OZ Digital, LLC

  1. Home
  2. /
  3. Resources
  4. /
  5. Blog
  6. /
  7. Transform Raw Data into...

Transform Raw Data into Knowledge and Tools Capable of Solving a Problem

When developing an AI algorithm, you may try to improve results by working on the model or perhaps by collecting more data.

And this is as it should be.

But there is also a third powerful option available: Applying feature engineering techniques which transform raw data into knowledge and tools capable of solving a problem.

These include…

  • Feature Generation. Generate relevant variables.
  • Data Transformation. Modify the scale or distribution of the values.
  • Data Cleaning. Identify and correct errors in the data.
  • Feature Selection. Identify the most relevant input variables for the task.
  • Dimension Reduction. Create compact projections of the data.

This is neither a trivial nor easy task.

After all, you must evaluate what actions must be applied to the data to gain a holistic understanding of the problem. Defining the data pipeline for the application of the feature engineering techniques is close to 90 percent of the effort in machine learning applications.

“Coming up with features is difficult, time-consuming, requires expert knowledge,” AI pioneer Andrew Ng once noted. “‘Applied machine learning’ is basically feature engineering.”

Feature Engineering in Supervised Learning Models

When undertaking feature engineering, several types of variables may be encountered. Each type of variable has distinct characteristics and may require different preprocessing techniques.

Here are some common types of variables you may find:

  1. Numerical Variables.
  • These variables can take any real value within a given range. Examples include age, income, and temperature. Continuous variables are often scaled or normalized to ensure uniformity in their ranges.
  • Discrete numerical variables take on specific, distinct values, typically integers. Examples include the number of children in a family or the count of items in a shopping cart.
  1. Categorical Variables.
  • Nominal categorical variables represent categories or labels with no inherent order or ranking. Examples include colors, product categories, or types of fruits.
  • Ordinal categorical variables have categories with a meaningful order or ranking. Examples include education levels (e.g., high school, bachelor’s, master’s) or customer satisfaction ratings (e.g., poor, fair, good, excellent). Ordinal variables may be encoded with integer values or custom rank values.
  1. Time-Related Variables: These can include date and time features, such as timestamps, days of the week, months, or years. These variables may be extracted from datetime data and used to capture temporal patterns.

Once there is a good grasp on the specific nature of the problem, choosing the right tools for the solution requires knowledge of both probabilities and math to choose.

Elements that should be checked include…

  • Missing data.
  • Categorical variables. Cardinality and rare labels.
  • Compliance with linearity assumptions.
  • Distribution of independent variable values.
  • Magnitude/scale.

Missing data

Missing data refers to the absence or lack of values in one or more variables within a dataset. Dealing with missing data is a crucial aspect of preparing data for a supervised learning model—and it can have significant effects on the model’s performance and results.

The three long-accepted categories of missing data are…

  • Missing Completely at Random (MCAR). Data is missing randomly, and there is no systematic pattern. (The probability of missing values is the same for all observations.) In this case, the missing values are unlikely to bias the model.
  • Missing at Random (MAR). The probability of data being missing depends on other observed variables. While there is a pattern, it can be accounted for using information from other variables. (In a study for a new treatment, for example, some subjects withdraw when they start experiencing side effects).
  • Missing Not at Random (MNAR). The probability of data being missing depends on unobserved or missing values themselves. This type is more challenging to handle and can introduce bias.

Some negative effects on the model could include…

  1. Reduced Sample Size. Missing data can lead to a reduced sample size, which may affect the model’s ability to learn patterns and make accurate predictions. This reduction in sample size can be especially problematic if the missing data is not missing completely at random.
  2. Bias and Inaccuracy. If the missing data is not handled appropriately, it can introduce bias into the model. For example, if certain groups are more likely to have missing values, the model may not generalize well to those groups.
  3. Loss of Information. Missing data can result in the loss of potentially valuable information, especially if the missing variables are important predictors of the target variable. This loss of information can lead to suboptimal model performance.

Imputation of Missing Data

Imputation is the act of replacing missing data with statistical estimates of the absent values. The goal of any imputation technique is to produce a complete dataset that allows training a machine learning model.

These techniques can be grouped into two categories:

  • When imputation is done independently for each input variable, without considering other variables.
  • When the imputed value for each variable is a function of two or more variables.

Some univariate imputation techniques include…

  1. Numeric Variables
  • Mean/Median Imputation.
  • Arbitrary Value Imputation.
  • End of Tail Imputation.
  1. Categorical Variables
  • Mode Imputation.
  • Adding a “MISSING” Category.
  1. Both Numeric and Categorical Variables
  • Complete Case Analysis.
  • Adding a “MISSING” Indicator.
  • Random Sampling Imputation.

Complete Case Analysis (CCA)

Complete Case Analysis consists of discarding all observations in which any of the variables are missing. Therefore, only observations with complete data are analyzed. CCA is most appropriate when the missing data mechanism is believed to be missing completely at random (MCAR) or when the proportion of missing data is very small and does not significantly affect the sample size. It is also suitable when the goal is to maintain the simplicity of the analysis, and the loss of information is acceptable. However, if the missing data mechanism is believed to be missing at random (MAR) or missing not at random (MNAR), other imputation methods that account for these mechanisms may be more appropriate to avoid bias in the analysis.

The advantages of CCA are the simplicity since it does not involve any assumptions about the missing data, and it preserves the distribution of the variables since it does not introduce any bias or imputation errors.

This method has limitations and considerations. One significant drawback is that it can lead to a substantial reduction in the sample size and smaller sample sizes can reduce the power of statistical tests and may limit the generalizability of results. Also, if the assumption of MCAR is not met, the loss of data may impact the representativeness of the remaining sample.

In conclusion, as norm the CCA should be implemented if the MCAR assumption is met and if the instance with missing data is not higher than the 5% of the total dataset.

Imputation by Mean or Median

Imputation by mean or median is a simple technique used to handle missing data by replacing missing values with the mean (average) or median value of the available data within the same variable. This method is widely used in data preprocessing when dealing with numerical or continuous variables. If the variable is normally distributed, the mean and median are similar. On the other hand, if the distribution is skewed, the median is a better representation of the missing data.

When using this technique, the following assumptions should be checked:

  • The missing data should be MCAR or MAR.
  • The missing observations resemble the majority.

The advantages of this method are the simplicity, and it preserves the variable distribution.

However, there are some limitations and considerations that should be aware of:

  • It distorts the original distribution, the more missing data there is, the greater the distortion.
  • It distorts the original variance.
  • It distorts the covariance with the remaining variables.

Imputation by mean or median is most appropriate when the missing data mechanism is believed to be MCAR or MAR and the proportion of missing data is relatively small. It is a quick and simple method to handle missing data, but it should be used cautiously and with an awareness of its limitations.

As a rule, we will apply this technique when…

  • …the missing data is missing completely at random (MCAR) or missing at random (MAR), and the assumptions for using mean or median imputation are met.
  • …the percentage of missing data is relatively low, typically not exceeding 5% of the total dataset.
  • …we want a relatively simple imputation method that does not introduce excessive bias or distortion into the data.

Arbitrary Value Imputation (Numeric)/Adding a “MISSING” Category (Categoric)

This approach consists of replacing occurrences of missing values (NA) with a variable of an arbitrary value. The more common values used in this technique are:

  • For numeric distributions: 0, 999, -999, or -1 (for positive distributions).
  • For categorical variables: “MISSING”.

This method works under the assumption that the missing data is MNAR, in this case there should be a reason why the values are missing, in this case the mean and the median are not representative. Its advantages are that is easy to implement, a quick method to obtain complete dataset and most important it captures the importance of the “missing” data, if there are.

Despite its advantages, this technique has several limitations, it distorts the original distribution of the data, it distorts the variance and the covariance with the remaining variables. If the arbitrary value is at the end of the distribution, it can introduce outliers, also care should be taken not to choose an arbitrary value too similar to the median or mean value.

End of Tail Imputation

This approach is equivalent to arbitrary value imputation, but in this case, arbitrary values are automatically selected from the end of the distribution, and it is applicable only to numeric data types. If the variable has a normal distribution, you can use the mean ± 3 times the standard deviation. If the variable has a skewed distribution, you can use the IQR proximity rule.

Imputation By Frequent Category or Mode

This approach consists of replacing the missing values with the most frequent category or mode of the variable. It is used for categorical variables and has the advantage that it is a simple and straightforward imputation method.

This method works under the assumption that the missing data is MCAR or MAR.

It is particularly useful when dealing with categorical data with a dominant category and it preserves the original distribution of the categorical variable.

The main disadvantages of this technique are that it distorts the relationship between the most frequent label and other variables in the dataset, also it can exaggerate the presence of the most frequent label if the number of missing values is high. The greater the number of missing values, the greater the distortion.

This technique should be applied only when the assumption of the missing data is MCAR or MAR and the percentage of missing data is relatively low, typically not exceeding 5% of the total dataset.

Multivariable Imputation

Often referred to as multiple imputation, multivariable imputation is an advanced technique employed in handling missing data within a dataset. Unlike simpler univariate imputation methods that handle missing values for one variable at a time, multivariable imputation considers the complex interplay and relationships between multiple variables when imputing missing data.

Two common methods used for multivariable imputation are “Multiple Imputation by Chained Equations (MICE)” and “K-Nearest Neighbors (K-NN) Imputation.”

  • Multiple Imputation by Chained Equations (MICE) is an iterative imputation method that models each variable with missing data as a function of the other variables. It sequentially imputes missing values for each variable, updating the imputed values in each iteration based on the imputed values from other variables.

This method is highly versatile and can handle various types of variables, including numeric and categorical variables. MICE provides a way to capture complex relationships between variables, making it suitable for datasets with dependencies and interactions.

  • K-Nearest Neighbors (K-NN) Imputation is primarily used for numeric or continuous variables. It involves finding the k-nearest data points with complete information (i.e., non-missing values) to the observation with the missing value. The missing value is then imputed based on the values of its nearest neighbors. K-NN imputation can be sensitive to the choice of the distance metric used to determine similarity between data points.

In summary, multivariable imputation is a powerful technique for handling missing data by considering the complex relationships between variables. It provides a more comprehensive and flexible approach compared to univariate imputation methods, ensuring that the imputed data aligns with the structure of the original dataset. MICE and K-NN are two notable methods within multivariable imputation, each with its strengths and applicability depending on the nature of the data and the missing data mechanism.

Your Next Steps

Scaling AI in your organization isn’t just about adopting new technology. It’s about transforming your entire way of doing things.

To learn more about how OZ can accelerate your business into the data analytics future, click here or schedule a consultation today.