Feature Engineering: Fundamentals and Best Practices

A summary of Chapter 5 of the Book “Designing Machine Learning Systems” by Chip Huyen

7 min readApr 12, 2024

Introduction

Feature engineering plays a crucial role in the development of machine learning models, largely determining model performance and generalization. In this article, we will explore key strategies, best practices, and pitfalls to avoid to maximize the effectiveness of feature engineering. When engineering good features, it’s fundamental to prioritize both feature importance and feature generalization. This process demands a certain understanding of domain-specific techniques, and subject matter experts, who may not necessarily be engineers, play a vital role. Hence, it’s imperative to structure the workflow in a manner that facilitates the contributions from individuals across diverse backgrounds.

Main Feature Engineering Strategies and Techniques

When handling missing values, it’s essential to distinguish between three types of missing values: Missing not at random, Missing at random, and Missing completely at random, each requiring specific treatment. There are two primary approaches to this problem: deletion and imputation. Deletion, whether through row or column deletion, is the simpler option, but it runs the risk of discarding valuable information and reducing model accuracy. On the other hand, imputation is more complex, because there’s no perfect solution to select the values to fill. The common practice involves filling missing values with the mean, median, or mode, but there is always the risk of injecting bias or noise into the data, or worse, causing data leakage.

When it comes to feature scaling and standardization, preparing features to be within comparable ranges is crucial before feeding them into models. While the range [0,1] is commonly used, any arbitrary range can suffice. However, if you suspect your variables may follow a normal distribution, standardization is recommended, effectively normalizing them to possess a zero mean and unit variance. In cases where certain features exhibit skewed distributions, employing log transformation can help alleviate this issue, as machine learning models typically struggle with skewed features. Nevertheless, it’s essential to exercise caution, as scaling and standardization can inadvertently lead to data leakage, often necessitating the use of global statistics. During inference, it’s essential to reuse the statistics obtained during training to scale new data. Should the new data significantly differ from the training set, these statistics may prove less useful. Consequently, frequent model retraining becomes necessary to adapt to evolving data patterns.

Feature discretization involves transforming a continuous feature into a discrete one. However, this categorization introduces discontinuities at category boundaries, and choosing the boundaries of categories can become a challenge. Some strategies for boundary selection include leveraging subject matter expertise, employing common sense, and utilizing tools such as histograms and quantiles.

When it comes to encoding categorical features, the traditional method of one-hot encoding is commonly used. However, in production settings, this approach falls short due to the dynamic nature of categories, which frequently change over time. To address this challenge, the hashing trick offers a viable solution. By applying a hash function to each category, a hashed value is generated, serving as the index for that category. Because the hash space is specified, the number of encoded values can be determined in advance, eliminating the need to know the exact number of categories. Additionally, specific hash functions can be chosen to meet desired properties, such as locality-sensitive hashing, which ensures similar categories are hashed into closely located values. While regarded as a “trick” and often dismissed from traditional machine learning curricula, its widespread adoption in the industry underscores its effectiveness.

Feature crossing is the technique to combine two or more features to generate new ones. This technique is particularly valuable for capturing nonlinear relationships between features, which is essential for models that struggle with or are incapable of learning such relationships, including linear regression, logistic regression, and tree-based models. However, there are drawbacks to consider, cross-feature generation can exponentially increase the feature space, necessitating a larger dataset for effective model learning. Moreover, it may lead to over-fitting, where the model fits too closely to the training data, compromising its ability to generalize to new data.

Positional features have emerged as a standard technique in both computer vision and natural language processing tasks, involving representing data positions through vectors known as embeddings. It’s important to some models to explicitly inform word positions to ensure the model comprehends their sequential order. One approach to managing positional embeddings is to treat them akin to word embeddings, where the number of columns corresponds to the number of positions. Typically, the embedding size for positions mirrors that of words to facilitate summation. As model weights evolve, these embeddings are learned alongside other parameters. Another approach is to use fixed positional embedding, also referred to as Fourier features, that has demonstrated efficacy in enhancing model performance, particularly for tasks reliant on positional coordinates.

Data Leakage — Causes and How to Detect it

Data leakage refers to the phenomenon when information from the target label “leaks” into the feature set used for prediction, leading to inaccurate model evaluations during inference.

Some common causes for data leakage are:

Splitting time-correlated data randomly instead of by time: Because in many cases, data is time-correlated
Scaling before splitting: which risks incorporating test split information into the training process, some argue that we should split before the exploratory analysis.
Filling in missing data with statistics from the test split: which can introduce bias and compromise model integrity.
Poor handling of data duplication before splitting: the same information may appear in both the training and test sets, compromising the integrity of model evaluation.
Group leakage, where examples with strongly correlated labels are inadvertently divided into different splits, skewing model performance.
Leakage originating from the data generation process: highlighting the importance of understanding data sources, collection methods, and processing techniques to mitigate potential issues.

Data leakage can happen during many steps, from generating, collecting, sampling, splitting, and processing data to feature engineering. Given this information, it’s important to monitor for data leakage during the entire lifecycle of a machine learning project.

One approach is to assess the predictive power of each feature or a group of features concerning the target variable (label). If a feature exhibits unusually high correlation, investigate its generation process and evaluate the coherence of the correlation.

Alternatively, conduct ablation studies to measure the significance of individual features or sets of features to the model. It’s also important to remain vigilant for any new features incorporated into the model. If the addition of a new feature significantly improves the model’s performance, it suggests that the feature is either highly beneficial or may inadvertently contain leaked information related to the labels.

Engineering Good Features

In general, adding more features tends to enhance model performance. Nevertheless, it’s important to note that too many features doesn’t invariably translate to improved model performance. The drawbacks of having an excess of features are notable during both model training and deployment:

The more features you have, the more opportunities there are for data leakage
Too many features can cause over-fitting
Too many features can increase memory required to serve a model,
Too many features can increase inference latency when doing online prediction
Useless features become technical debts.

Careful feature selection and management are essential to mitigate these issues and optimize model performance effectively.

There are two factors you might want to consider when evaluating whether a feature is good for a model: importance to the model and generalization to unseen data

There are many different methods for measuring the importance of a feature. One such approach involves evaluating the extent to which a model’s performance declines when that feature, or a set of features containing it, is removed from the model.

In classical machine learning algorithms such as boosted gradient trees, a straightforward method to assess feature importance is to utilize built-in functions provided by frameworks like XGBoost. For a more model-agnostic approach, exploring methods such as SHAP (SHapley Additive exPlanations) can be beneficial. Additionally, InterpretML, a great open-source package, leverages feature importance to provide insights into how your model makes predictions. Frequently, a limited number of features contribute significantly to your model’s overall feature importance. This observation not only aids in selecting pertinent features but also enhances interpretability, offering valuable insights into the inner workings of your models.

The features employed in the model should generalize to unseen data. However, not all features exhibit the same level of generalization. Two aspects to consider regarding generalization are feature coverage and the distribution of feature values.

Coverage is the percentage of samples in the dataset that possess values for a particular feature, with higher coverage indicating fewer missing values. Significant disparities in feature coverage between the training and test datasets may suggest a mismatch in their underlying distributions.

Moreover, if the values present in the training data have no intersection with those in the test data, it could hurt the model’s performance. Therefore, ensuring consistency in feature coverage and value distribution across different datasets is crucial for maintaining model robustness and generalization capability.

Conclusion

Feature engineering plays a pivotal role in the success of machine learning models. By following best practices and avoiding pitfalls, it’s possible to create more robust and generalizable models capable of effectively handling a wide variety of data and scenarios.

Here is a summary of best practices for feature engineering:

Split data by time into train/valid/test splits instead of doing it randomly
If you oversample your data, do it after splitting
Scale and normalize your data after splitting to avoid data leakage.
Use statistics from only the train split, instead of the entire data, to scale your features and handle missing values.
Understand how your data is generated, collected, and processed. Involve domain experts if possible.
Keep track of your data’s lineage
Understand feature importance to your model
Use features that generalize well
Remove no longer useful features from your models