10 key points of Chapter 4 of Chip Huyen’s book, “Designing Machine Learning Systems”
Training data serves as the cornerstone of machine learning algorithms, dictating their effectiveness and reliability. Chapter 4 of the book “Designing Machine Learning Systems” by Chip Huyen, presents techniques for acquiring or generating high-quality training data, various sampling strategies, challenges in data labeling, and techniques for data augmentation. Here we will discuss ten key points of this chapter:
- Non-probability Data Sampling: Sampling is a crucial step for creating subsets of data feasible for processing, in non-probability the samples are not representative of the real-world data and may introduce biases, but are quick and easy to gather. Some non-probability sampling are: Convenience sampling, Snowball sampling, Judgment sampling, and Quota sampling.
- Random Data Sampling: Random sampling gives all samples in the population equal probabilities of being selected, which is important to create reliable models. Some of the common random sampling techniques are: simple random sampling, stratified sampling, weighted sampling, reservoir sampling, and importance sampling.
- Data Labeling: The performance of a supervised machine learning model depends heavily on the quality and quantity of the labeled data it’s trained on. There are two major ways to label data, hand-labeling and natural-labeling. Hand-labeling data is expensive, slow, and poses privacy concerns. While in natural-labeling, the system itself provides some form of label, leveraging the user behavior as feedback.
- Challenges in Data Labeling: Label multiplicity, where conflicting labels exist for data instances, common on hand-labels, which emphasize the importance of clear problem definitions. Data lineage, tracking the origin and labeling of data samples to identify biases and debug models effectively. Feedback Loop Length: The time between serving predictions and receiving feedback influences model evaluation and issue detection. Short feedback loops allow for faster issue detection but risk premature labeling (which may be wrong labels).
- Handling Lack of Labels: Because of the challenges in acquiring sufficient high-quality labels, techniques like weak supervision, semi-supervision, transfer learning, and active learning were created to mitigate label scarcity by leveraging, respectively, heuristics (strategies developed with subject matter expertise), structural assumptions of the data, pre-trained models on other tasks, and selective labeling, which consists of labeling the most uncertain examples to the model.
- Class imbalance: This problem occurs when there is a significant disproportion in the number of samples across different classes within the training dataset. The imbalance often results in insufficient signal for the model to effectively discern the minority classes, potentially leading to the model relying on simplistic heuristics rather than grasping the underlying data patterns. Moreover, misclassifying a sample from the underrepresented class can carry far greater consequences than misclassifying one from the majority class.
- Handling Class Imbalance (Changing evaluation metric): Wrong metrics will give you the wrong ideas of how your models are doing, so changing the evaluation metric from the simple overall accuracy to metrics such as F1 Score, precision, recall or Receiver operating characteristic (ROC), that measure your model’s performance with respect to specific class should be preferred on class imbalance problems.
- Handling Class Imbalance (Changing data distribution): Modifying the distribution of the training data to reduce the level of imbalance can make it easier for the model to learn. Oversampling, adding more instances from the minority classes, and under-sampling, removing instances of the majority classes, are some techniques for this purpose, but under-sampling runs the risk of losing important data, while oversampling runs the risk of over-fitting on training data.
- Handling Class Imbalance (Changing algorithms): Changing the algorithm to make it more robust to class imbalance, involving adjusting the loss function giving higher weight to important samples to make the model focus more on learning these samples. Some methods to modify the cost function are: Cost-sensitive learning, Class-balanced loss, and Focal loss.
- Data Augmentation: Data augmentation is a family of techniques that are used to increase the amount of training data, augmented data can make our models more robust to noise and even adversarial attacks. The techniques depend heavily on the data format. Techniques such as label-preserving transformations, perturbation, and data synthesis expand training data, improving model robustness and generalization.
The insights from these ten points emphasize the crucial role of training data in machine learning systems. Techniques for acquiring high-quality data, mitigating biases, handling label scarcity, managing class imbalances, and augmenting data contribute significantly to the reliability and effectiveness of machine learning models. Effective management of training data remains essential for building robust and high-performing machine algorithms applicable to real-world datasets. The combination of these techniques facilitates the creation of applications capable of navigating data complexities, enabling the development of unbiased and effective real world machine learning models.