📒
Machine & Deep Learning Compendium
  • The Machine & Deep Learning Compendium
    • Thanks Page
  • The Ops Compendium
  • Types Of Machine Learning
    • Overview
    • Model Families
    • Weakly Supervised
    • Semi Supervised
    • Active Learning
    • Online Learning
    • N-Shot Learning
    • Unlearning
  • Foundation Knowledge
    • Data Science
    • Data Science Tools
    • Management
    • Project & Program Management
    • Data Science Management
    • Calculus
    • Probability & Statistics
    • Probability
    • Hypothesis Testing
    • Feature Types
    • Multi Label Classification
    • Distribution
    • Distribution Transformation
    • Normalization & Scaling
    • Regularization
    • Information Theory
    • Game Theory
    • Multi CPU Processing
    • Benchmarking
  • Validation & Evaluation
    • Features
    • Evaluation Metrics
    • Datasets
    • Dataset Confidence
    • Hyper Parameter Optimization
    • Training Strategies
    • Calibration
    • Datasets Reliability & Correctness
    • Data & Model Tests
    • Fairness, Accountability, and Transparency
    • Interpretable & Explainable AI (XAI)
    • Federated Learning
  • Machine Learning
    • Algorithms 101
    • Meta Learning (AutoML)
    • Probabilistic, Regression
    • Data Mining
    • Process Mining
    • Label Algorithms
    • Clustering Algorithms
    • Anomaly Detection
    • Decision Trees
    • Active Learning Algorithms
    • Linear Separator Algorithms
    • Regression
    • Ensembles
    • Reinforcement Learning
    • Incremental Learning
    • Dimensionality Reduction Methods
    • Genetic Algorithms & Genetic Programming
    • Learning Classifier Systems
    • Recommender Systems
    • Timeseries
    • Fourier Transform
    • Digital Signal Processing (DSP)
    • Propensity Score Matching
    • Diffusion models
  • Classical Graph Models
    • Graph Theory
    • Social Network Analysis
  • Deep Learning
    • Deep Neural Nets Basics
    • Deep Neural Frameworks
    • Embedding
    • Deep Learning Models
    • Deep Network Optimization
    • Attention
    • Deep Neural Machine Vision
    • Deep Neural Tabular
    • Deep Neural Time Series
  • Audio
    • Basics
    • Terminology
    • Feature Engineering
    • Deep Neural Audio
    • Algorithms
  • Natural Language Processing
    • A Reality Check
    • NLP Tools
    • Foundation NLP
    • Name Matching
    • String Matching
    • TF-IDF
    • Language Detection Identification Generation (NLD, NLI, NLG)
    • Topics Modeling
    • Named Entity Recognition (NER)
    • SEARCH
    • Neural NLP
    • Tokenization
    • Decoding Algorithms For NLP
    • Multi Language
    • Augmentation
    • Knowledge Graphs
    • Annotation & Disagreement
    • Sentiment Analysis
    • Question Answering
    • Summarization
    • Chat Bots
    • Conversation
  • Generative AI
    • Methods
    • Gen AI Industry
    • Speech
    • Prompt
    • Fairness, Accountability, and Transparency In Prompts
    • Large Language Models (LLMs)
    • Vision
    • GPT
    • Mix N Match
    • Diffusion Models
    • GenAI Applications
    • Agents
    • RAG
    • Chat UI/UX
  • Experimental Design
    • Design Of Experiments
    • DOE Tools
    • A/B Testing
    • Multi Armed Bandits
    • Contextual Bandits
    • Factorial Design
  • Business Domains
    • Follow the regularized leader
    • Growth
    • Root Cause Effects (RCE/RCA)
    • Log Parsing / Templatization
    • Fraud Detection
    • Life Time Value (LTV)
    • Survival Analysis
    • Propaganda Detection
    • NYC TAXI
    • Drug Discovery
    • Intent Recognition
    • Churn Prediction
    • Electronic Network Frequency Analysis
    • Marketing
  • Product Management
    • Expanding Your Data Science Skills
    • Product Vision & Strategy
    • Product / Program Managers
    • Product Management Resources
    • Product Tools
    • User Experience Design (UX)
    • Business
    • Marketing
    • Ideation
  • MLOps (www.OpsCompendium.com)
  • DataOps (www.OpsCompendium.com)
  • Humor
Powered by GitBook
On this page
  • Structured / Unstructured data
  • BIAS / VARIANCE
  • SPARSE DATASETS
  • TRAINING METHODOLOGIES
  • TRAIN / TEST / CROSS VALIDATION
  • VARIOUS DATASETS
  • IMBALANCED DATASETS
  • SAMPLE SELECTION
  • LEARNING CURVES
  • DISTILLING DATA
  • DATASET SELECTION

Was this helpful?

  1. Validation & Evaluation

Datasets

PreviousEvaluation MetricsNextDataset Confidence

Last updated 3 years ago

Was this helpful?

Structured / Unstructured data

BIAS / VARIANCE

  1. by queue.acm

  1. Terms: training, validation, test.

    Split: training & validation 70%, test 30%

    Procedure: cross fold training and validation, or further split 70% to training and validation.

    BIAS - Situation 1 - doing much worse than human:

    Human expert: 1% error

    Training set error: 5% error (test on train)

    Validation set error: 6% error (test on validation or CFV)

    Conclusion: there is a BIAS between human expert and training set

    Solution: 1. Train deeper or bigger\larger networks, 2. train longer, 3. May needs more data to get to the human expert level, Or 4. New model architecture.

    VARIANCE - Situation 2 - validation set not close to training set error:

    Human expert: 1% error

    Training set error: 2% error

    Validation set error: 6% error

    Conclusion: there is a VARIANCE problem, i.e. OVERFITTING, between training and validation.

    Solution: 1. Early stopping, 2. Regularization or 3. get more data, or 4. New model architecture.

    Situation 3 - both:Human expert: 1% error

    Training set error: 5% error

    Validation set error: 10% error

    Conclusion: both problems occur, i.e., BIAS as and VARIANCE.

    Solution: do it al

  • Underfitting = Get more data

  • Overfitting = Early stop, regularization, reason: models detail & noise.

  • Happens more in non parametric (and non linear) algorithms such as decision trees.

  • Bottom line, bigger model or more data will solve most issues.

IMPORTANT! For Test Train efficiency when the data is from different distributions:

E.g: TRAIN: 50K hours of voice chatter as the train set for a DLN, TEST: 10H for specific voice-based problem, i.e, taxi chatter.

Best practice: better to divide the validation & test from the same distribution, i.e. the 10H set.

Reason: improving scores on validation which is from a diff distribution will not be the same quality as improving scores on a validation set originated from the actual distribution of the problem’s data, i.e., 10H.

NOTE: Unlike the usual supervised learning, where all the data is from the same distribution, where we split the training to train and validation (cfv).

Split: Train, Valid_Train = 48K\2K & Valid, Test, 5K & 5K.

So situation 1 stays the same,

Situation 2 is Valid_Train error (train_dev)

Situation 3 is Valid_Test error - need more data, data synthesis - tweak test to be similar to train data, new architecture as a solution

Situation 4 is now Test set error - get more data

SPARSE DATASETS

TRAINING METHODOLOGIES

  1. Train test split

  2. Cross validation

  3. Transfer learning - using a pre existing classifier similar to your domain, usually trained on millions of samples. fine-tuned on new data in order to create a new classifier that utilizes that information in the new domain. Examples such as w2v or classic resnet fine-tuning.

  4. Bootstrapping training- using a similar dataset, such as yelp, with 5 stars to create a pos/neg sentiment classifier based on 1 star and 5 stars. Finally using that to label or sample select from an unlabelled dataset, in order to create a new classifier or just to sample for annotation etc.

  1. Yoav’s method for transfer learning for languages - train a classifier on labelled data from english and spanish, fine tune using left out spanish data, stop before overfitting. This can be generalized to other domains.

TRANSFER LEARNING

TRAIN / TEST / CROSS VALIDATION

  • Random Split tests 66\33 - problem: variance each time we rerun.

  • Multiple times random split tests - problem: samples may not be included in train\test or selected multiple times.

  • Cross validation - pretty good, diff random seed results in diff mean accuracy, variance due to randomness

  • Multiple cross validation - accounts for the randomness of the CV

  • Statistical significance ( t-test) on multi CV - are two samples drawn from the same population? (no difference). If “yes”, not significant, even if the mean and std deviations differ.

Finally, When in doubt, use k-fold cross validation (k=10) and use multiple runs of k-fold cross validation with statistical significance tests.

VARIOUS DATASETS

IMBALANCED DATASETS

They recommend the following:

  1. (i) the effect of class imbalance on classification performance is detrimental;

  2. (ii) the method of addressing class imbalance that emerged as dominant in almost all analyzed scenarios was oversampling;

  3. (iii) oversampling should be applied to the level that totally eliminates the imbalance, whereas undersampling can perform better when the imbalance is only removed to some extent;

  4. (iv) as opposed to some classical machine learning models, oversampling does not necessarily cause overfitting of CNNs;

  5. (v) thresholding should be applied to compensate for prior class probabilities when overall number of properly classified cases is of interest.

General Rules:

  1. Many samples - undersampling

  2. Few samples - over sampling

  3. Consider random and non-random schemes

  4. Different sample rations, instead of 1:1 (proof? papers?)

  1. Oversampling the minority class

    1. (Random) duplication of samples

New_Sample = (random num in [0,1] ) * vec(ki,current_sample)

  • (in weka) The nearestNeighbors parameter says how many nearest neighbor instances (surrounding the currently considered instance) are used to build an in between synthetic instance. The default value is 5. Thus the attributes of 5 nearest neighbors of a real existing instance are used to compute a new synthetic one.

  • (in weka) The percentage parameter says how many synthetic instances are created based on the number of the class with less instances (by default - you can also use the majority class by setting the -Coption). The default value is 100. This means if you have 25 instances in your minority class, again 25 instances are created synthetically from these (using their nearest neighbours' values). With 200% 50 synthetic instances are created and so on.

  1. ADASYN - shifts the classification boundary to the minority class, synthetic data generated for majority class.

  2. Undersampling the majority class

    1. Remove samples

    2. Cluster centroids - replaces a cluster of samples (k-means) with a centroid.

    3. Tomek links - cleans overlapping samples between classes in the majority class.

    4. Penalizing the majority class during training

  3. Combined over and under (hybrid) - i.e., SMOTE and tomek/ENN

  4. Ensemble sampling

    1. EasyEnsemble

    2. BalanceCascade

  5. Dont balance, try algorithms that perform well with unbalanced DS

    1. Decision trees - C4.5\5\CART\Random Forest

    2. SVM

  6. Penalize Models -

    1. added costs for misclassification on the minority class during training such as penalized-SVM

    2. complex

SAMPLE SELECTION

LEARNING CURVES

  1. Predicting sample size required for training

This is a really wonderful study with far-reaching implications that could even impact company strategies in some cases. It starts with a simple question: “how can we improve the state of the art in deep learning?” We have three main lines of attack:

  1. We can search for improved model architectures.

  2. We can scale computation

  3. We can create larger training data sets.

DISTILLING DATA

DATASET SELECTION

, bottom line use bonferroni correction.

Understanding what is the next stage in DL (& ML) algorithm development: basic approach - on youtube

In practice advice with

: However, when there are 2 distributions it’s possible to extend the division of the training set to validation_training and training, and the test to validation and test.

in ML - one hot/tfidf, dictionary/list of lists/ coordinate list.

(facebook), using a big labelled dataset to train a teacher classifier, predicting on unlabelled data, choosing the best classified examples based on probability, using those to train a new student model, finally fine-tune on the labeled dataset to create a more robust model, which is expected to know the unlabelled dataset and the labelled dataset with higher accuracy. With respect to the fully supervised teacher model / baseline.

.

-

“ set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis”

- leave unseen data, do cross fold on that. Good for ensembles.

es,

50K - alone has over 50,000 freely accessible pre-trained models with search functionality to

( with visual samples - it actually works well on clustering.

cost sensitive sampling

, with several observations. This is crucial when training networks, because in real life you don’t always get a balanced DS.

Balancing data sets (, & ):

SMOTE & - find k nearest neighbours,

a meta classifier in Weka that wraps classifiers and applies a custom penalty matrix for miss classification.

: - Gibbs Sampling is a MCMC method to draw samples from a potentially really really complicated, high dimensional distribution, where analytically, it’s hard to draw samples from it. The usual suspect would be those nasty integrals when computing the normalizing constant of the distribution, especially in Bayesian inference. Now Gibbs Sampler can draw samples from any distribution, provided you can provide all of the conditional distributions of the joint distribution analytically.

- seemed like active learning, i.e., sample using EM/cluster to achieve nearly as accurate on all data

this . What I found interesting about this paper is that it challenges the common approach of “the more the merrier” when it comes to training data, and shifts the focus from the quantity of the data to the quality of the data.

Overfitting your test set, a statistican view point, a great article
Andrew NG
regularized linear regression.
Situation 4
Sparse matrices
Student-teacher paradigm
In deep learning
Scikit-lego on group-based splitting and transformation
Images from here
Train Test methodology
The training
Out of fold
26 of them
24
Eu-
2
ModelDepot
the BEST resource and a great api for python)
Mastery on
Smote for imbalance
Systematic Investigation of imbalance effects in CNN’s
wiki
scikit learn
examples in SKLEARN
(in weka + needs to be installed
paper)
CostSensitiveClassifier
How to choose your sample size from a population based on confidence interval
Data advice, should we get more data? How much
Gibbs sampling
Git examples
Sklearn examples
Understanding bias variance via learning curves
Unread - learning curve sampling applied to model based clustering
Advice on many things, including learning curves
Medium on
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Medium
Unstructured
Structured
Various Bias types