📒
Machine & Deep Learning Compendium
  • The Machine & Deep Learning Compendium
    • Thanks Page
  • The Ops Compendium
  • Types Of Machine Learning
    • Overview
    • Model Families
    • Weakly Supervised
    • Semi Supervised
    • Active Learning
    • Online Learning
    • N-Shot Learning
    • Unlearning
  • Foundation Knowledge
    • Data Science
    • Data Science Tools
    • Management
    • Project & Program Management
    • Data Science Management
    • Calculus
    • Probability & Statistics
    • Probability
    • Hypothesis Testing
    • Feature Types
    • Multi Label Classification
    • Distribution
    • Distribution Transformation
    • Normalization & Scaling
    • Regularization
    • Information Theory
    • Game Theory
    • Multi CPU Processing
    • Benchmarking
  • Validation & Evaluation
    • Features
    • Evaluation Metrics
    • Datasets
    • Dataset Confidence
    • Hyper Parameter Optimization
    • Training Strategies
    • Calibration
    • Datasets Reliability & Correctness
    • Data & Model Tests
    • Fairness, Accountability, and Transparency
    • Interpretable & Explainable AI (XAI)
    • Federated Learning
  • Machine Learning
    • Algorithms 101
    • Meta Learning (AutoML)
    • Probabilistic, Regression
    • Data Mining
    • Process Mining
    • Label Algorithms
    • Clustering Algorithms
    • Anomaly Detection
    • Decision Trees
    • Active Learning Algorithms
    • Linear Separator Algorithms
    • Regression
    • Ensembles
    • Reinforcement Learning
    • Incremental Learning
    • Dimensionality Reduction Methods
    • Genetic Algorithms & Genetic Programming
    • Learning Classifier Systems
    • Recommender Systems
    • Timeseries
    • Fourier Transform
    • Digital Signal Processing (DSP)
    • Propensity Score Matching
    • Diffusion models
  • Classical Graph Models
    • Graph Theory
    • Social Network Analysis
  • Deep Learning
    • Deep Neural Nets Basics
    • Deep Neural Frameworks
    • Embedding
    • Deep Learning Models
    • Deep Network Optimization
    • Attention
    • Deep Neural Machine Vision
    • Deep Neural Tabular
    • Deep Neural Time Series
  • Audio
    • Basics
    • Terminology
    • Feature Engineering
    • Deep Neural Audio
    • Algorithms
  • Natural Language Processing
    • A Reality Check
    • NLP Tools
    • Foundation NLP
    • Name Matching
    • String Matching
    • TF-IDF
    • Language Detection Identification Generation (NLD, NLI, NLG)
    • Topics Modeling
    • Named Entity Recognition (NER)
    • SEARCH
    • Neural NLP
    • Tokenization
    • Decoding Algorithms For NLP
    • Multi Language
    • Augmentation
    • Knowledge Graphs
    • Annotation & Disagreement
    • Sentiment Analysis
    • Question Answering
    • Summarization
    • Chat Bots
    • Conversation
  • Generative AI
    • Methods
    • Gen AI Industry
    • Speech
    • Prompt
    • Fairness, Accountability, and Transparency In Prompts
    • Large Language Models (LLMs)
    • Vision
    • GPT
    • Mix N Match
    • Diffusion Models
    • GenAI Applications
    • Agents
    • RAG
    • Chat UI/UX
  • Experimental Design
    • Design Of Experiments
    • DOE Tools
    • A/B Testing
    • Multi Armed Bandits
    • Contextual Bandits
    • Factorial Design
  • Business Domains
    • Follow the regularized leader
    • Growth
    • Root Cause Effects (RCE/RCA)
    • Log Parsing / Templatization
    • Fraud Detection
    • Life Time Value (LTV)
    • Survival Analysis
    • Propaganda Detection
    • NYC TAXI
    • Drug Discovery
    • Intent Recognition
    • Churn Prediction
    • Electronic Network Frequency Analysis
    • Marketing
  • Product Management
    • Expanding Your Data Science Skills
    • Product Vision & Strategy
    • Product / Program Managers
    • Product Management Resources
    • Product Tools
    • User Experience Design (UX)
    • Business
    • Marketing
    • Ideation
  • MLOps (www.OpsCompendium.com)
  • DataOps (www.OpsCompendium.com)
  • Humor
Powered by GitBook
On this page

Was this helpful?

  1. Deep Learning

Deep Network Optimization

PreviousDeep Learning ModelsNextAttention

Last updated 2 years ago

Was this helpful?

PRUNING / KNOWLEDGE DISTILLATION / LOTTERY TICKET

  1. Lottery ticket

    1. , -paper

  2. , ,

  3. ,

  4. focusing on Knowledge & Ranking distillation

  1. Troubleshooting Neural Nets

The author of the original article suggests to turn everything off and then start building your network step by step, i.e., “a divide and conquer ‘debug’ method”.

Dataset Issues

1. Check your input data - for stupid mistakes

2. Try random input - if the error behaves the same on random data, there is a problem in the net. Debug layer by layer

3. Check the data loader - input data is possibly broken. Check the input layer.

4. Make sure input is connected to output - do samples have correct labels, even after shuffling?

5. Is the relationship between input and output too random? - the input are not sufficiently related to the output. Its pretty amorphic, just look at the data.

6. Is there too much noise in the dataset? - badly labelled datasets.

7. Shuffle the dataset - useful to counteract order in the DS, always shuffle input and labels together.

8. Reduce class imbalance - imbalance datasets may add a bias to class prediction. Balance your class, your loss, do something.

9. Do you have enough training examples? - training from scratch? ~1000 images per class, ~probably similar numbers for other types of samples.

10. Make sure your batches don’t contain a single label - this is probably something you wont notice and will waste a lot of time figuring out! In certain cases shuffle the DS to prevent batches from having the same label.

12. Test on well known Datasets

Data Normalization/Augmentation

12. Standardize the features - zero mean and unit variance, sounds like normalization.

13. Do you have too much data augmentation?

Augmentation has a regularizing effect. Too much of this combined with other forms of regularization (weight L2, dropout, etc.) can cause the net to underfit.

14. Check the preprocessing of your pretrained model - with a pretrained model make sure your input data is similar in range[0, 1], [-1, 1] or [0, 255]?

Any preprocessing should be computed ONLY on the training data, then applied to val/test

Implementation issues

16. Try solving a simpler version of the problem -divide and conquer prediction, i.e., class and box coordinates, just use one.

17. Look for correct loss “at chance” - calculat loss for chance level, i.e 10% baseline is -ln(0.1) = 2.3 Softmax loss is the negative log probability. Afterwards increase regularization strength which should increase the loss.

18. Check your custom loss function.

19. Verify loss input - parameter confusion.

20. Adjust loss weights -If your loss is composed of several smaller loss functions, make sure their magnitude relative to each is correct. This might involve testing different combinations of loss weights.

21. Monitor other metrics -like accuracy.

22. Test any custom layers, debugging them.

23. Check for “frozen” layers or variables - accidentally frozen?

24. Increase network size - more layers, more neurons.

25. Check for hidden dimension errors - confusion due to vectors ->(64, 64, 64)

Training issues

27. Solve for a really small dataset - can you generalize on 2 samples?

29. Change your hyperparameters - grid search

30. Reduce regularization - too much may underfit, try for dropout, batch norm, weight, bias , L2.

31. Give it more training time as long as the loss is decreasing.

32. Switch from Train to Test mode - not clear.

36. Increase/Decrease Learning Rate, or use adaptive learning

, magnitude vs structured pruning on a various metrics, i.e., LT works on bert. The classical Lottery Ticket Hypothesis was mostly tested with unstructured pruning, specifically magnitude pruning (m-pruning) where the weights with the lowest magnitude are pruned irrespective of their position in the model. We iteratively prune 10% of the least magnitude weights across the entire fine-tuned model (except the embeddings) and evaluate on dev set, for as long as the performance of the pruned subnetwork is above 90% of the full model.

We also experiment with structured pruning (s-pruning) of entire components of BERT architecture based on their importance scores: specifically, we 'remove' the least important self-attention heads and MLPs by applying a mask. In each iteration, we prune 10% of BERT heads and 1 MLP, for as long as the performance of the pruned subnetwork is above 90% of the full model. To determine which heads/MLPs to prune, we use a loss-based approximation: the importance scores proposed by for self-attention heads, which we extend to MLPs. Please see our paper and the original formulation for more details.

(, ) - copy pasted and rewritten here for convenience, it's pretty thorough, but long and extensive, you should have some sort of intuition and not go through all of these. The following list is has much more insight and information in the article itself.

11. Reduce batch size - points out that having a very large batch can reduce the generalization ability of the model. However, please note that I found other references that claim a too small batch will impact performance.

15. Check the preprocessing for train/validation/test set - CS231n points out a :

26. Explore Gradient checking -does your backprop work for custon gradients? **[2]() **.

28. Check weights initialization - or or forget about it for networks such as RNN.

33. Visualize the training - activations, weights, layer updates, biases. and . Tips on . Expect gaussian distribution for weights, biases start at 0 and end up almost gaussian. Keep an eye out for parameters that are diverging to +/- infinity. Keep an eye out for biases that become very large. This can sometimes occur in the output layer for classification if the distribution of classes is very imbalanced.

34. Try a different optimizer, Check this about gradient descent optimizers.

35. Exploding / Vanishing gradients - Gradient clipping may help. Tips on: : “A good standard deviation for the activations is on the order of 0.5 to 2.0. Significantly outside of this range may indicate vanishing or exploding activations.”

37. Overcoming NaNs, big issue for RNN - decrease LR, . evaluate layer by layer, why does it appear.

Neural Network Graph With Shared Inputs
Deep network compression using teacher student
Lottery ticket on BERT
Michel, Levy and Neubig (2019)
37 reasons
10 more
This paper
common pitfall
1
http://cs231n.github.io/neural-networks-3/#gradcheck
3
Xavier
He
Tensorboard
Crayon
Deeplearning4j
excellent post
Deeplearning4j
how to deal with NaNs
Awesome Knowledge distillation
1
2
Uber on Lottery ticket, masking weights retraining
Facebook article and paper
Knowledge distillation 1
2
3
Pruning 1
2
Teacher-student knowledge distillation