📒
Machine & Deep Learning Compendium
  • The Machine & Deep Learning Compendium
    • Thanks Page
  • The Ops Compendium
  • Types Of Machine Learning
    • Overview
    • Model Families
    • Weakly Supervised
    • Semi Supervised
    • Active Learning
    • Online Learning
    • N-Shot Learning
    • Unlearning
  • Foundation Knowledge
    • Data Science
    • Data Science Tools
    • Management
    • Project & Program Management
    • Data Science Management
    • Calculus
    • Probability & Statistics
    • Probability
    • Hypothesis Testing
    • Feature Types
    • Multi Label Classification
    • Distribution
    • Distribution Transformation
    • Normalization & Scaling
    • Regularization
    • Information Theory
    • Game Theory
    • Multi CPU Processing
    • Benchmarking
  • Validation & Evaluation
    • Features
    • Evaluation Metrics
    • Datasets
    • Dataset Confidence
    • Hyper Parameter Optimization
    • Training Strategies
    • Calibration
    • Datasets Reliability & Correctness
    • Data & Model Tests
    • Fairness, Accountability, and Transparency
    • Interpretable & Explainable AI (XAI)
    • Federated Learning
  • Machine Learning
    • Algorithms 101
    • Meta Learning (AutoML)
    • Probabilistic, Regression
    • Data Mining
    • Process Mining
    • Label Algorithms
    • Clustering Algorithms
    • Anomaly Detection
    • Decision Trees
    • Active Learning Algorithms
    • Linear Separator Algorithms
    • Regression
    • Ensembles
    • Reinforcement Learning
    • Incremental Learning
    • Dimensionality Reduction Methods
    • Genetic Algorithms & Genetic Programming
    • Learning Classifier Systems
    • Recommender Systems
    • Timeseries
    • Fourier Transform
    • Digital Signal Processing (DSP)
    • Propensity Score Matching
    • Diffusion models
  • Classical Graph Models
    • Graph Theory
    • Social Network Analysis
  • Deep Learning
    • Deep Neural Nets Basics
    • Deep Neural Frameworks
    • Embedding
    • Deep Learning Models
    • Deep Network Optimization
    • Attention
    • Deep Neural Machine Vision
    • Deep Neural Tabular
    • Deep Neural Time Series
  • Audio
    • Basics
    • Terminology
    • Feature Engineering
    • Deep Neural Audio
    • Algorithms
  • Natural Language Processing
    • A Reality Check
    • NLP Tools
    • Foundation NLP
    • Name Matching
    • String Matching
    • TF-IDF
    • Language Detection Identification Generation (NLD, NLI, NLG)
    • Topics Modeling
    • Named Entity Recognition (NER)
    • SEARCH
    • Neural NLP
    • Tokenization
    • Decoding Algorithms For NLP
    • Multi Language
    • Augmentation
    • Knowledge Graphs
    • Annotation & Disagreement
    • Sentiment Analysis
    • Question Answering
    • Summarization
    • Chat Bots
    • Conversation
  • Generative AI
    • Methods
    • Gen AI Industry
    • Speech
    • Prompt
    • Fairness, Accountability, and Transparency In Prompts
    • Large Language Models (LLMs)
    • Vision
    • GPT
    • Mix N Match
    • Diffusion Models
    • GenAI Applications
    • Agents
    • RAG
    • Chat UI/UX
  • Experimental Design
    • Design Of Experiments
    • DOE Tools
    • A/B Testing
    • Multi Armed Bandits
    • Contextual Bandits
    • Factorial Design
  • Business Domains
    • Follow the regularized leader
    • Growth
    • Root Cause Effects (RCE/RCA)
    • Log Parsing / Templatization
    • Fraud Detection
    • Life Time Value (LTV)
    • Survival Analysis
    • Propaganda Detection
    • NYC TAXI
    • Drug Discovery
    • Intent Recognition
    • Churn Prediction
    • Electronic Network Frequency Analysis
    • Marketing
  • Product Management
    • Expanding Your Data Science Skills
    • Product Vision & Strategy
    • Product / Program Managers
    • Product Management Resources
    • Product Tools
    • User Experience Design (UX)
    • Business
    • Marketing
    • Ideation
  • MLOps (www.OpsCompendium.com)
  • DataOps (www.OpsCompendium.com)
  • Humor
Powered by GitBook
On this page
  • ELMO
  • ULMFIT
  • BERT
  • GPT2
  • GPT3
  • XLNET

Was this helpful?

  1. Deep Learning

Attention

PreviousDeep Network OptimizationNextDeep Neural Machine Vision

Last updated 2 years ago

Was this helpful?

  1. AMAZING

  2. , self, soft vs hard, global vs local, neural turing machines, pointer networks, transformers, snail, self attention GAN.

  3. !

  4. - faster, better, more accurate

  5. - including turing / attention / adaptive computation time etc. general overview, not as clear as the one below.

    1. Soft (above) and hard crisp attention

    2. Dropping the hidden output - HAN or AB BiLSTM

    3. Attention concat to input vec

    4. Global vs local attention

    1. Encoder: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.

    2. Decoder: The decoder is responsible for stepping through the output time steps while reading from the context vector.

    3. A problem with the architecture is that performance is poor on long input or output sequences. The reason is believed to be because of the fixed-sized internal representation used by the encoder.

      1. Enc-decoder

      2. Recursive

  1. Code on GIT:

note: word level then sentence level embeddings.

figure= >

BERT/ROBERTA

In both cases, the metrics do not appear to be representative of the extent of linguistic knowledge learned by the BERT models, based on their strong performance on many NLP tasks. Hence, our takeaway is that while we can tease out some structure from the attention weights of BERT models using the above methods, studying the attention weights alone is unlikely to give us the full picture of BERT’s strength processing natural language.

  1. TRANSFORMERS

α-entmax: a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight. Moreover, we derive a method to automatically learn the

α parameter -- which controls the shape and sparsity of

α-entmax -- allowing attention heads to choose between focused or spread-out behavior. Our adaptively sparse Transformer improves interpretability and head diversity when compared to softmax Transformers on machine translation datasets.

ELMO

ULMFIT

BERT

  1. (amazing) Deconstructing bert

    1. I found some fairly distinctive and surprisingly intuitive attention patterns. Below I identify six key patterns and for each one I show visualizations for a particular layer / head that exhibited the pattern.

Pruning - Removes unnecessary parts of the network after training. This includes weight magnitude pruning, attention head pruning, layers, and others. Some methods also impose regularization during training to increase prunability (layer dropout).

Weight Factorization - Approximates parameter matrices by factorizing them into a multiplication of two smaller matrices. This imposes a low-rank constraint on the matrix. Weight factorization can be applied to both token embeddings (which saves a lot of memory on disk) or parameters in feed-forward / self-attention layers (for some speed improvements).

Knowledge Distillation - Aka “Student Teacher.” Trains a much smaller Transformer from scratch on the pre-training / downstream-data. Normally this would fail, but utilizing soft labels from a fully-sized model improves optimization for unknown reasons. Some methods also distill BERT into different architectures (LSTMS, etc.) which have faster inference times. Others dig deeper into the teacher, looking not just at the output but at weight matrices and hidden activations.

Weight Sharing - Some weights in the model share the same value as other parameters in the model. For example, ALBERT uses the same weight matrices for every single layer of self-attention in BERT.

Quantization - Truncates floating point numbers to only use a few bits (which causes round-off error). The quantization values can also be learned either during or after training.

Pre-train vs. Downstream - Some methods only compress BERT w.r.t. certain downstream tasks. Others compress BERT in a way that is task-agnostic.

GPT2

GPT3

XLNET

  1. CLIP

    1. Adversarial methodologies

Label flipping is a training technique where one selectively manipulates the labels in order to make the model more robust against label noise and associated attacks - the specifics depend a lot on the nature of the noise. Label flipping bears no benefit only under the assumption that all labels are (and will always be) correct and that no adversaries exist. In cases where noise tolerance is desirable, training with label flipping is beneficial.

Label smoothing is a regularization technique (and then some) aimed at improving model performance. Its effect takes place irrespective of label correctness.

  1. GAN

    1. Unread:

- will change on other data, my impression is that the data is too good in this article

Mastery on - a really unclear intro

Mastery on - this makes the whole process clear, scoring encoder vs decoder input outputs, normalizing them using softmax (annotation weights), multiplying score and the weight summed on all (i.e., context vector), and then we decode the context vector.

Mastery on - a theoretical discussion about many attention architectures. This adds make-sense information to everything above.

Enc-dev with recursive

HAN - ,

LSTM, ,

Tushv89,

Richliao, hierarchical , ,

,

**

- tl;dr: The attention weights between tokens in BERT/RoBERTa bear similarity to some syntactic dependency relations, but the results are less conclusive than we’d like as they don’t significantly outperform linguistically uninformed baselines for all types of dependency relations. In the case of MAX, our results indicate that specific heads in the BERT models may correspond to certain dependency relations, whereas for MST, we find much less support “generalist” heads whose attention weights correspond to a full syntactic dependency structure.

(amazing)

(amazing)

(seems like it is constantly updated)

Hugging face,

- This memory layer allows us to tackle very large scale language modeling tasks. In our experiments we consider a dataset with up to 30 billion words, and we plug our memory layer in a state-of-the-art transformer-based architecture. In particular, we found that a memory augmented model with only 12 layers outperforms a baseline transformer model with 24 layers, while being twice faster at inference time.

- This sparsity is accomplished by replacing softmax with

,

- everything you want to know with code

.

, ,

, for too,

, batches, with code and linear regression i

,

,

“The one cycle policy provides some form of regularisation”, if you wish to know more about one cycle policy, then feel free to refer to this excellent paper by Leslie Smith – “”.

- using bert as a service to encode 1024 vectors and do cosine similarity

- the idea is to classify the word duck into one of three meanings using bert embeddings, which promise contextualized embeddings. I.e., to duck, the Duck, etc

- attention to the next/previous/ identical/related (same and other sentences), other words predictive of a word, delimeters tokens

(good) - looking at the visualization and attention heads, focusing on Delimiter attention, bag of words attention, next word attention - patterns.

(read this first!)

, the most coherent explanation on bert, 15% masked word prediction and next sentence prediction. Roberta, xlm bert, albert, distilibert.

A , fine tuning using hugging face transformers package.

Youtube , , , ,

from scratch using TF, with [CLS] [SEP] etc

, on fine tuning, some talk on from scratch and probably not discussed about using embeddings as input

, feature names as labels, finetune bert, predict.

(is/is not)

, - When combined with distillation, the approach achieves minimal accuracy loss with down to only 3% of the model parameters.

Bert with keras, ,

Finetuning - , claims 94% on IMDB. official code “ it creates a single new layer that will be trained to adapt BERT to our sentiment task (i.e. classifying whether a movie review is positive or negative). This strategy of using a mostly trained model is called .”

- bert visualization tool.

sentenceBERT

on covid19

- is the first model that has been pretrained to learn representations for both natural language sentences and tabular data.

- We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT’s attention

BertViz is a tool for visualizing attention in the Transformer model, supporting all models from the library (BERT, GPT-2, XLNet, RoBERTa, XLM, CTRL, etc.). It extends the by and the library from .

PMI-masking , - Joint masking of correlated tokens significantly speeds up and improves BERT's pretraining

(really good/) - TL;DR BERT’s raw word embeddings capture useful and separable information (distinct histogram tails) about a word in terms of other words in BERT’s vocabulary. This information can be harvested from both raw embeddings and their transformed versions after they pass through BERT with a Masked language model (MLM) head

small algorithm was trained on the task of language modeling — which tests a program’s ability to predict the next word in a given sentence — by ingesting huge numbers of articles, blogs, and websites. By using just this data it achieved state-of-the-art scores on a number of unseen language tests, an achievement known as zero-shot learning. It can also perform other writing-related tasks, such as translating text from one language to another, summarizing long articles, and answering trivia questions.

for GPT=2 - big algo

on medium - language models can be used to produce good results on zero-shot, one-shot, or few-shot learning.

- Actually its quite good explaining it

(keras) model for retrieving images that match natural language queries. - The example demonstrates how to build a dual encoder (also known as two-tower) neural network model to search for images using natural language. The model is inspired by the approach, introduced by Alec Radford et al. The idea is to train a vision encoder and a text encoder jointly to project the representation of images and their captions into the same embedding space, such that the caption embeddings are located near the embeddings of the images they describe.

What is label and usage for making a model more robust against adversarial methodologies - 0

Smoothing the labels in this way prevents the network from becoming overconfident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition...Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective.

- In this paper we propose an efficient algorithm to perform optimal label flipping poisoning attacks and a mechanism to detect and relabel suspicious data points, mitigating the effect of such poisoning attacks.

- To develop a robust classification algorithm in the adversarial setting, it is important to understand the adversary’s strategy. We address the problem of label flips attack where an adversary contaminates the training set through flipping labels. By analyzing the objective of the adversary, we formulate an optimization framework for finding the label flips that maximize the classification error. An algorithm for attacking support vector machines is derived. Experiments demonstrate that the accuracy of classifiers is significantly degraded under the attack.

, such as label flipping batch norm, etc read!

, and cool

- transferring styles

- super res images

- good for critique

- divide and conquer

- mini batch discrimination

, reducing mode collapse

with tf code

“GAN”

A really good REVIEW on attention and its many forms, historical changes, etc
Medium on comparing cnn / rnn / han
rnn vs attention vs global attention
attention
attention with lstm encoding / decoding
GIT
paper
Non penalized self attention
BiLSTM attention
paper
Keras layer attention implementation
Attention code for document classification using keras
blog
group chatter
Self Attention pip for keras
git
Phillip remy on attention in keras, not a single layer, a few of them to make it.
Self attention with relative positiion representations
nMT - jointly learning to align and translate
Medium on attention plus code, comparison keras and pytorch
Do attention heads in bert roberta track syntactic dependencies?
Jay alammar on transformers
J.A on Bert Elmo
Jay alammar on a visual guide of bert for the first time
J.A on GPT2
Super fast transformers
Lilian Wang on the transformer family
encoders decoders in transformers for seq2seq
The annotated transformer
Large memory layers with product keys
Adaptive sparse transformers
Short tutorial on elmo, pretrained, new data, incremental(finetune?)
using elmo pretrained
Why you cant use elmo to encode words (contextualized)
Vidhya on elmo
Sebastien ruder on language modeling embeddings for the purpose of transfer learning, ELMO, ULMFIT, open AI transformer, BILSTM,
Another good tutorial on elmo
ELMO
tutorial
github
Elmo on google hub and code
How to use elmo embeddings, advice for word and sentence
Using elmo as a lambda embedding layer
Elmbo tutorial notebook
Elmo code on git
Elmo on keras using lambda
Elmo pretrained models for many languages
russian
mean elmo
Ari’s intro on word embeddings part 2, has elmo and some bert
Mean elmo
Elmo projected using TSNE - grouping are not semantically similar
Tutorial and code by vidhya
medium
Paper
Ruder on transfer learning
Medium on how - unclear
Fast NLP on how
Paper: ulmfit
Fast.ai on ulmfit
this too
Vidhya on ulmfit using fastai
Medium on ulmfit
Building blocks of ulm fit
Applying ulmfit on entity level sentiment analysis using business news artcles
Understanding language modelling using Ulmfit, fine tuning etc
Vidhaya on ulmfit + colab
A disciplined approach to neural network hyper-parameters: Part 1 — learning rate, batch size, momentum, and weight decay
The BERT PAPER
Prerequisite about transformers and attention - this is not enough
Embeddings using bert in python
Google neural machine translation (attention) - too long
What is bert
part 1
Deconstructing bert part 2
Bert demystified
Read this after
thorough tutorial on bert
Code
ep1
2
3
3b
How to train bert
Extending a vocabulary for bert, another kind of transfer learning.
Bert tutorial
Bert for summarization thread
Bert on logs
Bert scikit wrapper for pipelines
What is bert not good at, also refer to the cited paper
Jay Alamar on Bert
Jay Alamar on using distilliBert
sparse bert
paper
blog post
colaboratory
Bert with t-hub
Bert on medium with code
Bert on git
Better sentiment analysis with bert
here
fine-tuning
Explain bert
paper
Bert question answering
Codebert
Bert multilabel classification
Tabert
TaBERT
All the ways that you can compress BERT
Bert and nlp in 2019
HeBert - bert for hebrwe sentiment and emotions
Kdbuggets on visualizing bert
What does bert look at, analysis of attention
Bertviz
transformers
Tensor2Tensor visualization tool
Llion Jones
transformers
HuggingFace
paper
post
Examining bert raw embeddings
the GPT-2
Medium code
GPT3
Fit More and Train Faster With ZeRO via DeepSpeed and FairScale
Xlnet is transformer and bert combined
git
Implementation of a dual encoder
CLIP
flipping and smoothing
Paper: when does label smoothing helps?
Label smoothing, python code, multi class examples
Label sanitazation against label flipping poisoning attacks
Adversarial label flips attacks on svm
Great advice for training gans
Intro to Gans
A fantastic series about gans, the following two what are gans and applications are there
What are a GANs?
applications
Comprehensive overview
Cycle gan
Super gan resolution
Why gan so hard to train
And how to improve gans performance
Dcgan good as a starting point in new projects
Labels to improve gans, cgan, infogan
Stacked - labels, gan adversarial loss, entropy loss, conditional loss
Progressive gans
Using attention to improve gan
Least square gan - lsgan
Wasserstein gan, wgan gp
Faster training for gans, lower training count rsgan ragan
Addressing gan stability, ebgan began
What is wrong with gan cost functions
Using cost functions for gans inspite of the google brain paper
Proving gan is js-convergence
Dragan on minimizing local equilibria, how to stabilize gans
Unrolled gan for reducing mode collapse
Measuring gans
Ways to improve gans performance
Introduction to gans
Intro to gans
Intro to gan in KERAS
using xgboost and gmm for density sampling
Reverse engineering
Illustrated attention-
Illustrated self attention - great
Jay alamar on attention, the first one is better.
Attention is all you need (paper)
The annotated transformer - reviewing the paper
Lilian weng on attention
Understanding attention in rnns
Another good intro with gifs to attention
Clear insight to what attention is, a must read
Transformer NN by google
Intuitive explanation to attention
Attention by vidhya
Augmented rnns
A survey of long term context in transformers.
Identifying the right meaning with bert