Embedding

Intro

(amazing) embeddings from the ground up singleluncharrow-up-right

  1. Faissarrow-up-right - a library for efficient similarity search

  2. Benchmarkingarrow-up-right - complete with almost everything imaginable

  3. Google cloud vertex matching engine NN searcharrow-up-right

    1. search

      1. Recommendation engines

      2. Search engines

      3. Ad targeting systems

      4. Image classification or image search

      5. Text classification

      6. Question answering

      7. Chat bots

    2. Features

      1. Low latency

      2. High recall

      3. managed

      4. Filtering

      5. scale

  4. Pinecone - managed vector similarity searcharrow-up-right - Pinecone is a fully managed vector database that makes it easy to add vector search to production applications. No more hassles of benchmarking and tuning algorithms or building and maintaining infrastructure for vector search.

  5. Nmslibarrow-up-right (benchmarkedarrow-up-right - Benchmarks of approximate nearest neighbor libraries in Python) is a Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

  6. scann,

  7. Vespa.aiarrow-up-right - Make AI-driven decisions using your data, in real time. At any scale, with unbeatable performance

  8. Weaviatearrow-up-right - Weaviate is an open sourcearrow-up-right vector search engine and vector database. Weaviate uses machine learning to vectorize and store data, and to find answers to natural language queries, or any other media type.

  9. Neural Search with BERT and Solrarrow-up-right - Indexing BERT vector data in Solr and searching with full traversal

  10. Fun With Apache Lucene and BERT Embeddingsarrow-up-right - This post goes much deeper -- to the similarity search algorithm on Apache Lucene level. It upgrades the code from 6.6 to 8.0

  11. Speeding up BERT Search in Elasticsearcharrow-up-right - Neural Search in Elasticsearch: from vanilla to KNN to hardware acceleration

  12. Ask Me Anything about Vector Searcharrow-up-right - In the Ask Me Anything: Vector Search! session Max Irwin and Dmitry Kan discussed major topics of vector search, ranging from its areas of applicability to comparing it to good ol’ sparse search (TF-IDF/BM25), to its readiness for prime time and what specific engineering elements need further tuning before offering this to users.

  13. Search with BERT vectors in Solr and Elasticsearcharrow-up-right - GitHub repository used for experiments with Solr and Elasticsearch using DBPedia abstracts comparing Solr, vanilla Elasticsearch, elastiknn enhanced Elasticsearch, OpenSearch, and GSI APU

  14. Not All Vector Databases Are Made Equalarrow-up-right - A detailed comparison of Milvus, Pinecone, Vespa, Weaviate, Vald, GSI and Qdrant

  15. Vector Podcastarrow-up-right - Podcast hosted by Dmitry Kan, interviewing the makers in the Vector / Neural Search industry. Available on YouTube, Spotify, Apple Podcasts and RSS

  16. Players in Vector Search: Videoarrow-up-right -Video recording and slides of the talk presented on London IR Meetup on the topic of players, algorithms, software and use cases in Vector Search

TOOLS

FLAIR

  1. Name-Entity Recognition (NER): It can recognise whether a word represents a person, location or names in the text.

  2. Parts-of-Speech Tagging (PoS): Tags all the words in the given text as to which “part of speech” they belong to.

  3. Text Classification: Classifying text based on the criteria (labels)

  4. Training Custom Models: Making our own custom models.

  5. It comprises of popular and state-of-the-art word embeddings, such as GloVe, BERT, ELMo, Character Embeddings, etc. There are very easy to use thanks to the Flair API

  6. Flair’s interface allows us to combine different word embeddings and use them to embed documents. This in turn leads to a significant uptick in results

  7. ‘Flair Embedding’ is the signature embedding provided within the Flair library. It is powered by contextual string embeddings. We’ll understand this concept in detail in the next section

  8. Flair supports a number of languages – and is always looking to add new ones

HUGGING FACE

  1. hugging face on emotionsarrow-up-right

    1. how to make a custom pyTorch LSTM with custom activation functions,

    2. how the PackedSequence object works and is built,

    3. how to convert an attention layer from Keras to pyTorch,

    4. how to load your data in pyTorch: DataSets and smart Batching,

    5. how to reproduce Keras weights initialization in pyTorch.

  2. A thorough tutorial on bertarrow-up-right, fine tuning using hugging face transformers package. Codearrow-up-right

Youtube ep1arrow-up-right, 2arrow-up-right, 3arrow-up-right, 3barrow-up-right,

LANGUAGE EMBEDDINGS

History

  1. How self attention and relative positioning workarrow-up-right (great!)

    1. Rnns are sequential, same word in diff position will have diff encoding due to the input from the previous word, which is inherently different.

    2. Attention without positional! Will have distinct (Same) encoding.

    3. Relative look at a window around each word and adds a distance vector in terms of how many words are before and after, which fixes the problem.

    4. The authors hypothesized that precise relative position information is not useful beyond a certain distance.

    5. Clipping the maximum distance enables the model to generalize to sequence lengths not seen during training.

Embedding Foundation Knowledge

  1. Medium on Introduction into word embeddings, sentence embeddings, trends in the field.arrow-up-right The Indian guy, gitarrow-up-right notebook, his gitarrow-up-right,

    1. Baseline Averaged Sentence Embeddings

    2. Doc2Vec

    3. Neural-Net Language Models (Hands-on Demo!)

    4. Skip-Thought Vectors

    5. Quick-Thought Vectors

    6. InferSent

    7. Universal Sentence Encoder

Language modeling

  1. Ruder on language modelling as the next imagenetarrow-up-right - Language modelling, the last approach mentioned, has been shown to capture many facets of language relevant for downstream tasks, such as long-term dependenciesarrow-up-right , hierarchical relationsarrow-up-right , and sentimentarrow-up-right . Compared to related unsupervised tasks such as skip-thoughts and autoencoding, language modelling performs better on syntactic tasks even with less training dataarrow-up-right.

  2. A tutorialarrow-up-right about w2v skipthought - with code!, specifically language modelling here is important - Our second method is training a language model to represent our sentences. A language model describes the probability of a text existing in a language. For example, the sentence “I like eating bananas” would be more probable than “I like eating convolutions.” We train a language model by slicing windows of n words and predicting what the next word will be in the text

  3. Bertarrow-up-right **[python git](https://github.com/CyberZHG/keras-bertarrow-up-right)- We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks.**

  4. Open.ai on language modellingarrow-up-right - We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformersarrow-up-right and unsupervised pre-trainingarrow-up-right. READ PAPERarrow-up-right, VIEW CODEarrow-up-right.

  5. Scikit-learn inspired model finetuning for natural language processing.

finetunearrow-up-right ships with a pre-trained language model from “Improving Language Understanding by Generative Pre-Training”arrow-up-right and builds off the OpenAI/finetune-language-model repositoryarrow-up-right.

  1. Did not read - The annotated Transformerarrow-up-right - jupyter on transformer with annotation

Embedding spaces

  1. Sent2vec by gensimarrow-up-right - sentence embedding is defined as the average of the source word embeddings of its constituent words. This model is furthermore augmented by also learning source embeddings for not only unigrams but also n-grams of words present in each sentence, and averaging the n-gram embeddings along with the words

  2. Wordrank vs fasttext vs w2v comparisonarrow-up-right - the better word similarity algorithm

  3. Doc2vec tutorial by gensimarrow-up-right - Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents. - Most importantly this tutorial has crucial information about the implementation parameters that should be read before using it.

  4. Lbl2Vecarrow-up-right, mediumarrow-up-right, is an algorithm for unsupervised document classification and unsupervised document retrieval. It automatically generates jointly embedded label, document and word vectors and returns documents of categories modeled by manually predefined keywords.

  5. Skip-thought -arrow-up-right **[git](https://github.com/ryankiros/skip-thoughtsarrow-up-right)- Where word2vec attempts to predict surrounding words from certain words in a sentence, skip-thought vector extends this idea to sentences: it predicts surrounding sentences from a given sentence. NOTE: Unlike the other methods, skip-thought vectors require the sentences to be ordered in a semantically meaningful way. This makes this method difficult to use for domains such as social media text, where each snippet of text exists in isolation.**

  6. Fastsentarrow-up-right - Skip-thought vectors are slow to train. FastSent attempts to remedy this inefficiency while expanding on the core idea of skip-thought: that predicting surrounding sentences is a powerful way to obtain distributed representations. Formally, FastSent represents sentences as the simple sum of its word embeddings, making training efficient. The word embeddings are learned so that the inner product between the sentence embedding and the word embeddings of surrounding sentences is maximized. NOTE: FastSent sacrifices word order for the sake of efficiency, which can be a large disadvantage depending on the use-case.

  7. Weighted sum of words - In this method, each word vector is weighted by the factor \frac{a}{a + p(w)} where a is a hyperparameter and p(w) is the (estimated) word frequency. This is similar to tf-idf weighting, where more frequent terms are weighted downNOTE: Word order and surrounding sentences are ignored as well, limiting the information that is encoded.

  8. Infersent by facebookarrow-up-right - paperarrow-up-right InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language inference data and generalizes well to many different tasks. ABSTRACT: we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks.

  9. Universal sentence encoder - googlearrow-up-right - notebookarrow-up-right, gitarrow-up-right The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmarkarrow-up-right for semantic similarity, and the results can be seen in the example notebookarrow-up-right made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.

  10. Pair2vec - paperarrow-up-right - paper proposes new methods for learning and using embeddings of word pairs that implicitly represent background knowledge about such relationships. I.e., using p2v information with existing models to increase performance. Experiments show that our pair embeddings can complement individual word embeddings, and that they are perhaps capturing information that eludes the traditional interpretation of the Distributional Hypothesis

Embedding Models

Cat2vec

  1. Part2: cat2vec using w2varrow-up-right, and entity embeddings for categorical data

ENTITY EMBEDDINGS

  1. Using embeddings on tabular data, specifically categorical - introductionarrow-up-right, using fastai without limiting ourselves to pytorch - the material from this post is covered in much more detail starting around 1:59:45 in the Lesson 3 videoarrow-up-right and continuing in Lesson 4arrow-up-right of our free, online Practical Deep Learning for Codersarrow-up-right course. To see example code of how this approach can be used in practice, check out our Lesson 3 jupyter notebookarrow-up-right. Perhaps Saturday and Sunday have similar behavior, and maybe Friday behaves like an average of a weekend and a weekday. Similarly, for zip codes, there may be patterns for zip codes that are geographically near each other, and for zip codes that are of similar socio-economic status. The jupyter notebook doesn't seem to have the embedding example they are talking about.

  2. Embedderarrow-up-right - git code for a simplified entity embedding above.

  3. Finally what they do is label encode each feature using labelEncoder into an int-based feature, then push each feature into its own embedding layer of size 1 with an embedding size defined by a rule of thumb (so it seems), merge all layers, train a synthetic regression/classification and grab the weights of the corresponding embedding layer.

ALL2VEC EMBEDDINGS

  1. Fast.ai postarrow-up-right regarding embedding for tabular data, i.e., cont and categorical data

  2. Diff2vec - might be useful on social network graphs, paperarrow-up-right, codearrow-up-right

  3. emoji 2vec (below)

EMOJIS

  1. hugging face on emotionsarrow-up-right

    1. how to make a custom pyTorch LSTM with custom activation functions,

    2. how the PackedSequence object works and is built,

    3. how to convert an attention layer from Keras to pyTorch,

    4. how to load your data in pyTorch: DataSets and smart Batching,

    5. how to reproduce Keras weights initialization in pyTorch.

  2. Group2vecarrow-up-right git and mediumarrow-up-right, which is a multi input embedding network using a-f below. plus two other methods that involve groupby and applying entropy and join/countvec per class. Really interesting

    1. Initialize embedding layers for each categorical input;

    2. For each category, compute dot-products among other embedding representations. These are our ‘groups’ at the categorical level;

    3. Summarize each ‘group’ adopting an average pooling;

    4. Concatenate ‘group’ averages;

    5. Apply regularization techniques such as BatchNormalization or Dropout;

    6. Output probabilities.

WORD2VEC

  1. Monitor train lossarrow-up-right using callbacks for word2vec

  2. Cleaning datasets using weighted w2v sentence encoding, then pca and isolation forest to remove outlier sentences.

  3. Chris mccormick ml on w2v,arrow-up-right **[post #2](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/arrow-up-right) - negative sampling “Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them. With negative sampling, we are instead going to randomly select just a small number of “negative” words (let’s say 5) to update the weights for. (In this context, a “negative” word is one for which we want the network to output a 0 for). We will also still update the weights for our “positive” word (which is the word “quick” in our current example). The “negative samples” (that is, the 5 output words that we’ll train to output 0) are chosen using a “unigram distribution”. Essentially, the probability for selecting a word as a negative sample is related to its frequency, with more frequent words being more likely to be selected as negative samples.**

  4. Chris mccormick on negative sampling and hierarchical soft maxarrow-up-right training, i.e., huffman binary tree for the vocabulary, learning internal tree nodes ie.,, the path as the probability vector instead of having len(vocabulary) neurons.

  5. Another gensim-based w2v tutorialarrow-up-right, with starter code and some usage examples of similarity

  6. Mean w2v

  7. Sequential w2v embeddings.

  8. W2v Analogies using predefined anthologies of thearrow-up-right form x:y:🅰️b, plus code, plus insights of why it works and doesn't. presence : absence :: happy : unhappy absence : presence :: happy : proud abundant : scarce :: happy : glad refuse : accept :: happy : satisfied accurate : inaccurate :: happy : disappointed admit : deny :: happy : delighted never : always :: happy : Said_Hirschbeck modern : ancient :: happy : ecstatic

GLOVE

  1. W2v against glove performancearrow-up-right comparison - glove wins in % and time.

  2. How glove and w2v work, but the following has a very good descriptionarrow-up-right - “GloVe takes a different approach. Instead of extracting the embeddings from a neural network that is designed to perform a surrogate task (predicting neighbouring words), the embeddings are optimized directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other (within 5 words for example). For example if "dog" and "cat" occur near each other 10 times in a corpus, then vec(dog) dot vec(cat) = log(10). This forces the vectors to somehow encode the frequency distribution of which words occur near them.”

FastText

  1. Gensim - fasttext docsarrow-up-right, similarity, analogies

  2. Alternative to gensimarrow-up-right - promises speed and out of the box support for many embeddings.

  3. A comparison of w2v vs ft using gensimarrow-up-right - “Word2Vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.

    1. Syntacticarrow-up-right means syntax, as in tasks that have to do with the structure of the sentence, these include tree parsing, POS tagging, usually they need less context and a shallower understanding of world knowledge

    2. Semanticarrow-up-right tasks mean meaning related, a higher level of the language tree, these also typically involve a higher level understanding of the text and might involve tasks s.a. question answering, sentiment analysis, etc...

    3. As for analogies, he is referring to the mathematical operator like properties exhibited by word embedding, in this context a syntactic analogy would be related to plurals, tense or gender, those sort of things, and semantic analogy would be word meaning relationships s.a. man + queen = king, etc... See for instance this articlearrow-up-right (and many others)

  1. Paperarrow-up-right on fasttext vs glove vs w2v on a single DS, performance comparison. Ft wins by a small margin

SENTENCE EMBEDDING

Sense2vec

  1. Blogarrow-up-right, githubarrow-up-right: Using spacy or not, with w2v using POS/ENTITY TAGS to find similarities.based on reddit. “We follow Trask et al in adding part-of-speech tags and named entity labels to the tokens. Additionally, we merge named entities and base noun phrases into single tokens, so that they receive a single vector.”

  2. >>> model.similarity('fair_game|NOUN', 'game|NOUN') 0.034977455677555599 >>> model.similarity('multiplayer_game|NOUN', 'game|NOUN') 0.54464530644393849

SENT2VEC aka “skip-thoughts”

  1. Gensim implementation of sent2vecarrow-up-right - usage examples, parallel training, a detailed comparison against gensim doc2vec

USE - Universal sentence encoder

BERT+W2V

PARAGRAPH2Vec

Doc2Vec

  1. Shuffle before training eacharrow-up-right epoch in d2v in order to fight overfitting

Last updated

Was this helpful?