Foundation NLP

Basic nlp

  1. DL for text classificationarrow-up-right

    1. Logistic regression with word ngrams

    2. Logistic regression with character ngrams

    3. Logistic regression with word and character ngrams

    4. Recurrent neural network (bidirectional GRU) without pre-trained embeddings

    5. Recurrent neural network (bidirectional GRU) with GloVe pre-trained embeddings

    6. Multi channel Convolutional Neural Network

    7. RNN (Bidirectional GRU) + CNN model

Chunking

NLP for hackers tutorials

  1. Complete guide for training your own Part-Of-Speech Tagger -arrow-up-right using Penn Treebank tagsetarrow-up-right. Using nltk or stanford pos taggers, creating features from actual words (manual stemming, etc0 using the tags as labels, on a random forest, thus creating a classifier for POS on our own. Not entirely sure why we need to create a classifier from a “classifier”.

  2. Word net introductionarrow-up-right - POS, lemmatize, synon, antonym, hypernym, hyponym

  3. Sentence similarity using wordnetarrow-up-right - using synonyms cumsum for comparison. Today replaced with w2v mean sentence similarity.

  4. Stemmers vs lemmatizersarrow-up-right - stemmers are faster, lemmatizers are POS / dictionary based, slower, converting to base form.

  5. Chunkingarrow-up-right - shallow parsing, compared to deep, similar to NER

  6. NER -arrow-up-right using nltk chunking as a labeller for a classifier, training one of our own. Using IOB features as well as others to create a new ner classifier which should be better than the original by using additional features. Aso uses a new english dataset GMB.

Synonyms

  1. Python Module to get Meanings, Synonyms and what not for a given word using vocabulary (also a comparison against word net) https://vocabulary.readthedocs.io/en/…arrow-up-right

For a given word, using Vocabulary, you can get its

  • Meaning

  • Synonyms

  • Antonyms

  • Part of speech : whether the word is a noun, interjection or an adverb et el

  • Translate : Translate a phrase from a source language to the desired language.

  • Usage example : a quick example on how to use the word in a sentence

  • Pronunciation

  • Hyphenation : shows the particular stress points(if any)

Swiss army knife libraries

  1. textacyarrow-up-right is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spacy library. With the fundamentals — tokenization, part-of-speech tagging, dependency parsing, etc. — delegated to another library, textacy focuses on the tasks that come before and follow after.

Collocation

  1. What is collocation? - “the habitual juxtaposition of a particular word with another word or words with a frequency greater than chance.”Medium tutorialarrow-up-right, quite good, comparing freq/t-test/pmi/chi2 with github code

  2. A website dedicated to collocationsarrow-up-right, methods, references, metrics.

  3. Text2vecarrow-up-right in R - has ideas on how to use collocations, for downstream tasks, LDA, W2V, etc. also explains about PMI and other metrics, note that gensim metric is unsupervised and probablistic.

  4. A blog postarrow-up-right about keeping or removing stopwords for collocation, usefull but no firm conclusion. Imo we should remove it before

  5. A blog postarrow-up-right with code of using nltk-based collocation

  6. Small code for using nltk collocationarrow-up-right

  7. Another code / score example for nltk collocationarrow-up-right

  8. Jupyter notebook on manually finding collocationarrow-up-right - not useful

  9. Paper: Ngram2Vecarrow-up-right - Githubarrow-up-right We introduce ngrams into four representation methods. The experimental results demonstrate ngrams’ effectiveness for learning improved word representations. In addition, we find that the trained ngram embeddings are able to reflect their semantic meanings and syntactic patterns. To alleviate the costs brought by ngrams, we propose a novel way of building co-occurrence matrix, enabling the ngram-based models to run on cheap hardware

Language detection

  1. Using google lang detectarrow-up-right - 55 languages af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn, zh-tw

Stemming

How to measure a stemmer?

  1. References [1arrow-up-right 2arrow-up-right(apr11) 3arrow-up-right(Index compression factor ICF) 4arrow-up-right 5arrow-up-right]

Phrase modelling

  1. Phrase Modelingarrow-up-right - using gensim and spacy

Phrase modeling is another approach to learning combinations of tokens that together represent meaningful multi-word concepts. We can develop phrase models by looping over the the words in our reviews and looking for words that co-occur (i.e., appear one after another) together much more frequently than you would expect them to by random chance. The formula our phrase models will use to determine whether two tokens AA and BB constitute a phrase is:

count(A B)−countmincount(A)∗count(B)∗N>threshold

Document classification

Hebrew NLP tools

  1. HebMorpharrow-up-right last update 7y ago

Semantic roles:

Last updated

Was this helpful?