📒
Machine & Deep Learning Compendium
  • The Machine & Deep Learning Compendium
    • Thanks Page
  • The Ops Compendium
  • Types Of Machine Learning
    • Overview
    • Model Families
    • Weakly Supervised
    • Semi Supervised
    • Active Learning
    • Online Learning
    • N-Shot Learning
    • Unlearning
  • Foundation Knowledge
    • Data Science
    • Data Science Tools
    • Management
    • Project & Program Management
    • Data Science Management
    • Calculus
    • Probability & Statistics
    • Probability
    • Hypothesis Testing
    • Feature Types
    • Multi Label Classification
    • Distribution
    • Distribution Transformation
    • Normalization & Scaling
    • Regularization
    • Information Theory
    • Game Theory
    • Multi CPU Processing
    • Benchmarking
  • Validation & Evaluation
    • Features
    • Evaluation Metrics
    • Datasets
    • Dataset Confidence
    • Hyper Parameter Optimization
    • Training Strategies
    • Calibration
    • Datasets Reliability & Correctness
    • Data & Model Tests
    • Fairness, Accountability, and Transparency
    • Interpretable & Explainable AI (XAI)
    • Federated Learning
  • Machine Learning
    • Algorithms 101
    • Meta Learning (AutoML)
    • Probabilistic, Regression
    • Data Mining
    • Process Mining
    • Label Algorithms
    • Clustering Algorithms
    • Anomaly Detection
    • Decision Trees
    • Active Learning Algorithms
    • Linear Separator Algorithms
    • Regression
    • Ensembles
    • Reinforcement Learning
    • Incremental Learning
    • Dimensionality Reduction Methods
    • Genetic Algorithms & Genetic Programming
    • Learning Classifier Systems
    • Recommender Systems
    • Timeseries
    • Fourier Transform
    • Digital Signal Processing (DSP)
    • Propensity Score Matching
    • Diffusion models
  • Classical Graph Models
    • Graph Theory
    • Social Network Analysis
  • Deep Learning
    • Deep Neural Nets Basics
    • Deep Neural Frameworks
    • Embedding
    • Deep Learning Models
    • Deep Network Optimization
    • Attention
    • Deep Neural Machine Vision
    • Deep Neural Tabular
    • Deep Neural Time Series
  • Audio
    • Basics
    • Terminology
    • Feature Engineering
    • Deep Neural Audio
    • Algorithms
  • Natural Language Processing
    • A Reality Check
    • NLP Tools
    • Foundation NLP
    • Name Matching
    • String Matching
    • TF-IDF
    • Language Detection Identification Generation (NLD, NLI, NLG)
    • Topics Modeling
    • Named Entity Recognition (NER)
    • SEARCH
    • Neural NLP
    • Tokenization
    • Decoding Algorithms For NLP
    • Multi Language
    • Augmentation
    • Knowledge Graphs
    • Annotation & Disagreement
    • Sentiment Analysis
    • Question Answering
    • Summarization
    • Chat Bots
    • Conversation
  • Generative AI
    • Methods
    • Gen AI Industry
    • Speech
    • Prompt
    • Fairness, Accountability, and Transparency In Prompts
    • Large Language Models (LLMs)
    • Vision
    • GPT
    • Mix N Match
    • Diffusion Models
    • GenAI Applications
    • Agents
    • RAG
    • Chat UI/UX
  • Experimental Design
    • Design Of Experiments
    • DOE Tools
    • A/B Testing
    • Multi Armed Bandits
    • Contextual Bandits
    • Factorial Design
  • Business Domains
    • Follow the regularized leader
    • Growth
    • Root Cause Effects (RCE/RCA)
    • Log Parsing / Templatization
    • Fraud Detection
    • Life Time Value (LTV)
    • Survival Analysis
    • Propaganda Detection
    • NYC TAXI
    • Drug Discovery
    • Intent Recognition
    • Churn Prediction
    • Electronic Network Frequency Analysis
    • Marketing
  • Product Management
    • Expanding Your Data Science Skills
    • Product Vision & Strategy
    • Product / Program Managers
    • Product Management Resources
    • Product Tools
    • User Experience Design (UX)
    • Business
    • Marketing
    • Ideation
  • MLOps (www.OpsCompendium.com)
  • DataOps (www.OpsCompendium.com)
  • Humor
Powered by GitBook
On this page
  • Neural Language Models
  • NEURAL LANGUAGE GENERATION
  • LANGUAGE DETECTION / IDENTIFICATION
  • LANGUAGE TRANSLATION

Was this helpful?

  1. Natural Language Processing

Language Detection Identification Generation (NLD, NLI, NLG)

PreviousTF-IDFNextTopics Modeling

Last updated 2 years ago

Was this helpful?

Neural Language Models

NEURAL LANGUAGE GENERATION

  1. - Word-based LMs display higher accuracy and lower computational cost than char-based LMs. However, char-based RNN LMs better model languages with a rich morphology such as Finish, Turkish, Russian etc. Using word-based RNN LMs to model such languages is difficult if possible at all and is not advised. Char-based RNN LMs can mimic grammatically correct sequences for a wide range of languages, require bigger hidden layer and computationally more expensive while word-based RNN LMs train faster and generate more coherent texts and yet even these generated texts are far from making actual sense.

LANGUAGE DETECTION / IDENTIFICATION

  1. , , 2, , ,

  2. ,

  3. OPENNLP

  4. , ,

  5. ,

  6. ,

  7. - beware based on 200 samples per language!!

  • all detectors are equally good on some languages, such as, Japanese, Chinese, Vietnamese, Greek, Arabic, Farsi, Georgian, etc. - for them the accuracy of detection is between 98 & 100%;

  • CLD is much better in detection of "rare" languages, especially for languages, that are similar to more frequently used - Afrikaans vs Dutch, Azerbaijani vs. Turkish, Malay vs. Indonesian, Nepali vs. Hindi, Russian vs Bulgarian, etc. (it could be result of imbalance of training data - I need to check the source dataset);

  • for "major" languages not mentioned above (English, French, German, Spanish, Portuguese, Dutch) the fastText results are much better than CLD's, and in many cases lingid.py's & OpenNLP's;

  • for many languages results for "compressed" fastText model are slightly worse than results from "full" model (mostly only by 1-2%, but could be higher, like for Kazakh when difference is 33%), but there are languages where the situation is different - results for compressed are slight better than for full (for example, for German or Dutch);

OpenNLP has many misclassifications for Cyrillic languages - Russian/Ukrainian, ...

Here are some additional notes about differences in behavior of detectors that I observe during analyzing results:

  • fastText is more reliable than CLD on the short texts;

  • fastText models & langid.py detect language as Hebrew instead of Jewish as in CLD. Similarly, CLD uses 'in' for Indonesian language instead of standard 'id' used by fastText & langid.py;

  • fastText distinguish between Cyrillic- & Latin-based versions of Serbian;

  • CLD tends to incorporate geographical & person's names into detection results - for example, blog post in German about travel to Iceland is detected as Icelandic, while fastText detects it as German;

  • In extended detection mode CLD tends to select more rare language, like, Galician or Catalan over Spanish, Serbian instead of Russian, etc.

  • OpenNLP isn't very good in detection for short texts.

The models released by fastText development team provides very good alternative to existing language detection tools, like, Google's CLD & langid.py - for most of "popular" languages, these models provides higher detection accuracy comparing to other tools, combined with high speed of detection (drawback of langid.py). Even using "compressed" model it's possible to reach good detection accuracy. Although for some less frequently used languages, CLD & langid.py may show better results.

Performance-wise, the langid.py is much slower than both CLD & fastText. On average, CLD requires 0.5-1 ms to perform language detection. For fastText & langid.py I don't have precise numbers yet, only approximates based on speed of execution of corresponding programs.

GIT:

Articles:

Papers:

LANGUAGE TRANSLATION

“[BLEU] looks at the presence or absence of particular words, as well as the ordering and the degree of distortion—how much they actually are separated in the output.”

BLEU’s evaluation system requires two inputs: (i) a numerical translation closeness metric, which is then assigned and measured against (ii) a corpus of human reference translations.

The result is typically measured on a 0 to 1 scale, with 1 as the hypothetical “perfect” translation. Since the human reference, against which MT is measured, is always made up of multiple translations, even a human translation would not score a 1, however. Sometimes the score is expressed as multiplied by 100 or, as in the case of Google mentioned above, by 10.

a BLEU score offers more of an intuitive rather than an absolute meaning and is best used for relative judgments: “If we get a BLEU score of 35 (out of 100), it seems okay, but it actually has no correlation to the quality of the output in any meaningful sense. If it’s less than 15, we can probably safely say it’s very bad. If it’s greater than 60, we probably have some mistake in our testing! So it will generally fall in there.”

“Typically, if you have multiple [human translation] references, the BLEU score tends to be higher. So if you hear a very large BLEU score—someone gives you a value that seems very high—you can ask them if there are multiple references being used; because, then, that is the reason that the score is actually higher.”

Named entity language transliteration

Full results for every language that I tested are in table at the end of blog post & on . From them I can make following conclusions:

Rafael Oliveira a simple diagram that shows what languages are detected better by CLD & what is better handled by fastText

,

- a review of papers

LASER: ,

Stanford coreNLP language POS/NER/DEP PARSE etc for

w2v by gensim

Really good:

BLEU averages out various metrics using an , a probabilistic language model often used in computational linguistics.

BLEU sample

about FAMG (fb, ama, micro, goog) and research direction atm, including some info about BLUE scores and the comparison issues with reports of BLUE (boils down to diff unmentioned parameters)

One proposed solution is , pip install sacreblue

, : English russian, hebrew, arabic, japanese, with data set and

Mastery on Word-based
Using RNN
Using language modeling
Word based vs char based
mediu m on Char based with code, leads to better grammer
Git, keras language models, char level word level and sentence using VAE
A qualitative comparison of google, azure, amazon, ibm LD LI
CLD2
CLD3
PYCLD
POLYGLOT wraps CLD
alex ott cld stats
cld comparison vs tika langid
Fast text LI
facebook post
Google detect language
github code
v3beta
Microsoft azure LD,
2
Ibm watson
2
Amazon,
2
Lingua - most accurate for java… doesn't seem like its accurate enough
LD with infinity gram 99.1 on a lot of data a benchmark for this 2012 method
LD with infinity gram
WiLI dataset for LD, comparison of CLD vs others
Comparison of CLD vs FT vs OPEN NLP
Github
posted on FB
LD with infinity gram
Medium on training LI models
A comparison of lang ident approaches
lI on code switch social media
Comparing LI methods, has 6 big languages
Comparing LI techniques
Radim rehurek LI on the web extending the dictionary
Comparative study of LI methods
2
State of the art methods for neural machine translation
Zero shot multi lang-translation by facebook
github
How to use laser on medium
53 languages
Using embedding spaces
The risk of using bleu
BLUE - what is it, how it is calculated?
n-gram method
General talk
sacreBLUE
Paper
blog post
github