📒
Machine & Deep Learning Compendium
  • The Machine & Deep Learning Compendium
    • Thanks Page
  • The Ops Compendium
  • Types Of Machine Learning
    • Overview
    • Model Families
    • Weakly Supervised
    • Semi Supervised
    • Active Learning
    • Online Learning
    • N-Shot Learning
    • Unlearning
  • Foundation Knowledge
    • Data Science
    • Data Science Tools
    • Management
    • Project & Program Management
    • Data Science Management
    • Calculus
    • Probability & Statistics
    • Probability
    • Hypothesis Testing
    • Feature Types
    • Multi Label Classification
    • Distribution
    • Distribution Transformation
    • Normalization & Scaling
    • Regularization
    • Information Theory
    • Game Theory
    • Multi CPU Processing
    • Benchmarking
  • Validation & Evaluation
    • Features
    • Evaluation Metrics
    • Datasets
    • Dataset Confidence
    • Hyper Parameter Optimization
    • Training Strategies
    • Calibration
    • Datasets Reliability & Correctness
    • Data & Model Tests
    • Fairness, Accountability, and Transparency
    • Interpretable & Explainable AI (XAI)
    • Federated Learning
  • Machine Learning
    • Algorithms 101
    • Meta Learning (AutoML)
    • Probabilistic, Regression
    • Data Mining
    • Process Mining
    • Label Algorithms
    • Clustering Algorithms
    • Anomaly Detection
    • Decision Trees
    • Active Learning Algorithms
    • Linear Separator Algorithms
    • Regression
    • Ensembles
    • Reinforcement Learning
    • Incremental Learning
    • Dimensionality Reduction Methods
    • Genetic Algorithms & Genetic Programming
    • Learning Classifier Systems
    • Recommender Systems
    • Timeseries
    • Fourier Transform
    • Digital Signal Processing (DSP)
    • Propensity Score Matching
    • Diffusion models
  • Classical Graph Models
    • Graph Theory
    • Social Network Analysis
  • Deep Learning
    • Deep Neural Nets Basics
    • Deep Neural Frameworks
    • Embedding
    • Deep Learning Models
    • Deep Network Optimization
    • Attention
    • Deep Neural Machine Vision
    • Deep Neural Tabular
    • Deep Neural Time Series
  • Audio
    • Basics
    • Terminology
    • Feature Engineering
    • Deep Neural Audio
    • Algorithms
  • Natural Language Processing
    • A Reality Check
    • NLP Tools
    • Foundation NLP
    • Name Matching
    • String Matching
    • TF-IDF
    • Language Detection Identification Generation (NLD, NLI, NLG)
    • Topics Modeling
    • Named Entity Recognition (NER)
    • SEARCH
    • Neural NLP
    • Tokenization
    • Decoding Algorithms For NLP
    • Multi Language
    • Augmentation
    • Knowledge Graphs
    • Annotation & Disagreement
    • Sentiment Analysis
    • Question Answering
    • Summarization
    • Chat Bots
    • Conversation
  • Generative AI
    • Methods
    • Gen AI Industry
    • Speech
    • Prompt
    • Fairness, Accountability, and Transparency In Prompts
    • Large Language Models (LLMs)
    • Vision
    • GPT
    • Mix N Match
    • Diffusion Models
    • GenAI Applications
    • Agents
    • RAG
    • Chat UI/UX
  • Experimental Design
    • Design Of Experiments
    • DOE Tools
    • A/B Testing
    • Multi Armed Bandits
    • Contextual Bandits
    • Factorial Design
  • Business Domains
    • Follow the regularized leader
    • Growth
    • Root Cause Effects (RCE/RCA)
    • Log Parsing / Templatization
    • Fraud Detection
    • Life Time Value (LTV)
    • Survival Analysis
    • Propaganda Detection
    • NYC TAXI
    • Drug Discovery
    • Intent Recognition
    • Churn Prediction
    • Electronic Network Frequency Analysis
    • Marketing
  • Product Management
    • Expanding Your Data Science Skills
    • Product Vision & Strategy
    • Product / Program Managers
    • Product Management Resources
    • Product Tools
    • User Experience Design (UX)
    • Business
    • Marketing
    • Ideation
  • MLOps (www.OpsCompendium.com)
  • DataOps (www.OpsCompendium.com)
  • Humor
Powered by GitBook
On this page
  • Misc
  • NMF (Non Negative Matrix Factorization )
  • LSA (TFIDF + SVD)
  • LDA (Latent Dirichlet Allocation)
  • Mallet LDA
  • Visualization
  • COHERENCE (Topic)
  • LDA2VEC
  • TOP2VEC

Was this helpful?

  1. Natural Language Processing

Topics Modeling

PreviousLanguage Detection Identification Generation (NLD, NLI, NLG)NextNamed Entity Recognition (NER)

Last updated 2 years ago

Was this helpful?

Misc

  1. for topic modellng

  2. (TopSBM) topic block modeling,

NMF (Non Negative Matrix Factorization )

  1. Non-negative Matrix factorization (NMF)

  2. NMF (Non-negative Matrix factorization)+ code

LSA (TFIDF + SVD)

  1. Including code and explanation about Dirichlet probability.

LDA (Latent Dirichlet Allocation)

  • A about topic modeling, Pros and Cons! (LSA, pLSA, LDA)

  1. (LDA) Latent Dirichlet Allocation

  2. LDA is already taken by the above algorithm!

  3. NMF is in its general definition the search for 2 matrices W and H such that W*H=V where V is an observed matrix. The only requirement for those matrices is that all their elements must be non negative.

    From the above definitions it is clear that in LDA only bag of words frequency counts can be used since a vector of reals makes no sense. Did we create a word 1.2 times? On the other hand we can use any non negative representation for NMF and in the example tf-idf is used.

    As far as choosing the number of iterations, for the NMF in scikit learn I don't know the stopping criterion although I believe it is the relative improvement of the loss function being smaller than a threshold so you 'll have to experiment. For LDA I suggest checking manually the improvement of the log likelihood in a held out validation set and stopping when it falls under a threshold. The rest of the parameters depend heavily on the data so I suggest, as suggested by @rpd, that you do a parameter search. So to sum up, LDA can only generate frequencies and NMF can generate any non negative matrix.

    1. It is unsupervised natively; it uses joint probability method to find topics(user has to pass # of topics to LDA api). If “Doc X word” is size of input data to LDA, it transforms it to 2 matrices:

    2. Doc X topic

    3. Word X topic

    4. further if you want, you can feed “Doc X topic” matrix to supervised algorithm if labels were given.

    1. Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature

    2. Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.

    3. Uncovering Themes in Texts – Useful for detecting trends in online publications for example

    4. A Form of Tagging - If document classification is assigning a single category to a text, topic modeling is assigning multiple tags to a text. A human expert can label the resulting topics with human-readable labels and use different heuristics to convert the weighted topics to a set of tags.

    1. Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.

    2. Number of Topics – Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. I will not discuss this in detail, as it is too mathematical. For understanding, one can refer to this[1] original paper on the use of KL divergence.

    3. Number of Topic Terms – Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.

    4. Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.

  4. Ways to improve LDA:

    1. Reduce dimentionality of document-term matrix

    2. Frequency filter

    3. POS filter

    4. Batch wise LDA

Mallet LDA

    1. With hyperparameter optimization, the alpha value for each topic can be different. They usually become smaller than the default setting.

    2. The default value for beta is 0.01. This means that each topic has a weight on the uniform prior equal to the size of the vocabulary divided by 100. This seems to be a good value. With optimization turned on, the value rarely changes by more than a factor of two.

Visualization

  1. How to interpret topics using pyldaviz: Let’s interpret the topic visualization. Notice how topics are shown on the left while words are on the right. Here are the main things you should consider:

    1. Larger topics are more frequent in the corpus.

    2. Topics closer together are more similar, topics further apart are less similar.

    3. When you select a topic, you can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.

    4. Hovering over a word will adjust the topic sizes according to how representative the word is for the topic.

      1. On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)

      2. The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.

      3. An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.

      4. On the right, there is a bar chart showing top terms.

      5. When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.

      6. When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter λλ, which can be adjusted with a slider above the bar chart.

        1. Setting the λλ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.

        2. Setting λλ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.

        3. Setting λλ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.

        4. Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.

COHERENCE (Topic)

Conclusion: The results of the first experiment show that if we are using the one-any, any-any and one-all coherences directly for optimization they are leading to meaningful word sets. The second experiment shows that these coherence measures are able to outperform the UCI coherence as well as the UMass coherence on these generated word sets. For evaluating LDA topics any-any and one-any coherences perform slightly better than the UCI coherence. The correlation of the UMass coherence and the human ratings is not as high as for the other coherences.

LDA2VEC

TOP2VEC

This algorithm takes a group of documents (anything that is made of up text), and returns a number of topics (which are made up of a number of words) most relevant to these documents.

NMF (Non-negative Matrix factorization)+ code

In case LDA groups together two topics, we can influence the algorithm in a way that makes those two topics separable -

, used this to build my own classes - using gensim mallet wrapper, doesn't work on pyLDAviz, so use to fix it

- using tfidf matrix as input!

,

One of the best explanation about - tf for lda, tfidf for nmf, but tfidf can be used for top k selection in lda + visualization,

generative model that generates documents by sampling a topic for each word and then a word from the sampled topic. The generated document is represented as a bag of words.

1. Variance score the transformation and inverse transformation of data, test for 1,2,3,4 PCs/LDs/NMs.

Medium on , explains the random probabilistic nature of LDA

Machinelearningplus on - a great read, dont forget to read the article.

Medium on , high level theoretical - not clear

Medium on , some historical reference and general high level how to use exapmles.

on LDA grid search params and about LDA expectations. Must read.

, talks about the sampling from a distribution of distributions in LDA

- has some text about overfitting - undiscussed in many places.

a, not okay and okay, respectively. Due to how we measure the metrics, ie., read the formulas. and

LDA as

Jupyter notebook on - missing code?

for kmeans, lda, svd,nmf comparison - advice is to keep nmf or other as a baseline to measure against LDA.

with

Selecting the number of topics in LDA, , , , , , , , un, , , , , , + gh code,

- switching from LDA to a variation of it that is guided by the researcher / data

Medium on lda - ,

, ,

,

The best topic modelling explanation including , insights, a great read, with code - shows how to find similar docs by topic in gensim, and shows how to transform unseen documents and do similarity using sklearn:

- Sometimes LDA can also be used as feature selection technique. Take an example of text classification problem where the training data contain category wise documents. If LDA is running on sets of category wise documents. Followed by removing common topic terms across the results of different categories will give the best features for a category.

, including algorithm, parameters!! And Parameters of LDA

- by the frech guy

- has a very good simple example with probabilities

Code:

Great article:

, and using (using clustering for getting group of sentences in each topic)

lda nmf svd, using umass and uci coherence measures

*** code

Paper: , says LDA vs NMI (NMF?) and using coherence to analyze

LDADE's tunings dramatically reduces topic instability.

github code

(didnt read) NTM - with github code

- The inference algorithms in Mallet and Gensim are indeed different. Mallet uses Gibbs Sampling which is more precise than Gensim's faster and online Variational Bayes. There is a way to get relatively performance by increasing number of passes.

Alpha beta in mallet:

0 divided by the number of topics. You can think of this as five "pseudo-words" of weight on the uniform distribution over topics. If the document is short, we expect to stay closer to the uniform prior. If the document is long, we would feel more confident moving away from the prior.

****

by spacy. There are a lot of moving parts in the visualization. Here's a brief summary:

The plot is rendered in two dimensions according a algorithm. Topics that are generally similar should be appear close together on the plot, while dissimilar topics should appear far apart.

A more detailed explanation of the pyLDAvis visualization can be found . Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's LdaMulticore object and pyLDAvis' visualization, you have to dig through the terms manually.

Presentation:

-> ,

,

Paper:

Paper: umass, uci, nmpi, cv, cp etv

Paper:

Paper:

Paper:

Stackexchange:

Paper: - perplexity needs unseen data, coherence doesnt

lda lda-u btm w2vgmm

Paper:

Paper:

Paper:

Paper:

Paper:

Paper: - Abstract: Topic models extract representative word sets—called topics—from word counts in documents without requiring any semantic annotations. Topics are not guaranteed to be well interpretable, therefore, coherence measures have been proposed to distinguish between good and bad topics. Studies of topic coherence so far are limited to measures that score pairs of individual words. For the first time, we include coherence measures from scientific philosophy that score pairs of more complex word subsets and apply them to topic scoring.

Code: - To conclude, there are many other approaches to evaluate Topic models such as Perplexity, but its poor indicator of the quality of the topics.Topic Visualization is also a good way to assess topic models. Topic Coherence measure is a good way to compare difference topic models based on their human-interpretability.The u_mass and c_v topic coherences capture the optimal number of topics by giving the interpretability of these topics a number called coherence score.

Formulas:

Presentation , , , Papers: , , , ,

-

then - - “In your data we can see that there is a peak between 0-100 and a peak between 400-500. What I would think in this case is that "does ~480 topics make sense for the kind of data I have?" If not, you can just do an np.argmax for 0-100 topics and trade-off coherence score for simpler understanding. Otherwise just do an np.argmax on the full set.”

- really good

Topic stability Metric, a novel method, compared against jaccard, spearman, silhouette.:

“if you want to rework your own topic models that, say, jointly correlate an article’s topics with votes or predict topics over users then you might be interested in .”

- I just learned about these papers which are quite similar: and .

(excellent read)

+

,

, ,

, ,

Youtube:

on jupyter

,

Topic modeling with distillibert , !, c-tfidf, umap, hdbscan, merging similar topics, visualization,

Word cloud
Topic modeling with sentiment per topic according to the data in the topic
Topsbm
Medium Article about LDA and
Sklearn LDA and NMF for topic modelling
A very good article about LSA (TFIDV X SVD), pLSA, LDA, and LDA2VEC.
Lda2vec code
A descriptive comparison for LSA pLSA and LDA
great summation
Latent Dirichlet allocation (LDA) -
Medium Article about LDA and
Medium article on LDA - a good one with pseudo algorithm and proof
this is called Semi Supervised Guided LDA
LDA tutorials plus code
this
Introduction to LDA topic modelling, really good,
plus git code
Sklearn examples using LDA and NMF
Tutorial on lda/nmf on medium
Gensim and sklearn LDA variants, comparison
python 3
Medium article on lda/nmf with code
Tf-idf vs bow for LDA/NMF
important paper
LDA is a probabilistic
How to measure the variance for LDA and NMF, against PCA.
Matching lda mallet performance with gensim and sklearn lda via hyper parameters
What is LDA?
LDA in sklearn
mallet
LSA pLSA, LDA LDA2vec
Medium on LSI vs LDA vs HDP, HDP wins..
LDA
Incredibly useful response
Lda vs pLSA
BLog post on topic modelling
Perplexity vs coherence on held out unseen dat
Also this
this
dimentionality reduction
LDA on alpha and beta to control density of topics
hacknews LDA topic modelling
Jupyter notebook
Gensim on LDA
code
Medium on lda with sklearn
blog 1
blog2
using preplexity
prep and aic bic
coherence
coherence2
coherence 3 with tutorial
clear
unclear with analysis of stopword % inclusion
unread
paper: heuristic approach
elbow method
using cv
Paper: new stability metric
Selecting the top K words in LDA
Presentation: best practices for LDA
Medium on guidedLDA
another introductory
la times
Topic modelling through time
Mallet vs nltk
params
params
Paper: improving feature models
Lda vs w2v (doesn't make sense to compare
again here
Adding lda features to w2v for classification
Spacy and gensim on 20 news groups
Usages
Topic Modelling for Feature Selection
Another great article about LDA
History of LDA
Multilingual - alpha is divided by topic count, reaffirms 7
Topic modelling with lda and nmf on medium
great for top docs, terms, topics etc.
Many ways of evaluating topics by running LDA
Difference between lda in gensim and sklearn a post on rare
The best code article on LDA/MALLET
sklearn
LDA in gensim, a tutorial by gensim
Lda on medium
What are the pros and cons of LDA and NMF in topic modeling? Under what situations should we choose LDA or NMF? Is there comparison of two techniques in topic modeling?
What is the difference between NMF and LDA? Why are the priors of LDA sparse-induced?
Exploring Topic Coherence over many models and many topics
Practical topic findings for short sentence text
What's the difference between SVD/NMF and LDA as topic model algorithms essentially? Deterministic vs prob based
What is the difference between NMF and LDA? Why are the priors of LDA sparse-induced?
What are the relationships among NMF, tensor factorization, deep learning, topic modeling, etc.?
Code: lda nmf
Unread a comparison of lda and nmf
Presentation: lda sparse coding matrix factorization
An experimental comparison between NMF and LDA for active cross-situational object-word learning
Topic coherence in gensom with jupyter code
Topic modelling dynamic presentation
Topic modelling and event identification from twitter data
Just another medium article about ™
What is Wrong with Topic Modeling? (and How to Fix it Using Search-based SE)
Talk about topic modelling
Intro to topic modelling
Detecting topics in twitter
Another topic model tutorial
neural topic modeling using embedded spaces
Another lda tutorial
Comparing tweets using lda
Lda and w2v as features for some classification task
Improving ™ with embeddings
w2v/doc2v for topic clustering - need to see the code to understand how they got clean topics, i assume a human rewrote it
Diff between lda and mallet
Mallet in gensim blog post
contribution
The default for alpha is 5.
pyLDAviz paper***!
pyLDAviz - what am i looking at ?
multidimensional scaling (MDS)
here
Youtube on LDAvis explained
More visualization options including ldavis
A pointer to the ldaviz fix
fix
git code
What is?
Wiki on pmi
Datacamp on coherence metrics, a comparison, read me.
explains what is coherence
Umass vs C_v, what are the diff?
Exploring the Space of Topic Coherence Measures
Automatic evaluation of topic coherence
exploring the space of topic coherence methods
Relation between mutial information / entropy and pmi
coherence / pmi how to calc
Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality
Evaluation of topic modelling techniques for twitter
Topic coherence measures
topic modelling from different domains
Optimizing Semantic Coherence in Topic Models
L-EnsNMF: Boosted Local Topic Discovery via Ensemble of Nonnegative Matrix Factorization
Content matching between TV shows and advertisements through Latent Dirichlet Allocation
Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation
Evaluating topic coherence
Evaluating topic coherence, using gensim umass or cv parameter
Inferring the number of topics for gensim's LDA - perplexity, CM, AIC, and BIC
Perplexity as a measure for LDA
Finding number of topics using perplexity
Coherence for tweets
Twitter DLA
tweet pooling improvements
hierarchical summarization of tweets
twitter LDA in java
on github
TM of twitter timeline
in twitter aggregation by conversatoin
twitter topics using LDA
empirical study
Using regularization to improve PMI score and in turn coherence for LDA topics
Improving model precision - coherence using turkers for LDA
Gensim
paper about their algorithm and PMI/UCI etc.
Advice for coherence,
Good vs bad model (50 vs 1 iterations) measuring u_mass coherence
2nd code
Diff term weighting schemas for topic modeling, code plus paper
Workaround for pyLDAvis using LDA-Mallet
pyLDAvis paper
Visualizing LDA topics results
Visualizing trends, topics, sentiment, heat maps, entities
Measuring LDA Topic Stability from Clusters of Replicated Runs
lda2vec
Datacamp intro
Original blog
Gaussian LDA for Topic Word Embeddings
Nonparametric Spherical Topic Modeling with Word Embeddings
Moody’s Slide Share
Docs
Original Git
Excellent notebook example
Tf implementation
another more recent one tf 1.5
Another blog explaining about lda etc
post
post
Lda2vec in tf
tf 1.5
Comparing lda2vec to lda
lda/doc2vec with pca examples
Example on gh
Git
paper
on medium
bertTopic
berTopic (same method as the above)
Medium with the same general method
new way of modeling topics
LDA
UCI vs UMASS