Topics Modeling
Last updated
Was this helpful?
Last updated
Was this helpful?
for topic modellng
(TopSBM) topic block modeling,
Non-negative Matrix factorization (NMF)
NMF (Non-negative Matrix factorization)+ code
Including code and explanation about Dirichlet probability.
A about topic modeling, Pros and Cons! (LSA, pLSA, LDA)
(LDA) Latent Dirichlet Allocation
LDA is already taken by the above algorithm!
NMF is in its general definition the search for 2 matrices W and H such that W*H=V where V is an observed matrix. The only requirement for those matrices is that all their elements must be non negative.
From the above definitions it is clear that in LDA only bag of words frequency counts can be used since a vector of reals makes no sense. Did we create a word 1.2 times? On the other hand we can use any non negative representation for NMF and in the example tf-idf is used.
As far as choosing the number of iterations, for the NMF in scikit learn I don't know the stopping criterion although I believe it is the relative improvement of the loss function being smaller than a threshold so you 'll have to experiment. For LDA I suggest checking manually the improvement of the log likelihood in a held out validation set and stopping when it falls under a threshold. The rest of the parameters depend heavily on the data so I suggest, as suggested by @rpd, that you do a parameter search. So to sum up, LDA can only generate frequencies and NMF can generate any non negative matrix.
It is unsupervised natively; it uses joint probability method to find topics(user has to pass # of topics to LDA api). If “Doc X word” is size of input data to LDA, it transforms it to 2 matrices:
Doc X topic
Word X topic
further if you want, you can feed “Doc X topic” matrix to supervised algorithm if labels were given.
Text classification – Topic modeling can improve classification by grouping similar words together in topics rather than using each word as a feature
Recommender Systems – Using a similarity measure we can build recommender systems. If our system would recommend articles for readers, it will recommend articles with a topic structure similar to the articles the user has already read.
Uncovering Themes in Texts – Useful for detecting trends in online publications for example
A Form of Tagging - If document classification is assigning a single category to a text, topic modeling is assigning multiple tags to a text. A human expert can label the resulting topics with human-readable labels and use different heuristics to convert the weighted topics to a set of tags.
Alpha and Beta Hyperparameters – alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.
Number of Topics – Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. I will not discuss this in detail, as it is too mathematical. For understanding, one can refer to this[1] original paper on the use of KL divergence.
Number of Topic Terms – Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.
Number of Iterations / passes – Maximum number of iterations allowed to LDA algorithm for convergence.
Ways to improve LDA:
Reduce dimentionality of document-term matrix
Frequency filter
POS filter
Batch wise LDA
With hyperparameter optimization, the alpha value for each topic can be different. They usually become smaller than the default setting.
The default value for beta is 0.01. This means that each topic has a weight on the uniform prior equal to the size of the vocabulary divided by 100. This seems to be a good value. With optimization turned on, the value rarely changes by more than a factor of two.
How to interpret topics using pyldaviz: Let’s interpret the topic visualization. Notice how topics are shown on the left while words are on the right. Here are the main things you should consider:
Larger topics are more frequent in the corpus.
Topics closer together are more similar, topics further apart are less similar.
When you select a topic, you can see the most representative words for the selected topic. This measure can be a combination of how frequent or how discriminant the word is. You can adjust the weight of each property using the slider.
Hovering over a word will adjust the topic sizes according to how representative the word is for the topic.
On the left, there is a plot of the "distance" between all of the topics (labeled as the Intertopic Distance Map)
The relative size of a topic's circle in the plot corresponds to the relative frequency of the topic in the corpus.
An individual topic may be selected for closer scrutiny by clicking on its circle, or entering its number in the "selected topic" box in the upper-left.
On the right, there is a bar chart showing top terms.
When no topic is selected in the plot on the left, the bar chart shows the top-30 most "salient" terms in the corpus. A term's saliency is a measure of both how frequent the term is in the corpus and how "distinctive" it is in distinguishing between different topics.
When a particular topic is selected, the bar chart changes to show the top-30 most "relevant" terms for the selected topic. The relevance metric is controlled by the parameter λλ, which can be adjusted with a slider above the bar chart.
Setting the λλ parameter close to 1.0 (the default) will rank the terms solely according to their probability within the topic.
Setting λλ close to 0.0 will rank the terms solely according to their "distinctiveness" or "exclusivity" within the topic — i.e., terms that occur only in this topic, and do not occur in other topics.
Setting λλ to values between 0.0 and 1.0 will result in an intermediate ranking, weighting term probability and exclusivity accordingly.
Rolling the mouse over a term in the bar chart on the right will cause the topic circles to resize in the plot on the left, to show the strength of the relationship between the topics and the selected term.
Conclusion: The results of the first experiment show that if we are using the one-any, any-any and one-all coherences directly for optimization they are leading to meaningful word sets. The second experiment shows that these coherence measures are able to outperform the UCI coherence as well as the UMass coherence on these generated word sets. For evaluating LDA topics any-any and one-any coherences perform slightly better than the UCI coherence. The correlation of the UMass coherence and the human ratings is not as high as for the other coherences.
This algorithm takes a group of documents (anything that is made of up text), and returns a number of topics (which are made up of a number of words) most relevant to these documents.
NMF (Non-negative Matrix factorization)+ code
In case LDA groups together two topics, we can influence the algorithm in a way that makes those two topics separable -
, used this to build my own classes - using gensim mallet wrapper, doesn't work on pyLDAviz, so use to fix it
- using tfidf matrix as input!
,
One of the best explanation about - tf for lda, tfidf for nmf, but tfidf can be used for top k selection in lda + visualization,
generative model that generates documents by sampling a topic for each word and then a word from the sampled topic. The generated document is represented as a bag of words.
1. Variance score the transformation and inverse transformation of data, test for 1,2,3,4 PCs/LDs/NMs.
Medium on , explains the random probabilistic nature of LDA
Machinelearningplus on - a great read, dont forget to read the article.
Medium on , high level theoretical - not clear
Medium on , some historical reference and general high level how to use exapmles.
on LDA grid search params and about LDA expectations. Must read.
, talks about the sampling from a distribution of distributions in LDA
- has some text about overfitting - undiscussed in many places.
a, not okay and okay, respectively. Due to how we measure the metrics, ie., read the formulas. and
LDA as
Jupyter notebook on - missing code?
for kmeans, lda, svd,nmf comparison - advice is to keep nmf or other as a baseline to measure against LDA.
with
Selecting the number of topics in LDA, , , , , , , , un, , , , , , + gh code,
- switching from LDA to a variation of it that is guided by the researcher / data
Medium on lda - ,
, ,
,
The best topic modelling explanation including , insights, a great read, with code - shows how to find similar docs by topic in gensim, and shows how to transform unseen documents and do similarity using sklearn:
- Sometimes LDA can also be used as feature selection technique. Take an example of text classification problem where the training data contain category wise documents. If LDA is running on sets of category wise documents. Followed by removing common topic terms across the results of different categories will give the best features for a category.
, including algorithm, parameters!! And Parameters of LDA
- by the frech guy
- has a very good simple example with probabilities
Code:
Great article:
, and using (using clustering for getting group of sentences in each topic)
lda nmf svd, using umass and uci coherence measures
*** code
Paper: , says LDA vs NMI (NMF?) and using coherence to analyze
LDADE's tunings dramatically reduces topic instability.
github code
(didnt read) NTM - with github code
- The inference algorithms in Mallet and Gensim are indeed different. Mallet uses Gibbs Sampling which is more precise than Gensim's faster and online Variational Bayes. There is a way to get relatively performance by increasing number of passes.
Alpha beta in mallet:
0 divided by the number of topics. You can think of this as five "pseudo-words" of weight on the uniform distribution over topics. If the document is short, we expect to stay closer to the uniform prior. If the document is long, we would feel more confident moving away from the prior.
****
by spacy. There are a lot of moving parts in the visualization. Here's a brief summary:
The plot is rendered in two dimensions according a algorithm. Topics that are generally similar should be appear close together on the plot, while dissimilar topics should appear far apart.
A more detailed explanation of the pyLDAvis visualization can be found . Unfortunately, though the data used by gensim and pyLDAvis are the same, they don't use the same ID numbers for topics. If you need to match up topics in gensim's LdaMulticore object and pyLDAvis' visualization, you have to dig through the terms manually.
Presentation:
-> ,
,
Paper:
Paper: umass, uci, nmpi, cv, cp etv
Paper:
Paper:
Paper:
Stackexchange:
Paper: - perplexity needs unseen data, coherence doesnt
lda lda-u btm w2vgmm
Paper:
Paper:
Paper:
Paper:
Paper:
Paper: - Abstract: Topic models extract representative word sets—called topics—from word counts in documents without requiring any semantic annotations. Topics are not guaranteed to be well interpretable, therefore, coherence measures have been proposed to distinguish between good and bad topics. Studies of topic coherence so far are limited to measures that score pairs of individual words. For the first time, we include coherence measures from scientific philosophy that score pairs of more complex word subsets and apply them to topic scoring.
Code: - To conclude, there are many other approaches to evaluate Topic models such as Perplexity, but its poor indicator of the quality of the topics.Topic Visualization is also a good way to assess topic models. Topic Coherence measure is a good way to compare difference topic models based on their human-interpretability.The u_mass and c_v topic coherences capture the optimal number of topics by giving the interpretability of these topics a number called coherence score.
Formulas:
Presentation , , , Papers: , , , ,
-
then - - “In your data we can see that there is a peak between 0-100 and a peak between 400-500. What I would think in this case is that "does ~480 topics make sense for the kind of data I have?" If not, you can just do an np.argmax for 0-100 topics and trade-off coherence score for simpler understanding. Otherwise just do an np.argmax on the full set.”
- really good
Topic stability Metric, a novel method, compared against jaccard, spearman, silhouette.:
“if you want to rework your own topic models that, say, jointly correlate an article’s topics with votes or predict topics over users then you might be interested in .”
- I just learned about these papers which are quite similar: and .
(excellent read)
+
,
, ,
, ,
Youtube:
on jupyter
,
Topic modeling with distillibert , !, c-tfidf, umap, hdbscan, merging similar topics, visualization,