How many Topics: A detailed guide to Topic Modeling
I had the opportunity to be a part of the Girlscript Summer of Code Extended’ 20 Program wherein I was one of the top 3 contributors of the open-source research project called ‘How Many Topics?’ based on Natural Language Processing and Topic Modeling. I am extremely grateful to my project mentors, Harini Suresh and Mansi for helping my team and I wherever required in successfully completing the project. I would also like to thank my teammates, Namya LG, Rashi Singh and Nagasuruthika for their constant support and enthusiasm towards the project.
Problem Statement
Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus.
We intend to work on a research paper where we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. The goal is to determine k, the number of topics that are relevant to a given piece of text.
The pipeline that was followed for the project was as follows:
- Dataset Collection and required pre-processing.
- Studying and analysing several Topic modeling algorithms and running them.
- Fine-tuning the parameters for topic modeling.
- Classification
The entire code for the project can be found here.
Introduction
Topic modeling is an unsupervised machine learning technique that is used to classify a set of documents, detecting words and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterise a set of documents. To delve deep into the same, we researched Topic Modeling by using the COVID-19 Dataset which can be found here.
Description of the dataset
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.
The research part of the project was implemented on the Title part of the COVID-19 Research paper dataset. To confirm our results, the best suited parameters and procedure was then run on the Abstracts part of the same dataset and several other datasets like Yelp Reviews, Nematode Biology, 20 Newsgroups and Blog Authorship.
Data Pre-Processing
From the entire COVID-19 Research Paper dataset, Topic Modeling is done in two parts i.e. on the Titles and the Abstract of the Research Papers. The following steps were performed to clean the data in order to prepare it for running several Topic Modeling algorithms:
- Lowercasing, Punctuation, and Stopword Removal: All the texts were lowercase, and punctuations were removed with the help of the Regex library in Python. Then stopword removal was done and all the stopwords in the English language which are available in the gensim package of Python were removed.
- Tokenization: Word Tokenization for done so that these words can then further be stemmed and lemmatized. The RegexpTokenizer available in the nltk package for data preprocessing was used in this step.
- Stemming: Stemming was done for all the words in the dataset using Porter Stemmer in order to reduce the words to their root/base form. The tokenized words from the above step were the input for the stemmer.
- Lemmatization: Lemmatization was done so that words that are in the third person are changed to the first person and verbs in the past and/or future tense are changed to their present tense. WordNetLemmatizer available in the nltk package was used for the same.
Observations
1. Hyperparameter Tuning
For the COVID-19 Title dataset, the following were the questions which often serve as a standpoint for many who are getting their hands dirty with Topic Modeling. After several days of Research, the following were the observations made by us which worked well for our dataset.
- Whether to use Bag of Words or Term Frequence-Inverse Document Frequency?
The BoW model captures the frequencies of the word occurrences in a text corpus. It is not concerned about the order in which words appear in the text; instead, it only cares about which words appear in the text. On the other hand, TF-IDF measures how important a particular word is with respect to a document and the entire corpus.
So as far as our algorithm goes, often while implementing the Latent Dirichlet Allocation algorithm (discussed later), TF-IDF model is only used because it has shown to provide better results as compared to when we only use a bag of words. However, the difference is not so significant. So as far as LDA is concerned, Bag of words or TF-IDF, both seemed to work fine for our dataset.
After trying both the models on several different values of k, the following observations were made by our team:
Even though some of the topics were repetitive, overall the model gave satisfactory results for us to continue further.
However, in our case, even more repetition among several topics was observed when the LDA model used TF-IDF. According to us, TF-IDF provided very general topics regarding COVID-19 which could have been common among all research papers whereas the Bag of Words model could give us a more clear picture of what was going on in the paper as more specific words came out. In order to check whether we were proceeding in the right direction, we did the performance evaluation of both models. Performance Evaluation is done to check whether our model can distinguish different topics by using the words in each topic and their corresponding weights.
Both the models performed more or less the same. For example, in the first case, Bag of Words performed better whereas in the second case TF-IDF model performed much better. So we went according to our previous observation and finally chose the Bag of Words Model as it gave us more detailed and specific topics.
- Whether to use the Latent Dirichlet allocation (LDA) algorithm or the Latent Semantic Analysis (LSA) algorithm?
Latent Dirichlet Allocation (LDA) is the most popular topic modeling algorithm which is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modelled as Dirichlet distributions.
Latent Semantic Analysis (LSA) is another topic modeling algorithm that helps us to capture the context around hidden meaning/concepts in documents (topics). It mainly is a technique in the field of distributional semantics and analysis.
LSA is a technique that can help us figure out hidden concepts or topics behind the words. While implementing the LSA algorithm on the COVID-19 title dataset, because of limitations regarding computational time and resources, we restricted the number of features to 1000 while creating the document term matrix. This hyperparameter was also tuned to check if this affects the topic model in any way. However, by changing the number of max features, almost no change was observed in the results of our dataset. If enough computational power is available, it is advised to use all the terms while implementing the algorithm.
The above has been repeated for varying values of k.
On implementing the same using LDA (as seen later in the document), our dataset gave much better results as compared to LSA. LDA gave a more variety of topics as compared to LSA which gave very redundant results in between the topics. One more reason why LSA has not given satisfactory results for our dataset is that the Latent Semantic Analysis algorithm as the name suggests tries to find the hidden concepts and meaning behind words that have the same spelling but mean different things in different situations. As most of the words in our dataset are technical and related to the COVID-19 field, they cannot be used in places where they have some other meaning (unlike words like a novel which can mean a book or original depending on the situation), this can be a reason that LDA performed better as compared to LSA for COVID-19 dataset. Hence, LDA is finally used for training.
- Whether to use the implementation of LDA algorithm available in Python’s Gensim Package or Scikit-learn Package
The Latent Dirichlet Allocation algorithm can be implemented using two libraries available in Python i.e. Gensim package and Scikit-learn Package. The more popular one used to implement LDA is the Gensim Package however, in order to check whether the results varied if another package was used, we tried to implement LDA using both the packages. A major drawback was that the Scikit-learn package was very slow and due to limited computational power, even on dividing our dataset into two parts, even one run of the algorithm could not be achieved. Hence, we decided to continue with the Gensim Package as it was much faster. If computational resources permit, then one should check the results by implementing LDA using both the packages.
2. Number of Topics
For the COVID-19 Title dataset, LDA algorithm was implemented twice, on the entire dataset as well as by dividing the dataset randomly into 2 parts to ease the implementation and reduce the running time. A little analysis was also done with regard to coherence, and we fixed the number of passes to one.
This is the output that was generated, though the max coherence increases with the number of topics, the topics are not relevant at all. Considering the local maxima, the coherence score for number of topics = 11 seems to work the best.
The above is a graph between coherence and number of passes and number of topics as 8, it can be seen that the number of passes between 30–40 is most favourable for the number of topics 8.
Earlier, the only parameters that were taken into account were as follows:
lda_model = gensim.models.ldamodel.LdaModel(corpus,num_topics=8, id2word = dictionary, alpha=0.16, passes=i,chunksize=10000,per_word_topics=True)
To delve deeper, the following have also been considered then:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,id2word=dictionary,num_topics=8, random_state=100, update_every=1, chunksize=1000, passes=10,alpha=’auto’, per_word_topics=True)
To date, the best results have been obtained for number of topics as 8 and number of passes as 40 and the coherence score calculated was 0.3253107958923857.
3. Coherence Value
C_v (Coherence value) measure is based on a sliding window, one-set segmentation of the top words, and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity.
With the coherence score that seems to keep increasing with the number of topics, it may make better sense to pick the model that gave the highest CV before flattening out or a major drop. In this case, we picked k=8 as for a very huge value of k, the topics do not seem to add much value.
The title dataset was then divided randomly into two equal halves in order to reduce the runtime and also to check whether that would give us better results as compared to feeding the entire dataset into the algorithm. As the best results were obtained for 7<=k<=9, the two divided datasets were also checked in the same range only for easier comparison purposes.
First Half of the divided dataset
The major observation was that dividing the dataset did help and gave much better results. From the above picture, we can see that newer words have come up (Topic 4 and Topic 5) which were not seen in earlier parts.
This dataset shows excellent results in terms of visualisation as the topics in the Intertopic Distance Map are quite far away. The best improvement that can possibly be done to our model to achieve excellent accuracy is to change the parameters such that (Topic 1 and Topic 5) and (Topic 2 and Topic 3) do not overlap for the best possible results. On observing this and the next visualisation, we thought that maybe a suitable number of topics might be 5 as the Intertopic Distance Map always is being divided into majorly 5 parts.
Almost the same results can be seen (this is also much better than the results obtained using the entire dataset) that were obtained by using k = 7. However, the one with more number passes gave a more in-depth insight into the topics. This also shows that the model which runs for a greater number of passes gives us better results (but this cannot always be tested because of limited computational power).
Topic 3 gave really good and unique results that did not come up before. This model is also excellent in terms of visualisation as all the topics are quite spread out. As we can again see in the above visualisation, some topics are a bit clustered together and there are 5 major categories/areas in which the topics fall into. So we decided to check whether k = 5 might be a good value to be chosen as the number of topics.
From the previous observations, an idea that came up was to try for number of topics as 5 to get all separated topics. But still Topics 3 and Topics 5 were overlapping so this approach was not the correct one. Also, the quality of topics was maintained the same as that we could see with the Number of topics = 8. As a higher coherence value was obtained for k = 8, we decided to finally continue with that model only.
This Intertopic Distance Distribution is the best one we have seen so far. All the topics are quite far from each other and again the dataset has been divided into 5 major sets. Some overlaps can be seen clearly in (Topic 6 and Topic3), (Topic 1 and Topic 2) and (Topic 4 and Topic 5). One thing which we observed that even with a lot of hyperparameter tuning, at least 2 topics were constantly overlapping.
Second Half of the divided dataset
The same procedure was now done in the second half of the dataset. As k = 8 gave the best results for the first half, we decided to check the second dataset with the same. But we increased the number of passes to 40 in order to achieve even better results.
Very satisfactory results were obtained (especially better than the one without splitting the dataset). New words have come up in this Topic Model too and comparatively there is less redundancy. One reason for new words coming up might be as the dataset is now divided, the dominant and most prominent words now change to ones that might have been not so dominant to the entire dataset.
Implementation of our observations on the Abstracts part of the COVID-19 dataset
The COVID-19 abstract was further divided into two halves randomly so that the operations performed can be easily implemented to understand the results.
First Half of the divided dataset
For the first half of the dataset, a similar procedure was followed as the one done for the title dataset. As the k value for our dataset (as found from the title dataset) ranges between 8<=k<=10, these values were used to test whether they work the same for the abstract part too.
One observation was that values in the range k<7 gave very unsatisfactory results as most of the words were redundant and not much relevant to our dataset. Fairly good results were coming up. So that’s when we knew that the optimum number of topics is going to be reached soon.
- For k=7: On increasing the number of passes, more general nouns came up as compared to more specific proper nouns (in passes = 2).
- For k=8: Number of topics=8 gave a very good general idea of our dataset which could cover almost all the aspects of COVID-19 in general.
The topics were not redundant which shows that k=7 and/or k=8 is a good number of topics that our dataset can be divided into. While k=7 gave more specific topics with proper nouns, k=8 gave a more general idea and hence, should be preferred over k=7.
As we decided to continue with k=8 for the first part of the abstract dataset, we made the Visualisation Model for the same using pyLDAvis package available in Python.
Striking Observation
It is usually said that if the Intertopic Distance between the topics in a Topic Modelling Model is large, the model is said to be better. However, one striking observation that was observed while working with all the datasets was that if the Coherence Value for a particular number of topics was very good (i.e. high), the Intertopic distance for the same k was quite lower. This had caused a dilemma of whether to move forward with a better Coherence Value model or a better Visualisation Model. In the end, we decided to move forward with the Coherence Value Model because we thought that as our dataset was very much focused on topics related to COVID-19 only, there is a huge possibility that almost all the topics will have to be interrelated a lot more as compared to other datasets which have a very diverse range of topics.
An overall Visualisation of how the coherence score changes as compared to a number of topics (k) is shown below. From this graph, we can clearly see that the optimum range of k for our dataset is 8<=k<=10, which is almost the same as our results.
Second Half of the divided dataset
For the second half of the abstract dataset, we checked that what number of topics will suit the dataset best by plotting the coherence value graph and printing the topics’ names.
We used the coherence value graph method because as per our previous results we have seen that the coherence value is at the peak for the number of topics that suit the best for our dataset.
The words present in the topic model were not very much repetitive or redundant.
The words present in the topic were not very much repetitive or redundant except for topic 1, which consisted of only two-letter words that made no sense like ‘eg- de’, ‘la’, ‘en’, etc.
The words present in the topic were not very much repetitive or redundant except for topic 6, which consisted of only one or two-letter words that made no sense.
Words like ‘covid19’ and ‘disease’ were seen to repeat in all the topics. The presence of irrelevant data was negligible.
Only one or two words were repeated but a lot of two or three-letter words were found which were irrelevant and did not add much value to our results.
Visualisation
For the COVID-Title Dataset, we tried visualising the results using word clouds, graphs, tables, etc.
- Dominant topics and its percentage contribution
Repetition of words in various documents can be seen.
- Frequency Distribution of Word Counts in Documents
The following graphs show the distribution of words present in various documents in the topics.
- Word cloud with respect to different topic models:
Implementation of our observations on several other datasets
After successfully finding a particular flow to implement our topic modeling algorithm and finding the suitable number of topics along the way, our project mentors advised us to try the same on other datasets of different types to check whether we got the desired results. Each member took up some of the most famous datasets and worked on the same. I chose to work on the Blog Authorship dataset. This dataset contains text from 681,288 blogs which were written before August 2004 with each blog being the work of a single writer.
Because of the huge size of the dataset and limitation of computational power available, the dataset was split randomly into 2 halves to reduce the running time and for the ease of implementation of the algorithm. Data preprocessing was done which included the usual steps of lowercasing, punctuation and stopword removal, tokenization, stemming and lemmatization. Then the LDA model was run on the two divided datasets.
Observations
First Half of the Blog Authorship Dataset
As observed in the COVID-19 Research Paper dataset, 8<=k<=10 was the best-suited value for the number of topics to be selected. Intuitively, we went for the same k to check if it also suited the Blog Authorship dataset.
As can be seen from the above data, k=8 was a very bad value for a number of topics and did not give satisfactory results at all. Many words were repetitive and it did not give an idea about the overall topics at all. Hence, the value of k that worked for COVID-19 Research Paper dataset did not work for this. There were 2 reasons we could think of which might have caused this:
- The striking difference in the size of the datasets. The COVID-19 Research Paper dataset was much smaller (around 250,000 entries were there) as compared to the Blog Authorship dataset which contains 681,288 entries.
- The Blog Authorship dataset can have a very large variety of topics as blogs can be about anything to everything whereas there is limited scope for topics in the COVID-19 Research Paper dataset.
Because of the above-stated factors, there is a possibility that 8<=k<=10 might not have worked for huge datasets with very diverse topics.
A Coherence value graph was plotted for k ranging between 7 and 30 in order to find the local and global maxima of the Coherence Value.
From the above graph, we can clearly see that there are local maximas occurring at k = 12/13 and k = 17/18 and the best topic model could occur when k = 22/23 is selected as a higher coherence value corresponds to a better topic model. The coherence value for k>23 then continues to decrease indicating that the highest suitable value for k cannot be greater than 23.
k=12 also gave satisfactory results. This value for k can also be used as a topic model as coherence value gives a local maxima at k=12. However, as these topics are not much detailed, there is a very high possibility that many other major topics in the dataset are being missed.
This model proved to be a good choice when it comes to visualisation of topic models as the topics were not so clustered together and had more intertopic distance as compared to other values of k (as seen later in the document). Similarly, as for the COVID-19 Research Paper Dataset, we can choose any one hyperparameter for tuning between Coherence Value and Visualisation. As Coherence Value has proven to be a better parameter, we continue with it only.
On checking for k=17 (as it was a point of local maxima when it came to Coherence Value), the LDA Model gave satisfactory results, but still a better model would be the one having k=22 because it achieved the maximum coherence Value.
k = 17 gave the best model when considered from the Visualisation aspect. This is because the topics were comparatively very much spread out (i.e. the intertopic distance is quite large) as compared to other values of k and not so clustered together. This suggests that this value of k covers quite a large number of diverse topics which are dominant in the dataset.
Overall k=22 gave a very detailed explanation for each topic that is dominant in the first half of the Blog Authorship dataset, however there is a possibility that k=19/20/21 can also prove to be good as some of the topics did not make much sense as they included words like ‘like’ very frequently. As also seen from the Coherence Value graph that there is a global maximum at k=22, this value of k can finally be chosen as the optimum number of topics.
When it comes to visualisation, it can again be observed that for a high coherence value, the intertopic distance is comparatively less as compared to lower values of k where a local maximum was achieved. But all the topics were not clustered together and as there is some appreciable intertopic distance, this factor can be ignored as Coherence Value outweighs the same.
Top 15 words in the Dataset
This histogram was plotted in order to check that the dominant topics that the LDA algorithm calculated were in accordance with the dominant words in the dataset. From the above graph we can clearly see that the most frequent words in the dataset also constitute topics which our topic model has calculated. This somewhat proves the authenticity of the procedure that we have followed for Topic Modeling.
Second Half of the Blog Authorship Dataset
Similar to the procedure followed for the first half of the dataset, a Coherence Value graph was plotted for 7<k<30 in order to find the local and global maxima of the Coherence Value. One more reason for choosing k=22/23 as compared to its lower values is that as this dataset is very huge with very diverse entries, in order to cover all the dominant topics, it is better that a larger number of topics is chosen.
Number of Topics 7 has coherence value 0.417767Number of Topics 12 has coherence value 0.441731Number of Topics 17 has coherence value 0.444497Number of Topics 22 has coherence value 0.458948Number of Topics 27 has coherence value 0.452409
A striking observation was that the Coherence Value Graph for both the first and second half of the Blog Authorship Dataset was exactly the same with local maxima at k = 12/13 and k = 17/18 and the global maxima occurring for values of k = 22/23. As the Coherence Value decreases for k>23, no other maxima occur after k=22/23.
On running the algorithm for 13≤k≤15, the result observed was that all the topics made sense and gave a fairly good idea about what each topic was trying to say. Some of the words were redundant though which can be improved by choosing a greater value of k with a higher coherence score.
On visualisation we can see that many topics are clustered together and it is a possibility that this model might not be a good one due to less intertopic distance.
Even though k=17/18 had a higher coherence score as compared to k=12/13, the topics were not that good as many of them were very repetitive. It is observed that k = 12/13 would have sufficed rather than choosing a larger number such as 17/18.
The best topic model is achieved using k = 23. This model gave a very detailed description of what each topic could have been and gives an overall idea of the different blogs that the dataset consists of. This model did not consist of redundant topics too and had the best coherence score as observed in the graph. Finally k = 22/23 is the final number of topics that can be chosen for the second half of the Blog Authorship dataset.
Again it can be observed that for a very high coherence value, the visualisation model did not give very good results. Many topics were clustered together and the intertopic distance was also quite small. However, as Coherence Value outweighs the visualisation aspect, finally k = 22/23 is only chosen to train the model.
Conclusion for the Blog Authorship dataset
As both the first and second half of the dataset were pretty much in coordination and did not have any contradicting features at all, it can easily be said that k = 22/23 is the final number of topics chosen for the Blog Authorship dataset which gives a Coherence score of around 0.458948. The procedure and algorithm which was followed to find the topics for COVID-19 Research Paper dataset worked quite well for this dataset too. However, the value of k had to be increased because of the huge size and the diversity of topics of the dataset.
Overall conclusions made for the COVID-19 datasets
On the basis of our findings, the following are the conclusions made:
- The best-suited value for the number of topics i.e. k comes out to be in the range of 8<=k<=10 for the COVID-19 Title and Abstract dataset.
- Latent Dirichlet Allocation (LDA) algorithm worked better for our dataset as compared to the Latent Semantic Analysis (LSA) algorithm.
- Coherence Value was considered as the main feature used for the analysis of our topic model i.e. it was preferred over other factors like Perplexity score, Log-likelihood, Intertopic Distance Map, etc.