Topic Modeling in Python
What is Topic Modeling?
Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Topic Models are, in a nutshell, a type of statistical language model used for uncovering hidden structures in a collection of texts.
There are several existing algorithms you can use to perform the topic modeling. The most common of them is Latent Semantic Analysis (LSA/LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA).
In this article, we’ll cover LDA, and implement a basic topic model.
Introduction
Latent Dirichlet Allocation (LDA) is a generative probabilistic model that assumes each topic is a mixture of an underlying set of words, and each document is a mixture of over a set of topic probabilities.
The Data
The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from Kaggle. (https://www.kaggle.com/therohk/million-headlines/data)
Data Preprocessing
In order to preprocess the data set we have, we will perform the following steps:
- Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
- Words less than 3 characters are removed from the data.
- Stopwords are removed.
- Word Lemmatization and Stemming
Loading Gensim and NLTK libraries:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
import nltk
nltk.download(‘wordnet’)
nltk.download(‘omw-1.4’)
Function to perform lemmatization and stemming steps on the data set:
def lemmatize_stemming(text):
return PorterStemmer().stem(WordNetLemmatizer().lemmatize(text, pos=’v’))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result
Data set preview after preprocessing step:
Preprocess the headline text, saving the results as ‘processed_docs’
processed_docs = documents[‘headline_text’].map(preprocess)
processed_docs[:10]
Bag of Words on the Data set
Create a dictionary from a preprocessed data set containing the number of times a word appears in the training set.
dictionary = gensim.corpora.Dictionary(processed_docs)
count = 0
for k, v in dictionary.iteritems():
print(k, v)
count += 1
if count > 10:
break
Gensim filter_extremes
Filter out tokens that appear in:
- less than 15 documents (absolute number) or
- more than 0.5 documents (fraction of total corpus size, not the absolute number)
- after the above two steps, keep only the first 100000 most frequent tokens.
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
Gensim doc2bow
For each document, we create a dictionary reporting how many words and how many times those words appear.
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]
Preview Bag of Words for our sample preprocessed document:
Running LDA using Bag of Words
Train our LDA model using gensim.models.LdaMulticore and save it to ‘lda_model’
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)
For each topic, we will explore the words occurring in that topic and their relative weight.
for IDX, topic in lda_model.print_topics(-1):
print(‘Topic: {} \nWords: {}’.format(idx, topic))
You can view/download the Jupyter Notebook here (https://nbviewer.org/github/just-arvind/article_src/tree/main/LDA_Topic_Modeling_Article.ipynb)