Topic Modeling in Python

3 min readNov 15, 2022

What is Topic Modeling?

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Topic Models are, in a nutshell, a type of statistical language model used for uncovering hidden structures in a collection of texts.

There are several existing algorithms you can use to perform the topic modeling. The most common of them is Latent Semantic Analysis (LSA/LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA).

In this article, we’ll cover LDA, and implement a basic topic model.

Introduction

Latent Dirichlet Allocation (LDA) is a generative probabilistic model that assumes each topic is a mixture of an underlying set of words, and each document is a mixture of over a set of topic probabilities.

The Data

The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from Kaggle. (https://www.kaggle.com/therohk/million-headlines/data)

Data Preprocessing

In order to preprocess the data set we have, we will perform the following steps:

Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
Words less than 3 characters are removed from the data.
Stopwords are removed.
Word Lemmatization and Stemming

Loading Gensim and NLTK libraries:

import gensim

from gensim.utils import simple_preprocess

from gensim.parsing.preprocessing import STOPWORDS

from nltk.stem import WordNetLemmatizer, SnowballStemmer

from nltk.stem.porter import *

import numpy as np

np.random.seed(2018)

import nltk

nltk.download(‘wordnet’)

nltk.download(‘omw-1.4’)

Function to perform lemmatization and stemming steps on the data set:

def lemmatize_stemming(text):

return PorterStemmer().stem(WordNetLemmatizer().lemmatize(text, pos=’v’))

def preprocess(text):

result = []

for token in gensim.utils.simple_preprocess(text):

if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:

result.append(lemmatize_stemming(token))

return result

Data set preview after preprocessing step:

Preprocess the headline text, saving the results as ‘processed_docs’

processed_docs = documents[‘headline_text’].map(preprocess)

processed_docs[:10]

Bag of Words on the Data set

Create a dictionary from a preprocessed data set containing the number of times a word appears in the training set.

dictionary = gensim.corpora.Dictionary(processed_docs)

count = 0

for k, v in dictionary.iteritems():

print(k, v)

count += 1

if count > 10:

break

Gensim filter_extremes

Filter out tokens that appear in:

less than 15 documents (absolute number) or
more than 0.5 documents (fraction of total corpus size, not the absolute number)
after the above two steps, keep only the first 100000 most frequent tokens.

dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Gensim doc2bow

For each document, we create a dictionary reporting how many words and how many times those words appear.

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

bow_corpus[4310]

Preview Bag of Words for our sample preprocessed document: