Topic Modeling in Python

Lakebrains Technologies
3 min readNov 15, 2022

--

Topic Modeling with Python

What is Topic Modeling?

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Topic Models are, in a nutshell, a type of statistical language model used for uncovering hidden structures in a collection of texts.

There are several existing algorithms you can use to perform the topic modeling. The most common of them is Latent Semantic Analysis (LSA/LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA).

In this article, we’ll cover LDA, and implement a basic topic model.

Introduction

Latent Dirichlet Allocation (LDA) is a generative probabilistic model that assumes each topic is a mixture of an underlying set of words, and each document is a mixture of over a set of topic probabilities.

The Data

The data set we’ll use is a list of over one million news headlines published over a period of 15 years and can be downloaded from Kaggle. (https://www.kaggle.com/therohk/million-headlines/data)

Data Preprocessing

In order to preprocess the data set we have, we will perform the following steps:

  • Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
  • Words less than 3 characters are removed from the data.
  • Stopwords are removed.
  • Word Lemmatization and Stemming

Loading Gensim and NLTK libraries:

import gensim

from gensim.utils import simple_preprocess

from gensim.parsing.preprocessing import STOPWORDS

from nltk.stem import WordNetLemmatizer, SnowballStemmer

from nltk.stem.porter import *

import numpy as np

np.random.seed(2018)

import nltk

nltk.download(‘wordnet’)

nltk.download(‘omw-1.4’)

Function to perform lemmatization and stemming steps on the data set:

def lemmatize_stemming(text):

return PorterStemmer().stem(WordNetLemmatizer().lemmatize(text, pos=’v’))

def preprocess(text):

result = []

for token in gensim.utils.simple_preprocess(text):

if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:

result.append(lemmatize_stemming(token))

return result

Data set preview after preprocessing step:

Preprocess the headline text, saving the results as ‘processed_docs’

processed_docs = documents[‘headline_text’].map(preprocess)

processed_docs[:10]

Bag of Words on the Data set

Create a dictionary from a preprocessed data set containing the number of times a word appears in the training set.

dictionary = gensim.corpora.Dictionary(processed_docs)

count = 0

for k, v in dictionary.iteritems():

print(k, v)

count += 1

if count > 10:

break

Gensim filter_extremes

Filter out tokens that appear in:

  • less than 15 documents (absolute number) or
  • more than 0.5 documents (fraction of total corpus size, not the absolute number)
  • after the above two steps, keep only the first 100000 most frequent tokens.

dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Gensim doc2bow

For each document, we create a dictionary reporting how many words and how many times those words appear.

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

bow_corpus[4310]

Preview Bag of Words for our sample preprocessed document:

Running LDA using Bag of Words

Train our LDA model using gensim.models.LdaMulticore and save it to ‘lda_model’

lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

For each topic, we will explore the words occurring in that topic and their relative weight.

for IDX, topic in lda_model.print_topics(-1):

print(‘Topic: {} \nWords: {}’.format(idx, topic))

You can view/download the Jupyter Notebook here (https://nbviewer.org/github/just-arvind/article_src/tree/main/LDA_Topic_Modeling_Article.ipynb)

--

--

Lakebrains Technologies
Lakebrains Technologies

Written by Lakebrains Technologies

Team of young and enthusiastic individuals who are focused on solving business problems through niche technologies in innovative manner

No responses yet