Skip to main content Link Menu Expand (external link) Document Search Copy Copied

1. Topic Modeling (LDA)

1.1 Downloading NLTK Stopwords & spaCy

NLTK (Natural Language Toolkit) is a package for processing natural languages with Python. To deploy NLTK, NumPy should be installed first. Know that basic packages such as NLTK and NumPy are already installed in Colab.

We are going to use the Gensim, spaCy, NumPy, pandas, re, Matplotlib and pyLDAvis packages for topic modeling. The pyLDAvis package is not in Colab, so you should manually install it.

pip install --upgrade gensim
pip install pyldavis==3.2.1
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spaCy for Lemmatization
import spacy

# Visualization tools
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for Gensim (This is optional)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
/usr/local/lib/python3.7/dist-packages/past/types/oldstr.py:5: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  from collections import Iterable
/usr/local/lib/python3.7/dist-packages/past/builtins/misc.py:4: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  from collections import Mapping

1.2 Adding NLTK Stop words

Download the stopwords from NLTK in order to use them.

import nltk; nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.





True
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

1.3 Importing Dataset

We are going to load a dataset, Charles Dickens’s Our Mutual Friend, in Colab. Google Drive needs to be mounted in order to load a dataset. If you run this code on your local server, skip the step of mounting Google Drive and then load the dataset with the changed directory.

from google.colab import drive
drive.mount('/content/drive/')
Mounted at /content/drive/

We will load a txt file instead of a csv file. If you wanted to load a csv file or another different type of file, I recommend you use the function pd.read_csv('path').

data = open('drive/My Drive/Colab Notebooks/[your path]/OMF.txt', 'r')
data.columns = ['sentences']

Let’s look at what’s the 20th line.

print(data.readline(20))
In these times of ou

2.1 Tokenization and Clean-up

Time to tokenize each sentence into a list of words, getting rid of unnecessary items such as punctuation. If you set deacc=False, punctuation marks won’t be removed.

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))

data_words = list(sent_to_words(data))

print(data_words[:1])
[['rs', 'though', 'concerning', 'the', 'exact', 'year', 'there', 'is', 'no']]

2.2 Bigram and Trigram

Bigrams and trigrams are words that frequently occur together. For example, on_the_rocks is a trigram. We can implement bigrams and trigrams through the Gensim’s Phrases function. You might want to change min_count and threshold later in order to get the best results for your purpose.

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])
['rs', 'though', 'concerning', 'the', 'exact', 'year', 'there', 'is', 'no']

2.3 Functions that Deal with Stopwords, Lemmatization, Bigrams, and Trigrams

Let’s create functions to remove stopwords, deal with lemmatization, and make bigrams and trigrams. After that, we will implement them.

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spaCy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spaCy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Perform lemmatization keeping noun, adjective, verb, and adverb
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])
[['concern', 'exact', 'year']]

2.4 Dictionary and Corpus

We are going to create the dictionary using data_lemmatized from the previous step.

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])
[[(0, 1), (1, 1), (2, 1)]]

Gensim vectorizes each word. The generated corpus shown above is (word_id, word_frequency).

Let’s view the word for word_id=0 from id2word.

id2word[0]
'concern'
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]
[[('concern', 1), ('exact', 1), ('year', 1)]]

3.1 Running the LDA Model

We have everything we need to perform the LDA model. Let’s build the LDA model with specific parameters. You might want to change num_topics and passes later. passes is the total number of training iterations, similar to epochs.

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
[(0,
  '0.083*"little" + 0.073*"never" + 0.059*"word" + 0.046*"mean" + '
  '0.042*"woman" + 0.032*"hear" + 0.029*"honour" + 0.026*"yet" + 0.026*"next" '
  '+ 0.021*"point"'),
 (1,
  '0.114*"turn" + 0.068*"eye" + 0.060*"return" + 0.038*"place" + 0.025*"love" '
  '+ 0.019*"bear" + 0.019*"hour" + 0.019*"believe" + 0.018*"change" + '
  '0.018*"lay"'),
 (2,
  '0.183*"say" + 0.040*"would" + 0.037*"man" + 0.035*"hand" + 0.029*"much" + '
  '0.019*"great" + 0.017*"twemlow" + 0.017*"old" + 0.017*"quite" + '
  '0.013*"shake"'),
 (3,
  '0.086*"see" + 0.077*"riderhood" + 0.055*"good" + 0.040*"find" + 0.040*"way" '
  '+ 0.037*"ever" + 0.025*"home" + 0.024*"must" + 0.023*"wife" + '
  '0.022*"feeling"'),
 (4,
  '0.088*"come" + 0.077*"make" + 0.051*"think" + 0.043*"young" + 0.033*"lady" '
  '+ 0.030*"want" + 0.027*"question" + 0.027*"seem" + 0.022*"away" + '
  '0.019*"side"'),
 (5,
  '0.075*"know" + 0.062*"take" + 0.027*"shall" + 0.027*"bella" + 0.026*"dear" '
  '+ 0.025*"day" + 0.024*"give" + 0.024*"put" + 0.022*"podsnap" + 0.020*"sit"'),
 (6,
  '0.096*"may" + 0.088*"look" + 0.057*"could" + 0.038*"head" + 0.037*"cry" + '
  '0.033*"part" + 0.032*"do" + 0.029*"face" + 0.025*"long" + 0.022*"voice"'),
 (7,
  '0.087*"go" + 0.046*"time" + 0.031*"well" + 0.023*"name" + 0.022*"back" + '
  '0.021*"last" + 0.021*"let" + 0.019*"keep" + 0.019*"night" + 0.017*"eugene"'),
 (8,
  '0.066*"tell" + 0.049*"leave" + 0.036*"many" + 0.033*"stand" + '
  '0.027*"sloppy" + 0.025*"suppose" + 0.023*"company" + 0.022*"certain" + '
  '0.020*"throw" + 0.020*"lie"'),
 (9,
  '0.067*"ask" + 0.064*"get" + 0.037*"use" + 0.032*"mind" + 0.032*"still" + '
  '0.032*"bring" + 0.026*"open" + 0.024*"wegg" + 0.022*"alone" + 0.022*"door"')]

3.2 Evaluating the LDA Model

After training a model, it is common to evaluate the model. For topic modeling, we can see how good the model is through perplexity and coherence scores.

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Perplexity:  -9.15864413363542

Coherence Score:  0.4776129744220124

3.3 Visualization

Now we have the test results, so it is time to visualiza them. We are going to visualize the results of the LDA model using the pyLDAvis package.

# Visualize the topics
pyLDAvis.enable_notebook()
visualization = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
visualization
# Export the visualization as a html file.
pyLDAvis.save_html(visualization, 'drive/My Drive/Colab Notebooks/LDAModel.html')

References:
Topic Modeling with Gensim (Python) by Selva Prabhakaran.
Generating and Visualizing Topic Models with Tethne and MALLET by ASU Digital Innovation Group.
Colab + Gensim + Mallet by Geoff Ford.