If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Topic Modeling Workshop

Page history last edited by Alan Liu 6 years, 1 month ago

Alan Liu & Lindsay Thomas (28 August 2015)

This workshop contains an overview of topic modeling for beginners and a lesson plan with step-by-step instructions for taking a group through making a topic model using Mallet, examining Mallet output files, visualizing topics as word clouds using Lexos, and clustering topics in topic models to facilitate interpretation. Alan Liu and Lindsay Thomas (assisted by Jeremy Douglass and Scott Kleinman) co-led the inaugural version of the workshop (in a hybrid face-to-face and Skype environment) on 28 August 2015 for a group of 15 participants, including faculty and graduate students in UC Santa Barbara's English, Writing Program, History, and Sociology departments (with some participants, including Lindsay, Skyping in).

Theory: Overview of Topic Modeling: The first section of the workshop consists of a 25-minute exposition (with slides) of topic modeling, building especially on explanations for beginners posted by Edwin Chen and Ted Underwood. (Selected readings about topic modeling and examples of research using topic modeling are also provided.)

Practice: Hands-on Tutorial: The second section of the workshop is an efficient, hands-on lesson plan in topic modeling using sample corpora. (Not covered are pre-processing tasks, including: installation of Mallet [participants are referred to the Programming Historian's guide]; cleaning of texts; studying a corpus with text analysis tools such as Antconc, Voyant, or Natural Language Toolkit (NLTK) to find important words to add to a stop list or phrases to consolidate (tokenize); and chunking or segmenting of texts.) In the inaugural workshop on 28 August 2015, participants broke into two groups--one led face-to-face by Alan and the other led in a mixed face-to-face and Skype screen-sharing environment by Lindsay (who herself was Skyping in). In advance, Mallet and demonstration corpora were installed on the machines used by these two groups. Alan and Lindsay led the groups through the lesson plan, adding explanations and responding to questions.
Given the combined technical and conceptual difficulty for beginners, plus the complications of a hybrid face-to-face and virtual environment, these hands-on tutorials succeeded surprisingly well in the inaugural workshop. One criterion of success is that beginners appear to have gotten past the mental threshold to the point where they say to themselves, "I can do this. I'm still not sure of some of the details. But I know how to get started and I'll figure it out." Another criterion of success was unexpected: the remarkable robustness of questions and discussion during the tutorials. It turned out that the tutorials were not just "hands-on" but "mind on." In Alan's group. for example, participants from English, Writing, History, and Sociology asked both technical questions and penetrating methodological questions about the nature, goals, premises, implications, and impact of topic modeling for research.

Overview: Theory of Topic Modeling

Some Readings:

Overviews and "Simple" Explanations ( = especially recommended for beginners across research disciplines)

Edwin Chen, "Introduction to Latent Dirichlet Allocation" (2011)
Ted Underwood, "Topic Modeling Made Just Simple Enough" (2012)
Matthew L. Jockers, "The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors" (2011)
David M. Blei, "Topic Modeling and Digital Humanities" (2012)
Matt Burton, "The Joy of Topic Modeling" (2013)

More Advanced Explanations & Examples of Research Using Topic Models

Ted Underwood, "What Kinds of "Topics" Does Topic Modeling Actually Produce?" (2012)
Matthew L. Jockers, Macroanalysis: Digital Methods and Literary History (2013) -- Chap. 8: "Theme" [purchase this book as a paperback or Kindle book from online vendors]

Also: play with the topic models from his corpus that Jockers provides interactive access to on his "500 Themes from a corpus of 19th-Century Fiction" page. (For an example of how someone can use this page to do a kind of "close reading" of his topic-model results, see Bill Benzon, "Reading Macroanalysis 6.4: Themes and how they evolve over time" [2014])

Andrew Goldstone and Ted Underwood, "The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us" [PDF] (2014) (preprint version)
See Andrew Goldstone's site for exploring and visualizing the topic models on which "The Quiet Transformations of Literary Studies" is based. (See also Andrew Goldstone's new site for exploring and visualizing topic models 40 years of the Signs journal)
John Mohr and Petko Bogdanov, "Topic Models: What They Are and Why They Matter [PDF]," (2013) [paywalled; UCSB students have free access through UCSB Library Proxy server] | [open-access manuscript version]

Slideshow: Alan Liu's Explanation of Topic Modeling for Humanists

(building on Edwin Chen and Ted Underwood):

topic-modeling-explanation-for-humanists.pptx

Practice: Hands-on Tutorial

Preliminary Steps (Installing Mallet):

Install the MALLET topic-modeling tool. For instructions, see The Programming Historian Tutorial "Getting Started with Topic Modeling and MALLET". Two "gotchas" especially to note: MALLET must be installed at the root of a computer's directory tree; and on Windows machines one must set an "environment variable" pointing to the MALLET folder.

Topic Modeling -- Step by Step:

Sample corpora for topic modeling exercises: [download a demonstration corpus from Alan Liu's DH Toychest]:

(A) Open a command shell ("command line") window

Windows "command prompt" (using shell command language based on MS-DOS) -- quick guide -- tip on enabling copy/paste operations
Mac "terminal" (using shell command language based on Unix) -- cheat sheet
Linux command line (also uses bash)

(B) Navigate to the Mallet folder (directory)

Windows: type the following at the command line, followed by a return (helpful tip: the <F5> function key pastes in the previously used command)

cd c:\mallet

Mac: type the following at the command line

cd /Users/yourusername/mallet
As with the Windows command above, this will depend on where and how you have saved the MALLET folder you downloaded when you installed the package. The above command assumes you've saved MALLET (and titled the folder it's in "mallet") under your home directory.

(C) Input a folder of text files and process them into a .mallet data file
--Use the command below, varying the path and file names as desired (red italics indicate path/filenames you supply). Use backward slashes in Windows, forward slashes on Macs. There must be no hidden returns in the command. Best practice is to set the job up in a text document (without "wordwrap" view turned on) and copy/paste the command into the command shell.

bin\mallet import-dir --input path to folder --output path and filename of desired output data file with .mallet extension --keep-sequence --remove-stopwords
Example (Windows): bin\mallet import-dir --input C:\workspace\corpus-sample --output C:\workspace\topic-model\corpus-sample.mallet --keep-sequence --remove-stopwords
Example (Mac): ./bin/mallet import-dir --input /Users/we1s/corpus-sample --output /Users/we1s/topic-model/corpus-sample.mallet --keep-sequence --remove-stopwords

(D) Create the topic model ("train" the topic model) -- Use the following command.

bin\mallet train-topics --input path and filename of the previously created data file with .mallet extension --num-topics desired number of topics --optimize-interval 20 --output-state path to output folder\topic-state.gz --output-topic-keys-- path to output folder\keys.txt --output-doc-topics --path to output folder\composition.txt --word-topic-counts-file path to output folder\topic_counts.txt
Example (Windows): bin\mallet train-topics --input C:\workspace\topic-model\corpus-sample.mallet --num-topics 20 --optimize-interval 20 --output-state C:\workspace\topic-model\topic-state.gz --output-topic-keys C:\workspace\topic-model\keys.txt --output-doc-topics C:\workspace\topic-model\composition.txt --word-topic-counts-file C:\workspace\topic-model\topic-counts.txt
Example (Mac): ./bin/mallet train-topics --input /Users/we1s/topic-model/corpus-sample.mallet --num-topics 20 --optimize-interval 20 -- output-state /Users/we1s/topic-model/topic-state.gz --output-topic-keys /Users/we1s/topic-model/keys.txt --output-doc-topics /Users/we1s/topic-model/topic-counts.txt

(E) Examine the topic model output files

Examine the keys.txt, composition.txt, and topic-counts.txt output files for the topic model. It helps to copy and paste them into a spreadsheet to see their structure. (If you pull the keys.text into a spreadsheet, you can sort by weight of the topic to see the relative ranking of the topics in importance).

(F) Visualize the topics as word clouds using Lexos

Go to Lexos online
Navigate to in the Lexos interface to "Visualize" > "Multicloud"
In the options for a multicloud (or multiple word clouds), choose "topic clouds" (instead of the default "document clouds")
Upload the topic-counts.txt file from your topic model
Click on "Get graphs"

(G) Experiment with clustering topics

Convert the topics in your topic model as represented/weighted in the topic-counts.txt file into simple collections of words, one collection for each topic (with words repeated in proportion to their weight in that topic). We call these "topic-documents." Essentially, this gives us a simplified representation of each topic as an ordinary document that can itself be analyzed like other documents.

First make a copy of the topic-counts.txt file from your topic model and place it in its own folder or subfolder. (This allows you to create topic-documents in their own folder, rather than have them clutter up your topic model folder itself.)
From the python_scripts folder on the WE1S Google Drive, download Scott Kleinman's topicsToDocs.py script and save it in the Python Scripts folder on your local machine
Click on the script to open it in Python (or in your Python Integrated Development Environment [IDE] such as Canopy Enthought)
Set the parameters at the top of the script as appropriate so that you input the topic-counts.txt file from its folder (which will also be the output folder), tell the script how many topics are in your topic model (the number you told Mallet to create), and the desired number of top words in each topic to process (e.g., 100).
Run the script, which will generate a series of text documents, one for each topic in your topic model. These documents will be titled in the format "Topic1.txt," and each will contain the top most frequent words in that topic (limited to the number you requested in the script), but with each word repeated a number of times to represent its relative frequency weight. (For example, a topic-document might begin, "books books books books books books reading reading reading reading author author author print print print publisher publisher ebook ebook, [etc.]..."

Input the generated topic-documents into Lexos

Go to Lexos online and upload your topic-document files.
Then in the Lexos interface go to "Analyze" and process your uploaded files through one of the clustering methods:

Clustering > Hierarchical (creates a dendogram tree)
Clustering > K-Means (creates a visualization of topic clusters in a Voronoi space)
Similarity Query (creates a cosine similarity ranking of files to a base comparison file)

You can also upload the topic-documents to Voyant Tools's "Scatterplot" tool (documentation), which is a PCA clustering tool and experiment with that. (Note: the easiest way to upload a folder of text documents to Voyant is to make a .zip file from the files in the folder, then upload the zip file).

Comments (0)

You don't have permission to comment on this page.

Topic Modeling Workshop

Alan Liu & Lindsay Thomas (28 August 2015)

Overview: Theory of Topic Modeling

Some Readings:

Slideshow: Alan Liu's Explanation of Topic Modeling for Humanists

Practice: Hands-on Tutorial

Topic Modeling Workshop

Page Tools

Insert links

Comments (0)

Join this workspace

Navigator

SideBar

Recent Activity