|
Topic Modeling Workshop
Page history
last edited
by Alan Liu 6 years, 8 months ago
This workshop contains an overview of topic modeling for beginners and a lesson plan with step-by-step instructions for taking a group through making a topic model using Mallet, examining Mallet output files, visualizing topics as word clouds using Lexos, and clustering topics in topic models to facilitate interpretation. Alan Liu and Lindsay Thomas (assisted by Jeremy Douglass and Scott Kleinman) co-led the inaugural version of the workshop (in a hybrid face-to-face and Skype environment) on 28 August 2015 for a group of 15 participants, including faculty and graduate students in UC Santa Barbara's English, Writing Program, History, and Sociology departments (with some participants, including Lindsay, Skyping in).
Theory: Overview of Topic Modeling: The first section of the workshop consists of a 25-minute exposition (with slides) of topic modeling, building especially on explanations for beginners posted by Edwin Chen and Ted Underwood. (Selected readings about topic modeling and examples of research using topic modeling are also provided.)
Practice: Hands-on Tutorial: The second section of the workshop is an efficient, hands-on lesson plan in topic modeling using sample corpora. (Not covered are pre-processing tasks, including: installation of Mallet [participants are referred to the Programming Historian's guide]; cleaning of texts; studying a corpus with text analysis tools such as Antconc, Voyant, or Natural Language Toolkit (NLTK) to find important words to add to a stop list or phrases to consolidate (tokenize); and chunking or segmenting of texts.) In the inaugural workshop on 28 August 2015, participants broke into two groups--one led face-to-face by Alan and the other led in a mixed face-to-face and Skype screen-sharing environment by Lindsay (who herself was Skyping in). In advance, Mallet and demonstration corpora were installed on the machines used by these two groups. Alan and Lindsay led the groups through the lesson plan, adding explanations and responding to questions. Given the combined technical and conceptual difficulty for beginners, plus the complications of a hybrid face-to-face and virtual environment, these hands-on tutorials succeeded surprisingly well in the inaugural workshop. One criterion of success is that beginners appear to have gotten past the mental threshold to the point where they say to themselves, "I can do this. I'm still not sure of some of the details. But I know how to get started and I'll figure it out." Another criterion of success was unexpected: the remarkable robustness of questions and discussion during the tutorials. It turned out that the tutorials were not just "hands-on" but "mind on." In Alan's group. for example, participants from English, Writing, History, and Sociology asked both technical questions and penetrating methodological questions about the nature, goals, premises, implications, and impact of topic modeling for research.
Overview: Theory of Topic Modeling
Some Readings:
Overviews and "Simple" Explanations ( = especially recommended for beginners across research disciplines)
More Advanced Explanations & Examples of Research Using Topic Models
Slideshow: Alan Liu's Explanation of Topic Modeling for Humanists
(building on Edwin Chen and Ted Underwood):
Practice: Hands-on Tutorial
Preliminary Steps (Installing Mallet):
Topic Modeling -- Step by Step:
Sample corpora for topic modeling exercises: [download a demonstration corpus from Alan Liu's DH Toychest]:
- (A) Open a command shell ("command line") window
- Windows "command prompt" (using shell command language based on MS-DOS) -- quick guide -- tip on enabling copy/paste operations
- Mac "terminal" (using shell command language based on Unix) -- cheat sheet
- Linux command line (also uses bash)
- (B) Navigate to the Mallet folder (directory)
- Windows: type the following at the command line, followed by a return (helpful tip: the <F5> function key pastes in the previously used command)
- Mac: type the following at the command line
- cd /Users/yourusername/mallet
- As with the Windows command above, this will depend on where and how you have saved the MALLET folder you downloaded when you installed the package. The above command assumes you've saved MALLET (and titled the folder it's in "mallet") under your home directory.
- (C) Input a folder of text files and process them into a .mallet data file
--Use the command below, varying the path and file names as desired (red italics indicate path/filenames you supply). Use backward slashes in Windows, forward slashes on Macs. There must be no hidden returns in the command. Best practice is to set the job up in a text document (without "wordwrap" view turned on) and copy/paste the command into the command shell.
- bin\mallet import-dir --input path to folder --output path and filename of desired output data file with .mallet extension --keep-sequence --remove-stopwords
- Example (Windows): bin\mallet import-dir --input C:\workspace\corpus-sample --output C:\workspace\topic-model\corpus-sample.mallet --keep-sequence --remove-stopwords
- Example (Mac): ./bin/mallet import-dir --input /Users/we1s/corpus-sample --output /Users/we1s/topic-model/corpus-sample.mallet --keep-sequence --remove-stopwords
- (D) Create the topic model ("train" the topic model) -- Use the following command.
- bin\mallet train-topics --input path and filename of the previously created data file with .mallet extension --num-topics desired number of topics --optimize-interval 20 --output-state path to output folder\topic-state.gz --output-topic-keys-- path to output folder\keys.txt --output-doc-topics --path to output folder\composition.txt --word-topic-counts-file path to output folder\topic_counts.txt
- Example (Windows): bin\mallet train-topics --input C:\workspace\topic-model\corpus-sample.mallet --num-topics 20 --optimize-interval 20 --output-state C:\workspace\topic-model\topic-state.gz --output-topic-keys C:\workspace\topic-model\keys.txt --output-doc-topics C:\workspace\topic-model\composition.txt --word-topic-counts-file C:\workspace\topic-model\topic-counts.txt
- Example (Mac): ./bin/mallet train-topics --input /Users/we1s/topic-model/corpus-sample.mallet --num-topics 20 --optimize-interval 20 -- output-state /Users/we1s/topic-model/topic-state.gz --output-topic-keys /Users/we1s/topic-model/keys.txt --output-doc-topics /Users/we1s/topic-model/topic-counts.txt
- (E) Examine the topic model output files
- Examine the keys.txt, composition.txt, and topic-counts.txt output files for the topic model. It helps to copy and paste them into a spreadsheet to see their structure. (If you pull the keys.text into a spreadsheet, you can sort by weight of the topic to see the relative ranking of the topics in importance).
- (F) Visualize the topics as word clouds using Lexos
- Go to Lexos online
- Navigate to in the Lexos interface to "Visualize" > "Multicloud"
- In the options for a multicloud (or multiple word clouds), choose "topic clouds" (instead of the default "document clouds")
- Upload the topic-counts.txt file from your topic model
- Click on "Get graphs"
- (G) Experiment with clustering topics
- Convert the topics in your topic model as represented/weighted in the topic-counts.txt file into simple collections of words, one collection for each topic (with words repeated in proportion to their weight in that topic). We call these "topic-documents." Essentially, this gives us a simplified representation of each topic as an ordinary document that can itself be analyzed like other documents.
- First make a copy of the topic-counts.txt file from your topic model and place it in its own folder or subfolder. (This allows you to create topic-documents in their own folder, rather than have them clutter up your topic model folder itself.)
- From the python_scripts folder on the WE1S Google Drive, download Scott Kleinman's topicsToDocs.py script and save it in the Python Scripts folder on your local machine
- Click on the script to open it in Python (or in your Python Integrated Development Environment [IDE] such as Canopy Enthought)
- Set the parameters at the top of the script as appropriate so that you input the topic-counts.txt file from its folder (which will also be the output folder), tell the script how many topics are in your topic model (the number you told Mallet to create), and the desired number of top words in each topic to process (e.g., 100).
- Run the script, which will generate a series of text documents, one for each topic in your topic model. These documents will be titled in the format "Topic1.txt," and each will contain the top most frequent words in that topic (limited to the number you requested in the script), but with each word repeated a number of times to represent its relative frequency weight. (For example, a topic-document might begin, "books books books books books books reading reading reading reading author author author print print print publisher publisher ebook ebook, [etc.]..."
- Input the generated topic-documents into Lexos
- Go to Lexos online and upload your topic-document files.
- Then in the Lexos interface go to "Analyze" and process your uploaded files through one of the clustering methods:
- Clustering > Hierarchical (creates a dendogram tree)
- Clustering > K-Means (creates a visualization of topic clusters in a Voronoi space)
- Similarity Query (creates a cosine similarity ranking of files to a base comparison file)
- You can also upload the topic-documents to Voyant Tools's "Scatterplot" tool (documentation), which is a PCA clustering tool and experiment with that. (Note: the easiest way to upload a folder of text documents to Voyant is to make a .zip file from the files in the folder, then upload the zip file).
Topic Modeling Workshop
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
|
|
Comments (0)
You don't have permission to comment on this page.