Meeting 5, 1/28/15
I have performed some preliminary topic models on the plain text files Alan prepared of his Jan 14 NYT run (also sent around a zip via email). I've placed the results in our shared Google Drive folder (topic_models>mallet: I created a separate "mallet" folder in case other experiments are done using other topic modeling programs/tools). I'm not sure I know how to start interpreting the results yet (but some are evocative/weird: check out topic 33 in the 75 topic model run, for example ["iran rice rouhani ahmadinejad wallerstein iranian princeton campuses women security noorbaksh tehran condoleezza ali intellectuals report harris retire expelled"]), but what's been slightly more promising is the research I've done to try to determine best practices for topic modeling. The following is an explanation of what I've done so far, what I've discovered about best practices, and some possible next steps.
Contents of the Google Drive Folder
I used Mallet to create the topic models I've shared in the Google Drive folder. I did four runs, each with a different number of topics: 25, 50, 75, and 100. In the "mallet" folder, you will find four different folders, one for each run I did (25 topics, 50 topics, etc.). Each one of these folders contains the following:
- text_harvest mallet command: This txt file contains the Mallet command used for the topic model run.
- text_harvest keys: This file lists each topic, the Dirichlet parameter for that topic, and the top 19 words associated with that topic. The Dirichlet parameter gives a sense of the "weight" of the topic
- text_harvest composition (in .txt, .csv, and .xlsx formats): This is a standard mallet output file. It lists the association between a document name (columns 1 & 2) and a topic (column 3), and that topic's strength (column 4). The rest of the columns are progressively lower-ranked associations of that document with the listed topic.
- text_harvest topic-state: A compressed text file containing all of the words in the corpus with their topic assignments.
Parsing the Mallet command
In general, the mallet command txt files look like this:
./bin/mallet train-topics --input text_harvest.mallet --num-topics 25 --optimize-interval 20 --output-state text_harvest25_topic-state.gz --output-topic-keys text_harvest25_keys.txt --output-doc-topics text_harvest25_composition.txt
- The ./bin/mallet is because I have a Mac. The commands for Windows machines are slightly different. See this great tutorial from the Programming Historian to get started with Mallet.
- --input: Tells Mallet which file to input. To run a topic model, you first need to create a .mallet file using your plain text files. I created the "text_harvest.mallet" file in a previous step by following the directions on importing data in the Programming Historian tutorial.
- --num-topics: The number of topics you want to model. Pretty straightforward.
- --optimize-interval: This has to do with hyperparameter optimization, which is something I don't fully understand yet. But the research I've done suggests that it's generally the case that optimized topic models produce better results, and 20 is the interval that is generally used.
- --output-state: These are the various file types I've asked Mallet to output. There are several different --output-state options; it will take some more fiddling -- and some more knowledge of how we eventually want to visualize and parse the results -- to determine which parameters we may want to use.
Best Practices
- Hyperparameter optimization: Most practitioners seem to agree that enabling hyperparameter optimization produces better results. See the Appendix to "What can topic models of PMLA teach us about the history of literary scholarship?" by Ted Underwood and Andrew Goldstone for an initial discussion of hyperparameter optimization. They also discuss this issue in the Appendix to their forthcoming "The Quiet Transformations of Literary Studies." And if you really want to start down the rabbit hole, take a look at Hannah Wallach et al's "Rethinking LDA: Why Priors Matter," which describes the results of a series of experiments to determine the effects of hyperparameter optimization.
- Number of Topics: Wallach et al also emphasize that, due to hyperparameter optimization, "the risk of using too many topics is lower than the risk of using too few." There is no standard way to discover the optimal number of topics for a given corpus, and most practitioners seem to play around with different numbers of topics before settling on one that makes the most sense for their corpus. One generally-agreed upon method of determining whether or not one's topics are making sense is to look in the composition file. If your documents are tending to cluster into a small number of topics, or if a small number of topics gets repeated lots across your set of documents, then this is generally an indication that you need to increase the number of topics. If For the preliminary corpus, it looks like 50-75 topics produces well-distributed, potentially meaningful results.
- Stop words: Developing our own stop word list will be important, I think. In general, it seems that developing a robust stop word list that is attuned to our corpus is probably more important than attempting to fine-tune our corpus prior to topic modeling (although Wallach et al also suggest that for optimized models, finely tuned stop word lists are perhaps less important than for non-optimized models). This means, I think, that we need to worry less about determining what specific search terms to use when searching for texts to include in the corpus, and more on fine-tuning the corpus once we have it assembled. In general, practitioners seem to cast as wide a net as they can given the parameters of their particular questions when collecting their corpora. Then they focus on fine-tuning the corpus using stop words to achieve more accurate -- or more interesting -- results. Generally speaking, the following guidelines can be employed for constructing a generalized stop word list:
- Eliminate common words. Mallet comes loaded with a standard English stop word list.
- Eliminate words that are arbitrarily distributed (given names, abbreviations). If you look at some of the preliminary topic modeling results with Alan's sample corpus, names like "nicholas" and "kristoff" already appear in weird places. The model could be improved by taking names out.
- Standardize to US spelling.
- Remove low-frequency words (words that are not among the top ~10,000-100,000 in the corpus, depending on the size of the corpus of course)
- Remove a few common categories of error from OCR recognition software problems.
Next Steps for Topic Model Experiments
- Experiment with LDAviz to start visualizing results.
- Practice more with Gephi to visualize results. I played around a bit with the MALLET-to-Gephi Data Stacker in order to import data into Gephi for visualization, but I really wasn't able to understand the results, or Gephi, very well. I was also thrown off for a little bit by the current problem with Gephi and Mac's OSX Yosemite (see https://lbartkowski.wordpress.com/2014/11/28/gephi-0-8-2-on-apple-osx-yosemite/ for instructions on how to get around this problem -- I now have Gephi running on my machine).
While there is still more experimentation to be done, especially with visualizations, I think we are now at a point where spending too much more time tinkering with topic modeling a sample corpus may end up being a time suck. Apart from the really general list of best practices I've tried to assemble here, there is still not really a "standard operating procedure" for topic modeling (at least among digital humanists), at least as far as I can tell. So much depends on the individual corpus being modeled. This is why I think we should soon focus efforts on assembling the corpus (and deciding what is going into the corpus). After we have assembled our corpus, then we can start thinking more about fine-tuning the topic models.
Comments (0)
You don't have permission to comment on this page.