|
Interpreting Topic Models Workshop (2016-01-18)
Page history
last edited
by Alan Liu 8 years, 11 months ago

Preliminaries (very quick status reports & updates on ongoing development work)
Interpreting Topic Models (main focus of meeting)
Overview
- Two topic models serve as the basis of our discussion today:
- Alan used the subcorpus we practiced topic modeling on during our last meeting: New York Times 2010-2014 articles mentioning "humanities" and "liberal arts" (deduplicated). He improved the scrubbing fixes and stopword list for these materials. Then he created topic models of the scrubbed articles at scales ranging from 25 to 1,000 topics (specifically: 25, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 1000 topics). The 25-topic version of the topic model is presented in the following document: liu-analysis.docx (Handouts will be provided at the meeting; those not present at UCSB should make a print-out to facilitate discussion.)
- Scott used a different subcorpus that combined LA Times, New York Times, and The Guardian articles from 2013-2014. He created deduplicated, scrubbed topic models for 25 and 50 topics. The 25-topic version of this topic model is presented in the following document (handouts also provided at UCSB; others should make a print-out): kleinman-analysis.docx
- For practical reasons (simplicity and print-out size), our meeting today will focus just on the 25-topic versions of the above models, though for comparison some views of the 50-topic versions will also be introduced in the discussion.
- The main purpose of our meeting today is to use these demonstration topic models to think through the conceptual, workflow, and logistical issues involved in interpreting WE1S topic models of public discourse relating to the humanities.
Materials for Our Discussion (materials are available online in links below and also in a folder for this meeting on the WE1S Google Drive site; some materials will also be provided in hard-copy form at the UCSB location of the meeting)
- Main Documents for Discussion:
- "Alan Liu's Topic Model (25 Topics) of New York Times from 2010-2014"--presented in liu-analysis.docx . The document also includes for comparison at the end some views of the 50-topic version of the model. (Printouts will be handed out at the meeting at UCSB. Other participants should download and print in advance if possible. Printouts are optimal for facilitating comparing/contrasting topic models during discussion.)
- "Scott Kleinman's Topic Model (25 Topics) of LA Times, New York Times, & The Guardian from 2013-2014"--presented in kleinman-analysis.docx. (Printouts will be handed out at the meeting at UCSB. Other participants should download and print in advance if possible.) A spreadsheet of Scott's 50-topic model with topic labels can be found at labelled_keys.xlsx.
- Supplementary materials, resources, scripts, etc. (These can be downloaded from the WE1S Google Drive as needed by meeting participants. The Google Drive site is restricted to project members.):
- Corpus sources:
- articles-scrubbed.zip (the subcorpus of NY Times articles produced by Alan after scrubbing and stopword deletion)
- Scrubbing script:
- config.py (latest version of the Python script that configures scrub.py for scrubbing. This script contains the actual information about fixes, consolidations, etc. to apply when scrubbing.)
- scrub.py (the Python script that actually executes the scrubbing as guided by config.py. This script does not change.)
- Stopword lists:
- combined-mallet-we1s-stopwords.txt (stopword list including both the standard Mallet stopword list and the most recent WE1S stopword list, with both all-lower-case and proper-case (first letter capitalized) versions of each word.
- we1s-stopwords-master-file.xlsx (master spreadsheet of stopwords used to generate text-file versions of stopword lists. Keeping stopwords in a spreadsheet facilitates sorting and other management of the list.)
- Topic model results files:
- 25-topic model produced by Alan for NY Times 2010-14:
- keys.txt
- keys.xlsx (spreadsheet version of keys.txt, with topics sorted by weight)
- composition.txt (file showing topic weights in specific documents. WE1S doesn't yet have an easy or intuitive interface to present this information for ready grasp.)
- topic_counts-25.txt (the Mallet-generated file that can be used to create "topic documents" and "topic clouds")
- Clustering visualizations:
- topic-models-nyt2010-14-experimental.zip (all the topic models produced by Alan for the NY Times 2010-14 subcorpus. Models for number of topics: 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 1000)
Discussion Agenda
Optional methodological readings & topic-modeling systems/interfaces -- See Research on Topic Models in the new Reference Library section of the project site. The materials being gathered there can help us think about our goals and methods in modeling the "humanities" as a "concept" or "discourse." (These materials do not need to be read now. They will be worth looking at as our project moves forward.)
(1) Exercise: group "reading" and discussion of Alan's & Scott's 25-topic models (with some comparison to their 50-topic models)
(2) Possible WE1S project workflow for interpreting topic models (Draft 1) (including parallel, alternative, or conjoined human and machine processes):
|
Human Processes
(Typical procedure for steps requiring judgment will be to have a panel of three or more people perform the step and compare results. The hermeneutical rhythm will typically consist of iterative cycles of observations suggesting research questions, and research questions suggesting ways to sharpen observation.)
|
Machine Processes
(We may be able to automate some steps and sequences)
|
1
|
Assess topic models to determine appropriate number of topics. We may decide to generate one, two, or three numbers of topics for simultaneous interpretation. (Questions: Can we define criteria for "best" topic model? Do we know any published theory or methods for choosing right number of topics? Cf. Scotts issues for discussion.pdf)
|
Generate topic models at many levels of granularity--e.g., 25, 50, 150, 200, 250, 300, 350, 400, 450, 500
|
2
|
Initial analysis of topic models.
- Assign labels for topics (assisted by automated process suggested at right).
- Identify and label any major clusters of topics.
- Flag for attention any illegible topics.
|
Assemble materials to facilitate interpretation:
- Create sorted keys files in a spreadsheet.
- Create topic cloud vizualizations.
- Create clustering visualizations
(Testing phase: compare a human-panel-only clustering with a machine clustering of topics)
- Assess "nearness" of topics (We don't yet have a method to do this; but cf. Goldstone DFR Browser "scaled view")
- If possible, auto-label topics with the top 2-4 most frequent words in a topic (based on an algorithm that establishes a minimum threshold of proportional frequency and decides what to do if there are one, two, three, or four top words that are, or are not, significantly more important than others.)
|
3
|
Detailed analysis of topic model (part I: total corpus, synchonic analysis).
- Study major topics and clusters of topics.
- Human panel reads sample articles and compares to the topic proportions found in the topic-counts.txt file created by Mallet. (This is a sanity check.)
- Human panel writes up analytical notes and observations, and compares.
- Members of the human write up report.
|
|
4
|
Detailed analysis of topic model (part II: comparative analysis).
- Study major correlations/differences between any two or three parts of our corpus of interest.
|
Create view of topic model that compares two or more parts of our corpora (e.g., NY Times vs. The Guardian) for the topics and topic weights they contain. We don't yet have an interface or method of using the composition.txt files produced by Mallet to do this. (cf. Goldstone DFR Browser "document view," which shows topics in a single document) (Alan's experiment)
|
5
|
Detailed analysis of topic model (part III: time-series analysis).
- Study trends in topics.
|
Create views of topic model that shows trend lines of topics (created by showing weights of topics in documents at time 1, followed by time 2, etc.). We don't yet have a method or tool for this, but cf. the following time-series views in the Goldstone DFR Browser: topics across years | topics within a single year. See also: demo vizualization of topics in State of Union addresses; the TOM code demos; Robert K. Nelson, "Mining the Dispatch") (Alan's experiment)
|
6
|
Write up results:
- Create key observations and data set and publish (with a catchy title like "Humanities in Public Discourse: The Manual").
- Co-author white paper with recommendations for humanities advocacy.
- Create subset of above as a brochure or infographic
- Disseminate research methods and conclusions in academic talks, panels, papers, articles.
|
|
|
|
|
Interpreting Topic Models Workshop (2016-01-18)
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
|
|
Comments (0)
You don't have permission to comment on this page.