• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Interpreting Topic Models Workshop (2016-01-18)

Page history last edited by Alan Liu 8 years, 3 months ago



Preliminaries (very quick status reports & updates on ongoing development work)

  • Next Meeting Dates (options): WE1S project Calendar

    • Friday Jan. 29th? or Friday Feb. 5th?
  • Status of Current Project Stages
    • Overview
      • Current goal: topic modeling the ambient public discourse in which articles mentioning the humanities and liberal arts occur. Later goal: identifying articles in the corpus that are centrally "about" the humanities and topic modeling those. (See Alan's "neighborhood" analogy for this two-goal strategy.)
      • Our major project development stages/cycles to date:
        • (1) scraping,
        • (2) topic modeling workflow,
        • (Parallel development cycle) Manifest system and backend MongoDB system
      • Development work initiated at today's meeting:
        • (3) Interpretation of topic models
      • Next meeting devoted to:
        • (4) researching interfaces for presenting/exploring topic models. (See interfaces being accumulated on our Research on Topic Modeling page)
    • (Status of Development Cycle 1) Scraping
      • Finishing up ongoing scraping tasks
      • Scraping Globe and Mail
      • New scraping we can/want to do in Winter quarter? (Alan has about 180 hours of RA work left on his Academic Senate grant)
      • File renaming (renaming newly collected files named, for example, "File_1.txt" to "nyt-2007-h-1.txt")
        • Automate?
        • Possible need to rename all files to allow for chronological parsing--e.g., filenames that begin "20070906-nyt-h-1.txt"
    • (Status of Development Cycle 2) Topic modeling workflow: From Alan's email of Nov. 23, 2015: "Taking what we learned from [our rehearsal of Nov. 16, 2015], I have now upgraded our instructions and resources for topic modeling.  Please see the just-completed WE1S Topic Modeling Workflow (v 1.0), which is also linked from the bottom of the sidebar on the project PBWorks site. The workflow incorporates links to a new guide prepared by Scott to assist folks with installing a Python environment, installing Mallet, and using the command line. It also incorporates a link to a new Github repository set up by Scott and Jeremy for the latest versions of the project's Python scripts and extra stopword list. (These are in addition to the ready-to-deploy zip file containing resources and organized working folders we already had prepared.)"
      • De-duplication issues:
        • Deduping script
        • Deduping strategy (from Alan's email of Nov. 23, 2015: "We need perhaps a couple of people to study an example of de-duplicating a sample of our articles and advise Jeremy on how or if we should adjust the outcome of his deduping script.  (See the deduping instructions in the Topic Modeling Workflow. Currently, his script identifies close matches above a certain threshold, and we are just deleting all files in the right-hand column of a pair of matches.")
        • Automation (script) for deleting duplicates based on report of the deduping script 
      • Scrubbing issues:
    • (Status of parallel development cycle): Manifest Schema and MongoDB Database system  


Interpreting Topic Models (main focus of meeting)



  •  Two topic models serve as the basis of our discussion today:
    • Alan used the subcorpus we practiced topic modeling on during our last meeting: New York Times 2010-2014 articles mentioning "humanities" and "liberal arts" (deduplicated).  He improved the scrubbing fixes and stopword list for these materials. Then he created topic models of the scrubbed articles at scales ranging from 25 to 1,000 topics (specifically: 25, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 1000 topics). The 25-topic version of the topic model is presented in the following document: liu-analysis.docx (Handouts will be provided at the meeting; those not present at UCSB should make a print-out to facilitate discussion.)
    • Scott used a different subcorpus that combined LA Times, New York Times, and The Guardian articles from 2013-2014. He created deduplicated, scrubbed topic models for 25 and 50 topics. The 25-topic version of this topic model is presented in the following document (handouts also provided at UCSB; others should make a print-out): kleinman-analysis.docx
  • For practical reasons (simplicity and print-out size), our meeting today will focus just on the 25-topic versions of the above models, though for comparison some views of the 50-topic versions will also be introduced in the discussion.
  • The main purpose of our meeting today is to use these demonstration topic models to think through the conceptual, workflow, and logistical issues involved in interpreting WE1S topic models of public discourse relating to the humanities.


Materials for Our Discussion (materials are available online in links below and also in a folder for this meeting on the WE1S Google Drive site; some materials will also be provided in hard-copy form at the UCSB location of the meeting)

  • Main Documents for Discussion:
    • "Alan Liu's Topic Model (25 Topics) of New York Times from 2010-2014"--presented in liu-analysis.docx . The document also includes for comparison at the end  some views of the 50-topic version of the model. (Printouts will be handed out at the meeting at UCSB. Other participants should download and print in advance if possible.  Printouts are optimal for facilitating comparing/contrasting topic models during discussion.)
    • "Scott Kleinman's Topic Model (25 Topics) of LA Times, New York Times, & The Guardian from 2013-2014"--presented in kleinman-analysis.docx. (Printouts will be handed out at the meeting at UCSB. Other participants should download and print in advance if possible.) A spreadsheet of Scott's 50-topic model with topic labels can be found at labelled_keys.xlsx.
  • Supplementary materials, resources, scripts, etc. (These can be downloaded from the WE1S Google Drive as needed by meeting participants. The Google Drive site is restricted to project members.):
    • Corpus sources:
      • articles-scrubbed.zip (the subcorpus of NY Times articles produced by Alan after scrubbing and stopword deletion)
    • Scrubbing script:
      • config.py (latest version of the Python script that configures scrub.py for scrubbing. This script contains the actual information about fixes, consolidations, etc. to apply when scrubbing.)
      • scrub.py (the Python script that actually executes the scrubbing as guided by config.py. This script does not change.)
    • Stopword lists:
      • combined-mallet-we1s-stopwords.txt (stopword list including both the standard Mallet stopword list and the most recent WE1S stopword list, with both all-lower-case and proper-case (first letter capitalized) versions of each word.
      • we1s-stopwords-master-file.xlsx (master spreadsheet of stopwords used to generate text-file versions of stopword lists. Keeping stopwords in a spreadsheet facilitates sorting and other management of the list.)
    • Topic model results files:
      • 25-topic model produced by Alan for NY Times 2010-14:
        • keys.txt
        • keys.xlsx (spreadsheet version of keys.txt, with topics sorted by weight)
        • composition.txt (file showing topic weights in specific documents. WE1S doesn't  yet have an easy or intuitive interface to present this information for ready grasp.)
        • topic_counts-25.txt (the Mallet-generated file that can be used to create "topic documents" and "topic clouds")
        • Clustering visualizations:
      • topic-models-nyt2010-14-experimental.zip (all the topic models produced by Alan for the NY Times 2010-14 subcorpus. Models for number of topics: 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 1000)


Discussion Agenda


Optional methodological readings & topic-modeling systems/interfaces -- See Research on Topic Models in the new Reference Library section of the project site.  The materials being gathered there can help us think about our goals and methods in modeling the "humanities" as a "concept" or "discourse." (These materials do not need to be read now. They will be worth looking at as our project moves forward.) 


(1) Exercise: group "reading" and discussion of Alan's & Scott's 25-topic models (with some comparison to their 50-topic models)


(2) Possible WE1S project workflow for interpreting topic models (Draft 1) (including parallel, alternative, or conjoined human and machine processes):


Human Processes

(Typical procedure for steps requiring judgment will be to have a panel of three or more people perform the step and compare results. The hermeneutical rhythm will typically consist of iterative cycles of observations suggesting research questions, and research questions suggesting ways to sharpen observation.)

Machine Processes

(We may be able to automate some steps and sequences)

Assess topic models to determine appropriate number of topics. We may decide to generate one, two, or three numbers of topics for simultaneous interpretation.
(Questions: Can we define criteria for "best" topic model? Do we know any published theory or methods for choosing right number of topics? Cf. Scotts issues for discussion.pdf)

Generate topic models at many levels of granularity--e.g., 25, 50, 150, 200, 250, 300, 350, 400, 450, 500

Initial analysis of topic models.
  1. Assign labels for topics (assisted by automated process suggested at right).
  2. Identify and label any major clusters of topics. 
  3. Flag for attention any illegible topics.
Assemble materials to facilitate interpretation:
  1. Create sorted keys files in a spreadsheet.
  2. Create topic cloud vizualizations.
  3. Create clustering visualizations
    (Testing phase: compare a human-panel-only clustering with a machine clustering of topics)
  4. Assess "nearness" of topics (We don't yet have a method to do this; but cf. Goldstone DFR Browser "scaled view")
  5. If possible, auto-label topics with the top 2-4 most frequent words in a topic (based on an algorithm that establishes a minimum threshold of proportional frequency and decides what to do if there are one, two, three, or four top words that are, or are not, significantly more important than others.)

Detailed analysis of topic model (part I: total corpus, synchonic analysis).

  1. Study major topics and clusters of topics.
  2. Human panel reads sample articles and compares to the topic proportions found in the topic-counts.txt file created by Mallet. (This is a sanity check.)
  3. Human panel writes up analytical notes and observations, and compares.
  4. Members of the human write up report.



Detailed analysis of topic model (part II: comparative analysis).

  1. Study major correlations/differences between any two or three parts of our corpus of interest.
Create view of topic model that compares two or more parts of our corpora (e.g., NY Times vs. The Guardian) for the topics and topic weights they contain. We don't yet have an interface or method of using the composition.txt files produced by Mallet to do this. (cf. Goldstone DFR Browser "document view," which shows topics in a single document) (Alan's experiment)

Detailed analysis of topic model (part III: time-series analysis).

  1. Study trends in topics.
Create views of topic model that shows trend lines of topics (created by showing weights of topics in documents at time 1, followed by time 2, etc.). We don't yet have a method or tool for this, but cf. the following time-series views in the Goldstone DFR Browser: topics across years | topics within a single year. See also: demo vizualization of topics in State of Union addresses; the TOM code demos; Robert K. Nelson, "Mining the Dispatch") (Alan's experiment)

Write up results:

  1. Create key observations and data set and publish (with a catchy title like "Humanities in Public Discourse: The Manual").
  2. Co-author white paper with recommendations for humanities advocacy.
    1. Create subset of above as a brochure or infographic
  3. Disseminate research methods and conclusions in academic talks, panels, papers, articles.












Comments (0)

You don't have permission to comment on this page.