• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Finally, you can manage your Google Docs, uploads, and email attachments (plus Dropbox and Slack files) in one convenient place. Claim a free account, and in less than 2 minutes, Dokkio (from the makers of PBworks) can automatically organize your content for you.


Meeting (2016-11-11) (Interpreting Topic Models)

Page history last edited by Lindsay Thomas 4 years, 6 months ago


Agenda -- last revised 11 Nov. 2016, 12:30 pm, by Alan


  • Next full-team meeting date(s)? -- some options:

    • Thursday, Friday, or Saturday, Nov. 17, 18, or 19 (1 pm Pacific)?
    • Friday, Dec. 2?



Preliminary Discussion Issue 1: Grant Planning

 (discussion to set up for our next meeting)


Sausage-making machine

Preliminary Discussion Issue 2:
Briefing Reports on Current State of our "Sausage Making Machine" (WE1S technical system) and "Sausage" (corpus)


  • WE1S Corpus
    • Main corpus: report from Jamal)
    • "Random" corpus: report from Lindsay, with Annie & Samina) 
  • Integrated Workflow & Virtual Machine for Topic Modeling (report from Jeremy)   
  • "Manifest" Workflow Management & Provenance Tracking system (report from Scott, with Tyler and Nate)


1. Interpreting Our Initial Topic Model -- Discussion of Methods & Initial Observations
     --cf. our topic modeling rehearsal of Jan. 18, 2016 



  • B. Preparation for our discussion (to-dos before our meeting):

    • i. Everyone: 

      • Explore the topic models in dfr-browser, noting things of interest (keep an observation notebook). (Also keep notes on bugs, desired improvements, etc. in dfr-browser and our interpretive process.)
      • Identify a few interesting topics and do a "ground truth" exercise of human-reading some articles in which there is a heavy representation of those topics.
        • Note: The best way at present for us to access the text of the articles is to go to the metadata CSV (or Excel) files for each topic model (linked above) and search for the article by title. The article body is included in these metadata files. (Except for the NY Times part of our corpus, our metadata files do not include URLs to the publicly online versions of articles because they were collected from Proquest. Unfortunately, due to the distributed nature of our work, we have Proquest articles gathered through a variety of university proxies--UCSB, Clemson, and U. Miami--which means that there is no easy, uniform way for everyone to use the Proquest article URLs to go retrieve the original articles through their campus Proquest service. So unless we find a way to simplify article access, we will need to read articles clumsily in the CSVs.
    • ii. Volunteer 1 Needed (Jeremy is setting up a multi-topic-count generator; Alan will do the comparative inspection of the topic models with different numbers of topics):
      • Create alternative topic models of 2007-2009 LA Times & Washington Post with different numbers of topics (e.g., 25, 100, 200, 300)
      • Do an intuitive assessment of what number of topics seems best. Assess whether the optimal number of topics.
      • Reflect on how we can decide what is the best number of topics to ask for in a topic model of our corpus,
    • iii. Volunteer 2 Needed (Mauro Carassai, with Scott Kleinman, have volunteered):
      • For one or two topics, human-read and compare: articles that are "hot" with that topic (at top of the ranked list of articles in which that topic is represented and "cold" with that topic (lower down in the ranked list of articles associated with that topic). This is a way to begin exploring the relation between articles focused on the humanities as a topic and those that are part of what Alan has called the "cosmic background radiation" of the humanities in public discourse.
    • iv. Volunteer 3 Needed (Ashley signed up for this task):
      • Compare/contrast the LA Times and the Washington Post in the topic model. Can we detect at first glance any differences in their views of the humanities?
    • v. Lindsay, with Annie and Samina:
      • Do some of the same exercises as above with the topic models including the "random" corpus.
      • Think about technical and theoretical issues involved in working with random corpus, whether as part of a combined topic model (mixed with the rest of the WE1S corpus) or as their own topic model. (To begin with: how do we identify random articles in a topic model in which the random ones are mixed in with the main WE1S corpus?)
    • vi. For future examination:
      • Assess if we need to factor out the following kinds of articles that could potentially blur things together: Event listings, School listings, etc.
      • Experiment with time-series topic modeling 
  • C. Discussion and reports will occur at our meeting based on the above preparation. We will also discuss problems, desired improvements, etc. in tools and methods for interpreting our topic models.
    1. Opening general discussion (Initial impressions of our topic models and methods/tools for interpreting the topic models. (Notes by Alan.docx )
      1. Interpretive method (how we are all using dfr-browser and other tools to "read" the topic models)
      2. Noise problems (sources of noise in our corpus and models)
      3. Tool problems (wish list for dfr-brower and other tools to assist in interpreting topic models)
    2. Focused discussion on particular issues:
      1. What is the right number of topics? (report by Alan and Jeremy)
      2. "Hot" vs. "cold" topics (ranked high vs. low in articles in a topic (report by Mauro, with Scott )
      3. Compare/contrast the LA Times and the Washington Post in the topic model (report by Ashley)
      4. Looking at the "random" corpus (report by Lindsay, Annie, and Samina)
  • D. Based on the above first-pass interpretive study and discussion, we will iterate a more systematic process of interpretation of the sort outlined in the following draft plan for a systematic workflow of topic model interpretation (see table showing draft protocol for systematic interpretation)
  • E. We need to create "starter package" of topic models and anthology of representative articles for undergraduate Winter-Spring "collaborative research grant" group titled "Making the Humanities Public." Undergrads will ear a stipend and independent course credit to work together under the guidance of Jamal and Alan to use the starter package to create an "as-if" briefing report to President Obama on humanities advocacy; then they will create a few advocacy projects based on the recommendations of that briefing report.













Comments (0)

You don't have permission to comment on this page.