Meeting (2018-11-27)


 

 

Meeting Time:       Tuesday, November 27, 2018, 9-11 am (Pacific)

Meeting Location: DAHC (Digital Arts & Humanities Commons) (directions)

Meeting Zoom:     We'll use Alan's "instant" Zoom ID (our default meeting Zoom):  https://ucsb.zoom.us/j/760-021-1662

 


Purpose of Today's Meeting

 

  1. Planning for further evolution of our interpretation protocol 

  2.  [Meet next Tuesday to continue?]
  3. Breakout meetings of teams (plus PI's meeting)

 

Preliminary Business

 

 

 

1. Status Updates

 

 

 

2. Outline of a Fuller Interpretation Protocol

 

Context: Overall Plan for This Year:

 

Fuller Interpretation Protocol (red = what we have workshopped so far)

  1. Preliminary steps:

    1. Method for sampling and normalizing materials from our corpus for modeling

    2. Method for assessing quality of a topic model

    3. Method for choosing granularity level to work with

    4. Scrubbing 

    5. Rules-based (and algorithmically assisted) method of labeling topics 

      • Eventually, we may be able to consolidate clusters of labeled topics under a controlled vocabulary for a topic (e.g., "economics" or "politics") that would be like a "codebook." 
    6. [Include initial "human topic modeling" to help assess topic model and also provide a preliminary overview of topics?] See Team 2's recommendations (Google Doc)

    7. Method for finding and reading original articles (i.e., when our articles are eventually "bagged" and not available as JSON text) 

  2. Method for macroanalysis of a topic model

    1. Identifying topics of interest

    2. Describing the structure of the model and relation of main topics

  3. Method for close analysis of a topic

  4. Method for comparative analysis of two topics

  5. Method for comparing two corpora or subcorpora (or sets of sources within a corpus) 

    1. Method for including social media in our analysis. (E.g., see Ray's topic model of Reddit)

  6. Method for longitudinal analysis of topics 

  7. Method for reporting on the interpretation of a topic model 

 

 

3. Breakout meetings

 

 

 

 

From Lindsay's Ryver Post to Team 1 (Primary Corpus Team) of Nov. 24, 2018:

 

First, I've finally managed to download data for and produce models of articles from the NYT from 2012-2017 containing the other keywords we agreed on last month.... I've produced 11 models based on different keywords, and one comparison model (articles from the NYT from 2012-2017 containing the word "humanities"), all of which you'll find on harbor 10000 in our team folder: http://harbor.english.ucsb.edu:10000/tree/write/projects/teams/2018-19-1-primary-corpus.

 

The data for all models is taken from the NYT from 2012-2017. All models were produced on harbor 10000 with the python-only workflow, which means that the "Scaled" view in dfr-browser doesn't work, and neither will the topic bubbles visualization. (There's a lot more to say about why we're gradually going to switch to a python-only workflow for modeling, and what problems that will solve and cause, and maybe we'll talk some about it on Tuesday. For now, it's enough to know that there are now two ways to make a dfrbrowser: using notebook 4, which is an R notebook, or using only notebook 1, which contains a python-only way to create a browser. The python-only workflow works better if you want to create diagnostics files, or if you have a large number of articles [above 10,000].) Where possible, I've normalized the number of articles in each model to ~1200, which is the size of the comparison model (the NYT 2012-2017 "humanities" model), so that we are comparing models of roughly the same size. I did this in the notebook workflow by randomly selecting articles for inclusion in the model. Models with ~1200 articles have 100 topics, models with <500 articles have 25 topics. I chose these numbers quickly and without a ton of thought, so it's very likely the models aren't as good as they could be. But I hope they will suit our purposes. Links to each model are in the worksheets I've created (see below).

The models are as follows, in no particular order:

  • "art history", 100 topics
  • "english literature", 25 topics
  • "history", 100 topics
  • "humanities" AND "funding", 25 topics
  • "humanities" AND "science", 25 topics
  • "liberal arts", 100 topics
  • "literature", 100 topics
  • "nea", 25 topics
  • "philosophy", 100 topics
  • "science", 100 topics
  • "stem", 100 topics
  • comparison model: "humanities", 100 topics
  • The search for articles containing the keyword "neh" only returned 32 articles from 2012-2017, so I did not do a model for that keyword.

Here's what I'm thinking for Tuesday's meeting:

  1. Once we get to the breakout sessions part of the meeting, those of us in team 1 choose 1 model to focus on and interpret. In our team Google drive folder (let me know if you don't have access to this), you will find a folder called "Topic model worksheets." This folder contains 12 worksheets I've created for interpreting the models. These worksheets are based on the interpretation protocol we went through during our last two meetings, with some revisions based on Abigail's feedback from her group and the new diagnostics files we are now saving via MALLET. Choose the model you want to focus on, click on that worksheet, and add your name to the top where it says "Name." If someone has already taken a model you wanted to work on, please just choose another one. The worksheets contain links to each model's browser and diagnostics file, as well as basic information about each model (the number of articles, the keyword, etc). (If you want, obviously, you can choose your model ahead of the meeting.)
    • I will focus on the comparison model (the "humanities" 100-topic model), and will complete my worksheet by the meeting on Tuesday. During the meeting's breakout session, I actually have to meet with the other PI's about the interim report due to the Mellon. So I won't be able to meet with our team then. That's why I'm thinking we'll use the meeting time for individual work (see below).
  2. Use the breakout session time of the meeting to work on your worksheet. There are more models than there are members of our group, so if you are feeling ambitious, feel free to take on more than 1 model. Just remember to add your name to the top of the worksheet for any model you are working on or plan to work on.
  3. Finally, we schedule an actual team meeting for Thursday, December 6 from 10 am - 11 am PST to discuss our models. I'm shooting for Thursday just in case Alan wants to schedule another all-hands meeting for that Tuesday. Does Thurs, Dec 6 from 10 - 11 am PST work for everyone for a team meeting?