| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

View
 

Meeting (2018-11-27)

Page history last edited by Alan Liu 5 years, 10 months ago

 

 

Meeting Time:       Tuesday, November 27, 2018, 9-11 am (Pacific)

Meeting Location: DAHC (Digital Arts & Humanities Commons) (directions)

Meeting Zoom:     We'll use Alan's "instant" Zoom ID (our default meeting Zoom):  https://ucsb.zoom.us/j/760-021-1662

 


Purpose of Today's Meeting

 

  1. Planning for further evolution of our interpretation protocol 

    • Planning for other kinds of interpretative tasks (beyond the "examine one topic" or "compare two topics" we have so far workshopped).
    • Planning for how to incorporate materials and processes of task teams 2, 3, and 4. This includes addressing "subcorpora", additional methods and visualizations, and the "human topic modeling" process of Team 2.
    • (Note: "Interpretation protocol" means:
      • (a) A step-by-step declaration of machine-learning processes plus human inspection, reading, and validation processes. (These processes will eventually be declarable as WE1S "manifests");
      • (b) An accompanying rationale statement for the overall combination of processes (Dan has started writing this). 
  2.  [Meet next Tuesday to continue?]
  3. Breakout meetings of teams (plus PI's meeting)

 

Preliminary Business

 

  • Machine-building workshop this Friday, 2-6, in the Transcriptions Center (SH 2509) [cf., Scanner Praxis build, 2012-13]

 

 

1. Status Updates

 

  • (Team 1) Primary Corpus Collection & Analysis Team -- (Led by: Lindsay Thomas, with the other PIs; Members: Tyler, Leila, Tarika, Rebecca, Jamal, Ryan, Sean)
  • (Team 2) Students and the Humanities Team (Led by Abigail Droge; Members: Jessica, Avery, Leila, Rebecca, Aleah, Tyler)
  • (Team 3) "Gender, Ethnic, and Racial Diversity and Interests" Team (Led by Giorgina Paiella; Members: Jamal, Aleah, Su)
  • (Team 4) Interpretation Lab Team (Led by Dan Baciu; Members: Su, Ryan, Sean, Cindy, Sihwa) 
    • Team 4 Google Drive folder 
    • Interpretation Lab Work Log   
    • Interpretation Protocol development (led by Dan)
    • Word Embedding (Fabian Offert, Teddy Roland, Isaac, Su, Ryan)
    • Data Visualization (Sihwa Park, Cindy Kang) 

    • "Random" Corpus (Led by Lindsay Thomas; Members: U. Miami RAs, Ray) 

      • Note: This group may be run in collaboration between Team #1 and Team #4

  • (Team 5) Team to Harvest and Report on Summer 2018 work (and notes from the Advisory Board meeting)

 

 

2. Outline of a Fuller Interpretation Protocol

 

Context: Overall Plan for This Year:

  • Interpretation rehearsals (of parts or all of our interpretation protocol) in February and March?
  • "Real" interpretation of our corpus (including subcorpora) to address one to three specific research questions in spring quarter--e.g.,
    • What is the similarity/difference between mainstream news media and student discourse on the humanities? (Or a comparison between our main corpus and any subcorpus)
    • How do the humanities and sciences compare in news media?
    • How does the "cosmic background radiation" of the humanities compare in the U.S. West vs. East vs. Midwest, vs. South?
    • How does our corpus compare to a "random" corpus? 
  • The goal is to position ourselves by the end of the academic year to embark on actual analytical and reporting work on our corpus. 

 

Fuller Interpretation Protocol (red = what we have workshopped so far)

  1. Preliminary steps:

    1. Method for sampling and normalizing materials from our corpus for modeling

    2. Method for assessing quality of a topic model

    3. Method for choosing granularity level to work with

    4. Scrubbing 

    5. Rules-based (and algorithmically assisted) method of labeling topics 

      • Eventually, we may be able to consolidate clusters of labeled topics under a controlled vocabulary for a topic (e.g., "economics" or "politics") that would be like a "codebook." 
    6. [Include initial "human topic modeling" to help assess topic model and also provide a preliminary overview of topics?] See Team 2's recommendations (Google Doc)

    7. Method for finding and reading original articles (i.e., when our articles are eventually "bagged" and not available as JSON text) 

  2. Method for macroanalysis of a topic model

    1. Identifying topics of interest

    2. Describing the structure of the model and relation of main topics

  3. Method for close analysis of a topic

  4. Method for comparative analysis of two topics

  5. Method for comparing two corpora or subcorpora (or sets of sources within a corpus) 

    1. Method for including social media in our analysis. (E.g., see Ray's topic model of Reddit)

  6. Method for longitudinal analysis of topics 

  7. Method for reporting on the interpretation of a topic model 

 

 

3. Breakout meetings

 

    • Team meetings
    • PIs to meet. PI topics include--
      • January event in Miami
      • Course buyouts this year
      • Issues relating to the Mellon progress report:
        • Repository strategy
        • IP licenses not yet covered in Jeremy's inventory 
      • Getting our models to Carl 

 

 

 

From Lindsay's Ryver Post to Team 1 (Primary Corpus Team) of Nov. 24, 2018:

 

First, I've finally managed to download data for and produce models of articles from the NYT from 2012-2017 containing the other keywords we agreed on last month.... I've produced 11 models based on different keywords, and one comparison model (articles from the NYT from 2012-2017 containing the word "humanities"), all of which you'll find on harbor 10000 in our team folder: http://harbor.english.ucsb.edu:10000/tree/write/projects/teams/2018-19-1-primary-corpus.

 

The data for all models is taken from the NYT from 2012-2017. All models were produced on harbor 10000 with the python-only workflow, which means that the "Scaled" view in dfr-browser doesn't work, and neither will the topic bubbles visualization. (There's a lot more to say about why we're gradually going to switch to a python-only workflow for modeling, and what problems that will solve and cause, and maybe we'll talk some about it on Tuesday. For now, it's enough to know that there are now two ways to make a dfrbrowser: using notebook 4, which is an R notebook, or using only notebook 1, which contains a python-only way to create a browser. The python-only workflow works better if you want to create diagnostics files, or if you have a large number of articles [above 10,000].) Where possible, I've normalized the number of articles in each model to ~1200, which is the size of the comparison model (the NYT 2012-2017 "humanities" model), so that we are comparing models of roughly the same size. I did this in the notebook workflow by randomly selecting articles for inclusion in the model. Models with ~1200 articles have 100 topics, models with <500 articles have 25 topics. I chose these numbers quickly and without a ton of thought, so it's very likely the models aren't as good as they could be. But I hope they will suit our purposes. Links to each model are in the worksheets I've created (see below).

The models are as follows, in no particular order:

  • "art history", 100 topics
  • "english literature", 25 topics
  • "history", 100 topics
  • "humanities" AND "funding", 25 topics
  • "humanities" AND "science", 25 topics
  • "liberal arts", 100 topics
  • "literature", 100 topics
  • "nea", 25 topics
  • "philosophy", 100 topics
  • "science", 100 topics
  • "stem", 100 topics
  • comparison model: "humanities", 100 topics
  • The search for articles containing the keyword "neh" only returned 32 articles from 2012-2017, so I did not do a model for that keyword.

Here's what I'm thinking for Tuesday's meeting:

  1. Once we get to the breakout sessions part of the meeting, those of us in team 1 choose 1 model to focus on and interpret. In our team Google drive folder (let me know if you don't have access to this), you will find a folder called "Topic model worksheets." This folder contains 12 worksheets I've created for interpreting the models. These worksheets are based on the interpretation protocol we went through during our last two meetings, with some revisions based on Abigail's feedback from her group and the new diagnostics files we are now saving via MALLET. Choose the model you want to focus on, click on that worksheet, and add your name to the top where it says "Name." If someone has already taken a model you wanted to work on, please just choose another one. The worksheets contain links to each model's browser and diagnostics file, as well as basic information about each model (the number of articles, the keyword, etc). (If you want, obviously, you can choose your model ahead of the meeting.)
    • I will focus on the comparison model (the "humanities" 100-topic model), and will complete my worksheet by the meeting on Tuesday. During the meeting's breakout session, I actually have to meet with the other PI's about the interim report due to the Mellon. So I won't be able to meet with our team then. That's why I'm thinking we'll use the meeting time for individual work (see below).
  2. Use the breakout session time of the meeting to work on your worksheet. There are more models than there are members of our group, so if you are feeling ambitious, feel free to take on more than 1 model. Just remember to add your name to the top of the worksheet for any model you are working on or plan to work on.
  3. Finally, we schedule an actual team meeting for Thursday, December 6 from 10 am - 11 am PST to discuss our models. I'm shooting for Thursday just in case Alan wants to schedule another all-hands meeting for that Tuesday. Does Thurs, Dec 6 from 10 - 11 am PST work for everyone for a team meeting?
 

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.