• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Whenever you search in PBworks, Dokkio Sidebar (from the makers of PBworks) will run the same search in your Drive, Dropbox, OneDrive, Gmail, and Slack. Now you can find what you're looking for wherever it lives. Try Dokkio Sidebar for free.


Meeting (2016-08-26)

Page history last edited by Alan Liu 6 years, 5 months ago

 Meeting Outcomes:

(jump to notebook added after meeting at bottom of page)



Next full-team meeting date(s)? -- some options:

  • Friday, Sept. 16 (11 am, or 1 pm Pacific)
  • Friday, Sept. 23

Purpose of the next meeting:

  • Ongoing business
  • Brainstorm as a group about how to structure our topic-model interpretation process (see initial ideas below)
  • Discuss whether we will be ready to submit a paper or panel to DH2017 (the deadline for submissions for DH2016 was Nov. 1, 2015).



1. MongoDB/Manifest Development & Total Workflow Development on File Server / MongoDB / Virtual Machine



2. Corpus Finalization: Metadata -- current projected completion date: Sept. 12

  • Progress so far -- briefing by Jamal and the RA team. (Developer Task Assignments (Corpus Finalization)
  • How should we go about correcting variations that have cropped up in different team members' CSV files:
  • Planning for future incorporation of the Globe & Mail metadata sheets being created by Nathalie Popa
    • Nathalie has scraped 1977 to 2014 "humanities" for The Globe and Mail. Files she has shared from her Google Drive include:
      • ProQuest search results pages
      • Outwit Hub scrape of the ProQuest results (in XLS formatt)
      • Master metadata spreadsheets (in Google Sheets format); these will require finalization to standardize with our current final format for the CSV files
      • Composite text and doc files of article bodies from each year, with the use of our @@@@@@@@@@ separator (but not yet cut into individual articles).
    • Nathalie has also separated created a mini-corpus of the Globe and Mail in which for the year 1999 she has scraped both "humanities" and "the arts." (We'll need to evaluate the results to see if it is worthwhile to scrape "the arts")
    • Nathalie has another 50 hours of RAship left at McGill that she could use to scrape "liberal arts. (At a minimum, she could get the ProQuest search results, so that wc could take it from there if need be.)
  • Creating a sample "random" corpus -- Lindsay to report any thoughts or progress she has on this so far.



3. Corpus Finalization: Creating the Corpus of Articles -- can we do so by end of Sept.?

  • Metadata finalization team to give example CSVs to programmers (Tyler and Ishjot) to examine.
    • The CSVs should be representative, and should include known problems (e.g.,  long articles whose bodies span two cells, CSVs with extra columns)
    • Tyler & Ishjot will try to make a script or scripts that does the following:
      1. Export the article bodies as plain-text files named by the values of the ID column (e.g., "nyt-2012-h-14")
      2. Store in appropriate folders in tree on Mirrormask filestation space (or storage in the MongoDB/Manifest system if it is ready)
      3. Repeat the above two steps, but include in the plain text files a first line consisting of basic metadata (date, author, title <line break>)
  • Creation and management of the corpus of articles:
    • Use the process developed by Tyler & Ishjot to generate the corpus of articles. Some issues:
      1. Can the process be installed in Jeremy's virtual machine so that various team members can re-run the process in whole or in part in the future?
      2. Can the workflow for creating the  corpus be integrated with that for de-duplication into a single workflow?
      3. Can a method be developed to allow team members to select all or parts of the corpus to work on? (by copying that portion to a working folder on the virtual machine for topic modeling, or by downloading a zip file)?
      4. If the corpus will live on Mirrormask, do we need a backup method that stores the backup on a different machine?
  • Schedule a team meeting when appropriate for training in running/customizing the whole process (hopefully through a Jupyter notebook on the virtual machine)


4. Generate Initial Topic Models-- by early October?

  • Take a small sample of our corpus and create a an initial topic model (multiple versions with different numbers of topics).
  • Do the same for our sample "random" corpus
  • Attempt to do a topic model of our entire corpus.
  • Visualize in DFR-browser


5. Begin Interpreting Topic Models-- by mid October?

  • Our previous rehearsals and thoughts on interpretation process:

    • Topic modeling rehearsal of Jan. 18, 2016: see meeting notes.
    • Draft plan for a workflow for interpretation: see table in Jan. 18, 2016, meeting notes.


  • Alan's initial idea for how we start exploring the interpretation process in October: Conduct a first-pass study and discussion of the topic models (to get a rough sense of what we have, how we want to vary or diversify our topic modeling, etc.). Specific issues we can focus on:

    • Do a "ground truth" assessment of topics by human-reading some articles containing those topics.
    • Compare the topic model of our corpus against the random corpus to see if we are detecting a recognizable difference.
    • Figure out if we need to separate out the following kinds of articles that could potentially blur things together:
      • Event listings
      • School listings, etc.
    • Do an experiment with topic modeling separately a sample of our whole corpus and a sample of articles that are centrally "about" the humanities (and devise an algorithmic means of detecting the latter).
             Will our topic model allow us to say anything meaningful about how the idea of the humanities participates in general discourse about society (a question about the overall structure of discourse about society)?
            Or will it only allow us to say something meaningful about focalized, explicit discussions of the humanities?
    • Do an experiment with time-series topic modeling
    • Use clustering methods to assist in grouping topics.
    • Feel our way toward an understanding of what the "architecture" of the complex idea of the humanities looks like (using Pete de Bolla's term).
    • Based on the above first-pass interpretive study and discussion, implement a more systematic process of interpretation of the sort outlined in our draft plan for a workflow for interpretation (see table).
  • Create "starter package" of topic models and anthology of representative articles for undergraduate Winter-Spring CRG (UCSB English Dept."collaborative research grant" group of students on stipend and independent course credit to work together under guidance of a grad student on stipend).








Meeting Outcomes (to-do's in red)

  • Next full-team meeting: Thursday, Sept. 15, 1pm (Pacific): We will brainstorm topic model interpretation process in that meeting.
  • Corpus Metadata Finalization:
    • RAs to finish current corpus by c. Sept. 12th
      • Nathalie Popa to collect Globe & Mail "liberal arts" (However, exporting the article bodies manually from the sheets is no longer necessary.)
    • Variations among individual RAs' CSV files to be fixed/standardized
    • Jamal to provide Tyler with sample CSVs 
  • Producing the Corpus:
    • Tyler to make a script that does the following: [P.S. Scott indicated at the meeting he had quickly started on this]
      1. Export the article bodies as plain-text files named by the values of the ID column (e.g., "nyt-2012-h-14")
      2. Store in appropriate folders in tree on filestation (or storage MongoDB/Manifest system if it is ready)
    • Lindsay to work on producing a quick-and-dirty "random" corpus on a relatively small scale (pending future decision on whether we need to improve the random corpus).
    • We will use script to export plain-text files for whole corpus.
  • Back-end Development:
    • Tyler to continue working on MongoDB/Manifest (and file uploading) in collaboration with Scott. Tyler to document his work/code as appropriate.
    • Tyler to consult with Jeremy (in a meeting?) on ideas for possibly simplifying the WE1S workflow.
    • Jeremy to work on the total WE1S workflow system on the filestation, MongoDB, and virtual machine (as indicated in his diagram).
    • Jeremy to work on the de-duping part of the workflow (as in his diagram)



































Comments (0)

You don't have permission to comment on this page.