| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Meeting (2016-08-05)

Page history last edited by Alan Liu 7 years, 8 months ago

 Meeting Outcomes:

(jump to notebook added after meeting at bottom of page)

 

 

Next full-team meeting date(s)? -- some options:

  • Friday, Aug. 19th (11 a.m. Pacific)
  • Friday, Aug. 26
  • Friday, Sept. 2

 


I. Corpus Finalization -- (possible to complete A, B, C, D below by early Sept.?)

  • (A) Corpus Finalization, Stage 1 -- Metadata Finalization (finalizing the metadata sheets/CSVs) 

    • Progress to date on the metadata sheets/CSVs, and projected completion date.
    • Variations that have cropped up in how different members of the team are finalizing the metadata in the CSV files:
    • Other problems encountered by the RA team? (Also, possible other adjustments of the CSVs may be suggested by Tyler and Ishjot -- see below.)
    • What is the best scenario for fixing variations and incorporating adjustments in the CSVs? Options:
      • Each RA to iron out existing issues in the CSV's s/he produced before continuing?
      • Assign a single RA to iron out existing issues in the CSV's?
      • Assign a single RA to inspect and do quality control adjustments when all the CSVs are complete?

 

 

  • (B) Corpus Finalization, Stage 2 -- Finalizing the plain-text article files

    • Produce an example CSV for the programmer RAs (Tyler and Ishjot) to examine.
      • We should annotate or point out on the CSV some possible problems that a script for exporting articles would need to deal with (e.g., long articles whose bodies span two cells)
    • Export the plain-text files:
      • Ask Tyler & Ishjot to take the example CSV and determine if they can make a script that does the following (Tyler & Ishjot to suggest tweaks to the CSVs if necessary):
        1. Export the article bodies as plain-text files named by the values of the ID column (e.g., "nyt-2012-h-14")
        2. Store in appropriate folders in tree on Mirrormask filestation space (pending future storage in the MongoDB/Manifest system)
        3. Repeat the above two steps, but include in the plain text files a first line consisting of basic metadata (date, author, title <line break>)
    • Deal with Unicode special character problems (e.g., nonsense characters where there should be curly quotes, m-dashes, diacriticals):
      • Do we really need to clean special character problems?
      • If so:
        • produce a "dictionary" of special character problems encountered in the CSVs.
        • Then, at what stage do we do cleaning? Options:
          • Clean the metadata csv files
          • Clean the exported plain text files
          • Incorporate cleaning of special character problems in the "scrubbing" Python script we use for special issues such as word consolidations.
            • However, see Scott's comments on the Trello board about Python 1.3 taking care of unicode issues such as curly quotes
    • Deal with de-duping problem: (Note: if the de-duping problem hangs us up, we can go on to topic modeling and interpretation with just "humanities" files for the moment until the problem is solved.)
      • Current de-duping workflow
      • RA subgroup (Alanna and Billy volunteered) to work on the following:
        • Run corpus_compare.py on a sample from the WE1S corpus (e.g., one year of one newspaper).
        • Study the results and evaluate for false positives, missed duplicates, etc. to see how aggressive we should be in setting the threshold of difference between files that flags a pair as likely duplicates.
        • We then need a script that will combine "h", "la" (and "ta") files for each year of a publication in a final output folder and delete the duplicate files 

 

  • (C) Proof-of-concept End-to-End Run

    • Take a small sample of our corpus and do a complete end-to-end test starting with the CSVs:
      1. Export the plain-text files
      2. Make a topic model
      3. Visualize in DFR-browser

 

  • (D) Implement and automate the whole workflow above on the virtual machine if possible

    • Assessment by Jeremy (with Jamal) on feasibility of implementing end-to-end workflow through Jupyter notebook on the virtual machine.
    • Note: the topic modeling parameters will need to be later iteratively adjusted based on our interpretive work (see below).

 

 


II. Manifest/MongoDB Work

 

  • Report from Scott, Jeremy, Ishjot, Tyler on current work and progress to date.

 

 


III. Topic Modeling & Interpretation -- (start on this in early to late Sept.?)

 

  • Where we are so far:

    • Our topic modeling rehearsal of Jan. 18, 2016: see meeting notes.
    • Our draft plan for a workflow for interpretation: see table in Jan. 18, 2016, meeting notes.

 

  • Next steps for implementation:

    • Create some initial topic models from our corpus.
    • Also create a topic model from a random sample of articles from our publications (as suggested by Teddy)
    • Conduct a first-pass study and discussion of the topic models (to get a rough sense of what we have, how we want to vary or diversify our topic modeling, etc.). Specific issues we can focus on:
      • Do a "ground truth" assessment of topics by human-reading some articles containing those topics.
      • Compare the topic model of our corpus against the random corpus to see if we are detecting a recognizable difference.
      • Figure out if we need to separate out the following kinds of articles that could potentially blur things together:
        • Event listings
        • School listings, etc.
      • Do an experiment with topic modeling separately a sample of our whole corpus and a sample of articles that are centrally "about" the humanities (and devise an algorithmic means of detecting the latter).
               Will our topic model allow us to say anything meaningful about how the idea of the humanities participates in general discourse about society (a question about the overall structure of discourse about society)?
              Or will it only allow us to say something meaningful about focalized, explicit discussions of the humanities?
      • Do an experiment with time-series topic modeling
      • Use clustering methods to assist in grouping topics.
      • Feel our way toward an understanding of what the "architecture" of the complex idea of the humanities looks like (using Pete de Bolla's term).
    • Based on the above first-pass interpretive study and discussion, implement a more systematic process of interpretation of the sort outlined in our draft plan for a workflow for interpretation (see table).  

 


 

 

 

 

 

 

 

Meeting Outcomes (current to-do's in red)

  • Next full-team meeting: Friday, Aug. 26th, 11:30 a.m. (Pacific)
  • Metadata finalization work
    • RAs to meet in next 2 weeks for shared work session to help make our workflow more uniform.
    • Individual RAs to continue with the way they have been finalizing CSV files. Standardization of variations among the files to be done later (ideally assisted by scripting).
    • We are aiming to complete metadata finalization by early Sept.
  • Plain-text file finalization work:
    • Tyler & Ishjot will create a script to export plain-text files from the CSVs, name them, & store in appropriate locations.
    • Unicode special-character issues will be addressed at the later stage of using Scott's Python scrubbing script on the corpus.
    • Jeremy's continued work on de-duping script has obviated need for RA subgroup to study de-duping results. Instead, Jeremy will continue and build the results of de-duping (with Ishjot & Tyler's help) into our total workflow.
    • Tyler & Ishjot will assist Jeremy in implementing total workflow in the virtual machine environment.
  • Manifest/MongoDB work:
    • Ishjot & Tyler to continue working with Scott on developing the backend (including file upload capability).
  • Preparing for Topic Modeling & Interpretation:
    • Lindsay will work on creating a random-sample corpus from our publications.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.