• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Meeting 23 (2015-10-2)

Page history last edited by Alan Liu 8 years, 6 months ago




Scheduling of Meetings During Fall

Scraping Work

  • Current status (Developer Tasks page)
    • Lindsay's report based on the meeting of Oct. 23, 2015:
      • These publications will be complete by Monday:
            NYT: Chris plans to finish up with his inspection sign-offs by Monday (there are several years remaining).
            WSJ: Ashley will finish her inspection sign-offs by Monday (1997 and 2008 remain)
            Guardian: Chris will finish up his inspection sign-offs (2000-2014) by Monday
            NPR: Ashley plans to finish up her inspection sign-offs (2010, 2012-14) by Monday

        These publications still need more work:
            LA Times: The collection work is done, but none of the inspection work has started. Chris will see what he can accomplish before Monday, but it's likely he won't be able to inspect much of this publication.
            New Yorker: 2004-2011 have been collected; 2012-14 remain. Phillip is planning to devote the rest of this afternoon to inspecting the existing files from the New Yorker. New inspection work will need to be done after Ashley finishes collection work.
            Washington Post: Ashley plans to start collection work using the HTML files I've uploaded after she finishes her remaining inspection work for the above publications. I've also volunteered to do some collection work on later years. Ashley has done one year (1987) and says the collection workflow is very clean and easy so far.
    • Phillip on The New Yorker inspections: "For a few of the scrapes, I encountered the problem of the masters spreadsheet not being able to include the whole article body because the article exceeded the spreadsheet's word limit. I wasn't able to figure out how to fix this, but for these instances, I did however copy and paste the whole article into the plain text documents. Other than this anomaly, the scrapings were mostly fine.... I believe that, in the years I detected this problem, one to two articles were affected. I think I caught all instances of the problem in the years I checked."


Topic Modeling Work

Developer Task Assignments (Topic Modeling)

  • Preprocessing: We need to prototype and debug processes for:
    1. Accessing and working locally with "flattened" collections of files. (Jeremy?)
    2. Deduplication of "humanities," "liberal arts," and "the arts" files. (Jeremy or Alan?)
    3. Scrubbing (current files are kept on Google Drive: we1s-2 > stopwords_and_scrubbing_list ) (Alan, Lindsay, Scott?) (need to put these files on Github? Run orientation meeting on Github?)
      1. Extra Stopwords list (current version created by Alan)
      2. config.py for Scott's python scrubbing script (current version added to by Lindsay and then by Alan)
    4. Creating/forking deduplicated and scrubbed working corpora.  (Long range goal: through query?) (Short range goal: manually created versions of our corpus for topic modeling experiments):
      1. all files
      2. sub-corpora by publication(s)
      3. sub-corpora by year(s)
  • Topic Modeling (We can use experiments to try out ideas and prototype processes):
    • Current or planned experiments on our corpus (or sub-corpora):
      • Alan: topic models of NY Times 2002-6 "humanities", and NY times 2010-14 "humanities" (the five years before and after the Great Recession).
      • Others?
      • Schedule a future meeting to discuss our models:
    • Methodological experiments we should also pursue (e.g., as part of the above topic models. After each experiment, each of us could write a brief report on process and results with suggestions on the following issues):
      • Optimize number of topics
      • Clustering of topics
        • hierarchical (dendograms)
        • K-means
        • PCA
      • Visualization of topics
      • Human team to cluster topics
      • Schedule a future meeting to discuss clustering (do human clustering or inspection of machine clustering)
    • Experiment with Andrew Goldstone's DFRtopics R package?
  • Creating a Public-Facing Interface:
    • Use Andrew Goldstone's DFR browser to start with?


Manifest scheme, Database system, Backup system

  • reports from Scott and Jeremy


Scott created a demo of webform access to a mongodb database, and I have build a system to serve it out of containers (virtual machines). An early form example and a more recent database-connected example are hosted here:


    1. WE1S flask+deform  



    2. WE1S flask+alpaca (+pymongo)  



(NOTE -- as always you may need to campus VPN in order to access these URLs)


Comments (0)

You don't have permission to comment on this page.