• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Whenever you search in PBworks, Dokkio Sidebar (from the makers of PBworks) will run the same search in your Drive, Dropbox, OneDrive, Gmail, and Slack. Now you can find what you're looking for wherever it lives. Try Dokkio Sidebar for free.


Meeting 24 (2015-10-23)

Page history last edited by Alan Liu 7 years, 1 month ago




(a) Near-future and Far-future Scheduling

  • Next Meeting Dates (options in November): WE1S project Calendar

  • Future Research Outcomes (?)

    • Panel proposal for DH 2017 in Montreal.

      • WE1S Panel would include several of us giving talks on the sub-topics:
        • WE1S overview and corpus
        • WE1S topic modeling methods, clustering methods, and results
        • WE1S Manifest Schema
        • WE1S infrastructure (Github container: MongoDB)
        • Collaboration with Lexos
    • Project write-up for submission to the new DHCommons Journal for peer-review of DH projects.

    • Later: individual or co-authored articles for journals on various aspects of the project.



(b) Scraping Work

  • Current status of UCSB scraping (Developer Tasks page)

    • Main fixes and to-do's that need to be applied to the summer scraping work?
    • Ongoing scraping in fall? (Ashley and possibly also Jamal serving as scrapers)
    • Status of the WE1S Corpus
      • on Google Drive
      • mirrored on Mirrormask (Jeremy's server) at: mirrormask/4humwe1s-GDrive/ (screenshot)
      • "flattened" versions archived at: mirrormask/4humwe1s-GDrive/archives/(screenshot)
      • backup archive at: TimeBackup/ (screenshot)
  • Scraping of Globe and Mail

    • Scraping to be done by Nathalie Popa (McGill U.) 
    • Current problems
  • Austin Yack's research on government and legislative documents on the humanities


(c) Next Meeting: Workshop for End-to-End Trial Rehearsal of Workflow for Topic Modeling WE1S Corpus (at sample scale)


       Plan for Workshop:

         (We'll run in parallel at our various locations. However, some steps may be pre-prepared; and some may be performed as a tutorial only from one location.)


  • Infrastructure for Workshop:

    • Parallel installations on computers at the following locations:
      • Transcriptions (UCSB) (Prep the workstation attached to the projector and Skype)
      • Alan
      • Lindsay
      • Scott
    • Installations should include (in addition to tools we all already have on our machines)
      • Anaconda distro of Python.
      • Relevant Python scripts:
      • Relevant iPython notebooks:
  • Workshop Stage 1 -- Assemble a demo sub-corpus of the WE1S corpus

    • Access latest "flattened" collections of files in which final file names have been assigned (e.g., no "File12.txt" names) (screenshot of example)
    • Carve out a part of the corpus (e.g., NY Times, 2010-2014, "humanities" and "liberal arts") for our demo corpus
    • Copy the demo corpus to a working folder (e.g., on a Windows machine: C:\workspace\nyt-2010-2014\articles\
  • Workshop Stage 2 - De-duplicate the sub-corpus

    • Report from Jeremy (and discussion)
  • Workshop Stage 3 - Scrub the sub-corpus

    • Run Scott's python scrubbing script (scrub.py) on the corpus, and deposit results in a separate folder in the workspace (on a Windows machine, for example, C:\workspace\nyt-2010-2014\articles-scrubbed\
      • Note: the most current version of the config-py file used to add words and phrases to the scrub.py script is kept on Google Drive: we1s-2 > stopwords_and_scrubbing_list. This config-py reflects Lindsay and Alan's cumulative additions to the scrubbing list so far.)
      • Note: there is an Extra Stopwords list for MALLET at the Google Drive location above. It is used during topic modeling with MALLET.
  • Workshop Stage 4 - Topic model the sub-corpus using MALLET:

    • [Steps to be filled in here]
    • Recent topic modeling experiments with WE1S corpus:
      • Alan: topic models of NY Times 2002-6 "humanities", and NY times 2010-14 "humanities" (the five years before and after the Great Recession).
      • Ashley & Zach: topic models of 1980s discourse about the humanities versus 1990s discourse about the humanities
  • Workshop Stage 5 - Clustering topics:

    • Use Scott's topicsToDocs.py script on the topic-counts.txt file produced by MALLET to create "topic-documents" from the individual topics in the topic model. (Or use Lexos to do the same)
    • Use Scott's adaptation of the DARIAH-DE tutorial iPython notebook to produce PCA clustering and dendogram visualizations of the topic-documents
      • Alan's results from NY Times 2002-20016 "humanities": PCA | Dendogram 

(d) For the Next Workshop: Interpret the Topic Model of the Sub-corpus


  • Discuss the topic model of the sub-corpus based on inspecting the topic model and also the clustering dendogram (and other clustering experiments)
  • Work out a systematic, documentable workflow for interpreting topic models


(e) For a Later Workshop: Improve


  • Improve workflow
  • Experiment with alternative workflows, e.g., Andrew Goldstone's DFRtopics R package?


(f) Manifest schema, Database system

  • reports from Scott and Jeremy


Scott created a demo of webform access to a mongodb database, and I have build a system to serve it out of containers (virtual machines). An early form example and a more recent database-connected example are hosted here:


    1. WE1S flask+deform  



    2. WE1S flask+alpaca (+pymongo)  



(NOTE -- as always you may need to campus VPN in order to access these URLs)



Comments (0)

You don't have permission to comment on this page.