| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Meeting 24 (2015-10-23)

Page history last edited by Alan Liu 8 years, 6 months ago

Spotlight

 


 

(a) Near-future and Far-future Scheduling

  • Next Meeting Dates (options in November): WE1S project Calendar

  • Future Research Outcomes (?)

    • Panel proposal for DH 2017 in Montreal.

      • WE1S Panel would include several of us giving talks on the sub-topics:
        • WE1S overview and corpus
        • WE1S topic modeling methods, clustering methods, and results
        • WE1S Manifest Schema
        • WE1S infrastructure (Github container: MongoDB)
        • Collaboration with Lexos
    • Project write-up for submission to the new DHCommons Journal for peer-review of DH projects.

    • Later: individual or co-authored articles for journals on various aspects of the project.

 

 


(b) Scraping Work

  • Current status of UCSB scraping (Developer Tasks page)

    • Main fixes and to-do's that need to be applied to the summer scraping work?
    • Ongoing scraping in fall? (Ashley and possibly also Jamal serving as scrapers)
    • Status of the WE1S Corpus
      • on Google Drive
      • mirrored on Mirrormask (Jeremy's server) at: mirrormask/4humwe1s-GDrive/ (screenshot)
      • "flattened" versions archived at: mirrormask/4humwe1s-GDrive/archives/(screenshot)
      • backup archive at: TimeBackup/ (screenshot)
  • Scraping of Globe and Mail

    • Scraping to be done by Nathalie Popa (McGill U.) 
    • Current problems
  • Austin Yack's research on government and legislative documents on the humanities

 


(c) Next Meeting: Workshop for End-to-End Trial Rehearsal of Workflow for Topic Modeling WE1S Corpus (at sample scale)

 

       Plan for Workshop:

         (We'll run in parallel at our various locations. However, some steps may be pre-prepared; and some may be performed as a tutorial only from one location.)

 

  • Infrastructure for Workshop:

    • Parallel installations on computers at the following locations:
      • Transcriptions (UCSB) (Prep the workstation attached to the projector and Skype)
      • Alan
      • Lindsay
      • Scott
    • Installations should include (in addition to tools we all already have on our machines)
      • Anaconda distro of Python.
      • Relevant Python scripts:
      • Relevant iPython notebooks:
  • Workshop Stage 1 -- Assemble a demo sub-corpus of the WE1S corpus

    • Access latest "flattened" collections of files in which final file names have been assigned (e.g., no "File12.txt" names) (screenshot of example)
    • Carve out a part of the corpus (e.g., NY Times, 2010-2014, "humanities" and "liberal arts") for our demo corpus
    • Copy the demo corpus to a working folder (e.g., on a Windows machine: C:\workspace\nyt-2010-2014\articles\
  • Workshop Stage 2 - De-duplicate the sub-corpus

    • Report from Jeremy (and discussion)
  • Workshop Stage 3 - Scrub the sub-corpus

    • Run Scott's python scrubbing script (scrub.py) on the corpus, and deposit results in a separate folder in the workspace (on a Windows machine, for example, C:\workspace\nyt-2010-2014\articles-scrubbed\
      • Note: the most current version of the config-py file used to add words and phrases to the scrub.py script is kept on Google Drive: we1s-2 > stopwords_and_scrubbing_list. This config-py reflects Lindsay and Alan's cumulative additions to the scrubbing list so far.)
      • Note: there is an Extra Stopwords list for MALLET at the Google Drive location above. It is used during topic modeling with MALLET.
  • Workshop Stage 4 - Topic model the sub-corpus using MALLET:

    • [Steps to be filled in here]
    • Recent topic modeling experiments with WE1S corpus:
      • Alan: topic models of NY Times 2002-6 "humanities", and NY times 2010-14 "humanities" (the five years before and after the Great Recession).
      • Ashley & Zach: topic models of 1980s discourse about the humanities versus 1990s discourse about the humanities
  • Workshop Stage 5 - Clustering topics:

    • Use Scott's topicsToDocs.py script on the topic-counts.txt file produced by MALLET to create "topic-documents" from the individual topics in the topic model. (Or use Lexos to do the same)
    • Use Scott's adaptation of the DARIAH-DE tutorial iPython notebook to produce PCA clustering and dendogram visualizations of the topic-documents
      • Alan's results from NY Times 2002-20016 "humanities": PCA | Dendogram 

(d) For the Next Workshop: Interpret the Topic Model of the Sub-corpus

 

  • Discuss the topic model of the sub-corpus based on inspecting the topic model and also the clustering dendogram (and other clustering experiments)
  • Work out a systematic, documentable workflow for interpreting topic models

 


(e) For a Later Workshop: Improve

 

  • Improve workflow
  • Experiment with alternative workflows, e.g., Andrew Goldstone's DFRtopics R package?

 


(f) Manifest schema, Database system

  • reports from Scott and Jeremy

 

Scott created a demo of webform access to a mongodb database, and I have build a system to serve it out of containers (virtual machines). An early form example and a more recent database-connected example are hosted here:

 

    1. WE1S flask+deform  

    http://mirrormask.english.ucsb.edu:8500/

 

    2. WE1S flask+alpaca (+pymongo)  

    http://mirrormask.english.ucsb.edu:8501/

 

(NOTE -- as always you may need to campus VPN in order to access these URLs)

 

 

Comments (0)

You don't have permission to comment on this page.