(a) Near-future and Far-future Scheduling
-
-
Future Research Outcomes (?)
(b) Scraping Work
-
- Main fixes and to-do's that need to be applied to the summer scraping work?
- Ongoing scraping in fall? (Ashley and possibly also Jamal serving as scrapers)
- Status of the WE1S Corpus
- on Google Drive
- mirrored on Mirrormask (Jeremy's server) at: mirrormask/4humwe1s-GDrive/ (screenshot)
- "flattened" versions archived at: mirrormask/4humwe1s-GDrive/archives/(screenshot)
- backup archive at: TimeBackup/ (screenshot)
-
Scraping of Globe and Mail
- Scraping to be done by Nathalie Popa (McGill U.)
- Current problems
-
Austin Yack's research on government and legislative documents on the humanities
(c) Next Meeting: Workshop for End-to-End Trial Rehearsal of Workflow for Topic Modeling WE1S Corpus (at sample scale)
Plan for Workshop:
(We'll run in parallel at our various locations. However, some steps may be pre-prepared; and some may be performed as a tutorial only from one location.)
-
Infrastructure for Workshop:
- Parallel installations on computers at the following locations:
- Transcriptions (UCSB) (Prep the workstation attached to the projector and Skype)
- Alan
- Lindsay
- Scott
- Installations should include (in addition to tools we all already have on our machines)
- Anaconda distro of Python.
- Relevant Python scripts:
- Relevant iPython notebooks:
-
Workshop Stage 1 -- Assemble a demo sub-corpus of the WE1S corpus
- Access latest "flattened" collections of files in which final file names have been assigned (e.g., no "File12.txt" names) (screenshot of example)
- Carve out a part of the corpus (e.g., NY Times, 2010-2014, "humanities" and "liberal arts") for our demo corpus
- Copy the demo corpus to a working folder (e.g., on a Windows machine: C:\workspace\nyt-2010-2014\articles\
-
Workshop Stage 2 - De-duplicate the sub-corpus
- Report from Jeremy (and discussion)
-
Workshop Stage 3 - Scrub the sub-corpus
- Run Scott's python scrubbing script (scrub.py) on the corpus, and deposit results in a separate folder in the workspace (on a Windows machine, for example, C:\workspace\nyt-2010-2014\articles-scrubbed\
- Note: the most current version of the config-py file used to add words and phrases to the scrub.py script is kept on Google Drive: we1s-2 > stopwords_and_scrubbing_list. This config-py reflects Lindsay and Alan's cumulative additions to the scrubbing list so far.)
- Note: there is an Extra Stopwords list for MALLET at the Google Drive location above. It is used during topic modeling with MALLET.
-
Workshop Stage 4 - Topic model the sub-corpus using MALLET:
- [Steps to be filled in here]
- Recent topic modeling experiments with WE1S corpus:
- Alan: topic models of NY Times 2002-6 "humanities", and NY times 2010-14 "humanities" (the five years before and after the Great Recession).
- Ashley & Zach: topic models of 1980s discourse about the humanities versus 1990s discourse about the humanities
-
Workshop Stage 5 - Clustering topics:
- Use Scott's topicsToDocs.py script on the topic-counts.txt file produced by MALLET to create "topic-documents" from the individual topics in the topic model. (Or use Lexos to do the same)
- Use Scott's adaptation of the DARIAH-DE tutorial iPython notebook to produce PCA clustering and dendogram visualizations of the topic-documents
- Alan's results from NY Times 2002-20016 "humanities": PCA | Dendogram
(d) For the Next Workshop: Interpret the Topic Model of the Sub-corpus
- Discuss the topic model of the sub-corpus based on inspecting the topic model and also the clustering dendogram (and other clustering experiments)
- Work out a systematic, documentable workflow for interpreting topic models
(e) For a Later Workshop: Improve
- Improve workflow
- Experiment with alternative workflows, e.g., Andrew Goldstone's DFRtopics R package?
(f) Manifest schema, Database system
- reports from Scott and Jeremy
Scott created a demo of webform access to a mongodb database, and I have build a system to serve it out of containers (virtual machines). An early form example and a more recent database-connected example are hosted here:
1. WE1S flask+deform
http://mirrormask.english.ucsb.edu:8500/
2. WE1S flask+alpaca (+pymongo)
http://mirrormask.english.ucsb.edu:8501/
(NOTE -- as always you may need to campus VPN in order to access these URLs)
Comments (0)
You don't have permission to comment on this page.