If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Meeting 24 (2015-10-23)

Page history last edited by Alan Liu 8 years, 6 months ago

Spotlight

New Zealand!

Alan's cottage

(a) Near-future and Far-future Scheduling

Next Meeting Dates (options in November): WE1S project Calendar
Future Research Outcomes (?)

Panel proposal for DH 2017 in Montreal.

WE1S Panel would include several of us giving talks on the sub-topics:

WE1S overview and corpus
WE1S topic modeling methods, clustering methods, and results
WE1S Manifest Schema
WE1S infrastructure (Github container: MongoDB)
Collaboration with Lexos

Project write-up for submission to the new DHCommons Journal for peer-review of DH projects.
Later: individual or co-authored articles for journals on various aspects of the project.

(b) Scraping Work

Current status of UCSB scraping (Developer Tasks page)

Main fixes and to-do's that need to be applied to the summer scraping work?
Ongoing scraping in fall? (Ashley and possibly also Jamal serving as scrapers)
Status of the WE1S Corpus

on Google Drive
mirrored on Mirrormask (Jeremy's server) at: mirrormask/4humwe1s-GDrive/ (screenshot)
"flattened" versions archived at: mirrormask/4humwe1s-GDrive/archives/(screenshot)
backup archive at: TimeBackup/ (screenshot)

Scraping of Globe and Mail

Scraping to be done by Nathalie Popa (McGill U.)
Current problems

Austin Yack's research on government and legislative documents on the humanities

Whitehouse.gov research
www.congress.gov research (via Sunlight Foundation's API) -- example query string | Web interface for API

(c) Next Meeting: Workshop for End-to-End Trial Rehearsal of Workflow for Topic Modeling WE1S Corpus (at sample scale)

Plan for Workshop:

(We'll run in parallel at our various locations. However, some steps may be pre-prepared; and some may be performed as a tutorial only from one location.)

Infrastructure for Workshop:

Parallel installations on computers at the following locations:

Transcriptions (UCSB) (Prep the workstation attached to the projector and Skype)
Alan
Lindsay
Scott

Installations should include (in addition to tools we all already have on our machines)

Anaconda distro of Python.
Relevant Python scripts:
Relevant iPython notebooks:

Workshop Stage 1 -- Assemble a demo sub-corpus of the WE1S corpus

Access latest "flattened" collections of files in which final file names have been assigned (e.g., no "File12.txt" names) (screenshot of example)
Carve out a part of the corpus (e.g., NY Times, 2010-2014, "humanities" and "liberal arts") for our demo corpus
Copy the demo corpus to a working folder (e.g., on a Windows machine: C:\workspace\nyt-2010-2014\articles\

Workshop Stage 2 - De-duplicate the sub-corpus

Report from Jeremy (and discussion)

Workshop Stage 3 - Scrub the sub-corpus

Run Scott's python scrubbing script (scrub.py) on the corpus, and deposit results in a separate folder in the workspace (on a Windows machine, for example, C:\workspace\nyt-2010-2014\articles-scrubbed\

Note: the most current version of the config-py file used to add words and phrases to the scrub.py script is kept on Google Drive: we1s-2 > stopwords_and_scrubbing_list. This config-py reflects Lindsay and Alan's cumulative additions to the scrubbing list so far.)
Note: there is an Extra Stopwords list for MALLET at the Google Drive location above. It is used during topic modeling with MALLET.

Workshop Stage 4 - Topic model the sub-corpus using MALLET:

[Steps to be filled in here]
Recent topic modeling experiments with WE1S corpus:

Alan: topic models of NY Times 2002-6 "humanities", and NY times 2010-14 "humanities" (the five years before and after the Great Recession).
Ashley & Zach: topic models of 1980s discourse about the humanities versus 1990s discourse about the humanities

Workshop Stage 5 - Clustering topics:

Use Scott's topicsToDocs.py script on the topic-counts.txt file produced by MALLET to create "topic-documents" from the individual topics in the topic model. (Or use Lexos to do the same)
Use Scott's adaptation of the DARIAH-DE tutorial iPython notebook to produce PCA clustering and dendogram visualizations of the topic-documents

Alan's results from NY Times 2002-20016 "humanities": PCA | Dendogram

(d) For the Next Workshop: Interpret the Topic Model of the Sub-corpus

Discuss the topic model of the sub-corpus based on inspecting the topic model and also the clustering dendogram (and other clustering experiments)
Work out a systematic, documentable workflow for interpreting topic models

(e) For a Later Workshop: Improve

Improve workflow
Experiment with alternative workflows, e.g., Andrew Goldstone's DFRtopics R package?

(f) Manifest schema, Database system

reports from Scott and Jeremy

Scott created a demo of webform access to a mongodb database, and I have build a system to serve it out of containers (virtual machines). An early form example and a more recent database-connected example are hosted here:

1. WE1S flask+deform

http://mirrormask.english.ucsb.edu:8500/

2. WE1S flask+alpaca (+pymongo)

http://mirrormask.english.ucsb.edu:8501/

(NOTE -- as always you may need to campus VPN in order to access these URLs)

Comments (0)

You don't have permission to comment on this page.

Meeting 24 (2015-10-23)

(a) Near-future and Far-future Scheduling

Next Meeting Dates (options in November): WE1S project Calendar

Future Research Outcomes (?)

(b) Scraping Work

Current status of UCSB scraping (Developer Tasks page)

Scraping of Globe and Mail

Austin Yack's research on government and legislative documents on the humanities

(c) Next Meeting: Workshop for End-to-End Trial Rehearsal of Workflow for Topic Modeling WE1S Corpus (at sample scale)

Plan for Workshop:

Infrastructure for Workshop:

Workshop Stage 1 -- Assemble a demo sub-corpus of the WE1S corpus

Workshop Stage 2 - De-duplicate the sub-corpus

Workshop Stage 3 - Scrub the sub-corpus

Workshop Stage 4 - Topic model the sub-corpus using MALLET:

Workshop Stage 5 - Clustering topics:

(d) For the Next Workshop: Interpret the Topic Model of the Sub-corpus

(e) For a Later Workshop: Improve

(f) Manifest schema, Database system

Meeting 24 (2015-10-23)

Page Tools

Insert links

Comments (0)

Join this workspace

Navigator

SideBar

Recent Activity