Planning Document for Manual Text Harvesting


Last updated:
Oct. 22, 2014
Project Overview | WE1 Corpus |
Workflow: Corpus Collection | Stage 1 Transform | Stage 2 Transform | | Dissemination & Publicity | Additional Experiments with WE1 Corpus

 

Project Overview 

WhatEvery1Says (WE1)

 

The 4Humanities WhatEvery1Says research project is collecting a corpus of public discourse about the humanities (in newspapers, magazines, blogs, reports intended for the public or legislatures, etc.) and analyzing that corpus with digital text-analysis methods.

 

Our hypothesis is that digital methods can help us learn new things about how media pundits, politicians, business leaders, administrators, scholars, students, artists, and others are actually thinking about the humanities. For example, are there sub-themes beneath the familiar dominant clichés and memes? Are there hidden connections or mismatches between the “frames” (premises, metaphors, and narratives) of those arguing for and against the humanities? How do different parts of the world or different kinds of speakers compare in the way they think about the humanities? Instead of concentrating on set debates and well-worn arguments, can we exploit new approaches or surprising commonalities to advocate for the humanities in the 21st century?"

 

We hope to use findings from the WhatEvery1Says project to provide advocates for the humanities with strategies and materials for effective communication of the value of humanistic study and knowledge--with narratives, arguments, scenarios, and evidence that advance, rather than simply react to, public conversation on the place of the humanities in today's world.

 

Project History and Participants: The WhatEvery1Says project was initiated by 4Humanities.org in 2011.  Currently, project participants include members of the local chapters of 4Humanities at UCSB, CSUN, and UCLA in association with other members of the Southern California digital humanities collective (including the Whittier College Digital Liberal Arts Center).  Contact for project: Alan Liu, UCSB (ayliu@english.ucsb.edu).

 

Research Material 

WhatEvery1Says Corpus

 

  • WhatEvery1Says Corpus v. 1 (collected manually) -- in progress 
  • WhatEvery1Says Corpus v. 2 (extended corpus to be collected systematically or algorithmically)

Workflow (A)

Corpus Collection

 

  1. Continue adding examples of public discourse about the humanities in WhatEvery1Says Corpus v. 1 (kept in a Google spreadsheet). (In progress)
    1. Create a database site to replace the spreadsheet as a collection structure. (To do)
  2. Systematically search and identify online digital archives for documents to include in an extended WhatEvery1Says Corpus v. 2.
    1. Identify online archives of newspapers, magazines, etc., for which we have full-text access and that allow for searching and automated downloading. (To do)
    2. Create a method and script for identifying relevant articles for the WhatEvery1Says corpus (To do)

Workflow (B)

Stage 1 Transform of Corpus
(text archiving, extraction)

 

  1. Archive document from corpus, extract plain-text version, and save the results on the Google Drive space for the project according to the following organizational protocol: (in progress)

    /_corpus
      /whatevery1says
        /_document
          /MyDoc
            /_source
              /_archive
                /_transform-cut-paste
                  /_transform-hand-edited

    The path on the Google Drive space is: 4Humanities > 4Humanities--WhatEvery1Says > _corpus > whatevery1says-web-unstructured > _document

    Full statement of the organizational protocol based on a sample document
    Editorial policy for what to include / exclude in text extraction
    Tools to help automate text extraction and preparation, e.g. to auto-extract from PDF's (use with caution)

  2. Create a form-driven interface for the archiving and extraction work above.  (in progress: Jeremy Douglass)

  3. Preliminary Results of Stage 1 Transformed Corpus: plain-text files for 61 sample documents from the original raw corpus
    Path on Google Drive space: 4Humanities > 4Humanities--WhatEvery1Says > sample-corpus-1.0

Workflow (C)

Stage 2 Transformation of Corpus
(text cleaning and other manipulation)

 

We are currently working on specific components of the following set of processes, which ideally should be explored in iterative complementarity with topic modeling runs (below), and which should ultimately should be stitched together and automated as a single workflow.  Once we have a tested workflow, we will begin production-run transformations of the corpus.

 

  1. Perform initial text cleaning, punctuation-stripping, and low-level prepping work -- automate as appropriate using Lexos or other text-preparation tools. (to do)
  2. Identify bigrams (e.g., "social sciences") that need to be converted to unigrams (in progress)
    1. Assisted by identification of frequent collocates using Antconc (in progress: Lindsay Thomas); Antconc results on sample corpus
  3. Build a stop list (in progress: Jeremy Douglass et al.)
    1. Standard starter stop lists
      1. 1. The Fox 1992 stop word list (429 words).  Fox, C. (1992). Lexical analysis and stop lists. In Frakes, W. and Baeza-Yates, R., editors, Information Retrieval: Data Structures and Algorithms, chapter 7. Prentice-Hall. http://www.lextek.com/manuals/onix/stopwords1.html
      2. 2. The SMART 1971 stop word list (571 words): Salton, G. 1971. The SMART Retrieval System—Experiments in Automatic Document Processing, Upper Saddle River, NJ, USA: Prentice-Hall, Inc. http://www.lextek.com/manuals/onix/stopwords2.html [similar to MALLET standard English language stop list]
    2. Andrew Goldstone and Ted Underwood's stop list
    3. Matthew Jockers's stop list
  4. Use named-entity parsers to identify proper names, etc., that can either be put in the stop list or set aside for social-network analysis separate from the topic modeling) (Zach Horton and Liz Shayne)
    1. Resources: Michelle Moravec, "How to Use Stanford's NER and Extract Results" (2014)
  5. Use Parts-of-speech taggers to allow us to experiment with subtracting verbs, etc., to improve usefulness of topic modeling. (in progress: Priscilla Leung)
    1. POS tagging examples by Priscilla
  6. Put component processes together in a single workflow; find ways to automate at least parts of the workflow or transitions between parts (to do)

 

Tools and Resources from the DH Toychest:

Project Dissemination and Publicity

 

  1. Post progress reports and results on 4Humanities.org (descriptions of project are now on the site)
  2. Explore writing co-authored article(s) and papers on the project.
  3. Organize a 4Humanities event for analyzing results of the project.
  4. Brainstorm ways of using the project to build humanities advocacy guides and resources.

Additional Experiments with WhatEvery1Says Corpus