• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Planning Document for Algorithmic Text Harvesting

Page history last edited by Alan Liu 9 years, 4 months ago

Workflow Plan for Text Harvesting


  • (A) Locate and prioritize sources for text harvesting
    1. What we are targeting: public discourse about the humanities (in English language text resources), whether by students, journalists and pundits, politicians, general members of the public, or academics.  At present, this excludes scholarly research articles, etc.; and it also excludes audiovisual sources that do not offer text transcripts (e.g., TV).  The first priority is the last five or so years.  Later we can sample from historical periods.
    2. Look at the manually-collected WhatEvery1Says corpus as it stands to date for likely newspaper, magazine, and other resources
    3. Check to see which we have access to for automated searching of metadata or full-text either through UCSB subscription
    4. Check to see which provide API access


  • (B) Create general heuristic for searching for/identifying relevant documents
    • For example:
      • (a) create sample set of relevant documents from diverse sources;
      • (b) collocate "humanities" with other words (and other frequency text analyses); (do the same for Britain, which speaks of "the arts" in roughly the same way that the U.S. speaks of "humanities"),
      • (c) disambiguate if possible from "liberal arts"
    • Lindsay's tests with using AntConc to identify collocates for "humanities" as signals of relevant articles: Antconc Experiments
  • (C) Create scripts, etc., for searching specific sources for relevant documents
  • (D) Create methods, scripts for downloading full-text documents where possible.



  • (E) Develop archiving/repository strategy



  • Future tasks:
    • Data wrangling, cleaning
    • Exploratory text-analysis, social-network-analysis, and visualization work:
      • Topic modeling




Comments (0)

You don't have permission to comment on this page.