| 
View
 

Planning Document for Algorithmic Text Harvesting

This version was saved 10 years, 6 months ago View current version     Page history
Saved by Alan Liu
on October 23, 2014 at 9:53:23 am
 
  • Idea of WhatEvery1Says Project

 

 

  • Algorithmic Text Harvesting Sub-project
    • Project Group Logistics
      • Hours and timesheets
      • Semi-regular meetings (e.g., every 2 weeks)?
      • Shared and/or individual development environments (Python, Github, etc.)
    • Workflow
      • Locate and prioritize sources for text harvesting
        1. What we are targeting: public discourse about the humanities (in English language text resources), whether by students, journalists and pundits, politicians, general members of the public, or academics.  At present, this excludes scholarly research articles, etc.; and it also excludes audiovisual sources that do not offer text transcripts (e.g., TV).  The first priority is the last five or so years.  Later we can sample from historical periods.
        2. Look at the manually-collected WhatEvery1Says corpus as it stands to date for likely newspaper, magazine, and other resources
        3. Check to see which we have access to for automated searching of metadata or full-text either through UCSB subscription
        4. Check to see which provide API access
      • Create general heuristic for searching for/identifying relevant documents
        • For example:
          • (a) create sample set of relevant documents from diverse sources;
          • (b) collocate "humanities" with other words (and other frequency text analyses); (do the same for Britain, which speaks of "the arts" in roughly the same way that the U.S. speaks of "humanities"),
          • (c) disambiguate if possible from "liberal arts"
      • Create scripts, etc., for searching specific sources for relevant documents
      • Create methods, scripts for downloading full-text documents where possible.
      • Future tasks:
        • Data wrangling, cleaning
        • Develop archiving/repository strategy
        • Exploratory text-analysis, social-network-analysis, and visualization work:
          • Topic modeling
    • Resources:
      • DH Toychest
      • TAPoR
      • Alex's report and resources from IGERT bootcamp
        • Summary of IGERT Boot Camp for What Every 1 Says Project

          1.    Graph Theory and Python (Days 1-3)
          2.    Linear Algebra (Days 4-6)
          3.    Data Mining (Days 7-10)

          Overview of potentially useful tools:
          •    Python – Overview
          •    APIs – General case and example (NYT and gender)
          •    Data wrangling and analysis – Pandas, network theory
          •    NLTK (Natural Language Toolkit) – existing tools/strategies for analyzing text
          •    NetworkX – may be useful for visualizing, presenting data
          •    Git/Github – may be useful for version control, collaboration, accessing code
        • Resources from Bootcamp

 

Comments (0)

You don't have permission to comment on this page.