|
Planning Document for Algorithmic Text Harvesting
Page history
last edited
by Alan Liu 10 years, 2 months ago
Workflow Plan for Text Harvesting
- (A) Locate and prioritize sources for text harvesting
- What we are targeting: public discourse about the humanities (in English language text resources), whether by students, journalists and pundits, politicians, general members of the public, or academics. At present, this excludes scholarly research articles, etc.; and it also excludes audiovisual sources that do not offer text transcripts (e.g., TV). The first priority is the last five or so years. Later we can sample from historical periods.
- Look at the manually-collected WhatEvery1Says corpus as it stands to date for likely newspaper, magazine, and other resources
- Check to see which we have access to for automated searching of metadata or full-text either through UCSB subscription
- Check to see which provide API access
- E.g., New York Times - Developers (Search by API) ("why just read the news when you can hack it?"; API's for accessing headlines, abstracts, first paragraphs, links, etc. to NYT data ; includes API's for articles, best sellers, comments by users, most popular items, newswire, and other parts of the Times)
- (B) Create general heuristic for searching for/identifying relevant documents
- For example:
- (a) create sample set of relevant documents from diverse sources;
- (b) collocate "humanities" with other words (and other frequency text analyses); (do the same for Britain, which speaks of "the arts" in roughly the same way that the U.S. speaks of "humanities"),
- (c) disambiguate if possible from "liberal arts"
- Lindsay's tests with using AntConc to identify collocates for "humanities" as signals of relevant articles: Antconc Experiments
- (C) Create scripts, etc., for searching specific sources for relevant documents
- (D) Create methods, scripts for downloading full-text documents where possible.
- (E) Develop archiving/repository strategy
- Zotero-specific Resources:
- Future tasks:
- Data wrangling, cleaning
- Exploratory text-analysis, social-network-analysis, and visualization work:
Resources:
- DH Toychest
- TAPoR
- Alex's report and resources from IGERT bootcamp
- Summary of IGERT Boot Camp for What Every 1 Says Project
1. Graph Theory and Python (Days 1-3) 2. Linear Algebra (Days 4-6) 3. Data Mining (Days 7-10)
Overview of potentially useful tools: • Python – Overview • APIs – General case and example (NYT and gender) • Data wrangling and analysis – Pandas, network theory • NLTK (Natural Language Toolkit) – existing tools/strategies for analyzing text • NetworkX – may be useful for visualizing, presenting data • Git/Github – may be useful for version control, collaboration, accessing code
- Resources from Bootcamp
- Alan can also ask for advice from other DH'ers
- Zotero-specific Resources:
- Shared Documents
- Web Scraping
Planning Document for Algorithmic Text Harvesting
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
Comments (0)
You don't have permission to comment on this page.