If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Planning Document for Algorithmic Text Harvesting

Page history last edited by Alan Liu 9 years, 4 months ago

Workflow Plan for Text Harvesting

What we are targeting: public discourse about the humanities (in English language text resources), whether by students, journalists and pundits, politicians, general members of the public, or academics. At present, this excludes scholarly research articles, etc.; and it also excludes audiovisual sources that do not offer text transcripts (e.g., TV). The first priority is the last five or so years. Later we can sample from historical periods.
Look at the manually-collected WhatEvery1Says corpus as it stands to date for likely newspaper, magazine, and other resources
Check to see which we have access to for automated searching of metadata or full-text either through UCSB subscription
Check to see which provide API access

E.g., New York Times - Developers (Search by API) ("why just read the news when you can hack it?"; API's for accessing headlines, abstracts, first paragraphs, links, etc. to NYT data ; includes API's for articles, best sellers, comments by users, most popular items, newswire, and other parts of the Times)

(B) Create general heuristic for searching for/identifying relevant documents

(a) create sample set of relevant documents from diverse sources;
(b) collocate "humanities" with other words (and other frequency text analyses); (do the same for Britain, which speaks of "the arts" in roughly the same way that the U.S. speaks of "humanities"),
(c) disambiguate if possible from "liberal arts"

Lindsay's tests with using AntConc to identify collocates for "humanities" as signals of relevant articles: Antconc Experiments

(C) Create scripts, etc., for searching specific sources for relevant documents
(D) Create methods, scripts for downloading full-text documents where possible.

Resources:

You don't have permission to comment on this page.