|
Document Sources
This version was saved 8 years, 7 months ago
View current version Page history
Saved by Phillip Cortes
on August 19, 2015 at 1:26:00 pm
This page lists candidate and active sources of public documents about the humanities (newspapers, magazines, blogs, government or foundation reports, etc.) For finished text-harvesting and text-analysis workflows for particular document sources, see Workflows for Sources.
Active sources for WhatEvery1Says text-mining project
(Sources we are currently working on)
Source (and URL)
|
Nation
|
Text-Harvest Method
|
Coverage |
Workflow
|
New York Times
|
US |
API for search Import.io (& Outwit Hub) for scraping |
- NYT provides access to their archives in 3 tiers:
- 1850-1923: non-OCR PDF's
- 1923-1980: non-OCR PDF's (limited to100 articles per month with a NYT subscription)
- 1981-present: fully-digitized text, available with NYT subscription. The HTML in the digitized text appears to have evolved in 4 stages, each requiring a different Import.io extractor:
- 1981-2004
- 2005 (anomalous year requiring special handling)
- 2006-2011
- 2012-present
- API provides full-text access from 1981 on, and metadata and abstracts for prior years.
|
Workflow
|
Wall Street Journal |
US |
Proquest for search
Outwit Hub for scraping
|
- Proquest provides the WSJ from 2 Jan. 1984 to the present.
- However, some articles have readily accessible abstract and full text, while others have only an accessible abstract with the full text hidden in a Flash widget. Full text is the norm from about 1998 on (with unpredictable exceptions in particular articles); full text is inconsistently missing in earlier years.
|
Workflow
|
USA Today
|
US
|
API for search
Python script for scraping
|
- (full text in API back to 2004)
|
Workflow
|
The Guardian (guardian.co.uk) |
UK
|
API for search
Import.io for scraping
|
- Full digitization appears to start in 1994, which are the oldest results returned for "humanities" in The Guardian API Console
|
Workflow
|
NPR |
US
|
Import.io for scraping
|
|
Workflow
|
|
|
|
|
|
Candidate sources (non-academic)
= Possible get
Candidate sources (academic)
Other Sources Suggested by WE1S's Earlier Manually Collected Articles on the Humanities
Document Sources
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
Comments (0)
You don't have permission to comment on this page.