|
Document Sources
This version was saved 8 years, 8 months ago
View current version Page history
Saved by Alan Liu
on August 4, 2015 at 8:52:38 pm
This page lists candidate and active sources of public documents about the humanities (newspapers, magazines, blogs, government or foundation reports, etc.) For finished text-harvesting and text-analysis workflows for particular document sources, see Workflows for Sources.
Active sources for WhatEvery1Says text-mining project
(Sources we are currently working on)
Source (and URL)
|
Nation
|
Text-Harvest Method
|
Coverage |
Workflow
|
New York Times
|
US |
API for search Import.io (& Outwit Hub) for scraping |
- NYT provides access to their archives in 3 tiers:
- 1850-1923: non-OCR PDF's
- 1923-1980: non-OCR PDF's (limited to100 articles per month with a NYT subscription)
- 1981-present: fully-digitized text, available with NYT subscription. The HTML in the digitized text appears to have evolved in 4 stages, each requiring a different Import.io extractor:
- 1981-2004
- 2005 (anomalous year requiring special handling)
- 2006-2011
- 2012-present
- API provides full-text access from 1981 on, and metadata and abstracts for prior years.
|
Workflow
|
Wall Street Journal |
US |
Proquest for search
Outwit Hub for scraping
|
- Proquest provides the WSJ from 2 Jan. 1984 to the present.
- However, some articles have readily accessible abstract and full text, while others have only an accessible abstract with the full text hidden in a Flash widget. Full text is the norm from about 1998 on (with unpredictable exceptions in particular articles); full text is inconsistently missing in earlier years.
|
Workflow
|
USA Today
|
US
|
API for search
Python script for scraping
|
- (full text in API back to 2004)
|
Workflow
|
The Guardian (guardian.co.uk) |
UK
|
API for search
Import.io for scraping
|
- Full digitization appears to start in 1994, which are the oldest results returned for "humanities" in The Guardian API Console
|
Workflow
|
NPR |
US
|
Import.io for scraping
|
|
Workflow
|
|
|
|
|
|
Candidate sources (non-academic)
Source (and URL)
|
Nation
|
API
|
Non-API Search
|
WE1S Lead Researcher
|
Notes
|
Newspapers |
|
|
|
|
|
Australian Newspapers (through Trove) |
Australia |
|
|
|
|
Globe and Mail |
Canada |
TBD |
|
Phillip Cortes
|
|
Washington Post |
US |
TBD |
Proquest |
Lindsay
|
|
LA Times |
|
|
|
Lindsay |
|
Magazines & Online Publications
|
|
|
|
|
|
The Atlantic |
US |
TBD |
|
Alan Liu |
Ungettable: available only as PDFs through EBSCOhost Academic Search Complete database subscribed to by UCSB; or for exorbitant prices from the magazine itself ($99/yr access to archives with limit of 300 articles) |
Business Insider |
US
|
TBD
|
|
|
|
Forbes Magazine |
US |
TBD |
|
Lindsay
|
|
Huffington Post |
US |
TBD |
|
|
|
The Independent |
UK |
No |
|
Alan Liu |
Available through LexisNexis from 2010. Requires post-processing to filter out near-words to "humanities" ("humanity," "humanity's"). Only 1,000 results shown at a time.
|
Los Angeles Review of Books |
US |
TBD |
|
J. Callies
|
|
New Republic |
US |
TBD |
|
J. Callies
|
|
New Yorker |
US |
TBD |
|
Ashley
|
|
Salon |
US |
|
|
Chris Walker |
|
Time Magazine |
US |
TBD |
|
|
|
Broadcast Media (with text transcripts)
|
|
|
|
|
|
|
|
|
|
|
|
Student Newspapers |
|
|
|
|
|
Daily Californian (UC Berkeley) |
US |
TBD |
|
Ashley
|
|
Daily Bruin |
|
|
|
J. Callies |
|
Harvard Crimson |
|
|
|
|
|
Yale Daily News |
US |
TBD |
|
Alan
|
|
Candidate sources (academic)
Other Sources Suggested by WE1S's Earlier Manually Collected Articles on the Humanities
Document Sources
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
Comments (0)
You don't have permission to comment on this page.