| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Social distancing? Try a better way to work remotely on your online files. Dokkio, a new product from PBworks, can help your team find, organize, and collaborate on your Drive, Gmail, Dropbox, Box, and Slack files. Sign up for free.

View
 

Document Sources

Page history last edited by Alan Liu 4 years, 8 months ago

This page lists candidate and active sources of public documents about the humanities (newspapers, magazines, blogs, government or foundation reports, etc.) For finished text-harvesting and text-analysis workflows for particular document sources, see Workflows for Sources.

Active sources for WhatEvery1Says text-mining project

(Sources we are currently working on)

 

Source (and URL)

Nation

Text-Harvest Method

Coverage

Workflow

Notes

New York Times
US API for search
Import.io (& Outwit Hub) for scraping
  • NYT provides access to their archives in 3 tiers:
    • 1850-1923: non-OCR PDF's
    • 1923-1980: non-OCR PDF's (limited to100 articles per month with a NYT subscription)
    • 1981-present: fully-digitized text, available with NYT subscription. The HTML in the digitized text appears to have evolved in 4 stages, each requiring a different Import.io extractor:
      • 1981-2004
      • 2005 (anomalous year requiring special handling)
      • 2006-2011
      • 2012-present 
  • API provides full-text access from 1981 on, and metadata and abstracts for prior years.
Workflow
 
 Wall Street Journal US

Proquest for search

Outwit Hub for scraping

  • Proquest provides the WSJ from 2 Jan. 1984 to the present.
  • However, some articles have readily accessible abstract and full text, while others have only an accessible abstract with the full text hidden in a Flash widget.  Full text is the norm from about 1998 on (with unpredictable exceptions in particular articles); full text is inconsistently missing in earlier years.
Workflow
 
USA Today
US

API for search

Python script for scraping

  • (full text in API back to 2004)
Workflow
 
The Guardian (guardian.co.uk) UK

API for search

Import.io for scraping

  • Full digitization appears to start in 1994, which are the oldest results returned for "humanities" in The Guardian API Console

Workflow
 
NPR US
Import.io for scraping
  Workflow
 
New Yorker US TBD EBSCOhost
 
Available in HTML (full text) through EBSCOhost via the library. The issues date back to 1985, and there are current issues. There is no API. 
LA Times US  No  Proquest   
Full text from 1985-present. Gettable using workflow similar to WSJ. 
US. Patents  US  Manual Search and Scrape     
 

 

Candidate sources (non-academic)

= Possible get

Source (and URL)

Nation

API 

Non-API Search

WE1S Lead Researcher

Notes

Newspapers           
Australian Newspapers (through Trove)  Australia      
Globe and Mail Canada No   Phillip Cortes
It seems only Canadian institutions have free access via Proquest to their archives.
Washington Post
US  No Proquest Lindsay
Full text from 1987-present. Gettable using workflow similar to WSJ. 
Magazines & Online Publications  
         
The Atlantic US TBD   Alan Liu

Ungettable: available only as PDFs through EBSCOhost Academic Search Complete database subscribed to by UCSB; or for exorbitant prices from the magazine itself ($99/yr access to archives with limit of 300 articles)

 

We may try to write to them requesting access.

Business Insider US
TBD
     
Forbes Magazine US No   Lindsay
Probably ungettable: available through Business Source Complete Database, but coverage is spotty and many articles are only available as PDFs. Also doesn't appear to be a way to reliably systematically search the database: a sample search of the magazine's entire publication records since 1990 using "humanities" only returned 23 hits, and many of those articles didn't even seem to contain the word "humanities." 
Harper's Magazine US        
Huffington Post US TBD      
The Independent UK No   Alan Liu Available through LexisNexis from 2010. Requires post-processing to filter out near-words to "humanities" ("humanity," "humanity's"). Only 1,000 results shown at a time.
Los Angeles Review of Books US TBD   J. Callies
They are willing to work with us in making their archives available. (Jonathan wrote to them; Alan followed up; they are consulting their developer.)
New Republic US TBD   J. Callies
 
Salon US     Chris Walker  
Time Magazine US TBD      
Broadcast Media  (with text transcripts) 
         
           
Student Newspapers           
Daily Californian (UC Berkeley) US TBD   Ashley
Ungettable. There is no API, not available through library.  
Daily Bruin       J. Callies  
Harvard Crimson          
Yale Daily News US TBD   Alan
 

 

Candidate sources (academic)

 

Source (and URL)

Nation

API 

Non-API Search

WE1S Lead Researcher

Notes


Newspapers           
Chronicle of Higher Education
US 

No

   

 

Inside Higher Ed
US  
TBD
     
University Media Sources
         
Harvard Gazette US
       
           

 

Candidate sources (political)

 

Source (and URL)

Nation

API 

Non-API Search

WE1S Lead Researcher
(originally researched by Austin Yack)

Notes


[Kind of Material]           
 
US 

 

   

 

 
US  
 
     
[Kind of Material] 
         
 
 
       
           

 

Other Sources Suggested by WE1S's Earlier Manually Collected Articles on the Humanities

 

 

 

Comments (0)

You don't have permission to comment on this page.