| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Alternate Collection Workflows

This version was saved 8 years, 9 months ago View current version     Page history
Saved by Alan Liu
on June 26, 2015 at 8:11:44 pm
 

This page provides alternate workflows for collecting document sources for the WhatEvery1Says project.  (Workflows are revised or extended as the developers continued working on the project. See also the primary Collection Workflows page.) 


New York Times (Alternate Method 1)

Using the NYT API & Python scripts for searching/downloading articles after 1980; & BeautifulSoup as scraper

See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the NYT API)

 

Requirements:

 

Workflow Steps:

  1. Search using the NYT API and Python scripts: (Scripts are located in the project Mirrormask site folder here) (Older versions of materials are also located in the Google Drive folder Scripts for NY Times)
    1. Read _instructions.txt for an overview (included in the NYT scripts folder).   And download the python scripts and settings.cfg to your computer.
    2. Adjust the settings.cfg file for the getTimesArticles_fq.py script to insert the paths for your working space, the search terms you want, and the date span you want.
    3. Run the getTimesArticles_fq.py script.
    4. Pull into a spreadsheet program the .TSV (tab separated values) file that the script creates as a summary of the JSON files retrieved for particular articles.  (The JSON files contain metadata and abstracts of the articles found in the search; the .TSV file aggregates the metadata from all the JSON files.)
    5. Select the column in the spreadsheet for the URLs of articles found in the search, and copy the URLs into a text file.  (Call the file urls.txt)
  2. Download articles:
    1. Use the Firefox DownloadThemAll plugin on the urls.txt file to download all the NYT articles and save them to a local folder.
  3. Scrape plain text of articles:
    1. Use the nyt_scraper_folder.py Python script to extract the plain text from all the NYT articles in your local folder and aggregate it in a file called text_harvest.txt. (The plain text for individual articles are separated from each other in the aggregate file with a string of 10 ampersands: "@@@@@@@@@@")
    2. If you wish, use a file splitter utility program (like Chopping List for Windows) to split the text_harvest.txt file into separate plain-text files for each article.
  4. Fill out Collection Manifest Form

 

 

New York Times (Alternate Method 2)

Using Proquest for articles between 1923 and end of 1980

See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the NYT API)

 

Requirements

 


The Guardian (Alternate Method 1 - Using OutWit Hub as scraper) 

 

Requirements:

 

Workflow Steps:

  1. Get a Guardian API Key if you don't have one (request developer key).
  2. Search The Guardian using the Guardian's API Console from within the browser in OutWit Hub:
    1. Open OutWit Hub.
    2. In the web browser built into OutWit Hub, go to the URL of the Guardian's  Open Platform API Console (beta version, not the old version): http://open-platform.theguardian.com/explore/. (Make sure that "Page" is selected in the OutWit Hub sidebar at the left, in order to show the web page you are opening.)
    3. When the Guardian's API console search form loads, check the box in that form for "Show All Filters."  Then fill in the following fields in the form:
      1. search term
      2. order-by (choose "oldest")
      3. page-size (set at 199 to show the maximum number of hits per search-results page)
      4. from-date & to-date (in the format, e.g., "2014-01-31")
      5. api-key (your Guardian api key)
    4. At the bottom of the Guardian API Console web page, you'll see a live view of the JSON returned by the search, including the URLs for the pages found.  The maximum number of hits on a single search-results page is 199.
    5. For multiple search-results pages:
      1. The JSON search results will start with metadata that includes, the number of the current search-results page and the total number of search-results pages (e.g., currentPage: 2, pages: 2)
      2. After harvesting the results of one page (through the process described below), you can use the "page" field in the Guardian API Console's search form to request the next page of search results.
  3. Scrape and Export the Guardian articles using OutWit Hub: 
    1. In the Outwith Hub sidebar, choose "Links'.
    2. You'll see a table of all the URLs found on the page, including the ones in the JSON for the search results.
    3. Select all the rows that include URLs for articles (as opposed to those to the Guardian search form).
    4. Right-click on the selection, and choose ‘Auto-Explore Selected Links -> Fast Scrape', and then the Guardian Scraper. (Note: do not chose "Fast Scrape (Include Selected Data)," since that will include all the columns in the table of data instead of the ones for Date, Headline, Author, and ArticleBody that the Guardian Scraper is designed to collect.)
    5. OutWit Hub will now proceed to scraping all the Guardian pages in the list of URLs.  In doing so, it will authomatically switch to the "Scraped" view (as indicated in the sidebar) and populate a table with the scraped data for Date, Headline, Author, and ArticleBody.
    6. When OutWit Hub has completed scraping, select all the scraped table, right-click, and "Export Selection As ..."  (export formats include: Excel, JSON, XML, HTML, CSV, TXT, SQL)
    7. (Repeated as needed for multiple search-results pages of URLs found through the Guardian API Console, as described in step 2.iv above)
  4. Chop the scraped articles as individual articles:
    1. If you exported the scraped articles from Outwit Hub as a txt file, then:
      1. you can use Chopping List (or a similar utility) to chop the cumulative text file into text files for individual articles.  The delimiter between articles in the cumulative text file is a carriage return, which can be entered as the delimiter in Chopping List as "$r" (no quote marks)
      2. or you can use Scott's cut.py Python script for the purpose (download from our Mirrormask server)

 

 

 

 

Comments (0)

You don't have permission to comment on this page.