• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


NPR Collection Workflow

Page history last edited by Jonathan Callies 8 years, 10 months ago Saved with comment

This page provides instructions for collecting the NPR from among document sources for the WhatEvery1Says project.  (Collecting workflows are revised or extended as the developers continued working on the project. See also Alternate Workflows.)


Using NPR API for searching and Import.io as scraper (last revised July 13, 2014)


Requirements: (preinstalled on the WE1S workstations)

  • NPR API Key (register and request a key)
  • Import.io (free) -- Note: Import.io can be used in a browser to extract data automatically from a Web page; and the browser version can also use customized extractors that someone else has created.  However, creating customized extractors (e.g., able to select only particular data from a page) can only be done through Import.io's downloadable "app" (which is also free).
  • Convert JSON to CSV online tool.


Workflow Steps:

  1. Get a NPR API Key if you don't have one (request developer key).
  2. Search NPR using the NPR API Console:
    1. In a web browser (not the one built into Import.io), go to the URL of NPR's API Console: http://www.npr.org/api/queryGenerator.php.
    2. When NPR's API console search form loads,
      1. Click the "Fields" tab. For Output Format, select JSON.  For Selectable Fields, choose only "listText".
      2. Then click the "Control" tab.  Fill in the search parameters in the form.  (Show only stories with "text."  Use "date ascending" for the Sort Order.  The start and end date formats are, for example, 2010-12-31.  Start Number refers to the number of the search-results page if there is more than one (leaving it blank defaults to the first search page).
    3. Then click on "Create API Call," followed by "Run API Call" to generate search results in JSON format.  (The JSON will appear all on a single line.)
    4. For multiple search-results pages:
      1. After harvesting the results of one page (through the process described below), you can use the "Start Number" field in the NPR API Console's search form to request the next page of search results.
    5. Alternatively: Instead of using the NPR API console, you can just directly enter variations of the following URL, which is an example of the search request the API console would generate (adjust the parameters, search-results start page, and add your own API key:
  3. Collect the URLs of the links from the NPR search using the Convert JSON to CSV online tool and Excel:
    1. First, save the JSON in files called "json.txt" to keep among your working data.
    2. Then copy-and-paste the JSON search results from the steps above into the Convert JSON to CSV online tool.  When the CSV version of the data appears, click on the "Download" button in the tool to export to Excel. Save this Excel spreadsheet as "urls.xlsx" among your working data.
    3. The Excel spreadsheet will show three different URLs for each article, one per row: "html," "api," and "short."  We want only the "html" URL.  To make Excel filter for only the rows for "html" URLs, take the following steps:
      1. Select the column showing the labels "html," "api," and "short."
      2. Click on the "Data" tab in Excel and then click on "Filter". This will create a little down-arrow icon in the first row of the column for the available filters..
      3. Click on the down-arrow filter icon and deselect all the values for the column except "html."  This will result in a filtered view of the spreadsheet showing only the "html" URLs.
      4. Select and copy the URLs in the URL column. Save these in a text file called "urls.txt" to keep among your working data.
  4. Scrape and Export the NPR articles using Import.io: 
    1. Open Import.io by pointing your browser to the URL for the NPR extractor we created for the WE1S project: https://import.io/data/mine/?tag=EXTRACTOR&id=9b1bac59-4f18-4f33-be15-62d931d99208
    2. Click the Import.io tab for "Bulk Extract"
    3. Copy-and-paste into into the Import.io field for "Enter URLs to extract from" the whole list of URLs you collected in Excel (from the steps above).  (The URLs should be listed one per line) 
    4. Then click on "Run queries"
    5. When extraction of data is complete, click on the "Export" button in Import.io."  From the export options, choose "HTML." (There are problems with choosing the more intuitive-sounding "spreadsheet".) Open the resulting web page, which will show you the query results in a table. Copy all (Ctrl-A) the page.
    6. Creation of Master Spreadsheet for Scrape: Go to the WE1S Google Drive and in in the appropriate working data subfolder (npr > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "npr-[year]-h-master" (for "humanities" queries) or "npr-[year]-la-master" (for "liberal arts" queries).  E.g., "npr-2007-h-master" or "npr-2007-la-master.   Paste into the spreadsheet the content you copied from the HTML page exported from Import.io. This will be the master spreadsheet for the scrape of a year (referred to in the instructions below as "master spreadsheet".
              (Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet.  For some reason this works.)
      1. Organize and add columns in Master Spreadsheet:
        Arrange (and add) columns in the spreadsheet as follows:
        1. Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number.  The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping. 
        2. Column B -- the "date" column
        3. Column C -- the "title" column
        4. Column D -- the "articlebody" column
        5. Column E -- the "author" column
        6. Column F -- Add a column here with label "3 words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " ")) 
               Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet.
               Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
        7. Column G -- the "pageUrl" column
  5. Chop the scraped articles as individual articles:
    1. you can use Chopping List (or a similar utility) to chop the cumulative text file into text files for individual articles.  The delimiter between articles in the cumulative text file is a carriage return, which can be entered as the delimiter in Chopping List as "$r" (no quote marks)
    2. Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
  6. Upload Data and Working Data to the WE1S Google Drive (according to the path/folder structure indicated in the screenshot below) 
    1. WE1S Google Drive organization"Data" consists of individual plain text files for a publication organized by folders for each year. For example, corpus > npr > data > 1984 > plain_text > humanities contains all the scraped individual plain text files collected for 1984 that contain the word "humanities."
    2. "Working_data" sits in a parallel branch under each publication, also organized by folders for each year.  In the case of NPR, the working data consists of subfolders under each year for:
      1. aggregate_working_data folder (containing urls.xlsx, urls.txt, the master spreadsheet; aggregate-plain-text.docx; aggregate-plain-text.txt
        1. Note: Before uploading the master spreadsheet, remember to create an extra column at the left (which will be column A), label the column "Plain Text File #" in cell A1" and populate the column with sequential file numbers. (Enter "1" in cell A2; enter the following formula in the next cell down: "=A2+1"; then drag the handle at lower-right of the cell down the column to automatically populate the rest of the column.) The purpose of this is to create file numbers in the spreadsheet that match the file names of individual plain text files created by Chopping List (File_1.txt, File_2.txt, etc.)
      2. individual_plain_text folder (containing the chopped, individual plain text files. These are also copied into the relevant "data" branch of the site as final data)
  7. Fill out Collection Manifest Form
    1. Go to Form  (submitting the form will send your metadata about the collection run to a Google spreadsheet)







Comments (0)

You don't have permission to comment on this page.