| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

USA Today Collection Workflow

This version was saved 8 years, 8 months ago View current version     Page history
Saved by Alan Liu
on July 12, 2015 at 6:42:38 pm
 

This page provides instructions for collecting the USA Today from among document sources for the WhatEvery1Says project.  (Collecting workflows are revised or extended as the developers continued working on the project. See also Alternate Workflows.)


 

USA Today

Using the USA Today "Articles" API & Python scripts for searching/downloading article

See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the USA Today API)

 

Requirements: (preinstalled on the WE1S workstations)

 

Workflow Steps:

  1. Search using the USA Today API and Python scripts:
    1.  Use your browser to:
      1. send a request string to the USA Today articles API (entering the string as the URL, adjusting the search term and dates you want in the URL request string);
      2. convert the JSON metadata that the API returns as results into a CSV file using the json2csv.py script; 
      3. copy the URLS for articles into a urls.txt file;
  2. Scrape and Export the USAToday articles using Import.io: 
    1. Open the Import.io app and point its built-in browser to the URL for the USAToday extractor: https://import.io/data/mine/?id=635fe72f-51bb-4dd2-b1d9-7222b1e47e5f
    2. Click the Import.io tab for "Bulk Extract"
    3. Copy-and-paste into into the Import.io field for "Enter URLs to extract from" the whole list of URLs you collected in Excel (from the steps above).  (The URLs should be listed one per line) 
    4. Then click on "Run queries"
    5. When extraction of data is complete, click on the "Export" button in Import.io."  From the export options, choose "HTML." (There are problems with choosing the more intuitive-sounding "spreadsheet".) Open the resulting web page, which will show you the query results in a table. Copy all (Ctrl-A) the page.
    6. Creation of Master Spreadsheet for Scrape: Go to the WE1S Google Drive and in in the appropriate working data subfolder (new_york_times > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "usatoday-[year]-h-master" (for "humanities" queries) or "usatoday-[year]-la-master" (for "liberal arts" queries).  E.g., "nytimes-2007-h-master" or "nytimes-2007-la-master.   Paste into the spreadsheet the content you copied from the HTML page exported from Import.io. This will be the master spreadsheet for the scrape of a year (referred to in the instructions below as "master spreadsheet".
              (Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet.  For some reason this works.)
      1. Organize and add columns in Master Spreadsheet:
        Arrange (and add) columns in the spreadsheet as follows:
        1. Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number.  The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping. 
        2. Column B -- the "date" column
        3. Column C -- the "title" column
        4. Column D -- the "articlebody" column
        5. Column E -- the "author" column
        6. Column F -- Add a column here with label "3 words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " ")) 
               Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet.
               Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
        7. Column G -- the "pageUrl" column
  3. Chop the scraped articles as individual articles:
    1. you can use Chopping List (or a similar utility) to chop the cumulative text file into text files for individual articles.  The delimiter between articles in the cumulative text file is a carriage return, which can be entered as the delimiter in Chopping List as "$r" (no quote marks)
    2. Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
  4.  Fill out Collection Manifest Form

 


 

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.