| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

USA Today Collection Workflow

Page history last edited by Alan Liu 8 years, 9 months ago

This page provides instructions for collecting the USA Today from among document sources for the WhatEvery1Says project.  (Collecting workflows are revised or extended as the developers continued working on the project. See also Alternate Workflows.)


 

USA Today

Using the USA Today "Articles" API and Import.io (last revised July 13, 2015)

See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the USA Today API)

 

Requirements: (preinstalled on the WE1S workstations)

 

Workflow Steps:

  1. Search using the USA Today API and Python scripts:
    1.  Use your browser to:
      1. send a request string to the USA Today articles API (entering the string as the URL, adjusting the search term and dates you want in the URL request string);
      2. convert the JSON metadata that the API returns as results into a CSV file using the json2csv.py script; 
      3. copy the URLS for articles into a urls.txt file;
  2. Scraping (Phase 1): Scrape and Export the USAToday articles using Import.io: 
    1. Open Import.io by pointing your browser to the URL for the USA Today extractor we built for the WE1S project: https://import.io/data/mine/?id=635fe72f-51bb-4dd2-b1d9-7222b1e47e5f
    2. Click the Import.io tab for "Bulk Extract"
    3. Copy-and-paste into into the Import.io field for "Enter URLs to extract from" the whole list of URLs you collected in Excel (from the steps above).  (The URLs should be listed one per line) 
    4. Then click on "Run queries"
    5. When extraction of data is complete, click on the "Export" button in Import.io."  From the export options, choose "HTML." (There are problems with choosing the more intuitive-sounding "spreadsheet".) Open the resulting web page, which will show you the query results in a table. Copy all (Ctrl-A) the page.
    6. Creation of Master Spreadsheet for Scrape: Go to the WE1S Google Drive and in in the appropriate working data subfolder (new_york_times > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "usatoday-[year]-h-master" (for "humanities" queries) or "usatoday-[year]-la-master" (for "liberal arts" queries).  E.g., "nytimes-2007-h-master" or "nytimes-2007-la-master.   Paste into the spreadsheet the content you copied from the HTML page exported from Import.io. This will be the master spreadsheet for the scrape of a year (referred to in the instructions below as "master spreadsheet".
              (Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet.  For some reason this works.)
      1. Organize and add columns in Master Spreadsheet:
        Arrange (and add) columns in the spreadsheet as follows:
        1. Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number.  The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping. 
        2. Column B -- the "date" column
        3. Column C -- the "title" column
        4. Column D -- the "articlebody" column
        5. Column E -- the "author" column
        6. Column F -- Add a column here with label "3 words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " ")) 
               Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet.
               Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
        7. Column G -- the "pageUrl" column
  3. Chop the scraped articles as individual articles:
    1. you can use Chopping List (or a similar utility) to chop the cumulative text file into text files for individual articles.  The delimiter between articles in the cumulative text file is a carriage return, which can be entered as the delimiter in Chopping List as "$r" (no quote marks)
    2. Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
  4. Upload Data and Working Data to the WE1S Google Drive (according to the path/folder structure indicated in the screenshot below) 
    1. WE1S Google Drive organization"Data" consists of individual plain text files for a publication organized by folders for each year. For example, corpus > usatoday > data > 2000 > plain_text > humanities contains all the scraped individual plain text files collected for 2000 that contain the word "humanities."
    2. "Working_data" sits in a parallel branch under each publication, also organized by folders for each year.  In the case of USA Today, the working data consists of subfolders under each year for:
      1. aggregate_working_data folder (containing urls.txt, the master spreadsheet; aggregate-plain-text.docx; aggregate-plain-text.txt
        1. Note: Before uploading the master spreadsheet, remember to create an extra column at the left (which will be column A), label the column "Plain Text File #" in cell A1" and populate the column with sequential file numbers. (Enter "1" in cell A2; enter the following formula in the next cell down: "=A2+1"; then drag the handle at lower-right of the cell down the column to automatically populate the rest of the column.) The purpose of this is to create file numbers in the spreadsheet that match the file names of individual plain text files created by Chopping List (File_1.txt, File_2.txt, etc.)
      2. individual_plain_text folder (containing the chopped, individual plain text files. These are also copied into the relevant "data" branch of the site as final data)
  5. Fill out Collection Manifest Form
    1. Go to Form  (submitting the form will send your metadata about the collection run to a Google spreadsheet)

 


 

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.