| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

NY Times Collection Workflow

Page history last edited by Alan Liu 8 years, 6 months ago

This page provides instructions for collecting the New York Times (1981-) from among document sources for the WhatEvery1Says project.  (Collecting workflows are revised or extended as the developers continued working on the project. See also Alternate Workflows.)


New York Times (1981-present)

Using NY Times API for searching & Import.io as scraper) (last revised September 11, 2015)

See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the NYT API)

 

Overview

  • The NY Times has a complicated history of evolving source code with at least the following epochs: 1981-2004, 2005, 2006-11, 2012-present. Though all articles from the NY Times archives are now wrapped in the paper's contemporary look-and-feel, earlier source code for article bodies retain most of their original formatting/tag structure (with light changes) and are simply embedded in the more modern code.
  • This means that we have to use different strategies for each major epoch of code (e.g., different scrapers or combinations of scrapers).

 

Requirements: (preinstalled on the WE1S workstations)

 

Workflow Steps:

  1. Search using the NYT API and getTimesArticles_fq.py Python script:
    1. Adjust the settings.cfg file for the getTimesArticles_fq.py script to insert the paths for your working space, the search terms you want, and the date span you want. To search for a phrase, enter the phrase in the format "liberal+arts". (Note that the settings.cfg file must be located in a "config" subfolder within the folder holding the getTimesArticles_fq.py script.)
    2. Run the getTimesArticles_fq.py script.  (Double click on the file, which will pull it into a Python editor in the Enthought Canopy IDE [integrated development environment] for Python.  Then from the menu at the top of the IDE, go Run > Run file.)  This will start the script, which uses the NY Times API to search systematically for articles in a date range for a query term, pulls in the JSON output from the API for each found article, and harvests metadata from the JSON files in a cumulative TSV (tab separated values) file.  Depending on the date range specified for the search, this can take a long time.  When the script completes, you will see the command prompt again in the Python environment.
      1. Error correction: Occasionally, the Python script will terminate prematurely when it encounters some kind of error during querying and getting of information through the NY Times API. In that case, locate the last JSON file retrieved (whose file name shows the date (e.g., "20011014"), reset the settings.cfg file to start at that date, and restart the Python script. (The .TSV file that the Python script writes as it harvests data from the JSON files is rewritten cumulatively after each JSON file, so after a fresh start the Python script simply keeps adding to the bottom of the .TSV file.)
      2. Note: Keep and upload the .TSV file and the JSON files as "working data" from the collection run to the WE1S Google Drive.  See example of path/folders for where to put this working data material on the Google Drive. 
    3. Pull into a spreadsheet program the .TSV (tab separated values) file that the script creates as a summary of the JSON files retrieved for particular articles.  (The JSON files contain metadata and abstracts of the articles found in the search; the .TSV file aggregates the metadata from all the JSON files.)
    4. Select the column in the spreadsheet for the URLs of articles found in the search, and copy the URLs into a text file.  (Call the file urls.txt)
    5. Using Word to add "?pagewanted=all" to most (but not all) URLs as follows:
      1. Start by using search-and-replace in Word to look for ".html^13" (that is, all URLs ending in ".html" followed by a line break before the next URL). Replace with ".html?pagewanted=all^13"
      2. Next use search and replace to look for "/^13" (that is, URLs ending in a forward slash followed by a line break before the next URL). Replace with "?pagewanted=all^13"
      3. In the later years of the NY Times, some URLs begin "http://query.nytimes.com/... [ending with a long numerical id number).  These should be left standing as is after the above search and replace steps (i.e., these special URLs should NOT have the pagewanted query appended to them).
  2. Scrape articles using Import.io (and/or OutWit Hub Pro for later years): 
    1. Scraping (Phase 1): Initial Scrape
      1. NY Times 1981-2004, 2006-11:
        1. Open a browser to  https://import.io/data/mine/. Logging in as the WE1S user in Import.io will reveal the extractors we have pre-built for various publications, including for the New York Times in each stage in the evolution of the Times's HTML format beginning in 1981 when fully digital text became available.  
          • Toggle the selector from "Single URL" to "Bulk Extract"
          • In the "Enter URLs to extract from" box, copy-and-paste the whole list of URLs you collected in urls.txt (from the steps above).  (The URLs should be listed one per line.)
          • Then click on "Run queries". (Watch the counter at the top left to see that all the queries complete correctly. If some Import.io fails to get some pages, it will give a tally of failed pages.  Rerun the queries if there are failed pages. If you still can't get some failed pages, click on the "i" information icon to see the URLs, and copy them into a file called "urls-failed.txt" for later reference.)
          • When extraction of data is complete, click on the "Export" button in Import.io."  From the export options, choose "HTML." (There are problems with choosing the more intuitive-sounding "spreadsheet".) Open the resulting web page, which will show you the query results in a table. Copy all (Ctrl-A) the page.
          • Creation of Master Spreadsheet for Scrape: Go to the WE1S Google Drive and in in the appropriate working data subfolder (new_york_times > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "nytimes-[year]-h-master" (for "humanities" queries) or "nytimes-[year]-la-master" (for "liberal arts" queries).  E.g., "nytimes-2007-h-master" or "nytimes-2007-la-master.   Paste into the spreadsheet the content you copied from the HTML page exported from Import.io. This will be the master spreadsheet for the scrape of a year (referred to in the instructions below as "master spreadsheet".
                    (Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet.  For some reason this works.)
            1. Organize and add columns in Master Spreadsheet:
              Arrange (and add) columns in the spreadsheet as follows:
              1. Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number.  The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping. 
              2. Column B -- the "date" column
              3. Column C -- the "title" column
              4. Column D -- the "articlebody" column
              5. Column E -- the "author" column
              6. Column F -- Add a column here with label "# words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " "))      Note: For Excel, the equivalent formula is: =IF(LEN(TRIM(D2))=0,0,LEN(TRIM(D2))-LEN(SUBSTITUTE(D2," ",""))+1)
                     Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet.
                     Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
              7. Column G -- the "pageUrl" column
        2. When you have the correct extractor selected, click the Import.io tab for "Bulk Extract".
      2. NYTimes 2005, 2012-  (Using Outwit Hub Pro instead of Import.io, make several passes through each year to scrape different kinds of pages in the NY Times ecosystem in these years (Fuller detail on getting missing data and pages follows further below.)
        1. Use the TextFixer URL to Link online tool to convert all the URLs in "urls.txt" into links. Paste them into a text file; frame the links with <html><body> . . . </body></html> tags, and name the file "urls-blogs.html". Open in a browser to see the local URL. Load this file into Outwit Hub Pro (by inputting the "file:///..." path to the local html file).
        2. Click on "links" in the Outwit Hub Pro sidebar to show the links in the page.
        3. Select all, then right-click. Choose "Auto-Explore Pages," then "Fast Scrape (include Selected Data), then choose the appropriate NY Times scraper as follows:
          1. NY Times 2005:
            1. First use the NY Times 2005 Scraper to get most of the results. Export to Excel; copy all; and paste into the Google master spreadsheet as described above.
            2. Create a filter in the master spreadsheet on the pageURL column to show only URLs that begin "http:..query" (filter by "condition" > "text starts with")
            3. Copy the URLs and create a urls-nytimes-query-pages.text file. Append to every URL "&pagewanted=all"  (Note the &, not the usual ?). Use that to create a urls-nytimes-query-pages.html text.
            4. Feed the urls-nytimes-query-pages.html page into Outwit Hub Pro, and (as described above) scrape the links, but this time using the NYTimes 2005 Query Pages scraper.
            5. Export into Excel and then copy the article bodies and any other missing data into the Google master spreadsheet.
            6. Finally, use the same process to filter for any remaining articles in the master spreadsheet with no article bodies. These will be book reviews. Get the URLs, create urls-reviews.txt and urls-reviews.html files, and feed into Outwit Hub Pro, this time using the NYTimes Book Reviews scraper.  Use the results to fill in the missing data in the Google master spreadsheet.
          2. NY Times 2013-
            1. Use the same process as described for 2005, but substituting the appropriate Outwit Hub scrapers for the year (NYTimes 2013- Scraper, NYTimes Blogs scraper). The book reviews are now similar to the blogs and can be scraped with the same scraper.
    2. Scraping (Phase 2): Further Information on getting missing pages and data (see also above for the years after 2005)
      1. API search results for the New York Times after the mid 2000s include URLs for NYT "blogs" (such as the "Opinionator" series) that cannot be scraped automatically.  (They will appear as a blank row in the master spreadsheet.)
        1. One option is to manually get the results and put them in the appropriate cells in the spreadsheet.  Be sure to eliminate all line returns in the article body.
        2. A second option is to use Outwit Hub Pro:
          1. Collect from the master spreadsheet the URLs for the blogs (by searching on "blog", or by creating a "filter view" for rows where the "# words" column is "1") and copy the URLs into a file called "urls-blogs.txt".
          2. Use the TextFixer URL to Link online tool to convert all the URLs into links. Paste them into a text file; frame the links with <html><body> . . . </body></html> tags, and name the file "urls-blogs.html". Open in a browser to see the local URL. Load this file into Outwit Hub Pro (by inputting the "file:///..." path to the local html file).
          3. Click on "links" in the Outwit Hub Pro sidebar to show the links in the page.
          4. Select all, then right-click. Choose "Auto-Explore Pages," then "Fast Scrape (include Selected Data), then choose the "NY Times Blogs" scraper.
          5. After the data has been scraped, "Export" the results from Outwit Hub Pro as an Excel spreadsheet.
          6. Use this spreadsheet to fill in the missing data in the master spreadsheet.  "Filtered View": The easiest way to work with the master spreadsheet for this purpose is to create a "filter view" called "filter- blogs" that shows only the rows in which the value of the "# words" is "1". (Click on the dropdown arrow by the funnel-shaped icon for filters in Google Spreadsheets. Select "create a new filter." Set the name of the filter and the column/row numbers for the span you want to filter on. "Clear" all the checked possibilities. Select only "1", and click "okay." Closing the filter view will restore the normal view of the spreadsheet.)
                Once you have a filtered view showing only the blogs, then you can batch copy the information from the Outwit Hub Pro output to the master spreadsheet, since the rows should be identical.
      2. There is also a class of URLs that are dynamic queries in the format, e.g., "http://query.nytimes.com/gst/fullpage.html?res=940DE4DB1E30F932A15752C0A9619C8B63". These show up in the master spreadsheet with article bodies but missing dates, etc.  If possible, go manually to the pages to find the missing data.
      3. There are often several pages in a collection run for a later year of the NY Times (when their pages become extremely complex with added materials) that are truncated. Typically, this means that the scraper collected the first lines of text, then hit a special piece of scripted, tabular, image, or other content that stopped it.  Look down the column of # words for any suspiciously short articles that are not obviously "corrections," and other short items.  If you see a word count of 18 or 72 or something for an article with a full headline, go in a browser to the actual URL and look at the page to see if there is missing content to be added to the spreadsheet.
      4. Duplicate articles: Occasionally, you will see articles that are obvious duplicates. Eliminate the duplicates if you see them.
      5. Special note: some years of the NY Times from 2010 on must be processed first using the Import.io extractor for 2006-2011, but will have a lot of missing dates and author names. These can be scraped a second time using the Import.io extractor for the NYT for named "supplemental NYT 2010-"), and then the results used to fill in missing dates and authors from the first scrape.  (The supplemental extractor, however, cannot be used as the primary scraper because it cannot get many pages, including but not limited to, the blogs). Once you have the results from the supplementary extractor, then you can copy them into the master spreadsheet.  The easiest way to do that is to create another "filtered view" (see instructions above), but this time set it to filter for blanks in the date column.
      6. Finally, if there were failed URLs in the initial use of Import.io to scrape (which you earlier saved in a file called "urls-failed.txt), either get them manually or rerun them in Import.io. Add the results to the master spreadsheet.
      7. Pages that can't be scraped (except selectively) in the recent years of NY Times.
        1. Room for Debate special forum articles: From 2013 on, the "Room for Debate" forums have a cover page, and then discrete pages for separate contributors to the debate topic. There appers to be no way to scrape, or even display, all the materials in the debate in a single page.  Click on each contributor's page in the forum to find which one(s) mention "humanities" or "liberal arts," and collect that individual debate article manually.
        2. Photo Slideshows: these are ungettable (delete the row in the master spreadsheet).
    3. Scraping Phase 3 (Outputting all the results to an "aggregate-plain-text.txt" file)
      1. In the master spreadsheet, select just the following columns: date, title, articlebody and use "Ctrl-A" to copy all.  (As described previously, these columns should have been arranged adjacent to each other in this order.)
      2. Open a Word document that you will name "aggregate-plain-text.docx," and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file).  There will be excess returns and pages at the bottom of the file that need to be deleted.
      3. Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by ten "@" signs, followed by two returns ("   ^13^13@@@@@@@@@@").  This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file.  (One exception: remove the delimiter after the last article at the end of the file so as to prevent creating a blank individual plain text file later).
      4. Finally, save or export the aggregate-plain-text.docx Word file as a .txt file (aggregate-plain-txt) as follows: 
        1. When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").
  3. Chop Into Individual Plain Text Files
    1. You can use Chopping List (or a similar utility) to chop the aggregate-plain-text.txt file into text files for individual articles.  For "number of places to pad filenames with zeros," leave blank (not "0").  Set the delimiter as the ten "@" signs (@@@@@@@@@@) you previously added to the aggregate-plain-text.txt file. (If instead you ever need to find a delimiter between articles that is a carriage return, enter that delimiter in Chopping List as "$r" (no quote marks)
    2. Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
  4. Upload Data and Working Data to the WE1S Google Drive (according to the path/folder structure indicated in the screenshot below) 
    1. WE1S Google Drive organization"Data" consists of individual plain text files for a publication organized by folders for each year. For example, corpus > new_york_times > data > 1981 > plain_text > humanities contains all the scraped individual plain text files collected for 1981 that contain the word "humanities."
    2. "Working_data" sits in a parallel branch under each publication, also organized by folders for each year.  In the case of the New York Times, the working data consists of subfolders under each year for:
      1. aggregate_working_data folder (containing the .TSV spreadsheet; urls.txt; the master spreadsheet; aggregate-plain-text.docx; aggregate-plain-text.txt
        1. Note: Before uploading the master spreadsheet, remember to create an extra column at the left (which will be column A), label the column "Plain Text File #" in cell A1" and populate the column with sequential file numbers. (Enter "1" in cell A2; enter the following formula in the next cell down: "=A2+1"; then drag the handle at lower-right of the cell down the column to automatically populate the rest of the column.) The purpose of this is to create file numbers in the spreadsheet that match the file names of individual plain text files created by Chopping List (File_1.txt, File_2.txt, etc.)
      2. individual_plain_text folder (containing the chopped, individual plain text files. These are also copied into the relevant "data" branch of the site as final data)
      3. JSON folder (containing JSON files)
  5. Fill out Collection Manifest Form
    1. Go to Form  (submitting the form will send your metadata about the collection run to a Google spreadsheet) 

 

 


 

Quality-Control Inspection Procedures for the New York Times Corpus Collected for WE1S

 

On the WE1S Google Drive, the "corpus" consisting of all the materials we collect is organized by publication, with sub-branches for "data" (finished data in individual plain text form) and "working_data" (the master spreadsheets, urls, aggregated text files, etc. we use to generate the final data). The following quality-control procedures are designed to sample and compare "data" and "working_data" to give us confidence in our final data. It's a reality check, since there are so many opportunities for human error in the collection process. We're not aiming for perfection (because "big data" by its nature is messy). We're just aiming to avoid big mistakes.

 

  1. Verify that the final data files are for the right year. E.g.,open one or two NY Times individual plain text files for 2010 to ensure from the dates included at the top the files are actually from 2010.
  2. Verify that "humanities" occurs in articles for the humanities branch of the collection. Verify that "liberal arts" occurs in articles for the liberal arts branch.
  3. Compare the number of files in the master spreadsheets (for both "humanities" and "liberal arts" queries) to the number of plain text files in the relevant final data folders.
  4. Using only the master spreadsheets in working_data, sample a few articles to verify that when you click on the URL source the article actually corresponds to the metadata (date, author, title, articlebody) listed in the spreadsheet.
  5. Inspect the # words column in the master spreadsheet for any suspiciously short articles.
  6. Sample one or two long articles (over 2,000 words) to see that what we got in "articlebody" includes all of the article.  (This is a test that we got all the "next page" sections of an article.

Comments (0)

You don't have permission to comment on this page.