| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Guardian Collection Workflow

Page history last edited by Jonathan Callies 8 years, 7 months ago Saved with comment

This page provides instructions for collecting the The Guardian from among document sources for the WhatEvery1Says project.  (Collecting workflows are revised or extended as the developers continued working on the project. See also Alternate Workflows.)


 

The Guardian (1994-present)

Using Guardian API for searching and Import.io as scraper (last revised August 2, 2015) 

 

Requirements: (preinstalled on the WE1S workstations)

  • Guardian API Key (register on developer's site and request a key for the "articles" API)
  • Import.io (free) -- Note: Import.io can be used in a browser to extract data automatically from a Web page; and the browser version can also use customized extractors that someone else has created.  However, creating customized extractors (e.g., able to select only particular data from a page) can only be done through Import.io's downloadable "app" (which is also free).

 

Workflow Steps:

  1. Get a Guardian API Key if you don't have one (request developer key).
  2. Search The Guardian using the Guardian's API Console:
    1. In a web browser (not the one built into Import.io), go to the URL of the Guardian's "Open Platform API Console" (beta version, not the old version): http://open-platform.theguardian.com/explore/
    2. The Guardian's API ConsoleWhen the Guardian's API console search form loads, check the box in that form for "Show All Filters."  Then fill in the following fields in the form:
      1. search term in double quotes(e.g. "humanities" or "liberal arts")
      2. order-by (choose "oldest")
      3. page-size (set at 200 to show the maximum number of hits per search-results page)
      4. from-date & to-date (in the format, e.g., "2014-01-31")
      5. api-key (your Guardian api key)
    3. At the bottom of the Guardian API Console web page, you'll instantly see a live view of the JSON returned by your search parameters, including the URLs for the pages found.  The maximum number of hits on a single search-results page is 199. Select and copy the JSON results. Copy beginning with the first curly bracket "{"  to the last curly bracket" }" JSON results shown in The Guardian's API Console
    4. For multiple search-results pages:
      1. The JSON search results start with metadata that includes, the number of the current search-results page and the total number of search-results pages (e.g., currentPage: 2, pages: 2). This shows how many total results there are and whether you need to gather the JSON on extra search results pages.
      2. If so, after harvesting the results of one search-results page (by copying the JSON into a "guardian-json" spreadsheet through the process described below), you can use the "page" field in the Guardian API Console's search form to request the next page of search results.
      3. You'll then need to copy and accumulate the JSON from each set of search-results at the bottom of the JSON previously collect in the "guardian-json" spreadsheet. 
  3. Collect the URLs of the links from the Guardian search in Excel:
    1. Paste the JSON search results from the steps above into a blank spreadsheet, e.g., and Excel sheet (using "paste special" to paste without formatting). Name the spreadsheet "guardian-json.xlxs" and save it among your working data. (See above of adding extra JSON from additional search results page from The Guardian API console.)
    2. Select the first column, then go to Data view ("Data" tab in Excel) and click on "Filter." 
    3. Cell A1 will now show a little drop-down arrow.  Click on the arrow, and choose: Text Filters > Begins with.  Type in "webURL: " (includes space after the colon). Then click "OK" to filter the worksheet to show only the rows beginning with this expression.
    4. Next do a search-and-replace in Excel to replace "webURL: " (includes trailing space) with nothing. This will leave you with a visible column of URLs without extraneous matter.
    5. Finally, copy the column of URLs, excluding any rows with only a curly bracket (and save the URLs in your working data as "urls.txt").
  4. Scraping (Phase 1): Scrape and Export the Guardian articles using the Import.io "Guardian - Primary Scraper": 
    1. Open Import.io by pointing your browser to the URL for the Guardian - Primary Scraper extractor
    2. Click the Import.io tab for "Bulk Extract"
    3. Paste into into the Import.io field for "Enter URLs to extract from" the whole list of URLs you collected in Excel (from the steps above).  (The URLs should be listed one per line) 
    4. Then click on "Run queries".
      1. Note: If there are failed queries (there will probably be a fair number). Click on the "i" info icon for the failed queries. Click on "retry" to retry them once.  Then copy the remaining failed URLs into a file called "urls-failed.txt." (You'll need to clean extra text and numbers from this file. The easiest way to do so is to use a word processor to get rid of everything but the number, tab, and URL on the relevant lines. Then paste into a temporary Excel spreadsheet, and copy only the column containing the URLS. (Excel will treat the tab as a delimiter; if not use Data > Text to Columns to delimit and separate into columns the information.) Keep the "urls-failed.txt" file in your working data for phase 2 of the scraping process below
    5. When extraction of data is complete, click on the "Export" button in Import.io.  From the export options, choose "HTML." (There are problems with choosing the more intuitive-sounding "spreadsheet".) Open the resulting web page, which will show you the query results in a table. Copy all (Ctrl-A) the page.
    6. Creation of Master Spreadsheet for Scrape: Go to the WE1S Google Drive and in in the appropriate working data subfolder (new_york_times > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "guardian-[year]-h-master" (for "humanities" queries) or "guardian-[year]-la-master" (for "liberal arts" queries).  E.g., "guardian-2007-h-master" or "guardian-2007-la-master.   Paste into the spreadsheet the content you copied from the HTML page exported from Import.io. This will be the master spreadsheet for the scrape of a year (referred to in the instructions below as "master spreadsheet".
              (Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet.  For some reason this works.)
      1. Organize and add columns in Master Spreadsheet:
        Arrange (and add) columns in the spreadsheet as follows:
        1. Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number.  The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping. 
        2. Column B -- the "date" column
        3. Column C -- the "title" column
        4. Column D -- the "subtitle" column [Note: This is an extra column in the Guardian master spreadsheet]
        5. Column E -- the "articlebody" column
        6. Column F -- the "author" column [Note: The author names are in a column labeled "author/text." Use this column and retitle it "author." There is another column called "author" that includes a link to the author's page. Retitle this column "authorURL"]
        7. Column G -- Add a column here with label "3 words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(E2, " ")) 
               Adjust the "E2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet.
               Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
        8. Column H -- the "pageUrl" column
  5. Scraping (Phase 2): Scrape and Export Failed URLs using the Import.io "Guardian - Secondary Scraper":
    1. Feed the URLs from the "urls-failed.txt" file you earlier created into Import.io, this time using the Guardian - Secondary Scraper.  This will get almost all the failed URLs.
    2. Export into an HTML file, then copy that and paste into an Excel spreadsheet that you name "failed-urls-added-to-master.xlsx". Arrange the columns in order to exactly match the master spreadsheet (date, title, subtitle, articlebody, author, # words, pageURL).
    3. Then copy the exact block of date, title, subtitle, articlebody, author, # words, pageURL cells from the Excel seheet to the bottom of the master Google spreadsheet.
    4. This will result in all the material from the failed URLs added out of chronological order at the bottom of the master spreadsheet. To convert dates from plain text into actual dates and then sort chronologically, follow this process:
      1. Select all the dates in the Google spreadsheet and copy them.
      2. Paste them into a column in an Excel spreadsheet that you name "dates-conversion.xlsx".
      3. Under the "Data" tab in Excel, click on "Text to Columns."
        1. Choose "delimited."
        2. Then on the next dialogue screen, set the delimiter to "space."
        3. Then finish, which will separate all the parts of the plain text date and time into separate columns.
      4. Delete the irrelevant columns, which will leave you with separate columns for day, month, and year.
      5. Then in another column, use the following formula to refer to concatenate the text in the three columns and also add a space between each text string (setting the column names as appropriate): =CONCATENATE(A2," ",B2," ",C2)
        1. Note: do this for the first cell, then drag down on the active corner of the cell to apply to all the rest of the column.
      6. Next, copy the column and paste as "values" only in another column, so that you end up with cells that look like this: "1 January 2014").
      7. Select the column, and format as date (as opposed to plain text).
      8. Then select the exact block of dates in the finished column in the Excel spreadsheet, and paste then over the exactly block of dates in the Google spreadsheet.  (You should be able to tell that the original plain text dates are replaced correctly with their actual date equivalents in order.)
      9. Finally, because all you have done so far is convert the plaint-text dates into actual dates, click on the sort arrow at the top of the date column and sort "A - Z" to sort chronologically.
  6. Scraping (Phase 3): Delete any remaining articles with missing articlebody.
    1. In the later years of The Guardian, the two previous phases of scraping will leave a fair number of articles for you which have all the metadata (date, title, etc.) but no article body. (# words = 1). An examination of these show that almost all are ungettable material--e.g., live-blogs and video/pictorial posts made up of separate postings, photo sequences, lists in table format, etc.
    2. Delete each of these rows.
  7. Scraping (Phase 4): Output all the results to an "aggregate-plain-text.txt" file
    1. In the master spreadsheet, select just the following columns: date, title, subtitle, articlebody and use "Ctrl-A" to copy all.  (As described previously, these columns should have been arranged adjacent to each other in this order.)
    2. Open a Word document that you will name "aggregate-plain-text.docx," and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file).  There will be excess returns and pages at the bottom of the file that need to be deleted.
    3. Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by ten "@" signs, followed by two returns ("   ^13^13@@@@@@@@@@").  This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file.  (One exception: remove the delimiter after the last article at the end of the file so as to prevent creating a blank individual plain text file later).
    4. Finally, save or export the aggregate-plain-text.docx Word file as a .txt file (aggregate-plain-txt) as follows: 
      1. When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").
  8. Chop Into Individual Plain Text Files
    1. You can use Chopping List (or a similar utility) to chop the aggregate-plain-text.txt file into text files for individual articles.  For "number of places to pad filenames with zeros," leave blank (not "0").  Set the delimiter as the ten "@" signs (@@@@@@@@@@) you previously added to the aggregate-plain-text.txt file. (If instead you ever need to find a delimiter between articles that is a carriage return, enter that delimiter in Chopping List as "$r" (no quote marks)
    2. Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
  9. Upload Data and Working Data to the WE1S Google Drive (according to the path/folder structure indicated in the screenshot below) 
    1. WE1S Google Drive organization"Data" consists of individual plain text files for a publication organized by folders for each year. For example, corpus > guardian > data > 1984 > plain_text > humanities contains all the scraped individual plain text files collected for 1984 that contain the word "humanities."
    2. "Working_data" sits in a parallel branch under each publication, also organized by folders for each year.  In the case of the Guardian, the working data consists of subfolders under each year for:
      1. aggregate_working_data folder (containing urls.xlsx, the master spreadsheet; aggregate-plain-text.docx; aggregate-plain-text.txt
        1. Note: Before uploading the master spreadsheet, remember to create an extra column at the left (which will be column A), label the column "Plain Text File #" in cell A1" and populate the column with sequential file numbers. (Enter "1" in cell A2; enter the following formula in the next cell down: "=A2+1"; then drag the handle at lower-right of the cell down the column to automatically populate the rest of the column.) The purpose of this is to create file numbers in the spreadsheet that match the file names of individual plain text files created by Chopping List (File_1.txt, File_2.txt, etc.)
      2. individual_plain_text folder (containing the chopped, individual plain text files. These are also copied into the relevant "data" branch of the site as final data)
  10. Fill out Collection Manifest Form
    1. Go to Form  (submitting the form will send your metadata about the collection run to a Google spreadsheet)

 


 

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.