• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Washington Post Collection Workflow

Page history last edited by Alan Liu 8 years, 9 months ago

This page provides instructions for collecting the The Washington Post (1987-) from among document sources for the WhatEvery1Says project.  (Collecting workflows are revised or extended as the developers continued working on the project. See also Alternate Workflows.)

The Washington Post (1987-present)

Using Proquest and Outwit Hub Pro (last revised October 8, 2015)


Note: This workflow is almost identical to the Wall Street Journal workflow, since both publications use ProQuest.


Requirements: (preinstalled on the WE1S workstations)

  • Access through institutional subscription to Proquest 
  • Outwit Hub Pro (installed on one of the Transcriptions workstations, using purchased license)


Workflow Steps:

  1. Search Proquest: The Washington Post (pre-1997 Fulltext) OR Proquest: The Washington Post:
    1. The Washington Post is divided up into two parts: those articles published on and prior to December 3, 1996, and those articles published on and after December 4, 1996. Choose which "publication" to search depending on the year you are searching for. 
    2. Since an institutional license is required for Proquest, you need to be working on a campus with a license or while connected to such a campus through the campus VPN (e.g., the UCSB VPN)
    3. At UCSB, open the Proquest search form for The Washington Post (pre-1997 Full Text) OR The Washington Post.
    4. Click on the "Advanced Search" link to show the Advanced Search interface for Proquest.
    5. With the Proquest Advanced Search form pre-set to the publication number designating the WP, use the search fields and Boolean operators to search as follows:
      1. Keyword(s).  To search for a literal word or phrase rather than fuzzy resemblances, put quotes around the term.  For example, searching for the word humanities without quotes will also find articles with the word "humanity."  But searching for "humanities" (in quotes) restricts the search to the literal string of characters.  Searching for "liberal arts" (with quotes around the phrase) returns literal string matches but ignores punctuation, so that results include "liberal arts" and "liberal-arts".  (Note: Proquest allows for Boolean and/or concatenations of terms.  But it does not allow for proximity searches--e.g., two words within a span of 4 words from each other.)
      2. Date range. Select a span of dates. (The Washington Post has fully digitized full text available through Proquest from 1987 on. But make sure you are searching in the right "publication" for the year you are scraping, or your search will return 0 results.)
      3. Sort order: Scroll to the bottom of the Proquest search form and request sort by "oldest first"
      4. Also set the number of results to show at one time ("Items per page") to the maximum of 100. (Note: if the number of results exceed 100, you will need to be sure all the results, e.g., 1-134, including those not visible on the current results page are "selected" for export in the next step below.)
    6. After running the search, "select" all article titles in the results list. Make sure to select only articles from the year you are searching for. (This may just be a Chrome problem, but for whatever reason, sometimes ProQuest will keep articles from previous years "selected," even though they do not appear on the screen or in the search results. You can tell they are selected because although your search may have only returned 137 articles, there will be something like 260 articles "selected". You can clear the ghost results by clicking "Clear" on the search results page.) Then use the "Export/Save" function to export to "HTML."  Fill in the dialogue for export options as in the below screenshot (being sure to include the "full text"). Proquest Wall Street Journal export to HTML options
    7. Open the resulting HTML page (a local web page consisting of a .html file. Also copy the local web page from the temp folder on your workstation where it appears to the working data to be saved for the collection run. The file name for the HTML file has by default the date of collection (e.g., ProQuestDocuments-2015-07-21.html). For the copy you are storing among working data, change the name to reflect the materials being collected.  For example, the HTML file with the results of the search of the WP for 1987, "humanities" query" should be: "ProQuestDocuments-wp-1987-h.html").
  2. Scraping (Phase 1): Scrape articles Using Outwit Hub Pro:
    1. We need to use Outwit Hub Pro (the pro version is a paid product available on one of the Transcriptions workstations) because we have not been able to use our usual Import.io solution to build a working scraper.
    2. Open Outwit Hub Pro.
    3. Click on "scrapers" in the sidebar at the left.
      1. If there are no visible scrapers, then download them in the form of XML files from the Outwit Hub Scrapers folder on the WE1S Google Drive; then "import" them in Outwit Hub as scrapers.
    4. Select the "Proquest Wall Street Journal" scraper (by double-clicking on that scraper until its collection fields appear). This scraper works for The Washington Post as well.
    5. Enter the URL of the local .html file created from the Proquest export into Outwit Hub Pro's built-in browser and "return" to load the page.
    6. Click on "Execute." This will scrape the date, title, and article body of articles from the local .html page.
    7. "Export" the results from Outwit Hub Pro as an Excel spreadsheet.
    8. Creation of Master Spreadsheet for Scrape:
      1. Copy all the Excel spreadsheet exported from Outwit Hub Pro above.
      2. Go to the WE1S Google Drive and in the appropriate working data subfolder (wall_street_journal > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "wp-[year]-h-master" (for "humanities" queries) or "wp-[year]-la-master" (for "liberal arts" queries).  E.g., "wp-2007-h-master" or "wp-2007-la-master.  Paste into the Google spreadsheet the content from Excel. The Google spreadsheet will be the master spreadsheet for the scrape (referred to in the instructions below as "master spreadsheet".
                (Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet.  For some reason this works.)
      3. Organize and add columns in Master Spreadsheet:
        Arrange (and add) columns in the spreadsheet as follows:
        1. Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number.  The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping. 
        2. Column B -- the "date" column [Special note: For some reason, all the dates in the export of search results from Proquest will be off by one row.  Move the entire column of dates in the spreadsheet up one row.]
        3. Column C -- the "title" column
        4. Column D -- the "articlebody" column
        5. Column E -- the "author" column
        6. Column F -- Add a column here with label "# words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " "))     Note: For Excel, the equivalent formula is: =IF(LEN(TRIM(D2))=0,0,LEN(TRIM(D2))-LEN(SUBSTITUTE(D2," ",""))+1)
               Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet.
               Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
        7. Column G -- the "document Url" column [Special note: For some reason, all the document URL's in the export of search results from Proquest will be off by one row.  Move the entire column of document URL's in the spreadsheet up one row.]
  3. Scraping (Phase 2): Getting missing data.
    1. Occasionally, the Wall Street Journal makes only abstracts available, in which case there will be no content in the "articlebody" field for an article and a word count of "1". Go to the document URL to verify that there is no full text available. If there is only an abstract, manually copy the abstract, remove line breaks, and copy into the articlebody field prefaced with the phrase, in brackets: "[Abstract Only]".
  4. Scraping (Phase 3): Output all the results to an "aggregate-plain-text.txt" file
    1. In the master spreadsheet, select just the following columns: date, title, articlebody and use "Ctrl-A" to copy all.  (As described previously, these columns should have been arranged adjacent to each other in this order.)
    2. Open a Word document that you will name "aggregate-plain-text.docx," and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file).  There will be excess returns and pages at the bottom of the file that need to be deleted. (One important note: if you copied in whole columns from the spreadsheet, the first line in the doc will consist of the column labels followed by a return.  Delete this line. Otherwise, at the end of the process you will create an empty first file.).
    3. Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by two returns, followed by ten "@" signs ("   ^13^13@@@@@@@@@@").  This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file.  (One exception: remove the delimiter after the last article at the end of the file so as to prevent creating a blank individual plain text file later).
    4. Finally, save or export the aggregate-plain-text.docx Word file as a .txt file (aggregate-plain-txt) as follows: 
      1. When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").
  5. Chop Into Individual Plain Text Files
    1. You can use Chopping List (or a similar utility) to chop the aggregate-plain-text.txt file into text files for individual articles.  For "number of places to pad filenames with zeros," leave blank (not "0").  Set the delimiter as the ten "@" signs (@@@@@@@@@@) you previously added to the aggregate-plain-text.txt file. (If instead you ever need to find a delimiter between articles that is a carriage return, enter that delimiter in Chopping List as "$r" (no quote marks)
    2. Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
  6. Upload Data and Working Data to the WE1S Google Drive (according to the path/folder structure indicated in the screenshot below) 
    1. WE1S Google Drive organization"Data" consists of individual plain text files for a publication organized by folders for each year. For example, corpus > washington_post > data > 1984 > plain_text > humanities contains all the scraped individual plain text files collected for 1984 that contain the word "humanities."
    2. "Working_data" sits in a parallel branch under each publication, also organized by folders for each year.  In the case of The Washington Post, the working data consists of subfolders under each year for:
      1. aggregate_working_data folder (containing the local HTML file exported from the Proquest search, the Outwit Hub scraped export XLS spreadsheet, the master spreadsheet; aggregate-plain-text.docx; aggregate-plain-text.txt
        1. Note: Before uploading the master spreadsheet, remember to create an extra column at the left (which will be column A), label the column "Plain Text File #" in cell A1" and populate the column with sequential file numbers. (Enter "1" in cell A2; enter the following formula in the next cell down: "=A2+1"; then drag the handle at lower-right of the cell down the column to automatically populate the rest of the column.) The purpose of this is to create file numbers in the spreadsheet that match the file names of individual plain text files created by Chopping List (File_1.txt, File_2.txt, etc.)
      2. individual_plain_text folder (containing the chopped, individual plain text files. These are also copied into the relevant "data" branch of the site as final data)
  7. Fill out Collection Manifest Form
    1. Go to Form  (submitting the form will send your metadata about the collection run to a Google spreadsheet) 







Comments (0)

You don't have permission to comment on this page.