| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

New York Times Random Corpus Collection Workflow

Page history last edited by Lindsay Thomas 7 years, 5 months ago

The New York Times (1981-present)         

Using Proquest, Outwit Hub Pro, and OpenRefine (last revised September 18, 2016)

 

Note: This workflow is basically identical to the Washington Post and the Wall Street Journal random corpus collection workflows, since all can be done using ProQuest.

 

Requirements: 

 

Workflow Steps:

1. Search ProQuest: New York Times (Late Edition, Eastern Coast) 

    1. Since an institutional license is required for ProQuest, you need to be working on a campus or you need to sign in to ProQuest via the UM Library site.
      1. Go to the UM Library site. Search for Databases > P. Select "ProQuest Newsstand."
      2. If you're off campus, you will need to sign in with your CaneID and password.  
    2. Click on the "Advanced Search" link to show the Advanced Search interface for ProQuest. You might also be taken here automatically.
      1. Click on "Publications." Search for "New York Times."
      2. Select the Eastern Edition option of NYT (full text from 1980 to the present).
    3. You should now be looking at the "Publication Information" site for your selected publication. Scroll down and select "Advanced Search" under "Search within this publication." 
    4. Search as follows: 
      1. First search bar. Make sure the publication's pubid is automatically entered into the first search bar. 
      2. Second search bar. To search for a literal word or phrase rather than fuzzy resemblances, put quotes around the term. Enter the word "of" including the quotes.
        1. Note: Because ProQuest does not technically allow for random sampling, this is how we're "hacking" its interface: we're simply searching for any articles in a given year that contain the word "of." This will give us an effectively random sampling of articles published in that year. 
      3. Publication date. Select "specific date range" from the dropdown menu and fill out as follows:
        1. Start: January 1, whatever year you are scraping
        2. End: December 31, whatever year you are scraping 
      4. Sort results by: Scroll to the bottom of the ProQuest search form and make sure "Relevance" is selected for this field.
      5. Items per page: This is where you tell ProQuest how many search results to display on one page at a time. You will want to change this number depending on how many articles you are going to be scraping for your year. For example, if you need to scrape 102 articles, setting the "items per page" to 100 makes the most sense and will allow for the quickest scraping. However, if you're only scraping 25 articles, you will likely want to set the items per page to 20. 
    5. After running the search, select the number of articles you need to scrape. Depending on how many you need, you may need to go to the next page and select the rest. For example, if you need to scrape 104 articles, select all from the first page of results, then click "Next," and select 4 from the following page. Make sure to select only articles from the year you are searching for.
      1. Note: This may just be a Chrome problem, but for whatever reason, sometimes ProQuest will keep articles from previous years "selected," even though they do not appear on the screen or in the search results. You can tell they are selected because although you have only selected 137 articles, there will be something like 260 articles "selected" when you go to export/save your results. You can clear the ghost results by clicking "Clear" on the search results page.
    6. You should also switch up the page or pages of search results you select articles from. For example, if you need to collect 105 articles from 1987, you might collect those 105 articles from the first and second pages of search results (100 from the first page, 5 from the second). If you then need to collect 102 from 1990, you might select these 102 articles from the 8th and 9th pages of search results, and so on. This will also help to randomize our results.
    7. Once you've selected the proper number of articles, use the "Export/Save" function to export to "HTML."  Fill in the dialogue for export options as in the below screenshot (being sure to include the "full text").

 

 

    1. Save the resulting HTML page (a local web page consisting of a .html file) in whatever folder you have set up to hold WE1S materials. The file name for the HTML file has by default the date of collection (e.g., ProQuestDocuments-2015-07-21.html). To help keep things straight, change the name to reflect the materials being collected.  For example, the HTML file with the results of the randomized search of the NYT for 1987 should be: "ProQuestDocuments-nyt-1987-random.html" [the .html is the file extension, not part of the file's name]).

 

2. Scraping (Phase 1): Scrape articles Using Outwit Hub Pro

    1. Open Outwit Hub Pro.
    2. Click on "scrapers" in the sidebar at the left (under "automators").
      1. If there are no visible scrapers, then download them in the form of XML files from the Outwit Hub Scrapers folder on the WE1S Google Drive; then "import" them in Outwit Hub as scrapers (the ProQuest Wall Street Journal scraper). Once it's imported, it will appear in the bottom half of the screen with a check mark next to it.
    3. Select the "ProQuest Wall Street Journal" scraper (by double-clicking on that scraper until its collection fields appear). This scraper works for The New York Times as well.
    4. Open the HTML file you downloaded from ProQuest in a web browser. Copy the URL. Then, enter that URL into Outwit Hub Pro's built-in browser and click enter/return to load the page.
    5. Click on "Execute" in Outwit Hub Pro. This will scrape the date, title, and article body of articles from the local .html page.
    6. Click on the black "Export" tab, then choose "Excel" from the dropdown menu. Click "Edit." Arrange the columns in the top yellow box by moving them around so that they are in the following order:
      1. Source
      2. Date
      3. Title
      4. Articlebody
      5. Author
      6. Document URL
    7. Note: Move all of the yellow column headings to the left of the editing interface. Also, sometimes, particularly for the NYT, the scraper will pick up an additional field called "Source URL." If you see that column title, don't worry about it now. We will get rid of it when we edit the spreadsheet in the following step.
    8. Click the green check mark, then click "Export." Save the Excel spreadsheet in your WE1S folder according to the following format: publication-random-year.xlsx. So, for a scrape of the New York Times for 1987, it would look like this: "nyt-1987-random.xlsx" (again, .xlsx is the file extension, not part of the file name). Save as an Excel spreadsheet (.xslx format -- you may not be able to do this directly from Outwit Hub Pro's export function. You may need to save the file using .xls or .xml, and then open it up in Excel and save it as .xslx).

 

3. Scraping (Phase 2): Getting missing data.

    1. Occasionally, ProQuest makes only abstracts available, in which case there will be no content in the "articlebody" field for an article and a word count of "1". Go to the document URL to verify that there is no full text available. If there is full text of the article body available, copy and paste it into the articlebody column for that article. If not, delete that article and its metadata (title, author, etc) from your spreadsheet. After you've finalized the spreadsheet, go back and download whatever number of missing articles from that year that you need from ProQuest and repeat the process. (So, for example, if you were supposed to download 105 articles from 1987, but one of those articles was missing its articlebody, after finalizing the first spreadsheet, you would go back into ProQuest and repeat the process again, but this time only searching for 1 article from 1987.)

 

4. Organize and edit columns in spreadsheet:  

    1. Organize and edit columns in the spreadsheet you have just saved as follows. It's important that the columns names correspond exactly to how they are written here (ie, if it's in lowercase here, it needs to be in lowercase in your spreadsheet). Make sure to save your work as you go along.
    2. If your spreadsheet contains a SourceURL column, delete it now. Select the entire column, right-click, and choose "delete." 
    3.  Column A: Add this column to the beginning of the spreadsheet by selecting the current A column, right-clicking, then selecting "Insert." A new A column should appear to the left of the selected column.
      1.  Label: id
      2.  Content: Type in the first available column: publication-year-r-1. As an example: wsj-1987-r-1. The "r" is for "random corpus." Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the article #. Now each article has a unique id.
    4.  Column B:
      1.  Label: publication
    5.  Column C: 
      1. Label: pubdate
      2. Content: For some reason, all the dates in the export of search results from Proquest will be off by one row.  Move the entire column of dates in the spreadsheet up one row. Do this by selecting the first row with a date in it, pressing shift, and then using the arrow keys on your keyboard to select all of the remaining dates. Copy this selection. Then, place your cursor in the empty field in the top row and paste in the dates you copied.
      3. Formatting date column: The dates will also be in the following format: Jan 2, 2002. However, we need them to be in this format: 02/02/2002. Unfortunately the fix is a bit cumbersome. Here's how you do it:
        1. Copy all of the dates in the date column and paste them in a separate, empty column or in any empty plain text file.
        2. Select the date column and click CONTROL + 1 (Windows) or COMMAND + 1 (Mac). The formatting dialogue box will appear.
        3. Select "Text" under the Category menu, and then click OK.
        4. The dates will have all changed to numbers. At this point, you need to replace each of these numbers with the correct dates. You can try pasting in the dates again, but I'm not sure if this will work. This may need to be done by hand.  
    6.  Column D:
      1.  Label: title
    7.  Column E:
      1.  Label: articlebody 
      2. Content: Articles from some years of the NYT all begin with "LEAD: ". We need to take this out. To do this, use Excel's find/replace feature. Select the articlebody column, and then click CTRL+F (Windows)/COMMAND+F (Mac). Or Edit > Find > Replace. Search for "LEAD: " (without the quotes, but with a space after the colon) in the "Find" box, and leave the "Replace" box empty. Then click "Replace all." This should get rid of all of the "LEAD: " instances. 
    8.  Column F:
      1. Label: author 
      2. Content: Check this column to see if there are words or phrases other than the author’s name in the cells, such as "by" or "published". If there are, you can cut these phrases by using Excel’s find/replace feature. Select the respective author column, and then click CTRL+F (Windows)/COMMAND+F (Mac). Or Edit > Find > Replace. Search for "by" in the "Find" box, and leave the "Replace" box empty. Then click "Replace all." 
      3. Formatting author column: The proper format for the author column is Last name_First name_Middle name or initial. You must therefore add an underscore between the names. The NYT makes this difficult because many of their author names aren't necessarily standardized. It may be easiest to do this manually, but here's a rough list of what you need to change:
        1. Delete all "Special to the New York Times"
        2. Delete all commas
        3. Delete middle initial periods, or periods after Jr.
        4. Change "Associated Press" to AP
        5. If there are two authors, their names should just follow each other with no commas or "and"'s between them, e.g.: Smith_John Johnson_Laura
        6. Here's how to automatically format some of the names in Excel:
          1. Click on the newly cleaned column and click CTRL+F on a Windows machine or CMD+F if you have a Mac to open the find and replace feature (make sure to click "Replace" once the "Find" box appears). Or Edit > Find > Replace. In the "find" box put a comma and a space (, ); in the "replace" box put an underscore (_). Click "Find Next" and then "Replace," checking each entry as you go one-by-one. This will put an underscore automatically between each last name and first name.  
        7. Check your work by looking over the author column to check for any errors. 
      4. Sometimes an article won't have an author. That's fine. 
    9.  Column G:
      1.  Label: docURL 
      2. Content: For some reason, all the docURL's in the export of search results from ProQuest will be off by one row. Move the entire column of docURL's in the spreadsheet up one row. Do this by selecting the first row with a docURL in it, pressing shift, and then using the arrow keys on your keyboard to select all of the remaining docURL's. Copy this selection. Then, place your cursor in the empty field in the top row and paste in the docURL's you copied. 
    10.  Column H:
      1.  Label: wordcount 
      2.  Content: Add this column to your spreadsheet by titling column H "wordcount." To generate a count of the words in the articlebody, insert the following formula in the cell in the second row: =IF(LEN(TRIM(E2))=0,0,LEN(TRIM(E2))-LEN(SUBSTITUTE(E2," ",""))+1). Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.  

 

5. Use OpenRefine to convert the dates to UTC format and to check the author column for whitespace.

    1. Open the OpenRefine interface:
      1. On a Windows machine: Open the “C:/openrefine-win-2.6-rc2/openrefine-2.6-rc2” folder in your system and run “openrefine.exe” by clicking on the named file twice to open the OpenRefine interface at the address 127.0.0.1:3333 (you can always navigate back to the OpenRefine interface by pointing your browser to this address, and can even use it in multiple windows using it).
      2. On a Mac: Open OpenRefine in Applications.
      3. It will open up a tab in your web browser. 
    2. Once you are in the OpenRefine interface, click “Create Project” and upload the spreadsheet from Outwit Hub Pro that you recently finished editing. Click "Next" and then Create Project" again.
    3. Pubdate: We now need to change the pubdates to UTC format (e.g., 2016-01-01T00:00:00Z).
      1. To change this, mouse over the arrow at the top of the pubdate column. 
      2. Select Edit Cells > Common transforms > To date. This should change all of the dates in the column to UTC format.   
    4. Make sure there is no whitespace in the author column:
      1. Open that column's drop down menu, and select "Edit Cells" > "Common transformations" > "trim leading and trailing whitespace”. 
    5. The metadata should look as follows when you're done:

 

id

publication

pubdate

title

articlebody

author

docUrl

wordcount

nyt-1987-r-1

New York Times

1987-01-01 T00:00:00Z

Title of article

;lkaslkjf;lkjas;lkjdl;kj;l

Smith_John_M

url from Proquest

4500

 

    1. When you have refined the metadata, look for the “Export” drop down menu in the right-hand corner of the page. Select the “Comma Separated Value” option, and your OpenRefine project should be downloaded as an .csv file. Make sure that you do not have any text facet/filtering options selected; if you do, OpenRefine will only export a sheet containing the filtered metadata. Save your file as a csv file with the same filename.
    2. Open up the exported .csv file in Excel and check it for newly created blank columns. If OpenRefine has created new columns, delete them by highlighting the columns, right-clicking them, selecting delete from the generated menu. 

 

6. Upload Data to the WE1S Google Drive 

    1. Sign into the WE1S google drive.
    2. Click on we1s-2 > data_archive > random > random_corpus_a.
    3. There you will see folders for the NY Times, the Washington Post, and the Wall Street Journal.
      1. Select the appropriate folder for your publication.
    4. Create a folder for the year you scraped and title it according to that year, eg., 1987.
    5. Upload the HTML file you downloaded from ProQuest and the .csv file you created using Outwit Hub Pro and OpenRefine (you do not need to upload the .xslx file).

 

7. Fill out Collection Manifest Form and mark the pub/year as completed on the PBworks site.

    1. Go to Form  (submitting the form will send your metadata about the collection run to a Google spreadsheet).
    2. Check your year off on the Developer Task Assignments (Random Corpus Scraping) page.

 

 

Comments (0)

You don't have permission to comment on this page.