New Yorker Collection Workflow


 

Workflow Steps: (edited on 9/7/2015)

 

1.  Search Academic Search Complete: 

 

    1. Use the Advanced Search interface for Academic Search Complete (it will use EbscoHost to run your search, but I find it easier to go through Academic Search Complete originally).
    2.  With the Academic Search Complete open, use the search fields to find "SO Journal Name" New Yorker and type in the search field "All Text" the quote "humanities." Then make sure to click "Full Text" on the left-hand side bar, and type in the time frame you're looking for in the "date" field: January 1-December 31 of the year you're looking for.
    3. After running the search, select all article titles in the results list on the left-hand side bar.
    4. Download articles. Click on the "folders" beside each of the articles that appear in your results so that they turn from blue to yellow. The right-hand sidebar will say "Folder has items" and have a list of those items. click on "Folder View" once your right-hand folder has all the items you want. 
    5. Once in "Folder View," click "Save as File" on the right-hand column. This will save your articles--including the full text--on a webpage. This will have the full text of the article and the title, author, source, word count, etc. Save your HTML page as an HTML page and change the name to reflect the materials being collected. For example, the HTML file with the results of the search of New_Yorker for 2011, "humanities" query would be: ("EBSCOHostDocuments-ny-2011-h-html")

 

2. Scraping (Phase 1): Scrape articles using Outwit Hub Pro:

 

               i.  Open Outwit Hub Pro (The pro version is a paid product available on one of the Transcriptions workstations).

               ii. Put the link (starting with file:///) that you got from EBSCOhost in the top built-in browser of Outwit Hub and "return" to load the page.

               iii. Click on "scrapers" on the left-hand side of Outwit Hub and click "EBSCOhost scraper"

               iv. Click on "Execute" on the right-hand side of Outwit Hub Pro. You should then see your data in columns that you can download into an Excel spreadsheet.

               v. Export the results from Outwit Hub Pro as an Excel spreadsheet (Note: For whatever reason, the date and articlebody are one row off. So delete the top row in your Excel spreadsheet for each of these columns to make the rows align.

 

 

3.  Creation of Master Spreadsheet for Scrape:

 

          i. Copy all the Excel spreadsheet exported from Outwit Hub Pro above

          ii. Go to the WE1S Google Drive and in the appropriate working data subfolder (new_yorker > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named “new_yorker-[year]-h-master” (for “humanities” queries) or “new_yorker-[year]-la-master” (for “liberal arts” queries). Paste into the Google spreadsheet the content from Excel. The Google spreadsheet will be the master spreadsheet for the scrape (referred to in the instructions below as “master spreadsheet.”

 

 (Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet.  For some reason this works.)

 

Organize and add columns in Master Spreadsheet:
Arrange (and add) columns in the spreadsheet as follows:

               a. Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number.  The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping.

               b. Column B -- the "date" column [Special note: For some reason, all the dates in the export of search results from Proquest will be off by one row. Move the entire column of dates in the spreadsheet up one row.] 

               c. Column C -- the "title" column

               d. Column D -- the "articlebody" column

               e. Column E -- the "author" column

               f. Column F -- Add a column here with label "# words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " ")) 
                  Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have  added Column A above yet. Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate  the rest of the column with the # of words in the articles.
               g. Column G -- the "document Url" column [Special note: For some reason, all the document URL's in the export of search results from Proquest will be off by one row.  Move the entire column of document URL's in the spreadsheet up one row.]

 

4.  Scrape plain text of articles:

i.   In the master spreadsheet, select just the following columns: date, title, articlebody and use "Ctrl-A" to copy all.  (As described previously, these columns should have been arranged adjacent to each other in this order.)


ii.   Open a Word document that you will name "aggregate-plain-text.docx," and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file).  There will be excess returns and pages at the bottom of the file that need to be deleted.


iii. Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by ten "@" signs, followed by two returns ("   ^13^13@@@@@@@@@@").  This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file.  (One exception: remove the delimiter after the last article at the end of the file so as to prevent creating a blank individual plain text file later).
iv. Finally, save or export the aggregate-plain-text.docx Word file as a .txt file (aggregate-plain-txt) as follows:
           1. When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").

 

 5. Chop Into Individual Plain Text Files

i. Upload Data and Working Data to the WE1S Google Drive WE1S Google Drive organization

ii. Open a Word document that you will name "aggregate-plain-text.docx," and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file).  There will be excess returns and pages at the bottom of the file that need to be deleted.

iii. Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by ten "@" signs, followed by two returns ("   ^13^13@@@@@@@@@@").  This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file.  (One exception: remove the delimiter after the last article at the end of the file so as to prevent creating a blank individual plain text file later).

iv. Finally, save or export the aggregate-plain-text.docx Word file as a .txt file (aggregate-plain-txt) as follows: 

When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").

v. You can use Chopping List (or a similar utility) to chop the aggregate-plain-text.txt file into text files for individual articles.  For "number of places to pad filenames with zeros," leave blank (not "0").  Set the delimiter as the ten "@" signs (@@@@@@@@@@) you previously added to the aggregate-plain-text.txt file. (If instead you ever need to find a delimiter between articles that is a carriage return, enter that delimiter in Chopping List as "$r" (no quote marks)

vi. Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)

 

  1. "Data" consists of individual plain text files for a publication organized by folders for each year. For example, corpus > New_Yorker > data > 2001 > plain_text > humanities contains all the scraped individual plain text files collected for 1984 that contain the word "humanities."
  2. "Working_data" sits in a parallel branch under each publication, also organized by folders for each year.  In the case of New Yorker, the working data consists of subfolders under each year for:
    1. aggregate_working_data folder (containing the local HTML file exported from the EBSCOHost search, the Outwit Hub scraped export XLS spreadsheet, the master spreadsheet; aggregate-plain-text.docx; aggregate-plain-text.txt

Note: Before uploading the master spreadsheet, remember to create an extra column at the left (which will be column A), label the column "Plain Text File #" in cell A1" and populate the column with sequential file numbers. (Enter "1" in cell A2; enter the following formula in the next cell down: "=A2+1"; then drag the handle at lower-right of the cell down the column to automatically populate the rest of the column.) The purpose of this is to create file numbers in the spreadsheet that match the file names of individual plain text files created by Chopping List (File_1.txt, File_2.txt, etc.)

      ii. individual_plain_text folder (containing the chopped, individual plain text files. These are also copied into the relevant "data" branch of the site as final data)

 

 

6. Fill out Collection Manifest Form

  1. Go to Form  (submitting the form will send your metadata about the collection run to a Google spreadsheet)