Workflow Steps: (edited on 9/7/2015)
1. Search Academic Search Complete:
2. Scraping (Phase 1): Scrape articles using Outwit Hub Pro:
i. Open Outwit Hub Pro (The pro version is a paid product available on one of the Transcriptions workstations).
ii. Put the link (starting with file:///) that you got from EBSCOhost in the top built-in browser of Outwit Hub and "return" to load the page.
iii. Click on "scrapers" on the left-hand side of Outwit Hub and click "EBSCOhost scraper"
iv. Click on "Execute" on the right-hand side of Outwit Hub Pro. You should then see your data in columns that you can download into an Excel spreadsheet.
v. Export the results from Outwit Hub Pro as an Excel spreadsheet (Note: For whatever reason, the date and articlebody are one row off. So delete the top row in your Excel spreadsheet for each of these columns to make the rows align.
3. Creation of Master Spreadsheet for Scrape:
i. Copy all the Excel spreadsheet exported from Outwit Hub Pro above
ii. Go to the WE1S Google Drive and in the appropriate working data subfolder (new_yorker > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named “new_yorker-[year]-h-master” (for “humanities” queries) or “new_yorker-[year]-la-master” (for “liberal arts” queries). Paste into the Google spreadsheet the content from Excel. The Google spreadsheet will be the master spreadsheet for the scrape (referred to in the instructions below as “master spreadsheet.”
(Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet. For some reason this works.)
Organize and add columns in Master Spreadsheet:
Arrange (and add) columns in the spreadsheet as follows:
a. Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number. The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping.
b. Column B -- the "date" column [Special note: For some reason, all the dates in the export of search results from Proquest will be off by one row. Move the entire column of dates in the spreadsheet up one row.]
c. Column C -- the "title" column
d. Column D -- the "articlebody" column
e. Column E -- the "author" column
f. Column F -- Add a column here with label "# words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " "))
Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet. Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
g. Column G -- the "document Url" column [Special note: For some reason, all the document URL's in the export of search results from Proquest will be off by one row. Move the entire column of document URL's in the spreadsheet up one row.]
4. Scrape plain text of articles:
i. In the master spreadsheet, select just the following columns: date, title, articlebody and use "Ctrl-A" to copy all. (As described previously, these columns should have been arranged adjacent to each other in this order.)
ii. Open a Word document that you will name "aggregate-plain-text.docx," and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file). There will be excess returns and pages at the bottom of the file that need to be deleted.
iii. Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by ten "@" signs, followed by two returns (" ^13^13@@@@@@@@@@"). This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file. (One exception: remove the delimiter after the last article at the end of the file so as to prevent creating a blank individual plain text file later).
iv. Finally, save or export the aggregate-plain-text.docx Word file as a .txt file (aggregate-plain-txt) as follows:
1. When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").
5. Chop Into Individual Plain Text Files
i. Upload Data and Working Data to the WE1S Google Drive
ii. Open a Word document that you will name "aggregate-plain-text.docx," and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file). There will be excess returns and pages at the bottom of the file that need to be deleted.
iii. Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by ten "@" signs, followed by two returns (" ^13^13@@@@@@@@@@"). This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file. (One exception: remove the delimiter after the last article at the end of the file so as to prevent creating a blank individual plain text file later).
iv. Finally, save or export the aggregate-plain-text.docx Word file as a .txt file (aggregate-plain-txt) as follows:
When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").
v. You can use Chopping List (or a similar utility) to chop the aggregate-plain-text.txt file into text files for individual articles. For "number of places to pad filenames with zeros," leave blank (not "0"). Set the delimiter as the ten "@" signs (@@@@@@@@@@) you previously added to the aggregate-plain-text.txt file. (If instead you ever need to find a delimiter between articles that is a carriage return, enter that delimiter in Chopping List as "$r" (no quote marks)
vi. Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
Note: Before uploading the master spreadsheet, remember to create an extra column at the left (which will be column A), label the column "Plain Text File #" in cell A1" and populate the column with sequential file numbers. (Enter "1" in cell A2; enter the following formula in the next cell down: "=A2+1"; then drag the handle at lower-right of the cell down the column to automatically populate the rest of the column.) The purpose of this is to create file numbers in the spreadsheet that match the file names of individual plain text files created by Chopping List (File_1.txt, File_2.txt, etc.)
ii. individual_plain_text folder (containing the chopped, individual plain text files. These are also copied into the relevant "data" branch of the site as final data)
6. Fill out Collection Manifest Form