If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Alternate Collection Workflows

Page history last edited by Alan Liu 8 years, 9 months ago

This page provides alternate workflows for collecting document sources for the WhatEvery1Says project. (Workflows are revised or extended as the developers continued working on the project. See also the primary Collection Workflows page.)

New York Times

Alternate Method 1: Using API and scripts for searching/downloading; & BeautifulSoup as scraper
Alternate Method 2: Using Proquest to search/download articles between 1923 and end of 1980
Alternate Method for Getting "next pages"

Wall Street Journal

Alternate Method 1: Using Proquest,, Python Scripts, and Wget

The Guardian

Alternate Method 1: Using OutWit Hub as scraper

New York Times (Alternate Method 1)

Using the NYT API & Python scripts for searching/downloading articles after 1980; & BeautifulSoup as scraper

See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the NYT API)

Requirements:

NY Times API Key (request a key)
Python (2.x version) (with Beautiful Soup 4 package)

Scripts for NY Times located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)

Firefox & DownloadThemAll plugin (or an alternative method of download web pages from list of ULRs such as Wget)

Workflow Steps:

Search using the NYT API and Python scripts: (Scripts are located in the project Mirrormask site folder here) (Older versions of materials are also located in the Google Drive folder Scripts for NY Times)

Read _instructions.txt for an overview (included in the NYT scripts folder). And download the python scripts and settings.cfg to your computer.
Adjust the settings.cfg file for the getTimesArticles_fq.py script to insert the paths for your working space, the search terms you want, and the date span you want.
Run the getTimesArticles_fq.py script.
Pull into a spreadsheet program the .TSV (tab separated values) file that the script creates as a summary of the JSON files retrieved for particular articles. (The JSON files contain metadata and abstracts of the articles found in the search; the .TSV file aggregates the metadata from all the JSON files.)
Select the column in the spreadsheet for the URLs of articles found in the search, and copy the URLs into a text file. (Call the file urls.txt)

Download articles:

Use the Firefox DownloadThemAll plugin on the urls.txt file to download all the NYT articles and save them to a local folder.

Scrape plain text of articles:

Use the nyt_scraper_folder.py Python script to extract the plain text from all the NYT articles in your local folder and aggregate it in a file called text_harvest.txt. (The plain text for individual articles are separated from each other in the aggregate file with a string of 10 ampersands: "@@@@@@@@@@")
If you wish, use a file splitter utility program (like Chopping List for Windows) to split the text_harvest.txt file into separate plain-text files for each article.

Fill out Collection Manifest Form

New York Times (Alternate Method 2)

Using Proquest for articles between 1923 and end of 1980

See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the NYT API)

Requirements

Access through institutional subscription to Proquest Historical Newspapers: New York Times
[Under construction]

New York Times (Alternate Method for Getting "Next Pages")

Scraping Phase 2 (Scrape "Next Pages" and other missing pages and data)

Some articles will have "next page(s)" link(s), often just one but sometimes continuing to several sequent pages. These have to be collected and added to the appropriate article bodies in the aggregate-plain-text.docx file. This can be done as follows:

"Next Pages" (Best Method):

If you included "?pagewanted=all" at the end of the URLs in urls.txt, then no action should be needed (as instructed under scraping phase 1 above). All articles that continued to sequent pages should have been scraped by Import.io in their entirety.

"Next Pages" (Alternate Method):

If the "?pagewanted="all" method does not work for some years of the New York Times HTML format, then return to the master spreadsheet you earlier created. There you will see "nextpage" links for all articles that have sequent pages. First copy the text in the "articlebody" for that article and paste it unformatted into a word processor, adding a space or tab at the end. (This temporary word processing file will serve as a staging ground for accumulating all the parts of the article.)
Then in the master spreadsheet, find the column labeled "nextpage" that contains the full URL of the sequent page. (There is a different column labeled "nextpage/_source" that contains only partial URLs.)
Copy the URL for the next page.
Then go back to Import.io bulk extract page that you left open in your browser. Enter the URL for the sequent page, and run the query. This will produce a new query results page. (Note: observe whether the query results show that there is an additional "nextpage" beyond the current one.)
Export the Import.io query results for the sequent article page as "HTML". Copy the article body. Then paste that text to the end of the document in your temporary word processor page.
If an article goes on to additional sequent pages, go back to the Import.io bulk extract page where you previously inserted the URL of the first sequent page (it will end with "pagewanted=2". Increment to the next page (e.g., change the "=2" to "=3"). Run the query, export the results to HTML again, and copy the article body to the end of your temporary word processor page.
Repeat until you get all the sequent pages of the article.
Then check that your temporary word processor file has no line returns in it, copy all the text, and paste it in place of the original partial articlebody in the master spreadsheet.

Wall Street Journal, 1984-present (Alternate Method 1)

(using Proquest, Python scripts, and Wget)

Requirements: (preinstalled on the WE1S workstations)

Access through institutional subscription to Proquest
Python (with Beautiful Soup 4 package)
Scripts for Wall Street Journal located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
Wget to download web pages from list of ULRs.

Workflow Steps:

Search Proquest: Wall Street Journal (Eastern Edition):

Use the Advanced Search interface for Proquest: Wall Street Journal (Eastern Edition) to run your search.

With the Proquest Advanced Search form pre-set to the publication number designating the WSJ, use the search fields and Boolean operations in the rest of the form to search for keyword(s). (To search for a literal single word or phrase rather than fuzzy resemblances, put quotes around the term. For example, searching for the word humanities without quotes will also find articles with the word "humanity." But searching for "humanities" (in quotes) restricts the search to the literal string of characters. (Note: Proquest allows for Boolean and/or concatenations of terms. But it does not allow for proximity searches--e.g., two words within a span of 4 words from each other.)

After running the search, select all article titles (or whichever ones you want) in the results list. Then use the "Export/Save" function to export to "XLS (works with Microsoft Excel)." This will open a spreadsheet summarizing the results of the search.

Download articles:

From the spreadsheet of the Proquest search results (see above), select the whole column named "DocumentURL" (you have to scroll far to the right to see that column). Copy the URLs and paste them into a text file called urls.txt in your local working folder.
Use Wget to download the articles:

If you don't already have it on your computer, download the Wget command-line program commonly used to download files from the web. (And save the executable wget file in some folder that is in the PATH definitions on your computer, so you can invoke the program from whatever working folder you happen to be in. Alternatively, invoke the program with an explicit path.).
Open a command line terminal or bash window on your computer. Then run the following two commands (where you configure the first one for the path of the local folder where you want to save the results).

cd C:/workspace/wsj_downloads/ wget --no-check-certificate -i C:/workspace/urls.txt
The result is a folder called wsj_downloads with the Web pages you found in your search saved as local files. (Important note: these files will be saved without a ".html" extension.)
Rename each of the downloaded files in your results folder so that they have the extension ".html" (There are utility programs for Mac, Windows, etc. for mass renaming of files--e.g., Advanced Renamer for Windows)

Scrape plain text of articles:

Use the wjs_scraper_folder.py Python script (from the Google Drive folder: Scripts for Wall Street Journal) to extract the plain text from all the articles and aggregate it in a file called wsj_text_harvest.txt. (The plain text for individual articles are separated from each other in the aggregate file with the string "@@@@@@@@@@")
If you wish, use a file splitter utility program (like Chopping List for Windows) to split the text_harvest.txt file into separate plain-text files for each article.

Fill out Collection Manifest Form

The Guardian (Alternate Method 1 - Using OutWit Hub as scraper)

Requirements:

Guardian API Key (register on developer's site and request a key for the "articles" API)
OutWit Hub ($89 Pro version needed for unlimited calls; chart of difference between free Light and paid Pro versions)

Guardian Scraper created by Alan for Outwit Hub (download from the folder for outwit_hub_scrapers in the We1S shared Google Drive; or from our Mirrormask site here)

[Note: OutWit Hub can also be used as a Firefox add-on for Windows and Firefox add-on for Mac, though Alan has not tried them. The add-ons are supposed to have the functionality of the free OutWit Hub "light" version)

Workflow Steps:

Get a Guardian API Key if you don't have one (request developer key).
Search The Guardian using the Guardian's API Console from within the browser in OutWit Hub:

Open OutWit Hub.
In the web browser built into OutWit Hub, go to the URL of the Guardian's Open Platform API Console (beta version, not the old version): http://open-platform.theguardian.com/explore/. (Make sure that "Page" is selected in the OutWit Hub sidebar at the left, in order to show the web page you are opening.)
When the Guardian's API console search form loads, check the box in that form for "Show All Filters." Then fill in the following fields in the form:

search term
order-by (choose "oldest")
page-size (set at 199 to show the maximum number of hits per search-results page)
from-date & to-date (in the format, e.g., "2014-01-31")
api-key (your Guardian api key)

At the bottom of the Guardian API Console web page, you'll see a live view of the JSON returned by the search, including the URLs for the pages found. The maximum number of hits on a single search-results page is 199.
For multiple search-results pages:

The JSON search results will start with metadata that includes, the number of the current search-results page and the total number of search-results pages (e.g., currentPage: 2, pages: 2)
After harvesting the results of one page (through the process described below), you can use the "page" field in the Guardian API Console's search form to request the next page of search results.

Scrape and Export the Guardian articles using OutWit Hub:

In the Outwith Hub sidebar, choose "Links'.
You'll see a table of all the URLs found on the page, including the ones in the JSON for the search results.
Select all the rows that include URLs for articles (as opposed to those to the Guardian search form).
Right-click on the selection, and choose ‘Auto-Explore Selected Links -> Fast Scrape', and then the Guardian Scraper. (Note: do not chose "Fast Scrape (Include Selected Data)," since that will include all the columns in the table of data instead of the ones for Date, Headline, Author, and ArticleBody that the Guardian Scraper is designed to collect.)
OutWit Hub will now proceed to scraping all the Guardian pages in the list of URLs. In doing so, it will authomatically switch to the "Scraped" view (as indicated in the sidebar) and populate a table with the scraped data for Date, Headline, Author, and ArticleBody.
When OutWit Hub has completed scraping, select all the scraped table, right-click, and "Export Selection As ..." (export formats include: Excel, JSON, XML, HTML, CSV, TXT, SQL)
(Repeated as needed for multiple search-results pages of URLs found through the Guardian API Console, as described in step 2.iv above)

Chop the scraped articles as individual articles:

If you exported the scraped articles from Outwit Hub as a txt file, then:

you can use Chopping List (or a similar utility) to chop the cumulative text file into text files for individual articles. The delimiter between articles in the cumulative text file is a carriage return, which can be entered as the delimiter in Chopping List as "$r" (no quote marks)
or you can use Scott's cut.py Python script for the purpose (download from our Mirrormask server)

Comments (0)

You don't have permission to comment on this page.

Alternate Collection Workflows

This page provides alternate workflows for collecting document sources for the WhatEvery1Says project. (Workflows are revised or extended as the developers continued working on the project. See also the primary Collection Workflows page.)

New York Times (Alternate Method 1)

Requirements:

Workflow Steps:

New York Times (Alternate Method 2)

Requirements

New York Times (Alternate Method for Getting "Next Pages")

Wall Street Journal, 1984-present (Alternate Method 1)

Requirements: (preinstalled on the WE1S workstations)

Workflow Steps:

The Guardian (Alternate Method 1 - Using OutWit Hub as scraper)

Requirements:

Workflow Steps:

Alternate Collection Workflows

Page Tools

Insert links

Comments (0)

Join this workspace

Navigator

SideBar

Recent Activity