|
Collection Workflows
Page history
last edited
by Alan Liu 9 years, 4 months ago
This page provides instructions for collecting document sources for the WhatEvery1Says project. (Collecting workflows are revised or extended as the developers continued working on the project. See also Alternate Workflows.)
New York Times
Using NY Times API for searching & Import.io as scraper) (last revised July 2, 2015)
See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the NYT API)
Requirements: (preinstalled on the WE1S workstations)
Workflow Steps:
- Search using the NYT API and getTimesArticles_fq.py Python script:
- Adjust the settings.cfg file for the getTimesArticles_fq.py script to insert the paths for your working space, the search terms you want, and the date span you want. To search for a phrase, enter the phrase in the format "liberal+arts". (Note that the settings.cfg file must be located in a "config" subfolder within the folder holding the getTimesArticles_fq.py script.)
- Run the getTimesArticles_fq.py script. (Double click on the file, which will pull it into a Python editor in the Enthought Canopy IDE [integrated development environment] for Python. Then from the menu at the top of the IDE, go Run > Run file.) This will start the script, which uses the NY Times API to search systematically for articles in a date range for a query term, pulls in the JSON output from the API for each found article, and harvests metadata from the JSON files in a cumulative TSV (tab separated values) file. Depending on the date range specified for the search, this can take a long time. When the script completes, you will see the command prompt again in the Python environment.
- Error correction: Occasionally, the Python script will terminate prematurely when it encounters some kind of error during querying and getting of information through the NY Times API. In that case, locate the last JSON file retrieved (whose file name shows the date (e.g., "20011014"), reset the settings.cfg file to start at that date, and restart the Python script. (The .TSV file that the Python script writes as it harvests data from the JSON files is rewritten cumulatively after each JSON file, so after a fresh start the Python script simply keeps adding to the bottom of the .TSV file.)
- Note: Keep and upload the .TSV file and the JSON files as "working data" from the collection run to the WE1S Google Drive. See example of path/folders for where to put this working data material on the Google Drive.
- Pull into a spreadsheet program the .TSV (tab separated values) file that the script creates as a summary of the JSON files retrieved for particular articles. (The JSON files contain metadata and abstracts of the articles found in the search; the .TSV file aggregates the metadata from all the JSON files.)
- Select the column in the spreadsheet for the URLs of articles found in the search, and copy the URLs into a text file. (Call the file urls.txt)
- Using a text editor or word processor add the following to all URLs: "?pagewanted=all"
- For any URLs that end with a trailing forward slash before the question mark (i.e, "/?pagewanted=all"), remove that final forward slash.
- Scrape articles using Import.io:
- Scraping Phase 1 (Initial Scrape):
- Open a browser to https://import.io/data/mine/ (or use the Import.io app). Logging in as the WE1S user in Import.io will reveal the extractors we have pre-built for various publications, including for the New York Times in each stage in the evolution of the Times's HTML format beginning in 1981 when fully digital text became available. Or click on the links below to go directly to the extractors . (The year 2005 has no extractor, and must be scraped using the Beautifulsoup python script in alternative collection workflows):
- When you have the correct extractor selected, click the Import.io tab for "Bulk Extract".
- In the "Enter URLs to extract from" box, copy-and-paste the whole list of URLs you collected in urls.txt (from the steps above). (The URLs should be listed one per line.)
- Then click on "Run queries". (Watch the counter at the top left to see that all the queries complete correctly. If some Import.io fails to get some pages, it will give a tally of failed pages. Rerun the queries if there are failed pages.)
- When extraction of data is complete, click on the "Export" button in Import.io." From the export options, choose "HTML." (There are problems with choosing the more intuitive-sounding "spreadsheet".) Open the resulting web page, which will show you the query results in a table. Copy all (Ctrl-A) the page.
- Creation of Master Spreadsheet for Scrape: Go to the WE1S Google Drive and in in the appropriate working data subfolder (new_york_times > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "nytimes-[year]-h-master" (for "humanities" queries) or "nytimes-[year]-la-master" (for "liberal arts" queries). E.g., "nytimes-2007-h-master" or "nytimes-2007-la-master. Paste into the spreadsheet the content you copied from the HTML page exported from Import.io. This will be the master spreadsheet for the scrape of a year (referred to in the instructions below as "master spreadsheet".
(Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet. For some reason this works.)
- Organize and add columns in Master Spreadsheet:
Arrange (and add) columns in the spreadsheet as follows:
- Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number. The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping.
- Column B -- the "date" column
- Column C -- the "title" column
- Column D -- the "articlebody" column
- Column E -- the "author" column
- Column F -- Add a column here with label "3 words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " "))
Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet. Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
- Column G -- the "pageUrl" column
- Scraping Phase 2 (Scrape "Next Pages" and other missing pages and data)
- Some articles will have "next page(s)" link(s), often just one but sometimes continuing to several sequent pages. These have to be collected and added to the appropriate article bodies in the aggregate-plain-text.docx file. This can be done as follows:
- "Next Pages" (Best Method):
- If you included "?pagewanted=all" at the end of the URLs in urls.txt, then no action should be needed (as instructed under scraping phase 1 above). All articles that continued to sequent pages should have been scraped by Import.io in their entirety.
- "Next Pages" (Alternate Method):
- If the "?pagewanted="all" method does not work for some years of the New York Times HTML format, then return to the master spreadsheet you earlier created. There you will see "nextpage" links for all articles that have sequent pages. First copy the text in the "articlebody" for that article and paste it unformatted into a word processor, adding a space or tab at the end. (This temporary word processing file will serve as a staging ground for accumulating all the parts of the article.)
- Then in the master spreadsheet, find the column labeled "nextpage" that contains the full URL of the sequent page. (There is a different column labeled "nextpage/_source" that contains only partial URLs.)
- Copy the URL for the next page.
- Then go back to Import.io bulk extract page that you left open in your browser. Enter the URL for the sequent page, and run the query. This will produce a new query results page. (Note: observe whether the query results show that there is an additional "nextpage" beyond the current one.)
- Export the Import.io query results for the sequent article page as "HTML". Copy the article body. Then paste that text to the end of the document in your temporary word processor page.
- If an article goes on to additional sequent pages, go back to the Import.io bulk extract page where you previously inserted the URL of the first sequent page (it will end with "pagewanted=2". Increment to the next page (e.g., change the "=2" to "=3"). Run the query, export the results to HTML again, and copy the article body to the end of your temporary word processor page.
- Repeat until you get all the sequent pages of the article.
- >Then check that your temporary word processor file has no line returns in it, copy all the text, and paste it in place of the original partial articlebody in the master spreadsheet.
- Other missing pages and data: Some years of the New York Times (e.g., 2007) provide us through the API with URLs to pages that are unscrapable or incompletely scrapable through automated means.
- NY Times blogs (such as the "Opinionator" series of blogs are wholly unscrapable. (They will appear as a blank row in the master spreadsheet.) You'll need to manually get the results and put them in the appropriate cells in the spreadsheet. Be sure to eliminate all line returns in the article body.
- There is also a class of URLs that are dynamic queries in the format, e.g., "http://query.nytimes.com/gst/fullpage.html?res=940DE4DB1E30F932A15752C0A9619C8B63". These show up in the master spreadsheet with article bodies but missing dates, etc. If possible, go manually to the pages to find the missing data.
- Duplicate articles: Occasionally, there are duplicate articles that need to be eliminated.
- Scraping Phase 3 (Outputting all the results to an "aggregate-plain-text.txt" file)
- In the master spreadsheet, arrange so that they begin in this order: date, title, articlebody . Then select just these columns and use "ctrl-A" to copy all.
- Open a Word document that you will name "aggregate-plain-text.docx," and paste in in the contents of the above columns. (Be sure to use paste - unformatted). This will create a file with all the articles (beginning with date, author, title preceding the article body). Individual articles are separated from each other by a return (the only use of returns in the file). There will be excess returns and pages at the bottom of the file that need to be deleted.
- Using Word's find-and-replace function, replace all returns (found by searching for "^13") with three spaces, followed by ten "@" signs, followed by two returns (" ^13^13@@@@@@@@@@"). This creates an easy-to-recognize and -manipulate delimiter between individual articles in the aggregate file. (One exception: remove the delimiter after the last article at the end of the file so as to prevent creating a blank individual plain text file later).
- Finally, save or export the aggregate-plain-text.docx Word file as a .txt file (aggregate-plain-txt) as follows:
- When Word shows the dialogue for conversion to plain text, choose "other encoding" > "Unicode UTF8" (i.e., do not choose "Windows default").
- Chop Into Individual Plain Text Files
- You can use Chopping List (or a similar utility) to chop the aggregate-plain-text.txt file into text files for individual articles. For "number of places to pad filenames with zeros," leave blank (not "0"). Set the delimiter as the ten "@" signs (@@@@@@@@@@) you previously added to the aggregate-plain-text.txt file. (If instead you ever need to find a delimiter between articles that is a carriage return, enter that delimiter in Chopping List as "$r" (no quote marks)
- Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
- Remember to delete the dummy File_1.txt that you earlier created before uploading the final data to Google Drive.
- Upload Data and Working Data to the WE1S Google Drive (according to the path/folder structure indicated in the screenshot below)
- "Data" consists of individual plain text files for a publication organized by folders for each year. For example, corpus > new_york_times > data > 1981 > plain_text > humanities contains all the scraped individual plain text files collected for 1981 that contain the word "humanities."
- "Working_data" sits in a parallel branch under each publication, also organized by folders for each year. In the case of the New York Times, the working data consists of subfolders under each year for:
- aggregate_working_data folder (containing the .TSV spreadsheet; urls.txt; the master spreadsheet; aggregate-plain-text.docx; aggregate-plain-text.txt
- Note: Before uploading the master spreadsheet, remember to create an extra column at the left (which will be column A), label the column "Plain Text File #" in cell A1" and populate the column with sequential file numbers. (Enter "1" in cell A2; enter the following formula in the next cell down: "=A2+1"; then drag the handle at lower-right of the cell down the column to automatically populate the rest of the column.) The purpose of this is to create file numbers in the spreadsheet that match the file names of individual plain text files created by Chopping List (File_1.txt, File_2.txt, etc.)
- individual_plain_text folder (containing the chopped, individual plain text files. These are also copied into the relevant "data" branch of the site as final data)
- JSON folder (containing JSON files)
- Fill out Collection Manifest Form
- Go to Form (submitting the form will send your metadata about the collection run to a Google spreadsheet)
Wall Street Journal, 1984-present
(using Proquest and Python scripts)
Requirements: (preinstalled on the WE1S workstations)
- Access through institutional subscription to Proquest
- Python (with Beautiful Soup 4 package)
- Scripts for Wall Street Journal located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
- Wget to download web pages from list of ULRs.
Workflow Steps:
- Search Proquest: Wall Street Journal (Eastern Edition):
- Use the Advanced Search interface for Proquest: Wall Street Journal (Eastern Edition) to run your search.
- With the Proquest Advanced Search form pre-set to the publication number designating the WSJ, use the search fields and Boolean operations in the rest of the form to search for keyword(s). (To search for a literal single word or phrase rather than fuzzy resemblances, put quotes around the term. For example, searching for the word humanities without quotes will also find articles with the word "humanity." But searching for "humanities" (in quotes) restricts the search to the literal string of characters. (Note: Proquest allows for Boolean and/or concatenations of terms. But it does not allow for proximity searches--e.g., two words within a span of 4 words from each other.)
- After running the search, select all article titles (or whichever ones you want) in the results list. Then use the "Export/Save" function to export to "XLS (works with Microsoft Excel)." This will open a spreadsheet summarizing the results of the search.
- Download articles:
- From the spreadsheet of the Proquest search results (see above), select the whole column named "DocumentURL" (you have to scroll far to the right to see that column). Copy the URLs and paste them into a text file called urls.txt in your local working folder.
- Use Wget to download the articles:
- If you don't already have it on your computer, download the Wget command-line program commonly used to download files from the web. (And save the executable wget file in some folder that is in the PATH definitions on your computer, so you can invoke the program from whatever working folder you happen to be in. Alternatively, invoke the program with an explicit path.).
- Open a command line terminal or bash window on your computer. Then run the following two commands (where you configure the first one for the path of the local folder where you want to save the results).
cd C:/workspace/wsj_downloads/ wget --no-check-certificate -i C:/workspace/urls.txt
- The result is a folder called wsj_downloads with the Web pages you found in your search saved as local files. (Important note: these files will be saved without a ".html" extension.)
- Rename each of the downloaded files in your results folder so that they have the extension ".html" (There are utility programs for Mac, Windows, etc. for mass renaming of files--e.g., Advanced Renamer for Windows)
- Scrape plain text of articles:
- Use the wjs_scraper_folder.py Python script (from the Google Drive folder: Scripts for Wall Street Journal) to extract the plain text from all the articles and aggregate it in a file called wsj_text_harvest.txt. (The plain text for individual articles are separated from each other in the aggregate file with the string "@@@@@@@@@@")
- If you wish, use a file splitter utility program (like Chopping List for Windows) to split the text_harvest.txt file into separate plain-text files for each article.
- Fill out Collection Manifest Form
USA Today
Using the USA Today "Articles" API & Python scripts for searching/downloading article
See details of access by time range to full-text, partial-text, PDF versions of the articles discoverable through the USA Today API)
Requirements: (preinstalled on the WE1S workstations)
Workflow Steps:
- Search using the USA Today API and Python scripts:
- Use your browser to:
- send a request string to the USA Today articles API (entering the string as the URL, adjusting the search term and dates you want in the URL request string);
- convert the JSON metadata that the API returns as results into a CSV file using the json2csv.py script;
- copy the URLS for articles into a urls.txt file;
- Scrape and Export the USAToday articles using Import.io:
- Open the Import.io app and point its built-in browser to the URL for the USAToday extractor: https://import.io/data/mine/?id=635fe72f-51bb-4dd2-b1d9-7222b1e47e5f
- Click the Import.io tab for "Bulk Extract"
- Copy-and-paste into into the Import.io field for "Enter URLs to extract from" the whole list of URLs you collected in Excel (from the steps above). (The URLs should be listed one per line)
- Then click on "Run queries"
- When extraction of data is complete, click on the "Export" button in Import.io." From the export options, choose "HTML." (There are problems with choosing the more intuitive-sounding "spreadsheet".) Open the resulting web page, which will show you the query results in a table. Copy all (Ctrl-A) the page.
- Creation of Master Spreadsheet for Scrape: Go to the WE1S Google Drive and in in the appropriate working data subfolder (new_york_times > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "usatoday-[year]-h-master" (for "humanities" queries) or "usatoday-[year]-la-master" (for "liberal arts" queries). E.g., "nytimes-2007-h-master" or "nytimes-2007-la-master. Paste into the spreadsheet the content you copied from the HTML page exported from Import.io. This will be the master spreadsheet for the scrape of a year (referred to in the instructions below as "master spreadsheet".
(Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet. For some reason this works.)
- Organize and add columns in Master Spreadsheet:
Arrange (and add) columns in the spreadsheet as follows:
- Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number. The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping.
- Column B -- the "date" column
- Column C -- the "title" column
- Column D -- the "articlebody" column
- Column E -- the "author" column
- Column F -- Add a column here with label "3 words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " "))
Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet. Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
- Column G -- the "pageUrl" column
- Chop the scraped articles as individual articles:
- you can use Chopping List (or a similar utility) to chop the cumulative text file into text files for individual articles. The delimiter between articles in the cumulative text file is a carriage return, which can be entered as the delimiter in Chopping List as "$r" (no quote marks)
- Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
- Fill out Collection Manifest Form
The Guardian
Using Guardian API for searching and Import.io as scraper
Requirements: (preinstalled on the WE1S workstations)
- Guardian API Key (register on developer's site and request a key for the "articles" API)
- Import.io (free) -- Note: Import.io can be used in a browser to extract data automatically from a Web page; and the browser version can also use customized extractors that someone else has created. However, creating customized extractors (e.g., able to select only particular data from a page) can only be done through Import.io's downloadable "app" (which is also free).
Workflow Steps:
- Get a Guardian API Key if you don't have one (request developer key).
- Search The Guardian using the Guardian's API Console:
- In a web browser (not the one built into Import.io), go to the URL of the Guardian's Open Platform API Console (beta version, not the old version): http://open-platform.theguardian.com/explore/.
- When the Guardian's API console search form loads, check the box in that form for "Show All Filters." Then fill in the following fields in the form:
- search term
- order-by (choose "oldest")
- page-size (set at 199 to show the maximum number of hits per search-results page)
- from-date & to-date (in the format, e.g., "2014-01-31")
- api-key (your Guardian api key)
- At the bottom of the Guardian API Console web page, you'll see a live view of the JSON returned by the search, including the URLs for the pages found. The maximum number of hits on a single search-results page is 199.
- For multiple search-results pages:
- The JSON search results will start with metadata that includes, the number of the current search-results page and the total number of search-results pages (e.g., currentPage: 2, pages: 2)
- After harvesting the results of one page (through the process described below), you can use the "page" field in the Guardian API Console's search form to request the next page of search results.
- Collect the URLs of the links from the Guardian search in Excel:
- Copy-and-paste the JSON search results from the steps above into a blank Excel worksheet (using "paste special to paste without formatting)
- Select the first column, then go to Data view ("Data" tab in Excel) and click on "Filter."
- Cell A1 will now show a little drop-down arrow. Click on the arrow, and choose: Text Filters > Begins with. Type in "webURL: " (includes space after the colon) to filter the worksheet to show only the rows beginning with this expression.
- Do a search-and-replace in Excel to replace "webURL: " (includes trailing space) with nothing.
- Scrape and Export the Guardian articles using Import.io:
- Open the Import.io app and point its built-in browser to the URL for the Guardian extractor: https://import.io/data/mine/?tag=EXTRACTOR&id=c43eee85-3697-42a1-81f0-dcad2bc66c45
- Click the Import.io tab for "Bulk Extract"
- Copy-and-paste into into the Import.io field for "Enter URLs to extract from" the whole list of URLs you collected in Excel (from the steps above). (The URLs should be listed one per line)
- Then click on "Run queries"
- When extraction of data is complete, click on the "Export" button in Import.io." From the export options, choose "HTML." (There are problems with choosing the more intuitive-sounding "spreadsheet".) Open the resulting web page, which will show you the query results in a table. Copy all (Ctrl-A) the page.
- Creation of Master Spreadsheet for Scrape: Go to the WE1S Google Drive and in in the appropriate working data subfolder (new_york_times > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "guardian-[year]-h-master" (for "humanities" queries) or "guardian-[year]-la-master" (for "liberal arts" queries). E.g., "nytimes-2007-h-master" or "nytimes-2007-la-master. Paste into the spreadsheet the content you copied from the HTML page exported from Import.io. This will be the master spreadsheet for the scrape of a year (referred to in the instructions below as "master spreadsheet".
(Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet. For some reason this works.)
- Organize and add columns in Master Spreadsheet:
Arrange (and add) columns in the spreadsheet as follows:
- Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number. The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping.
- Column B -- the "date" column
- Column C -- the "title" column
- Column D -- the "subtitle" column
- Column E -- the "articlebody" column
- Column F -- the "author" column
- Column G -- Add a column here with label "3 words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(E2, " "))
Adjust the "E2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet. Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
- Column H -- the "pageUrl" column
- (Repeated as needed for multiple search-results pages of URLs found through the Guardian API Console, as described in step 2.iv above)
- Chop the scraped articles as individual articles:
- You can use Chopping List (or a similar utility) to chop the cumulative text file into text files for individual articles. The delimiter between articles in the cumulative text file is a carriage return, which can be entered as the delimiter in Chopping List as "$r" (no quote marks)
- Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
NPR
Using NPR API for searching and Import.io as scraper
Requirements: (preinstalled on the WE1S workstations)
- NPR API Key (register and request a key)
- Import.io (free) -- Note: Import.io can be used in a browser to extract data automatically from a Web page; and the browser version can also use customized extractors that someone else has created. However, creating customized extractors (e.g., able to select only particular data from a page) can only be done through Import.io's downloadable "app" (which is also free).
- Convert JSON to CSV online tool.
Workflow Steps:
- Get a NPR API Key if you don't have one (request developer key).
- Search NPR using the NPR API Console:
- In a web browser (not the one built into Import.io), go to the URL of NPR's API Console: http://www.npr.org/api/queryGenerator.php.
- When NPR's API console search form loads,
- Click the "Fields" tab. For Output Format, select JSON. For Selectable Fields, choose only "listText".
- Then click the "Control" tab. Fill in the search parameters in the form. (Show only stories with "text." Use "date ascending" for the Sort Order. The start and end date formats are, for example, 2010-12-31. Start Number refers to the number of the search-results page if there is more than one (leaving it blank defaults to the first search page).
- Then click on "Create API Call," followed by "Run API Call" to generate search results in JSON format. (The JSON will appear all on a single line.)
- For multiple search-results pages:
- After harvesting the results of one page (through the process described below), you can use the "Start Number" field in the NPR API Console's search form to request the next page of search results.
- Alternatively: Instead of using the NPR API console, you can just directly enter variations of the following URL, which is an example of the search request the API console would generate (adjust the parameters, search-results start page, and add your own API key:
http://api.npr.org/query?fields=listText&requiredAssets=text&startDate=2010-01-01&endDate=2010-12-31&searchTerm=humanities&startNum=1&dateType=story&sort=dateAsc&output=JSON&numResults=20&apiKey= [your api key here]
- Collect the URLs of the links from the NPR search using the Convert JSON to CSV online tool and Excel:
- Copy-and-paste the JSON search results from the steps above into the Convert JSON to CSV online tool. When the CSV version of the data appears, click on the "Download" button in the tool to export to Excel.
- The Excel spreadsheet will show three different URLs for each article, one per row: "html," "api," and "short." We want only the "html" URL. To make Excel filter for only the rows for "html" URLs, take the following steps:
- Select the column showing the labels "html," "api," and "short."
- Click on the "Data" tab in Excel and then click on "Filter". This will create a little down-arrow icon in the first row of the column for the available filters..
- Click on the down-arrow filter icon and deselect all the values for the column except "html." This will result in a filtered view of the spreadsheet showing only the "html" URLs.
- Select and copy the URLs in the URL column.
- Scrape and Export the NPR articles using Import.io:
- Open the Import.io app and point its built-in browser to the URL for the NPR extractor: https://import.io/data/mine/?tag=EXTRACTOR&id=9b1bac59-4f18-4f33-be15-62d931d99208
- Click the Import.io tab for "Bulk Extract"
- Copy-and-paste into into the Import.io field for "Enter URLs to extract from" the whole list of URLs you collected in Excel (from the steps above). (The URLs should be listed one per line)
- Then click on "Run queries"
- When extraction of data is complete, click on the "Export" button in Import.io." From the export options, choose "HTML." (There are problems with choosing the more intuitive-sounding "spreadsheet".) Open the resulting web page, which will show you the query results in a table. Copy all (Ctrl-A) the page.
- Creation of Master Spreadsheet for Scrape: Go to the WE1S Google Drive and in in the appropriate working data subfolder (new_york_times > working_data > [year] > [query term] > aggregate_working_data) create a Google spreadsheet named "npr-[year]-h-master" (for "humanities" queries) or "npr-[year]-la-master" (for "liberal arts" queries). E.g., "nytimes-2007-h-master" or "nytimes-2007-la-master. Paste into the spreadsheet the content you copied from the HTML page exported from Import.io. This will be the master spreadsheet for the scrape of a year (referred to in the instructions below as "master spreadsheet".
(Note: if pasting into the Google master spreadsheet produces an error report that some cell has characters exceeding the maximum limit, then first paste into an Excel spreadsheet, then upload the Excel sheet into Google Drive, and open it as a Google spreadsheet. For some reason this works.)
- Organize and add columns in Master Spreadsheet:
Arrange (and add) columns in the spreadsheet as follows:
- Column A (add this column last when you have finished with all other work on the spreadsheet) -- Label: "Plain Text file #" Content: insert the number "1" in cell A2. Then in cell A3 insert the formula "=A2+1" (to increment the number). Then drag the active handle at the lower right of cell A3 down the whole column from A4 on to automatically populate the rest of the column cells with that number. The purpose to assign a file identifier to each article that will correspond to the file sequence in the final individual plain text files that will be the output of scraping.
- Column B -- the "date" column
- Column C -- the "title" column
- Column D -- the "articlebody" column
- Column E -- the "author" column
- Column F -- Add a column here with label "3 words." Insert the following formula in the cell in the second row The content: =COUNTA(SPLIT(D2, " "))
Adjust the "D2" here to correspond to the column in which the "articlebody" current resides, which may change depending on whether you have added Column A above yet. Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.
- Column G -- the "pageUrl" column
- (Repeated as needed for multiple search-results pages of URLs found through the Guardian API Console, as described in step 2.iv above)
- Chop the scraped articles as individual articles:
- you can use Chopping List (or a similar utility) to chop the cumulative text file into text files for individual articles. The delimiter between articles in the cumulative text file is a carriage return, which can be entered as the delimiter in Chopping List as "$r" (no quote marks)
- Or you can use Scott's cut.py Python script for the purpose (located in the"pythonscripts" folder on the local workstation (and also on the WE1S Google Drive)
Collection Workflows
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
|
|
Comments (0)
You don't have permission to comment on this page.