• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Dokkio Sidebar (from the makers of PBworks) is a Chrome extension that eliminates the need for endless browser tabs. You can search all your online stuff without any extra effort. And Sidebar was #1 on Product Hunt! Check out what people are saying by clicking here.


Corpus Finalization Workflow

Page history last edited by Jamal Russell 6 years, 1 month ago

WE1S Corpus Finalization Workflow


1)    Sign up for a year and publication date under the “Developer Task Assignments (Finalizing Corpus) page on 4Humanities pbworks: 4humwhatevery1says.pbworks.com


2)    Log-in to the WE1S Google Drive and navigate to your chosen publication and year's folder.


File path: WE1S-2 à data_archive à corpus à publication (e.g. NYT) à working dataà year à humanities à aggregate working data à master.xlsx spreadsheet in Excel.

Note: if an Excel master spreadsheet is not available, open the Google sheets master file and save as an Excel file (.xlsx).


3)     Open the spreadsheet and go over the following columns for necessary edits:

a)   Plain Text File #: Type in the first available column (publication-year-h-1). As an example: nyt-2010-h-1. Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the article #. Make sure that the new name matches up with the names of the plain text files in the WE1S Google Drive (thus, latimes-2000-h-1 should match with the name of the corresponding plain text file in the "data" folder of the Google Drive).


b)  Publication: If there is no publication category, create one and write in the publication name for each article. The column should be between the Plain Text File # column and the Date column.

c)  # words: If the word count is not available or in error, insert the following formula in the cell in the second row: =IF(LEN(TRIM(D2))=0,0,LEN(TRIM(D2))-LEN(SUBSTITUTE(D2," ",""))+1). Then drag the active corner at the lower right of the cell with the formula down the column to automatically populate the rest of the column with the # of words in the articles.


d)   Author: Check this column to see if there are words or phrases other than the author’s name in the cells, such as "by" or "published." If there are, you can cut these phrases by using Excel’s find/replace feature. Select the respective author column, and then click CTRL+F. Search for "by" in the "Find" box, and leave the "Replace" box empty. Then either "Replace all" or "Find Next" to check and see that Excel is only replacing the "bys" in the proper column. 

     -If you are working on the version of Excel installed on the lab computers, you will have to make sure that you are running your find/replace within your selected row (which is done by selecting the "Within" drop down menu and selecting

     "Selection"). This will limit your search to your selected row, and ensure that you are not eliminating "bys" in the Title or Articlebody columns.


     Formatting Author column: The proper format for the author column is Last name_First name_Middle name or initial. To format this in Excel, do the following:


i. Trim the white spaces before the author name. To do this, left click on the author column and click "insert." This will insert a new column. In that column put the following formula=TRIM (A1) . Please note that A1 refers to the cell that the author name is in, and that will differ depending on the spreadsheet you are working on. The formula will trim the author's name in the new column you created. Then drag the cell down to cover every cell in the newly created column, so that no author name has any white space before it. Finally, select the newly created column and click insert again, creating another new column. In this column, click paste special to insert only the text from the newly trimmed author names. In the paste special box, click "Values" to show only the text.


ii. Flip the author's first name and last name if necessary. If your spreadsheet has the author name formatted as first name last name, create a new column through the same method as listed earlier. Then, put the following format in the newly created cell to flip the first name and last name in your spreadsheet, and change the "AI" to the proper cell of the author name you're trying to flip:  =MID(A1&" "&A1,FIND(" ",A1)+1,LEN(A1))


This formula will flip the first name and last name. Drag the cell down so that every author name is flipped. Please note that there are somenames that won't flip when there's a middle initial or name. Identify these cases and then manually fix them. Finally, insert a new column again and paste special the newly cleaned names into the new column. 


iii. Add an underscore between the names. Click on the newly cleaned column and click CTRL+F to open the find and replace feature. In the "find" box put a space; in the "replace" box put an underscore. This will put an underscore automatically between each last name and first name. 


vi. Finally, look over the newly revised column and check for errors. When this column is complete, delete the columns you made other than the finished column. 


e)  Date: Depending on the date format, do the following:

     -If the date is in mm/dd/yyyy format, leave it be. When the sheet is imported into OpenRefine, the dates will be automatically converted to UTC format.


     -If the date is in any other format: Make sure that you only have the date contained within the cells of the column (all "Published:" headers and the like should be cut) and that the date of the column is the second column of the sheet.

     Save the sheet as a .csv file in your folder of choice. Download the csv_date_clean.py script circulated via e-mail and place it in the same folder that your .csv file is in. Open your system's command shell/prompt and navigate to the folder

     you are using (the command is "cd [folder name]." If you need to know what the names of the folders are, type in "dir" and press enter). When you have navigated to your folder of choice, type in the following commands:

          -pip install click dateparser: This installs the dateparser Python library on your system, which is what will be used to clean up the dates within your .csv file.

          -python csv_date_clean.py [Input] [Output]: This cleans the dates in your saved .csv file and creates a new .csv file with your dates in UTC format. "Input" is the name of the .csv file you just saved and "Output" is the name of the newly

          produced .csv file.


     -Additional Steps for The Guardian Corpus:

          -Left click on the "date" column and select "Insert" to create a new column. Copy the date information from the original date column and paste it into the new one.

          -In the new column, use the find/replace function to cut all of the EST/EDT times, as well as any leading or trailing white space. This should produce dates in a "dd-mm-yy" format. Save the sheet.

          -Run the csv_date_clean.py script on your saved metadata sheet to get your edited dates into a basic UTC format.

          -Using the old "date" column as reference, convert the provided EST/EDT times into UTC times and write them into your edited and transformed "date" column (UTC is 4 hours ahead of EST/EDT, so 12:00 EST/EDT=4:00 UST)


4)    Open the OpenRefine interface in your web browser by opening the “C:/openrefine-win-2.6-rc2/openrefine-2.6-rc2” folder in your system and run “openrefine.exe” by clicking on the named file twice to open the OpenRefine interface at the address (you can always navigate back to the OpenRefine interface by pointing your browser to this address, and can even use it in multiple windows using it).


5)    Once you are in the OpenRefine interface, click “Create Project” and upload the sheet that you recently downloaded and edited. 


6)    In the editor, rename the relevant columns as follows:

a.     Plain Text File # à id


b.     Date à pubdate


c.     Article à articlebody


d.     DocumentURL/PageURL à docUrl


e.     # words à wordcount


(the cases of the new titles should be exactly as shown here). The columns, from left to right, should read: id, publication, pubdate, title, articlebody, author, docUrl, wordcount.


7)    Edit the content of the columns as follows:

a.     General: Check the contents of a column for whitespace by opening a given column’s drop down menu, navigating to the “Edit cells” option, and selecting “Trim leading and trailing whitespace” from the “Common transforms” options (this is especially important for the author column).


b.    Pubdate:  All dates should be in UTC format (2016-01-01T00:00:00Z). Double check this column to make sure that OpenRefine has either converted or retained your UTC dates.


c.     Author: Open the author column’s drop down menu and, from the “Facet” options, select “Text facet.” This will open a record of the names on the sidebar. From here, you can use the “Cluster” option to group similar names together and edit them accordingly, as well as edit the individual names themselves (for now, just make sure that the names are consistent. Give yourself as little work as possible).


d.     Wordcount: Move the wordcount to the end of the spreadsheet by going into the column drop down menu and selecting “Move column to end” under “Edit column.”


e.     The metadata should look as follows:











New York Times

1984-01-01 T00:00:00Z

The Learning Network







8)    When you have refined the metadata, look for the “Export” drop down menu in the right-hand corner of the page. Select the “Comma Separated Value” option, and your OpenRefine project should be downloaded as an .csv file. Make sure that you do not have any text facet/filtering options selected; if you do, OpenRefine will only export a sheet containing the filtered metadata.


9)    Open up the exported .csv file in Excel and check it for newly created blank columns. If OpenRefine has created new columns, delete them by highlighting the columns, right-clicking them, selecting delete from the generated menu.


10)  Log on to the WE1S Mirrormask server at https://mirrormask.english.ucsb.edu:5001 (username and password below) and click on the File Station icon. Starting within the “we1s-3” folder, navigate to your chosen publication and year's “aggregate_working_folder” and upload the refined .csv file to the folder.


11)  To exit OpenRefine, close any browser windows you were using to edit spreadsheets, navigate back to the OpenRefine command shell (which should still be open), and hit “ctrl+c.” This will stop the OpenRefine application and the shell should disappear within a couple of seconds.


12)  Mark the year as completed on the pbworks site. You’re done!


Comments (0)

You don't have permission to comment on this page.