• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Instructions: Stage 1 Processing for Unstructured Corpus (Text Capture and Archiving)

Page history last edited by Priscilla Leung 10 years, 1 month ago

The following are instructions for Stage 1 Processing of the miscellaneous (unstructured) corpus of statements about the humanities collected in the WhatEvery1Says Google Drive spreadsheet.  The instructions describe how to capture the text (or multiple representations of the text) using a template file structure that can be added to the archive we are building in our Google Drive project folder.

  1. The following are the relevant resources you will need:
    1. Spreadsheet: https://docs.google.com/spreadsheet/ccc?key=0AoYF_lw5DlrkdGlENUJDS3lMY3lkNElRSnZiSllCZnc&usp=drive_web 
    2.  Google Drive project folder: (owned by transcriptions.ucsb@gmail.com, sharable with participants): https://drive.google.com/folderview?id=0B5qy9SR1ODepWk1MMFk5VUQ0enM&usp=sharing
    3.  Template (please make a *copy* of this template folder of files to seed your work): https://drive.google.com/file/d/0BwBRb4YK4iSQREktVlJXZUI5MGs/edit?usp=sharing
  2. Choose a document to work on: Select a document from the spreadsheet -- e.g.,  #1 "Humanists: Do Not Panic".  Insert your name in the column indicating "Stage 1 Document Processing Editor" (Note: yellow-fill colored rows in the spreadsheet indicate documents that have already been stage-1 processed.)

  3. Open the Google Drive folder where you will be saving your output: Working in Google Drive (either through a browser or in a synced folder on your local hard drive managed by the Google Drive local app),go to the following location for our project: 4Humanities--Whatevery1Says > _corpus > whatevery1says-web-unstructured > _document

  4. Seed a copy of the template folder for the document you are working on: Make a copy of the folder titled "0000--Session Template".  Give a name to this copied folder that starts with a 4-digit version of the entry ID from the document you are working on in the spreadsheet -- e.g., 0001--Humanists Do Not Panic/

  5. Analyze how best to capture the text content of the document you are working on (deciding on a "source" format): Go to the document (at the URL included in the spreadsheet).  Explore which way of accessing the document (e.g. , default view, print view, PDF view, etc.) is best for acquiring a clean plain-text copy of the content.  This requires some judgement.

  6. Depending on on the source format you choose, create a "source" subfolder indicating the nature of that format: Create an _archive folder with a description of the way the article is being saved -- _source--html-page/

    Names for archive folders are being standardized (see a large list below**), but the most common for our purposes are:


    You may create more than one source folder if the document is available in two forms (from two different URLS) that are both of interest.
    1. Save the URL of the document you are working on in the folder: On a Mac computer, drag the icon from the URL bar into the folder to create a .webloc xml file. On a Windows computer, drag the icon from the URL bar into the folder to create to create an Internet Shortcut text file. On a Linux computer, save the URL in a plain text file, e.g. url.txt.  (If drag-and-drop does not work for your computer, save the URL as a .txt file.)  The purpose of this step is to ensure that we have a convenient nearby record of the original URL.

    2. Archive a copy of the original document (as HTML or PDF, depending on its original format) in the source folder: Create in the source folder an archive subfolder named (for example, among likely candidates):


      Then save a copy of the original document in that _archive subfolder to ensure that we have an archival copy in case the original later disappears from the Internet.  (Use your browser to save an HTML file as "HTML only" or as a total web page, including images, etc.)

      1. Create a _transform sub-sub-folder in the _archive folder named after the way you will be transforming the original document into plain text -- e.g., _transform--cut-paste/
      2. ** FINAL STEP (PAYLOAD)**: Save a plain-text copy of the original document in this _transform sub-sub-folder: Follow the Editorial Policy to include/exclude material. 

      3. (Optional) If clean-up is required, make a copy of the plain text into a subfolder and hand-edit it:


        Transforms may be chained to any depth. Some common transforms are:




Comments (0)

You don't have permission to comment on this page.