| 
View
 

Notes From Meeting of March 19, 2014, "Archiving an Unstructured Web Corpus -- Discussion and Work"

Page history last edited by Alan Liu 10 years, 8 months ago

 

 

  • Preliminary thoughts by Alan in response to planning documents ( A  |  B ) from Jeremy. 
    For our text preparation work, we need:
    • Dual workflow strategy:
      • Workflow for a miscellaneous corpus (see 3.A below)
      • Workflow for predefined corpora (see 3.B below)
      • Alternatively: a wide-angled Google search with transparent search criteria:
        • collocation of words to search for
        • a hit triggers several other tests
        • finally documents selected
    • Editorial policy
    • Uniformity of method and tools
    • Modularity, rollback, and fork (Github?)
      • Jeremy suggests: flat directory tree structure, branching for each version of a transformed document (see his notes below)
      • Run on Google Drive (?) supports versioning for 30 days ; or run off Jeremy's server?
    • Documentation

 

  • Notes of discussion and results of the meeting (by Jeremy, from his email of 3/19/14):

    Archiving an Unstructured Web Corpus -- Discussion and Workshop

    -   Reviewed the materials on the wiki

    http://4humwhatevery1says.pbworks.com/w/page/75154250/Text%20Preparation%20and%20Topic%20Modeling%20Planning%20Document#view=page

    -   Discussed creating an online corpus and the storage of our corpus - in particular, issues relating to shared access and version control.
    -   Went forward with a plan to initially use a shared Google Drive folder to store the unstructured web corpus as it is being collected.
    -   Reviewed a scheme for storing source URLs, archived versions of pages, plain text copies, and edited/cleaned versions of those copies in a URI-like folder structure, like this:

    /_corpus
      /whatevery1says
        /_document
          /MyDoc
            /_source
              /_archive
                /_transform-cut-paste
                  /_transform-hand-edited

    -   Discussed the advantages and limitations, its relationship to automated downloading or text processing, and edge cases when dealing multiple sources, different browsers and operating systems, coordinating work between editors etc.

    -   Diving in: After adding a pre-created project folder tree to the Google Drive folder (linked above), we logged in and each added 1-2 articles found the spreadsheet into the corpus. The archiving process explored by the group is outlined below.

    -    After the hands-on work, we had discussed coordination and next steps:
       -   We need to capture feedback on the process -- both in per-document editor log files, and in a shared document (which needs to be set up). Everything about this process could be documented better [JD: I'm doing some of it below].
       -   There is a new "Corpus Editor" column on the spreadsheet to indicate the articles that participant signed up for (either during or after).
       -   Our goal is to do 5-10 entries per member -- from a mix of sources at first, with the possibility of later specialization in a single source for greater efficiency (e.g. focusing on The Chronicle of Higher Education) -- with a goal of having 30-50 documents ready for the other group.
       -   Our tentative deadline is sometime before a next meeting -- with the next meeting hopefully happening by the first week of May.

 

Comments (0)

You don't have permission to comment on this page.