• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Whenever you search in PBworks or on the Web, Dokkio Sidebar (from the makers of PBworks) will run the same search in your Drive, Dropbox, OneDrive, Gmail, Slack, and browsed web pages. Now you can find what you're looking for wherever it lives. Try Dokkio Sidebar for free.


Notes From Meeting of March 19, 2014, "Archiving an Unstructured Web Corpus -- Discussion and Work"

Page history last edited by Alan Liu 9 years, 2 months ago



  • Preliminary thoughts by Alan in response to planning documents ( A  |  B ) from Jeremy. 
    For our text preparation work, we need:
    • Dual workflow strategy:
      • Workflow for a miscellaneous corpus (see 3.A below)
      • Workflow for predefined corpora (see 3.B below)
      • Alternatively: a wide-angled Google search with transparent search criteria:
        • collocation of words to search for
        • a hit triggers several other tests
        • finally documents selected
    • Editorial policy
    • Uniformity of method and tools
    • Modularity, rollback, and fork (Github?)
      • Jeremy suggests: flat directory tree structure, branching for each version of a transformed document (see his notes below)
      • Run on Google Drive (?) supports versioning for 30 days ; or run off Jeremy's server?
    • Documentation


  • Notes of discussion and results of the meeting (by Jeremy, from his email of 3/19/14):

    Archiving an Unstructured Web Corpus -- Discussion and Workshop

    -   Reviewed the materials on the wiki


    -   Discussed creating an online corpus and the storage of our corpus - in particular, issues relating to shared access and version control.
    -   Went forward with a plan to initially use a shared Google Drive folder to store the unstructured web corpus as it is being collected.
    -   Reviewed a scheme for storing source URLs, archived versions of pages, plain text copies, and edited/cleaned versions of those copies in a URI-like folder structure, like this:


    -   Discussed the advantages and limitations, its relationship to automated downloading or text processing, and edge cases when dealing multiple sources, different browsers and operating systems, coordinating work between editors etc.

    -   Diving in: After adding a pre-created project folder tree to the Google Drive folder (linked above), we logged in and each added 1-2 articles found the spreadsheet into the corpus. The archiving process explored by the group is outlined below.

    -    After the hands-on work, we had discussed coordination and next steps:
       -   We need to capture feedback on the process -- both in per-document editor log files, and in a shared document (which needs to be set up). Everything about this process could be documented better [JD: I'm doing some of it below].
       -   There is a new "Corpus Editor" column on the spreadsheet to indicate the articles that participant signed up for (either during or after).
       -   Our goal is to do 5-10 entries per member -- from a mix of sources at first, with the possibility of later specialization in a single source for greater efficiency (e.g. focusing on The Chronicle of Higher Education) -- with a goal of having 30-50 documents ready for the other group.
       -   Our tentative deadline is sometime before a next meeting -- with the next meeting hopefully happening by the first week of May.


Comments (0)

You don't have permission to comment on this page.