|
Notes From Meeting of March 19, 2014, "Archiving an Unstructured Web Corpus -- Discussion and Work"
Page history
last edited
by Alan Liu 10 years, 8 months ago
- Preliminary thoughts by Alan in response to planning documents ( A | B ) from Jeremy.
For our text preparation work, we need:
- Dual workflow strategy:
- Workflow for a miscellaneous corpus (see 3.A below)
- Workflow for predefined corpora (see 3.B below)
- Alternatively: a wide-angled Google search with transparent search criteria:
- collocation of words to search for
- a hit triggers several other tests
- finally documents selected
- Editorial policy
- Uniformity of method and tools
- Modularity, rollback, and fork (Github?)
- Jeremy suggests: flat directory tree structure, branching for each version of a transformed document (see his notes below)
- Run on Google Drive (?) supports versioning for 30 days ; or run off Jeremy's server?
- Documentation
- Notes of discussion and results of the meeting (by Jeremy, from his email of 3/19/14):
Archiving an Unstructured Web Corpus -- Discussion and Workshop
- Reviewed the materials on the wiki
http://4humwhatevery1says.pbworks.com/w/page/75154250/Text%20Preparation%20and%20Topic%20Modeling%20Planning%20Document#view=page
- Discussed creating an online corpus and the storage of our corpus - in particular, issues relating to shared access and version control. - Went forward with a plan to initially use a shared Google Drive folder to store the unstructured web corpus as it is being collected. - Reviewed a scheme for storing source URLs, archived versions of pages, plain text copies, and edited/cleaned versions of those copies in a URI-like folder structure, like this:
/_corpus /whatevery1says /_document /MyDoc /_source /_archive /_transform-cut-paste /_transform-hand-edited
- Discussed the advantages and limitations, its relationship to automated downloading or text processing, and edge cases when dealing multiple sources, different browsers and operating systems, coordinating work between editors etc.
- Diving in: After adding a pre-created project folder tree to the Google Drive folder (linked above), we logged in and each added 1-2 articles found the spreadsheet into the corpus. The archiving process explored by the group is outlined below.
- After the hands-on work, we had discussed coordination and next steps: - We need to capture feedback on the process -- both in per-document editor log files, and in a shared document (which needs to be set up). Everything about this process could be documented better [JD: I'm doing some of it below]. - There is a new "Corpus Editor" column on the spreadsheet to indicate the articles that participant signed up for (either during or after). - Our goal is to do 5-10 entries per member -- from a mix of sources at first, with the possibility of later specialization in a single source for greater efficiency (e.g. focusing on The Chronicle of Higher Education) -- with a goal of having 30-50 documents ready for the other group. - Our tentative deadline is sometime before a next meeting -- with the next meeting hopefully happening by the first week of May.
Notes From Meeting of March 19, 2014, "Archiving an Unstructured Web Corpus -- Discussion and Work"
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
Comments (0)
You don't have permission to comment on this page.