If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Planning Notes (Thomas, Feb 2, 2014)

Page history last edited by Alan Liu 10 years, 1 month ago

Dear all,

You're receiving this email because you signed up for the Text Extraction group for our 4Humanities topic modeling project. This email is long, but here is what it contains: a list of group members; a short summary of our meeting on Tuesday with info that pertains to your group;an initial action plan for your group; other questions to consider as you work on your initial action plan; some resources to help you; and some final info and reminders.

Group Members: (please let me know if you are no longer interested in participating)
- Jeremy Douglass (group leader?) - Liz Shayne (group leader?)
- Alan Liu - Lindsay Thomas - Priscilla Leung - Ashley Champagne - Zach Horton

NOTE: Jeremy, after you left the meeting, we discovered that we would like to nominate you to lead our text preparation group (and you weren't there to protest...), simply because you have the most knowledge about this area already. "Leading" the group would involve simply taking the lead on finding tools and methods of text extraction, for example, and pointing in the group in their general direction. Liz also volunteered to co-lead the group or to take over if you can't.

Meeting Summary

As we discussed during our meeting on Tuesday, the first line of attack for this group is to focus first on text extraction and preliminary text preparation. At the meeting, we decided that our first goal should be to come up with a set of tools and ideas for workflow for text extraction and preparation. We don't quite know what exact corpora we will be working with yet, so the idea isn't to get started yet, but rather to investigate how best to go about text extraction and to develop a workflow for doing so.

Initial Action Plan

1. Research methods for extracting plaintext from HTML and PDF files.

2. Research methods of preliminary text preparation: Use Python scripts or other tools for fixing common errors, standardizing spellings, resolving hyphenations, etc., and/or decide if we need to do this at all.

3. Develop a list of tools we could use to do #1 and #2 and develop a theoretical workflow plan for extracting plaintext files once we have our corpora.

(NOTE: More details about #1 and #2 above can be found on the Text Preparation Group project page in the second and third bullet points: http://4humwhatevery1says.pbworks.com/w/page/75154250/Text%20Preparation%20and%20Topic%20Modeling%20Planning%20Document)

Other questions to consider as we develop this initial action plan:

- What is the best way to scale up text extraction and preparation so that it can be done on many files?

- Do we need to think about how to exclude non-relevant material like advertisements, author bios, copyright notices, etc?

Resources:
- WhatEvery1Says Project Page: http://4humwhatevery1says.pbworks.com/w/page/75154136/Meeting%202014-02-18 (includes info about the project as well as resources on topic modeling itself)

- Text Preparation Group project page: http://4humwhatevery1says.pbworks.com/w/page/75154250/Text%20Preparation%20and%20Topic%20Modeling%20Planning%20Document

Final Info and Reminders:

Around the end of this quarter or the beginning of spring quarter, you will be receiving a link to a Doodle poll that lists possible dates and times for our next meeting as a whole project team. This meeting will ideally occur during the first half of the Spring quarter. Before then, we may wish to meet as a Text Preparation Group to discuss these issues in more detail. However you do it, the end goal is to produce a list of corpora we could use for our initial investigations.

Planning Notes (Thomas, Feb 2, 2014)

Planning Notes (Thomas, Feb 2, 2014)

Page Tools

Insert links

Comments (0)

Join this workspace

Navigator

SideBar

Recent Activity