|
Meeting (2019-03-05) UM Comparison Corpus Meeting
Page history
last edited
by Lindsay Thomas 6 years, 2 months ago
|
Meeting Time: Tuesday, March 5, 2 pm EST
Meeting Location: Faculty Exploratory (library room 305)
|
0. Preliminary Business
- March timesheets due March 18 at the latest
- All-hands meeting next week, 1 pm EST (10 am PST): can attend remotely
Purpose of today's meeting
- Check in about article classification, begin process for classification of articles NOT about the humanities
1. Hand classification of humanities articles discussion
- 140ish marked as explicitly about the humanities
- Why we are doing this: produce data set of known classifications for training
- Process:
- Train model on known data, A
- Set 1: articles about the humanities (explicit)
- Set 2: articles not about the humanities
- Feature set:
- Tf-idf: increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.
- Possibly also using LWIC
- Train model on known data, B
- Set 1: articles about the humanities (explicit and implicit)
- Set 2: articles not about the humanities
- Feature set:
- Tf-idf: increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.
- Possibly also using LWIC
- Test models A and B on unknown data
- Set 1: articles containing the word humanities (that the model hasn't seen before)
- Set 2: articles not containing the word humanities (that the model hasn't seen before)
- Logic of steps 1 and 2 is to see how/if these classifiers can predict the class of an article based on the presence or absence of a search term alone.
- Questions?
2. Read articles explicitly about the humanities
- Google form results spreadsheet (NOT orange!)
- Process:
- For those articles marked as being explicitly about the humanities:
- Using the link in the results spreadsheet, read the article.
- If it is indeed explicitly about the humanities as a concept/theme, highlight the results spreadsheet row in yellow (see example row 2 in spreadsheet).
- 3 Readers:
3. Classification of articles NOT about the humanities
- New Google form (NOT about the humanities form is orange! So is top row of its results spreadsheet!)
- Working from 75-topic comparison corpus model
- Only need about ~150ish articles
- If article is somehow about the humanities (without containing the word humanities), enter it into the humanities form we worked with previously: WE1S Comparison Corpus Top 10 News Sources 2014-2017
- Mark as "does not contain the word humanities" on that form (this is a new edit to the form)
- Volunteers:
- 2017, A-M (bibliography view): Ruth
- 2017, N-Z (bibliography view): Dieyun
4. Goals
- Next UM team meeting is March 19
- Finish classification of articles NOT about the humanities by then
- Finish reading/double-checking articles marked as explicitly about the humanities
- My goal: Get classification notebook written and in working order
Planning for Future Meetings
- All-hands meeting: Mar. 14 (UM spring break) (remote meeting via Zoom -- you are not required/expected to attend, but it is paid time if you want to attend the meeting)
- UM team meeting March 19, 2 pm, Faculty exploratory
Meeting (2019-03-05) UM Comparison Corpus Meeting
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
Comments (0)
You don't have permission to comment on this page.