|
Meeting (2019-03-28) UM Comparison Corpus Meeting
Page history
last edited
by Lindsay Thomas 5 years, 10 months ago
|
Meeting Time: Thursday, March 28, 2:00 pm (Eastern)
Meeting Location: Faculty Exploratory (library room 305)
|
0. Preliminary Business
- April timesheets due to me April 18
Purpose of today's meeting
- Check in regarding classification tasks, continue classification tasks (yay!)
- Report results of initial classification experiments
1. Discussion of initial classification experiments
Classification experiments (using classification script):
- Explicit vs. not humanities, LIWC feature sets
- 289 explicit (included all that had been marked explicit, including those not yet double-checked)
- 382 not humanities
- Naive Bayes:
- Accuracy: .6347
- Average F1 score over 10 folds: .5515
- Confusion matrix: https://drive.google.com/file/d/1vReF16da4zdWaLRayTQ0NnVkc74QhpnF/view?usp=sharing
- SVM:
- Accuracy: .7485
- Average F1 score over 10 folds: .6348
- Confusion matrix: https://drive.google.com/file/d/1LxCc0yFj2jUj_fRiyUDQ3FbjE7QAkb_6/view?usp=sharing
- Explicit vs. not humanities, tf-idf feature sets (using pre-processing script to extract feature sets)
- 289 explicit (included all that had been marked explicit, including those not yet double-checked)
- 382 not humanities
- Naive Bayes:
- Accuracy: .4062
- Average F1 score over 10 folds: .4903
- Confusion matrix: https://drive.google.com/file/d/1CqalT6fNF2THriP3s5WGQrLngrdhIUDI/view?usp=sharing
- Did not do SVM with this feature set because it would take a very long time without extensive tuning (or using primal SVM form, which I'm not sure how to implement)
LIWC feature comparison:
- Using rank-sum test to compare ratios of features in one document set to another (feature comparison script): https://drive.google.com/file/d/1yUOmdsMNJKfDeWikS1tmDXv1Xa7CAHuG/view?usp=sharing
- "Work" words are over 2x as likely to appear in humanities articles as in not humanities articles
- Not humanities articles are almost 2x as likely to quote someone as humanities articles
- Overall, differences are not as distinct as we see between fiction and non-fiction, or even between genres of fiction (in Piper, for example)
- See "LIWC2015 dictionary poster" in GDrive for lists of words in each category
Implications?:
- We need more hand-labeled data for testing
- Additional experiments:
- binary bag of words feature sets
- bigram tfidf feature sets
- topic model feature sets
- Try with glmnet package, as described here: https://www.r-bloggers.com/text-classification-with-tidy-data-principles/
- Try exact same experiments again, but be more careful about having a balanced training set (equal numbers of humanities and not-humanities articles in training set; this training set was slightly unbalanced).
- Try "humanities" vs "science" keywords (instead of "humanities" vs "not humanities") across same source groups
- Try including "not explicit" humanities data in humanities data (vs "not humanities")
- Customized stop word lists may also help improve accuracy
- Also the case that it may just be difficult to predict class of "humanities" articles vs. "not humanities" articles from the same group of sources
Planning for Future Meetings
- Next UM team meetings:
- Friday, April 12, 1:00 pm
- Friday, April 26?
- All-hands meeting:
- Friday, April 12, 2:00 pm
Meeting (2019-03-28) UM Comparison Corpus Meeting
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
Comments (0)
You don't have permission to comment on this page.