| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Meeting (2019-03-28) UM Comparison Corpus Meeting

Page history last edited by Lindsay Thomas 4 years, 11 months ago

 

 

Meeting Time:       Thursday, March 28, 2:00 pm (Eastern)

Meeting Location: Faculty Exploratory (library room 305)

 


 

0. Preliminary Business

 

  • April timesheets due to me April 18 

 

Purpose of today's meeting

  • Check in regarding classification tasks, continue classification tasks (yay!) 
  • Report results of initial classification experiments 

 

 

1. Discussion of initial classification experiments

 

 

Classification experiments (using classification script):

  1. Explicit vs. not humanities, LIWC feature sets
    1. 289 explicit (included all that had been marked explicit, including those not yet double-checked)
    2. 382 not humanities 
    3. Naive Bayes: 
      1. Accuracy: .6347
      2. Average F1 score over 10 folds: .5515
      3. Confusion matrix:  https://drive.google.com/file/d/1vReF16da4zdWaLRayTQ0NnVkc74QhpnF/view?usp=sharing 
    4. SVM:
      1. Accuracy: .7485
      2. Average F1 score over 10 folds: .6348
      3. Confusion matrix: https://drive.google.com/file/d/1LxCc0yFj2jUj_fRiyUDQ3FbjE7QAkb_6/view?usp=sharing 
  2. Explicit vs. not humanities, tf-idf feature sets (using pre-processing script to extract feature sets)
    1. 289 explicit (included all that had been marked explicit, including those not yet double-checked)
    2. 382 not humanities 
    3. Naive Bayes:
      1. Accuracy: .4062
      2. Average F1 score over 10 folds: .4903
      3. Confusion matrix: https://drive.google.com/file/d/1CqalT6fNF2THriP3s5WGQrLngrdhIUDI/view?usp=sharing 
    4. Did not do SVM with this feature set because it would take a very long time without extensive tuning (or using primal SVM form, which I'm not sure how to implement) 

 

LIWC feature comparison:

  • Using rank-sum test to compare ratios of features in one document set to another (feature comparison script): https://drive.google.com/file/d/1yUOmdsMNJKfDeWikS1tmDXv1Xa7CAHuG/view?usp=sharing 
    • "Work" words are over 2x as likely to appear in humanities articles as in not humanities articles 
    • Not humanities articles are almost 2x as likely to quote someone as humanities articles
    • Overall, differences are not as distinct as we see between fiction and non-fiction, or even between genres of fiction (in Piper, for example) 
  • See "LIWC2015 dictionary poster" in GDrive for lists of words in each category

 

Implications?:

  • We need more hand-labeled data for testing 
  • Additional experiments:
    • binary bag of words feature sets
    • bigram tfidf feature sets
    • topic model feature sets 
    • Try with glmnet package, as described here: https://www.r-bloggers.com/text-classification-with-tidy-data-principles/ 
    • Try exact same experiments again, but be more careful about having a balanced training set (equal numbers of humanities and not-humanities articles in training set; this training set was slightly unbalanced).
    • Try "humanities" vs "science" keywords (instead of "humanities" vs "not humanities") across same source groups
    • Try including "not explicit" humanities data in humanities data (vs "not humanities") 
  • Customized stop word lists may also help improve accuracy 
  • Also the case that it may just be difficult to predict class of "humanities" articles vs. "not humanities" articles from the same group of sources 

 

 

 

 

2. Classification tasks

 

  1. Need more articles classified as being explicitly about the humanities 
    1. NEW top 10 US newspaper model (data NOT included in previous model, 1980 articles): http://harbor.english.ucsb.edu:10001/projects/teams/2018-19-5-comparison-corpus/20190328_0225_humanities-topusnewspapers-50/browser/#/bib 
    2. NEW Google form: https://docs.google.com/forms/d/e/1FAIpQLSev3AIbRTQIzpyxhji5oOH_rEgjLWL6c0OSU6r5xDmkaHZbwA/viewform
    3. Going for 500-750 articles explicitly about the humanities --> stricter criteria of "aboutness" 
    4. Team members taking this on:
      1. 2010-2011: Ashley
      2. 2012: Ruth (after NOT)
      3. 2013: Dieyun
      4. 2014: 
      5. 2015: Suchi
  2. Need more articles classified as being NOT about the humanities
    1. Google form (NOT about the humanities form is orange! So is top row of its results spreadsheet!)
    2. Working from 75-topic comparison corpus model 
    3. Currently at 650; going for 1000 articles explicitly tagged as NOT being about the humanities (1183 articles total in model)
    4. If article is somehow about the humanities (without containing the word humanities), enter it into the humanities form we worked with previously: WE1S Comparison Corpus Top 10 News Sources 2014-2017 
      1. Mark as "does not contain the word humanities" on that form
    5. Team member taking this on: 
      1. Ruth 

 

 

 

 

 

Planning for Future Meetings

 

  • Next UM team meetings:
    • Friday, April 12, 1:00 pm
    • Friday, April 26? 
  • All-hands meeting:
    • Friday, April 12, 2:00 pm 

 

 

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.