Meeting (2019-03-28) UM Comparison Corpus Meeting


 

 

Meeting Time:       Thursday, March 28, 2:00 pm (Eastern)

Meeting Location: Faculty Exploratory (library room 305)

 


 

0. Preliminary Business

 

 

Purpose of today's meeting

 

 

1. Discussion of initial classification experiments

 

 

Classification experiments (using classification script):

  1. Explicit vs. not humanities, LIWC feature sets
    1. 289 explicit (included all that had been marked explicit, including those not yet double-checked)
    2. 382 not humanities 
    3. Naive Bayes: 
      1. Accuracy: .6347
      2. Average F1 score over 10 folds: .5515
      3. Confusion matrix:  https://drive.google.com/file/d/1vReF16da4zdWaLRayTQ0NnVkc74QhpnF/view?usp=sharing 
    4. SVM:
      1. Accuracy: .7485
      2. Average F1 score over 10 folds: .6348
      3. Confusion matrix: https://drive.google.com/file/d/1LxCc0yFj2jUj_fRiyUDQ3FbjE7QAkb_6/view?usp=sharing 
  2. Explicit vs. not humanities, tf-idf feature sets (using pre-processing script to extract feature sets)
    1. 289 explicit (included all that had been marked explicit, including those not yet double-checked)
    2. 382 not humanities 
    3. Naive Bayes:
      1. Accuracy: .4062
      2. Average F1 score over 10 folds: .4903
      3. Confusion matrix: https://drive.google.com/file/d/1CqalT6fNF2THriP3s5WGQrLngrdhIUDI/view?usp=sharing 
    4. Did not do SVM with this feature set because it would take a very long time without extensive tuning (or using primal SVM form, which I'm not sure how to implement) 

 

LIWC feature comparison:

 

Implications?:

 

 

 

 

2. Classification tasks

 

  1. Need more articles classified as being explicitly about the humanities 
    1. NEW top 10 US newspaper model (data NOT included in previous model, 1980 articles): http://harbor.english.ucsb.edu:10001/projects/teams/2018-19-5-comparison-corpus/20190328_0225_humanities-topusnewspapers-50/browser/#/bib 
    2. NEW Google form: https://docs.google.com/forms/d/e/1FAIpQLSev3AIbRTQIzpyxhji5oOH_rEgjLWL6c0OSU6r5xDmkaHZbwA/viewform
    3. Going for 500-750 articles explicitly about the humanities --> stricter criteria of "aboutness" 
    4. Team members taking this on:
      1. 2010-2011: Ashley
      2. 2012: Ruth (after NOT)
      3. 2013: Dieyun
      4. 2014: 
      5. 2015: Suchi
  2. Need more articles classified as being NOT about the humanities
    1. Google form (NOT about the humanities form is orange! So is top row of its results spreadsheet!)
    2. Working from 75-topic comparison corpus model 
    3. Currently at 650; going for 1000 articles explicitly tagged as NOT being about the humanities (1183 articles total in model)
    4. If article is somehow about the humanities (without containing the word humanities), enter it into the humanities form we worked with previously: WE1S Comparison Corpus Top 10 News Sources 2014-2017 
      1. Mark as "does not contain the word humanities" on that form
    5. Team member taking this on: 
      1. Ruth 

 

 

 

 

 

Planning for Future Meetings