• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Corpus Design and Research Group Meeting (2018-03-02)

Page history last edited by Samina Gul Ali 6 years, 2 months ago

 Meeting Outcomes:

(jump to notebook added after meeting at bottom of page)


Meeting time: Friday, March 2, 2018, 12-1pm Pacific Meeting Zoom URLhttps://ucsb.zoom.us/j/902248211
  • PIs: Alan Liu; co-PIs: Jeremy Douglass, Scott Kleinman, Lindsay Thomas
  • Project Manager: Samina Ali




Corpus Design & Research Group

  •  UCSB Graduate Student RAs (currently focusing on corpus design & collection strategy)
    • Rebecca Baker (English)
    • Nazanin Keynejad (Comp. Lit.)
    • Giorgina Paiella (English), WE1S Analyses & Reports Editor
    • Aili Peeker (English)
    • Jamal Russell (English)
    • Tyler Shoemaker (English)


  •  U. Miami RAs:
    • Samina Ali (English, U. Miami), WE1S Project Manager
    • Tarika Sankar (English, U. Miami)
    • Annie Schmalstig (English, U. Miami)
  •  CSUN RA: Sandra Fernandez 

Next meeting (?):  



Text-Analysis Hacker Group

  • UCSB Faculty
    • Fermín Moscoso del Prado Martín, Assistant Professor of Linguistics, UC Santa Barbara
  • UCSB Graduate Student RAs
    • Sandra Auderset(Linguistics, UCSB)
    • Devin Cornell (Sociology, UCSB)
    • Nicholas Lester (Linguistics, UCSB)
    • Fabian Offert (Media Arts & Technology, UCSB)
    • Teddy Roland (English, UCSB)
    • Chloe Willis (Linguistics, UCSB) 
  • Other Participants
    • Ryan Heuser (WE1S Advisory Board member; Ph.D. student at Stanford U.)

Next meeting (?):  



Preliminary Business


  • FYI: WE1S T-Hackers group meeting Tuesday, March 13th, 1pm
  • Next C-Hackers group meeting(s):
    • Meeting similar to the above T-Hackers meeting during first week of spring quarter at UCSB? (April 2-6) (When does UM start up again after their break?) 
    • Workshop for learning and debugging the search/download workflows for databases. (week of April 9-13)?
  • Scott giving a topic-modeling workshop at UCI on Friday April 13th 
  • Call for RAs for WE1S UCSB "summer research camp" coming soon. Current plan for camp:
    • Five weeks July 2-Aug. 5 (culminating in meeting with the WE1S Advisory Board)
    • Interdisciplinary teams of RAs 
    • Each week: 4 days
    • Each day: about 5 hours, split between collection work and higher-level research work 
    • Some flexibility for RAs who need to work less or more time, or who are away some days  


1. Corpus Representativeness: Facets totals



  • Other links:
 WE1S Corpus Collection Form

WE1S Corpus Collection List Form  

WE1S Corpus Collection List (current) WE1S  Corpus Collection List   
  Deprecated version of corpus collection list  
Trello Board for current tasks  
Areas of focus Areas of Focus




2. Corpus Representativeness: Research on corpus collection and selection


  • Presentation by Lindsay on results of research into corpus collection and selection  
    • How do we know if our corpus is representative? 
      • Totaling facets 
      • What others have done: previous work in communications/journalism studies/linguistics on newspaper corpora 
    • Problems we face in determining whether or not our corpus is representative:
      • "Representative" of what? No de facto measure of representativeness exists for our question/project
        • We actually don't know what constitutes public discourse on the humanities 
      • Bespoke corpora
        • Vs. corpora for linguists, like BYU's NOW corpus
          • Costs money to access, but something we could consider for our project for a reference corpus
        • Apart from BYU's NOW corpus, no shareable, general-purpose corpus of contemporary journalistic sources currently exists 
      • Technical issues will likely limit our corpus significantly
    • Corpus collection as an intentionally iterative process: Hypothesis is that we will only understand what a "representative" corpus looks like once we start actually collecting and analyzing data.
      • Identify phases of collection, which we will revise as we go
      • Use analysis of data to revise these phases, and to help us determine what to collect next. Do same patterns pertain when we examine different phases or "levels" of the corpus, or do very different patterns emerge depending on what kinds of sources we examine?
      • Need a plan for what constitutes "phase 1" collection, "phase 2" collection, etc. 



3. Plan for collection


  • Brainstorming an initial plan for the first couple collection phases 
    • Phase 1?
      • Phase 1.1?
      • 1.2?
    • Etc  



4. Next Steps

  • Individual tasks (review and roundup of tasks that surfaced during this meeting)
  • Next Meetings?
    • Search/download workflow practicum 
    • Intro to the WE1S Virtual Workflow Manager/Manifest system 




Meeting Outcomes


  • Alan to schedule:
    • C-Hackers meeting during week of April 2-6 for Scott & Jeremy's presentations of the WE1S workflow platform.
    • Workshop to debug database search/download workflows
  • Everyone:
    • create workflow pages on databases (to be posted on PBworks)
    • start thinking about "tags" for our Zotero library re: academic research on corpus collection
  • U. Miami group:
    • start collection "phases" planning document
      • our initial thoughts for phase 1: (a) big national newspapers for at least one year, 2017, (b) some newspapers from Giorgina's area of focus to allow us to experiment with researching issues related to ethnic/gender groups
    • make the metrics data sheet update automatically
    • Samina:
      • update the project masthead with Giorgina's title as "Analyses and Reports Editor"
      • work with Lindsay to link WE1S spreadsheets
      • gather academic sources on news corpus collection
      • for area of focus report, complete two for Caribbean: one for Spanish and one for English
  • Naz to start a Zotero group library (reminder to everyone: set up a Zotero account)
  • Giorgina to add question about phasing to the area of focus report template, and to create deadline for initial work on the report
  • Tyler to send to Jeremy current guess about format of LexisNexis download
  • Ailli to look into transcripts of broadcasts and how much we can feasibly collect




Comments (0)

You don't have permission to comment on this page.