| 
View
 

Text Analysis Hacker Group Meeting (2018-02-06)

Page history last edited by Samina Gul Ali 7 years, 3 months ago

 Meeting Outcomes:

(jump to notebook added after meeting at bottom of page)

 

Meeting time: Tuesday, February 6, 2018, 1-2pm Pacific Meeting Zoom URLhttps://ucsb.zoom.us/j/942606367
  • PIs: Alan Liu; co-PIs: Jeremy Douglass, Scott Kleinman, Lindsay Thomas
  • Project Manager: Samina Ali

 

 

 
  • UCSB Graduate Student RAs (currently focusing on corpus design & collection strategy)
    • Rebecca Baker (English)
    • Nazanin Keynejad (Comp. Lit.)
    • Giorgina Paiella (English)
    • Aili Peeker (English)
    • Jamal Russell (English)
    • Tyler Shoemaker (English)

 

  • Text-Analysis Hacker Group:
    • Faculty
      • Fermín Moscoso del Prado Martín, Assistant Professor of Linguistics, UC Santa Barbara
    • UCSB Graduate Student RAs
      • Sandra Auderset(Linguistics, UCSB)
      • Devin Cornell (Sociology, UCSB)
      • Nicholas Lester (Linguistics, UCSB)
      • Fabian Offert (Media Arts & Technology, UCSB)
      • Teddy Roland (English, UCSB)
      • Chloe Willis (Linguistics, UCSB) 
    • Other Participants
      • Ryan Heuser (WE1S Advisory Board member; Ph.D. student at Stanford U.)
  •  U. Miami RAs:
    • Samina Ali (English, U. Miami)
    • Tarika Sankar (English, U. Miami)
    • Annie Schmalstig (English, U. Miami)
 
  •  CSUN RA: Sandra Fernandez 

Next meeting (?):  

 

 


 

Preliminaries

 

  • Overview by Alan
    • WE1S Mellon project (Oct. 2017 - Sept. 2020)
    • PIs
    • RAs workiing on corpus design & collection
    • Text Analysis Hacker Group
      • Self-introductions
      • RAs to be entered into Kronos 
    • Summer 2018 research camp (July 2-August 4) and Advisory Board meeting (August 3-4) 

 

 

 

1. Status of WE1S Project
(quick presentation by Alan)

 

  • Current work:
    • Deisgn the WE1S corpus 
    • Develop the WE1S collection and analysis workflow (for "open, reproducible" data research):
      • Virtual Workflow Management (VWM) system: login
      • Manifest System (GitHub repo):
        • "a set of recommendations for the construction of JSON-formatted manifest documents for the WE1S project. These JSON documents can be used as data storage and configuration files for a variety of scripted processes and tools that read the JSON format."
        • "A WE1S manifest is a JSON document that includes metadata describing a publication, process, set of data, or output of some procedure. Manifests can be used for a variety of purposes, but their primary intent is to help humans document and keep track of their workflow." 
        • Manifest types:
          • Publications: Information about the provenance of primary data.
          • Corpus: The data store for primary and generated data.
          • Processes: Metadata about workflow processes used to collect and generate data.
          • Scripts: The data store for scripts meta data about tools used to implement workflow processes.
        • MongoDB database
      • "Golden spike" moment of integration between the VWM and Manifest systems by late February.
    • Collection work to begin in March
    • Special constraints of collection work imposed by "terms and conditions" of proprietary database sources

 

 

2. Text Analysis Hacker Group Goals
(preliminary presentation by Alan, to be followed by general discussion)

 

  • General goals stated in our proposal to the DAHC (Digital Arts & Humanities Commons):
    • "WE1S proposes to create a "Text Analysis Hackerspace" in the DAHC that can serve as the hub for faculty and graduate students with an interest in text analysis methods. Using its grant resources, WE1S will sponsor the participation in the hackerspace of an interdisciplinary group at UCSB with advanced text-analysis and linguistics skills. The group will consist of WE1S-funded graduate-student research assistants and Linguistics faculty member Fermín Moscoso del Prado Martín (with WE1S PI Alan Liu participating). Their role will be to incorporate in the main WE1S workflow a set of experiments in leading-edge text analysis methods; to introduce text analysis at both basic and advanced levels to other students and faculty; and to cross-fertilize with other DAHC and Wireframe projects (e.g., in regard to visualization methods for text analysis or the analysis of social media text). The goal is not just to aid the WE1S project itself (by advancing its research workflow and methods, and by training other research assistants less familiar with text analysis) but to contribute generally to research and teaching at UCSB." 
  • Goals directly related to WE1S:
    • Experiment with advanced or variant kinds of topic modeling
    • Experiment with word embedding
      • Create variant of WE1S collection and analysis workflow for word2vec
    • Suggest linguistics-based lines of investigation (e.g., corpus linguistics, other kinds of computational linguistics?)
    • Build DIY text-analysis workstation or system
      • Current WE1S Mirrormask machine and virtual machine
      • Fabian Offert's prototype neural-network machine based on NVidia chips
      • Ryan Heuser's comments on virtual machine solution:
        • " What I've found most useful is a shared machine that people can SSH into, and that has enough disk space and computational power that it can store the entirety of the canonical version of the data, facilitate running tasks on it (e.g. part-of-speech tagging), and then storing alternate versions of the canonical data (e.g. part-of-speech tagged xml files, etc). Lately for us that has meant more of a virtual than an actual machine—i.e. we're now SSHing mainly into a compute cluster at Stanford, and the data lives on a data server, both of which we pay for. The virtual approach has certain advantages, mainly the even greater parallelization (depending on the state of the Stanford-wide queue, ~100 of our processes (again, e.g. part-of-speech tagging texts) can run in parallel). That said, I think in practice the difference between actual vs. virtual machine here is not necessarily great, given that the computational requirements for DH, in my experience, only rarely require that level of parallelization, and the 16-32 CPUs that an actual machine has these days can be easily good enough."
  • Goals for general research communities at UCSB, CSUN, and U. Miami (WE1S partner institutions):
    • Workshops:
      • Topic modeling
      • Word embedding
      • Etc.
    • Documentation of building of DIY machine (including videos)
    • Documentation of text analysis experiments (including videos)

 

 

 

3. Next Steps
(general discussion)

 

  • [TBD]
  • Text Analysis Hacker Group to meet by itself (with Alan and other PIs not taking a leadership role)?

 

 

 

4. Visit to DAHC

 

 


 

Meeting Outcomes

 

  • Three Subgroups will schedule a meeting during the week of Feb 26th-March 3rd OR March 12-16
    • Subgroup One:  Word-embedding (Devin Cornell, Ryan Heuser, Fabian Offert, Teddy Roland, and Fermin Moscoso del Prado Martín; with Scott Kleinman sitting in as a member of our WE1S PI group). This group may also address advanced topic-modeling issues.
    • Subgroup Two:  Linguistics research group (Fermin, Sandra Auderset, Nicholas Lester, and Chloe Willis; with Scott Kleinman also sitting in on the group.) (We are likely also to ask Prof. Jack Dubois in Linguistics to sit in.)
    • Subgroup Three: DIY Machine building group (Fabian Offert, Teddy Roland, and Jeremy Douglass, with Ryan Heuser consulting from Stanford)
  • Each of the subgroups will begin brainstorming what they would like to do (research that could help advance or augment the WE1S project; and also planning introductory-training and also advanced workshops).
    • What are the research directions you would like to pursue that would be interesting and also assist, augment, or complement the WE1S project?
    • What are the introductory and/or advanced workshops you would like to run in the future?
    • For the DIY Machine building group: what do we want to make (specs and rationale), and how much will it cost?
  • One question we will keep in mind as we move forward: How can we create a smaller-scale sample of our collection to use for word-embedding that falls within database "terms and conditions"? Since we cannot store articles, we currently have to immediately create "word bags" to use for topic modelling. What are our options here for word-embedding?

Comments (0)

You don't have permission to comment on this page.