• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Meeting (2016-09-15)

Page history last edited by Alan Liu 7 years, 8 months ago

 Meeting Outcomes:

(jump to notebook added after meeting at bottom of page)


  • Next full-team meeting date(s)? -- some options:

    • Thursday, Sept. 29 (11 am, or 1 pm Pacific)
    • Thursday or Friday, Oct. 6 or 7
    • Thursday or Friday, Oct. 13 or 14

Purpose of next meeting:

    • [TBD]
  • RA timesheets

  • RA status in fall (who can/wants to continue)?



1. MongoDB/Manifest Development & Total Workflow Development on File Server / MongoDB / Virtual Machine

  • WE1S workflow stack: (Jeremy's diagram: douglass-we1s-workflow-diagram.pdf)
    WE1S Workflow Stack
  • Briefing by Jeremy, Scott, and Tyler on current state of development
  •  Nate Diamond's interest in RA'ing in fall (MongoDB, Flasj, Python, JS & JQuery, JSON) 


2. Corpus Finalization: Metadata (completion date as originally projected: Sept. 12)


3. Corpus Finalization: Creating the Corpus of Articles (can we do so by end of Sept.?)

  • Metadata finalization team to give example CSVs to Tyler to examine.
    • Tyler will make a script or scripts that does the following:
      1. Export the article bodies as plain-text files named by the values of the ID column (e.g., "nyt-2012-h-14")
      2. Store in appropriate folders in tree on Mirrormask filestation space (or storage in the MongoDB/Manifest system if it is ready)
      3. Repeat the above two steps, but include in the plain text files a first line consisting of basic metadata (date, author, title <line break>)
  • Creation and management of the corpus of articles:
    • Use the process developed by Tyler to generate the corpus of articles. Some issues:
      1. Can the process be installed in Jeremy's virtual machine so that various team members can re-run the process in whole or in part in the future via a Jupyter notebook?
      2. Can the workflow for creating the  corpus be integrated with that for de-duplication into a single workflow?
      3. If the corpus will live on Mirrormask, do we need a backup method that stores the backup on a different machine?


4. Creation of "Random" Corpus

  • Lindsay's team at U. Miami.
  • Issues: (from Lindsay's email of 9/11/16) 
      • Excerpts from Lindsay's email of 9/11/1:
        "Here are some outstanding questions before I get into the options for proceeding:
        •     How large should the random corpus be? Should we just say it’s going to be x number of articles large, and then divide that total number of articles up by publication, for example? Or should we decide how large it is — and how many articles come from each publication we collect from — according to some other measure?
        •     How many publications should we collect from? The lowest-hanging fruit here is probably the big three — the NY Times, the Washington Post, and the Wall Street Journal — but how important is it to us that we collect from all of our publications to create this randomized corpus?
        •     When do we want/need it done?

        • The answers to these questions, of course, depend in large part on which option(s) we choose for proceeding. So, now for those options:
          • A. Collect from the NY Times using the NY Times API, and from the Washington Post and the Wall Street Journal using ProQuest
            • To collect from the Times, we would follow the same workflow we used to download articles from the NY Times for the main corpus, with some tweaks to the Python script we used (getTimesArticles.py) so that we would get a randomized (or as close as we could come) selection of articles. Scott and I have been emailing a bit about how best to do this, and it seems doable, depending on how large we want the corpus to be.
                        Collecting a randomized corpus of articles from the WP and WSJ using ProQuest is more complicated, because the ProQuest newsstand interface doesn’t allow for random sampling. There is a manual workaround I could imagine doing, however: Let’s say we wanted to collect a random sample of articles published in the WP in the year 1999. If you search for everything published in the WP for that year, you get over 43,000 articles. Since ProQuest only allows you to display and download 100 at a time, this would make exporting every article in HTML form and scraping it using Outwit Hub Pro (the workflow we used to get articles from ProQuest for the main corpus) prohibitive, to say the least. However, we could use a random number generator to select x dates for us, and then limit our search in ProQuest to just those dates for any particular year (say, for example, Feb 2, 1999; April 23, 1999; August 12, 1999; Sept 1, 1999; and November 17, 1999). We could then download the first y number of articles that appear for those dates, depending on how many articles we wanted to download for each year. This would, I think, make our article selection more random — there would still be some bias toward the first y number of articles on any date — for any given year, and it would also give us a manageable number to deal with in terms of scraping. This would be somewhat labor-intensive, but if we decide our randomized corpus can be smaller, then I think it’s doable in the time we have.
          • B. Collect only from the NY Times
            • If we thought the ProQuest collection process would take too long or would be too biased, we could decide only to collect from the NY Times. This would mean our randomized corpus would be NY Times-specific, of course, but articles from the NY Times make up the largest portion of our main corpus anyway.
          • C. Collect from all of our publications
            • These other publications include USA Today, NPR, the Guardian, and the New Yorker. We can use Academic Search Premier to collect from the New Yorker, which would mean it would involve a process similar to the process we would use for the ProQuest publications (WP and WSJ). USA Today, NPR, and the Guardian all have their own API’s, so I would need to do some investigating to determine how easy or difficult it would be to do a randomized sampling of articles from those publications.
      • Current state of the discussion after emails back and forth:
        • Alan: also a need for a "super-quick-and-dirty" random corpus
        • Lindsay & Scott: NY Times.
        • Teddy:
          Thinking about Alan’s question, I second (third?) the notion that training on a symmetric corpus of “humanities” and “non-humanities” texts is on surer statistical footing. If we hope to identify topics that occur more often in humanities over control articles (or vice-versa) then it is most rigorous to train on both types of articles — and test topic distributions on equal sized hold-out sets. But I’d also like to qualify it by saying that it depends on the research question that we hope to ask and how different we think humanities articles are from non-humanities ones.

          Drilling down through the statistics, there are interpretive questions that get encoded in such corpus balancing. Training on humanities+control articles asks an essential question about the boundaries around humanities discourse. What lies outside discussions of the humanities? Perhaps that negative space is integral to our understanding of humanities discourse. In particular, where we look for an outside to humanities discourse will shape the model itself. To put a fine point on it: Are we interested in knowing that obituaries mention the humanities more than other genres of article or do we want to know how humanities obituaries compare to non-humanities obituaries? Which one of these we are interested in will determine what control articles need to be collected.

          On the other hand, there is a looming empirical question whether a model trained on humanities articles alone would construct topics differently or whether it might cluster articles differently. In a sense we are judging the difference between humanities-on-its-own-terms and humanities-in-the-world. It is possible that these will not be very different across these two models, since topics are learned by their relative proportions in documents. For example, it is possible that a modeled obituary topic would be more-or-less the same whether it has been distinguished from articles that also mention a passion for the humanities vs say membership at the Elks lodge. I qualify these statements as possibilities because I believe the touch on a few open questions about topic modeling in general. It would be worth asking a computer scientist who does LDA for their intuition.

          By the same token, answering for ourselves whether we think those possibilities are true or not will help us to identify the parameters of the study. If — and this is a big if — we believe humanities obituaries are comprised of the same topic distributions as non-humanities obituaries, albeit with a humanities flavor, then we might be able to use a humanities-only corpus while training and simply use a small control corpus (~350 articles of each type) to identify which topics are humanities-specific. Even if we think this is only partly true, performing this test on the quick and dirty control corpus would probably be enlightening.

          I realize that these suggestions mean possible additional labor, so I’ll be clear that I think we could do some robust work with a control corpus of ~2000 NYT articles. Not sure if that is a large or small number based on previous discussion, and of course the more, the merrier. If we pin down a few specific tests it will simply be a matter of feeding those into the pipeline along with a subset of the existing WE1S corpus. In terms of the research we plan to do, I could imagine this humanities+control corpus being a first phase that then raises questions we hope to answer by topic modeling the full WE1S corpus.
        • Alan:
          My best guess is that we're going to be going round and round a hermeneutical circle between making empirical discoveries about our corpus and reshaping/redirecting our interpretive questions. So, for example, if we compare what is "inside" and what is "outside" the domain of articles mentioning the humanities and discover there is a difference, then that suggests a line of interpretive questions as follow-ups. If we don't find a difference, then that redirects us toward questions like this focused internally on the humanities articles: what are the top topics or clusters of topics? How are they relatively weighted? How do they compare in different papers? different years?  -- i.e., questions that get at what the structure of discourse about the humanities is.

          Finding no difference between our WE1S corpus and a control corpus would also lead us to experiments in constraining the WE1S corpus until we see a difference. I suspect that if we used a set of features (e.g., collocates) to identify the articles in the WE1S corpus that "focialize" on the humanities (e.g., Fish's Opinionator op-eds on the "humanities crisis"), then that subset is highly likely to show differences from the control set. But we won't know until we try the experiment.


5. Create Initial Topic Models & Conduct Initial Interpretation (by mid October?)
     --cf. our topic modeling rehearsal of Jan. 18, 2016
     --when the "random" corpus is ready, it will need to be included in the same topic model(s)

  • (A) Create some topic models (using our existing scrubbing fixes & stopwords list):
    • Attempt to create initial topic models our entire WE1S corpus. (Later, when the "random" corpus is ready, it will need to be included in the same topic model.)
    • If topic modeling the whole corpus is not viable at this point, then take a sample of the corpus and create initial topic models:
      • How big a sample? What to sample?
      • How many topics? -- 20, 30, 40, 50, 100, 200?
  • (B) Prepare topic models for interpretation:
    • Input into DFR-browser
    • Additional steps?
      • Create topic word-clouds
      • Prepare cluster analyses (dendograms via Lexos)
  • (C) Conduct a first-pass study & discussion of the topic models in seminar style (to get a rough sense of what we have, how we want to vary or diversify our topic modeling, etc.):

    • (1) Before the seminar meeting, we as a group prepare by "reading" the topic models:
      • We explore in DFR-browser (supplemented by clustering visualizations)
      • We do a "ground truth" grokking of selected topics by human-reading some articles containing those topics.
      • We identify some top topics or interesting topics.
    • (2) We delegate to individuals in our group to study and report on the following issues:
      • Assess the optimal number of topics.
      • For selected topics, examine
        • the representation of that topic in selected articles that focus on the humanities
        • the representation of that topic in selected articles in the whole WE1S corpus (excluding those in the random corpus)
        • the representation of that topic in the "random" corpus.
      • Assess if we need to factor out the following kinds of articles that could potentially blur things together:
        • Event listings
        • School listings, etc.
    • (3) We do an experiment with time-series topic modeling
  • Based on the above first-pass interpretive study and discussion, implement a more systematic process of interpretation of the sort outlined in the following draft plan for a systematic workflow of topic model interpretation:

    Human Processes

    (Typical procedure for steps requiring judgment will be to have a panel of three or more people perform the step and compare results. The hermeneutical rhythm will typically consist of iterative cycles of observations suggesting research questions, and research questions suggesting ways to sharpen observation.)

    Machine Processes

    (We may be able to automate some steps and sequences)

    Assess topic models to determine appropriate number of topics. We may decide to generate one, two, or three numbers of topics for simultaneous interpretation.
    (Questions: Can we define criteria for "best" topic model? Do we know any published theory or methods for choosing right number of topics? Cf. Scotts issues for discussion.pdf)

    Generate topic models at many levels of granularity--e.g., 25, 50, 150, 200, 250, 300, 350, 400, 450, 500

    Initial analysis of topic models.
    1. Assign labels for topics (assisted by automated process suggested at right).
    2. Identify and label any major clusters of topics. 
    3. Flag for attention any illegible topics.
    Assemble materials to facilitate interpretation:
    1. Create sorted keys files in a spreadsheet.
    2. Create topic cloud vizualizations.
    3. Create clustering visualizations
      (Testing phase: compare a human-panel-only clustering with a machine clustering of topics)
    4. Assess "nearness" of topics (We don't yet have a method to do this; but cf. Goldstone DFR Browser "scaled view")
    5. If possible, auto-label topics with the top 2-4 most frequent words in a topic (based on an algorithm that establishes a minimum threshold of proportional frequency and decides what to do if there are one, two, three, or four top words that are, or are not, significantly more important than others.)

    Detailed analysis of topic model (part I: total corpus, synchonic analysis).

    1. Study major topics and clusters of topics.
    2. Human panel reads sample articles and compares to the topic proportions found in the topic-counts.txt file created by Mallet. (This is a sanity check.)
    3. Human panel writes up analytical notes and observations, and compares.
    4. Members of the human write up report.



    Detailed analysis of topic model (part II: comparative analysis).

    1. Study major correlations/differences between any two or three parts of our corpus of interest.
    Create view of topic model that compares two or more parts of our corpora (e.g., NY Times vs. The Guardian) for the topics and topic weights they contain. We don't yet have an interface or method of using the composition.txt files produced by Mallet to do this. (cf. Goldstone DFR Browser "document view," which shows topics in a single document) (Alan's experiment)

    Detailed analysis of topic model (part III: time-series analysis).

    1. Study trends in topics.
    Create views of topic model that shows trend lines of topics (created by showing weights of topics in documents at time 1, followed by time 2, etc.). We don't yet have a method or tool for this, but cf. the following time-series views in the Goldstone DFR Browser: topics across years | topics within a single year. See also: demo vizualization of topics in State of Union addresses; the TOM code demos; Robert K. Nelson, "Mining the Dispatch") (Alan's experiment)

    Write up results:

    1. Create key observations and data set and publish (with a catchy title like "Humanities in Public Discourse: The Manual").
    2. Co-author white paper with recommendations for humanities advocacy.
      1. Create subset of above as a brochure or infographic
    3. Disseminate research methods and conclusions in academic talks, panels, papers, articles.
  • Create "starter package" of topic models and anthology of representative articles for undergraduate Winter-Spring CRG (UCSB English Dept."collaborative research grant" group of students on stipend and independent course credit to work together under guidance of a grad student on stipend).








Meeting Outcomes (to-do's in red)

  • Next full-team meeting: Friday, Oct. 7th, 2016 at 1pm Pacific (pending possible adjustment if we are not ready for topic modeling by that time)
  • RA availability in fall at UCSB: Jamal and Billy will continue; Sydney may continue at a lower level of activity; Tyler will decide if he can continue (Alanna & Chris TBD)
  • Backend Development Work: Scott, Jeremy, and Tyler will work together on finalizing the "sausage-making machine" (the full WE1S workflow for producing a corpus). Alan will check in with this sub-team in about a week to get a better sense of when we can expect to "make sausage". Specific to-dos at present:
    • Script for exporting article bodies with select metadata (as specified above here).
    • Creation of query & export UI for MongoDB and Manifest (if possible)
    • Integration with de-duping process and processes for topic modeling and inputting into DFR-browser.
  • Production of Random Corpus: Lindsay, Samina, and Annie will create a "random" corpus of approx. 2,000 articles. (Discussion is ongoing about selection principles and scraping process.) Alan will check in with them in about a week to see if the current completion timeline of about a month is feasible.
  • Topic Modeling and Interpretation: We will create topic models of 20, 30, 40, 50, 100, and 200 topics from some combination of the following materials (depending of feasibility in a few weeks): full WE1S corpus, partial WE1S corpus, random corpus. If it looks possible, we will devote the next team meeting to "reading" a topic model together as outlined above here. If it doesn't appear we will be ready by the Oct. 7th date of the next meeting, then we will devote that meeting to continuing business and set a next date for the topic modeling interpretation meeting.
  • (Possible future Stanford collaboration: Alan, Jeremy, and Ryan will keep in touch about possible ways to extend the Stanford/UCSB grad student exchange program, possibly with WE1S as an activity.)


































Comments (0)

You don't have permission to comment on this page.