• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Meeting 17 (2015-08-05)

Page history last edited by Alan Liu 8 years, 10 months ago

Progress to Date (and Future Scheduling)


  • Status Reports (Developer Task Assignments)
    • NY Times
    • WSJ
    • The Guardian
    • NPR
    • Other sources we're researching (Document Sources)
      • Possibilities We Considered: 
        * Other U.S. cities (e.g., Washington Post, LA Times, Chicago Tribune)?
        * Online media news/popular media (eg., Huff Post, Salon)
        * TV/Radio media
        * Middlebrow (e.g., USA Today)
        * Magazines (e.g., New Republic, LA Review of Books)?
        * Economic press (e.g., Forbes, Business Insider, The Economist)?
        * Higher-education press (e.g., Chronicle of Higher Education, Higher Ed)?
        * Student papers (e.g., Harvard Crimson, Yale Daily News, UCLA Bruin)?
        * Commencement speeches.
        * Articles on "sciences"?
        * Social media
      • Priorities We Decided On:
        1. First priority:
          1. Other nations: UK , Canada (Australia, New Zealand, India)
          2. At least one other U.S. city
          3. Born-digital publications (e.g., Huff Post, Salon, etc.)
        2. Second priority:
          1. At least one higher-ed publication (e.g., Chronicle of Higher Education, Higher Ed)
          2. At least one source from the economic press (e.g., Forbes, Business Insider, The Economist)
          3. Student newspapers (e.g., Harvard Crimson, Yale Daily News, UCLA Bruin)



Discussion of Scrubbing and Other Preprocessing Steps


  • Current Scrubbing List ("List of Fixes Needed for Raw Texts")
  • Issue: We can't try to add everything possible in detail to the scrubbing list. What is the right way to think about this?
    • Example: tokenizing (consolidating) compound phrases and organization names
    • Lindsay's topic modeling experiments with pre and post scrubbed NYT files:
      • [Email from Lindsay of 5  Aug 2015: ]

        I took some time yesterday to experiment with topic modeling to see what difference tokenization makes on the resulting models. Here's what I did:

            I modeled just the "humanities" articles from the NYT from 2014 using 100 topics with an optimization interval 20 without any tokenization. The keys file is attached here ("nyt_2014_h_pre_100_keys.txt"nyt_2014_h_pre_100_keys.xlsx ). The results are broadly similar to yours.
            I used Scott's python script to tokenize that same corpus, adding additional words to the script from our pbworks page (Note: I only did tokenization and the very minimal punctuation fixes included in the string here; I did not add in additional stop words to MALLET'S default stop word list). The keys file is attached here ("nyt_2014_h_post_100_keys.txt" nyt_2014_h_post_100_keys.xlsx ). The results are fairly similar to those attained prior to tokenization. One example where tokenization seems to have made a difference, though, is topic 58, which is clearly about Israel, Palestine and the American Studies Association's boycott. This topic shows up as very coherent in the post-tokenization model (where "American Studies Association" has been tokenized), whereas it's more disperse in the pre-tokenization model.
            I then decided to see if tokenization had an even more noticeable effect on a larger corpus. So I decided to model all of the "humanities" articles from the NYT from 1996-2014 (the most complete set of data we have as of yet from the NYT). This is a corpus of about 3400 articles. I used 350 topics (a guess, based on Jockers's model of 3300-ish novels) with an optimization interval of 20. The keys file, pre-processing, is also attached here ("nyt_1996-2014_h_pre_350_keys.txt" nyt_1996-2014_h_pre_350_keys.xlsx ).
            I then modeled the same corpus but used Scott's python script to tokenize. The keys file is attached ("nyt_1996-2014_h_post_350_keys.txt" nyt_1996-2014_h_post_350_keys.xlsx ). It's a bit tough to see any major differences just quickly looking through such a large number of topics, but one thing I noticed in the post-tokenized model is topic 74, which groups tokens like "story;advertisement;continue," "story;related," and "story;the" together. I might call this topic something like "next page and continue reading links." These types of words are much more disperse throughout the non-tokenized model, and do not constitute a coherent topic, as far as I can tell.

        So, while I would need to spend much more time on this to know for sure, it seems to me that tokenization has a noticeable effect on the topic coherence of the models, perhaps even indirectly (for topics that depend less on tokenized words, like the "next page and continue reading links" topic above). As we talked about on Monday, it's clear we can't tokenize everything, but from the ASA example, it seems that names of associations, grants, endowments, etc should be our first priority for tokenization. Other things currently on the tokenization list -- words like "phd," "ma," and even "mla" -- didn't show up often enough to be included as top-20 keys, but they could be affecting the model in smaller ways.


  • Issue: How will we implement scrubbing? (e.g., Lexos, Python scripts, or R scripts?)


Discussion of Topic Modeling Strategy

  • Possible research questions (brainstormed in WE1S meeting 16 with aid of example topic model of NYT 2014 "humanities" and Antconc analysis of the same files). Red = cardinal questions.
    • What are the topics with which the humanities are associated in public discourse (compared in future with academic, foundation, legislative, and other kinds of discourse)?
      • What are the most important topics, and their relative importance?
      • What are the kinds of topics (e.g., economic, "life," culture, colleges, compared to sciences, etc.), and their relative importance?
        • What is the relative weight between general society and institutional (academic, governmental, funding agency, foundation) topics?
        • What is the relative weight of general society and academic disciplinary notions of the humanities?
          • Are there topics in public discourse that coincide with academic disciplinary notions of the humanities--literary study, history, philosophy, classics, etc.?
        • What does the topic model suggest about how the public thinks of humanities relative to the arts, social sciences, and sciences?
      • How do topics vary by nation and decade? 
        • Can we identify important historical moments (e.g., 9/11) and correlate with discussion of the humanities?
        • How does public discourse on the humanities correlate with actual educational facts on the ground (enrollment, tuition, student loans
      • Can we identify the articles that are "prevalently" about the humanities, and then see what topics have a high weight in them?
      • Can researching collocates and n-grams of "humanities" (and other text analysis) suggest other research questions?
    • What are the "hot" button topics in public discourse about the humanities in each year, and over time? How do they compare to the way academics and foundations discuss the humanities?
      • If we imagine our end-goal to be a practical advocacy "kit" for humanities advocacy--the equivalent of what politicians call "talking points"--what topics would we steer advocates away from, and to?
      • In such a "kit," how can relational understandings of the humanities best be exploited--e.g., talking about the humanities in relation to the sciences, to getting a job, to personal well-being, to national security, etc.?
  • Topic Modeling Strategy to "Operationalize" These Research Questions:
    • Our Data Files
      • We are storing our corpus as plain text files on Google Drive in folders with paths (for example) like this:
        • \data_archive\corpus\new_york_times\data\2000\plain_text\humanities
        • \data_archive\corpus\new_york_times\data\2000\plain_text\liberal_arts
        • \data_archive\corpus\new_york_times\data\2000\plain_text\the_arts
      • Individual text files are named by the convention (for example): "nyt-2000-h-18.txt" (where "h" = "humanities")
    • First step (?) 
      • De-duplicate files collected from searches on "humanities," "liberal arts," and "the arts".
      • Consolidate in single folder of files
        • File names (e.g., "nyt-2000-h-18.txt") allow us to backtrack so as to provide citations and links later for public-facing front-end.
    • Second step (?)
      • Run MALLET on whole folder of consolidated files
    • Third step (?) 
      • For developer use: output to MongoDB database that allows querying
      • For public use: output the topic model, visualizations, etc. (but not full text, only links to original articles).
      • Crucial needs in both developer and public use cases: the ability to see differentially/comparatively (if possible):
        • How a topic ranks relative others in the whole corpus.
        • What the kinds of topics are (genres of topics) and their relative ranks.
        • What the ranking is in any subset of the corpus (e.g., nation, decade, publication). That is, if we examined only U.S. newspapers, or a particular newspaper, what is the rank of topics compared to the general ranking order for the whole corpus?
        • Chronological (longitudinal) topic modeling.






Comments (0)

You don't have permission to comment on this page.