| 
View
 

Meeting 17 (2015-08-05)

This version was saved 9 years, 8 months ago View current version     Page history
Saved by Alan Liu
on August 4, 2015 at 10:57:56 pm
 

Progress to Date (and Future Scheduling)

 

  • Status Reports (Developer Task Assignments)
    • NY Times
    • WSJ
    • The Guardian
    • NPR
    • Other sources we are researching (Document Sources)
      • Possibilities We Considered: 
        * Other U.S. cities (e.g., Washington Post, LA Times, Chicago Tribune)?
        * Online media news/popular media (eg., Huff Post, Salon)
        * TV/Radio media
        * Middlebrow (e.g., USA Today)
        * Magazines (e.g., New Republic, LA Review of Books)?
        * Economic press (e.g., Forbes, Business Insider, The Economist)?
        * Higher-education press (e.g., Chronicle of Higher Education, Higher Ed)?
        * Student papers (e.g., Harvard Crimson, Yale Daily News, UCLA Bruin)?
        * Commencement speeches.
        * Articles on "sciences"?
        * Social media
      • Current Priorities:
        1. First priority:
          1. Other nations: UK , Canada (Australia, New Zealand, India)
          2. At least one other U.S. city
          3. Born-digital publications (e.g., Huff Post, Salon, etc.)
        2. Second priority:
          1. At least one higher-ed publication
          2. At least one source from the economic press
          3. Student newspapers

 

  1. Other nations: UK (Canada, Australia, New Zealand, India)?
  2. At least one other U.S. city
  3. born-digital publications
  4. One higher-ed publication (e.g., Chronicle of Higher Education, Higher Ed)
  5. Economic press (e.g., Forbes, Business Insider, The Economist)
  6. One student newspaper (e.g., Harvard Crimson, Yale Daily News, UCLA Bruin)

 

 


Discussion of Scrubbing and Other Preprocessing Steps

 

  • Current Scrubbing List ("List of Fixes Needed for Raw Texts")
  • Issue: We can't try to add everything possible in detail to the scrubbing list. What is the right way to think about this?
  • Issue: How will we implement scrubbing? (e.g., Lexos, Python scripts, or R scripts?)

 


Discussion of Topic Modeling Strategy

  • Possible research questions (as brainstormed in WE1S meeting 16 (with aid of example topic model of NYT 2014 "humanities" and Antconc analysis of the same files). Red = cardinal questions.
    • What are the topics with which the humanities are associated in public discourse (compared in future with academic, foundation, legislative, and other kinds of discourse)?
      • What are the most important topics, and their relative importance?
      • What are the kinds of topics (e.g., economic, "life," culture, colleges, compared to sciences, etc.), and their relative importance?
        • What is the relative weight between general society and institutional (academic, governmental, funding agency, foundation) topics?
        • What is the relative weight of general society and academic disciplinary notions of the humanities?
          • Are there topics in public discourse that coincide with academic disciplinary notions of the humanities--literary study, history, philosophy, classics, etc.?
        • What does the topic model suggest about how the public thinks of humanities relative to the arts, social sciences, and sciences?
      • How do topics vary by nation and decade? 
        • Can we identify important historical moments (e.g., 9/11) and correlate with discussion of the humanities?
        • How does public discourse on the humanities correlate with actual educational facts on the ground (enrollment, tuition, student loans
      • Can we identify the articles that are "prevalently" about the humanities, and then see what topics have a high weight in them?
      • Can researching collocates and n-grams of "humanities" (and other text analysis) suggest other research questions?
      •  
    • How do topics vary by nation and decade?
      • Can we correlate important historical moments (e.g., 9/1) and public discourse about the humanities?
      • How does public discourse about the humanities correlate with educational facts on the ground (e.g., trends in enrollment figures, tuition, etc.)?
    • What are the "hot" button topics in public discourse about the humanities in each year, and over time? How do they compare to the way academics and foundations discuss the humanities?
      • If we imagine our end-goal to be a practical advocacy "kit" for humanities advocacy--the equivalent of what politicians call "talking points"--what topics would we steer advocates away from, and to?
      • In such a "kit," how can relational understandings of the humanities best be exploited--e.g., talking about the humanities in relation to the sciences, to getting a job, to personal well-being, to national security, etc.?
  • Topic Modeling Strategy 
    • Our Data Files
      • We are storing our corpus as plain text files on Google Drive in folders with paths (for example) like this:
        • \data_archive\corpus\new_york_times\data\2000\plain_text\humanities
        • \data_archive\corpus\new_york_times\data\2000\plain_text\liberal_arts
        • \data_archive\corpus\new_york_times\data\2000\plain_text\the_arts
      • Individual text files are named by the convention (for example): "nyt-2000-h-18.txt" (where "h" = "humanities")
    • First step (?) 
      • De-duplicate files collected from searches on "humanities," "liberal arts," and "the arts".
      • Consolidate in single folder of files
        • File names (e.g., "nyt-2000-h-18.txt") allow us to backtrack so as to provide citations and links later for public-facing front-end.
    • Second step (?)
      • Run MALLET on whole folder of consolidated files
    • Third step (?) 
      • For developer use: output to MongoDB database that allows querying
      • For public use: output the topic model, visualizations, etc. (but not full text, only links to original articles).
      • Crucial needs in both developer and public use cases: the ability to see differentially/comparatively (if possible):
        • How a topic ranks relative others in the whole corpus.
        • What the kinds of topics are (genres of topics) and their relative ranks.
        • What the ranking is in any subset of the corpus (e.g., nation, decade, publication)? That is, if we examined only U.S. newspapers, or a particular newspaper, what is the rank of topics compared to the general ranking order for the whole corpus?
        • Longitudinal topic modeling.

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.