• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Meeting 18 (2015-08-14)

Page history last edited by Alan Liu 8 years, 10 months ago

Progress to Date (and Future Scheduling)


  • Status Reports (Developer Task Assignments)
    • Completion of collection work for NYT, WSJ, Guardian, NPR
    • Completion of inspection work and sign-off
    • Alan then to tidy up final data files (renaming, etc.)
  • Next Steps (A): Scrubbing List: In next week, spend one hour combing files and adding "fixes" to our Scrubbing List 
  • Next Steps (B): Other sources to collect (Document Sources)
    • Priorities We Decided On:
      1. First priority:
        1. Other nations: UK , Canada (Australia, New Zealand, India)
        2. At least one other U.S. city
        3. Born-digital publications (e.g., Huff Post, Salon, etc.)
      2. Second priority:
        1. At least one higher-ed publication (e.g., Chronicle of Higher Education, Higher Ed)
        2. At least one source from the economic press (e.g., Forbes, Business Insider, The Economist)
        3. Student newspapers (e.g., Harvard Crimson, Yale Daily News, UCLA Bruin)



Ongoing research on the text analysis methods we will use on our corpus


  • Topic Modeling

  • Journalism data analytics

    • Email from Alan to Scott & Lindsay of 13 Aug 2015 on Overview (and journalism data analytics tools)
            Exploring Overview has clued me in to the whole infrastructure that the journalism profession has built to conduct data analytics. Besides Overview, there is DocumentCloud (to which journalists can upload primary documents for storage, publishing, and/or analysis; and which Overview can connect to) and OpenCalais (which does named entity recognition, and through which DocumentCloud processes documents) (See also OpenCalais Viewer - Example: ).
            Except for OpenCalais, which is a ThomsonReuters product (but with a free tier sufficient for our purposes), the others are open-source and originate as professional and public service tools for journalists. That provenance gives the tools some legitimacy; and, of course, the other plus is that there is a good fit between the way journalists want to cluster and explore documents and our goals for the WE1S project.
           Overview is like Lexos for journalists. The workflow is that you upload your documents, and then it clusters them and does various analytics/visualizations. The fundamental operations involved are word counting (actually, bigram counting); TF-IDF to evaluate the importance of words in documents; and then K-means clustering of documents. One of the main outputs of Overview are trees of clustered documents. Especially handy: clicking on a leaf (in what is essentially a dendogram tree presentation) pops up on the right the actual documents participating in that leaf.
           Overview is something you can just register for. DocumentCloud has an application process. (I've applied, but we'll see if they let me in as a non-journalist. Also, their terms of service say no copyrighted material. But we could use them for the policy, government, open-source, and other materials we will eventually want to add to our dataset, even if that exludes uploading newspaper articles.)
  • Overall Strategy For Analysis:

    • The multiple experiments we have been doing recently are bringing into focus, I think, a basic strategy for data analysis we will want to pursue when we have our corpus in hand. My UCSB Geography colleague Krzysztof Janowicz, who works in digital geography, has a great metaphor for data mining: a data observatory (astronomy rather than dig into the earth). I'll go with that metaphor here.
           (a) First analytical step for WE1S: we observe the universe of public discourse about the humanities (later, also government, academic, foundation, and other discourse) using a variety of telescopic distant-reading instruments, including especially topic modeling (instrument: Mallet, supplemented by Lexos topic multiclouds); PCA and K-means clustering (instruments: Lexos, Overview, Voyant, etc.); and possibly other instruments. Criteria for selection of instruments include: provenance (good: open source, public service, academic; bad: proprietary or commercial); transparency of method; etc.
           (b) Some of these instruments help us innovatively refine observations conducted with other methods--e.g., using Overview to process our "topic-documents" generated originally from Mallet to help identify clusters of topics.
           (c) Human domain experts (ourselves as humanists) review the methods to see if their outcomes converge (one main goal) and if there are also interesting divergences (another intellectual goal).
           (d) We then create a workflow (and manifest) for a standard analytical workflow based on the above. This workflow and its rationale amount to another publishable deliverable of the project (a methodological deliverable parallel to the manifest schema system).




Comments (0)

You don't have permission to comment on this page.