| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Finally, you can manage your Google Docs, uploads, and email attachments (plus Dropbox and Slack files) in one convenient place. Claim a free account, and in less than 2 minutes, Dokkio (from the makers of PBworks) can automatically organize your content for you.

View
 

Comparing Interfaces for Exploring Topic Models Meeting (2016-02-12)

Page history last edited by Alan Liu 5 years, 3 months ago

 

 

Preliminaries 

  • Introductions

  • Next Meeting Dates (options): WE1S project Calendar

    • The purpose of our next meeting will be to:
      (a) continue with exploration of topic model interfaces (hopefully we'll be further down the road in implementing / trying them);
      (b) begin to pull together everything we've done into a final plan for finalizing files in our corpus, topic modeling them, and putting them in a topic model interface, with storage and querying support in MongoDB.
      Some possibilities for meeting dates:
      • F, Feb. 26 (1 pm)
      • F, March 4 (1 pm)
      • F, March 25 (1 pm)
      • R, March 31(1 pm)
      • F, April 1 (2 pm)
  • Quick Orientation & Review of Project Work (for benefit of newcomers at CSUN and elsewhere)
    • Overview
      • Current goal: topic modeling the ambient public discourse in which articles mentioning the humanities and liberal arts occur. Later goal: identifying articles in the corpus that are centrally "about" the humanities and topic modeling those. (See Alan's "neighborhood" analogy for this two-goal strategy.)
      • Our major project development stages/cycles to date:
      • Development work to be initiated at today's meeting:
        • (4) Comparing interfaces for exploring topic models
      • Other ongoing development work:
        • Nathalie Popa: scraping Globe & Mail 
        • Austin Yack: "What Politicians Have to Say About the Humanities" and supporting spreadsheets for his research on the U.S. White House, Congress, and California State Legislature (using Google Visualization API to output HTML tables from Google Spreadsheets. See experiment.)
    • Status Updates?
      • Scraping
      • File renaming issue
      • De-duplication issues:
        • Deduping script
        • Deduping strategy (from Alan's email of Nov. 23, 2015: "We need perhaps a couple of people to study an example of de-duplicating a sample of our articles and advise Jeremy on how or if we should adjust the outcome of his deduping script.  (See the deduping instructions in the Topic Modeling Workflow. Currently, his script identifies close matches above a certain threshold, and we are just deleting all files in the right-hand column of a pair of matches.")
      • Archiving on Mirrormask has stopped working (?)

 


Comparing Interfaces for Exploring Topic Models (main focus of today's meeting)

 

Overview

  • The problem of preparing, organizing, and navigating between views of topic models (as we learned in our last meeting). See table below for "Possible WE1S project workflow for interpreting topic models (Draft 1)" (This table is now also a separate page: Topic Modeling Interpretation Workflow)
  • WE1S does not want to reinvent the wheel and build its own interface for interpreting topic models if that can be avoided, so the project is searching for an existing or adaptable interface to use for the twin purpose of interpretation and presentation of models.
  • Two key questions in comparing interfaces for exploring topic models:
    • (a) Will this system effectively help us interpret the WE1S topic models? 
    • (b) Does this system seem possible in terms of technical implementation?
  • Other questions to consider:
    • Can the system allow for comparison between parts of our corpus (e.g., comparing NY Times to Wall St. Journal, or American and English newspapers)?
    • Can the system allow for time-series views of topic models?
    • Do we want all-in-one systems, or instead systems that allow us to to Mallet topic modeling separately and feed them into the system?
    • Does the system require, or optionally benefit from, metadata in addition to plain-text corpora?
    • Visualization idiom: how easy is it in a system to think with the pictures it generates?

 

Materials for Our Discussion 

 

 

 

Discussion Agenda

  • Comments and evaluations of the systems still in the running.
  • Narrow the field to a short list of systems to try to implement
    • Plan/divide up the implementation work (to be accomplished if possible by the next WE1S meeting)
  • Set up a virtual machine with multiple topic model systems on it that we can share?

 

 


 

Possible WE1S project workflow for interpreting topic models (Draft 1) (including parallel, alternative, or conjoined human and machine processes):

 

Human Processes

(Typical procedure for steps requiring judgment will be to have a panel of three or more people perform the step and compare results. The hermeneutical rhythm will typically consist of iterative cycles of observations suggesting research questions, and research questions suggesting ways to sharpen observation.)

Machine Processes

(We may be able to automate some steps and sequences)

1
Assess topic models to determine appropriate number of topics. We may decide to generate one, two, or three numbers of topics for simultaneous interpretation.
(Questions: Can we define criteria for "best" topic model? Do we know any published theory or methods for choosing right number of topics? Cf. Scotts issues for discussion.pdf)

Generate topic models at many levels of granularity--e.g., 25, 50, 150, 200, 250, 300, 350, 400, 450, 500

2
Initial analysis of topic models.
  1. Assign labels for topics (assisted by automated process suggested at right).
  2. Identify and label any major clusters of topics. 
  3. Flag for attention any illegible topics.
Assemble materials to facilitate interpretation:
  1. Create sorted keys files in a spreadsheet.
  2. Create topic cloud vizualizations.
  3. Create clustering visualizations
    (Testing phase: compare a human-panel-only clustering with a machine clustering of topics)
  4. Assess "nearness" of topics (We don't yet have a method to do this; but cf. Goldstone DFR Browser "scaled view")
  5. If possible, auto-label topics with the top 2-4 most frequent words in a topic (based on an algorithm that establishes a minimum threshold of proportional frequency and decides what to do if there are one, two, three, or four top words that are, or are not, significantly more important than others.)
3

Detailed analysis of topic model (part I: total corpus, synchonic analysis).

  1. Study major topics and clusters of topics.
  2. Human panel reads sample articles and compares to the topic proportions found in the topic-counts.txt file created by Mallet. (This is a sanity check.)
  3. Human panel writes up analytical notes and observations, and compares.
  4. Members of the human write up report.

 

 
4

Detailed analysis of topic model (part II: comparative analysis).

  1. Study major correlations/differences between any two or three parts of our corpus of interest.
Create view of topic model that compares two or more parts of our corpora (e.g., NY Times vs. The Guardian) for the topics and topic weights they contain. We don't yet have an interface or method of using the composition.txt files produced by Mallet to do this. (cf. Goldstone DFR Browser "document view," which shows topics in a single document) (Alan's experiment)
5

Detailed analysis of topic model (part III: time-series analysis).

  1. Study trends in topics.
Create views of topic model that shows trend lines of topics (created by showing weights of topics in documents at time 1, followed by time 2, etc.). We don't yet have a method or tool for this, but cf. the following time-series views in the Goldstone DFR Browser: topics across years | topics within a single year. See also: demo vizualization of topics in State of Union addresses; the TOM code demos; Robert K. Nelson, "Mining the Dispatch") (Alan's experiment)
6

Write up results:

  1. Create key observations and data set and publish (with a catchy title like "Humanities in Public Discourse: The Manual").
  2. Co-author white paper with recommendations for humanities advocacy.
    1. Create subset of above as a brochure or infographic
  3. Disseminate research methods and conclusions in academic talks, panels, papers, articles.
 
     

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.