-
If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
-
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!
|
Meeting 2014-05-16
Page history
last edited
by Alan Liu 9 years, 10 months ago
4Humanities "WhatEvery1Says" Project
Project Idea | Current State of Project | Future Steps and Need for More Collaborators
WhatEvery1Says Project Idea
Research Material:
- WhatEvery1Says Corpus #1 (collected manually)
- WhatEvery1Says Corpus #2 (extended corpus collected systematically or algorithmically)
Research Questions:
Our hypothesis is that digital methods can help us learn new things about how media pundits, politicians, business leaders, administrators, scholars, students, artists, and others are actually thinking about the humanities. For example, are there sub-themes beneath the familiar dominant clichés and memes? Are there hidden connections or mismatches between the “frames” (premises, metaphors, and narratives) of those arguing for and against the humanities? How do different parts of the world or different kinds of speakers compare in the way they think about the humanities? Instead of concentrating on set debates and well-worn arguments, can we exploit new approaches or surprising commonalities to advocate for the humanities in the 21st century?"
Specific research questions:
- What are the common "themes" (ideas, theses, evidence, metaphors, etc.) that divide or join people discussing the humanities?
- What are the lower-level or latent themes beneath those everyone "knows"?
- What are the outlier themes?
- What are the patterns of connection between themes, between spokespersons, and between media outlets?
- How do themes compare across time?
- How are themes differentiated by nation, region, gender, age, etc.?
- Other questions ...
Research Method: Topic Modeling
Other Possible Analytical Goals
Initial Proof of Concept
Intended Outcomes
- Creation of interactive site for exploring the topic model of WhatEvery1Says. (Cf., DFR-Browser, a browser-based visualization interface created by Andrew Goldstone for exploring his topic model of JSTOR articles).
- Co-authored research report or article on outcomes.
- Workshop to brainstorm ways we can apply the outcomes in facilitating, guiding, or creating advocacy arguments and materials.
Stage 1 Transformation of Corpus (documents from raw corpus archived and extracted as plain text)
- Raw WhatEvery1Says Corpus #1 (links to documents collected manually)
- Workflow protocol for archiving and extracting text from the raw corpus:
Stage 2 Transformation of Corpus (plain text files cleaned and prepared for topic modeling)
We are currently working on specific components of the following set of processes, which ideally should be explored in iterative complementarity with initial topic modeling runs, and which should ultimately should be stitched together and automated as a single workflow:
- Perform initial text cleaning, punctuation-stripping, and low-level prepping work -- automate using Lexos or other text-preparation tools?
- Identify bigrams (e.g., "social sciences") that need to be converted to unigrams
- Assisted by identification of frequent collocates using Antconc (Lindsay Thomas)
- Build a stop list (Jeremy Douglass)
- Standard starter stop lists
- 1. The Fox 1992 stop word list (429 words). Fox, C. (1992). Lexical analysis and stop lists. In Frakes, W. and Baeza-Yates, R., editors, Information Retrieval: Data Structures and Algorithms, chapter 7. Prentice-Hall. http://www.lextek.com/manuals/onix/stopwords1.html
- 2. The SMART 1971 stop word list (571 words): Salton, G. 1971. The SMART Retrieval System—Experiments in Automatic Document Processing, Upper Saddle River, NJ, USA: Prentice-Hall, Inc. http://www.lextek.com/manuals/onix/stopwords2.html [similar to MALLET standard English language stop list]
- Andrew Goldstone and Ted Underwood's stop list
- Matthew Jockers's stop list
- Use named-entity parsers to identify proper names, etc., that can either be put in the stop list or set aside for social-network analysis separate from the topic modeling) (Zach Horton and Liz Shayne)
- Use Parts-of-speech taggers to allow us to experiment with subtracting verbs, etc., to improve usefulness of topic modeling. (Priscilla)
Early Topic Model Run on the 61 Documents in the Stage 1 Transform Sample Corpus:
- Alan's topic model run of 1 May 2014 (using MALLET):
- How-to Resources for MALLET:
Future Steps and Need for More Collaborators
Major Tasks (some task groups could be the projects of other SoCal 4Humanities chapters / digital humanists):
- Continue advancing and experimenting with Stage 2 Transformation of WhatEvery1Says corpus.
- Iterative work on running and tweaking topic models of the corpus.
- Develop methods and scripts for automated, systematic identification of relevant documents for inclusion in raw Whatever1Says corpus:
- Identify available full-text corpora (e.g., newspaper and magazine online archives)
- Develop methods of searching and relevancy identification.
- Collect documents for Stage 1 transformation.
- Extend collection backward in time to selected sample decades.
- Develop (or borrow) methods of facilitating the interpretation of topic models:
- Create visualizations and other methods of "grokking" topic models
- Develop or adapt front-end interfaces for topic models. Examples:
- Andrew Goldstone's interface
- Jeffrey M. Binder and Collin Jennings's interface
- Use the WhatEvery1Says corpus for other kinds of analysis:
- Social network analysis
- Other kinds of text analysis, or clustering analysis
- Possible future co-authoring of article(s)
Meeting 2014-05-16
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
|
|
Comments (0)
You don't have permission to comment on this page.