|
Planning Notes (Liu, Nov 21, 2013)
(redirected from Planning Notes (version 1, Liu))
Page history
last edited
by Alan Liu 10 years ago
Plan for Developing Methods of Processing Samples from WhatEvery1Says Corpus
(later to be applied to larger sample and wider corpora)
- Document Selection: Choose a representative sample of both HTML and PDF documents from WhatEvery1Says.
- Text Extraction: Develop methods for semi-automating the following tasks [Task: research and experiment with this issue]
- Extract text as "plain text."
- Exclude non-relevant material such as advertisements, author bios, copyright notices, page numbers, etc.
- Resources & tools to investigate:
- Preliminary text preparation: Use Python scripts or other tools for fixing common errors, standardizing spellings, resolving hyphenations, etc. [Task: research and experiment with this issue]
- Resources & tools to investigate:
- Secondary (interpretive) text preparation: Use Python scripts or other tools for the following tasks: [Task: research and experiment with this issue]
- Lemmatization.
- Consolidating semantically unitary bigrams into unigrams (e.g., "social sciences" into "social_sciences").
- Filtering out proper names (using named-entity recognizers).
- Creating a "stop list." (See Ted Underwood and Andrew Goldstone's stop list for their project on topic modeling literary studies journals.)
- Experimenting with parts-of-speech taggers (POS) to filter out everything but nouns (cf. Matthew Jocker's topic-modeling work).
- "Chunking" (breaking documents if needed into appropriately sized subdocuments). Alternatively, we could try the text "segmentation" approach of described in E. Thomas Ewing, et al., "Mining Coverage of the Flu: Big Data's Insights into an Epidemic" (2014) [Alan's annotated copy]
- Resources & tools to investigate:
- Topic Modeling:
- Experiment to see if we should use the full-featured Mallet topic modeling suite or its Java implementation (Topic Modeling Tool).
- Experiment with the Overview tool for topic sorting journalism and other documents
- Experiment with different parameters for the topic modeling, most importantly: number of topics to ask the algorithm to produce.
- Experiment with interpretive labeling of topics
- Experiment with visualizing topics.
- Experiment with social-network analysis of topics.
Document Collection Strategy (go to planning page)
Planning Notes (Liu, Nov 21, 2013)
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
Comments (0)
You don't have permission to comment on this page.