| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Planning Notes (Liu, Nov 21, 2013) (redirected from Planning Notes (version 1, Liu))

Page history last edited by Alan Liu 10 years ago

 

  

Plan for Developing Methods of Processing Samples from WhatEvery1Says Corpus

(later to be applied to larger sample and wider corpora)

 

  • Document Selection: Choose a representative sample of both HTML and PDF documents from WhatEvery1Says.

 

 

 

  • Secondary (interpretive) text preparation: Use Python scripts or other tools for the following tasks: [Task: research and experiment with this issue]
    • Lemmatization.
    • Consolidating semantically unitary bigrams into unigrams (e.g., "social sciences" into "social_sciences").
    • Filtering out proper names (using named-entity recognizers).
    • Creating a "stop list." (See Ted Underwood and Andrew Goldstone's stop list for their project on topic modeling literary studies journals.)
    • Experimenting with parts-of-speech taggers (POS) to filter out everything but nouns (cf. Matthew Jocker's topic-modeling work).
    • "Chunking" (breaking documents if needed into appropriately sized subdocuments). Alternatively, we could try the text "segmentation" approach of described in E. Thomas Ewing, et al., "Mining Coverage of the Flu: Big Data's Insights into an Epidemic" (2014) [Alan's annotated copy]
    • Resources & tools to investigate:

 

  • Topic Modeling:
    • Experiment to see if we should use the full-featured Mallet topic modeling suite or its Java implementation (Topic Modeling Tool).
    • Experiment with the Overview tool for topic sorting journalism and other documents
    • Experiment with different parameters for the topic modeling, most importantly: number of topics to ask the algorithm to produce.
    • Experiment with interpretive labeling of topics
    • Experiment with visualizing topics.
    • Experiment with social-network analysis of topics.

 


Document Collection Strategy
(go to planning page)

 

 

 

Comments (0)

You don't have permission to comment on this page.