| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Programming Resources

Page history last edited by Patrick Mooney 8 years, 11 months ago

 

A. Development Environment

 

  • Programming Languages
    • Python -- Python.org
    • R -- R-project.org
      • R environments and packages
        • RStudio
        • rOpenSci (workflow environment based on R that is designed for scientists but may be useful for other scholars working with processing and narrating data.  "Use our packages to acquire data (both your own and from various data sources), analyze it, add in your narrative, and generate a final publication in any one of widely used formats such as Word, PDF, or LaTeX"; packages that allow access to data repositories through the R statistical programming environment [and] facilitate drawing data into an environment where it can readily be manipulated"; "analyses and methods can be easily shared, replicated, and extended by other researchers")
      • R Tutorials (see in Alan's DH Toychest)
    • Topic Modeling Tools (complemented by Text Preparation "Recipes" for Topic Modeling Work above) (see Topic Modeling Tutorials)
      • checkmark blueDFR-Browser (browser-based visualization interface created by Andrew Goldstone for exploring JSTOR articles [facilitated by the JSTOR "Data for Research" (DFR) site through topic-modeling)
      • Gensim ("free Python library: scalable statistical semantics, analyze plain-text documents for semantic structure, retrieve semantically similar documents")
      • Glimmer.rstudio.com Topic Modeling (LDA) visualization tool (allows users to upload their own data to generate scatterplots and bar charts)
      • checkmark blueIn-Browser Topic Modeling ("Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations"; by David Mimno.) Note: the files for this tool can be downloaded and run locally; download from GitHub here.
      • LDAvis ("R package for interactive topic model visualization") (example of use)
      • checkmarkMALLET
        • Mallet (MAchine Learning for LanguagE Toolkit)
          • GRMM (GRaphical Models in Mallet)
          • Programming Historian tutorial for installing and starting with MALLET
      • MALLET-to-Gephi Data Stacker (online tool that takes "the '--output-doc-topics' output from MALLET and reorganize it into a format that Gephi understands")
      • The Networked Corpus ("a Python script that generates a collection of Web pages like the ones we have created for <em>The Spectator</em>.... designed to work with MALLET."  The Networked Corpus project "provides a new way to navigate large collections of texts. Using a statistical method called topic modeling, it creates links between passages that share common vocabularies, while also showing in detail the way in which the topic modeling program has “read” the texts. ")
      • Stanford Topic Modeling Toolbox ("brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to: * Import and manipulate text from cells in Excel and other spreadsheets; * Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text; * Select parameters (such as the number of topics) via a data-driven process; * Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data")
      • TMVE ("basic implementation of a topic model visualization engine")
      • checkmark blueTopic Modeling Tool (Java-based "graphical user interface tool for Latent Dirichlet Allocation topic modeling" by David Newman; comes with test input files [look in "Downloads" tab on site]. Input files should be in .txt files saved in same directory; the input files are formatted with returns between each separate document)
      • "Two Topic Browsers" by Jonathan Goodwin

 

2. Search Methods/Scripts

 

 

3. Scraping Methods/Scripts

 

    • i. Python
    • ii. R
    • iii. Auto-downloading Web pages
      • Option 1 (Pros: easy to use, optimized for multi-thread downloading. Cons: cannot get past proxy servers or password login screens.): DownloadThemAll addon for Firefox, https://addons.mozilla.org/en-US/firefox/addon/downthemall/ 
        • Usage (for our purposes):
          • From the .TSV file (or spreadsheet) created at the end of using the get-nytimes-articles.py script, take the column of URLs and put it in a .txt file, each URL on a separate line (e.g., urls.txt)
          • Open the urls.txt file as a local Web file in Firefox (file:///pathname)
          • Right click on the displayed file and choose DownThemAll from the menu.
          • Define folder in which to save the files downloaded from the URL list.
          • Set the "Renaming Mask" field to the following in order to number items: *inum*_*name*.*ext*
      • Option 2 (Pros: customized, scripted control, including for recursive downloading; can be used to pass login information to proxy or protected site. Cons: requires installation; more difficult to use because it is a command terminal / bash program): Wget, https://www.gnu.org/software/wget/ (Download for your platform; for convenience, save the executable file in a directory that is in the PATH definition on your computer so that Wget can be invoked from the command line no matter your current working directory.  [To see you path definitions, open a command window or terminal and type "path"]).

 

4. Topic Modeling Tools

 

  • i. MALLET (and Mallet frontends)
    • Mallet (MAchine Learning for LanguagE Toolkit)
      • GRMM (GRaphical Models in Mallet)
      • Programming Historian tutorial for installing and starting with MALLET
    • Topic Modeling Tool (Java-based "graphical user interface tool for Latent Dirichlet Allocation topic modeling" by David Newman; comes with test input files [look in "Downloads" tab on site]. Input files should be in .txt files saved in same directory; the input files are formatted with returns between each separate document)
  • ii. Stanford Topic Modeling Toolbox ("brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to: * Import and manipulate text from cells in Excel and other spreadsheets; * Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text; * Select parameters (such as the number of topics) via a data-driven process; * Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data")
  • iii. Integrated workflow environments for topic modeling (and topic model visualization tools)
    • Lexos ("online tool ... to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations")
    • Serendip ("system for visually exploring topic models generated on large corpora of documents") (Installing Serendip.docx, with added notes by A. Liu)
    • MALLET-to-Gephi Data Stacker (online tool that takes "the '--output-doc-topics' output from MALLET and reorganize it into a format that Gephi understands")
    • LDAvis ("package for interactive topic model visualization.... designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.")
  • iv. Patrick's repetitive topic-modeling script for Linux/OS X is on SourceForge; there's a README file there in several formats (the HTML version is also here).
    • It's still kind of a quick, ugly hack, though a bit cleaner than the initial version I discussed in a meeting. Suggestions, feedback, and offers to recode in Python would be very welcome.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.