If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Programming Resources

Page history last edited by Patrick Mooney 8 years, 11 months ago

Development Environment | Search Methods/Scripts
Scraping Methods/Scripts | Topic Modeling Tools

A. Development Environment

Programming Languages

Python -- Python.org

Python Integrated Development Environments (IDE)

Canopy Enthought (installs Python plus editing environment and iPython Notebooks)
Anaconda
PyCharm
PythonToolkit

Python Packages

Beautiful Soup ("Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work")
Pattern ("web mining module for the Python programming language. It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization. )
NLTK (Natural Language ToolKit)

Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing: Analyzing Text with the Natural Language Toolkit (NLTK)

Tutorials:

Python Tutorials (see in Alan's DH Toychest). Also:

"Reading and Writing Files in Python"

BeautifulSoup tutorials:

Jeri Wieringa / Programming Historian, "Intro to Beautiful Soup"

R -- R-project.org

R environments and packages

RStudio
rOpenSci (workflow environment based on R that is designed for scientists but may be useful for other scholars working with processing and narrating data. "Use our packages to acquire data (both your own and from various data sources), analyze it, add in your narrative, and generate a final publication in any one of widely used formats such as Word, PDF, or LaTeX"; packages that allow access to data repositories through the R statistical programming environment [and] facilitate drawing data into an environment where it can readily be manipulated"; "analyses and methods can be easily shared, replicated, and extended by other researchers")

R Tutorials (see in Alan's DH Toychest)

Topic Modeling Tools (complemented by Text Preparation "Recipes" for Topic Modeling Work above) (see Topic Modeling Tutorials)

DFR-Browser (browser-based visualization interface created by Andrew Goldstone for exploring JSTOR articles [facilitated by the JSTOR "Data for Research" (DFR) site through topic-modeling)
Gensim ("free Python library: scalable statistical semantics, analyze plain-text documents for semantic structure, retrieve semantically similar documents")
Glimmer.rstudio.com Topic Modeling (LDA) visualization tool (allows users to upload their own data to generate scatterplots and bar charts)
In-Browser Topic Modeling ("Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations"; by David Mimno.) Note: the files for this tool can be downloaded and run locally; download from GitHub here.
LDAvis ("R package for interactive topic model visualization") (example of use)
MALLET

Mallet (MAchine Learning for LanguagE Toolkit)

GRMM (GRaphical Models in Mallet)
Programming Historian tutorial for installing and starting with MALLET

MALLET-to-Gephi Data Stacker (online tool that takes "the '--output-doc-topics' output from MALLET and reorganize it into a format that Gephi understands")
The Networked Corpus ("a Python script that generates a collection of Web pages like the ones we have created for <em>The Spectator</em>.... designed to work with MALLET." The Networked Corpus project "provides a new way to navigate large collections of texts. Using a statistical method called topic modeling, it creates links between passages that share common vocabularies, while also showing in detail the way in which the topic modeling program has “read” the texts. ")
Stanford Topic Modeling Toolbox ("brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to: * Import and manipulate text from cells in Excel and other spreadsheets; * Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text; * Select parameters (such as the number of topics) via a data-driven process; * Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data")
TMVE ("basic implementation of a topic model visualization engine")
Topic Modeling Tool (Java-based "graphical user interface tool for Latent Dirichlet Allocation topic modeling" by David Newman; comes with test input files [look in "Downloads" tab on site]. Input files should be in .txt files saved in same directory; the input files are formatted with returns between each separate document)

See also Miriam Posner, "Very basic strategies for interpreting results from the Topic Modeling Tool"

"Two Topic Browsers" by Jonathan Goodwin

2. Search Methods/Scripts

i. Tools and Scripts:

General Toolkits:

Developer Documentation "Data Science Toolkit"

New York Times API's, Scripts, and Tools

NY Times Developers API's page

Article Search API ("Search Times articles from 1851 to today, retrieving headlines, abstracts and links to associated multimedia")
Semantic API ("Get access to the people, places, organizations and descriptors that make up the controlled vocabulary used as metadata by The New York Times")

API Console (drop-down menu tool for building simple calls to the Times APIs)
Tutorials:

James Boehmer, "The New York Times Article Search API" (slides)
csitkursus, "Python and the NYTimes Api"
Data-gov Wiki at Rensselaer Polytechnic Institute, "How to use New York Times Article Search API"
Chris Utz, "Show Me the Code: NYT Trender"

ii. Search Query Syntaxes:

Lucene text query syntax (guide; fairly full)
Lucene text query syntax
Lucene Query syntax tutorial (easy to consult; partial coverage)

iii. Theory and Methods:

General Overviews:

Aggarwal, Charu C., and ChengXiang Zhai, "A Survey of Text Clustering Algorithms" [PDF], chap. 4 in Mining Text Data, ed. Aggarwal and Zhai (Springer: 2012)

Direct Search methods:

Bahgel, Rekha, and Renu Dir (2010), "A Frequent Concepts Based Document Clustering Algorithm" [PDF]
Liu, Xiangwei, and Pilian He (2005), "A Study on Text Clustering Algorithms Based on Frequent Term Sets" [PDF]
Python methods/scripts:

NLTK (Natural Language Tool Kit) methods/scripts

Collocations

get-nytimes-articles (Python script for "getting data from the New York Times Article API. Retrieves JSON from the API, stores it, parses it into a TSV file.")

Trained classifier methods

Python methods/scripts:

NLTK (Natural Language Tool Kit) methods/scripts

"Training Binary Text Classifiers with NLTK Trainer"
"Text Classification for Sentiment Analysis – Naive Bayes Classifier" (also useful for learning about general method of NLTK classification)
NLTK site "How-To's":

Classifiers

3. Scraping Methods/Scripts

i. Python

Scraping New York Times & The Guardian using Python
Article Scraping in Python web scraping (video tutorial of scraping, using Python and Beautifulsoup)
Human Rights Coverage Over Time: A Tutorial in Automated Text Analysis (example of scraping and analyzing)
Scraping Using Python Packages

Beautiful Soup (Download) (Python package for scraping)

Installation process: * extract from the tar.gz file; then run the following command in a command/terminal window: pip install beautifulsoup4
Tutorials

Beautiful Soup Documentation and tutorial
Tutorial from Python for Beginners
Example of script for inputting a HTML file and running Beautiful Soup in it in Python: Example of using Beautiful Soup.txt

Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing: Analyzing Text with the Natural Language Toolkit (NLTK):

Scrapy (Python " framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival")

ii. R

Scraping New York Times Articles with R

iii. Auto-downloading Web pages

Option 1 (Pros: easy to use, optimized for multi-thread downloading. Cons: cannot get past proxy servers or password login screens.): DownloadThemAll addon for Firefox, https://addons.mozilla.org/en-US/firefox/addon/downthemall/

Usage (for our purposes):

From the .TSV file (or spreadsheet) created at the end of using the get-nytimes-articles.py script, take the column of URLs and put it in a .txt file, each URL on a separate line (e.g., urls.txt)
Open the urls.txt file as a local Web file in Firefox (file:///pathname)
Right click on the displayed file and choose DownThemAll from the menu.
Define folder in which to save the files downloaded from the URL list.
Set the "Renaming Mask" field to the following in order to number items: *inum*_*name*.*ext*

Option 2 (Pros: customized, scripted control, including for recursive downloading; can be used to pass login information to proxy or protected site. Cons: requires installation; more difficult to use because it is a command terminal / bash program): Wget, https://www.gnu.org/software/wget/ (Download for your platform; for convenience, save the executable file in a directory that is in the PATH definition on your computer so that Wget can be invoked from the command line no matter your current working directory. [To see you path definitions, open a command window or terminal and type "path"]).

4. Topic Modeling Tools

i. MALLET (and Mallet frontends)

Mallet (MAchine Learning for LanguagE Toolkit)

GRMM (GRaphical Models in Mallet)
Programming Historian tutorial for installing and starting with MALLET

Topic Modeling Tool (Java-based "graphical user interface tool for Latent Dirichlet Allocation topic modeling" by David Newman; comes with test input files [look in "Downloads" tab on site]. Input files should be in .txt files saved in same directory; the input files are formatted with returns between each separate document)

ii. Stanford Topic Modeling Toolbox ("brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to: * Import and manipulate text from cells in Excel and other spreadsheets; * Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text; * Select parameters (such as the number of topics) via a data-driven process; * Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data")
iii. Integrated workflow environments for topic modeling (and topic model visualization tools)

Lexos ("online tool ... to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations")
Serendip ("system for visually exploring topic models generated on large corpora of documents") (Installing Serendip.docx, with added notes by A. Liu)
MALLET-to-Gephi Data Stacker (online tool that takes "the '--output-doc-topics' output from MALLET and reorganize it into a format that Gephi understands")
LDAvis ("package for interactive topic model visualization.... designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.")

iv. Patrick's repetitive topic-modeling script for Linux/OS X is on SourceForge; there's a README file there in several formats (the HTML version is also here).

It's still kind of a quick, ugly hack, though a bit cleaner than the initial version I discussed in a meeting. Suggestions, feedback, and offers to recode in Python would be very welcome.

Comments (0)

You don't have permission to comment on this page.

Programming Resources

A. Development Environment

2. Search Methods/Scripts

3. Scraping Methods/Scripts

4. Topic Modeling Tools

Programming Resources

Page Tools

Insert links

Comments (0)

Join this workspace

Navigator

SideBar

Recent Activity