|
Text Preparation and Topic Modeling Resources
Page history
last edited
by Alan Liu 11 years ago
-
Text Preparation
- Tutorials
- Tools
- Collection-level tools for massaging the WhatEvery1Says spreadsheet, archive filenames, etc.:
- OpenRefine ("tool for working with messy data, cleaning it, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase")
- NameChanger ("Rename a list of files quickly and easily. See how the names will change as you type")
- Overview ("automatically sorts thousands of documents into topics and sub-topics, by reading the full text of each one")
- Tools for transforming texts:
- Lexos - Integrated Lexomics Workflow ("online tool ... to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations")
- NEX - Named Entity eXtraction (Web tool from dataTXT to identify names, concepts, etc. in short texts; also allows API access)
- Stanford Named Entity Recognizer (NER) ("a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. It comes with well-engineered feature extractors for Named Entity Recognition, and many options for defining feature extractors. Included with the download are good named entity recognizers for English, particularly for the 3 classes (PERSON, ORGANIZATION, LOCATION), and we also make available on this page various other models for different languages and circumstances")
- VARD 2 ("software produced in Java designed to assist users of historical corpora in dealing with spelling variation, particularly in Early Modern English texts. The tool is intended to be a pre-processor to other corpus linguistic tools such as keyword analysis, collocations," etc.)
- Text Preparation "Recipes" for Topic Modeling Work:
- Matthew Jockers
- "'Secret' Recipe for Topic Modeling Themes" (guidance on creating stop lists, using parts-of-speech taggers to filter text, and "chunking" texts into suitable-length sections to optimize topi-modeling results)
- "Expanded Stopwords List" ("Below is the list of stop words I used in topic modeling a corpus of 3,346 works of 19th-century British, American, and Irish fiction. The list includes the usual high frequency words (“the,” “of,” “an,” etc) but also several thousand personal names.")
- Andrew Goldstone & Ted Underwood, "Code Used ... in Analyzing Topic Models of Literary-studies Journals" (GitHub repository of stoplist, code, and resources for Goldstone and Underwood's topic modeling project)
-
-
Topic Modeling
- Tutorials
- Tools
- Mallet
- Mallet (MAchine Learning for LanguagE Toolkit)
- In-Browser Topic Modeling ("Many people have found topic modeling a useful (and fun!) way to explore large text collections. Unfortunately, running your own models usually requires installing statistical tools like R or Mallet. The goals of this project are to (a) make running topic models easy for anyone with a modern web browser, (b) explore the limits of statistical computing in Javascript and (c) allow tighter integration between models and web-based visualizations"; by David Mimno.) Note: the files for this tool can be downloaded and run locally; download from GitHub here.
Topic Modeling Tool (Java-based "graphical user interface tool for Latent Dirichlet Allocation topic modeling" by David Newman; comes with test input files [look in "Downloads" tab on site]. Input files should be in .txt files saved in same directory; the input files are formatted with returns between each separate document)
- Other Tools Related to Topic Modeling
- Glimmer.rstudio.com Topic Modeling (LDA) visualization tool (allows users to upload their own data to generate scatterplots and bar charts)
- Stanford Topic Modeling Toolbox ("brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to: * Import and manipulate text from cells in Excel and other spreadsheets; * Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text; * Select parameters (such as the number of topics) via a data-driven process; * Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data")
-
Text Preparation and Topic Modeling Resources
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
|
|
Comments (0)
You don't have permission to comment on this page.