|
End-to-end Topic Modeling Rehearsal Workshop (2015-11-16)
Page history
last edited
by Alan Liu 9 years, 3 months ago

Resources needed on computers participating in workshop
Download the suggested workspace folder structure for this workshop containing the subcorpus of New York Times articles, 2010-14, we will be using, plus also all the other items in red below (Python scripts, stop-word list, and iPython notebook) as a single, ready-to-deploy zip file: here. The zip file also includes instructions and links to software needed for the workshop. Note: recommendations for local working folder structure and other pathnames below are given for Windows machines, e.g., "C:/workspace/1," etc. For Mac users, please translate into the Mac equivalents, e.g., "/Users/yourusername/workspace/1"
- Programming Environments:
- Software:
- Mallet, version 2.0.8RC2 (latest version)
- Reminders for Windows installation of Mallet:
- Mallet must be installed at root level (C:/mallet). After installing, set the environment variable %MALLET_HOME% to point to the MALLET directory. (Instructions for setting environment variables in Windows)
- Windows pathnames entered at the command line use the forward slash (e.g., bin/mallet)
- Example Mallet command lines (pathnames configured according to recommended working folder structure described below)
- Input folder of text files and generate .mallet data file:
- bin\mallet import-dir --input C:\workspace\1\articles-scrubbed --output C:\workspace\1\topic-models\topics.mallet --keep-sequence --remove-stopwords --extra-stopwords C:\workspace\1\we1s_extra_stopwords.txt
- Create topic model:
- bin\mallet train-topics --input C:\workspace\1\topic-models\topics.mallet --num-topics 100 --optimize-interval 20 --output-state C:\workspace\1\topic-models\topic-state.gz --output-topic-keys C:\workspace\1\topic-models\keys.txt --output-doc-topics C:\workspace\1\topic-models\composition.txt --word-topic-counts-file C:\workspace\1\topic-models\topic_counts.txt
- Python Scripts: (items in red available in the ready-to-deploy zip file for this workshop: here)
- Scrubbing scripts: (download latest version from WE1S Google Drive here)
- config.py (This is the configuration file for Scott's scrub.py; config.py specifies the input and output folder of files, and is loaded with the exact words and phrases that scrub.py processes. The latest version of config.py contains the words and phrases that Lindsay and Alan have so far added. Alan added many of the terms after inspecting sample NY Times files in Antconc to look for frequent words, word forms, names, and phrases that need to be either deleted, consolidated, or otherwise massaged)
- scrub.py (This is the actual Python script to run on a folder of files once config.py has been configured. Usage: configure config.py first if necessary, then run scrub.py)
- Topics-to-Documents script: (download latest version from WE1S Google Drive here)
- topicsToDocs.py (We convert the topics in a topic model into what we call "topic-documents" so that we can cluster and do other analysis with them as if they were ordinary text documents of the kind clustering tools are designed for. topicsToDocs.py uses as input the topic_counts.txt file created by Mallet. For each topic, it creates a "topic-document" [named "Topic0.txt," "Topic1.txt," etc.], each one of which contains the most important 100 words of a topic and repeats them in proportion to their weight in the topic. Usage: configure lines 2-5 in topicsToDocs.py with information about the input path of the topic_counts.txt file, the name of that file, the number of topics in your topic model, and the number of top words in each topic to act on.)
- iPython Notebooks: (download latest versions from WE1S Google Drive here)
Note: iPython Notebooks open as locally-served Web pages containing documentation or explanation that includes actionable Python code. Usage: Placing one's cursor in one of the code cells on the page and pressing Shift + Enter will execute that bit of code. Normally, a user executes the code cells in sequence to generate the results needed as input for later code.)
- dariah-de-tutorial-v3.ipynb (Based on a Python tutorial by DARIAH-DE, and adapted first by Scott and then Alan, this iPython notebook can be used to input a folder of "topic-documents" and generate PCA and hierarchical dendogram visualizations. The visualizations open in separate windows, and can be enlarged and manipulated in various ways. Important: because of package dependencies that are difficult to install, dariah-de-tutorial-v3.ipynb should be opened and run in the Anaconda distribution of Python, which comes with all the necessary packages. Open the Anaconda "Launcher"; then launch IPython Notebooks from the menu that appears; and then navigate to the folder where dariah-de-tutorial-v3.ipynb is stored on your local computer. Selecting the notebook will start an instance of a local web server [as if there were a server on your machine, but one that serves web pages only to yourself], and that local server will show you the notebook as a web page in your browser.)
- Note: For the set of cells containing executable code in this iPython notebook under the heading "Preparing to Analyze Document Vectors," there is one cell (#4) to pay special attention to. It's the cell that follows immediately after the explanation, "We will use CountVectorizer both to load the texts and to construct a DTM from them. We'll start by constructing a list of filenames...." This cell contains code that begins with filenames = [ The paths and file names here have to be specific for where you put the 100 topic-documents on your local machine.
- For Windows machines, a copy of the code for all the filenames for this workshop are pre-filled in the iPython notebook. The code for this cell is also contained here in this plain text file titled filenames-for-iPython-notebook.txt
- For Mac machines, you will need to adapt the code (using search and replace in a text processor)
- Stopword List: (download latest version from WE1S Google Drive here) -- Recommended location to save file on local machine: C:\workspace\1\we1s_extra_stopwords.txt (or Mac equivalent)
- we1s_extra_stopwords.txt (This is an extra stopword list that supplements the default Mallet stopwords list. Usage: in the initial command string that instructs Mallet to input a folder of files and generate a .mallet data file, append to "--remove-stopwords" the additional parameter (separated by a space): "--extra-stopwords [path/filename for extra stopwords file]"
- Subcorpus of articles from WE1S Corpus serving as primary materials for the workshop: (access requires login as WE1S user on WE1S Google Drive)
- articles-h (New York Times articles, 2010-14, mentioning "humanities") (download as zip file from WE1S Google Drive here)
- articles-la (New York Times articles, 2010-14, mentioning "liberal arts") (download as zip file from WE1S Google Drive here)
- articles-lah (This is the consolidated, de-duplicated combination of the above two sets of New York Times articles, 2010-14 articles, for "humanities" and "liberal arts") (download as zip file form WE1S Google Drive here)
- Recommended folder structure on local machines: (Mac equivalents for the below paths would begin, "/Users/yourusername/workspace/1)
(This set of workspace folders is available in pre-organized form in the ready-to-deploy zip file for this workshop: here)
- C:/workspace/1/ (main workspace; the subfolder "1" is there because most people will have other folders already in the workspace)
- C:/workspace/1/articles-h (holds the subcorpus of New York Times articles, 2010-14, mentioning "humanities")
- C:/workspace/1/articles-la (holds the subcorpus of New York Times articles, 2010-14, mentioning "liberal arts")
- C:/workspace/1/articles (holds the consolidated, de-duplicated humanities and liberal arts articles)
- C:/workspace/1/articles-scrubbed (holds scrubbed versions of the articles)
- C:/workspace/1/topic-models (holds Mallet results and also subfolders for later derivative results; iterations of topic models can be held in subfolders named "iteration1," etc.)
- C:/workspace/1/topic-models/topic-documents (holds "topic documents" created using the topicsToDocs.py script)
- C:/workspace/1/topic-models/topic-clusters (holds PCA, dendogram, and other clustering visualizations of topics)
Preliminaries
Workshop for End-to-End Trial Rehearsal of Workflow for Topic Modeling WE1S Corpus (at sample scale)
Step-by-step instructions for Workshop:
(We'll run in parallel at our various locations. However, some steps will be pre-prepared.)
-
Infrastructure for Workshop:
- Parallel installations on computers at the following locations:
- Transcriptions (UCSB) (Prep the workstation attached to the projector and Skype)
- Alan
- Lindsay
- Scott (if he is able to participate in the workshop)
- Installations should include: (see under spotlight at top of this page)
-
Workshop Stage 1 -- Assemble a demo subcorpus of the WE1S corpus
- Access latest "flattened zipped" versions of the WE1S corpus on the Mirrormask server at the location: File Station > Archives (screenshot of example). For this workshop, we will be using the flattened version of 25 October 2015 (corpus_2015-10-25_flat.zip). (Note: accessing the Mirrormask server requires working over the UCSB VPN and login permissions for the WE1S user)
- Download the zipped file, unzip it, and copy out part of the corpus for use as our workshop demo corpus. (We will be using New York Times articles, 2010-2014, mentioning "humanities" and the also New York Times articles, 2010-14 mentioning "liberal arts") (To save time, you can download the humanities files for these years here, and the liberal arts files here from the WE1S Google Drive, which you need to log into as the WE1S user.)
- Paste the demo corpus into your local workspace in folders: C:\workspace\1\articles-h and C:\workspace\1\articles-la
-
Workshop Stage 2 - De-duplicate the demo subcorpus
- This step has been prepared in advance with the help of Jeremy (see corpus_compare.xlsx )
- Total number of New York Times 2010-14 "humanities" articles: 1,062
- Total number of New York Times 2010-14 "liberal arts" articles: 751
- Total articles consolidating above after de-duping: 1,623
- Jeremy will demonstrate the process in real time.
- De-duplication script:
corpus_compare.py https://mirrormask.english.ucsb.edu:5001/fbsharing/CBkLLLlX
The de-duplication python script generates a file listing of duplicate file pairs in spreadsheet format, along with a rating of their reasons for matching and degree of matching. It can be downloaded and run on your computer. It requires the installation of python 2.7 and scikit-learn. Alternately, it can be used form a virtual container (anaconda) running on our server with all requirements already installed.
usage:
corpus_compare.py [-h] [-i INPUTPATHS [INPUTPATHS …]]] [-f FILEPATTERN] [-o OUTPUTFILE] [-t THRESHOLD]
Use Example 1:
corpus_compare.py
Scans all .txt files in the current directory and subdirectories; default settings and output.
Use Example 2:
corpus_compare.py -i /mytemp/doc-compare-test/2015-11-16-workshop/data/ -f “*.txt” -t 0.85 -o /mytemp/doc-compare-test/2015-11-16-workshop/corpus_compare-args.csv
Scans with a specified input directory, file filter, threshold, and output file.
- The de-duplicated, consolidated demo sub-corpus (consisting of New York Times articles, 2010-14 mentioning "humanities" and/or "liberal arts") should be put on your local workspace in the folder C:\workspace\1\articles\
-
Workshop Stage 3 - Scrub the demo subcorpus
- Apply to the subcorpus Scott's python scrubbing script (scrub.py) using the current version of the config.py file containing scrubbing information (for convenient view of content of file, see text version of config.py (2015-11-16).txt).
- Deposit results in a separate folder in the workspace: C:\workspace\articles-scrubbed\
- Download a copy of the current WE1S we1s_extra_stopwords.txt file from the WE1S Google Drive here; and put in in your workspace folder, C:/workspace/1/ (This will be used when running MALLET during the topic modeling stage next.)
-
Workshop Stage 4 - Topic model the subcorpus using MALLET:
- (A) Open a command shell ("command line") window
- Windows "command prompt" (using shell command language based on MS-DOS) -- quick guide -- tip on enabling copy/paste operations
- Mac "terminal" (using shell command language based on Unix) -- cheat sheet
- Linux command line (also uses bash)
- (B) Navigate to the Mallet folder (directory)
- Windows: type the following at the command line, followed by a return (helpful tip: the <F5> function key pastes in the previously used command)
- Mac: type the following at the command line
- cd /Users/yourusername/mallet
- As with the Windows command above, this will depend on where and how you have saved the MALLET folder you downloaded when you installed the package. The above command assumes you've saved MALLET (and titled the folder it's in "mallet") under your home directory.
- (C) Input a folder of text files and process them into a .mallet data file
Use the command below, varying the path and file names as desired (red italics indicate path/filenames you supply). Use forward slashes in Windows, backward slashes on Macs. There must be no hidden returns in the command. Best practice is to set the job up in a text document (without "wordwrap" view turned on) and copy/paste the command into the command shell.
- General format (Windows): bin\mallet import-dir --input path to folder --output path and filename of desired output data file with .mallet extension --keep-sequence --remove-stopwords
- Command line for this workshop (Windows):
- bin\mallet import-dir --input C:\workspace\1\articles-scrubbed --output C:\workspace\1\topic-models\topics.mallet --keep-sequence --remove-stopwords --extra-stopwords C:\workspace\resources\we1s_extra_stopwords.txt
- Command line for this workshop (Mac):
- ./bin/mallet import-dir --input /Users/[your-user-name]/workspace/1/articles-scrubbed --output /Users/[your-user-name]/workspace/1/topic-models/topics.mallet --keep-sequence --remove-stopwords --extra-stopwords /Users/[your-user-name]/workspace/resources/we1s_extra_stopwords.txt
- (D) Create the topic model ("train" the topic model) -- Use the following command.
- General format (Windows): bin\mallet train-topics --input path and filename of the previously created data file with .mallet extension --num-topics desired number of topics --optimize-interval 20 --output-state path to output folder\topic-state.gz --output-topic-keys path to output folder\keys.txt --output-doc-topics --path to output folder\composition.txt --word-topic-counts-file path to output folder\topic_counts.txt
- Command line for this workshop (Windows):
- bin\mallet train-topics --input C:\workspace\1\topic-models\topics.mallet --num-topics 100 --optimize-interval 20 --output-state C:\workspace\1\topic-models\topic-state.gz --output-topic-keys C:\workspace\1\topic-models\keys.txt --output-doc-topics C:\workspace\1\topic-models\composition.txt --word-topic-counts-file C:\workspace\1\topic-models\topic_counts.txt
- Command line for this workshop (Mac):
- ./bin/mallet train-topics --input /Users/[your-user-name]/workspace/1/topic-models/topics.mallet --num-topics 100 --optimize-interval 20 --output-state /Users/[your-user-name]/workspace/1/topic-models/topic-state.gz --output-topic-keys /Users/[your-user-name]/workspace/1/topic-models/keys.txt --output-doc-topics /Users/[your-user-name]/workspace/1/topic-models/composition.txt --word-topic-counts-file /Users/[your-user-name]/workspace/1/topic-models/topic_counts.txt
-
Workshop Stage 5 - Clustering the topics:
- Use Scott's topicsToDocs.py script on the topic-counts.txt file produced by MALLET to create "topic-documents" from the individual topics in the topic model. (Or use Lexos to do the same):
- Configure the settings at the top of the topicsToDocs.py script to the following (Windows; adapt path name for Mac)
- input_file_path = "C:/workspace/1/topic-models"
input_file = "topic_counts.txt" num_topics = 100
- The script will deposit the resulting topic documents (named "Topic0.txt" Topic1.txt" etc.) in the same folder as its input (in this case C:/workspace/1/topic-models/. Move the topic documents into C:/workspace/1/topic-models/topic-documents/
- Using the Anaconda distribution of Python, open your local copy of the iPython notebook titled dariah-de-tutorial-v3.ipynb (available from the WE1S Google Drive here). This is the the WE1S adaptation of a DARIAH-DE tutorial iPython notebook (edited by Scott and Alan). The notebook contains executable Python commands and scripts tht we will use to produce PCA clustering and dendogram visualizations of the topic-documents.
- Open the Anaconda "Launcher"
- From the options in the launcher, choose "ipython-notebook" and launch. From the folder list that opens, find and open the local copy of dariah-de-tutorial-v3.ipynb. This will start a local web service on your computer and open the iPython notebook as if it were a Web page in your browser. (Note: the local web service is started in a command or terminal window on your computer. Leave that window open.)
- In the iPython notebook, scroll down a bit and begin at the section titled "Getting Some Data." You'll see paragraphs of explanatory text interspersed with executable cells containing live Python code that can be run. To run the code in a cell, place your cursor in the cell, and then press CTRL-return.
- Execute the code in each of the cells in sequence.
- Note: when you get to the set of cells under the heading "Preparing to Analyze Document Vectors," there is one cell (#4) to pay special attention to. It's the cell that follows immediately after the explanation, "We will use CountVectorizer both to load the texts and to construct a DTM from them. We'll start by constructing a list of filenames...." This cell contains code that begins with filenames = [ The paths and file names here have to be specific for where you put the 100 topic-documents on your local machine.
- For Windows machines, a copy of the code for all the filenames for this workshop are pre-filled in the iPython notebook. The code for this cell is also contained here in this plain text file titled filenames-for-iPython-notebook.txt
- For Mac machines, you will need to adapt the code (using search and replace in a text processor)
- Example of results from using this iPython notebook:
- Alan's results from NY Times 2002-2006 "humanities": PCA | Dendogram
(d) For the Next Workshop: Interpret the Topic Model of the Sub-corpus
- Discuss the topic model of the sub-corpus based on inspecting the topic model and also the clustering dendogram (and other clustering experiments)
- Work out a systematic, documentable workflow for interpreting topic models
Note about general strategy: Eventually, there are likely to be at least three phases of work involved in interpreting topic models of the WE1S corpus:
- Interpret the topic model(s) that represent the whole corpus.
- Interpret the above in a way that discovers differences in the incidence and weight of topics comparatively in distinct chronological, national, or other segments of our corpus.
- Create and interpret a topic model specifically for those articles in our corpus that focus directly rather than peripherally on the humanities. (To identify such articles, we will need to develop a textual-analysis method for locating articles that emit the strongest "signal" of direct focus on the humanities--e.g., articles where the word "humanities" co-occurs with specific other words like "college," "majors," "literature," "history," etc.)
Conceptually, #1 and #2 above are about modeling the structure of ideas in the overall field of discourse in which the word "humanities" participates. The metaphor of a "neighborhood" of discourse may help clarify. When WE1S scrapes all articles that mention "humanities" and "liberal arts," it is collecting the whole neighborhood in which our particular subject of interest, the humanities, has a residence or does business. But there are a lot other houses and businesses in the neighborhood--private residences, shops, movie theaters, museums, banks, churches, police stations, etc. When we topic model this whole neighborhood of discourse, we are asking the question: how is discourse about and by the public in the neighborhood structured? Secondarily: how does discourse about the humanities fit into that overall picture?
By contrast, #3 above is about pre-identifying the exact household or business of the humanities and using topic modeling to understand the structure of discourse about/by the humanities in that focalized discursive space--e.g., the way that articles titled such things as "The Decline of the Humanities" talk about the humanities. Of course, articles of this latter sort will intersect with many of the topics in the larger set--youth, the economy, jobs, etc. It's an open empirical question how much articles that worry explicitly about the humanities do or do not match other public discourse in their selection, prioritization, and weighting of topics.
(e) For a Later Workshop: Improve
- Improve workflow
- Experiment with alternative workflows, e.g., Andrew Goldstone's DFRtopics R package?
(f) Manifest schema, Database system
- reports from Scott and Jeremy
Scott created a demo of webform access to a mongodb database, and I have build a system to serve it out of containers (virtual machines). An early form example and a more recent database-connected example are hosted here:
1. WE1S flask+deform
http://mirrormask.english.ucsb.edu:8500/
2. WE1S flask+alpaca (+pymongo)
http://mirrormask.english.ucsb.edu:8503/
(NOTE -- as always you may need to campus VPN in order to access these URLs)
End-to-end Topic Modeling Rehearsal Workshop (2015-11-16)
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
|
|
Comments (0)
You don't have permission to comment on this page.