• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Topic Modeling Workflow (version 1)

Page history last edited by Scott Kleinman 8 years, 5 months ago Saved with comment


(Version 1.0, last revised 23 Nov 2015)



       Step-by-step Instructions For Topic Modeling the WE1S Corpus (or subcorpora):


  • Stage 0 -- Prepare your local computer with the required software, resources, and workspace.


    • (A) Download a ready-to-deploy zip file for the WE1S topic-modeling environment containing the resources listed below. (Note: the latest versions of the Python scripts and stopword lists included here are also available on this Github site).
      • Workspace main folder: C:\workspace\1\
        Windows users should copy the workspace folder and its nested subfolders to immediately under the root of the C: drive.  Mac users: copy to immediately under the root of your user space.
                 Explanation: because parts of the topic modeling process involve entering path names on a command line, near-root locations and short folder names are best.  Vary the name of the subfolder "1" as needed if you are doing multiple topic modeling experiments. Path names below are for Windows and start with "C:\"  Mac equivalents for paths would begin "./~/" where the period stands for the root and the tilde stands for "Users/[your username]/". (Note that Windows systems use the back slash ( \ ) in paths, while Mac systems use the forward slash ( / ). A handy, if biased, mnemonic is: "Windows backwards, Mac forwards.") 
        • Workspace subfolders: (these are already contained in the main workspace folder)
          • C:\workspace\1\articles-not-deduped (will hold combined articles from for the search terms "humanities" and "liberal arts," plus in Commonweath nation publications also "the arts")
          • C:\workspace\1\articles (will hold combined but de-duplicated of the above articles)
          • C:\workspace\1\articles-scrubbed (will hold scrubbed versions of the de-duplicated articles)
          • C:\workspace\1\topic-models (will hold Mallet results and also subfolders for later derivative results. Iterations of topic models can be held in subfolders here named "iteration1," etc.)
          • C:\workspace\1\topic-models\topic-documents (will hold "topic documents" created using the topicsToDocs.py script)
          • C:\workspace\1\topic-models/topic-clusters (will hold PCA, dendogram, and other clustering visualizations of topics)
          • C:\worskpace\resources (contains resources such as Python scripts and stopword lists needed for various stages of the process)
      • Python Scripts (These are already contained in the C:\workspace\resources subfolder.)
        • Scrubbing scripts:
          • config.py
          • scrub.py
        • De-duplication script:
          • corpus_compare.py
        • Topics-to-Documents script: 
          • topicsToDocs.py
      • Jupyter (formerly known as iPython) notebook:
        • dariah-de-tutorial-v3.ipynb  (This is already contained in the C:\workspace\resources subfolder.  Explanation: Jupyter/iPython notebooks contain explanatory text alongside Python code that is executable inside the notebook. They are often used to guide users through Python tasks. Adapted from a Python tutorial by DARIAH-DE, this particular Jupyter notebook takes "topic-documents" generated after topic modeling and creates from them scalable PCA and hierarchical dendogram visualizations of topic clusters. The Jupyter notebook thus supplements the process of clustering that can be done more more easily, but without scalability or other means of manipulation, through Lexos.)
      • Stopword lists (This is already in the C:\workspace\resources folder)
        • we1s_extra_stopwords.txt (Explanation: The Mallet topic modeling program comes with a default stopword list for improving topic models by ignoring common words and other non-thematic text. The WE1S extra stopword list is added during the running of Mallet to ignore additional non-thematic words found through text analysis to be frequent in the WE1S corpus.)
    • (B) Install Python Programming Environment & Mallet Topic-Modeling Software
      For instructions and explanations on installing these, consult Scott Kleinman's Guide ("SKGuide"). Besides explicit instructions for downloading and installing Anaconda Python and Mallet, the SKGuide also offers context about the purpose of the tools and beginner's instructions for the "command line" environment in which such tools are run. (There is also a test folder for performing small tests of the setup. Tests walk the user through executing Python and Mallet commands from the command line, executing Python scripts from the command line, and executing Python code in an iPython/Jupyter notebook.)
               General explanation: The WE1S topic-modeling workflow requires the Python 2.7 programming language (not the 3.0 generation of Python), which can optimally be installed on a computer as part of "distributions" ("distros") that include besides Python suites of "packages" (task-specific code libraries) and also an "integrated development environment" (IDE) interface with tools for editing and executing scripts. The ideal Python distribution for the WE1S topic-modeling workflow is  Anaconda (described in SKGuide).  In addition (and unrelated to Python), the WE1S topic-modeling workflow requires the Mallet topic-modeling software, which runs on Java.  The SKGuide includes installation instructions.


  • Stage 1 -- Assemble a subcorpus of the WE1S corpus

    1. Download the latest "flattened zipped" versions of the WE1S corpus on the specialized UCSB Mirrormask server, which can only be accessed over the UCSB VPN with login permissions for the WE1S user. Once you are logged into Mirrormask, the flattened zipped versions of he WE1S corpus are available at the location: File Station > Archives (screenshot of example).
             Explanation: Mirrormask is a NAS (network-attached storage) machine in WE1S team member Jeremy Douglass's office at UCSB that WE1S uses to back up and manipulate its corpus as well as to track work on the corpus.  "Flattened" refers to the automated process by which Mirrormask each night not only backs up the WE1S corpus but collects all its documents, which are held in subfolders for particular publications and years, into a single cumulative folder designed for input into Mallet topic modeling and other operations requiring a single input folder.
    2. Copy out the part of the corpus you want to use. For example, excerpt only New York Times articles in the years 2010-2014 that include the word "humanities" or the phrase "liberal arts." (When working with articles in the WE1S corpus from U.K or Commonwealth publications such as The Guardian, there is a third set of articles that WE1S scraped through searching on the phrase "the arts.")
              The filename nomenclature that WE1S uses for documents in its corpus is as follows (by example): nyt-2010-h-1.txt (where "h" indicates "humanities" and "1" is the file number for that year). The equivalent file names for articles found through searching on "liberal arts" and "the arts," respectively, include "la" and "ta" in the filenames instead of "h".)
    3. Paste the combined files for your subcorpus into your workspace in the folder: C:\workspace\1\articles-not-deduped (Mac equivalent: ./~/articles-not-deupuded)  The folder name indicates that the combined "humanities" and "liberal arts" (and "the arts") files have not yet been deduplicated to remove copies of the same article containing more than one of the search terms.


  • Stage 2 - De-duplicate the subcorpus

    • Run the Python script corpus_compare.py on your folder of not yet de-duplicated files (the files in the Windows folder C:\workspace\1\articles-not-deduped or the Mac folder ./~/workspace/1/articles-not-deduped). The steps for running this Python 2.7 script are as follows:
      1. Click on corpus_compare.py (in your workspace /resources subfolder) to open it in your Python IDE (integrated development environment). Anaconda will open the script in its Spyder interface. (The equivalent is the Canopy interface in Enthought, another often used Python IDE.)
      2. Under the "Run" tab in Anaconda's Spyder (or Enthought's Canopy), use the "configuration" or "configure" option to set the "working folder" to C:\workspace\1\articles-not-deduped (the input folder of files you want the script to operate on). In the "command line options" or "arguments" field, enter for the output folder by using the command: -o C:\workspace\1\articles (For Mac users: -o ./~/workspace/articles).

         Anaconda Spyder - Run - Configure

        Canopy Enthought - Run - Configurations
        Anaconda Spyder Run > Configurations dialogue   
        Canopy Enthought Run > Configurations dialogue


      • Note: By default, corpus_compare.py will run according to pre-set parameters and create a report file called compare-args.csv, which it will put it in the folder of the files being de-duped.  However, you can also set optional parameters by entering a command string in the Spyder (or Enthought) configuration dialogue according to the usage schema below. (The parameter "-t" sets the sensitivity threshold above which a match between files is detected, where the higher the threshold, the more exact rather than fuzzy the match):
        • Usage format: corpus_compare.py [-h] [-i INPUTPATHS [INPUTPATHS …]]] [-f FILEPATTERN] [-o OUTPUTFILE] [-t THRESHOLD]
        • Example usage: corpus_compare.py -i /mytemp/doc-compare-test/2015-11-16-workshop/data/ -f "*.txt" -t 0.85 -o /mytemp/doc-compare-test/2015-11-16-workshop/corpus_compare-args.csv
      • Note: if you are are running Python scripts directly in a command or terminal window (and not in Anaconda Spyder or Canopy Enthought), then the syntax on the command line (once you have changed directory using the cd [path] command to the folder holding the corpus_compare.py script) is: python corpus_compare.py [plus any optional parameters as cited above].
      1. After corpus_compare.py has run, it will by default deposit report file called compare-args.csv in the folder of files being de-deduped. Open this CSV file in a spreadsheet (or paste it into a spreadsheet) (see screenshot). Pairs of files in the columns labeled "file1" and "file2" are duplicates of each other. After copying your combined files to the subfolder that will hold your de-duped files -- C:\workspace\1\articles (or for Mac users: ./~/workspace/1/articles) -- delete the files whose names are in the right column of the CSV file (labeled "file2"). You will be left in your C:\workspace\1\articles subfolder with your combined, deduped files.
         compare-args.csv screenshot

        compare-args.csv file viewed in spreadsheet 



  • Stage 3 - Scrub the subcorpus

    "Scrub" means to pre-process the subcorpus files to clean them (e.g., of frequent OCR or formatting errors) and perform other operations (e.g., consolidate phrases such as "social sciences" or "National Endowment for the Humanities" into a single word ("social_sciences" or "national_endowment_for_the_humanities") for the purpose of optimizing topic modeling.
    1. Apply to the subcorpus the scrub.py Python script, which is configured with the config.py file.
      1. The config.py file contains the current scrubbing information. (For a sense of the information the file contains, see  text version of config.py (2015-11-16).txt). This file does not actually need to be executed in Python; it is simply called on as a library of information during the process of scrubbing. However, config.py does need to be edited at the top to set the input and output folders of the operation as follows:
                   input_file_path = "C:\workspace\1\articles"
                   output_file_path = "C:\workspace\1\articles-scrubbed"
      2. The scrub.py file is the actual script that needs to be run in Python once config.py is configured. It will deposit the finished, scrubbed files in the folder indicated in the output_file_path, which should be C:\workspace\1\articles-scrubbed (for Mac users: ./~/workspace/1/articles-scrubbed)
      3. Important: scrub.py will save the scrubbing log as a file called log.txt in your articles-scrubbed folder. Make sure to remove it before proceeding to Stage 4.


  • Stage 4 - Topic model the subcorpus using MALLET:

    For additional help on using the command line interface or using Mallet, consult Scott Kleinman's Guide ("SKGuide").
    • (A) Open a command shell ("command line") window
      1. Windows "command prompt" (using shell command language based on MS-DOS) -- quick guide -- tip on enabling copy/paste operations
      2. Mac "terminal" (using shell command language based on Unix) -- cheat sheet 
      3. Linux command line (also uses bash)
    • (B) Navigate to the Mallet folder (directory)
      1. Windows: type the following at the command line, followed by a return (helpful tip: the <F5> function key pastes in the previously used command)
        •  cd c:\mallet
      2. Mac: type the following at the command line
        • cd /Users/yourusername/mallet
        • As with the Windows command above, this will depend on where and how you have saved the MALLET folder you downloaded when you installed the package. The above command assumes you've saved MALLET (and titled the folder it's in "mallet") under your home directory. 

    • (C) Input a folder of text files and process them into a .mallet data file
      Use the command below, varying the path and file names as desired (red italics indicate path/filenames you supply).  Use forward slashes in Windows, backward slashes on Macs. There must be no hidden returns in the command. Best practice is to set the job up in a text document (without "wordwrap" view turned on) and copy/paste the command into the command shell.
      • General format (Windows): bin\mallet import-dir --input path to folder --output path and filename of desired output data file with .mallet extension --keep-sequence --remove-stopwords
      • Command line (Windows):
        • bin\mallet import-dir --input C:\workspace\1\articles-scrubbed --output C:\workspace\1\topic-models\topics.mallet --keep-sequence --remove-stopwords  --extra-stopwords C:\workspace\resources\we1s_extra_stopwords.txt
      • Command line (Mac):
        • ./bin/mallet import-dir --input ./~/workspace/1/articles-scrubbed --output  ./~/workspace/1/topic-models/topics.mallet --keep-sequence --remove-stopwords  --extra-stopwords ./~/workspace/resources/we1s_extra_stopwords.txt

    • (D) Create the topic model ("train" the topic model) -- Use the following command.
      • General format (Windows): bin\mallet train-topics  --input path and filename of the previously created data file with .mallet extension --num-topics desired number of topics --optimize-interval 20 --output-state path to output folder\topic-state.gz --output-topic-keys path to output folder\keys.txt --output-doc-topics --path to output folder\composition.txt --word-topic-counts-file  path to output folder\topic_counts.txt
      • Command line (Windows):
        • bin\mallet train-topics  --input C:\workspace\1\topic-models\topics.mallet --num-topics 100 --optimize-interval 20 --output-state C:\workspace\1\topic-models\topic-state.gz --output-topic-keys C:\workspace\1\topic-models\keys.txt --output-doc-topics C:\workspace\1\topic-models\composition.txt --word-topic-counts-file C:\workspace\1\topic-models\topic_counts.txt
      • Command line (Mac):
        • ./bin/mallet train-topics  --input ./~/workspace/1/topic-models/topics.mallet --num-topics 100 --optimize-interval 20 --output-state ./~/workspace/1/topic-models/topic-state.gz --output-topic-keys ./~/workspace/1/topic-models/keys.txt --output-doc-topics ./~/workspace/1/topic-models/composition.txt --word-topic-counts-file ./~/workspace/1/topic-models/topic_counts.txt

    • (E) When the topic model is complete, you should see that Mallet has deposited in the C:\workspace\1\topic-models folder the following set of files:
      • composition.txt
      • keys.txt
      • topic_counts.txt
      • topics.mallet
      • topic-state.gz

    • (F - Optional) Text analyze or inspect the topic model to improve the scrubbing of files and stopword list, then repeat topic modeling
      • Topic models can be inspected casually or text-analyzed more systematically to harvest phrases and words to be added to the WE1S config.py file (which configures our scrubbing script) or the we1s_extra_stopwords.txt file (which adds stopwords). Re-scrubbing the files will produce better topic models, which in turn can be studied to discover additional scrubbing issues in an iterative process. See "Using Text Analysis to Improve Scrubbing and Stopword Lists" for instructions on this process and on adding fixes to the WE1S Master List of Scrubbing Fixes and Stopwords.



  • Stage 5 - Cluster the topics:

    • Method 1 -- Using Lexos
      1. Go to Lexos online. Under the "Visualize" tab, choose "Multicloud". The following dialogue will appear (numbered labels refer to steps outlined here).
        Lexos Multicloud dialogue screen
      2. Toggle the switch to "Topic Clouds".
      3. Check the box for "convert topics to documents"
      4. Click on "Upload file" and upload the topic_counts.txt file that Mallet generated.
      5. Then click on "Get Graphs" to produce word clouds of each topic in your topic model.  After a wait, which can be long for a large number of topics, you will see word clouds. (Example).
      6. Next, click on the "Analyze" tab in Lexos, and choose "Clustering" > "Hierarchical Clustering" in order to produce dendograms to assist in seeing hor the topics in your topic model are clustered. The following dialogue will appear:
        Lexos Get Dendogram dialogue
      7. Click on "Get Dendogram" to produce a dendogram visualization of clusters of topics. An example:
        Lexos Dendogram Example
    • Method 2 -- Using the topicsToDocs.py script on the topic-counts.txt file produced by MALLET to create "topic-documents" from the individual topics in the topic model. This method is more complicated; but it has the advantage of producing dendograms that can be scaled and manipulated for greater legibility in ways that Lexos dendograms currently do not allow.
              (Explanation: A topic model finds "topics" consisting of words that tend to co-occur in a corpus, and also counts the relative frequency of the words. For the purpose of clustering topics with other topics, WE1S also converts topics into "topic documents," which are plain-text document files containing the most frequent words in a topic, with words repeated in proportion to their weight in the topic.)
      1. First configure the settings at the top of the topicsToDocs.py script to the following (Mac users: adapt the path name for Mac)
        • input_file_path = "C:\workspace\1\topic-models"
          input_file = "topic_counts.txt"
          num_topics = 100
        • The script will deposit the resulting topic documents (named "Topic0.txt" Topic1.txt" etc.) in the same folder as its input (in this case C:\workspace\1\topic-models\.  Move the topic documents into C:\workspace\1\topic-models\topic-documents\ 
      2. Using the Anaconda distribution of Python, open your local copy of the Jupyter (formerly called iPython) notebook titled dariah-de-tutorial-v3.ipynb. This is the  WE1S adaptation of a DARIAH-DE tutorial. It contains explanations and executable Python commands that will produce PCA clustering and dendogram visualizations of  topic-documents.
        • Open the Anaconda "Launcher"
        • From the options in the launcher, choose "ipython-notebook" and launch. From the folder list that opens, find and open the local copy of dariah-de-tutorial-v3.ipynb. This will start a local web service on your computer and open the Jupyter (iPython) notebook as if it were a Web page in your browser. (Note: the local web service is started in a command or terminal window on your computer. Leave that window open.)
        • In the Jupyter (iPython) notebook, scroll down a bit and begin at the section titled "Getting Some Data."  You'll see paragraphs of explanatory text interspersed with executable cells containing live Python code that can be run. To run the code in a cell, place your cursor in the cell, and then press CTRL-return.
          • Execute the code in each of the cells in sequence.
          • Note: when you get to the set of cells under the heading "Preparing to Analyze Document Vectors," there is one cell (#4) to pay special attention to. It's the cell that follows immediately after the explanation, "We will use CountVectorizer both to load the texts and to construct a DTM from them. We'll start by constructing a list of filenames...." This cell contains code that begins with filenames = [  The paths and file names here have to be specific for where you put the 100 topic-documents on your local machine.
            • For Windows machines, a copy of the code for all the filenames for this workshop are pre-filled in the Jupyter (iPython) notebook.  The code for this cell is also contained here in this plain text file titled filenames-for-iPython-notebook.txt
            • For Mac machines, you will need to adapt the code (using search and replace in a text processor)
        • Example of results from using this Jupyter (iPython) notebook:
          • Alan's results from NY Times 2002-2006 "humanities": PCA | Dendogram







Comments (0)

You don't have permission to comment on this page.