Using Text Analysis to Improve Scrubbing and Stopword Lists


Overview

As outlined in the WE1S Topic Modeling Workflow, there are two preprocessing steps we take to optimize the raw plain-text files of articles and other material to improve topic modeling. First we use a Python script to "scrub" the files (fixing common punctuation problems, standardizing word forms, consolidating some phrases, making other corrections. Secondly, we ask the Mallet topic-modeling program to add to its default stopwords list the we1s_extra_stopwords.txt file containing additional common words, names, numbers, etc. to ignore during topic modeling. (See the command line in step C of topic modeling).

         The general principle is th at we want the WE1S scrubbing script (whose values are set in config.py) and the WE1S extra stopword list (whose values are set in we1s_extra_stopwords.txt) to work in tandem to prevent the Mallet topic modeling program from encountering frequent non-thematic lexical elements that detract from discovering meaningful topics. Consider the following hypothetical example of a cluster of words that Mallet suggests are part of a single topic because they frequently co-occur in a corpus: "whale harpoon of ship oil John the blubber etc. boat sea above seals captain Monday 's ocean". Here the words in red are thematically significant words.  The other words or word fragments are non-thematic because they are either too common or co-occur in non-meaningful ways with other terms. For example, the fact that a word like "of" or a name like "John" co-occurs with both terms A and B does not mean that terms A and B actually have anything in common as a theme.  (There could be many different "John's" in a corpus, for instance). Note: we only worry about problems that occur frequently, as determined by text analysis of topic models, for example. Infrequent problems will tend to drop out of sight in topic modeling.

        Text analysis of topic models we have already created can help us identify frequent non-thematic words and phrases in our corpus to add to our scrubbing script or extra stopwords list.  Doing so iteratively will improve our topic models.  The following instructions describe how to use the Antconc text analysis tool on the articles-scrubbed folder in the WE1S topic modeling workflow (containing articles scrubbed by an earlier iteration of our scrubbing script) to discover additional to fix; and then how to add fixes to the WE1S Master List of Scrubbing Fixes and Stopword.

 


Step 1 - Download and open Antconc

  1. Download the latest version of Antconc for your operating system from http://www.laurenceanthony.net/software.html.  Save the downloaded executable file (Windows) or zip folder (Mac) on your hard drive. The program runs directly from these downloaded files (no further installation process is required).
  2. Open Antconc.

 


Step 2 - Configure Antconc to apply to texts the current we1s_extra_stopwords.txt file

  1. As shown in the screenshot below, click on the tab for "Tool Preferences" (labeled #1 in the screenshot)
  2. In the dialogue that opens, click in the left Category sidebar on "Word List" (#2).
  3. Check the button for "Use a stoplist below" (#3)
  4. Click on "Open"(#4) and navigate to the location of the existing we1s_extra_stopwords.txt file on your computer (in the WE1S topic modeling workflow, it is located at C:\workspace\resources\ [for Mac users: ./~/workspace/resources/]). Selecting that file will populate the "Add Words From File" window in the dialogue.
  5. Then click on "Apply."  This will apply the stopword list to text files operated on by Antconc in this session. (Antconc does not save this information between sessions unless you export settings [under the "File" tab] and then import settings when opening a new session.)

 

Antconc - Set Stopwords list

 

 

 

 

 

 

 

(click for larger image)


Step 3 - Load current scrubbed files into Antconc

  1. As show in the screenshot below, load the already scrubbed files for the WE1S corpus (or subcorpus) you are working with into Antconc by clicking on the "File" tab and then on "Open Dir" (#1). Navigate to the articles-scrubbed folder in the WE1S topic modeling workflow containing the scrubbed articles. (The purpose of using Antconc in the WE1S workflow is to improve the scrubbing of files; so we operate iteratively on the last state of the files as they were previously scrubbed.)
  2. Click on the tab for "Word List" (#2)
  3. Then click on "Start" (#3). Warning: If the number of files you are working on is very large, there will be a long wait as Antconc processes them (discovering words and counting their frequency). The processing of 2,000 files takes about 5 minutes (depending on the speed of your computer).

 

 

Antconc - Load files

 

 

 

 

 

 

 

 

(Click for larger image)

 


Step 4 - Analyze using Word List

  1. As shown in the screenshot below, once Antconc has finished processing your corpus while in Word List view, it will show the words in your corpus (#3) ranked by ordinal number (#1) and frequency count (#2).
  2. The most common words at the top of the list will be articles, prepositions, and other functional words. You can assume that these will be stopped out by the Mallet default stop list ( Mallet_2015.txt) during topic modeling. Scroll down the list and exam the first hundred or so substantive words. (You only need to worry about high frequency words, since low frequency words will likely not matter for topic modeling.) Look for any problem words that need to be scrubbed or stopped out. (To add fixes, go to WE1S Master List of Scrubbing Fixes and Stopwords) (For a general explanation of the kinds of problems we are looking for, see "general principle" above.)
  3. A useful tool in Antconc that complements analysis by word frequency is the "Concordance" view. If you are not sure of the sense or context of a word you see in the word list as it is commonly used in our corpus, enter it in the Concordance view to get a keyword-in-text view of occurrences in context (see example). 

 

 

Antconc - Word List

 

 

 

 

 

 

 

 

(Click for larger image)

 


Step 5 - Analyzing phrases using "Clusters/N-grams"

  1. As shown in the screenshot below, you can analyze the phrases in which important words occur by using Antconc's "Clusters/"N-grams" view. Click on the tab for the view to open it (#1)
  2. Enter a word that you are interested in (#2)
  3. Set the size of the clusters (phrases) containing the word that you want Antconc to find (#3)
  4. Set whether the word you are interested in is at the beginning or end of the phrases you want Antconc to find (#4).
  5. Examine the most frequent phrases (not bothering with lower frequency phrases). Look for any problem phrases that need to be scrubbed. (To add fixes, go to WE1S Master List of Scrubbing Fixes and Stopwords) (For a general explanation of the kinds of problems we are looking for, see "general principle" above.)

 

 

Antconc - Clusters/N-grams view

 

 

 

 

 

 

 

 

(Click for larger image)

 


Step 6 - Add words and phrases to your list of candidates for the scrubbing list or we1s_extra_stopword.txt list.

  1. If you identify problems that need to be scrubbed or stopped out, see whether they are already accounted for in the WE1S "Master List of Scrubbing Fixes and Stopwords". Otherwise, add your fixes in the appropriate categories. You can use the below jump-menu to go directly to categories of fixes:

 

Jump to categories on WE1S "Master List of Scrubbing Fixes and Stopwords" page:

 

 

  1. For a general explanation of the kinds of problems we are looking for, see "general principle" above. The existing examples in our "Master List of Scrubbing Fixes and Stopwords" will help explain the specific categories of problems we are looking for.