Operational Master Files
The operational master file for WE1S scrubbing fixes is the config.py script (which configures the scrub.py script). The current copy with the latest fixes is here on the WE1S Google drive. Fixes stored in config.py will be active when scrub.py runs (when both are stored in the same directory on a local workstation). (Config.py last revised 2 Jan. 2016 by Alan)
The operational master file for WE1S stopwords is the spreadsheet we1s-stopwords-master-file.xslx (on the WE1S Google Drive). That spreadsheet file contains both extra stopwords added for the WE1S project (sheet 1) and the standard Mallet stopwords list (sheet 2). In both cases, the lists exist in two forms: all-lower-case and "proper" case (first letter capitalized). (There are also a few all-upper-case words in the lists.) The spreadsheet is used to generate plain-text stopwords files as needed, depending on whether we use the scrubbing script not only to apply fixes but also to delete stopwords from all articles before topic modeling (in which case we generate a stopword list combining the Mallet and WE1S lists) or we ask Mallet to ignore stopwords only at the topic-modeling stage (in which case we generate a stopword list consisting of only the WE1S list). Note: if using the scrub.py to delete stopwords from the articles, both the lower-case and proper case lists of stopwords must be included in the stopwords file (to delete both versions of words). (we1s-stopwords-master-file.xlsx last revised 2 Jan. 2016 by Alan.)
Function of This Web Page
This web page functions as a convenience to record the fixes and stopwords in the operational versions of the files described above. Collaborators on WE1S can check the web page to see if a problem they discover is already take care of. If not, they can add suggestions to the bottom of each category under the label "to be added" indicated in red)
The web page also explains the principles we are using for identifying fixes and stopwords.
Note: We can include here also problems that only show up in some publications, so long as applying fixes to them has no deleterious impact on other publications. However, indicate with a note when a fix should be applied only to a specific publication. For example: we want to remove "New York Times" only from the NY Times files, not from other publications.
The general principle is that we want the WE1S scrubbing script (whose values are set in config.py) and the WE1S extra stopword list (whose values are set in we1s_extra_stopwords.txt) to work in tandem to prevent the Mallet topic modeling program from encountering frequent non-thematic lexical elements that detract from discovering meaningful topics. Consider the following hypothetical example of a cluster of words that Mallet suggests are part of a single topic because they frequently co-occur in a corpus: "whale harpoon of ship oil John the blubber etc. boat sea above seals captain Monday 's ocean". Here the words in red are thematically significant words. The other words or word fragments are non-thematic because they are either too common or co-occur in non-meaningful ways with other terms. For example, the fact that a word like "of" or a name like "John" co-occurs with both terms A and B does not mean that terms A and B actually have anything in common as a theme. (There could be many different "John's" in a corpus, for instance). Note: we only worry about problems that occur frequently, as determined by text analysis of topic models, for example. Infrequent problems will tend to drop out of sight in topic modeling.
Punctuation Fixes: (applied in config.py)
- Punctuation fixes currently in config.py (for scrubbing script)
- . (replace period with period followed by space)
- : (replace colon with period followed by space)
- ? (replace ? with ? followed by space)
- Punctuation fixes to be added to config.py
- 's (replace apostrophe s with nothing)
Spelling Variants
(applied in config.py)
- Spelling variants currently consolidated in config.py (for scrubbing list)
Principles for scrubbing: (1) for important concepts in our context, we standardize to the American spelling.
- centre --> center
- labour --> labor
- organisations --> organization
- programme --> program
- Spelling variants to be added for consolidation in config.py
(applied in config.py)
("Tokenize" in this context means change multi-word phrases into what text analysis program will treat as a single word.) Principles: We consolidate: (1) important organization, nation, or other entity names for our context; (2) important acronyms in our context; (3) important names or phrases in which one part of the phrase would otherwise be stopped out; (4) phrases, titles, names that tend to be used as indicators of humanities works or issues (e.g., liberal_arts, Romeo_and_Juliet); (5) other important phrases functioning as compound words (e.g., social_sciences)
- Tokenizations currently in config.py (for scrubbing script)
- Abu_Dhabi
- Affordable_Care_Act
- American Association of University Professors --> AAUP
- American_Studies_Association
- Art_History
- Big_Ten
- Bryn_Mawr
- center-left, centre-left --> center_left
- Carnegie_Mellon
- Claremont_McKenna
- Chronicle_of_Higher_Education
- Cold_War
- Common_Core
- continuing_education
- Cooper_Union
- Department_of_Education
- distance education, distance-education ---> distance_education
- distance_learning
- distant_reading
- East_Coast
- East_Asian
- e-mail, E-mail --> email
- Emily_Dickinson
- Ezra_Pound
- Fairleigh_Dickinson
- Ford_Foundation
- full_time
- [consolidations for school grades]
- hard_headed
- hard_nosed
- hard_science
- hard_sciences
- hard_times
- hard_wired
- hard_work
- hard_working
- Harvard_University
- Harvey_Mudd
- high-paying --> high_paying
- high school, high-school, high schools, High School --> high_school
- high-skill --> high_skill
- high-skilled --> high_skilled
- high_technology
- H.M.O. --> HMO
- Ho_Chi_Minh
- Hong_Kong
- Hurricane_Katrina
- Ivy_League
- King_Lear
- Las_Vegas
- cum_laude
- [consolidations related to "left"]
- American_left
- British_left
- left_brain
- left_leaning
- left_wing
- the_left
- [consolidations related to "right"]
- [same with variation as for "left" above"]
- Letters_to_the_Editor
- liberal arts, liberal-arts, liberal art, liberal-art --> liberal_arts
- Long_Island
- long-term, long term --> long_term
- long-time, long time --> long_time
- los_angeles
- [consolidations related to humanities majors] -- both singular and plural of "major" need to be consolidated (e.g., "English major" and "English majors")
- English_major
- History_major
- Philosophy_major
- French_major
- Classics_major
- Art_major
- Language_major
- humanities_major
- arts_major
- art_major
- Art_History_major
- major_in_humanities
- [consolidations related to humanities minors]
- [same with variation as for humanities majors]
- Martin_Luther_King
- M.D., M.D.s --> M_D
- Middle East, Middle Eastern --> Middle_East
- M.A. --> M_A
- MacArthur_Foundation
- Mellon_Foundation
- Modern Language Association --> MLA
- Mother_Teresa
- National Endowment for the Humanities, N. E. H. --> NEH
- National Endowment for the Arts, N. E. A. --> NEA
- National_Humanities_Center
- National_Commission_on_Excellence_in_Education
- New_Haven
- New_Left
- New_Jersey
- New_York
- Nobel_laureate
- North America, North_American --> North_America
- Op Ed, Op ed, op ed --> Op_Ed
- part time, part-time --> part_time
- Pell grant, Pell grants --> Pell_grants
- Ph.D., Ph. D., PhD --> Ph_D
- Phi_Beta_Kappa
- poet laureate, Poet laureate, poets laureate --> poet_laureate
- political_correctness
- queer_studies
- queer_theory
- Rockefeller_Foundation
- Rome_and_Juliet
- Royal_Society
- SAT_scores
- SAT averages, SAT score averages --> SAT_averages
- SAT_verbal
- SAT_math
- SAT subject test, SAT subject tests --> SAT_subject_test
- SAT test, SAT tests --> SAT_tests
- SAT_results
- SAT_data
- Saul_Bellow
- sci-fi, sci fi, Sci-Fi, Sci Fi ---> sci_fi
- social_sciences
- social_science
- social_scientist
- social_studies
- South America, South American --> South_America
- Star_Trek
- Star_Wars
- startup, startups, Startup, Startups, start-ups, Start-ups, Start-up --> start-up
- status quo, status-quo --> status_quo
- Tel_Aviv
- The_National_Review
- Title_IX
- Uncle_Sam
- United Kingdom, U. K., U.K. --> United_Kingdom
- United_Nations
- University_of_California
- University_of_Chicago
- University_of_Bridgeport
- US, USA, U.S.A., U. S. A., U.S., United States, United States of America --> U_S
- vice_chancellor
- vice_dean
- vice_president
- vice_presidential
- vice_provost
- vice_rector
- waiting list, waiting lists --> waiting_list
- Wall_Street_Journal
- Wi-fi, wifi, Wifi ---> wi-fi
- White_House
- White_Plains
- World_War_I
- World_War_II
- Yale_University
Tokenizations to be added to config.py
- o American_Express
- o Black_History_Month
- o Board_of_Education
- o Business_World
- o City_College_of_New_York
- o City_University_of_New_York
- o Corporation_for_Public_Broadcasting
- o D.C., D. C. à D_C
- o D.C. Public Library, District of Columbia Public Library à D.C._Public_Library
- o Department_of_Education
- o Department_of_Public_Works
- o District_of_Columbia
- o Federal_City_College
- o Fort_Washington
- o George_Mason_University
- o George_Washington
- o George_Washington_University
- o House_Intelligence_Committee
- o House_Speaker
- o Howard _University
- o Kansas_City
- o Kennedy_Center_Opera_House
- o Latin_American
- o Mexico_City
- o National_Book_Award
- o National_Endowment_for_the_Arts
- o National_Endowment_for_the_Humanities
- o National_Gallery
- o National_Geographic
- o National_Geographic_Society
- o National_Guard
- o National_Humanities_Center
- o National_Humanities_Medal
- o National_Museum_of_American_History
- o National_Museum_of_Natural_History
- o National_Park_Service
- o National_Public_Radio
- o New_York_City
- o New_York_Public_Library
- o North_American
- o Ohio State, Ohio State University à Ohio_State_University
- o Paul_Peck_Humanities_Institute
- o Secretary_of_Education
- o Secretary_of_State
- o State_Department
- o Third_World
- o University_of_Maryland
- o Washington, D.C. à Washington_D.C.
- o Washington_Metropolitan_Area
- o Washington_Performing_Arts_Society
- o Washington_Post
- o White_House
- o World_War_I
- o World_War_II
Phrases stopped out in config.py (for scrubbing script):
- Stop phrases currently in config.py (for scrubbing script)
- Associated Press
- Continue reading the main story
- Corrections & Amplifications
- Credit: By
- New York Times (this fix specific to NYT)
- njtowns@nytimes.com
- N.Y. / Region
- Published:
- Room for Debate (this fix specific to NYT)
- Special to the New York Times (this fix specific to NYT)
- Sunday New Jersey Section
- 620 Eighth Avenue, New York, N.Y. 10018-1405
- Stop phrases to add to config.py
- Guardian (this fix specific to Guardian)
- Washington_Post (this fix specific to WP)
- Washington_Ways (this fix specific to WP)
- Los_Angeles_Times (this fix specific to LATimes)
- National_Public_Radio (this fix specific to NPR)
Single words to be stopped out in we1s_extra_stopwords.txt:
- Names currently stopped out in we1s_extra_stopwords.txt
Principles for adding words : (1) common first names of people, but not first names that are also common last names or place names; (2) months, numbers, days of week and their abbreviations; (3) phrases such as "pages" that in newspapers have a primarily functional rather than thematic meaning; (3) common inadvertent OCR and word fragment problems
- ’ll
