• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Stop wasting time looking for files and revisions. Connect your Gmail, DriveDropbox, and Slack accounts and in less than 2 minutes, Dokkio will automatically organize all your file attachments. Learn more and claim your free account.


Alan's Instructions for Implementing The Networked Corpus

Page history last edited by Alan Liu 4 years, 8 months ago

The following step-by-step instructions for implementing The Networked Corpus topic model browser are based on Scott Kleinman's Instructions for implementing The Networked Corpus and uses his adapted version of the code files for The Networked Corpus (zip file;)  Like Scott's instruction's Alan's instructions are for implementation on a Windows machine. Adapt path locations and change back-slashes to forward-slashes for Mac and Linux machines as needed.


Scrubbing Articles to Delete Non-ASCII Characters

  1. Put the articles to be topic modeled in C:\networkedcorpus\articles
  2. Use Notepad++ to delete all non-ASCII characters in the folder: (a) "Find in Files" (set to articles folder), (b) set to search for regex; (c) search for [^\x1F-\x7F]+ and replace with nothing.

Topic Modeling the Articles

Generate the topic model using MALLET from the C:\networkedcorpus\articles folder, placing the results in C:\networkedcorpus\networkedcorpusmalletfiles\ (Adjust pathnames as desired for your machine. Note that the names of the output Mallet files in step #5 below have to be exactly as indicated--e.g., topic_state.gz, topic_keys.txt, doc_topics, etc.)

  1. In a command window, change directory to Mallet: cd C:\mallet  Then run the following two commands (adjusting path names as needed for your machine):
  2. bin\mallet import-dir --input C:\networkedcorpus\articles --output C:\networkedcorpus\networkedcorpusmalletfiles\corpus.mallet --keep-sequence --remove-stopwords --token-regex "[\p{L}\p{M}]+"
  3. bin\mallet train-topics --input C:\networkedcorpus\networkedcorpusmalletfiles\corpus.mallet --num-topics 50 --optimize-interval 20 --output-state C:\networkedcorpus\networkedcorpusmalletfiles\topic_state.gz --output-topic-keys C:\networkedcorpus\networkedcorpusmalletfiles\topic_keys.txt --output-doc-topics C:\networkedcorpus\networkedcorpusmalletfiles\doc_topics.txt --word-topic-counts-file C:\networkedcorpus\networkedcorpusmalletfiles\topic_counts.txt

Running the Networked Corpus Script

  1. In the directory holding the Networked Corpus program, edit gen-networked-corpus.py so that line 303 in the script points to the folder (path relative to the location of the Networked Corpus folder) where the original plain-text files for topic modeling are kept: datadir = "articles"
  2. In a command window, change directory to the C:\networkedcorpus directory: cd c:\networkedcorpus  
  3. Run gen-networked-corpus.py using the following command (Note: the location of Python needs to be specified as below only on workstations with multiple Python installations, e.g., Anaconda, Enthought, etc.):
    C:\Users\Alan\Anaconda\python gen-networked-corpus.py --input-dir networkedcorpusmalletfiles --output-dir networkedcorpusresults

Seeing the Results

  1. Go to the c:\networkedorpus\networkedcorpusresults folder and click on index.html (or any of the .html files there)









Comments (0)

You don't have permission to comment on this page.