| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Alan's Instructions for Implementing The Networked Corpus

Page history last edited by Alan Liu 8 years ago

The following step-by-step instructions for implementing The Networked Corpus topic model browser are based on Scott Kleinman's Instructions for implementing The Networked Corpus and uses his adapted version of the code files for The Networked Corpus (zip file;)  Like Scott's instruction's Alan's instructions are for implementation on a Windows machine. Adapt path locations and change back-slashes to forward-slashes for Mac and Linux machines as needed.

 

Scrubbing Articles to Delete Non-ASCII Characters

  1. Put the articles to be topic modeled in C:\networkedcorpus\articles
  2. Use Notepad++ to delete all non-ASCII characters in the folder: (a) "Find in Files" (set to articles folder), (b) set to search for regex; (c) search for [^\x1F-\x7F]+ and replace with nothing.


Topic Modeling the Articles

Generate the topic model using MALLET from the C:\networkedcorpus\articles folder, placing the results in C:\networkedcorpus\networkedcorpusmalletfiles\ (Adjust pathnames as desired for your machine. Note that the names of the output Mallet files in step #5 below have to be exactly as indicated--e.g., topic_state.gz, topic_keys.txt, doc_topics, etc.)

  1. In a command window, change directory to Mallet: cd C:\mallet  Then run the following two commands (adjusting path names as needed for your machine):
  2. bin\mallet import-dir --input C:\networkedcorpus\articles --output C:\networkedcorpus\networkedcorpusmalletfiles\corpus.mallet --keep-sequence --remove-stopwords --token-regex "[\p{L}\p{M}]+"
  3. bin\mallet train-topics --input C:\networkedcorpus\networkedcorpusmalletfiles\corpus.mallet --num-topics 50 --optimize-interval 20 --output-state C:\networkedcorpus\networkedcorpusmalletfiles\topic_state.gz --output-topic-keys C:\networkedcorpus\networkedcorpusmalletfiles\topic_keys.txt --output-doc-topics C:\networkedcorpus\networkedcorpusmalletfiles\doc_topics.txt --word-topic-counts-file C:\networkedcorpus\networkedcorpusmalletfiles\topic_counts.txt


Running the Networked Corpus Script

  1. In the directory holding the Networked Corpus program, edit gen-networked-corpus.py so that line 303 in the script points to the folder (path relative to the location of the Networked Corpus folder) where the original plain-text files for topic modeling are kept: datadir = "articles"
  2. In a command window, change directory to the C:\networkedcorpus directory: cd c:\networkedcorpus  
  3. Run gen-networked-corpus.py using the following command (Note: the location of Python needs to be specified as below only on workstations with multiple Python installations, e.g., Anaconda, Enthought, etc.):
    C:\Users\Alan\Anaconda\python gen-networked-corpus.py --input-dir networkedcorpusmalletfiles --output-dir networkedcorpusresults


Seeing the Results

  1. Go to the c:\networkedorpus\networkedcorpusresults folder and click on index.html (or any of the .html files there)

 

 

 

 

 

 

 

 

Comments (0)

You don't have permission to comment on this page.