| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Meeting 10 (2015-04-22)

Page history last edited by Alan Liu 9 years ago

 

Setting our next milestone goal:

  • Solve necessary issues so as to position us to do real collecting runs at scale (and with manifest documentation) by summer.
  • Full collection of plain text for available NY Times & WS Journal by June (or July).

 

Also: Transcriptions Research Slam, "SynchDH" (May 8th)

 


 

(1) Data Storage Platform

 

  • Progress on the NoSQL, MongoDB , BSON, Flask Discussion?

 


 

(2) Manifest Schema

 

Alan's suggestion and revisions of our manifest system:

  • I think we're going to end up with four functionally different kinds of manifests, which can feed into or call each other in a system something like this:
    1. "Collecting" manifests -- to track harvesting and scraping work.  Collecting manifests can refer by manifestIDs to package manifests (e.g., for the NY Times collecting scripts/tools module).  Collecting manifests produce output organized variously in path locations. (Draft example: collecting manifest (draft 2015-04-19).docx -- Alan's latest revision of the collecting manifest for the NYT from the April 1st rehearsal, with comments and queries.)
    2. "Corpus" manifests -- to track the corpora we create from the outputs of collecting harvests.  Corpus manifests have sections (e.g., years of materials) and states (e.g., raw text, preprocessed text), each of which will have a reference number.  Sections and states refer back by manifestID to the collecting harvest that produced them.  (No draft example yet.)
    3. "Processing" manifests -- to track various processes, including cleaning work, topic model runs, visualizations, etc.  Processing manifests refer by manifestID to corpus manifests (and reference numbers for their sections and states) as input material.  For example, a topic model manifest could take as its source material corpus 32, sections 4,5,7,8, state raw-text.  Processing manifests can also refer to package manifests as needed. (Draft example: processing manifest (draft 2015-04-19).docx -- Alan's revision and suggestions for Lindsay's processing manifest template, with Scott's generalizing revision incorporated.  This revision is a rough draft.)
    4. "Package" manifests -- modules of sequenced tools/scripts with options (and instructions as necessary). Package manifests are called into collecting and processing manifests by ID number as needed. (Draft example: package manifest (draft 2015-04-19).docx  -- Alan's mockup of a package manifest for the sequence of scripts and tools we use to scrape the NYT; includes fields for instructions.)

 


 

(3) Review/Critique the April 1st Collecting Rehearsal

  • Results of the collecting rehearsal workshop we did on April 1, 2015
    • Results of Wall Street Journal collecting run (by Alex) -- problem with Wall Street Journal collecting discovered by Alex
    • Results of New York Times collecting run (by Patrick and Zach) (results stored on MirrorMask):
  • Report Card on the collecting rehearsal
    (reporting problems and possible solutions listed in order from most to least important)
    :
    • Problem: Earlier years of the digitized NYT (whose full digital product begins in 1989) contain occasional articles that are "continued" on other online pages--e.g., this article from 1989.  (Alan has not yet confirmed that current or recent NYT issues don't have the same problem.)  Scraping these articles does not result in the inclusion of "next page" at bottom of original article.  Source cod of original articles includes code like the following, which our script does not know how to get:
      <a onclick="s_code_linktrack('Article-MultiPagepageNum2');" title="Page 2" href="/1989/01/18/us/washington-transition-reagan-s-final-rating-best-any-president-since-40-s.html?pagewanted=2">2</a>
      • Solution: [none yet]
    • Problem: Need to ensure that all downloads of articles are completed.  (Alan compared his earlier collecting of the NYT for 2013 with the collecting for the same year in the April 1 workshop.  There are missing articles in Alan's run as opposed to the April 1 run, which looks more complete.)
      • Solution: when using DownLoadThemAll, be sure to wait for multiple attempts to get hung pages.
      • Solution: do another run of 2013 and compare the urls.txt with the April 1 run.
    • Problem: very occasional double downloads of articles: --e.g., "adrianne-wadewitz-37-wikipedia-editor-and-academic-dies.html" and "adrianne-wadewitz-37-wikipedia-editor-and-academic-dies_001.html"
    • Problem: Lack of common utility tools on different platforms (e.g., Chopping List.exe available for Windows but not for other platforms)
      • Solution: writing Python scripts to substitute for utility tools (thanks, Scott!)
      • Additional step: adding a manifest field to track which machines we are using.
    • Problem: small errors in storing working files -- e.g.:
      • the urls.txt file for NYT 2013 stored on Mirrormask actually includes both 2013 and 2014;
      • missing .tsv files
      • Solution: [none yet]
    • Cleaning Problems:
      • encoding errors like the following for quotations sighes: “
  • Also: Alan downloaded pages for NYT 2010-2014 that he previously collected using the "liberal arts" query.  Confirmed that searching on "liberal arts" through the NYT API returns the bigram we want.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

Earlier Discussions of Manifest Schema (for current state of discussion, click here)

  • Processing fields: "In regard to the Processing section of the Manifest: I sense here that we need more detail in the manifest (but not excessive detail).  As an action item, I suggest that Lindsay and I who have done trial harvest runs and processing plus cleaning think about this and mock up an idea of what more we really need to track.  "Need to track" has two meanings.  First, what do we need to track in processing steps, resources, and transitional files created for development purposes (e.g., if we need to go back to a fork in the decision tree for a workflow and experiment with a different script).  Second, what do we need to track in processing steps to document the eventual deliverables of the project (the topic models, articles, etc.) [i.e., processing steps that will need to be exposed to the end-user]?  This latter topic is new territory for the digital humanities, where to my knowledge none of the leaders in text mining work offer any transparent documentation of the key steps in their process that could allow others to reproduce their experiments.
    • Lindsay, Alan, and Scott's email exchange on processing fields of 23-24 March 2015:
      • Lindsay: "the manifest schema currently contains two different "processingInfo" sections: one located under "resource" and one as its own top-level element. It occurs to me that these two levels of the "processingInfo" element may get us some way toward thinking about the two different "tracking levels" that Alan outlines in his email. For example, for each item we store, the "resource > processingInfo" element would track that particular item's processing history. At the same time, the top-level "processingInfo" element would track the steps needed/taken for documenting eventual deliverables of the project (i.e., "Information about a sequence of processing steps not embedded within a resource element"). "
      • Alan: "I think it may come down to what we think we are tracking--the "identity" of the resource/processing workflow, as it were, linked to a manifest or family of manifests.  Perhaps the essential identity would simply be the whole harvest for a single publication (e.g., NYT)?  If we did the for-real harvest of all the digitized years of the NYT in a single run or series of runs, it's conceivable that would could set up a manifest for that whole run, with text file locations designated generically (e.g., not a specific folder and file name but all files starting "NYT__" in a particular folder set).  Then there could be child manifests for cleaning, topic modeling, and other experimentation after that (?)  In other words, a child manifest would essentially say, "Alan took as his source files the material from the parent manifest, and then he did x, y, and z with it."
      • Scott: "I was thinking along lines similar to what Alan suggests (i.e. one manifest for one harvest, rather than one manifest for each file in the harvest) whilst trying to make the system flexible enough for us to attach metadata to individual files if we wanted to. So we could have a manifest for each file. With a bit of tweaking of the "Required" option in the schema, a manifest might be as simple as the following:

        manifestId: (unique manifest ID)
        resourceId: (unique ID for the file)
        label: (optional, but useful for generating graphs)
        relationships:
          isPartOf: (ID of the harvest manifest)

        In other words, a unique ID for the manifest file, a unique ID for the data file, an optional label, and a reference to the manifest for the entire harvest.
                 Frankly, that might end up creating proliferating manifest files and, although we can probably generate them automatically in many cases, potentially also a lot of extra work if we want to be consistent.
                I am starting to warm to Chris' suggestion of a document storage database. The one he mentioned, Mongodb, stores documents in JSON format which essentially is a manifest document (but one that the database can also query). This would allow for the flexibility described above, but without the proliferating manifest files."
  • resourceLocation field: "Finally, my sense is that the "resourceLocation" field in the Manifest is going to need more elaboration (depending on how we finalize our data storage file structure), since it would be useful to track at least the following states of stored resources: downloaded original documents, aggregate scraped plain-text files for a span of years or a single year; individual scraped plain-text files, text-analysis outputs. I think we can defer action on this issue until the data storage organization issue clarifies a bit more. From Scott: The manifest schema actually handles these situations (though whether it does so well is open to discussion). Unfortunately, it appears that formatting changes caused the omission of the processedText field from the list of fields. You can see it in the example under processes. An updated version of the schema with the processedText field restored is accessible at http://scottkleinman.net/wp-content/uploads/ManifestDocumentation.html. The action field identifies the "state" of the stored resources and the processedText field provides the file path.
  •  

 

 

 

Comments (0)

You don't have permission to comment on this page.