| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Meeting 8 (2015-03-18)

Page history last edited by Alan Liu 9 years ago
Meeting Participants:
  • UCSB: Alan Liu, Zach Horton, Alex Kulick, Patrick Mooney
  • Clemson U.: Lindsay Thomas
  • CSUN: Scott Kleinman, Chris McKinlay

 

(1) Project Data Storage & Sharing

  • VPN to Jeremy's Synology site (Mirrormask):
    • Workable for everyone?
    • Workable if we expand to include other project developers in other locations in future? 
  • 4Humanities filespace (on Mirrormask): https://128.111.86.240:5001/
    "Filestation" folders on the filespace are under home > www > whatevery1says>.  Presently, the folders include the following (pending reorganization; see below on organization of files).  ("data_archive" should be stored in organized fashion in the file space.  But in the case of the other folders, we may want to store and organize instead in the Bolt CMS):
    • data_archive     # archive of data harvests from publications (including downloads, transitional docs, plain-text docs).  Folder nomenclature and organization needs standardization.
      • _production_runs_trial_1     # Alan's harvests for "humanities" & "liberal arts" for NYT and WSJ; harvests not yet at gold-standard quality.
        • new_york_times     # subfolders not shown here
        • wall_street_journal
    • [metadata]     # currently holds Scott's prototype documents for the manifest schema; this folder may later not be needed if manifests for metadata are kept on the Bolt site, or saved as exports from Bolt to either the "data" or "output" folders.
      • manifest
    • output     # intended to include output ready to be moved to public-facing web site at some time in the future.
    • workflows     # intended to store instructions, tutorials, and resources for workflows (e.g., the NYT workflow); top level of folder to hold instructions for each workflow, unless we put them in the Bolt CMS.
      • scripts     #  intended to store Python, R, shell scripts; unclear if we should organize by programming type or by task; we may want to create naming system for scripts to indicate their sequence in a workflow
        • python_scripts 
      • tutorials
        • iPython_notebooks     # iPython notebooks for tutorials on workflow and recipes (if we decide these are useful).
  • 4Humanities Bolt CMS site (on Mirrormask)https://mirrormask.english.ucsb.edu/~4Humanities/bolt/
    • The admin page for the Bolt site is at https://mirrormask.english.ucsb.edu/~4Humanities/bolt/bolt. Click the "Manifests" link in the left sidebar to create or edit a manifest. Access to the admin site requires creation of an individual account.
    • Where are the manifests stored? Manifest data entered in Bolt are stored in the Bolt database. However, Bolt does not currently create manifest YAML documents (although it can store them as file uploads). Ideally, we should create a Bolt extension that imports YAML documents, inserts their data into the database, and then re-creates the documents for export.
    • Can we later process/harvest data from them in batch mode (e.g., through direct database query)? The above method allows access to all the manifest data through database querying. Bolt's search function will return all manifests with specific keywords (some filtering is also possible). More complex database queries will need a Bolt extension, or the database will have to be accessed by an external script. Either of these two methods could be used to generate a YAML manifest containing mixed data (e.g. a concatenated stream of individual manifests), which could then be parsed by scripts making use of the manifest's content.
    • Email from Scott of 16 March 2015:
              "Bolt has just released a new version, so we'll have to update in the next couple of weeks (I'm giving them time to discover bugs). Right now, Alan and myself are the only ones with accounts. I can create accounts for other before Wednesday so that we can do a manifest creation rehearsal.
              Right now, the manifest creation form is auto-generated by Bolt with no customisation, so we will eventually have to do some tweaking. Form validation is not currently implemented. I didn't want to add it until we decided what should be required. It is possible to save a manifest as a draft, but I suspect that it might not let you save incomplete required fields by default. We might have to do a little hacking to get it to do that.
              Right now, we have to check manually whether a manifestID is unique. We could hack the save function to check the database first. Note that Bolt will already check to see whether the manifest's internal database ID is unique. One possibility we should discuss is whether we should just use Bolt's internal ID."
  • Organization of Data Files
    • Email from Scott of 17 March 2015:
              "The manifest schema requires that we specify file paths in a number of places, but I have been putting off thinking about how to organise them on the server. I finally gave it some thought today, and I see why Alan was pulling his hair out over this issue--even before we got server space. In an effort to get my head around file paths and how they will relate to manifest documents, I've put up a small tool that models a file browser. You can see it at http://www.csun.edu/~sk36711/4humanities/filepaths.html. I didn't get a chance to see how closely this corresponds to the file structure Alan is currently using, but I hope the tool provides us with a way to visualise where we might put things."
    • Alan's current data organization:

 

(2) Manifest Schema

  • Issues:
    • Email from Scott of 25 Feb 2015:
                 "I created a manifest content type in Bolt based on the first draft of manifest template. What this means is that Bolt offers a "New Manifest" option which is the equivalent of a "New Post" option in Wordpress. In effect, each manifest is a blog post. Clicking "New Manifest" opens up a form that is automatically generated from the content type definition. I have put some screenshots in the Word document attached to this e-mail. [ManifestScreenShots.docx]
                Note that I've organised the data entry into groups of form fields, which are navigable in the tabs at the top. The meta tab is for Bolt-internal information and is not derived from our manifest. I ran into a couple of minor problems with our field terms. A couple were reserved for use by Bolt (e.g. "id"), and others are apparently run through a filter that changes their capitalisation (which should be an easy fix). But, as I mentioned, we need to reconsider field names anyway. A somewhat related issue is that Bolt uses YAML to create the form, so I had to modify our manifest document to fit the available form fields. I've included the YAML configuration at the end of the Word document so you can see what it looks like. On the whole, it was easy to map our manifest schema onto the Bolt content type, but I did skim over problems where we had multiple levels of embedding. For instance, I just used a text editor for "Author" to allow for multiple authors, but, of course, these will not be separate entries in the database. Likewise, I didn't even bother with multiple processing steps. I'll need to see if I can figure out how to handle these types of scenarios. Still, this is a promising start."
    • Email from Scott of 1 March 2015:
               "I have produced a fuller set of documentation for the manifests [ManifestDocumentation.html ] based on prior discussion and further reflection (see attachment). There are a few more issues we may wish to discuss, but I have inserted a versioning system so that we can begin using the manifest schema even if there are further changes down the line (which will inevitably be the case)."
  • Rehearsal: creating a manifest on the Bolt site.

 

(3) Production-run rehearsal for harvesting from NYT, WSJ, USAToday

       (for Alan, Zach, Patrick, Alex:  Monday March 23, 1pm)

  • Prep by looking at the workflows for each publication.
  • Need the following scripts  and utilities on lab machines and our laptops (or equivalents for our OS platforms):
    • Python scripts for the various workflows
    • Wget (versions for different OS's; may come included on Macs and Linux?)
    • DownloadThemAll plugin for Firefox.
    • Chopping List (or equivalent utility for splitting the aggregated_text_harvest.txt file created by our harvesting workflows into separate plain-text files for each article at the terminator of ten ampershands (@) between articles).

 

 

Comments (0)

You don't have permission to comment on this page.