• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Virtual Machine for Collection Runs

Page history last edited by Patrick Mooney 9 years ago

I've been putting together a virtual machine that can be deployed as a uniform environment on any machines that we're using to scrape (and, perhaps, topic model). This is intended to be an alternative approach to Alan's proposal for Workstation Set-Up. This approach to setup develops a ... well, a virtual machine, a virtual computer that can be deployed and run as if it were a real computer on any number of other systems. Instead of making sure that everyone installs the same or similar versions of software, we can just have everyone install a hardware "hypervisor" program that runs the virtual computer, then copy the hard drive image onto their own hard drive. That is, instead of manually installing import.io, Python 2.7, DownThemAll and LastPass for Firefox, our current scripts, etc. etc. etc. on each workstation, then setting up a bunch of desktop shortcuts, in order to set up a workstation, all we need to do is install the hypervisor (currently, VirtualBox, though we have other options), tweak a few settings, and copy over the virtual hard drive file. Running the "guest operating system" (a Linux distribution) then ensures that everyone has the exact same software working in the exact same way. It's a quicker set-up process and results in a more consistent experience for the person engaging in the scraping.


Here is the compressed version of the hard drive for the virtual machine, although you should probably read through the rest of this document before trying to use it.


There are several other advantages, too, I think.

  • Everyone's using the exact same versions of the exact same software. There are no subtle problems with slightly different versions of, say, Python, nor are there subtle problems with differences between Python under different operating systems, because everyone's using (say) 32-bit Python 2.7.6 under 32-bit Linux. Similarly, we don't have to worry about altering our scripts to specify different paths on different computers, or deal with Windows vs. Mac vs. Linux pathnames, because the folder setup in the virtual machine is the same for everyone using it.
  • When new versions of the software come out, we just have to update the master copy of the virtual machine and push it to each workstation using it. This is a process that can be made more or less automatic, if we want. We don't have to worry about everyone needing to separately update multiple pieces of software on multiple computers; we just push a new virtual hard drive out for everyone.
  • If any workstation starts having weird problems, we can just push a clean copy of the virtual hard drive out to it; it's faster than diagnosing the actual problem or rebuilding the affected system from scratch. In fact, if we want to go with a "FOR GOD'S SAKE, NEVER SAVE ANYTHING DIRECTLY IN THE VIRTUAL ENVIRONMENT" policy (I recommend this: we can use the virtual environment to write to the host OS if we want to think about how to arrange this interaction), then we might just want to push a clean copy of the virtual environment out to every workstation periodically anyway (weekly? nightly?).
  • We may wind up considering the virtual environment itself to be another deliverable generated by the WE1S project. This is another meaningful payoff for the public -- a pre-built scraping and topic modeling environment (just drop it in and start working!). It also lets us show our hand in terms of how we've done our research: anyone who wants to replicate our results is one large step closer to being able to use the exact same software that we did.
  • It's also a way of documenting specific versions of software used in a particular collection of modeling run: we know that, for instance, the version of Python used on the 3 September 2015 run was 2.7.8 because we were using update 9 of the environment. It makes documenting software versions much simpler.


So, some specifics. I should say up front that this environment was built as a test run, and that it should probably be built again from scratch before we actually use it as a production environment (this will take several hours but is completely do-able), but that the current environment gives us something to play with and a basis for talking about what we do & don't want to include in the virtual machine. I'll indicate why I made the choices I did, but many of them involve trade-offs in one way or another and it's rare that any choice is obviously and unambiguously "the best choice"; everything's up for debate.


  • The virtual machine uses VirtualBox 4.3.28 r100309. We could theoretically use some other hypervisor; as far as I can tell, they're all about equally good (but Jeremy or Zach may have more experience in this regard than I do). I picked VirtualBox because it's easy to install under any host operating system we might want to use and I already use it. The guest operating system has the VirtualBox guest additions installed; this gives the guest OS the ability to get at (a configurable subset of) the files on the host machine, and to write to a designated folder. (My own recommendation is that we have a "NEVER NEVER NEVER save data in the virtual machine because it could be overwritten at any time" policy; instead, we should use the software installed in the virtual environment to write to the file system on the host OS. Like everything else, though, this is debateable.)
  • The guest operating system is Linux MInt 17.2 (code name "Rafaela") with the MATE desktop environment. I've installed the 32-bit version on a 32-bit virtual machine that we should be able to run pretty much anywhere. I picked the 32-bit version because (a) that means the environment can be run on the widest variety of host machines, and (b) some of the programs installed in the guest environment have hiccups in their 64-bit versions. Picking Linux in some form as the guest OS is really our only choice (well ... our only sensible choice) if we don't want to deal with licensing issues, and really is our only choice if we want to redistribute the environment (we certainly cannot redistribute a Windows- or MacOS-based setup without going through long discussions with Microsoft or, even less likely to be successful, Apple). However, there are of course hundreds of Linux distros that we could theoretically use, and there are a lot of tradeoffs involved. Here's why I picked Linux Mint:
    • It's an Ubuntu derivative, which means that it'll be easy to find help if (when) necessary after I graduate and move on.  Virtually everything works the same in Linux Mint as it does in Ubuntu, which means virtually anyone who knows Linux at all can administer the guest system.
    • Linux Mint's update procedure ranks updates in terms of safety, and seems to me to do a pretty good job, so there's an additional vetting process in terms of software updates.
    • Unlike a lot of Linux distros, it comes with the ability to play non-free media on a standard system installation. This may not matter -- we may decide that we'd rather handle any media through the host operating system -- but if we encounter situations where we want to deal with non-textual media, then we may decide that we want to standardize procedures for doing so and install a standard set of software into the guest environment. Not needing to set up closed-source codecs will save an annoying step later if we go in this direction.
    • It's comparatively lightweight. We could certainly find other distributions that are even more lightweight, but the tradeoff will definitely be ease of use for people who don't have experience using Linux. 
    • The MATE desktop environment is itself comparatively lightweight and "feels like" Windows XP in a lot of ways, which (I'm guessing) is probably the usage metaphor that's most broadly intelligible to the largest number of people. (Linux people who haven't heard of MATE might be enlightened to hear that it's a fork of GNOME 2 that's still being updated and otherwise developed.) Future RAs or unknown scholars who want to verify our results will probably hit the ground running more quickly here than with most of the alternative possibilities, because I think that virtually everyone can interact with Windows XP.
    • 17.2 is a Long Term Support release, and the current version will continue to receive updates until at least April 2019. Updating from there will probably be easy enough, but we won't need to do so for nearly four years if we don't want to.
  • The virtual machine has (lowercase) we1s as the username, and has the expected password. It's got Firefox 38.0 as its default browser and has DownThemAll and LastPass installed as extensions. Some minor changes to the default Firefox preferences have been made because that seemed sensible to me (I can elaborate if anyone wants). Privilege escalation happens with sudo, which is the most common Linux model; the standard password is used for privilege escalation (though that's an administrative matter that probably won't be relevant to RAs working with the environment). 
  • There's a folder on the desktop with launchers and links, set up according to Alan's instructions, on the analogue of how things are set up with Macs. The expected folders ~/workspace and ~/pythonscripts exist and there are links to them in the folder on the desktop. For the most part, I take Alan's term "shortcut" to mean "symbolic link" except when a desktop launcher is required (e.g., with websites). Desktop launchers use xdg-open, out of all the mess of Linux options for accomplishing this task, to launch sites in the default browser, which is Firefox.
  • Canopy is installed in ~/Canopy, although when I rebuild the system I'll probably put it in ~/bin/Canopy on the theory that that will help to consolidate programs installed by hand (and not with the aptitude package manager) in ~/bin. Import.io is installed in ~/bin/import.io/. MALLET, AntConc, etc. are not installed, but they could be (I'd vote for putting them in ~/bin/mallet etc.), and of course one of the advantages of doing this is we could install new software on all of the machines just by installing it in one place  -- the master version of the virtual environment and then pushing a new virtual hard drive out.
  • There's a lot of software that's preinstalled with Linux Mint that could (and probably should) be cleaned off: chat, email, music, CD/DVD burning, system tools, graphic editors, office-style document editors, printing utility software, and video software could all probably be eliminated; there's a lot of other stuff that might make the installation smaller and quicker to deploy. On the other hand, there are plenty of cases where it's not trivially obvious that some of this software won't be needed, and maybe it's worthwhile to talk about what should and shouldn't be there. (Assistive technology for disabled users? Utilities for non-Western script entry? Abilities to print so that we can support easy PDF generation?)
  • Python is 2.7.6 (that's the version available in the software repositories for Linux Mint). Wget is 1.15 (ditto). Beautiful Soup 4 is installed. All software installed is up-to-date as of 22 June 2015 or later -- though I haven't fully updated everything before I uploaded the image. Alas, I forgot to do this before I started the long upload of the virtual hard drive.
  • Advanced Renamer for Windows and Chopping List for Windows haven't been installed because, well, this isn't Windows. We could probably get them both working under Wine, which is a Windows compatibility layer for Linux, but this may be unnecessarily complex if there are other ways to accomplish what we're doing with these tools. 


So what's linked here is a compressed virtual hard drive image that can be used with VirtualBox. Here's a screen capture of the basic setup information:



Other setup information is available on request, but I think this should be enough to get people going. If we want to go ahead with doing this, we can create an encapsulated machine that preserves all settings in addition to the hard drive to make installation even easier, but I haven't done this this time around (nor, I think, is this what we'd want to do for anything except the initial installation; for updates, just replacing the hard drive will be easier than creating an entire encapsulated machine). Downloading and uncompressing the hard drive file and using it as the hard drive for the virtual machine set up as described above should allow the virtual machine to boot and give people a chance to play with it.


If we decide we want to go ahead with this method of deploying encapsulated software, there are some decisions to be made:

  • Do we want to go with Linux Mint, or should we try something else? (I'm happy to talk about other options and advantages/disadvantages if that's what people want.) 
  • Do we want to install additional software? (MALLET? AntConc?) What software should we remove from the default installation? (I'm in favor of streamlining it as much as possible -- we can always install things later and push another hard drive image out. Still, there are some things that we may want to avoid removing that are integrated into the system to provide certain kinds of functionality for, e.g., disabled users, on the theory that setting them up again later may be more trouble than just leaving them in. But we actually can strip out a lot of stuff that can be provided by the host OS instead, including system functionality for things like wireless connectivity -- network functionality, wired or wireless, can piggyback off the host OS just fine, so there's no need to have software supporting wireless installed in the guest OS, too. Ditto for Bluetooth, probably, and any number of other things.)
  • Do we want to standardize how the guest OS shares folders with the host OS? (This is probably a good idea.) If so, how? (Again, I'm happy to discuss options here.)P
  • Do we want to automate privilege escalation so it always succeeds automatically? (This is probably a bad idea, but there's an argument for it.)
  • If we're thinking about distributing this as a deliverable ... the current problem is that it's an OS with software that knows our passwords. We may want to think about how to deal with this situation before, well, delivering.
  • Any number of other minor setup details could go any number of ways, and we might want to discuss them before setting them in stone.


You'll notice that I haven't yet linked the compressed hard drive image. I'm still waiting for it to upload. I'll edit this page when it has and provide a link to it. Dropbox estimates it'll be another hour or so.

Comments (0)

You don't have permission to comment on this page.