Bentley Historical Library Curation Team Blog: January 2016

Thursday, January 28, 2016

Archivematica to DSpace (and back again to ArchivesSpace)

So, we haven't talked much yet on this blog about DSpace...

DSpace is a turnkey institutional repository.

Well, I guess that's not entirely true. Mike has written about both digital objects/DSpace items and also our current use of DSpace to provide access to digital archives (to see what condition his condition was in). But it's only been in the most recent phase of development with Artefactual that we've turned to DSpace/Deep Blue (our slightly customized DSpace instance) integration.

If you're interested, we've also been working on some optimization issues, packaging AIPs and prioritizing our development requests for improvements to the Appraisal and Arrangement tab (including some exciting UI enhancements--thanks, Dan!)--but all of that's a topic for another day!

Today's post is on the Archivematica-DSpace part of the ArchivesSpace-Archivematica-DSpace Workflow Integration project. I suspect this will be a two-parter. Today, I'll outline the workflow we envision as well as some of the options we're weighing for making it happen. Later, I'll report back on what course of action we decided to take and why.

Workflow

The essential workflow goes like this:

An archivist accessions (ArchivesSpace), transfers (Archivematica), appraises and arranges (forthcoming in Archivematica 1.6) a SIP.
An archivist "Finalize[s] Arrangement" for a particular digital object and it's components.
Archivematica runs said digital object through the rest of it's Ingest process (we'll be normalizing for preservation but you can do whatever you like!).
Archivematica creates a single Digital Object in ArchivesSpace, with one or more associated Digital Object Components.
Archivematica spits out a bagged AIP (actually two bags, one with the data itself and one with more administrative-type information) into a user-selected collection in DSpace, data (an item composed of one or more bitstreams) and metadata, the latter likely coming from the ArchivesSpace metadata we've been using/creating already within Archivematica (i.e., not pulled from ArchivesSpace, or at least not pulled immediately from ArchivesSpace). [1]
Archivematica updates ArchivesSpace with the relevant information: handle, URL, etc.

Considerations

It all doesn't come down to workflow. We have some goals for the way we'd like for this to work, and Archivematica and DSpace have some additional responsibilities as components of an OAIS.

Our Goals

One of our goals for this Mellon-funded project is to ensure that new features and functionality are modular so that other institutions may adopt some or all project deliverables. Indeed, this is a requirement of Mellon's, an organization that "aims to maximize the use and sustainability of technology," and would like for "funded work to be made publicly available for the long-term benefit of... cultural institutions."

This is something we take very seriously. It has informed everything about this process, from the way we've done development (so that, for example, institutions who don't use ArchivesSpace can still make use of the Appraisal and Arrangement tab), to our attempts to ensure that workflows are flexible (for MPLP as a baseline folks and for item-level folks) to our plans for sustainability (all code, in addition to just being out there, will be incorporated into Archivematica's core code and maintained by Artefactual going forward). It's even informed the way we've tried to reach out to others on thorny issues and the way we're trying to be as open as possible and share as much as we can on this blog.

All of this holds true for this process of deciding how we'll get data and metadata from Archivematica and DSpace. Ideally, we'd like for this to work for all institutions, regardless of the repository they're using (or even if they're not using a repository, but that parts easy). As we consider a move to Hydra in the next few years, this would actually work out well for us too. If that won't work, we'd at least like for this to work for everyone who uses DSpace, and not be tied specifically to Deep Blue. If even that won't work, we'll reluctantly settle for something that will only work for Deep Blue, for our local Dublin Core conventions or that MLibrary LIT developers will have to develop and maintain because, after all, we do have to make sure that data and metadata do actually get from Archivematica to DSpace by the time the grant concludes.

Archivematica's (and DSpace's) Additional Responsibilities

In addition, Archivematica and DSpace, by virtue of the fact that they are components of a digital preservation system, have some additional responsibilities above and beyond just exchanging data and metadata. Archivematica, for instance, needs to be able to ensure that AIPs have successfully transferred to what we're using for Archival Storage (i.e., DSpace), for example, by having some mechanism to verify checksums on both ends of a transfer. For all you OAIS junkies out there, that would be the Error Checking function of the Archival Storage Functional Entity:

The Error Checking function provides statistically acceptable assurance that no components of the AIP are corrupted in Archival Storage or during any internal Archival Storage data transfer... The Preservation Description Information (PDI) Fixity Information provides some assurance that the Content Information has not been altered as the AIP is moved and accessed.

As a result, whatever protocol or method we use to transfer data and metadata needs to be able to check this kind of thing and throw up an error if something goes wrong.

Archivematica also has a responsibility to be able to reassemble the AIP upon request. That would be the Provide Data function of the Archival Storage Functional Entity:

This function receives an AIP request that identifies the requested AIP(s) and provides them on the requested media type or transfers them to a temporary storage area.

As a result, Archivematica will need to have more granular information about individual bitstreams that make up an AIP than we originally anticipated needing, for example, for the minimal metadata that we'll record for Digital Objects in ArchivesSpace. [2] The handles are important for this, but so are the bitstream URLs, even for the administrative bit that will be hidden to users.

Those are just two examples but I hope they serve to illustrate the fact that this data/metadata exchange won't be quite as simple as copying files from one place to another.

Options

Just last week we had a brainstorming session with representatives from Artefactual, the Bentley, MLibrary LIT and the University of Edinburgh (including someone who works on SWORD) on the topic of Archivematica-DSpace data and metadata exchange. Justin at Artefactual began by outlining what he sees as the three options for getting data and metadata from Archivematica to DSpace, and we spent the hour discussing the advantages and disadvantages of each.

REST API

The DSpace REST API provides a programmatic interface to DSpace Communities, Collections, Items and Bitstreams. In the latest version, the REST API allows authentication to access restricted content as well as allowing Create, Edit and Delete on DSpace Objects. REST Endpoints allow you to do things like login, logout and, important for our purposes, post metadata and bitstreams to items, post policies to bitstreams, and get handles.

If you want to know more about the REST API, check out the latest documentation on their wiki.

Some advantages of the REST API:

Can deposit bitstreams.
Can edit metadata.

Some disadvantages:

This would not work with other repositories.
We'd need to develop a callback for something like verifying checksums.
While you can get access to restricted content, we're not sure if it can handle groups that we use for permissions (for example, Bentley IP addresses for Reading Room Only material).
Harder to get handle.

Simple Archive Format

DSpace also has a set of command line tools for importing and exporting items in batches using the DSpace Simple Archive Format. The basic idea is to produce a particular directory structure (like the one you see above) with sub-directories for Items. Each sub-directory contains components for the item's descriptive metadata and the files that make up the item. There are also conventions for the XML files and for the contents of the contents folder. Important for our purposes, the Simple Archive Format allows you to import items to particular collections, can alert (e-mail) folks that items have been imported, can resume a failed import and can add items from a ZIP file. There's also a UI for import, but I don't suspect we'll be using that.

For more on the Simple Archive Format, check out the latest documentation on their wiki.

Some advantages of the Simple Archive Format:

It's simple! But seriously, we have a lot of experience with it--we use it all the time.
Less work to implement.
We could do everything we want with it with some development.
Already works with our locally-grown embargo functionality.
Returns a file that maps deposited filenames to what they became.

Some disadvantages:

This would not work with other repositories.
There are some questions about how difficult it would be to make this work for DSpace and specific instances of DSpace like Deep Blue, given variations in Dublin Core and that kind of thing.
We'd need to develop a callback for something like verifying checksums.
It's an offline communication format, so it's slower and involves more code to maintain.
Would have to be developed by individual institutions.

SWORD

They was looking for the SWORDs, they was looking for the SWORDs! Sorry, couldn't help myself.

This is the SWORD we're looking for.

SWORD (Simple Web-service Offering Repository Deposit) is an interoperability standard that allows digital repositories to accept the deposit of content from multiple sources in different formats via a standardized protocol. SWORD allows clients to talk to repository servers. Important for our purposes, it allows deposit to a SWORD-compliant repository (DSpace is one of them, and so is Fedora) by a third party system (like Archivematica). It allows you to deposit files and, in its latest version, copes not only with the "fire and forget" deposit scenario, but also to facilitate the functions needed to support the whole deposit lifecycle--such as notifying a depositor that a deposit was successful and even verifying a Content-MD5 header. Cool stuff.

If you want to know more about SWORD, check out their website.

Some advantages of using sword:

This would work for other repositories besides DSpace, which means a lot for our goals.
It's good for depositing files.
Hydra/Fedora support is already there.
Does things live and dynamically.
May allow you to create a handle and add metadata later.
Can e-mail folks letting them know that ingest was successful.

And some disadvantages:

It's not really for adding or editing metadata.
Doesn't handle permissions, restrictions or embargoed items.

Conclusion

Decisions, decisions! It's important to note that it's not like these are all mutually exclusive options. Simple Archive Format scripts and SWORD could be used in conjunction, for example, and this is one of the options we're currently exploring. We could also make changes to the DSpace code itself.

Well, that's about it for Archivematica-DSpace/Deep Blue Integration. Check back soon for an update on the course we've decided to take!

[1] I'm actually not sure what order this will happen in, that is, whether the Digital Object will get created in ArchivesSpace first or if content will get deposited into DeepBlue first.

[2] Check out Mike's post on digital objects and ArchivesSpace for more information on how we envision using Digital Objects in ArchivesSpace. Full disclosure, our thinking on this is evolving a bit, but for now it's still true that we'd like to use Digital Objects in ArchivesSpace mostly for their ability to point or link out to, in our case, a handle. That a fairly minimal implementation compared to all the rich technical and administrative metadata you could add to Digital Objects. It's a system of record thing!

[3] Grantmaking Policies, Andrew J. Mellon Foundation.

Monday, January 18, 2016

ArchivesSpace + Vagrant = Archivagrant

If you're anything like me, you find yourself regularly creating, destroying, and recreating ArchivesSpace instances to run test migrations with a slightly modified set of data, to test new or updated plugins, or to verify that everything that previously worked still works with a new version of ArchivesSpace. Manually downloading an ArchivesSpace release, setting up a MySQL database, editing or copying over a config file, setting up default values, and all of the other steps that go into getting an ArchivesSpace release ready for testing can be a time consuming process. If you're even more like me, you went through this process countless times, all the while thinking "there must be a better way" without realizing the entire time that there is, indeed, a better way: Vagrant.

Vagrant is an application that allows users to create a single configuration file (aka a Vagrantfile) that can be used, shared, and reused to "create and configure lightweight, reproducible, and portable development environments" in the form of virtual machines. Vagrant, and tools like it, have been widely used by developers to solve issues that arise when development on a single application is done in a variety of development environments and to make the process of configuring a development environment across time, users, and operating systems easier and more consistent. Vagrant allows users to do some upfront work to configure an environment so that other people working on a project (and their future selves) will not have to worry about going through manual, time-consuming, and error-prone configuration steps ever again. While we aren't doing a lot of heavy developing, repurposing an existing tool like Vagrant to cut down on the amount of time we spend unzipping directories, editing config files, installing plugins, and so on allows us to focus on the work that we really need to do.

This blog post will walk through our ArchivesSpace Vagrant (or, Archivagrant) project to demonstrate how to setup a Vagrantfile and related provisioning scripts that, once written, will download the latest ArchivesSpace release, install dependencies and user plugins, setup ArchivesSpace to run against a MySQL database, and setup some of our local ArchivesSpace default values.

In order to beginning using Vagrant, you will need to first do the following:

Install VirtualBox
Install Vagrant
Install a terminal application with ssh (Secure Shell). If you're on Mac or Linux, the default terminal should work. If you're on Windows and have git installed, the git shell should also work. Another option is to use a Unix-like terminal emulator such as Cygwin, being sure to install the ssh package during setup and installation.

If you've never used Vagrant before and are curious about how it works in general, follow along with the Vagrant Getting Started instructions for information about downloading and installing boxes, setting up a basic Vagrantfile, and provisioning, starting, and accessing a Vagrant virtual machine before continuing with the following instructions.

As detailed in the Vagrant setup instructions, a Vagrantfile is all that's needed to set up a Vagrant virtual machine that can be installed, started, stopped, destroyed, and recreated at any time and on any machine. Here's ours:

The Vagrantfile for this project is pretty simple. First, it indicates that the box to be used is hashicorp/precise32, which is an Ubuntu 12.04 LTS 32-bit box. Next, ports 8080 and 8089 from the guest virtual machine are forwarded to the same ports on the host machine. This will allow us to use the browser on our host machine (the computer running Vagrant) to access the ArchivesSpace application running inside of the virtual machine and interact with the ArchivesSpace application as if it were running on our actual machine using its default backend and staff interface ports. That way, we don't need to worry about finding the IP address of the Vagrant machine or remembering any non-default ports (it also means that we don't need to change anything in the many scripts we have that access the ArchivesSpace API using http://localhost:8089). Next, the Vagrantfile allocates 2 GB of RAM to the virtual machine to improve performance. Finally, the Vagrantfile provisions the virtual machine using three shell scripts: setup_python.sh, setup_mysql.sh, and setup_archivesspace.sh.

The first shell script, setup_python.sh, is fairly short and simple. It first updates Ubuntu's packages (to ensure that we'll be downloading the most up-to-date packages in this and any subsequent provisioning scripts), then installs the Python package manager pip using the Ubuntu package manager, upgrades pip to its latest version, and installs the Python Requests library, which we'll be using later to find and download the latest version of ArchivesSpace and configure our ArchivesSpace defaults.

The next shell script, setup_mysql.sh, installs the Ubuntu mysql-server package, sets up a root username and password (since this is a temporary virtual machine used only for testing purposes, it's okay if the username and password are weak and exposed), and finally creates and configures a database following the official ArchivesSpace documentation for running ArchivesSpace against MySQL.

The final provisioning script, setup_archivesspace.sh, is the most detailed. It also makes use of two separate Python scripts that do the bulk of the work, so for the purposes of this post we'll take a look at setup_archivesspace.sh in two parts. It's worth reiterating that this provisioning script configures ArchivesSpace for our needs here at the Bentley Historical Library, but you should be able to modify it to suit your needs by changing some of the variables and removing some of the plugins (or adding your own).

The first part of the setup_archivesspace.sh shell script is pretty straightforward. The script first installs the Ubuntu packages that will be used in provisioning ArchivesSpace: Java (required by ArchivesSpace), unzip (used to extract the downloaded ArchivesSpace release), and git (used to install plugins from the Bentley's GitHub repository). Then, the shell script calls a separate Python script, download_latest_archivesspace.py, which is used to locate and download the latest release of ArchivesSpace.

This Python script uses the Python Requests library and the GitHub API to find the URL for the latest official ArchivesSpace release, download it, and extract it to the guest machine's home directory.

After downloading and unzipping the latest version of ArchivesSpace, the setup_archivesspace.sh provisioning script sets variables for the database URL and plugins entries to be edited in the ArchivesSpace config file. Then, several plugins are downloaded to the ArchivesSpace plugins directory, including the latest version of the container management plugin, our own EAD importer and exporter plugins, and our slightly modified version of Mark Cooper's very handy aspace-jsonmodel-from-format plugin (used to convert our legacy EADs to ArchivesSpace JSONModel format before posting them via the API -- we'll blog about that at some point, but it makes error identification much easier). Next, the setup_archivesspace.sh script edits the ArchivesSpace config file, replacing the default database URL and plugins entries with the variables that we set up earlier. The script continues by running the ArchivesSpace setup-database.sh script, then configures ArchivesSpace to run at system start (so we won't have to access the virtual machine just to start the application), and starts ArchivesSpace. Finally, the provisioning script calls another Python script, archivesspace_defaults.py, to set up some of our default configurations.

This script uses the ArchivesSpace API to setup some of the default values that we've been using for testing, including setting up a repository, container profiles, classifications, and repository preferences and editing the subject sources and name sources enumerations. While these are all configurations that can easily be set up using the ArchivesSpace staff interface, setting up some of these basic configurations in a provisioning script makes the process of starting and using an ArchivesSpace Vagrant instance that much faster.

Now that we've written our Vagrantfile and associated provisioning scripts, the process of setting up a new ArchivesSpace instance for testing is as simple as doing the following:

Clone the archivagrant GitHub repository (if we haven't already)
Open a terminal application and change directories to the archivagrant directory
vagrant up

The first time that we issue the vagrant up command, it provisions the virtual machine using the scripts detailed above. Once the provisioning process is complete, we can point our host machine's browser to http://localhost:8080 (to access the ArchivesSpace staff interface) and any scripts we have to http://localhost:8089 (to access the ArchivesSpace backend). If we need to gain command line access to the running virtual machine (to stop or restart ArchivesSpace, install any additional packages, mess around in an Ubuntu server without worrying that we're going to break everything, etc.), we can vagrant ssh into it. The virtual machine can be suspended using a vagrant suspend command; shutdown using vagrant halt; and destroyed with vagrant destroy. If suspended or shut down, the virtual machine can be started back up again to its previous state with another vagrant up. If destroyed, a vagrant up will recreate the virtual machine from scratch, going through the entire provisioning process. For the pros and cons of each approach, check out the Vagrant teardown documentation. I use vagrant halt most of the time, but a vagrant destroy is easiest when a new version of ArchivesSpace is released or when I have messed everything up beyond salvation.

Finally, there may be times when we want to start over with a fresh ArchivesSpace database in an existing Vagrant virtual machine without going through the process of recreating the entire machine through a vagrant destroy. The script reset_archivesspace.sh can be run by doing a vagrant ssh into the guest machine and changing directories to the /vagrant directory (a shared folder setup by Vagrant that syncs the contents of the Vagrant project's directory on the host machine to the guest machine).

The script sets up a clean MySQL database and our ArchivesSpace defaults without redownloading ArchivesSpace or reprovisioning the entire machine.

It looks like there are several other ArchivesSpace users that use a Vagrant ArchivesSpace configuration that might be worth checking out if the previously described setup doesn't quite work for you or if you want to see how others are doing it. If you're using some other way to ease the pain of frequently installing ArchivesSpace test instances, let us know!

Wednesday, January 6, 2016

2015: A Year of Progress on ArchivesSpace-Archivematica-DSpace Workflow Integration

Greetings and Happy New Year to one and all!

As we're still in the first week of 2016, we hope you'll allow us to cast one last glance back at 2015. The past year has been one of intense effort on the part of our Mellon grant team. From the three-day site visit by Evelyn McLellan and Justin Simpson of Artefactual Systems last January through our presentations at SAA and iPRES to our exploration of ASpace digital object records and PREMIS rights crosswalks, we've had our noses to grindstone!

We've been thrilled with the interest and engagement of our peers in the project and this blog in particular; thanks to everyone who's read or commented (and please keep the feedback coming)!

Based upon page views, here are the most popular posts from the past year—please take a moment to check them out (and get ready, because best is yet to come!):

Happy reading and here's to an exciting and productive 2016!