Friday, April 24, 2015

Implementing Archivematica

My last post detailed our work to implement ArchivesSpace, the open source archives information management application for managing and providing web access to archives, manuscripts and digital objects. Today's post is an overview about implementing Archivematica (don't worry, much more on our feature development work with Artefactual Systems, Inc. to come) here at the Bentley Historical Library:


Known for their active community and catchy slogans, Archivematica is a “web- and standards-based, open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content.” [1]

Archivematica Kick Off


Mike has already mentioned the Artefactual Systems site visit we hosted in January of this year. It was during this site visit that folks from the Library Information Technology division and Artefactual Systems, Inc. installed Archivematica and the Archivematica Storage Service. Unlike our installation of ArchivesSpace, our installation of Archivematica is currently hosted, maintained and supported (thanks, Aaron!) by the Library Information Technology division. When I started in late February, one of my initial work priorities was testing this local implementation in order to determine how Archivematica might "replace and extend" existing workflows and procedures.

Background


Note: Our proposal to the Andrew W. Mellon Foundation gave a great background to the problem that the ArchivesSpace-Archivematica-DSpace Workflow Integration project is trying to solve. That document isn't public, but much of this introduction is gleaned from that portion of the proposal which details our institutional context.

A previous post detailed the history of digital curation at the Bentley Historical Library. Relevant to our immediate conversation is Automated Processor (AutoPro), a homegrown tool that--if you couldn't guess from its name--automates digital processing from "Initial Survey" through "Deposit Content in Deep Blue" using 33 Windows CMD.EXE shell scripts that control more than 20 applications and various command line utilities:

Windows Command Prompt Interface

I know from personal experience that AutoPro has been, and continues to be, an effective processing tool. In fact, it has received a number of accolades, being recognized by conference reviewers at iPRES 2012 as:
  • "[A] successful implementation of various tools into a a successful institutional workflow [...that] will be relevant to other implementers."
  • "[A] useful breakdown of the workflow steps used to process unstructured documents for ingest into an archival repository."
  • "[A] sound methodology including automated metadata generation following the EAD and PREMIS standards and creation of an audit trail."

A recent review of how to more efficiently process and deposit unstructured archival content (Microsoft Office documents, images, audio, video, &c.) in Deep Blue, however, determined that while AutoPro has been an effective processing tool, it is not an ideal solution, as:
  • Component programs installed on individual workstations must undergo frequent updates.
  • Windows CMD.EXE scripts have a limited capacity to handle errors and exceptions and little text-processing capabilities.
  • Scalability becomes an issue, as very large files or large collections can require a large amount of workstation resources.

This review just happened to coincide with enhancements to the University of Michigan Library's repository infrastructure and an increased budget for digital archives storage at the Bentley Historical Library. As a result, the Library Information Technology division recommended that we investigate Archivematica as an alternative to AutoPro, citing the following advantages:
  • A graphical user interface available via a web-based dashboard.
  • A "client/server processing architecture [that] allows it to be deployed in multi-node, distributed processing configurations."
  • Support for "large-scale, resource-intensive production environments" that would permit archivists to ingest and process simultaneously multiple large deposits of digital archives.
  • "Highly scalable configurations" that would permit granular control of settings for individual Virtual Machines (VMs) according to the size or contents of given Submission Information Packages (SIPs).
  • Ability to "control and trigger specific micro-services."
  • Improved exception handling and various notifications, "includ[ing] error reports, monitoring of [system] tasks and manual approvals in the workflow."
  • Simplified "alteration of preservation plans and user access levels." [2]

Initial Testing


Archivematica Dashboard

Using relevant procedures and workflows from Archivematica’s Testing page, I ran a number of representative transfers through Archivematica’s transfer and ingest micro-services using a variety of processing configurations. 

In addition to the sample transfers provided by Artefactual Systems, Inc. (some of which were intentionally designed to trigger Archivematica failures, such as the "Scan for viruses" micro-service), I tested a number of in-house transfers that had been previously run through AutoPro. These included all types of digital objects: 
  • websites;
  • text(ual) materials like PDFs and Word documents;
  • spreadsheets;
  • images;
  • email; and
  • audio/video files.

Some of these were hierarchical in nature, and some were flat. One transfer that was exceptionally large (about 10.7 GB, although that's only a small percentage of the total SIP). I also experimented with a disk image to test the new Forensic disk image ingest feature of Archivematica (released in September 2014) and a collection of sample files with personally identifiable information intended to test Archivematica’s existing integration with bulk_extractor.

Findings


Our main interest in all this testing was to find out if and how Archivematica would "replace and extend" the Bentley Historical Library's existing procedures (i.e., AutoPro).

Replacing AutoPro


After some initial trial and error (we've had some permissions-related trouble related to indexing and storing transfers and Archival Information Packages (AIPs), but I believe most of that is related to the way we have our server set up here) and communication with the Library Information Technology division, nearly all transfers were able to be ingested (I'll get to the one exception in a bit).

Most of the steps in the Bentley’s current digital processing workflow utilizing AutoPro can be replaced by one of Archivematica’s micro-services:

AutoPro Workflow Step
Archivematica Micro-Service
Virus scan
Scan for viruses
Create temporary backup
Create transfer backups
Open archive files (.ZIP, .TAR, etc.)
Extract packages
File and folder name normalization
Clean up names
Identify missing file extensions
Characterize and extract metadata
Create preservation copies
Normalize (Normalize preservation)
PII (credit care and Social Security number) scan
Examine contents*
Appraisal and arrangement
[Appraisal and Arrangement tab]
Descriptive and administrative metadata creation
Metadata
Extract technical metadata
Characterize and extract metadata
Transfer content (with metadata) to long-term storage
Store AIP
Clean up
Store AIP (Remove processing directory)

There are two notable exceptions (in red).

Notable Exception #1: Appraisal and Arrangement

The notable exception is AutoPro’s “Appraisal and arrangement” step, for which there exists no comparable Archivematica micro-service. This functionality is very important to us. While it's true that additional steps are needed in the digital world to ensure the authenticity, integrity and security of content, digital processing is first and foremost traditional processing (this is also why we have one Curation division here at the Bentley Historical Library, not two). Traditional archival functions like appraisal, arrangement and description are just as important in the digital world as they are in the paper world.

This is why we are partnering with Artefactual Systems, Inc. to develop an Appraisal and Arrangement tab in Archivematica. We consider this functionality a high priority, and as such it is part of the first phase of development. The mockup below is what we're working on during the first sprint; it's the Transfer Backlog pane (the "appraisal" part). The final product will also include an ArchivesSpace pane (the "arrangement" part).

Be sure to keep an eye out on this page of the Archivematica wiki for the latest and greatest version of the Appraisal and Arrangement tab.

Notable Exception #2: PII

A second exception has to do with Personally Identifiable Information (PII). While the “Examine contents” micro-service of Archivematica does replicate AutoPro's functionality to identify documents that may contain PII, it does not replicate its ability to redact PII (via Identify Finder's “Scrub” functionality), and it does not currently replicate its ability to “Shred” or securely delete files containing PII.

As it turns out, the University of Michigan has decided to pull support for Identify Finder, so this is a bit of a moot point. However, part of our proposed feature development with Artefactual Systems, Inc. also includes introducing functionality in Archivematica to act on some of the bulk_extractor reports it is currently running on transfers. For example, we hope to be able to apply machine-actionable PREMIS rights statements to files and folders identified using the accounts scanner (or others) in bulk_extractor, which looks for credit card numbers, credit card track 2 information (the magnetic stripe data track read by ATMs and credit card checkers), phone numbers, and other formatted numbers. We would then use this metadata to automatically embargo or restrict access to content in Deep Blue.

Extending AutoPro


A number of Archivematica micro-services would actually extend the functionality of AutoPro, giving the Bentley the ability to:
  • automatically create UUIDs for transfers, SIPs and files, uniquely identifying and directly associating transfers and SIPs, as well as files and metadata, and, as part of the proposed development work, directly associating that with the DSpace Handle System;
  • create workflow “pipelines,” pre-configuring processing decisions for transfers and SIPs for groups of like material (i.e., born-digital acquisitions, digitization projects, audio/video, disk images vs. logical copies of directories, web archives, etc.);
  • automatically generate a robust METS.xml document, which is automatically added to any SIP generated from a transfer;
  • verify transfer checksums to compare data inside of Archivematica with data as it existed outside of Archivematica;
  • quarantine a transfer for a set period of time, until virus definitions update;
  • remove cache files;
  • automatically normalize files to create Dissemination Information Packages and thumbnails, if desired;
  • set permissions using PREMIS rights metadata, which, as part of the proposed development work, would also be recorded in ArchivesSpace and would carry over to the ability to embargo collections in DSpace; and
  • interact with AIPs and their METS files via an API.

Improving AutoPro


The original Mellon proposal noted that AutoPro is not an ideal solution because component programs installed on individual machines must undergo frequent updates, because Windows CMD.EXE scripts have a limited capacity to handle errors and exceptions and little text-processing functionality, and because scalability becomes an issue. 

Archivematica addresses some of these limitations:

Web-Based

Because Archivematica is web-based, there is no need to install clients on individual machines, and system updates only need to happen once.

Better Error-Handling

Archivematica was designed to anticipate a wide variety of processing errors. As a result, it also improves upon AutoPro’s ability to handle them. While some errors result in a process being halted and the transfer or SIP being moved to the failed directory, for others, processing can continue. Both types of errors were encountered and corrected during testing, as you can see in this typical "Archivematica Fail Report":

Type
Status
Started
Index AIP
Failed
2015-03-16 17:26:35
Store the AIP
Completed successfully
2015-03-16 16:53:27
Verify AIP
Completed successfully
2015-03-16 16:52:17
Move to processing directory
Completed successfully
2015-03-16 16:52:17
Move to processing directory
Completed successfully
2015-03-16 15:00:38
Normalize
Completed successfully
2015-03-16 15:00:38
Resume after normalization file identification tool selected.
Completed successfully
2015-03-16 15:00:38
Identify file format
Failed
2015-03-16 14:42:38
Select pre-normalize file format identification command
Completed successfully
2015-03-16 14:42:38
Move to select file ID tool
Completed successfully
2015-03-16 14:42:37
Set resume link after tool selected.
Completed successfully
2015-03-16 14:42:37
Set file permissions
Completed successfully
2015-03-16 14:37:09
Create removal from backlog PREMIS events
Completed successfully
2015-03-16 14:37:09
Approve SIP Creation
Completed successfully
2015-03-16 14:18:16

As you can see, that's a lot of green (actually much more than is displayed here, hence the ellipses); the majority of these micro-services worked just fine. "Identify file format" is an example of an Archivematica error for which processing can and did continue. "Index AIP" is an example of an error for which processing is halted.

Scalability (To Be Determined)

Unfortunately, I'm not able to report out yet on how Archivematica does with scalability. We've heard tell that Archivematica can work on packages as large as one TB. However, I've attempted the 10.7 GB transfer twice, with no luck yet. Artefactual Systems, Inc. is currently working with the Library Information Technology division to get this resolved. Stay tuned for an update to this post.

Conclusion


While we did encounter some issues during Archivematica testing, for the most part it seems that Archivematica (or the proposed feature development) does indeed replace and extend the functionality of AutoPro. We're excited to start using it in production!

[1] https://ww.archivematica.org/en/
[2] Quotations in this section are from https://www.archivematica.org/wiki/Overview.
[3] Curse you, thumbs.db!

No comments:

Post a Comment