Tuesday, May 19, 2015

Legacy EAD Clean Up: Getting Started

Previous posts focusing on our work migrating our legacy EADs to ArchivesSpace have discussed the results of legacy EAD import testing and examined the overall scale of and potential solutions for migrating our legacy EADs successfully into ArchivesSpace.

Whereas those posts were generally focused on the bigger picture of migrating legacy metadata to ArchivesSpace, and included details about overall error rates, common errors, and general additional concerns that we must address before migrating our legacy collections information without error and in a way that will ensure maximum usability going forward, this post will be the first in a series that will take a more detailed look at individual errors and the tools and strategies that we have used to resolve them.

Tools

As previously mentioned, we have found a great deal of success in approaching our legacy EAD clean up programmatically through the creation of our own custom EAD importer and by using Python and OpenRefine to efficiently clean up our legacy EADs. In order to make use of some of the scripts that we will be sharing in this and future posts, you will need to have the following tools installed on your computer:

Python 2.7.9
Aside from the custom EAD importer (which is written in Ruby), the scripts and code snippets that we will be sharing are written in Python, specifically in Python 2. Python 2.7.9 is the most recent version, but if you have an older version of Python 2 installed on your computer that will also work.

lxml
lxml is an XML toolkit module for Python, and is the primary Python module that we use for working with EAD (and, later, with MARC XML). To easily install lxml, make sure that pip is installed along with your Python installation and type 'pip install lxml' into a Command Prompt or terminal window.

To test that you have Python and lxml installed properly, open a Command Prompt (cmd.exe on a Windows computer) or a terminal window (on a Mac or Linux machine) and enter 'python.' This should start an interactive Python session within the window, displaying something like the following:

Python 2.7.9 (default, Dec 10 2014, 12:24:55) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>>

If that doesn't work, check the official Python documentation for help on getting Python set up on your system.

Once you are able to start an interactive Python session in your Command Prompt or terminal window, type 'import lxml' next to the '>>>' and hit enter. If an error displays, something went wrong with the lxml installation. Otherwise, you should be all set up for using Python, lxml, and many of the scripts that we will be sharing on this blog.

OpenRefine
A lot of the metadata clean up that we've been doing has been accomplished by exporting the content of certain fields from our EADs into a CSV file using Python, editing that file using OpenRefine, and then reinserting the updated metadata back into our EADs using Python.

The Basics of Working with EADs

Many of the Python scripts that we have written for our EADs can be broken down into several groups, among them scripts that extract metadata to be examined/cleaned in OpenRefine and scripts that help us identify potential import errors that need to be investigated on a case-by-case basis.

1. Extracting metadata from EADs

One of the most common types of scripts that we've been using are those that extract some metadata from our EADs and output it to another file (usually a CSV). We'll get into the specifics of how we've used this to clean up dates, extents, access restrictions, subjects, and more in future posts dedicated to each topic, but to give you an idea of the kind of information we've been able to extract and clean up, take a look at this example script that will print collection level statements from EADs to the Command Prompt or terminal window:



When this script is run against a sample set of EADs, we get the following output:







As you can see from that small sample, we have a wide variety of extent statements that will need to be modified before we import them into ArchivesSpace. Look forward to a post about that in the near future!

2. Identifying errors in EADs

One of the most common types of problems that we identified during our legacy EAD import testing was that there are some bits of metadata that are required to successfully import EADs into ArchivesSpace that are either missing or misplaced in our EADs. As such, we are not always looking to extract metadata from our EADs to clean up and reinsert to fit ArchivesSpace's or our own liking. Sometimes we simply need to know that information is not present in our EADs, or at least is not present in the way that ArchivesSpace expects.

The most common error associated with missing or misplaced information in our EADs is the result of components lacking <unititle> and/or <unitdate> tags. A title is required for all archival objects in ArchivesSpace, and that title can be supplied as either a title and a date, just a title, or just a date.

We have some (not many, but some) components in our EADs that are missing an ArchivesSpace-acceptable title. Sometimes, this might be the result of the conversion process from MS Word to EAD inserting a stray empty component at the end of a section in the container list, such as at the end of a series or subseries. These empty components can be deleted and the EAD will import successfully. Other times, however, our components that lack titles actually ought to have titles; this is usually evident when a component has a container, note, extent, or other kind of description that indicates there really is something being described that needs a title.

So, rather than write a script that will delete components missing titles or modify our custom EAD importer to tell ArchivesSpace to ignore those components, we need to investigate each manifestation of the error and decide on a solution on a case-by-case basis. This script (something like it, anyway) helps us do just that:



This script will check each <c0x> component in an EAD for either a <unittitle>, a nested title within a <unittitle>  (such as <unittitle> <title>), or a <unitdate>. If a component is missing all three acceptable forms of a title, the script will output the filename and the xpath of the component.

A sample output from that script is:




Checking those xpaths in the EAD will take us directly to each component that is missing an acceptable title. From there, we can make decisions about whether the component is a bit of stray XML that can be deleted or if the component really does refer to an archival object and ought to have a title.

For example, the following component refers to something located in box 1 and has a note referring to other materials in the collection. This should have a title.

<c03 level="item">
    <did>
      <container type="box" label="Box">1</container>
      <unittitle/>
    </did>
    <note><p>(See also Economic Research Associates)</p></note>
</c03>


This component, however, is a completely empty element at the end of a element and does not have any container information, description, or other metadata associated with it. This can safely be deleted. 

<c02 level="file"><did><unittitle/></did></c02></c01>


In the coming weeks we'll be detailing how we've used these sorts of strategies to tackle some of the specific problems that we've identified in our legacy EADs, including outputting information from and about all of our EADs to CSVs, cleaning up all the messy data in OpenRefine, and replacing the original, messy data with the new, clean data. We'll also be detailing some things that we've been able to clean up entirely programatically, without needing to even open an EAD or look at its contents in an external program. Legacy metadata clean up here at the Bentley is an ongoing process, and our knowledge of our own legacy metadata issues, our understanding of how we want to resolve them, and our skills to make that a possibility are constantly evolving. We can't wait to share all that we've learned!

No comments:

Post a Comment