Tuesday, May 5, 2015

Legacy EAD Clean Up: Scale and Solutions

As discussed in a previous post about legacy EAD import testing into ArchivesSpace conducted here at the Bentley, our legacy EADs will require quite a bit of cleaning up in order to import them into ArchivesSpace without error and with all of the data mapped as we would like.

Based on the legacy EAD import testing, we have a pretty good sense of common errors that we can expect to find in our legacy EADs. We have also conducted some additional investigation into how our data currently maps into ArchivesSpace and how we might want to use some features of ArchivesSpace going forward, so we also have a pretty good sense of some of the additional metadata cleanup or normalization we will need to take care of before migrating our legacy EADs.

Having a good understanding of what we must do before migrating our legacy EADs does not quite answer the question of how we should proceed with legacy metadata clean up, but the scale of the errors that need resolving and additional metadata that needs cleaning pushed us to consider how we might do some of that clean up in an efficient and programmatic way. We've found some tools, including Python and OpenRefine, to be incredibly useful in that endeavor. We will be sharing specific examples, code samples, step-by-step instructions, and maybe some video demonstrations in later posts, but for now we hope to give a sense of the scale of the legacy metadata clean up we're doing, and to introduce some of the tools that we've found most effective at helping us towards our goals.

The Scale of the Clean Up

As mentioned in the previous post regarding legacy EAD import testing into ArchivesSpace, preliminary testing revealed an error rate of about 35%. That would scale to almost 1,000 of our legacy EADs failing during migration to ArchivesSpace, each with varying amounts and types of errors.

Further investigation of how our legacy data maps into ArchivesSpace revealed that we need to make modifications to almost all of our nearly 3,000 EADs to not only migrate those EADs into ArchivesSpace without error, but to do so in such a way that all of the descriptive metadata within and associated with those EADs makes it into ArchivesSpace in an ideal way.

Some of the additional concerns that were revealed during additional ArchivesSpace testing include:

1. Extents. We have some extent statements that refer to multiple types of materials or containers, such as "3 linear feet and 2 oversize volumes." Extent statements such as these are imported into ArchivesSpace as:

Number: 3
Type: linear feet and 2 oversize volumes

This clutters the ArchivesSpace controlled value list for extent type and would make it very difficult for us to run any meaningful reports on the extent of our collections in the future. Ideally, that extent statement would be imported into ArchivesSpace as two separate statements, each with the relevant number and type.

2. Dates. Almost all of our <unitdate> tags within our EAD container lists contain only the date within the tags (<unitdate>May 5, 2015</unitdate>) and not an additional normal attribute with a normalized, machine-understandable form of the date (<unitdate normal="2015-05-05">May 5, 2015</unitdate>). As such, our dates import into ArchivesSpace only as plain text date "Expressions," and not as normalized "Begin" and "End" dates.

3. Subjects. Some of our collections have subjects grouped by type of material (see this example). In our EAD, the material types to which the subjects refer are identified with a <head> tag in each group of <controlaccess> tags, as follows:

<controlaccess>
 <controlaccess>
  <head>Subjects</head>
  <subject>Engineering</subject>
    etc.
  </controlaccess>
 <controlaccess>
  <head>Subjects - Visual Materials:</head>
  <subject>Aged persons</subject>
    etc.
 </controlaccess>
<controlaccess>

ArchivesSpace does not maintain those groupings.

4. Digital object descriptions. Digital objects related to our archival collections are currently accessible through Deep Blue, the University of Michigan's institutional repository. These digital objects are linked to and minimally described in our EADs using <dao> tags that often contain a brief associated <note>. The descriptive metadata in that <note> and the descriptive metadata in Deep Blue often do not match, with Deep Blue often providing a much richer description for the digital content. If we want to use ArchivesSpace as our system of record for descriptive metadata going forward, we will need to reconcile those descriptions in Deep Blue with our descriptions in our EADs.

5. Formatting. Some of our EADs have formatting characteristics or certain characters that display fine in the EADs and in our current access system, but which come our looking a little bit off after being imported into ArchivesSpace. Some examples of these formatting characteristics are trailing commas and the word "and" between <unitdate> tags, which looks like this in the EAD:

<unittitle>Letters, <unitdate>1919</unitdate> and <unitdate>1925</unitdate></unittitle>

ArchivesSpace strips dates out of <unittitle> tags, but leaves trailing commas and inner-tag characters. In ArchivesSpace, that bit of EAD might look like this:

Letters, and, 1919 1925

We plan on digging into those issues in greater detail in later posts, but this is just to demonstrate that we need to make changes (some great and some small) to almost all of our legacy EADs. Doing so manually would not be feasible, so we started looking into existing tools that could help us make those changes as efficiently as possible. Two tools that we've found indispensable in that process are the programming language Python and the metadata clean up tool OpenRefine.

An Introduction to Python and OpenRefine

Python is a programming language that can do all sorts of things, but for our purposes it has served as an error identification and extraction tool. We have written dozens of short Python scripts that serve a wide variety of functions in our clean up process, including scripts that quickly parse our legacy EADs to identify which EADs have a particular issue, other scripts that will make batch changes to all of our EADs at once for some common and programmatically-resolvable issues, and others that will extract the information from specific fields in our EADs and write that information to a CSV file to let us do some additional investigation and clean up in OpenRefine.

OpenRefine is a "powerful tool for working with messy data" that allows us to take some of the CSV files (some of which have tens of thousands of rows) that we've created using Python and quickly and efficiently find messy data in those CSVs, clean it up, and re-export the data to be put back into our EADs using Python.

Subsequent posts will take a closer look at how we have cleaned up some specific aspects of our legacy EADs, including dates, extents, subjects, digital object descriptions, and formatting. We'll also dive into some of the work that we've done developing our custom EAD importer to resolve some of our most common ArchivesSpace import errors without us needing to make any modifications to the original EADs. Stay tuned!

No comments:

Post a Comment