Wednesday, July 1, 2015

Inserting Normalized Dates into EADs

In previous posts, we shared how we automated the normalization of 75% of our unitdates and how we extracted the remaining 25% for normalization in OpenRefine. This post will detail how we have take the normalized dates from OpenRefine and insert them into the proper place in our EADs.

When we left off last time, we had just created a new column, 'normal,' in our OpenRefine project, containing normalized versions of dates of the form 'Month YYYY-Month YYYY.' The first step in exporting these dates from OpenRefine is to click on the triangle next to the 'normal' column name and select "Facet by blank" from the "Customized facets" drop down in the "Facets" menu.


Next, select "False" from the "Facet by blank" options menu on the left side of the project.



This will give us the ability to export a CSV containing only the dates that we have normalized in our current OpenRefine project, rather than waiting until we have normalized every single date to actually export the CSV and update our EADs. This helps us to focus on a small(ish) subset of dates to normalize each time we work in OpenRefine, breaking large scale legacy metadata cleanup into more manageable chunks.

Once the project is faceted to include only the dates that we have normalized, click "Comma-separated Value" from the "Export" menu on the top right and save the CSV.


Our exported CSV looks like this:


'Column 1' contains the filename of the EAD, 'Column 2' contains the unitdate's XPath (the unique address of the particular unitdate within the EAD), 'Column 3' contains the original text that we extracted from the EAD, 'expression' contains the cleaned up version of the original text (we didn't clean up the date text in this example, but you may want to do this to remove typos, normalize abbreviations, etc.), and 'normal' contains the normalized form of the date text that we created in OpenRefine to be added as a 'normal' attribute. This is all of the info we need to add 'normal' attributes to each of the unitdates represented in this CSV, and we do it all with this Python script:


It's important to note that Python indexes starting at 0, not 1, so as the script is looping through each row in the CSV and telling Python the following:

...
filename = row[0]
date_path = row[1]
...

It's really telling Python that the filename is located in the first column of the row, the date_path is located in the second column of the row, and so on. This is important to keep in mind in the event that you add additional columns, create your columns in a different order, etc. and need to change the script accordingly.

This may all seem like a lot of effort just to change something like this:

To this:

But using a combination of our automation script, our script that extracts non-automatable dates to normalize in OpenRefine, and the script above to insert normalized dates into our EADs from a CSV, we have been able to normalize approximately 400,000 dates, giving us enhanced descriptive metadata and (hopefully!) improved opportunities for facilitating patron access going forward.

1 comment:

  1. Thank you all so much for posting about this. We at Maryland have been fussing about how to automate the normalization of our own crazy dates, and these scripts and info are a huge help. I actually got this to work for us one a decent sized sample (4,000+ dates). Y'all are the best.

    ReplyDelete