Wednesday, June 8, 2016

Born-Digital Data: What Does It *Really* Look Like (Research Data Redux)

This is a follow up to Jenny Mitcham's recent Research data - what does it *really* look like post, and in particular, these questions she posed:
I'd be interested to know whether for other collections of born digital data (not research data) a higher success rate would be expected? Is identification of 37% of files a particularly bad result or is it similar to what others have experienced?


Extracting technical metadata with the file profiling tool DROID has been part of our digital processing procedures for born-digital accessions since the beginning, so to speak, about 2012. Right before deposit into DeepBlue and our dark archive, a CSV export of DROID's output gets included in a metadata folder in our Archival Information Packages (AIPs). Kudos to Nancy Deromedi and Mike Shallcross for their foresight and for their insistence on standardizing our AIPs. It made my job today easy!

At first I was thinking that I'd write a Python script that would recursively "walk" the directories in our dark archive looking for files that began with "DROID_" (our standard prefix for these files) and ended with ".csv". That would have worked, but I'm a bit paranoid about pointing Python at anything super important (read: pointing my code at anything super important), and making a 1.97 TB working copy wasn't feasible. So, I took the easy way out...

First, I did a simple search (DROID_*.csv) in Windows Explorer...

...made my working copy of individual DROID outputs (using TeraCopy!)...

...and wrote a short script to make one big (~215 MB) DROID output file.

These are not the DROIDs you're looking for.

Note that I had the script skip over Folders (because we're only interested in files here), packaged files, like ZIPs (because DROID looks in [most of] these anyway) and any normalized versions of files, which I could identify because they get a "_bhl-[CRC-8]" suffix. Kudos again to Nancy and Mike for making this easy.

All of the data (about 3/4 million individual files!) in this sample represents just about anything and everything born-digital that we've processed since 2012... basically anything related to to our two collecting areas of the University of Michigan and the state. I'd guess that much of it is office documents and websites (and recently, some Twitter Archives). The vast majority of the data was last modified in the past 15 years, and our peaks are in in 2006 and 2008. The distribution of dates is illustrated below...

Here are some of the findings of this exercise:

Summary Statistics

  • DROID reported that 731,949 individual files were present
  • 658,520 (89.9%) were given a file format identification by Droid
  • 657,808 (99.9%) of those files that were identified were given just one possible identification. 610 files were given two different identifications, 1 file was given three different identifications, 3 files were given five different identifications, 13 files were give six different identifications, 45 files were give seven different identifications, 28 files were given eight different identifications, and a further 12 were given nine different identifications. In all these cases, the identification for 331 files was done by signature and the identification for 380 files was done by extension.

Files that Were Identified

  • Of the 658,520 files that were identified:
    • 580,310 (88.1%) were identified by signature (which, as Jenny suggests, is a fairly accurate identification)
    • 13,478 (2%) were identified by extension alone (which implies a less accurate identification)
    • 64,732 (9.8%) were identified by container. Like Jenny said, these were mostly Microsoft Office files, which are types of container files (and still suggests a high level of accuracy)
    • Lots of these were HTML and XML files, although there were some Microsoft Office files as well
  • 180 different file formats were identified within the collection of born-digital data
  • Of the identified files 152,626 (19%) were HTML files. This was by far the most common file format identified within the born-digital dataset. The top 10 identified files are as follows:
    • Hypertext Markup Language - 152,626
    • JPEG File Interchange Format - 142,161
    • Extensible Hypertext Markup Language - 62,039
    • JP2 (JPEG 2000 part 1) - 56,986
    • Graphics Interchange Format - 48,317
    • Microsoft Word Document - 38,459
    • Exchangeable Image File Format (Compressed) - 18,826
    • Microsoft Word for Windows Document - 18,140
    • Acrobat PDF 1.4 - Portable Document Format - 17,840
    • Acrobat PDF 1.3 - Portable Document Format - 10,875

Files that Weren't Identified

  • Of the 73,421 that weren't identified by DROID, 851 different file extensions were represented
  • 1,888 (2.6%) of the unidentified files had no file extension at all
  • The most common file extensions for the files that were not identified are as follows:
    • emlx - 21,987
    • h - 8,545
    • cpp - 8,501
    • htm - 8,032
    • pdf - 5,216
    • png - 4,250
    • gif - 2,085
    • dat - 1,419
    • xml - 1,379

Some Thoughts

  • Like Jenny, we do have a long tail of file formats, but perhaps not quite as long as the long-tail of research data. I actually expected it to be longer (10.1% seems pretty good... I think?), since at times it feels like as a repository for born-digital archives we get everything and the kitchen sink from our donors (we don't, for example, require them to deposit certain types of formats), and because we are often working with older (again, relative) material.
  • We too had some pretty common extensions (many, in fact) that did not get identified (including the .dat files that Jenny reported on). Enough that I feel like I'm missing something here...
  • In thinking about how the community could continue to explore the problem, perhaps a good start would be defining what information is useful to report out on (I simply copied the format in Jenny's blog), and hear from other institutions. It seems like it should be easy enough to anonymize and share this information.
  • What other questions should we be asking? I think Jenny's questions seem focused on their goal of feeding information back to PRONOM. That's a great goal, but I also think there are ways we can use this information to identify risks and issues in our collections and assure that our or our patron's technical environments support them, as well as to advocate in our own institutions for more resources.
And, if you haven't yet, be sure to check out the original post and subscribe to that Digital Archiving at the University of York blog! Also be sure to check out the University of York's and University of Hull's exciting, Jisc-funded work to enhance Archivematica to better handle research data management.

[1] I think Jenny's only interested in original files, but an interesting follow-up question might ask questions along the lines of what percentage of files we were able to normalize...