Tuesday, July 7, 2015

Git-Flow for Archival Workflows

We here at the Bentley Historical Library have been using GitHub for quite some time now. (Really, it's only been since May 19th of this year, so not even two months, but who's counting?) Since we have so much experience, we figured it was about time for a post on how we handle version control for our project to migrate all of our legacy EADs into ArchivesSpace using Git and GitHub (and no, they're not the same thing).

Git is not the same as GitHub. [1] Also, I also just learned that "git" is English slang for "unpleasant person."

GitHub is not the same as Git. [2] It turns out that GitHub is not a center for unpleasant people.

The Problem: Version Control

The following transcript is adapted from an actual four-minute chat conversation I may or may not have had with a colleague (who may or may not be Dallas). I think it describes our frustrations better than a narrative description could.

**Disclaimer!**
Names have been changed to protect the innocent (and the guilty, i.e., me!). Also, I'm just back from a vacation where I spent some time at the beach, so ocean animals are on my mind.

Anonymous White-Spotted Puffer [3]
10:45 AM
so, it sounds like anonymous red lion fish's thing got added to real_masters_all.
10:46 AM
that's probably my fault. if there are any big mistakes anonymous red lionfish can just fix those, maybe using a backup
has anonymous great white shark replaced the ead masters yet?




Anonymous Atlantic Ghost Crab [4]
10:47 AM
ugh
umm, yeah i dunno






Anonymous White-Spotted Puffer
i didn't realize anonymous red lionfish had done it to real_masters_all
10:48 AM
because anonymous red lionfish was working form a copy anonymous red lionfish had made




Anonymous Atlantic Ghost Crab
anonymous great white shark has not replaced ead masters yet but anonymous goldband fusilier and i have probably made our own changes already
but maybe anonymous red lionfish could take a copy of just the things in a csv
10:49 AM
and we could fold those back into the real masters. hopefully there won't be too much that needs to be fixed.


Anonymous White-Spotted Puffer
yeah










The problem was that there were too many people trying to do too many things at once to the same version (or two, or three) of our EADs; the problem was version control!

Even though, as I mentioned, we had been using GitHub for quite sometime to showcase and share our custom ArchivesSpace EAD Importer and the tools we've developed to clean or prep our legacy EAD and MARC XML for migration to ArchivesSpace, as well as to make changes to the Archivematica documentation (yes, I'm rather proud of this and this contribution--thanks again for showing us the ropes, Justin and Sarah!), we hadn't been using Git and GitHub the way they were intended to be used: to solve the problem of version control when working in teams whose members may or may not be working right next to each other everyday (or in our case, even on the same computers everyday).

After some discussion about the suitability of GitHub for this project (while we know a number of libraries and archives use GitHub for a variety of purposes, we're still not sure if there is any precedence for putting EADs on GitHub--maybe we're the first!), we decided to move forward with creating a "repo" for our working copy of the EADs. To fit in with the A-Team theme, we went with the name vandura, after the model of the GMC van used in the show.

We even figured out how to add a picture to our README file in Markdown:

Classy.

We decided to retain the "Real_Masters_all" directory name (because that is so different from "Real_Masters" and "FindingAids/EAD/Master"--all actual directory names!) for our EADs to serve as a reminder of those dark times, in the not too distant past, when things seemed simple, and when we just made changes to our version of record as we pleased, without thought to the hard work of our colleagues that we may or may not have been overwriting (because hey, we'll never know, and there would be no way to prove it anyway!).

Wait, I've Heard of GitHub...What's Git?


Before we go on...

If you're like me (an archivist, not a programmer!) you may or may not have known that Git and GitHub are actually two different things. Git is a distributed version control system (that is, it does not work like a shared network drive does--neither copy of a project directory is any better or more 'authoritative' than any other, and team members collaborate on identical copies). GitHub is a web-based Git repository hosting service (which is why it is so popular with open source software like Archivematica and ArchivesSpace), which also offers it's own features (like forks and pull requests). Git is a tool that you mostly use in the terminal on your local computer, while GitHub is a service that you mostly use with a graphical user interface on the Internet.

Why Use Git and/or GitHub?


So Git is a version control system, and GitHub is used in conjunction with it for work in teams. Why use them?


  • Git and GitHub are not just for software, or for people with l337 h4x0r s|<1llz. In fact, both of these work extremely well for anything that is primarily text, whether that is your EADs in XML, your catalog records in MARC, your website in HTML or even your blog written in Markdown.
  • All the cool kids are doing it. Whether it's companies like Artefactual Systems, Inc. (Archivematica) or Lyrasis (ArchivesSpace), or any of the institutions on this list, GitHub has become the place that open source software is shared with others.
  • It's better than regular old backups. With Git, you make what are called "commits" (more on that later) with meaningful messages (e.g., "correcting spelling mistakes" or "changing id attribute to authfilenumber"). You can then go back and look at all of your commits, remember why you made a particular change you made, and even revert back to a version of a project before a particular commit. All of that is much more useful when looking back on the work you've done than seeing a backup of your project made at an arbitrary time by a computer.
  • It is distributed. Everything is local. See comment above about difference between this process and using a shared network drive.
  • Interns have a place where they can point to the work they've done. With GitHub, since interns have their own accounts and since there is an online, public record of every change they have ever made, interns can point to a place online where they can showcase their work for potential employers.
  • You don't have to be at the Bentley or using any particular computer to do some work. That's handy.
  • Everything that happens gets recorded. Check this out. That's right, all 418 changes we've made in the 27 days we've used Git and GitHub for our EADs. It's like an audit trail. And you know we digital preservation types like our audit trails.
  • Management of the whole process is much easier. While there are many hands working on the same set of files, only a few hands get to accept and merge what are called "pull requests" (again, more on that later) into the Bentley's repository.
  • GitHub will tell you when you're going to overwrite someone else's work! That's probably my favorite benefit. While this doesn't make the process of figuring out what to do about conflicts any easier, at least we know about them!


Convinced? I am.

And the How: How We're Using Git and GitHub for Curation Workflows


While we haven't even begun to scratch the surface of all the different operations you could do with Git and GitHub, here's the handful that we've found helpful so far, broken down into three stages: 1) the initial, project and daily setup; 2) the process for making changes; and 3) and the process for merging those changes with the Bentley's version.

Say what you want about my handwriting, but I think that's a pretty good rendering of a laptop, if I do say so myself.

The Setup (with Git and GitHub)


While Git comes standard Linux operating systems, it doesn't on Windows or Mac. We're a Windows shop, so there was some setup involved.

Once Per Lifetime


If you haven't already, join GitHub. The instructions are here. If you're using Windows like us you'll also need to download and install the latest version of GitHub for Windows.

Once Per Project


Fork the vandura (or any other) repository to your account online. This basically means make a copy of the repository on your account. Note that "repo," which you'll hear people say sometimes, is short for "repository" and is just a fancy word for folder with files or other folders in it, or a project directory. On GitHub, you can do this by navigating to the repository you want to fork and clicking Fork in the top-right corner of the page.

Create a local clone of your fork on your computer. In other words, make a copy of the repository on your local computer. You can do this by navigating to your fork of the repository on GitHub and copying the HTTPS clone URL in the right sidebar to your clipboard. Then, open the Git Shell application and type:

git clone https://github.com/YOUR-USERNAME/vandura.git 

Next you'll need to configure a remote for your fork (so it knows where it came from). Move into the project directory by typing:

cd vandura

Then check to see what the current remote is by typing:

git remote –v

Specify a new upstream remote repository by typing (pointing it to its origin):

git remote add upstream https://github.com/bentley-historical-library/vandura.git

Finally, verify the new upstream remote repository by typing:

git remote –v

Once (or Twice...) Per Shift (with Pictures!)


The rest of these instructions detail our day-to-day work, starting with syncing a local version of the files with the Bentley's master version. So here we go (with pictures--thanks, Devon! [5])...

It starts with syncing your fork, ensuring that what you have on your local computer matches what the Bentley has online (which may have been updated since you last sat down to do some work). After ensuring that you're in the appropriate directory, you do this by...

Using git fetch upstream to fetch new commits from the upstream repository.

Using git merge upstream/master to merge the changes from upstream/master into your local master branch.

Or, if changes were made to the upstream repository while you were making changes to your fork, you can apply those changes to your local version before applying your changes by...

Using git rebase upstream/master to "rebase" or merge the upstream repository with your fork and replay your changes on top of the upstream version before pushing your changes (I know, it's getting complicated).

Making Changes (with Git)


Now it's time to make changes! This happens the same way you'd make any other change to a file on your local computer--by opening the XML editor of your choice, for example, and making a change, or running a Python program. Git only gets involved when you are ready to "snapshot" files and record these snapshots on your local machine in preparation for version control (and Git, by the way, only gets involved on your local machine). 

Note: For those with some experience with GitHub, you'll notice that we aren't using different branches (e.g., a development branch and a master branch). This is because we are already using a working copy of our EADs to make changes (not the master). No branch needed! Plus, this makes the process that much easier to teach to others.

Sometimes we make small changes (such as correcting spelling mistakes, or adding or deleting boxes from a boxlist, &c., all of which happen to a single XML file). After making changes to a single we snapshot that file by...

Using git add [filename] to snapshot a single file in preparation for versioning.

Sometimes we make big changes (for example, adding an Authority ID attribute to <persname> elements, which changed 1386 files and 11761 <persname> elements at once) to multiple files. You can snapshot these by...

Using git add . to snapshot all files in a directory that have changed since the last commit in preparation for versioning.

Then we get them ready for versioning by...

Using git commit -m "[meaningful message]" to record file snapshots permanently in your version history.

Note: These steps for making changes can be repeated ad nauseam. You make commits as often as you think you make a meaningful change (that you may want to go back to later). Also, those messages are important! "updates" is not nearly as helpful as "separated boxes for use with aeon".

The Finish (with GitHub)


Now it's time to get GitHub involved, both your associated personal account and our team or institutional account.

For a Team Member


Upload all local commits to your account on GitHub in order to be able to merge them with the Bentley's account by...

Using git push to "push" those commits to your online account.


Finally, merge your account's version with the Bentley's version online by...

Making a pull request using GitHub.

For the Team


One of the adminstators for the Bentley account will then get a notification that a pull request (so called because Devon, for example, as an intern, does not have the ability to push to the main Bentley account, instead requesting that an administrator pull his changes instead) has been made. One of the administrators compares the changes that need to be made...

Comparing the changes that need to be made. This is incredibly helpful.

Based on that comparison, they either accept the changes or, if there is some sort of conflict, give him instructions (again, all online out in the open) to, for example, rebase to get the latest version of the EADs before making his pull request, and then accept...


Devon's changes have been merged with the Bentley's account. Notice that we're told that the latest change was Dallas merging Devon's pull request, and his meaningful commit message is shown next to the Real_Masters_all folder.

Kapow! Version controlled.

So Far, So Good


While there is a bit of a learning curve to using Git and GitHub (thanks again, Justin and Sarah, as well as Greg and Fiona, the Software Carpentry folks at HASTAC who taught Dallas and I Version Control with Git!) and teaching it to others, implementing a version control system has been great! We are now able to see every change that has been made. We know who did it and when (and, ideally why!). We even know when we're about to overwrite someone else's changes. Life is good!

All that being said, we've experienced a few hiccups along the way and we're still working out our Git-flow. We'd love to hear what you're doing for version control or your experience with Git and/or GitHub. Let us know by leaving a comment or getting in touch via email or Twitter!

[1] "Git-logo" by Jason Long - http://git-scm.com/downloads/logos. Licensed under CC BY 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Git-logo.svg#/media/File:Git-logo.svg
[2] "GitHub logo 2013" by GitHub - https://github.com/logos. Licensed under Public Domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:GitHub_logo_2013.svg#/media/File:GitHub_logo_2013.svg
[3] "Puffer Fish DSC01257" by Brocken Inaglory - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Puffer_Fish_DSC01257.JPG#/media/File:Puffer_Fish_DSC01257.JPG
[4] "Ocypode quadrata (Martinique)" by Free On Line Photos. Licensed under No restrictions via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Ocypode_quadrata_(Martinique).jpg#/media/File:Ocypode_quadrata_(Martinique).jpg
[5] Since these screenshots were done as Devon worked, they sometimes get a bit out of order...

No comments:

Post a Comment