I learned how to catalogue with MARC21 right at the beginning of my career when I was a graduate trainee librarian but haven’t had to do this for many years! Recently Health Education England (HEE) purchased a big bundle of 164 Oxford University Press (OUP) e-books for all NHS England staff and I needed to get these onto our library catalogue. The catalogue records were made available to download in MARC21 format but unfortunately the library management system I use, Heritage, does not accept this format. I didn’t want to manually catalogue 164 e-books, so I needed to translate MARC21 records into a Heritage-friendly CSV file and strip out all the weird MARC encoding… which OpenRefine was perfect for.
Convert MARC21 into Tab delimited OpenRefine readable format
I downloaded MarcEdit (free software) and followed the instructions on this website https://guides.uflib.ufl.edu/c.php?g=147853&p=6878235 to convert it into a tab delimited file. Use this file that I’ve already created to save you time. WordPress only allows me to upload XLS files and not CSVX so I’ve had to convert it into XLS, I’m sorry… Email me for the proper CSVX.
Work out what the end product needs to look like
I exported a spreadsheet of my existing e-book holdings and read up on the documentation to see what format my data needed to be in. Here’s the finished product for us to refer to so we know what we’re aiming for. WordPress only allows me to upload XLS files and not CSVX so I’ve had to convert it into XLS, I’m sorry… Email me if you need the proper CSVX.
Start transforming the data – one column at a time.
020$a – Get rid of the (ebook) : text – do a simple find and replace. To do this, EDIT CELLS > TRANSFORM > value.replace(‘(ebook) :’,’’)
This bit of code is essentially the same as ‘Find and replace’ in Microsoft Word/Excel. The first part of the bit in parantheses is what is being found, and the second part is what it is being replaced with. value.replace(‘TEXT BEING FOUND’,‘TEXT USED TO REPLACE IT’) – more information is available here https://guides.library.illinois.edu/openrefine/grel
245$a – Needed to get rid of the punctuation at the end of the titles. For example Oxford handbook of emergency medicine. should be Oxford handbook of emergency medicine
and Oxford handbook of renal nursing / should be Oxford handbook of renal nursing
EDIT CELLS > TRANSFORM > Use this bit of code to remove punctuation at the end of cells: replace(value, /[.]$/, ”) …I got this from https://groups.google.com/g/openrefine/c/9LB9OT0q2rY
Do the same transform process but this time replace the [.] in the code with [/] to address the issue with titles like the Oxford handbook of renal nursing.
When done, then do EDIT CELLS > COMMON TRANSFORMS > Trim leading and trailing whitespace to get rid of any extra spaces at the end of cells.
250$a – This is where OpenRefine really comes into its own. You can use the FACET > TEXT FACET option to see all the different ways that the edition info is given and edit them to be more standard. First, use the built-in similarity detector (CLUSTER) to group some together, found on the left hand panel. The ones that OpenRefine thinks are similar can be grouped together and renamed en masse. Click on Merge Selected & Close when you’re done.
For the remaining ones, you can click on the little ‘edit’ button next to them and change them so that they are standard. When finished, click on the little ‘X’ in the top left next to 250$a to close the facet.
260$a and 264$a are both equivalent to the PLACE column required by Heritage. We can just delete one of the columns by clicking on EDIT COLUMN > REMOVE THIS COLUMN
Then I used the TEXT FACET option to mass edit the cells so they all said Oxford – luckily all 136 books have the same place of publication and publisher!
260$b and 264$b are also both the same, they equate to the PUBLISHERcolumn. So you can do pretty much the same processes as above – delete one of the columns and change the text to Oxford University Press for the rest.
260$c and 264$c Unfortunately the dates are in two different columns… and we can’t just delete one of them. So it’s time to MERGE COLUMNS. I used this info: https://guides.library.illinois.edu/openrefine/combining
I then had to do a bit more tidying – some of the years also had a random full stop at the end for no reason, and a c at the beginning, e.g. c2011. You can fix this using the EDIT CELLS > TRANSFORM > value.replace(‘TEXT BEING FOUND’,‘TEXT USED TO REPLACE IT’) method we used earlier.
520$a – finally a column that is just fine and doesn’t need any changes – it will just go in the NOTESCSVX column with the string (CON) at the beginning as a ‘contents’ note.
650 – This one is quite simple. There’s a lot of annoying MARC21 punctuation such as \0$a $v $x that can be gotten rid of with the same find and replace value.replace method as above. To make it readable by Heritage we need to replace the $v and $x with ‘ – ‘, and just delete the \0$a. Note that to remove the \ symbol you need to do value.replace(‘\\’,’’)
700$a – What I did here was split this column into several columns using the ; separator so that I could tidy up the data on each individual author. EDIT COLUMN > SPLIT INTO SEVERAL COLUMNS
I thought that whether the authors were identified as ‘editors’ or ‘authors’ was not very important to the catalogue record so I just left this information out altogether. I used the code for removing specific punctuation from the end of cells several times – replace(value, /[.]$/, ”) and replace(value, /[,]$/, ”)
When I’d tidied up the individual columns I then merged them back together using EDIT COLUMN > JOIN COLUMNS. You need to use the separator ;
Those are the main bits of information that required transformation. I exported as a CSV (comma separated values) file and pasted information into my Heritage file under the right sections.
I also did quite a lot of work to turn the dois and URLs provided in the MARC21 file into WAYFless URLs but will describe this separately.