Creating Copac: Matching diverse data

Creating a database with many contributing libraries is an interesting process. The way in which different libraries catalogue their collections can vary significantly, for example. depending on the nature of the library and the catalogue system they use. In addition, there are record variations within individual library catalogues, reflecting changes over time in cataloguing standards, local systems, and individual cataloguing styles. In an ideal world we want one record in Copac for each document with details of all the libraries that hold a copy of that document. In practice this can be challenging… We’ve previously talked about our overall approach to deduplication, but the team that undertook the recent White Rose collection analysis project expressed a need to know more about how record matching works in Copac, so we’ve written a summary document setting out the basic match process all records go through as they are added to Copac.

In essence…
Incoming records go through an initial check to identify whether they might be duplicates of records already on Copac. If a match is found a ‘potential duplicate pair’ is created.

Potential duplicate pairs of records then go through a detailed match process to confirm whether they are genuine duplicates. Incoming records may form match pairs with multiple Copac records and each match pair is tested in turn.

  • If a pair of potentially duplicate records fail the detailed match tests the new incoming record is added to Copac as a single, unconsolidated, record.
  • If a pair of potentially duplicate records pass the detailed match tests the records are merged to form a consolidated record. If an incoming record has multiple match pairs that succeed in passing the detailed match all the records will be brought together in a single consolidated record; a record will never appear in more than one consolidation.
  • The incoming record may match with an existing Copac record that is itself already part of a larger set of records, so the new record will be merged into that larger consolidation. It is not necessary that each record in a consolidation matches every other record in that consolidation.

Each consolidated record can be ‘expanded’ in the Copac result display, so you can see all the originally supplied records that have been brought together to form that consolidation. This means all the information in a record from a specific library can be seen in context – particularly important for early materials.

The data deduplication is a fluid process and Copac records change daily in response to additions and deletions supplied by our contributing libraries. Similarly, the match process itself evolves over time as the data changes and new matching problems emerge. However, we need to be careful that in trying to improve our matching of some records we don’t create mistaken matches for others, which can result in incorrectly merging records.

The Copac Record match summary document provides a more detailed overview of the Copac record match process: Copac Record Match and Deduplication Procedure 1017

If you have any questions about this – or any other aspect of the Copac service – you can get touch via our helpdesk: help.copac@jisc.ac.uk