Re-structuring the database

We are thinking of changing the database we use to run Copac. The current database software we use is very good at what it does, which is free text searching, but it is proving problematical in other areas. For instance, it doesn’t know about Unicode or XML, which was okay some years ago when 7-bit ASCII was the norm, record displays were simple and there was less interest in inter-operating with other people and services. We have managed to shoehorn Unicode and XML into the database, though it doesn’t sit there easily and some pre- and/or post-processing is needed on the records.

The current database software doesn’t cope well with the number and size of records we are throwing at it. For instance, the limit on record size is too small and the number of records we have means the database has to be structured in such a way as makes updating slower than we would like. We’d also like a something with faster searching.

We haven’t decided what the replacement software is going to be, though we have been thinking about how a new Copac database might be structued…


Some people think we do too much de-duplication of our records, others think we do too little. So, we are thinking of having two levels of de-duplication, one at the the FRBR work level and another level of de-duplication broadly based on edition and format. The two levels would be linked in a 1 to n relationship. I.e. a FRBR level record would link to several edition level records. An edition level record would link back to one FRBR level record and also other edition level records which link to the same FRBR record. This would result in a three level hierarchy with the individual library records at the bottom. How this would translate in to a user interface is yet to be decided.

Holdings statements

We currently store local holdings information in with the main bibliographic record. Doing otherwise in a non-relational database would have been troublesome. The plan is to keep the holdings out of the bibliographic records and only pull it in when it is needed.


This should enable us to reduce the burden of the vast number of updates we have to perform. For instance, we sometimes receive updates from our contributing libraries of over 100,00 records and updates of over a quarter million records is not unknown. Our larger contributors send updates of around twenty thousand records on a weekly basis. We now have over 50 contributing libraries and that adds up to a lot of records every week that we need to push through the system.

Unfortunately for us, many of these updated records probably only have changes to local data and no changes to the bibliographic data. However, the current system means we have to delete it from the database and then add it back in. If a record was part of a de-duplicated set then that delete and add results in the de-duplicated record being rebuilt twice for probably no overall change to the bibliographic details.

So, the plan for a new system is that when a library updates a record we will immediately update our copy that stores the local data and mark for update the FRBR level and edition level records it is a part of. The updating of these de-duplicated record sets will be done off-line or during the small hours when the systems are less busy. If we can determine that an updated record had no changes to the bibliographic data then there would be no need to update the de-duplicated sets at all.

What now?

We think we know how we are going to do all the above and our next step is to produce a mock-up we can use to test our ideas…

6 thoughts on “Re-structuring the database

  1. Clearly restructuring the database should provide a better environment for handling holdings data as well as ease some of the pain associated with updates. There’s a limited amount contributors can do to help with the latter, of course – you can already tell from MARC 005s whether a bib or holdings record has been updated since you last received it from us, but in the current database model you have no choice but to process both anyway. Anything that reduces that wasted “effort” is clearly a good thing.

    I’m slightly less clear re deduplication. I don’t think having a work level and an expression/manifestation level – interesting as that is in itself – quite addresses what I perceive to be the gist of the comments you refer to at the beginning of that para. The real issue here seems to be about the matching algorithm(s) – whether this is the same as that. By associating all these expression/manifestation records not only with each other but with a “work” you are adding to the deduplication. The fact that the patient researcher will eventually be able to determine that some of these records are inappropriately linked may be a bonus (if there are any patient researchers left out there…), but it means there’s likely to be more such knots to untangle in the first place. A tricky one which I don;t envy you trying to tackle!

  2. In response to Hugh Taylor re. the deduplication issue.

    The current deduplication process is actually already a two-part match process so the move to having both work level and manifestation level records won’t actually add to the deduplication burden.

    With the deduplication process it would be too slow to match every incoming record with the entire database, so we do an initial quick and dirty match to create a set of potential duplicate records. The incoming record is then matched against the potential duplicates using a second, much more detailed, duplicate checking process to confirm or reject the initial match.

    We intend to refine the initial match procedure to bring together records for all the different manifestations of a work. This work level record set will then act as the set of potential duplicates for the second match stage, where we merge just those records for a specific manifestation of the work eg. the records relating to one particular edition.

    So the match procedures will need to be changed but the overall work involved in matching should be much the same as now.

    However I think Hugh is right in suggesting that work level record matching may result in a higher level of misconsolidation of records. The broader the match the more likely it is that we will bring records together inappropriately. There are still some ‘patient researchers’ out there who get in touch to report errors in the records, something we are planning to make easier to do. And we will put in place ways of dealing with reported consolidation errors to stop them happening again. But as Hugh says – ‘A tricky one’ and the diversity of the records is always going to make it difficult to resolve.

  3. Hi Baptiste, thanks for your comment and yes, we are hoping to include relevance ranking in a new system. It will be interesting to see how well relevance ranking will work on our records. Some of our records are really very mininal, having little more than an author and a title (and sometimes not even an author), while others have extensive table of contents, notes and subject information.

    The de-duplication should help as it brings together full and minimal records. Users will find the minimal records by virtue of us being able to associate them with the metadata from the fuller records. So a poorer record may be pulled up the rankings because it is part of a de-duplicated group of records.

    I think it unlikely we can do much with our really poor records (those with just a title and little else.) It may not be possible to allocate them to a FRBR work level record and they may just have to sit alone and unloved in the database. 🙂

  4. Pingback: Handling XML errors at Copac Developments

  5. Pingback: Getting to know the Copac libraries at Copac Developments