Educating our systems

JIBS workshop 13/11/08

I attended the JIBS workshop in London on ‘How to compete with Google: simple resource discovery systems for librarians’ with two agendas: one of a Copac team member, interested to see what libraries are doing that could be relevant to Copac; and the other of having recently completed some research on federated search engines, and being anxious to keep up-to-date with the developments.

The day consisted of seven presentations, and concluded with the panel taking discussion questions. Four of the presentations focussed on specific implementations: of Primo at UEA; of Encore at the University of Glasgow; of ELIN at the universities of Portsmouth and Bath; and of Aquabrowser at the University of Edinburgh. Some interesting themes ran through all of these presentations. One was that of increased web 2.0 functionality – library users expect the same level of functionality from library resource discovery systems as they find elsewhere on the internet. With this in mind, libraries have been choosing systems that allow personalisation in various forms. Some systems allow users to save results and favourite resources, and to choose whether to make these public or keep them private.

Another popular feature is tag clouds. These give users a visual method of exploring subjects, and expanding or refining their search. Some systems (such as Encore) allow the adding of ‘community’ tags. This allows users to tag resources as they please, and not rely on cataloguer-added tags. While expanding the resource-discovery possibilities, and adding some good web 2.0 user interaction, concerns have been raised about the quality of the tags. While Glasgow are putting a system in place to filter the most common swearwords, and hopefully ward off deliberate vandalism, there is a worry that user-added tags might not achieve the critical mass needed to become a significant asset in resource discovery. As we at Copac are looking into the possibility of adding tags to Copac records, we will be interested in seeing how this resolves.

The addition of book covers and tables-of-contents to records seems to be a desirable feature for many libraries – and it is nice that Copac is ahead of the pack in this regard! Informal comments throughout the day showed that people are very enthusiastic about the recent developments at Copac, and enjoy the new look.

It was also very interesting to see that some libraries are introducing (limited) FRBRisation for the handling and display of results. UEA, for instance, are grouping multiple editions of the same work together on their Primo interface. This means that a search for ‘Middlemarch’ returns 31 results, the first of which contains 19 versions of the same item. These include 18 different editions of Middlemarch in book form, and one video. While the system is not yet perfect (‘Middlemarch: a study of provincial life’ is not yet recognised as the same work), it is very encouraging to see FRBRised results working in practical situations. Introducing RDA and the principles of FRBR and FRAD at Copac is going to be an interesting challenge, as we will be receiving records produced to both RDA and AACR2 standards for a while. Copac, with its de-duplication system, already performs some aspects of FRBR, as the same work at multiple libraries is grouped as one record.

There were also two presentations dealing with information-seeking behaviour, by Maggie Fieldhouse from UCL and Mark Hepworth from Loughborough. Mark highlighted the need – echoed in later presentations – for users to be given the choice about how much control they had over their search. This was part of ‘training the system’ rather than ‘training the user’. Copac tries to be an ‘educated system’: we provide a variety of search options (from simple to very advanced) through a variety of different interfaces (including browser plug-ins and a Facebook widget), and we hope that this contributes to our users’ search successes. As part off this, we are going to be undertaking some usability studies, which we hope will make Copac even more well-trained.

A very enjoyable and informative day which has given me plenty to think about – and nice new library catalogues to play with!

All the presentations from the JIBS event are available for download:
http://www.jibs.ac.uk/events/workshops/simplerds/

Re-structuring the database

We are thinking of changing the database we use to run Copac. The current database software we use is very good at what it does, which is free text searching, but it is proving problematical in other areas. For instance, it doesn’t know about Unicode or XML, which was okay some years ago when 7-bit ASCII was the norm, record displays were simple and there was less interest in inter-operating with other people and services. We have managed to shoehorn Unicode and XML into the database, though it doesn’t sit there easily and some pre- and/or post-processing is needed on the records.

The current database software doesn’t cope well with the number and size of records we are throwing at it. For instance, the limit on record size is too small and the number of records we have means the database has to be structured in such a way as makes updating slower than we would like. We’d also like a something with faster searching.

We haven’t decided what the replacement software is going to be, though we have been thinking about how a new Copac database might be structued…

De-duplication

Some people think we do too much de-duplication of our records, others think we do too little. So, we are thinking of having two levels of de-duplication, one at the the FRBR work level and another level of de-duplication broadly based on edition and format. The two levels would be linked in a 1 to n relationship. I.e. a FRBR level record would link to several edition level records. An edition level record would link back to one FRBR level record and also other edition level records which link to the same FRBR record. This would result in a three level hierarchy with the individual library records at the bottom. How this would translate in to a user interface is yet to be decided.

Holdings statements

We currently store local holdings information in with the main bibliographic record. Doing otherwise in a non-relational database would have been troublesome. The plan is to keep the holdings out of the bibliographic records and only pull it in when it is needed.

Updating

This should enable us to reduce the burden of the vast number of updates we have to perform. For instance, we sometimes receive updates from our contributing libraries of over 100,00 records and updates of over a quarter million records is not unknown. Our larger contributors send updates of around twenty thousand records on a weekly basis. We now have over 50 contributing libraries and that adds up to a lot of records every week that we need to push through the system.

Unfortunately for us, many of these updated records probably only have changes to local data and no changes to the bibliographic data. However, the current system means we have to delete it from the database and then add it back in. If a record was part of a de-duplicated set then that delete and add results in the de-duplicated record being rebuilt twice for probably no overall change to the bibliographic details.

So, the plan for a new system is that when a library updates a record we will immediately update our copy that stores the local data and mark for update the FRBR level and edition level records it is a part of. The updating of these de-duplicated record sets will be done off-line or during the small hours when the systems are less busy. If we can determine that an updated record had no changes to the bibliographic data then there would be no need to update the de-duplicated sets at all.

What now?

We think we know how we are going to do all the above and our next step is to produce a mock-up we can use to test our ideas…