Issues searching other library catalogues

Some of you may have noticed that there is now a facility on the Copac search forms to search your local library catalogue as well as Copac. You’ll only see this option if you have logged into Copac and are from a supported library.

The searching of the local library catalogues and Copac is performed using the Z39.50 search protocol. Due to differences in local configurations the query we send to Copac and the various library catalogues have to be configured very differently.

When we built the Copac Z39.50 server, we tried to make it flexible in the type of query it would accept within the limitations imposed upon us by the database software we use. Our database software was made for keyword searching of full text resources. As such it is good at adjacency searches, but you can’t tell it you want to search for a word at the start of a field.

Databases built around relational databases tend to be the complete opposite in functionality. They often aren’t good at keyword searching, but find it very easy to find words at the start of a field.

The result of which is that we make our default search a keyword search, while some other systems default to searching for query terms at the start of a field. Hence if we send the exact same search to Copac and a library catalogue we can get a very different result from the two systems. To try and get a consistent result we have to tweak the query sent to the library so that it performs a search as near as possible to that performed by Copac. Working out how to tweak (or transform or mangle) the queries is a black art and we are still experimenting.

Stop word lists are also an issue. Some library systems like to fail your search if you search for a stop word. Better systems just ignore stop words in queries and perform the search using the remaining terms. The effect is that searching for “Pride and prejudice” fails on some systems because “and” is stop worded. To get around this we have to remove stop words from queries. But we first need to know what the stop words are.

The result is that the search of other library systems is not yet as good as it could be, though it will get better over time as we discover what works best with the various library systems that are out there.

Beta login issues

Users from some Institutions had been unable to login in Copac Beta. Thanks to help from colleagues we think we have now resolved the issue which was related to an exchange of security certificates between servers. The result was that a handful of Institutions were not trusting us and so were not releasing the anonymised username that we require. This seems to be fixed now and we’ve noticed that users from those Institutions can now login.

So, if you tried to login to Copac Beta and received a “Login failed” message, please try again. And please let us know if you still can’t get access.

Atom and Shibboleth

The Search History and My References feeatures of the Copac Beta Test Interface are stored in a database with an Atom Publishing Protocol (APP) Interface. The idea is to make the database open to use by other people and services and so enable re-purposing of the data.

Authentication poses a problem. We need to authenticate so that we can identify the user and show them their records and not someone elses. We didn’t want people to have to register to use Copac and neither did we want to get into developing a mechanism to handle user registration, etc. So, we have used the JISC supported UK Federation (aka Shibboleth) Access Management system. This allows users to login to Copac using their own instiutional username. Registering separately with Copac is not needed to gain access.

The downside is that Shibboleth is designed to work with web browsers. I don’t know the technacalities of it all, but a login with Shibboleth seems to involve multiple browser redirects, possibly a WAYF asking “Where are you From?” and a web page with a bunch of Javascript that the browser has to interpret that redirects the browser yet again. I’ve tried accessing the Shibboleth protected version of our APP Interface with some APP client software and none of it could get past the authentication — however, it is very hard to diagnose where the problems are.

I also tried the command line program “curl” to access the APP Interface and while it can handle the redirects and the username and password I think it fails when it gets to the page with the Javascript. Which is fair enough, “curl” isn’t a web browser, it is just a program that retrieves urls.

So, can we make do without Shibboleth? Well we can, but the options are either not terribly insecure or not practical. The options I can think of are:

  1. We put a token (eg a unique id) in the url. This effectively makes the users collection of records and search history public if the url is published.
  2. We put the token in a cookie. This is still insecure and subject to cookie highjacking, but is more private as the token isn’t in the url. Many high profile web sites seem to use such an cookie for authentication, and if they do, then I don’t see why we shouldn’t? However, I’m not sure how practical it is to get third party APP clinet software to send the cookie — unless the APP client was written as part of a web browser that already has the cookie.

You can try accessing the Shbboleth protected APP server for yourself at the following url:


If you’ve already used the Copac Beta then your Search History and My References collections can be found at the following urls in the form of Atom feeds:


Please let us know how you get on! I’ve tried the above urls with Firefox and Safari. Firefox gets through the authentication and displays the Atom feeds and Service Documents. Safari seems to put itself into an infinite loop whilst trying to display the feed (maybe this is something to do with the XML in our Atom feed?)

We’d be very interested to hear your thoughts on the above.

Copac Beta : new search urls

As the new Copac beta test interface is now storing users’ search history in a database we needed Copac search urls to be stateless (or RESTful.) If you look at the current Copac urls, you will notice as you navigate through a result set, just how much saved state is encoded in the url. There are references to the session ID and the number of your query within your session.

In the new scheme of things, that is all gone and I believe our search urls are now stateless — that is, all the information needed to display a search result is now encoded in the url. The CGI script serving the url does not have to go delving into a database to work out what to do.

I’ll attempt here to explain the new url scheme and hopefully you will see how it can be used as a machine to machine interface to Copac. I should point out though, that this is describing the beta version and things may change in the future.

So, to perform an author query against the Copac database, all you need is a url like this:

The above url will perform an author search for “sutter” and will display an HTML rendered page showing the first page of brief records. If you would like the results sorted, then you can add a “sort-order” element to the url as follows:

The above url will sort the query by the record title field. If the result set is too large to sort, then you will be redirected back to the same query without the sort-order.

If you want to view the first full record in a result set, then add an “rn” element to the url:

Similarly, to view the second page of brief records:

All the above urls return an HTML display — not what you want for machine to machine communication. So, to get some programmer friendly XML you can add the “format” element to the url:

The above url returns a page of MODS XML records. A page, by default, is 25 records. If you’d prefer more or less records in a page, then you can set the page size by sending a “Page-size” header with the HTTP request. And, so that you know how large the result is, a “Result-set-size” header is returned with the HTTP response when a “format” is specified in the url.

You can, of course, specify a “sort-order” along with a “format”. You’ll be able to discover the various query fields, sort and format options by delving around the user interface and performing a few queries. I’m not going to document them here and now as it is all still beta and they may change before we go live.

An enhanced Marked List

As part of the D2D work we are enhancing the functionality of the “Marked List” feature in Copac. The Marked List allows you to save records from your search session, for downloading or emailing to yourself in a variety of formats. One of the drawbacks to the Marked List is that it is linked to your search session. That means that when you come back to Copac tomorrow, the records you saved today will have gone.

So, one enhancement is to make your List of saved records permanent, so when you come back next week, everything you saved last week is still there. The downside to this is that you will need to login so that we know who you are and which are your records. If you don’t want to login to use Copac, then you will still be able to, you just wont get the facility of a permanent Marked List.

The current plan is to provide an API to the Marked List and it seems most sensible to use the Atom Publishing Protocol (APP). One of the nice side effects of using APP is that you’ll get an Atom feed of the records you’ve saved, plus you’ll be able to manage your collection of records with a suitable APP client outside of the Copac web site. Your Marked List will be private to you, though we will look at adding an option to publish your List to make it public.

The fly in the ointment of all this might be Shibboleth (the UK Academic access management mechanism.) It isn’t clear to me if an Atom feed is going to work in a Shibbolized environment. I hope to have something to test soon and I’ll keep you informed…

Bookmarking Copac records

In a previous post, “Persistent identifiers for Copac records“, I said that we would soon be adding links from our Full record pages to bookmarking sites such as Delicious. Well, we have now added the links to Delicious!

We hope you find this functionality useful. Let us know if you think there are other such sites you think we should be linking to.

Persistent identifiers for Copac records

If you know the record number of a Copac record, there is now a simple url that will return you the record in MODS XML format. The urls take the following form:<record-number>. For instance, the work “China tide : the revealing story of the Hong Kong exodus to Canada” has a Copac Record Number of 72008715609 and can be linked to with the url

Over the next few weeks we’ll be looking at adding these links to the Copac Full record pages and also introducing links to Bookmarking web sites such as

SRU Developments

I am a member of the OASIS Technical Committee that is attempting to formally standardize SRU. Some of the enhancements we are proposing to make to SRU as part of the standardization process are listed below:

  1. Allow Non-XML Record Representations
  2. Enhancements to Proximity searches in CQL
  3. Faceted Searching
  4. Ability for a server to be vague about the Result Set size
  5. Multiple Query Types
  6. Eliminate the Version and Operation Parameters 
  7. Alternative Response Formats

Some of the above are fairly trivial, such as the ability of the server to return an approximate number of records found by the query. It may not be immediately obvious why a server may not want to give an exact number of records found, but it enables very useful performance optimizations to be made on the server. For example, when you do a search on your favourite Internet search engine it will probably say something like “Results 1 – 10 of about 1,050” on the results page.

We are also being asked to enhance proximity searching so that it will support structured records. I.e. the sort of data you might find in a complex XML document. Some such queries might be as follows:

  • author = smith and date =2006, but both must be found within the same containing XML element.
  • dc.creator is in the second grandchild of the grandfather of a node with = 2006 

Some of the enhancements, such as multiple query types and response formats are quite controversial within the community. One objection being that by giving implementors choices, you will fragment the community and remove any chance of interoperability.

If you have an opinion about any of the above you are encouraged to join in the discussions by joining the OASIS Search Web Services Technical Committee.

Harvesting records from Copac with OAI-PMH

We have recently implemented an OAI-PMH interface to Copac (access details can be found on our Copac Interfaces web page.) It has passed all the tests of the OAI-PMH protocol validator hosted by the Open Archives Initiative, so I’m reasonably confident we have a useable service up and running. The niggling doubt I have is that the OAI Repository Explorer seems to have a problem parsing the MODS records we deliver — it has no problems with our DC records, so I’m hoping it is a problem at their end, not ours. If you know different then please let us know!

Several Sets are available through which you can harvest sub-sets of the Copac database. We currently have only four Sets: Music, Sounds, Images and Maps. If there are sub-sets of the Copac database you think it would be useful to harvest and are not in the list, then let us know and we’ll see what we can do. (We are limited in what Sets we offer by what is easily and efficiently searched for in the database.)

The Copac database contains almost 34 million records. We are a little worried about the performance hit our servers (and hence our users) might suffer should someone decide to try and harvest the whole of the database. Therefore, we are currently insisting that any ListRecords or ListIdentifiers requests must specify a Set. If no Set is specified then a BadArgument error is returned.

Unfortunately we do not maintain information about deleted records, so it will be necessary to periodically re-harvest records from us. I wish we could offer deletion information, however the way we receive records from our contributing Institutions and the de-duplication process makes it very difficult to track deletions and keep a stable record number. But such difficulties are probably a blog post for the future.