Atom and Shibboleth

The Search History and My References feeatures of the Copac Beta Test Interface are stored in a database with an Atom Publishing Protocol (APP) Interface. The idea is to make the database open to use by other people and services and so enable re-purposing of the data.

Authentication poses a problem. We need to authenticate so that we can identify the user and show them their records and not someone elses. We didn’t want people to have to register to use Copac and neither did we want to get into developing a mechanism to handle user registration, etc. So, we have used the JISC supported UK Federation (aka Shibboleth) Access Management system. This allows users to login to Copac using their own instiutional username. Registering separately with Copac is not needed to gain access.

The downside is that Shibboleth is designed to work with web browsers. I don’t know the technacalities of it all, but a login with Shibboleth seems to involve multiple browser redirects, possibly a WAYF asking “Where are you From?” and a web page with a bunch of Javascript that the browser has to interpret that redirects the browser yet again. I’ve tried accessing the Shibboleth protected version of our APP Interface with some APP client software and none of it could get past the authentication — however, it is very hard to diagnose where the problems are.

I also tried the command line program “curl” to access the APP Interface and while it can handle the redirects and the username and password I think it fails when it gets to the page with the Javascript. Which is fair enough, “curl” isn’t a web browser, it is just a program that retrieves urls.

So, can we make do without Shibboleth? Well we can, but the options are either not terribly insecure or not practical. The options I can think of are:

  1. We put a token (eg a unique id) in the url. This effectively makes the users collection of records and search history public if the url is published.
  2. We put the token in a cookie. This is still insecure and subject to cookie highjacking, but is more private as the token isn’t in the url. Many high profile web sites seem to use such an cookie for authentication, and if they do, then I don’t see why we shouldn’t? However, I’m not sure how practical it is to get third party APP clinet software to send the cookie — unless the APP client was written as part of a web browser that already has the cookie.

You can try accessing the Shbboleth protected APP server for yourself at the following url:


If you’ve already used the Copac Beta then your Search History and My References collections can be found at the following urls in the form of Atom feeds:


Please let us know how you get on! I’ve tried the above urls with Firefox and Safari. Firefox gets through the authentication and displays the Atom feeds and Service Documents. Safari seems to put itself into an infinite loop whilst trying to display the feed (maybe this is something to do with the XML in our Atom feed?)

We’d be very interested to hear your thoughts on the above.

Copac Beta : new search urls

As the new Copac beta test interface is now storing users’ search history in a database we needed Copac search urls to be stateless (or RESTful.) If you look at the current Copac urls, you will notice as you navigate through a result set, just how much saved state is encoded in the url. There are references to the session ID and the number of your query within your session.

In the new scheme of things, that is all gone and I believe our search urls are now stateless — that is, all the information needed to display a search result is now encoded in the url. The CGI script serving the url does not have to go delving into a database to work out what to do.

I’ll attempt here to explain the new url scheme and hopefully you will see how it can be used as a machine to machine interface to Copac. I should point out though, that this is describing the beta version and things may change in the future.

So, to perform an author query against the Copac database, all you need is a url like this:

The above url will perform an author search for “sutter” and will display an HTML rendered page showing the first page of brief records. If you would like the results sorted, then you can add a “sort-order” element to the url as follows:

The above url will sort the query by the record title field. If the result set is too large to sort, then you will be redirected back to the same query without the sort-order.

If you want to view the first full record in a result set, then add an “rn” element to the url:

Similarly, to view the second page of brief records:

All the above urls return an HTML display — not what you want for machine to machine communication. So, to get some programmer friendly XML you can add the “format” element to the url:

The above url returns a page of MODS XML records. A page, by default, is 25 records. If you’d prefer more or less records in a page, then you can set the page size by sending a “Page-size” header with the HTTP request. And, so that you know how large the result is, a “Result-set-size” header is returned with the HTTP response when a “format” is specified in the url.

You can, of course, specify a “sort-order” along with a “format”. You’ll be able to discover the various query fields, sort and format options by delving around the user interface and performing a few queries. I’m not going to document them here and now as it is all still beta and they may change before we go live.

Search results as an Atom feed?

Here’s a few questions for you. Would it be useful to be able to get your Copac search results as an Atom feed? If so, would it help in aggegrating Copac searches with results from other services? Would it make writing widgets for, say, iGoogle or Netvibes, easier? Would you like Copac urls to be RESTful (I hope so, as they will be before long.)

Yesterday I was thinking about the different search result formats we provide and I was wondering if Atom might be useful. Then a conversation I’ve had this morning with some colleagues have made me think an Atom format could be very useful in the areas outlined above. However, I don’t have experience of implementing widgets or working with Feeds, so I thought I’d ask here. Any thoughts, anyone?

Search history & a stateless interface

One of the things I’d like to do for Copac is to re-write the code behind the web based user interface. The current architecture was designed to work with a Z39.50 server and I now consider it to be too complex. This makes it hard to debug when things go wrong and the complexity of it means that things do go wrong.

So, I’d like to move the interface over to a REST based stateless interface that talks dircectly to the database without going through our Z39.50 interface. This should decrease the time to produce a response after a user hits the search button and should be more reliable.

What I wasn’t too sure about, until now, was how we would incorporate Copac’s Search History feature into a stateless, REST based, interface. The answer came to me during the small hours this morning. We can put the searches into the same Atom Publishing Protocol (APP) repository that we plan to use for the Marked List. (The Search History and Marked List would be separate collections within the repository and so wouldn’t be mixed up together.)

The advantages of this are: the user can have an Atom feed of their searches, they can tag and annotate their searches and generally manipulate their search history by deleting and editing entries through APP client software. We might also be able to include searches from other services. I think such a search history would work for any REST based service. So if we can move other Mimas services, such as Zetoc and the Archives Hub over to a REST based interface, then a user could potentially have, in one place, an archive of all the searches they have performed over a number of different services.

An enhanced Marked List

As part of the D2D work we are enhancing the functionality of the “Marked List” feature in Copac. The Marked List allows you to save records from your search session, for downloading or emailing to yourself in a variety of formats. One of the drawbacks to the Marked List is that it is linked to your search session. That means that when you come back to Copac tomorrow, the records you saved today will have gone.

So, one enhancement is to make your List of saved records permanent, so when you come back next week, everything you saved last week is still there. The downside to this is that you will need to login so that we know who you are and which are your records. If you don’t want to login to use Copac, then you will still be able to, you just wont get the facility of a permanent Marked List.

The current plan is to provide an API to the Marked List and it seems most sensible to use the Atom Publishing Protocol (APP). One of the nice side effects of using APP is that you’ll get an Atom feed of the records you’ve saved, plus you’ll be able to manage your collection of records with a suitable APP client outside of the Copac web site. Your Marked List will be private to you, though we will look at adding an option to publish your List to make it public.

The fly in the ointment of all this might be Shibboleth (the UK Academic access management mechanism.) It isn’t clear to me if an Atom feed is going to work in a Shibbolized environment. I hope to have something to test soon and I’ll keep you informed…

Bookmarking Copac records

In a previous post, “Persistent identifiers for Copac records“, I said that we would soon be adding links from our Full record pages to bookmarking sites such as Delicious. Well, we have now added the links to Delicious!

We hope you find this functionality useful. Let us know if you think there are other such sites you think we should be linking to.

Persistent identifiers for Copac records

If you know the record number of a Copac record, there is now a simple url that will return you the record in MODS XML format. The urls take the following form:<record-number>. For instance, the work “China tide : the revealing story of the Hong Kong exodus to Canada” has a Copac Record Number of 72008715609 and can be linked to with the url

Over the next few weeks we’ll be looking at adding these links to the Copac Full record pages and also introducing links to Bookmarking web sites such as

Handling XML errors

I’ve just installed some updated software that should increase the reliability of the web service. Unfortunately, while I was installing the software people using the service will have seen error messages in place of our records. The disruption should only have lasted a minute or two and everything should be working now.

The update allows us to better cope with errors in the records. In the past an XML error in one record in a page of results was causing users to see a “500 Internal Server Error” page rather than their records. Things are now better, though not perfect. We still cannot display the record with the errors, but the rest of the records are displayed and there should be no more Internal Server Error pages because of bad XML. Records with errors will now show as follows in the brief display:

An undisplayable record in the Brief display.

An un-displayable record in the Brief display.

As I mentioned in a previous post our database software does not natively support XML and it is occaisionally inserting line-breaks where it shouldn’t — such as in the middle of an XML Entity! Our next task is to modify our line breaking algorithm (so that the database doesn’t need to do it itself) and correct the the affected records.

SRU Developments

I am a member of the OASIS Technical Committee that is attempting to formally standardize SRU. Some of the enhancements we are proposing to make to SRU as part of the standardization process are listed below:

  1. Allow Non-XML Record Representations
  2. Enhancements to Proximity searches in CQL
  3. Faceted Searching
  4. Ability for a server to be vague about the Result Set size
  5. Multiple Query Types
  6. Eliminate the Version and Operation Parameters 
  7. Alternative Response Formats

Some of the above are fairly trivial, such as the ability of the server to return an approximate number of records found by the query. It may not be immediately obvious why a server may not want to give an exact number of records found, but it enables very useful performance optimizations to be made on the server. For example, when you do a search on your favourite Internet search engine it will probably say something like “Results 1 – 10 of about 1,050” on the results page.

We are also being asked to enhance proximity searching so that it will support structured records. I.e. the sort of data you might find in a complex XML document. Some such queries might be as follows:

  • author = smith and date =2006, but both must be found within the same containing XML element.
  • dc.creator is in the second grandchild of the grandfather of a node with = 2006 

Some of the enhancements, such as multiple query types and response formats are quite controversial within the community. One objection being that by giving implementors choices, you will fragment the community and remove any chance of interoperability.

If you have an opinion about any of the above you are encouraged to join in the discussions by joining the OASIS Search Web Services Technical Committee.

Google Book Search

We have re-enabled links to Google Book Search, again. I you haven’t already seen these links, they appear in the sidebar of the Full Record display underneath the menu and cover image. The link text will read either as either “Google Full View”, “Google Preview” or “Google Book Search” depending on the amount and type of information held by Google.

Javascript embedded within the Full Record page connects to Google Book Search to determine if Google hold information on the work. This enables us to show links to Google only when there is something useful to see when you follow the link. The downside to this is that Google will log the IP address of your computer, any Cookies they have previously set on your browser and the ISBN of the work you are viewing; even if you don’t follow the link.

Some of our users expressed concerns about being forced to link to Google and so we changed the way in which the connect to Google was performed. We had a small script on our server act as an intermediary between your computer and Google. That way your computer was only talking to our server and all the connects to Google Book Search originated from our server. This worked okay for a short amount of time until our script was blocked by Google — the message sent back to our script from Google was that “your query looks similar to automated requests from a computer virus or spyware application.” Which I can understand. We did try contacting people at Google to see if there was any way we could keep using our script. All we’ve had from Google is a deathly silence.

So we’ve re-instated the links to Google Book Search and we now have a Preferences page which enables you to turn the links off if you don’t like Google being able to track what you do on Copac. You will need cookies enabled on your browser for the Preference settings to work. The link to the Preferences page appears in the sidebar menu on the search forms and Full and Brief record displays.