Copac Beta Interface

We’ve just released the beta test version of a new Copac interface and I thought I’d write a few notes about it and how we’ve created it.

Some of the more significant changes to the search result page (or “brief display” as we call it) are:

  • There are now links to the library holdings information pages directly from the brief display. You no longer have to go via the “full record” page to get to the holdings information.
  • You can see a more complete view of a record by clicking on the magnifying glass icon at the end of the title. This enables you to quickly view a more detailed record without having to leave the brief display.
  • You can quickly edit your query terms using the search forms at the top of the page.
  • To further refine your search you can add keywords to the query by typing them into the “Search within results”  box.
  • You can change the number of records displayed in the result page.

The pages have been designed using Responsive Web Design techniques — which is jargon that means that the HTML5 and CSS have been designed in such a way that the web page rearranges itself depending on the size of your screen. The new interface should work whether you are using a desktop with a cinema display, a tablet computer or a mobile phone. Users of those three display types will see a different arrangement of screen elements and some may be missing altogether on the smaller displays. If you use a tablet computer or smartphone, then please give beta a try on them and let us know what you think.

The CGI script that creates the web pages is a C++ application which outputs some fairly simple, custom, XML. The XML is fed through an XSLT stylesheet to produce the HTML (and also the various record export formats.) Opinion on the web seems divided on whether or not this is a good idea; the most valid complaints seem to be that it is slow. It seems fast enough to us and the beta way of doing things is actually an improvement as there is now just one XSLT used in creating the display, whereas our old way of doing things used multiple XSLT stylesheets run multiple times for each web page. Which probably just goes to show that the most significant eater of time is the searching of the database rather than the creation of the HTML.

Copac deduplication

Over 60 institutions contribute records to the Copac database. We try to de-duplicate those contributions so that records from multiple contributors for the same item are “consolidated” together into a single Copac record. Our de-duplication efforts have reduced over 75 million records down to 40 million.

Our contributors send us updates on a regular basis which results in a large amount of database “churn.” Approximately one million records a month are altered as part of the updating process.

Updating a consolidated record

Updating a database like Copac is not as immediately intuitive as you may think. A contributor sending us a new record may result in us deleting a Copac record. A contributor who deletes a record may result in a Copac record being created. A diagram may help explain this.

A Copac consolidated record created from 5 contributed records. Lines show how contributed records match with one another.

The above graph represents a single Copac record consolidated from five contributed records: a1, a2, a3, b1 & b2. A line between two records indicates that our record matching algorithm thinks the records are for the same bibliographic item. Hence, record a1,a2 & a3 match with one another; b1 & b2 match with each other and a1 matches with b1.

Should record b1 be deleted from the database, then as b2 does not match with any of a1, a2 or a3 we are left with two clumps of records. Records a1, a2 & a3 would form one consolidated record and b2 would constitute a Copac record in its own right as it matches with no other record. Hence the deletion of a contributed record turns one Copac record into two Copac records.

I hope it is clear that the inverse can happen — that a new contributed record can bring together multiple Copac records into a single Copac record.

The above is what would happen in an ideal world. Unfortunately the current Copac database does not save a log of the record matches it has made and neither does it attempt to re-match the remaining records of a consolidated set when a record is deleted. The result is that when record b1 is deleted, record b2 will stay attached to records a1, a2 & a3. Coupled with the high amount of database churn this can sometimes result in seemingly mis-consolidated records.

Smarter updates

As part of our forthcoming improvements to Copac  we are keeping a log of records that match. This makes it easier for the Copac update procedures to correctly disentangle a consolidated record and should result in less mis-consolidations.

We are also trying to make the update procedures smarter and have them do less. For historical reasons the current Copac database is really two databases: a database of the contributors records and a database of consolidated records. The contributors database is updated first and a set of deletions and additions/updates is passed onto the consolidated database. The consolidated database doesn’t know if an updated record has changed in a trivial way or now represents another item completely. It therefore has no choice but to re-consolidate the record and that means deleting it from the database and then adding it back in (there is no update functionality.) This is highly inefficient.

The new scheme of things tries to be a bit more intelligent. An updated record from a contributor is compared with the old version of itself and categorised as follows:

  • The main bibliographic details are unchanged and only the holdings information is different.
  • The bibliographic record has changed, but not in a way that would affect the way it has matched with other records.
  • The bibliographic record has changed significantly.

Only in the last case does the updated record need to be re-consolidated (and in future that will be done without having to delete the record first!) In the first two cases we would only need to refresh the record that we use to create our displays.

 

An analysis of an update from one of our contributors showed that it contained 3818 updated records; 954 had unchanged bibliographic details and only 155 had changed significantly and needed reconsolidating. The saving there is quite big. In the current Copac database we have to re-consolidate 3818 records. In the new version of Copac we only need to re-consolidate 155. This will reduce database churn significantly, result in updates being applied faster and allow us to have more contributors.

Example Consolidations

Just for interest and because I like the graphs, I’ve included a couple graphs of consolidated records from our test database. The first graph shows a larger set of records. There are two records in this set that when either are deleted would result in the set being broken up into two smaller sets.

The graph below shows a smaller set of records where each record matches with every other record.

Performance improvements

The run up to Christmas (or Autumn term if you prefer) is always our busiest time of year as measured by the number of searches performed by our users. Last year the search response times were not what we would have liked and we have been investigating the causes of the poor performance and ways of improving it. Our IT people determined that at our busiest times the disk drives in our SAN were being pushed to their maximum performance and just couldn’t deliver data any faster. So, over the summer we have installed an array of Solid State Disks to act as a fast cache for our file-systems (for the more technical I believe it is actually configured as a ZFS Level 2 Cache.)

The SSD cache was turned on during our brief downtime on Thursday morning and so far the results look promising. I’m told the cache is still “warming up” and that performance may improve still further. The best performance indicator I can provide is the graph below. We run a “standard” query against the database every 30 minutes and record the time taken to run the query. The graph below plots the time (in seconds) to run the query since midnight on the 23rd August 2011. I think it is pretty obvious from looking at the graph exactly when the SSD cache was configured in.

It all looks very promising so far and I think we can look forward to the Autumn with less trepidation and hopefully some happier users.

Hardware move

The hardware move has gone relatively smoothly today. We’ve had some configuration issues that prevented some Z39.50 users from pulling back records and another configuration problem that meant a small percentage of the records weren’t visible. That should all be fixed now, but if you see something else that looks like a problem, then please let us know.

The DNS entry for copac.ac.uk was changed at about 10am this morning. At 4pm we’re still seeing some usage on the old hardware. However, most usage started coming through to the new machine very soon after the DNS change.

The change over to the new hardware has involved a lot of preparation over many weeks. Now it’s done we can now get back to re-engineering Copac… a new database backend and new search facilities for the users.

Database update

We’ve had a recurrence of the problem I reported a month ago and so last night we installed an update to the database software we use. I’m told the update contains fixes relevant to the problems we have been experiencing, so here’s hoping it brings some increased reliability with it.

Please accept out apologies if you experienced some disruption last night while I was updating the software.

Yesterday’s loss of service

I thought I’d write a note about why we lost the Copac service for a couple of hours yesterday.

The short of it is, that our database software hung when it tried to read a corrupted file in which it keeps track of sessions. The result was that everyone’s search process hung and so frustrated users kept re-trying their searches, which created more hung sessions until the system was full of hung processes and with no CPU or memory left. Once we had deleted the corrupted file, everything was okay.

The long version goes something like this… From what I remember, things started going pear-shaped a little before noon when the machine running the service started becoming unresponsive. A quick look at the output of top showed we had far more search sessions running than normal and that the system was almost out of swap space.

It wasn’t clear why this was happening and because the system was running out of swap it was very difficult to diagnose the problem. It was difficult to run programs from the command line as, more often than not, they immediately died with the message “out of memory.” I did manage to shutdown the web server in an effort to lighten the load and stop more search sessions being created. It was proving almost impossible to kill off the existing search sessions. In Unix a “kill -9” on a process should immediately stop the process and release its memory back to the system. But yesterday a “kill -9” was having no effect on some processes and those that we did manage to kill were being listed as “defunct” and still seemed to be holding onto memory. In the end we just thought it would be best to re-boot the system and hope that it would solve whatever the problem was.

It took ages for the system to shut itself down – presumably because the shutdown procedures weren’t working with no memory to work in. Anyway, it did finally reboot and within minutes of the system coming up it became overloaded with search sessions and ran out of memory again.

We immediately shut down the web server again. However, search sessions were still being created by people using Z39.50 and so we had to edit the system configuration files to stop inetd spawning more Z39.50 search sessions. Editing inetd.conf didn’t prove to be the trivial task it should have been, but we did get it done eventually. We then tried killing off the 500 or so search sessions that were hogging the system — and that proved difficult too. Many of the processes refused to die. So, after sitting staring at the screen for about 15 minutes, unable to run programs because there was no memory and wondering what on earth do we do now, the system recovered itself. The killed off processes did finally die, memory was released and we could do stuff again!

A bit of investigation showed that the search processes weren’t getting very far into their initialisation procedure before hanging or going into an infinite loop. I used the Solaris truss program to see what files the search process was reading and what system calls it was making. Truss showed that the process was going off into cloud cuckoo land just after reading a file the database software uses to track sessions. So I deleted that file and everything started working again! The file got re-created next time a search process ran — presumably the file had become corrupted.

Issues searching other library catalogues

Some of you may have noticed that there is now a facility on the Copac search forms to search your local library catalogue as well as Copac. You’ll only see this option if you have logged into Copac and are from a supported library.

The searching of the local library catalogues and Copac is performed using the Z39.50 search protocol. Due to differences in local configurations the query we send to Copac and the various library catalogues have to be configured very differently.

When we built the Copac Z39.50 server, we tried to make it flexible in the type of query it would accept within the limitations imposed upon us by the database software we use. Our database software was made for keyword searching of full text resources. As such it is good at adjacency searches, but you can’t tell it you want to search for a word at the start of a field.

Databases built around relational databases tend to be the complete opposite in functionality. They often aren’t good at keyword searching, but find it very easy to find words at the start of a field.

The result of which is that we make our default search a keyword search, while some other systems default to searching for query terms at the start of a field. Hence if we send the exact same search to Copac and a library catalogue we can get a very different result from the two systems. To try and get a consistent result we have to tweak the query sent to the library so that it performs a search as near as possible to that performed by Copac. Working out how to tweak (or transform or mangle) the queries is a black art and we are still experimenting.

Stop word lists are also an issue. Some library systems like to fail your search if you search for a stop word. Better systems just ignore stop words in queries and perform the search using the remaining terms. The effect is that searching for “Pride and prejudice” fails on some systems because “and” is stop worded. To get around this we have to remove stop words from queries. But we first need to know what the stop words are.

The result is that the search of other library systems is not yet as good as it could be, though it will get better over time as we discover what works best with the various library systems that are out there.

Copac Beta can search your library too

One of the new features we are trailing in the new Copac Beta is the searching of your local institutions library catalogue alongside Copac. To do this we need to know which Institution you are from and whether or not your Institutional library catalogue can be searched with the Z39.50 protocol.

To identify where you are from, we are using information given to us during the login process. When you login, your Institution gives us various pieces of information about you, including something called a scoped affiliation. For someone logging in from, say, the University of Manchester, the scoped affiliation might be something like “student@manchester.ac.uk”

Once we know where you are from, we search a database of Institutional Z39.50 servers to see if your Institution’s library is searchable. If it is we can present the extra options on the search forms, and indeed, fire off any queries to your library catalogue.

Our database of Z39.50 servers is created from records harvested from the IESR. So, if you’d like your Institution’s catalogue available through Copac, make sure it is included in the IESR by talking to the nice people there.

Many thanks to everyone who tried the Beta interface early on and discovered that this feature mostly wasn’t working. You enabled us to identify some bugs and get the service working.

Beta login issues

Users from some Institutions had been unable to login in Copac Beta. Thanks to help from colleagues we think we have now resolved the issue which was related to an exchange of security certificates between servers. The result was that a handful of Institutions were not trusting us and so were not releasing the anonymised username that we require. This seems to be fixed now and we’ve noticed that users from those Institutions can now login.

So, if you tried to login to Copac Beta and received a “Login failed” message, please try again. And please let us know if you still can’t get access.

Copac Beta tweaks

[Jargon alert] I just noticed that the Shibboleth TargetedIds (read anonymised usernames) we are seeing when people log into Copac Beta are much longer than I expected. Some of them may have become truncated when saved in our user database. So, I’ve just increased the field size in the database. This may mean that some people will have lost their search history and references. Sorry about that. But a Beta test version is all about finding out about such niggles.

Thanks for persevering everyone.

[Added 30/3/2009] Tagging in ‘My references’ currently broken. We’re looking into it and hope to have it fixed soon.

[31/3/2009] Tagging issues now look to be fixed. However, when I was looking through the logs I spotted another problem which may reveal the reason some people are having difficulties searching Beta. Investigations are in progress.

[1/4/2009] It looks like some people are unable to gain access to Copac Beta because their Identity Provider isn’t providing us with an anonymised userid, or to use the jargon, a TargetedID. We do need this so we know which is your Search History and References.

There’s not a lot we can do about this. It is up to your Institution to release the TargetedID to us. However, if you are getting a “Login Failed” message please contact us, telling us which Institution you are from and we’ll try hassling your system admins.