Surfacing the Academic Long Tail — Announcing new work with activity data

We’re pleased to announce that JISC has funded us to work on the SALT (Surfacing the Academic Long Tail) Project, which we’re undertaking with the University of Manchester, John Rylands University Library.

Over the next six months the SALT project will building a recommender prototype for Copac and the JRUL OPAC interface, which will be tested by the communities of users of those services.  Following on from the invaluable work undertaken at the University of Huddersfield, we’ll be working with ten years+ of aggregated and anonymised circulation data amassed by JRUL.  Our approach will be to develop an API onto that data, which in turn we’ll use to develop the recommender functionality in both services.   Obviously, we’re indebted to the previous knowledge acquired by a similar project at the University of Huddersfield and the SALT project will work closely with colleagues at Huddersfield (Dave Pattern and Graham Stone) to see what happens when we apply this concept in the research library and national library service contexts.

Our overall aim is that by working collaboratively with other institutions and Research Libraries UK, the SALT project will advance our knowledge and understanding of how best to support research in the 21st century. Libraries are a rich source of valuable information, but sometimes the sheer volume of materials they hold can be overwhelming even to the most experienced researcher — and we know that researchers’ expectation on how to discover content is shifting in an increasingly personalised digital world. We know that library users — particularly those researching niche or specialist subjects — are often seeking content based on a recommendation from a contemporary, a peer, colleagues or academic tutors. The SALT Project aims to provide libraries with the ability to provide users with that information. Similar to Amazons, ‘customers who bought this item also bought….’ the recommenders on this system will appear on a local library catalogue and on Copac and will be based on circulation data which has been gathered over the past 10 years at The University of Manchester’s internationally renowned research library.

How effective will this model prove to be for users — particularly humanities researchers users?

Here’s what we want to find out:

  • Will researchers in the field of humanities benefit from receiving book recommendations, and if so, in what ways?
  • Will the users go beyond the reading list and be exposed to rare and niche collections — will new paths of discovery be opened up?
  • Will collections in the library, previously undervalued and underused find a new appreciative audience — will the Long Tail be exposed and exploited for research?
  • Will researchers see new links in their studies, possibly in other disciplines?

We also want to consider if there are other  potential beneficiaries.  By highlighting rarer collections, valuing niche items and bringing to the surface less popular but nevertheless worthy materials, libraries will have the leverage they need to ensure the preservation of these rich materials. Can such data or services assist in decision-making around collections management? We will be consulting with Leeds University Library and the White Rose Consortium, as well as UKRR in this area.

(And finally, as part of our sustainability planning, we want to look at how scalable this approach might be for developing a shared aggregation service of circulation data for UK University Libraries.  We’re working with potential data contributors such as Cambridge University LibraryUniversity of Sussex Library, and the M25 consortium as well as RLUK to trial and provide feedback on the project outputs, with specific attention to the sustainability of an API service as a national shared service for HE/FE that supports academic excellence and drives institutional efficiencies.

Copac Beta : new search urls

As the new Copac beta test interface is now storing users’ search history in a database we needed Copac search urls to be stateless (or RESTful.) If you look at the current Copac urls, you will notice as you navigate through a result set, just how much saved state is encoded in the url. There are references to the session ID and the number of your query within your session.

In the new scheme of things, that is all gone and I believe our search urls are now stateless — that is, all the information needed to display a search result is now encoded in the url. The CGI script serving the url does not have to go delving into a database to work out what to do.

I’ll attempt here to explain the new url scheme and hopefully you will see how it can be used as a machine to machine interface to Copac. I should point out though, that this is describing the beta version and things may change in the future.

So, to perform an author query against the Copac database, all you need is a url like this:

http://beta.copac.ac.uk/search?au=sutter

The above url will perform an author search for “sutter” and will display an HTML rendered page showing the first page of brief records. If you would like the results sorted, then you can add a “sort-order” element to the url as follows:

http://beta.copac.ac.uk/search?au=sutter&sort-order=ti

The above url will sort the query by the record title field. If the result set is too large to sort, then you will be redirected back to the same query without the sort-order.

If you want to view the first full record in a result set, then add an “rn” element to the url:

http://beta.copac.ac.uk/search?au=sutter&rn=1

Similarly, to view the second page of brief records:

http://beta.copac.ac.uk/search?au=sutter&page=2

All the above urls return an HTML display — not what you want for machine to machine communication. So, to get some programmer friendly XML you can add the “format” element to the url:

http://beta.copac.ac.uk/search?au=sutter&page=1&format=XML+-+MODS

The above url returns a page of MODS XML records. A page, by default, is 25 records. If you’d prefer more or less records in a page, then you can set the page size by sending a “Page-size” header with the HTTP request. And, so that you know how large the result is, a “Result-set-size” header is returned with the HTTP response when a “format” is specified in the url.

You can, of course, specify a “sort-order” along with a “format”. You’ll be able to discover the various query fields, sort and format options by delving around the user interface and performing a few queries. I’m not going to document them here and now as it is all still beta and they may change before we go live.

An enhanced Marked List

As part of the D2D work we are enhancing the functionality of the “Marked List” feature in Copac. The Marked List allows you to save records from your search session, for downloading or emailing to yourself in a variety of formats. One of the drawbacks to the Marked List is that it is linked to your search session. That means that when you come back to Copac tomorrow, the records you saved today will have gone.

So, one enhancement is to make your List of saved records permanent, so when you come back next week, everything you saved last week is still there. The downside to this is that you will need to login so that we know who you are and which are your records. If you don’t want to login to use Copac, then you will still be able to, you just wont get the facility of a permanent Marked List.

The current plan is to provide an API to the Marked List and it seems most sensible to use the Atom Publishing Protocol (APP). One of the nice side effects of using APP is that you’ll get an Atom feed of the records you’ve saved, plus you’ll be able to manage your collection of records with a suitable APP client outside of the Copac web site. Your Marked List will be private to you, though we will look at adding an option to publish your List to make it public.

The fly in the ointment of all this might be Shibboleth (the UK Academic access management mechanism.) It isn’t clear to me if an Atom feed is going to work in a Shibbolized environment. I hope to have something to test soon and I’ll keep you informed…

Persistent identifiers for Copac records

If you know the record number of a Copac record, there is now a simple url that will return you the record in MODS XML format. The urls take the following form: http://copac.ac.uk/crn/<record-number>. For instance, the work “China tide : the revealing story of the Hong Kong exodus to Canada” has a Copac Record Number of 72008715609 and can be linked to with the url http://copac.ac.uk/crn/72008715609.

Over the next few weeks we’ll be looking at adding these links to the Copac Full record pages and also introducing links to Bookmarking web sites such as delicio.us.

Search Solutions 2008

On Tuesday last I attended “Search Solutions 2008” organised the BCS-IRSG and to quote from event programme, “Search Solutions is a special one-day event dedicated to the latest innovations in information search and retrieval.” The format of the day was a series of short talks, 11 in all, each about 20 minutes in length with the chance for questions from the audience after each talk.

One of the themes through the day was the linguistic analysis of texts such as blog posts and web pages. Or in other words, deducing the correct meaning of a word like Georgia; is it referring to someone called Georgia, the country that used to be part of the USSR, or the USA State. As all the speakers were from commercial companies no-one was giving their secrets away, but approaches mentioned ranged from Bayesian analysis to a team of 50 linguistic experts.

Another theme was how social networking can help users find what they’re looking for. User recommendations and tagging were both cited frequently in this regard. Elias Pampalk from last.fm gave a very interesting talk on how tagging is being used on last.fm. They have made it very easy for users to tag. Adding a tag usually involves no typing — just a couple of mouse clicks to select either a tag you’ve used before or a tag someone else has used for that item. There is also incentive for people to tag at last.fm as it can help you discover new music and connect you to people with similar tastes. They seem to have gotten it right as they are collecting over 2.5 million tags per month.

At the end of his talk, Elias mentioned that last.fm had an open API, which I had never realised before. This got me wondering if we could provide links from Copac to last.fm. This perhaps isn’t as strange an idea as it may first seem. Copac doesn’t hold records for just books, we have many records in the database for CD and sheet music. It might be kind of neat to provide a link from those records to last.fm’s page about the artist or album and perhaps pull in images as well? Something to think about when we can find a bit of spare time.

Overall it was a very interesting day with many thought provoking talks and I’d happily attend a similar day next year.

To Google or not to Google [with update]

As Ashley has just posted, we’ve just reinstated the links to Google Books that were appearing in the right-hand column of relevant records. Back in March we were pleased to be among the throng of those incorporating the new Google Books API. If Google’s mission is to ‘organize the world’s information and make it universally accessible and useful,’ who are we to argue? What self-respecting library service wouldn’t want to be a part of a project that promotes the Public Good?

Then something unusual happened — we got complaints. Not a great many, but still a vociferous few who questioned why Copac would give Google ‘personal data’ about them as users. Several of us in the team went back and forth over whether this was actually the case. My own opinion was that a) this was not ‘personal’ data, but usage data, and therefore not a threat to any individual’s privacy, and b) even if we were giving Google a little bit of something about our users and how they behaved, what does it matter if the trade-off is an improved system? Nonetheless, we went ahead and added that small script so that Google only spoke to the Copac server. No dice.

I was not all that surprised that our attempt at a workaround wasn’t effective (it would have been nice to have heard something back officially from Google on this front, but we’ll live). I am still wondering if it matters, though. Does it makes sense that we ‘pay’ Google for this API by giving them this information about Copac users — their IP addresses and the ISBNS of books they look at? (Is this, in fact, what we’re doing? Paying them?) Isn’t all this just part of the collective move toward the greater Public Good that the entire Google Books Search project seems to be about?

Ultimately, right now, yes. This is the trade-off we’re willing to make. So we’ve reinstated the links, but also added an option under Preferences for now to allow users to de-googlise their searches. Turning off the feature for good would be reactionary to say the least (and perhaps, more to the point, in the political landscape in which Copac operates, *seen* as reactionary). Right now, if you’re in the ‘Resource Discovery’ business, then a good relationship with the most ubiquitous and powerful search engine in the world is of no small importance.

Indeed, behind the scenes, our colleagues at RLUK have been working with Google on our behalf to sign an agreement which will mean that Google can spider Copac records. The National Archives has recently done this, and from what I hear anecdotally from people there, it’s already having a dramatic impact on their stats — thousands users are discovering TNA records through Google searches, and so discovering a resource they might not have known about before. We are hoping that users will have a similar experience with Copac, especially those looking for unique and rare items held in UK libraries that might not surface through any other means. We are eager to see what sort of impact a Google gateway to Copac will have, and we know it can only enhance the exposure of the collections. We’re also exploring this option for The Archives Hub.

Of course, this also means that Google gets to index more information about Copac web searches. David Smith’s article last week “Google, 10 years on. Big Friendly Giant or Greedy Goliath?” highlights some of the broader concerns about this. To what extent should we be concerned about the fact that a corporation is hoovering up information about our usage behaviour? I am always suspicious of overblown language surrounding technology, and Smith’s article does invoke quite a number of metaphors connoting a dark and grasping Google that we’d better start keeping an eye on, “Google’s tentacles are everywhere.”

But invokations of the ‘Death Star’ notwithstanding (!) I think we’re all learning to be a bit more cautious about our approach to Google. It may not be the Dark Lord, but it’s no ‘Big Friendly Giant’ either. For now, we’re pleased to be able to plug in Google’s free API (thank you, Google) and that Copac will soon be searchable via the engine. But nothing is entirely free, or done for entirely altruistic purposes — this is business after all. We just have to keep that in mind and talk constructively and openly about what we’re willing to pay.

[Updated to add: Likely much too late in the game, but I’ve just spent an hour or so listening to The Library 2.0 Gang’s podcast with Frances Haugen, product manager for the Google Book Search API.  Tim Spalding (LibraryThing) and Oren Beit-Arie (Ex Libris) were among those to pose some of the tougher questions surrounding the API and specifically the fact the it only works client-side and forces the user into the google environment.  According to Frances, future developments will include a server-side API, and that an ultimate goal would be to move to a place where the API can be used to mash up data in new interface contexts.  We’ll certainly be watching this space:-)]