Surfacing the Academic Long Tail — Announcing new work with activity data

We’re pleased to announce that JISC has funded us to work on the SALT (Surfacing the Academic Long Tail) Project, which we’re undertaking with the University of Manchester, John Rylands University Library.

Over the next six months the SALT project will building a recommender prototype for Copac and the JRUL OPAC interface, which will be tested by the communities of users of those services.  Following on from the invaluable work undertaken at the University of Huddersfield, we’ll be working with ten years+ of aggregated and anonymised circulation data amassed by JRUL.  Our approach will be to develop an API onto that data, which in turn we’ll use to develop the recommender functionality in both services.   Obviously, we’re indebted to the previous knowledge acquired by a similar project at the University of Huddersfield and the SALT project will work closely with colleagues at Huddersfield (Dave Pattern and Graham Stone) to see what happens when we apply this concept in the research library and national library service contexts.

Our overall aim is that by working collaboratively with other institutions and Research Libraries UK, the SALT project will advance our knowledge and understanding of how best to support research in the 21st century. Libraries are a rich source of valuable information, but sometimes the sheer volume of materials they hold can be overwhelming even to the most experienced researcher — and we know that researchers’ expectation on how to discover content is shifting in an increasingly personalised digital world. We know that library users — particularly those researching niche or specialist subjects — are often seeking content based on a recommendation from a contemporary, a peer, colleagues or academic tutors. The SALT Project aims to provide libraries with the ability to provide users with that information. Similar to Amazons, ‘customers who bought this item also bought….’ the recommenders on this system will appear on a local library catalogue and on Copac and will be based on circulation data which has been gathered over the past 10 years at The University of Manchester’s internationally renowned research library.

How effective will this model prove to be for users — particularly humanities researchers users?

Here’s what we want to find out:

  • Will researchers in the field of humanities benefit from receiving book recommendations, and if so, in what ways?
  • Will the users go beyond the reading list and be exposed to rare and niche collections — will new paths of discovery be opened up?
  • Will collections in the library, previously undervalued and underused find a new appreciative audience — will the Long Tail be exposed and exploited for research?
  • Will researchers see new links in their studies, possibly in other disciplines?

We also want to consider if there are other  potential beneficiaries.  By highlighting rarer collections, valuing niche items and bringing to the surface less popular but nevertheless worthy materials, libraries will have the leverage they need to ensure the preservation of these rich materials. Can such data or services assist in decision-making around collections management? We will be consulting with Leeds University Library and the White Rose Consortium, as well as UKRR in this area.

(And finally, as part of our sustainability planning, we want to look at how scalable this approach might be for developing a shared aggregation service of circulation data for UK University Libraries.  We’re working with potential data contributors such as Cambridge University LibraryUniversity of Sussex Library, and the M25 consortium as well as RLUK to trial and provide feedback on the project outputs, with specific attention to the sustainability of an API service as a national shared service for HE/FE that supports academic excellence and drives institutional efficiencies.

The SALT Project – Supporting Researchers in the 21st Century.

Libraries are a rich source of valuable information, but sometimes the sheer volume of materials they hold can be overwhelming even to the most experienced researcher. Sometimes what library users crave the most it a recommendation from a contemporary, a peer, colleagues or academic tutors. The SALT, (Surfacing the Academic Long Tail), Project aims to provide libraries with the ability to provide you with that information. Similar to Amazons, ‘customers who bought this item also bought….’ the recommenders on this system will appear on a local library catalogue and on Copac and will be based on circulation data which has been gathered over the past 10 years at Manchester University’s internationally renowned research library. What the SALT project wants to find out is; will researchers in the field of humanities benefit from receiving book recommendations? Will the users go beyond the reading list and be exposed to rare and niche collections? Will collections in the library, previously undervalued and underused find a new appreciative audience? Will researchers see new links in their studies, possibly in other disciplines? And as a result could this improve the quality of research, improve grades and advance knowledge? The users of libraries are not the only beneficiaries of this project. By highlighting rarer collections, valuing niche items and bringing to the surface less popular but nevertheless worthy materials; libraries will have the leverage they need to ensure the preservation of these rich materials.

Over the next six months the SALT project will build a recommender prototype. It will be tested on the University of Manchester’s own local library catalogue and also to a national audience on Copac. The project is indebted to the previous knowledge acquired by a similar project at the University of Huddersfield and the SALT project will work closely with colleagues at Huddersfield to take this concept to the next level. Users and librarians will be invited to try the prototype and feedback their thoughts to developers. By working collaboratively with other institutions and Research Libraries UK, the SALT project will advance our knowledge and understanding of how best to support research in the 21st century.

Auto-complete considered harmful?

Behind the scenes we’ve been creating new versions of Copac that use relational database technology (the current version of Copac doesn’t use a relational database.) It’s a big change which has kept me busy for a long time now. One of things we thought it would be nice to do with all this structured data is to have fields on our web search forms offer suggestions (or auto-complete) as the user types.

It turned out that implementing auto-complete was very easy thanks to JQuery UI. Below is a screen shot (from my test interface) showing the suggestions that auto-complete offers after typing “sha” in the author field.

The suggestions are ordered by how frequently the name appears in the database. So in the screen shot above, “Shakespeare, Willian, 1564-1616” is the most frequently occurring name starting with the letters “sha” in my test database.

(By the way, these example screen shots are from a test database of about 5 million records selected in a very un-random way from from seven of our contributing libraries.)

Having done the Author auto-complete I started thinking about how we would present suggestions for a Title auto-complete popup. It didn’t seem useful to present the user with an alphabetical list of titles, neither did it seem much more useful to present the most commonly occurring titles. I thought we could relatively easily log which records users view and then present the suggestions ranked according to how often a title has been viewed.

Then I thought that if a user has already selected an author from the Author auto-complete suggestions, it only makes sense to suggest titles that are by the selected author. For example, a user has selected Shakespeare from the author auto-complete suggestions. They then type “lo” in the title field. It would be pointless and counter-intuitive to list “Lord of the Rings” in the title suggestions; what we should show is “Love’s Labour’s Lost”.  But then, by the time you’ve created that list of suggestions for the user you’ve pretty much done their search for them already. So why not just show them the search results straight away? Google are doing this now with their Instant search results. Well as hip and sexy as that sounds I don’t think we can go there. For a start I don’t think we have the compute horsepower to make it as instant as Google do and there are fundamental data problems which make it very hard for us to do well.

So, going back to the Author auto-suggestions, lets look what happens when I type “tol” in the author field:

Again, the author suggestion look very nice, but unfortunately the list contains Leo Tolstoy twice: at the top of the list as “Tolstoy, Leo, graf, 1828-1910” and at the bottom of the list as “Tolstoy, Leo”. That’s because there’s no consistent Authority Control across our ~60 contributing libraries (and then there’s all the typos to consider.).

There’s two ways we can turn a user selection from an auto-complete list into a search.

  1. We can turn the author name into a keyword search.
  2. Each of those names in the list has a unique database ID and we can search for records that have that author-ID.

If we do 2.) then selecting one form of the name Leo Tolstoy will only find records with that exact form and wont find records that have the second (or third or fourth) form of the name. This will give the search a lot of precision but the recall is likely to be terrible.

If we do 1.) then the top ranking “Tolstoy, Leo, graf, 1828-1920” will only find a subset of our Tolstoy records. As there are a substantial set of records that don’t include “graf, 1828-1910” a keyword search including those terms will miss those records entirely. If the user selected “Tolstoy, Leo” from the list they will likely find all the Leo Tolstoy records in the database (except those catalogued as “Tolstoy, L.” and those records with typos.) The user may wonder why the name variant that finds most records is listed 10th, while the name listed first finds only a subset.

Maybe we could get around these problems by only using the MARC $a subfield from the 100 and 700 tags. (The examples above are using 100 $a$b$c$d.) Doing that would remove all the additions to names such as “Sir” and the dates. That would probably be okay for authors with distinctive names, but could merge lots of authors with common names. It would reduce search precision and increase recall.

So far I’ve only considered auto-complete on author and title fields. The Copac search forms have many fields and I’m not sure we have the facilities or compute power to inter-relate all the auto-complete suggestions so that the user only sees suggestions that make sense according to the fields the user has already filled in.

If we could inter-relate all the fields on our search forms we would probably know the search result before the user hit the search button. So what would be the point of having a search button anyway? That brings us back to the Google Instant search type of interface.

What should we do?

  • We could just not bother trying to inter-realte the auto-complete suggestions and let users select mutually incompatible suggestions. (Which seems rather unhelpful.)
  • We could not do auto-complete at all. (Again, this seems un-helpful at first sight, but may be better as the auto-complete seems to effect an increase in search precision which may not be useful against a database containing very variable quality data.)
  • We could have just a single field on our search form. (Much easier to program, but not what our users tell us they want.)
  • Just offer auto-complete on a two or three fields and inter-relate them. (To make this work I think we’d have to make the suggestions as imprecise as we can without them being a waste of space.)

I hope the above ramblings make some sense. If anyone has thoughts on this issue we’d like to hear your views.

Hardware move

The hardware move has gone relatively smoothly today. We’ve had some configuration issues that prevented some Z39.50 users from pulling back records and another configuration problem that meant a small percentage of the records weren’t visible. That should all be fixed now, but if you see something else that looks like a problem, then please let us know.

The DNS entry for copac.ac.uk was changed at about 10am this morning. At 4pm we’re still seeing some usage on the old hardware. However, most usage started coming through to the new machine very soon after the DNS change.

The change over to the new hardware has involved a lot of preparation over many weeks. Now it’s done we can now get back to re-engineering Copac… a new database backend and new search facilities for the users.

Behind the Copac record 2: MODS and de-duplication

We left the records having been rigorously checked for MARC consistency, and uploaded to the MARC21 database used for the RLUK cataloguing service. Next they are processed again, to be added to Copac.

One of the major differences between Copac and the MARC21 database is that the Copac records are not in MARC21. They’re in MODS XML, which is

an XML schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. It is a derivative of the MARC 21 bibliographic format (MAchine-Readable Cataloging) and as such includes a subset of MARC fields, using language-based tags rather than numeric ones.

Copac records are in MODS rather than MARC because Copac records are freely available for anyone to download, and use as they wish. The records in the MARC21 database are not – they remain the property of the creating library or data provider. We couldn’t offer MARC records on Copac without getting into all sorts of copyright issues. Using MODS also means we have all the interoperability benefits of using an XML format.

Before we add the records to Copac we check local data to ensure we’re making best use of available local holdings details, and converting local location codes correctly. Locations in MARC records will often be in a truncated or coded form, eg ‘MLIB’ for ‘Main Library’. We make sure that these will display in a format that will be meaningful to our users.
Click for larger version
It is also at this point that we do the de-duplication of records for Copac. Now, Copac de-duplication garners very mixed reactions: some users think we aren’t doing enough de-duplication; and occasionally we get told that we’re doing too much! We can’t ever hope to please everyone, but we’re aware that the process isn’t perfect, and we’ll be reviewing and updating deduplication during the reengineering. We will also be exploring FRBR work level deduplication.
As I’ve mentioned in an earlier blog post , we don’t de-duplicate anything published pre-1801. So what do we do for the post-1801 records?

As new records comes in we do a quick and dirty match against the existing records using one or more of ISBN, ISSN, title key and date. This identifies potential matches which go through a range of other exact and partial field matches. The exact procedure will vary depending on the type of material, so journals (for instance) will go through a slightly different process than monographs.

Records that are deemed to be the same are merged and for many fields unique data from each record is indexed. This provides for enhanced access to materials eg. a wider range of subject headings than would be present in any of the original records. The deduplication process can thus result in the creation of a single enhanced record containing holdings details for a range of contributing libraries.

As we create the Copac records we also check for the availability of supplementary content information for each document, derived from BookData. We incorporate this into the Copac record further enhancing record content for both search and display, eg. a table of contents, abstract, reviews.

Because the deduplication process is fully automated it needs to err on the side of caution, otherwise some materials might disappear from view, subsumed into similar but unrelated works. This can mean records that appear to be self-evident duplicates to a searcher may be separated on Copac because of minor differences in the records. Changes made to solve one problem example could result in many other records being mis-consolidated. It’s a tricky balance.

However, there is another issue: the current load and deduplication is a relatively slow process. We have large amounts of data flowing onto the database everyday and restricted time for dealing with updates. Consequently, where a library has being making significant local changes to their data, and we get a very large update (say 50,000 records), then this will be loaded straight onto Copac without going through the deduplication process.

This means that the load will, almost certainly, result in duplicate records. These will disappear gradually as they are pulled together by subsequent data loads, but it is this bypassing of the deduplication procedure in favour of timeliness, that results in many of the duplicate records visible on Copac. One of the aims of the reengineering is to streamline the dataload process, to avoid this update bottleneck, and improve overall duplicate consolidation levels.

So, that’s the Copac record, from receipt to display. We hope you’ve enjoyed this look behind the Copac records. Anything else you’d like to know about? Tell us in the comments!

Thanks to Shirley Cousins for the explanation of the de-duplication procedures

Issues searching other library catalogues

Some of you may have noticed that there is now a facility on the Copac search forms to search your local library catalogue as well as Copac. You’ll only see this option if you have logged into Copac and are from a supported library.

The searching of the local library catalogues and Copac is performed using the Z39.50 search protocol. Due to differences in local configurations the query we send to Copac and the various library catalogues have to be configured very differently.

When we built the Copac Z39.50 server, we tried to make it flexible in the type of query it would accept within the limitations imposed upon us by the database software we use. Our database software was made for keyword searching of full text resources. As such it is good at adjacency searches, but you can’t tell it you want to search for a word at the start of a field.

Databases built around relational databases tend to be the complete opposite in functionality. They often aren’t good at keyword searching, but find it very easy to find words at the start of a field.

The result of which is that we make our default search a keyword search, while some other systems default to searching for query terms at the start of a field. Hence if we send the exact same search to Copac and a library catalogue we can get a very different result from the two systems. To try and get a consistent result we have to tweak the query sent to the library so that it performs a search as near as possible to that performed by Copac. Working out how to tweak (or transform or mangle) the queries is a black art and we are still experimenting.

Stop word lists are also an issue. Some library systems like to fail your search if you search for a stop word. Better systems just ignore stop words in queries and perform the search using the remaining terms. The effect is that searching for “Pride and prejudice” fails on some systems because “and” is stop worded. To get around this we have to remove stop words from queries. But we first need to know what the stop words are.

The result is that the search of other library systems is not yet as good as it could be, though it will get better over time as we discover what works best with the various library systems that are out there.

Copac Beta can search your library too

One of the new features we are trailing in the new Copac Beta is the searching of your local institutions library catalogue alongside Copac. To do this we need to know which Institution you are from and whether or not your Institutional library catalogue can be searched with the Z39.50 protocol.

To identify where you are from, we are using information given to us during the login process. When you login, your Institution gives us various pieces of information about you, including something called a scoped affiliation. For someone logging in from, say, the University of Manchester, the scoped affiliation might be something like “student@manchester.ac.uk”

Once we know where you are from, we search a database of Institutional Z39.50 servers to see if your Institution’s library is searchable. If it is we can present the extra options on the search forms, and indeed, fire off any queries to your library catalogue.

Our database of Z39.50 servers is created from records harvested from the IESR. So, if you’d like your Institution’s catalogue available through Copac, make sure it is included in the IESR by talking to the nice people there.

Many thanks to everyone who tried the Beta interface early on and discovered that this feature mostly wasn’t working. You enabled us to identify some bugs and get the service working.

New Copac trial interfaces

We are beginning a major redevelopment of the Copac National, Academic, and Specialist library catalogue service. The first stage of this work will introduce a login version of Copac with a range of new personalised facilities. Alongside this we will retain an open-access version of Copac.

Building on the recent Copac Beta trial, we have two new Copac Beta trial interfaces.

Personalised Copac, as seen in the beta trial, now has a new addition in the form of ‘my local library’ search, which allows members of some universities to search their own library catalogue alongside Copac, giving a single result set. This requires you to login to Copac.

The new standard Copac is a streamlined service which allows you to search and export records without logging in to the personalised Copac. It also includes a new journal table-of-contents display (where available).

Both these interfaces can be accessed at http://beta.copac.ac.uk/, and the trial will be running until 26th July.

There is a very short feedback questionnaire for each interface. We would appreciate it if you could fill in the questionnaire, or just email the Copac helpdesk (copac@mimas.ac.uk) with any comments you may have.

Notes on (Re)Modelling the Library Domain (JISC Workshop).

A couple of weeks ago, I attended JISC’s Modelling the Library Domain Workshop. I was asked to facilitate some sessions at the workshop, which was an interesting but slightly (let’s say) ‘hectic’ experience. Despite this, I found the day very positive. We were dealing with potentially contentious issues, but I noted real consensus around some key points. The ‘death of the OPAC’ was declared and no blood was shed as a result. Instead I largely heard murmured assent. As a community, we might have finally faced a critical juncture, and there were certainly lessons to be learned in terms of considering the future of services such as Copac, which as a web search service, in the Library Domain Model would count as national JISC service ‘Channel.’

In the morning, we were asked to interrogate what has been characterised as the three ‘realms’ of the Library Domain: Corporation, Channels, and Clients. (For more explanation of this model, see the TILE project report on the Library Domain Model). My groups were responsible for picking apart the ‘Channel’ realm definition:

The Channel: a means of delivering knowledge assets to Clients, not necessarily restricted to the holdings or the client base of any particular Corporation, Channels within this model range from local OPACs to national JISC services and ‘webscale’ services such as Amazon and Google Scholar. Operators of channel services will typically require corporate processes (e.g. a library managing its collection, an online book store managing its stock). However, there may be an increasing tendency towards separation, channels relying on the corporate services of others and vice versa (e.g. a library exposing its records to channels such as Google or Liblime, a bookshop outsourcing some of its channel services to the Amazon marketplace).

In subsequent discussion, we came up with the following key points:

  • This definition of ‘channel’ was too library-centric. We need to working on ‘decentring’ our perspective in this regard.
  • We will see an increasing uncoupling of channels from content. We won’t be pointing users to content/data but rather data/content will be pushed to users via a plethora of alternative channels
  • Users will increasingly expect this type of content delivery. Some of these channels we can predict (VLEs, Google, etc) and others we cannot. We need to learn to live with that uncertainty (for now, at least).
  • There will be an increasing number of ‘mashed’ channels – a recombining of data from different channels into new bespoke/2.0 interfaces.
  • The lines between the realms are already blurring, with users becoming corporations and channels….etc., etc.
  • We need more fundamental rethinking of the OPAC as the primary delivery channel for library data. It is simply one channel, serving specific use-cases and business process within the library domain.
  • Control. This was a big one. In this environment libraries increasingly devolve control of the channels via which their ‘clients’ use to access the data. What are the risks and opportunities to be explored around this decreasing level of control? What related business cases already exist, and what new business models need to evolve?
  • How are our current ‘traditional’ channels actually being used? How many times are librarians re-inventing the wheel when it comes to creating the channels of e-resource or subject specialist resource pages? We need to understand this in broad scale.
  • Do we understand the ways in which the channels libraries currently control and create might add value in expected and unexpected ways? There was a general sense that we know very little in this regard.

There’s a lot more to say about the day’s proceedings, but the above points give a pretty good glimpse into the general tenor of the day. I’m now interested to see what use JISC intends to make of these outputs. The ‘what next?’ question now hangs rather heavily.

It’s Official — Copac’s Re-engineering

We’ve been hinting a while now about significant changes being imminent for Copac, and I am now pleased to announce that we’ve had official word that we have secured JISC funding to overhaul the Copac service over the next year.

The major aim for this work is to improve the Copac user experience.  In the short term this will mean improving the quality of the search results.  More broadly, this will mean providing more options for personalising and reusing Copac records.

We’re going to be undertaking the work in two phase.  We’re calling Phase 1 the ‘iCue Project’ (stands for ‘Improving the Copac User Experience’).  This work will be focused on  investigating and proposing pragmatic solutions that improve the Copac infrastructure and end-user experience, and we’re going to be partnering with Mark Van Harmelen of Personal Learning Environments Ltd (PLE) in this work (Mark is also involved in the JISC TILE project, so we believe there’s a lot of fruitful overlap there, especially around leveraging the potential of circulation data a la Huddersfield).  The second phase is really about doing the work — re-engineering Copac in line with the specifications defined in the iCue Project.

We see this work tackling three key areas for Copac:

(i) Interface revision: We’ll be redesigning Copac’s user interface, focusing on areas of usability and navigability of search results. We are aware that the sheer size of our database and our current system means that searches can return large, unstructured result sets that do not facilitate users finding what they need.  Addressing this is a major priority.  We’ll be building on the CERLIM usability report we recently commissioned (more on that in another post) and also drawing on the expertise of OPAC 2.0 specialists such as Dave Pattern.  We’ll also be working consistently with users (librarian users and researcher users) to monitor and assess how we’re doing.

(ii) Database Restructuring: A more usable user interface is going to critically rely on a suitable restructuring of Copac’s database. Particularly, we are centrally interested in FRBR (Functional Requirements for Bibliographic Records) as a starting point for a new database structure. We anticipate that whatever we learn as we undertake this piece of work will be of interest to the broader community, and plan to disseminate this knowledge, and update the community via this blog.

(iii)  De-duplication: The restructuring implies further de-duplication of Copac’s contents, and so we’re also developing a de-duplication algorithm.  Ideally we would like to see the FRBR levels of work, expression, manifestation and (deduplicated) item being supported, or a pragmatic version of the same.

The end user benefits:
1. Searches are faster and more effective (Copac database is more responsive and robust; users are presented with a more dramatically de-duplicated results view)
2.  Search-related tasks are easier to perform (i.e. the flexibility of this system will support the narrowing/broadening of searches, faceted searching, personalising/sharing content)
3.  Access to more collections (Copac database is able to hold more content and continue to grow)

So there we have it.  It’s going to be quite a year for the Copac team.  If you have any questions, comments or suggestions you’d like us to take on board, do leave a comment here or email us.  (Not that this will be the only time we ask!) We can also be chatted to via twitter @Copac.