Auto-complete considered harmful?

Behind the scenes we’ve been creating new versions of Copac that use relational database technology (the current version of Copac doesn’t use a relational database.) It’s a big change which has kept me busy for a long time now. One of things we thought it would be nice to do with all this structured data is to have fields on our web search forms offer suggestions (or auto-complete) as the user types.

It turned out that implementing auto-complete was very easy thanks to JQuery UI. Below is a screen shot (from my test interface) showing the suggestions that auto-complete offers after typing “sha” in the author field.

The suggestions are ordered by how frequently the name appears in the database. So in the screen shot above, “Shakespeare, Willian, 1564-1616” is the most frequently occurring name starting with the letters “sha” in my test database.

(By the way, these example screen shots are from a test database of about 5 million records selected in a very un-random way from from seven of our contributing libraries.)

Having done the Author auto-complete I started thinking about how we would present suggestions for a Title auto-complete popup. It didn’t seem useful to present the user with an alphabetical list of titles, neither did it seem much more useful to present the most commonly occurring titles. I thought we could relatively easily log which records users view and then present the suggestions ranked according to how often a title has been viewed.

Then I thought that if a user has already selected an author from the Author auto-complete suggestions, it only makes sense to suggest titles that are by the selected author. For example, a user has selected Shakespeare from the author auto-complete suggestions. They then type “lo” in the title field. It would be pointless and counter-intuitive to list “Lord of the Rings” in the title suggestions; what we should show is “Love’s Labour’s Lost”. ┬áBut then, by the time you’ve created that list of suggestions for the user you’ve pretty much done their search for them already. So why not just show them the search results straight away? Google are doing this now with their Instant search results. Well as hip and sexy as that sounds I don’t think we can go there. For a start I don’t think we have the compute horsepower to make it as instant as Google do and there are fundamental data problems which make it very hard for us to do well.

So, going back to the Author auto-suggestions, lets look what happens when I type “tol” in the author field:

Again, the author suggestion look very nice, but unfortunately the list contains Leo Tolstoy twice: at the top of the list as “Tolstoy, Leo, graf, 1828-1910” and at the bottom of the list as “Tolstoy, Leo”. That’s because there’s no consistent Authority Control across our ~60 contributing libraries (and then there’s all the typos to consider.).

There’s two ways we can turn a user selection from an auto-complete list into a search.

  1. We can turn the author name into a keyword search.
  2. Each of those names in the list has a unique database ID and we can search for records that have that author-ID.

If we do 2.) then selecting one form of the name Leo Tolstoy will only find records with that exact form and wont find records that have the second (or third or fourth) form of the name. This will give the search a lot of precision but the recall is likely to be terrible.

If we do 1.) then the top ranking “Tolstoy, Leo, graf, 1828-1920” will only find a subset of our Tolstoy records. As there are a substantial set of records that don’t include “graf, 1828-1910” a keyword search including those terms will miss those records entirely. If the user selected “Tolstoy, Leo” from the list they will likely find all the Leo Tolstoy records in the database (except those catalogued as “Tolstoy, L.” and those records with typos.) The user may wonder why the name variant that finds most records is listed 10th, while the name listed first finds only a subset.

Maybe we could get around these problems by only using the MARC $a subfield from the 100 and 700 tags. (The examples above are using 100 $a$b$c$d.) Doing that would remove all the additions to names such as “Sir” and the dates. That would probably be okay for authors with distinctive names, but could merge lots of authors with common names. It would reduce search precision and increase recall.

So far I’ve only considered auto-complete on author and title fields. The Copac search forms have many fields and I’m not sure we have the facilities or compute power to inter-relate all the auto-complete suggestions so that the user only sees suggestions that make sense according to the fields the user has already filled in.

If we could inter-relate all the fields on our search forms we would probably know the search result before the user hit the search button. So what would be the point of having a search button anyway? That brings us back to the Google Instant search type of interface.

What should we do?

  • We could just not bother trying to inter-realte the auto-complete suggestions and let users select mutually incompatible suggestions. (Which seems rather unhelpful.)
  • We could not do auto-complete at all. (Again, this seems un-helpful at first sight, but may be better as the auto-complete seems to effect an increase in search precision which may not be useful against a database containing very variable quality data.)
  • We could have just a single field on our search form. (Much easier to program, but not what our users tell us they want.)
  • Just offer auto-complete on a two or three fields and inter-relate them. (To make this work I think we’d have to make the suggestions as imprecise as we can without them being a waste of space.)

I hope the above ramblings make some sense. If anyone has thoughts on this issue we’d like to hear your views.