Google as your federated search interface

For a while now I've noticed that journal articles from some not-for-profit (but not free) content delivery projects--like JSTOR and Project MUSE--have populated my Google search results. But the implications of that never sunk in until I sat in on recent presentations by salespeople from both Thomson Gale and EBSCO, both of who say they are working with Google on a similar capability: results from their databases will appear in Google results.

Apparently, how this would work is still under development, with issues such as result placement algorithms and the point at which user verification occurs being among the major issues. ( I wonder how the JSTOR and MUSE hits seem to always rank high?)

But still, the idea of Google being the interface for federating searching makes sense in a context where libraries are constantly resisting the user behavior of going to Google first and ending their information searches with what is found there. Making other federated search interfaces as simple as Google may not be enough--if users go to Google anyway. And I have to confess that I've been one to protest against the dumbing down of federated search interfaces; but Google is the elephant in the room, and can't be ignored.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I think there's a quite bit more too this than the usual "Sadly, we have no choice but to go where our users want us to go, even if it means degrading our services." I've been thinking about this a lot lately.

1. It's seemed to me that federated search just doesn't work very well. You're suggesting current federated search interfaces have been 'dumbed down' because people think users want it dumb---it's always seemed to me that federated search interfaces are dumb simply because the technology is dumb. Hypothesis: it's not currently possible to provide a smart federated search interface. True or false?

2. Current federated searching for scholarly information generally relies on searching multiple repositories at once at the point of request, and then merging the results ("cross-search"). I think recent history has shown that this just doesn't work, and that's part of why current federated search technology is so 'dumb'. The better way to do federated search is a harvest, an ingest of all the (meta)data into a single store. That's what Google does, and as a result Google can do things that we simply can't do with cross-search 'federated search'. Note well also: That's what OAI-PMH does too, and the development of OAI-PMH in fact came out of the realization that cross-search (the z39.50 model) just doesn't work that well for federated search. [See "Cross search or harvest" at http://www.oaforum.org/tutorial/english/page2.htm].

3. Our content providers and vendors won't provide us that data to ingest in a harvest. Which is why, even if we buy new products coming out in the library market intended to harvest metadata to a single store, we can't use them on our licensed content. Becuase the content providers want to keep control of that data, we need to access it through their interfaces (so long as we keep paying for it), we can't download the data ourselves (where we'd still have it even if we stopped paying?).

4. Aha! But the market force of the 900 pound gorilla that is Google means that these content providers are turning around, they ARE providing harvestable metadata. (At least they're saying they will. We'll see.) But not to us. To Google. So how will this be done? At least some stuff will be on the open web. This is a very welcome thing, for a couple reasons. First, as libraries, the more freely available information and metadata, the better! Second, if it's on the free web, it's possible we can harvest it too, just like Google. However, being on the free web isn't enough for convenient harvesting---really, they should be making it available via OAI-PMH or similar.

5. But in addition to technical barriers, I'm worried that the content providers are going to be providing certain metadata to Google that they _don't_ put on the free web (or provide to us in machine-readable or harvestable format). I have already noticed some things in Google that looks like this---Google can provide a title and excerpt matching my query, but when I click on it, I just get a come-on from the content provider asking me to pay $5 for the article. [With EBSCO and Wilson at least, they plan to also give me a link to my link resolver--if I have it configured properly, which is another story.] This is what we should really be worried about---not that users are using Google, but that Google is monopolizing data.

6. Becuase there ARE things that Google (Scholar) search does better than our federated search (cross-search). Our federated search is ridiculously slow compared to Google. For many queries, Google is much better at relevancy ranking a 'merged' set of results than our federated search. With Google, you can search the contents of literally hundreds of thousands of dbs at once, with our cross search you get maybe half a dozen (and the product we use can only get away with grabbing the first couple dozen results from each at once!). There's a reason our users are flocking to Google, it's not _only_ because of 'brand consciousness' OR laziness. There are also things our searches do better than Google: Knowing what the coverage is of the corpus covered by your search, completeness, limiting to certain disciplinary domains, etc. [Sadly, I don't think our cross-searches do THAT much better a job of fielded or boolean searching, or limiting to what is actually available (whether in full text or physically)].

To make our own federated searches have the best of both worlds---to make them actually work well, as well as to in fact _smarten them up_ (not dumb them down), we require moving to a harvest instead of a cross search model. [We also need harmonized metadata, but that's an even harder problem]. It is our content providers (and their business model) that, sadly, stand in the way of this. Now of course, we need our content providers to stay in business, we need them to have a business model that works. But we also need to be able to harvest that metadata (and even full text data) for indexing, in order to provide a better search. That the existence of Google is pushing these content providers to take steps in that direction is actually pretty huge. But there are significant concerns.

Sorry for such a long missive, hope it's helpful.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.