Archive for April, 2008

The 32nd ELAG Library Systems Seminar is about to start

Next week, from Monday till Wednesday, this blog will be trifle more active than usual. I am participating in the annual seminar of the European Library Automation Group. I will try to do my best and cover as much as possible of this event on this blog. Live, that is. For those who want to follow bits and peaces of the conference, there will be a live video stream at http://wurtv.wur.nl/presentations/roadkit4. This video stream only works with Internet Explorer (Sorry, sorry, sorry…). The programme is posted here, so you can work out the time in your own time zone whever you want to.

Anyway, if you want to follow me on twitter, there I will post all the changes from the programme, and the exact starts of major sessions.

Last a message for the event bloggers. I expect at least a few. I propose to use the ELAG08 tag for this conference. So please apply this tag to all your blog posts, Flickr photos or YouTube videos. That should make life much easier to track the developments of this event.

Linking from Catalog of Wageningen UR Library to Google Books

Previously I announced that we made use of the Google Books API to link to the full text whenever possible. We only experienced two problems with this service. First, the quite frequent Google spam warnings, which have been partially resolved but still keep coming back. Second, we did not have the required OCLC or LCCN numbers for the pre-ISBN books in our catalog.

Thanks to OCLC Nederland this problem has been circumvented successfully as well. OCLC build a service which we feed the PPN (Pica Production Number) , which is available in our catalog and returns the OCLC number. We use that number to feed it into the Google Books API which determines the kind of electronic availability of those books which results in the right link and text on the catalog record. Peter described this in more detail. Just another Hooray for OCLC, since the service is now working.

A few examples are:

Even when the full text is not available on Google Books, the service can be usefull. In the following example of Hogg, R. (1884) The fruit manual, the electronic version of the 1860 edition is available on Google Books rather than the 1884 edition we have in our collection.

It took actually quite some effort to find these examples. Perhaps an inidcation of our unique collection?

Thomson increases journal coverage in the social sciences

Thomson Scientific published a press release yesterday in which they announced a substantial increase of journals in the social sciences indexed in Web of Science.

Track back a little. In the November/December issue of the Searcher Magazine was an interview of Herther with Keith MacGregor and James Testa of Thomson Scientific. The interview closed with the following question:

If, by some chance, next year there were suddenly 500 new journals that met the criteria for acceptance in Web of Science, would you add 500 new journals to the databases?

Testa: Well, this is a hypothetical question, so the hypothetical answer would be yes. If these were journals that met our criteria, we would absolutely add these. I don’t see how we could say no. I’d be surprised to see that happen, but, yes we would certainly accept them.

In the interview Testa indicated that the normal pace of growth for Web of Science is in the order of 100-200 journals per year. The announced 162 regional social science journals which have been added to Web of Science are thus to be considered as part of those hypothetical journals. The newly identified collection contains journals that typically target a regional rather than international audience by approaching subjects from a local perspective or focusing on particular topics of regional interest. They include 49 titles from the Asia-Pacific region and 91 from the European Union.

What’s left on my wishlist is only a complete title list of those journals that have been added. Now we have to guess to the additions. in this respect Thomson behaves like the “Guide de Michelin” you have to figure it out for yourself which restraunts dropped a star or gained one.

reference
Herther, N. K. (2007). Thomson Scientific and the citation indexes : an interview with Keith MacGregor and James Testa. Searcher 15(10): 8-17.

Library website design, search engine crawlers and SEO

Digital libraries have tons of data, and when they don’t have the data in digital format , they have really nice and structured electronically available metadata about those data. Library catalogs we call them. They are plain ordinary databases and come in all kind of flavours.

When I joined the library behind the scenes some eight years ago, indexing of the library catalog was off limits for the search engines. That would cripple the system, at the expense of the users! Actually we were talking about Altavista and AllTheWeb in those days. Albeit, Google was around already. Times have changed though. We have taken away all kinds of no-index no-follow signs on our system and the first catalog cards are being indexed by search engines. We are just starting to use RSS and OAI as sitemaps for Google. But this is not the only approach that should be taken. The site should become optimized for the Google bots and crawlers of all kind of search engines. Although Google is by far the most important search engine at this moment.

Interesting to look back in my archives, a study was done carried out two years ago by Drunk men work here. It is not a peer reviewed study so it seems, but interesting nevertheless. In their research they compared the crawling behaviour of the Google, Yahoo! and MSN bots on a really large site that was set up as a binary search tree. Quite amazing, the Yahoo! bot showed to be the most proficient bot, having indexed most underlying pages and down to the deepest level. The Google bot followed at quite some distance and MSN came in last.

How matters have changed over the last two years. Smith and Nelson (2008) built two a large digital library websites to study crawler behaviour. They compared wide and deep linking design of the websites. It appeared that conventional wisdom held true, in that the wide design sites were indexed twice as fast. In the case of google 18 days compared to 44 days. The Yahoo! crawler failed to index the complete site. The MSN bot took more than 200 days in the wide design and failed to completely index the entire website when using the deep design.

The latest article I read which touched this subject was Jody L. DeRidder (2008) who explored the use of Google sitemaps and static browse pages (for which I have been pleading already for so long, not so much for robots as well as for our human users) and they concluded that -with a relatively small sample- static browse pages enhanced the crawling and indexing by search engines.

Having digested this all, I think we are back again to our thesaurus and classification system, and use those logical trees for entry of the crawlers into our catalog. Isn’t it nice that we indexed all our records already manually for years and that we can make use of that system as efficient highways into our catalog for the crawlers. Old systems used for new purposes.

Release the spiders!

The second part of the exercise becomes of course, design efficient catalog records that rank well in the search engine result pages. Wonder who has formally studied those matters? Any suggestions?

References
DeRidder, J. L. (2008). Googlizing a digital library. Code4Lib journal, 2. http://journal.code4lib.org/articles/43
Smith, J. A. and M. L. Nelson (2008). Site design impact on robots: An examination of search engine crawler behaviour at deep and wide websites. D-Lib Magazine, 14(3/4). http://www.dlib.org/dlib/march08/smith/03smith.html