Library website design, search engine crawlers and SEO

Digital libraries have tons of data, and when they don’t have the data in digital format , they have really nice and structured electronically available metadata about those data. Library catalogs we call them. They are plain ordinary databases and come in all kind of flavours.

When I joined the library behind the scenes some eight years ago, indexing of the library catalog was off limits for the search engines. That would cripple the system, at the expense of the users! Actually we were talking about Altavista and AllTheWeb in those days. Albeit, Google was around already. Times have changed though. We have taken away all kinds of no-index no-follow signs on our system and the first catalog cards are being indexed by search engines. We are just starting to use RSS and OAI as sitemaps for Google. But this is not the only approach that should be taken. The site should become optimized for the Google bots and crawlers of all kind of search engines. Although Google is by far the most important search engine at this moment.

Interesting to look back in my archives, a study was done carried out two years ago by Drunk men work here. It is not a peer reviewed study so it seems, but interesting nevertheless. In their research they compared the crawling behaviour of the Google, Yahoo! and MSN bots on a really large site that was set up as a binary search tree. Quite amazing, the Yahoo! bot showed to be the most proficient bot, having indexed most underlying pages and down to the deepest level. The Google bot followed at quite some distance and MSN came in last.

How matters have changed over the last two years. Smith and Nelson (2008) built two a large digital library websites to study crawler behaviour. They compared wide and deep linking design of the websites. It appeared that conventional wisdom held true, in that the wide design sites were indexed twice as fast. In the case of google 18 days compared to 44 days. The Yahoo! crawler failed to index the complete site. The MSN bot took more than 200 days in the wide design and failed to completely index the entire website when using the deep design.

The latest article I read which touched this subject was Jody L. DeRidder (2008) who explored the use of Google sitemaps and static browse pages (for which I have been pleading already for so long, not so much for robots as well as for our human users) and they concluded that -with a relatively small sample- static browse pages enhanced the crawling and indexing by search engines.

Having digested this all, I think we are back again to our thesaurus and classification system, and use those logical trees for entry of the crawlers into our catalog. Isn’t it nice that we indexed all our records already manually for years and that we can make use of that system as efficient highways into our catalog for the crawlers. Old systems used for new purposes.

Release the spiders!

The second part of the exercise becomes of course, design efficient catalog records that rank well in the search engine result pages. Wonder who has formally studied those matters? Any suggestions?

DeRidder, J. L. (2008). Googlizing a digital library. Code4Lib journal, 2.
Smith, J. A. and M. L. Nelson (2008). Site design impact on robots: An examination of search engine crawler behaviour at deep and wide websites. D-Lib Magazine, 14(3/4).

Eric Lease Morgan’s digital information landscape

During the Ticer’07 summerschool ‘Digital Libraries à la Carte’ I First met Eric Lease Morgan. He was an excellent instructor, making the techie stuff more palatable.

With much interest I noted one of his recent lectures cited in Current Cites. His lecture “Today’s digital information landscape” has some thoughtful points on future libraries, librarianship and above all catalogs. Here are some interesting quotes selected from the various parts of his lecture

On MARC and XML “MARC is a Gordian Knot that needs to be cut, and XML put into it’s place.”

On databases and indexes “They are two sides of the same information retrieval coin.”

On exploiting the network “A rising tide floats all boats. The tide of network computing is certainly upon us. Let’s make sure our boats are in the water.”

On institutional repositories and open access “Acquisitions departments are not necessarily about buying content… An acquisitions department is responsible for bringing collections into the library.”

On the next generation catalogs “More importantly, a “next generation” library catalog will provide services against the things discovered. These services can be enumerated and described with action statements including but not limited to: get it, add it to my personal collection, tag & classify it, review it, buy it, delete it, edit it, share it, link it, compare & contrast it, search it, summarize it, extract all the images from it, cite it, trace it, delete it. Each of these tasks supplement the learning, teaching, and research process.” And “Collections without services are useless. Services without collections are empty. Library catalogs lie at the intersection of collections and services.”

Morgan concludes with “The principles of collection, organization, preservation, and dissemination are extraordinarily relevant in today’s digital landscape. The advent of the globally networked computers, Internet indexes, and mass digitization projects have not changed this fact.”

Worth reading as a whole.

Morgan, E. L. (2007). Today’s digital information landscape. Infomusings.