Google and the academic Deep Web

Blogging on Peer-Reviewed ResearchHagendorn and Santelli (2008) just published an interesting article on the comprehensiveness of indexing of academic repositories by Google. This article triggers this me to write up some observations I was intending to make for quite some time already. It addresses the question I got from a colleague of mine, who observed that the deep web apparently doesn’t exist anymore.

Google has made a start to index flash files. Google has made a start to retrieve information that is hidden behind search forms on the web, i.e. started to index information contained in databases. Google and OCLC exchange information on books scanned, and those contained in Worldcat. Google so it seems has indexed the Web comprehensively with 1 trillion indexed webpages. Could there possibly be anything more to be indexed?

The article by Hagendorn and Santelli shows convincingly that Google still has not indexed all information that is contained in OAISTER, the second largest archive of open access article information. Only Scientific Commons is more comprehensive. They tested this with the Google Research API using the University Research Program for Google Search. They only checked whether the URL was present. This approach only partially reveals some information on depth of the Academic Deep Web. But those are staggering figures already. But reality bites even more.

A short while ago I taught a Web Search class for colleagues at the University Library at Leiden. For the purpose of demonstrating what the Deep or Invisible Web actually constitutes I used and example from their own repository. It is was a thesis on Cannabis from last year and deposited as one huge PDF of 14 MB. Using Google you can find the metadata record. With Google Scholar as well. However, if you try to search for a quite specific sentence on the beginning pages of the actual PDF file Google gives not the sought after thesis. You find three other PhD dissertations. Two of those defended at the same university that same day, but not the one on Cannabis.

Interestingly, you are able to find parts of the thesis in Google Scholar, eg chapter 2, chapter 3 etc. But those are the parts of the thesis contained in different chapters that have been published elsewhere in scholarly journals. Unfortunately, none of these parts in Google Scholar refers back to the original thesis that is in Open Access or have been posted as OA journal article pre-prints in the Leiden repository. In Google Scholar most of the materials is still behind toll gates at publishers websites.

Is Google to blame for this incomplete indexing of repositories? Hagendorn and Santelli point the finger to Google indeed. However, John Wilkin, a colleague of them, doesn’t agree. Just as Lorcan Dempsey didn’t. And neither do I.

I have taken an interest in the new role of librarians. We are no longer solely responsible for bringing external –documentary- resources from outside into the realm of our academic clientele. We have also the dear task of bringing the fruits of their labour as good as possible for the floodlights of the external world. Be it academic or plain lay interest. We have to bring the information out there. Open Access plays an important role in this new task. But that task doesn’t stop at making it simply available on the Web.

Making it available is only a first, essential step. Making it rank well is a second, perhaps even more important step. So as librarians we have to become SEO experts. I have mentioned this here before, as well as at my Dutch blog.

So what to do about this chosen example from the Leiden repository. Well there is actually a slew of measures that should be taken. First of course is to divide the complete thesis in parts, at chapter level. Albeit publishers give permission only to publish articles, of which most theses in the beta sciences exists in the Netherlands, when the thesis is published as a whole. On the other hand, nearly 95% of the publishers allow publication of pre-prints and peer reviewed post prints. The so called Romeo green road. So it is up to the repository managers, preferably with the consent from the PhD candidate, to tear up the thesis in its parts –the chapters, which are the pre-print or post-prints of articles- and archive the thesis on chapter level as well. This makes the record for this thesis with a number of links to far more digestible chunks of information better palatable for the search engine spiders and crawlers. The record for the thesis thus contains links to the individual chapters deposited elsewhere in the repository.

Interesting side effect of this additional effort at the repository side is that the deposit rates will increase considerably. This applies for most Universities in the Netherlands, for our collection of theses as well. Since PhD students are responsible of the lion’s share of academic research at the University, depositing the individual chapters as article preprints in the repository will be of major benefit to the OA performance university. It will require more labour at the side of repository management, but if we take this seriously it is well worth the effort.

We still have to work at the visibility of the repositories really hard, but making the information more palatable is a good start.

Hagedorn, K. and J. Santelli (2008). Google still not indexing hidden web URLs. D-Lib Magazine 14(7/8).

Library website design, search engine crawlers and SEO

Digital libraries have tons of data, and when they don’t have the data in digital format , they have really nice and structured electronically available metadata about those data. Library catalogs we call them. They are plain ordinary databases and come in all kind of flavours.

When I joined the library behind the scenes some eight years ago, indexing of the library catalog was off limits for the search engines. That would cripple the system, at the expense of the users! Actually we were talking about Altavista and AllTheWeb in those days. Albeit, Google was around already. Times have changed though. We have taken away all kinds of no-index no-follow signs on our system and the first catalog cards are being indexed by search engines. We are just starting to use RSS and OAI as sitemaps for Google. But this is not the only approach that should be taken. The site should become optimized for the Google bots and crawlers of all kind of search engines. Although Google is by far the most important search engine at this moment.

Interesting to look back in my archives, a study was done carried out two years ago by Drunk men work here. It is not a peer reviewed study so it seems, but interesting nevertheless. In their research they compared the crawling behaviour of the Google, Yahoo! and MSN bots on a really large site that was set up as a binary search tree. Quite amazing, the Yahoo! bot showed to be the most proficient bot, having indexed most underlying pages and down to the deepest level. The Google bot followed at quite some distance and MSN came in last.

How matters have changed over the last two years. Smith and Nelson (2008) built two a large digital library websites to study crawler behaviour. They compared wide and deep linking design of the websites. It appeared that conventional wisdom held true, in that the wide design sites were indexed twice as fast. In the case of google 18 days compared to 44 days. The Yahoo! crawler failed to index the complete site. The MSN bot took more than 200 days in the wide design and failed to completely index the entire website when using the deep design.

The latest article I read which touched this subject was Jody L. DeRidder (2008) who explored the use of Google sitemaps and static browse pages (for which I have been pleading already for so long, not so much for robots as well as for our human users) and they concluded that -with a relatively small sample- static browse pages enhanced the crawling and indexing by search engines.

Having digested this all, I think we are back again to our thesaurus and classification system, and use those logical trees for entry of the crawlers into our catalog. Isn’t it nice that we indexed all our records already manually for years and that we can make use of that system as efficient highways into our catalog for the crawlers. Old systems used for new purposes.

Release the spiders!

The second part of the exercise becomes of course, design efficient catalog records that rank well in the search engine result pages. Wonder who has formally studied those matters? Any suggestions?

DeRidder, J. L. (2008). Googlizing a digital library. Code4Lib journal, 2.
Smith, J. A. and M. L. Nelson (2008). Site design impact on robots: An examination of search engine crawler behaviour at deep and wide websites. D-Lib Magazine, 14(3/4).

Full feeds versus partial feeds

The full feeds versus partial feeds is an old debate. Have a look at the 2.7 million Google hits for this simple query. Most of the debate however, concentrates on the presumed effects on visitors to the actual blog and -missed?- advertising revenue.

This afternoon I was having an interesting discussion with a representative from a library organization and we were discussing the theme of findability and accessibility of scientific information. My point of view was that blogging about science and scientific articles would at least increase the findability of these articles. However, this is only true when the feeds of the blog are full feeds. The discovery of very new, young or even premature information on the web should be complemented nowadays with searches on blog search engines and news search engines. These search engines are on most occasions not exactly what their name suggests. In most instances they are rss feed search engines, i.e. they only index rss feeds.

The consequences are simple. When a blog is using partial feeds only the headline is indexed by blog search engines. Have for instance a look at the Technorati results for the IAALD blog,  or from Google Blog search, or at Ask blog search. These represent the top three blog search engines at the moment. The discoverablity of content with these search engines for content from the IAALD blog is miserable, whereas it has some excellent content.

Where the discussion of full text feeds versus partial feeds so far has concentrated on arguments of pro-bloggers who are worried about their advertising revenue. For scientists, the argument of discoverablity is far more important and they should always opt for full feeds to syndicate their content as widely as possible.

It sounds strange but a lot of people have not yet realized this.

Searching for Science

Since a little while -say a year and a half or so- I teach at regular intervals a course on finding scholarly information with freely available resources on the Web. The course is titled “Searching for Science“. The course material is freely available in one of my Wikis’. The main reason for using a wiki for presenting a course like this, is that linking to examples on the Web works so much more smoothly than using a powerpoint  instead.

With regards to the course today, a small group attended. 4 researchers and 5 (mostly) international students. A nice mix. I really enjoyed it, and I think they did as well. Well at least they gave me a really positive evaluation.
During the course I spend about three quarters of the morning, say a littel over 2 hours, on general search tactics. Search engines and their commands, Web directories and the Deep Web.  During the evaluation I always get the feedback that just some plain Google commands and search tips receive the most Brownie points. What’s always interesting is an exercise where we compare the coverage of scholarly search engines plus Live Academic on retrieving a known article from an OA repository in the Netherlands. I always ask the students to do the search with the full title of an article and repeat the exercise with a sentence from the discussion part of the article. It is always interesting to see the outcome of this exercise. As usual Live Academic failed entirely. Google Scholar did reasonbaly well on both, but today Scirus and Scientific Commons only worked with the title words.  These outcomes can be different again tomorrow. It is always difficult to explain these outcomes.

Meanwhile I find some real gratification in the fact to point my students to some of the OA discussions as well, whilst covering collections of OA journals, Repositories or mentioning Open Course Ware sources.

On most occasions the participants are entirely new to some of de Science 2.0 developments. RSS? never heard off. So I introduce them to Bloglines, Netvibes and Google Reader. Show them something about scholarly blogs, social bookmarking for scientists or Digg.

We do actually have a course on Science 2.0 in the planning for somewhere in April. Needs still a lot of developing though. But it will be interesting.