The mysterious ways of Web of Science

A while back, one of our researchers asked me how Steven Salzberg arrived at the number of citations for the paper on the Arabidopsis genome in Nature. When he checked Web of Science, it only delivered zero citations and that couldn’t be true for such a breakthrough paper. Peter found 2689 citations! How did he do that?

I checked out the paper in Web of Science myself first as well, and found also zero citations.

Zero citations from Web of Science for the Arabidopsis papers

I was not entirely surprised since I realized it was one of those consortium papers. I knew Thomson had some problems with a consortium paper in the past. But annoying it was.

I first checked about the issue around the human genome project and found it being mentioned even in Science Watch from Thomson. But from the article it appeared that Thomson only improved the tracking for citations from that Human Genome project paper, and not the raised issue per se. Even though the Arabidopdsis paper was even older the citations to this paper had not been corrected. It appeared that something in the searching or tracking of citations by WoS went wrong but where was the error being made?

I made a few futile attempts in the cited ref search with Arabidopsis as author, or Arabidopsis*. Searched in the cited ref search for Kaul as author (which is listed in the end of the original article as first author) but that only resulted in some 130 citations. Not sufficient to justify Steven Salzberg number of citations. I did not like to use the cited ref search to look for the cited articles from Nature in 2000 this is a very large result set that you have to wade through innumerable pages of results since you can’t refine these type of searches by volume or page numbers. (Wouldn’t that be nice?)

To reassure my inquisitive researcher I pointed him to Scopus (Sorry Thomson) where the he could see a reassuring 3000+ citations himself. Meanwhile I did not have a quick fix for this problem.

It was only later when I looked into the problem again, and somehow I was forwarded to the all databases search rather than the Web of Science search tab, which I normally use. To my utter amazement the title search delivered this time two records. Both with zero citations, but more importantly it showed next to [Anon] Ar Gen In, as the author.

Zero citations from Web of Science for the Arabidopsis papers

Now the problem was simple. I had found the author. A cited ref search yielded indeed nearly the 2689 citations from Steven Salzberg.

Zero citations from Web of Science for the Arabidopsis papers

But these figures are not entirely correct either since there are some additional 131 citations to be found with Kaul as a first author reference to Nature with the correct volume and page number.

Of course I requested at Web of Science a correction of the citation data, but forgot to include Kaul’s citations. Hopefully this will be repaired at a later date.

But what makes me really wondering is the slight -but very important- difference in record presentation between the All Databases search and the Web of Science search  on Web of Knowledge. For me personally the standard entry in Web of Knowledge is the Web of Science tab. Not in my normal working routine would I ever go to the all databases tab to look up a number of citations. Just by luck I found the right author name on this occasion. But it shouldn’t have to become the standard way to perform searches shouldn’t it?

Trends in the science and information world

Tomorrow I have to teach a class for better searching for scientific information on the world wide web. In the introduction I try to highlight the major trends in research and the information landscape. I came up with the two following bullet lists.

Trends in science and research

  • Increased multidisciplinarity
  • Increasing cooperation between scientists
  • Internationalization of research
  • Need for primary data
  • More competition for same grant money

Trends in the information world

  • Increased importance of free web resources
  • From information scarcity to overload
  • After A&I databases, journal currently digitization of books
  • From bibliographic control to fulltext search
  • Open Access & Source
  • Multiformity of resources
  • User in control

I wondered if anybody has some additional suggestions for either one of these lists.

Google and the academic Deep Web

Blogging on Peer-Reviewed ResearchHagendorn and Santelli (2008) just published an interesting article on the comprehensiveness of indexing of academic repositories by Google. This article triggers this me to write up some observations I was intending to make for quite some time already. It addresses the question I got from a colleague of mine, who observed that the deep web apparently doesn’t exist anymore.

Google has made a start to index flash files. Google has made a start to retrieve information that is hidden behind search forms on the web, i.e. started to index information contained in databases. Google and OCLC exchange information on books scanned, and those contained in Worldcat. Google so it seems has indexed the Web comprehensively with 1 trillion indexed webpages. Could there possibly be anything more to be indexed?

The article by Hagendorn and Santelli shows convincingly that Google still has not indexed all information that is contained in OAISTER, the second largest archive of open access article information. Only Scientific Commons is more comprehensive. They tested this with the Google Research API using the University Research Program for Google Search. They only checked whether the URL was present. This approach only partially reveals some information on depth of the Academic Deep Web. But those are staggering figures already. But reality bites even more.

A short while ago I taught a Web Search class for colleagues at the University Library at Leiden. For the purpose of demonstrating what the Deep or Invisible Web actually constitutes I used and example from their own repository. It is was a thesis on Cannabis from last year and deposited as one huge PDF of 14 MB. Using Google you can find the metadata record. With Google Scholar as well. However, if you try to search for a quite specific sentence on the beginning pages of the actual PDF file Google gives not the sought after thesis. You find three other PhD dissertations. Two of those defended at the same university that same day, but not the one on Cannabis.

Interestingly, you are able to find parts of the thesis in Google Scholar, eg chapter 2, chapter 3 etc. But those are the parts of the thesis contained in different chapters that have been published elsewhere in scholarly journals. Unfortunately, none of these parts in Google Scholar refers back to the original thesis that is in Open Access or have been posted as OA journal article pre-prints in the Leiden repository. In Google Scholar most of the materials is still behind toll gates at publishers websites.

Is Google to blame for this incomplete indexing of repositories? Hagendorn and Santelli point the finger to Google indeed. However, John Wilkin, a colleague of them, doesn’t agree. Just as Lorcan Dempsey didn’t. And neither do I.

I have taken an interest in the new role of librarians. We are no longer solely responsible for bringing external –documentary- resources from outside into the realm of our academic clientele. We have also the dear task of bringing the fruits of their labour as good as possible for the floodlights of the external world. Be it academic or plain lay interest. We have to bring the information out there. Open Access plays an important role in this new task. But that task doesn’t stop at making it simply available on the Web.

Making it available is only a first, essential step. Making it rank well is a second, perhaps even more important step. So as librarians we have to become SEO experts. I have mentioned this here before, as well as at my Dutch blog.

So what to do about this chosen example from the Leiden repository. Well there is actually a slew of measures that should be taken. First of course is to divide the complete thesis in parts, at chapter level. Albeit publishers give permission only to publish articles, of which most theses in the beta sciences exists in the Netherlands, when the thesis is published as a whole. On the other hand, nearly 95% of the publishers allow publication of pre-prints and peer reviewed post prints. The so called Romeo green road. So it is up to the repository managers, preferably with the consent from the PhD candidate, to tear up the thesis in its parts –the chapters, which are the pre-print or post-prints of articles- and archive the thesis on chapter level as well. This makes the record for this thesis with a number of links to far more digestible chunks of information better palatable for the search engine spiders and crawlers. The record for the thesis thus contains links to the individual chapters deposited elsewhere in the repository.

Interesting side effect of this additional effort at the repository side is that the deposit rates will increase considerably. This applies for most Universities in the Netherlands, for our collection of theses as well. Since PhD students are responsible of the lion’s share of academic research at the University, depositing the individual chapters as article preprints in the repository will be of major benefit to the OA performance university. It will require more labour at the side of repository management, but if we take this seriously it is well worth the effort.

We still have to work at the visibility of the repositories really hard, but making the information more palatable is a good start.

Hagedorn, K. and J. Santelli (2008). Google still not indexing hidden web URLs. D-Lib Magazine 14(7/8).

PubMed sucks, or the user is broken

Anna Kushnir runs a blog on a high profile platform over at Nature Publishing. Last Saturday she complained about the user fiendliness of PubMed.

I have spent an absurd amount of time on PubMed recently and can say in no uncertain terms that it is making my dissertation writing way more painful than it needs to be. I can hold a paper in my hands, search for two authors’ last names and have PubMed come up with nothing.

PubMed, however is probably the most widely used bibliographic database in the world. Certainly in the world of Medicine. Many libraries run special classes to teach the intricacies of PubMed. We -librarians- have to admit, searching PubMed is not easy.  It is certainly not intuitive. After you’ve found what you searched for, then it is complicated to get the information over to another programme such as Reference Manager or EdnNote. If you succeed in that, you get abbreviated journal title’s, authors with maximally two initials etc….

How surprising was the reaction of Dean Giustini. Well his reaction is perhaps typical for a librarian in general, we go out and teach the user a few tricks. We teach and teach. The database is not broken! It’s the user we need to mend.

I thought Dean would know better than this. Of course he is right in the fact that this complaint on PubMed is an excellent teaching moment. But I would rather stress the message from Anna Kushnir, and that is that searching PubMed is not intuitive. Far from it. Even if you would have had classes some years ago in searching PubMed, that knowledge is now obsolete. That is good for PubMed, they innovate and improve, but when we think that refresher courses in searching PubMed should be high of the lists of Doctors, surgeon and medical researchers, we are speculating on the wrong track. They simply don’t have time for these courses. It is a rat race to keep informed on the progress of their own specialities. Why would they need courses for full time MLIS professionals to search a bibliographic database?

We have to go out there and listen to our users. Anna Kushnir is one of them. Her message is plain and simple, searching PubMed -however good we think it already might be- should become more intuitive. I think we should do a lot better and can do a lot better to build these more intuitive search engines.

I see the post from Anna more as a challenge for our profession, than as a teaching moment.

Full feeds versus partial feeds

The full feeds versus partial feeds is an old debate. Have a look at the 2.7 million Google hits for this simple query. Most of the debate however, concentrates on the presumed effects on visitors to the actual blog and -missed?- advertising revenue.

This afternoon I was having an interesting discussion with a representative from a library organization and we were discussing the theme of findability and accessibility of scientific information. My point of view was that blogging about science and scientific articles would at least increase the findability of these articles. However, this is only true when the feeds of the blog are full feeds. The discovery of very new, young or even premature information on the web should be complemented nowadays with searches on blog search engines and news search engines. These search engines are on most occasions not exactly what their name suggests. In most instances they are rss feed search engines, i.e. they only index rss feeds.

The consequences are simple. When a blog is using partial feeds only the headline is indexed by blog search engines. Have for instance a look at the Technorati results for the IAALD blog,  or from Google Blog search, or at Ask blog search. These represent the top three blog search engines at the moment. The discoverablity of content with these search engines for content from the IAALD blog is miserable, whereas it has some excellent content.

Where the discussion of full text feeds versus partial feeds so far has concentrated on arguments of pro-bloggers who are worried about their advertising revenue. For scientists, the argument of discoverablity is far more important and they should always opt for full feeds to syndicate their content as widely as possible.

It sounds strange but a lot of people have not yet realized this.