Archive for the 'Databases' Category

Library website design, search engine crawlers and SEO

Digital libraries have tons of data, and when they don’t have the data in digital format , they have really nice and structured electronically available metadata about those data. Library catalogs we call them. They are plain ordinary databases and come in all kind of flavours.

When I joined the library behind the scenes some eight years ago, indexing of the library catalog was off limits for the search engines. That would cripple the system, at the expense of the users! Actually we were talking about Altavista and AllTheWeb in those days. Albeit, Google was around already. Times have changed though. We have taken away all kinds of no-index no-follow signs on our system and the first catalog cards are being indexed by search engines. We are just starting to use RSS and OAI as sitemaps for Google. But this is not the only approach that should be taken. The site should become optimized for the Google bots and crawlers of all kind of search engines. Although Google is by far the most important search engine at this moment.

Interesting to look back in my archives, a study was done carried out two years ago by Drunk men work here. It is not a peer reviewed study so it seems, but interesting nevertheless. In their research they compared the crawling behaviour of the Google, Yahoo! and MSN bots on a really large site that was set up as a binary search tree. Quite amazing, the Yahoo! bot showed to be the most proficient bot, having indexed most underlying pages and down to the deepest level. The Google bot followed at quite some distance and MSN came in last.

How matters have changed over the last two years. Smith and Nelson (2008) built two a large digital library websites to study crawler behaviour. They compared wide and deep linking design of the websites. It appeared that conventional wisdom held true, in that the wide design sites were indexed twice as fast. In the case of google 18 days compared to 44 days. The Yahoo! crawler failed to index the complete site. The MSN bot took more than 200 days in the wide design and failed to completely index the entire website when using the deep design.

The latest article I read which touched this subject was Jody L. DeRidder (2008) who explored the use of Google sitemaps and static browse pages (for which I have been pleading already for so long, not so much for robots as well as for our human users) and they concluded that -with a relatively small sample- static browse pages enhanced the crawling and indexing by search engines.

Having digested this all, I think we are back again to our thesaurus and classification system, and use those logical trees for entry of the crawlers into our catalog. Isn’t it nice that we indexed all our records already manually for years and that we can make use of that system as efficient highways into our catalog for the crawlers. Old systems used for new purposes.

Release the spiders!

The second part of the exercise becomes of course, design efficient catalog records that rank well in the search engine result pages. Wonder who has formally studied those matters? Any suggestions?

References
DeRidder, J. L. (2008). Googlizing a digital library. Code4Lib journal, 2. http://journal.code4lib.org/articles/43
Smith, J. A. and M. L. Nelson (2008). Site design impact on robots: An examination of search engine crawler behaviour at deep and wide websites. D-Lib Magazine, 14(3/4). http://www.dlib.org/dlib/march08/smith/03smith.html

PubMed sucks, or the user is broken

Anna Kushnir runs a blog on a high profile platform over at Nature Publishing. Last Saturday she complained about the user fiendliness of PubMed.

I have spent an absurd amount of time on PubMed recently and can say in no uncertain terms that it is making my dissertation writing way more painful than it needs to be. I can hold a paper in my hands, search for two authors’ last names and have PubMed come up with nothing.

PubMed, however is probably the most widely used bibliographic database in the world. Certainly in the world of Medicine. Many libraries run special classes to teach the intricacies of PubMed. We -librarians- have to admit, searching PubMed is not easy.  It is certainly not intuitive. After you’ve found what you searched for, then it is complicated to get the information over to another programme such as Reference Manager or EdnNote. If you succeed in that, you get abbreviated journal title’s, authors with maximally two initials etc….

How surprising was the reaction of Dean Giustini. Well his reaction is perhaps typical for a librarian in general, we go out and teach the user a few tricks. We teach and teach. The database is not broken! It’s the user we need to mend.

I thought Dean would know better than this. Of course he is right in the fact that this complaint on PubMed is an excellent teaching moment. But I would rather stress the message from Anna Kushnir, and that is that searching PubMed is not intuitive. Far from it. Even if you would have had classes some years ago in searching PubMed, that knowledge is now obsolete. That is good for PubMed, they innovate and improve, but when we think that refresher courses in searching PubMed should be high of the lists of Doctors, surgeon and medical researchers, we are speculating on the wrong track. They simply don’t have time for these courses. It is a rat race to keep informed on the progress of their own specialities. Why would they need courses for full time MLIS professionals to search a bibliographic database?

We have to go out there and listen to our users. Anna Kushnir is one of them. Her message is plain and simple, searching PubMed -however good we think it already might be- should become more intuitive. I think we should do a lot better and can do a lot better to build these more intuitive search engines.

I see the post from Anna more as a challenge for our profession, than as a teaching moment.

ISI Web of Knowledge development survey

Thomson Scientific has posted a small survey on the new Web of Knowledge interface. It only took 5 to 10 minutes to complete. Really worth the effort when you are serious about this product. One of the questions struck me as a bit odd was where they inquired about the necessity of a fully functioning back button on your browser. It struck me as odd since I have heard from marketing people themselves that users are complaining about a not functioning back button. I only get frustrated a couple of times per session in Web of Knowledge when a page has expired once again. Old habits never die. So each time a whistle a foul tone when it happens.

So please take this Survey, and tell them!

The response box for ideas for improvement is a bit small. So please ISI have a look at the following related posts:

Seems that times are changing, and they start listening to their users again!

Thomson launches ScienceWatch

One of the lesser known citation database from Thomson Scientific is the Essential Science Indicators. It is one of the databases that has actually some of the most interesting material since it contains analyses of the WoS data over the past ten years. Since it is a bit of an odd database, there is quite a lot of support material around it. Those websites however had a look and feel of the twentieth century (have a look, before it is too late, at In-cites or ESI-topics and you will probably agree).

However, they have updated the site, and a completely overhauled the looks, resulting in a brand new ScienceWatch. It looks much better, cleaner, fresher, and appears to be better organized. However, for the most important page for my day to day work, the journal list, they still use the old journal list at In-Cites.

If they are about to redesign the list, I only have a few simple requests for Thomson. Please do include ISSN numbers in this list, and secondly match the journal abbreviations with those in Web of Science. The last one seems only too logical, bu it wasn’t the practice up till now. At the same time I do realize that this request means a major overhaul of the ESI database as well. Perhaps that is about time. After the new Web of Knowledge this interesting database can’t be left behind. But please, please, please, do keep the file as a single dowloadable table, that works real fine. Much better than the current master journal lists.

With this new site I have to update my RSS feeds as well! A bit odd.

New version of Citeseer available

Citeseer was the first citation enhanced  bibliographic  database which provided free available citation data for the scientific literature. It was  therefore the first serious competitior for the kings of citation data ISI/Thomson Scientific. Citseer covered the literature of computer and information science. Started in 1997 at the NEC Research Institute, Princeton, New Jersey it has come a long way. Since it’s inception, the original CiteSeer grew to index over 750,000 documents and served over 1.5 million requests daily, pushing the limits of the system’s capabilities.

The next Generation Citeseer, CiteseerX, is now available for search.

My first impression is a really nice intuitive layout, and a fast search performance. I will keep pointing students to this free resource during my classes on citation analysis.

Scopus is adding institute disambiguation

Today it was announced that institute disambiguation, or the affiliation identifier, will become functional in Scopus early January 2008.  At this promotional site it is demonstrated what a search for the University of Liverpool returns in options of selection the right University of Liverpool and whether or not you want to include the teaching hospitals in a subsequent search as well.

Web of Science already included a refine option with an affiliation option amongst others, but they way the results are presented for Scopus shows that Elsevier has taken a different approach to solving this problem.

It will be interesting to test both approaches in more detail when the Scopus tool is officially launched.

Scopus is speeding up it’s indexing

I knew it was coming, today I noted it for the first time that Scopus is already indexing and alerting ‘articles in press’ (or any of its variations such as ‘online first’). In one of my regular alerts I got this article from Henk Moed:

Moed, H.F. (2007) UK Research Assessment Exercises: Informed judgments on research quality or quantity? Scientometrics, pp. 1-9. Article in Press

SJR : Scimago Journal & Country Rank

Sometimes you find these real gems. WoW, fantastic.

This evening I had this exciting feeling when I saw SJR for the first time. Tipped of by Recherchen Blog I stumbled upon Scimago. A database that provides a plethora of bibliometric indicators for journals and research performance at a country level. They have developed their own Pagerank (from Google) type of indicator for journal ranking called SJR indicator. But the data they provide is much more than only this indicator. Articles, citations and citations per article are provided as well.

This database is based on data provided by Scopus, which covers a much larger dataset than Journal Citation Reports or the Essential Science Indicators from Thomson Scientific. Very interesting to observe that SJR is freely available on the Web. This is a new development in the competition that is taking place between the two publishing giants Elsevier and Thomson.

The information contained in SJR is so overwhelming that it will take some time before I fully comprehend the possibilities of this database. To understand the new indicators and to make comparisons with the old established databases. The systems provides really nice graphics for journal data as well. The makers of SJR are really serious about their research, they recently published a study in Scientometrics some of their analyses with this database -on my pile of stuff to read-.

Noted some mention of SJR at Sidi and DigitalKoans as well. In the Spanish blogosphere the rumour has been spreading for some time already.

This database will certainly be covered in more detail at a later date.

Literature:
Moya-Anegón, F. d., Z. Chinchilla-Rodríguez, B. Vargas-Quesada, E. Corera-Álvarez, F. J. Muñoz-Fernández, A. González-Molina & V. Herrero-Solana (2007). Coverage analysis of Scopus: A journal metric approach. Scientometrics 73(1): 53-78. http://www.scimago.es/file.php?file=/1/Documents/CoverageScopus07.pdf

Interview with two Thomson executives on the citation indexes

When you work on a nearly daily basis with the products of Thomson ISI and have developed a love and hate relationship with the databases you sieve all information on these products you can find. It was therefore a welcome interview with Keith MacGregor, executive VP of Thomson’s academic and government strategic business unit, and James Testa, senior director, editorial development and publisher relations for Thomson that Nancy K. Herther published in the last issue of the Searcher (not free on-line).

The interview itself was rather too nice, the interviewer was perhaps too polite to raise really sensitive subjects. The parting thoughts listed by Herther at the end of the interview were the most interesting points of the whole article. A real pity that the two executives did not have a change to formulate their opinions on those points. In addition to the parting thoughts listed by Herther I would have loved to hear the opinion of these two gentleman on the stubborn ISI/Thomson Scientific policy not to change anything of the data collected in WoS. This results in all kind of inconsistencies in journal and author names when these are subject of study for a longer time period. I have the feeling that they try and correct some of the data in the software environment, but when you have to deal with the output as an analyst or collection development librarian, you end up with a load of data inconsistencies.

Only a few days ago I had to look into the citedness of T.B. van Wimersma Greidanus who published between 1969 and 1996. Impressive publication list, but really difficult to collect all those 300+ references from the cited ref search. For journal titles I have blogged already on this subject before and even before.

According to a few, Thomson is opening up a bit. However Herther wrote “I read a great deal of the published criticisms of citation data used for ranking individuals and institutions. I was therefore surprised at the absence of Thomson Scientific’s voice in many of these debates”. Which confirms my impression. But then again, perhaps times they are a-changin’.

reference
Herther, N. K. (2007). Thomson Scientific and the citation indexes : an interview with Keith MacGregor and James Testa. Searcher 15(10): 8-17.

Consistent search interfaces, oh so difficult

One of my annoyances of searching for journals in Web of Science has always been that in standard search you have to fill in the full journal title but when you search for a journal in the cited ref search you have to use the abbreviated jounal title. A very inconvenient way of doing searches in the same database, albeit a different index. Explain this in your classes on searching databases. Another small grunt in this respect is that the title abbreviations between or within different ISI products is not the same either so you are always left guessing.

This afternoon I had to check since when the Journal of Environmental Planning and Management has been indexed in Web of Science. The answer was found quite quickly. The journal only started this year to be covered by WoS. So I had to look up some citation data using a cited ref search. Easier said than done.

Using the official journal abbreviation list on the cited ref search the journal appeared not be there. But it has been indexed on WoS since the beginning of this year already. Moving over to the new interface, assuming they would have updated matters there a lot more, brought me some more disappointment. The journal list in the new interface was not up to date either.

Guessing the abbreviation I arrived quickly at the following abbreviations being used within WoS for the same journal:

  • J ENV PLANNING MANAG
  • J ENV PLANN MANAG
  • J ENV PLAN MANAG
  • J ENV PLAN MNGMT
  • J ENV PLANNING MANGE
  • J ENV PLAN MANAGE
  • J ENVIRON PLANN MAN

This list is certainly not exhaustive, but just illustrates my point of different abbreviations for the same journal (how do they ever calculate the right impact fact you might wonder?).

My idea is that when you have such a major overhaul of you web platform that you look at the search ergonomy as well. Full title search in the normal search and abbreviated title search in the cited ref search should have been a problem reported back to ISI headquarters by all marketeers and sales people on many different occasions. So this little annoyance should have been rectified in latest extensive product overhaul.

That journal abbreviation lists are not up to date with the latest additions of newly indexed periodicals is a sign of very sloppy maintenance of your databases. For an important database such as Web of Science I would have expected higher standards of accuracy.

It seems that the competition has not yet fully woken up this giant in database land. Please Thomson wake up!