Document types in databases

Currently I am reading with a great deal of interest the Arxiv preprint of the review by Ludo Waltman (2015) from CWTS on bibliometric indicators. In this post I want to provide a brief comment on his section 5.1 where he discusses the role document types in bibliometric analyses. Ludo mainly reviews and comments on the inclusion or exclusion of certain document types in bibliometric analyses, he does not touch upon the subject of discrepancies between databases. I want to argue that he could take his review a step further in this area.

Web of Science and Scopus, differ quite a bit from each other on how they assign document types. If you don’t realize that this discrepancy exists, you can draw wrong conclusions when bibliometric analyses between these databases are studied.

This blogpost is a quick illustration that this is an issue that should be adressed in a review like this. To illustrate my argument I looked to the document types assigned to Nature publications from 2014 in Web of Science and Scopus. The following tables gives an overview of the results:

Document Types Assigned to 2014 Nature Publications in Web of Science and Scopus
WoS Document type # Publications Scopus Document type # Publications
Editorial Material 833 Editorial 244
Article 828 Article 1064
Article in Press 56
News item 371
Letter 272 Letter 253
Correction 109 Erratum 97
Book Review 102
Review 34 Review 51
Biographical Item 13
Reprint 3
Note 600
Short Survey 257
Total 2565 2622

In the first place Scopus yields for the year 2014 a few more publications for Nature than Web of Science does. The difference can be explained by the articles in press that are still present in the Scopus search results. This probably still requires maintenance from Scopus and should be corrected.

More importantly WoS assigns 833 publications as “editorial material” whereas Scopus assigns only 244 publications as “editorial”. It is a well known tactic from journals such as Nature to assign articles as editorial material, since this practice artificially boosts their impact factor. I have had many arguments with professors whose invited news and views items (most often very well cited!) were not included in bibliometric analyses since they were assigned to “editorial material” category and therefore not included in the analysis.

“Letters”, “corrections” or “errata” are in the same order of size between Scopus and Web of Science. “News Items” are a category of publications in Web of Science, but not in Scopus. They are probably listed as “note” in Scopus. Some of the “short surveys” in Scopus turn up in Web of Science as “news item”. But all these categories probably don’t affect bibliometric analyses too much.

The discrepancy in “reviews” between Web of Science and Scopus however is important. And large as well. Web of Science assigns 34 articles as a “review”, whereas Scopus counts 51 “reviews” in the same journal over the same period. Reviews are included in bibliometric analyses, and since the attract relatively more citations than oridinary articles, special baselines are construed for this document type. But comparisons between these databases are foremost affected by differences in document assignation between these databases.

The differences in editorial material, articles and reviews between Web of Science and Scopus are most likely to the affect outcomes of comparisons in bibliometric analyses between these two databases. But I am not sure about the size of this effect. I would love to see some more quatitative studies in the bibliometrics arena to investigate this issue.



Waltman, Ludo (2015). A review of the literature on citation impact indicators.

The week in review – week 4

The week in review, a new attempt to get some life back into this weblog. It is inspired of course (for the Dutch readers) on TWIT The Week In Tweets by colleague @UBABert and the older monthly overviews which Deet’jes used to do on

The new Web of Science interface
Whilst I was in Kenya the previous week to give training for PhD students and staff at Kenyatta University and the University of Nairobi, Thomson Reuters released their new version of the Web of Science. So only this week I had a first go at it. We haven’t been connected to Google Scholar yet, still waiting to see that come through, but in general the new interface is an improvement over the old one. Albeit, searching for authors is still broken for those who haven’t claimed their ResearcherID. But apart from that, what I hadn’t noticed in the demo versions of the new interface is the new Open Access facet in Web of Science. I like it. But immediately the question arises how do they do it jumps to my mind. The is no information in the help files on this new possibility. So my first guess would be the DOAJ list of journals. Through a message on the Sigmetrics list a little more confusion was added, since various PLoS journals are included in their ‘Open Access Journal Title List’, but for PLoS ONE. Actual searches in Web of Science quickly illustrate that for almost any topic in the past view years PLoS ONE is the largest OA journal responsible for content within this Open Access facet. I guess this new facet in Web of Science will spark some more research in the near future. I see the practical approach of Web of Science as a first step in the right direction. The next challenge is of course to indicate the individual Open Access articles in hybrid journals. Followed by -and this will be a real challenge- green archived copies of Toll Access articles. The latter is badly needed since we can’t rely only on Google Scholar to do this for us.

Two interesting articles in the unfolding field of Altmetrics deserve mention. The groups of Judit Barr-Ilan and Mike Thelwall cooperated in “Do blog citations correlate with a higher number of future citations? Research blogs as a potential source for alternative metrics” . They show that Research Blogging is a good post peer review blogging platform able to pick the better cited articles. However, the number of articles covered by the platform is really too small to be meaningful to become a widely used altmetric indicator.
The other article, at the moment still a working paper, was from CWTS (Costas et al. 2014). They combined Web of Science covered articles with the indicators and investigated many different Altmetric indicators such as as mentions on Facebook walls, Blogs, Twitter, Google+ and News outlets but not Mendeley. Twitter is by far the most abundant Altmetric source in this study, but blogs are in a better position to identify top publications. However the main problem remains the limited coverage by the various altmetrics tools. For 2012 24% of the publications had an altmetric mention, but already 26% of the publications had scored already a citations. Thus confirming the other study that coverage of the peer reviewed scholarly output is only covered on a limited scale by social media tools.

Scholarly Communication
As a follow up on my previous post on the five stars of transparent pre-publication peer review, a few articles on peer review came to my attention. The first was, yet another, excellent bibliography by Charles W. Bailey Jr. on transforming peer review. He did not cover blogposts, only peer reviewed journals. The contributions to this field are published in many different journals, so an overview like this still has its merits.
Through a tweet from @Mfenner

I was notified on a really interesting book ‘Opening Science‘. It is still lacking a chapter on changes in the peer review system, but it is really strong at indicating new trends in Scholarly Communication and Publishing. Worth further perusing. Rankings Although the ranking season has not started yet. The rankers are always keen of putting old wine in new bags. The Times Higher Education presented this week the 25 most international universities in the world. It is based the THE WUR, released last year, this time only focusing on the ‘international outlook indicator’only which accounts for 7.5% of their standard ranking. Of the Dutch universities Maastricht does well. Despite the fact that Wageningen university host students from more than 150 countries, we only ranked 45th on this indicator. More interesting was an article of Alter and Reback (2014) where they show that rankings actually influence the number of freshman applying for a college in the United States as well as the fact that quality of college life plays an important factor as well. So it makes sense for universities to invest in campus facilities and recreation possibilities such as sports grounds etc. Random notes A study on copy rights, database rights and IPR in Europe for Europeana by Guibault. Too much to read at once, and far too difficult to comprehend at once. But essential reading for repository managers.


Alter, M., and R. Reback. 2014. True for Your School? How Changing Reputations Alter Demand for Selective U.S. Colleges. Educational Evaluation and Policy Analysis. (Free access)
Bailey Jr., C. W. 2014. Transforming Peer Review Bibliography. Available from
Binfield, P. 2014. Novel Scholarly Journal Concepts. In: Opening Science, edited by Sönke Bartling and Sascha Friesike, 155-163. Springer International Publishing. OA version:
Costas, R., Z. Zahedi, and P. Wouters. 2014. Do ‘altmetrics’ correlate with citations? Extensive comparison of altmetric indicators with citations from a multidisciplinary perspective. CWTS Working Paper Series Vol. CWTS-WP-2014-001. Leiden: CWTS. 30 pp.
Guibault, L., and A. Wiebe. 2013. Safe to be open : Study on the protection of research data and recommendation for access and usage. Göttingen: Universitätsverlag Göttingen 167 pp.
Shema, H., J. Bar-Ilan, and M. Thelwall. 2014. Do blog citations correlate with a higher number of future citations? Research blogs as a potential source for alternative metrics. Journal of the Association for Information Science and Technology: n/a-n/a. OA version:

How Google Scholar Citations passes the competition left and right

Google Scholar logoLast Thursday Google Scholar Citations went public. It was to be expected. Since August the product has been tested by a few (blogging) scientists. We only had to wait patiently for it to be released to all scientists. Last Thursday the moment was there.

Was it worth the wait? Yes it certainly was. Google Scholar Citations really excels at finding publications you completely forgot about. But even then, there are still –obscure- publications that even Google Scholar doesn’t know about. You simply log in and deselect those few publications that don’t belong to you. You can make searches to find publications that Google has overlooked. You get a comprehensive publication list quite quickly. Well when your name is not too common, that is. How it works for very common names, Korean scientists jump to my mind as well as John Smith, I don’t know yet. But so far nothing new, Ann-Will Harzing’s excellent Publish or Perish software already did this. What is new is the fact that Google Scholar Citations keeps the citations and publications automatically up to data and allows you to publish your own publication list on the Web with the citations and some crude citations metrics.

The two major competitors in this arena are Thomson Reuters with their ResearcherID and Elsevier’s Scopus which has their Scopus ID. With both services you can identify your own publications and assign them to a unique number. IN this way you can create your unique publications list with citation metrics as well. The main disadvantage compared to Google Scholar is their rather limited resource set. Thomson Reuters WoS “only” covers some 10,000 scholarly journals a set of selected proceedings and of recent only 30,000 books. Scopus has nearly double the number of journals but stays behind in proceedings and covers hardly any books. Google Scholar certainly covers more, but we still don’t understand what is included and what not and sometimes have our doubts about currentness of Google Scholar. The larger resource base, including books and book chapters, of Google Scholar makes will make this service more attractive for social scientist and scholars in arts and humanities studies.

On top of the smaller publication base on which these services are based, these two competitors each have their own particular disadvantage as well. You have to maintain you publications list in Thomson Reuters Researcher ID yourself manually. Each time you publish a new article, you have to add it to your profile yourself. Looking around, I see that most researchers are a bit sloppy in this respect. You can however, make your publication list and the citation impact publically available. see for example my meagre list. Scopus on the other hand, maintains your publication list automatically (albeit it made some serious mistakes in this area in the past, but they seem to have improved this service). But, and this is a big but, you can’t publish you properly curated publication list with citations publically on the Web. They used to have 2Collab for this, but since they stopped 2Collab they haven’t come up with an alternative mechanism to publish your publications list with citation impact on a public website. A real pity.

So Google Scholar easily beats ResearcherID since it updates automatically and Scopus ID because you can make your list with citations publically available. To make your publication list openly available is really recommended to all scientists, it helps your personal branding.

Certainly there are disadvantages to Google Scholar aswell. The most serious at this moment all kind of ghost citations. If you look at the citations to our bibliometrics analysis on top of repositories paper, Google counts three citations. But checking the Leydesdorff citations, a reference to our article is not to be found (of course it should have been there, but it isn’t). 0xDE reported a spam account in the name of Peter Taylor, where they collected various Taylors in a single profile boasting an h-index of 94. That Google Scholar can be fooled has been reported Beel & Grip (2010).

When I was interviewed for our university paper on Google Scholar Citations (in Dutch) I told them: Google Scholar is only about five years old. Give them another five years and they will have changed the market for abstracting and indexing database totally. If only 20 percent of all scientists make their publication lists correct (also editing of the references which can be done to improve the mistakes Google has made) even without making them publically available, Google sits on a treasure trove of high quality metadata. Really interesting to see how this story will develop.

Joeran Beel and Bela Gipp. Academic search engine spam and google scholar’s resilience against it. Journal of Electronic Publishing, 13(3), December 2010.

Some observations during the bibliometrics session at the Österreichische Bibliothekartag

Albeit the program consistently talks about the Österreichische Bibliothekartag (singular) the whole library day spans actually 4 days. One would have expected at least the Österreichische Bibliothekartaggen (plural) but they insist in mentioning only one day. Of those four days, I was only present during part of the morning of the third day, so this is a very limited report on the Österreichische Bibliothekartag. Looking at their program, it is a very comprehensive and interesting program. Never thought that you could cover a complete session, 5 presentations, talking about cooking books (No pun intended). It only reflects that bibliometrics was only a small part of the program amongst many other subjects covered. I noticed a lot of presentations on e-book platforms, many digitization projects, plenty of mobile less of library 2.0 than you would expect (is the hype over?) and open access had also a very limited role. What struck me as interesting for conference organizers, is that many commercial presentation were programmed equally throughout the sessions. Just a sign of taking the sponsors seriously.

So far on the conference as a whole, of which I actually experienced too little. On to the bibliometrics sessions. The session was chaired by Juan Gorraiz, a bubbly Spaniard working already for years in Austria. Give him the opportunity and he will take the floor and would love to take all the time available and fill the slots for all presentations planned.

The first presentation was on a piece of research that should result in a masters thesis at some point, but some preliminary results were presented in this session by Christian Gumpenberger. The focus of the research was on the acceptance and familiarity of Austrian researchers with bibliometrics. The results were not really shocking, most researchers stated that they were familiar with impact factors, but for the moment there was no clue as to whether they were aware about a thing like a two year citation window. Or the difference between citable items and non-citable items leading to the inflation of impact factors for journals like Nature and Science. Christian sketched some sunny skies for bibliometrics in Austria, but in the subsequent discussion part this sunny view was criticized quite a bit. Notwithstanding I would like to have a look at this MS thesis when it becomes available.

The second presentation was from Italian origin by Nicola de Bellis. Nicola has written an interesting book on citation analysis in which he stresses the sociological, philosophical and historical aspects of bibliometric analyses. It is always interesting to hear a presentation like this, away from the fact finding number crunching approach which I normally have and dream a bit away on outlines of what in an ideal world should be done on a subject like this. Quite a lot, but some of it is beyond being practical. When you carry out bibliometric analyses in the library at some scale, like dealing with 18,000 papers that have collected 265,000 citations like we do in our library, you can only be practical. So there is an interesting conflict between his presentation (which will be on-line soon, I hope) and mine which followed Nicola his presentation.

I don’t want to cover all aspects of Nicolas his presentation. Go and read the book, which I am going to do as well. But at one point during his presentation I strongly disagreed with him. Where he stated that only the mediocre scientists have an interest in bibliometrics and the top scientists normally don’t have an interest in this topic. My experience it quite the contrary. In the first place it was one of Wageningen’s top scientist who urged the library to take a subscription on Web of Science back in 2001, and made it possible with a special contribution from his top institute. He knew he was a highly cited scientist, but somehow he needed Web of Science to confirm his reputation. Later on as well, apart from the discussion with scholars in the social sciences department, it has always been those top performing groups that invited me to give a presentation on this subject rather than the groups that were lagging behind in the bibliometric performance indicators. To me it has always appeared that those who are leading the pack are also interested in staying ahead of the rest and invite the library to explain the results obtained and enhance their performance in the future.

The second observation in Nicola his presentation where he was far beyond practical where he insisted on the point that for a publication all citations to this publication should be retrieved from the three general databases (Web of Science, Scopus and Google Scholar) in the first place supplemented with citations from at least one citation enriched subject specific database. Well that’s a lot of work for single publication in the first place, leading to deduplication errors if you’re not very careful. Secondly it should be well know that Google Scholar, albeit attractive because of tools like Harzing’s Publish-or-Perish, is not a reliable database for citation counts at his moment (Jacso 2008). Google Scholar still has serious problems with ordinary counting and depuplication and should therefore not be used for serious citation analyses. The third argument against the use of multiple databases goes a bit further into the theory of bibliometrics and relies on approaches described by Waltman et al. (2011) and Leydesdorff et al. (2011). The key point is that a number of citations in itself has no meaning. It should be related to the citations of related documents in the same field of science. You can do that by normalizing on the mean citation rate in the field (Waltman et al. 2011) or by the perhaps more sophisticated approach sketched by Leydesdorff et al. (2011) based on the citation distributions in the fied to which the paper belongs. The latter approach is very novel, and has not really been widely tested yet. Both these approaches rely on the availability of the all the citations to the publications in a certain field of science of a certain age and document type. This can be expected that you have the availability of the means or citation distribution when you work with a specific database (for WoS there is plenty experience, with Scopus it is coming with SciVal Strata but for Google Scholar it doesn’t exist yet), but is beyond reality when you derive citation data from three or four databases at the same time.

But apart from these critical points I just made, I liked the presentation by De Bellis very much. For those interested in similar views on the citation practice I really recommend to read MacRoberts & MacRoberts (1996) as well.

The session closed with my presentation, which is enclosed here

Bibliometric analysis tools on top of the university’s bibliographic database, new roles and opportunities for library outreach

View more presentations from Wouter Gerritsma

After which the session ended with some discussion but soon all 30 or so participants hurried themselves to the coffee.


De Bellis, N. (2009). Bibliometrics and citation analysis : From the Science Citation Index to cybermetrics. ISBN 9780810867130, The Scarecrow Press, 450p. (download here)
Jacsó, P. (2008). The pros and cons of computing the h-index using Google Scholar. Online Information Review, 32 (3): 437-451
Leydesdorff, L., L. Bornmann, R. Mutz & T. Opthof (2011). Turning the tables on citation analysis one more time: Principles for comparing sets of documents. Journal of the American Society for Information Science and Technology n/a-n/a
MacRoberts, M. H. & B. R. MacRoberts (1996). Problems of citation analysis. Scientometrics, 36(3): 435-444
Waltman, L., N. J. van Eck, T. N. van Leeuwen, M. S. Visser & A. F. J. van Raan (2011). Towards a new crown indicator: Some theoretical considerations. Journal of Informetrics, 5(1): 37-47.

Scimago rankings 2011 released

Today Félix de Moya Anegón announced on twitter  that the Scimago Institutional rankings (SIR) for 2011 were released. These rankings are not very well known or widely used. Yesterday during a ranking masterclass from the Dutch Association for Institutional Research the SIR was not even mentioned. Undeservedly so. Scimago lists just over 3000 institutions worldwide. It is therefore one of the most comprehensive institutional ranking. If not the most. It is also a very clear ranking they only measure publication output and impact. It thus ranks only research performance of the institutions and therefore very similar to the Leiden ranking.

What I like about Scimago, is their innovative indicators, they come up with each year. Last year they introduced the %Q1 parameter. Which is the ratio of publications that an institution publishes in the most influential scholarly journals of the world. Journals considered for this indicator are those ranked in  the first quartile (25%) in their categories as ordered by SCImago Journal Rank SJR indicator. This year they introduced the Excellence Rate. The Excellence Rate indicates which percentage of an institution’s scientific output is included into the set formed by the 10% of the most cited papers in their respective scientific fields. It is a measure of high quality output of research institutions. Very similar indicators, the excellence indicator is just a tougher version of the %Q1.

The other new indicator is the specialization index. The Specialization Index indicates the extent of thematic concentration / dispersion of an institution’s scientific output. Values range between 0 to 1, indicating generalistic vs. specialized institutions respectively.

Their most important indicator to express research performance is their Normalized Impact (NI). Which is similar to the MNCS of the CWTS and RI as we calculate in Wageningen. The values, expressed in percentages, show the relationship of an institution’s average scientific impact and the world average, which is 1, –i.e. a score of 0.8 means the institution is cited 20% below average and 1.3 means the institution is cited 30% above average.

Last year the the Scimago team showed already that there is exist an exponential relationship between the ability an institution has to lead its scientific papers to better journals (%Q1) and the average impact achieved by its production in terms of Normalized Impact. It is a relationship I always show in classes on publications strategy (slides 15 and 16). When looking at the Dutch universities, I noted that the correlation between the new excellence indicator and normalized impact is even better than with the %Q1. So the pressure to publish in the absolute top journal per research field will even further increase if this become general knowledge.

What do we learn for the Dutch universities from the Scimago rankings. Rotterdam still maintains its top position for normalized impact, it scores also best for the %Q1 and Exc. Direct after Rotterdam you Leiden, UvA, VU, Utrecht and Radboud with equal impact. Utrecht has published the most articles during the period 2005-2009. Wageningen excels at international cooperation. And both Tilburg and Wageningen are the most specialized universities in the Netherlands.

Making these international rankings is quite a daunting task. For the Netherlands I noticed that the output of Nijmegen was distributed over Radboud University and Radboud University and Nijmegen Medical Centre, this was not done for the other university hospitals.  And for Wageningen the output was noted under Wageningen University and Research Centre and Plant Research International (which is part of Wageningen UR). But for researchers from Spain these are difficult nuances to resolve 100% perfectly.

My only real complaint with the ranking is the fact that they state it is not a league table, and they rank the institutions on publication output. It is so much more obvious to present the list ranked on NI. Since they only produce the ranking as a PDF file, it took me a couple of hours to translate it into an excel spreadsheet and rank the data any way I wish. With all the information at hand it is also possible design your own indicators, such as a power rank in analogy of the Leiden rankings.

The message to my researchers: aim for the best journals in you field. We still have scope for improvement. We are still not in the neighbourhood of the 30 to 40% Exc. Rate we see for Rockkefeller, Harvard and the like.