Academic search engine optimization: for publishers

A few weeks ago my eye caught a tweet on the subject of academic search engine optimization

The nicely styled PDF referred to in the tweet  was from Wiley. Wiley has been quite active in this area. In my book mark list I have somewhere the link to their webpage on optimizing your research articles for search engines (SEO), somewhere tucked away on their author services section. And a link to the article “Search engine optimization and your journal article: Do you want the bad news first?” on their Exchange blog. Wiley is not the only publishers dealing with this subject, here is an example on academic search engine optimization from Elsevier and another example from Sage. I bet there are other examples from publishers to be found.

The major advise is to use the right keywords. Use these keywords in your title, and repeat them throughout your abstract. Contextually repeated as they say. Do mention some synonyms for those keywords as well and please do make use of the key words fields in the article as well.  They emphasize to use Google Trends or Google Adwords to find the right keywords, but that is ill-advised for academic search engine optimization in my opinion. When selecting keywords for academic search engine optimization it is better to use keyword systems, ontologies or thesauri from you subject area, because experienced researchers will use this terminology to search for their information as well. So in the biomedical area it is obvious to consult the mesh browser, but when you are in the agriculture or ecology field of research the CAB thesaurus is the first choice for selecting the appropriate keywords. The Wiley SEO tips ends with  the advise to consistent with your own name (and affiliation, your lab deserves to be named properly as well), and don’t forget to cite your previous work.

The role of the editors in Academic Search Engine Optimization

In their short PDF the Wiley team mentions to use headings as well “Headings for the various sections of your article tip off search engines to the structure and content of your article.  Incorporate your keywords and phrases in these headings wherever it’s appropriate.”  A nice suggestion but in practice this is hardly ever in the hands of the individual author. Scholarly articles tend to have a rather fixed structure. The IMRAD structure, Introduction, Methods, Results and Discussion being the most common. In such a case the author has no space to add headings in the right position in their paper. But research by  Hamrick et al. showed that papers with callouts, tend to have higher number of citations. A “callout” is a phrase or sentence from the paper, perhaps paraphrased, that is displayed prominently in a larger font. The journal which they investigated abandoned the practice to use callouts, but after their article this practice was reinstated again. A decision like that, is an editorial decision. And it is recommended for all journals to help the readers with pointers in the form of callouts, and benefit from the affects it can have as academic search engine optimization as well. My favourite Wiley journal, JASIST, certainly doesn’t make systematic use of callouts.

The other topic on which the editorial board has an important say is the layout of the reference lists in their journals. I have pleaded many times before for a reduced number of specifications of reference lists. It looks like the first task an editorial board of a newly established journal embarks upon is,  is to formulate yet another exotic variation of the many different styles specifying the layout of the reference list. The point however, that these definitions hardly make use of the possibilities of academic search engine optimization, or search engine optimization whatsoever, most often they forget to include linking options in the reference list altogether. Older instructions to authors have not caught up with the present time yet. In the html version of the scholarly articles links are included as part of the journal platform software, but in the PDF versions of the articles the URLs are often forgotten altogether. Where DOIs are linkable in the webpage, in most instances DOIs in the PDF version are most often presented in the form of  doi:10.1002/asi/etc. It is even explicitly stipulated in the APA style and many others to reference a DOI as doi: which goes against the advice of the DOI governing body. These bad practices results in the fact that DOI’s included in the PDF versions of the reference list don’t link. Which is a complete and utter waste of SEO opportunity. So academic search engine optimization is badly broken in this area.

The role of publishers in Academic Search Engine Optimization

Publishers have their role in supporting the editorial boards in resolving the two previously mentioned items. But they should also have a careful look into the PDF files they produce at the moment as well. At this moment the Google Webmaster has only a few pointers to PDF optimization. To mention a few interesting ones: Links should be included in the PDF (this means again DOIs as links rather than doi: statements) since they are treated as ordinary links.  And the last point is important as well “How can I influence the title shown in search results for my PDF document” The title attribute in the PDF is used! And the anchor text. On publishers site this is most often “PDF”. If they only would use the title as anchor text on their website it would work in their advantage. Albeit not mentioned on the Google webmaster blogpost, since it is probably too obvious, if the file had only the name of the title it certainly would help the SEO for the PDF, and it would help all those scientists who download all the PDF files for their research to sort out what file is what about. Was 123456.pdf about the genetics or genomes, or was that in 234567.pdf? Clear titles would help both researchers as well as search engines to work out what it is all about.

And whilst publishers are on the subject of PDF optimization they might as well complete the other attributes for PDF files, such as authors, keywords and summary. If it is not now, another search engine might make use of those attributes another day. You might as well be prepared.  Researchers, using reference management tools, can also benefit from those metadata attributes. Ross Mounce has some interesting blogposts about the researchers need for good metadata in PDFs.  Theoretically a little effort since all that metadata is in the databases already, so make use of it to optimize your PDFs for academic search engine optimization or service to your most loyal users who have so far put up with a load of bad PDFs.

References

Hamrick, T. A., R. D. Fricker, and G. G. Brown. 2010. Assessing what distinguishes highly cited from less-cited papers published in Interfaces. Interfaces, 40(6): 454-464. http://dx.doi.org/10.1287/inte.1100.0527. OA version:http://faculty.nps.edu/tahamric/docs/citations%20paper.pdf

Related: Google and the academic Deep Web

Google and the academic Deep Web

Blogging on Peer-Reviewed ResearchHagendorn and Santelli (2008) just published an interesting article on the comprehensiveness of indexing of academic repositories by Google. This article triggers this me to write up some observations I was intending to make for quite some time already. It addresses the question I got from a colleague of mine, who observed that the deep web apparently doesn’t exist anymore.

Google has made a start to index flash files. Google has made a start to retrieve information that is hidden behind search forms on the web, i.e. started to index information contained in databases. Google and OCLC exchange information on books scanned, and those contained in Worldcat. Google so it seems has indexed the Web comprehensively with 1 trillion indexed webpages. Could there possibly be anything more to be indexed?

The article by Hagendorn and Santelli shows convincingly that Google still has not indexed all information that is contained in OAISTER, the second largest archive of open access article information. Only Scientific Commons is more comprehensive. They tested this with the Google Research API using the University Research Program for Google Search. They only checked whether the URL was present. This approach only partially reveals some information on depth of the Academic Deep Web. But those are staggering figures already. But reality bites even more.

A short while ago I taught a Web Search class for colleagues at the University Library at Leiden. For the purpose of demonstrating what the Deep or Invisible Web actually constitutes I used and example from their own repository. It is was a thesis on Cannabis from last year and deposited as one huge PDF of 14 MB. Using Google you can find the metadata record. With Google Scholar as well. However, if you try to search for a quite specific sentence on the beginning pages of the actual PDF file Google gives not the sought after thesis. You find three other PhD dissertations. Two of those defended at the same university that same day, but not the one on Cannabis.

Interestingly, you are able to find parts of the thesis in Google Scholar, eg chapter 2, chapter 3 etc. But those are the parts of the thesis contained in different chapters that have been published elsewhere in scholarly journals. Unfortunately, none of these parts in Google Scholar refers back to the original thesis that is in Open Access or have been posted as OA journal article pre-prints in the Leiden repository. In Google Scholar most of the materials is still behind toll gates at publishers websites.

Is Google to blame for this incomplete indexing of repositories? Hagendorn and Santelli point the finger to Google indeed. However, John Wilkin, a colleague of them, doesn’t agree. Just as Lorcan Dempsey didn’t. And neither do I.

I have taken an interest in the new role of librarians. We are no longer solely responsible for bringing external –documentary- resources from outside into the realm of our academic clientele. We have also the dear task of bringing the fruits of their labour as good as possible for the floodlights of the external world. Be it academic or plain lay interest. We have to bring the information out there. Open Access plays an important role in this new task. But that task doesn’t stop at making it simply available on the Web.

Making it available is only a first, essential step. Making it rank well is a second, perhaps even more important step. So as librarians we have to become SEO experts. I have mentioned this here before, as well as at my Dutch blog.

So what to do about this chosen example from the Leiden repository. Well there is actually a slew of measures that should be taken. First of course is to divide the complete thesis in parts, at chapter level. Albeit publishers give permission only to publish articles, of which most theses in the beta sciences exists in the Netherlands, when the thesis is published as a whole. On the other hand, nearly 95% of the publishers allow publication of pre-prints and peer reviewed post prints. The so called Romeo green road. So it is up to the repository managers, preferably with the consent from the PhD candidate, to tear up the thesis in its parts –the chapters, which are the pre-print or post-prints of articles- and archive the thesis on chapter level as well. This makes the record for this thesis with a number of links to far more digestible chunks of information better palatable for the search engine spiders and crawlers. The record for the thesis thus contains links to the individual chapters deposited elsewhere in the repository.

Interesting side effect of this additional effort at the repository side is that the deposit rates will increase considerably. This applies for most Universities in the Netherlands, for our collection of theses as well. Since PhD students are responsible of the lion’s share of academic research at the University, depositing the individual chapters as article preprints in the repository will be of major benefit to the OA performance university. It will require more labour at the side of repository management, but if we take this seriously it is well worth the effort.

We still have to work at the visibility of the repositories really hard, but making the information more palatable is a good start.

Reference:
Hagedorn, K. and J. Santelli (2008). Google still not indexing hidden web URLs. D-Lib Magazine 14(7/8). http://www.dlib.org/dlib/july08/hagedorn/07hagedorn.html