The invisible web is still there, and it is probably larger than ever

Book review: Devine, J., & Egger-Sider, F. (2014). Going beyond Google again : strategies for using and teaching the Invisible Web. Chicago: Neal-Schuman, an imprint of the American Library Association. ISBN 9781555708986, 180p.

Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web

The invisible web, as we know it, dates back to at least 2001. In that year both Sherman & Price (2001) as well as Bergman (2001) came out with two studies describing the whole issue surrounding the deep, or invisible web, for the first time. These two seminal studies each used a different term to indicate the same concept, invisible and deep, but both described independently from each other convincingly that there was more information available that ordinary search engines can see.

Later on Lewandowski & Mayr (2006) showed that Bergmann perhaps overstated the size of the actual problem, but it certainly remained a problem for those unaware of the whole issue. Whilst Ford & Mansourian (2006) added the concept of the “cognitive inivisbility”, i.e. everything beyond page 1 in the Google results page. Since then very little has happened in the research on this problem in the search or information retrieval community. The notion of “deep web” has continued to receive some interest in the computer sciences, where they look into query expansion and data mining to alleviate the problems. But ground breaking scientific studies on this subject in the area of information retrieval or LIS have been scanty.

The authors of the current book Devine and Egger-Sider have been involved with the invisible web already since 2004 (Devine & Egger-Sider, 2004; Devine & Egger-Sider, 2009). Their main concern is to get the concept of the invisible web in the curriculum for information literacy. The current book documents a major survey in this area. For the purpose of getting the invisible web in the information literacy curriculum they maintain a useful website with invisible web discovery tools.

The current book is largely a repetition of their previous book (Devine & Egger-Sider, 2009). However two major additions to the notion of the invisible web have been added. Web 2.0 or the social web, and the mobile or the apps web. The first concept I was aware of and used it in classes for information professionals in the Netherlands for quite a long time already. The second concept was an eye opener for me. I did realize that search on mobile devices was different, more personalized than anything else, but I had not categorized it as a part of the invisible web.

Where Devine and Egger-Sider (2014) disappoint is that the proposed solutions, curricula etc, only address the invisible as a database problem. Identify the right databases and perform your searches. Make students and scholars aware of the problem, guide them to the additional resources and the problem is solved. However, no solution whatsoever, is provided to solve the information gap due to the social web or the mobile web. On this part the book does not add anything to the version from 2009.

Another notion of the ever increasing invisible web as we know it, concerns grey literature. Scholarly output in the form of peer reviewed articles or books are reasonably well covered by (web) search engines and library subscribed A&I databases, but to retrieve the grey literature still remains a major problem. The whole notion of grey literature is mentioned in this book. Despite the concern about the invisible or deep web, they also fail to stress the advantages that full scale web search engines have brought. Previously we only had the indexed bibliographic information to search whereas web search engines brought us full text search. Full text search, while not being superior, has brought us new opportunities and sometimes improved retrieval as well.

The book is not entirely up to date. The majority of the reference are up to date to 2011, only a few 2012 let alone 2013 references are included. Apparently the book took a long time to write and produce. But what is really lacking is a suitable accompanying website. The many URLs provided in the book on a short list would have been helpful to probably many readers. For the time being we have to do it with their older webpage which is less comprehensive than the complete collection of sources mentioned in this edition.

Where the book completely fails is the inclusion of the darknet. Since Wikileaks and Snowden we should be aware that even more is going on in the invisible web than ever before. Devine & Egger Sider, only mention the darknet or dark web as an area not to treat. This is slightly disappointing.

If you have already the 2009 edition of this book, there is no need to upgrade to the current version.

References
Bergman, M.K. (2001). White Paper: The Deep Web: Surfacing Hidden Value. The Journal of Electronic Publishing, 7(1). http://dx.doi.org/10.3998/3336451.0007.104
Devine, J., & Egger-Sider, F. (2004). Beyond Google : The invisible Web in the academic library. The Journal of Academic Librairianship, 30(4), 265-269. http://dx.doi.org/10.1016/j.acalib.2004.04.010
Devine, J., & Egger-Sider, F. (2009). Going beyond Google : the invisible web in learning and teaching. London: Facet Publishing. 156p.
Devine, J., & Egger-Sider, F. (2014). Going beyond Google again : strategies for using and teaching the Invisible Web. Chicago: Neal-Schuman, an imprint of the American Library Association. 180p.
Lewandowski, D., & Mayr, P. (2006). Exploring the academic invisible web. Library Hi Tech, 24(4), 529-539. http://dx.doi.org/10.1108/07378830610715392 OA version: http://eprints.rclis.org/9203/
Sherman, C., & Price, G. (2001). The invisible web: Discovering information sources search engines can’t see. Medford NJ, USA: Information today. 439p.
Ford, N., & Mansourian, Y. (2006). The invisible web: An empirical study of “cognitive invisibility”. Journal of Documentation, 62(5), 584-596. http://dx.doi.org/10.1108/00220410610688732

Other reviews for this book
Malone, A. (2014). Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web, Jane Devine, Francine Egger-Sider. Neal-Schuman, Chicago (2014), ISBN: 978-1-55570-898-6. The Journal of Academic Librarianship, 40(3–4), 421. http://dx.doi.org/10.1016/j.acalib.2014.03.006
Mason, D. (2014). Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web. Online Information Review, 38(7), 992-993. http://dx.doi.org/10.1108/OIR-10-2014-0228
Stenis, P. (2014). Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web. Reference & User Services Quarterly, 53(4), 367-367. http://dx.doi.org/10.5860/rusq.53n4.367a
Sweeper, D. (2014). A Review of “Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web”. Journal of Electronic Resources Librarianship, 26(2), 154-155. http://dx.doi.org/10.1080/1941126x.2014.910415

Towards a Google Scholar API

A while back I begged Google to come up with an API (Application Programming Interface) for Google Scholar. With the many possibilities for the Google Maps API they practically set the standard for APIs. For the library world they lived up to their promise when they launched the Google Books API back in 2008. But for Google Scholar Google has never delivered an API.

Nearly a Google Scholar API

A less well document feature of Google Scholar is that you can look up information in Google Scholar for a specific article using the DOI of that article.Google Scholar DOI lookup
This search query in Google Scholar with the full DOI returns exactly one result. If you had carried out a title search for this article, Google Scholar had returned 23 results. With the correct article at the top. That’s true. But a title search would take Google Scholar about 0.04 second, whereas a DOI lookup did only take 0.01 second. In many other title queries the time difference is in the order of hundred times slower

Playing around with the DOI in Google Scholar you can retrieve some more interesting results. The citing articles based on the DOI (which is implemented in the Plos article metrics page). The versions or document cluster of an article (useful to identify OA versions of an article) is also a direct query on the basis of the DOI. Unfortunately I don’t see how you can get to the related articles using the DOI (Any suggestions are welcome in the comments). To get the related articles you need an internal Google Scholar article number. My conclusion of these examples is that Google Scholar has already a mechanism in place that can form the backbone of an API. Using the DOI to look up an article in Google scholar resolves in most cases very quick, and precise. Only in a few instances I have come across examples that Google Scholar was in error. Most often for editorial material, or corrections and in some instances when a version in an Open Access repository actually interfered with this mechanism.

It works with partially with ISBN

As long as books and reports have an ISBN assigned to them the item is also possible to retrieve exactly one result based on the ISBN, eg ISBN 9022010007 but playing around with the citations or the document cluster is not directly possible on the basis of this ISBN. On the basis of the ISBN query results it looks like that it should be possible that Google is close to some useful functionality in this area as well.

A monetizing model for Google Scholar

I can imagine that publishers or repositories would really like to make use of functionalities like these from Google Scholar. For publishers and repositories it would be valuable to show the citation data, to link to related items or look up Open Access versions in the document clusters around an article. The announced integration of Google Scholar and Web of Science data for Web of Science customers is a sign that Google is willing to share data. There is likely to be some money involved in this deal as well. I wonder if Google is willing to strike up similar deals with other publishers. PLoS journals are a good example where they are actually very close to using this information from Google Scholar as well. They only don’t dare to screen scrape the information they really want. And need. Currently they only link out. Altmetric data providers as Plum analytics and Altmetric are other partners that are possibly interested to integrate this kind of information from Google Scholar in their metrics dashboard, in the end their customers pay a price for this data integration.

Why am I suggesting a monetizing model for Google Scholar. Currently Google Scholar still seems to be the (important) pet project of Anurag Acharya. It is not included in the major services offered in the Google spine. It looks like Google is not earning any money with it. So Google Scholar is at the risk of to be taken down just next week, following the examples of Google Reader or iGoogle, where no monetizing models were available either. If there is a fair earning model for Google Scholar I do hope it will increase the sustainability of Google Scholar, and that we get new exiting data sources to do research with.

The numrange operator in Google and Google Scholar

Google allows you to search for numbers within a specific range, eg [stonewashed jeans $20..$30]. As indicated in the example the search is for a price range. That is also the origin of this operator. It was probably first developed for Google Catalogs (now a retired service). In the ordinary Google it is still available. Well hidden in the advanced search form.

numrange

The numrange operator works fine for many purposes.
[“mountain bike” $500..$800]
[“Russian revolution” 1900..1920]
[“Theobroma cacao” 2010..2014]
The last example hints on the retrieval of objects on cocoa between the (publication) years 2010 and 2014. Whereas the Russian revolution guesses the years the event took place.

It doesn’t work in Euros

So far it seems fine. But it doesn’t work for Euros.
[“mountain bike” €500..€800]
That probabaly has something to do with the character set. Nor for Pound Sterling [“mountain bike” £500..£800]. Albeit it doesn’t search for Pound sterling or Euros, it does return any number range.

Use three dots

The other problem with the numrange is that it doesn’t work for large figures. Search [water 988650..988700] fails. However, if you use three dots instead of the two dots, it works fine [water 988650…988700]
The other examples work with three dots as well as with two dots.
[“mountain bike” $500…$800]
[“Russian revolution” 1900…1920]
[“Theobroma cacao” 2010…2014]

So the quick conclusion is to use the tree dots rather than the two dots. Hattip for the three dots goes to @Henkvaness in his book Google Code.

Numrange operator in Google Scholar

In Google Scholar the numrange operator doesn’t work. Well that was my experience which I blogged yesterday in my Google Scholar blogpost. The numrange operator works for researchers searching for publications in the first place as a quick way to limit the results to a range of publication years. Google Scholar facilitates this trough the advanced search form or after a search action trough the facets in the search engine results page. But in the default Google Scholar search box the numrange doesn’t work for publication year ranges. Not with two dots [“Theobroma cacao” 2010..2014] nor with three dots [“Theobroma cacao” 2010…2014].

But Henk van Ess reacted on my slideshare “Google Scholar : Google for research” yesterday in the commments that the numrange work in Google Scholar. A little toying around. It works fine indeed for range that are not likely to be publication years. A search with three [“Theobroma cacao” 10…14] or two [“Theobroma cacao” 10..14] works indeed. But as soon as you come near a year range it doesn’t [“Theobroma cacao” 1800…1850].

If you want to search for year ranges in Google Scholar you have to do it through the advanced search form. Or use the more complicated url parameters as_ylo and as_yhi http://scholar.google.com/scholar?q=%22Theobroma%20cacao%22&as_ylo=1800&as_yhi=1850.

Reference
van Ess, H. 2009. De Google code. Amsterdam: Pearson Education. ISBN 9789043019088 136p.

Google Scholar : Google for research

Or super search tips for researchers and students how to use Google Scholar more efficiently. The embedded Slideshare presentation and this blog post will be kept up to date and in sync. And which is more interesting, all inks or examples in this Slidehare presentation are clickable, so you can see what I mean.

The following scholarly super search tips are an explanation for the embedded slideshare presentation.

You can use, and should use, the usual Google shortcuts. The ones listed in this slide are the most important ones. Search for [“phrase searching”] to keep the words together. Search for specific file types with the ext: (or filetype:) operator. Limit searches to specific parts of the www with the site: operator. Search for the specific words in the title with the allintitle: (or intitle:) operator. Use the OR operator to include synonyms of certain search terms. Exclude specific terms with the sign. And last, but not least combine all these operators. A few more tips like these can be found in the post “Google better with Google

An important Google operator that you can’t use in Google Scholar is the numerical range operator (numrange). The three … (dots) connecting two figures. In Google Scholar you even get a warning that the numrange operator isn’t working when you make use of it. Instead of the numrange operator the facet for publication years is extremely important in Google Scholar.

But before you’re using Google Scholar on a regular basis, turn to the search engine settings. There are three tabs that need a little tuning to optimize Google Scholar for you purposes. In the first tab you should selected the twenty search results per page, and that they open in a new tab/window. And select your preferred bibliography (reference) manager here. In case you use Mendeley, you get the best results when selecting Reference Manager as preferred bibliography manager. In the second tab you can select the language of the interface as well as the search results. It is not recommended to select search results in a single language only. In the last tab you can select the Library links that should be shown. When you are on campus, this is normally selected automatically, but especially when you’re off campus it is recommended to select the appropriate library access that you have to connect to more content directly.

The Google advanced search options are currently hidden behind the small triangle in the search box. You only need that for a few a few type of searches.

At the beginning you might like to use the advanced search form to search for authors. But soon you learn that a search for an author actually translate into the author: operator, eg [author:”KE Giller”] in the Google Scholar search box. If you want to search for the oeuvre of two authors the Advaced search form already fails, you have to do that trough the normal search box [author:”R Leemans” OR author:”KE Giller”]. The second useful option in the advanced search form is the possibility to search for articles in a certain journal. This option doesn’t translate back into a neat operator in standard search box. But in the url you can see what actually happens and you can see that it translates in as_publication= in the url http://scholar.google.com/scholar?as_publication=%22agricultural%20systems%22. The years option in the advanced search form can be used here, but also after an initial search through the facets. That is what I normally prefer.

The ranking of the search results is heavily influenced by the citations to the articles found. The consequence of this influence of citations on the ranking of the results is that most often older material is at the top of the results page. It is therefore of utmost importance to use the year range option in the advanced search screen or the year range option in facets to select more recent results rather than heavily cited older material found at the top of the results page. When searching for recent results the standard ranking in Google Scholar is counterproductive and you have to make use of the year ranges.

Google Scholar searches for less word variants than the big Google does. There is no verbatim search needed as in the big Google, but “phrase” quotes around a single word still works to search specifically for a single word. Another interesting gem is that the tilde operator still functions in Google Scholar to search for a keyword and its synonyms (hattip @wichor). Something I come across quite a lot amongst experienced search is the use of parentheses, but unfortunately these don’t work in Google Scholar (or the big Google).

Looking into more detail to the search results the snippet of the search results is surrounded by many options. In the first place a clear indication of Open Access versions is indicated in the last column of search engine results page. With the save option you can add the result to the Google Scholar library (not connected to the Google Books Library). Under the Cite option you find three different options to which the reference can be formatted, APA, MLA or Chicago. In combination with the versions option, you can come to a complete reference for to use in your reference list. The import option lets you export the reference to your bibliography management software, such as EndNote, Refworks etc. It only allows you to do it one at the time. The versions tab is useful to locate other full text versions (eg. better scanning quality). In combination with the cite option you can also get properly formatted references. The last options, related articles and Cited by allows you to further search for information based on a useful search. The exact algorithm behind the related search option has not been published or studied and reported widely in the literature.

In Google Scholar it is really easy to initiate search alerts. You only have to be aware of the fact that for a standard search in Google Scholar you are allowed to use 256 characters for a search query, but for an alert the limitation is 100 characters (Barely sufficient for a proper search query). On top of the search alerts, you can receive updates based on your articles in your my citations profile.

On the quality of Google Scholar as a comprehensive search engine for researchers the last word has not been spoken yet. In terms of coverage it is probably larger than any other academic database or search engine. However still not all scholarly sources, such as OA repositories are fully indexed. The big Google index still finds OA resources not indexed in Google Scholar. For systematic reviews Google Scholar is a good addition to the range of databases to search. Metadata quality is still something that needs improvement, as well as the disambiguation of articles and authors. The version function sometimes helps with finding the proper metadata for a reference. The announced coupling to Web of Science should really a big plus in this area.

Google better with Google

Or 14 super search tips for scientists and students. The following scholarly super search tips are an explanation for the enclosed slideshare presentation.

Google better with google

This slideshare presentation was posted a while back on WoW!ter’s slideshare, but has been updated to stay sync with this blogpost

The tips
1. Which Google do you want to use? We have a large international audience of users at our University, who normally are redirected to http://www.google.nl. However if you use http://www.google.com/ncr then you get the international version. But if you prefer your Indian version http://www.google.co.in/ncr works as well. With the /ncr you can control the regional version you are using easily.

2. Personalize your search experience. Nowadays found under the small cogwheel at the top right hand of the page or follow this link. The sections I always pay attention to is the filter option. Why should Google judge if something is fit for my eyes? Or not? I also advice to set the number of search results to 50 (but you can’t make use of Google instant search in that case) I used to use 100 results, but even I found that a wee bit too much. Lastly I always check the box to open the results in a new window (it actually opens a new tab, rather than a window), this keeps my search results window in tact whilst I browse some to the results I retrieved.

Some further personalisation would include to install the google toolbar in your browser, or even a step more in the personalization of the search experience is to make use of iGoogle.

3. There is more than 1 Google. Many people are only using the standard Google web search engine. But for academics, Google Scholar, Google book search, Google patents are certainly specific interfaces that should be part of the searchers trick of the trades.

4. Google universal. Nowadays, Google has realized that the many different search interfaces cause a problem for the users as well and therefore they have introduced the universal search engine results page with a lot of specific options on the left hand side of the results. However a suggestion to use Google Scholar is not included.

5. Learn from the advanced search interface. All Google search interfaces have an advanced search option. Use these options to see what the possibilities of the specific search interface are, and learn how you can make use of these advanced search operators in the normal search interface. When you make use of the advanced search options in Google Scholar you see an option to search for a specific author which translates in the Scholar search box as [nitrogen fixation author:”K E Giller”]

6. Be specific or search with more than 1 term In the Dutch language we can often get away with searching for a single word, because we are allowed to make incredibly long compound words such as “wapenstilstandsonderhandelingen”. When you’re searching for scientific information you better stick to English as language . In English can’t make compound words. This is a small language difference which necessitates searching with more terms. But apart from the language difference, when you search with more terms, searches become more specific and the results more relevant. In the current example a search for water only, results in more than 700 million results, whereas [Water management technology assessment] results in nearly 8 million results.
Interestingly, when you look at the results in the slides, you’ll notice that total results numbers in Google are unreliable to say the least. In the step from 2 to 3 search terms the result sets increases again.
The fifth example in the slide is an introduction to the next slide. You can be even more precise when searching.

7. Keep words together. Make us of “phrase searches”. A phrase search is a search which returns the words in exactly the specified order. Of course Google already ranks the results with the phrases of search terms at the very top of the search engine results page. This technique also reduces the sheer number of possible results. Compare for instance [“water management”] with [water management]. You can combine as many phrases as you like (see the previous slide), or make them really long (the latter is also used in plagiarism checks).

8. Search for title words. When you feel overwhelmed by the number of results a good solution is to limit your search to title words rather than anywhere on a page. You can search for single title words with the operator, or all of your search words with the operator. These operators are the same when you compare [intitle:”water management”] with [allintitle:water management]

9. Search for information in PDF files. Most scientific information is published on the web in the format of PDF files. Be it as a scientific report or a scholarly article e.g. [Agaricus bisporus ext:pdf]. A couple of years ago this was an extremely efficient way to look for scholarly information on the Web. However, since it has become very easy to produce your own PDF files, this technique has suffered some of its effectiveness, but it still works wonders. Especially in combination with the other tips.

10. Search for results from a specific domain. In some cases it is useful to restrict you results to a certain website or domain. This is certainly true for sites that don’t have good site search options e.g. [EndNote site:library.wur.nl]. You can also limit the results to the academic institutions of the USA [“water management” site:.edu].

11. Search for number ranges. Apart from the fact that Google is a powerful calculator, you can also search for number ranges. This comes in handy when you want to limit your search to results from certain publication years, e.g. [“publication strategy” 2009…2011]. Note that three dots is different (better) than the standard used two dots.

12. Exclude specific terms with the – operator. You can narrow your searches using this operator. You can exclude as many words as you want by using the – sign in front of all of them, for example [mercury -ford -freddy -outboards -planets].

13. Search with OR. In some occasions it the intelligence of Google doesn’t include obvious synonyms. With the OR operator you can combine search terms e.g. [“carbon dioxide” OR CO2]. Notice that OR should be typed with capitals.

14. Combine. Having seen some of the options of the Google search engine you should realize that you can combine most of these operators. In this way you can make very precise searches [“publication strategy” citations 2009…2011 ext:pdf]