The invisible web is still there, and it is probably larger than ever

Book review: Devine, J., & Egger-Sider, F. (2014). Going beyond Google again : strategies for using and teaching the Invisible Web. Chicago: Neal-Schuman, an imprint of the American Library Association. ISBN 9781555708986, 180p.

Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web

The invisible web, as we know it, dates back to at least 2001. In that year both Sherman & Price (2001) as well as Bergman (2001) came out with two studies describing the whole issue surrounding the deep, or invisible web, for the first time. These two seminal studies each used a different term to indicate the same concept, invisible and deep, but both described independently from each other convincingly that there was more information available that ordinary search engines can see.

Later on Lewandowski & Mayr (2006) showed that Bergmann perhaps overstated the size of the actual problem, but it certainly remained a problem for those unaware of the whole issue. Whilst Ford & Mansourian (2006) added the concept of the “cognitive inivisbility”, i.e. everything beyond page 1 in the Google results page. Since then very little has happened in the research on this problem in the search or information retrieval community. The notion of “deep web” has continued to receive some interest in the computer sciences, where they look into query expansion and data mining to alleviate the problems. But ground breaking scientific studies on this subject in the area of information retrieval or LIS have been scanty.

The authors of the current book Devine and Egger-Sider have been involved with the invisible web already since 2004 (Devine & Egger-Sider, 2004; Devine & Egger-Sider, 2009). Their main concern is to get the concept of the invisible web in the curriculum for information literacy. The current book documents a major survey in this area. For the purpose of getting the invisible web in the information literacy curriculum they maintain a useful website with invisible web discovery tools.

The current book is largely a repetition of their previous book (Devine & Egger-Sider, 2009). However two major additions to the notion of the invisible web have been added. Web 2.0 or the social web, and the mobile or the apps web. The first concept I was aware of and used it in classes for information professionals in the Netherlands for quite a long time already. The second concept was an eye opener for me. I did realize that search on mobile devices was different, more personalized than anything else, but I had not categorized it as a part of the invisible web.

Where Devine and Egger-Sider (2014) disappoint is that the proposed solutions, curricula etc, only address the invisible as a database problem. Identify the right databases and perform your searches. Make students and scholars aware of the problem, guide them to the additional resources and the problem is solved. However, no solution whatsoever, is provided to solve the information gap due to the social web or the mobile web. On this part the book does not add anything to the version from 2009.

Another notion of the ever increasing invisible web as we know it, concerns grey literature. Scholarly output in the form of peer reviewed articles or books are reasonably well covered by (web) search engines and library subscribed A&I databases, but to retrieve the grey literature still remains a major problem. The whole notion of grey literature is mentioned in this book. Despite the concern about the invisible or deep web, they also fail to stress the advantages that full scale web search engines have brought. Previously we only had the indexed bibliographic information to search whereas web search engines brought us full text search. Full text search, while not being superior, has brought us new opportunities and sometimes improved retrieval as well.

The book is not entirely up to date. The majority of the reference are up to date to 2011, only a few 2012 let alone 2013 references are included. Apparently the book took a long time to write and produce. But what is really lacking is a suitable accompanying website. The many URLs provided in the book on a short list would have been helpful to probably many readers. For the time being we have to do it with their older webpage which is less comprehensive than the complete collection of sources mentioned in this edition.

Where the book completely fails is the inclusion of the darknet. Since Wikileaks and Snowden we should be aware that even more is going on in the invisible web than ever before. Devine & Egger Sider, only mention the darknet or dark web as an area not to treat. This is slightly disappointing.

If you have already the 2009 edition of this book, there is no need to upgrade to the current version.

References
Bergman, M.K. (2001). White Paper: The Deep Web: Surfacing Hidden Value. The Journal of Electronic Publishing, 7(1). http://dx.doi.org/10.3998/3336451.0007.104
Devine, J., & Egger-Sider, F. (2004). Beyond Google : The invisible Web in the academic library. The Journal of Academic Librairianship, 30(4), 265-269. http://dx.doi.org/10.1016/j.acalib.2004.04.010
Devine, J., & Egger-Sider, F. (2009). Going beyond Google : the invisible web in learning and teaching. London: Facet Publishing. 156p.
Devine, J., & Egger-Sider, F. (2014). Going beyond Google again : strategies for using and teaching the Invisible Web. Chicago: Neal-Schuman, an imprint of the American Library Association. 180p.
Lewandowski, D., & Mayr, P. (2006). Exploring the academic invisible web. Library Hi Tech, 24(4), 529-539. http://dx.doi.org/10.1108/07378830610715392 OA version: http://eprints.rclis.org/9203/
Sherman, C., & Price, G. (2001). The invisible web: Discovering information sources search engines can’t see. Medford NJ, USA: Information today. 439p.
Ford, N., & Mansourian, Y. (2006). The invisible web: An empirical study of “cognitive invisibility”. Journal of Documentation, 62(5), 584-596. http://dx.doi.org/10.1108/00220410610688732

Other reviews for this book
Malone, A. (2014). Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web, Jane Devine, Francine Egger-Sider. Neal-Schuman, Chicago (2014), ISBN: 978-1-55570-898-6. The Journal of Academic Librarianship, 40(3–4), 421. http://dx.doi.org/10.1016/j.acalib.2014.03.006
Mason, D. (2014). Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web. Online Information Review, 38(7), 992-993. http://dx.doi.org/10.1108/OIR-10-2014-0228
Stenis, P. (2014). Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web. Reference & User Services Quarterly, 53(4), 367-367. http://dx.doi.org/10.5860/rusq.53n4.367a
Sweeper, D. (2014). A Review of “Going Beyond Google Again: Strategies for Using and Teaching the Invisible Web”. Journal of Electronic Resources Librarianship, 26(2), 154-155. http://dx.doi.org/10.1080/1941126x.2014.910415

Google and the academic Deep Web

Blogging on Peer-Reviewed ResearchHagendorn and Santelli (2008) just published an interesting article on the comprehensiveness of indexing of academic repositories by Google. This article triggers this me to write up some observations I was intending to make for quite some time already. It addresses the question I got from a colleague of mine, who observed that the deep web apparently doesn’t exist anymore.

Google has made a start to index flash files. Google has made a start to retrieve information that is hidden behind search forms on the web, i.e. started to index information contained in databases. Google and OCLC exchange information on books scanned, and those contained in Worldcat. Google so it seems has indexed the Web comprehensively with 1 trillion indexed webpages. Could there possibly be anything more to be indexed?

The article by Hagendorn and Santelli shows convincingly that Google still has not indexed all information that is contained in OAISTER, the second largest archive of open access article information. Only Scientific Commons is more comprehensive. They tested this with the Google Research API using the University Research Program for Google Search. They only checked whether the URL was present. This approach only partially reveals some information on depth of the Academic Deep Web. But those are staggering figures already. But reality bites even more.

A short while ago I taught a Web Search class for colleagues at the University Library at Leiden. For the purpose of demonstrating what the Deep or Invisible Web actually constitutes I used and example from their own repository. It is was a thesis on Cannabis from last year and deposited as one huge PDF of 14 MB. Using Google you can find the metadata record. With Google Scholar as well. However, if you try to search for a quite specific sentence on the beginning pages of the actual PDF file Google gives not the sought after thesis. You find three other PhD dissertations. Two of those defended at the same university that same day, but not the one on Cannabis.

Interestingly, you are able to find parts of the thesis in Google Scholar, eg chapter 2, chapter 3 etc. But those are the parts of the thesis contained in different chapters that have been published elsewhere in scholarly journals. Unfortunately, none of these parts in Google Scholar refers back to the original thesis that is in Open Access or have been posted as OA journal article pre-prints in the Leiden repository. In Google Scholar most of the materials is still behind toll gates at publishers websites.

Is Google to blame for this incomplete indexing of repositories? Hagendorn and Santelli point the finger to Google indeed. However, John Wilkin, a colleague of them, doesn’t agree. Just as Lorcan Dempsey didn’t. And neither do I.

I have taken an interest in the new role of librarians. We are no longer solely responsible for bringing external –documentary- resources from outside into the realm of our academic clientele. We have also the dear task of bringing the fruits of their labour as good as possible for the floodlights of the external world. Be it academic or plain lay interest. We have to bring the information out there. Open Access plays an important role in this new task. But that task doesn’t stop at making it simply available on the Web.

Making it available is only a first, essential step. Making it rank well is a second, perhaps even more important step. So as librarians we have to become SEO experts. I have mentioned this here before, as well as at my Dutch blog.

So what to do about this chosen example from the Leiden repository. Well there is actually a slew of measures that should be taken. First of course is to divide the complete thesis in parts, at chapter level. Albeit publishers give permission only to publish articles, of which most theses in the beta sciences exists in the Netherlands, when the thesis is published as a whole. On the other hand, nearly 95% of the publishers allow publication of pre-prints and peer reviewed post prints. The so called Romeo green road. So it is up to the repository managers, preferably with the consent from the PhD candidate, to tear up the thesis in its parts –the chapters, which are the pre-print or post-prints of articles- and archive the thesis on chapter level as well. This makes the record for this thesis with a number of links to far more digestible chunks of information better palatable for the search engine spiders and crawlers. The record for the thesis thus contains links to the individual chapters deposited elsewhere in the repository.

Interesting side effect of this additional effort at the repository side is that the deposit rates will increase considerably. This applies for most Universities in the Netherlands, for our collection of theses as well. Since PhD students are responsible of the lion’s share of academic research at the University, depositing the individual chapters as article preprints in the repository will be of major benefit to the OA performance university. It will require more labour at the side of repository management, but if we take this seriously it is well worth the effort.

We still have to work at the visibility of the repositories really hard, but making the information more palatable is a good start.

Reference:
Hagedorn, K. and J. Santelli (2008). Google still not indexing hidden web URLs. D-Lib Magazine 14(7/8). http://www.dlib.org/dlib/july08/hagedorn/07hagedorn.html