A Prospective View of Citation Indexing and Information Retrieval In the 21st Century, Distinguished Lectureship, New Jersey Chapter, American Society for Information and Technology, The Rutgers Club, October 28, 2004

About five years ago, I addressed this New Jersey Chapter of ASIS&T about the future role of the information scientist. That was part of my role as President of ASIS&T¹. But tonight I want to talk to you about my past career as an information scientist. Since my initial remarks are prepared, they will be a little more formal than today’s afternoon session with the students and faculty at Rutgers. In that session, where we covered a wide range of topics, but especially citation index theory and practice. Tonight, I would like to review some ideas I first discussed in the millennium issue of JASIS&T published in early 2001².

In the 1950s, research scientists typically scanned a dozen or so journals they personally received, including abstracting journals like Chemical Abstracts or Biological Abstracts. Most visited libraries regularly where they could scan the latest journals. In those days, journals were much more affordable. Ironically, a survey of the local American Chemical Society by Russ Ackoff in Philadelphia found that the average member subscribed to more journals than they read. Russ Ackoff also found that half of the papers in core journals will be read by 1% or less and no report is likely to be read by more than 7% However, that data is now 35 years old.

It was the time of the reprint culture⁴. Authors exchanged reprints generously. It was not unusual to mail reprints regularly to one’s “invisible college.” The now ubiquitous Xerox machine was not yet widely available and photostats were expensive and cumbersome. In those days it was common for journals to be circulated to the company research staff. When this became too cumbersome, many libraries circulated contents pages. This eventually led to the birth of Current Contents. In those days, correspondence by snail mail was the norm, as was the use of printed indexing and abstracting services. The pace of research and publication was significantly slower. Once published, however, the time to deliver journals and reprints, including transatlantic steamship delivery was rather quick. Rudi Schmid described the remarkable speed of intercontinental and transcontinental mail transit times from Europe to California from 1852 to 1941⁵. While it has been modernized in many respects, the postal system was considered somewhat archaic. After World War II, the introduction of telephone, fax, and then email enforced that impression. Nevertheless, most print journals still use snail mail for domestic distribution but and air-cargo services are used for international distribution.

Online access to indexing and abstracting services was introduced in the 70s. For example, SSCI was file number 7 in 1972.Twenty-five years later, full-text journal articles began to appear on-line. Now they are provided routinely. The integration of the journal literature with A&I services through linking services presents a completely transformed situation. Readers are now almost instantly accessing journal contents pages, abstracts, and full text. You can browse the current literature on-line and, in real time, go backward and then forward again into related documents via cited references. As full-text archives increase their chronological scope, you will be able to search and peruse the literature without ever entering the library, if you are a researcher at an institution that covers the costs of these capabilities.

Ancillary to these developments is the issue of digital libraries, open access, as well as individual and institutional archiving. Today we can access a significant part of the last decade of the literature electronically. In five to ten years, this will extend to much of the significant journal literature of the twentieth century, that is, the 1,000 or more most-consulted and higher impact journals. The cost of conversion of full texts of several million articles from back runs of legacy journals will not be trivial, but there is evidence libraries are willing to support these costs as was demonstrated by the successful launch of JSTOR and other legacy projects. To supplement these efforts, Dana Roth and I have discussed the idea of creating files of perhaps ten to thirty thousand of the most-cited papers⁶. While many of the highest impact journals of science are currently available, electronically complete archives are relatively rare. Many of the most-cited journals identified by ISI's Journal Citation Reports^® are electronically accessible if your library has electronic site licenses. I find that Drexel can help me find a large percentage of my needs for current and back issue journals.

An alternative interim step is to use email to contact authorsfor access to articles not yet available on the web. Papers often can be found on the author's personal home page. This type of self-archiving is key to Steve Harnad’s idea of open access. He wants each institution to assume that responsibility but creating digital libraries of faculty papers is a formidable task. In the meantime, it would be enormously helpful if university web sites provided a standardized means of locating faculty email addresses which would lead to CV’s and bibliographies and include up-to-date URLs for the papers and books listed. It is significant that Current Contents and the Web of Science now include these email addresses which is a logical extension of the address directories provided from 1960.

The creation of large digital libraries seems inevitable, since technology and outsourcing continue to reduce the cost of conversion from paper. Large-scale conversions to PDF files are possible at costs from 5 to 50 cents per page. Whatever the cost, there is a separate issue when discussing free access to the public. Clearly, there is a tacit desire to archive everything that has been published – perhaps 50 to 100 million papers and books. In the meantime, as long as we have a situation that is half-electronic and half-paper, authors will provide equally half-baked retrospective coverage of the literature. Authors take the path of least resistance. For some younger authors if it is not electronic, it does not exist.

Searching Full Text

While alerting and SDI services were available 40 years ago, it is now rather routine for publishers to announce forthcoming articles electronically. This can significantly reduce the time between actual publication and citation by others. The time lag between submission and publication of articles is rapidly diminishing as is the work of preparing and editing manuscripts. The need to standardize formats for electronic documents is evident as is the desire to standardize electronic manuscripts per se. One can rely on the services of Reference Manager or other citation management systems to produce articles and citations in any journal style required without having to completely retype manuscripts. These systems have increased the efficiency of producing original manuscripts. Furthermore, the increased use of personal web pages displaces the need to go directly to the library for a lot of archival material. Like other authors, I have "self-archived" most of what I have published in my career. Howard Lenhoff suggested that retired scholars do this systematically, and even include the work they have not published⁷. I estimate this would cost the average well-published author about $5,000. BioMed Central now offers universities a service for doing this with current material. It could easily be extended to archival stuff.

But we could do much more to take advantage of the new technologies. It should be routine for editors to be able to feed bibliographies in manuscripts into a database to verify the accuracy of every paper or book reference cited, thus reducing the annoying typos and other errors that do occur. Right now this has to be done one reference at a time and perhaps five to ten percent of citations have errors or typos depending upon how you define an error.

The time consuming process of refereeing has been improved. Using electronic access facilitates the paperwork involved. Email receipt of PDF or Word manuscripts provides a stimulus to potential referees to act promptly. Eliminating snail mail makes the process less costly, and thereby increases the number of referees one can consult. But the situation is not perfect, as Tefko Saracevic and I recently experienced when we could not read the illustrations in a PDF manuscript, necessitating the use of snail mail.

Sending Full Texts

Searching full texts of documents presents new and interesting problems. Information scientists have been studying full-text searching for fifty years. John O'Connor was one of the pioneers⁸. Early on he recognized the need to create artificially intelligent searching systems. Personal experience with large-scale files, including even my own, demonstrates the blessings and the dilemmas of full-text searching. For the rare word or phrase, it can be extremely efficient. For the frequently occurring term, it can be highly frustrating. Twentieth-first century users will demand more sophisticated methods for refining full-text searches. Google does not really solve the problem. I have experienced this lately both with my own website and a new one covering 4,000 Citation Classics (www.citationclassics.org). Google and other search engines do not display results in a manner that is conducive to quick selection unless the output is limited.

Regardless of the search engine used, the speed of access to electronic files is crucial to their effective use. If you must wait 20 seconds to display the context of a retrieved document, then scanning a large list retards the process of information recovery. I experience the elation and frustration of full-text when I use the Verity system to search my own publications. To take full advantage of its word-for-word indexing, I need to be able to instantly pop up the context in which the term occurs, not just the title of the paper in which it is contained. This is possible with CiteSeer, the autonomous citation index developed by Steve Lawrence et al⁹. I imagine that future Google searching will take advantage of his experience since he left NEC in Princeton to work for them. The recent announcement of Google Print is a harbinger of that.

SDI Profiling and Clipping Services

Many years ago, I wrote a letter to the New York Times about push-pull technology. It was not published in the Times but did appear in a paper I published in JASIST.²I can not take the time to read the letter. The key point was that in 1965 Irv Sher and I published a paper on Automatic Subject Citation Alert the first commercially available computer-based system for selective dissemination of information (SDI).^{10, 11}

While newspaper and magazine clipping services have existed since the beginning of the last century, the ASCA personal alerting system for the first time dealt with the huge body of scientific and scholarly literature. Personal alerting services have not proven to be a howling financial success. They can only survive as by-products of systems that derive their success from searching.

Thirty-five years after launching ASCA, it is difficult to estimate the extent to which SDI is used. I see minimal evidence of this in academia.Certain institutions like Stanford have made it popular by using the ISI database in combination with SDI software developed by Los Alamos National Laboratory. Information professionals have an important educational task to make users "profile" conscious so that they will embrace these SDI systems.In particular, they must learn to take full advantage of citation as well as keyword profiling. While not called citation profiling, this capability has been incorporated in the Highwire system¹². For each new article one encounters, the user can automatically include its citation as part of an alerting profile. And most of you are aware of the now ubiquitous Google and Yahoo news alerts.

Foreign Language Translation

A significant amount of interesting literature is still published in foreign languages.The ability to use on-line translation dictionaries facilitates the ability to read foreign language material.Using pop-up windows to translate individual words or phrases, much as one uses a spell checker, can be extremely time saving. Given a real-time word-for-word look-up system, I can read or scan most papers in German, Spanish, or French with minimum difficulty. A great deal of editorial comment is still expressed in foreign languages so this translation capability is important to those who wish to take into account opinions expressed by foreign authors. Foreign editors should take advantage of these translation facilities to produce online multi-lingual versions of their editorials and articles since there is not the serious space limitation as with print.

As a final thought on the matter of machine translation, the work of Watters & Patel, among others, indicates that Systran leaves much to be desired^14.

Information Nirvana

In the early days of my career, I referred to the coming information nirvana.¹⁵ This was yet another metaphor for the World Brain of H.G. Wells and the dreams of the early encyclopedists. Each new generation of information technology advancements brings with it a need for new refinements.

The notion of the automatic review of the literature has been in the minds of information scientists for a long time. Whether we can ever obtain artificially intelligent literature reviews, remains to be seen.Displaying lists of citations surrounded by contextual text is just one step in that direction¹⁶. For this reason, the interpretive role of information scientists, especially in pharmaceutical companies is still essential. Automatic or computer assisted reviewing, if it ever arrives, will simply make them more productive.

Research scientists, especially in the life sciences, need to parse scientific documents so that key phrases used in various combinations can lead to interesting correlations.Sher used phrase analysis to create Keywords Plus.¹⁷ New systems of artificial intelligence will facilitate the indexing needed in evaluative medicine or bioinformatics. The pharmaceutical and biotechnology industries are now dependent upon a whole new sub-industry involving structure-function determination and correlation.

As mentioned earlier, the pioneer in discussing this type of a posteriori intelligence was John O'Connor. The intelligent automaton, even if a document never mentions the word toxic or toxicity, should be able to conclude that it concerns an aspect of toxicity.⁷

Another expression of the AI challenge is implicit in the distinction I made in 1965¹⁸ between an automated system of compiling citation indexes, and a system which is able to read a text and supply relevant cited references. An experiment with a group of graduate students, demonstrated that the need for a cited reference in a text is perceived quite differently depending upon the reader's experience and sophistication. Given a paper I had published In the Journal of Chemical Documentation,¹⁹ the students were asked to insert a mark wherever they thought a reference was needed. The number of references per document recommended varied from 15 to 75, but averaged about 35 which in fact was close to what I had actually used.²⁰

From the preceding remarks, it will not be surprising that I hold in high esteem the work of Don Swanson in attempting to create an artificially intelligentagent for generating correlations between disease elements and potential therapies.^{21, 22}

All such experiments emphasize the unique role played by the critical review in the progress of science.This role is needed increasingly even as we gain easier access to the primary literature. It is the aposteriori use of the literature that paves the way to discovery.That is what the IR game is all about.Information systems should facilitate the process of making new connections.In the meantime, human, mainly laboratory-based researchers, continue this creative process of reviewing. Organizations like Annual Reviews, Nature Reviews, Trends journals, and others already provide a rich supply of such reviews.The huge output of review articles and their high impact demonstrates, I believe, their value to the scientific community. Leo Fleming (Harvard) and Olav Sorenson (UCLA) have recently referred to this inventive process as combinatorial.²³

Twenty years ago, ISI and Annual Reviews established the National Academy of Sciences Award in Recognition of this role.²⁴ However, the nature of scholarly reviews is undergoing change in two directions. The minireview is now quite popular but at the same time, even larger more comprehensive reviews are appearing sometimes in the form of mini-databases.

Perhaps the most significant advance in reviewing has been made by the Cochrane Collaboration Centers which form the basis for modern evidence-based medicine.²⁵ The success of those enterprises may now be applied to other fields besides medicine as e.g. the new Campbell Collaboration to judge outcomes in social engineering.²⁶For those interested, please contact Robert Boruch at the University of Pennsylvania (robertb@gse.upenn). Electronic journals and databases will aid these systems of synthesis and should significantly reduce publication bias since space in on-line journals will not be a limiting factor.²⁷ All necessary backup data, even for negative results, can now be stored electronically. This has recently been highlighted in the press by the consortium of medical journal editors.²⁸ Those of you in Pharma will know exactly what I mean.

Information Discovery and Recovery

This leads to a concluding observation.Information retrieval concerns both information discovery and information recovery.^29,30 While closely related, the latter process of information recovery should approach perfection in the years to come. We should rarely have difficulty in recovering papers we have encountered in the past. Information discovery systems, however, will remain a daunting challenge for decades to come since they involve the injection of human intelligence difficult to match in AI systems. Recognizing how long it has taken to reach the present state of the art, I doubt that many of us will still be here when these breakthroughs occur.

In the meantime, I have tried to make a further dent in this process. As I described earlier today, the ability to select the significant literature from a large mass of retrieved documents is essential to this process including the ability to provide reviews that are not only topically but also historically correct. For those of you who are not yet familiar with HistCite, I refer you to my web page on algorithmic historiography: (http://garfield.library.upenn.edu/algorithmichistoriographyhistcite.html)

I prepared a special search on the subject of thalidomide [for this afternoon’s session] since it was particularly relevant to the pharmaceutical industry. However, you will have to go to the URL to find that case study: http://garfield.library.upenn.edu/histcomp/thalidomide_ti/
And if you are so inclined, join our volunteer group of HistCite evaluators and receive free access to the software.

Thank you for your attention.