MECHANIZATION OF CHEMICAL INFORMATION PUBLICATIONS AND SERVICES

by

A. W. Elias
E. Garfield
G. Foeman
G. Revesz

Institute for Scientific Information
Philadelphia, Pa.
_________________________________

ABSTRACT
A review is presented of intellectual, design, and format
considerations involved in the mechanization of chemical
information for publication. The role of conventional data
processing equipment, computers, and specialized devices
will be noted. By-product information for retrieval will be
discussed with attention to current research in this area.
                                          _________________________________

The use of mechanical procedures as aids in the production of chemical information publications is growing, as is the use of machine records as publications in their own right. It is proper to question these applications and it is important to know if they are used simply because they exist. Are they medications used to alleviate symptoms rather than the disease? Are they sales gimmicks used for promotional purposes that employ the glamour of automata? Or are they sound, systematic solutions offered to meet needs of publisher and user.....

What exactly is meant by mechanization in chemical information services and publications? By mechanization we mean the application of a device or devices to perform work either of a "routine" clerical nature or of an intellectual variety.

Chemical information services would include conventional printed journals and machine language expressions of them such as in the form of punched cards or magnetic tape. Indeed the whole range of abstracting and publication media falls into this category.

Let us analyze mechanization then not in terms of specific products but in terms of information functions using examples as appropriate.

The functions for mechanization may be broadly categorized as those (1) prior to the physical production of a product, that is, mechanization of the intellectual phase of chemical information, and those (2) subsequent mechanical or production activities which make the information derived by intellectual activity available to the user.

In which applications of machines have the intellectual operations necessary for chemical information publications been affected? In so-called "auto-abstracting" (1) complete chemical texts would be subjected to machine analysis.

In these techniques the most important or "meaning-bearing" words and phrases are detected through statistical or other procedures. In this way indexing or abstracting is presumably accomplished without human intervention. In fact, this is only partially true since considerable human editing of the text for machine input is presently required. Uniformity and consistency are suggested as resulting from the mechanical procedures involved and semantic problems are presumably avoided by examining each document individually and not in relation to other papers in the same universe. Such work is presently being expanded by Guiliano (2) Sherry (3) and Bryant (4). Dr. John O’Connor, of ISI, has reviewed and is studying in detail the various automatic indexing procedures. His studies show that automatic indexing will have to be considerably improved to match the sophistication of many human indexing procedures.

Mechanical translation should be mentioned here. Obviously the analysis of language, whether for translation or for retrieval, is related. Therefore some of the achievements of MT may spill over to auto-indexing In 1961, Garfield showed that the mechanical translation of natural language expressions of chemical compounds; that is, chemical nomenclature, can be translated to such chemonymic forms as molecular formulas by computer (5). This was but a first step in a more ambitious grammar he calls CHEMTRAN in which names, notations, structural formulae, ciphers, etc. would be produced automatically. Recently, Dyson (6) has reported on research involving automatic translation of chemical names. The Dyson work is directed to the problem of converting CA index nomenclature entries to atom-by-atom connectivity matrices and/or Dyson ciphers. The method requires the initial presentation of the chemical name, however, in the form in which it appears in the CA indexes that required a not insignificant human intervention originally. This is quite different than dealing with the unindexed chemical name as encountered in the literature.

While computers undoubtedly will play ever increasing roles in the intellectual operations required for handling chemical information, other devices may be employed to facilitate routine operations. INDEX CHEMICUS has recently made one of these publicly available. The HydroBond Computer was developed to aid the INDEX CHEMICUS indexing staff in their calculations of molecular formulas. By setting the scales of this circular slide rule, the proper hydrogen count is obtained in an accurate and speedy fashion. Use of the entire procedure, however, assures accuracy of the entire formula—-not merely the H count.

The application of mechanical procedures for the physical production of chemical information publications and services may be divided into four categories based on the form and functions of the services. These include
1) production of chemical journals,
2) textbooks,
3) chemical handbooks and guides, and
4) alerting-retrieval services.

Conventional journal and textbook publication involving mechanization must be subject to the same economics as the publication of any magazine or book. It is justified in the interest of speed of production, reduction in production costs, improvement in product quality, the obtaining of some by-product which is itself salable, or by the achievement of input and output flexibility yielding, at low work or low additional cost, supplemental publications or services. "Intellectual" effort may be reduced by eliminating repetitious editing work previously required.

As far as the initial creation of chemical information is concerned, that is, prior to its translation into some form, e.g. the manuscript, mechanical devices have contributed very little. Dictation machines have been used at CA to expedite indexing work and production of manuscripts by tape typewriter is a start in this direction but so far we have only scratched the surface. If this were the only reason, it would be difficult for us to justify mechanization since the intellectual activity involved the main rate-limiting factor.

Production cost reduction through mechanization is a more cogent reason, therefore, and, indeed, much of the activity in publication mechanization results from consideration of this factor. It is, however, a general result not one specific to the economics of chemical information publications.

The third economic consideration, that of improved product quality, may be a real consideration for the future of mechanized publication production, but at this time cannot be realistically considered as a single factor justifying mechanization. Naturally, it would be nice if this was also achieved in the course of attaining one of the other objectives of the mechanization program.

The fourth consideration may well be an economic justification for producing a publication or service by mechanized procedures; that is, can useful and salable by-products be obtained? This is not presently true for chemical journals, but may well be in the future when mechanical processing of entire chemical texts is desired for whatever reason - MT or MI. CT and ASCA, to be discussed shortly, are both by-product services which could not be sold initially at their low cost if they were not by-products of publications. CT is a byproduct of CA, and ASCA a by-product of the SCI.

Finally, mechanization may be reasonable if the needs of the publisher and/or user can be foreseen as requiring more than one form of the same information. Flexible machine records can be used to update, correct, reformat, select, or suppress information.

For example, in the production of a handbook or data compilation, the foreseen need of revisions or updating at frequent intervals could be a justification for mechanization.

The Most extensive applications of mechanization of chemical information services have taken place in the alerting or retrieval types. I will not attempt to cover all of these, but some typical examples can be discussed here.

Chemical Titles is a good starting point. CT is, among other reasons, justified on the basis of its speed and low cost. The intellectual analysis of the titles is not really achieved completely mechanically. In fact, every title must be pre-edited for keypunching. The computer then creates all possible permutations of the title words and, by routine alphabetizing, an index results which can be quite useful, especially when its limitations are understood. There also results a similarly useful, or potentially useful tape which can, with further refinements, provide selective dissemination of information along the lines devised originally by Luhn (7) and used experimentally by Dr. Rice, of Eli Lilly, (8). The process, which is mechanized in using CT tapes for SDI, is the inspection of the edited title information which simulates the intellectual activity of indexing by keywords. This requires the use of so-called stop lists. C. Montgomery and D. Swanson have also suggested using the same technique without the computer (9). In short, an algorithmic procedure can be used by a clerk or a machine to produce the same result. Therefore, it is clear that the intellectual achievement is in programming; that is, in the design of algorithmic procedures it is not an intellectual achievement by the machine. It may well be that programming is the principal ingredient in characterizing intellectual achievement. For example, compiling citation indexes involves an algorithmic procedure which, for the time being, has not been and cannot be done by machine. Citation indexers or editors must scan texts to extract bibliographic citations, edit them, and specially trained keypunchers convert them to machine language. Further, it is unquestionably beyond the capability of any computer, in the foreseeable future, to mechanize the intellectual analysis performed by the author who makes the reference citation in the first place. This was discussed in detail at the NBS-ADI symposium last year.

In CT we have seen how the mag tapes can serve the dual function of producing a printed KWIC index and a SDI system based on title words. In a similar fashion, the magnetic tapes used to produce the Science Citation Index are used to provide a service for selective dissemination of information. It is called ASCA. The weekly index tapes of SCI are matched with user profiles. In ASCA, however, profiles consist of citations rather than words. The need to employ matching "factors," or any other means of resolving ambiguity of words, is eliminated since there is no ambiguity in a citation; whereas, a single word may not be sufficient to define the content of a document. In ASCA the scientist is notified of the existence of the match, its bibliographic location, and the title and authors of the citing article. The Automatic Subject Citation Alert, is, like CT tapes, also commercially available to the individual scientist. However, while CT tapes are transmitted to the subscriber, the citation tapes are not sent to the client. The so-called source tapes, the equivalent of CT's title tapes, are available and are now being tested experimentally by one large agency. There is every reason to state they are comparable to CT tapes, but naturally there are differences in the tape formats which may mean user programs for CT cannot be used on SCI tapes without some modification.

An example of a machine index is prepared by Derwent Ltd. This is known as the RingDoc system, derived as it was from the original cooperative Ring of four pharmaceutical companies. RingDoc is a combination of conventional abstracting and machine retrieval capability. Instead of a printed index, a set of punched cards is issued at regular intervals. The chemical information contained in abstracts, which were previously issued on individual 4 x 6 inch sheets of paper, has been coded by a chemist. The code used is essentially the system designed by Steidle, of Knoll A.G., Germany (10). However, for the use of those customers who do not use punched cards, an entirely independent set of machine information is available -- this is the previously reported codeless scanning(11) developed at Hoffman LaRoche and Sandoz. The use of the machine in the RingDoc system is mainly in sorting punched cards by well known, simple EAM methods. Some companies may use these punched cards to prepare a mag tape that can be searched on a computer. This is not as simple as it sounds, however.

Codeless scanning tapes, however, can be used both for searching and for dissemination in a fashion similar to that for CT and SCI source tapes.

Farmdoc, the earlier service issued by Derwent, involves a punched card on which an abstract has been imprinted.

Incidentally, the indexing entries used by Information for Industry for its Uniterm Index to Patents is also available on magnetic tape, but I will not discuss this system today.

Machine methods are used extensively in the preparation of the Index Chemicus. In 1965some new methods were introduced which increased the reliance on machine methods. A brief outline of its production technique is necessary to explain this. The main intellectual task of course remains human operation, including selection of articles to be indexed. Indexing the novel compounds reported within those articles, including preparation of diagrams and flow charts, is also done by chemists, as is selection of the subject terms. In this report the subject index is similar to CA's key-word indexes, not to be confused with CT. All indexed information in an IC abstract is keypunched except for the structural diagrams and author summaries. This includes titles, authors, addresses, bibliographic citations, molecular formulas, diagram reference numbers, generic substituents, and subject index terms. These punched cards contain all of the alpha-numeric information necessary to produce the published abstract and the biweekly indexes, and the quarterly and yearly cumulative indexes. The cards themselves do not contain the control punches and instruction needed to format the information. This is done in an independent operation in which the cards are converted to paper tape using a programmed card-to-tape converter. Through this program, control information missing from the cards is sensed by the machine and generated. Thus in a molecular formula card, for example, the machine detects successive alphabetic and numeric characters and, based on this juxtaposition, generates in the paper tape the instruction to the typewriter to shift to lower case for the second alphabetic character only. Similarly the all cap mode for title cards, the initial cap mode for author cards, tabulations and indentations are all generated without encoded instructions.

When the paper tape is passed through an output typewriter, the resultant product is uniform in format. Gross errors are detected since invalid punches or formats are printed obviously out of position. The information needed, both for the IC abstract and the indexes, is obtained as the result of one keypunching operation. Of course, IC has never used highly skilled compositors, as in Chemical Abstracts, because the publication is produced by photo offset printing.

Although this procedure is still too new for final evaluation, it is expected that production will be speeded and/or increased through the elimination of repetitive operations formerly required to produce the abstract itself by typists followed by a separate keypunching operation for the indexes.

The card system permits more facile correction of machine information, but only partially resolves the problem of corrections in the abstracts which still usually have to be done by hand unless. the corrections in one abstract are quite drastic which is rarely the case.

Saleable by-products include the various computer tapes of abstract and/or index information. Flexibility is inherent in the machine unit records and tapes enabling the rearrangement of the information, the selection or suppression of parts thereof. One application might be the production of a KWIC-type index using only the title records, a KWOC-type by adding the subject terms, or even use for codeless scanning.

Research continues in all the areas we have discussed. The rising costs of conventional composition and typography alone mitigate for further experiments as exemplified by the ACS and AIP research along these lines.

In conclusion, we have reviewed a number of areas in which mechanization has been used to advantage. Nevertheless, it should be evident that the use of machines to perform intellectual functions in indexing has been minimal.

The main use of computers has been in facilitating and lowering the cost of the production or routine operations in indexing. We are only beginning to seriously tackle these problems. Of course, if there were more time, we could discuss many operations now performed by machines that used to be considered intellectual or editing tasks as, e.g., in the selection of cross-references by the Index Medicus. This underscores one of the advantages of the systems approach -- defining the intellectual and routine tasks in a precise fashion. Getting the author to participate in the indexing system should not deceive anyone into believing that machines have conquered fundamental human capabilities. Even the advent of optical and/or electronic character recognition devices will not alter this situation. The attempt to use such devices for intellectual analysis heightens the participation of the author in text preparation. CT and CC type publications focus greater attention on the need for better titles. The development of citation indexing encourages not only standard bibliographic graphic practices, but also more careful attention to citing earlier literature. Just as poor title will not cause a paper to be disseminated by a word-oriented system, a paper will not be disseminated in a citation-based based system if it does not cite prior literature. Thus, the inevitable product is, as it has always been, primarily a product of the human intellect -- whether it is the original author or the editor-indexer who must transform the author’s intellectual creation into an organized retrievable and disseminable form.

REFERENCES
1. back to text H. P. Luhn, "The Automatic Creation of Literature Abstracts," IBM of Research and Development 2(2), 159-165 (1958).
2. back to text V. E. Giuliano, "Requirements of Future Computer Memories for Document Processing." Paper presented at the 27th Annual Meeting of the American Documentation Institute, Philadelphia, Pa., October 5-8, 1964. Abstracted in Proceedings of the American Documentation institute 1, 341 (1964).
3. back to text M. E. Sherry, "Memory Organization of a 7090 to do Statistical Association Processing for Document Retrieval, Proceedings of the American Documentation Institute 1, 337—339 (1964)
4. back to text E. C. Bryant, "Redirection of Research into Associative Retrieval," Proceedings of the American Documentation Institute 1, 503-505 (1964).
5. back to text E. Garfield, An Algorithm for Translating Chemical Names to Molecular Formulas, Institute for Scientific Information, Philadelphia, 1961.
6. back to text G. M. Dyson, "A Cluster of Algorithms Relating the Nomenclature of Organic Compounds to Their Structure Matrices and Ciphers," Information Storage and Retrieval 2, 159-199 (1964).
7. back to text H. P. Luhn, "Keyword-in-context Index for Technical Literature (KWIC Index)," IBM Advanced Systems Development Division Report RO-127, August 31, 1959.
8. back to text C. N. Rice, "Automated Literature-Alerting Services." Paper presented at the 148th National Meeting of the American Chemical Society, Chicago, Ill., September 1, 1964. Abstracted in Abstracts of Papers for that meeting, p. 5G.
9. back to text E. Montgomery and D. R. Swanson, "Machinelike Indexing by People," American Documentation 13(4), 359-366 (1962).
10. back to text W. Steidle, "Moglichkeit der mechanisehen Dokumentation in der organischen Chemie," Die Pharmazeutische Industrie 19, 88-93 (1957).
11. back to text F. Wegmueller, R. Becher, and B. Hoffman, "Codeless Scanning, A New Method of Automatic Documentation," Experentia 16(8), 383384 (1960).