JILT 1997 (3) - Taylor Fitchett
The Road to the Virtual Library:
The Center for Electronic Text in the Law Builds DIANA
Taylor Fitchett
University of Cincinnati
taylor.fitchett@uc.edu
Contents
Abstract
The advent of the virtual library is the most fundamental change in the history of librarianship. It will alter the way libraries do business, but it should not change the historical mission of the library to acquire, organize, preserve, and make accessible the human record. The University of Cincinnati College of Law Library created the Center for Electronic Text in the Law (CETL) to do research on the management of electronic text in preparation for its participation in building the virtual library.
CETL is one of a growing number of library efforts focused on the production of electronic text that will form the foundation of the virtual library. Its staff are currently looking at methods of coding electronic text using Standard Generalized Markup Language (SGML), specifically Text Encoding Initiative (TEI) SGML. As the virtual library grows, meaning must be extracted from vast amounts of textual data. Applying this type of markup to documents constructs a foundation for the future, preventing information loss when documents are migrated to applications not yet created.
In addition to its research function, CETL produces two databases for the Internet. The article offers justification for the librarian's role in producing such databases and details the creation of CETL's SGML database of human rights documents, DIANA.
Key words: virtual library, digital library, document imaging, Standard Generalized Markup Language, SGML, Text Encoding Initiative, TEI, Center for Electronic Text in the Law, CETL, DIANA, human rights documentation
This is a Commentary article published on 31 October 1997.
Citation: Fitchett T, 'The Road to the Virtual Library: The Center for Electronic Text in the Law Builds DIANA ', Commentary, 1997 (3) The Journal of Information, Law and Technology (JILT). <http://elj.warwick.ac.uk/jilt/virtlib/97_3fitc/>. New citation as at 1/1/04: <http://www2.warwick.ac.uk/fac/soc/law/elj/jilt/1997_3/fitchett/>
1. Introduction
For over a decade the signs of the virtual age that we are entering have been visible. Certainly a large number of people would be both technologically and psychologically ready for the virtual library if it were here today. Some define the virtual library as the library with no walls, no books, and no librarians. Others see it as a nexus for information activities, including e-mail, teleconferencing, newsgroups, listservs, etc. It encompasses access, delivery, and preservation systems, as well as education and communication systems. It is one global library and one worldwide system of communication that will provide all the critical works of the libraries and research institutions in the world. This library will be accessible to anyone, anywhere, and at anytime.
Librarians have played a prominent role in the evolution to electronic information and are essential components in its continuing development. In the years prior to the Machine-Readable Cataloging (MARC) era that began in the mid-1960s, libraries were organising bibliographic information in a fashion that would lead to the international implementation of the MARC record, a development that would be the foundation of the library's electronic movement. With the vast body of metadata electronically encoded, it would be a matter of time until the text associated with MARC record would also be available digitally. While the virtual library does not yet exist in its entirety, parts of it are in place, and its advent brings with it a fundamental change in the way libraries do business. Today, librarians are producing information for the Internet and exploring ways to better index its content.
Librarians are also studying complex issues in information management, not all of which are foremost on the agendas of others in the information business. These issues include the authentication of documents, document citation and migration, the organisation of huge bodies of data, information security and preservation, and user privacy. Certainly publishers, authors, and researchers are concerned with the management of electronic information, but the preservation of our cultural heritage during the age of electronic information depends on coordinated strategies among libraries. Transition from hard copy collections to the electronic medium is not license for the library to abandon the historical charge from society to acquire, organise, preserve, and make accessible the human record.
With no master blueprint on how to build a virtual library, libraries have begun to move collections into cyberspace. Special collections of photographs, manuscripts, and older literary works for which copyright is not a barrier, are among the first items from libraries to enter cyberlife. There are a growing number of sophisticated electronic text projects in the States. Most notable for their standards in developing electronic text are the Model Editions Partnership, supported by Rutgers University and the Universities of South Carolina and Illinois at Chicago <http://mep.cla.sc.edu/>, and the Women Writers Project at Brown University <http://www.wwp.brown.edu/>. The goals and strategies of such groups have set the standard for the Center for Electronic Text in the Law at the University of Cincinnati, whose work, especially its work on the DIANA database of international human rights documents, is the focus of this article <http://www.law.uc.edu/Diana>.
2. The Center for Electronic Text in the Law
Four years ago the University of Cincinnati College of Law Library established the Center for Electronic Text in the Law (CETL), because we realised electronic text had to become an important part of our law school's operation. We also understood that we needed to look at information in new ways and develop new information paradigms to prepare to meet the predicted needs of the user. There was concern that although we increasingly relied upon the electronic medium for access to legal materials we had little input to either the development or use of that medium. With the support of the Dean of the College of Law and a few thousand dollars, in 1993 CETL was created. [1] CETL is fully integrated into the library's operations, and although it has only two employees assigned on a full-time basis, most members of the library staff have responsibilities related to its work. CETL's mission is to promote legal scholarship through the understanding, management, and production of enhanced electronic text. The work of CETL can be divided into three categories:
2.1 Research on Electronic Text:
Research on methods of processing and managing electronic text is an ongoing objective at CETL. In the earliest stages of exploring hardware and software platforms and various standards for electronic text, research was CETL's primary function. During this period we built a small prototype database for experimentation and demonstration, but we were not anxious to rush into text production without a review of the technology available at the time and an understanding of the standards associated with electronic data. Although CETL has added a production component, exploring new methods of electronic text management remains as the foundation of its work.
At the moment, we are looking at standards of encoding electronic text, especially using ISO standard 8879, the highly structured Standard Generalized Markup Language (SGML). We are examining both the costs and benefits of this markup system. SGML became interesting to us because it preserves text structure in an application-independent manner. The type of SGML that CETL is using is the version created by the Text Encoding Initiative (TEI). We use TEI SGML because it has become an academic standard.
Text markup is so critical to the management of electronic information that a small digression to explain its importance is merited. In the future, we do not want to discover that we have limited people's ability to use the electronic text we have created. Software changes rapidly and we risk information loss if we do not protect it during the many migrations it will surely make. Librarians must build systems that can be shared among institutions and this implies the use of standards. SGML is a standard than can be employed by those building digital libraries to ensure the ability to share data. A recent report sponsored by the Commission on Preservation and Access <http://www.clir.org/cpa/> concludes that SGML meets current preservation and access requirements for digital libraries. [2]
The University of Cincinnati Law Library was among the first law libraries to use SGML as a distribution and preservation medium for legal information. SGML is an international standard adopted in 1986 for the description of marked-up electronic text <http://www.sil.org/sgml/sgml.html> . It is a markup language, i.e. a set of instructions for encoding text. Encoding text is a means of making explicit the interpretation of text, e.g. identifying the author or title of a publication or designating that a word is a term of art within a discipline. SGML provides a standardised method to assign attributes to a text and define the structure of the text. SGML is descriptive, i.e. it explains the contextual significance of a language, rather than being procedural. There is a new international standard ISO/IEC 10179, called Document Style Semantics and Specification Language (DSSSL) that has been developed as a companion to SGML and specifies how SGML data will be output in either print or electronic format.
SGML allows the separation of text and structure. Because of this ability SGML documents can be interchanged among many systems in many ways. It is, in other words, highly reusable data, because it is not bound by physical formatting. The need to use electronic texts in new applications helped drive the development of SGML. Interchanging documents with minimal information loss as they move from application to application will preserve content. SGML is widely used because it is non-proprietary; it is vendor independent and application independent. Because SGML documents are intelligent they are very searchable. The user can search document fields, such as footnotes, glossaries, or tables of contents, thereby pinpointing information efficiently.
In addition to its research function, CETL has acted as a text consultant for a number of libraries who are developing technological approaches to publication. Consequently, an effort is made to stay current with new products and standards that relate to text representation and management.
2.2 Publication Services to faculty and Students:
CETL also serves as a publication center for law faculty and students. It assists the law school community in building databases for research and teaching. Much of the assistance has been course related, and we have drawn on the resources of the University's Center for Academic Technology for consultation in instructional design. Faculty members, accustomed to the traditional method of teaching law, need assistance in preparing course materials for the online environment. Whether assisting in building a Folio Views database of case law or an HTML page for a class assignment, it is our goal to encourage the teacher to maximise the use of the instructional tool.
2.3 Electronic Publishing:
CETL publishes two databases on the Internet. Each of these databases supports a center of study that has been established at the University of Cincinnati College of Law. The Corporate Law database supports the work of the Center for Corporate Law. The documents in this database are either obtained in electronic format from an information vendor, or scanned locally, and marked up in HTML. The other, DIANA, a collection of human rights documents, is an SGML database that supports the work of the Urban Morgan Institute for Human Rights. [3] DIANA is the more complex of the two databases, and its construction process is described in the following text.
Before describing the techniques of building DIANA, a brief history of its origin is in order. The University of Cincinnati Law Library has a typical collection of legal materials for a law school. One collection distinguishes it from similar schools, its collection of human rights materials. The collection is heavily borrowed from by other academic institutions, and the bibliographic indexing tools that provide access to its content are minimal. Several years ago a decision was made to create a database of human rights materials that would be accessible to researchers over the Internet. Many documents necessary to research in the field of human rights are difficult to obtain, especially in areas of the world where they are most needed.
At the same time that we were beginning our human rights initiative, we learned that a number of other institutions had a similar idea, and we decided to partner with these groups. We had several coordinating meetings and set up an advisory board of human rights scholars and activists. The members of the Board, who oversee the development of DIANA, have agreed that it will be a non-fee based library of human rights materials on the Internet. It was decided that the database will first contain the core instruments central to research in the field of human rights and expand to include difficult to obtain U. N. documents, briefs of the various human rights organisations, non-governmental organisation information, and current awareness materials. As an international database DIANA will be in multiple languages, and it will ultimately combine the efforts of many organisations around the world. [4]
Unfortunately, at this time there has been no consensus among the contributors to DIANA concerning the technical standards for building the database. CETL stands firmly behind a set of principles for editorial markup that are not used by other contributors. It is our concern that the database be built to meet the scholarly editorial practices of today, that it be designed to limit the amount of information loss that could be encountered in future migrations, and that it be maximally transportable. All references to the construction of the DIANA database that follow refer to the University of Cincinnati's portion of that database.
3. Creation of the SGML Database- DIANA
CETL Process Flow Chart's URL <http://www.law.uc.edu/CETL>
3.1 Determining Intellectual Content
The first step in the process of building DIANA is taken outside of CETL with the selection of a particular title for inclusion in the database. As the project has grown, many people in the field of human rights have become interested in DIANA's scope of coverage. We have drawn on the work of bibliographers who are not on our staff, and we have consulted with scholars working in the field of human rights to ensure the proper development of the intellectual content of DIANA.
3.2 Acquisition of Source Material and Copyright Permission
The acquisitions librarian locates and acquires materials for inclusion in the databases. What makes the acquisition of documents for our electronic work challenging is that we strive to obtain original sources or the closest thing we can get to an original source. This means that if we are working on current electronic documents from the United Nations, we want to acquire them from the United Nations in the word processing format in which they are originally produced. If we are scanning hard copy of the Organisation of African Unity resolutions, we prefer to get the text of the resolutions directly from the OAU and not reprinted from some other publication.
The acquisitions librarian also acquires copyright permission for an electronic reproduction of a source document when it is required. There are so many parties interested in the transfer of electronic information, including copyright holders, librarians, publishers, reproduction rights organisations, various user groups, Internet providers, etc., that we are proceeding with caution when digitising any material not in the public domain. The uncertainty surrounding the application of copyright laws to digital media acknowledged, CETL moves deliberately toward the acquisition of copyright permission.
3.3 Administrative Control of Text
From the moment the acquired document enters CETL it is tracked. Electronic documents and their paper counterparts are assigned accession numbers. The number always resides with the electronic copy, and if there is a companion paper copy, it is placed in a folder and archived under the same number. In this way, the paper copy is always available to the text editors who may have to consult the document later in the conversion process.
3.4 Text Conversion Process
The text conversion process, i.e. the process of taking an original document and putting it in electronic format, can be quite simple or very complex, depending on the intent for making the document electronically accessible and the format of the source document. Concerns with such things as document stability, transportability, and preservation of a source document can make the conversion process complex and, consequently, expensive. The DIANA database is viewed as a being a long-term resident within the virtual library, and therefore, must be built using the highest standards available today for the management of electronic text. A document of transient importance can be converted simply and inexpensively in any number of ways, including having it scanned into PDF, marked up in HTML, or even left in a flat ASCII format.
The path that a document follows once it arrives in CETL differs depending on a number of variables. If it arrives already in an electronic format, as most of DIANA's United Nations documents do, a number of processing steps will be eliminated. If, however, the source document is in paper form a decision must be made as to whether it will be digitally imaged and OCR'd or whether it is best sent to a vendor who will rekey the text. Unless the original text is very high quality, we have found that it is considerably cheaper, roughly six times less expensive, to have the text rekeyed.
As CETL has grown and learned more about processing text, we have reduced labor costs. One way to reduce the cost of processing electronic text is to find additional ways to automate the process. We continue to explore new software and we write small programs to improve existing software. Another way of reducing labor costs is simply to do less text processing. At CETL, that translates into doing less markup. The level of markup needed for a particular text has caused some controversy among the text editors in CETL. Once text analysis has begun it is easy to identify document components that could be marked up. Scholars who markup poetry, for example, often divide the language into its smallest components for analytical purposes. But even basic markup to identify gross document structure, to set up the links to other documents, and to record metadata is labor-intensive. In our earliest attempts to build cost accounting into our production process, it was clear that we had to reduce the labor costs of the markup process.
3.5 Imaging of Source Documents
Once a document is acquired, if the original is a paper document, it is prepared for imaging. Usually it is a photocopy as opposed to the original document that is imaged. CETL images documents at 400 dpi for optimal preservation of text information.
The image serves as a medium for preservation and authentication of a source document, but is unsearchable, requires indexing, and relies on other systems to manage its use. Because the scholarly researcher must have both the flexibility to go beyond the digital object to the meaning associated with the text and the certainty of textual accuracy provided by a digital image, it was determined in the early developmental days that DIANA would consist of both images and searchable text when the document source was in print format.
3.6 Creation of ASCII Text
When the document is not available in electronic form, it is sent off-site to be double-keyed. [5] The conversion company can also add basic SGML markup to the ASCII text. When the document is acquired in word processing format, it is converted to basic SGML form using software tools. FastTag from Avalanche/Interleaf has been used for this purpose but CETL is currently converting to tools built on Omnimark.
3.7 Document Markup
Once the document is in electronic form, CETL turns its focus toward adding more value to the markup inserted by the double-keyers or the software conversion tools. The markup added at this point usually requires an understanding of the text and its significance and structure. It is at this point, for example, that cross references to other parts of the document or other documents are added. In addition, there are any number of features specific to legal text, such as the highly structured organisation of a legal statute or the effective date of that legislation, that require unique identification. The TEI Guidelines for Text Encoding and Interchange are a markup developed by scholars, librarians, and those interested in computing for use with literature in the humanities. [6] CETL is currently working to extend the TEI Guidelines for use with legal materials.
Typically, text is marked up at the paragraph level. Quotes, underlined words, tables, and other features that fall within paragraphs are also marked up. Markup includes identification of the basic reference unit so that it will be possible to create hypertext links to it later on. If the document exists in several languages, as often happens with United Nations materials, markup is added to indicate parallel points in the various language versions. When needed, pagination of the original paper document is indicated in the electronic version. Where there have been hard copy source documents, links are made to the digital images that were created earlier in the process. This completes the creation of the archival electronic text, and all subsequent distribution of this text is done from the SGML document.
As the virtual library grows, so, too, will the need to extract meaning from vast amounts of textual data. In building the DIANA database we are constructing a foundation for hundreds of millions of documents, not just for the relatively small number of documents now in the database. Based on our knowledge of the search and retrieval software existing today, SGML markup is an indispensable component of a research database.
3.8 Assigning the Metadata to the Document
There is a growing awareness among the information industry of the importance of assigning an appropriate amount of information to a particular document, or object, to define and index it during its electronic journey. One of the most critical roles that the librarians who work in CETL play is in the assignment of metadata, data describing data, to the document. CETL has automated much of this process, but the catalogers still have a bit of work in deciding how bibliographic data will be managed in the TEI header. CETL's catalogers are involved in building the structure of the DIANA database and in designing the electronic header that travels with all of the DIANA documents created by CETL. The MARC record can be incorporated into the header but the information in the MARC record alone does not substitute for the header. CETL is looking forward to the development of the SGML DTD for USMARC that is scheduled for completion in 1997.
The TEI header has four parts: 1) a file description, containing a full bibliographic description of the source document. Subject matter keywords chosen from standard thesauri can be indicated here. In the case of DIANA we use HURIDOCS and UNBIS [7] subject headings as well as Library of Congress Subject Headings. 2) an encoding description, describing encoding practices 3) a text profile, where, for example, CETL explains how text ambiguities were handled and who did the work 4) a revision history, where any subsequent changes made to the text would be indicated.
3.9 Electronic Text Distribution
Once the SGML document is created it can be delivered in three ways. The first method is through down-translation from SGML to a word-processing format. This gives the user a formatted document but the SGML added value is lost.
The second method of delivering SGML requires the receiver to use an SGML viewer, such as the Panorama viewer from SoftQuad. This viewer may be employed as a helper application by a Web browser when it encounters SGML. It downloads the text, the DTD, and the stylesheet to the client machine. This process may be cumbersome to many users, so other means of delivery must be available.
The final method of delivery is through conversion of the SGML markup to corresponding HTML markup. There are several ways to do this. CETL uses DynaWeb, an HTTP server plug-in from INSO Corp. The user needs only Web access and no additional applications to view the document, but, again, much of the SGML added value is lost.
In order to have access to the added value offered by SGML a new type of markup is being developed, Extensible Markup Language, XML, a simplified form of SGML that is compatible with its parent markup language. The goal of XML is to enable SGML to be served, received, and processed on the Web in the way that is now possible with HTML. When XML becomes available, CETL will deliver its data in XML format <http://www.textuality.com/sgml-erb/WD-xml-lang.html>.
4. Conclusion
In this brief overview of the work of the Center for Electronic Text in the Law, a glimpse of one of many initiatives coming from libraries around the world to build a virtual library has been offered. I have no realistic estimate of what it will cost to make this library a reality or how long it will take other than to say that it will cost trillions of dollars and take a long time to get the billions of documents of the world's research institutions alone converted to electronic text. America's National Archives houses 6 billion documents, and it would be safe to say that it will take decades to get the documents from this single institution into electronic form. Perhaps information specialists of the future will decide that such a retrospective endeavor is too ambitious.
Those who have limited experience in the production of electronic text are shocked at the expense of building an electronic library of research quality materials. They are also surprised that the costs do not lie primarily in the acquisition of hardware, software, or even in the cost of information itself. The majority of dollars are spent on personnel to understand the technology, acquire the information, and then process it. Despite the cost of new technologies, library budgets are gradually being reallocated to accommodate them as the concept of access to information, versus ownership of information, is promoted by librarians. If budgetary constraints are used as justification for not exploring new ways of managing information, librarians may find that they are less relevant to the future.
While almost all of my colleagues in law libraries are involved with the Internet to locate information and create home pages, a much smaller number are actively engaged in building research quality databases for the Internet. Justification of funding for a text center, such as CETL, within a law library has been requested by law professors, law deans, and law librarians alike. Setting a new agenda in an organisation requires months, or perhaps years, of groundwork with those who control the budget, the administration, and those who do the work, the librarians. Administrators, who are themselves under financial constraints, want to understand the cost-benefits of innovation. Text conversion projects are expensive and difficult to justify through conventional methods. Librarians who see the transition from hard copy to electronic delivery of information as an omen for the disintermediation between librarian and end-user may not eagerly facilitate the change. In most libraries it is the existing workforce who will implement change, so it is critical that they share a common vision for a new direction and possess the skills to achieve their goals.
While the librarian's part in building the virtual library is yet to be determined, early indicators are that the role will be significant. Librarians will not be the only players building the virtual library. Authors, publishers, computer scientists, and others will have a significant impact on its character. Librarians, who have several decades of experience in the creation and maintenance of online bibliographic databases, understand that the next logical progression is to link those existing databases to full-text. There is enthusiasm among librarians about the information highway for many reasons, including the fact that information can be made accessible to remote users, that multiple users can access the same information simultaneously, and that powerful search tools offering full-text indexing can be found. At the same time there is the gnawing realisation that information on the Net is not secure, well-preserved, well-organised, or necessarily authentic.
The journalist and writer, Bruce Sterling, offers one of the more pessimistic views of the digital revolution:
Computers swallow whatever they can touch, and everything they swallow is forced to become as unstable as they are. With the soaring and brutal progress of Moore's Law, computer systems have become a series of ever-faster, ever more complex, and even more elaborate coffins. [8]
Certainly it is possible to envision us all as the victims of the information revolution, drowning in data today without the certainty of having access to any of it tomorrow. But it is equally possible to envision us as the beneficiaries of the same revolution, having learned to manage information through the use of technology. Computers can enhance learning, giving us new ways of looking at information. Sophisticated retrieval software will simplify our quest for information by helping us evaluate search results from vast data repositories. However, lack of foresight on the part of the library profession, which has been charged by our society with the management of information and the preservation of our cultural heritage, would certainly drive a nail into the information coffin. But a look at the twentieth century library system reveals a highly structured and successfully managed network of information centers. As these networked entities begin to meld into the virtual library, librarians will develop and apply the appropriate management standards to the electronic medium to ensure the viability of yet another mode of information transfer, whatever its ultimate duration may be.
Footnotes
[1] Nick Finke is the Executive Director of the Center for Electronic Text in the Law at the University of Cincinnati College of Law Library (nick.finke@uc.edu). Greg MacGowan is the Associate Director. Other staff members include Joe Madlener, web designer; Cynthia Aninao, Janet Smith, and Akram Pari, electronic text catalogers; Don Blair, work flow manager; Bill Linder, UNIX consultant; and numerous, dedicated text editors who are graduate students at the University of Cincinnati.
Jeanette Yackle, Head of Reference for International and Foreign Law at Harvard Law School, Ewa Brantley, human rights scholar and activist, and Mariano Morales-lebron, Head of Reference at the University of Cincinnati College of Law Library, have had primary responsibility for selection of materials that have gone into Cincinnati's site.
Taylor Fitchett, Director of the University of Cincinnati College of Law Library, can be contacted at taylor.fitchett@uc.edu.
[2] Coleman J and Willis D (1997) SGML as a Framework for Digital Preservation and Access (Washington, D.C.: The Commission on Preservation and Access).
[3] DIANA is named in honor of Diana Vincent-Daviss, the late library director and human rights bibliographer from Yale Law School.
[4] The institutions currently developing the DIANA database are The Orville B. Schell Center for International Human Rights at Yale Law School, The University of Minnesota Human Rights Center, The University of Toronto Law Library, Harvard Law School Library, the Urban Morgan Institute for Human Rights, and The University of Cincinnati College of Law Library.
[5] The company, Input Center, 320 N. Michigan, Suite 404, Chicago, IL 60601, charges 85 cents per thousand characters (approximately $2 per page) which includes the markup.
[6] The TEI Guidelines for encoding and interchanging electronic text were developed under the direction of the Association of Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. The latest version of the TEI Guidelines (TEI P3) was published in 1994.
[7] HURIDOCS is a network of non-governmental organizations and individuals concerned with human rights documentation who are striving to build one set of information standards, HURIDOCS Standard Formats. Their guidelines are based on the Anglo-American Cataloging Rules and include features such as geographical terms and codes, human rights indexing terminology, and guidelines for recording the names of persons.
The UNBIS Thesaurus, prepared by the Dag Hammarskjold Library, contains the terminology used for subject access to the United Nations Bibliographic Information System.
[8] Sterling B (1997) 'The Digital Revolution in Retrospect' 40 no.2 Communications of the ACM 79.