The Automatic Linking of
Dr Justin Needle
Many organisations, be they publishers or other information-rich enterprises, are in possession of large volumes of data to which they are looking to add value in various ways. The motivation behind this might this be purely commercial (a desire to publish revenue-generating material, for example) or, if the data is not intended for external consumption, an attempt to facilitate and improve the flow of information within an organisation.
Either way, as the volume of digital content produced by and available to organisations on the Internet, CD-ROM, corporate intranets and subscription services continues to increase at an astonishing rate, the problem of maintaining data in a manageable and, more to the point, usable form becomes ever more acute. More efficient ways of locating relevant information and negotiating one's way through it are urgently needed.
It is well known that the ability to link related documents together represents one important means of making information far easier to retrieve and navigate. Hypertext is, of course, nothing new and is familiar to all users of the World Wide Web, but the large-scale creation of hypertext links has required a substantial investment of time and effort on the part of electronic publishers.
This article describes how legal publishers Context Limited has developed a range of advanced text processing techniques which overcome such obstacles and represent a significant advance in legal publishing and content management.
2. The Problem
Context publishes the JUSTIS range of over twenty database products, distributed on CD-ROM and via its web site JUSTIS.com. Its broad user base is drawn from government, industry and commerce, the legal and financial professions and the academic community.
The content for the JUSTIS range is drawn from many unrelated sources, including official databases from the European Union and electronic versions of print products from commercial law publishers. Most of these independent sources contain numerous references to documents in other database products published by Context.
Appeal Court rulings, for example, refer extensively to earlier court rulings on related cases, such as:
'. . . with all respect to the view of Stephenson L.J. expressed in Blackshaw v. Lord  Q.B. 1, 28, I would not agree that . . .'
Given the requirement that users be able to navigate easily from case to case, such citations need to be converted into hypertext links, so that by simply clicking on a link a user will be taken directly to the full text of the cited case.
The conventional method of creating hypertext links between documents involves manually editing each document and inserting fixed links at the database production stage. Unfortunately, there is a major problem. The JUSTIS databases contain millions of citations which, in order to achieve the required functionality, need to be converted into millions of corresponding hypertext links. The manual creation of links on this scale is not really an option since link creation is a laborious process, requiring the services of skilled, and expensive, editors. Even if an editor is able to identify and process ten links per hour, which is optimistic, then the human effort required will be approximately a hundred thousand hours per million links created. The situation is made worse by the fact that, as new documents or products are added to JUSTIS, the existing corpus of documents would need to be manually reprocessed in order to insert additional links to the new items, an exercise which would in practice be both highly impracticable and prohibitively expensive. Such tasks are, indeed, so daunting that they have rarely if ever been attempted. Clearly, there ought to be a better way of tackling the problem.
3. The Solution
Fortunately, techniques have been developed which enable links to be created far more quickly and cost-effectively than would ever be possible using conventional methods. These techniques function by treating citations as types of meaningful patterns occurring within the text of documents. Such patterns have a certain regularity of structure. For example, a legal citation will typically contain some combination of case name, year of publication, series name, volume number and page number. Intelligent pattern-matching may be used to identify such references, even where they occur in a variety of formats, i.e. where the same document may be referred to in a number of different ways.
Unlike a conventional hypertext system or free-text search facility, software based on this type of technology actually interprets the citations it finds within documents. Once a document has been retrieved, the text is automatically scanned for relevant citations. For each citation recognised, the software uses the resulting data, together with any surrounding contextual information which might be available, in order to determine the location of the full text of the cited document and convert the citation into a link to it. The process is entirely automated and software driven.
The software is also capable of recognising and linking citations even in cases where the bibliographic information contained within a reference is incomplete. For example, if all we are given in the text is '. . . expressed in Blackshaw v. Lord , I would not agree that . . .', the software will recognise this incomplete case reference as a reference, and knows further that the full publication details of the case are likely to have been given earlier in the document. The software will then search backwards through the document until it finds the information required to complete the link, then return to the incomplete citation and create the link in the appropriate place.
4. Approaches to Automatic Citation Linking
There are a number of ways in which these techniques may be implemented. The most obvious and straightforward is to insert links directly into the text of documents (in HTML or Acrobat PDF format, for example) at the database production stage. The advantage of this approach, which is known as offline or hard linking, is that it is possible to produce formatted, browser-ready documents without the need for any further programming or bespoke document delivery software. For example, one might start with plain text documents and end up with Internet-ready HTML pages in which the citations have been converted into links.
Context has utilised this approach within JUSTIS. Its database of UK Case Law, for example, contains numerous citations to related cases, as well as to EC directives and other legal documents published in separate JUSTIS products. Offline linking is used to identify case references at the database production stage, thus achieving very substantial savings in production costs.
But there is a far more powerful and flexible approach to link creation, whereby links are created dynamically, or 'on the fly', as and when they are needed. The advantage of this soft linking approach is that, in cases where information needs to be frequently updated, as is typically the case with legal content, it is not necessary to reprocess an entire corpus of documents after each modification. Even in cases where the contents of documents do not change on a regular basis, the scope and efficiency of the linking can be continuously improved without the need for further processing. In this way, the links are continually kept up to date.
Soft linking may be implemented in a number of ways. Context has so far developed two soft linking applications:
1. Multi-Source Linking. Multi-Source Linking (MSL) allows users to create links 'on the fly' from text appearing anywhere in virtually any Windows application, such as a Web browser, word-processing package or email application. If a user sees a citation in the text which is of interest, all that is required is to highlight that citation using the mouse, click on a button (which is located in an 'always on top' window residing on the user's desktop) and the full text of the highlighted document is, where available, retrieved. Context has incorporated MSL into JUSTIS. This implementation of MSL, known as Floating J-Link, is delivered as a small, downloadable program which may be run at system start-up and enables end users to link from any document reference to the appropriate record within JUSTIS.
2. Web Wide Linking. An alternative approach, known as Web Wide Linking (WWL), operates in a Web-only environment. Once a web page has been retrieved, the text is automatically scanned for linkable citations which are converted into hypertext links, before being presented to the user. It is also possible to customise the link creation process to the preferences of individual users, so that only selected reference types are converted into links. Context is implementing a version of WWL for JUSTIS to be known as Automatic J-Link. Due for release later in 2000, Automatic J-Link will be delivered to users as a web browser plug-in which creates a button on the browser's toolbar. Automatic J-Link recognises references to JUSTIS documents appearing on any web page which the user is currently viewing. When the user clicks on the toolbar button, the citations are converted into links which, when clicked, take subscribers to the full text of cited documents on JUSTIS.com. Automatic J-Link works on any document retrieved from the Internet or may be tailored to work within corporate intranets.
The ability of users to navigate easily and seamlessly between one JUSTIS CD-ROM and another, and across the entire JUSTIS.com web site, in effect creates a single, fully cross-referenced virtual database. The substantial value which the ability to link together large numbers of documents automatically has added to Context's product range has been achieved at a fraction of the cost of using conventional, manual methods.
5. The Syntalex Engine
At the heart of Context's linking technology is Syntalex, an intelligent, generic and highly flexible Engine which automatically identifies references within text. The Syntalex Reference Recognition Engine can be applied in any domain where value may be added to text through the application of rules based on text-pattern recognition. In addition to automatic linking, the Syntalex Engine has been used to create powerful text processing applications across a diverse range of areas, including machine-aided indexing (matching the text of documents against a lexicon or thesaurus of key words and phrases in order to assist editors in the indexing, categorisation and retrieval of documents); automated document markup and conversion (the identification of specific elements within raw text and the automatic insertion of appropriate markup and formatting tags to produce structured content); and 'web farming' (the automated and timely extraction of specific data from the Internet or other information sources).
The generality and flexibility of the Syntalex Engine has been achieved through a rule-based design, and the behaviour of the Engine is controlled and determined by the use of rule files. A rule file contains a number of rules, each of which comprises two distinct components:
The 'Matching' Component: describes and defines a specific type of text pattern, such as a legal citation or other type of reference.
The 'Action' Component: specifies a type of action that is to be carried out once the conditions specified by the Matching Component have been satisfied. Examples include creating a hypertext link, outputting information to a database, retrieving specific information from an online resource and sending an email alert.
A solution incorporating the Syntalex Engine will typically comprise the following elements:
The core Syntalex Reference Recognition Engine;
One or more rule files, programmed to recognise specific types of text pattern and perform specific types of actions, which are used to guide and control the Engine's pattern matching and text processing functions;
A software application which applies the Syntalex Engine, rule files and required data sources to a body of documents or other type of content.
The highly flexible and modular design of the Engine means that it can be readily integrated within existing content management systems, information products, internet sites, corporate intranets and Web browsers, and is able to processes content in all standard electronic formats, including plain text, HTML, XML and Acrobat PDF.
The implications of applying the technology described here to legal content are significant. Not only does it eliminate the need for the laborious manual tagging of data and creation of hypertext links but, more importantly, it also increases the value and usability of legal, official and related information by enabling users to navigate instantly and seamlessly across diverse content resources.
The technology therefore overcomes a number of seemingly insurmountable obstacles, both technical and commercial, which have previously stood in the way of attempts by content producers and managers to improve the usability of their electronic information.
Context Limited: http://www.context.co.uk/
This is a Commentary published on 31 October 2000.
Citation: Needle J, 'The Automatic Linking of Legal Citations', Commentary 2000 (3) The Journal of Information, Law and Technology (JILT). <http://elj.warwick.ac.uk/jilt/00-3/needle.html>. New citation as at 1/1/04: <http://www2.warwick.ac.uk/fac/soc/law/elj/jilt/needle/>