JILT 1997 (2) - AustLII Paper 3

3. Indexing Law on the Internet

3.1	The problems of finding law
3.2	AustLII's approach
3.3	AustLII's Links
	3.3.1	AustLII's Web Indices
	3.3.2	Operation of the Links Indices
	3.3.3	AustLII index software
3.4	The targeted web spider
	3.4.1	Wallace the Gromit Harness
	3.4.2	Impact on Other Sites
	3.4.3	Mirror sites on AustLII
	3.4.4	Checking for bad links
3.5	Project DIAL
3.6	Targeted web spider - AustLII's future directions

3.1 The Problems of Finding Law on the Internet

There are essentially only two types of tools which help users find legal materials on the Internet:

•'Intellectual' indexes where individual web pages are classified by hand according to various classificatory schemes. Usually, such indices only provide the title, URL and perhaps a brief description of each site indexed. Yahoo! is a well known general example.

•'Robot' indexes where a program traverses the web, downloading every page it encounters, so that every word on every page can be indexed by a remotely located search engine. When the search engine displays a URL as a result of a search, that URL is to the original site, not to a mirror on the remote site. Alta Vista is perhaps the best known general example. The advantage is, of course, that it is possible to search for every word indexed, at least using Boolean operators.

Viewed from the perspective of an Australian user of Internet legal materials, finding Australian legal information on the Internet is difficult, for at least the following reasons:

•As the quantity of Australian legal material on the Internet grows, it is difficult to maintain intellectual indexes, at least with any depth of indexing of significant sites. The best that can be hoped for is that sites with significant legal materials are identified.

•There are no satisfactory 'Australian only' robot indexing sites providing both extensive coverage and a useful search engine. To use Alta Vista or other Internet-wide search engines to limit searches to Australian law is not easy, as discussed below.

•Many sites containing valuable legal information do not have search engines at all, so searching at word level is not possible. Users are also confused by multiple search engines.

So, in Australia it may be possible to find most useful sites of legal materials, but it is often difficult to know what is on them. If we generalise the problem to that of finding Internet legal information world-wide, the problems are variations on the Australian situation:

•While there are many multi-country intellectual indices to law on the Internet, none are even remotely comprehensive, and many are US-oriented with a slight international gloss. Some very good indices do exist for particular countries such as Canada, and for some subject matter areas, but they are often difficult to find from the multi-country indices. It is therefore difficult to find a good place to start.

•There are very good Internet-wide robot indexes, such as Alta Vista, but they are not as comprehensive as people often assume. For example, Alta Vista apparently only indexes about 600 pages of even the largest web site [21] . Furthermore, well-behave robots adhere to the robot exclusion standard, by which web servers tell robots which pages they may not index on a site. Because of the effects of some robots on server performance, and for other reasons, many servers exclude robots. Such factors lead to estimates that even the largest Internet-wide search engines only index about 20% of the estimated 150 million web pages.

•It is difficult to make searches precise enough to find only legal materials using Internet-wide robot indexes, because they index predominantly non-legal material. It is usually necessary to try to impose some ad hoc search limitation (in addition to the real search terms) such as 'law or legislation or code or court' or some such, to try to stem the flood of irrelevant information (or more likely, to fool the relevance ranking into putting legally oriented material first).

•It is also difficult for most users to limit searches to materials concerning laws of particular countries [22] , and failure to do so will usually result in the search being flooded with material from North America and other 'content rich' parts of the Internet.

•When you do find a site containing valuable legal information it will often not have a search engine at all, so searching at word level is not possible. Users are also confused by multiple search engines.

So the problems of finding legal materials world-wide are that it is both difficult to find which useful sites exist for a particular country or subject, and also difficult to find what is on such sites as are known.

3.2 AustLII's approach - A robot targeted by an intellectual index

Our approach to solving these problems rests on these propositions:

•Robot indexing of remote law sites, and a sufficiently powerful search engine, are necessary;

•Searching robot indexed sites will work much better if (i) only law sites are indexed (to remove non-legal 'noise'); and (ii) such sites are indexed comprehensively ;

•Significant law sites which normally exclude robots may allow a law-oriented robot to index them, by request. The number of requests may be manageable.

•A comprehensive intellectual index is needed to identify the law sites worth indexing, and therefore to 'target' the robot.

AustLII has a suitable search engine (SINO), its own Internet indexing software (Feathers) which can be used to 'target' a robot, and a sufficiently comprehensive index of law on the Internet, at least for Australian law (Australian Links). A robot (or 'web spider' as we prefer to call it) called Gromit, and a 'harness' or means of directing it (called Wallace) by using an intellectual index. The targeted web spider will soon play a significant role in AustLII's future developments. AustLII's research on Internet law indexing is supported by an Australian Research Council small grant for 1997.

The rest of this paper describes the components of AustLII's targeted web spider, and indicates some of the roles it may play.

3.3 AustLII's Links - Australian and World web indices to law

3.3.1 History of AustLII's web indices

AustLII was launched in July 1995 with an Index to Australian Law on the Net, a conventional hypertext index based around a source/author index approach. The index was maintained periodically by Graham Greenleaf until it reached about 500 entries a year later, at which point the maintenance of 'hand-tooled' web pages, lack of search capacity, and lack of a subject index became problems which had to be addressed.

Geoffrey King wrote the Chain indexing software for a new user interface to the Links indices, for hierarchical browsing, for editing and maintenance of index entries, and for an interface to the SINO search engine. We then settled new source and subject index categories, all data from the old index was transferred into the new one, and symbolic links were added to make the whole structure work. The new Links indices were launched in October 1996, and 'Australian Links' was the runner-up in the Australian Society of Indexers inaugural web indexing awards in 1996. Index entries have grown to about 1,500 by mid-1997, of which over 1,000 related to Australian law sites. At this point the Chain indexing software also required redevelopment to satisfy new demands.

3.3.2 Operation of the Links indices

The Australian Links index can be used in three principal ways:

•as a Source index which categorises the sites according to their source or 'author';
•as a Subject index, which categorises the same sites according to over 50 heads of legal subject matter; or
•by Searching the index, from a search window at the top of each page, which allows Boolean and proximity searching (using AustLII's SINO) over both the index categories and index entries. Searches may be over the whole index, or limited to those sub-categories lower in the index tree.

Users may submit links to be added to the index, but they are edited and approved by the index editors before they are added.

The following example shows the 'Administrative law' subject index page.

http://www.austlii.edu.au/links/Australia/Subjects/Administrative_Law/index.html

3.3.3 The new AustLII index software - Feathers

The indexing software has now been rewritten, with the new software ('Feathers') and interface to be released in July 1997. It will result in major changes in the way that the links are maintained, and in the editing facilities available to those who maintain it. It will allow considerable customisation of the appearance of index pages, so that they can appear in a consistent style with text collections and other resources, which will be valuable for our special projects and teaching resources. Another major aspect of the rewrite is to allow interaction between Feathers and the targeted web spider discussed below.

3.4 The Targeted Web Spider - Gromit (and Wallace!)

Gromit is a specialist web robot. It targets selected legal web sites, namely a subset of the URLs contained in AustLII's Links Internet indices, selected for their high value legal content. Gromit Web Robot (Gromit) is a single program that recursively downloads all text files on a site for indexing by AustLII's SINO Search Engine.

We call Gromit a Targeted Web Spider, as it is not designed to traverse the Web generally, its downloading being limited to the site specified in the original URL specified when it is invoked. For example, if Gromit is invoked to download the URL http://actag.canberra.edu.au/actag/ (ie the ACT Lawnet site), any linked pages that fall below the original URL (ie lower down in the file hierarchy on the same server) will be downloaded. Linked pages outside that scope are ignored. The Gromit robot is not allowed to wander 'off site'.

Normal operation for remote indexing purposes (as opposed to mirroring) is in text only mode, so image links will also be ignored, as will any links that do not appear to be of the MIME type text/html or text/plain.

Gromit maintains a local cache of downloaded documents, so that they can be indexed by AustLII's SINO Search Engine. The cached documents are not available for browsing or downloading via AustLII's servers - users must go to the original host in order to browse or download.

3.4.1 Wallace the Gromit harness

Gromit is not intended to be used directly by a human operator. Typically, it runs under the control of Wallace, a control script that fires off Gromit processes over blocks of URLs. AustLII's new software for the Links indices, Feathers, will invoke Gromit processes in relation to those sites selected by the editors of the indices. A separate version of Gromit to access protected databases on remote servers, with permission, is also available.

Wallace is a harness program for Gromit. Wallace instructs Gromit as to which sites it should download, and monitors its progress. Wallace runs a number of spider processes at any one time, but limits the maximum number of spiders to a preset limit. When one spider finishes, another is started automatically to download a different site. Wallace reads the list of links to download from a remote mSQL database using the Perl DBD and DBI modules. The database is expected to be in the format maintained by the Feathers links system

Wallace first downloads all the URLs in the database that are marked for indexing or mirroring. It then sorts the URLs by host name. URLs are grouped into host bands (that is, they all contain the same host name) and these bands are passed as URL lists to the web spider (gromit) for downloading. Wallace runs its spiders concurrently. There may be a web spider running for each host band at the same time, up to a maximum of 10. The user can modify the maximum number of spider processes. As one spider completes, another is started, until all host bands have been downloaded.

3.4.2 Impact on Other Sites

Gromit is a relatively unobtrusive robot, designed to have minimal impact on the sites it visits. The robot, designed and implemented by AustLII staff, has been written in Perl 5, and uses the LWP library. In particular, the LWP::RobotUA object is used as the basis for Gromit. That module, together with other measures taken in the program, minimises impact on web performance because:

•It obeys the Robots Exclusion Protocol so as to not visit areas where robots are not welcome. Specifically, it obeys directives in the robots.txt file in the root directory of servers (see Robots Exclusion at The Web Robots Pages).

•No one site is accessed twice by the robot within a 2 minute period.

•The robot caches downloaded documents for later indexing, and will issue a HEAD request for a page before attempting to download fresh versions of already cached pages. On those web sites that support such mechanisms, Gromit will take advantage of the If-Modified-Since and Last-Modified HTTP headers, reducing server load for those machines.

•A notorious problem with web spiders is that they can saturate a remote site with requests, slowing down the remote server and denying access to other web users. By grouping sites into bands, no one site is accessed simultaneously by Gromit, since Gromit processes URLs in consecutive order.

Gromit is still under development, and during this initial stage will not be running unattended. Further information can be obtained on the page 'Gromit Web Robot - Information for Web Managers'.

3.4.3 Mirror sites on AustLII

AustLII has been granted permission to mirror certain legal sites. The Gromit robot is used to download these sites and keep the mirrors updated. When mirroring, Gromit rewrites local URLs to use the mirror copies of documents, and also downloads any graphics or other files that may be referenced there.

3.4.4 Checking for bad links

A by-product of the development of the web spider is that it will also be used to check the validity of all links in AustLII's indices, so as to improve the quality of the indices.

3.5 Project DIAL - A Challenge for Internet Law Indexing

AustLII's management team are involved in a consultancy project for the Asian Development Bank, Project DIAL (Development of the Internet for Asian Law). It is a feasibility study of the potential use of the Internet to assist those involved in the development of legislation in the developing member countries (DMCs) of the Bank. One method of assistance which is envisaged is the DIAL Index of legislative and other resources already on the World Wide Web, so as to provide ready access to comparative legal materials from other countries.

The prototype DIAL Index, which will be located on AustLII, will be based principally around a subject index of matters which are of particular interest to those drafting legislation in DMCs ('Privatisation', 'Environment' etc.), but will also have indexes which classify materials by the country they concern, by international organisations etc. The prototype DIAL Index will provide a very extensive testing ground for methods of intellectual indexing of legal web sites world-wide, but particularly in the Asia-Pacific region, and of the use of a targeted web spider to provide word-level searching of remote sites. The flexibility of the new SINO interface which allows searches to be limited to any combination of available databases will be put to very extensive use in the design of an interface to accommodate the needs of Project DIAL. The end result is intended to be a very extensive set of links and searches on each subject matter, which gives users access to a wide range of comparable legislative resources in many countries.

It should be stressed that the Asian Development Bank has not made any decision whether Project DIAL will proceed beyond prototype stage, or about the software, servers etc. which might be employed in the final system.

3.6 The targeted web spider and AustLII's future directions

One international application of the targeted web spider is mentioned above. Despite the availability of numerous Australian web sites for law, AustLII will be able to be an even more comprehensive starting point for Australian legal research. The new SINO interface allows construction of searches over varying combinations of Australian and overseas materials, and for searches over single sites which do not otherwise have a search engine. Mirroring of other sites will allow automated hypertext links to materials in AustLII 's collection to be added (where permitted).