Terminology retrieval can be considered as a special application of information retrieval. The trends in information retrieval mentioned in the introduction to this chapter can therefore be regarded as being of relevance to the design
Retrieval of Terminology 195 of a software to retrieve terminological information. This section investigates those retrieval requirements which are exclusive to terminological data.
Standard information retrieval is based on the premise that the system has only one human user-type, the on-line user. The majority of systems take data which are already in existence in another form (texts, articles, abstracts etc.) and convert them to a new storage medium to allow easier access to and wider dissemination of that data than was possible in their previous printed form.
Essentially, the structure of the information remains unchanged. What has altered is the way in which the information can be used.
The situation in terminology information retrieval is the complete reverse.
The data is created in the first-place in machine-readable files and then either made available to on-line users or converted to another form (printed diction
ary, micro-fiche, microcomputer-based machine-readable glossary). Previous data collections are used as sources but the resultant data structure bears no resemblance to that which existed prior to terminology compilation. Thus in traditional information retrieval applications, information is normally con
verted from print to machine-readable format and the facilities required for retrieval can be entirely oriented to a narrow range of on-line queries. In terminology retrieval data is created in machine-readable form and may be converted to a printed format to satisfy the requirements of particular user- groups. It must, therefore, be structured in such a way that a great diversity of subsets of information can be extracted not only on-line, but also in formats suitable for effective presentation on other media.
The use of the two types of data also differs. The raw data input to IR systems is textual data which for improved access in retrieval will be analysed and processed in some way. Terminological data could form a particular mod
ule within the system assisting in the computational analysis of the text being processed. The form of the output of IR systems is basically of two types:
either it is a reference to a text or it is a text. The use of the output is also twofold: either it is used to get hold of a text or it is the end result of the search process, i.e. the information need is satisfied with the provision of the text.
Terminological data output to the end user is of two types: it is either an item selected from a natural language corpus, e.g. a term or a context, or it is information related to an item of natural language, in which case it is entirely the result of human decision-taking and attribution. In most cases the information sought is of a mixed nature, i.e. the end user seeks information in relation to a lexical item of one language and expects a commentary of some sort which relates the lexical item to the user query. The use of the output is
more varied than for a IR system. Human uses are discussed in section 7.4., below.
For use in NLP systems it is desirable that terminological information systems be able to produce output in some sort of formal representation. To take this point a step further the output of data in a formal representation is heavily reliant on a similar representation being used internally for the storage of data. This supports the earlier argument for a more structured logical database representation than is possible in most current information retrieval software.
On-line terminology retrieval has several further requirements over and above those provided by standard information retrieval software because of the heterogeneity of the end users. Differences in user requirements can be perceived on a number of distinct planes.
Terminology has many distinct user groups: translators, technical writers, abstractors, teachers etc. These users must be able to retrieve the subsets of data they need; they want also to be able both to commence searching by using any data category and to combine search parameters and data categories in ways which information retrieval software rarely permits. The following access paths, via a powerful non-procedural query language which allows the formulation of any logical query commencing from any point in the database, should be supported as a minimum:
- to the term, e.g.
via direct query,
via selection from an index or permuted index,
via expansion of the search string to retrieve entries containing the term or entries similar in their orthographic or phonetic form.
- to the concept, e.g.
via direct access using an identification code, via conceptual relationships,
via free text searches on definitions (and possibly contexts).
Terminology databases, like dictionaries, attract users with different levels of subject specialisation. Whereas a scientist will expect a very precise def
inition of a concept, a translator may prefer a less technical definition and an undergraduate student may be looking for a definition more akin to an encyclopaedia entry. A term bank, if stored in a formal and structured enough manner, should be capable of tailoring output to different levels of technical ability on the part of users as defined in their output profiles.