Terminological data banks—a definition

Traditionally the concept of a terminological data bank or term bank has been defined as an automated collection of the vocabularies of a subset of spe

cialised knowledge created to serve a particular user group. This definition implies monofunctional use and was appropriate for those term banks which have been used predominantly as tools for large translation services. The vocabularies collected in this way are in most cases considered as enhanced but still conventional glossaries, thesauri or technical dictionaries transferred to a new medium. Existing term banks were designed to answer the same type of questions one would have consulted a good dictionary for. These questions are perfectly valid, but they are only addressed to and elicit direct responses from the various parts of the conventional dictionary entry, e.g.:

ENTRY PART QUESTION ANSWER

gender spelling

equivalent definition synonym subject label example

What is the gender of 'imprimante'? - feminine What is the spelling of the French

word whose English equivalent is 'Woodruff key'?

What is the French for 'laser printer'?

What is a 'laser printer'?

Can I use 'bit' as synonym for 'binary digit'?

• clavette Woodruff

■ imprimante laser - a printer which ...

- binary digit - Abbrev.: bit Is 'bit' restricted to a subject field? - Computation Is there an example sentence

containing 'bit'?

While these questions may satisfy existing expectations, they neither exhaust the information available nor do they present all the information that may be useful to a wide range of existing and potential dictionary users. In addition, the answer may or may not satisfy the user and it may be ambiguous. For example, listing a subject field label may mean that a word has a particu

lar meaning in the named subject field, or that its use is restricted to this field.

The questions that a specialist user would address to a conventional term bank are in principle the same as those addressed to a general language dic

tionary. What difference there is, lies in the frequency with which certain questions are asked and the class of entry that is being consulted. Most existing term banks can provide satisfactory answers to these questions, and according to the specific nature of the term bank, also some other questions regard

ing sources, date of recording in the term bank and more extensive usage notes.

The potential information of a lexical database is, however, not exploited by existing automated dictionaries and term banks. This has several reasons:

(a) the information held is not unified in a manner suitable for flexible re

trieval;

(b) there is a lack of coherent structure and formal representation of data;

tries which the machine provides and underutilise or ignore those facil

ities which allow the storage and representation of terminological data in new and interesting ways.

Modern expectations in both general language lexicography and termino

logy processing are for flexibility to become the main criterion in designing machine-readable dictionaries. In addition to the type of questions listed above it is now considered desirable that information should be stored in such a way that the following searches and queries become possible:

QUERY SEARCH OF FIELD - Compile a glossary of all usage restriction or

terms with usage note of 'ICI'. scope note - What do you call a machine definition or

that performs X? conceptual links

- What parts make up a Y? subordinate partitive terms - Find all terms entered name of the terminologist

by 'JCS' since 1985 over a period of time

Storage of Terminology 169 - Compile a glossary of terms subject field or a

related to 'GPSG'. subdivision - Print all terms with a subset of origin of data

source of CCL.

Much of the data required to answer the above questions is already available in existing term banks but their retrieval requires a cumbersome scan of the whole data-base, i.e. the data is hidden.

These new information possibilities should be reflected in the definition of the modern concept of a term bank, which may, in fact thereby be oriented towards a broader function than that normally associated with dictionaries.

A term bank should therefore be defined as:

'a collection, stored in a computer, of special language vocabularies, including nomenclatures, standardised terms and phrases, together with the information required for their identification, which can be used as a mono- or multilingual dictionary for direct consultation, as a basis for dictionary production, as a control instrument for consistency of usage and term creation and as an ancillary tool in information and documentation.'

A term bank is therefore more than just an automated version of a printed dictionary, designed to meet the needs of a single user group. Using the above definition it should now be possible to specify the design of a term bank so that users with varying degrees of expertise can access the same term bank and retrieve data for a variety of different purposes from a variety of starting points. Indeed, the possibility of using the same term bank to serve both human users and such machine uses as machine-translation, natural language processing and expert systems is well within the reach of possibility, given a flexible and sensitive design and a sophisticated interface both for the input of data and for their retrieval.

6.3 Modern terminological data bank design

NOTE: The model presented in this section was conceived by Richard Cande- land as part of the research he carried out at UMIST, and is reproduced here with his kind permission.

At first sight, the design of a terminological data bank presents no problems.

The lack of flexibility inherent in traditional term banks can be dismissed on historical grounds. There was no suitable purpose-made software available

when the first and second generations of term banks were conceived and con

sequently they were implemented using custom-built database/file manage

ment systems. In order to make them serve all the requirements we now consider essential this software would have had to be completely rewritten to facilitate any significant diversification of use.

An examination of the current term bank facilities and software permits two possible interpretations of the conditions for term bank design:

EITHER: building a term bank is simply a matter of specifying the structure of terminological data using traditional database design techniques and im

plementing the resultant model by means of an existing database management (DBMS) package;

OR: building a term bank is merely another application of conventional in

formation retrieval (IR) techniques and so can be loaded onto any good IR software package.

Although both methods have been used for the construction of term banks, neither has proved entirely satisfactory, and on closer study it becomes evident that in practice both approaches have their problems.

Concerning the current position of the logical structure of term banks, the following features are still to be found in the majority of well-established term banks:

- There is still strict adherence to the rigid sub-division of the knowledge base into separate areas and a failure to take account of inter-disciplinary and overlapping subject fields. Many identical concepts must consequently either be duplicated or subjectively classified as belonging to a particular discipline. (This inability to satisfactorily represent the relationship subject field : concept can be held to explain the inability of term bank administrators to agree on a subject classification).

- There is a tendency to organise the database linguistically rather than con

ceptually. In a theoretical model of terminology, terms are subordinate to the concept; in practice, however, the term is the primary, and often the only, entry point into the database and so must be given precedence.

- There is a tendency to ignore conceptual relationships as they are difficult to represent using a network-type database model. Where they have been introduced, only a very restricted subset of the relationships in existence—

i.e. those which fit neatly into a hierarchical one : many representation—have been implemented.

- There is a tendency to impose a uniform record structure over the whole database. Subject fields whose terminology is taxonomic in structure share the same logical record format with those whose structure is very loose.

Storage of Terminology 171 Differences of this kind are often catered for by having a 'put-anything-here- that-does-not-fit-elsewhere' data field but this approach adversely affects the flexibility with which data can be retrieved from the system.

- The inability to represent the relationship 'can be translated in language X by' as a many:many relationship has resulted in the need:

- to declare each concept as having only one equivalent concept in each foreign language without inferring that the concepts match;

- to enter foreign-language equivalents only for matching concepts;

- to store target language equivalents with a limited amount of associated data as part of a monolingual record in the source language database. The full entry is then stored independently as part of a separate logical data

base.

- Implementations of network-type data models preclude the dynamic cre

ation of new record types and new relationships.

Consequently it is difficult to change the orientation of existing terminological data banks and adapt them to new and diverse user-groups. Many term banks are therefore unable to meet the requirements of the present information mar

ket. Current trends in database management generally favour an approach based on the relational database model. Implementations of this model in the commercial environment began to appear in the early eighties and many powerful data storage and manipulation tools are now available.

6.3.1 Representation of terminology—a theoretical model

When applying the computer to any task, it is first necessary to define a the

oretical model both of the processing involved and of the data that require processing. A model of the data is devised by defining the entities which are relevant to the application in question and identifying the relationships which hold between them. It is, therefore, a representation of the real world of the application in question and is drafted independently of any of the con

straints which will later be imposed by attempts to implement the theoretical model onto a computer architecture using a particular logical data-storage model.

In the majority of existing computerised terminology processing systems the task to be achieved has been very specific. The range of operations to be performed has been limited to those required to satisfy the demands of an ideal user (e.g. a translator of the EC, a standards expert of AFNOR). All theoretical models of terminological data have therefore been heavily biased

by the needs of a single user-group and once implemented have resulted in term banks locked into a particular application.

There are a number of very practical reasons which have contributed to the re-assessment of the design requirements of term banks.

1. The demand from new and diverse user-groups for high-quality termino

logy.

2. The immense cost of terminology compilation.

3. The greater range of expectation of users.

These circumstances have resulted in a call for the pooling of terminological resources in a multifunctional term bank or an even wider-ranging database more properly designated as lexical database. Once a multipurpose tool was envisaged, it also became necessary to re-appraise a theoretical model of ter

minology processing. In such a situation emphasis must be placed on the definition of a general and exhaustive theoretical model of terminological data, such that all the data models produced by the investigation of individual retrieval requirements can be mapped onto it. Only in this way is it feasible to serve many diverse and disparate user groups from a single central termino

logical collection.

From a model-theoretic point of view, therefore, the new design objective for optimal terminology processing is the storage of terminological data in an application-independent fashion so that all conceivable purposes can be served equally well. In such a 'new' situation, compilation continues to be governed by the theoretical model of the data but, in contrast, a model of retrieval as such remains undefined. In practice several processing models for terminology retrieval are required. Each will match the specific needs of its target user-group and each will have its own model of terminological data which can be implemented at the logical level as a view of terminological data as a whole, i.e. subsets of the data will be extracted (and converted into an appropriate format) as each application requires.

This notion of user views of data is far from new and has been in use in the world of commercial database management for many years. Only recently, however, has it been recognised as applicable to terminology. We can now define the main types of data that have to be accommodated.

Any contemporary model of terminological data comprises the following broad data categories which are all interdependent:

- Housekeeping/management data (reference/record number, terminologist's name, date of first coding, information about updates);

- Conceptual data (subject, scope, definition, related concepts, related terms and type of relationship);

Storage of Terminology 173 - Linguistic data (lexical entries, their form and grammatical features);

- Pragmatic data (usage restrictions and special labels, contextual data);

- Bibliographic reference data.

Even though recent developments in modelling terminological data converge on the necessity of creating a multipurpose tool, there are still marked dif

ferences in the perception of the data that have to be accommodated in a theoretical model. Different categories are selected as meriting isolation as entities in their own right and there are disagreements about the relationships which hold between entities.

The theoretical model of terminological data which follows is not to be considered a definitive model. It draws on the theoretical models underlying the most recent developments in term bank design as well as models proposed for the application-independent representation of the general-language lex

icon.

Figure 6.1. Theoretical model of terminological data Entities in the model

ENTITIES

Pool/Collection of terms Origin/Update

Concept

Conceptual Link Language Definition Term

Usage

PROPERTIES Code - number Function/purpose Originating Centre Date

Originator Code - number Formal representation (language independent) Type of link

Nature (e.g. reciprocal, arity, ratio, e.g. 1 : N, N : N or 1 : l,etc) Language

Textual definition Graphemic form Phonetic form Grammatical features (syntactic and morphological) Status (preferred, deprecated, abbreviation, equivalent etc.) Usage note (stylistic, geographical)

Context Phrase containing term

Source Author/organisation Title

Page reference Publisher Date

Type of source Relationships between entities in the model

EXISTS BETWEEN ENTITIES TYPE Pool - Concept 1 : N

Concept - Origin/Update 1 : N

Concept - Link-Concept(s) N : N : N (: N :) (any concept can have any number

of links which may be to one or more than one other concepts)

Concept - Language - Definition 1 : 1 : 1 (one textual definition per

concept per language)

Concept - Language - Usage - Term N : N : N : N (one concept may have any number of

terms in any number of languages, depending on any number of variations in usage.

Conversely any term may refer to more than one concept)

Term - Context 1 : N Source - Term

Source - Definition 1 : N Source - Context J

NOTES:

1. What are traditionally called conceptual relationships have been renamed concep

tual links here to avoid any confusion between conceptual relationships and rela

tionships in the data model which exist between any entities.

2. Conceptual links are not enumerated. It is now recognised that hierarchical links are restrictive and are being supplemented by the idea of a network of links. Research into the nature and types of such links continues. In order not to prejudge the results of such research it is necessary to make provisions in the model for the addition of new links and the possibility that such links will not conform to the binary, 1 :N nature of traditional hierarchical knowledge representation.

Storage of Terminology 175 3. As subject fields and scope limitations may themselves be considered as concepts, the membership of particular terms of a conceptual system or subject area (or areas) can be represented as a special type of concept-concept link. This allows the construction of subject domains from the bottom up and may help resolve the controversial question of the sub-division of the knowledge base. The current prac

tice of initially declaring a list of valid subject fields and demanding that all terms are subsequently allocated (normally at the discretion of the terminologist) to a subject field from that list can therefore be replaced by one in which the terminologist enters a concept and describes its environment using conceptual links. The machine itself will then determine to which knowledge domain or domains the term belongs.

4. One further type of specialised concept: term relationship could be used to indicate the nearest term in the language  for a concept in language A where there is no match of conceptual systems across languages. Most current term bank implemen

tations either assume all concepts match or no concepts match and allow no com

promise between the two positions. The implementation of a 'nearest equivalent to' relationship would be of particular use when non-standardised terminology was being processed.

6.3.2 Representation of terminology (logical implementation)

When implementing the data model with a particular application in mind, using available data-storage models, it is theoretically always possible to ad

here to the original definition of the model. In practice, however, the resultant logical database structure is sometimes so cumbersome and unwieldy as to make this unwise. This has been the case with terminological data and has resulted in term bank designers either omitting the more complex structures inherent in the data or simplifying the structure of the model to facilitate an easy implementation.

This problem has been exacerbated by the fact that at the time when most term banks were created research in data storage models was in its early stages and most data storage software was based on the limited hierarchy and net

work models. Both models support only binary, one:many relationships and as a result attempts have been made to portray terminological data purely in terms of this type of relationship. It has, therefore, been the availability of software and software techniques which have been the driving force behind term bank design rather than a desire to achieve a true logical implementation of the data. As a result logical implementations of terminological data do not reflect many of the trends in the representation theory of terminology outlined above and have consequently failed to make use of the removal of physical restraints which operated on the first and second generation of term banks.

There has been some recent research into the suitability of the relational model for storing terminological data. The CEZEAUTERM term bank, for example, was designed using a relational model of terminological data, al

though it was not implemented, using a pre-written relational database manage

ment package instead. (At that time commercial relational database management packages were not fully developed.) The relational model eliminates some of the problems highlighted earlier. The relatively weak relational structure between the data fields imposed on conventional terminologies by a hierarch

ical or network approach is replaced by a potentially more powerful set of relationships between all entities, features and values.

The relational model can efficiently and easily represent the many:many relationships, which are so common in terminological data. For example, the representation of synonyms and homonyms in different subject fields is easily achieved. If concept X is signified by terms A and  and concept Y is signified by A and  then the relation concept-term would be represented as follows:

SUBJECT CONCEPT TERM 1

X X

 2

Y Y



Relations in the relational database model are simply represented as two- dimensional tables. For a network database model a link record must be introduced to allow for the representation of the many:many relationships, thus:

SUBJECT CONCEPT

SUBJECT-CONCEPT LINK

Furthermore relationships can be created between more than two entities if required. Thus a particular theoretical model of terminological data may view term W as being the value created by the relationship 'used for concept X in language Y for text-type Z'. A relationship between four entities such as this can be represented using the relational model as one relation. A similar representation using a network or hierarchy model would require the creation

Terminological data banks—a definition

Term formation: theory and practice

The functional efficacy of terms