Storage of terminology—practical considerations- 123docz.net

The particular nature and the considerable volume of data to be stored in terminological data banks determines the type of software tools needed for the physical organisation of data on disk, the logical organisation of the structure of the data and the management (security, integrity etc.) and running of the subsequent database. This need is being met in different ways: Some term bank designers have opted to use commercially available systems, where the choice is between database management software and commercial in

formation retrieval software. Others have decided that none of the available software matches their particular requirements and have constructed a term- bank software from scratch. All three approaches have their advantages and disadvantages.

Commercial database management systems and information and docu

mentation systems are based on particular models, the relative theoretical merits and failings of which were discussed in the previous section. Both types of system are thoroughly tested before they are released and are consequently reliable and robust. They possess all the facilities which are appropriate to their intended task and allow a wide range of applications to be built onto

them. From a practical point of view, however, there are a number of other considerations which affect their usefulness as models for term bank software.

6.4.1 Database management systems

In order to build a term bank from scratch it is necessary to first build a data management software; this will without doubt incorporate many of the features already provided in commercial packages. The next step is to develop a user-interface, i.e. the specific application, in much the same way as would be required to implement a term bank using existing software. Whereas the simple structure of early term banks, based loosely on the model provided by the format of the printed dictionary, encouraged the development of custom- built software, the move to more complex terminological data models made this task less attractive. The designers of more recent term banks (with one or two exceptions) turned, therefore, to commercial software as the basis on which to construct their systems.

Commercially available database management systems have proved unsuit

able for the implementation of a term bank in a number of respects.

(a) Terminological data is of a predominantly textual nature. Much of the data is of variable length and there is no pre-determined maximum for any of the values in question. It is, for example, extremely difficult to prescribe a fixed length for definitions. This is then exacerbated by the fact that the structure of terminological data and the relative importance of particular data items may vary from one subject field to another. Database manage

ment systems are not designed to cater for data of this type. They were written to manipulate predominantly numerical data and strictly format

ted information. It is, therefore, difficult to implement terminological data neatly onto the majority of commercially available database softwares.

(b) Database management systems are not usually suitable for more that one language, whereas terminological data is, in the majority of cases, multi

lingual in nature. Because of the compression techniques used by many database management systems to make more efficient use of available storage, the extended character facilities required to represent diacritics or non-Latin characters are not generally available.

(c) Database management systems generally have ordering systems, such as indexes and other access facilities, in which the differences between upper- and lower-case is not fully acknowledged. The distinction may not be supported at all, in which case the user may remain unaware of

Storage of Terminology 181 the graphemic peculiarities of certain terms. Alternatively, it is always significant; in such a case the term 'Boussinesq equation', for example, would not come between 'boulder' and 'coefficient of consolidation' purely because it begins with a capital 'B'. In terminological data processing the upper-/lower-case distinction is partially significant. In the majority of database searches users will wish any differences to be ignored in order to increase the scope of retrieval. On the other hand, when the term is retrieved they will wish to see it in its true graphemic form.

These problems can be overcome to a certain degree by sensitive and clever design but the resultant logical database schema is somewhat contrived and does not easily lend itself to a flexible or efficient user-interface.

6.4.2 Information retrieval systems (IR)

Information and document retrieval packages were designed with the specific task of storing textual data and therefore allow true variable-length text en

tries. Most good IR packages can adequately handle the problems related to multilingual data but the upper-/lower-case distinctions are usually absent.

Most IR software exactly mirrors the model traditional printed dictionaries and glossaries are based on. Each entry consists of a heterogeneous collection of unformatted or loosely formatted data which is accessed via a head-word.

IR software generally allows for thesauri of varying complexity to be set up which ought to satisfy the needs of terminologists wishing to portray concep

tual links in their term banks. It should therefore be a simple task to transfer a conventional glossary onto an IR package without any loss of data or facilities and indeed make some improvements in the process.

IR packages can therefore supply most of the needs of conventional glos

sary production and consultation; however, current requirements are for a more complex data structure allowing terminological data to be retrieved in new and more interesting ways. The restricted data model offered by most IR systems is consequently inadequate for the following reasons:

(a) Information retrieval database models lack structure. They allow a limited amount of flexibility in the way the information can be internally struc

tured and retrieved but do not allow the database administrator the full range of options for logical database design; nor do they enable him to se

lect the most efficient storage techniques as appropriate to the data being represented. Consequently they discourage the representation of complex relationships between data such as those discussed in the previous sections

and work against a true portrayal of the theoretical representation of ter

minology presented earlier. On a more practical level, when implement

ing the theoretical model, it is difficult to preserve the distinction between conceptual and linguistic data categories. Information retrieval packages do not have sufficient data structuring facilities to achieve this purpose.

They also preclude any more formal representation of the type required to satisfy the diverse user groups listed above, making the goal of a flexibly constructed term bank difficult if not impossible to achieve.

(b) The range of thesaurus building facilities offered is insufficient for cur

rent terminology processing requirements. The relationships which can be represented in the system thesauri are generally restricted to those hierarchical relationships which many terminologists are now claiming to be inadequate to truly portray terminological data.

base to which documents are added at pre-determined times, e.g. in batches of new accessions. Once on the database, documents remain prac

tically unaltered until they are finally regarded as obsolete and possibly removed to a back-up store or altogether. Terminological data is not of this nature. It is volatile and undergoes change in line with linguistic usage or the structure of knowledge. New terms are coined to signify new concepts, old terms become obsolete but are simply marked as such and not eliminated, existing terms change their meaning and therefore require a change in the record. Many other consequent changes may then occur; e.g. synonyms may cease to be synonyms and abbreviations can be promoted to become the preferred term, the previous preferred term be

coming a synonym. The more complex the representation of terminology used, the more likely it is that individual values and features of a record will change. Addition of new records may require changes in records else

where to maintain the system of cross references. As the volume of data held increases, so the number of terms needing monitoring, modification and updating will rise. Information retrieval software does not normally possess facilities for coping with large-scale database modification, nor does it have the facilities for large scale verification of the data and the recovery of the system in the event of a failure. For typical information retrieval applications, these facilities are not deemed necessary; for ter

minology processing on a large scale they are indispensable.

(d) There are intrinsic differences in the information processing being under

taken and the modes of access required. Firstly, the optimum response to a query in a typical IR application is a set of documents; in terminology

Storage of Terminology 183 retrieval the optimum response is normally a single term record. Secondly, the contents of a typical IR database originate elsewhere (as articles, ab

stracts, texts etc.) and on conversion to machine-readable format they undergo no logical change. Their sole purpose is to satisfy on-line queries from users. Terminological data, however, are built up initially on the machine and may subsequently be converted to a printed or other format for other uses. It must therefore be structured in such a way that a great diversity of subsets of information can be extracted not only on-line but also on micro-fiche, on paper and in a form suitable for use in machine- (assisted) translation systems, spelling and style checkers, automatic ab

stracting systems and other natural language processing environments.

The natural conclusion to be drawn from the above discussion is that there are no suitable off-the-peg software packages available for the storage of ter

minological data. Indeed, recent research activity directed towards the au

tomation of general language lexicography has led a number of computational linguists to the same conclusion regarding the data of general language lexico

graphy. This has prompted a call from a number of sources for a database management software based on the unique representational requirements of a dictionary data model. Such demands are also in line with current trends in Information Science where new developments are towards greater abstraction and the definition of application-oriented data types. However, no specific model for lexical data has as yet been agreed amongst lexicographers, nor has computational lexicography reached a sufficient level of maturity for the specific computational requirements of lexical data to be determined. Re

search will no doubt continue into methods and strategies for storing lexical data, prompted by the realisation amongst many computational linguists and computer scientists that this area of Natural Language Processing has been sadly neglected in recent years.

The renewed interest in the lexicon will have a profound effect on the storage not only of general language vocabulary but also of terminology.

Storage of terminology—practical considerations

Term formation: theory and practice

The functional efficacy of terms