ATAMIRI presentation in Washington D.C., OAS, 1985

ATAMIRI: Towards Language Engineering - 1985

Description of the first multilingual translator, presented at the MT Workshop held under the auspicies of the General Secretariat of the Organization of American States, in Washington, D.C., March, 1985.

Iván Guzmán de Rojas

SYSTEM OVERVIEW

A brief description

ATAMIRI is a complete and user friendly machine translation (MT) system, designed by Iván Guzmán de Rojas, and developed by IGRAL, the language engineering R&D group he conducts in La Paz, Bolivia. He began the work in 1978 and in the spring of 1985, under the auspices of the Organization of American States, ATAMIRI was presented in Washington as the first multilingual translator prototype. Since then, it has been rendering services at various translation centers in Latin America and Europe.

The multilingual translator capacity, thanks to a powerful inferential engine capable of solving most frequent homograph occurrences in the source text, generates a high quality draft, simultaneously into various target languages. The average number of correction keystrokes required to convert the draft in final, is less than 15%, as compared to source text number of characters.

Grammars are external to the program algorithms, therefore new source and target language implementations can be achieved in a relatively easy way. The new languages can then be used combining pairs with all previously installed languages. Thus, language expansions, in this novel MT system are not developed by pairs, reducing significantly implementation time and costs.

The system's lexicographer is multilingual and allows a centralized lexico management, working in a PC's network environment. It performs quick look ups and the user oriented terminology can be easily entered in context subdictionaries; it also supports a thesaurus facility. An independent lexicographer submodule allows off network lexical expansions, i.e. new language updates done at other installations.

Thanks to an efficient retranslation feature, the user can benefit recycling already translated and revised text fragments from older documents that may appear again in the new document. This capability to detect source text differences and translate only the new or modified text, inserting them automatically in the right place, is specially useful in the translation of updated versions of a handbook.

The bilingual word processing tool helps to rise postediting productivity, since in the working screen, the draft can be compared sentence by sentence with the source text.

Lexico and Grammars

ARUNQERA is the lexical file of the system, it currently (May, 1992) contains the following number of lexeme (T = Target language, S = Source Language):

**Operational Group:**
Language	Lexeme	Target/Source
Spanish	18,773	T/S
English	17,691	T/S
German	13,500	T
French	16,006	T
Aymara	1,300	T

ARUNQERA's design allows to implement any of these languages, either as source or as target language. The access can be in various language directions, simultaneously and in shared mode from the server, according to the demand generated at the network by the multilingual translator and by the lexicographer.

This lexical file is organized in such a way that user terminology subdictionaries and context subdictionaries can be treated as independent subsets from the whole lexico. The method of concept coding also allows to add lexeme in various languages, simultaneously and independently (in different file copies) , ensuring lexical consistency and data base integrity.

Other grammatical knowledge of the system is stored in special files containing verbal terminations and syntagmatical formulas, organized in a similar multilingual structure like the lexical file ARUNQERA.

The collection of syntagmatical formulas, about six hundred per language, is capable of an accurate syntactical representation of a variety of millions of sentences. Each formula was carefully obtained from the syntactical analysis of different sentence structures encountered in years of translation experience using ATAMIRI.

How does the system help you?

First you run the WP interface program, in order to create a DP text file from your source WP document. This program does for you an automatic pre-editing of the source document; for example, it eliminates underscores, keeps apart all editing symbols and formats, and executes the necessary subroutines to ensure a sentencewise recording of the document. It is run only once, but the created DP text file can be repeatedly used by the lexical analyzer or by the translator.

Now you may run the lexical analyzer which delivers the list of words occuring in the source text but not found in your ATAMIRI's lexical file. This list mostly contains the specialized terminology of the source document; if the text is in a new semantic field, not previously handled by the system, the missing words list might show 3 to 10% of the total number of words. When the lexical file has been enriched with the new terms required by the list, any new text in the same field will show a reduced missing words demand of less than 2%, for the given source language.

The lexical analyzer can be run in multilingual mode, with the option to choose up to five target languages which will be analyzed simultaneously. This upper limit can be extended if enough memory capacity is available. The program analyses more than 300,000 words per hour.

The list delivered by the lexical analyzer shows not only those terms missing in the source language, but also those missing in any of the chosen target languages, even if they exist in the source language. This happens because the lexical file does not need to be equally enriched in all implemented languages. Text's misspellings also appear in the list; the analyzer is an effective spelling verifier.

Once you are done with the introduction of the missing terms for your document, you can run the translator and obtain a first draft. The translator program offers the option to operate in multilingual mode, so that you can generate simultaneously up to five translation drafts in the chosen target languages. ATAMIRI generates the draft at a rate of more than 200,000 words per hour.

At this stage, with help of the text handling module, you can proceed to revise the draft. For this purpose the system offers you a bilingual revision screen with WP functions and look-up access to the lexical file. In multilingual environment you may distribute the drafts among different revisors, in one work station for each target language, all doing the post-editing at the same time.

After polishing the draft you run the text to document interface program which creates your translated WP document, keeping the same editing structure of the source document. If you prefer, you may do the post-editing work directly on the translated WP document, in a PC at home.

In this way, ATAMIRI helps the translation center providing immediately a high quality draft with consistent terminology and allowing an efficient post-editing work using adequate WP tools. The specialized lexical research needs to be done only once for each new entry which remains available for future translations and for every body in the in the network who has access to the lexical server.

Performance Measurements

According to various studies of human translator performance the average productivity is 1,500 words per day, i.e. six pages per day, including terminology research, writing the draft and revising it. Higher translation speeds can only be achieved by a person under stress for a short period of time. An upper limit of 48 pages per day is given by the maximum proof reading speed (source and target text) during revision of a perfect draft.

Provided that the lexicographic task has been adequately accomplished, the draft generated by ATAMIRI allows measured post-editing speeds between 24 and 36 pages per day. A draft which has been 30% retranslated, can allow a post-editing speed as high as 55 pages per day, since the retranslation feature marks those sentences from prior translations which don't need revision any more.

The post-editing speed depends on the draft quality which diminishes in direct proportion to the number of keystrokes needed to convert the draft in final. Depending how difficult is the source text (excluding poetry), ATAMIRI's draft requires a number of post-editing keystrokes ranging between 5% (quality Q=95%) and 18% (Q=82%) of the total number of keystrokes necessary to write the whole translation. Experience with MT systems has shown that a translator software is economically interesting only when the delivered quality is above 75%.

Installation Requirements

ATAMIRI's best operational performance is obtained when the system is installed as a multi-user facility in a DOS/Novell network of 386-Micro-computers, where the lexical file is in the server being shared by various users.

To run ATAMIRI it is required at least 3 MB of storage for translation into two target languages, 5 MB for translation into five target languages. The minimum disk capacity required is 40 MB. The system server should have a disk of at least 60 MB, for programs and dictionaries. A 100 MB server is recommended.

ATAMIRI can be installed in a 386 note book as a monouser facility; 40 MB disk is required. In this minimum configuration the system can not be exploited in its full potential.

The old version for the Wang VS machine has been discontinued, but the Wang WP and WP Plus interface subroutines are still available; so that a VS user with a 386PC attached to it can translate Wang WP or WP Plus documents. The VS could also be used as the network server if various PCs are connected to it.

For users handling source documents not available in magnetic media but in paper, it is recommended to install a character recognition unit (OCR) with the corresponding interface software. The user can choose the word processing software of his preference, provided it has the option to generate an ASCII file for the text.

Potential Users of ATAMIRI

The system has been designed as a tool for a translation center, to assist, not to substitute, professional translators in their task of searching and storing terminology, writing the draft and revising the final text.

Potential users are:

Translation centers and international conferences organizers who deal with large volumes of messages and documents in multilingual environment.

Fast translation services who sell translations either in draft or in final form, using telecommunications.

Training centers who conduct workshops on machine translation for professional translators.

Universities where the system can be used as a linguistic research tool and training facility.