Memorandum on the Multilingual Translator ATAMIRI

The Multilingual Translator Software ATAMIRI

Presented at the VII Simposio Ibero-Americano de Terminologia e Indústrias da Língua

Fundación Calouste Gulbenkian, Lisboa, Portugal,

From November 14^th to 17^th, 2000

Iván Guzmán de Rojas

The European Council (Cannes, June 1995) has stressed the relevance of linguistic and cultural aspects of the Information Society in Europe. The G7 Conference of Ministers on The Information Society and Development, has emphasized the fact that information technologies have a tremendous potential to preserve and exploit cultural and linguistic diversity. The Discussion Document (July 1997) Living and Working Together in the Information Society points out the central role that multilinguality should play in high-bandwith digital communications and the World Wide Web. (Link to the Bangemann Report - 1997)

The Survey of Machine Translation products and services (September 1996)[1] carried out by Equipe Consortium Ltd. on behalf of the European Council, in its summary of data and findings, among other aspects concludes that:

· Some of the best technologies may not be in commercially available products.

· Only a few products are plausible for handling EC translation loads.

· The most established translation services are currently provided by product vendors, but these are rudimentary.

The Survey contains a selected list with data of 25 machine translation software developers: 5 from the USA, 3 from Canada and Germany, 2 from Denmark, Finland, Russia, and Japan, and one from France, Belgium, Greece, Spain, Ukraine and Bolivia. It is remarkable that famous systems like ARIANE (Grenoble) and EUROTRA (EC) that a decade ago seemed to be very promising, now were not even mentioned in the Survey.

Comparing the technical characteristics of the software produced by these developers, it is surprising to verify that only one, IGRAL from Bolivia with its ATAMIRI software, has been able to design and develop a truly multilingual machine translator, i.e. one program, one lexical and grammatical data base, supporting various languages capable of operating either as source or target language, with simultaneous translation from any source language to various target languages. All other vendors try to cover the multilingual demand with multiple programs and dictionaries developed by language pairs, mostly capable to operate only in one direction, few are bidirectional.

Another interesting finding in the Survey is the fact that only ATAMIRI is capable to translate at high speed (over one million words per hour in a 400 MHz Windows NT Workstation), while all the rest have reported speeds under 100,000 words per hour running in powerful main frames. This means that ATAMIRI can easily translate a 400 word web page in one second, while other translators would need at least 15 to 45 seconds. This figures are critical if we take into account additional time needed for web page transmission if the MT system is located in the Web at the information providers or at the search engines, as it would be convenient because of terminology data base management reasons.

User respondents to the Survey are using only 14 unidirectional language pairs, English is used as source language in 5 pairs, German in 3 pairs, Chinese, Finnish, French, Italian, Russian and Spanish in only one pair each one. Japanese was not considered in the user survey, as Japanese is a low priority language for the EC. On the other hand the range of languages covered is surprisingly wide, there are in total 162 language pairs currently offered by the vendors and additional 62 language pairs are in development. Why this discrepancy of 162 pairs against the actual number of 14 pairs really used in productivity environment: poor translation quality, low speed in the whole process, high costs?

It is well recognized in the EC the high cost of multilinguality, caused by the diverse requirements of customers and commercial partners in a context where a growing amount of trade is being carried out electronically across linguistic borders, where global competitiveness rests increasingly on higher information productivity and communication effectiveness.

The estimated costs for the development and implementation of N languages in language pair transfer-based MT systems is proportional to the N(N-1) translation directions in the multilingual set. While for an interlingüa-based system, like Atamiri’s technology, the costs are just proportional to the N languages[2].

An example of the high cost of multilinguality is the famous EUROTRA Project to which the EC assigned a total budget of at least 30 million US$, to produce an eight-language multilingual translator software. After almost ten years, it was not known of a prototype that would have achieved the project objectives. In the workshop held in March, 1985, at the OAS in Washington, ATAMIRI has demonstrated the feasibility of low cost multilingual MT operation with 10 languages[3] implemented, though with different dictionary sizes and translation quality levels.

The straight forward proof that ATAMIRI is built on a very advanced language engineering technology is the fact that its twenty years R&D costs have been covered by the restricted personal budget of its creator, plus some income from few users and translation service clients. That is why its dictionary has a relative low number of entries per language, compared to other products in the market.

The most economically significative advantage of ATAMIRI is due to its table driven multilingual translator engine, using a matrix[4] language representation which allows the implementation of a new language without much additional programming effort. As soon as the new language has enough lexical entries in a given domain, it is ready to be used both as source and as target language in relation to all other languages already introduced in the system.

With ATAMIRI a full implementation of 16 languages (equivalent to 240 language pairs) has an estimated R&D additional[5] cost of 10 million US$, mainly for lexicographic work and translation quality improvements.

Development time and costs required for language-pairs-based MT systems conspire against the urgently needed coverage of more languages, like those less widely spoken European languages and globally strategic languages. In this way it is practically impossible to think on a truly pluricultural world wide network.

In web page translation, even a human translator faces the tedious and difficult task of discriminating between terms that properly belong to the text being translated and embedded markers of the hypertext language. Very often MT systems misplace those markers that syntactically belong to the sentence structure and should be relocated according to the target language syntax, otherwise the underlined links will not make sense in the translated web page. This kind of syntactical transformations are very well handled by ATAMIRI thanks to the underlying general language representation in its translator engine design.

The development and promotion of interoperability guides and standards for language databases and components become almost impossible to achieve, unless a genuine multilingual technology is applied. It seems that currently such a technology is only operating with ATAMIRI since 1985, though only as a system prototype, which is barely exploited in its full potential, because of lack of economical support for its further development and implementation at operational level.

A world wide cooperation is required to mobilize the competencies needed to address the multilinguality issue. As the creator of ATAMIRI, I urge leaders of institutions and corporations that promote Language Engineering projects and government authorities concerned with the problematic of Human Language Technologies, to support a thorough ATAMIRI assessment and benchmark activities to test its multilingual technology, translation quality improvement capacity and operational speed, specially in web page translation.

Iván Guzmán de Rojas

IguzmanRR@hotmail.com

[1] http://158.169.50.95:10080/langeng/reps/mtsurvey/mtsurvey.html

[2] If we assume that the proportionality factor k is in both cases approximately equal, the relation R of the pair wise model costs compared wit the interlingüa model is: R = kN(N-1)/kN = N-1

This cost advantage of the interlingüa model grows from R=1 for N=2 (one pair of languages) to a value of R=20 for the case of 21 languages, which implementation is urgent in the Internet. Experience with the language pair wise model has shown that the factor k is of the order of one million US$ per language. There fore, a comparison for N=21 means an economical advantage where the interlingüa model is 20 times better, i.e., at least 400 US$ millions. This position of superiority grows astronomically as the number of languages increases in the multilingual environment, for example, for the 131 ISO-639 coded languages the advantage factor is 130 times!

[3] With English and Spanish, both as source and target languages, ATAMIRI’s lexical data base contains about 25,000 entries. With less entries at experimental level and only as target languages, the system is able to translate into: French, Portuguese, Italian, German, Dutch, Swedish, Hungarian, and Aymara. The capacity to implement more languages is practically unlimited.

[4] ATAMIRI - Sistema de traducción interlingüe utilizando el lenguaje Aymara

Iván Guzmán de Rojas - Traducción del trabajo presentado en Budapest, Agosto de 1988

Bajo el título de ATAMIRI - Interlingual MT Using the Aymara Language

Publicado en: Dan Maxwell / Klaus Schubert / A. P. M. Witkam (eds.)

New Directions in Machine Translation Conference Proceedings, Budapest 18/19- 8-1988

Budapest: John von Neumann Society for Computing Sciences / Dordrecht/Providence: Foris Publishers

[5] Against 240 US$ millions that would be required using the language pair wise transfer technology, with the tree language representation peculiar for each language pair.