The Multilingual Translator Software ATAMIRI
Presented
at the VII Simposio Ibero-Americano de Terminologia e Indústrias da Língua
Fundación Calouste Gulbenkian, Lisboa, Portugal,
From
November 14th to 17th, 2000
Iván Guzmán de Rojas
The
European Council (Cannes, June 1995) has stressed the relevance of linguistic
and cultural aspects of the Information Society in Europe. The G7 Conference of Ministers on The Information Society and Development,
has emphasized the fact that information technologies have a tremendous
potential to preserve and exploit cultural and linguistic diversity. The Discussion Document (July 1997) Living and Working Together in the
Information Society points out the central role that multilinguality should play in high-bandwith digital communications
and the World Wide Web. (Link to the Bangemann Report - 1997)
The Survey of Machine Translation products and
services (September 1996)[1] carried out by Equipe Consortium Ltd. on behalf of the European
Council, in its summary of data and findings, among other aspects concludes
that:
·
Some of the best technologies may not be in
commercially available products.
·
Only a few products are plausible for handling
EC translation loads.
·
The most established translation services are
currently provided by product vendors, but these are rudimentary.
The Survey contains a selected list with
data of 25 machine translation software developers: 5 from the USA, 3 from
Canada and Germany, 2 from Denmark, Finland, Russia, and Japan, and one from
France, Belgium, Greece, Spain, Ukraine and Bolivia. It is remarkable
that famous systems like ARIANE (Grenoble) and EUROTRA (EC) that a decade ago
seemed to be very promising, now were not even mentioned in the Survey.
Comparing
the technical characteristics of the software produced by these developers, it
is surprising to verify that only one, IGRAL from Bolivia with its ATAMIRI
software, has been able to design and develop a truly
multilingual machine translator, i.e. one program, one lexical and
grammatical data base, supporting various languages capable of operating either
as source or target language, with simultaneous translation from any source
language to various target languages.
All other vendors try to cover the multilingual demand with multiple
programs and dictionaries developed by language pairs, mostly capable to
operate only in one direction, few are bidirectional.
Another
interesting finding in the Survey is
the fact that only ATAMIRI is capable to translate at high speed (over one
million words per hour in a 400 MHz Windows NT Workstation), while all the rest
have reported speeds under 100,000 words per hour running in powerful main
frames. This means that ATAMIRI can
easily translate a 400 word web page in one second, while other translators would
need at least 15 to 45 seconds. This
figures are critical if we take into account additional time needed for web
page transmission if the MT system is located in the Web at the information
providers or at the search engines, as it would be convenient because of
terminology data base management reasons.
User
respondents to the Survey are using
only 14 unidirectional language pairs, English is used as source language in 5
pairs, German in 3 pairs, Chinese, Finnish, French, Italian, Russian and
Spanish in only one pair each one.
Japanese was not considered in the user survey, as Japanese is a low
priority language for the EC. On the
other hand the range of languages covered is surprisingly wide, there are in
total 162 language pairs currently offered by the vendors and additional 62
language pairs are in development. Why
this discrepancy of 162 pairs against the actual number of 14 pairs really used
in productivity environment: poor
translation quality, low speed in the whole process, high costs?
It
is well recognized in the EC the high
cost of multilinguality, caused by the diverse requirements of customers
and commercial partners in a context where a growing amount of trade is being
carried out electronically across linguistic borders, where global
competitiveness rests increasingly on higher information productivity and
communication effectiveness.
The
estimated costs for the development and implementation of N languages in
language pair transfer-based MT systems is proportional to the N(N-1)
translation directions in the multilingual set. While for an interlingüa-based system, like Atamiri’s technology,
the costs are just proportional to the N languages[2].
An
example of the high cost of multilinguality is the famous EUROTRA Project to
which the EC assigned a total budget of at least 30 million US$, to produce an
eight-language multilingual translator software. After almost ten years, it was
not known of a prototype that would have achieved the project objectives. In
the workshop held in March, 1985, at the OAS in Washington, ATAMIRI has demonstrated
the feasibility of low cost multilingual
MT operation with 10 languages[3]
implemented, though with different dictionary sizes and translation quality
levels.
The
straight forward proof that ATAMIRI is built on a very advanced language
engineering technology is the fact that its twenty years R&D costs have
been covered by the restricted personal budget of its creator, plus some income
from few users and translation service clients. That is why its dictionary has a relative low number of entries
per language, compared to other products in the market.
The
most economically significative advantage of ATAMIRI is due to its table driven
multilingual translator engine, using a matrix[4]
language representation which allows the implementation of a new language
without much additional programming effort.
As soon as the new language has enough lexical entries in a given
domain, it is ready to be used both as source and as target language in
relation to all other languages already introduced in the system.
With
ATAMIRI a full implementation of 16 languages (equivalent to 240 language
pairs) has an estimated R&D additional[5]
cost of 10 million US$, mainly for lexicographic work and translation quality
improvements.
Development
time and costs required for language-pairs-based MT systems conspire against
the urgently needed coverage of more languages, like those less widely spoken
European languages and globally strategic languages. In this way it is practically impossible to think on a truly
pluricultural world wide network.
In
web page translation, even a human translator faces the tedious and difficult
task of discriminating between terms that properly belong to the text being
translated and embedded markers of the hypertext language. Very often MT systems
misplace those markers that syntactically belong to the sentence structure and
should be relocated according to the target language syntax, otherwise the
underlined links will not make sense in the translated web page. This kind of
syntactical transformations are very well handled by ATAMIRI thanks to the
underlying general language representation in its translator engine design.
The
development and promotion of interoperability guides and standards for language
databases and components become almost impossible to achieve, unless a genuine
multilingual technology is applied. It
seems that currently such a technology is only operating with ATAMIRI since
1985, though only as a system prototype, which is barely exploited in its full
potential, because of lack of economical support for its further development
and implementation at operational level.
A
world wide cooperation is required to mobilize the competencies needed to
address the multilinguality issue. As
the creator of ATAMIRI, I urge leaders of institutions and corporations that
promote Language Engineering projects and government authorities concerned with
the problematic of Human Language Technologies, to support a thorough ATAMIRI
assessment and benchmark activities to test its multilingual technology,
translation quality improvement capacity and operational speed, specially in
web page translation.
Iván Guzmán
de Rojas
[2] If we assume that the
proportionality factor k is in both cases approximately equal, the
relation R of the pair wise model costs compared wit the interlingüa model
is: R = kN(N-1)/kN = N-1
This
cost advantage of the interlingüa model grows from R=1 for N=2 (one pair of
languages) to a value of R=20 for the case of 21 languages, which
implementation is urgent in the Internet.
Experience with the language pair wise model has shown that the factor k
is of the order of one million US$ per language. There fore, a comparison for N=21 means an economical advantage
where the interlingüa model is 20 times
better, i.e., at least 400 US$ millions.
This position of superiority grows astronomically as the number of
languages increases in the multilingual
environment, for example, for the 131 ISO-639 coded languages
the advantage factor is
130 times!
[3] With English and Spanish, both as
source and target languages, ATAMIRI’s lexical data base contains about 25,000
entries. With less entries at
experimental level and only as target languages, the system is able to
translate into: French, Portuguese, Italian, German, Dutch, Swedish, Hungarian,
and Aymara. The capacity to implement
more languages is practically unlimited.
[4] ATAMIRI
- Sistema de traducción interlingüe utilizando el lenguaje Aymara
Iván Guzmán de Rojas - Traducción del trabajo presentado en Budapest, Agosto de 1988
Bajo el título de ATAMIRI - Interlingual MT Using the Aymara Language
Publicado en: Dan Maxwell / Klaus Schubert / A. P. M. Witkam (eds.)
New Directions in Machine
Translation Conference Proceedings, Budapest 18/19- 8-1988
Budapest: John von Neumann Society
for Computing Sciences / Dordrecht/Providence: Foris Publishers
[5] Against 240 US$ millions that would
be required using the language pair wise transfer technology, with the tree
language representation peculiar for each language pair.