In the future, Arlette Attali is thinking about "contributing to the development of the linguistic tools associated to the FRANTEXT database and getting teachers, researchers and students to know them." In her e-mail of June 11, 1998, she also explained the changes brought by the Internet in her professional life:
"As I was more specially assigned to the development of textual databases at the INaLF, I had to explore the websites giving access to electronic texts and test them. I became a 'textual tourist' with the good and bad sides of this activity. The tendency to go quickly from one link to another, and to skip through the information, was a permanent danger — it is necessary to target what you are looking for if you don't want to lose your time. The use of the Web totally changed my working methods — my investigations are not only bookish and within a narrow circle anymore, on the contrary they are expanding thanks to the electronic texts available on the Internet."
The ARTFL Project (ARTFL: American and French Research on the Treasury of the French Language) is a cooperative project established in 1981 by the Institut national de la langue française (INaLF) (National Institute of the French Language, based in France) and the Division of the Humanities of the University of Chicago. Its purpose is to be a research tool for scholars and students in all areas of French studies.
The origin of the project is a 1957 initiative of the French government to create a new dictionary of the French language, the Trésor de la Langue Française (Treasure of the French Language). In order to provide access to a large body of word samples, it was decided to transcribe an extensive selection of French texts for use with a computer. Twenty years later, a corpus totaling some 150 million words had been created, representing a broad range of written French — from novels and poetry to biology and mathematics — stretching from the 17th to the 20th centuries.
This corpus of French texts was an important resource not only for lexicographers, but also for many other types of humanists and social scientists engaged in French studies — on both sides of the Atlantic. The result of this realization was the ARTFL Project, as explained on its website:
"At present the corpus consists of nearly 2,000 texts, ranging from classic works of French literature to various kinds of non-fiction prose and technical writing. The eighteenth, nineteenth and twentieth centuries are about equally represented, with a smaller selection of seventeenth century texts as well as some medieval and Renaissance texts. We have also recently added a Provençal database that includes 38 texts in their original spellings. Genres include novels, verse, theater, journalism, essays, correspondence, and treatises. Subjects include literary criticism, biology, history, economics, and philosophy. In most cases standard scholarly editions were used in converting the text into machine-readable form, and the data contain page references to these editions."
One of the largest of its kind in the world, the ARTFL database permits both the rapid exploration of single texts, and the inter-textual research of a kind. ARTFL is now on the Web, and the system is available through the Internet to its subscribers. Access to the database is organized through a consortium of user institutions, in most cases universities and colleges which pay an annual subscription fee.
The ARTFL Encyclopédie Project is currently developing an on-line version of Diderot and d'Alembert's Encyclopédie, ou Dictionnaire raisonné des sciences, des arts et des métiers, including all 17 volumes of text and 11 volumes of plates from the first edition, that is to say about 18,000 pages of text and exactly 20,736,912 words.
Published under the direction of Diderot between 1751 and 1772, the Encyclopédie counted as contributors the most prominent philosophers of the time: Voltaire, Rousseau, d'Alembert, Marmontel, d'Holbach, Turgot, etc.
"These great minds (and some lesser ones) collaborated in the goal of assembling and disseminating in clear, accessible prose the fruits of accumulated knowledge and learning. Containing 72,000 articles written by more than 140 contributors, the Encyclopédie was a massive reference work for the arts and sciences, as well as a machine de guerre which served to propagate Enlightened ideas […] The impact of the Encyclopédie was enormous, not only in its original edition, but also in multiple reprintings in smaller formats and in later adaptations. It was hailed, and also persecuted, as the sum of modern knowledge, as the monument to the progress of reason in the eighteenth century. Through its attempt to classify learning and to open all domains of human activity to its readers, the Encyclopédie gave expression to many of the most important intellectual and social developments of its time."
At present, while work continues on the fully navigational, full-text version, ARTFL is providing public access on its website to the Prototype Demonstration of Volume One. From Autumn 1998 a preliminary version is released for consultation by all ARTFL subscribers.
Mentioned on the ARTFL home page in the Reference Collection, other ARTFL projects are: the 1st (1694) and 5th (1798) editions of the Dictionnaire de L'Académie française; Jean Nicot's Trésor de la langue française (1606) Dictionary; Pierre Bayle's Dictionnaire historique et critique (1740 edition) (text of an image-only version); The Wordsmyth English Dictionary-Thesaurus; Roget's Thesaurus, 1911 edition; Webster's Revised Unabridged Dictionary; the French Bible by Louis Segond and parallel Bibles in German, Latin, and English, etc.
Created by Michael S. Hart in 1971, the Project Gutenberg was the first information provider on the Internet. It is now the oldest digital library on the Web, and the biggest considering the number of works (1,500) which has been digitalized for it, with 45 new titles per month. Michael Hart's purpose is to put on the Web as many literary texts as possible for free.
In his e-mail of August 23, 1998, Michael S. Hart explained:
"We consider e-text to be a new medium, with no real relationship to paper, other than presenting the same material, but I don't see how paper can possibly compete once people each find their own comfortable way to e-texts, especially in schools. […] My own personal goal is to put 10,000 e-texts on the Net, and if I can get some major support, I would like to expand that to 1,000,000 and to also expand our potential audience for the average e-text from 1.x% of the world population to over 10%… thus changing our goal from giving away 1,000,000,000,000 e-texts to 1,000 time as many… a trillion and a quadrillion in US terminology."
Project Gutenberg is now developing its foreign collections, as announced in the Newsletter of October 1997. In the Newsletter of March 1998, Michael S. Hart mentioned that Project Gutenberg's volunteers were now working on e-texts in French, German, Portuguese and Spanish, and he was also hoping to get some e-texts in the following languages: Arabic, Chinese, Danish, Dutch, Esperanto, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Latin, Lithuanian, Polish, Romanian, Russian, Slovak, Slovene, and Valencian (Catalan).
3.5. Terminological Databases
The free consultation of terminological databases on the Web is much appreciatedby language specialists. There are some terminological databases maintained byinternational organizations, such as Eurodicautom, maintained by the TranslationService of the European Commission; ILOTERM, maintained by the InternationalLabour Organization (ILO), the ITU Telecommunication Terminology Database(TERMITE), maintained by the International Telecommunication Union (ITU) and theWHO Terminology Information System (WHOTERM), maintained by the World HealthOrganization (WHO).
Eurodicautom is the multilingual terminological database of the Translation Service of the European Commission. Initially developed to assist in-house translators, it is consulted today by an increasing number of European Union officials other than translators, as well as by language professionals throughout the world. Its huge, constantly updated, contents is drafted in twelve languages (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Latin, Portuguese, Spanish, Swedish), and covers a broad spectrum of human knowledge, while the main core relates to European Union topics.
ILOTERM is the quadrilingual (English, French, German, Spanish) terminology database maintained by the Terminology and Reference Unit of the Official Documentation Branch (OFFDOC) of the International Labour Office (ILO), Geneva, Switzerland. Its primary purpose is to provide solutions, reflecting current usage, to terminological problems in the social and labor fields. Terms are entered in English with their French, Spanish and/or German equivalents. The database also includes records (in up to four languages) concerning the structure and programmes of the ILO, official names of international institutions, national bodies and employers' and workers' organizations, as well as titles of international meetings and instruments.
The ITU Telecommunication Terminology Database (TERMITE) is maintained by the Terminology, References and Computer Aids to Translation Section of the Conference Department of the International Telecommunication Union (ITU), Geneva, Switzerland. TERMITE (59,000 entries) is a quadrilingual (English, French, Spanish, Russian) terminological database which contains all the terms which appeared in ITU printed glossaries since 1980, as well as more recent entries relating to the different activities of the Union.
Maintained by the World Health Organization (WHO), Geneva, Switzerland, the WHO Terminology Information System (WHOTERM) includes: the WHO General Dictionary Index, giving access to an English glossary of terms, with the French and Spanish equivalents for each term; three glossaries in English: Health for All, Programme Development and Management, and Health Promotion; the WHO TermWatch, an awareness service of the Technical Terminology, which is a service reflecting the current WHO usage — but not necessarily terms officially approved by WHO — and a series of links to health-related terminology
[In this chapter:]
[4.1. Translation Services / 4.2. Machine Translation / 4.3. Computer-Assisted Translation]
4.1. Translation Services
Maintained by Vorontsoff, Wesseling & Partners, Amsterdam, the Netherlands, Aquarius is a directory of translators and interpreters including 6,100 translators, 800 translation companies, 91 specialized areas of expertise and 369 language combinations. This non-commercial project helps to locate and contact the best translators in the world directly, without intermediaries or agencies. Aquarius Database can be searched using location, language combination and specialization.
Founded by Bill Dunlap, Euro-Marketing Associates proposes Global Reach, a methodology for companies to expand their Internet presence into a more international framework. this includes translating a website into other languages, actively promoting it and using local banner advertising to increase local website traffic in all on-line countries. Bill Dunlap explains:
"Promoting your website is at least as important as creating it, if not more important. You should be prepared to spend at least as much time and money in promoting your website as you did in creating it in the first place. With the "Global Reach" program, you can have it promoted in countries where English is not spoken, and achieve a wider audience… and more sales. There are many good reasons for taking the on-line international market seriously. "Global Reach" is a means for you to extend your website to many countries, speak to on-line visitors in their own language and reach on-line markets there."
In his e-mail of December 11, 1998, he also explains what the use of theInternet brought in his professional life:
"Since 1981, when my professional life started, I've been involved with bringing American companies in Europe. This is very much an issue of language, since the products and their marketing have to be in the languages of Europe in order for them to be visible here. Since the Web became popular in 1995 or so, I've turned these activities to their on-line dimension, and have come to champion European e-commerce among my fellow American compatriates. Most lately at Internet World in New York, I spoke about European e-commerce and how to use a website to address the various markets in Europe."
4.2. Machine Translation
Machine translation (MT) is the automated process of translating from one natural language to another. MT analyzes the language text in the source language and automatically generates corresponding text in the target language.
Characterized by the absence of any human intervention during the translation process, machine translation (MT) is also called "fully automatic machine translation (FAMT)". It differs from "machine-aided human translation (MAHT)" or "computer-assisted translation (CAT)", which involves some interaction between the translator and the computer.
As SYSTRAN, a company specialized in translation software, explains on its website:
"Machine translation software translates one natural language into another natural language. MT takes into account the grammatical structure of each language and uses rules to transfer the grammatical structure of the source language (text to be translated) into the target language (translated text). MT cannot replace a human translator, nor is it intended to."
The European Association for Machine Translation (EAMT) gives the following definition:
"Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful for certain specific applications, usually in the domain of technical documentation. In addition, translation software packages which are designed primarily to assist the human translator in the production of translations are enjoying increasing popularity within professional translation organizations."
Machine translation is the earliest type of natural language processing. Here are the explanations given by Globalink:
"From the very beginning, machine translation (MT) and natural language processing (NLP) have gone hand-in-hand with the evolution of modern computational technology. The development of the first general-purpose programmable computers during World War II was driven and accelerated by Allied cryptographic efforts to crack the German Enigma machine and other wartime codes. Following the war, the translation and analysis of natural language text provided a testbed for the newly emerging field of Information Theory.
During the 1950s, research on Automatic Translation (known today as Machine Translation, or 'MT') took form in the sense of literal translation, more commonly known as word-for-word translations, without the use of any linguistic rules.
The Russian project initiated at Georgetown University in the early 1950s represented the first systematic attempt to create a demonstrable machine translation system. Throughout the decade and into the 1960s, a number of similar university and government-funded research efforts took place in the United States and Europe. At the same time, rapid developments in the field of Theoretical Linguistics, culminating in the publication of Noam Chomsky's Aspects of the Theory of Syntax (1965), revolutionized the framework for the discussion and understanding of the phonology, morphology, syntax and semantics of human language.
In 1966, the U.S. government-issued ALPAC report offered a prematurely negative assessment of the value and prospects of practical machine translation systems, effectively putting an end to funding and experimentation in the field for the next decade. It was not until the late 1970s, with the growth of computing and language technology, that serious efforts began once again. This period of renewed interest also saw the development of the Transfer model of machine translation and the emergence of the first commercial MT systems.
While commercial ventures such as SYSTRAN and METAL began to demonstrate the viability, utility and demand for machine translation, these mainframe-bound systems also illustrated many of the problems in bringing MT products and services to market. High development cost, labor-intensive lexicography and linguistic implementation, slow progress in developing new language pairs, inaccessibility to the average user, and inability to scale easily to new platforms are all characteristics of these second-generation systems."
A number of companies are specialized in machine translation development, such as Lernout & Hauspie, Globalink, Logos or SYSTRAN.
Based in Ieper (Belgium) and Burlington (Massachussets, USA), Lernout & Hauspie (L&H) is an international leader in the development of advanced speech technology for various commercial applications and products. The company offers four core technologies - automatic speech recognition (ASR), text-to-speech (TTS), text-to-text and digital speech compression. Its ASR, TTS and digital speech compression technologies are licensed to main companies in the telecommunications, computers and multimedia, consumer electronics and automotive electronics industries. Its text-to-text (translation) services are provided to information technology (IT) companies and vertical and automation markets.
The Machine Translation Group of Lernout & Hauspie comprises enterprises that develop, produce, and market highly sophisticated machine translation systems: L&H Language Technology, AppTek, AILogic, NeocorTech and Globalink. Each is an international leader in its particular segment.
Founded in 1990, Globalink is a major U.S. company in language translation software and services, which offers customized translation solutions built around a range of software products, on-line options and professional translation services. The company publishes language translation software products in Spanish, French, Portuguese, German, Italian and English, and finds solutions to translation problems faced by individuals and small businesses, to multinational corporations and governments (a stand-alone product that gives a fast, draft translation or a full system to manage professional document translations). Globalink explains its corporate information on its website as follows:
"With Globalink's translation applications, the computer uses three sets of data: the input text, the translation program and permanent knowledge sources (containing a dictionary of words and phrases of the source language), and information about the concepts evoked by the dictionary and rules for sentence development. These rules are in the form of linguistic rules for syntax and grammar, and some are algorithms governing verb conjugation, syntax adjustment, gender and number agreement and word re-ordering.
Once the user has selected the text and set the machine translation process in motion the program begins to match words of the input text with those stored in its dictionary. Once a match is found, the application brings up a complete record that includes information on possible meanings of the word and its contextual relationship to other words that occur in the same sentence. The time required for the translation depends on the length of the text. A three-page, 750-word document takes about three minutes to render a first draft translation."
Randy Hobler is a Marketing Consultant for Globalink. He is currently acting as the Product Marketing Manager for Globalink's suite of Internet based products and services. In his e-mail of 3 September 1998, he wrote:
"85% of the content of the Web in 1998 is in English and going down. This trend is driven not only by more websites and users in non-English-speaking countries, but by increasing localization of company and organization sites, and increasing use of machine translation to/from various languages to translate websites.
Because the Internet has no national boundaries, the organization of users is bounded by other criteria driven by the medium itself. In terms of multilingualism, you have virtual communities, for example, of what I call 'Language Nations'… all those people on the Internet wherever they may be, for whom a given language is their native language. Thus, the Spanish Language nation includes not only Spanish and Latin American users, but millions of Hispanic users in the US, as well as odd places like Spanish-speaking Morocco.
Language Transparency: We are rapidly reaching the point where highly accurate machine translation of text and speech will be so common as to be embedded in computer platforms, and even in chips in various ways. At that point, and as the growth of the Web slows, the accuracy of language translation hits 98% plus, and the saturation of language pairs has covered the vast majority of the market, language transparency (any-language-to-any-language communication) will be too limiting a vision for those selling this technology. The next development will be 'transcultural, transnational transparency', in which other aspects of human communication, commerce and transactions beyond language alone will come into play. For example, gesture has meaning, facial movement has meaning and this varies among societies. The thumb-index finger circle means 'OK' in the United States. In Argentina, it is an obscene gesture.
When the inevitable growth of multi-media, multi-lingual videoconferencing comes about, it will be necessary to 'visually edit' gestures on the fly. The MIT Media Lab [MIT: Massachussets Institute of Technology], Microsoft and many others are working on computer recognition of facial expressions, biometric access identification via the face, etc. It won't be any good for a U.S. business person to be making a great point in a Web-based multi-lingual video conference to an Argentinian, having his words translated into perfect Argentinian Spanish if he makes the 'O' gesture at the same time. Computers can intercept this kind of thing and edit them on the fly.
There are thousands of ways in which cultures and countries differ, and most of these are computerizable to change as one goes from one culture to the other. They include laws, customs, business practices, ethics, currency conversions, clothing size differences, metric versus English system differences, etc., etc. Enterprising companies will be capturing and programming these differences and selling products and services to help the peoples of the world communicate better. Once this kind of thing is widespread, it will truly contribute to international understanding."
Logos is an international company (US, Canada and Europe) specialized in machine translation for 25 years, which provides various translation tools, machine translation systems and supporting services.
SYSTRAN (an acronym for System Translation) is a company specialized in machine translation software. SYSTRAN's headquarters are located in Soisy-sous-Montmorency, France. Sales and marketing, along with most development, operate out of its subsidiary, in La Jolla, California. The SYSTRAN site gives an interesting overview of the company's history. One of the company's products is AltaVista Translation, an automatic translation service of English Web pages into French, German, Italian, Portuguese, or Spanish, and vice versa, and is available on the AltaVista site, the most frequently used search engine on the Web.
Based in Montreal, Canada, Alis Technologies is an international company specialized in the development and marketing of language handling solutions and services, particularly at language implementation in the IT industry. Alis Translation Solutions (ATS) offers a wide selection of applications and languages, and multiple tools and services for best possible translation quality. Language Technology Solutions (LTS) is devoted to commercializing advanced tools and services in the field of language engineering and information technology. The unilingual information systems are transformed into software that users can put to work in their own language (90 languages covered).
Another machine translation development is SPANAM and ENGSPAN, which are fully automatic machine translation systems developed and maintained by the computational linguists, translators, and systems programmer of the Pan American Health Organization (PAHO), Washington, D.C. The PAHO Translation Unit has used SPANAM (Spanish to English) and ENGSPAN (English to Spanish) to process over 25 million words since 1980. Staff and free-lance translators postedit the raw output to produce high-quality translations with a 30-50% gain in productivity. The system is installed on a local area network at PAHO Headquarters and is used regularly by staff in the technical and administrative units. The software is also installed in a number of PAHO field offices and has been licensed to public and non-profit institutions in the US, Latin America, and Spain.
Some associations also contribute to machine translation development.
The Association for Computational Linguistics (ACL) is the main international scientific and professional society for people working on problems involving natural language and computation. Published by MIT Press, the ACL quarterly journal, Computational Linguistics (ISSN 0891-2017), continues to be the primary forum for research on computational linguistics and natural language processing. The Finite String is its newsletter supplement. The European branch of ACL is the European Chapter of the Association of Computational Linguistics (EACL), which provides a regional focus for its members.
The International Association for Machine Translation (IAMT) heads a worldwide network with three regional components: the Association for Machine Translation in the Americas (AMTA), the European Association for Machine Translation (EAMT) and the Asia-Pacific Association for Machine Translation (AAMT).
The Association for Machine Translation in the Americas (AMTA) presents itself as an association dedicated to anyone interested in the translation of languages using computers in some way. It has members in Canada, Latin America, and the United States. This includes people with translation needs, commercial system developers, researchers, sponsors, and people studying, evaluating, and understanding the science of machine translation and educating the public on important scientific techniques and principles involved.
The European Association for Machine Translation (EAMT) is based in Geneva, Switzerland. This organization serves the growing community of people interested in MT (machine translation) and translation tools, including users, developers, and researchers of this increasingly viable technology.
The Asia-Pacific Association for Machine Translation (AAMT), formerly called the Japan Association for Machine Translation (created in 1991), is comprised of three entities: researchers, manufacturers, and users of machine translation systems. The association endeavors to develop machine translation technologies to expand the scope of effective global communications and, for this purpose, is engaged in machine translation system development, improvement, education, and publicity.
In Web embraces language translation, an article of ZDNN (ZD Network News) ofJuly 21, 1998, Martha L. Stone explains:
"Among the new products in the $10 billion language translation business are instant translators for websites, chat rooms, e-mail and corporate intranets.
The leading translation firms are mobilizing to seize the opportunities. Such as:
SYSTRAN has partnered with AltaVista and reports between 500,000 and 600,000 visitors a day on babelfish.altavista.digital.com, and about 1 million translations per day — ranging from recipes to complete Web pages.
About 15,000 sites link to babelfish, which can translate to and from French,Italian, German, Spanish and Portuguese. The site plans to add Japanese soon.
'The popularity is simple. With the Internet, now there is a way to use US content. All of these contribute to this increasing demand,' said Dimitros Sabatakakis, group CEO of SYSTRAN, speaking from his Paris home.
Alis technology powers the Los Angeles Times' soon-to-be launched language translation feature on its site. Translations will be available in Spanish and French, and eventually, Japanese. At the click of a mouse, an entire web page can be translated into the desired language.
Globalink offers a variety of software and Web translation possibilities, including a free e-mail service and software to enable text in chat rooms to be translated.
But while these so-called 'machine' translations are gaining worldwide popularity, company execs admit they're not for every situation.
Representatives from Globalink, Alis and SYSTRAN use such phrases as 'not perfect' and 'approximate' when describing the quality of translations, with the caveat that sentences submitted for translation should be simple, grammatically accurate and idiom-free.
'The progress on machine translation is moving at Moore's Law — every 18 monthsit's twice as good,' said Vin Crosbie, a Web industry analyst in Greenwich,Conn. 'It's not perfect, but some [non-English speaking] people don't realizeI'm using translation software.'
With these translations, syntax and word usage suffer, because dictionary-driven databases can't decipher between homonyms — for example, 'light' (as in the sun or light bulb) and 'light' (the opposite of heavy).
Still, human translation would cost between $50 and $60 per Web page, or about 20 cents per word, SYSTRAN's Sabatakakis said.
While this may be appropriate for static 'corporate information' pages, the machine translations are free on the Web, and often less than $100 for software, depending on the number of translated languages and special features."
4.3. Computer-Assisted Translation
Within the World Health Organization (WHO), Geneva, Switzerland, the Computer-assisted Translation and Terminology (Unit (CTT) is assessing technical options for using computer-assisted translation (CAT) systems based on "translation memory". With such systems, translators have immediate access to previous translations of portions of the text before them. These reminders of previous translations can be accepted, rejected or modified, and the final choice is added to the memory, thus enriching it for future reference. By archiving daily output, the translator would soon have access to an enormous "memory" of ready-made solutions for a considerable number of translation problems. Several projects are currently under way in such areas as electronic document archiving and retrieval, bilingual/multilingual text alignment, computer-assisted translation, translation memory and terminology database management, and speech recognition.
Contrary to the imminent outbreak of the universal translation machine announced some 50 years ago, the machine translation systems don't yet produce good quality translations. Why not? Pierre Isabelle and Patrick Andries, from the Laboratoire de recherche appliquée en linguistique informatique (RALI) (Laboratory for Applied Research in Computational Linguistics) in Montreal, Quebec, explain this failure in La traduction automatique, 50 ans après (Machine translation, 50 years later), an article published in the Dossiers of the daily cybermagazine Multimédium:
"The ultimate goal of building a machine capable of competing with a human translator remains elusive due to the slow progress of the research. […] Recent research, based on large collections of texts called corpora - using either statistical or analogical methods - promise to reduce the quantity of manual work required to build a MT [machine translation] system, but it is less sure than they can promise a substantial improvement in the quality of machine translation. […] the use of MT will be more or less restricted to information assimilation tasks or tasks of distribution of texts belonging to restricted sub-languages."
According to Yehochua Bar-Hillel's ideas expressed in The State of Machine Translation, an article published in 1951, Pierre Isabelle and Patrick Andries define three MT implementation strategies: 1) a tool of information assimilation to scan multilingual information and supply rough translation, 2) situations of "restricted language" such as the METEO system which, since 1977, has been translating the weather forecasts of the Canadian Ministry of Environment, 3) the human being/machine coupling before, during and after the MT process, which is not inevitably economical compared to traditional translation.
The authors favour "a workstation for the human translator" more than a "robot translator":
"The recent research on the probabilist methods permitted in fact to demonstrate that it was possible to modelize in a very efficient way some simple aspects of the translation relationship between two texts. For example, methods were set up to calculate the correct alignment between the text sentences and their translation, that is, to identify the sentence(s) of the source text which correspond(s) to each sentence of the translation. Applied on a large scale, these techniques allow the use of archives of a translation service to build a translation memory which will often permit the recycling of previous translation fragments. Such systems are already available on the translation market (IBM Translation Manager II, Trados Translator's Workbench by Trados, RALI TransSearch, etc.)
The most recent research focuses on models able to automatically set up the correspondences at a finer level than the sentence level: syntagms and words. The results obtained foresee a whole family of new tools for the human translator, including aids for terminological studying, aids for dictation and translation typing, and detectors of translation errors."
[In this chapter:]
[5.1. Machine Translation Research / 5.2. Computational Linguistics / 5.3. Language Engineering / 5.4. Internationalization and Localization]
5.1. Machine Translation Research
The CL/MT Research Group (Computational Linguistics (CL) and Machine Translation (MT) Group) is a research group in the Department of Language and Linguistics at the University of Essex, United Kingdom. It serves as a focus for research in computational, and computationally oriented, linguistics. It has been in existence since the late 1980s, and has played a role in a number of important computational linguistics research projects.
Founded in 1986, the Center for Machine Translation (CMT) is now a research center within the new Language Technologies Institute at the School of Computer Science at Carnegie Mellon University (CMU), Pittsburgh, Pennsylvania. It conducts advanced research and development in a suite of technologies for natural language processing, with a primary focus on high-quality multilingual machine translation.
Within the CLIPS Laboratory (CLIPS: Communication langagière et interaction personne-système = Language Communication and Person-System Communication) of the French IMAG Federation, the Groupe d'étude pour la traduction automatique (GETA) (Study Group for Machine Translation) is a multi-disciplinary team of computer scientists and linguists. Its research topics concern all the theoretical, methodological and practical aspects of computer-assisted translation (CAT), or more generally of multilingual computing. The GETA participates in the UNL (Universal Networking Language) project, initiated by the Institute of Advanced Studies (IAS) of the United Nations University (UNU).
"UNL (Universal Networking Language) is a language that - with its companion "enconverter" and "deconverter" software - enables communication among peoples of differing native languages. It will reside, as a plug-in for popular World Wide Web browsers, on the Internet, and will be compatible with standard network servers. The technology will be shared among the member states of the United Nations. Any person with access to the Internet will be able to "enconvert" text from any native language of a member state into UNL. Just as easily, any UNL text can be "deconverted" from UNL into native languages. United Nations University's UNL Center will work with its partners to create and promote the UNL software, which will be compatible with popular network servers and computing platforms."
The Natural Language Group (NLG) at the Information Sciences Institute (ISI) of the University of Southern California (USC) is currently involved in various aspects of computational/natural language processing. The group's projects are: machine translation; automated text summarization; multilingual verb access and text management; development of large concept taxonomies (ontologies); discourse and text generation; construction of large lexicons for various languages; and multimedia communication.
Eduard Hovy, Head of the Natural Language Group, expained in his e-mail ofAugust 27, 1998:
"Your presentation outline looks very interesting to me. I do wonder, however, where you discuss the language-related applications/functionalities that are not translation, such as information retrieval (IR) and automated text summarization (SUM). You would not be able to find anything on the Web without IR! — all the search engines (AltaVista, Yahoo!, etc.) are built upon IR technology. Similarly, though much newer, it is likely that many people will soon be using automated summarizers to condense (or at least, to extract the major contents of) single (long) documents or lots of (any length) ones together. […]
In this context, multilingualism on the Web is another complexifying factor. People will write their own language for several reasons — convenience, secrecy, and local applicability — but that does not mean that other people are not interested in reading what they have to say! This is especially true for companies involved in technology watch (say, a computer company that wants to know, daily, all the Japanese newspaper and other articles that pertain to what they make) or some Government Intelligence agencies (the people who provide the most up-to-date information for use by your government officials in making policy, etc.). One of the main problems faced by these kinds of people is the flood of information, so they tend to hire 'weak' bilinguals who can rapidly scan incoming text and throw out what is not relevant, giving the relevant stuff to professional translators. Obviously, a combination of SUM and MT (machine translation) will help here; since MT is slow, it helps if you can do SUM in the foreign language, and then just do a quick and dirty MT on the result, allowing either a human or an automated IR-based text classifier to decide whether to keep or reject the article.
For these kinds of reasons, the US Government has over the past five years been funding research in MT, SUM, and IR, and is interested in starting a new program of research in Multilingual IR. This way you will be able to one day open Netscape or Explorer or the like, type in your query in (say) English, and have the engine return texts in *all* the languages of the world. You will have them clustered by subarea, summarized by cluster, and the foreign summaries translated, all the kinds of things that you would like to have.
You can see a demo of our version of this capability, using English as the userlanguage and a collection of approx. 5,000 texts of English, Japanese, Arabic,Spanish, and Indonesian, by visiting MuST Multilingual Information Retrieval,Summarization, and Translation System.
Type your query word (say, 'baby', or whatever you wish) in and press 'Enter/Return'. In the middle window you will see the headlines (or just keywords, translated) of the retrieved documents. On the left you will see what language they are in: 'Sp' for Spanish, 'Id' for Indonesian, etc. Click on the number at left of each line to see the document in the bottom window. Click on 'Summarize' to get a summary. Click on 'Translate' for a translation (but beware: Arabic and Japanese are extremely slow! Try Indonesian for a quick word-by-word 'translation' instead).
This is not a product (yet); we have lots of research to do in order to improve the quality of each step. But it shows you the kind of direction we are heading in."
"How do you see the future of Internet-related activities as regards languages?"
"The Internet is, as I see it, a fantastic gift to humanity. It is, as one of my graduate students recently said, the next step in the evolution of information access. A long time ago, information was transmitted orally only; you had to be face-to-face with the speaker. With the invention of writing, the time barrier broke down — you can still read Seneca and Moses. With the invention of the printing press, the access barrier was overcome — now *anyone* with money to buy a book can read Seneca and Moses. And today, information access becomes almost instantaneous, globally; you can read Seneca and Moses from your computer, without even knowing who they are or how to find out what they wrote; simply open AltaVista and search for 'Seneca'. This is a phenomenal leap in the development of connections between people and cultures. Look how today's Internet kids are incorporating the Web in their lives.
The next step? — I imagine it will be a combination of computer and cellular phone, allowing you as an individual to be connected to the Web wherever you are. All your diary, phone lists, grocery lists, homework, current reading, bills, communications, etc., plus AltaVista and the others, all accessible (by voice and small screen) via a small thing carried in your purse or on your belt. That means that the barrier between personal information (your phone lists and diary) and non-personal information (Seneca and Moses) will be overcome, so that you can get to both types anytime. I would love to have something that tells me, when next I am at a conference and someone steps up, smiling to say hello, who this person is, where last I met him/her, and what we said then!
But that is the future. Today, the Web has made big changes in the way I shop (I spent 20 minutes looking for plane routes for my next trip with a difficult transition on the Web, instead of waiting for my secretary to ask the travel agent, which takes a day). I look for information on anything I want to know about, instead of having to make a trip to the library and look through complicated indexes. I send e-mail to you about this question, at a time that is convenient for me, rather than your having to make a phone appointment and then us talking for 15 minutes. And so on."
The Computing Research Laboratory (CRL) at New Mexico State University (NMSU) is a non-profit research enterprise committed to basic research and software development in advanced computing applications concentrated in the areas of natural language processing, artificial intelligence and graphical user interface design. Applications developed from basic research endeavors include a variety of configurations of machine translation, information extraction, knowledge acquisition, intelligent teaching, and translator workstation systems.
Maintained by the Department of Linguistics of the Translation Research Group of Brigham Young University (BYU), Utah, TTT.org (Translation, Theory and Technology) provides information about language theory and technology, particularly relating to translation. Translation technology includes translator workbench tools and machine translation. In addition to translation tools, TTT.org is interested in data exchange standards that allow various tools to interoperate, allowing the integration of tools from multiple vendors in the multilingual document production chain.
In the area of data exchange standards, TTT.org is actively involved in the development of MARTIF (machine-readable terminology interchange format). MARTIF is a format to facilitate the interchange of terminological data among terminology management systems. This format is the result of several years of intense international collaboration among terminologists and database experts from various organizations, including academic institutions, the Text Encoding Initiative (TEI), and the Localisation Industry Standards Association (LISA).
5.2. Computational Linguistics
The Laboratoire de recherche appliquée en linguistique informatique (RALI) (Laboratory of Applied Research in Computational Linguistics) is a laboratory of the University of Montreal, Quebec. The RALI's personnel includes experienced computer scientists and linguists in natural language processing both in classical symbolic methods as well as in newer probabilist methods.
Thanks to the Incognito laboratory, which was founded in 1983, the University of Montreal's Computer Science and Operational Research Department (DIRO) established itself as a leading research centre in the area of natural language processing. In June 1997, Industry Canada agreed to transfer to the DIRO all the activities of the machine-aided translation program (TAO), which had been conducted at the Centre for Information Technology Innovation (CITI) since 1984. A new laboratory — the RALI — was opened in order to promote and develop the results of the CITI's research, allowing the members of the former TAO team to pursue their work within the university community. The RALI's areas of expertise include work in: automatic text alignment, automatic text generation, automatic reaccentuation, language identification and finite state transducers.
The RALI produces the "TransX family" of what it calls "a new generation" of translation support tools (TransType, TransTalk, TransCheck and TransSearch), which are based on probabilistic translation models that automatically calculate the correspondences between the text produced by a translator and the original source language text.
" TransType speeds up the keying-in of a translation by anticipating a translator's choices and critiquizing them when appropriate. In proposing its suggestions, TransType takes into account both the source text and the partial translation that the translator has already produced.
TransTalk is an automatic dictation system that makes use of a probabilistic translation model in order to improve the performance of its voice recognition model.
TransCheck automatically detects certain types of translation errors by verifying that the correspondences between the segments of a draft and the segments of the source text respect well-known properties of a good translation.
TransSearch allows translators to search databases of pre-existing translations in order to find ready-made solutions to all sorts of translation problems. In order to produce the required databases, the translations and the source language texts must first be aligned."
Some of RALI's other projects are:
- the SILC Project, concerning language identification. When a document is submitted to the system, SILC attempts to determine what language the document is written in and the character set in which it is encoded.
- the FAP: Finite Automata Package (FAP), a project concerning finite-state transducers. The finite-state automaton is a simple and efficient computational device for describing sequences of symbols (words, characters, etc.) known as the regular languages. The finite-state transducer is a device for linking pairs of these sequences under the control of a grammar of local correspondences, and thus provides a means of rewriting one sequence as another. Applications of these techniques in NLP include: dictionaries, morphological analysis, part-of-speech tagging, syntactic analysis, and speech processing.
The Xerox Palo Alto Research Center (PARC)'s projects include two main projects concerning languages: Inter-Language Unification (ILU) and Natural Language Theory and Technology (NLTT).
The Inter-Language Unification (ILU) System is a multi-language object interface system. The object interfaces provided by ILU hide implementation distinctions between different languages, between different address spaces, and between operating system types. ILU can be used to build multilingual object-oriented libraries ("class libraries") with well-specified language-independent interfaces. It can also be used to implement distributed systems, or to define and document interfaces between the modules of non-distributed programs.
The goal of Natural Language Theory and Technology (NLTT) is to develop theories of how information is encoded in natural language and technologies for mapping information to and from natural language representations. This will enable the efficient and intelligent handling of natural language text in critical phases of document processing, such as recognition, summarizing, indexing, fact extraction and presentation, document storage and retrieval, and translation. It will also increase the power and convenience of communicating with machines in natural language.
Based in Cambridge, United Kingdom, and Grenoble, France, The Xerox Research Centre Europe (XRCE) is also a research organization of the international company XEROX, which focuses on increasing productivity in the workplace through new document technologies, with several tools and projects relating to languages.
One of Xerox's research activities is MultiLingual Theory and Technology (MLTT), to study how to analyze and generate text in many languages (English, French, German, Italian, Spanish, Russian, Arabic, etc.). The MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. Currently under development are phrasal parsers for French and German, a lexical functional grammar (LFG) for French and projects on multilingual information retrieval, translation and generation.
Founded in 1979, the American Association for Artificial Intelligence (AAAI) is a non-profit scientific society devoted to advancing the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines. AAAI also aims to increase public understanding of artificial intelligence, improve the teaching and training of AI practitioners, and provide guidance for research planners and funders concerning the importance and potential of current AI developments and future directions.
The Institut Dalle Molle pour les études sémantiques et cognitives (ISSCO) (Dalle Molle Institute for Semantic and Cognitive Studies) is a research laboratory attached to the University of Geneva, Switzerland, which conducts basic and applied research in computational linguistics (CL), and artificial intelligence (AI). The site gives a presentation of the ISSCO projects (European projects, projects of the Swiss National Science Foundation, projects of the French-speaking community, etc.).
Created by the Foundation Dalle Molle in 1972 for research into cognition and semantics, ISSCO has come to specialize in natural language processing and, in particular, in multilingual language processing, in a number of areas : machine translation, linguistic environments, multilingual generation, discourse processing, data collection, etc. The University of Geneva provides administrative support and infrastructure for ISSCO. The research is funded solely by grants and by contracts with public and private bodies.
ISSCO is multi-disciplinary and multi-national, "drawing its staff and its visitors from the disciplines of computer science, linguistics, mathematics, psychology and philosophy. The long-term staff of the Institute is relatively small in number; with a much larger number of visitors coming for stays ranging from a month to two years. This ensures a continual exchange of ideas and encourages flexibility of approach amongst those associated with the Institute."
The International Conferences on Computational Linguistics (COLINGs) are organized every two years by the International Committee on Computational Linguistics (ICCL).
"The International Committee on Computational Linguistics was set up by David Hays in the mid-Sixties as a permanent body to run international computational linguistics conferences in an original way, with no permanent secretariat, subscriptions or funds. It was ahead of its time in that and other ways. COLING has always been distinguished by pleasant venues and atmosphere, rather than by the clinical efficiency of an airport conference hotel: COLINGs are simply nice conferences to be at. […] In recent years, the ACL [Association for Computational Linguistics] has given great assistance and cooperation in keeping COLING proceedings available and distributed."
5.3. Language Engineering
Launched in January 1999 by the European Commission, the website HLTCentral (HLT: Human Language Technologies) gives a short definition of language engineering:
"Through language engineering we can find ways of living comfortably with technology. Our knowledge of language can be used to develop systems that recognise speech and writing, understand text well enough to select information, translate between different languages, and generate speech as well as the printed world.
By applying such technologies we have the ability to extend the current limits of our use of language. Language enabled products will become an essential and integral part of everyday life."
A full presentation of language engineering can be found in LanguageEngineering: Harnessing the Power of Language.
From 1992 to 1998, the Language Engineering Sector was part of the Telematics Applications Programme of the European Commission. Its aim was to facilitate the use of telematics applications and to increase the possibilities for communication in and between European languages. RTD (research and technological development) work focused on pilot projects that integrated language technologies into information and communications applications and services. A key objective was to improve their ease of use and functionality and broaden their scope across different languages.
From January 1999, the Language Engineering Sector has been rebranded as Human Language Technologies (HLT), a sector of the IST Programme (IST: Information Society Technologies) of the European Commission for 1999-2002. HLTCentral has been set up by the LINGLINK Project as the springboard for access to Language Technology resources on the Web: information, news, downloads, links, events, discussion groups and a number of specially-commissioned studies (e-commerce, telecommunications, Call Centres, Localization, etc.).
The Multilingual Application Interface for Telematic Services (MAITS) is a consortium formed to specify an applications programming interface (API) for multilingual applications in the telematic services. A number of telematic applications, such as X.500, WWW, X.400, internet mail and data bases, is planned to be enhanced to use this i18n API, and products are planned to be implemented using the API.
FRANCIL (Réseau francophone de l'ingénierie de la langue) (Francophone Network in Language Engineering) is a programme launched in June 1994 by the Agence universitaire de la francophonie (AUPELF-UREF) (University Agency for Francophony) to strengthen activities in linguistic engineering, particularly for automatic language processing. This quickly-growing sector includes research and development for text analysis and generation, and for speech recognition, comprehension and synthesis. It also includes some applications in the following fields: document management, communication between the human being and the machine, writing aid, and computer-assisted translation.
5.4. Internationalization and Localization
"Towards communicating on the Internet in any language…" Babel is an Alis Technologies/ Internet Society joint project to internationalize the Internet. Its multilingual site (English, French, German, Italian, Portuguese, Spanish and Swedish) has two main sections: languages (the world's languages; typographical and linguistic glossary; Francophonie (French-speaking countries); and the Internet and multilingualism (developing your multilingual Web site; coding the world's writing).
The Localisation Industry Standards Association (LISA) is a main organization for the localization and internationalization industry. The current membership of 130 leading players from all around the world includes software publishers, hardware manufacturers, localization service vendors, and an increasing number of companies from related IT sectors. LISA defines its mission as "promoting the localization and internationalization industry and providing a mechanism and services to enable companies to exchange and share information on the development of processes, tools, technologies and business models connected with localization, internationalization and related topics". Its site is housed and maintained by the University of Geneva, Switzerland.
W3C Internationalization/Localization is part of the World Wide Web Consortium (W3C), an international industry consortium founded in 1994 to develop common protocols for the World Wide Web. The site gives in particular a definition of protocols used for internationalization/localization: HTML; base character set; new tags and attributes; HTTP; language negotiation; URLs & other identifiers including non-ASCII characters; etc. It also offers some help with creating a multilingual site.
Agence de la francophonie
Alis Technologies
AltaVista Translation
American Association for Artificial Intelligence (AAAI)
Aquarius
ARTFL Project (ARTFL : American and French Research on the Treasury of theFrench Language)
Asia-Pacific Association for Machine Translation (AAMT)
Association for Computational Linguistics (ACL)
Association for Machine Translation in the Americas (AMTA)
Babel / Alis Technologies & Internet Society
CAPITAL (Computer-Assisted Pronunciation Investigation Teaching and Learning)
Center for Machine Translation (CMT) / Carnegie Mellon University (CMU)
Centre d'expertise et de veille inforoutes et langues (CEVEIL)
COLING (International Conference on Computational Linguistics)
Computational Linguistics (CL) and Machine Translation (MT) Group (CL/MTResearch Group) / Essex University
Computer-Assisted Translation and Terminology Unit (CTT) / World HealthOrganization (WHO)
Computing Research Laboratory (CRL) / New Mexico State University (NMSU)
CTI (Computer in Teaching Initiative) Centre for Modern Languages / University of Hull
Dictionnaire francophone en ligne / Hachette & Agence universitaire de laFrancophonie (AUPELF-UREF)
Dictionnaires électroniques / Swiss Federal Administration
ENGSPAN (SPANAM and ENGSPAN) / Pan American Health Organisation (PAHO)
Ethnologue (The)
Eurodicautom / European Commission
EUROCALL (European Association for Computer-Assisted Language Learning)
European Association for Machine Translation (EAMT)
European Chapter of the Association of Computational Linguistics (EACL)
European Committee for the Respect of Cultures and Languages in Europe (ECRCLE)
European Language Resources Association (ELRA)
European Minority Languages / Sabhal Mór Ostaig
European Network in Language and Speech (ELSNET)
Fonds francophone des inforoutes / Agence de la francophonie
FRANCIL (Réseau francophone de l'ingénierie de la langue) / Agence universitaire de la francophonie (AUPELF-UREF)
FRANTEXT / Institut national de la langue française (INaLF)
Global Reach
Globalink
Groupe d'étude pour la traduction automatique (GETA)
Human Language Technologies (HLTCentral) / European Commission
Human-Languages Page (The)
ILOTERM / International Labour Organization (ILO)
Institut Dalle Molle pour les études sémantiques et cognitives (ISSCO)
Institut national de la langue française (INaLF)
International Committee on Computational Linguistics (ICCL)
International Conference on Computational Linguistics (COLING)
Internet Dictionary Project
Internet Resources for Language Teachers and Learners
Laboratoire de recherche appliquée en linguistique informatique (RALI)
Language Futures Europe
Language Today
Languages of the World by Computers and the Internet (The) (Logos Home Page)
Lernout & Hauspie
LINGUIST List (The)
Localisation Industry Standards Association (LISA)
Logos (Canada, USA, Europe)
Logos (Italy)
Logos Home Page (The Languages of the World by Computers and the Internet)
Merriam-Webster Online: the Language Center
Multilingual Application Interface for Telematic Services (MAITS)
Multilingual Glossary of Internet Terminology (The) (Netglos) / WorldWideLanguage Institute (WWLI)
Multilingual Information Society (MLIS) / European Commission
MultiLingual Theory and Technology (MLTT) / Xerox Research Centre Europe (XRCE)
Multilingual Tools and Services / European Union
Natural Language Group (NLG) at USC/ISI / University of Southern California(USC)
NetGlos (The Multilingual Glossary of Internet Terminology) / WorldWide LanguageInstitute (WWLI)
OneLook Dictionaries
PARC (Xerox Palo Alto Research Center)
Project Gutenberg
RALI (Laboratoire de recherche appliquée en linguistique informatique)
Réseau francophone de l'ingénierie de la langue (FRANCIL) / Agence universitaire de la francophonie (AUPELF-UREF)
SPANAM and ENGSPAN / Pan American Health Organization (PAHO)
Speech on the Web
TERMITE (ITU Telecommunication Terminology Database) / InternationalTelecommunication Union (ITU)
Travlang
TTT.org (Translation, Theory and Technology) / Brigham Young University (BYU)
Universal Networking Language (UNL) / United Nations University (UNU)
W3C Internationalization/Localization / World Wide Web Consortium (W3C)
Web Languages Hit Parade / Babel
Web of Online Dictionaries (A)
WELL (Web Enhanced Language Learning)
WHOTERM (WHO Terminology Information System) / World Health Organization (WHO)
Xerox Palo Alto Research Center (PARC)
Xerox Research Centre Europe (XRCE)
Yamada WWW Language Guides
An asterisk (*) indicates the persons who sent contributions especially for this study.
Patrick Andries (Laboratoire de recherche appliquée en linguistique informatique- RALI)
Arlette Attali* (Institut national de la langue française - INaLF)
Robert Beard* (A Web of Online Dictionaries)
Louise Beaudoin (Ministry of Culture and Communications in Quebec)
Guy Bertrand* (Centre d'expertise et de veille inforoutes et langues - CEVEIL)
Tyler Chambers* (Human-Language Pages)
Jean-Pierre Cloutier (Chroniques de Cybérie)
Cynthia Delisle* (Centre d'expertise et de veille inforoutes et langues -CEVEIL)
Helen Dry* (The LINGUIST List)
Bill Dunlap* (2) (Euro-Marketing Associates, Global Reach)
Marcel Grangier* (Section française des Services linguistiques centraux de laChancellerie fédérale suisse)
Barbara F. Grimes* (The Ethnologue)
Michael S. Hart* (Project Gutenberg)
Randy Hobler* (Globalink)
Eduard Hovy* (Natural Language Group at USC/ISI)
Pierre Isabelle (Laboratoire de recherche appliquée en linguistique informatique- RALI)
Christiane Jadelot* (Institut national de la langue française - INaLF)
Annie Kahn (Le Monde)
Brian King* (NetGlos)
Geoffrey Kingscott* (Praetorius)
Steven Krauwer* (European Network in Language and Speech - ELSNET)
Michael C. Martin* (Travlang)
Yoshi Mikami* (The Languages of the World by Computers and the Internet)
Caoimhín P. Ó Donnaíle* (European Minority Languages)
Henri Slettenhaar* (professor at the Webster University)
Martha L. Stone (2) (ZDNN)
June Thompson* (CTI (Computer in Teaching Initiative) Centre for ModernLanguages)
Paul Treanor* (Language Futures Europe)
Rodrigo Vergara (Logos, Italy)
Robert Ware* (2) (OneLook Dictionaries)
Copyright © 1999 Marie Lebert
End of Project Gutenberg's Multilingualism on the Web, by Marie Lebert