= [Quote]
Robert Beard, a language teacher at Bucknell University, in Lewisburg, Pennsylvania, wrote in September 1998: "As a language teacher, the web represents a plethora of new resources produced by the target culture, new tools for delivering lessons (interactive Java and Shockwave exercises) and testing, which are available to students any time they have the time or interest — 24 hours a day, 7 days a week. It is also an almost limitless publication outlet for my colleagues and I, not to mention my institution. (…) Ultimately all course materials, including lecture notes, exercises, moot and credit testing, grading, and interactive exercises will be far more effective in conveying concepts that we have not even dreamed of yet."
= CTI Centre for Modern Languages
Since its inception in 1989, the CTI (Computer in Teaching Initiative) Centre for Modern Languages, based in the Language Institute at the University of Hull, United Kingdom, aims to promote and encourage the use of computers in language learning and teaching. The CTI Centre provides information on how computer-assisted language learning (CALL) can be effectively integrated into existing courses. It offers support to language lecturers who are using computers in their teaching, or who wish to use them.
June Thompson, manager of the CTI Centre, wrote in December 1998: "The internet has the potential to increase the use of foreign languages, and our organization certainly opposed any trend towards the dominance of English as the language of the internet. The use of the internet has brought an enormous new dimension to our work of supporting language teachers in their use of technology in teaching."
How about the future? "I suspect that for some time to come, the use of internet-related activities for languages will continue to develop alongside other technology-related activities (e.g. use of CD-ROMs — not all institutions have enough networked hardware). In the future I can envisage use of internet playing a much larger part, but only if such activities are pedagogy-driven. Our organization is closely associated with the WELL project which devotes itself to these issues."
The WELL (Web Enhanced Language Learning) project was a project from EUROCALL (European Association for Computer-Assisted Language Learning). It ran from 1997 to 2000 in the United Kingdom to provide access to high-quality web resources in 12 languages. The resources were selected and described by subject experts, with information and examples on how to use them for teaching and learning.
More generally, EUROCALL's goal is to promote the use of foreign languages within Europe, to provide a European focus for all aspects of the use of technology for language learning, and to enhance the quality, dissemination and efficiency of CALL materials. Another project of EUROCALL is CAPITAL (Computer-Assisted Pronunciation Investigation Teaching and Learning), run by a group of researchers and practitioners interested in using computers in this field.
= LINGUIST List
The LINGUIST List was founded by Anthony Rodriques Aristar in 1990 at the University of Western Australia, with 60 subscribers, before moving from Australia to Texas A&M University in 1991. In 1997, emails sent to the distribution list were also available on the list's own website, in the following sections: the profession (conferences, linguistic associations, programs), research and research support (papers, dissertation abstracts, projects, bibliographies, topics, texts), publications, pedagogy, language resources (languages, language families, dictionaries, regional information), and computer support (fonts and software). The LINGUIST List is a component of the WWW Virtual Library for linguistics.
Helen Dry, moderator of the LINGUIST List, wrote in August 1998: "The LINGUIST List, which I moderate, has a policy of posting in any language, since it's a list for linguists. However, we discourage posting the same message in several languages, simply because of the burden extra messages put on our editorial staff. (We are not a bounce- back list, but a moderated one. So each message is organized into an issue with like messages by our student editors before it is posted.) Our experience has been that almost everyone chooses to post in English. But we do link to a translation facility that will present our pages in any of five languages; so a subscriber need not read LINGUIST in English unless s/he wishes to. We also try to have at least one student editor who is genuinely multilingual, so that readers can correspond with us in languages other than English."
She added in July 1999: "We are beginning to collect some primary data. For example, we have searchable databases of dissertation abstracts relevant to linguistics, of information on graduate and undergraduate linguistics programs, and of professional information about individual linguists. The dissertation abstracts collection is, to my knowledge, the only freely available electronic compilation in existence."
= [Quote]
Caoimhín Ó Donnaíle has taught computing — through the Gaelic language — at the Institute Sabhal Mór Ostaig, on the Island of Skye, in Scotland. He has also maintained the bilingual (English, Gaelic) college website, which is the main site worldwide with information on Scottish Gaelic. He wrote in May 2001: "Students do everything by computer, use Gaelic spell-checking, a Gaelic online terminology database. There are more hits on our website. There is more use of sound. Gaelic radio (both Scottish and Irish) is now available continuously worldwide via the internet. A major project has been the translation of the Opera web browser into Gaelic — the first software of this size available in Gaelic."
= The Ethnologue
Published by SIL International (SIL was initially known as the Summer Institute of Linguistics), "The Ethnologue: Languages of the World" is an encyclopedic reference work cataloging all of the world’s 6,909 known living languages. The 16th edition was published in 2009, in print and on the web. The Ethnologue has been an active research project for more than fifty years. Thousands of linguists have contributed to the Ethnologue worldwide. A new edition is published approximately every four years.
The Ethnologue was founded in 1951 by Richard Pittman, who was motivated by the desire to share information on language development needs around the world with his colleagues at SIL International as well as with other language researchers. Richard Pittman was the editor of the 1st to 7th editions (1951-1969).
Barbara Grimes was the editor of the 8th to 14th editions (1971-2000). She wrote in January 2000: "It is a catalog of the languages of the world, with information about where they are spoken, an estimate of the number of speakers, what language family they are in, alternate names, names of dialects, other socio-linguistic and demographic information, dates of published Bibles, a name index, a language family index, and language maps." In 1971, information was expanded from primarily minority languages to encompass all known languages of the world. Between 1967 and 1973, she completed an in-depth revision of the information on Africa, the Americas, the Pacific, and a few countries of Asia. During her years as editor, the number of identified languages grew from 4,493 to 6,809. The information recorded on each language expanded so that the published work more than tripled in size.
In 2000, Raymond Gordon Jr. became the third editor of the Ethnologue and produced the 15th edition (2005). Shortly after the publication of the 15th edition, Paul Lewis became the editor, responsible for general oversight and research policy. He installed Conrad Hurd as managing editor, responsible for operations and database management, and Raymond Gordon as senior research editor, leading a team of regional and language-family focused research editors.
In the Introduction of its latest edition (16th edition, 2009), the Ethnologue defines a language as such: "How one chooses to define a language depends on the purposes one has in identifying that language as distinct from another. Some base their definition on purely linguistic grounds. Others recognize that social, cultural, or political factors must also be taken into account. In addition, speakers themselves often have their own perspectives on what makes a particular language uniquely theirs. Those are frequently related to issues of heritage and identity much more than to the linguistic features of the language(s) in question."
As explained in the Introduction, one feature of the database since its inception has been a system of three-letter language identifiers, that appeared in the publication itself from the 10th edition (1984) onwards. "In 1998, the International Organization for Standardization (ISO) adopted ISO 639-2, a standard for three-letter language identifiers. The standard is based on a convergence of ISO 639-1 (an earlier standard for two-letter language identifiers adopted in 1988) and of ANSI Z39.53 (also known as the MARC language codes, a set of three-letter identifiers developed within the library community and adopted as an American National Standard in 1987). The ISO 639-2 standard was insufficient for many purposes since it has identifiers for fewer than 400 individual languages. Thus in 2002, ISO TC37/SC2 formally invited SIL International to prepare a new standard that would reconcile the complete set of codes used in the Ethnologue with the codes already in use in the earlier ISO standard. In addition, codes developed by Linguist List to handle ancient and constructed languages were to be incorporated. The result, which was officially approved by the subscribing national standards bodies in 2006 and published in 2007, is a standard named ISO 639-3 that provides standardized three- letter codes for identifying nearly 7,500 languages (ISO 2007). SIL International was named as the registration authority for the ISO 639-3 standard inventory of language identifiers and administers the annual cycle for changes and updates. This edition of Ethnologue is the second to use the ISO 639-3 language identifiers. In the fifteenth edition they had the status of Draft International Standard. In this edition they are based on the standard as originally adopted plus the 2006 series of adopted change requests (released August 2007) and the 2007 series of adopted change requests (released January 2008). Information about the ISO 639-3 standard and procedures for requesting additions, deletions, and other modifications to the ISO 639-3 inventory of identified languages can be found at the ISO 639-3 website: http://www.sil.org/iso639-3."
= Experiences
Caoimhín Ó Donnaíle has taught computing - through the Gaelic language - at the Institute Sabhal Mór Ostaig, on the Island of Skye, in Scotland. He has also maintained the bilingual (English, Gaelic) college website, which is the main site worldwide with information on Scottish Gaelic, as well as the bilingual webpage European Minority Languages, a list of minority languages by alphabetic order and by language family. He wrote in May 2001: "There has been a great expansion in the use of information technology in our college. Far more computers, more computing staff, flat screens. Students do everything by computer, use Gaelic spell-checking, and a Gaelic online terminology database. There are more hits on our website. There is more use of sound. Gaelic radio (both Scottish and Irish) is now available continuously worldwide via the internet. A major project has been the translation of the Opera web browser into Gaelic — the first software of this size available in Gaelic."
How about the internet and endangered languages? "I would emphasize the point that as regards the future of endangered languages, the internet speeds everything up. If people don't care about preserving languages, the internet and accompanying globalization will greatly speed their demise. If people do care about preserving them, the internet will be a tremendous help."
Guy Antoine is the founder of Windows on Haiti, a reference website about Haitian culture. He wrote in November 1999: "In Windows on Haiti, the primary language of the site is English, but one will equally find a center of lively discussion conducted in 'Kreyòl'. In addition, one will find documents related to Haiti in French, in the old colonial Creole, and I am open to publishing others in Spanish and other languages. I do not offer any sort of translation, but multilingualism is alive and well at the site, and I predict that this will increasingly become the norm throughout the web."
Guy added in June 2001: "Kreyòl is the only national language of Haiti, and one of its two official languages, the other being French. It is hardly a minority language in the Caribbean context, since it is spoken by eight to ten million people. (…) I have taken the promotion of Kreyòl as a personal cause, since that language is the strongest of bonds uniting all Haitians, in spite of a small but disproportionately influential Haitian elite's disdainful attitude to adopting standards for the writing of Kreyòl and supporting the publication of books and official communications in that language. For instance, there was recently a two-week book event in Haiti's Capital and it was promoted as 'Livres en Folie' ('A mad feast for books'). Some 500 books from Haitian authors were on display, among which one could find perhaps 20 written in Kreyòl. This is within the context of France's major push to celebrate Francophony among its former colonies. This plays rather well in Haiti, but directly at the expense of Creolophony. What I have created in response to those attitudes are two discussion forums on my website, Windows on Haiti, held exclusively in Kreyòl. One is for general discussions on just about everything but obviously more focused on Haiti's current socio-political problems. The other is reserved only to debates of writing standards for Kreyòl. Those debates have been quite spirited and have met with the participation of a number of linguistic experts. The uniqueness of these forums is their non- academic nature."
Robert Beard, co-founder of the yourDictionary.com portal, wrote in January 2000: "While English still dominates the web, the growth of monolingual non-English websites is gaining strength with the various solutions to the font problems. Languages that are endangered are primarily languages without writing systems at all (only 1/3 of the world's 6,000+ languages have writing systems). I still do not see the web contributing to the loss of language identity and still suspect it may, in the long run, contribute to strengthening it. More and more Native Americans, for example, are contacting linguists, asking them to write grammars of their language and help them put up dictionaries. For these people, the web is an affordable boon for cultural expression."
= [Quote]
Peter Raggett, deputy-head (and then head) of the Central Library at the OECD (Organization for Economic Cooperation and Development), wrote in August 1999: "I think it is incumbent on European organizations and businesses to try and offer websites in three or four languages if resources permit. In this age of globalization and electronic commerce, businesses are finding that they are doing business across many countries. Allowing French, German, Japanese speakers to easily read one's website as well as English speakers will give a business a competitive edge in the domain of electronic trading."
= [Text]
In 1999, the subtitle of Babel's website was: "Towards communicating on the internet in any language…" Babel was a joint project from Alis Technologies and the Internet Society to contribute to the internationalization of the internet. Babel offered a multilingual website (English, French, German, Italian, Portuguese, Spanish and Swedish), with information about the world's languages, and a typographical and linguistic glossary. "The Internet and Multilingualism" section gave information on how to develop a multilingual website, and how to code the "world's writing".
Bill Dunlap, founder of Euro-Marketing Associates, a company based in San Francisco and Paris, launched the international marketing consultancy Global Reach as a methodology for U.S. companies to expand their internet presence into an international framework. This included translating a website into other languages, actively promoting it, and using local online banner advertising to increase local website traffic.
Bill Dunlap explained in December 1998: "Promoting your website is at least as important as creating it, if not more important. You should be prepared to spend at least as much time and money in promoting your website as you did in creating it in the first place. With the Global Reach program, you can have it promoted in countries where English is not spoken, and achieve a wider audience… and more sales. There are many good reasons for taking the online international market seriously. Global Reach is a means for you to extend your website to many countries, speak to online visitors in their own language and reach online markets there. (…)
Since 1981, when my professional life started, I've been involved with bringing American companies in Europe. This is very much an issue of language, since the products and their marketing have to be in the languages of Europe in order for them to be visible here. Since the web became popular in 1995 or so, I've turned these activities to their online dimension, and have come to champion European e-commerce among my fellow American compatriots. Most lately at Internet World in New York, I spoke about European e-commerce and how to use a website to address the various markets in Europe."
Bill added in July 1999: "After a website's home page is available in several languages, the next step is the development of content in each language. A webmaster will notice which languages draw more visitors (and sales) than others, and these are the places to start in a multilingual web promotion campaign. At the same time, it is always good to increase the number of languages available on a website: just a home page translated into other languages would do for a start, before it becomes obvious that more should be done to develop a certain language branch on a website."
The World Wide Web Consortium (W3C) was founded in October 1994 to develop interoperable technologies (specifications, guidelines, software, and tools) for the web, for example specifications for markup languages (HTML, XML, and others), and to act as a forum for information, commerce, communication and collective understanding. In 1998, the section Internationalization/Localization gave a definition of protocols used for internationalization/localization: HTML, base character set, new tags and attributes, HTTP, language negotiation, URLs & other identifiers including non-ASCII characters, etc. It also offered some help with creating a multilingual website.
The Localisation Industry Standards Association (LISA) was created in the mid-1990s as a forum for "software publishers, hardware manufacturers, localization service vendors, and an increasing number of companies from related IT sectors." LISA has defined its mission as "promoting the localization and internationalization industry and providing a mechanism and services to enable companies to exchange and share information on the development of processes, tools, technologies and business models connected with localization, internationalization and related topics". Its website was first housed and maintained by the University of Geneva, Switzerland.
Launched in January 1999 by the European Commission, the website HLTCentral (HLT: Human Language Technologies) gave a short definition of language engineering: "Through language engineering we can find ways of living comfortably with technology. Our knowledge of language can be used to develop systems that recognize speech and writing, understand text well enough to select information, translate between different languages, and generate speech as well as the printed world. By applying such technologies we have the ability to extend the current limits of our use of language. Language enabled products will become an essential and integral part of everyday life."
= [Quote]
Tim McKenna is an author who thinks and writes about the complexity of truth in a world of flux. He wrote in October 2000: "When software gets good enough for people to chat or talk on the web in real time in different languages, then we will see a whole new world appear before us. Scientists, political activists, businesses and many more groups will be able to communicate immediately without having to go through mediators or translators."
= A definition
Machine translation can be defined as the automated process of translating a text from one language to another language. MT analyzes the text in the source language and automatically generates the corresponding text in the target language. With the lack of any human intervention during the translation process, machine translation (MT) differs from computer-assisted translation (CAT), which involves some interaction between the translator and the computer.
As explained on the website of SYSTRAN, a company specializing in translation software, "machine translation software translates one natural language into another natural language. MT takes into account the grammatical structure of each language and uses rules to transfer the grammatical structure of the source language (text to be translated) into the target language (translated text). MT cannot replace a human translator, nor is it intended to."
The website of the European Association for Machine Translation (EAMT) gives the following definition: "Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful for certain specific applications, usually in the domain of technical documentation. In addition, translation software packages which are designed primarily to assist the human translator in the production of translations are enjoying increasing popularity within professional translation organizations."
Machine translation is the earliest type of natural language processing, as stated on the website of Globalink, a company offering language translation software and services: "From the very beginning, machine translation (MT) and natural language processing (NLP) have gone hand-in-hand with the evolution of modern computational technology. The development of the first general-purpose programmable computers during World War II was driven and accelerated by Allied cryptographic efforts to crack the German Enigma machine and other wartime codes. Following the war, the translation and analysis of natural language text provided a testbed for the newly emerging field of Information Theory.
During the 1950s, research on Automatic Translation (known today as Machine Translation, or 'MT') took form in the sense of literal translation, more commonly known as word-for-word translations, without the use of any linguistic rules. The Russian project initiated at Georgetown University in the early 1950s represented the first systematic attempt to create a demonstrable machine translation system. Throughout the decade and into the 1960s, a number of similar university and government-funded research efforts took place in the United States and Europe. At the same time, rapid developments in the field of Theoretical Linguistics, culminating in the publication of Noam Chomsky's "Aspects of the Theory of Syntax" (1965), revolutionized the framework for the discussion and understanding of the phonology, morphology, syntax and semantics of human language.
In 1966, the U.S. government-issued ALPAC (Automatic Language Processing Advisory Committee) report offered a prematurely negative assessment of the value and prospects of practical machine translation systems, effectively putting an end to funding and experimentation in the field for the next decade. It was not until the late 1970s, with the growth of computing and language technology, that serious efforts began once again. This period of renewed interest also saw the development of the Transfer model of machine translation and the emergence of the first commercial MT systems. While commercial ventures such as SYSTRAN and METAL began to demonstrate the viability, utility and demand for machine translation, these mainframe-bound systems also illustrated many of the problems in bringing MT products and services to market. High development cost, labor-intensive lexicography and linguistic implementation, slow progress in developing new language pairs, inaccessibility to the average user, and inability to scale easily to new platforms are all characteristics of these second- generation systems."
As explained in August 1998 by Eduard Hovy, head of the Natural Language Group at USC/ISI (University of Southern California/Information Sciences Institute), machine translation implies "language-related applications/functionalities that are not translation, such as information retrieval (IR) and automated text summarization (SUM). You would not be able to find anything on the Web without IR! — all the search engines (AltaVista, Yahoo!, etc.) are built upon IR technology. Similarly, though much newer, it is likely that many people will soon be using automated summarizers to condense (or at least, to extract the major contents of) single (long) documents or lots of (any length) ones together."
= Experiences
In December 1997, AltaVista, a leading search engine, was the first to launch a free translation software with Babel Fish — also called AltaVista Translation —, which could translate webpages (up to three pages at the same time) from English into French, German, Italian, Portuguese or Spanish, and vice versa. The software was developed by SYSTRAN (an acronym for System Translation), a company specializing in machine translation software. SYSTRAN's headquarters are located in Soisy-sous-Montmorency, near Paris, France. Sales, marketing, and research and development are based in its subsidiary in La Jolla, California.
This initiative was followed by other translation software developed by Alis Technologies, Globalink, Lernout & Hauspie, and Softissimo, with free and/or paid versions on the web.
Based in Montreal, Quebec, Alis Technologies has specialized in development and marketing of language handling solutions and services, particularly language implementation in the information technology industry. Alis Translation Solutions (ATS) has offered applications in a number of languages, and tools and services to improve the quality of translations. Language Technology Solutions (LTS) has marketed advanced tools and services for language engineering and information technology (90 languages covered).
Based in Ieper, Belgium, and Burlington, Massachusetts, Lernout & Hauspie (L&H) was a leader in advanced speech technology for commercial applications and products, with four core technologies: automatic speech recognition (ASR), text-to-speech (TTS), text-to-text (TTT), and digital speech compression (DSC). Its ASR, TTS and DSC technologies were licensed to companies in telecommunications, computers and multimedia, consumer electronics and automotive electronics. Its TTT translation services were provided to IT companies, and vertical and automation markets. The Machine Translation Group created by Lernout & Hauspie included L&H Language Technology, AppTek, AILogic, NeocorTech, and Globalink. Lernout & Hauspie was later bought by Nuance Communications.
Globalink, a company created in 1990 in the U.S., focused on language translation software and services, i.e. customized translation solutions built around software products, online options, and professional translation services. The software products were available in Spanish, French, Portuguese, German, Italian and English, for individuals, small businesses, multinational corporations and governments, from a stand-alone product giving a fast draft translation to a full system managing professional translations.
As explained on the company website in 1998, "with Globalink's translation applications, the computer uses three sets of data: the input text, the translation program and permanent knowledge sources (containing a dictionary of words and phrases of the source language), and information about the concepts evoked by the dictionary and rules for sentence development. These rules are in the form of linguistic rules for syntax and grammar, and some are algorithms governing verb conjugation, syntax adjustment, gender and number agreement and word re-ordering. Once the user has selected the text and set the machine translation process in motion, the program begins to match words of the input text with those stored in its dictionary. Once a match is found, the application brings up a complete record that includes information on possible meanings of the word and its contextual relationship to other words that occur in the same sentence. The time required for the translation depends on the length of the text. A three-page, 750-word document takes about three minutes to render a first draft translation."
At the headquarters of the World Health Organization (WHO) in Geneva, Switzerland, the Computer-assisted Translation and Terminology Unit (CTT) has been a pioneer since 1997 in assessing technical options for using computer-assisted translation (CAT) systems based on translation memory (TM). With such systems, translators can access previous translations from portions of the text; accept, reject or modify them; and add the new translation to the memory, thus enriching it for future reference. By archiving the daily output, the translator helps in building an extensive translation memory and in solving a number of translation issues. Several projects have been under way at the CTT for electronic document archiving and retrieval, bilingual/multilingual text alignment, computer-assisted translation, translation memory and terminology database management, and speech recognition.
The Pan American Health Organization (PAHO) in Washington, D.C. has developed its own machine translation software, as a common work from its own computational linguists, translators, and system programmers. The PAHO Translation Unit has used SPANAM (Spanish to English) from 1980 and ENGSPAN (English to Spanish) from 1985, to process over 25 million words between 1980 and 1998. Staff translators and free-lance translators post-edit the raw output to produce high-quality translations with a 30-50% gain in productivity. The software is available in the LAN (Local Area Network) of PAHO Headquarters, and is regularly used by the staff of technical and administrative units. The software is also available in a number of PAHO field offices, and has been licensed to public and non-profit institutions in the U.S., Latin America, and Spain. The software was later renamed PAHOMTS, and has included new language pairs with Portuguese.
= Comments
# Comments from ZDNN
In "Web Embraces Language Translation", an article published in ZDNN (ZDNetwork News) on 21 July 1998, Martha Stone explained: "Among the new products in the $10 billion language translation business are instant translators for websites, chat rooms, email and corporate intranets. The leading translation firms are mobilizing to seize the opportunities. Such as:
*SYSTRAN has partnered with AltaVista and reports between 500,000 and 600,000 visitors a day on babelfish.altavista.digital.com, and about 1 million translations per day — ranging from recipes to complete webpages. About 15,000 sites link to babelfish, which can translate to and from French, Italian, German, Spanish and Portuguese. The site plans to add Japanese soon. 'The popularity is simple. With the internet, now there is a way to use U.S. content. All of these contribute to this increasing demand,' said Dimitros Sabatakakis, group CEO of SYSTRAN, speaking from his Paris home.
*Alis technology powers the Los Angeles Times' soon-to-be launched language translation feature on its site. Translations will be available in Spanish and French, and eventually, Japanese. At the click of a mouse, an entire webpage can be translated into the desired language.
*Globalink offers a variety of software and web translation possibilities, including a free email service and software to enable text in chat rooms to be translated.
But while these so-called 'machine' translations are gaining worldwide popularity, company execs admit they're not for every situation. Representatives from Globalink, Alis and SYSTRAN use such phrases as 'not perfect' and 'approximate' when describing the quality of translations, with the caveat that sentences submitted for translation should be simple, grammatically accurate and idiom-free. 'The progress on machine translation is moving at Moore's Law — every 18 months it's twice as good,' said Vin Crosbie, a web industry analyst in Greenwich, Conn. 'It's not perfect, but some [non-English speaking] people don't realize I'm using translation software.'
With these translations, syntax and word usage suffer, because dictionary-driven databases can't decipher between homonyms — for example, 'light' (as in the sun or light bulb) and 'light' (the opposite of heavy). Still, human translation would cost between $50 and $60 per webpage, or about 20 cents per word, SYSTRAN's Sabatakakis said. While this may be appropriate for static 'corporate information' pages, the machine translations are free on the web, and often less than $100 for software, depending on the number of translated languages and special features."
# Comments from RALI
Despite the imminent outbreak of a universal translation machine announced at the end of the 1940s, machine translation hasn't produced good translations yet. Pierre Isabelle and Patrick Andries, two scientists from the RALI Laboratory (Laboratory for Applied Research in Computational Linguistics - Laboratoire de Recherche Appliquée en Linguistique Informatique) in Montreal, Quebec, explain the reasons for this failure in "La Traduction Automatique, 50 Ans Après" (Machine Translation, 50 Years Later), an article published in 1998 by Multimédium, a French-language online magazine: "The ultimate goal of building a machine capable of competing with a human translator remains elusive due to slow progress in research. (…) Recent research, based on large collections of texts called corpora — using either statistical or analogical methods — has promised to reduce the quantity of manual work required to build a machine translation (MT) system, but can't promise for sure a significant improvement in the quality of machine translation. (…) The use of MT will be more or less restricted to tasks of information assimilation or tasks of text distribution in restricted sub-languages."
According to Yehochua Bar-Hillel's ideas expressed in "The State of Machine Translation", an article published in 1951, Pierre Isabelle and Patrick Andries define three implementation strategies for machine translation: (a) a tool of information assimilation to scan multilingual data and supply rough translation, (b) situations of "restricted language" such as the METEO system which, since 1977, has translated the weather forecasts of the Canadian Ministry of Environment, (c) the human/machine coupling before, during and after the machine translation process, that may not save money if compared to traditional translation.
Pierre Isabelle and Patrick Andries favor "a workstation for the human translator" more than a "robot translator": "Recent research on the probabilist methods showed it was possible to modelize in an efficient way some simple aspects of the translation relationship between two texts. For example, methods were set up to calculate the correct alignment between the text sentences and their translation, that is, to identify the sentence(s) of the source text corresponding to each sentence of the translation. Applied on a large scale, these techniques can use the archives of a translation service to build a translation memory for recycling fragments from previous translations. Such systems are already available on the translation market (IBM Translation Manager II, Trados Translator's Workbench by Trados, RALI TransSearch, etc.) The latest research focuses on models that can automatically set up correspondences at a finer level than the sentence level, i.e. syntagms and words. The results let hope for a bunch of new tools for the human translator, including for the study of terminology, for dictation and translation typing, and for detectors of translation errors."
# Comments from Randy Hobler
In September 1998, Randy Hobler was a consultant in internet marketing at Globalink, after working for IBM, Johnson & Johnson, Burroughs Wellcome, Pepsi, and Heublein. He wrote in an email interview: "We are rapidly reaching the point where highly accurate machine translation of text and speech will be so common as to be embedded in computer platforms, and even in chips in various ways. At that point, and as the growth of the web slows, the accuracy of language translation hits 98% plus, and the saturation of language pairs has covered the vast majority of the market, language transparency (any-language-to-any- language communication) will be too limiting a vision for those selling this technology. The next development will be 'transcultural, transnational transparency', in which other aspects of human communication, commerce and transactions beyond language alone will come into play. For example, gesture has meaning, facial movement has meaning and this varies among societies. The thumb-index finger circle means 'OK' in the United States. In Argentina, it is an obscene gesture.
When the inevitable growth of multimedia, multilingual videoconferencing comes about, it will be necessary to 'visually edit' gestures on the fly. The MIT (Massachusetts Institute of Technology) Media Lab, Microsoft and many others are working on computer recognition of facial expressions, biometric access identification via the face, etc. It won't be any good for a U.S. business person to be making a great point in a web-based multilingual video conference to an Argentinian, having his words translated into perfect Argentinian Spanish if he makes the 'O' gesture at the same time. Computers can intercept this kind of thing and edit them on the fly.
There are thousands of ways in which cultures and countries differ, and most of these are computerizable to change as one goes from one culture to the other. They include laws, customs, business practices, ethics, currency conversions, clothing size differences, metric versus English system differences, etc. Enterprising companies will be capturing and programming these differences and selling products and services to help the peoples of the world communicate better. Once this kind of thing is widespread, it will truly contribute to international understanding."
= Machine translation R&D
Here is an overview of the work of four research centers, in Quebec(RALI Laboratory), California (Natural Language Group), Switzerland(ISSCO) and Japan (UNDL Foundation).
# RALI Laboratory
In Montreal, Quebec, the RALI Laboratory (Laboratory of Applied Research in Computational Linguistics - Laboratoire de Recherche Appliquée en Linguistique Informatique) has worked in automatic text alignment, automatic text generation, automatic reaccentuation, language identification, and finite state transducers. RALI produces the "TransX family" of what it calls "a new generation" of translation support tools (TransType, TransTalk, TransCheck, and TransSearch), which are based on probabilistic translation models that automatically calculate correspondences between the text produced by a translator and the original text from the source language.
As explained on RALI's website in 1998: "(a) TransType speeds up the keying-in of a translation by anticipating a translator's choices and criticizing them when appropriate. In proposing its suggestions, TransType takes into account both the source text and the partial translation that the translator has already produced. (b) TransTalk is an automatic dictation system that makes use of a probabilistic translation model in order to improve the performance of its voice recognition model. (c) TransCheck automatically detects certain types of translation errors by verifying that the correspondences between the segments of a draft and the segments of the source text respect well- known properties of a good translation. (d) TransSearch allows translators to search databases of pre-existing translations in order to find ready-made solutions to all sorts of translation problems. In order to produce the required databases, the translations and the source language texts must first be aligned."
# Natural Language Group
The Natural Language Group (NLG) at the Information Sciences Institute (ISI) of the University of Southern California (USC) has been involved in various aspects of computational/natural language processing: machine translation, automated text summarization, multilingual verb access and text management, development of large concept taxonomies (ontologies), discourse and text generation, construction of large lexicons for various languages, and multimedia communication.
Eduard Hovy, head of the Natural Language Group, explained in August 1998: "People will write their own language for several reasons — convenience, secrecy, and local applicability — but that does not mean that other people are not interested in reading what they have to say! This is especially true for companies involved in technology watch (say, a computer company that wants to know, daily, all the Japanese newspaper and other articles that pertain to what they make) or some Government Intelligence agencies (the people who provide the most up- to-date information for use by your government officials in making policy, etc.). One of the main problems faced by these kinds of people is the flood of information, so they tend to hire 'weak' bilinguals who can rapidly scan incoming text and throw out what is not relevant, giving the relevant stuff to professional translators. Obviously, a combination of SUM (automated text summarization) and MT (machine translation) will help here; since MT is slow, it helps if you can do SUM in the foreign language, and then just do a quick and dirty MT on the result, allowing either a human or an automated IR-based text classifier to decide whether to keep or reject the article. For these kinds of reasons, the U.S. Government has over the past five years been funding research in MT, SUM, and IR (information retrieval), and is interested in starting a new program of research in Multilingual IR. This way you will be able to one day open Netscape or Explorer or the like, type in your query in (say) English, and have the engine return texts in *all* the languages of the world. You will have them clustered by subarea, summarized by cluster, and the foreign summaries translated, all the kinds of things that you would like to have."
Eduard Hovy added in August 1999: "Over the past 12 months I have been contacted by a surprising number of new information technology (IT) companies and startups. Most of them plan to offer some variant of electronic commerce (online shopping, bartering, information gathering, etc.). Given the rather poor performance of current non-research level natural language processing technology (when is the last time you actually easily and accurately found a correct answer to a question to the web, without having to spend too much time sifting through irrelevant information?), this is a bit surprising. But I think everyone feels that the new developments in automated text summarization, question analysis, and so on, are going to make a significant difference. I hope so!—but the level of performance is not available yet.
It seems to me that we will not get a big breakthrough, but we will get a somewhat acceptable level of performance, and then see slow but sure incremental improvement. The reason is that it is very hard to make your computer really 'understand' what you mean — this requires us to build into the computer a network of 'concepts' and their interrelationships that (at some level) mirror those in your own mind, at least in the subjects areas of interest. The surface (word) level is not adequate — when you type in 'capital of Switzerland', current systems have no way of knowing whether you mean 'capital city' or 'financial capital'. Yet the vast majority of people would choose the former reading, based on phrasing and on knowledge about what kinds of things one is likely to ask the web, and in what way. Several projects are now building, or proposing to build, such large 'concept' networks. This is not something one can do in two years, and not something that has a correct result. We have to develop both the network and the techniques for building it semi-automatically and self-adaptively. This is a big challenge."
Eduard Hovy added in September 2000: "I see a continued increase in small companies using language technology in one way or another: either to provide search, or translation, or reports, or some other communication function. The number of niches in which language technology can be applied continues to surprise me: from stock reports and updates to business-to-business communications to marketing…
With regard to research, the main breakthrough I see was led by a colleague at ISI (I am proud to say), Kevin Knight. A team of scientists and students last summer at Johns Hopkins University in Maryland developed a faster and otherwise improved version of a method originally developed (and kept proprietary) by IBM about 12 years ago. This method allows one to create a machine translation (MT) system automatically, as long as one gives it enough bilingual text. Essentially the method finds all correspondences in words and word positions across the two languages and then builds up large tables of rules for what gets translated to what, and how it is phrased.
Although the output quality is still low — no-one would consider this a final product, and no-one would use the translated output as is — the team built a (low-quality) Chinese-to-English MT system in 24 hours. That is a phenomenal feat — this has never been done before. (Of course, say the critics: you need something like 3 million sentence pairs, which you can only get from the parliaments of Canada, Hong Kong, or other bilingual countries; and of course, they say, the quality is low. But the fact is that more bilingual and semi-equivalent text is becoming available online every day, and the quality will keep improving to at least the current levels of MT engines built by hand. Of that I am certain.)
Other developments are less spectacular. There's a steady improvement in the performance of systems that can decide whether an ambiguous word such as "bat" means "flying mammal" or "sports tool" or "to hit"; there is solid work on cross-language information retrieval (which you will soon see in being able to find Chinese and French documents on the web even though you type in English-only queries), and there is some rather rapid development of systems that answer simple questions automatically (rather like the popular web system AskJeeves, but this time done by computers, not humans). These systems refer to a large collection of text to find 'factiods' (not opinions or causes or chains of events) in response to questions such as 'what is the capital of Uganda?' or 'how old is President Clinton?' or 'who invented the xerox process?', and they do so rather better than I had expected."
In Geneva, Switzerland, ISSCO (Dalle Molle Institute for Semantic and Cognitive Studies - Institut Dalle Molle pour les Études Sémantiques et Cognitives) is a research laboratory conducting basic and applied research in computational linguistics (CL) and artificial intelligence (AI), for a number of Swiss and European research projects. The University of Geneva has provided administrative support and infrastructure. Research is funded with grants and contracts with public and private bodies.
Created by the Foundation Dalle Molle in 1972 to conduct research in cognition and semantics, ISSCO has come to specialize in natural language processing, including multilingual language processing, in a number of areas: machine translation, linguistic environments, multilingual generation, discourse processing, data collection, etc. ISSCO is multi-disciplinary and multi-national. As explained on its website in 1998, "its staff and its visitors [are drawn] from the disciplines of computer science, linguistics, mathematics, psychology and philosophy. The long-term staff of the Institute is relatively small in number; with a much larger number of visitors coming for stays ranging from a month to two years. This ensures a continual exchange of ideas and encourages flexibility of approach amongst those associated with the Institute."
# UNDL Foundation
The UNL (universal networking language) project was launched in the mid-1990s as a main digital metalanguage project by the Institute of Advanced Studies (IAS) of the United Nations University (UNU) in Tokyo, Japan. As explained on the bilingual (English, Japanese) website in 1998: "UNL is a language that — with its companion 'enconverter' and 'deconverter' software — enables communication among peoples of differing native languages. It will reside, as a plug-in for popular web browsers, on the internet, and will be compatible with standard network servers. The technology will be shared among the member states of the United Nations. Any person with access to the internet will be able to 'enconvert' text from any native language of a member state into UNL. Just as easily, any UNL text can be 'deconverted' from UNL into native languages. United Nations University's UNL Center will work with its partners to create and promote the UNL software, which will be compatible with popular network servers and computing platforms."
In 2000, 120 researchers worldwide were working on a multilingual project in 16 languages (Arabic, Brazilian, Chinese, English, French, German, Hindu, Indonesian, Italian, Japanese, Latvian, Mongolian, Russian, Spanish, Swahiki, and Thai). The UNDL Foundation (UNDL: Universal Networking Digital Language) was founded in January 2001 to develop and promote the UNL project.
[Each line begins with the year or the year/month.]
1968: ASCII is the first character set encoding.1971: Project Gutenberg is the first digital library.1974: The internet takes off.1990: The web is invented by Tim Berners-Lee.1991/01: Unicode is a universal character set encoding for all languages.1993/11: Mosaic is the first web browser.1994/05: The Human-Languages Page is a catalog of language-related internet resources.1994/10: The World Wide Web Consortium will deal with internationalization and localization.1994: Travland is dedicated to both travel and languages.1995/12: The Kotoba Home Page deals with language issues using our keyboard.1995: The Internet Dictionary Project works on creating free translating dictionaries.1995: NetGlos is a multilingual glossary of internet terminology.1995: Global Reach is a virtual consultancy stemming from Euro-Marketing Associates.1995: LISA is the localization industry standards association.1995: "The Ethnologue: Languages of the World" offers a free online version.1996/04 : OneLook Dictionaries is a fast finder in online dictionaries.1997/01: UNL (universal networking language) is a digital metalanguage project.1997/12: AltaVista launches AltaVista Translation, also called Babel Fish.1997: The Logos Dictionary goes online for free.1999/12: Britannica.com is the first main English-language online encyclopedia.1999/12: WebEncyclo is the first main French-language online encyclopedia.1999: WordReference.com offers free online bilingual translating dictionaries.2000/02: yourDictionary.com is a major language portal.2000/07: Non-English-speaking internet users reach 50%.2001/01: Wikipedia is a main free multilingual cooperative encyclopedia.2001/01: The UNDL Foundation develops UNL, a digital metalanguage project.2001/04: The Human-Languages Project becomes the iLoveLanguages portal.2004/01: Project Gutenberg Europe is launched as a multilingual project.2007/03: IATE is the new terminological database of the European Union.2009: "The Ethnologue" launches its 16th edition as an encyclopedic reference work.
Alis Technologies: http://www.alis.com/Aquarius.net: Directory of Localization Experts: http://www.aquarius.net/ASCII Table: http://www.asciitable.com/Asia-Pacific Association for Machine Translation (AAMT): http://www.aamt.info/Association for Computational Linguistics (ACL): http://www.aclweb.org/Association for Machine Translation in the Americas (AMTA): http://www.amtaweb.org/CALL@Hull: http://www.fredriley.org.uk/call/ELRA (European Language Resources Association): http://www.elra.info/ELSNET (European Network of Excellence in Human Language Technologies): http://www.elsnet.org/Encyclopaedia Britannica Online: http://www.britannica.com/Encyclopaedia Universalis: http://www.universalis-edu.com/Ethnologue: http://www.ethnologue.com/Ethnologue: Endangered Languages: http://www.ethnologue.com/nearly_extinct.aspEUROCALL (European Association for Computer-Assisted Language Learning): http://www.eurocall-languages.org/European Association for Machine Translation (EAMT): http://www.eamt.org/European Bureau for Lesser-Used Languages (EBLUL): http://www.eblul.org/European Commission: Languages of Europe: http://ec.europa.eu/education/languages/languages-of-europe/European Minority Languages (list of the Institute Sabhal Mòr Ostaig): http://www.smo.uhi.ac.uk/saoghal/mion-chanain/en/Google Translate: http://translate.google.com/Grand Dictionnaire Terminologique (GDT): http://www.granddictionnaire.com/IATE: InterActive Terminology for Europe: http://iate.europa.eu/ILOTERM (ILO: International Labor Organization): http://www.ilo.org/iloterm/iLoveLanguages: http://www.ilovelanguages.com/International Committe on Computational Linguistics (ICCL): http://nlp.shef.ac.uk/iccl/Internet Dictionary Project (IDP): http://www.june29.com/IDP/Internet Society (ISOC): http://www.isoc.org/Laboratoire CLIPS (Communication Langagière et Interaction Personne-Système): http://www-clips.imag.fr/Laboratoire CLIPS: GETA (Groupe d'Étude pour la Traduction Automatique): http://www-clips.imag.fr/geta/LINGUIST List (The): http://linguistlist.org/Localization Industry Standards Association (LISA): http://www.lisa.org/Logos: Multilingual Translation Portal: http://www.logos.it/MAITS (Multilingual Application Interface for Telematic Services): http://wwwold.dkuug.dk/maits/Merriam-Webster Online: http://www.merriam-webster.com/Natural Language Group (NLG) at USC/ISI: http://www.isi.edu/natural-language/Nuance: http://www.nuance.com/OneLook Dictionary Search: http://www.onelook.com/Oxford English Dictionary (OED): http://www.oed.com/Oxford Reference Online (ORO): http://www.oxfordreference.com/PAHOMTS (PAHO: Pan American Health Organization): http://www.paho.org/english/am/gsp/tr/machine_trans.htmPalo Alto Research Center (PARC): http://www.parc.com/Palo Alto Research Center (PARC): Natural Language Processing: http://www.parc.com/work/focus-area/NLP/RALI (Recherche Appliquée en Linguistique Informatique): http://www-rali.iro.umontreal.ca/Reverso: Free Online Translator: http://www.reverso.net/SDL: http://www.sdl.com/SDL: FreeTranslation.com: http://www.freetranslation.com/SDL Trados: http://www.trados.com/Softissimo: http://www.softissimo.com/SYSTRAN: http://www.systranlinks.com/SYSTRANet: Free Online Translator: http://www.systranet.com/TEI: Text Encoding Initiative: http://www.tei-c.org/index.xmlTERMITE (Terminology of Telecommunications): http://www.itu.int/terminology/index.html*tmx Vokabeltrainer: http://www.tmx.de/Transparent Language: http://www.transparent.com/TransPerfect: http://www.transperfect.com/Travlang: http://www.travlang.com/Travlang's Translating Dictionaries: http://dictionaries.travlang.com/UNDL (Universal Networking Digital Language) Foundation: http://www.undl.org/Unicode: http://www.unicode.org/Yahoo! Babel Fish: http://babelfish.yahoo.com/YourDictionary.com: http://www.yourdictionary.com/YourDictionary.com: Endangered Languages: http://www.yourdictionary.com/elr/index.htmlW3C: World Wide Web Consortium: http://www.w3.org/W3C Internationalization Activity: http://www.w3.org/International/WELL (Web Enhanced Language Learning): http://www.well.ac.uk/Wordfast: http://www.wordfast.org/Xerox XRCE (Xerox Research Centre Europe): http://www.xrce.xerox.com/Xerox XRCE: Cross-Language Technologies: http://www.xrce.xerox.com/competencies/cross-language/
Copyright © 2009 Marie Lebert. All rights reserved.
End of Project Gutenberg's The Internet and Languages, by Marie Lebert