Interlingua in Google Translate

Machine Translation is the master discipline in the computational linguistics; it was one of the first major tasks defined for computers back in the times of Post-World War II. Warren Weaver, an American science administrator stated in a famous memorandum called "Translation" in 1949: „It is very tempting to say that a book written in Chinese is simply a book written in English which was coded into the 'Chinese code'. If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation?

After many ups and downs in the coming decades, the first real breakthrough came with fast PCs, fast web connections and the possibility to compile and process immense language data sets. But instead of compiling grammar sets in order to define one language and than another and their relationships, the use of statisical models became en vouge: Instead of years of linguistical work, they used some weeks of processing with similar results. While rules based systems created nice looking sentences with often stupid word choiced, statistics based systems created stupid looking sentences with good phrase quality. One thing, linguists as well as statisticians were always dreaming about was the so called Interlingua. A kind of a neutral language in between which would allow to translate the pure meaning of one sentence into this Interlingua and afterwards to construct a sentence in the target language that bears the same meaning. There is a common three step pyramide to the describe the raising quality of machine translation:
First level: Direct translation from one language to another
Second level: Transfer using one elaborated way or another, e.g. rules, statistics, etc.
Third level: Using an Interlingua.

There were many attempts, from planned languages as Esperanto up to semantic primes and lexical functions - the result was always the same: There is no Interlingua. "Meaning" is a to complex concept to model it in a static way.

In 2006, Google released Google Translate, a nowadays very popular system of MT that was statistics based originally, created by the German computer scientist Franz Josef Och (not at Human Longevity). This was an event that inspired me in a very personal way to focus my linguistics career on computational lingustics and inspired me to write my Magister Thesis with the Title "Linguistic Approaches to improve Statistical Machine Translation" (Linguistische Ansätze zur Verbesserung von statistischer maschineller Übersetzung) at the University of Kassel. This is 10 years ago. Recently, I talked to a friend about the success of the Google AI beating of the first Go-Master Lee Sedol using a neural network. Would this be able to change Machine Translation aswell? 

In September, Google announced in their research blog that they are switching their Translation system from statistics based to the Google Neural Machine Translation (GNMT), "an end-to-end learning framework that learns from millions of examples, and provided significant improvements in translation quality". This system is able to make zero shot translation, as they write in an article published three days ago, on November 22th. A zero shot translation is a translation between two languages while the system does not have examples of translation between those two, e.g. it is trained by examples to translate between English and Japanese and between English and Corean, a zero shot translation would be between a data-less translation Japanese and Corean.. As Google state in their blog:

To the best of our knowledge, this is the first time this type of transfer learning has worked in Machine Translation. 
The success of the zero-shot translation raises another important question: Is the system learning a common representation in which sentences with the same meaning are represented in similar ways regardless of language — i.e. an “interlingua”?

This is indeed hard to tell: Neural networks are closed systems. The computer is learning something out of a data set in an intelligent but incomprehensible and obscure way. But Google is able to visualize the produced data and you've got to take a look at the blog post to understand this in detail, but: 

Within a single group, we see a sentence with the same meaning but from three different languages. This means the network must be encoding something about the semantics of the sentence rather than simply memorizing phrase-to-phrase translations. We interpret this as a sign of existence of an interlingua in the network. 

Google, this is awesome! Thank you so much for sharing!

Image: Mihkelkohava Üleslaadija 

Buchbesprechung: Petros Memoiren

Nach 13 Jahren ist gestern erstmals wieder ein Artikel von mir in der Amiga Future erschienen. Zwar nutze ich seither keinen Amiga mehr und meine Ausflüge mit dem UAE sind eher bescheidener Natur. Aber nachdem ich die Memoiren von Petro Tyschtschenko gelesen hatte, der lange Jahre das Gesicht des Amigas war, fragte ich bei meinem alten Bekannten, dem langjährigen APC&TCP-Frontmann und AF-Chefredakteur Andreas Magerl an, ob ich nicht eine Besprechung für ihn schreiben könne. Zu meiner großen Freude hatte er sofort zugestimmt und so konnte ich zumindest gefühlt einen kleinen Beitrag leisten zur Verbreitung dieser - für die Geschichtsschreibung des Heimcomputers sehr wertvollen - Memoiren.

Amiga Future September/Oktober 2014, S. 37: "Petros Erinnerungen"

Petro Taras Tyschtschenko mit Patric Klüter: Meine Erinnerungen an Commodore und Amiga.

Oldest song ever

In the early 1950s, archaeologists unearthed several clay tablets from the 14th century B.C.E.. Found, WFMU tells us, “in the ancient Syrian city of Ugarit,” these tablets “contained cuneiform signs in the hurrian language,” which turned out to be the oldest known piece of music ever discovered, a 3,400 year-old cult hymn.

Listen to the Oldest Song in the World: A Sumerian Hymn Written 3,400 Years Ago


Great overview!

What tools does a grammarian need? A brain helps, and so does a computer, but surely one of our most essential tools is some kind of diagramming system. How can we think about a sentence's structure, after all, without displaying it visually? Geographers have maps; mathematicians have equations; composers have musical notation; economists have graphs; and grammarians have trees.

It wasn't always so.

A Brief History of Diagramming Sentences

Altavista shuts down

Das waren noch Zeiten.

Die Entscheidung zeigt auch, wie sehr sich das Web seit dem Start von Altavista 1995 verändert hat. Damals war der Suchanbieter ein Pionier. Während andere Suchmaschinen die bestehenden Webseiten in einem redaktionell gepflegten Katalogen und Verzeichnisdiensten sammelten, baute Altavista eine Software. Der „Super-Spider“ von Altavista, der sich wie eine Spinne durchs Web bewegte, las Informationen über die Seiten im Web aus und erfasste sie in einem Index.

Ende eines Internet-Fossils Die Suchmaschine Altavista wird abgeschaltet