Shalom, Siri. Do You Understand a Word I’m Saying?

A government agency is embarking on a project to help computers and devices understand Hebrew better

Amazon's Alexa artificial intelligence device, May 17, 2017.
Elaine Thompson/AP

As part of an effort teach computers and devices to speak and understand Hebrew better, the Israeli government is setting up a database designed to help the development of artificial intelligence applications based on natural language processing – that is, interactions between computers and human language. 

Hebrew is a difficult language, even when it’s spoken by a machine. To teach a device to understand it well, they need to be trained extensively in it. While identifying speech itself (converting speech to text) in Hebrew is easy, the problem of deciphering the meanings of texts is complicated. Machines find it difficult to understand sentences and put words in the right context in any language, and this is especially true in Hebrew.

Haaretz Weekly Ep. 58Haaretz

The word “ha’kafé,” for example, could mean one of the world's favorite hot drinks, or the perimeter of a field.

Because the number of Hebrew speakers in the world is so small, private companies haven’t invested in creating voice applications for conversations in natural Hebrew. That’s the reason that voice assistants like Amazon’s Alexa, Google Assistant and Microsoft’s Cortana, which are integrated into smartphones, smart speakers and PCs, don’t support Hebrew.

Apple’s Siri does support Hebrew, but only to a limited degree. And for the same reason, voice and text chatbots don’t support Hebrew either.

“When we looked for chatbot solutions for government services for gov.il websites, we discovered that there was a widespread problem throughout the economy – the lack of Hebrew language digitization,” explained Yogev Shamni, director of the accessible government unit at the Government ICT Authority. “Everyone uses digital services – Siri, Alexa, Google. We know they work well in English, but in Hebrew they’re limited.”

To fix this, officials at the Government Information and Communications Technology Authority are meeting with people in the industry, the universities and high-tech who work in the field with the aim of building a database to train devices to better understand the language.

Last week the authority unveiled a pilot for a “Manually Labelled Corpus of Modern Hebrew – a database of texts using the Universal Dependencies framework for consistent annotation of grammar. It’s a database of Hebrew sentences broken down into their components and labeled by linguists. A table where each word receives a list of characteristics, such as context, subject and time, that help to understand the meaning of the sentence.

Initially it will be used for chatboxes on government websites, so that users can use ordinary voice or text language, to do things like applying for a passport. But eventually, it will be available not only to the government, but to startups and big companies.

“Every company will be able to use it to develop means to talk with a smart car or a smart home, to improve access to people with disabilities and other applications,” said Shamni. “The vision is do things like Google Duplex – the Google service that lets you make restaurant reservation with a voice assistant.”

Not everyone is convinced that the pilot is the way to go.

“I wouldn’t say that a database with morphological labels is what separates us from Cortana,” said one expert in language processing who asked not to be identified.

“It’s certainly a welcome development in that there will be a tagged corpus, but is the morphological analysis really necessary? Will it advance the development of applications? ... Opinions about this are divided. There are other natural language processing methods that do not require such labeling, and the trend today in computational learning is, in fact, automation and smart data processing methods. “

The pilot is being conducted with the help of the Hebrew Language Academy and with Prof. Reut Tsarfaty of Bar-Ilan University. The database so far includes 600 labeled sentence and is freely available to the public. The authority is asking the public to contribute with feedback and references.

It’s a small start but the authority plans to expand it as quickly as possible. “As more words are added to the corpus, device will be able to increase their understanding from 30-40% today to 70-80% as they do in English,” he said.