Hey Alexa, Why Don’t You Speak Hebrew?

'Semitic languages are hard to analyze,' expert explains, and because number of Hebrew speakers worldwide is very small, companies haven’t bothered to invest. Israel is now stepping in

Sagi Cohen
Send in e-mailSend in e-mail
Why can't Siri speak Hebrew?
Why can't Siri speak Hebrew?Credit: Kavakavins111/WikiCommons
Sagi Cohen

A new program announced last week by the Israel Innovation Authority and the National Digital Ministry will seek to remove one of the most significant barriers to Israelis seeking to enter the digital age – the fact that computers and apps have trouble understanding Hebrew.

The two agencies approved the establishment of an Association of Natural Language Processing Technology Companies that will advance computerized systems’ understanding of Hebrew and Arabic. For the first three years, the program’s budget will be 7.5 million shekels ($2.2 million). 

The goal is to build a database that will enable government agencies and commercial companies both in Israel and abroad to develop apps, programs and digital services that know how to understand Hebrew speech and text.

Every Israeli is familiar with the problem: Today, there are virtually no apps or programs capable of understanding Hebrew well. This makes it very hard for anyone who wants to analyze or extract insights from the mountains of Hebrew-language data and documents available in, for instance, the legal and medical fields. There also aren’t enough programs and apps capable of understanding natural speech in Arabic.

Popular voice-activated so-called smart assistants like Amazon’s Alexa, Google Assistant, Microsoft’s Cortana and Apple’s Siri are now integrated into smartphones, bluetooth loudspeakers and computers. But they either don’t support Hebrew at all or support it in a very limited fashion.

Amazon's Alexa artificial intelligence device, May 17, 2017.
Amazon's Alexa artificial intelligence device, May 17, 2017.Credit: Elaine Thompson/AP

Granted, computers are getting better at identifying Hebrew speech and converting it into text. But they find it harder to decipher the meaning of these texts, due to certain characteristics of the Hebrew language. 

For instance, the four Hebrew letters heh-kuf-peh-heh could signify either a hot coffee (hakafeh) or making a circuit around a field (hakafah). The word het-yud-peh-heh could either be the name of an Israeli city (Haifa) or a man who covers (hipa) for a friend. And so forth.

This inability to comprehend Hebrew speech already prevents many Israelis from using advanced digital services. And everyone agrees that the problem will only get worse. 

In the near future, the main platforms that we will use to operate technological devices will be based on voice and speech. If these devices don’t understand Hebrew, it will greatly limit many Israelis’ ability to use technology. Imagine, for instance, that keyboards were sold with no Hebrew letters or that cellphones came with no Hebrew menus or user interface.

Will we succeed this time?

To date, because the number of Hebrew speakers worldwide is very small, companies haven’t bothered to invest in a solution. Therefore, the state is trying to solve this market failure.

A woman uses Siri voice-control on her Apple iPhone
A woman uses Siri voice-control on her Apple iPhoneCredit: Bloomberg Finance LP

“Semitic languages are hard to analyze,” said Aviv Zeevi, the head of the Innovation Authority’s technological infrastructure division. “The whole world is racing forward with natural language processing and developing smart tools. But when you want to apply them to data in Hebrew, it’s impossible. Therefore, a situation has been created in which the field of natural language processing in Hebrew has been neglected, and this is a big market failure.

“It’s impossible, for instance, to analyze medical files or court documents, because these are incomprehensible Hebrew texts,” he continued. “If you want to create insights on the basis of this data, you need an artificial intelligence tool that’s capable of doing natural language processing in Hebrew.”

The Association of Natural Language Processing Technology Companies will create the necessary infrastructure – a database of texts that will be broken down into their component parts and tagged by linguists according to their syntactic, semantic and morphological characteristics (context, subject or object, tense and so forth) to help make the sentence’s meaning comprehensible. Via this database, it will be possible to train programs to better understand the context and meaning of Hebrew texts. 

The association will also explore the possibility of adapting or developing tools in open-source code to improve the quality of Hebrew and Arabic comprehension in various computerized systems.

The project will be run by Avner Algom, an entrepreneur and chairman of Israel’s association of cloud computing companies, IGTCloud. The work of tagging the elements and creating the database will be done by subcontractors.

The project’s goal is to be able, using the database, to develop services for both industry and the government sector. Examples could include advanced apps for various government services (for instance, going to a government website and applying for a passport via voice chat with a bot), banks that want to offer an advanced voice identification service to its customers, and industrial companies that want to develop apps and smart services enabling people to talk with apps in their car, their home or their smartphone. 

In theory, such a database could also be used by Google, Amazon and Apple to significantly improve their voice-based digital assistants by enabling them to understand Hebrew better.

The association’s member companies will both contribute to the project’s development and be entitled to make use of its products. The companies in question include Intel, Ginger Software, Rafael, AudioCodes, Bank Hapoalim, Melingo, Ynet, Walla and many others. 

A new program sponsored by the Innovation Authority and the National Digital Ministry will develop infrastructure to make it possible to build apps capable of understanding Hebrew speech and text
A new program sponsored by the Innovation Authority and the National Digital Ministry will develop infrastructure to make it possible to build apps capable of understanding Hebrew speech and textCredit: SHANNON STAPLETON/Reuters

The project is being funded by the participating companies themselves, each of which will contribute 500,000 shekels. Invitations have also been extended to the major technology giants, and Google and Microsoft are considering joining the association. 

In addition, the project will be assisted by several academic researchers, including Prof. Reut Tsarfaty of Bar-Ilan University. The sources for the database’s content will include newspapers (Haaretz among them), the Knesset archive, the Maccabi health maintenance organization, Bank Hapoalim and the National Institute for Testing and Evaluation.

The Innovation Authority didn’t want to wait until the program could secure funding from the Finance Ministry, which is why it decided to launch it with funding from an association of companies instead. This model will also be useful for understanding the existing needs and capabilities.

“The public sector is currently dealing with information in Hebrew and Arabic of which the lion’s share is unstructured,” said Asher Bitton, director general of the National Digital Ministry. “One of the big challenges in digitizing public services is to enable operational efficiency, availability to the public without cost and also high productivity.”

This is a major, long-term challenge that is shadowed by previous failed attempts to create databases and tools for Hebrew linguistic analysis. Similar projects have been launched in various frameworks, including by commercial companies for their own internal use, but none has really taken off.  

Just months ago, at the start of this year, the Government Teleprocessing Authority announced a project to create a database that would help computers understand Hebrew, with goals very similar to that of the new association. But that project was defined as a limited pilot, and it apparently didn’t get very far. 

The new project appears to be much larger in scope and more ambitious. It is also better funded and includes major industrial companies.

“We want to build something broad enough, generic enough and with enough data, and that is why we wanted industry to be involved in defining the corpora as well,” Zeevi said, referring to the linguistic databases. 

Comments