Tuesday, December 15, 2009
On December 7th, we launched our new Japanese voice search system (音声検索), which has been available for various flavors of English since last year and for Mandarin Chinese for the past two months. The initial Japanese system works on the Android platform and also through the Google Mobile App on the iPhone as announced in a Japanese blog and a general explanation on how to get started. For developers who want to make use of the speech recognition backend for their own Android applications there is a public API (recognizer intent API) described here.
Although speech recognition has had a long history in Japan, creating a system that can handle a problem as difficult as voice search is still a considerable challenge. Today, most speech recognition systems are large statistical systems that must learn two models from sets of examples, an acoustic model and a language model. The acoustic model represents (statistically) the fundamental sounds of the language, and the language model statistically represents the words, phrases, and sentences of the language. The acoustic model for Japanese voice search was trained using a large amount of recorded Japanese speech, with the associated transcriptions of the words spoken. The language model for Japanese voice search was trained on Japanese search queries.
While speech recognition systems are surprisingly similar across different languages, there are some problems that are more specific to Japanese. Some of the challenges we faced while developing Japanese voice search included:
- Spaces in Japanese text
As we looked at some popular search queries in Japan we saw that Japanese often doesn't have spaces but sometimes it does. For example, if a user searches for Ramen noodles near Tokyo station they will often type: "東京駅 ラーメン" with a space in between Tokyo station and Ramen -- therefore, we would like to display it in this way as well. Getting the spaces right is difficult and we continue working to improve it.
- Japanese word boundaries
Word boundaries in Japanese are often not clear and subject to interpretation as most of the time there are not spaces between words. This also makes the definition of the vocabulary (the words that can be recognized theoretically) extremely large. We deal with this problem by finding likely word boundaries using a statistical system which also helps us limit the vocabulary.
- Japanese text is written in 4 different writing systems
Japanese text as written today uses Kanji, Hiragana, Katakana & Romaji, often mixed in the same sentence and sometimes in the same word, depending on the definition of a word. Try these example queries to see some interesting cases: "価格.com", "マーボー豆腐", "東京都渋谷区桜丘町26-1". We try to display the output in a way that is most user-friendly, which often means to display it as you would write it down.
- Japanese has lots of basic characters and many have several pronunciations depending on context
To be able to recognize a word you need to know its pronunciation. Western languages in general use only ASCII or a slightly extended set of characters which is relatively small (less than 100 total in most cases). For Japanese the number of basic characters is the union of all basic characters from the four writing systems mentioned above, which is several thousands in total. Finding the correct pronunciations for all words in the very large voice search vocabulary is difficult and is often done using a combination of human effort and automatic statistical systems. This is even more difficult in the Japanese case as the number of basic characters is higher and there are a vast number of exceptions, for example consider the case of: "一人" (hitori) versus "一人前" (ichininmae). Although the phrases look very similar they have completely different pronunciations.
- Encoding issues
Japanese characters can be written in many encoding systems including UTF-8, Shift_JIS, EUC-JP and others. While at Google we try to use exclusively UTF-8 there are still interesting edge cases to deal with. For example some characters exist in different forms in the same encoding system. Compare for example "カナ" and "ｶﾅ" -- they both say "kana" and mean the exact same thing, the first in full-width and the second in half-width. There are numerous similar cases like this in Japanese that make normalization of the text data more difficult.
- Every speaker sounds different
People speak in different styles, slow or fast, with an accent or without, have lower or higher pitched voices, etc. To make it work for all these different conditions we trained our system on data from many different sources to capture as many conditions as possible.