The limits of language are the limits of the world. (The limits of my language means the limits of my world.)
This is what Wittgenstein, the 20th century's leading philosopher, said. As he said, humans think in language and live within the framework of that language. Since we are Korean, we are going to think and live within the framework of the Korean language, and of course the world we understand is bound to be different from that of people in the Anglo-American region.
Therefore, in order to understand the world more broadly and more deeply, I need to expand my horizons through language. But learning a new language isn't easy. To properly understand a language, you need to know the country, region, culture, and people that language belongs to (not just increase your vocabulary).
The world is big, and there are many languages. However...
It is said that there are around 7,100 languages around the world. As such, there is probably a lot of human knowledge and information left that hasn't been shared with the world yet. It's unfortunate that humans are limited in their ability to learn languages.
Meanwhile, the online world is dominated by English. It's often said that the Internet is an open information space, but I think this story is limited to English users. The reality is that there is a huge knowledge and information gap for many people who don't actually speak English.
The disappointment of natural language processing with a focus on English
In the past, NLP research, such as machine translation and language models, has focused on English. Apparently, it has developed mainly in the US and other Western regions, so it's no wonder. As a result, most languages, with the exception of some languages such as English and Spanish, were left out of NLP research.
Most multilingual AI models also rely on English. For example, when translating from German to Korean, they first switch from German to English, then change from English to Korean, and so on. The erratic mistranslation of machine translators, which used to be easy to read, may have had a big impact.
Meanwhile, due to globalization, the importance of NLP technology is growing more and more. There are more and more things that everyone needs to do to communicate across language barriers. Unfortunately, the reality is that most people around the world are still excluded from the benefits of technological advancements such as AI translation.
Languages with little data that can train AI language models are called low-resource languages. However, as is well known, NLP research requires significant amounts of linguistic data. As a result, only people who speak a select few commonly used languages (out of 7,100 languages around the world) can use AI language tools.
In fact, according to Meta (Meta) AI “More than 20% of the world's population cannot receive commercialized translation technology services.” *That's it. There is a digital divide that prevents people using low-resource languages from communicating freely. This is why there is a need for solutions for those who are excluded from the global exchange of knowledge, information, and culture because of language.
While finishing
Before looking at multilingual AI in earnest, I looked at why various languages other than English are becoming important in NLP research. In fact, recently, there have been more and more attempts to switch languages and translation models to multilingual ones. In light of these unfortunate circumstances, this is great news for more people around the world who have been marginalized until now.
Next, in the next post, I'll take a closer look at this topic through actual industry research and development examples.
* Quote https://www.ciokorea.com/t/22000/AI/243970#csidxaf4c5dbdb5bf6318b0d338efe81a7fa
References
[1] https://www.washingtonpost.com/news/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
[2] https://www.ethnologue.com/guides
[3] https://edu.krlo.co.kr/2018/05/09/q-001/
[4] https://ai.facebook.com/blog/teaching-ai-to-translate-100s-of-spoken-and-written-languages-in-real-time/