This content continues from the previous post “AI Became a Language Genius, Polyglot (Polyglot) Model (1)” I recommend checking it out first and then watching this content.
Go to “AI Became Linguistic Genius, Polyglot (Polyglot) Model (1)”
As we saw in the previous article, AI translation has been centered around English. Most machine translation systems used English as an intermediate language. This method can be effective in learning, processing, etc. (with the exception of a few selected languages), but it was difficult to take advantage of the unique characteristics of the language.
Recently, however, there has been an increase in the development of multilingual language and translation models. It's a meaningful step towards a world where everyone can communicate without language barriers, as they dream of. This is especially necessary for the many people around the world who use low-resource languages (low-resource languages) (who have experienced inconveniences due to them).
Now let's take a look at some examples corresponding to this.
Meta AI, from many-to-many data sets to multi-language translation models
In 2021, Meta (then FACEBOOK) AI released “Flores-101 (FLORES-101),” a many-to-many (many-to-many) data set covering 101 languages around the world, as an open source. They stated that this was to break knowledge gaps, cultural differences, and language barriers and bring people closer together. The research result is a paper*It was published on, and the data set is on GitHub**It was released through
This was a step that would greatly help AI researchers study multi-language translation models and develop more diverse and powerful AI translation support tools. Thanks to this, researchers were able to benchmark 10,100 different translation directions. Evaluating and comparing model or system performance is very important in the research process and can be the foundation for developing translation models into more languages later.
And (of course, data and model updates continued in the meantime) NLLB (No Language Left Behind) -200, which can translate text into 200 languages last July***will be released as an open source. As the name suggests, it now supports various languages that are not supported by other AI translation systems. If the existing major translation tools supported fewer than 25 African languages, the NLLB-200 supports 55.
In addition, to support this, FLORES-200 expands the existing FLORES-101 data set****We've also built it; it consists of 40,000 different combinations between 200 languages. This was also released as an open source so that the model's performance can be evaluated and improved, as well as applied to external research and development.
Ultimately, Meta AI seems to want to build a single model that supports all languages and dialects around the world.
HuggingFace, open source language model BLOOM
Last June, we unveiled BLOOM (BLOOM), an open source language model that responds to the limitations of existing large language models (created by big tech companies) through the public collaboration project BigScience (BigScience). In terms of scale, it's a very large model equivalent to GPT-3, and it's an open source multilingual model. In particular, it stands out that more than 1,000 academic volunteers from around the world joined forces and transparently disclosed both code and data to improve the bias and harmfulness of language models.
Regarding BLOOM, the previous post “Can the open source language model BLOOM become the flower of AI democratization?” It was also covered in If you want to learn more, check it out.
'Can BLOOM, an open source language model, become the flower of AI democratization? ' Go see
Google Translate, Monolingual Learning & Multiplication Model
Google announced support for 24 additional minority languages for translation through I/O 2002. Google Translate also aims to remove language barriers and help people understand and communicate. The addition of minority languages from India, Africa, and South America has now opened up a little more opportunities for many people who have not benefited from technological advances to connect to the wider world.
And behind this, there is a model learning method called monolingual learning. Simply put, I want to learn and understand the language itself without going through English. Parallel text (Parallel text) that can be used to translate any language*****In a situation where this was not enough, it seems that they have found an approach that can translate new languages that have not been translated before.
Supervised learning (supervised learning) is bound to be difficult in situations where data is difficult to obtain. Instead, it uses a type of unsupervised learning (Unsupervised Learning) that uses unlabeled data. In this way, AI, which has learned well in existing high-resource languages, is improving performance by directly learning the corresponding low-resource language.
“Translation accuracy scores for 638 of the languages supported in our model, using the metric we measured (rttLangIdchrf), for both the higher-resource supervised languages and the low-resource zero-resource languages.******”
While finishing
It is said that more than 300 million people use the 24 minority languages that Google has added this time. That probably means that there are still far more people left out of technological progress. AI continues to transform into a language genius, but it seems that there are still many ways to challenge a world without language barriers.
and Twig Farm Language Processing Engine LETR After all, we are walking this path together. Even at this moment An unrivaled language processing engine centered on Asian languagesIt's developing towards Going forward, the LETR team will continue to work to create a better world with the digital technology we create and its influence.
* https://arxiv.org/abs/2106.03193
** https://github.com/facebookresearch/flores?fbclid=IwAR0qvXY6LMM5kB3qK8n-8YRfxq_Y-DEBU1f_WWWIAeaPKy826AGNWEMnUfU
*** https://github.com/facebookresearch/fairseq/tree/nllb/?fbclid=IwAR0iXLXmcVSlY-HDO6X4vFZqthZs3Nnuo91TiCfn_HzlyBcYVglj932g6qY
**** https://github.com/facebookresearch/flores
***** https://ko.wikipedia.org/wiki/병렬말뭉치
****** https://ai.googleblog.com/2022/05/24-new-languages-google-translate.html
References
[1] https://ai.facebook.com/blog/the-flores-101-data-set-helping-build-better-translation-systems-around-the-world/
[2] https://ai.facebook.com/research/no-language-left-behind/
[3] https://research.facebook.com/file/585831413174038/No-Language-Left-Behind--Scaling-Human-Centered-Machine-Translation.pdf
[4] https://www.aitimes.kr/news/articleView.html?idxno=25475
[5] https://bigscience.huggingface.co/blog/bloom
[6] https://blog.google/technology/developers/io-2022-keynote/
[7] https://ai.googleblog.com/2022/05/24-new-languages-google-translate.html
[8] https://arxiv.org/abs/2201.03110
[9] https://ai.googleblog.com/2016/11/zero-shot-translation-with-googles.html
Good content to watch together
AI that became a linguistic genius, multilingual (Polyglot) model (1)Can the open source language model BLOOM become the flower of AI democratization?Why is artificial intelligence making Korean more difficult?