View Korean Pre-Korean Language Models (1)

2024-07-04

Recently, deep learning-based natural language processing research using large-scale data is active. Everyone is jumping in, regardless of business or academia. Big tech companies such as Google and Meta, as well as public collaboration projects such as BigScience (BigScience), are showing remarkable results.

‍

The background of this achievement is a transformer (Transformer) that was pre-learned through extensive corpus data^*is in place. Since then, many variants have appeared, and performance has improved rapidly. Also, most of these language models are unsupervised learning through large amounts of corpus data^**Because of the use of, data acquisition has become very important.

‍

However, there is something disappointing about language model research that has developed so rapidly. Especially from the perspective of those of us who were born in this country and live in Korean. Broadly speaking, there have been many difficulties in studying the Korean language model due to the following two reasons.

‍

First, first of all, the linguistic characteristics of Korean are very different from English. Just as Japanese is generally easier to learn than English for us, artificial intelligence that has been learning English is bound to be much easier to process Spanish than Korean. I've already covered this in previous content, so check out the article below for more details.

‍- Why is artificial intelligence making Korean more difficult?

‍

Second, because crucially, the amount of training data is directly related to model performance. In general, low-resource (low-resource) languages such as Korean are bound to have relatively limited performance improvements. I've also looked at this through past content related to large language models and multi-language models, so please check it out as well.

‍- Can the open source language model BLOOM become the flower of AI democratization?

- AI, multilingual (Polyglot) model that became a linguistic genius (1)

- AI, multilingual (Polyglot) model that became a linguistic genius (2)

‍

However, as the level of natural language processing research in Korean is rising, the number of cases where Korean-centered models are being studied or published continues to increase. Leading domestic institutions and companies such as the Korea Institute of Electronics and Telecommunications Research (ETRI), Naver, and Kakao are releasing new models one after another. Various models such as Korbert, HyperClova, KoGPT, and EXAONE have appeared one after another, and research continues at this moment.

‍

Therefore, I would like to take this opportunity to share a summary of the Korean language models that have been revealed so far. Broadly speaking, Encoder Model (BERT)^*** series), Decoder Model (GPT^**** series), Encoder-Decoder Model (seq2seq^***** I collected them by dividing them into 3 model groups (series).

We'll be introducing the results step by step in the next post, so stay tuned.

‍

* https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)

** https://en.wikipedia.org/wiki/Unsupervised_learning

*** https://en.wikipedia.org/wiki/BERT_(language_model)

**** https://en.wikipedia.org/wiki/OpenAI#GPT

***** https://en.wikipedia.org/wiki/Seq2seq

‍