Explore Korean Pre-Korean Language Models (2)

2024-07-04

‍

This post continues from the previous post 'Korean Pre-Korean Language Models (Korean Language Models) (1) ' I recommend that you check it out first and then watch this content.

A look at Korean Pre-Korean Language Models (1) Go see

‍

Similar to overseas, there are many examples of studying models in Korean based on Transformers learned in advance through large numbers of corpus. Various models such as KoBert, Korbert, HanBert, KoElectra, KoGPT, and Hyper CLOVA have been announced. In this article, I'll first take a brief summary of the main models and features that have been released in chronological order, then sort them out by dividing them into encoder (encoder), decoder (decoder), and encoder-decoder (encoder-decoderModel, seq2seq) series.

Three main types of PLM, Image source

‍

Korean Language Model Chronicles

‍

2019

‍

Korbert(Korean Bidirectional Encoder representations from Transformers)

This is the first Korean language pre-learning model published by the Korea Institute of Electronics and Telecommunications Research (ETRI). It is a model trained with 23 GB of data extracted from Korean news and encyclopedias, and the parameter size is known to be 100M. Morpheme and WordPiece tokenizer were used, and the vocab (vocabulary) sizes were 30,349 (Morphemes) and 30,797 (WordPiece). It was announced that it showed superior performance than BERT because it reflected the characteristics of Korean, which is a crossword.

‍

Comparing Kobert and Google's Bert language model algorithms developed by ETRI, Image source

‍

References

_‍_{https://arxiv.org/pdf/1810.04805.pdf}

_{https://medium.com/towards-data-science/pre-trained-language-models-simplified-b8ec80c62217}

_{https://wikidocs.net/166826}

_{https://itec.etri.re.kr/itec/sub02/sub02_01_1.do?t_id=1110-2020-00231&nowPage=1&nowBlock=0&searchDate1=&searchDate2=&searchCenter=&m_code=&item=&searchKey=b_total&searchWord=KorBERT}

_{https://www.etnews.com/20190611000321}

‍

Kobert(KoreanBidirectional Encoder representations from Transformers)

It is a model learned from 50 million sentences collected from Wikipedia, news, etc. published by SKT. To reflect the characteristics of irregular language changes in the Korean language, a data-based tokenization (SentencePiece tokenizer) technique was applied, and the vocab size was 8002 and the model's parameter size was 92M.

‍

References

_{https://sktelecom.github.io/project/kobert/}

_{https://github.com/SKTBrain/KoBERT}

‍

2020

‍

HanBert(Hangul Bidirectional Encoder representations from Transformers)

This model was trained with 70GB of general documents and patent documents published by 2Block AI. It is known that they used a self-developed Moran tokenizer, and the vocab size is 54,000 and the model parameter size is 128M.

References

_{https://twoblockai.files.wordpress.com/2020/04/hanbert-ed8ca8ed82a4eca780-ec868ceab09cec849c.pdf}

_{https://www.stechstar.com/user/zbxe/study_SQL/72557}

_{https://github.com/monologg/HanBert-Transformers}

‍

KogPt2(Korean Generative Pre-Generative Transformer 2)

This is an open source-based GPT2 model learned in Korean announced by SKT. Like GPT2, it has a transformer decoder structure and uses next token prediction for learning. It is said that they learned with 152 M sentences extracted from various data such as Korean Wikipedia, News, Namu Wiki, and Naver movie reviews, and Tokenizer used CBPE (Character Byte Pair Encoding) and added emoticons and emojis frequently used in conversations to improve recognition ability. The vocab size is 51,200, and the base model size is 125 M parameters.

References

_{https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf}

_{https://sktelecom.github.io/project/kogpt2/}

_{https://github.com/SKT-AI/KoGPT2}

‍

KobArt(Korean Bidirectional and Auto-Regressive Transformers)

This is the third Korean version of the BART model released by SKT after the existing KobArt and KogPT2. KobART has an encoder-decoder structure similar to BART, and the denoising auto encoder method was used for pre-learning. I learned with more diverse 0.27B data than before, such as Korean Wikipedia, news, books, everyone's horoscope, and the Cheongwadae National Petition.

‍

References

_{https://arxiv.org/pdf/1910.13461.pdf}

_{https://github.com/SKT-AI/KoBART}

_{https://www.ajunews.com/view/20201210114639936}

‍

2021

‍

KoreAlbert(Koreana Lite BERT)

It is a model released by Samsung SDS, and the Masked Language Model and Sentence-Order Prediction method were applied to pre-learning like ALBERT. We learned about 43 gigabytes (GB) of data, including Korean Wikipedia, Namu Wiki, news, and book plot summaries, and a 32,000 vocab size, and a 12M base model and an 18M large model were released.

‍

References

_{https://www.samsungsds.com/kr/insights/techtoolkit_2021_korealbert.html}

_{https://arxiv.org/pdf/2101.11363.pdf}

_{https://arxiv.org/pdf/1909.11942.pdf}

_{https://www.inews24.com/view/1316425}

_{https://www.itbiznews.com/news/articleView.html?idxno=65720}

_{https://www.itbiznews.com/news/articleView.html?idxno=66222}

‍

KE-T5

This is a Korean and English version of the model based on the Text-to-Text Transfer Transformer (T5) released by the Korea Institute of Electronics Technology (KETI). It is known that it was pre-trained using a mask-fill method similar to the T5 model using 93 GB of Korean and English corpus. The SentencePiece tokenizer was used in the preprocessing process, and the vocab size was 64,000. As a result, we released models of various sizes so that a total of 92.92 GB of Korean and English corpus can be selected and used in various ways according to model size and purpose of use.

‍

References‍

_{https://arxiv.org/abs/1910.10683}

_{https://huggingface.co/tasks/fill-mask}

_{https://github.com/google/sentencepiece}

_{https://koreascience.kr/article/CFKO202130060717834.pdf}

_{https://zdnet.co.kr/view/?no=20210427130809}

‍

KOGPT-Trinity

It is known that it was learned with a 1.2B KO-data dataset built in-house using the model released by SKT. The size of the model is 1.2B, which is a significant increase compared to KogPt2, the vocab size is 51,200, and it was pre-trained with next token prediction.

‍

References

_{https://huggingface.co/skt/ko-gpt-trinity-1.2B-v0.5}

‍

HyperClova

Using a large-scale model published by Naver, we learned vast amounts of data extracted from documents collected through Naver, such as news, cafes, blogs, knowledge in, web documents, and comments, and various documents such as everyone's corpus and Korean Wikipedia. The data used for training consists of 561.8B tokens, and there are models of various sizes, such as 1.3B, 6.9B, 13.0B, 39.0B, and 82.0B.

‍

References

_{https://www.etnews.com/20210525000052}

_{https://tv.naver.com/v/20349558}

_{https://arxiv.org/abs/2109.04650}

‍

KLUE-BERT

KLUE-BERT is a model used as a baseline in KLUE, which is the benchmark data, and was learned with 63 GB of data extracted from documents such as Everyone's Corpus, CC-100-KOR, Namu Wiki, News, and Petitions. Morpheme-based Subword Tokenizer was used, the vocab size is 32,000, and the model size is 111M.

‍

References

_{https://huggingface.co/klue/bert-base?text=%EB%8C%80%ED%95%9C%EB%AF%BC%EA%B5%AD%EC%9D%98+%EC%88%98%EB%8F%84%EB%8A%94+%5BMASK%5D+%EC%9E%85%EB%8B%88%EB%8B%A4.}

_{https://github.com/KLUE-benchmark/KLUE}

_{https://cpm0722.github.io/paper-review/an-empirical-study-of-tokenization-strategies-for-various-korean-nlp-tasks}

‍

KoGPT

It is a Korean model released by KakaoBrain and benchmarked GPT3. It is a 6B super-large model learned from 200B token Korean data, and the vocab size is 64,512.

‍

References

_{https://github.com/kakaobrain/kogpt}

_{https://huggingface.co/kakaobrain/kogpt}

_{https://www.kakaocorp.com/page/detail/9600}

_{http://www.aitimes.com/news/articleView.html?idxno=141575}

‍

ET5

Following T5, it was announced by ETRI, and it is a model that simultaneously pre-learned T5's mask-fill and GPT3's Next Token Prediction. I learned with 136 GB of data extracted from Wikipedia, newspaper articles, broadcast scripts, movie/TV series scripts, etc. It is 45,100 vocab size based on SentencePiece tokenizer, and the size of the model is 60M.

References

_{http://exobrain.kr/pages/ko/result/assignment.jsp #}

_{https://www.etnews.com/20211207000231}

‍

EXAONE(ExpertAI for everyone)

It is a multimodal (multimodal) model learned based on text, voice, and images published by LG AI Research. It has learned more than 250 million high-resolution images combining 600 billion corpus and language and images, and has approximately 300 billion parameters, which is the largest in Korea. It has multi-modality (multi-modality) ability to learn and handle various information related to human communication, such as converting language into images and images into language.

‍