This post has been updated to match the latest trends as of 2023, so please refer to the article below.
NER's Present and Future Ver. 2: Korean NER Data Set Summary
'NER's present and future' This content, which is the third topic in the series 'Future development direction and goals'We have prepared content about It continues from the first topic, “From Concepts to Various Approaches,” and the second topic, “Model Structure and Data Set Status,” so I recommend checking it out and reading it.
'NER's present and future: 01. From concepts to diverse approaches' Go see'NER's present and future: 02. Model structure and data set status' Go see
Development direction of the NER model
In practice, the most effective method is to obtain better results by further training an existing model.
For the LETR team, we chose the ner_ontonotes_bert_mult model* from the deeppavlov library for the following reasons:
1. It supports the largest number of languages (104); 2. While it has the most diverse classes (18), 3. The data processing speed is ok, 4. The reproduction rate is also noticeably high, 5. This is because it is not difficult to use, and practitioners can adapt quickly.
Also, the model's embedding size is 700 MB, the model size is 1.4 GB, and an f1-score of 88.8 was recorded based on the Ontonotes data set*. (In addition, Deeppavlov also has the characteristic of providing separate NER models specialized in Russian and Vietnamese.)
For the same reason as above Proactively develop the NER model in the direction of further learning deeppavlov's ner_ontonotes_bert_mult model (hereafter, base model)I was able to do it.
Necessity of organizing a Korean NER data set
Appropriate data is essential for model training, but the NER data set in Korean is still insufficient. In particular, there is no Korean NER data set with 18 NE types using the Ontonotes method used by the base model, which the LETR team requires. So first Propose a configuration plan for the Korean NER data setDo it, and go further Proposing a new direction for the Korean NER modelI want to do it.
How to secure original data
1. Preowned data
- TED* Corpus
- Collection of Korean-British contracts
- English (English): 100,000 sentences
- AI HUB*: Korean-English parallel corpus of 160 sentences
2. Data that can be obtained in the future
- AI HUB: 10,000 sentences of Korean conversation, 270,000 sentences of emotional conversation
- 3 million sentences built through a data construction support project* for artificial intelligence learning
Procedure for organizing data
In order to improve the efficiency of data organization, we first chose a method of NER using the existing model and then inspecting it by the operator. However, in order to do this, it is necessary to reconstruct the data so that it is suitable for the operator to inspect, and the data that has been inspected must be solved again in a form suitable for the model. Specifically, the data will be organized in the following order.
1. NER as an existing model
2. Data purification (The process of excluding sentences without NE)
The Korean language is less precise than the previous model. Therefore, in the NER model, NE may be included in the sentence that there is no NE, so the following two methods are used.
(1) (In the case of multilingual data) Cross-check the corresponding language pairs by NER
(2) (Optional) Check with crowdsourced (label each sentence with or without NE)
3. Data processing
Data is processed into a form that can be crowdsourced.
4. Primary worker inspection with crowd-sourcing
5. Second inspection by the manager
6. Solve processed sentences in a form that can be fed to the model
The specific form of data
1. Tagging system and NE type of the Korean NER data set you want to configure
The tagging system for Korean NER datasets also follows Ontonotes's rules. Using the BIO tagging system, NE is classified into 18 categories as shown in the table below.
2. Types of data that can be fed to the model
The types of data that can be fed to the model are as follows.
As shown above, they all consist of text data. Tags and tokens are separated by white spaces (white spaces), and between sentences are separated by empty lines (empty lines).
The data set is divided into train, test, and vailing, and their ratio consists of 8:1:1.
3. The type of data used during inspection
Information about the type of object is placed in square brackets (< >) before and after the object name.
(example)
Hello? My name is <PERSON>Young-hee</PERSON>. My birthday is <DATE>October 26th</DATE>. I live in <GPE>Seoul</GPE>. I<LANGUAGE>'m a Korean</LANGUAGE> <NORP>who speaks Korean</NORP>.
Calculating the target number of data sets
When 41,969 sentences were extracted from various fields such as media, culture, science, anthropology, philosophy, and economics, individual names were recognized in 2,453 sentences. If you take this as a ratio, it's 5.8%. (However, keep in mind that this is the ratio in written sentences, and the ratio in colloquial language may vary.)
In other words, if we simply assume that about 5% of the sentences in the entire corpus have object names, we can estimate that 250,000 sentences contain object names out of approximately 5 million sentences. Therefore, it aims to consist of 250,000 sentences containing object names.
At the end
As stated earlier, NER plays a very important role in information retrieval, so active research is being carried out in the field of natural language processing. In particular, since the names of people, organizations, and regions can be automatically detected, translation quality is improved by preventing translation errors, but user satisfaction can also be greatly increased through customized translation according to the field.
However, despite this, the NER data set specific to the Korean language is still insufficient. Therefore, in order to overcome the limitations of the scarce amount of data, the LETR team built a Korean-centered data set and built a higher-performance NER Korean language model learned based on this to enable more accurate and natural translation.
Of course, machine translation at the level of a professional translator won't be possible right away. However, as we continue to advance technology, I believe that soon we will create a better world where everyone we dream can communicate without language barriers.
* Data construction support project for artificial intelligence learning: The core project of the Digital New Deal 'Data Dam' organized by the Ministry of Science and ICT and the Korea Intelligent Information Society Promotion Agency, Twig Farm Selected as Executing Agency for 'Building Data for AI Learning' Project
NER's present and future
NER's present and future: 01. From concepts to diverse approaches NER's present and future: 02. Model structure and data set status NER's present and future: 03. Future development direction and goals