NER's present and future: 03. Future development direction and goals

2024-07-04

‍

This post has been updated to match the latest trends as of 2023, so please refer to the article below.

NER's Present and Future Ver. 2: Korean NER Data Set Summary

‍

‍'NER's present and future' This content, which is the third topic in the series 'Future development direction and goals'We have prepared content about It continues from the first topic, “From Concepts to Various Approaches,” and the second topic, “Model Structure and Data Set Status,” so I recommend checking it out and reading it.

'NER's present and future: 01. From concepts to diverse approaches' Go see

'NER's present and future: 02. Model structure and data set status' Go see

‍

Development direction of the NER model

In practice, the most effective method is to obtain better results by further training an existing model.

For the LETR team, we chose the ner_ontonotes_bert_mult model* from the deeppavlov library for the following reasons:

1. It supports the largest number of languages (104); 2. While it has the most diverse classes (18), 3. The data processing speed is ok, 4. The reproduction rate is also noticeably high, 5. This is because it is not difficult to use, and practitioners can adapt quickly.

Also, the model's embedding size is 700 MB, the model size is 1.4 GB, and an f1-score of 88.8 was recorded based on the Ontonotes data set*. (In addition, Deeppavlov also has the characteristic of providing separate NER models specialized in Russian and Vietnamese.)

For the same reason as above Proactively develop the NER model in the direction of further learning deeppavlov's ner_ontonotes_bert_mult model (hereafter, base model)I was able to do it.

‍

‍Necessity of organizing a Korean NER data set

Appropriate data is essential for model training, but the NER data set in Korean is still insufficient. In particular, there is no Korean NER data set with 18 NE types using the Ontonotes method used by the base model, which the LETR team requires. So first Propose a configuration plan for the Korean NER data setDo it, and go further Proposing a new direction for the Korean NER modelI want to do it.

How to secure original data

1. Preowned data

- TED* Corpus

- Collection of Korean-British contracts

- English (English): 100,000 sentences

- AI HUB*: Korean-English parallel corpus of 160 sentences

2. Data that can be obtained in the future

- AI HUB: 10,000 sentences of Korean conversation, 270,000 sentences of emotional conversation

- 3 million sentences built through a data construction support project* for artificial intelligence learning

‍

‍Procedure for organizing data

In order to improve the efficiency of data organization, we first chose a method of NER using the existing model and then inspecting it by the operator. However, in order to do this, it is necessary to reconstruct the data so that it is suitable for the operator to inspect, and the data that has been inspected must be solved again in a form suitable for the model. Specifically, the data will be organized in the following order.

1. NER as an existing model

‍2. Data purification (The process of excluding sentences without NE)

‍The Korean language is less precise than the previous model. Therefore, in the NER model, NE may be included in the sentence that there is no NE, so the following two methods are used.

(1) (In the case of multilingual data) Cross-check the corresponding language pairs by NER

(2) (Optional) Check with crowdsourced (label each sentence with or without NE)

‍3. Data processing

Data is processed into a form that can be crowdsourced.

‍4. Primary worker inspection with crowd-sourcing

‍5. Second inspection by the manager

‍6. Solve processed sentences in a form that can be fed to the model

‍

The specific form of data

1. Tagging system and NE type of the Korean NER data set you want to configure

The tagging system for Korean NER datasets also follows Ontonotes's rules. Using the BIO tagging system, NE is classified into 18 categories as shown in the table below.

‍

2. Types of data that can be fed to the model

The types of data that can be fed to the model are as follows.

As shown above, they all consist of text data. Tags and tokens are separated by white spaces (white spaces), and between sentences are separated by empty lines (empty lines).

The data set is divided into train, test, and vailing, and their ratio consists of 8:1:1.

‍

3. The type of data used during inspection

Information about the type of object is placed in square brackets (< >) before and after the object name.

(example)

Hello? My name is <PERSON>Young-hee</PERSON>. My birthday is <DATE>October 26th</DATE>. I live in <GPE>Seoul</GPE>. I<LANGUAGE>'m a Korean</LANGUAGE> <NORP>who speaks Korean</NORP>.

‍

Calculating the target number of data sets

When 41,969 sentences were extracted from various fields such as media, culture, science, anthropology, philosophy, and economics, individual names were recognized in 2,453 sentences. If you take this as a ratio, it's 5.8%. (However, keep in mind that this is the ratio in written sentences, and the ratio in colloquial language may vary.)

In other words, if we simply assume that about 5% of the sentences in the entire corpus have object names, we can estimate that 250,000 sentences contain object names out of approximately 5 million sentences. Therefore, it aims to consist of 250,000 sentences containing object names.

‍

At the end

As stated earlier, NER plays a very important role in information retrieval, so active research is being carried out in the field of natural language processing. In particular, since the names of people, organizations, and regions can be automatically detected, translation quality is improved by preventing translation errors, but user satisfaction can also be greatly increased through customized translation according to the field.

However, despite this, the NER data set specific to the Korean language is still insufficient. Therefore, in order to overcome the limitations of the scarce amount of data, the LETR team built a Korean-centered data set and built a higher-performance NER Korean language model learned based on this to enable more accurate and natural translation.

‍

Of course, machine translation at the level of a professional translator won't be possible right away. However, as we continue to advance technology, I believe that soon we will create a better world where everyone we dream can communicate without language barriers.

‍