This post has been updated to match the latest trends as of 2023, so please refer to the article below.
NER's Present and Future Ver. 2: Korean NER Data Set Summary
<NER의 현재와 미래> This content, which is the second theme in the series, is'Ner's Model structure and data set'We have prepared content about the first topic'From concepts to diverse conceptsIt seems from ', so if you haven't checked it out yet, I've read it first.
Ner's model structure
According to the paper 'A Survey on Deep Learning for Named Entity Recognition', the structure of the NER model can be divided into a three-step process as shown in the figure below.
(1) Distributed representations for input*
Pre-defined word embedding, character-level embedding, POS* tag, and gazetteer are used as layers that express input data as vectors, etc.
(2) Context Encoder
Models such as CNN*, RNN*, language model*, and Transformer* are used as layers to encode contextual information.
(3) Tag Decoder
Models such as Softmax, CRF*, RNN, and Point Network are used as layers to decode tag information.
However, not all models strictly follow the above structure. In particular, models on the deep learning side work end to end, so there are cases where the steps are not clearly split. However, if you include a traditional approach, you can easily consider the three steps above.
* Distributed representations forInput: distributed representation of inputs
Status and performance evaluation of NERlibraries
Currently, it is recommended to find an official NER library that is only in Korean, and you can find Korean in most models presented in multiple languages. Each library has the following topics:
The evaluation was then carried out with a data set* distributed by Kaggle*. Since the number of classes in the data set and the number of classes in the library were all different, the task of matching each class to the class in the data set was discussed, and it was confirmed that the precision was lower in the library that could classify more classes the reference data set during this process. Judging, as criteria for judging NER performance, the judging used precision and f1-score using it were excluded, and library performance was assessed based only on judging rate and time required. The results are as significant:
It can be confirmed that Stanford NER Tagger explains less in the time it took (based on 1,000 reviews), and that flair and polyglot explains lower performance in terms of recall.
Representative English NER data set
(1) CoNll 2003 (Sang and Meulder,2003) *
: Copyright Policy - DUA
: 1,393 news articles in English (mostly sports-related)
: 4 types of annotated* entities — {LOC (location), ORG (organization), PER (person), MISC (miscellaneous)}
* Annotated: with an <책 등이> annotation [note]
(2) onToNotes 5.0 (Weischedel et al., 2013) *
: Copyright — LDC
: The types and numbers of data are as mentioned.
* Pivot: Old Testament and New Testament text (Old Testament and New Testament text)
: 18 types of annotated entities
(3) MUC-6 (Grishman and Sundheim,1996)
: Copyright Policy — LDC
: News article published from Wall Street Journal
: 3 types of Annotated Entities — {PER, LOC, ORG}
(4) WNUT 17: Emerging and Rareentity Recognition (Derczynski et al., 2016)
: Copyright Policy — CC-BY 4.0
: social media (YouTube comments, Stack Overflow responses Twitter text and Reddit comments)
: 6 types of Annotated Entities — (Person, Location, Group, Creative Word, Corporation, Product)
Representative Korean NER data set
The number of NER data in Korea is very scarce. Currently, there are a total of three Korean NER data sets that have been released, and commercial use of all of them is restricted.
(1) National Institute of Korean Language NER data set
:Total 3,555 episodes
:Using BIO tagging system
:5 types of Annotated Entities — {Place (LC), Date (DT), Organization (OG), Time (TI), Person (PS)}
* The words of everyone at the National Institute of Korean Language, https://corpus.korean.go.kr
(2) Korea Maritime University Natural Language Processing Laboratory NER data set
:23,964 measured in total
:Using BIO tagging system
:10 types of Annotated Entities — {Person (PER), Organization (ORG), Place Name (LOC), Other (POH), Date (DAT), Time (TIM), Duration (DUR), Currency (MNY), Ratio (PNT), Other Annotated Expressions (NOH)}
* Korea Maritime University Natural Language Processing Laboratory on GitHub, https://github.com/kmounlp
(3) NAVER NLP CHALLENGE 2018
:Total 82, 393 episodes
:Using BIO tagging system
: 14 types of annotateEntities — {Person (PER), Field of Study (FLD), Artifact (AFW), Organization (ORG), Location (LOC), Civilization and Culture (CVL), Date (DAT), Time (TIM), Numbers (NUM), Developments and Events (EVT), Animals (ANM), Plants (PLT), Metals/Rocks/Chemicals (MAT), Medical Terms/IT Related Terms (TRM)}
Until now 'Ner's present and future“The Second Topic in the Series About”Model structure and data set status“It was. THE THIRD TOPIC OF THIS SERIES WILL SOON BE 'Future development direction and goalsIt will lead to”.
Ner's present and future
Ner's Present and Future: 01. From concepts to diverse concepts Ner's Present and Future: 02. Model structure and data set status Ner's Present and Future: 03. Future development direction and goals