Unlike mechanical methods based on clear algorithms, human communication is extremely complex. This is because it doesn't just happen within the meaning contained in the word itself. In reality, it works according to the definition of a situation called context, so the meaning or interpretation can change even with a single phrase, spelling, or punctuation.
Furthermore, human language has evolved over a long period of time, and it continues to change even now. Even within the same language community, the languages spoken and written languages differ depending on the region and generation. For example, British, American, and Australian English are all different, and older generations find it difficult to understand the neologisms used by the younger generation.
Natural language processing so that machines understand human language
How difficult would it be for a computer to understand human language, even among humans?
Conversely, most people don't understand code or machine language, which can be called the language of computers. Basically, the language of machines consists of many combinations of zeros and ones. Nowadays, when we say “Hey Siri” or “OK Google,” it's amazing that machines can understand human language and respond right away.
But how is this communication between machines and humans possible? With the advent of AI based on deep learning*, natural language processing (NLP) ** technology evolved, making it possible. Machines using NLP technology can now interpret human words or characters, make judgments, and execute commands.
This field of NLP is a representative field where AI technology, which is rapidly developing, is being actively used. It began to be studied in the 1950s and evolved through rule-based and statistics-based methods. And after the 2000s, we were able to achieve what it is today by combining it with deep learning.
Why is natural language processing in Korean particularly difficult
However, compared to English and other languages, it is said that natural language processing is more difficult in Korean. What the heck is “why?” Ran explores the reason with curiosity and regret.
1 Korean is a crossword ***.
The function of a word is determined by the root word**** and affixes*****. ****** For that reason, when tokenizing in phraseological units, the number of words that can occur in a sentence increases enormously. ******* Basically, the number of cases is bound to be much higher compared to English without investigation.
For example, even with just the word “she,” there are various situations, such as “she, her, with her, she, she, like her, etc.” Therefore, in Korean, it is also an important task to separate collateral and research through tokenization. *******
2 The word order doesn't matter.
The meaning of Korean works even if the word order changes. For example, 'I study at school.“I”I study at school.“I mean the same thing. There are also cases where there is no problem if you change the order of words like this, or even omit a subject.
This can certainly come in handy when we use it in real life. On the other hand, it also makes natural language processing more difficult. Since any unit can appear after a particular word, probability-based language models make it difficult to predict the next word. *******
3 The spacing is not well respected.
Compared to English, spacing is not well maintained in Korean. First of all, the spacing rules are so difficult that even we, as native speakers, can't strictly follow them, and there's no problem with conveying meaning even if we don't use spaces at all. In fact, spacing itself was introduced after modern times, and the standard rules for this have continued to change.
After all, natural language processing is more difficult because spacing is often not well respected in Korean.
4 It is difficult to distinguish between a question statement and a written statement.
In fact, if you only look at the text without punctuation marks, it's impossible to distinguish the meaning. For example, 'I ate.'and'Did you eat?If you remove the period and question mark from ', you can see that there is no distinction.
While finishing
If you just look at the above content, it may seem that Korean is an unusually difficult language to process natural language. However, with so many languages coexisting in this world, there are many languages that are as good as Korean. For example, Thai********, which is the same Asian language, has no spaces, no question marks, or even a period.
From that point of view, I understand that Korean is difficult due to the nature of natural language processing and machine translation, which has developed mainly in English. Just as Koreans can learn Japanese relatively easily than English, machines will be easier to learn French and Spanish, which are similar to English. Another disadvantage is that there is still a relatively scarcity of Korean data compared to other languages.
It is said that a language reflects the characteristics and culture of the country or nation that uses that language. From that point of view, I'm looking forward to the future of natural language processing and machine translation of the Korean language even more than it is now. Recently, Korean language data has been enriched through full-scale data conversion projects, and Korean researchers with a high level of understanding of the Korean language are continuing their efforts.
References
[1] https://media.fastcampus.co.kr/knowledge/data-science/nlp-korean-4reasons/
[2] https://kh-kim.gitbook.io/natural-language-processing-with-pytorch/00-cover/04-korean-is-hell
[3] https://wikidocs.net/22533
[4] https://www.bloter.net/newsView/blt201712050015
Good content to watch together
[AI Story] How Machine Translation Meets Artificial Intelligence [AI Story] Machine translation becoming human-like Why does LETR, a language processing engine, focus on text languages?