Getting started
The beginning of artificial intelligence is quite a long time ago. Since it is an advanced technology, it seems to have appeared recently, but if you know, it is known that research* on artificial intelligence also began with the advent of computers in the late 1940s.
For that reason, the existence of artificial intelligence is quite familiar. People have already imagined various forms of artificial intelligence. On screens, highly advanced artificial intelligence causes wars to control humans, and also freely communicates with unknown aliens through translators. However, compared to expectations and fears stemming from vague imaginations about artificial intelligence, it is also true that it has not had a significant impact on our actual lives.
However, a major event occurred that made us feel that the existence of artificial intelligence was close at hand. Lee Se-dol, the 9th team with the highest number of humans, lost to AlphaGo in a Go match. Until now, machines beat humans in chess, but in Go, the number of cases is much higher, so it was thought that it was difficult to surpass humans.
Deep learning technology has broken the prejudice against the limitations of artificial intelligence and enabled AlphaGo to shine brilliantly. It's about increasing the probability of solving a problem by learning a large amount of data on the machine. AlphaGo started with learning Go's bulletins (chess) *, which has been accumulated over a long period of time, and has greatly improved performance through extensive self-learning.
AI translation and corpus
Artificial intelligence has easily surpassed humans in Go, which requires complex strategic thinking. But why do Google Translator and Papago still get lots of translation errors instead of surpassing human translators?
Anyway, compared to Go, where you have to calculate a finite number of cases, it's a much larger world of languages. Language expressions change with time, region, and even people and situations. Even if humans create criteria for judging appropriate expressions, it is bound to be difficult for machines to make their own judgments because there are so many variables.
Above all, there isn't enough data for machine learning, such as Go's bulletin. However, English translations in specialized fields with limited terms and relatively large amounts of data are in a better situation. On the other hand, there is still a lack of data on languages other than English and colloquial language used in everyday life.
For that reason, the surest way to improve the performance of current translators is to create good data. If there is high-quality data to act as a textbook for learning translators, the performance of the translator will naturally improve. For example, the data for learning a Korean-English translator is a pair of sentences composed of Korean and English. In technical terms, this pair of sentences is called a corpus.
Of course, an excellent model must be assumed, but building a good corpus is also very important to improve the performance of machine translators. Therefore, LETR is also making great efforts to secure high-quality corpus as much as possible.
This concludes the first story I've prepared about corpus for learning artificial intelligence translators.
Next, let's talk about corpus generation, or the actual process of building a corpus.
reference
History of artificial intelligence https://ko.wikipedia.org/wiki/인공지능#역사
Bulletin: A record of Go or organ (Source: Standard Korean Dictionary)
A corpus or corpus (corpus) is a set of samples of languages extracted for a specific purpose for natural language research. https://ko.wikipedia.org/wiki/말뭉치