Most artificial intelligence is trained to do only one thing well. Of course, the performance of each AI model specialized in that one thing is powerful. It's like IBM's supercomputer “Deep Blue (Deep Blue),” which beat a chess world champion.
Back then, people watched humans lose and worried that computers would now overwhelm humans. But the real world is much more complicated than on a chessboard, and deep blue couldn't do anything other than chess. This is where technology is limited, and it is also the reason why AI has not been able to surpass humans.
“In fact, chess, where rules and logic reign, is a relatively computer-friendly game. Rather, it is more difficult for humans to do activities naturally without much awareness. For example, everyday things like driving on a busy road, having a conversation, and reading a friend's feelings.”*
Limitations of language models
Open AI's GPT-3 was called the best AI in human history. Because you can chat as if you were talking to a person, and you can even write long posts. It wasn't a flimsy level, and I wrote columns and novels for the media.
However, GPT-3, which is so powerful, has the same limitations as conventional AI. I learned only with text data (albeit on a huge scale). If you look at language skills alone, they seem close to humans, but there are quite a few cases where people make erratic or basic mistakes (because they learned the world in writing).
Below are examples of GPT-3 mistakes discovered by overseas AI experts. You can check the limitations of AI mentioned earlier. It seems like we don't have what we commonly call “common sense.”**
Q: How many eyes do a giraffe have?
A: A giraffe has two eyes.
Q: How many eyes do my feet have?
A: Your foot has two eyes.
Q: How many eyes does a spider have?
A: A spider has eight eyes.
Q: How many eyes do the sun have?
A: The Sun has one eye.
Q: How many eyes do a blade of grass have?
A: A blade of grass has one eye.
The rise of multimodal AI
Humans, on the other hand, only use text and don't understand the world. In addition to text, they accept, share, and communicate information in various formats such as images, audio, and video. Thanks to this, three-dimensional and intuitive thinking is possible, and it is also possible to empathize and have conversations with the other party.
This is why multimodal AI, which is one of the major topics in the AI industry recently and the topic of this post, has appeared.
Multimodal AI has various modes such as images, text, voice, and video***It is accepted and used at the same time. By receiving various data such as voice, gestures, gaze, facial expressions, and biological signals, you can do comprehensive thinking that imitates a human being. AI, which approaches the way we accept the world in this way, can communicate more naturally with humans.
Also, you can do a variety of other things other than writing. It's evolving to be able to do new things based on various data. For example, you can create creative designs by learning various images, and turn simple text into videos.
The era of multimodal AI
First, an attempt was made to add flair to language models such as GPT-3. Computer vision, which has a long history and tradition in the field of sensory perception, first began to be applied. This is because I expected that if words and visual information could be linked, not only would the model's reading comprehension ability but also expand its application fields later.
As a result, “DALL-E (DALL-E)” announced that this attempt was finally on track. Following GPT-3, OpenAI once again showed remarkable results. As a result of adding image recognition to NLP technology, DALL-E is able to create new images as a result of adding image recognition to NLP technology.
Then, DALL-E 2, which was announced in 2022, shows results that went one step further. There were upgrades, such as the addition of new features for editing and retouching existing photos. Thanks to this, it is now possible to create realistic and artistic high-resolution images that are far more advanced than before.
Also, in addition to DALL-E, various multimodal AIs continue to be revealed. Google has released Imagen (Imagen), a text-image diffusion model (Diffusion Model), and domestic companies are jumping in one after another. Kakao Brain's Mindall-e (Mindall-e) showed results similar to Dali, and LG AI researchers announced EXAONE (EXAONE), which enables bidirectional thinking between text and images.
Side effects of multimodal AI
Like language models, multimodal AI is difficult to be free from ethical issues. It's a common problem with AI today, which is bound to be affected by biases in learning data. AI that has learned misconceptions about race and gender that are already spreading around the world can cause problems.
Therefore, most multimodal AI has restrictions on disclosure or use. Harmful images are filtered, and in particular, the creation of images of real people is strictly prohibited. Because unless a fundamental solution to bias is found, some malicious users can create aggressive or sensationalized results.
However, researchers are not watching these problems either. Since the advent of multimodal AI, it continues to improve by collecting various examples, data, and feedback. In the case of DALL-E, we are developing technology for creating unbiased images and are making multifaceted efforts such as blocking harmful images by strengthening filtering.
Multimodal AI has a lot of potential. However, like all AI technology, it should eventually develop in a direction that benefits humanity. Ultimately, AI will need to be developed and used more ethically and responsibly so that it becomes an opportunity rather than a threat to humanity.
* For indirect quotes https://www.technologyreview.kr/ai의-과거를-통해-ai의-미래를-본다/
** Excerpt/Summary https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html, https://multiverseaccordingtoben.blogspot.com/2020/07/gpt3-super-cool-but-not-path-to-agi.html
*** https://en.wikipedia.org/wiki/Modality_(human–computer_interaction)
**** https://www.ted.com/talks/jeff_dean_ai_isn_t_as_smart_as_you_think_but_it_could_be
References
[1] https://www.technologyreview.kr/ai의-과거를-통해-ai의-미래를-본다/
[2] https://www.blog.google/products/search/introducing-MUM/
[3] https://www.ted.com/talks/jeff_dean_ai_isn_t_as_smart_as_you_think_but_it_could_be
[4] https://openai.com/dall-e-2/
[5] https://openai.com/blog/dall-e-2-extending-creativity/
[6] http://www.aitimes.com/news/articleView.html?idxno=144897
[7] https://www.kakaobrain.com/contents?contentId=6c33343e-4c3c-4bf5-8927-7649d90bab98
[8] http://www.aitimes.com/news/articleView.html?idxno=141958
[9] http://www.aitimes.com/news/articleView.html?idxno=144483
[10] http://www.aitimes.com/news/articleView.html?idxno=145260
[11] https://openai.com/blog/reducing-bias-and-improving-safety-in-dall-e-2/
Good content to watch together
[AI by our side] Can artificial intelligence communicate with humans?AI that became a linguistic genius, multilingual (Polyglot) model (1)
AI that became a linguistic genius, multilingual (Polyglot) model (2)