Forget chat: AI that can listen, see and click is already here

Chatting with an AI chatbot is a 2022 thing. The hottest new AI tools take advantage of multimodal models, which can handle multiple things at once, like images, audio, and text.

Example A: Google’s NotebookLM. NotebookLM is a research tool that the company launched without much fanfare a year ago. A few weeks ago, Google added an AI podcast tool called Audio Overview to NotebookLM, allowing users to create podcasts on any topic. Just add a link to, say, your LinkedIn profile, and the AI podcast hosts will boost your ego for nine minutes. The feature became an unexpected viral hit.

Multimodal AI-generated content has also improved a lot in a short time. In September 2022, I covered Meta’s first text-to-video model, Make-A-Video. Compared to today’s technology, these videos seem clumsy and crude. Meta just announced its competitor to OpenAI’s Sora, called Movie Gen. The tool allows users to use text prompts to create custom videos and sounds, edit existing videos, and turn images into videos.

The way we interact with AI systems is also changing, becoming less text-dependent. OpenAI’s new Canvas interface allows users to collaborate on projects with ChatGPT. Instead of relying on a traditional chat window, which requires multiple rounds of prompts and text regeneration to get the desired result, Canvas lets people select snippets of text or code to edit.

Even search is getting a multimodal update. In addition to inserting ads into AI overviews, Google has rolled out a new feature where users can upload a video and use their voice to search for information. In a demo at Google I/O, the company showed how you can open the Google Lens app, record a video of fish swimming in an aquarium, and ask a question about them. Google’s Gemini model then searches the web and provides an answer in the form of a Google AI summary.

What unites these features is a more interactive and customizable interface, as well as the ability to apply AI tools to different types of materials. NotebookLM was the first AI product in a while that brought me delight and surprise, in part because of how different, realistic, and unexpected the AI voices were. But the fact that NotebookLM’s Audio Overview became a success despite being a secondary feature within a larger product just goes to show that AI developers don’t really know what they’re doing. Hard to believe now, but ChatGPT itself was an unexpected success for OpenAI.

We are a few years into the billion-dollar generative AI boom. The huge investment in AI contributed to the rapid improvement in the quality of the resulting content. But we haven’t seen the “killer app” yet, and these new multimodal applications are a result of the immense pressure AI companies are facing to generate profit and results. Tech companies are releasing different AI tools to the public and seeing what sticks.

( fonte: MIT Technology Review )