Are AI generative responses reliable?

The arrival of generative artificial intelligence (AI) into the daily habits of millions of people has raised several questions regarding the reliability of the content produced by this technology in terms of veracity. The most recent episode of this controversy arose in recent days, when it was discovered that Google’s AI Overviews, or texts generated by the Gemini AI, which the engine has been offering for a few weeks in response to user searches, provide random results in some cases.

It has been shown how, when asked about the “meaning” of non-existent or completely invented idioms, Mountain View’s AI provides detailed and well-founded explanations, as if these idioms actually existed. However, Google Overview has already made headlines in this regard, due to unreliable and potentially dangerous answers to health.

Overall, assessing the reliability of generative AI responses is complex because these factors are highly context-dependent and not easy to express in absolute terms. According to Massive Multitask Language Understanding (MMLU), one of the benchmarks for analyzing the reliability of generative AI, for example, ChatGPT-4o (OpenAI’s latest model) would achieve an accuracy rate of 88.7%. However, this data comes from analysis methods that AI experts tend to consider unreliable, unrepresentative, and overly generic, but which AI companies, on the other hand, highly value.

However, the research available so far in specific areas has yielded some pretty indicative results. Google’s generative AIs aren’t the only ones with veracity issues. ChatGPT has, in fact, reported unreliable results in response to queries on a variety of topics. In one case last November, a person in Norway was falsely accused by the chatbot of having a criminal record for murder. The case has also become a legal controversy. In the health field, ChatGPT’s responses appear to be unreliable. When it comes to news, Columbia University’s Tow Center for Digital Journalism found that the largest generative AI applications are not very good at finding and citing news. The same would be true for legal information. Even Meta’s AI, which has recently been included in some of its products, seems to have trouble with reality.

AI “Hallucinations”

The term “hallucinations” has often been used to describe this type of problematic response, as if it were a delusion on the part of the machine. The concept of “hallucination,” however, is particularly controversial and has been criticized by several experts for its medical connotations—inapplicable to a machine—and because the word presupposes the existence of a state of consciousness and knowledge from which the AI can erroneously deviate. Generative AIs, however, are neither conscious nor able to know what they are saying: consequently, they cannot even hallucinate.

The problem of the reliability of these tools is, in fact, largely a problem of how much expectation we place on them. Some answers to these questions can be found directly on OpenAI’s website, where the company describes the capabilities of its AI. Here we read, for example, how “results may be inaccurate, false, or misleading in some cases”; “may occasionally provide incorrect answers” and other similar warnings. Generative AI, in fact, represents an evolution of machine learning, as it enriches models with the ability to generate new content, such as text or images, from the data with which they were trained. These models have no real understanding and are unable to discern between reality and invention, nor between what is correct and what is not.

Generative AI is very useful and beneficial and often provides surprising results and answers that are very close to human language and reasoning, but based on performing probabilistic and statistical calculations, not on a real understanding of the questions posed to it. Despite the impressive technical improvements and advances in the models underlying these AIs, generative AIs are based on a simple principle: “predicting” the best statistical answer, not the truth.

*Philip Di Salvo is a senior researcher and professor at the University of St. Gallen. His main research topics are the relationship between information and hacking, Internet surveillance and artificial intelligence. As a journalist, he writes for several newspapers.

( fontes: RSI )