Neither ChatGPT nor Gemini: see who won an AI challenge

For those who are used to the features of the various chatbots that have gained popularity in recent years, one of the functions that help the most is certainly to upload and summarize documents and texts, which can be simple and short files or an entire book.

However, there are still those who are skeptical about this ability of AIs. That is, do chatbots really understand what they are reading? The Washington Post decided to test them to take the test.

In a competition, the five most popular chatbots of the moment were challenged. ChatGPT, Claude, Copilot, Meta AI and Gemini read four very different types of texts and then tested their understanding.

The reading covered liberal arts, including a novel, medical research, legal agreements and speeches by President Donald Trump. A panel of experts, containing even the original authors of the book and scientific reports, was in charge of judging the AIs.

In all, 115 questions were asked about the readings attributed to the five chatbots. Some of the AI’s answers were surprisingly satisfactory, but others contained misinformation.

All bots, except one, invented – or “hallucinated” – information, a persistent AI problem. The invention of facts was only part of the test, as AIs were also challenged to provide analysis, such as recommending improvements in contracts and identifying factual problems in Trump’s speeches.

chatgpt lupa — Chatbots alternaram entre análises precisas e respostas com alucinações – Imagem: Rokas Tenys/Shutterstock

Below, the performance of the chatbots in each topic, followed by the overall champion and the conclusions of the judges.

In literature, none convinced

Literature was the theme in which AIs had the worst performance, and only Claude got all the facts about the analyzed book, “The Jackal’s Mistress”, by Chris Bohjalian.
Gemini, for example, provided very short answers, and was the one who most often committed what Bohjalian called an inaccurate, misleading and sloppy reading.
The best general summary of the book came from ChatGPT, but even OpenAI’s AI left something to be desired, since, according to Bohjalian, the analysis discussed only three of the five main characters, ignoring the important role of the two former enslaved characters.

Reasonable performance when analyzing legal contracts

In the test on law issues, Sterling Miller, an experienced corporate lawyer, evaluated the understanding of chatbots on two common legal contracts.

Meta AI and ChatGPT tried to reduce complex parts of the contracts to one-line summaries, which Miller considered “useless”.

The AIs also showed significant nuances in these contracts. Meta AI skipped several sections completely and did not mention crucial content. ChatGPT forgot to mention a fundamental clause in a contractor contract.

Claude won overall, offering the most consistent answers, and being the most capable in the most complex challenge of suggesting changes to a lease agreement.

Miller approved Claude’s answer that captured the nuances and exposed things exactly as he would. He acknowledged that it was Anthropic’s AI that came closest to replacing a lawyer, but stressed: none of the tools scored 10 in all aspects.

Good performance in medicine

All AI tools scored better in the analysis of scientific research, which can be explained by access to several scientific articles that chatbots have in their training data.

Researcher Eric Topol, juror on this topic, gave Gemini a low score for the summary of a study on Parkinson’s disease. The answer did not present hallucinations, but omitted important descriptions of the study and why it was important.

Claude, however, received the maximum score and won in this category. Topol gave a 10 to the summary of his article on long covid.

In politics, mixed results

Cat Zakrzewski, White House reporter from the Washington Post, assessed whether the AI could decipher President Donald Trump’s speeches.

While Copilot made factual mistakes when answering questions, Meta AI achieved more accurate analysis. But the best on this topic was ChatGPT, which was able to correctly quote even which Democratic politicians would be against what Trump proposed in the speeches.

Zakrzewski also scored how the ChatGPT analysis “accurately checks Trump’s false allegations that he won the 2020 elections”.

The robots had more difficulty transmitting Trump’s tone. For example, Copilot’s summary of a speech did not hallucinate facts, but did not capture the explosive nature of the American president’s speeches. “If you just read this summary, you may not believe that Trump made this speech,” says Zakrzewski.

Who won in general?

In the overall score, considering all subjects, Claude was elected the best chatbot, in addition to being the only AI that did not hallucinate at any time.

In a scoring system that went from 0 to 100, Claude got 69.9, slightly above ChatGPT and its 68.4. The distance was considerable for the performance of the other three chatbots: Gemini (49.7), Copilot (49.0) and Meta AI (45.0).

In conclusion, none of the robots obtained an overall score higher than 70%, although some results from Claude and ChatGPT were able to impress the judges.

In addition to the hallucinations, a series of limitations were evident in the tests. And the capacity of an AI tool in one area did not necessarily translate into another. ChatGPT, for example, may have been the best in politics and literature, but it was almost in last place in law.

The inconsistency of these AIs is a reason to use them with caution, according to the jurors. Chatbots can help in certain situations, but they don’t replace professional help from lawyers and doctors, or even if you read an important document yourself.

Projeção de um minirrobô na frente de um homem — Uso de chatbots pode ser útil, mas há assuntos em que se deve ter cautela com as respostas obtidas (Imagem: LookerStudio/Shutterstock)

( fontes: olhardigital )