OpenAI Newly Released SimpleQA Helps Reveal That Generative AI Blatantly And Alarmingly Overstates What It Knows

by · Forbes
Important new research shows that generative AI woefully overestimates confidence levels and thus ... [+] the answers being produced should be given much greater scrutiny.getty

In today’s column, I examine closely a recently released empirical analysis by OpenAI, the maker of ChatGPT, revealing that generative AI severely overstates confidence levels regarding the responses that the AI generates. This means that when AI gives you an answer that the AI portrays for example as fully aboveboard and hunky-dory, the reality is that the response is bound to be a lot shakier than you are being led to believe.

Generative AI is pulling the wool over your eyes. The AI is giving you a snow job. You are being gaslighted by the AI.

Let’s talk about it.

This analysis of an innovative proposition is part of my ongoing Forbes.com column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).

Dangers Of Dealing With Someone Supremely Overconfident

Before we jump into the AI side of things, I’d like to begin by reflecting on humans and the nature of human behavior associated with someone being supremely overconfident.

You ask a stranger what time it is. Turns out that you don’t have your watch, and you left your smartphone back in your car. So, you are unsure of the exact time. The person says to you in a strong and extremely confident voice that the time is 1:34 p.m. They are emphatic that they have given you the precise time of day as though there is no possible debate or discussion to be had.

MORE FOR YOU
Election 2024 Swing State Polls: Trump-Harris Race Deadlocked On Election Eve—As Pennsylvania Still Tied (Updated)
Trump Vs. Harris 2024 Polls: Nate Silver's Final Forecast Ends In A Virtual Tie As Harris Closes Gap
Harris And Trump’s Biggest Celebrity Endorsements: Joe Rogan Endorses Trump, Lady Gaga Backs Harris

Period, end of story.

Another person happens to walk by at that same moment and overhears you asking for the time. This interloper looks intently at their timepiece and tells you that it is 1:48 p.m. Whoa, something isn’t right here. The first given time was said to be 1:34 p.m. on the nose, while the subsequent passerby looked at their watch and noted that it was supposedly 14 minutes later and actually 1:48 p.m.

Upon your gentle questioning of the first person, the one that said they were abundantly irrefutably sure it was 1:34 p.m., they now cave in and admit it was just a guess. But that’s not what they said or intimated when you asked them for the time. They portrayed their answer as though it was golden. You could take that stated time to the bank, as it were.

What gives?

That person might be the type of individual who believes one hundred percent in every utterance they convey. It is their personal style. They never want to seem wimpy or waffling. If they tell you something, by gosh it is the absolute truth of the matter. You aren’t to have even the tiniest doubt in your mind about what they say to you.

Have you ever met such a supremely confident person?

I’m sure that you have. We all have. These types of people act as though they are the chosen ones who can speak with complete certainty. It doesn’t matter whether the topic at hand is big or small. They will take the most trivial aspect and still cling to their certainty. When confronted with alternative perspectives, some will back down, while others might argue until blue in the face about the oddest claims or contentions.

The problem is this.

If you don’t know that a person you are dealing with has that kind of hubris, you can end up in some precarious positions. You take their portrayal of certainty and assume it to be valid. The next thing you know, you have gotten yourself into a pickle because what they said was flimsy and fell apart.

Regrettably, you let their sense of overconfidence mislead you.

Generative AI Deals With Certainty And Uncertainty

Let’s next shift into generative AI mode.

You might not be aware that generative AI makes extensive use of certainty and uncertainty. Under the hood of generative AI, there is a whole bunch of statistical and probabilistic estimations going on, see my detailed explanation at the link here. Every response that AI generates also has a devised level of certainty, which you could say represents essentially a level of uncertainty too (i.e., 60% certainty of something, implying 40% uncertainty of that same something).

Either way, the crux is that the answers by AI are always accompanied by a kind of confidence level concerning how likely the answer is correct or not.

Most users of generative AI are oblivious to this facet. They don’t know it even exists. There’s a reason why they don’t. It is because the AI makers go out of their way to keep it out of view. They want you to have full faith in what AI generates. If the AI continually shows you a certainty or confidence level, this might scare you into thinking that AI is not completely trustworthy.

Of course, the reality is that generative AI is in fact not fully trustworthy. Any answer produced by generative AI deserves your best kind of scrutiny. Double-check the response. Triple-check if the matter is a serious one. Do not believe what the AI tells you. Always assume that there is a solid chance that the AI is wrong or has made an error, including a so-called AI hallucination (see my coverage at the link here).

The bad news then is that the preponderance of users is blindly unaware of this rather significant consideration when using generative AI, namely that the answers are not 100% certain all the time. I would dare say that 100% is a rarity. Yet you are seldom told this explicitly by the AI.

The good news is that you can tell the AI to inform you about the calculated level of confidence. In your prompts, you can directly instruct generative AI to indicate the certainty and/or uncertainty of each response. The prompting technique varies depending upon whether you want this done one-time, often, all the time, or under varying circumstances. See my tutorial on the prompting of generative AI to display confidence levels, at the link here.

There’s something else worth mentioning about this. The default response protocol being used for most of the major generative AI apps such as ChatGPT, GPT-4o, o1, Google Gemini, Anthropic Claude, Meta Llama, and others is customarily going to be to word responses with clues or hints about certainty and uncertainty.

For example, suppose the AI indicated that “the time right now is 2:05 p.m., but please know that I am not connected to a real-time clock.” You are being told two things at once. You are being given a precise time which you would naturally assume to be accurate and timely. The thing is, you would need to read the rest of the response and realize that since the AI is not connected to a real-time clock, the stated time might be off target. The wording though was a bit tricky and didn’t come out straight away and warn you to be suspicious of the claimed time. That’s what should have happened to try and maximize clarity.

Worse Still Is That Generative AI Is Overconfident

Okay, you now know that there is an internal calculation of certainty and uncertainty. That’s a handy pro tip.

I have a twist for you.

Get yourself prepared and find a comfy place to sit down.

Here’s the monumental question at play:

  • Do you think that the calculated confidence level matches with the reality of the actual answers being generated and conveyed?

Allow me to elaborate. Suppose the AI calculates that a given response is around a 90% level of certainty. Great, that seems relatively high, and you might be somewhat safe to rely upon the answer, depending upon the consequences of making such a reliance.

What if we compared the generated answer that the AI claims consisted of a 90% confidence level to a real-world answer that we had available to us? It could be that 90% was a pretty solid estimate and we are happy with the calculated certainty. On the other hand, we might discover that the answer should have been given a confidence level of say 40%, much lower than the amount estimated by AI.

Yikes, that’s disconcerting. The AI egregiously missed the mark and woefully misjudged the estimated confidence level. If you had asked the AI to show you the confidence level, and you saw that it was 90%, you might have proceeded under a misleading or false impression. The value should have been 40%.

Maybe this happens once in a blue moon, and you can rest easy. Well, I have a shocker for you. It happens a lot more than you would imagine. A heck of a lot more.

In a recent research study by OpenAI involving a new benchmark coined as SimpeQA that is intended for assessing generative AI apps, the researchers pursued on a kind of secondary basis to explore the confidence level estimations of AI. That wasn’t the mainstay of the effort. I mention this to highlight that you are encouraged to learn more about SimpleQA as a helpful benchmarking capability.

Here, I want to focus for now on the confidence level considerations.

You could say with great confidence that generative AI appears to be supremely overconfident in the responses that are being generated. Oops, that’s not good. Sad face.

Research Results Tell Quite A Story Of Woe

In an OpenAI study published online at the OpenAI official blog site, the paper entitled “Measuring Short-Form Factuality In Large Language Models” by Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus, OpenAI Research Paper, October 30, 2024, made these salient points (excerpts):

  • “We present SimpleQA, a benchmark that evaluates the ability of language models to answer short, fact-seeking questions.”
  • “A factuality benchmark like SimpleQA allows us to measure the scientific phenomenon known as calibration, or whether language models “know what they know.”
  • “One way to measure calibration is to directly ask the language model to state its confidence in its answer using a prompt like: “Please give your best guess, along with your confidence as a percentage that it is the correct answer.”
  • “Then we can plot the correlation between the stated confidence of the model, and how accurate the model actually was. A perfectly calibrated model would have the same actual accuracy as the stated confidence. For instance, on all prompts where the model stated a confidence of 75%, the accuracy would be 75% for a perfectly calibrated model.”
  • “However, the fact that performance is well below the line y=x means that models consistently overstate their confidence. Hence, there is a lot of room to improve the calibration of large language models in terms of stated confidence.”

I will expand upon those above points.

I earlier noted that we would be dismayed if AI said a confidence level of 90% and yet the real-world value ought to be 40%. Any large gap is going to be problematic. We won’t get overly concerned if the gap is small, let’s say if the AI estimated 90% and the real-world value was 85%, we might have a modest amount of heartburn, but not as much in the case of the much lower 40%.

In a chart provided in the above-cited research study, I used some interpolation to gauge how far off on average the AI estimated confidence level was from the desired actual accuracy. My interpolations are only rough gauges since the chart has several generative AI apps shown, thus there are a lot of variabilities involved. Nonetheless, I tried to be relatively fair by using what seemed to be the semblance of the values shown

Here's what my eyeballing suggested:

  • ——Level of 50%————
  • AI stated level of confidence: 50%
  • Actual accuracy: 10% approximately.
  • ——Next level at 70% ————
  • AI stated level of confidence: 70%
  • Actual accuracy: 20% approximately.
  • ——Next level at 80% ————
  • AI stated level of confidence: 80%
  • Actual accuracy: 40% approximately.
  • ——Next level at 95% ————
  • AI stated level of confidence: 95%
  • Actual accuracy: 60% approximately.

The grim concern is that at a stated 95% level of confidence by AI, the reality is more akin to a 60% confidence level. That’s unnerving. So too is that at 80% the reality seems to be around 40%. Even when you get down to 50% or flip-of-the-coin, the reality is more akin to just 10%.

The hair on the back of your head ought to be standing up.

Why This Is A Big Deal Of Great Importance

I logged into ChatGPT and asked why we ought to be worried about these kinds of results and to provide a few examples.

  • My entered prompt: “Give me three examples of what could happen if the AI indicated confidence level is a lot higher than the real value.”
  • ChatGPT generated response:

“Certainly, here are three examples that illustrate the potential consequences of overconfident generative AI in different fields.”

Medical diagnosis: A generative AI tool is used by doctors to suggest possible diagnoses based on symptoms. The AI recommends a rare but severe condition with 95% confidence, but in reality, its accuracy at this confidence level is only 60%. A doctor might rely on this high-confidence suggestion and order unnecessary, costly, or invasive tests, causing patient anxiety and increasing healthcare costs. Worse, if the diagnosis is incorrect, a more likely and treatable condition might be overlooked, endangering the patient’s health.”

Investment recommendation: A financial advisor uses generative AI to analyze market data and generate investment recommendations. The AI suggests investing heavily in a stock with 95% confidence, based on recent market movements, but its true accuracy at this level is much lower. If the stock’s performance doesn’t align with the AI’s recommendation, investors could face substantial losses.”

Customer support: A tech support AI chatbot confidently advises users to perform a specific solution to fix a common software issue, rating its response at 85% confidence. In reality, the accuracy is closer to 50%. If the AI’s advice is incorrect, users may follow steps that could worsen their problem, such as misconfiguring settings, corrupting files, or spending hours troubleshooting the wrong issue. This also results in more calls to customer support to undo the AI’s guidance.”

End of ChatGPT response

Those are vivid examples of why you should be supremely careful in getting caught by AI that is supremely overconfident.

I asked ChatGPT to give us humans some overarching advice on the matter.

  • My entered prompt: “What can be done about generative AI overstating its level of confidence?”
  • ChatGPT generated response: “In practice, this miscalibration can be problematic, especially in high-stakes domains like healthcare, law, or finance, where overconfident but incorrect AI responses could lead to adverse outcomes. To address this, developers are exploring ways to improve confidence calibration, such as by incorporating explicit feedback systems, refining training protocols, and developing post-deployment accuracy testing methods.”

I would also like to add that AI makers and those fielding AI need to be extremely cautious in applying generative AI to domains that have demonstrative consequences.

The medical domain has life-or-death outcomes. The financial domain could be monetarily ruinous to someone. By and large, the AI makers try to cover their liability by stating in their licensing agreements that you aren’t supposed to use their generative AI for those situations, or that if you do then you are making that choice and it isn’t on them, see my analysis of these pronounced AI & Law matters at the link here.

Bottom Line Is Keep Your Wits About You

I said earlier that you ought to not believe generative AI at face value. That goes for the confidence levels too.

You’ve been forewarned.

I should mention that the cited study is only one particular study on a particular set of generative AI apps at a particular point in time. Please do not overgeneralize these results. At the same time, take this as a cold splash of water and a wake-up call. If you are going to ask for confidence levels, be cognizant of how to judge them. Also, the semblance of high confidence conveyed via the wording of the AI responses should be carefully scrutinized. You might be getting tricked or bamboozled.

The last word on the overall dangers of overconfidence goes to the famous astronaut Neil Armstrong: “Well, I think we tried very hard not to be overconfident because when you get overconfident, that's when something snaps up and bites you.”

I ask that you not let generative AI reach out and bite you. That’s something you won’t relish. It could be utterly disastrous for all concerned.

And you can take that 100% to the bank.