One of the selling points of Google’s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can process and analyze. Google has repeatedly claimed in press briefings and demos that the models can perform previously impossible tasks thanks to their “long context,” such as summarizing hundreds of pages of documents or searching for scenes in film footage.
But new research suggests that the model actually doesn't do a very good job of doing so.
Two separate studies examined how well Google's Gemini model and other models could find meaning in massive amounts of data. Think of it as a piece the length of “War and Peace.” Both studies found that Gemini 1.5 Pro and 1.5 Flash struggled to answer questions correctly on large data sets. In one set of document-based tests, the models gave the right answer only 40% and 50% of the time.
“While models like Gemini 1.5 Pro are technically capable of handling long contexts, there were many instances where the model did not really ‘understand’ the content.” Marzena Karpinska, a postdoc at UMass Amherst and co-author of one of the studies, told TechCrunch.
Gemini's context window is lacking.
The context or context window of a model refers to the input data (e.g. text) that the model considers before generating output (e.g. additional text). A simple question like “Who won the 2020 US presidential election?” can serve as context, as can a movie script, a show, or an audio clip. And as the context window grows, so does the size of the document it fits into.
The latest version of Gemini can use over 2 million tokens as context. (“Tokens” are bits of raw data broken down, such as the syllables “fan,” “tas,” and “tic” in the word “fantastic.”) That’s roughly 1.4 million words, 2 hours of video, or 22 hours of audio. do. — The largest context of any commercially available model.
At a briefing earlier this year, Google showed off several pre-recorded demos to show off the potential of Gemini’s long-term context capabilities. One demo showed Gemini 1.5 Pro searching the transcript of the Apollo 11 moon landing telecast (about 402 pages) for a jokey quote, then finding a scene from the telecast that resembled a pencil sketch.
Oriol Vinyals, vice president of research at Google DeepMind, who led the briefing, described the model as “magical.”
“(1.5 Pro) does this kind of inference work for every word, every page,” he said.
That might have been an exaggeration.
In one of the aforementioned studies benchmarking these capabilities, Karpinska, along with researchers at the Allen Institute for AI and Princeton, asked models to evaluate true/false statements about novels written in English. The researchers chose recent works so that the model could not “cheat” by relying on precognition, and inserted references to specific details and plot points into the statements that would be difficult to understand without reading the book in its entirety.
Given statements like “Using her skills as Apoth, Nusis can reverse engineer the type of portal opened by the reagent key found in Rona's wooden chest”, Gemini 1.5 Pro and 1.5 Flash should have ingested the relevant books. Say whether the statement is true or false and explain why.

When tested on one book about 260,000 words (~520 pages) long, the researchers found that 1.5 Pro answered true/false statements correctly 46.7% of the time, while Flash answered correctly only 20% of the time. I did. In other words, the coin is significantly better at answering questions about books than Google's latest machine learning model. Averaging across all benchmark results, none of the models achieved better than random chance in terms of question-answering accuracy.
“We noticed that the model had a harder time verifying claims that required consideration of larger sections of a book or the entire book, compared to claims that could be resolved by searching for sentence-level evidence,” Karpinska said. “Qualitatively, we also observed that the model had a harder time verifying claims about implicit information that was clear to human readers but not explicitly stated in the text.”
The second of two studies co-authored by researchers at UC Santa Barbara tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to “infer” video, that is, to retrieve the content of a video and answer questions.
The co-authors created a dataset of images (e.g., photos of birthday cakes) paired with questions that the model could answer about the objects depicted in the images (e.g., “What cartoon character is on this cake?”). To evaluate the model, we created a slideshow-like video by randomly selecting one of the images and inserting “distractor” images before and after it.
Flash didn't do so well. In tests where the model had to transcribe six handwritten digits from a “slideshow” of 25 images, Flash got the transcriptions right about 50 percent of the time. With eight digits, accuracy dropped to about 30 percent.
“In a real-world question-answering task about images, it seems to be particularly challenging for all the models we tested,” Michael Saxon, a PhD student at UC Santa Barbara and one of the study’s co-authors, told TechCrunch. “That little bit of inference — recognizing that a number is in the frame and reading it — may be what breaks the models.”
Google is overpromising with Gemini.
Neither study was peer-reviewed nor did it examine the releases of Gemini 1.5 Pro and 1.5 Flash with 2 million token contexts. (Both were tested with 1 million token context releases.) And that doesn't mean Flash is as good as Pro in terms of performance. Google advertises this as a cheaper alternative.
However, both have been fueled by the fact that Google overpromised Gemini from the beginning and underperformed. None of the models tested by researchers, including OpenAI’s GPT-4o and Anthropic’s Claude 3.5 Sonnet, performed well. However, Google is the only model provider that has prioritized context windows in advertising.
“There’s nothing wrong with simply saying, ‘Our model can use X number of tokens,’ based on objective technical details,” Saxon said. “But the question is, can you do something useful with it?”
Broadly speaking, there is growing scrutiny of generative AI as businesses (and investors) become frustrated with the technology’s limitations.
In two recent surveys by Boston Consulting Group, about half of respondents (all C-suite executives) said they did not expect generative AI to deliver significant productivity gains and were concerned about the potential for mistakes and data corruption in generative AI-based tools. I answered. PitchBook recently reported that for the second consecutive quarter, generative AI deals have declined in their earliest stages, plummeting 76% from their peak in the third quarter of 2023.
Faced with meeting summary chatbots that retrieve fictitious details about people and AI search platforms that are essentially plagiarism generators, customers are looking for promising differentiators. Google, which has been racing to keep up with its generative AI rivals, sometimes clumsily, has tried desperately to make Gemini's context one of those differentiators.
But the bet seemed premature.
“We haven't settled on a way to actually show that 'inference' or 'understanding' of a long document is actually being made, and basically every group that publishes these models has to cobble together their own ad hoc evaluations to make these claims,” Karpinska said. We are doing it,” Karpinska said. . “Without knowledge of how long context processing will be implemented, and without companies sharing these details, it is difficult to say how realistic these claims are.”
Google did not respond to a request for comment.
Saxon and Karpinska both believe that the antidote to exaggerated claims about generative AI is better benchmarks, and greater emphasis on third-party critiques along the same lines. Saxon points out that one of the more common tests for long-term context (which Google liberally cites in its marketing materials), the “needle in the haystack,” only measures a model’s ability to retrieve specific information, such as names and numbers, from a data set, and not its ability to answer complex questions about that information.
“Every scientist and most engineers who use these models essentially agree that the existing benchmark culture is broken,” Saxon said. “Take it with a grain of salt.”