Open this photo in gallery:
The introduction of Google's AI-generated search summaries has not been without problems, with users sharing screenshots of strange and potentially harmful results. Richard Drew/Associated Press
The uneven rollout of Google's AI-generated search summaries in mid-May was in some ways entirely predictable, and it's likely to happen again.
There's a pattern to big tech companies' fanciful launches of generative artificial intelligence applications: Once these products get into users' hands, the hype quickly gives way to reality, the applications turn out to be error-prone and unreliable, and a flurry of negative media coverage often follows, sometimes followed by explanations and apologies from the offending companies.
In the case of Google's GOOGL-Q, the company began adding AI summaries on top of traditional search results, which marked a major shift in the company's core product. On social media, users quickly began sharing screenshots of bizarre results: misinformation about Barack Obama being the first Muslim president of the United States, nonsensical answers about adding glue to pizza, and potentially harmful advice about eating mushrooms.
Indeed, some companies seem comfortable releasing high-profile but half-baked generative AI products amid intense competitive pressures, even at the risk of embarrassment. The concept of minimum viable products, where companies release bare-bones applications to test customer needs and demand before developing a fully functional version, has been around for years in the tech world. But generative AI companies are pushing the concept to its limits. Depending on who you ask, this approach is either reckless or a sign that expectations for generative AI need to be reset.
OpenAI's release of ChatGPT in November 2022 sparked a race between tech companies. A few months later, Microsoft set the template for a shaky AI debut. The tech giant, which is a big investor in OpenAI, integrated the ChatGPT maker's technology into its Bing search engine, with lackluster results. Users widely shared some of the crazy, insane responses from a Bing chatbot that had professed its love for a New York Times reporter.
While Microsoft MSFT-Q quickly made changes, a fundamental problem with the large language models (LLMs) that underpin chatbots is their tendency to make up stories: the technology has no ability to reason or distinguish truth from fiction.
The problems aren't limited to text: Earlier this year, Google suspended its Gemini model's image generation feature after users felt the system was generating historically inaccurate images. For example, when asked to generate an illustration of German soldiers in 1943, the model returned what appeared to be an Asian woman and a black man in Nazi-style uniforms.
Some companies integrating AI into hardware haven't had much luck: Startups Humane and Rabbit released wearable AI-powered devices this year aimed at replacing smartphones, but reviewers panned both as slow, clunky, and severely limited in functionality. Even the performance of OpenAI's GPT-4o, announced in May, didn't offer a significant improvement over its predecessor, according to some reviewers.
“They feel like it's a huge PR advantage to be the first, or not to be the last,” said Melanie Mitchell, a computer scientist and professor at the Santa Fe Institute. This may be especially true for Google, which has invested heavily in AI research for years but was seen as slow and cautious once ChatGPT got going. “They're now overreacting in the opposite direction,” Mitchell said.
There's a different dynamic at work there. Of course, it might be embarrassing for members of the public to post examples of AI misbehavior on social media, but it's also a free beta test: Companies want to better understand how people use their technology in order to continuously improve it. They do so even though major AI developers have in-house safety teams to make sure applications don't go off the rails before release.
Google addressed that issue in a recent blog post, explaining what went wrong with its AI-generated search summaries: “We tested this feature thoroughly before releasing it,” wrote Liz Reid, head of Google Search. “But seeing millions of people use it for many new searches has been an unparalleled experience.” The post added that Google has made more than a dozen technical improvements to the AI ​​summaries since then.
Ethan Mollick, an associate professor at the Wharton School of the University of Pennsylvania, pointed to another factor that's driving the rush to release imperfect products: Some AI developers believe the technology's capabilities will improve so quickly that current glitches are seen as “just an intermediate step,” he said.
Progress has certainly been pretty rapid. Just a few years ago, law students were rambling and spitting out gibberish much more than today's models. In that light, it's understandable why developers aren't stressed out too much about an AI summary that advises you to eat rocks because they're nutritious. There's a widespread belief that this will soon be a thing of the past.
That doesn't mean progress will continue at the same pace, or that AI reliability issues will be easily solved. Issues around cost, computing power, and availability of data to train new large-scale AI models could constrain development, and there are already signs that the pace of progress is slowing. A report published by Stanford University earlier this year noted that progress on various benchmarks used to evaluate AI proficiency “has stagnated in recent years, suggesting that AI capabilities are plateauing or that researchers are shifting to more complex research questions.”
But according to Dr. Molik, expecting perfection from AI is the wrong approach (it's not a standard you should apply to your colleagues or yourself). “AI often beats humans, even when they make mistakes,” Dr. Molik says. On his Substack, Dr. Molik suggests that a more appropriate metric is what he calls the “best available human.” “Will the best AI available at a particular moment in a particular place be better at problem-solving than the best available human that's actually useful in a particular situation?” he writes. There are situations where that's true.
Last fall, Dr. Mollick and his colleagues worked with the Boston Consulting Group on a study to evaluate the extent to which OpenAI's GPT-4 could help (or hinder) consultants with a range of tasks. Overall, people equipped with the AI ​​were significantly more productive and produced higher-quality results on creative tasks, such as pitching ideas for a new type of shoe, and on writing and marketing tasks, such as drafting a news release.
But in tasks designed to exceed the AI's capabilities — business analysis involving spreadsheet data and interview notes — consultants who relied on GPT-4 performed worse and produced less accurate answers. “Beyond its limits, the AI's output is inaccurate, less useful, and undermines human performance,” the study said.
One lesson is that people can be misled by generative AI if they don't fully understand its capabilities and limitations, which has some experts concerned about rushing to release new applications.
Mitchell said there could be a paradoxical outcome from how we use applications like ChatGPT. If the error rate is around 50 percent, we're less likely to trust such systems and double-check their output. If the error rate is around 5 percent, we might not even think twice and let inaccurate information slip through the cracks. “Better systems are in some ways more dangerous than the worst ones,” he said.
Until the accuracy and reliability issues are fundamentally resolved (a big “if”), we can expect to see many more shaky AI debuts in the future.