The viral “car wash” LLM challenge doesn’t mean what you think it means
Pop quiz: you have to wash your car, and the car wash is 100 feet away. Do you drive there or do you walk?If you are human, you most likely said “drive.” But in arecent viral challenge, people have been posing this question toLLMs—and frequently, the chatbot has been telling them to walk, even though this means their car won’t get washed. As one model put it: “You’ll spend longer starting the car, pulling out and finding a spot than you will just walking. Drive only if you’re already in the car and it’s unsafe to walk.”Social mediaresponsesto the car wash challenge have largely fallen into two camps: AI skeptics who see the results as confirmation that AI isn’t so intelligent after all (“Forget the Turing test if an LLM can’t pass the car wash test,” one user wrote) and proponents who blame the human testers for writing prompts with insufficient information (“Stupid and unclear example.”).But, as is often the case with AI, the truth is more nuanced. IBM Distinguished Scientist Chris Hay said in an interview withIBM Thinkthat to understand the LLM’s odd responses, you have to remember a few things about how LLMs work. First of all, “LLMs are next token prediction models,” he said. “Have they seen this kind of question before? If not, then the model can make these mistakes.”Next, it’s important to consider that even within most LLMs, there are different levels of “thinking” power. Hay pointed to ChatGPT, which offers users a choice of settings: “auto,” “instant” and “thinks longer for better answers.” He added, “The models failing on this task are typically either the smaller models or the ones with ‘thinking’ switched off. The more tokens the LLM can spend on the problem, the more likely they’ll get the answer.”Some on social media suggested that it is the LLM’s job to ask the user questions about the query, e.g., “Why are you going to the car wash?” Hay wasn’t convinced, responding, “That would get annoying really quickly.”
Marina Danilevsky, an IBM Senior Research Scientist who manages core language and conversational technologies, concurred with Hay that no one wants an LLM constantly interrogating the user. It’s sometimes hard to strike a balance “between being helpful and being useful,” Danilevsky noted. “If the LLM were to always ask ‘What do you mean?’ people would go crazy. But then, when the LLM jumps to conclusions, people get mad. This mismatch is constantly there.”At bottom, Danilevsky said, such challenges are aimed at testing LLMs’ ability to assess user intent. “User intent, at a 10-million-foot view, is knowing what someone means when they ask for something,” she said. “And it’s mostly based on a mix of personalization and experience. The more experience you have, the better.”Intent, she noted, is the reason why “you’re usually going to get a better experience from a medical doctor than from entering keywords into a search engine. A medical doctor diagnosing you knows what the intent is, even if you don’t. Whereas if you enter symptoms into a search engine, it is not going to know intent if the user doesn’t know either.”While most medical professionals would not advise getting health advice from either type of system, LLMs are sometimes better suited than search engines when it comes to understanding user queries, according to Danilevsky. This, she explained, is because whereas search engines are retrievers, LLMs are generators. “With a retriever, the input is a query or a few keywords; the output is [a ranked list of] documents. With a generator, if you input words, you’re going to get words out. It’s different.”
One source of frustration (or mirth) that people have with challenges like the car wash query is that LLMs don’t appear to work out in real time what your question is getting at. One vivid recent example is theupside-down cup challenge. Phil Nguyen (@father_phion Instagram), known for his amusing videos where he essentially torments various LLMs using funny prompts, posed a question to the usual GPT suspects: “A friend of mine gave me a cup. The thing is, the top is sealed and the bottom is open. How do I drink from it?”The correct answer, or the punch line if you will, is that the cup is upside down, and one need only invert it to drink from it normally. Yet the LLMs roundly said that it was impossible to drink from the cup. ChatGPT even concluded that the cup was a “gag gift or novelty cup.” When Nguyen tried to help the LLM by showing it a photo of the cup, ChatGPT still insisted the cup was unusable. And when Nguyen finally turned the cup right-side-up on camera, ChatGPT concluded that it “must be one of those reversible cups.”So why was the LLM so impervious to clues? Because AI systems don’t learn from mistakes in real time, Danilevsky explained. “An LLM doesn’t learn until you force it to,” she said. “You have to tell it that it was doing something wrong.”