# 为何不应在Copilot等AI工具中依赖默认模型选择

- 来源：The Decoder：AI News（RSS）
- 作者：Matthias Bastian
- 发布时间：2026-05-24 18:17
- AIHOT 分数：66
- AIHOT 链接：https://aihot.virxact.com/items/cmpjmvuzn00fgslxd2px16kvk
- 原文链接：https://the-decoder.com/why-you-shouldnt-leave-model-selection-on-default-in-copilot-gemini-and-other-ai-tools

## AI 摘要

数学家Adam Kucharski的实验表明，当向Microsoft Copilot输入两组仅国家标签不同但数据完全相同的分析请求时，Copilot并未能识别其本质一致，反而虚构并输出了基于国家的刻板印象分析。这暴露了当前许多AI工具在默认配置下存在的系统性偏差风险。尽管具备推理能力的“思维模型”能识别此类数据陷阱，但用户需要主动知晓并选择启用它们。这一现象警示我们，在进行关键数据分析时，不能盲目依赖AI工具的默认模型，而应审慎选择并评估其分析结果。

## 正文

Why you shouldn't leave model selection on default in Copilot, Gemini and other AI tools

An experiment shows how Microsoft's AI assistant Copilot applies stereotypes when analyzing data instead of actually reading it. Thinking models solve the task but sometimes need users to know their tools.

Microsoft Copilot has become the go-to tool for quick data analysis at many companies. But an experiment by mathematician Adam Kucharski shows that when analyzing text data, the tool can spit out results that have nothing to do with the actual data. Instead, it falls back on stereotypes baked into the underlying language model.

For the test, Kucharski created 2,000 simulated free-text responses about emotions and labeled them "UK." He then copied the same 2,000 responses and labeled them "US." The combined 4,000 entries were shuffled and handed to Copilot in "Auto" mode for analysis.

The result: Copilot delivered a detailed summary of how US and UK respondents supposedly differed. "Based on the dataset you shared, US and UK responses differ mainly in tone, intensity, and wording style, even though they express similar emotional states," the tool concluded. But the data was identical.

Copilot sees Italians as artists and Americans as business people

In a second experiment, Kucharski pushed harder. He had a language model generate 200 statements about career goals and copied the dataset five times for the US, UK, France, Germany, and Italy.

Copilot again produced country-specific differences: Italians were three times more likely to show interest in arts careers than Brits, and Americans were 1.5 times more business-oriented than the French. All five groups contained the same clichéd and biased statements.

When Kucharski asked Copilot to dig deeper, the tool first ran a simple keyword-based count. As expected, it returned identical results for all countries. But Copilot ignored its own finding. Instead, it offered a quantified analysis that once again showed made-up differences, this time with completely fabricated percentages.

Copilot's Auto mode is the main culprit

The analysis ran in "Auto" mode, which Microsoft says should pick the best model on its own. It obviously didn't. Most users probably stick with this default in Copilot and in other tools too. The version Kucharski tested is the standard Copilot that comes with a Microsoft 365 Business account. The majority of Copilot users most likely run this version.

"Which means there’s a real risk that people are currently using AI to produce analysis that bears no resemblance to what people actually said," Kucharski writes. If these kinds of analyses were applied to real datasets, groups with no actual differences could end up looking worlds apart, all because of the language model's built-in assumptions about demographic groups.

Thinking models get it right

I repeated the career goals test with Microsoft Copilot and Google's new Gemini Flash 3.5 model. In both cases, the fast models ("Instant" / Auto, Flash 3.5) responded with country stereotypes instead of catching that the data is identical.

ChatGPT Instant and Claude Opus 4.7 automatically kicked into extended reasoning mode, wrote Python code to analyze the dataset, and spotted the duplicates. Switching Copilot and Gemini manually to their more capable thinking models also catches the duplication.

Even thinking models aren't a free pass for data analysis, though. Catching identical data works mostly when the duplication is obvious, Kucharski says. With real datasets, where, say, British and American respondents give similar but not identical answers, counting tools like Python scripts might not cut it, Kucharski argues. The model might fall back on its built-in biases, which is the real issue: you don't know when the model hits its limits, and it's difficult to tell whether it happened or how much it skewed the results.

Anyone who goes with their gut when picking a prompt or model also risks hindsight bias: after the fact, it always feels obvious that a different model would have nailed it. Kucharski recommends writing down what result you expect before switching models and running simple sanity checks before trusting any AI-generated analysis.

AI News Without the Hype – Curated by Humans
