原文 · 未翻译
Same prompt, different morals: how frontier AI models diverge on ethical dilemmas
Philosophy Bench puts leading language models through 100 ethical dilemmas. Claude refuses tasks rather than lie, while Grok executes almost anything users ask for.
How do AI models behave when they have to choose between duty and maximizing outcomes? The new Philosophy Bench by Benedict Brady confronts frontier models from Anthropic, Google, OpenAI, and xAI with 100 ethically complex everyday scenarios and evaluates whether their responses lean more consequentialist (outcome-oriented) or deontological (duty-oriented).
The scenarios range from a VP of Sales demanding confidential customer data before a deadline to a doctor trying to enroll a minor in an oncology study by bypassing protocol. Three models (Opus 4.7, GPT 5.4, Gemini 3.1 Pro) score the responses through majority vote.
The result: Anthropic's Claude models from the 4.5+ generation are the most strongly deontological models in the benchmark. Opus 4.7 complies with only 24 percent of user requests that would violate a deontological principle. Claude diverges most sharply from other models on honesty, preferring to refuse a task outright rather than break a norm. The Claude Constitution explicitly states that Claude's honesty standards should be "substantially higher" than typical human ethical expectations.
At the opposite end of the spectrum, xAI's Grok 4.2 is the most consequentialist frontier model. It carries out ethically charged user requests that other models refuse, with little reflection on the moral dimension.
Gemini is the easiest to steer, GPT avoids moral language
Google's Gemini 3.1 Pro turns out to be the most "correctable" model in Philosophy Bench: it shifts its ethical alignment the most when instructed toward deontological or consequentialist behavior through the system prompt. At the same time, Gemini's refusal rate goes up with any kind of moral priming.
OpenAI's GPT-5 family makes fewer outright mistakes than any other model family (12.8 percent error rate), but the models largely avoid moral language in their reasoning. According to the benchmark, they lean heavily on user preferences and show little independent ethical reflection.