KellyBench基准测试检验了主流LLM在英超赛季投注中的长期预测与风险管理能力。所有参测模型均遭遇亏损,部分资金归零。Claude Opus 4.6以-11% ROI表现最佳,GPT-5.4为-13.6%。该测试通过100-150场动态赛季模拟,暴露出现有AI在持续决策中的连贯性、数据适应性与风险控制方面存在显著缺陷。
People using AI for Premier League bets are losing badly.
A new betting benchmark suggests today's best AI models still unravel when prediction has to survive a whole season.
In KellyBench, every tested model lost money, and some went completely bust.
KellyBench forced agents through a changing 100-150 matchday season where they had to predict outcomes, size bets, and protect a £100,000 bankroll.
That setup tests something normal benchmarks miss: whether an LLM can stay coherent, adapt to new data, and manage risk over time.
Claude Opus 4.6 was best at -11% ROI, GPT-5.4 came next at -13.6%, and several models hit -100%.