研究人员构建了更严格的FINSABER测试框架,在约20年、多只股票、防挑结果条件下评估FinMem、FinAgent等LLM交易智能体。结果显示,LLM策略在狭窄测试中看似不错,但面对买入持有、规则交易、预测模型和强化学习等简单基线时,在长期公平测试中通常失败。LLM在市场上涨时过于谨慎,下跌时过于冒险,表明理解金融文本不等于能可靠把握市场时机。论文指出,当前LLM可能无法在长期跑赢简单市场策略。
LLM trading agents mostly fail when stock-market tests become long, broad, and fair.
The authors built FINSABER, a stricter testing setup that checks LLM trading over about 20 years, across more stocks, and with better protection against cherry-picked results.
They tested LLM systems such as FinMem and FinAgent against simple baselines like Buy and Hold, rule-based trading, forecasting models, and reinforcement learning methods.
The main result is that LLM strategies can look good in narrow tests, but they usually fail to beat simple market strategies once the test becomes longer and fairer.
The paper also finds that these LLMs behave badly across market conditions because they are too cautious when stocks are rising and too risky when stocks are falling.