# 卢森堡大学与LIH研究揭示LLM约束推理关键缺陷

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-04-22 14:28
- AIHOT 链接：https://aihot.virxact.com/items/cmo9ot8ht03pnsls2glnsjaeb
- 原文链接：https://x.com/rohanpaul_ai/status/2046838411604939228

## AI 摘要

卢森堡大学与LIH研究揭示，LLM在结构化约束推理中存在关键缺陷。通过最优潮流问题测试发现，各类模型约束满足率停滞于55%-60%，主要瓶颈是无法满足电力系统物理约束方程。研究表明，模型仅学会"解的形状"却未真正执行约束搜索，导致输出看似合理（格式正确、误差小）却物理不可行。监督微调虽改善表面指标，但无法提升物理可行性；强化学习亦效果有限。研究警示：流畅近似不等于约束优化，"看起来合理"是危险标准。

## 正文

New University of Luxembourg+LIH paper reveals a critical gaps in LLMs' ability to handle structured reasoning under constraints

It checks if LLMs can solve Optimal Power Flow problems end to end， and finds that they mostly cannot do so physically coherently.

Across models and sizes， constraint satisfaction stayed stuck at about 55 to 60 percent.

The interesting result here is not that LLMs miss a hard engineering problem.

It is that they miss it in a very specific way.

Optimal Power Flow is a brutal test of real reasoning because it is not just about getting numbers close to a target， but about satisfying a web of physical constraints at the same time， from generator limits to bus voltages to the power-flow equations themselves.

That sounds minor until you look at the mechanism.

A model can produce an answer that looks clean， uses the right JSON， and even lands near the right values on mean squared error， while still violating the equations that make the grid physically coherent.

This paper shows exactly that failure mode.

Across several model families and sizes， constraint satisfaction sits in a stubborn band around 55 to 60 percent， and the main bottleneck is the power-flow constraints， while generator and voltage limits are often satisfied far more easily， as the table on page 12 makes plain.

Here's the part most people miss.

That pattern is not a small bug in prompting.

It suggests the models are learning the shape of a solution without actually carrying out the constrained search that the problem demands.

The ablations make the point sharper.

Supervised fine-tuning improves formatting and often lowers MSE， but it does not materially improve physical feasibility， and even a more elaborate system prompt barely moves the numbers， which is about as clean a rejection of "prompting will fix it" as you can ask for.

Reinforcement learning with a reward for valid structure and satisfied constraints helps a bit， especially on the 30-bus case， but even there the gains are modest rather than transformative， as the study overview on page 2 and results plots on pages 7 and 8 show.

So the real lesson is not that LLMs cannot reason.

It is that fluent approximation is not the same thing as optimization under law， and until models can reliably honor the constraints that define a system， "looks plausible" remains a very dangerous standard.

----

Paper Link - arxiv. org/abs/2603.23004v1

Paper Title： "Can LLMs Reason and Optimize Under Constraints？"