# 形式化验证：你能对软件做出哪些保证？

- 来源：Hacker News 热门（buzzing.cc 中文翻译）
- 作者：eatonphil
- 发布时间：2026-06-30 10:50
- AIHOT 分数：48
- AIHOT 链接：https://aihot.virxact.com/items/cmr02ep9v0390slki1eyvl1x2
- 原文链接：https://queue.acm.org/detail.cfm?id=3819084

## AI 摘要

形式化验证的成本和工具已进入可广泛使用阶段，AI消除了编写证明的最大障碍。与仅覆盖有限输入的测试不同，形式化验证通过验证语言（如Dafny、Lean、Rocq）编写属性和代码，由SMT求解器自动检查所有可达状态是否满足规范——例如权限系统的子集不变性，确保派生权限始终是父权限的子集。验证保证是绝对的：只要规范正确，代码在所有状态下都不会违反属性。过去编写证明需要博士级技能且工具缓慢，如今这一门槛正在降低。

## 正文

We are redesigning the Queue website.

Take a look and let us know what you think.

June 22, 2026

Volume 24, issue 3 PDF

You Don’t Know Jack About Formal Verification

What can you confidently guarantee about your software?

Fernanda Graciolli and Nada Amin

The cost and tooling of formal verification have reached the point of widespread use. By eliminating the cost of writing proofs, AI is removing the biggest barrier to formal verification. This makes it easier than ever to build software with critical business rules that are guaranteed to be mathematically correct, rather than merely tested.

If your program has a complex set of business rules, can you guarantee that no combination of valid inputs will produce an invalid outcome?

Consider a secrets-management platform with a permission system. Rules control who can read, edit, or delete secrets across environments. When a user creates a custom role, the system checks that the new role’s permissions are a subset of the user’s own (i.e., you shouldn’t be able to grant access you don’t have). The boundary check covers every combination of permission operators. It has a thorough test suite, and all are passing.

But there’s a case that the tests never tried: A user with access scoped to a single environment (say, QA) could create a role scoped to “not development.” That sounds narrow, but “not development” matches every environment except development (e.g., user acceptance testing, sandbox, canary, production, etc.)—meaning a single-environment permission escalated to near-universal access. The boundary check approves it and the tests don’t catch it, because every test uses overlapping values, which means the excluded value always appears in the parent’s set. The bug triggers only when they don’t overlap.

A correctly designed and implemented permission system has one fundamental semantic invariant: Permission derivation is subset invariant. That is, the set of environments that a derived permission matches must always be a subset of the environments the granting permission matches. That property says nothing about how permissions are represented. It covers finite sets, negated sets, and any representation that might be added later. With this invariant set, permission derivation can’t fail at runtime because a bug of this class would be made impossible by construction.

The distinction between “the code didn’t fail at runtime” and “the code can’t fail” is what this article is about.

What Formal Verification Actually Is

At its core, formal verification is a simple concept. You start with properties you want your code to uphold. These are contracts for input and output, functions and methods. For a shopping cart, you might want to prove the following:

– The balance never goes negative.

– Every item in the cart is reflected in the total.

– Only one coupon code can be applied per order.

A verifier can’t work with English, so those properties need to be expressed in a verification-aware language—a programming language designed so that specifications and proofs are treated as first-class citizens alongside implementation code. Several exist, each with different strengths: Dafny uses an imperative style familiar to most developers and automates much of the proof work via SMT (satisfiability modulo theories) solvers; Lean has roots in pure mathematics and is rapidly gaining traction for both theorem proving and verified software; Rocq (formerly Coq) and Isabelle are interactive proof assistants with decades of use in research and high-assurance systems; F* targets verified systems programming with extraction to C and OCaml; and TLA+ focuses on specifying and model checking distributed protocols.

They all work differently, but the core idea is the same: You write your code and your properties in the same system, and the tool checks that one satisfies the other. Here’s what that looks like in Dafny. For “the balance never goes negative” property, you would write:

method ApplyCoupon(balance: int, discount: int) returns (newBalance: int) requires discount >= 0 requires balance >= discount ensures newBalance >= 0{ newBalance := balance - discount;}

The requires clauses are preconditions, which means that this must be true before the function runs. The ensures clause is a postcondition; this is what the function guarantees when it finishes. The verifier doesn’t run this code. Instead, it reasons about its structure and hands the verification conditions to an automated oracle (in this case, an SMT solver) that determines whether the postcondition holds for every input that satisfies the preconditions. If it does, the proof goes through. If there’s even one reachable state where the property breaks, the code doesn’t compile.

The guarantee is only as good as the spec, but when the spec is right, the guarantee is absolute: The property holds in every reachable state of the program. You can now guarantee that this verified piece of code will not break under any circumstance (that is, assuming preconditions are met, postconditions are ensured).

This is fundamentally different from testing. With property tests, you can sample one million random inputs into a system and catch some nasty bugs. Your software sits in production with high confidence levels until a user runs a sequence that would have been caught by the 1,000,001 st random input.

Verification tells you, “There is no bug to find here.” That 1,000,001 st sequence was already covered before the code compiled, and so was every sequence after it. The proof didn’t check them one by one—it ruled them all out at once.

So, why haven’t you used it? Because, historically, the cost of writing proofs dwarfed the cost of writing the code. You would state an “obvious” property and then spend days coaxing a proof assistant into accepting it. The tools were slow, the process was painstakingly tedious and required Ph.D.-level skill, and this all produced error messages that were inscrutable. The guarantee was real, but the price was too high, which is why formal verification lived almost exclusively in avionics, chip design, nuclear systems, and cryptographic protocols—domains where a bug costs lives or fortunes. For everyone else, the reasonable conclusion was: Tests are good enough.

Now, Formal Verification Is AI’s Problem

The conclusion that testing is enough no longer holds because AI can now handle the exact bottleneck that prevented this system from being used in mainstream development in the first place. The bottleneck was not the verification itself since a lot of the tools can check proofs automatically. The bottleneck had to do with writing the proofs: translating an intuitive requirement into the precise logical form a verifier demands, and then spending hours or days wrestling with it when the solver can’t automatically confirm what you know to be true. That’s the part that made the cost prohibitive.

Since Opus 4.5, most frontier LLMs can draft formal specifications from natural-language requirements, propose proof strategies, and—critically—iterate relatively fast on failing lemmas in a tight loop with the verifier.

AI proposes implementation candidates, and a deterministic, mechanical process checks each candidate’s correctness. If the proof is wrong, the verifier rejects it and AI tries again. Trust in AI is reduced (to the specifications) because the verifier is an external authority. The human stays in the loop for the part that requires judgment: deciding which properties are worth guaranteeing and system design. Meanwhile, the machine handles the labor of producing a verifiably correct implementation.

This changes who and what formal verification is for. It’s no longer just for safety-critical systems with the budget for specialized proof engineers. It’s for anyone who has a property worth proving.

Formal Verification in Practice

Consider an e-commerce order that follows a state machine: Cart → Placed → Shipping → Delivered, with Cancelled reachable from Placed or Shipping. Placing an order charges the customer’s card; the warehouse begins shipping items; and at any point before delivery, the customer can cancel and the system must issue a refund.

The fundamental property to verify is financial conservation: On cancellation, the refund equals exactly the value of unshipped items. Equivalently, the net amount the customer pays must always equal the value of the items they receive.

In Dafny:

datatype State = Cart | Placed | Shipping | Delivered | Cancelleddatatype Item = Item(name: string, price: nat)datatype Order = Order(state: State, items: seq<Item>, shippedValue: nat, charged: nat, refunded: nat)predicate Valid(o: Order){ o.shippedValue <= Total(o.items) && match o.state case Cancelled => o.charged == Total(o.items) && o.refunded == o.charged - o.shippedValue // ... other states}

The critical clause: in Cancelled, refunded == charged - shippedValue. One line encodes the financial conservation law for every possible cancellation: nothing shipped, half shipped, or all but one shipped. The invariant covers every case at once.

Each state transition is a method with preconditions and postconditions:

method CancelOrder(o: Order) returns (o': Order) requires Valid(o) requires o.state == Placed || o.state == Shipping ensures Valid(o') && o'.state == Cancelled{ var refund := Total(o.items) - o.shippedValue; o' := o.(state := Cancelled, refunded := refund);}

Every state transition preserves Valid. A subsequent FinanciallySound ghost predicate stating that the customer pays for exactly what they received (charged - refunded == shippedValue) discharges automatically from Valid’s Cancelled case. So does a NoMoneyLost lemma: If a charge exists, then either items shipped, a refund was issued, or the order is still in fulfillment.

No sequence of operations—adding items, placing, shipping, cancelling—can produce an invalid order. The cancellation refund equals exactly the value of unshipped items. These guarantees hold for every possible order at any price, shipment pattern, and cancellation timing.

Catching Drift and Preserving Intent

When the accounting is simple, the FinanciallySound property holds up trivially for terminal states Cancelled and Delivered. An engineer with some knowledge of the e-commerce domain, however, knows that features such as discounts, batched and recurring orders, or return windows can add complexity to the accounting.

Regardless of what’s added later, there is no scenario in which you would want a user to be over- or undercharged. By encoding the financial conservation property upfront, you ensure that no future code inside this model can violate this fundamental property.

Let’s say that, six months later, the product adds order-level coupons: a “$20 off $100” promotion. The change looks small: a discount field on Order and charged becomes Total(items) - discount. The test fixtures are updated, and the cancellation tests still pass.

A bug is introduced where the customer is now overcharged in certain cancellation scenarios. (The detailed calculation is left as an exercise for the reader.)

Of course, since this product’s order lifecycle is being implemented as a verified domain, the discount feature—as it is part of this lifecycle—must be implemented inside the same system. In this case, a discount calculation would be added to the FinanciallySound predicate.

The invariant refunded == charged - shippedValue is correct when charged == Total(items). Once discount > 0, charged, and Total(items) diverge, the invariant silently encodes the wrong financial law. Adding a discount field changes the meaning of charged without changing its name, and every invariant that refers to charged comes to mean something different (a drift in intent).

The verifier says nothing about whether the meaning is preserved. If Valid is the only invariant, the implementation still satisfies it. What the verifier does catch, however, is inconsistency between invariants.

The error doesn’t fix the bug, but the failure indicates a gap in design that needs to be addressed so that the fundamental property put in place earlier—holding that the user should not be over- or undercharged—cannot be violated. In this case, you need to consider how the discount is allocated across items. Proportionally? Shipped first? Unshipped first?

Because the discount feature is implemented inside a verified system, no implementation of this feature will even compile if it violates the invariant.

The Proof Is Only as Good as the Spec

Everything stated in this article so far assumes the spec is right. That’s not a small assumption—it’s the hardest part and the crux of formal verification. No tool can fully automate the design decisions you want your app to adopt. The verifier will ruthlessly enforce whatever you tell it to, but it can’t tell you what to enforce. Brainstorming and planning with AI, along with spec-conflict audits during the formal spec-writing step, facilitates the design process.

Consider the example set forth in this article. The secrets-management bug is catchable only if someone has stated the subset invariant: Derived permissions must always match a subset of the granting permission’s environments. If the spec had instead said something narrower—“finite permissions can contain only other finite permissions”—the proof would pass, but the guarantee wouldn’t cover future permission representations. The invariant needs to be stated at the right level of abstraction, and the subsequent proof needs to prove it at the right level of depth. Getting both of these right requires a deep understanding of the domain, even if you never need to touch a formal spec or proof and your review is focused on the artifacts produced by the results of both.

What Formal Verification Can and Can’t Do

The properties you choose to verify are a statement about what matters most in your system. Once you get that right, the guarantees are absolute.

As of today, the formally verified core can handle most effect-free logic—invariants, transitions, conflict resolution. But the UI, network calls, and database interactions typically sit outside the verification boundary. Verification makes the core airtight but doesn’t guarantee end-to-end correctness.

Some tools exist, and the cost of proofs is now the cost of the tokens it takes to write them. With the right tooling, formal verification is ready to be integrated into mainstream AI workflows.

Acknowledgment

We thank Anastasiya Kravchuk–Kirilyuk for reading and discussing drafts.

Related Material

Proving the Correctness of Nonblocking Data Structures

Nonblocking synchronization can yield astonishing results in terms of scalability and realtime response, but at the expense of verification state space.

Mathieu Desnoyers

https://queue.acm.org/detail.cfm?id=2490873

The Verification of a Distributed System

A practitioner’s guide to increasing confidence in system correctness

Caitie McCaffrey

https://queue.acm.org/detail.cfm?id=2889274

You Don’t Know Jack About AI

And ChatGPT probably doesn’t either

Sonja Johnson-Yu and Sanket Shah

https://queue.acm.org/detail.cfm?id=3703128

Fernanda Graciolli is co-founder of Midspiral. After an early career in neuroscience research, she pivoted to web development and product design, focusing on making complex technologies accessible to end users. Since 2020, she has built several AI tools, including, most recently, co-creating both the methodology for applying formal verification to web application patterns and lemmafit, an npm package that employs that methodology in Claude Code to build end-to-end apps.

Nada Amin is an assistant professor of computer science at Harvard, where she combines PL and AI to create systems that are correct by construction. Prior to Harvard, she was a University Lecturer at the University of Cambridge and completed her PhD at EPFL as part of the Scala team. She has received two Amazon Research Awards in Automated Reasoning and best papers awards at NeurIPS Math-AI and PLDI. With Fernanda Graciolli, she has recently co-founded Midspiral to bring verification to mainstream development with AI.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Copyright © 2026 held by the owner/author(s). Publication rights licensed to ACM.

Originally published in Queue vol. 24, no. 3

More related articles: Mark A. Overton - The IDAR Graph

UML is the de facto standard for representing object-oriented designs. It does a fine job of recording designs, but it has a severe problem: its diagrams don’t convey what humans need to know, making them hard to understand. This is why most software developers use UML only when forced to. People understand an organization, such as a corporation, in terms of a control hierarchy. When faced with an organization of people or objects, the first question usually is, "What’s controlling all this?" Surprisingly, UML has no concept of one object controlling another. Consequently, in every type of UML diagram, no object appears to have greater or lesser control than its neighbors.

Eric Bouwers, Joost Visser, Arie Van Deursen - Getting What You Measure

Software metrics - helpful tools or a waste of time? For every developer who treasures these mathematical abstractions of software systems there is a developer who thinks software metrics are invented just to keep project managers busy. Software metrics can be very powerful tools that help achieve your goals but it is important to use them correctly, as they also have the power to demotivate project teams and steer development in the wrong direction.

Duncan Johnston-Watt - Under New Management

In an increasingly competitive global environment, enterprises are under extreme pressure to reduce operating costs. At the same time they must have the agility to respond to business opportunities offered by volatile markets.

Derek Miers - Best Practice (BPM)

Just as BPM (business process management) technology is markedly different from conventional approaches to application support, the methodology of BPM development is markedly different from traditional software implementation techniques. With CPI (continuous process improvement) as the core discipline of BPM, the models that drive work through the company evolve constantly. Indeed, recent studies suggest that companies fine-tune their BPM-based applications at least once a quarter (and sometimes as often as eight times per year). The point is that there is no such thing as a "finished" process; it takes multiple iterations to produce highly effective solutions. Every working BPM-based process is just a starting point for the future.

© ACM, Inc. All Rights Reserved.
