Neural Proposal and Symbolic Verification

Motivation

Two pieces of evidence shape this design pattern. First, large language models are very good at generating candidate solutions — proofs, programs, geometric constructions, conjectures — but bad at producing only correct ones. Second, for many domains, there exist sound symbolic verifiers: type checkers, theorem provers, SAT/SMT solvers, unit tests, formal proof assistants. Checking is easy; only producing correct outputs is hard.

The propose-and-verify pattern combines them. The neural model proposes many candidates; the symbolic verifier rejects the wrong ones; what survives is correct by construction. Crucially, the neural component never needs to be reliable — its job is to find candidates, not to certify them. The symbolic component never needs to be creative — its job is to check, not to generate.

Recent high-profile systems use this pattern in different forms: AlphaGeometry (Trinh et al. 2024) for olympiad geometry, FunSearch (Romera-Paredes et al. 2023) for mathematical discovery, AlphaProof for formal theorem proving, and a long tail of program-synthesis systems for code. All share the same skeleton.

The Pattern

loop:
    candidates ← neural_proposer(problem)         # sample many candidates
    for c in candidates:
        if symbolic_verifier(c, problem):         # sound check
            return c
    update_proposer(failed_candidates, problem)   # optional: learn from failure

Two ingredients matter:

  • The proposer can be wrong. Most proposals are. As long as one in some reasonable number is right, sampling fixes the rest.
  • The verifier is sound. A verifier that accepts a wrong proposal poisons the entire system; soundness is the contract that makes the architecture work.

This is the strict opposite of the LLM-with-tools pattern. There, the LLM drives and tools are consulted on demand; here, the symbolic verifier is the bottleneck through which every output must pass.

What Verifiers Look Like

The pattern requires a domain where checking is much easier than solving. Common verifiers:

Domain Proposer outputs Verifier
Formal theorem proving A Lean / Coq / Isabelle proof script The proof assistant’s kernel
Olympiad geometry A sequence of auxiliary constructions A symbolic deduction engine
Program synthesis A function in some language A unit test suite or specification
SAT/SMT problem solving An assignment / certificate Substitution into the formula
Mathematical search A Python function returning a candidate An evaluator scoring the candidate
Code generation A program A type checker plus tests

Each verifier supplies the soundness guarantee. Lean’s kernel accepts a proof only if every step type-checks under the constructive type theory it implements; running a test against a generated function rules out programs that fail the test. The verifier may be incomplete (it might reject correct candidates expressed in unusual ways) but it must not be unsound.

The Search Loop

The neural component is rarely a single forward pass. Different systems organize the search differently:

Sampling with rejection

The simplest version: sample \(k\) candidates, run the verifier on each, return the first that passes. Works when proposals are independent and a reasonable fraction are right.

Evolutionary search over programs

FunSearch (Romera-Paredes et al. 2023) iterates: an LLM proposes new candidate programs, an evaluator scores them on a benchmark, the highest-scoring programs are added to a pool, and the LLM is prompted with the pool to generate new candidates. The pool functions as evolutionary memory; the evaluator is the soundness check. Over many iterations, the pool drifts toward better programs — including programs that beat human-written baselines on cap-set and bin-packing problems.

Tree search with neural priors

AlphaGeometry (Trinh et al. 2024) interleaves a transformer that proposes new auxiliary points with a symbolic deduction engine that derives all consequences of the current diagram. The proposer’s job is just to suggest the next construction; deduction does the rest. Solved 25 of 30 IMO geometry problems — at the level of an IMO gold medalist — using a corpus generated by symbolic synthesis, with no human-written proofs in training.

Proof-state-guided generation

In formal theorem proving (AlphaProof, lean-style systems), the proposer is prompted with the current proof state, which the proof assistant supplies after every tactic. The system is a tree search where the verifier supplies feedback after each step, not just at the end. This is the variant closest to traditional model-checking + heuristic search.

Why It Works

Four properties make the pattern robust:

  • The verifier reframes the problem. “Generate a correct proof” becomes “generate any proof and let the verifier reject the bad ones.” Correctness is no longer a property the neural model has to internalize.
  • Failures are training signal. A rejected candidate is a labeled negative example. Proposer fine-tuning on (problem, accepted proof) pairs improves the proposer over time without needing a human in the loop.
  • Sound certificates compose. A proof verified by Lean’s kernel can be embedded in a larger proof; a function that passes its tests can be called by other functions. Verified outputs are reusable, unlike hallucinations.
  • The neural model can be wrong indefinitely without harm. Unlike LLM tool use, where a wrong tool call may surface as a wrong answer, propose-and-verify never returns an answer that hasn’t passed the verifier. The system’s output is as reliable as the verifier.

Where the Pattern Breaks

Two conditions are required:

  • A verifier must exist. For most natural-language tasks — open-ended writing, opinion, persuasion — there is no sound check, and the pattern doesn’t apply. Tasks with verifiable correctness (mathematics, code, formal logic) are where it shines.
  • Search must be tractable. If the verifier accepts one in \(10^{20}\) candidates and the proposer is uniform, no amount of compute will find a winner. The pattern requires the proposer to be enough of a useful prior over the space that sampling lands in plausible regions.

When both conditions hold, propose-and-verify gives a tight upper bound on correctness — limited only by whether the verifier is actually sound — at the cost of compute proportional to the rejection rate.

Relationship to Older Ideas

The pattern is not new. Generate-and-test is one of the oldest patterns in AI; what is new is the quality of the generator. Earlier systems used random sampling or hand-tuned heuristics as the proposer; modern systems use neural models trained on the domain. The verifier is unchanged — the same theorem provers, the same SAT solvers, the same type checkers — but the proposer has gotten dramatically better.

The connection to neurosymbolic AI is direct: the pattern is the cleanest possible coupling of neural and symbolic systems. The interface is “candidate in, accept/reject out,” which any solver in any framework can implement without modification. The bet is that most of the value in pairing neural and symbolic systems comes from exactly this loose coupling, and that more entangled architectures (differentiable logic, end-to-end trained hybrids) buy comparatively less.

When to Use It

Situation Approach
Formal mathematics, code generation, formula synthesis Propose-and-verify with the domain’s standard verifier (Lean, type checker, tests).
Tasks where exact computation is needed mid-dialogue LLM with symbolic tools — pay tool latency per turn, not per candidate.
Open-ended generation without ground truth Neither — the verifier doesn’t exist. RLHF and human evaluation remain the only signal.
Domains with verifiers but rare success Apply, but invest in proposer training; sampling alone won’t bridge a \(10^{-20}\) acceptance rate.

The pattern is closing the gap between AI systems and parts of mathematics and software engineering where the neural model alone cannot be trusted, but where a sound check exists. The expected near-term outcome is more domains acquiring verifiers explicitly to make this loop work — a quiet inversion of the usual “verifier follows the prover” sequence.

References

Romera-Paredes, Bernardino, Mohammadamin Barekatain, Alexander Novikov, et al. 2023. “Mathematical Discoveries from Program Search with Large Language Models.” Nature 625 (7995): 468–75. https://doi.org/10.1038/s41586-023-06924-6.
Trinh, Trieu H., Yuhuai Wu, Quoc V. Le, He He, and Thang Luong. 2024. “Solving Olympiad Geometry Without Human Demonstrations.” Nature 625 (7995): 476–82. https://doi.org/10.1038/s41586-023-06747-5.