Project  ·  Research Build  ·  2025–2026

Deterministic Compilation for Structured LLM Workflows

A typed-node-registry architecture that restricts the LLM to planning, then validates and compiles the resulting workflow deterministically before execution.

Type
Research build
Benchmark
300 tasks across 6 sets
Result
278/300 compiled successes
Method
Single-plan JSON -> static validation -> deterministic codegen
278/300
COMPILED SUCCESSES
202/300
GPT-4.1 BASELINE
187/300
CLAUDE SONNET 4.6
5/6
SETS LED
// motivation

The problem with free-form LLM code generation

LLMs are reasonably good at single-step coding tasks. The harder problem is multi-step pipelines — ingest data, apply transformations, persist to a database, export an artifact. In these settings, free-form code generation has a characteristic failure mode: errors accumulate across steps. A hallucinated import in step two propagates through steps three and four. A mismatched column name introduced at ingestion causes silent failures at query time. Each step may look locally correct while the composed pipeline fails.

Many popular LLM workflow frameworks focus primarily on orchestration and runtime recovery, rather than enforcing strong pre-execution guarantees over the workflow representation itself. The core question here is not whether a model can produce a plausible intermediate plan, but what correctness guarantees that representation boundary actually provides.

The central question: which failure modes does the architecture eliminate by construction, which does it defer to runtime, and which can it not catch at all?

// architecture

How the compiler works

The system imposes a strict separation between planning and execution. The LLM is permitted to do exactly one thing: select nodes from a pre-verified registry and supply their required parameters, expressed as a typed JSON plan. Everything after that point is deterministic.

┌─────────────────────────────────────────────────────────────────┐ │ USER TASK │ │ "ingest CSV → normalize columns → aggregate → export to SQL" │ └──────────────────────────┬──────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ LLM PLANNER (single call, constrained output) │ │ │ │ Input: Task description + node registry (typed) │ │ Output: JSON plan - node selections + parameter bindings │ │ │ │ The LLM cannot invent new nodes or execute repair loops. │ │ It emits a plan and stops there. │ └──────────────────────────┬──────────────────────────────────────┘ │ JSON plan ▼ ┌─────────────────────────────────────────────────────────────────┐ │ STATIC VALIDATOR (7 checks, deterministic) │ │ │ │ ✓ Node existence ✓ Acyclicity │ │ ✓ Edge validity ✓ Orphan detection │ │ ✓ Type compatibility ✓ Input arity │ │ ✓ Required parameter presence │ │ │ │ Fails here -> reject, log, return. No execution. │ └──────────────────────────┬──────────────────────────────────────┘ │ validated plan ▼ ┌─────────────────────────────────────────────────────────────────┐ │ COMPILER (topological sort → Python assembly) │ │ │ │ Assembles executable Python from pre-verified node templates. │ │ The LLM is not called again after planning. │ │ No runtime repair loops. No output inspection. │ └──────────────────────────┬──────────────────────────────────────┘ │ executable Python ▼ deterministic run

The seven static checks

If all seven pass, the compiler runs. If any fail, the plan is rejected before a single line of user-affecting code executes. The claim is first-pass correctness under the stated constraints: the system either compiles successfully or rejects cleanly. No partial execution, no silent patching, no mid-run repair.


// implementation

What a plan actually looks like

This is the full interface between the LLM and the rest of the system. The model produces a structured JSON object — node names, parameter bindings, and edge declarations. Nothing else passes through.

// Example: LLM-emitted plan for a 4-node pipeline
{
  "nodes": [
    {
      "id": "ingest_1",
      "type": "CSVIngestor",
      "params": { "path": "data/sales.csv", "delimiter": "," }
    },
    {
      "id": "norm_1",
      "type": "ColumnNormalizer",
      "params": { "strategy": "snake_case" }
    },
    {
      "id": "agg_1",
      "type": "Aggregator",
      "params": { "group_by": "region", "agg_fn": "sum", "column": "revenue" }
    },
    {
      "id": "sql_1",
      "type": "SQLExporter",
      "params": { "table": "sales_summary", "if_exists": "replace" }
    }
  ],
  "edges": [
    { "from": "ingest_1", "to": "norm_1" },
    { "from": "norm_1",  "to": "agg_1"  },
    { "from": "agg_1",   "to": "sql_1"  }
  ]
}

The validator runs all seven checks against this object. If it passes, the compiler assembles the corresponding Python by substituting each node ID with its pre-verified template and wiring outputs to inputs in topological order.


// evaluation

Benchmark results

300 tasks across six benchmark sets, each targeting a distinct failure mode. Baselines: GPT-4.1 and Claude Sonnet 4.6 generating free-form Python. A run is counted as successful only if the full pipeline executes end-to-end and produces the expected task output or artifact.

Across the six sets, the compiled system leads in five and trails only in SQL roundtrip tasks, where the intentionally unconstrained SQL surface becomes the dominant remaining failure mode.

Benchmark harness, task sets, and reproduction code are available in the repository linked above.

Set Description Compiled GPT-4.1 Claude Sonnet 4.6
A Shallow depth (3–4 nodes) 50/50 38/50 30/50
B Medium depth (5–6 nodes) 50/50 38/50 23/50
C Deep (7–9 nodes) 44/50 38/50 26/50
D Very deep (10+ nodes) 48/50 38/50 36/50
E Schema drift (column perturbations) 44/50 20/50 26/50
F SQL roundtrip (state persistence) 42/50 36/50 45/50

The compiled system's failures concentrate almost entirely in Set F, where the unconstrained SQL parameter remains available. Sets A and B are 100%. Schema drift (Set E) shows the largest advantage: 88% vs. 40% for GPT-4.1. Claude Sonnet 4.6 outperforms the compiled system on SQL roundtrip, which is the single clear exception in the benchmark and is directly tied to the open surface described below.


// key finding

Constraint evasion under partial enforcement

The node registry constrains all surfaces of the plan except one: the raw SQL string passed to the QueryEngine node. In complex tasks where the Aggregator node would be the structurally correct choice, the planner systematically routes computation into the SQL parameter instead — exploiting the only unconstrained surface available.

FINDING

Constraint evasion is predictable, not random

When you impose constraints on most surfaces of a generation system while leaving others open, the LLM will route computation into whichever surface remains unconstrained — even when structurally inferior options exist. This is not the model being adversarial. It is a direct consequence of optimising for task completion within whatever degrees of freedom are still available.

This is a design-level concern, not a prompting problem. More specific prompts do not fix it. The only fix is either closing the unconstrained surface or explicitly accepting that the remaining freedom will be used.

The SQL surface was not overlooked — it was intentionally left open because SQL is genuinely expressive and constraining it fully would have required a dedicated parser. The result is a measurable tradeoff: the open surface becomes the locus of nearly all remaining failure in an otherwise highly constrained system.

This generalises beyond SQL. Any system that relies on partial symbolic constraint as a correctness mechanism will exhibit constraint evasion proportional to the expressiveness of its unconstrained surfaces. The lesson is not simply "constrain everything"; it is to know exactly where the open surfaces are and what the planner can still push into them.


// failure modes

How the baselines fail

Analysing the baseline failures across all 300 tasks identified two systematic causes that account for the majority of free-form generation errors:

F1

Output length instability

Models generate variable-length outputs for functionally identical tasks. In multi-step pipelines this causes inconsistent function signatures, variable naming conventions that drift across steps, and import statements that appear in some generations but not others. The errors are non-deterministic and hard to reproduce.

F2

Prompt underspecification

As pipeline depth increases, the model's attention to early-step constraints degrades. Column names introduced in step one are misreferenced in step four. Schema assumptions made at ingestion are silently violated at persistence. Prompting more carefully helps at shallow depth but does not eliminate the degradation curve.

The compiled system eliminates both by construction. Output length instability does not affect plan structure because the JSON schema is fixed. Prompt underspecification is contained because the validator catches structural mismatches before execution.


// scope

Limitations and current scope

The current evaluation focuses on structured workflow synthesis under a fixed node registry rather than open-ended agentic tasks. These results should therefore be interpreted as evidence about constrained pipeline compilation, not as a claim about general program synthesis or unrestricted coding agents.

This is v1. The following are deliberate exclusions, not accidental omissions: