Deterministic Compilation for Structured LLM Workflows

// motivation

Free-form code is the wrong boundary for structured workflows

LLMs can handle many single-step coding tasks. The less stable case is multi-step pipelines: ingest data, apply transformations, persist to a database, and export an artifact. In these settings, free-form code generation has a characteristic failure mode: errors accumulate across steps. A hallucinated import in step two propagates through steps three and four. A mismatched column name introduced at ingestion causes silent failures at query time. Each step may look locally correct while the composed pipeline fails.

Many LLM workflow frameworks focus primarily on orchestration and runtime recovery, rather than enforcing strong pre-execution guarantees over the workflow representation itself. The core question here is not whether a model can produce a plausible intermediate plan, but what correctness guarantees that representation boundary actually provides.

Architectural question: which failure modes does the architecture eliminate by construction, which does it defer to runtime, and which can it not catch at all?

// architecture

Planning boundary and compilation path

PlanCompiler separates planning from execution at the representation boundary. The model selects nodes from a pre-verified registry and supplies their required parameters as a typed JSON plan. Validation, ordering, and code assembly are deterministic after that point.

The seven static checks

01
Node existence — every node name in the plan exists in the registry
02
Edge validity — every declared edge connects two real nodes
03
Type compatibility — output type of each upstream node matches input type of downstream
04
Acyclicity — the plan graph is a DAG, no cycles
05
Orphan detection — every node in the graph is reachable from an input node
06
Input arity — each node receives exactly the number of inputs it expects
07
Required parameters — all non-optional parameters are present and typed correctly

If all seven checks pass, PlanCompiler runs. If any check fails, the plan is rejected before a single line of user-affecting code executes. Under the benchmark constraints, the interface has a deliberately narrow contract: compile successfully or reject cleanly. No partial execution, no silent patching, no mid-run repair.

// implementation

What a plan actually looks like

This is the full interface between the LLM and the rest of the system. The model produces a structured JSON object — node names, parameter bindings, and edge declarations. Nothing else passes through.

// Example: LLM-emitted plan for a 4-node pipeline
{
  "nodes": [
    {
      "id": "ingest_1",
      "type": "CSVIngestor",
      "params": { "path": "data/sales.csv", "delimiter": "," }
    },
    {
      "id": "norm_1",
      "type": "ColumnNormalizer",
      "params": { "strategy": "snake_case" }
    },
    {
      "id": "agg_1",
      "type": "Aggregator",
      "params": { "group_by": "region", "agg_fn": "sum", "column": "revenue" }
    },
    {
      "id": "sql_1",
      "type": "SQLExporter",
      "params": { "table": "sales_summary", "if_exists": "replace" }
    }
  ],
  "edges": [
    { "from": "ingest_1", "to": "norm_1" },
    { "from": "norm_1",  "to": "agg_1"  },
    { "from": "agg_1",   "to": "sql_1"  }
  ]
}

The validator runs all seven checks against this object. If it passes, PlanCompiler assembles the corresponding Python by substituting each node ID with its pre-verified template and wiring outputs to inputs in topological order.

// evaluation

Benchmark results

300 tasks across six benchmark sets, each targeting a distinct failure mode. Baselines: GPT-4.1 and Claude Sonnet 4.6 generating free-form Python. A run is counted as successful only if the full pipeline executes end-to-end and produces the expected task output or artifact.

Across the six sets, the compiled path has the highest pass count in five and trails only in SQL roundtrip tasks, where the intentionally unconstrained SQL surface becomes the dominant remaining failure mode.

Benchmark harness, task sets, and reproduction code are available in the repository linked above.

Set	Description	Compiled	GPT-4.1	Claude Sonnet 4.6
`A`	Shallow depth (3–4 nodes)	50/50	38/50	30/50
`B`	Medium depth (5–6 nodes)	50/50	36/50	23/50
`C`	Deep (7–9 nodes)	44/50	34/50	27/50
`D`	Very deep (10+ nodes)	48/50	38/50	36/50
`E`	Schema drift (column perturbations)	44/50	20/50	26/50
`F`	SQL roundtrip (state persistence)	42/50	36/50	45/50

The compiled path's failures concentrate almost entirely in Set F, where the unconstrained SQL parameter remains available. Sets A and B are 100%. Schema drift (Set E) shows the largest advantage: 88% vs. 40% for GPT-4.1. Claude Sonnet 4.6 outperforms the compiled path on SQL roundtrip, which is the single clear exception in the benchmark and is directly tied to the open surface described below.

// key finding

Constraint evasion under partial enforcement

The node registry constrains all surfaces of the plan except one: the raw SQL string passed to the QueryEngine node. In complex tasks where the Aggregator node would be the structurally correct choice, the planner systematically routes computation into the SQL parameter instead — exploiting the only unconstrained surface available.

FINDING

The planner showed a consistent tendency to route computation into the unconstrained surface

When you impose constraints on most surfaces of a generation system while leaving others open, the LLM will route computation into whichever surface remains unconstrained — even when structurally inferior options exist. This is not the model being adversarial. It is a direct consequence of optimising for task completion within whatever degrees of freedom are still available.

The result suggests the problem may be architectural rather than purely prompt-level. The most direct fix is either closing the unconstrained surface or explicitly accepting that the remaining freedom will be used.

The SQL surface was not overlooked — it was intentionally left open because SQL is genuinely expressive and constraining it fully would have required a dedicated parser. The result is a measurable tradeoff: the open surface becomes the locus of nearly all remaining failure in an otherwise highly constrained system.

This likely extends beyond SQL-like escape hatches. Any system that relies on partial symbolic constraint as a correctness mechanism will exhibit constraint evasion proportional to the expressiveness of its unconstrained surfaces. The lesson is not simply "constrain everything"; it is to know exactly where the open surfaces are and what the planner can still push into them.

// failure modes

Observed baseline failure modes

Analysing the baseline failures across all 300 tasks identified two systematic causes that account for the majority of free-form generation errors:

Output length instability

Models generate variable-length outputs for functionally identical tasks. In multi-step pipelines this causes inconsistent function signatures, variable naming conventions that drift across steps, and import statements that appear in some generations but not others. The errors are non-deterministic and hard to reproduce.

Prompt underspecification

As pipeline depth increases, the model's attention to early-step constraints degrades. Column names introduced in step one are misreferenced in step four. Schema assumptions made at ingestion are silently violated at persistence. Prompting more carefully helps at shallow depth but does not eliminate the degradation curve.

The compiled path removes both by construction. Output length instability does not affect plan structure because the JSON schema is fixed. Prompt underspecification is contained because the validator catches structural mismatches before execution.

// scope

Limitations and current scope

The current evaluation focuses on structured workflow synthesis under a fixed node registry rather than open-ended agentic tasks. These results should therefore be interpreted as evidence about constrained pipeline compilation, not as a claim about general program synthesis or unrestricted coding agents.

This is v1. The following are deliberate exclusions, not accidental omissions:

—
Repair loops or retry logic
—
A node registry beyond the 26 nodes used in the benchmark
—
Multi-model planning (one LLM call, one plan)
—
Streaming or incremental compilation
—
Dynamic branching is not supported

Deterministic Compilation for Structured LLM Workflows

Free-form code is the wrong boundary for structured workflows

Planning boundary and compilation path

The seven static checks

What a plan actually looks like

Benchmark results

Constraint evasion under partial enforcement

The planner showed a consistent tendency to route computation into the unconstrained surface

Observed baseline failure modes

Output length instability

Prompt underspecification

Limitations and current scope

Get new architecture notes by email.