Project · Research Build · 2025–2026
A typed-node-registry architecture that restricts the LLM to planning, then validates and compiles the resulting workflow deterministically before execution.
LLMs are reasonably good at single-step coding tasks. The harder problem is multi-step pipelines — ingest data, apply transformations, persist to a database, export an artifact. In these settings, free-form code generation has a characteristic failure mode: errors accumulate across steps. A hallucinated import in step two propagates through steps three and four. A mismatched column name introduced at ingestion causes silent failures at query time. Each step may look locally correct while the composed pipeline fails.
Many popular LLM workflow frameworks focus primarily on orchestration and runtime recovery, rather than enforcing strong pre-execution guarantees over the workflow representation itself. The core question here is not whether a model can produce a plausible intermediate plan, but what correctness guarantees that representation boundary actually provides.
The system imposes a strict separation between planning and execution. The LLM is permitted to do exactly one thing: select nodes from a pre-verified registry and supply their required parameters, expressed as a typed JSON plan. Everything after that point is deterministic.
If all seven pass, the compiler runs. If any fail, the plan is rejected before a single line of user-affecting code executes. The claim is first-pass correctness under the stated constraints: the system either compiles successfully or rejects cleanly. No partial execution, no silent patching, no mid-run repair.
This is the full interface between the LLM and the rest of the system. The model produces a structured JSON object — node names, parameter bindings, and edge declarations. Nothing else passes through.
// Example: LLM-emitted plan for a 4-node pipeline { "nodes": [ { "id": "ingest_1", "type": "CSVIngestor", "params": { "path": "data/sales.csv", "delimiter": "," } }, { "id": "norm_1", "type": "ColumnNormalizer", "params": { "strategy": "snake_case" } }, { "id": "agg_1", "type": "Aggregator", "params": { "group_by": "region", "agg_fn": "sum", "column": "revenue" } }, { "id": "sql_1", "type": "SQLExporter", "params": { "table": "sales_summary", "if_exists": "replace" } } ], "edges": [ { "from": "ingest_1", "to": "norm_1" }, { "from": "norm_1", "to": "agg_1" }, { "from": "agg_1", "to": "sql_1" } ] }
The validator runs all seven checks against this object. If it passes, the compiler assembles the corresponding Python by substituting each node ID with its pre-verified template and wiring outputs to inputs in topological order.
300 tasks across six benchmark sets, each targeting a distinct failure mode. Baselines: GPT-4.1 and Claude Sonnet 4.6 generating free-form Python. A run is counted as successful only if the full pipeline executes end-to-end and produces the expected task output or artifact.
Across the six sets, the compiled system leads in five and trails only in SQL roundtrip tasks, where the intentionally unconstrained SQL surface becomes the dominant remaining failure mode.
Benchmark harness, task sets, and reproduction code are available in the repository linked above.
| Set | Description | Compiled | GPT-4.1 | Claude Sonnet 4.6 |
|---|---|---|---|---|
A |
Shallow depth (3–4 nodes) | 50/50 | 38/50 | 30/50 |
B |
Medium depth (5–6 nodes) | 50/50 | 38/50 | 23/50 |
C |
Deep (7–9 nodes) | 44/50 | 38/50 | 26/50 |
D |
Very deep (10+ nodes) | 48/50 | 38/50 | 36/50 |
E |
Schema drift (column perturbations) | 44/50 | 20/50 | 26/50 |
F |
SQL roundtrip (state persistence) | 42/50 | 36/50 | 45/50 |
The compiled system's failures concentrate almost entirely in Set F, where the unconstrained SQL parameter remains available. Sets A and B are 100%. Schema drift (Set E) shows the largest advantage: 88% vs. 40% for GPT-4.1. Claude Sonnet 4.6 outperforms the compiled system on SQL roundtrip, which is the single clear exception in the benchmark and is directly tied to the open surface described below.
The node registry constrains all surfaces of the plan except one: the raw SQL string
passed to the QueryEngine node. In complex tasks where the
Aggregator node would be the structurally correct choice, the planner
systematically routes computation into the SQL parameter instead — exploiting the only
unconstrained surface available.
When you impose constraints on most surfaces of a generation system while leaving others open, the LLM will route computation into whichever surface remains unconstrained — even when structurally inferior options exist. This is not the model being adversarial. It is a direct consequence of optimising for task completion within whatever degrees of freedom are still available.
This is a design-level concern, not a prompting problem. More specific prompts do not fix it. The only fix is either closing the unconstrained surface or explicitly accepting that the remaining freedom will be used.
The SQL surface was not overlooked — it was intentionally left open because SQL is genuinely expressive and constraining it fully would have required a dedicated parser. The result is a measurable tradeoff: the open surface becomes the locus of nearly all remaining failure in an otherwise highly constrained system.
This generalises beyond SQL. Any system that relies on partial symbolic constraint as a correctness mechanism will exhibit constraint evasion proportional to the expressiveness of its unconstrained surfaces. The lesson is not simply "constrain everything"; it is to know exactly where the open surfaces are and what the planner can still push into them.
Analysing the baseline failures across all 300 tasks identified two systematic causes that account for the majority of free-form generation errors:
Models generate variable-length outputs for functionally identical tasks. In multi-step pipelines this causes inconsistent function signatures, variable naming conventions that drift across steps, and import statements that appear in some generations but not others. The errors are non-deterministic and hard to reproduce.
As pipeline depth increases, the model's attention to early-step constraints degrades. Column names introduced in step one are misreferenced in step four. Schema assumptions made at ingestion are silently violated at persistence. Prompting more carefully helps at shallow depth but does not eliminate the degradation curve.
The compiled system eliminates both by construction. Output length instability does not affect plan structure because the JSON schema is fixed. Prompt underspecification is contained because the validator catches structural mismatches before execution.
The current evaluation focuses on structured workflow synthesis under a fixed node registry rather than open-ended agentic tasks. These results should therefore be interpreted as evidence about constrained pipeline compilation, not as a claim about general program synthesis or unrestricted coding agents.
This is v1. The following are deliberate exclusions, not accidental omissions:
QueryEngine parameter (the known open surface)