Published as a workshop paper at ICLR 2026
Large Language Models are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks. QuanBench+ introduces a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. Using executable functional tests, KL-divergence-based acceptance, and feedback-based repair, this study reveals that reliable multi-framework quantum code generation remains an open challenge.
LLMs have achieved impressive results on classical code generation benchmarks like HumanEval, but quantum code generation presents unique challenges. Unlike classical programs that produce deterministic outputs, quantum programs yield probabilistic measurement statistics. A qubit exists as a superposition \(|\psi\rangle = \alpha|0\rangle + \beta|1\rangle\), and correctness must be defined in terms of output distributions rather than exact values.
In classical programming, if you run a function twice with the same input, you get the same output. Quantum programs work differently — they produce probability distributions rather than fixed answers. Think of it like rolling a specially weighted die: the program defines the weights, but each run gives a different result. This means you can't simply check if output == expected_output. Instead, you need statistical methods to verify that the distribution of results is close enough to what's expected.
Several quantum code benchmarks exist (Qiskit HumanEval, QHackBench, QCircuitBench, QuanBench), but most evaluate models within a single framework only. This makes it impossible to tell whether failures come from weak quantum reasoning or simply unfamiliarity with a specific API.
A multi-framework benchmark is essential because it exposes two distinct failure modes: (i) conceptual errors in quantum reasoning — such as incorrect algorithmic structure or measurement logic — and (ii) framework-specific API errors — such as calling nonexistent methods or misusing parameter conventions.
Pass@k measures functional correctness: the model generates k code samples and passes if at least one produces the correct output. Pass@1 tests single-shot accuracy while Pass@5 allows the model five attempts.
KL-Divergence Acceptance handles probabilistic quantum outputs. Since quantum measurements are inherently stochastic, exact output matching doesn't work. Instead, QuanBench+ computes the KL divergence \(D_{KL}(P_{\text{ref}} \| P)\) between the reference distribution and the model's output distribution. If the divergence falls below a threshold \(\tau = 0.05\) (calibrated at the 0.997-quantile of a null distribution), the output is accepted.
KL-divergence measures how different two probability distributions are. Imagine you have a reference coin that lands heads 60% of the time. If a model's quantum program produces heads 58% of the time, the KL-divergence would be very small (close match). But if it produces heads 90% of the time, the KL-divergence would be large (poor match). QuanBench+ uses a threshold of 0.05 nats — if the difference is below this level, the output is considered correct. This threshold was carefully calibrated: even running the exact same correct program twice produces slight differences due to sampling noise, and the threshold sits just above that natural variation.
Why not fidelity? State fidelity requires access to the full quantum state vector, which is unavailable on real quantum hardware. QuanBench+ deliberately uses measurement-based correctness criteria that would work on actual quantum devices.
All 42 tasks are aligned across Qiskit, PennyLane, and Cirq with standardized prompts and output normalization. Tasks include classic algorithms like Grover's search, Shor's algorithm, Quantum Fourier Transform, and VQE, as well as gate decomposition challenges and quantum state preparation problems. Each task is evaluated in an isolated sandbox environment with deterministic seed control.
Think of it this way: Qiskit, PennyLane, and Cirq are like three different programming languages that all describe the same quantum operations. An AI that learned quantum programming mainly from Qiskit code examples will know the Qiskit API well but struggle with PennyLane's different function names and conventions — even for the exact same quantum algorithm.
For example, creating a simple quantum NOT gate is qc.x(0) in Qiskit but qml.PauliX(wires=0) in PennyLane. The concept is identical, but the code looks completely different. The gap between Qiskit scores (59.5%) and PennyLane scores (42.9%) suggests LLMs are partly memorizing API patterns rather than truly understanding quantum computing.
Pass@1 means the model gets exactly one attempt — like a student taking a test with no do-overs. Pass@5 gives the model five attempts and counts success if any of the five works. In real-world development, you often generate multiple suggestions and pick the best one, so Pass@5 reflects practical usage better. The gap between Pass@1 and Pass@5 reveals how consistent a model is: a small gap means it reliably generates correct code, while a large gap means it sometimes "gets lucky."
When models receive error messages from failed executions or wrong-answer feedback, they can revise their code for up to 5 rounds. This feedback loop dramatically improves performance across all frameworks:
| Model | Qiskit Pass@1 | Qiskit Pass@1 (FB) | Cirq Pass@1 | Cirq Pass@1 (FB) | PennyLane Pass@1 | PennyLane Pass@1 (FB) |
|---|---|---|---|---|---|---|
| GPT 5.1 | 59.5 | 73.8 | 54.8 | 76.2 | 40.5 | 66.7 |
| DeepSeek R1 | 57.1 | 83.3 | 52.4 | 73.8 | 42.9 | 66.7 |
| GLM 4.7 | 50.0 | 71.4 | 45.2 | 61.9 | 33.3 | 52.4 |
| Gemini 3 Pro | 47.6 | 69.0 | 38.1 | 57.1 | 26.2 | 38.1 |
| Claude 3.7 Sonnet | 45.2 | 57.1 | 35.7 | 59.5 | 26.2 | 47.6 |
| Kimi K2 Thinking | 50.0 | 57.1 | 33.3 | 57.1 | 23.8 | 45.2 |
| GPT 4.1 | 45.2 | 42.9 | 28.6 | 40.5 | 31.0 | 45.2 |
| DeepSeek Chat | 42.9 | 69.0 | 38.1 | 61.9 | 23.8 | 64.3 |
| Llama 4 Maverick | 40.5 | 61.9 | 35.7 | 50.0 | 23.8 | 40.5 |
| Gemini 2.5 Flash | 38.1 | 54.8 | 28.6 | 42.9 | 19.0 | 38.1 |
| MiniMax M2.1 | 28.6 | 57.1 | 23.8 | 47.6 | 31.0 | 47.6 |
| Qwen 2.5 7B | 16.7 | 19.0 | 4.8 | 7.1 | 11.9 | 19.0 |
Analyzing the 977 failed task attempts across all models and frameworks reveals a clear hierarchy of error types. The most common failure is producing a wrong answer (46.7%), meaning the code runs but produces incorrect quantum states or measurements. Logic errors (25.0%) involve flawed circuit construction, while missing method/gate errors (11.8%) indicate that models hallucinate nonexistent API functions.
The six error categories reveal where AI struggles with quantum code:
qc.toffoli() when the correct method is qc.ccx().After feedback, syntax errors nearly vanish (4.7% → 1.5%) but wrong answers increase in share (46.7% → 53.4%), showing that deep conceptual errors resist simple fixes.
After feedback-based repair, only 665 tasks remain unsolved. Notably, syntax errors drop from 4.7% to 1.5% and missing method/gate errors from 11.8% to 3.8%, while the share of wrong answers increases to 53.4%. This shows that feedback effectively resolves surface-level issues but struggles with deeper conceptual errors in quantum reasoning.
The prefill experiment tests whether giving models the beginning of the solution (import statements and function signature) improves code generation. This simulates a scenario where developers have already set up the basic structure and the model completes the implementation.
Prefill is like giving a student the first few lines of an essay and asking them to continue. In code generation, this means providing the import statements and function signature so the model only needs to write the actual logic. For example, instead of generating from scratch, the model receives: from qiskit import QuantumCircuit and continues from there. This removes the "boilerplate burden" and tests pure problem-solving ability.
def solve():
Results show that prefill effects vary significantly by model and framework. Some models benefit greatly (Gemini 3 Pro sees a +11.3% boost in Cirq), while others show minimal change or even slight regressions. The effect is most pronounced in PennyLane, the hardest framework, suggesting that import hints help most when models are least familiar with the API.
These heatmaps show Pass@1 results for every model-task combination in each framework. Each row represents an LLM, each column a task, with blue cells indicating success and white cells indicating failure. The pattern reveals that some tasks are universally difficult while others are easy for most models.
These charts track how many tasks each model solves as it receives more feedback attempts (1 through 5). Most improvement occurs in the first 2-3 rounds, with diminishing returns afterward. GPT 5.1 and DeepSeek R1 converge fastest across all frameworks.
The multi-framework evaluation reveals that high performance in one framework does not guarantee competence in another. Models appear to memorize framework-specific API patterns rather than developing true quantum understanding. The large performance gap between Qiskit (most popular, likely most represented in training data) and PennyLane (less common) supports this interpretation. Feedback-based repair proves highly effective, with the best model reaching 83.3% in Qiskit (up from 59.5%), suggesting that LLMs can learn from error signals in real time.
This research has a practical implication: if you're using AI tools to write quantum code for your projects, don't trust single-framework results. A model that scores 59.5% in Qiskit might only achieve 42.9% in PennyLane. The good news is that feedback loops help tremendously — iteratively running code, reading error messages, and revising can push accuracy from ~60% to over 80%. This mirrors how human developers work: write, test, debug, repeat. The challenge for the field is building AI that truly understands quantum concepts rather than just memorizing API documentation.
QuanBench+ is the first multi-framework quantum code generation benchmark, spanning Qiskit, PennyLane, and Cirq with 42 aligned tasks. The evaluation of 12 state-of-the-art LLMs reveals both progress and significant remaining challenges:
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.