Published as a workshop paper at ICLR 2026

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem — American University of Beirut & KAUST

Large Language Models are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks. QuanBench+ introduces a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. Using executable functional tests, KL-divergence-based acceptance, and feedback-based repair, this study reveals that reliable multi-framework quantum code generation remains an open challenge.

59.5% Best Pass@1 (Qiskit)

83.3% Best After Feedback

42 Aligned Tasks

LLM Quantum Programming Benchmarking Qiskit PennyLane Cirq

Read on arXiv ↗

Introduction

LLMs have achieved impressive results on classical code generation benchmarks like HumanEval, but quantum code generation presents unique challenges. Unlike classical programs that produce deterministic outputs, quantum programs yield probabilistic measurement statistics. A qubit exists as a superposition \(|\psi\rangle = \alpha|0\rangle + \beta|1\rangle\), and correctness must be defined in terms of output distributions rather than exact values.

What makes quantum code different?

In classical programming, if you run a function twice with the same input, you get the same output. Quantum programs work differently — they produce probability distributions rather than fixed answers. Think of it like rolling a specially weighted die: the program defines the weights, but each run gives a different result. This means you can't simply check if output == expected_output. Instead, you need statistical methods to verify that the distribution of results is close enough to what's expected.

Several quantum code benchmarks exist (Qiskit HumanEval, QHackBench, QCircuitBench, QuanBench), but most evaluate models within a single framework only. This makes it impossible to tell whether failures come from weak quantum reasoning or simply unfamiliarity with a specific API.

A multi-framework benchmark is essential because it exposes two distinct failure modes: (i) conceptual errors in quantum reasoning — such as incorrect algorithmic structure or measurement logic — and (ii) framework-specific API errors — such as calling nonexistent methods or misusing parameter conventions.

Key Contributions

Unified multi-framework benchmark — 42 tasks aligned across Qiskit, PennyLane, and Cirq, ensuring identical quantum problems are tested in each framework
Executable functional testing — automated evaluation using Pass@1, Pass@5, and KL-divergence-based acceptance for probabilistic outputs
Feedback-based repair evaluation — measuring how much models improve when given runtime error messages or wrong-answer feedback
12 state-of-the-art LLMs evaluated — including GPT 5.1, DeepSeek R1, Claude 3.7 Sonnet, Gemini 3 Pro, and others

Methodology

Correctness Metrics

Pass@k measures functional correctness: the model generates k code samples and passes if at least one produces the correct output. Pass@1 tests single-shot accuracy while Pass@5 allows the model five attempts.

KL-Divergence Acceptance handles probabilistic quantum outputs. Since quantum measurements are inherently stochastic, exact output matching doesn't work. Instead, QuanBench+ computes the KL divergence \(D_{KL}(P_{\text{ref}} \| P)\) between the reference distribution and the model's output distribution. If the divergence falls below a threshold \(\tau = 0.05\) (calibrated at the 0.997-quantile of a null distribution), the output is accepted.

KL-Divergence in plain terms

KL-divergence measures how different two probability distributions are. Imagine you have a reference coin that lands heads 60% of the time. If a model's quantum program produces heads 58% of the time, the KL-divergence would be very small (close match). But if it produces heads 90% of the time, the KL-divergence would be large (poor match). QuanBench+ uses a threshold of 0.05 nats — if the difference is below this level, the output is considered correct. This threshold was carefully calibrated: even running the exact same correct program twice produces slight differences due to sampling noise, and the threshold sits just above that natural variation.

Why not fidelity? State fidelity requires access to the full quantum state vector, which is unavailable on real quantum hardware. QuanBench+ deliberately uses measurement-based correctness criteria that would work on actual quantum devices.

**Figure 1:** QuanBench+ benchmarking workflow — choose framework, send API requests via OpenRouter, parse code responses, execute in isolated sandbox, and validate against canonical solutions

Task Categories

31 Quantum Algorithms

6 Gate Decomposition

5 State Preparation

All 42 tasks are aligned across Qiskit, PennyLane, and Cirq with standardized prompts and output normalization. Tasks include classic algorithms like Grover's search, Shor's algorithm, Quantum Fourier Transform, and VQE, as well as gate decomposition challenges and quantum state preparation problems. Each task is evaluated in an isolated sandbox environment with deterministic seed control.

Results

Why does the same AI score so differently across frameworks?

Think of it this way: Qiskit, PennyLane, and Cirq are like three different programming languages that all describe the same quantum operations. An AI that learned quantum programming mainly from Qiskit code examples will know the Qiskit API well but struggle with PennyLane's different function names and conventions — even for the exact same quantum algorithm.

For example, creating a simple quantum NOT gate is qc.x(0) in Qiskit but qml.PauliX(wires=0) in PennyLane. The concept is identical, but the code looks completely different. The gap between Qiskit scores (59.5%) and PennyLane scores (42.9%) suggests LLMs are partly memorizing API patterns rather than truly understanding quantum computing.

Cross-Framework Performance (RQ1)

Best one-shot scores: Qiskit 59.5% (GPT 5.1), Cirq 54.8% (Gemini 3 Pro), PennyLane 42.9% (DeepSeek R1)
Qiskit is consistently the easiest framework, PennyLane the hardest across all models tested
Performance varies significantly by framework — models that excel in one framework can struggle in another, suggesting API familiarity matters as much as quantum understanding
GPT 5.1 leads in Qiskit and Cirq, but DeepSeek R1 is competitive in Cirq and PennyLane, especially with feedback

Pass@1 scores across frameworks — **Figure 2:** Pass@1 scores across 12 LLMs for Qiskit, Cirq, and PennyLane — showing that performance drops sharply from Qiskit to PennyLane

What is Pass@1 vs Pass@5?

Pass@1 means the model gets exactly one attempt — like a student taking a test with no do-overs. Pass@5 gives the model five attempts and counts success if any of the five works. In real-world development, you often generate multiple suggestions and pick the best one, so Pass@5 reflects practical usage better. The gap between Pass@1 and Pass@5 reveals how consistent a model is: a small gap means it reliably generates correct code, while a large gap means it sometimes "gets lucky."

Feedback-Based Repair (RQ3)

When models receive error messages from failed executions or wrong-answer feedback, they can revise their code for up to 5 rounds. This feedback loop dramatically improves performance across all frameworks:

Qiskit 59.5% → 83.3%

Cirq 54.8% → 76.2%

PennyLane 42.9% → 66.7%

Pass@1 after feedback repair — **Figure 3:** Pass@1 after feedback-based repair — DeepSeek R1 reaches 83.3% in Qiskit, showing the power of iterative debugging

Detailed Results

Model	Qiskit Pass@1	Qiskit Pass@1 (FB)	Cirq Pass@1	Cirq Pass@1 (FB)	PennyLane Pass@1	PennyLane Pass@1 (FB)
GPT 5.1	59.5	73.8	54.8	76.2	40.5	66.7
DeepSeek R1	57.1	83.3	52.4	73.8	42.9	66.7
GLM 4.7	50.0	71.4	45.2	61.9	33.3	52.4
Gemini 3 Pro	47.6	69.0	38.1	57.1	26.2	38.1
Claude 3.7 Sonnet	45.2	57.1	35.7	59.5	26.2	47.6
Kimi K2 Thinking	50.0	57.1	33.3	57.1	23.8	45.2
GPT 4.1	45.2	42.9	28.6	40.5	31.0	45.2
DeepSeek Chat	42.9	69.0	38.1	61.9	23.8	64.3
Llama 4 Maverick	40.5	61.9	35.7	50.0	23.8	40.5
Gemini 2.5 Flash	38.1	54.8	28.6	42.9	19.0	38.1
MiniMax M2.1	28.6	57.1	23.8	47.6	31.0	47.6
Qwen 2.5 7B	16.7	19.0	4.8	7.1	11.9	19.0

Error Analysis

Analyzing the 977 failed task attempts across all models and frameworks reveals a clear hierarchy of error types. The most common failure is producing a wrong answer (46.7%), meaning the code runs but produces incorrect quantum states or measurements. Logic errors (25.0%) involve flawed circuit construction, while missing method/gate errors (11.8%) indicate that models hallucinate nonexistent API functions.

Understanding the error types

The six error categories reveal where AI struggles with quantum code:

Wrong answer (46.7%): The code runs without crashing but produces incorrect quantum states. This is the hardest type to fix because the logic looks plausible.
Logic errors (25.0%): The quantum circuit is constructed incorrectly — wrong gate order, missing entanglement, etc.
Missing method/gate (11.8%): The model "hallucinates" API functions that don't exist — like calling qc.toffoli() when the correct method is qc.ccx().
Shape mismatch (8.0%): Output has wrong dimensions — e.g., measuring 3 qubits when the test expects 4.

After feedback, syntax errors nearly vanish (4.7% → 1.5%) but wrong answers increase in share (46.7% → 53.4%), showing that deep conceptual errors resist simple fixes.

**Figure 4:** Error distribution before feedback — 977 wrong tasks dominated by wrong answers (46.7%) and logic errors (25.0%)

**Figure 5:** Error distribution after feedback — 665 remaining failures shift toward wrong answers (53.4%), as easy-to-fix errors are resolved first

After feedback-based repair, only 665 tasks remain unsolved. Notably, syntax errors drop from 4.7% to 1.5% and missing method/gate errors from 11.8% to 3.8%, while the share of wrong answers increases to 53.4%. This shows that feedback effectively resolves surface-level issues but struggles with deeper conceptual errors in quantum reasoning.

Prefill vs No-Prefill

The prefill experiment tests whether giving models the beginning of the solution (import statements and function signature) improves code generation. This simulates a scenario where developers have already set up the basic structure and the model completes the implementation.

What is "prefill"?

Prefill is like giving a student the first few lines of an essay and asking them to continue. In code generation, this means providing the import statements and function signature so the model only needs to write the actual logic. For example, instead of generating from scratch, the model receives: from qiskit import QuantumCircuit def solve(): and continues from there. This removes the "boilerplate burden" and tests pure problem-solving ability.

Prefill vs No-Prefill in Cirq — **Figure 6:** Prefill vs No-Prefill comparison for Cirq — Gemini 3 Pro jumps from 51.2% to 62.5% with prefill, the largest improvement

Results show that prefill effects vary significantly by model and framework. Some models benefit greatly (Gemini 3 Pro sees a +11.3% boost in Cirq), while others show minimal change or even slight regressions. The effect is most pronounced in PennyLane, the hardest framework, suggesting that import hints help most when models are least familiar with the API.

Per-Task Performance

These heatmaps show Pass@1 results for every model-task combination in each framework. Each row represents an LLM, each column a task, with blue cells indicating success and white cells indicating failure. The pattern reveals that some tasks are universally difficult while others are easy for most models.

Qiskit Pass@1 heatmap — **Qiskit** — Pass@1 heatmap showing the densest success pattern

Cirq Pass@1 heatmap — **Cirq** — Pass@1 heatmap with noticeably more white (failure) cells

PennyLane Pass@1 heatmap — **PennyLane** — Pass@1 heatmap showing the sparsest success pattern

Feedback Learning Curves

These charts track how many tasks each model solves as it receives more feedback attempts (1 through 5). Most improvement occurs in the first 2-3 rounds, with diminishing returns afterward. GPT 5.1 and DeepSeek R1 converge fastest across all frameworks.

Qiskit feedback curves — **Qiskit** — Feedback learning curves showing rapid early improvement

Cirq feedback curves — **Cirq** — Feedback learning curves with similar convergence pattern

PennyLane feedback curves — **PennyLane** — Feedback learning curves showing the steepest improvement

Discussion

Key Insights

The multi-framework evaluation reveals that high performance in one framework does not guarantee competence in another. Models appear to memorize framework-specific API patterns rather than developing true quantum understanding. The large performance gap between Qiskit (most popular, likely most represented in training data) and PennyLane (less common) supports this interpretation. Feedback-based repair proves highly effective, with the best model reaching 83.3% in Qiskit (up from 59.5%), suggesting that LLMs can learn from error signals in real time.

Threats to Validity

Benchmark size: 42 tasks provide meaningful coverage but cannot capture the full diversity of quantum programming challenges
API non-determinism: LLM API responses may vary across runs, though Pass@5 helps mitigate this
Framework versions: Results are tied to specific framework versions and may shift with updates

Limitations & Future Work

Only 3 frameworks covered (no Q#, Amazon Braket, or other emerging platforms)
All evaluation is simulator-based — real quantum hardware introduces additional noise and constraints
Future directions include expanding to more frameworks, adding hardware-aware tasks, and studying few-shot and retrieval-augmented approaches

Conclusion

The big picture for AI + quantum computing

This research has a practical implication: if you're using AI tools to write quantum code for your projects, don't trust single-framework results. A model that scores 59.5% in Qiskit might only achieve 42.9% in PennyLane. The good news is that feedback loops help tremendously — iteratively running code, reading error messages, and revising can push accuracy from ~60% to over 80%. This mirrors how human developers work: write, test, debug, repeat. The challenge for the field is building AI that truly understands quantum concepts rather than just memorizing API documentation.

QuanBench+ is the first multi-framework quantum code generation benchmark, spanning Qiskit, PennyLane, and Cirq with 42 aligned tasks. The evaluation of 12 state-of-the-art LLMs reveals both progress and significant remaining challenges:

Best one-shot accuracy is 59.5% (Qiskit), demonstrating that quantum code generation remains a challenging frontier for LLMs
Performance gaps across frameworks are substantial — even the strongest models drop significantly on less-familiar frameworks, suggesting API memorization rather than deep quantum understanding
Feedback-based repair boosts the best score to 83.3%, proving that iterative debugging is a powerful strategy for quantum code generation
Reliable multi-framework quantum code generation remains unsolved and continues to depend heavily on framework-specific knowledge rather than quantum reasoning alone

References

Chen et al. (2021). Evaluating Large Language Models Trained on Code. arXiv:2107.03374.
Aleksandrowicz et al. (2019). Qiskit: An Open-source Framework for Quantum Computing.
Bergholm et al. (2018). PennyLane: Automatic differentiation of hybrid quantum-classical computations. arXiv:1811.04968.
Cirq Developers (2021). Cirq: A python framework for creating, editing, and invoking quantum circuits.
Nielsen & Chuang (2010). Quantum Computation and Quantum Information. Cambridge University Press.
Vishwakarma et al. (2024). Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generation.
Basit et al. (2025). QHackBench: Benchmarking Quantum Computing Code Generation.
Wang et al. (2024). QCircuitBench: A Benchmark for Quantum Circuit Generation.
Guo et al. (2025). QuanBench: Benchmarking LLMs on Quantum Computing.
Achiam et al. (2023). GPT-4 Technical Report. OpenAI.
DeepSeek-AI (2024). DeepSeek-R1. Technical Report.
Google (2025). Gemini 3 Pro. Technical Report.
Anthropic (2025). Claude 3.7 Sonnet. Model Card.