← Flecto

Published as a workshop paper at ICLR 2026

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Ali Slim, Haydar Hamieh, Jawad Kotaich, Yehya Ghosn, Mahdi Chehimi, Ammar Mohanna, Hasan Abed Al Kader Hammoud, Bernard Ghanem — American University of Beirut & KAUST

Large Language Models are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks. QuanBench+ introduces a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. Using executable functional tests, KL-divergence-based acceptance, and feedback-based repair, this study reveals that reliable multi-framework quantum code generation remains an open challenge.

59.5% Best Pass@1 (Qiskit)
83.3% Best After Feedback
42 Aligned Tasks
LLM Quantum Programming Benchmarking Qiskit PennyLane Cirq

Introduction

LLMs have achieved impressive results on classical code generation benchmarks like HumanEval, but quantum code generation presents unique challenges. Unlike classical programs that produce deterministic outputs, quantum programs yield probabilistic measurement statistics. A qubit exists as a superposition \(|\psi\rangle = \alpha|0\rangle + \beta|1\rangle\), and correctness must be defined in terms of output distributions rather than exact values.

What makes quantum code different?

In classical programming, if you run a function twice with the same input, you get the same output. Quantum programs work differently — they produce probability distributions rather than fixed answers. Think of it like rolling a specially weighted die: the program defines the weights, but each run gives a different result. This means you can't simply check if output == expected_output. Instead, you need statistical methods to verify that the distribution of results is close enough to what's expected.

Several quantum code benchmarks exist (Qiskit HumanEval, QHackBench, QCircuitBench, QuanBench), but most evaluate models within a single framework only. This makes it impossible to tell whether failures come from weak quantum reasoning or simply unfamiliarity with a specific API.

A multi-framework benchmark is essential because it exposes two distinct failure modes: (i) conceptual errors in quantum reasoning — such as incorrect algorithmic structure or measurement logic — and (ii) framework-specific API errors — such as calling nonexistent methods or misusing parameter conventions.

Key Contributions

  • Unified multi-framework benchmark — 42 tasks aligned across Qiskit, PennyLane, and Cirq, ensuring identical quantum problems are tested in each framework
  • Executable functional testing — automated evaluation using Pass@1, Pass@5, and KL-divergence-based acceptance for probabilistic outputs
  • Feedback-based repair evaluation — measuring how much models improve when given runtime error messages or wrong-answer feedback
  • 12 state-of-the-art LLMs evaluated — including GPT 5.1, DeepSeek R1, Claude 3.7 Sonnet, Gemini 3 Pro, and others

Methodology

Correctness Metrics

Pass@k measures functional correctness: the model generates k code samples and passes if at least one produces the correct output. Pass@1 tests single-shot accuracy while Pass@5 allows the model five attempts.

KL-Divergence Acceptance handles probabilistic quantum outputs. Since quantum measurements are inherently stochastic, exact output matching doesn't work. Instead, QuanBench+ computes the KL divergence \(D_{KL}(P_{\text{ref}} \| P)\) between the reference distribution and the model's output distribution. If the divergence falls below a threshold \(\tau = 0.05\) (calibrated at the 0.997-quantile of a null distribution), the output is accepted.

KL-Divergence in plain terms

KL-divergence measures how different two probability distributions are. Imagine you have a reference coin that lands heads 60% of the time. If a model's quantum program produces heads 58% of the time, the KL-divergence would be very small (close match). But if it produces heads 90% of the time, the KL-divergence would be large (poor match). QuanBench+ uses a threshold of 0.05 nats — if the difference is below this level, the output is considered correct. This threshold was carefully calibrated: even running the exact same correct program twice produces slight differences due to sampling noise, and the threshold sits just above that natural variation.

Why not fidelity? State fidelity requires access to the full quantum state vector, which is unavailable on real quantum hardware. QuanBench+ deliberately uses measurement-based correctness criteria that would work on actual quantum devices.

QuanBench+ benchmarking workflow
Figure 1: QuanBench+ benchmarking workflow — choose framework, send API requests via OpenRouter, parse code responses, execute in isolated sandbox, and validate against canonical solutions

Task Categories

31 Quantum Algorithms
6 Gate Decomposition
5 State Preparation

All 42 tasks are aligned across Qiskit, PennyLane, and Cirq with standardized prompts and output normalization. Tasks include classic algorithms like Grover's search, Shor's algorithm, Quantum Fourier Transform, and VQE, as well as gate decomposition challenges and quantum state preparation problems. Each task is evaluated in an isolated sandbox environment with deterministic seed control.

Results

Why does the same AI score so differently across frameworks?

Think of it this way: Qiskit, PennyLane, and Cirq are like three different programming languages that all describe the same quantum operations. An AI that learned quantum programming mainly from Qiskit code examples will know the Qiskit API well but struggle with PennyLane's different function names and conventions — even for the exact same quantum algorithm.

For example, creating a simple quantum NOT gate is qc.x(0) in Qiskit but qml.PauliX(wires=0) in PennyLane. The concept is identical, but the code looks completely different. The gap between Qiskit scores (59.5%) and PennyLane scores (42.9%) suggests LLMs are partly memorizing API patterns rather than truly understanding quantum computing.

Cross-Framework Performance (RQ1)

  • Best one-shot scores: Qiskit 59.5% (GPT 5.1), Cirq 54.8% (Gemini 3 Pro), PennyLane 42.9% (DeepSeek R1)
  • Qiskit is consistently the easiest framework, PennyLane the hardest across all models tested
  • Performance varies significantly by framework — models that excel in one framework can struggle in another, suggesting API familiarity matters as much as quantum understanding
  • GPT 5.1 leads in Qiskit and Cirq, but DeepSeek R1 is competitive in Cirq and PennyLane, especially with feedback
Pass@1 scores across frameworks
Figure 2: Pass@1 scores across 12 LLMs for Qiskit, Cirq, and PennyLane — showing that performance drops sharply from Qiskit to PennyLane

What is Pass@1 vs Pass@5?

Pass@1 means the model gets exactly one attempt — like a student taking a test with no do-overs. Pass@5 gives the model five attempts and counts success if any of the five works. In real-world development, you often generate multiple suggestions and pick the best one, so Pass@5 reflects practical usage better. The gap between Pass@1 and Pass@5 reveals how consistent a model is: a small gap means it reliably generates correct code, while a large gap means it sometimes "gets lucky."

Feedback-Based Repair (RQ3)

When models receive error messages from failed executions or wrong-answer feedback, they can revise their code for up to 5 rounds. This feedback loop dramatically improves performance across all frameworks:

Qiskit 59.5% 83.3%
Cirq 54.8% 76.2%
PennyLane 42.9% 66.7%
Pass@1 after feedback repair
Figure 3: Pass@1 after feedback-based repair — DeepSeek R1 reaches 83.3% in Qiskit, showing the power of iterative debugging

Detailed Results

Model Qiskit Pass@1 Qiskit Pass@1 (FB) Cirq Pass@1 Cirq Pass@1 (FB) PennyLane Pass@1 PennyLane Pass@1 (FB)
GPT 5.159.573.854.876.240.566.7
DeepSeek R157.183.352.473.842.966.7
GLM 4.750.071.445.261.933.352.4
Gemini 3 Pro47.669.038.157.126.238.1
Claude 3.7 Sonnet45.257.135.759.526.247.6
Kimi K2 Thinking50.057.133.357.123.845.2
GPT 4.145.242.928.640.531.045.2
DeepSeek Chat42.969.038.161.923.864.3
Llama 4 Maverick40.561.935.750.023.840.5
Gemini 2.5 Flash38.154.828.642.919.038.1
MiniMax M2.128.657.123.847.631.047.6
Qwen 2.5 7B16.719.04.87.111.919.0

Error Analysis

Analyzing the 977 failed task attempts across all models and frameworks reveals a clear hierarchy of error types. The most common failure is producing a wrong answer (46.7%), meaning the code runs but produces incorrect quantum states or measurements. Logic errors (25.0%) involve flawed circuit construction, while missing method/gate errors (11.8%) indicate that models hallucinate nonexistent API functions.

Understanding the error types

The six error categories reveal where AI struggles with quantum code:

  • Wrong answer (46.7%): The code runs without crashing but produces incorrect quantum states. This is the hardest type to fix because the logic looks plausible.
  • Logic errors (25.0%): The quantum circuit is constructed incorrectly — wrong gate order, missing entanglement, etc.
  • Missing method/gate (11.8%): The model "hallucinates" API functions that don't exist — like calling qc.toffoli() when the correct method is qc.ccx().
  • Shape mismatch (8.0%): Output has wrong dimensions — e.g., measuring 3 qubits when the test expects 4.

After feedback, syntax errors nearly vanish (4.7% → 1.5%) but wrong answers increase in share (46.7% → 53.4%), showing that deep conceptual errors resist simple fixes.

Error distribution before feedback
Figure 4: Error distribution before feedback — 977 wrong tasks dominated by wrong answers (46.7%) and logic errors (25.0%)
Error distribution after feedback
Figure 5: Error distribution after feedback — 665 remaining failures shift toward wrong answers (53.4%), as easy-to-fix errors are resolved first

After feedback-based repair, only 665 tasks remain unsolved. Notably, syntax errors drop from 4.7% to 1.5% and missing method/gate errors from 11.8% to 3.8%, while the share of wrong answers increases to 53.4%. This shows that feedback effectively resolves surface-level issues but struggles with deeper conceptual errors in quantum reasoning.

Prefill vs No-Prefill

The prefill experiment tests whether giving models the beginning of the solution (import statements and function signature) improves code generation. This simulates a scenario where developers have already set up the basic structure and the model completes the implementation.

What is "prefill"?

Prefill is like giving a student the first few lines of an essay and asking them to continue. In code generation, this means providing the import statements and function signature so the model only needs to write the actual logic. For example, instead of generating from scratch, the model receives: from qiskit import QuantumCircuit
def solve():
and continues from there. This removes the "boilerplate burden" and tests pure problem-solving ability.

Prefill vs No-Prefill in Cirq
Figure 6: Prefill vs No-Prefill comparison for Cirq — Gemini 3 Pro jumps from 51.2% to 62.5% with prefill, the largest improvement

Results show that prefill effects vary significantly by model and framework. Some models benefit greatly (Gemini 3 Pro sees a +11.3% boost in Cirq), while others show minimal change or even slight regressions. The effect is most pronounced in PennyLane, the hardest framework, suggesting that import hints help most when models are least familiar with the API.

Per-Task Performance

These heatmaps show Pass@1 results for every model-task combination in each framework. Each row represents an LLM, each column a task, with blue cells indicating success and white cells indicating failure. The pattern reveals that some tasks are universally difficult while others are easy for most models.

Qiskit Pass@1 heatmap
Qiskit — Pass@1 heatmap showing the densest success pattern
Cirq Pass@1 heatmap
Cirq — Pass@1 heatmap with noticeably more white (failure) cells
PennyLane Pass@1 heatmap
PennyLane — Pass@1 heatmap showing the sparsest success pattern

Feedback Learning Curves

These charts track how many tasks each model solves as it receives more feedback attempts (1 through 5). Most improvement occurs in the first 2-3 rounds, with diminishing returns afterward. GPT 5.1 and DeepSeek R1 converge fastest across all frameworks.

Qiskit feedback curves
Qiskit — Feedback learning curves showing rapid early improvement
Cirq feedback curves
Cirq — Feedback learning curves with similar convergence pattern
PennyLane feedback curves
PennyLane — Feedback learning curves showing the steepest improvement

Discussion

Key Insights

The multi-framework evaluation reveals that high performance in one framework does not guarantee competence in another. Models appear to memorize framework-specific API patterns rather than developing true quantum understanding. The large performance gap between Qiskit (most popular, likely most represented in training data) and PennyLane (less common) supports this interpretation. Feedback-based repair proves highly effective, with the best model reaching 83.3% in Qiskit (up from 59.5%), suggesting that LLMs can learn from error signals in real time.

Threats to Validity

  • Benchmark size: 42 tasks provide meaningful coverage but cannot capture the full diversity of quantum programming challenges
  • API non-determinism: LLM API responses may vary across runs, though Pass@5 helps mitigate this
  • Framework versions: Results are tied to specific framework versions and may shift with updates

Limitations & Future Work

  • Only 3 frameworks covered (no Q#, Amazon Braket, or other emerging platforms)
  • All evaluation is simulator-based — real quantum hardware introduces additional noise and constraints
  • Future directions include expanding to more frameworks, adding hardware-aware tasks, and studying few-shot and retrieval-augmented approaches

Conclusion

The big picture for AI + quantum computing

This research has a practical implication: if you're using AI tools to write quantum code for your projects, don't trust single-framework results. A model that scores 59.5% in Qiskit might only achieve 42.9% in PennyLane. The good news is that feedback loops help tremendously — iteratively running code, reading error messages, and revising can push accuracy from ~60% to over 80%. This mirrors how human developers work: write, test, debug, repeat. The challenge for the field is building AI that truly understands quantum concepts rather than just memorizing API documentation.

QuanBench+ is the first multi-framework quantum code generation benchmark, spanning Qiskit, PennyLane, and Cirq with 42 aligned tasks. The evaluation of 12 state-of-the-art LLMs reveals both progress and significant remaining challenges:

  1. Best one-shot accuracy is 59.5% (Qiskit), demonstrating that quantum code generation remains a challenging frontier for LLMs
  2. Performance gaps across frameworks are substantial — even the strongest models drop significantly on less-familiar frameworks, suggesting API memorization rather than deep quantum understanding
  3. Feedback-based repair boosts the best score to 83.3%, proving that iterative debugging is a powerful strategy for quantum code generation
  4. Reliable multi-framework quantum code generation remains unsolved and continues to depend heavily on framework-specific knowledge rather than quantum reasoning alone
References
  1. Chen et al. (2021). Evaluating Large Language Models Trained on Code. arXiv:2107.03374.
  2. Aleksandrowicz et al. (2019). Qiskit: An Open-source Framework for Quantum Computing.
  3. Bergholm et al. (2018). PennyLane: Automatic differentiation of hybrid quantum-classical computations. arXiv:1811.04968.
  4. Cirq Developers (2021). Cirq: A python framework for creating, editing, and invoking quantum circuits.
  5. Nielsen & Chuang (2010). Quantum Computation and Quantum Information. Cambridge University Press.
  6. Vishwakarma et al. (2024). Qiskit HumanEval: An Evaluation Benchmark For Quantum Code Generation.
  7. Basit et al. (2025). QHackBench: Benchmarking Quantum Computing Code Generation.
  8. Wang et al. (2024). QCircuitBench: A Benchmark for Quantum Circuit Generation.
  9. Guo et al. (2025). QuanBench: Benchmarking LLMs on Quantum Computing.
  10. Achiam et al. (2023). GPT-4 Technical Report. OpenAI.
  11. DeepSeek-AI (2024). DeepSeek-R1. Technical Report.
  12. Google (2025). Gemini 3 Pro. Technical Report.
  13. Anthropic (2025). Claude 3.7 Sonnet. Model Card.

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.

View Services Contact Us