---
arxiv_id: 2601.23265
title: "PaperBanana: Automating Academic Illustration for AI Scientists"
authors:
  - Dawei Zhu
  - Rui Meng
  - Yale Song
  - Xiyu Wei
  - Sujian Li
  - Tomas Pfister
  - Jinsung Yoon
difficulty: Intermediate
tags:
  - Agent
  - Vision
  - Benchmark
published_at: 2026-01-30
flecto_url: https://flecto.zer0ai.dev/papers/2601.23265/
lang: en
---

## Hero Badge

### AI Agents

### Academic Illustration

## Hero H1

### PaperBanana: Automating Academic Illustration for AI Scientists

## Hero Authors

### Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, Jinsung Yoon

## Hero Institutions

### Peking University &middot; Google Cloud AI Research

## Hero Abstract

PaperBanana is an agentic framework that automates the generation of publication-ready academic illustrations. By orchestrating five specialized agents &mdash; Retriever, Planner, Stylist, Visualizer, and Critic &mdash; it transforms scientific content into high-quality methodology diagrams and statistical plots. The accompanying PaperBananaBench benchmark provides 292 test cases for rigorous evaluation across four dimensions.

## Hero Metric

### Faithfulness

### Conciseness

### Readability

### Aesthetics

## Hero Figcaption

Figure 1: Examples of methodology diagrams and statistical plots generated by PaperBanana, demonstrating the framework's ability to produce diverse, publication-ready academic illustrations

## Introduction H2

### Introduction

## Introduction P

Autonomous scientific discovery is a long-standing pursuit of artificial general intelligence. With the rapid evolution of Large Language Models (LLMs), autonomous AI Scientists have demonstrated the potential to automate many facets of the research lifecycle &mdash; from literature review and idea generation to experiment design and paper writing.

However, generating publication-ready illustrations remains a labor-intensive bottleneck. Prior code-based approaches using TikZ or Matplotlib produce results that lack visual aesthetics and fail to match the quality expected in modern academic manuscripts.

PaperBanana bridges this gap by automating the production of high-quality academic illustrations. Given a methodology description and diagram caption as input, it orchestrates specialized agents to produce publication-ready visual representations.

## Introduction Card H3

### PaperBanana Framework

### PaperBananaBench

### Superior Performance

## Introduction Card P

A fully automated agentic framework orchestrating five specialized agents &mdash; Retriever, Planner, Stylist, Visualizer, and Critic &mdash; for generating publication-ready academic illustrations from textual descriptions.

A comprehensive benchmark with 292 test cases curated from NeurIPS 2025 papers. Evaluation covers four dimensions: Faithfulness, Conciseness, Readability, and Aesthetics.

Consistently outperforms leading baselines across all evaluation dimensions, achieving gains of +2.8% Faithfulness, +37.2% Conciseness, +12.9% Readability, and +6.6% Aesthetics.

## Task H2

### Task Formulation

## Task P

The paper formalizes automated academic illustration generation as learning a mapping from a source context and a communicative intent to a visual representation.

Among various types of academic illustrations, this paper focuses on methodology diagrams , which require interpreting complex technical concepts and logical flows from textual descriptions into high-fidelity, visually appealing diagrams. The framework also extends to statistical plots .

## Task Math

The source context \(S\) contains the essential methodology information, while the communicative intent \(C\) (typically the figure caption) specifies what the illustration should convey. The mapping function \(F\) produces the illustration: \(I = F(S, C; \mathcal{E})\), optionally guided by reference examples \(\mathcal{E} = \{E_n\}_{n=1}^{N}\) where each \(E_n = (S_n, C_n, I_n)\).

## Methodology H2

### Methodology

## Methodology Figcaption

Figure 2: Overview of the PaperBanana framework. Given source context and communicative intent, the system operates through a Linear Planning Phase (Retriever &rarr; Planner &rarr; Stylist) followed by an Iterative Refinement Loop (Visualizer &harr; Critic, T=3 rounds)

## Methodology P

PaperBanana orchestrates a collaborative team of five specialized agents . The framework operates in two phases: a Linear Planning Phase where the Retriever, Planner, and Stylist agents sequentially process the input, followed by an Iterative Refinement Loop where the Visualizer and Critic agents collaborate for T=3 rounds to produce the final illustration.

## Methodology Agent H3

### Retriever Agent

### Planner Agent

### Stylist Agent

### Visualizer Agent

### Critic Agent

## Methodology Agent P

Identifies the most relevant reference examples from a fixed set using VLM-based ranking. The VLM is instructed to rank candidates by matching both research domain (e.g., Agent & Reasoning) and diagram type (e.g., pipeline, architecture), with visual structure being prioritized over topic similarity . This provides a concrete foundation for both structural logic and visual style.

The cognitive core of the system. Takes the source context, communicative intent, and retrieved examples as inputs. By performing in-context learning from the demonstrations, the Planner translates the source into a structured description of the target diagram, specifying components, connections, layout, and logical flow.

Acts as a design consultant ensuring academic aesthetic standards. Uses auto-summarized style guides derived from analyzing hundreds of human-drawn diagrams. The Stylist optimizes the planned description with specific visual instructions for color palettes, typography, icons, and layout refinements.

Renders academic illustrations from the stylistically optimized description. Leverages image generation models (Nano-Banana-Pro or GPT-Image-1.5). For statistical plots , the Visualizer generates executable Python Matplotlib code instead, ensuring numerical precision.

Forms a closed-loop refinement mechanism with the Visualizer. Examines the generated image at each iteration, identifies issues with content accuracy, visual clarity, and style adherence, then provides a refined description for regeneration. Runs for T=3 iterations to ensure quality.

## Methodology Callout H3

### Extension to Statistical Plots

## Methodology Callout P

The framework extends to statistical plots by adjusting the Visualizer and Critic agents. The Visualizer converts descriptions into executable Python Matplotlib code for numerical precision. The Critic verifies both visual quality and data accuracy by comparing against source tabular data, ensuring faithfulness in the generated plots.

## Benchmark H2

### Benchmark Construction

## Benchmark P

The lack of dedicated benchmarks has hindered rigorous evaluation of automated diagram generation. To address this, the authors introduce PaperBananaBench , a comprehensive benchmark curated from NeurIPS 2025 methodology diagrams, comprising 292 test cases that capture the sophisticated aesthetics and diverse logical structures of modern academic papers.

Referenced Scoring: A VLM judge compares the model-generated diagram against the human reference, determining Model wins (score 100), Tie (50), or Human wins (0) for each dimension.

## Benchmark Step H3

### Collection & Parsing

### Filtering

### Categorization

### Human Curation

## Benchmark Step P

2,000 papers randomly sampled from 5,275 NeurIPS 2025 publications. The MinerU toolkit extracts text content and figures from PDF files.

Papers without methodology diagrams are discarded (yielding 1,359 valid candidates). Aspect ratios restricted to [1.5, 2.5], resulting in 292 final test cases.

Four categories based on visual topology: Agent & Reasoning (31.5%), Vision & Perception (25.0%), Generative & Learning (25.0%), Science & Application (18.5%).

Annotators verify methodology descriptions, captions, diagram correctness, and category labels to guarantee data integrity and quality.

## Benchmark Figcaption

Figure 3: Statistics of the PaperBananaBench test set (292 samples). Left: category distribution. Right: width-height ratio distribution

## Benchmark H3

### Evaluation Protocol

## Benchmark Eval H4

### Content Dimensions

### Presentation Dimensions

## Benchmark Eval Li

Faithfulness: Alignment with the source context (methodology description) and communicative intent (caption)

### Conciseness: Focus on core information without visual clutter or redundant elements

### Readability: Intelligible layouts, legible text, no excessive crossing lines

### Aesthetics: Adherence to the stylistic norms of academic manuscripts

## Experiments H2

### Experiments & Results

## Experiments P

Three baseline settings are compared: (1) Vanilla &mdash; directly prompting the image generation model; (2) Few-shot &mdash; vanilla plus reference examples; (3) Agentic Frameworks &mdash; DiagramAgent, SciDraw, and PaperBanana. The VLM backbone is Gemini-3-Pro, with Nano-Banana-Pro and GPT-Image-1.5 as image generators.

The evaluation protocol is validated through inter-model agreement (Kendall's tau > 0.4 between Gemini-3-Pro judge and GPT-5) and human alignment (72% agreement with human annotators on 50 samples).

PaperBanana consistently outperforms all baselines across all metrics. Over the Vanilla Nano-Banana-Pro baseline, it achieves +2.8% Faithfulness, +37.2% Conciseness, +12.9% Readability, and +6.6% Aesthetics , contributing to an overall improvement of +48.7%.

DiagramAgent and SciDraw, which rely on TikZ code generation, underperform significantly. The code-based approach struggles to capture the visual sophistication expected in modern academic manuscripts. Despite overall progress, PaperBanana still underperforms the human reference in faithfulness, with fine-grained connectivity errors remaining the primary challenge.

### The ablation study reveals the contribution of each agent component:

## Experiments Figcaption

Table 1: Main results on PaperBananaBench. PaperBanana achieves the highest scores across all dimensions, with an Overall score of 60.2 (vs. Human baseline of 50.0)

Figure 4: Performance comparison across evaluation dimensions &mdash; Vanilla vs. PaperBanana vs. Human reference

Table 2: Ablation study on PaperBananaBench. Each agent component is systematically removed to assess its contribution

## Experiments H3

### Ablation Study

## Experiments Li

Retriever Agent: Semantic retrieval significantly outperforms random and no-retriever baselines. Without reference examples, the system loses its structural foundation.

Stylist Agent: Boosts Conciseness (+17.5%) and Aesthetics (+4.7%) but slightly reduces Faithfulness (-8.5%), as visual polishing can sometimes sacrifice fine-grained accuracy.

Critic Agent: Additional iterations substantially enhance all metrics, ensuring a balance between aesthetics and technical accuracy. The 3-iteration default provides the best overall trade-off.

## Plots H2

### Statistical Plots Generation

## Plots P

PaperBanana extends to statistical plot generation by adjusting the Visualizer and Critic agents. For statistical plots, the Visualizer generates executable Python Matplotlib code, ensuring numerical precision. The Critic verifies both visual quality and data accuracy.

On the curated test set, PaperBanana consistently outperforms vanilla Gemini-3-Pro across all dimensions. The image generation approach produces more visually appealing plots but can introduce faithfulness errors (incorrect data values, duplicated categories), while the code-based approach ensures data accuracy at the cost of visual sophistication.

## Plots Figcaption

Figure 5: Code-based vs. image-based generation for statistical plots. The image approach yields better aesthetics but may introduce data faithfulness issues

## Discussion H2

### Discussion

## Discussion H3

### Enhancing Aesthetics of Human-Drawn Diagrams

### Coding vs. Image Generation for Statistical Plots

## Discussion P

An intriguing application: can PaperBanana's auto-summarized aesthetic guidelines enhance existing human-drawn diagrams? The system identifies specific improvement areas &mdash; color palette, font, icons, connectors, line weight, and shape &mdash; and applies them through Nano-Banana-Pro to refine the original diagram while preserving its content.

For statistical plots, code-based approaches demonstrate remarkable efficacy for data accuracy, while image generation excels in visual aesthetics. The choice depends on priorities: when numerical precision is critical, code-based generation is preferred; when visual appeal and design quality matter more, image generation offers advantages &mdash; though at the risk of occasional faithfulness errors.

## Discussion Figcaption

Figure 6: Enhancing aesthetics of human-drawn diagrams. Left: original diagram. Center: suggested improvements. Right: enhanced version

## Related H2

### Related Work

## Related H3

### Automated Academic Diagram Generation

### Coding-Based Data Visualization

## Related P

Prior work primarily adopts code-based generation using TikZ (DeTikZify, AutomaTikZ, TikZero) to produce vector graphics. Recent image generation models (Nano-Banana-Pro, GPT-Image-1.5) have achieved remarkable progress in synthesizing high-fidelity figures. The closest benchmark to PaperBananaBench is SridBench, which evaluates automated diagram generation across multiple domains.

From early LSTM-based approaches (Data2Vis) to LLM-powered tools like LIDA, MatplotAgent, and CoDa, the field has evolved toward using language models to generate visualization code from data and natural language descriptions. These tools demonstrate the growing capability of AI systems in producing accurate, customizable statistical visualizations.

## Conclusion H2

### Conclusion

## Conclusion P

PaperBanana is an agentic framework designed to automate the generation of publication-ready academic illustrations. By orchestrating specialized agents &mdash; Retriever, Planner, Stylist, Visualizer, and Critic &mdash; the approach transforms scientific content into high-fidelity methodology diagrams and statistical plots. The accompanying PaperBananaBench benchmark enables rigorous evaluation, and comprehensive experiments demonstrate significant improvements over existing methods.

## Conclusion H3

### Limitations & Future Directions

## Conclusion Card H4

### Raster Output

### Style vs. Diversity

### Faithfulness Gap

### Evaluation Challenges

### Single Output

## Conclusion Card P

Output is raster (not vector), making editing difficult. Future work envisions element extraction for SVG reconstruction and GUI Agents for vector design software.

The unified style guide ensures consistency but reduces stylistic diversity. Future work: user-customizable style preferences and diverse output options.

Fine-grained connectivity errors remain the primary challenge. Future: specialized verification models and structured output formats for structural correctness.

Reference-based VLM-as-a-Judge has inherent limitations. Future work: reference-free evaluation metrics and multi-dimensional assessment frameworks.

Currently produces one output per query. Future: test-time scaling with diverse generative samples to satisfy varied aesthetic preferences.

## Appendix H2

### Case Studies

## Appendix Figcaption

Figure 7: Detailed case studies comparing Human-created diagrams, Nano-Banana-Pro output, and PaperBanana results. PaperBanana demonstrates advantages in modern color palettes, information conciseness, and logical clarity through color-coded modules

Figure 8: Style enhancement showcase &mdash; Original diagrams (left) vs. PaperBanana-enhanced versions (right). The system applies improved color palettes, typography, layout organization, and visual hierarchy

Figure 9: Detailed comparison of image-based vs. code-based statistical plot generation. Image generation produces more visually appealing results but may introduce data faithfulness issues, while code generation ensures accuracy

## References H2

### References

## References Summary

### Show all references (32 entries)
