AI Agents Academic Illustration

PaperBanana: Automating Academic Illustration for AI Scientists

Dawei Zhu, Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, Jinsung Yoon

Peking University · Google Cloud AI Research

PaperBanana is an agentic framework that automates the generation of publication-ready academic illustrations. By orchestrating five specialized agents — Retriever, Planner, Stylist, Visualizer, and Critic — it transforms scientific content into high-quality methodology diagrams and statistical plots. The accompanying PaperBananaBench benchmark provides 292 test cases for rigorous evaluation across four dimensions.

+2.8% Faithfulness

+37.2% Conciseness

+12.9% Readability

+6.6% Aesthetics

Read on arXiv ↗

**Figure 1:** Examples of methodology diagrams and statistical plots generated by PaperBanana, demonstrating the framework's ability to produce diverse, publication-ready academic illustrations

Introduction

Autonomous scientific discovery is a long-standing pursuit of artificial general intelligence. With the rapid evolution of Large Language Models (LLMs), autonomous AI Scientists have demonstrated the potential to automate many facets of the research lifecycle — from literature review and idea generation to experiment design and paper writing.

However, generating publication-ready illustrations remains a labor-intensive bottleneck. Prior code-based approaches using TikZ or Matplotlib produce results that lack visual aesthetics and fail to match the quality expected in modern academic manuscripts.

PaperBanana bridges this gap by automating the production of high-quality academic illustrations. Given a methodology description and diagram caption as input, it orchestrates specialized agents to produce publication-ready visual representations.

🤖

PaperBanana Framework

A fully automated agentic framework orchestrating five specialized agents — Retriever, Planner, Stylist, Visualizer, and Critic — for generating publication-ready academic illustrations from textual descriptions.

📊

PaperBananaBench

A comprehensive benchmark with 292 test cases curated from NeurIPS 2025 papers. Evaluation covers four dimensions: Faithfulness, Conciseness, Readability, and Aesthetics.

🏆

Superior Performance

Consistently outperforms leading baselines across all evaluation dimensions, achieving gains of +2.8% Faithfulness, +37.2% Conciseness, +12.9% Readability, and +6.6% Aesthetics.

Task Formulation

The paper formalizes automated academic illustration generation as learning a mapping from a source context and a communicative intent to a visual representation.

The source context \(S\) contains the essential methodology information, while the communicative intent \(C\) (typically the figure caption) specifies what the illustration should convey. The mapping function \(F\) produces the illustration: \(I = F(S, C; \mathcal{E})\), optionally guided by reference examples \(\mathcal{E} = \{E_n\}_{n=1}^{N}\) where each \(E_n = (S_n, C_n, I_n)\).

What does this mapping mean in practice?

Think of it like a translator for scientific ideas. You provide:

Source context S — the detailed text describing your methodology (e.g., "Our model uses a transformer encoder followed by a GRU decoder...")
Communicative intent C — the figure caption that says what the diagram should show (e.g., "Overview of our proposed architecture")
Reference examples E — existing diagrams from similar papers to guide the visual style

The system then produces an illustration I that visually communicates your methodology. It's similar to how a graphic designer reads your paper and creates a diagram — but fully automated.

Among various types of academic illustrations, this paper focuses on methodology diagrams, which require interpreting complex technical concepts and logical flows from textual descriptions into high-fidelity, visually appealing diagrams. The framework also extends to statistical plots.

Methodology

Overview of PaperBanana framework — **Figure 2:** Overview of the PaperBanana framework. Given source context and communicative intent, the system operates through a Linear Planning Phase (Retriever → Planner → Stylist) followed by an Iterative Refinement Loop (Visualizer ↔ Critic, T=3 rounds)

PaperBanana orchestrates a collaborative team of five specialized agents. The framework operates in two phases: a Linear Planning Phase where the Retriever, Planner, and Stylist agents sequentially process the input, followed by an Iterative Refinement Loop where the Visualizer and Critic agents collaborate for T=3 rounds to produce the final illustration.

Why five agents instead of one? Each agent specializes in a different aspect of diagram creation, similar to a design team: one person finds references, another plans the layout, a designer adds style, an illustrator draws it, and a reviewer gives feedback. This division of labor produces better results than asking a single model to do everything at once.

Retriever Agent

Identifies the most relevant reference examples from a fixed set using VLM-based ranking. The VLM is instructed to rank candidates by matching both research domain (e.g., Agent & Reasoning) and diagram type (e.g., pipeline, architecture), with visual structure being prioritized over topic similarity. This provides a concrete foundation for both structural logic and visual style.

VLM (Vision-Language Model) refers to AI models that can understand both images and text simultaneously. Here, the VLM examines reference diagrams to find ones with similar visual structure to what's needed — like browsing a portfolio to find matching design patterns.

Planner Agent

The cognitive core of the system. Takes the source context, communicative intent, and retrieved examples as inputs. By performing in-context learning from the demonstrations, the Planner translates the source into a structured description of the target diagram, specifying components, connections, layout, and logical flow.

Stylist Agent

Acts as a design consultant ensuring academic aesthetic standards. Uses auto-summarized style guides derived from analyzing hundreds of human-drawn diagrams. The Stylist optimizes the planned description with specific visual instructions for color palettes, typography, icons, and layout refinements.

Auto-summarized style guides: Instead of manually defining what "good academic diagrams" look like, the system automatically analyzes hundreds of human-drawn diagrams from top conferences and extracts common design patterns — preferred color palettes, font choices, icon styles, and layout conventions.

Visualizer Agent

Renders academic illustrations from the stylistically optimized description. Leverages image generation models (Nano-Banana-Pro or GPT-Image-1.5). For statistical plots, the Visualizer generates executable Python Matplotlib code instead, ensuring numerical precision.

Critic Agent

Forms a closed-loop refinement mechanism with the Visualizer. Examines the generated image at each iteration, identifies issues with content accuracy, visual clarity, and style adherence, then provides a refined description for regeneration. Runs for T=3 iterations to ensure quality.

Extension to Statistical Plots

The framework extends to statistical plots by adjusting the Visualizer and Critic agents. The Visualizer converts descriptions into executable Python Matplotlib code for numerical precision. The Critic verifies both visual quality and data accuracy by comparing against source tabular data, ensuring faithfulness in the generated plots.

For statistical plots, generating code (Python Matplotlib) is preferred over generating images directly because code can precisely reproduce exact data values, axis scales, and labels. Image generation might produce visually beautiful charts but can get the actual numbers wrong.

Benchmark Construction

The lack of dedicated benchmarks has hindered rigorous evaluation of automated diagram generation. To address this, the authors introduce PaperBananaBench, a comprehensive benchmark curated from NeurIPS 2025 methodology diagrams, comprising 292 test cases that capture the sophisticated aesthetics and diverse logical structures of modern academic papers.

Collection & Parsing

2,000 papers randomly sampled from 5,275 NeurIPS 2025 publications. The MinerU toolkit extracts text content and figures from PDF files.

Filtering

Papers without methodology diagrams are discarded (yielding 1,359 valid candidates). Aspect ratios restricted to [1.5, 2.5], resulting in 292 final test cases.

Categorization

Four categories based on visual topology: Agent & Reasoning (31.5%), Vision & Perception (25.0%), Generative & Learning (25.0%), Science & Application (18.5%).

Human Curation

Annotators verify methodology descriptions, captions, diagram correctness, and category labels to guarantee data integrity and quality.

PaperBananaBench statistics — **Figure 3:** Statistics of the PaperBananaBench test set (292 samples). Left: category distribution. Right: width-height ratio distribution

Evaluation Protocol

Content Dimensions

Faithfulness: Alignment with the source context (methodology description) and communicative intent (caption)
Conciseness: Focus on core information without visual clutter or redundant elements

Presentation Dimensions

Readability: Intelligible layouts, legible text, no excessive crossing lines
Aesthetics: Adherence to the stylistic norms of academic manuscripts

Referenced Scoring: A VLM judge compares the model-generated diagram against the human reference, determining Model wins (score 100), Tie (50), or Human wins (0) for each dimension.

Understanding the scoring system

The evaluation uses a "VLM-as-a-Judge" approach, where an AI model acts as an expert judge. For each diagram, it compares the machine-generated version against the human-created original:

Score 100: The AI-generated diagram is better than the human version
Score 50: They're roughly equal (this is the human baseline)
Score 0: The human version is better

So when PaperBanana scores 60.2 overall, it means it slightly outperforms human-created diagrams on average — a remarkable achievement.

Experiments & Results

Three baseline settings are compared: (1) Vanilla — directly prompting the image generation model; (2) Few-shot — vanilla plus reference examples; (3) Agentic Frameworks — DiagramAgent, SciDraw, and PaperBanana. The VLM backbone is Gemini-3-Pro, with Nano-Banana-Pro and GPT-Image-1.5 as image generators.

The evaluation protocol is validated through inter-model agreement (Kendall's tau > 0.4 between Gemini-3-Pro judge and GPT-5) and human alignment (72% agreement with human annotators on 50 samples).

Kendall's tau is a statistical measure of agreement between two rankings (range: -1 to +1). A value above 0.4 indicates relatively strong agreement, meaning different AI judges tend to rank the diagrams in a similar order — confirming the evaluation is reliable.

**Table 1:** Main results on PaperBananaBench. PaperBanana achieves the highest scores across all dimensions, with an Overall score of 60.2 (vs. Human baseline of 50.0)

PaperBanana consistently outperforms all baselines across all metrics. Over the Vanilla Nano-Banana-Pro baseline, it achieves +2.8% Faithfulness, +37.2% Conciseness, +12.9% Readability, and +6.6% Aesthetics, contributing to an overall improvement of +48.7%.

DiagramAgent and SciDraw, which rely on TikZ code generation, underperform significantly. The code-based approach struggles to capture the visual sophistication expected in modern academic manuscripts. Despite overall progress, PaperBanana still underperforms the human reference in faithfulness, with fine-grained connectivity errors remaining the primary challenge.

TikZ is a LaTeX package for creating vector graphics through code. While it produces precise, scalable diagrams, the code is complex and the resulting figures often look rigid and dated compared to modern image-generation approaches.

Performance comparison by dimension — **Figure 4:** Performance comparison across evaluation dimensions — Vanilla vs. PaperBanana vs. Human reference

Ablation Study

The ablation study reveals the contribution of each agent component:

Retriever Agent: Semantic retrieval significantly outperforms random and no-retriever baselines. Without reference examples, the system loses its structural foundation.
Stylist Agent: Boosts Conciseness (+17.5%) and Aesthetics (+4.7%) but slightly reduces Faithfulness (-8.5%), as visual polishing can sometimes sacrifice fine-grained accuracy.

Why does adding style reduce accuracy? This is a common trade-off in visualization: making a diagram more visually polished (cleaner layout, fewer labels, simplified connections) can sometimes sacrifice technical precision. It's like the difference between a detailed engineering blueprint and a sleek marketing infographic — the latter looks better but may omit subtle technical details.

Critic Agent: Additional iterations substantially enhance all metrics, ensuring a balance between aesthetics and technical accuracy. The 3-iteration default provides the best overall trade-off.

Statistical Plots Generation

PaperBanana extends to statistical plot generation by adjusting the Visualizer and Critic agents. For statistical plots, the Visualizer generates executable Python Matplotlib code, ensuring numerical precision. The Critic verifies both visual quality and data accuracy.

On the curated test set, PaperBanana consistently outperforms vanilla Gemini-3-Pro across all dimensions. The image generation approach produces more visually appealing plots but can introduce faithfulness errors (incorrect data values, duplicated categories), while the code-based approach ensures data accuracy at the cost of visual sophistication.

Code vs Image generation for statistical plots — **Figure 5:** Code-based vs. image-based generation for statistical plots. The image approach yields better aesthetics but may introduce data faithfulness issues

Discussion

Enhancing Aesthetics of Human-Drawn Diagrams

An intriguing application: can PaperBanana's auto-summarized aesthetic guidelines enhance existing human-drawn diagrams? The system identifies specific improvement areas — color palette, font, icons, connectors, line weight, and shape — and applies them through Nano-Banana-Pro to refine the original diagram while preserving its content.

Coding vs. Image Generation for Statistical Plots

For statistical plots, code-based approaches demonstrate remarkable efficacy for data accuracy, while image generation excels in visual aesthetics. The choice depends on priorities: when numerical precision is critical, code-based generation is preferred; when visual appeal and design quality matter more, image generation offers advantages — though at the risk of occasional faithfulness errors.

The accuracy vs. aesthetics dilemma: Image generation models like Nano-Banana-Pro create visually stunning charts but occasionally fabricate data (e.g., drawing a bar at the wrong height or duplicating categories). Code generation is "boring but reliable" — it always plots exactly what the data says, but the visual design is limited to what Matplotlib templates offer.

Related Work

Automated Academic Diagram Generation

Prior work primarily adopts code-based generation using TikZ (DeTikZify, AutomaTikZ, TikZero) to produce vector graphics. Recent image generation models (Nano-Banana-Pro, GPT-Image-1.5) have achieved remarkable progress in synthesizing high-fidelity figures. The closest benchmark to PaperBananaBench is SridBench, which evaluates automated diagram generation across multiple domains.

Coding-Based Data Visualization

From early LSTM-based approaches (Data2Vis) to LLM-powered tools like LIDA, MatplotAgent, and CoDa, the field has evolved toward using language models to generate visualization code from data and natural language descriptions. These tools demonstrate the growing capability of AI systems in producing accurate, customizable statistical visualizations.

Conclusion

PaperBanana is an agentic framework designed to automate the generation of publication-ready academic illustrations. By orchestrating specialized agents — Retriever, Planner, Stylist, Visualizer, and Critic — the approach transforms scientific content into high-fidelity methodology diagrams and statistical plots. The accompanying PaperBananaBench benchmark enables rigorous evaluation, and comprehensive experiments demonstrate significant improvements over existing methods.

Limitations & Future Directions

Raster Output

Output is raster (not vector), making editing difficult. Future work envisions element extraction for SVG reconstruction and GUI Agents for vector design software.

Style vs. Diversity

The unified style guide ensures consistency but reduces stylistic diversity. Future work: user-customizable style preferences and diverse output options.

Faithfulness Gap

Fine-grained connectivity errors remain the primary challenge. Future: specialized verification models and structured output formats for structural correctness.

Evaluation Challenges

Reference-based VLM-as-a-Judge has inherent limitations. Future work: reference-free evaluation metrics and multi-dimensional assessment frameworks.

Single Output

Currently produces one output per query. Future: test-time scaling with diverse generative samples to satisfy varied aesthetic preferences.

Case Studies

Style enhancement: Original vs Enhanced diagrams — **Figure 8:** Style enhancement showcase — Original diagrams (left) vs. PaperBanana-enhanced versions (right). The system applies improved color palettes, typography, layout organization, and visual hierarchy

IMG vs Code comparison for statistical plots — **Figure 9:** Detailed comparison of image-based vs. code-based statistical plot generation. Image generation produces more visually appealing results but may introduce data faithfulness issues, while code generation ensures accuracy

References

Show all references (32 entries)

Anthropic. Claude Sonnet 4, 2025.
J. Belouadi and S. Eger. DeTikZify: Synthesizing graphics programs for scientific figures and sketches with TikZ. NeurIPS, 2024.
J. Belouadi, A. Lauscher, and S. Eger. AutomaTikZ: Text-guided synthesis of scientific vector graphics with TikZ. arXiv:2310.00367, 2023.
J. Belouadi et al. TikZero: Zero-shot text-guided graphics program synthesis. ICCV, 2025.
BIT-DataLab. Edit Banana, Oct. 2025.
Y. Chang et al. SridBench: Benchmark of scientific research illustration drawing. arXiv:2505.22126, 2025.
Z. Chen et al. CoDa: Agentic systems for collaborative data visualization. arXiv:2510.03194, 2025.
J. Cohen. Statistical power analysis for the behavioral sciences. Routledge, 2013.
G. Comanici et al. Gemini 2.5. arXiv:2507.06261, 2025.
G. DeepMind. Introducing Nano Banana Pro, 2025.
V. Dibia. LIDA: Automatic generation of grammar-agnostic visualizations. ACL, 2023.
V. Dibia and C. Demiralp. Data2Vis: Automatic generation of data visualizations. IEEE CGA, 2019.
T. Galimzyanov et al. Drawing Pandas: A benchmark for LLMs in generating plotting code. MSR, 2025.
Z. Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 2015.
K. Goswami et al. PlotGen: Multi-agent LLM-based scientific data visualization. arXiv:2502.00988, 2025.
J. Gottweis et al. Towards an AI Co-Scientist. arXiv:2502.18864, 2025.
M. Hollander et al. Nonparametric statistical methods. Wiley, 2013.
S. Huang et al. SciFig: Towards automating scientific figure generation. arXiv:2601.04390, 2026.
P. Langley. Scientific discovery: Computational explorations of the creative processes. MIT Press, 1987.
P. Langley. Integrated systems for computational scientific discovery. AAAI, 2024.
Z. Lin et al. AutoFigure-Edit: Generating editable scientific illustration, 2026.
A. Liu et al. DeepSeek-V3 technical report. arXiv:2412.19437, 2024.
Z. Liu et al. Paper2Any, Oct. 2025.
C. Lu et al. The AI Scientist: Towards fully automated open-ended scientific discovery. arXiv:2408.06292, 2024.
J. Niu et al. MinerU document parser. arXiv, 2025.
OpenAI. GPT-Image-1, 2025.
OpenAI. Introducing GPT-5, 2025.
W. Pang et al. Paper2Poster: Multimodal poster automation. arXiv:2505.21497, 2025.
A. Quispel et al. Aesthetics and clarity in information visualization. Arts, 2018.
J. Rodriguez et al. FigGen: Text to scientific figure generation. arXiv:2306.00800, 2023.
J. Schmidhuber. Artificial scientists & artists based on the formal theory of creativity. AGI, 2010.
W. Seo et al. Automated visualization code synthesis via multi-path reasoning. arXiv:2502.11140, 2025.