Peking University ยท Google Cloud AI Research
PaperBanana is an agentic framework that automates the generation of publication-ready academic illustrations. By orchestrating five specialized agents โ Retriever, Planner, Stylist, Visualizer, and Critic โ it transforms scientific content into high-quality methodology diagrams and statistical plots. The accompanying PaperBananaBench benchmark provides 292 test cases for rigorous evaluation across four dimensions.
Autonomous scientific discovery is a long-standing pursuit of artificial general intelligence. With the rapid evolution of Large Language Models (LLMs), autonomous AI Scientists have demonstrated the potential to automate many facets of the research lifecycle โ from literature review and idea generation to experiment design and paper writing.
However, generating publication-ready illustrations remains a labor-intensive bottleneck. Prior code-based approaches using TikZ or Matplotlib produce results that lack visual aesthetics and fail to match the quality expected in modern academic manuscripts.
PaperBanana bridges this gap by automating the production of high-quality academic illustrations. Given a methodology description and diagram caption as input, it orchestrates specialized agents to produce publication-ready visual representations.
A fully automated agentic framework orchestrating five specialized agents โ Retriever, Planner, Stylist, Visualizer, and Critic โ for generating publication-ready academic illustrations from textual descriptions.
A comprehensive benchmark with 292 test cases curated from NeurIPS 2025 papers. Evaluation covers four dimensions: Faithfulness, Conciseness, Readability, and Aesthetics.
Consistently outperforms leading baselines across all evaluation dimensions, achieving gains of +2.8% Faithfulness, +37.2% Conciseness, +12.9% Readability, and +6.6% Aesthetics.
The paper formalizes automated academic illustration generation as learning a mapping from a source context and a communicative intent to a visual representation.
The source context \(S\) contains the essential methodology information, while the communicative intent \(C\) (typically the figure caption) specifies what the illustration should convey. The mapping function \(F\) produces the illustration: \(I = F(S, C; \mathcal{E})\), optionally guided by reference examples \(\mathcal{E} = \{E_n\}_{n=1}^{N}\) where each \(E_n = (S_n, C_n, I_n)\).
Think of it like a translator for scientific ideas. You provide:
The system then produces an illustration I that visually communicates your methodology. It's similar to how a graphic designer reads your paper and creates a diagram โ but fully automated.
Among various types of academic illustrations, this paper focuses on methodology diagrams, which require interpreting complex technical concepts and logical flows from textual descriptions into high-fidelity, visually appealing diagrams. The framework also extends to statistical plots.
PaperBanana orchestrates a collaborative team of five specialized agents. The framework operates in two phases: a Linear Planning Phase where the Retriever, Planner, and Stylist agents sequentially process the input, followed by an Iterative Refinement Loop where the Visualizer and Critic agents collaborate for T=3 rounds to produce the final illustration.
Why five agents instead of one? Each agent specializes in a different aspect of diagram creation, similar to a design team: one person finds references, another plans the layout, a designer adds style, an illustrator draws it, and a reviewer gives feedback. This division of labor produces better results than asking a single model to do everything at once.
Identifies the most relevant reference examples from a fixed set using VLM-based ranking. The VLM is instructed to rank candidates by matching both research domain (e.g., Agent & Reasoning) and diagram type (e.g., pipeline, architecture), with visual structure being prioritized over topic similarity. This provides a concrete foundation for both structural logic and visual style.
VLM (Vision-Language Model) refers to AI models that can understand both images and text simultaneously. Here, the VLM examines reference diagrams to find ones with similar visual structure to what's needed โ like browsing a portfolio to find matching design patterns.
The cognitive core of the system. Takes the source context, communicative intent, and retrieved examples as inputs. By performing in-context learning from the demonstrations, the Planner translates the source into a structured description of the target diagram, specifying components, connections, layout, and logical flow.
Acts as a design consultant ensuring academic aesthetic standards. Uses auto-summarized style guides derived from analyzing hundreds of human-drawn diagrams. The Stylist optimizes the planned description with specific visual instructions for color palettes, typography, icons, and layout refinements.
Auto-summarized style guides: Instead of manually defining what "good academic diagrams" look like, the system automatically analyzes hundreds of human-drawn diagrams from top conferences and extracts common design patterns โ preferred color palettes, font choices, icon styles, and layout conventions.
Renders academic illustrations from the stylistically optimized description. Leverages image generation models (Nano-Banana-Pro or GPT-Image-1.5). For statistical plots, the Visualizer generates executable Python Matplotlib code instead, ensuring numerical precision.
Forms a closed-loop refinement mechanism with the Visualizer. Examines the generated image at each iteration, identifies issues with content accuracy, visual clarity, and style adherence, then provides a refined description for regeneration. Runs for T=3 iterations to ensure quality.
The framework extends to statistical plots by adjusting the Visualizer and Critic agents. The Visualizer converts descriptions into executable Python Matplotlib code for numerical precision. The Critic verifies both visual quality and data accuracy by comparing against source tabular data, ensuring faithfulness in the generated plots.
For statistical plots, generating code (Python Matplotlib) is preferred over generating images directly because code can precisely reproduce exact data values, axis scales, and labels. Image generation might produce visually beautiful charts but can get the actual numbers wrong.
The lack of dedicated benchmarks has hindered rigorous evaluation of automated diagram generation. To address this, the authors introduce PaperBananaBench, a comprehensive benchmark curated from NeurIPS 2025 methodology diagrams, comprising 292 test cases that capture the sophisticated aesthetics and diverse logical structures of modern academic papers.
2,000 papers randomly sampled from 5,275 NeurIPS 2025 publications. The MinerU toolkit extracts text content and figures from PDF files.
Papers without methodology diagrams are discarded (yielding 1,359 valid candidates). Aspect ratios restricted to [1.5, 2.5], resulting in 292 final test cases.
Four categories based on visual topology: Agent & Reasoning (31.5%), Vision & Perception (25.0%), Generative & Learning (25.0%), Science & Application (18.5%).
Annotators verify methodology descriptions, captions, diagram correctness, and category labels to guarantee data integrity and quality.
Referenced Scoring: A VLM judge compares the model-generated diagram against the human reference, determining Model wins (score 100), Tie (50), or Human wins (0) for each dimension.
The evaluation uses a "VLM-as-a-Judge" approach, where an AI model acts as an expert judge. For each diagram, it compares the machine-generated version against the human-created original:
So when PaperBanana scores 60.2 overall, it means it slightly outperforms human-created diagrams on average โ a remarkable achievement.
Three baseline settings are compared: (1) Vanilla โ directly prompting the image generation model; (2) Few-shot โ vanilla plus reference examples; (3) Agentic Frameworks โ DiagramAgent, SciDraw, and PaperBanana. The VLM backbone is Gemini-3-Pro, with Nano-Banana-Pro and GPT-Image-1.5 as image generators.
The evaluation protocol is validated through inter-model agreement (Kendall's tau > 0.4 between Gemini-3-Pro judge and GPT-5) and human alignment (72% agreement with human annotators on 50 samples).
Kendall's tau is a statistical measure of agreement between two rankings (range: -1 to +1). A value above 0.4 indicates relatively strong agreement, meaning different AI judges tend to rank the diagrams in a similar order โ confirming the evaluation is reliable.
PaperBanana consistently outperforms all baselines across all metrics. Over the Vanilla Nano-Banana-Pro baseline, it achieves +2.8% Faithfulness, +37.2% Conciseness, +12.9% Readability, and +6.6% Aesthetics, contributing to an overall improvement of +48.7%.
DiagramAgent and SciDraw, which rely on TikZ code generation, underperform significantly. The code-based approach struggles to capture the visual sophistication expected in modern academic manuscripts. Despite overall progress, PaperBanana still underperforms the human reference in faithfulness, with fine-grained connectivity errors remaining the primary challenge.
TikZ is a LaTeX package for creating vector graphics through code. While it produces precise, scalable diagrams, the code is complex and the resulting figures often look rigid and dated compared to modern image-generation approaches.
The ablation study reveals the contribution of each agent component:
Why does adding style reduce accuracy? This is a common trade-off in visualization: making a diagram more visually polished (cleaner layout, fewer labels, simplified connections) can sometimes sacrifice technical precision. It's like the difference between a detailed engineering blueprint and a sleek marketing infographic โ the latter looks better but may omit subtle technical details.
PaperBanana extends to statistical plot generation by adjusting the Visualizer and Critic agents. For statistical plots, the Visualizer generates executable Python Matplotlib code, ensuring numerical precision. The Critic verifies both visual quality and data accuracy.
On the curated test set, PaperBanana consistently outperforms vanilla Gemini-3-Pro across all dimensions. The image generation approach produces more visually appealing plots but can introduce faithfulness errors (incorrect data values, duplicated categories), while the code-based approach ensures data accuracy at the cost of visual sophistication.
An intriguing application: can PaperBanana's auto-summarized aesthetic guidelines enhance existing human-drawn diagrams? The system identifies specific improvement areas โ color palette, font, icons, connectors, line weight, and shape โ and applies them through Nano-Banana-Pro to refine the original diagram while preserving its content.
For statistical plots, code-based approaches demonstrate remarkable efficacy for data accuracy, while image generation excels in visual aesthetics. The choice depends on priorities: when numerical precision is critical, code-based generation is preferred; when visual appeal and design quality matter more, image generation offers advantages โ though at the risk of occasional faithfulness errors.
The accuracy vs. aesthetics dilemma: Image generation models like Nano-Banana-Pro create visually stunning charts but occasionally fabricate data (e.g., drawing a bar at the wrong height or duplicating categories). Code generation is "boring but reliable" โ it always plots exactly what the data says, but the visual design is limited to what Matplotlib templates offer.
PaperBanana is an agentic framework designed to automate the generation of publication-ready academic illustrations. By orchestrating specialized agents โ Retriever, Planner, Stylist, Visualizer, and Critic โ the approach transforms scientific content into high-fidelity methodology diagrams and statistical plots. The accompanying PaperBananaBench benchmark enables rigorous evaluation, and comprehensive experiments demonstrate significant improvements over existing methods.
Output is raster (not vector), making editing difficult. Future work envisions element extraction for SVG reconstruction and GUI Agents for vector design software.
The unified style guide ensures consistency but reduces stylistic diversity. Future work: user-customizable style preferences and diverse output options.
Fine-grained connectivity errors remain the primary challenge. Future: specialized verification models and structured output formats for structural correctness.
Reference-based VLM-as-a-Judge has inherent limitations. Future work: reference-free evaluation metrics and multi-dimensional assessment frameworks.
Currently produces one output per query. Future: test-time scaling with diverse generative samples to satisfy varied aesthetic preferences.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages โ we turn any source material into production-quality content. Rich HTML ยท Custom slides ยท Animated video.