AI Papers

Curated research — 48 papers

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language World Models

Can AI agents handle real professional work? OccuBench evaluates agents across 100 tasks in 65 specialized domains using language world models, revealing critical gaps in professional task performance.

2026-04-13T00:00:00+00:00

🟡 Intermediate VisionReasoningDiffusion

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

What if reward models could explain their reasoning? RationalRewards teaches reward models to produce explicit critiques before scoring, turning passive evaluators into active optimization tools that improve visual generation at both training and test time.

2026-04-13T00:00:00+00:00

🟡 AgentBenchmark

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

GameWorld introduces a standardized benchmark for evaluating multimodal AI agents in browser-based video games, tackling heterogeneous action interfaces and heuristic verification challenges.

2026-04-08T00:00:00+00:00 arXiv

🔴 LLMReasoning

KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance

What if the secret to better LLM reasoning is giving hints that are just enough — not too much, not too little? KnowRL breaks problems into atomic Knowledge Points and uses Constrained Subset Search to find the minimal hint that unblocks exploration without leaking answers. On a 1.5B model, it beats GRPO by +9.63 points across 8 benchmarks.

2026-04-16T00:00:00+00:00 arXiv

🔴 AgentReasoning

Toward Autonomous Long-Horizon Engineering for ML Research

AiScientist treats long-horizon ML research engineering as a systems problem: thin orchestrator control over thick durable state. The File-as-Bus workspace delivers +10.54 pts on PaperBench and 81.82 Any Medal% on MLE-Bench Lite.

2026-04-15 arXiv

🟡 AgentBenchmark

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

Open-source full-stack framework for GUI agents: online RL training, reproducible evaluation across 6 benchmarks and 11+ models, and real-device deployment — +17.1% on ScreenSpot-Pro.

2026-04-13 arXiv

🟡 TransformerAttention Mechanism

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

First comprehensive survey on Attention Sink — why Transformers waste attention on uninformative tokens. Covers utilization (StreamingLLM), interpretation, and mitigation across 200+ papers.

2026-04-11T00:00:00+00:00 arXiv

🔴 Reinforcement LearningLLM Training

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

A memory-enhanced reward shaping framework that clusters recurring LLM errors and penalizes repetition, boosting pass@1 by up to +4.13 on math benchmarks.

2026-04-13T00:00:00+00:00 arXiv

🟡 LLMBenchmark

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Can LLMs generate correct quantum code across multiple frameworks? QuanBench+ benchmarks code generation for Qiskit, PennyLane, and Cirq, revealing that feedback-based repair boosts Pass@1 from 59.5% to 83.3%.

2026-03-25T00:00:00+00:00 arXiv

🟡 Vision-LanguageLLM

EXAONE 4.5 Technical Report

2026-04-09 arXiv

🟡 VisionDiffusion

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

RefineAnything surgically fixes distorted text, logos, and faces in AI-generated images — a bounding box plus a reference image is all it takes to restore pixel-perfect local details.

2026-04-08T00:00:00+00:00 arXiv

🟡 BenchmarkVision

FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios

FORGE reveals that domain knowledge—not visual grounding—is the bottleneck for manufacturing AI. A 3B model fine-tuned on FORGE matches a model 78× larger.

2026-04-08T00:00:00+00:00 arXiv

🔴 Computer Vision3D Detection

WildDet3D: Scaling Promptable 3D Detection in the Wild

WildDet3D brings 3D object detection in the wild — any object, any prompt, any image — with a 10× leap over prior state-of-the-art.

2026-04-09T00:00:00+00:00 arXiv

🟡 AgentLLM

SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

What if every agent failure made all agents smarter? SkillClaw shows collective skill evolution is possible.

2026-04-09 arXiv

🔴 ReasoningLLM

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

SFT doesn't just memorize — it generalizes conditionally. Three factors decide when: optimization depth, data quality, and model capability. The cost: reasoning improves, but safety degrades.

2026-04-08T00:00:00+00:00 arXiv

🔴 AgentSecurity

SoK: Agentic Skills — Beyond Tool Use in LLM Agents

The first systematic map of the agentic skill layer — from formal definition to marketplace attacks. This SoK reveals 7 design patterns, introduces trust tiers for skill governance, and documents the ClawHavoc supply-chain attack that compromised 36.8% of marketplace users.

2026-02-24T00:00:00+00:00 arXiv

🟡 LLMNLP

Adam's Law: Textual Frequency Law on Large Language Models

High-frequency text isn't just easier to read — it makes LLMs significantly smarter. Adam's Law proposes TFL, TFD, and CTFT to harness this principle across 4 NLP tasks.

2026-04-02T00:00:00+00:00 arXiv

🔴 AgentReasoning

RAGEN-2: Reasoning Collapse in Agentic RL

RL-trained LLM agents silently collapse into repetitive templates despite high entropy. Mutual information (+0.39 Spearman) beats entropy (-0.14) as a diagnostic, and SNR-Aware Filtering restores diverse reasoning across 4 environments.

2026-04-08T00:00:00+00:00 arXiv

🔴 VisionMultimodal

Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

What if AI could paint like a human artist — sketching, inspecting, and refining step by step? This paper introduces process-driven image generation with BAGEL-7B, achieving GenEval 0.83 (+5%) and WISE 0.76 (+6%) through Plan→Sketch→Inspect→Refine cycles.

2026-04-08T00:00:00+00:00 arXiv

🟡 AgentLLM

SkillNet: Create, Evaluate, and Connect AI Skills

SkillNet introduces an open infrastructure for creating, evaluating, and connecting AI agent skills at scale, featuring a unified ontology over 200,000+ skills that boosts average rewards by 40% and cuts execution steps by 30% on ALFWorld, WebShop, and ScienceWorld.

2026-02-26T14:24:02+00:00 arXiv

🔴 LLMMemory

MemOS: A Memory OS for AI System

What if LLMs had their own operating system for memory? MemOS unifies plaintext, KV cache, and model weights as schedulable resources—achieving state-of-the-art on all major memory benchmarks.

2025-07-04T17:21:46+00:00 arXiv

🔴 AgentRetrieval

Learning to Retrieve from Agent Trajectories

A new training paradigm for IR systems: learning to retrieve from agent trajectories bridges the gap between human-designed search and LLM-powered agent consumption.

2026-03-30T00:00:00+00:00 arXiv

🟡 AgentBenchmark

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Claw-Eval introduces trajectory-aware grading, safety evaluation, and multimodal coverage to build trustworthy benchmarks for autonomous LLM agents.

2026-04-07T00:00:00+00:00 arXiv

🔴 BenchmarkVideo AI

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Video-MME-v2 exposes a 41-point AI-human gap in video understanding, using group-based evaluation to reveal hidden failures in consistency and multimodal reasoning.

2026-04-06T00:00:00+00:00 arXiv

🟡 AudioLLM

VibeVoice Technical Report

VibeVoice synthesizes 90-minute, 4-speaker conversations using next-token diffusion with a 7.5 Hz tokenizer that compresses speech 80× vs Encodec — making long-form multi-speaker TTS feasible in a standard LLM context window.

2025-08-26T17:09:12Z arXiv

🔴 LLMReasoning

InCoder-32B-Thinking: Industrial Code World Model for Thinking

InCoder-32B-Thinking bridges general and industrial code intelligence through Error-driven Chain-of-Thought synthesis and an Industrial Code World Model, achieving 81.3% on LiveCodeBench and 84.0% on CAD-Coder.

arXiv

🔴 AgentBenchmark

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Current top AI models score just 56.3% on tasks humans ace at 93.8% — this benchmark finally exposes exactly where and why multimodal agents fail.

2026-04-03T00:00:00+00:00 arXiv

🟡 MultimodalSpatial Reasoning

Token Warping Helps MLLMs Look from Nearby Viewpoints

Token warping — rearranging ViT image tokens rather than pixels — enables MLLMs to reason from nearby viewpoints without fine-tuning, consistently outperforming all baselines on the new ViewBench benchmark.

2026-04-03T00:00:00+00:00 arXiv

🔴 LLMTraining

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

DataFlex unifies dynamic data selection, domain mixture optimization, and sample reweighting into a single LLaMA-Factory-compatible framework, enabling reproducible data-centric LLM training with consistent MMLU gains.

2026-03-27T08:28:02+00:00 arXiv

🟡 VideoVLM

A Simple Baseline for Streaming Video Understanding

A minimal 4-frame sliding-window baseline beats all published streaming video models with half the GPU memory.

2026-04-02T00:00:00+00:00 arXiv

🔴 LLMAgent

Self-Distilled RLVR

RLSD solves the information leakage problem of on-policy self-distillation by repurposing the teacher as a token-level magnitude evaluator, achieving state-of-the-art on 5 multimodal reasoning benchmarks.

2026-04-03T00:00:00+00:00 arXiv

🟡 MultimodalAudio

LTX-2: Efficient Joint Audio-Visual Foundation Model

A unified foundation model that jointly generates synchronized audio and video from text prompts, eliminating the need for separate audio and video pipelines.

2026-01-06T18:24:41+00:00 arXiv

🔴 LLMReasoning

Attention Residuals

A simple architectural modification to Transformers that feeds attention outputs back as residuals, improving reasoning and long-context performance without additional parameters.

2026-03-16T09:32:21+00:00 arXiv

🟡 AgentReasoning

ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers

A three-layer security framework for autonomous AI agents that protects against data leakage, privilege escalation, and malicious tool execution in real-time.

2026-03-25T15:27:54+00:00 arXiv

🔴 AgentLLM

Natural-Language Agent Harnesses

This paper introduces Natural-Language Agent Harnesses (NLAHs), showing that agent control logic can be expressed in editable text rather than code — achieving a 55% performance boost when migrating from code to natural language.

2026-03-26T00:00:00Z arXiv

🟡 LLMBenchmark

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory provides a unified framework for fine-tuning 100+ language models with minimal code, supporting LoRA, QLoRA, RLHF and more out of the box.

arXiv

🟡 AgentVision

PaperBanana: Automating Academic Illustration for AI Scientists

PaperBanana automates publication-ready academic illustrations using VLM-powered agents — a potential game-changer for the AI research workflow.

2026-01-30T18:33:37+00:00 arXiv

🔴 AgentMultimodal

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

CARLA-Air unifies drone flight and autonomous driving in a single simulation, enabling air-ground cooperative AI research without co-simulation overhead.

arXiv

🔴 AgentLLM

AutoHarness: Improving LLM Agents by Automatically Synthesizing a Code Harness

Smaller models beat larger ones by auto-synthesizing code harnesses that eliminate illegal moves — a new paradigm for efficient agent design.

2026-03-04T00:00:00+00:00 arXiv

🔴 LLMharness engineering

Meta-Harness: End-to-End Optimization of Model Harnesses

A coding agent that automatically discovers better LLM harnesses—achieving rank #1 on TerminalBench-2 and +7.7 points over ACE on text classification, using filesystem access for causal diagnosis.

2026-03-28T17:59:04+00:00 arXiv

🔴 LLMReasoning

Tool Building as a Path to "Superintelligence"

Could AI achieve superintelligence by building its own tools? This paper argues yes — through the Diligent Learner framework combining test-time search with tool-building.

2026-02-25T00:00:00+00:00 arXiv

🟡 AgentLLM

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

ACE (Agentic Context Engineering) treats LLM contexts as evolving playbooks, achieving +10.6% on agent benchmarks through systematic context optimization.

2025-10-06 arXiv

🟡 VisionDiffusion

RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models

RealRestorer leverages large-scale image editing models to achieve generalizable real-world image restoration, tackling complex degradations that previous methods couldn't handle.

2026-03-26T14:39:39+00:00 arXiv

🟡 AudioMultimodal

Voxtral TTS

Voxtral TTS by Mistral AI generates highly natural multilingual speech from minimal data, setting a new standard for expressive text-to-speech.

2026-03-26T15:23:34+00:00 arXiv

🔴 DiffusionLLM

Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

Calibri reveals hidden capacity in Diffusion Transformers through lightweight parameter-efficient calibration, achieving significant quality gains with minimal compute overhead.

2026-03-25T20:19:50+00:00 arXiv