---
arxiv_id: 2603.28032
title: "CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence"
authors:
  - Tianle Zeng
  - Hanxuan Chen
  - Yanci Wen
  - Hong Zhang
difficulty: Advanced
tags:
  - Agent
  - Multimodal
  - Reasoning
published_at: 2026-04-05
flecto_url: https://flecto.zer0ai.dev/papers/2603.28032/
lang: en
---

> A Unified Infrastructure for Air-Ground Embodied Intelligence

**Authors**: Tianle Zeng, Hanxuan Chen, Yanci Wen, Hong Zhang &mdash; Southern University of Science and Technology & Hunan University

## Abstract

### Abstract

The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates a growing need for simulation infrastructure that can jointly model aerial and ground agents within a single, physically coherent environment. Existing open-source platforms remain domain-specific: urban driving simulators provide rich traffic but no aerial dynamics, while multirotor simulators offer physics-accurate flight but lack realistic ground scenes.

CARLA-Air is an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. It preserves both CARLA and AirSim Python APIs, enables photorealistic rendering with up to 18 synchronized sensor modalities, and supports custom asset import for diverse scenarios. The platform is validated through performance benchmarks and five representative application workflows spanning cooperative landing, vision-language navigation, multi-modal dataset collection, cross-view perception, and reinforcement learning.

Figure 1: CARLA-Air enables drones, vehicles, and pedestrians to coexist in a shared urban simulation world, supporting custom assets, vision-language navigation, multi-modal sensors, and diverse scenario coverage.

### Air + Ground Agents

### Single UE4 Process

### 18 Sensor Modalities

### Custom Asset Import

### Open Source

### 5 Application Workflows

## Introduction

### Why CARLA-Air?

Three converging frontiers are reshaping autonomous systems research: the low-altitude economy demands scalable infrastructure for urban air mobility and drone logistics; embodied intelligence requires agents that perceive and act in photorealistic environments; and air-ground cooperation demands joint aerial-terrestrial reasoning within a shared world.

The problem: Existing simulators are domain-specific. CARLA excels at urban driving but has no aerial dynamics. AirSim provides physics-accurate flight but lacks realistic ground traffic. Bridge-based co-simulation approaches (running both simulators as separate processes) introduce synchronization overhead and cannot guarantee strict spatiotemporal consistency &mdash; per-frame data transfer latency grows from 1ms to over 20ms as sensors increase.

CARLA-Air solves this by integrating both platforms into a single Unreal Engine 4 process . Instead of bridging separate processes, it inherits CARLA's ground subsystems and composes AirSim's aerial flight actor within a unified GameMode. This eliminates inter-process communication entirely, achieving sub-millisecond data transfer regardless of sensor count.

Figure 2: Simulator landscape comparison. CARLA-Air occupies the previously empty quadrant of high-fidelity simulation with multi-domain agent support, combining the strengths of both CARLA (ground) and AirSim (aerial).

## Conclusion

### Conclusion

Simulation platforms for autonomous systems have historically fragmented along domain boundaries, forcing researchers to maintain inter-process bridge infrastructure or accept capability compromises. CARLA-Air resolves this fragmentation by integrating high-fidelity urban driving (CARLA) and physics-accurate multirotor flight (AirSim) within a single Unreal Engine process.

The central technical contribution is the composition-based GameMode design that resolves UE4's single-GameMode constraint. This enables zero inter-process latency, full API compatibility with both CARLA and AirSim, and a shared rendering pipeline. The platform maintains stable operation at ~20 FPS under joint workloads, with zero crashes over 3-hour continuous runs.

Five representative workflows demonstrate capabilities that are structurally inaccessible in single-domain platforms : cooperative air-ground landing, vision-language navigation from aerial perspectives, synchronized multi-modal dataset collection, cross-view perception, and RL training. CARLA-Air is released as open source with prebuilt binaries and full source code.

## References

### References (20 papers)

## Feature Comparison

### Feature Comparison

CARLA-Air is the only platform that achieves full coverage across all key capabilities: ground vehicles, aerial agents, high-fidelity physics, rich sensor suites, multi-agent support, weather simulation, custom asset import, and open-source availability.

## Architecture

### System Architecture

CARLA-Air integrates CARLA and AirSim within a single Unreal Engine process through a minimal bridging layer. The key insight is a composition-based design : CARLAAirGameMode inherits from CARLA's GameModeBase to acquire all ground simulation subsystems (episode management, weather, traffic, actors, scenario recorder), while AirSim's aerial flight actor (physics engine, flight pawn, aerial sensor suite) is composed as a standard world entity during BeginPlay . Both Python APIs connect through separate RPC servers to the same UE4 process, sharing a unified rendering pipeline for RGB, depth, segmentation, and weather effects.

Figure 3: CARLA-Air system architecture. Python CARLA and AirSim clients connect via separate RPC servers to a single UE4 process running CARLAAirGameMode, which unifies ground subsystems (inherited) and aerial flight actor (composed).

### Key Design Decisions

### GameMode Conflict Resolution

Unreal Engine only allows one active GameMode per world. A naive approach of loading both CARLA and AirSim GameModes fails silently &mdash; one mode is discarded, making its API inoperative. CARLA-Air solves this through inheritance + composition : CARLAAirGameMode inherits CARLA's GameModeBase (getting ground subsystems) and composes AirSim's flight actor as a separate entity, both fitting within the single GameMode slot.

### Coordinate System Mapping

UE4/CARLA uses a left-handed coordinate system (Z-up, centimeters), while AirSim uses NED (North-East-Down) with meters. CARLA-Air maintains a real-time coordinate transform layer that handles the Z-flip and scale conversion (cm &harr; m), ensuring both APIs report consistent positions.

### Asset Import Pipeline

CARLA-Air provides a streamlined pipeline for importing custom 3D assets (vehicles, robots, drones) into the simulation. Users can bring their own models &mdash; from mobile robots to sport cars &mdash; enabling diverse research scenarios beyond the default asset library.

## Performance

### Performance Evaluation

All benchmarks were conducted on a single workstation with an NVIDIA RTX 4090 GPU, running CARLA 0.9.15 on Unreal Engine 4.26. Three experiments evaluate frame rate scaling, memory stability, and communication latency.

### Multi-domain simulation at 1280&times;720 with ground vehicles, pedestrians, and a drone

### Over 3 hours of continuous operation with 357 reset cycles &mdash; no memory leak detected

### Per-frame data transfer, compared to 20ms for bridge-based co-simulation with 16 sensors

### Frame Rate & Resource Scaling

Adding the aerial domain introduces minimal overhead: ground-only CARLA achieves 28.4 FPS while multi-domain CARLA-Air maintains 19.8&ndash;26.3 FPS under comparable workloads &mdash; a less than 5% FPS drop attributable to the integration layer. The aerial-only configuration runs at 44.7 FPS, confirming that the flight physics engine is lightweight.

### Memory Stability

Figure 7: VRAM usage over a 3-hour continuous run. The early-phase mean is 3,862 MiB and the late-phase mean is 3,878 MiB &mdash; a growth of only ~0.4%.

During a 3-hour continuous stability test with 357 spawn/destroy cycles, VRAM usage remained stable at approximately 3,870 MiB. The growth from early to late phase was only ~0.4% (16 MiB), confirming the absence of memory leaks. The simulation completed with zero crashes , demonstrating production-grade reliability.

### Communication Latency

Figure 8: Per-frame data transfer comparison. Bridge-based co-simulation latency grows linearly with sensor count (1&ndash;20ms), while CARLA-Air maintains <0.5ms regardless of sensor count.

Because CARLA-Air keeps all data within a single process, per-frame data transfer stays below 0.5 milliseconds regardless of the number of concurrent sensors. In contrast, bridge-based co-simulation latency grows linearly &mdash; reaching 20ms with 16 sensors. Individual API operations (world state queries, actor spawning, image capture) are 4&ndash;10&times; faster than bridge equivalents.

## Applications

### Representative Applications

CARLA-Air enables a wide range of research workflows that require both aerial and ground agents operating in a shared environment. Five representative applications demonstrate the platform's versatility:

### W1: Air-Ground Cooperative Precision Landing

A drone autonomously tracks and lands on a moving ground vehicle using the shared world state. The system uses a unified Python script that controls both the ground client (vehicle trajectory) and aerial client (drone flight controller). The drone's approach, descent, and landing phases are coordinated through real-time shared positioning, achieving horizontal convergence within 0.5m tolerance .

Figure 9: W1 workflow &mdash; a single Python script coordinates both ground and aerial RPC clients within the shared UE4 process.

Figure 10: Precision landing results: (a) camera frames during approach, (b) 3D trajectory showing approach &rarr; descent &rarr; touchdown, (c) altitude profile, (d) horizontal error converging to <0.5m.

### W2: Embodied Navigation & VLN/VLA Data Generation

CARLA-Air generates training data for Vision-Language Navigation (VLN) and Vision-Language-Action (VLA) models. Drones navigate urban environments following natural language instructions like "Fly across the bridge to the city center." The platform collects paired visual observations and action trajectories, enabling researchers to train embodied agents that understand both visual scenes and language commands from aerial perspectives.

### W3: Synchronized Multi-Modal Dataset Collection

CARLA-Air simultaneously collects time-synchronized, multi-modal data from both aerial and ground perspectives. Because both sensor suites operate within the same physics tick, the data is perfectly aligned &mdash; no post-hoc synchronization needed. The platform captures up to 18 sensor modalities including RGB, depth, semantic segmentation, optical flow, surface normals, and LiDAR from both viewpoints.

Figure 13: Multi-modal sensor outputs from CARLA-Air showing RGB, depth, segmentation, optical flow, and more from both ground and aerial perspectives.

### W4: Air-Ground Cross-View Perception

This workflow trains models to match aerial and ground-level views of the same scene. CARLA-Air supports systematic data collection across 6 maps &times; 7 weather conditions (Clear Noon, Cloudy, Dense Fog, Hard Rain, Night, Soft Rain, Sunset), providing comprehensive visual diversity for robust cross-view perception research.

Figure 14: Cross-view perception dataset samples across 6 CARLA towns and 7 weather conditions, demonstrating the environmental diversity available for training robust perception models.

### W5: Reinforcement Learning Training Environment

CARLA-Air provides a Gymnasium-compatible RL environment for drone navigation tasks. The observation space includes drone RGB images, depth maps, pose information, and NPC vehicle positions. The action space controls velocity commands (&Delta;v x , &Delta;v y , &Delta;v z ). A reward function combines progress toward the target, altitude bonuses, and collision penalties: r = r progress + r altitude &minus; r collision .

Figure 15: W5 RL pipeline &mdash; CARLA-Air provides observation and action spaces to a policy network, with rewards based on drone-vehicle proximity and collision avoidance.

## Limitations

### Limitations & Future Work

The current release of CARLA-Air is validated for the five workflows presented above. Several constraints remain as active engineering targets:

Actor density: Performance is characterized at moderate traffic loads; high-density scenes with large simultaneous actor populations are still being optimized.

Environment resets: Map switching requires a full process restart due to independent actor lifecycle management. Staged in-session resets are planned for a future release.

Multi-drone scale: Configurations beyond two drones are functional but not yet formally validated across a wide range of scenarios.

### Future Directions

Near-term work includes physics-state synchronization between the two engines and a ROS 2 bridge for broader ecosystem integration. Longer-term goals include GPU-parallel multi-environment execution (similar to Isaac Lab) to increase episode throughput for large-scale RL, and potential migration to Unreal Engine 5 . Since AirSim's upstream development has been archived, CARLA-Air maintains the aerial stack independently with bug fixes and feature extensions.
