CARLA-Air: Fly Drones Inside a CARLA World

Abstract

The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates a growing need for simulation infrastructure that can jointly model aerial and ground agents within a single, physically coherent environment. Existing open-source platforms remain domain-specific: urban driving simulators provide rich traffic but no aerial dynamics, while multirotor simulators offer physics-accurate flight but lack realistic ground scenes.

CARLA-Air is an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. It preserves both CARLA and AirSim Python APIs, enables photorealistic rendering with up to 18 synchronized sensor modalities, and supports custom asset import for diverse scenarios. The platform is validated through performance benchmarks and five representative application workflows spanning cooperative landing, vision-language navigation, multi-modal dataset collection, cross-view perception, and reinforcement learning.

CARLA-Air teaser figure — **Figure 1:** CARLA-Air enables drones, vehicles, and pedestrians to coexist in a shared urban simulation world, supporting custom assets, vision-language navigation, multi-modal sensors, and diverse scenario coverage.

Air + Ground Agents Single UE4 Process 18 Sensor Modalities Custom Asset Import Open Source 5 Application Workflows

Why CARLA-Air?

Three converging frontiers are reshaping autonomous systems research: the low-altitude economy demands scalable infrastructure for urban air mobility and drone logistics; embodied intelligence requires agents that perceive and act in photorealistic environments; and air-ground cooperation demands joint aerial-terrestrial reasoning within a shared world.

The problem: Existing simulators are domain-specific. CARLA excels at urban driving but has no aerial dynamics. AirSim provides physics-accurate flight but lacks realistic ground traffic. Bridge-based co-simulation approaches (running both simulators as separate processes) introduce synchronization overhead and cannot guarantee strict spatiotemporal consistency — per-frame data transfer latency grows from 1ms to over 20ms as sensors increase.

What is "bridge-based co-simulation"?

Imagine running two separate video games on your computer and trying to sync them in real-time. That's essentially what "bridge-based co-simulation" does — it runs CARLA (driving sim) and AirSim (drone sim) as two independent programs and shuttles data between them over a network connection. The problem? Every time the drone needs to "see" a car, that visual data has to travel between processes, adding delay. With 16 sensors, this delay hits 20ms per frame — enough to break real-time requirements. CARLA-Air eliminates this by putting everything in one program.

CARLA-Air solves this by integrating both platforms into a single Unreal Engine 4 process. Instead of bridging separate processes, it inherits CARLA's ground subsystems and composes AirSim's aerial flight actor within a unified GameMode. This eliminates inter-process communication entirely, achieving sub-millisecond data transfer regardless of sensor count.

Simulator comparison landscape — **Figure 2:** Simulator landscape comparison. CARLA-Air occupies the previously empty quadrant of high-fidelity simulation with multi-domain agent support, combining the strengths of both CARLA (ground) and AirSim (aerial).

System Architecture

CARLA-Air integrates CARLA and AirSim within a single Unreal Engine process through a minimal bridging layer. The key insight is a composition-based design: CARLAAirGameMode inherits from CARLA's GameModeBase to acquire all ground simulation subsystems (episode management, weather, traffic, actors, scenario recorder), while AirSim's aerial flight actor (physics engine, flight pawn, aerial sensor suite) is composed as a standard world entity during BeginPlay. Both Python APIs connect through separate RPC servers to the same UE4 process, sharing a unified rendering pipeline for RGB, depth, segmentation, and weather effects.

Understanding the GameMode Design Pattern

In Unreal Engine (the game engine powering both CARLA and AirSim), a GameMode is like the "operating system" of a simulation — it controls what actors exist, how physics works, and what rules apply. The catch: UE4 only allows one GameMode at a time.

The Problem: CARLA has its own GameMode (for cars, traffic, pedestrians) and AirSim has its own (for drones, flight physics). You can't run both simultaneously.
The Solution: CARLA-Air creates a new GameMode (CARLAAirGameMode) that inherits CARLA's ground capabilities (like a child class) and composes AirSim's aerial capabilities (like adding a plugin). This is a classic software engineering pattern — inheritance + composition.
Real-world analogy: Think of it like a smartphone that can run both iOS apps and Android apps by building one OS that natively understands both app formats, rather than running two phones side by side.

**Figure 3:** CARLA-Air system architecture. Python CARLA and AirSim clients connect via separate RPC servers to a single UE4 process running CARLAAirGameMode, which unifies ground subsystems (inherited) and aerial flight actor (composed).

Key Design Decisions

GameMode Conflict Resolution

Unreal Engine only allows one active GameMode per world. A naive approach of loading both CARLA and AirSim GameModes fails silently — one mode is discarded, making its API inoperative. CARLA-Air solves this through inheritance + composition: CARLAAirGameMode inherits CARLA's GameModeBase (getting ground subsystems) and composes AirSim's flight actor as a separate entity, both fitting within the single GameMode slot.

Coordinate System Mapping

UE4/CARLA uses a left-handed coordinate system (Z-up, centimeters), while AirSim uses NED (North-East-Down) with meters. CARLA-Air maintains a real-time coordinate transform layer that handles the Z-flip and scale conversion (cm ↔ m), ensuring both APIs report consistent positions.

NED Coordinate System

NED stands for North-East-Down — a standard coordinate system in aviation where X points North, Y points East, and Z points downward. This is opposite to UE4's gaming convention where Z points up. CARLA-Air automatically converts between these two systems so that drone positions reported by AirSim's API match the corresponding locations in CARLA's world.

Asset Import Pipeline

CARLA-Air provides a streamlined pipeline for importing custom 3D assets (vehicles, robots, drones) into the simulation. Users can bring their own models — from mobile robots to sport cars — enabling diverse research scenarios beyond the default asset library.

Feature Comparison

CARLA-Air is the only platform that achieves full coverage across all key capabilities: ground vehicles, aerial agents, high-fidelity physics, rich sensor suites, multi-agent support, weather simulation, custom asset import, and open-source availability.

Platform	Ground	Aerial	Physics	Sensors	Multi-Agent	Weather	Custom Assets	Open Source
CARLA [2]	✓	✗	✓	✓	✓	✓	✓	✓
AirSim [15]	✗	✓	✓	✓	✓	~	✓	✓
Isaac Lab [11]	✓	~	✓	✓	✓	✗	✓	✓
Habitat [14]	✗	✗	~	✓	✓	✗	✓	✓
SUMO [8]	~	✗	✗	✗	✓	✗	✗	✓
MetaDrive [7]	✓	✗	~	✓	✓	~	~	✓
OmniDrones [19]	✗	✓	✓	✓	✓	✗	✓	✓
TranSimHub [17]	✓	~	~	~	✓	~	~	✓
CARLA-Air (Ours)	✓	✓	✓	✓	✓	✓	✓	✓

Performance Evaluation

All benchmarks were conducted on a single workstation with an NVIDIA RTX 4090 GPU, running CARLA 0.9.15 on Unreal Engine 4.26. Three experiments evaluate frame rate scaling, memory stability, and communication latency.

30+

FPS

Multi-domain simulation at 1280×720 with ground vehicles, pedestrians, and a drone

~0.4%

VRAM growth

Over 3 hours of continuous operation with 357 reset cycles — no memory leak detected

<0.5ms

latency

Per-frame data transfer, compared to 20ms for bridge-based co-simulation with 16 sensors

Frame Rate & Resource Scaling

Adding the aerial domain introduces minimal overhead: ground-only CARLA achieves 28.4 FPS while multi-domain CARLA-Air maintains 19.8–26.3 FPS under comparable workloads — a less than 5% FPS drop attributable to the integration layer. The aerial-only configuration runs at 44.7 FPS, confirming that the flight physics engine is lightweight.

Why is "less than 5% FPS drop" impressive?

When you add an entire flight simulation system (physics engine, drone sensors, aerial rendering) on top of an already-complex driving simulator, you'd expect a significant performance hit. The fact that CARLA-Air only loses ~5% of frame rate means the integration layer is extremely lightweight. For context, running the same simulations as separate processes (bridge approach) would add 20ms of communication overhead per frame, which effectively halves the usable frame rate at 30 FPS.

Configuration	Scenario	FPS	VRAM (MiB)	CPU (%)
Ground only	3 vehicles + 2 peds; 8 sensors @ 1280x720	28.4 ±1.2	3,821 ±10	31 ±3
Aerial only	1 drone; 8 sensors @ 1280x720	44.7 ±2.1	2,941 ±8	29 ±3
Baseline	Town10HD; no actors; no sensors	60.0 ±0.4	3,702 ±8	12 ±2
Multi-domain	3 vehicles + 2 peds; 8 sensors @ 1280x720	26.3 ±1.4	3,831 ±11	38 ±4
Multi-domain	3 vehicles + 2 peds + 1 drone; 8 sensors @ 1280x720	19.8 ±1.1	3,870 ±13	54 ±5
Multi-domain	8 autopilot vehicles + 1 drone; 1 RGB @ 1920x1080	20.1 ±1.8	3,874 ±15	61 ±6
Stability test	Moderate joint; 357 reset cycles; 3 hr	19.7 ±1.3	3,878 ±17	55 ±5

Memory Stability

During a 3-hour continuous stability test with 357 spawn/destroy cycles, VRAM usage remained stable at approximately 3,870 MiB. The growth from early to late phase was only ~0.4% (16 MiB), confirming the absence of memory leaks. The simulation completed with zero crashes, demonstrating production-grade reliability.

Communication Latency

Because CARLA-Air keeps all data within a single process, per-frame data transfer stays below 0.5 milliseconds regardless of the number of concurrent sensors. In contrast, bridge-based co-simulation latency grows linearly — reaching 20ms with 16 sensors. Individual API operations (world state queries, actor spawning, image capture) are 4–10× faster than bridge equivalents.

Why microseconds matter here

At 30 FPS, each frame has a budget of ~33 milliseconds. A bridge co-simulation using 16 sensors consumes 20ms just for data transfer — leaving only 13ms for actual computation. CARLA-Air's sub-0.5ms transfer means nearly the entire frame budget goes to useful work (physics, rendering, AI), making real-time multi-sensor simulation practical.

Operation	Bridge (μs)	CARLA-Air (μs)
World state snapshot	320	40
Actor transform query	280	35
Actor spawn	1,850	210
Image capture (1 RGB)	3,200	380
Velocity command	490	60
Cross-process sync (ref.)	3,000	2,000

Representative Applications

CARLA-Air enables a wide range of research workflows that require both aerial and ground agents operating in a shared environment. Five representative applications demonstrate the platform's versatility:

W1: Air-Ground Cooperative Precision Landing

A drone autonomously tracks and lands on a moving ground vehicle using the shared world state. The system uses a unified Python script that controls both the ground client (vehicle trajectory) and aerial client (drone flight controller). The drone's approach, descent, and landing phases are coordinated through real-time shared positioning, achieving horizontal convergence within 0.5m tolerance.

W1 workflow diagram — **Figure 9:** W1 workflow — a single Python script coordinates both ground and aerial RPC clients within the shared UE4 process.

W1 precision landing results — **Figure 10:** Precision landing results: (a) camera frames during approach, (b) 3D trajectory showing approach → descent → touchdown, (c) altitude profile, (d) horizontal error converging to <0.5m.

W2: Embodied Navigation & VLN/VLA Data Generation

CARLA-Air generates training data for Vision-Language Navigation (VLN) and Vision-Language-Action (VLA) models. Drones navigate urban environments following natural language instructions like "Fly across the bridge to the city center." The platform collects paired visual observations and action trajectories, enabling researchers to train embodied agents that understand both visual scenes and language commands from aerial perspectives.

VLN and VLA Explained

Vision-Language Navigation (VLN) is a research area where an AI agent navigates a physical environment by following natural language instructions — like telling a drone "fly across the bridge to the city center." The agent must understand both the language and the visual scene to navigate correctly.

Vision-Language-Action (VLA) extends this by also generating motor commands — the agent doesn't just plan a path, it directly outputs control signals (velocity, direction) based on what it sees and hears. CARLA-Air provides the training ground for these models by generating paired visual-linguistic-action data at scale.

W3: Synchronized Multi-Modal Dataset Collection

CARLA-Air simultaneously collects time-synchronized, multi-modal data from both aerial and ground perspectives. Because both sensor suites operate within the same physics tick, the data is perfectly aligned — no post-hoc synchronization needed. The platform captures up to 18 sensor modalities including RGB, depth, semantic segmentation, optical flow, surface normals, and LiDAR from both viewpoints.

W3 multi-modal sensor grid — **Figure 13:** Multi-modal sensor outputs from CARLA-Air showing RGB, depth, segmentation, optical flow, and more from both ground and aerial perspectives.

W4: Air-Ground Cross-View Perception

This workflow trains models to match aerial and ground-level views of the same scene. CARLA-Air supports systematic data collection across 6 maps × 7 weather conditions (Clear Noon, Cloudy, Dense Fog, Hard Rain, Night, Soft Rain, Sunset), providing comprehensive visual diversity for robust cross-view perception research.

W4 weather and map variations — **Figure 14:** Cross-view perception dataset samples across 6 CARLA towns and 7 weather conditions, demonstrating the environmental diversity available for training robust perception models.

W5: Reinforcement Learning Training Environment

CARLA-Air provides a Gymnasium-compatible RL environment for drone navigation tasks. The observation space includes drone RGB images, depth maps, pose information, and NPC vehicle positions. The action space controls velocity commands (Δv_x, Δv_y, Δv_z). A reward function combines progress toward the target, altitude bonuses, and collision penalties: r = r_progress + r_altitude − r_collision.

RL Environment Design Basics

Reinforcement Learning (RL) trains an agent by trial and error — like teaching a dog new tricks with treats. The key components CARLA-Air provides:

Observation Space: What the drone "sees" — RGB images, depth maps, its own position, and where other vehicles are
Action Space: What the drone can "do" — velocity changes in 3D (forward/backward, left/right, up/down)
Reward Function: The "treat" — positive reward for getting closer to the target vehicle, bonus for maintaining good altitude, penalty for collisions

The Gymnasium-compatible interface means researchers can plug in any standard RL algorithm (PPO, SAC, DQN) without custom integration work.

W5 RL environment — **Figure 15:** W5 RL pipeline — CARLA-Air provides observation and action spaces to a policy network, with rewards based on drone-vehicle proximity and collision avoidance.

Limitations & Future Work

The current release of CARLA-Air is validated for the five workflows presented above. Several constraints remain as active engineering targets:

Actor density: Performance is characterized at moderate traffic loads; high-density scenes with large simultaneous actor populations are still being optimized.
Environment resets: Map switching requires a full process restart due to independent actor lifecycle management. Staged in-session resets are planned for a future release.
Multi-drone scale: Configurations beyond two drones are functional but not yet formally validated across a wide range of scenarios.

Future Directions

Near-term work includes physics-state synchronization between the two engines and a ROS 2 bridge for broader ecosystem integration. Longer-term goals include GPU-parallel multi-environment execution (similar to Isaac Lab) to increase episode throughput for large-scale RL, and potential migration to Unreal Engine 5. Since AirSim's upstream development has been archived, CARLA-Air maintains the aerial stack independently with bug fixes and feature extensions.

Conclusion

Simulation platforms for autonomous systems have historically fragmented along domain boundaries, forcing researchers to maintain inter-process bridge infrastructure or accept capability compromises. CARLA-Air resolves this fragmentation by integrating high-fidelity urban driving (CARLA) and physics-accurate multirotor flight (AirSim) within a single Unreal Engine process.

Why this matters for the field

Before CARLA-Air, researchers studying air-ground cooperation had to either (a) maintain complex bridge infrastructure between separate simulators, accepting performance penalties and synchronization bugs, or (b) build simplified custom environments that sacrifice realism. CARLA-Air is the first platform that makes high-fidelity, multi-domain simulation accessible without these compromises — and being open source, it can become a shared standard for the community.

The central technical contribution is the composition-based GameMode design that resolves UE4's single-GameMode constraint. This enables zero inter-process latency, full API compatibility with both CARLA and AirSim, and a shared rendering pipeline. The platform maintains stable operation at ~20 FPS under joint workloads, with zero crashes over 3-hour continuous runs.

Five representative workflows demonstrate capabilities that are structurally inaccessible in single-domain platforms: cooperative air-ground landing, vision-language navigation from aerial perspectives, synchronized multi-modal dataset collection, cross-view perception, and RL training. CARLA-Air is released as open source with prebuilt binaries and full source code.

References (20 papers)

[1] Amini et al. VISTA 2.0: An open, data-driven simulator for multimodal sensing and policy learning. IEEE RA-L, 2022.
[2] Dosovitskiy et al. CARLA: An open urban driving simulator. CoRL, 2017.
[3] Epic Games. Unreal Engine 4 documentation, 2021.
[4] Furrer et al. RotorS — a modular Gazebo MAV simulator framework. ROS, 2016.
[5] Guerra et al. FlightGoggles: Photorealistic sensor simulation. IROS, 2019.
[6] Koenig & Howard. Design and use paradigms for Gazebo. IROS, 2004.
[7] Li et al. MetaDrive: Composing diverse driving scenarios. IEEE TPAMI, 2022.
[8] Lopez et al. Microscopic traffic simulation using SUMO. ITSC, 2018.
[9] Macenski et al. Robot Operating System 2. Science Robotics, 2022.
[10] Makoviychuk et al. Isaac Gym: High performance GPU-based physics simulation. NeurIPS, 2021.
[11] NVIDIA. Isaac Lab: A unified framework for robot learning. arXiv, 2025.
[12] Panerati et al. Learning to fly — a gym environment with PyBullet. CoRL, 2021.
[13] Rong et al. LGSVL Simulator. ITSC, 2020.
[14] Savva et al. Habitat: A platform for embodied AI research. ICCV, 2019.
[15] Shah et al. AirSim: High-fidelity visual and physical simulation. FSR, 2017.
[16] Song et al. Flightmare: A flexible quadrotor simulator. CoRL, 2020.
[17] Wang et al. TranSimHub. arXiv, 2024.
[18] Xiang et al. SAPIEN: A simulated part-based interactive environment. CVPR, 2020.
[19] Xu et al. OmniDrones: RL in drone control. arXiv, 2023.
[20] Zhu et al. robosuite: A modular simulation framework. arXiv, 2020.