A Unified Infrastructure for Air-Ground Embodied Intelligence
The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates a growing need for simulation infrastructure that can jointly model aerial and ground agents within a single, physically coherent environment. Existing open-source platforms remain domain-specific: urban driving simulators provide rich traffic but no aerial dynamics, while multirotor simulators offer physics-accurate flight but lack realistic ground scenes.
CARLA-Air is an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. It preserves both CARLA and AirSim Python APIs, enables photorealistic rendering with up to 18 synchronized sensor modalities, and supports custom asset import for diverse scenarios. The platform is validated through performance benchmarks and five representative application workflows spanning cooperative landing, vision-language navigation, multi-modal dataset collection, cross-view perception, and reinforcement learning.
Three converging frontiers are reshaping autonomous systems research: the low-altitude economy demands scalable infrastructure for urban air mobility and drone logistics; embodied intelligence requires agents that perceive and act in photorealistic environments; and air-ground cooperation demands joint aerial-terrestrial reasoning within a shared world.
The problem: Existing simulators are domain-specific. CARLA excels at urban driving but has no aerial dynamics. AirSim provides physics-accurate flight but lacks realistic ground traffic. Bridge-based co-simulation approaches (running both simulators as separate processes) introduce synchronization overhead and cannot guarantee strict spatiotemporal consistency โ per-frame data transfer latency grows from 1ms to over 20ms as sensors increase.
Imagine running two separate video games on your computer and trying to sync them in real-time. That's essentially what "bridge-based co-simulation" does โ it runs CARLA (driving sim) and AirSim (drone sim) as two independent programs and shuttles data between them over a network connection. The problem? Every time the drone needs to "see" a car, that visual data has to travel between processes, adding delay. With 16 sensors, this delay hits 20ms per frame โ enough to break real-time requirements. CARLA-Air eliminates this by putting everything in one program.
CARLA-Air solves this by integrating both platforms into a single Unreal Engine 4 process. Instead of bridging separate processes, it inherits CARLA's ground subsystems and composes AirSim's aerial flight actor within a unified GameMode. This eliminates inter-process communication entirely, achieving sub-millisecond data transfer regardless of sensor count.
CARLA-Air integrates CARLA and AirSim within a single Unreal Engine process through a minimal bridging layer. The key insight is a composition-based design: CARLAAirGameMode inherits from CARLA's GameModeBase to acquire all ground simulation subsystems (episode management, weather, traffic, actors, scenario recorder), while AirSim's aerial flight actor (physics engine, flight pawn, aerial sensor suite) is composed as a standard world entity during BeginPlay. Both Python APIs connect through separate RPC servers to the same UE4 process, sharing a unified rendering pipeline for RGB, depth, segmentation, and weather effects.
In Unreal Engine (the game engine powering both CARLA and AirSim), a GameMode is like the "operating system" of a simulation โ it controls what actors exist, how physics works, and what rules apply. The catch: UE4 only allows one GameMode at a time.
CARLAAirGameMode) that inherits CARLA's ground capabilities (like a child class) and composes AirSim's aerial capabilities (like adding a plugin). This is a classic software engineering pattern โ inheritance + composition.
Unreal Engine only allows one active GameMode per world. A naive approach of loading both CARLA and AirSim GameModes fails silently โ one mode is discarded, making its API inoperative. CARLA-Air solves this through inheritance + composition: CARLAAirGameMode inherits CARLA's GameModeBase (getting ground subsystems) and composes AirSim's flight actor as a separate entity, both fitting within the single GameMode slot.
UE4/CARLA uses a left-handed coordinate system (Z-up, centimeters), while AirSim uses NED (North-East-Down) with meters. CARLA-Air maintains a real-time coordinate transform layer that handles the Z-flip and scale conversion (cm โ m), ensuring both APIs report consistent positions.
NED stands for North-East-Down โ a standard coordinate system in aviation where X points North, Y points East, and Z points downward. This is opposite to UE4's gaming convention where Z points up. CARLA-Air automatically converts between these two systems so that drone positions reported by AirSim's API match the corresponding locations in CARLA's world.
CARLA-Air provides a streamlined pipeline for importing custom 3D assets (vehicles, robots, drones) into the simulation. Users can bring their own models โ from mobile robots to sport cars โ enabling diverse research scenarios beyond the default asset library.
CARLA-Air is the only platform that achieves full coverage across all key capabilities: ground vehicles, aerial agents, high-fidelity physics, rich sensor suites, multi-agent support, weather simulation, custom asset import, and open-source availability.
| Platform | Ground | Aerial | Physics | Sensors | Multi-Agent | Weather | Custom Assets | Open Source |
|---|---|---|---|---|---|---|---|---|
| CARLA [2] | โ | โ | โ | โ | โ | โ | โ | โ |
| AirSim [15] | โ | โ | โ | โ | โ | ~ | โ | โ |
| Isaac Lab [11] | โ | ~ | โ | โ | โ | โ | โ | โ |
| Habitat [14] | โ | โ | ~ | โ | โ | โ | โ | โ |
| SUMO [8] | ~ | โ | โ | โ | โ | โ | โ | โ |
| MetaDrive [7] | โ | โ | ~ | โ | โ | ~ | ~ | โ |
| OmniDrones [19] | โ | โ | โ | โ | โ | โ | โ | โ |
| TranSimHub [17] | โ | ~ | ~ | ~ | โ | ~ | ~ | โ |
| CARLA-Air (Ours) | โ | โ | โ | โ | โ | โ | โ | โ |
All benchmarks were conducted on a single workstation with an NVIDIA RTX 4090 GPU, running CARLA 0.9.15 on Unreal Engine 4.26. Three experiments evaluate frame rate scaling, memory stability, and communication latency.
Multi-domain simulation at 1280ร720 with ground vehicles, pedestrians, and a drone
Over 3 hours of continuous operation with 357 reset cycles โ no memory leak detected
Per-frame data transfer, compared to 20ms for bridge-based co-simulation with 16 sensors
Adding the aerial domain introduces minimal overhead: ground-only CARLA achieves 28.4 FPS while multi-domain CARLA-Air maintains 19.8โ26.3 FPS under comparable workloads โ a less than 5% FPS drop attributable to the integration layer. The aerial-only configuration runs at 44.7 FPS, confirming that the flight physics engine is lightweight.
When you add an entire flight simulation system (physics engine, drone sensors, aerial rendering) on top of an already-complex driving simulator, you'd expect a significant performance hit. The fact that CARLA-Air only loses ~5% of frame rate means the integration layer is extremely lightweight. For context, running the same simulations as separate processes (bridge approach) would add 20ms of communication overhead per frame, which effectively halves the usable frame rate at 30 FPS.
| Configuration | Scenario | FPS | VRAM (MiB) | CPU (%) |
|---|---|---|---|---|
| Ground only | 3 vehicles + 2 peds; 8 sensors @ 1280x720 | 28.4 ยฑ1.2 | 3,821 ยฑ10 | 31 ยฑ3 |
| Aerial only | 1 drone; 8 sensors @ 1280x720 | 44.7 ยฑ2.1 | 2,941 ยฑ8 | 29 ยฑ3 |
| Baseline | Town10HD; no actors; no sensors | 60.0 ยฑ0.4 | 3,702 ยฑ8 | 12 ยฑ2 |
| Multi-domain | 3 vehicles + 2 peds; 8 sensors @ 1280x720 | 26.3 ยฑ1.4 | 3,831 ยฑ11 | 38 ยฑ4 |
| Multi-domain | 3 vehicles + 2 peds + 1 drone; 8 sensors @ 1280x720 | 19.8 ยฑ1.1 | 3,870 ยฑ13 | 54 ยฑ5 |
| Multi-domain | 8 autopilot vehicles + 1 drone; 1 RGB @ 1920x1080 | 20.1 ยฑ1.8 | 3,874 ยฑ15 | 61 ยฑ6 |
| Stability test | Moderate joint; 357 reset cycles; 3 hr | 19.7 ยฑ1.3 | 3,878 ยฑ17 | 55 ยฑ5 |
During a 3-hour continuous stability test with 357 spawn/destroy cycles, VRAM usage remained stable at approximately 3,870 MiB. The growth from early to late phase was only ~0.4% (16 MiB), confirming the absence of memory leaks. The simulation completed with zero crashes, demonstrating production-grade reliability.
Because CARLA-Air keeps all data within a single process, per-frame data transfer stays below 0.5 milliseconds regardless of the number of concurrent sensors. In contrast, bridge-based co-simulation latency grows linearly โ reaching 20ms with 16 sensors. Individual API operations (world state queries, actor spawning, image capture) are 4โ10ร faster than bridge equivalents.
At 30 FPS, each frame has a budget of ~33 milliseconds. A bridge co-simulation using 16 sensors consumes 20ms just for data transfer โ leaving only 13ms for actual computation. CARLA-Air's sub-0.5ms transfer means nearly the entire frame budget goes to useful work (physics, rendering, AI), making real-time multi-sensor simulation practical.
| Operation | Bridge (ฮผs) | CARLA-Air (ฮผs) |
|---|---|---|
| World state snapshot | 320 | 40 |
| Actor transform query | 280 | 35 |
| Actor spawn | 1,850 | 210 |
| Image capture (1 RGB) | 3,200 | 380 |
| Velocity command | 490 | 60 |
| Cross-process sync (ref.) | 3,000 | 2,000 |
CARLA-Air enables a wide range of research workflows that require both aerial and ground agents operating in a shared environment. Five representative applications demonstrate the platform's versatility:
A drone autonomously tracks and lands on a moving ground vehicle using the shared world state. The system uses a unified Python script that controls both the ground client (vehicle trajectory) and aerial client (drone flight controller). The drone's approach, descent, and landing phases are coordinated through real-time shared positioning, achieving horizontal convergence within 0.5m tolerance.
CARLA-Air generates training data for Vision-Language Navigation (VLN) and Vision-Language-Action (VLA) models. Drones navigate urban environments following natural language instructions like "Fly across the bridge to the city center." The platform collects paired visual observations and action trajectories, enabling researchers to train embodied agents that understand both visual scenes and language commands from aerial perspectives.
Vision-Language Navigation (VLN) is a research area where an AI agent navigates a physical environment by following natural language instructions โ like telling a drone "fly across the bridge to the city center." The agent must understand both the language and the visual scene to navigate correctly.
Vision-Language-Action (VLA) extends this by also generating motor commands โ the agent doesn't just plan a path, it directly outputs control signals (velocity, direction) based on what it sees and hears. CARLA-Air provides the training ground for these models by generating paired visual-linguistic-action data at scale.
CARLA-Air simultaneously collects time-synchronized, multi-modal data from both aerial and ground perspectives. Because both sensor suites operate within the same physics tick, the data is perfectly aligned โ no post-hoc synchronization needed. The platform captures up to 18 sensor modalities including RGB, depth, semantic segmentation, optical flow, surface normals, and LiDAR from both viewpoints.
This workflow trains models to match aerial and ground-level views of the same scene. CARLA-Air supports systematic data collection across 6 maps ร 7 weather conditions (Clear Noon, Cloudy, Dense Fog, Hard Rain, Night, Soft Rain, Sunset), providing comprehensive visual diversity for robust cross-view perception research.
CARLA-Air provides a Gymnasium-compatible RL environment for drone navigation tasks. The observation space includes drone RGB images, depth maps, pose information, and NPC vehicle positions. The action space controls velocity commands (ฮvx, ฮvy, ฮvz). A reward function combines progress toward the target, altitude bonuses, and collision penalties: r = rprogress + raltitude โ rcollision.
Reinforcement Learning (RL) trains an agent by trial and error โ like teaching a dog new tricks with treats. The key components CARLA-Air provides:
The Gymnasium-compatible interface means researchers can plug in any standard RL algorithm (PPO, SAC, DQN) without custom integration work.
The current release of CARLA-Air is validated for the five workflows presented above. Several constraints remain as active engineering targets:
Near-term work includes physics-state synchronization between the two engines and a ROS 2 bridge for broader ecosystem integration. Longer-term goals include GPU-parallel multi-environment execution (similar to Isaac Lab) to increase episode throughput for large-scale RL, and potential migration to Unreal Engine 5. Since AirSim's upstream development has been archived, CARLA-Air maintains the aerial stack independently with bug fixes and feature extensions.
Simulation platforms for autonomous systems have historically fragmented along domain boundaries, forcing researchers to maintain inter-process bridge infrastructure or accept capability compromises. CARLA-Air resolves this fragmentation by integrating high-fidelity urban driving (CARLA) and physics-accurate multirotor flight (AirSim) within a single Unreal Engine process.
Before CARLA-Air, researchers studying air-ground cooperation had to either (a) maintain complex bridge infrastructure between separate simulators, accepting performance penalties and synchronization bugs, or (b) build simplified custom environments that sacrifice realism. CARLA-Air is the first platform that makes high-fidelity, multi-domain simulation accessible without these compromises โ and being open source, it can become a shared standard for the community.
The central technical contribution is the composition-based GameMode design that resolves UE4's single-GameMode constraint. This enables zero inter-process latency, full API compatibility with both CARLA and AirSim, and a shared rendering pipeline. The platform maintains stable operation at ~20 FPS under joint workloads, with zero crashes over 3-hour continuous runs.
Five representative workflows demonstrate capabilities that are structurally inaccessible in single-domain platforms: cooperative air-ground landing, vision-language navigation from aerial perspectives, synchronized multi-modal dataset collection, cross-view perception, and RL training. CARLA-Air is released as open source with prebuilt binaries and full source code.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages โ we turn any source material into production-quality content. Rich HTML ยท Custom slides ยท Animated video.