MARL trained the policy in weeks. Sim-to-real took the rest of the year.

In 2021 we trained an MARL system for tactical deconfliction — the runtime problem of keeping drones out of each other's way once they're already in flight. The setup was conventional in places and deliberately unconventional in others: centralized training, decentralized execution, a single shared policy across agents, every agent broadcasting position with a cryptographic signature, all agents jointly optimizing energy consumed in 3D.

The MARL part was the easy half.

Why learned over rule-based

Rule-based tactical deconfliction breaks combinatorially. Two drones in a corridor is a textbook problem. Twenty drones over a dense operations zone is a problem nobody enumerates ahead of time — the count of pairwise resolutions you'd have to spec out grows faster than you can write rules. A learned policy generalizes across density regimes; a rule book does not.

We were also deliberate about trust. Each agent's position broadcast was signed and verified before being consumed by anyone else's policy. MARL papers tend to assume the comm channel is honest. In airspace shared with hardware you don't own, it isn't.

The simulator was the lie

Then came sim-to-real.

In simulation, wind was a parameter. In flight, wind was a regime. In simulation, the energy model was the airframe's spec sheet. In flight, the airframe burned 20–30% more under load and gusting. In simulation, position-broadcast latency was deterministic. In the real network it had a long tail that occasionally fell off a cliff.

Policies that converged cleanly in sim brittle-failed in the field. The reference work that helped us most — Osipychev et al. on RL-based air traffic deconfliction — is explicit that the integration effort is the contribution, not the algorithm. We learned the same thing the same way: by being burned by everything in the real system we hadn't bothered to randomize in the simulator.

The marginal hour on simulator fidelity beats the marginal hour on algorithm choice.

The lesson

If you're shipping RL into a physical system, the simulator is the artifact you're actually building. The policy is downstream of it. We picked our MARL setup in two weeks. The simulator was never finished.