1anemptyship.com
Preliminary results — manuscript in preparation
Deployed humanoids run on lithium-ion packs whose terminal voltage sags and whose battery-management current limit tightens as the state of charge falls — shrinking the torque each joint can deliver. Yet RL locomotion policies are almost always trained with an idealized, constant torque limit. We close that gap with a physically-grounded battery–actuator co-simulation for legged RL: a 1-RC equivalent-circuit pack feeds an explicit voltage-limited DC-motor envelope, so degradation arises from first principles rather than from tuned knobs.
Credibility is the hard part — it is easy to make a robot fall by exaggerating the battery. So the circuit is grounded externally and without fitting to our own locomotion results: cross-validated against a Doyle–Fuller–Newman electrochemical model of the same cell in PyBaMM (open-circuit voltage to 35 mV), and pinned to the published pack datasheet. The cross-validation raises the internal resistance $3.6\times$, removing any need for an aged-pack assumption.
Under this model, locomotion degrades in two regimes: a graceful marginal-power band, and a hard physical cliff near empty. In bipedal-standard terms (velocity tracking, cost of transport, push recovery) we find that training in the battery environment provides most of the robustness, and that conditioning the policy on SOC adds energy efficiency — a benefit a sensor-free distilled student inherits without reading the battery, though no method beats the physical cliff.
Unitree G1 flat walking in Isaac Lab, 50 Hz, command 1.0 m/s. Healthy (DFN-grounded) pack — no aging crutch. Free-walk metrics, three-seed means.
In the marginal band (SOC 0.07), all battery-trained policies stay upright through a push — but the SOC-aware policy walks faster and spends the least energy to cover the same distance.
The novelty is a single shared power source whose voltage every joint jointly pulls down, feeding a motor envelope that shrinks with SOC. Three relations, once per 50 Hz control step:
At full charge the envelope sits far above the gait's torque demand, so walking is identical to the stock policy. Drain the pack and the same demanded torques pull more current, $V_{dc}$ sags, and $\tau_{\max}$ collapses toward the demand — first degrading tracking, then triggering falls. On top of the ohmic sag, the battery-management system derates the deliverable current $\eta$ as the open-circuit voltage approaches the cell cutoff. Because the OCV / $R_{\text{int}}$ curves are externally grounded and the envelope is textbook BLDC, the failure point is a property of the physics, not a chosen number.
To keep the result credible, the lumped circuit is checked against an independent high-fidelity electrochemical model and the manufacturer datasheet — with no parameter tuned to the locomotion outcome.
The grounded BMS derate is flat through the mid range and falls steeply only near empty ($\eta = 0.59,\,0.35,\,0.20$ at SOC 0.10, 0.07, 0.05). Locomotion follows suit.
Two clean, separable claims emerge. Battery-environment training is decisive: the vanilla policy, which never saw the limits, tracks $\sim$3× worse and is $\sim$3× less efficient than the policies trained inside the battery model, and falls first. SOC observation buys efficiency: on top of equal battery-environment training, reading the SOC lets the policy pace itself rather than command torques that clip — consistently the lowest cost of transport, and slightly better marginal-band tracking.
In a humanoid the battery effect shows up not only as tracking error but as a loss of balance margin — the largest lateral push the policy can absorb and still recover.
The SOC-aware policy reads a privileged battery channel a fielded robot may not expose. So we distill it (teacher→student, RMA-style) into a policy that reads only proprioceptive history — no battery sensor — and must infer the battery-limited dynamics on its own.
Free-walk metrics vs SOC: velocity-tracking error (m/s) / cost of transport / fall rate. Lower is better. All columns are three-seed means; the student is distilled once per teacher seed and reads no battery channel.
| SOC | Vanilla | Blind | Aware (teacher) | Student (no sensor) |
|---|---|---|---|---|
| 0.50 | 0.14 / 0.58 / 0 | 0.04 / 0.22 / 0 | 0.04 / 0.26 / 0 | 0.04 / 0.24 / 0 |
| 0.10 | 0.13 / 0.52 / 0 | 0.06 / 0.24 / 0 | 0.05 / 0.16 / 0 | 0.06 / 0.19 / 0 |
| 0.07 | 0.09 / 0.51 / 0 | 0.14 / 0.33 / 0 | 0.12 / 0.25 / 0 | 0.12 / 0.27 / 0 |
| 0.05 | fall | 0.19 / 0.34 / 0 | 0.25 / 0.39 / 0.27 | 0.27 / 0.44 / 0.35 |
| 0.04 | fall | fall | fall | fall |
Unitree G1 flat-walking task in Isaac Lab (built-in Isaac-Velocity-Flat-G1-v0 stack), 50 Hz
control. The leg and torso actuator groups are switched from the stock implicit PD to an explicit
voltage-limited DC-motor model carrying the $\tau_{\max}(\omega, V_{dc})$ envelope; arms remain implicit. SOC
is fixed per episode (within-episode ΔSOC is ≈0.003 over a 20 s episode, so per-episode-fixed
SOC is the modeling primitive) and randomized at reset, over-sampling the low-SOC band. The pack is grounded
to a 13S / 54 V / 9 Ah datasheet (run time within 5%), and the internal resistance to the
DFN cross-validation — no aging multiplier. Each policy is trained from scratch for
1500 iterations with PPO (rsl-rl) to remove any shared-initialization confound; the student is distilled by
DAgger. Simulation only, flat ground, no hardware calibration yet.
@misc{bae2026socaware,
title = {SOC-Aware Humanoid Locomotion under
Physically-Grounded Battery Voltage Sag},
author = {Bae, Hanbin},
year = {2026},
note = {Manuscript in preparation}
}