Work in Progress · 2026

SOC-Aware Humanoid Locomotion under Physically-Grounded Battery Voltage Sag

Hanbin Bae¹

¹anemptyship.com

Preliminary results — manuscript in preparation

Paper (soon) Code (soon) Video

Filmstrip: at SOC 6%, the vanilla policy collapses while the battery-trained policies keep walking

Free walking a Unitree G1 at a low battery state of charge (SOC 6%), 1.0 m/s forward command. The policy that never trained with the battery limits (Vanilla) collapses within ~1.3 s; the three policies trained inside the battery environment — Blind (SOC-unaware), SOC-aware teacher, and the sensor-free Student — all keep walking, matching the three-seed statistics. Real-time snapshots.

Abstract

Deployed humanoids run on lithium-ion packs whose terminal voltage sags and whose battery-management current limit tightens as the state of charge falls — shrinking the torque each joint can deliver. Yet RL locomotion policies are almost always trained with an idealized, constant torque limit. We close that gap with a physically-grounded battery–actuator co-simulation for legged RL: a 1-RC equivalent-circuit pack feeds an explicit voltage-limited DC-motor envelope, so degradation arises from first principles rather than from tuned knobs.

Credibility is the hard part — it is easy to make a robot fall by exaggerating the battery. So the circuit is grounded externally and without fitting to our own locomotion results: cross-validated against a Doyle–Fuller–Newman electrochemical model of the same cell in PyBaMM (open-circuit voltage to 35 mV), and pinned to the published pack datasheet. The cross-validation raises the internal resistance $3.6\times$, removing any need for an aged-pack assumption.

Under this model, locomotion degrades in two regimes: a graceful marginal-power band, and a hard physical cliff near empty. In bipedal-standard terms (velocity tracking, cost of transport, push recovery) we find that training in the battery environment provides most of the robustness, and that conditioning the policy on SOC adds energy efficiency — a benefit a sensor-free distilled student inherits without reading the battery, though no method beats the physical cliff.

At a Glance

Unitree G1 flat walking in Isaac Lab, 50 Hz, command 1.0 m/s. Healthy (DFN-grounded) pack — no aging crutch. Free-walk metrics, three-seed means.

2 regimes

a graceful marginal-power band, then a hard physical cliff — both emergent from physics

≈33% lower

cost of transport in the marginal band: SOC-aware vs SOC-blind (0.16 vs 0.24 at SOC 0.10) — the sensor-free student matches it

0 sensors

the distilled student recovers the teacher's efficiency reading only proprioceptive history — no battery sensor needed

35 mV

ECM↔DFN open-circuit agreement; internal resistance grounded, not fit to our results

What SOC-Awareness Buys

In the marginal band (SOC 0.07), all battery-trained policies stay upright through a push — but the SOC-aware policy walks faster and spends the least energy to cover the same distance.

SOC 0.07, real-time. Top: textured renders of the vanilla (no battery training), blind (SOC-unaware), and SOC-aware policies. Bottom-left: forward speed against the commanded 1.0 m/s (dotted), with the lateral push at $t=3$ s. Bottom-right: cumulative electrical energy vs distance travelled — lower is more efficient. The SOC-aware policy (green) holds the lowest energy-per-metre while tracking speed best; the vanilla policy (red) is both slowest and most wasteful.

The Coupling

The novelty is a single shared power source whose voltage every joint jointly pulls down, feeding a motor envelope that shrinks with SOC. Three relations, once per 50 Hz control step:

pack current $I_{\text{pack}} \;=\; \displaystyle\sum_j \frac{|\tau_j|}{K_{t,j}} \;+\; \frac{P_{\text{mech}}}{\eta\,V_{dc}} \;+\; I_{\text{q}}$
bus voltage $V_{dc} \;=\; \mathrm{OCV}(\mathrm{SOC}) \;-\; I_{\text{pack}}\,R_{\text{int}}(\mathrm{SOC})$
torque envelope $\tau_{\max}(\omega, V_{dc}) \;=\; K_t\,\mathrm{clamp}\!\Big(\dfrac{V_{dc} - K_t\,|\omega|}{R_\phi},\; 0,\; I_{\text{stall}}\Big)$

At full charge the envelope sits far above the gait's torque demand, so walking is identical to the stock policy. Drain the pack and the same demanded torques pull more current, $V_{dc}$ sags, and $\tau_{\max}$ collapses toward the demand — first degrading tracking, then triggering falls. On top of the ohmic sag, the battery-management system derates the deliverable current $\eta$ as the open-circuit voltage approaches the cell cutoff. Because the OCV / $R_{\text{int}}$ curves are externally grounded and the envelope is textbook BLDC, the failure point is a property of the physics, not a chosen number.

Grounding, Not Fitting

To keep the result credible, the lumped circuit is checked against an independent high-fidelity electrochemical model and the manufacturer datasheet — with no parameter tuned to the locomotion outcome.

ECM versus DFN cross-validation: open-circuit voltage and internal resistance versus SOC

Figure 1. Our 1-RC equivalent circuit (ECM) versus an independent Doyle–Fuller–Newman (DFN) model of the same LG M50 cell in PyBaMM. Open-circuit curves agree to a 35 mV RMS error; the resistance shape matches, but the ECM magnitude was initially $3.6\times$ too low — so we re-grounded it to the DFN value. With the corrected resistance, a healthy pack alone produces graded degradation, and the earlier aged-pack assumption is no longer needed.

Two Regimes of Degradation

The grounded BMS derate is flat through the mid range and falls steeply only near empty ($\eta = 0.59,\,0.35,\,0.20$ at SOC 0.10, 0.07, 0.05). Locomotion follows suit.

Free-walk metrics versus SOC: velocity tracking, cost of transport, completion time, fall rate

Figure 2. Free-walk metrics vs SOC (mean±std over three seeds) for vanilla, blind and aware. In a marginal-power band (SOC ≈0.07–0.15) the gait degrades gracefully: speed drops, cost of transport rises, but the robot keeps walking. Below it a physical cliff (≈0.04) is reached where the deliverable torque cannot support locomotion and every policy falls — the cliff is a property of the pack, not the controller. The vanilla policy is worst throughout; the aware policy has the lowest cost of transport.

Two clean, separable claims emerge. Battery-environment training is decisive: the vanilla policy, which never saw the limits, tracks $\sim$3× worse and is $\sim$3× less efficient than the policies trained inside the battery model, and falls first. SOC observation buys efficiency: on top of equal battery-environment training, reading the SOC lets the policy pace itself rather than command torques that clip — consistently the lowest cost of transport, and slightly better marginal-band tracking.

Disturbance Rejection

In a humanoid the battery effect shows up not only as tracking error but as a loss of balance margin — the largest lateral push the policy can absorb and still recover.

Maximum recoverable lateral push versus SOC

Figure 3. Maximum recoverable lateral push vs SOC. SOC-awareness extends disturbance rejection through the marginal band; all methods collapse at the cliff.

An honest trade-off. The advantage is not uniform. At the very edge (SOC 0.05) the SOC-blind policy is in fact more robust — it survives with zero falls, whereas the aware policy, seeking efficiency, falls in one of three seeds. We report this rather than suppress it. The clean claim is therefore narrower than “SOC-awareness dominates”: battery-environment training gives robustness, and the SOC observation adds efficiency.

Deployable Without a Battery Sensor

The SOC-aware policy reads a privileged battery channel a fielded robot may not expose. So we distill it (teacher→student, RMA-style) into a policy that reads only proprioceptive history — no battery sensor — and must infer the battery-limited dynamics on its own.

SOC 0.07, real-time. Blind, SOC-aware teacher, and the sensor-free student. Reading no battery channel, the student matches the teacher's velocity tracking and recovers most of its efficiency — cost of transport on par with the teacher and well below the SOC-blind policy (0.27 vs 0.25 vs 0.33 at SOC 0.07; 0.19 vs 0.16 vs 0.24 at SOC 0.10), averaged over three distillations. So the SOC benefit survives distillation into a deployable, sensor-free policy. At the very edge (SOC 0.05) the student is no more robust than the teacher — the physical cliff is method-invariant.

Results Table

Free-walk metrics vs SOC: velocity-tracking error (m/s) / cost of transport / fall rate. Lower is better. All columns are three-seed means; the student is distilled once per teacher seed and reads no battery channel.

SOC	Vanilla	Blind	Aware (teacher)	Student (no sensor)
0.50	0.14 / 0.58 / 0	0.04 / 0.22 / 0	0.04 / 0.26 / 0	0.04 / 0.24 / 0
0.10	0.13 / 0.52 / 0	0.06 / 0.24 / 0	0.05 / 0.16 / 0	0.06 / 0.19 / 0
0.07	0.09 / 0.51 / 0	0.14 / 0.33 / 0	0.12 / 0.25 / 0	0.12 / 0.27 / 0
0.05	fall	0.19 / 0.34 / 0	0.25 / 0.39 / 0.27	0.27 / 0.44 / 0.35
0.04	fall	fall	fall	fall

Setup

Unitree G1 flat-walking task in Isaac Lab (built-in Isaac-Velocity-Flat-G1-v0 stack), 50 Hz control. The leg and torso actuator groups are switched from the stock implicit PD to an explicit voltage-limited DC-motor model carrying the $\tau_{\max}(\omega, V_{dc})$ envelope; arms remain implicit. SOC is fixed per episode (within-episode ΔSOC is ≈0.003 over a 20 s episode, so per-episode-fixed SOC is the modeling primitive) and randomized at reset, over-sampling the low-SOC band. The pack is grounded to a 13S / 54 V / 9 Ah datasheet (run time within 5%), and the internal resistance to the DFN cross-validation — no aging multiplier. Each policy is trained from scratch for 1500 iterations with PPO (rsl-rl) to remove any shared-initialization confound; the student is distilled by DAgger. Simulation only, flat ground, no hardware calibration yet.

Citation

@misc{bae2026socaware,
  title  = {SOC-Aware Humanoid Locomotion under
            Physically-Grounded Battery Voltage Sag},
  author = {Bae, Hanbin},
  year   = {2026},
  note   = {Manuscript in preparation}
}