Inner Monologue: Embodied Reasoning through Planning with Language Models

Think out loud, then act

The agent generates an internal narrative: what it intends, what it observes, what went wrong. Language becomes the scratchpad for closed-loop control, not just user-facing chat.

Feedback closes the loop

Success detectors, scene descriptions, and human hints feed back into the monologue. The model replans without a full reset. This is closer to how people recover from failed grasps than one-shot command execution.

With SayCan

Read alongside SayCan: SayCan grounds skill selection in affordances; Inner Monologue adds temporal reasoning and recovery. Both treat the LLM as a planner, not a remote control.