Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Language without grounding fails

A model can say "pick up the can" eloquently and still propose impossible moves. SayCan splits the problem: a language model scores which skills are relevant to the instruction; a value function scores which skills are feasible in the current scene.

Skills as the interface

The robot executes a library of low-level skills (grasp, move, place). The LM chooses among them in sequence. Feasibility filters hallucinated plans before they reach the hardware.

Embodied AI direction

This paper connects the LLM wave to robotics without pretending text alone is enough. Grounding lives in the intersection of language, perception, and control. Inner Monologue and related work extend the same thread.