LookPlanGraph: Embodied Instruction Following with VLM Graph Augmentation

Problem

Static planners rely solely on predefined scene graphs, making them brittle when objects are missing or misplaced. Dynamic planners explore the environment in real time and update the scene graph to account for newly discovered objects, enabling successful execution.

Static vs dynamic planners — Static vs dynamic planning under environment changes.

Task execution example

Overview

LookPlanGraph iterates between LM decision-making, simulation feasibility checks, environment execution, and VLM-driven graph augmentation until the task is marked complete. It maintains a compact scene memory graph (SMG), derived from the initial scene graph with objects initially marked as unseen, to keep prompts short and context relevant. A Scene Graph Simulator (SGS) validates and, when necessary, corrects LM actions (e.g., adding open(container) before a pickup). When the LM chooses discover objects, a VLM processes egocentric images to add or validate object nodes, states, and relations in the SMG.

GraSIF Dataset

We present a benchmark of 514 tasks, along with a lightweight, automated validation framework, designed to evaluate graph-based instruction following in household manipulation scenarios.

GraSIF integrates tasks from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow to provide lightweight evaluation for scene-graph based planners in different scenarios. Each task includes a natural-language instruction and both initial and goal scene graphs. Success is checked by simulator-based comparison of changed nodes and asset states.

Scale: 10 scenes, 514 tasks
Focus: Mobile manipulation; predicates like inside and ontop
Validation: Fast graph-based evaluation; no heavy physics required

The GraSIF dataset statistics.
Subdata	Tasks	Rooms	Nodes	Actions
SayPlan Office	29	37	202.6	4.2
BEHAVIOR-1K	177	1.23	12.1	9.8
VirtualHome	308	4	195.7	3.1

SayPlan Office offers the largest environment, spanning 37 rooms. BEHAVIOR‑1K features long‑horizon tasks, and VirtualHome is a densely populated environment.

Graph examples

SayPlan Office graph example — SayPlan Office

Results

Evaluation and metrics. Success Rate (SR) — fraction of tasks completed where all graph nodes reach the goal configuration; For evaluation was used: GPT-4o and Llama3.2‑90B‑vision as VLM models. Llama3.3 80B for static settings.

Dynamic settings

Setup. VirtualHome: 50 tasks across three categories (put in fridge, heat in microwave, wash in dishwasher). OmniGibson: SGS emulates high‑level actions; perception uses images collected per asset (50 BEHAVIOR‑1K tasks). For both environments, object positions in the initial scene graph are perturbed for half the tasks.

Method	VH (L3.2)	VH (GPT‑4o)	OG (L3.2)	OG (GPT‑4o)
LLM-as-P	0.16	0.44	0.16	0.33
LLM+P	0.00	0.32	0.21	0.37
SayPlan	0.21	0.38	0.10	0.12
SayPlan Lite	0.39	0.48	0.27	0.40
ReAct	0.30	0.22	0.33	0.41
LookPlanGraph	0.60	0.52	0.35	0.42

Key results. LookPlanGraph achieves the highest SR across configurations.

Static settings (GraSIF)

Setup. Since GraSIF provides only graphs; the augmentation module is adapted to add nodes adjacent to explored nodes. We evaluate across SayPlan Office, BEHAVIOR‑1K, and RobotHow.

Method	SayPlan Office	BEHAVIOR‑1K	RobotHow
LLM-as-P	0.47	0.39	0.44
LLM+P	0.07	0.33	0.30
SayPlan	0.46	0.36	0.86
SayPlan Lite	0.53	0.61	0.84
ReAct	0.38	0.47	0.89
LookPlanGraph	0.62	0.60	0.87

Key results. LookPlanGraph leads in SayPlan Office and remains competitive in BEHAVIOR‑1K and RobotHow, offering strong SR with compact prompts.

Graph augmentation capability

Setup. Using images gathered during simulation, VLMs reconstruct scene graphs. We report F1 for nodes/edges and Success when all nodes and edges match ground truth.

VirtualHome
Model	F1 (Nodes)	F1 (Edges)	Success
Llama3.2‑90B‑vision	0.78	0.73	0.26
GPT‑4o	0.87	0.84	0.44

OmniGibson
Model	F1 (Nodes)	F1 (Edges)	Success
Llama3.2‑90B‑vision	0.67	0.67	0.33
GPT‑4o	0.85	0.88	0.59

Key results. GPT‑4o produces more accurate graphs and higher Success; Llama3.2‑90B‑vision tends to undercount repeated instances but remains usable.

Ablations

Setup. SayPlan Office (GraSIF); we report SR and Tokens Per Action (TPA). We ablate key components to isolate their effect on planning:

JSON graph: Represent the scene memory graph (SMG) as JSON instead of natural-language descriptions.
Full Graph: Provide the entire scene graph to the LM (no SMG pruning).
No Priors: Remove object/location priors from prompts (no heuristic hints for likely locations).
No Correction: Disable SGS action correction (no precondition insertion like “open(container)” before “pickup”).

Method / Ablation	SR	TPA
LookPlanGraph	0.62	1989
Json graph	0.55	2599
Full Graph	0.65	4044
No Priors	0.51	1767
No Correction	0.24	1682

Key results. SGS action correction is critical. Full Graph improves SR but greatly increases TPA; JSON hurts SR and raises TPA. SMG balances accuracy and token cost.

BibTeX

@misc{onishchenko2025lookplangraph,
      title={LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation}, 
      author={Anatoly O. Onishchenko and Alexey K. Kovalev and Aleksandr I. Panov},
      year={2025},
      eprint={2512.21243},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2512.21243}, 
}
}

Contributions

LookPlanGraph: An adaptive, context-aware planning framework that maintains a dynamic scene memory and excels in changing environments.
VLM Graph Augmentation: A module that grounds planning with egocentric images to discover and validate objects and relations in real time.
GraSIF Benchmark: A dataset of 514 instruction-following tasks with automated, graph-based validation for reproducible evaluation.