LookPlanGraph: Embodied Instruction Following with VLM Graph Augmentation

Anatoly O. Onishchenko2 Alexey K. Kovalev1,2 Aleksandr I. Panov1,2
1 Cognitive AI Lab, Moscow, Russia 2 IAI MIPT, Moscow, Russia

Problem

Static planners rely solely on predefined scene graphs, making them brittle when objects are missing or misplaced. Dynamic planners explore the environment in real time and update the scene graph to account for newly discovered objects, enabling successful execution.

Static vs dynamic planners
Static vs dynamic planning under environment changes.

Task execution example

Overview

LookPlanGraph iterates between LM decision-making, simulation feasibility checks, environment execution, and VLM-driven graph augmentation until the task is marked complete. It maintains a compact scene memory graph (SMG), derived from the initial scene graph with objects initially marked as unseen, to keep prompts short and context relevant. A Scene Graph Simulator (SGS) validates and, when necessary, corrects LM actions (e.g., adding open(container) before a pickup). When the LM chooses discover objects, a VLM processes egocentric images to add or validate object nodes, states, and relations in the SMG.

Pipeline overview
LM planning, simulation checks, and VLM-driven graph augmentation.

GraSIF Dataset

We present a benchmark of 514 tasks, along with a lightweight, automated validation framework, designed to evaluate graph-based instruction following in household manipulation scenarios.

GraSIF integrates tasks from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow to provide lightweight evaluation for scene-graph based planners in different scenarios. Each task includes a natural-language instruction and both initial and goal scene graphs. Success is checked by simulator-based comparison of changed nodes and asset states.

  • Scale: 10 scenes, 514 tasks
  • Focus: Mobile manipulation; predicates like inside and ontop
  • Validation: Fast graph-based evaluation; no heavy physics required
The GraSIF dataset statistics.
Subdata Tasks Rooms Nodes Actions
SayPlan Office 29 37 202.6 4.2
BEHAVIOR-1K 177 1.23 12.1 9.8
VirtualHome 308 4 195.7 3.1

SayPlan Office offers the largest environment, spanning 37 rooms. BEHAVIOR‑1K features long‑horizon tasks, and VirtualHome is a densely populated environment.

Graph examples

Results

Evaluation and metrics. Success Rate (SR) — fraction of tasks completed where all graph nodes reach the goal configuration; For evaluation was used: GPT-4o and Llama3.2‑90B‑vision as VLM models. Llama3.3 80B for static settings.

Dynamic settings

Setup. VirtualHome: 50 tasks across three categories (put in fridge, heat in microwave, wash in dishwasher). OmniGibson: SGS emulates high‑level actions; perception uses images collected per asset (50 BEHAVIOR‑1K tasks). For both environments, object positions in the initial scene graph are perturbed for half the tasks.

Method VH (L3.2) VH (GPT‑4o) OG (L3.2) OG (GPT‑4o)
LLM-as-P0.160.440.160.33
LLM+P0.000.320.210.37
SayPlan0.210.380.100.12
SayPlan Lite0.390.480.270.40
ReAct0.300.220.330.41
LookPlanGraph0.600.520.350.42

Key results. LookPlanGraph achieves the highest SR across configurations.

Static settings (GraSIF)

Setup. Since GraSIF provides only graphs; the augmentation module is adapted to add nodes adjacent to explored nodes. We evaluate across SayPlan Office, BEHAVIOR‑1K, and RobotHow.

Method SayPlan Office BEHAVIOR‑1K RobotHow
LLM-as-P0.470.390.44
LLM+P0.070.330.30
SayPlan0.460.360.86
SayPlan Lite0.530.610.84
ReAct0.380.470.89
LookPlanGraph0.620.600.87

Key results. LookPlanGraph leads in SayPlan Office and remains competitive in BEHAVIOR‑1K and RobotHow, offering strong SR with compact prompts.

Graph augmentation capability

Setup. Using images gathered during simulation, VLMs reconstruct scene graphs. We report F1 for nodes/edges and Success when all nodes and edges match ground truth.

VirtualHome
Model F1 (Nodes) F1 (Edges) Success
Llama3.2‑90B‑vision0.780.730.26
GPT‑4o0.870.840.44
OmniGibson
Model F1 (Nodes) F1 (Edges) Success
Llama3.2‑90B‑vision0.670.670.33
GPT‑4o0.850.880.59

Key results. GPT‑4o produces more accurate graphs and higher Success; Llama3.2‑90B‑vision tends to undercount repeated instances but remains usable.

Ablations

Setup. SayPlan Office (GraSIF); we report SR and Tokens Per Action (TPA). We ablate key components to isolate their effect on planning:

  • JSON graph: Represent the scene memory graph (SMG) as JSON instead of natural-language descriptions.
  • Full Graph: Provide the entire scene graph to the LM (no SMG pruning).
  • No Priors: Remove object/location priors from prompts (no heuristic hints for likely locations).
  • No Correction: Disable SGS action correction (no precondition insertion like “open(container)” before “pickup”).
Method / Ablation SR TPA
LookPlanGraph0.621989
Json graph0.552599
Full Graph0.654044
No Priors0.511767
No Correction0.241682

Key results. SGS action correction is critical. Full Graph improves SR but greatly increases TPA; JSON hurts SR and raises TPA. SMG balances accuracy and token cost.

BibTeX

@article{LookPlanGraph,
    title   = {LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation},
    author  = {Onishchenko, Anatoly O. and Kovalev, Alexey K. and Panov, Aleksandr I.},
    journal = {arXiv preprint},
    year    = {2025}
}

Contributions

  • LookPlanGraph: An adaptive, context-aware planning framework that maintains a dynamic scene memory and excels in changing environments.
  • VLM Graph Augmentation: A module that grounds planning with egocentric images to discover and validate objects and relations in real time.
  • GraSIF Benchmark: A dataset of 514 instruction-following tasks with automated, graph-based validation for reproducible evaluation.