Results
Evaluation and metrics.
Success Rate (SR) — fraction of tasks completed where all graph nodes reach the goal configuration;
For evaluation was used: GPT-4o and Llama3.2‑90B‑vision as VLM models. Llama3.3 80B for static settings.
Dynamic settings
Setup.
VirtualHome: 50 tasks across three categories (put in fridge, heat in microwave, wash in dishwasher).
OmniGibson: SGS emulates high‑level actions; perception uses images collected per asset (50 BEHAVIOR‑1K tasks).
For both environments, object positions in the initial scene graph are perturbed for half the tasks.
| Method |
VH (L3.2) |
VH (GPT‑4o) |
OG (L3.2) |
OG (GPT‑4o) |
| LLM-as-P | 0.16 | 0.44 | 0.16 | 0.33 |
| LLM+P | 0.00 | 0.32 | 0.21 | 0.37 |
| SayPlan | 0.21 | 0.38 | 0.10 | 0.12 |
| SayPlan Lite | 0.39 | 0.48 | 0.27 | 0.40 |
| ReAct | 0.30 | 0.22 | 0.33 | 0.41 |
| LookPlanGraph | 0.60 | 0.52 | 0.35 | 0.42 |
Key results. LookPlanGraph achieves the highest SR across configurations.
Static settings (GraSIF)
Setup.
Since GraSIF provides only graphs; the augmentation module is adapted to add nodes adjacent to explored nodes.
We evaluate across SayPlan Office, BEHAVIOR‑1K, and RobotHow.
| Method |
SayPlan Office |
BEHAVIOR‑1K |
RobotHow |
| LLM-as-P | 0.47 | 0.39 | 0.44 |
| LLM+P | 0.07 | 0.33 | 0.30 |
| SayPlan | 0.46 | 0.36 | 0.86 |
| SayPlan Lite | 0.53 | 0.61 | 0.84 |
| ReAct | 0.38 | 0.47 | 0.89 |
| LookPlanGraph | 0.62 | 0.60 | 0.87 |
Key results. LookPlanGraph leads in SayPlan Office and remains competitive in BEHAVIOR‑1K and RobotHow, offering strong SR with compact prompts.
Graph augmentation capability
Setup.
Using images gathered during simulation, VLMs reconstruct scene graphs.
We report F1 for nodes/edges and Success when all nodes and edges match ground truth.
VirtualHome
| Model |
F1 (Nodes) |
F1 (Edges) |
Success |
| Llama3.2‑90B‑vision | 0.78 | 0.73 | 0.26 |
| GPT‑4o | 0.87 | 0.84 | 0.44 |
OmniGibson
| Model |
F1 (Nodes) |
F1 (Edges) |
Success |
| Llama3.2‑90B‑vision | 0.67 | 0.67 | 0.33 |
| GPT‑4o | 0.85 | 0.88 | 0.59 |
Key results. GPT‑4o produces more accurate graphs and higher Success; Llama3.2‑90B‑vision tends to undercount repeated instances but remains usable.
Ablations
Setup.
SayPlan Office (GraSIF); we report SR and Tokens Per Action (TPA).
We ablate key components to isolate their effect on planning:
- JSON graph: Represent the scene memory graph (SMG) as JSON instead of natural-language descriptions.
- Full Graph: Provide the entire scene graph to the LM (no SMG pruning).
- No Priors: Remove object/location priors from prompts (no heuristic hints for likely locations).
- No Correction: Disable SGS action correction (no precondition insertion like “open(container)” before “pickup”).
| Method / Ablation |
SR |
TPA |
| LookPlanGraph | 0.62 | 1989 |
| Json graph | 0.55 | 2599 |
| Full Graph | 0.65 | 4044 |
| No Priors | 0.51 | 1767 |
| No Correction | 0.24 | 1682 |
Key results.
SGS action correction is critical. Full Graph improves SR but greatly increases TPA; JSON hurts SR and raises TPA. SMG balances accuracy and token cost.