Hybrid State Representation for Video Procedure Planning
Abstract
Accurate state representation is critical for effective procedure planning from visual inputs. Existing methods typically rely on sentence-level natural language descriptions to represent states. However, such representations are often ambiguous and fail to capture fine-grained object interactions, leading to errors in complex scenarios. To overcome these limitations, we propose a Hybrid State Representation (HSR), inspired by human-like reasoning, that models procedural states with both structural precision and contextual clarity. HSR integrates two complementary modalities: (1) Semantic State Graphs (SSGs), which explicitly encode objects, attributes, and their relations, and (2) contextual Question-Answer (QA) pairs, which act as semantic probes to disambiguate critical state transitions. We further design a heterogeneous encoder to fuse these components and introduce a visual-state alignment objective to ground the hybrid representation in the visual context. Extensive experiments on the COIN, CrossTask, and NIV benchmarks demonstrate that our method establishes a new state-of-the-art, achieving significant gains on the strict Success Rate (SR) metric. Ablation studies confirm that both the structural (SSG) and contextual (QA) components of HSR are essential for the observed performance gains.