Isolating the Role of Temporal Information in Video Saliency: A Controlled Experimental Analysis
Abstract
The role of temporal information in predicting human gaze in dynamic scenes remains a critical open question, underscored by the paradoxical finding that strong static models can outperform complex video-based models. This suggests that the true contribution of temporal cues has been obscured by confounding architectural variables. To resolve this, we present a rigorous, controlled experiment centered on a minimal architectural pair: a spatio-temporal saliency model (UniformerSal-ST) and its identical spatial-only counterpart (UniformerSal-S), designed to unambiguously isolate the impact of temporal feature integration. Our results demonstrate that principled temporal fusion yields a substantial Information Gain (IG) of +0.20 bits on temporally coherent datasets like LEDOV. Crucially, our controlled comparison also uncovers a key failure mode: on datasets with frequent hard cuts like DIEM, the same mechanism degrades performance, incurring a 0.07 bits IG deficit. We provide a mechanistic explanation for this dichotomy, revealing how certain visual scenarios (scene discontinuities, rapid camera zooms) can disrupt current temporal fusion approaches. By precisely quantifying both the benefits and drawbacks of temporal processing, our work provides the community with clear, actionable insights into when and why temporal information should be modeled for more robust and accurate video saliency prediction.