MSRTrack: LLM-Powered Object Tracking with Motion and Semantic Reasoning
Abstract
State-of-the-art object trackers primarily model appearance relations between the image template and the search region with Siamese networks. However, this well-established approach has a limited ability to leverage both motion and semantic cues of the target object, leading to increasing errors in challenging scenarios like drastic appearance changes and similar-looking distractors. To address the above weaknesses, we propose a novel tracking framework with Motion and Semantic Reasoning (MSRTrack), integrating short-term motion modeling and distinctive semantic features for robust tracking across diverse conditions. Powered by vision large language models (VLLMs) and the Segment Anything Model 2 (SAM2), MSRTrack identifies unique semantic attributes of the target, exploits motion cues across consecutive frames, and complements appearance-based trackers with strong semantic and dynamic reasoning capabilities. Unlike previous vision language tracking (VLT) methods that rely on broad captioning, MSRTrack automatically focuses on a concise set of key semantic attributes of the target, substantially improving target lost recovery and distractor rejection. MSRTrack achieves state-of-the-art performance across multiple tracking benchmarks, with 2.2% improvement on the LaSOT dataset, 9.5% improvement on the VastTrack dataset, and 1.4% on the TNL2K dataset.