TM-Adapter: Temporal Merge Adapter for Efficient Global Temporal Modeling
Abstract
We propose the Temporal Merge Adapter} (TM-Adapter), a novel framework for image-to-video parameter-efficient transfer learning (PETL), specifically designed for temporal representation learning in video understanding. PETL has emerged as a practical strategy for adapting large-scale vision models to video tasks under limited computational budgets. However, existing PETL approaches suffer from local redundancy caused by highly similar consecutive frames, which limits the modeling of diverse temporal dependencies. To address this limitation, we introduce a lightweight merge-unmerge mechanism that temporally aggregates and redistributes token embeddings, enabling the model to capture diverse temporal patterns by mitigating redundancy. Furthermore, to effectively handle diverse temporal dependencies across different time scales, TM-Adapter introduces a single adapter module with two parallel branches, local and global adapters, each specialized in capturing complementary patterns at different temporal ranges. We validate our approach through experiments on Kinetics-400, Something-Something V2, and HMDB-51, demonstrating competitive performance compared to existing methods while maintaining high parameter efficiency.