IMPACT: Interpretable Most Important Person Analysis and Classification using Transformer-based Models
Akshat Rampuria · Kamakshya Nayak · Kamalakar Thakare · Tushar Joshi · Aditya Singh · Haesol Park · Heeseung Choi · Debi Dogra · Ig-Jae Kim
Abstract
Identifying the Most Important Person (MIP) in complex social and sports events remains a challenging problem due to the dynamic nature of group interactions, subtle visual cues, and context-dependent semantics. Traditional methods often struggle to accurately capture the interplay between individuals and the overarching activity, especially in unstructured real-world environments. In addition, the lack of strong supervision and the need for a deeper contextual understanding further complicate the task. In this work, we propose IMPACT, a novel multi-modal framework that leverages recent advances in vision language models to bridge the gap between visual perception and semantic reasoning. Our approach integrates structured scene understanding, natural language generation, and cross-modal learning to jointly model activity recognition and MIP localization. The method integrates language, vision, and spatial reasoning to improve scene interpretability as well as accuracy in group activity recognition tasks. By incorporating language-based representations, the proposed method enables interpretable and robust performance in sports-centric group activity scenarios. Comprehensive experiments on C-Sports and NCAA datasets demonstrate that the framework significantly enhances the localization of key individuals as well as the accuracy of activity prediction, laying the groundwork for a holistic scene understanding in human-centric video and image analysis. Our proposed method achieves an accuracy of 81.6\% when compared with human annotator markings and an increase in mAP scores by $\sim 5\%$ for MIP identification.
Successful Page Load