Beyond Faces: A Multimodal Person Clustering for Unconstrained Environments
Abstract
The increasing demand for on-device AI, driven by privacy concerns and the need for real-time processing, poses new challenges for fundamental computer vision tasks. This paper addresses one such task, person clustering in photo galleries, which has traditionally relied on server-side computation or simplistic on-device models. We introduce the Multimodal Person Clustering Architecture (MPCA), a research framework designed to explore the feasibility of a high-performance, multimodal clustering pipeline operating entirely under mobile constraints. Our framework makes three principal contributions: (1) Multimodal Appearance-Assisted Identity Recovery (MAIR), a late-fusion strategy that leverages temporal consistency to recover identities when facial data is unreliable; (2) Language-Guided Appearance Extractor (LGAE), which adapts a vision-language paradigm to construct robust appearance representations efficiently; and (3) Sequential Graph-Density Clustering (SGDC), a novel algorithm that synergistically combines graph-based and density-based methods to handle the high variance of appearance data. We demonstrate through extensive experiments that our on-device framework achieves an unprecedented 87.97\% average recall, significantly outperforming leading cloud-based commercial systems like Google Photos (77.74\%) and on-device systems like Apple Photos (67.84\%) and Samsung Gallery (83.39\%). This work provides a blueprint for future research in privacy-preserving, efficient, and robust person clustering, highlighting a viable path for deploying next-generation computer vision applications directly on mobile devices.