Oral Session
Oral Session 4B: Machine Learning I
Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
Ziqiang Shi ⋅ Rujie Liu ⋅ Shanshan Yu ⋅ Satoshi Munakata ⋅ Koichi Shirahata
Rapid progress in large vision-language models(LVLMs) has achieved unprecedented performancein vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities,LVLMs often generate outputs inconsistent withvisual content - termed hallucination. To addressthis, we propose \textbf{Scalpel}, a method that reduceshallucination by refining attention activationdistributions toward more credible regions. Scalpelpredicts trusted attention directions for each headin Transformer layers during inference and adjustsactivations accordingly. It employs a Gaussian mixturemodel to capture multi-peak distributions ofattention in trust and hallucination manifolds, anduses entropic optimal transport (equivalent to Schr{\"o}dinger bridge problem) to map Gaussian componentsprecisely.During mitigation, Scalpel dynamicallyadjusts intervention strength and direction basedon component membership and mapping relationshipsbetween hallucination and trust activations.Extensive experiments across multiple datasets andbenchmarks demonstrate that Scalpel effectivelymitigates hallucinations, outperforming previousmethods and achieving state-of-the-art performance.Moreover, Scalpel is model- and data-agnostic,requiring no additional computation, only a singledecoding step.
Unified Alignment Protocol: Making Sense of the Unlabeled Data in New Domains
Sabbir Ahmed ⋅ Mamshad Nayeem Rizve ⋅ Abdullah Al Arafat ⋅ Jacqueline Liu ⋅ Rahim Hossain ⋅ Mohaiminul Nahian ⋅ Adnan Siraj Rakin
Semi-Supervised Federated Learning (SSFL) is gaining popularity over conventional Federated Learning in many real-world applications. Due to the practical limitation of limited labeled data on the client side, SSFL considers that participating clients train with unlabeled data, and only the central server has the necessary resources to access limited labeled data, making it an ideal fit for real-world applications (e.g., healthcare). However, traditional SSFL assumes that the data distributions in the training phase and testing phase are the same. In practice, however, domain shifts frequently occur, making it essential for SSFL to incorporate generalization capabilities and enhance their practicality. The core challenge is improving model generalization to new, unseen domains while the client participate in SSFL. However, the decentralized setup of SSFL and unsupervised client training necessitates innovation to achieve improved generalization across domains. To achieve this, we propose a novel framework called the Unified Alignment Protocol (UAP), which consists of an alternating two-stage training process. The first stage involves training the server model to learn and align the features with a parametric distribution, which is subsequently communicated to clients without additional communication overhead. The second stage proposes a novel training algorithm that utilizes the server feature distribution to align client features accordingly. Our extensive experiments on standard domain generalization benchmark datasets across multiple model architectures reveal that proposed UAP successfully achieves SOTA generalization performance in SSFL setting.
Feedback Alignment Meets Low-Rank Manifolds: A Structured Recipe for Local Learning
Arani Roy ⋅ Marco P. E. Apolinario ⋅ Shristi Biswas Biswas ⋅ Kaushik Roy
Training deep neural networks (DNNs) with backpropagation (BP) achieves state-of-the-art accuracy but requires global error propagation and full parameterization, leading to substantial memory and computational overhead. Direct Feedback Alignment (DFA) enables local, parallelizable updates with lower memory requirements but is limited by unstructured feedback and poor scalability in deeper architectures, specially convolutional neural networks. To address these limitations, we propose a structured local learning framework that operates directly on low-rank manifolds defined by the Singular Value Decomposition (SVD) of weight matrices. Each layer is trained in its decomposed form, with updates applied to the SVD components using a composite loss that integrates cross-entropy, subspace alignment, and orthogonality regularization. Feedback matrices are constructed to match the SVD structure, ensuring consistent alignment between forward and feedback pathways. Our method reduces the number of trainable parameters relative to the original DFA model, without relying on pruning or post hoc compression. Experiments on CIFAR-10, CIFAR-100, and ImageNet show that our method achieves accuracy comparable to that of BP. Ablation studies confirm the importance of each loss term in the low-rank setting. These results establish local learning on low-rank manifolds as a principled and scalable alternative to full-rank gradient-based training.
Learning from Unknown for Open-Set Test-Time Adaptation
Taki Hasan Rafi ⋅ Amit Agarwal ⋅ Hitesh Patel ⋅ Dong-Kyu Chae
Deep learning models often struggle to maintain performance when the training and testing data come from different distributions. Test-time adaptation (TTA) addresses this by adapting a pre-trained model to an unlabeled target domain under distribution shifts. A more challenging setting is open-set TTA (OSTTA), where the target domain may contain unknown samples outside the source classes.Existing OSTTA methods primarily detect and discard such unknowns, relying only on known samples for adaptation. In this work, we argue that unknown samples can also provide valuable cues for improving adaptation. We propose LU-OSTTA (learning from unknown for OSTTA), a simple yet effective framework that leverages both in-distribution and semantically useful out-of-distribution samples. Our approach introduces: (i) a class-conditioned dynamic energy threshold to separate OOD samples more reliably, (ii) an optimal transport–based pseudo-label refinement to mitigate noise under distribution shifts, and (iii) an adaptive prototype weighting strategy that emphasizes semantically aligned target samples while down-weighting harmful ones. Extensive experiments on CIFAR-C and Tiny-ImageNet-Cbenchmarks demonstrate that LU-OSTTA consistently outperforms state-of-the-art TTA and OSTTA methods, highlighting the benefits of utilizing rather than discarding unknown samples.
Streaming Real-Time Trajectory Prediction Using Endpoint-Aware Modeling
Alexander Prutsch ⋅ David Schinagl ⋅ Horst Possegger
Future trajectories of neighboring traffic agents have a significant influence on the path planning and decision-making of autonomous vehicles.While trajectory forecasting is a well-studied field, research mainly focuses on snapshot-based prediction, where each scenario is treated independently of its global temporal context.However, real-world autonomous driving systems need to operate in a continuous setting, requiring real-time processing of data streams with low latency and consistent predictions over successive timesteps.We leverage this continuous setting to propose a lightweight yet highly accurate streaming-based trajectory forecasting approach.We integrate valuable information from previous predictions with a novel endpoint-aware modeling scheme.Our temporal context propagation uses the trajectory endpoints of the previous forecasts as anchors to extract targeted scenario context encodings.Our approach efficiently guides its scene encoder to extract highly relevant context information without needing refinement iterations or segment-wise decoding.Our experiments highlight that our approach effectively relays information across consecutive timesteps. Unlike methods using multi-stage refinement processing, our approach significantly reduces inference latency, making it well-suited for real-world deployment.We achieve state-of-the-art streaming trajectory prediction results on the Argoverse 2 multi-agent and single-agent benchmarks, while requiring substantially fewer resources.