Cross-Modal Event Encoder: Bridging Image–Text Knowledge to Event Streams
Abstract
We introduce a robust event-centric encoder that expands the practical reach of event-based data across diverse tasks. To overcome the scarcity of large-scale event datasets, our approach adapts CLIP’s representation space to the event domain, preserving zero-shot learning and text alignment while mitigating catastrophic forgetting. By explicitly aligning event and image embeddings, the proposed encoder retains CLIP’s core strengths and delivers competitive performance on object recognition as well as zero-shot and few-shot benchmarks. Moreover, it generalizes effectively to event streams derived from video datasets without additional training. Finally, we show that the encoder integrates seamlessly into cross-modal architectures, enabling accurate event–image retrieval and unlocking new applications for the event modality.