RobustFormer: Noise-Robust Pre-training for Images and Videos
Abstract
While deep learning-based models like transformers, have revolutionized time-series and vision tasks, they remain highly susceptible to noise and often overfit on noisy patterns rather than robust features. This issue is exacerbated in vision transformers, which rely on pixel-level details that can easily be corrupt. To address this, we leverage the discrete wavelet transform (DWT) for its ability to decompose into multi-resolution layers, isolating noise primarily in the high frequency domain while preserving essential low-frequency information for resilient feature learning. Conventional DWT-based methods, however, struggle with computational inefficiencies due to the requirement for a subsequent inverse discrete wavelet transform (IDWT) step. In this work, we introduce RobustFormer, a novel framework that enables noise-robust masked autoencoder (MAE) pre-training for both images and videos by using DWT for efficient downsampling, eliminating the need for expensive IDWT reconstruction and simplifying the attention mechanism to focus on noise-resilient multi-scale representations. To our knowledge, RobustFormer is the first DWT-based method fully compatible with video inputs and MAE-style pre-training. Extensive experiments on noisy image and video datasets demonstrate that our approach achieves up to 8% increase in Top-1 classification accuracy under severe noise conditions in Imagenet-C and up to 2.7% in Imagenet-P standard benchmarks compared to the baseline and up to 13% higher Top-1 accuracy on UCF-101 under severe custom noise perturbations while maintaining similar accuracy scores for clean datasets. We also observe the reduction of computation complexity by up to 4.4% through IDWT removal compared to VideoMAE baseline without significant performance drop.