Feature Inversion as a Lens on Vision Encoders
Abstract
Vision encoders power modern vision-only and vision-language systems, yet the geometry of their internal features remains opaque. In this work, we introduce a simple, general approach for vision latent analysis: reconstruct images from frozen encoder features and treat reconstructability as a proxy for retained information and feature organization. Concretely, we train a lightweight reconstructor to invert feature tensors and use it to compare various vision encoders — CLIP-based ViT, SigLIP, SAM, and InternViT. We rank models by the informativeness of their features and observe consistent gains with image-centric objectives and higher spatial resolution. Beyond measurement, controlled manipulations in feature space produce predictable pixel-level edits: orthogonal rotations (rather than spatial transformations) implement channel permutations and drive systematic color changes; linear contractions implement channel suppression; and a learned linear map enables plausible colorization of grayscale inputs. VLM-based experiments confirm that feature-space color swaps translate into semantic color changes in reconstructions. Our approach is encoder-agnostic in principle (demonstrated on ViT-based models), requires only access to features, and offers a practical diagnostic of what encoders remember, how that information is organized, and how it can be manipulated.