DICE: Discrete Inversion Enabling Controllable Editing for Masked Generative Models
Abstract
Recent advances in discrete diffusion models have demonstrated strong performance in image generation and masked language modeling, yet they remain limited in their capacity for controlled content editing. We propose DICE (Discrete Inversion for Controllable Editing), a novel framework that pioneers precise inversion capabilities for discrete diffusion models, including both masked generative and multinomial diffusion variants. Our key innovation lies in capturing noise sequences and masking patterns during reverse diffusion process, enabling both accurate reconstruction and flexible editing without relying on predefined masks or attention-based manipulations. Through comprehensive experiments across image and text modalities using models such as Paella, VQ-Diffusion, RoBERTa and LLaDA, we demonstrate that DICE successfully maintains high fidelity to the original data while significantly expanding editing capabilities. These results establish new possibilities for fine-grained content manipulation in discrete spaces