T2LF: LLM-Guided Multimodal Diffusion for Text-to-Light Field Synthesis
Abstract
We present a novel text-driven approach for light field (LF) synthesis. Existing methods typically generate LFs from given images, requiring users to find reference images, which makes it difficult to construct the desired scene directly and limits scene diversity. Moreover, existing methods are mainly designed for limited baselines from training datasets, making it difficult to implement various viewpoint changes and consequently limiting the flexibility of motion. In contrast, our method directly synthesizes LFs from user-provided text descriptions by leveraging the scene understanding capabilities of a multi-modal large language model (LLM) and the generative power of a diffusion model. Given a text prompt describing the desired LF, the multimodal LLM extracts relevant information for LF synthesis, which then guides a diffusion model to produce diverse scenes and motions. This approach enables LF synthesis even with a pre-trained model not initially designed for this purpose, requiring only minimal fine-tuning. The proposed framework enables visually diverse LF synthesis with only text input. Experimental results demonstrate that the synthesized LFs exhibit geometric consistency and achieve advanced synthesis quality compared to existing methods.