SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis
Abstract
Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics often measure realism by comparing generated scenes to a set of ground-truth scenes, but they overlook how well scenes follow the input text and capture implicit expectations of plausibility. We present SceneEval, an evaluation framework designed to address these limitations. SceneEval introduces fine-grained metrics for explicit user requirements—including object counts, attributes, and spatial relationships—and complementary metrics for implicit expectations such as support, collisions, and navigability. Together, these provide interpretable and comprehensive assessments of scene quality. To ground evaluation, we curate SceneEval-500, a benchmark of 500 text descriptions with detailed annotations of expected scene properties. This dataset establishes a common reference for reproducible and systematic comparison across scene generation methods. We evaluate six recent scene generation approaches using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results identify significant gaps in current methods, underscoring the need for further research toward practical and controllable scene synthesis.