HumanBench: Two Heads, No Legs, But Mostly Human, the State of Generative Capabilities in T2I Models
Abstract
Despite rapid advances, text-to-image (T2I) models still falter in generating anatomically coherent and semantically grounded humans. We introduce HumanBench, a large-scale (35K-image), privacy-friendly benchmark that rigorously evaluates T2I models across four axes: template consistency, spatial reasoning, action understanding, and texture recognition. To quantify alignment, we propose two novel metrics—Agreement and Distinction—capturing both fidelity to prompts and semantic contrast with counterfactuals and negations.Evaluating six leading models, we uncover persistent failures including disfigurements, species leakage, texture-object mismatches, and counting errors, especially under compound prompts. A complementary human study reveals that image realism and correctness degrade with prompt complexity, validating our automated assessments. HumanBench offers the first comprehensive audit of human-centric T2I generation, setting a new standard for benchmarking anatomical accuracy, compositional reasoning, and trustworthiness in generative models.