Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition
Abstract
Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical everyday life scenarios remains insufficiently explored. We introduce VERI (Visual Emergency Recognition Dataset), a diagnostic benchmark comprising 200 images organized into 100 contrastive pairs. Each emergency scene is paired with a visually similar but safe counterpart through human verification and refinement. Using a two-stage evaluation protocol—risk identification and emergency response. We assess 17 VLMs (from open source to commercial APIs) across medical emergencies, accidents, and natural disasters. Our analysis reveals an "overreaction problem," where models achieve high recall in detecting genuine emergencies (70-100%) but suffer from low precision, misclassifying 31-96% of safe situations as dangerous. Seven safe scenarios were universally misclassified by all models, regardless of scale. This "better-safe-than-sorry" bias primarily results from contextual overinterpretation (88-98% of errors), challenging VLM reliability in safety-critical applications. As VLMs increasingly power real-world applications from smart home monitoring to autonomous systems, understanding and addressing these systematic biases becomes critical for safe deployment. Our results demonstrate a need for strategies specifically improving contextual reasoning in ambiguous visual situations.