RampWatch: An In-the-Wild Dataset and Text-Guided Detection Framework for Recreational Vessels
Abstract
Detecting small, recreational vessels in coastal environments remains a persistent challenge due to complex backgrounds, dynamic lighting conditions, and the scarcity of annotated data for non-commercial maritime traffic. Despite their socio-economic significance, recreational boats are underrepresented in existing datasets and are poorly detected by standard object detectors, particularly in open-vocabulary scenarios. To address this gap, we present RampWatch, an in-the-wild dataset curated from surveillance footage at multiple boat ramps. RampWatch provides instance annotations across seven categories of recreational vessels, captured under diverse weather, lighting, and occlusion conditions. To benchmark detection in this domain, we introduce YOLO-TG, a novel detection framework that augments YOLOv11 with a text encoder for open-vocabulary recognition and a self-attention module for enhanced spatial reasoning. YOLO-TG adopts a dual-stream design: visual features are extracted via a hierarchical YOLO backbone, while semantic embeddings from natural language prompts are encoded by a frozen language encoder. These modalities are fused via lightweight cross-modal attention, enabling text-guided detection without retraining. YOLO-TG achieves a 12% relative improvement in mAP@50–95 over strong YOLOv11 baselines on RampWatch, and demonstrates robust cross-domain generalization, with gains of +22% on the Singapore Maritime Dataset and +4.3% on the Split Port Ship Classification Dataset. These results highlight the effectiveness of cross-modal grounding and domain-specific datasets for advancing open-world maritime surveillance.