Timestamp Query Transformer for Temporal Action Segmentation
Abstract
This work addresses action segmentation in videos under sparse timestamp supervision, where only a single frame per action segment---referred to as a timestamp---is labeled during training. We propose the Timestamp Query Transformer (TQT) that treats timestamps as learnable class query tokens. While existing approaches rely on iterative, multi-step generation of framewise pseudo-labels, TQT directly predicts temporal segmentation masks by leveraging query-feature cross-attention. This design enables fully end-to-end learning and maximizes the utility of sparse labels from the entire training dataset, rather than relying on only a few local timestamps within each training video as in prior work. Experiments on the GTEA, 50Salads, and Breakfast datasets demonstrate that TQT outperforms SOTA methods by up to 5.8\% in accuracy and 7.7\% in F1@50. The model and code will be released.