Understanding human actions and events from video has long been a central challenge in computer vision, driven by the fundamental difficulty of making sense of images over time over time. This keynote covers the history of action recognition and video understanding through the lens of the field’s most persistent obstacles. It begins with early approaches that relied on carefully engineered spatiotemporal features, where the core challenge was how to represent motion, dynamics, and temporal structure in a form suitable for learning. After that we will cover how the rise of CNNs brought the shift from handcrafted features to data-driven representations but also how progress became coupled to the availability, scale, and diversity of video datasets and the practical limits this imposed on training deep models. The talk concludes with the current challenges of the field—aligning video with language—where the problem extends beyond recognition to semantic grounding, multimodal representation learning, and reasoning across visual, temporal, and linguistic abstractions. It shows that while video understanding matured the challenge of making sense of visual data over time persisted. By revisiting the history of these challenges, this keynote aims to clarify how past constraints influence today’s solutions and to provide perspective on the open problems that will define the next generation of video understanding systems.