See, Record, Do: Automated Generation of UI Workflows from Tutorial Videos
Abstract
System designers and developers need data-driven approaches for user-interface (UI) development and testing. They need trace-based workflows to support UI navigation agents and inputs for UI code generation. Given the high production costs of manually constructing such workflows, the UI agent research community has explored automated, fully-synthetic UI workflow generation. However, there is an open need to characterize the fidelity and effectiveness of these synthetic approaches with respect to the current manual approaches.In this work, we aim to provide larger-scale synthetic workflows based on past human usage. We particularly focused on the desktop application modality since prior synthetic generation has mostly targeted mobile or web applications. Using video tutorials with permissive licenses, we derive associated UI behaviors to construct synthetic dataset of UI workflows in desktop applications. We provide data from these videos to a large language model (LLM) to generate a set of ``task list'' instructions that replicate the videos' actions. We execute the task list instructions in a UI agent within an instrumented desktop operating system. Using the sensor data from that environment, we can enable simpler actuation scripts for replay of the tasks. Along with detailing our approach, we provide a dataset with over 5,000 workflows across a range of desktop applications with costs that are roughly 51% to 70% of manual construction.