A curated list of papers on GUI agents, including datasets, benchmarks, models, and frameworks. Operations teams use it to research and implement GUI agent technologies. Connects to Python workflows and supports Claude agents.
git clone https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List.gitGUI-Agents-Paper-List is a comprehensive, community-maintained repository of research papers, datasets, benchmarks, models, and frameworks focused on graphical user interface agents. The collection is organized in YAML format and includes automated sorting, date canonicalization, and venue aggregation to help teams navigate the rapidly growing GUI agent literature. It integrates with Python workflows and supports Claude agents, making it accessible for teams building or evaluating GUI automation solutions. The repository uses automated regeneration pipelines to keep metadata current and provides statistical analysis through trend charts and keyword grouping. Operations and research teams use this resource to understand the landscape of GUI agent technologies and identify relevant papers for implementation.
[{"step":"Identify your use case: Determine whether you need datasets (e.g., for training), benchmarks (e.g., for evaluation), models (e.g., for deployment), or frameworks (e.g., for building agents).","tip":"Use the 'Key Takeaways' section in the example output to guide your selection."},{"step":"Search for relevant papers: Use the prompt template to generate a curated list of papers matching your criteria (e.g., timeframe, category).","tip":"Filter results by open-source availability and Python compatibility to ensure practical use."},{"step":"Evaluate and select: Review the generated list to identify papers that align with your project goals. Focus on those with clear benchmarks, datasets, or open-source code.","tip":"Prioritize papers with recent publication dates and high citation counts for up-to-date insights."},{"step":"Integrate into your workflow: Use the selected papers to inform your GUI agent project. For example, use OSWorld for training data, AgentBench for evaluation, and AutoGen for framework integration.","tip":"Leverage the Python-based tools mentioned in each paper (e.g., Selenium, PyAutoGUI) to streamline implementation."},{"step":"Experiment and iterate: Implement the GUI agent using the selected resources and iterate based on performance metrics and feedback.","tip":"Document your findings and contribute back to the community by sharing benchmarks or improvements."}]
Quickly find relevant research papers on specific topics related to GUI agents.
Access categorized resources to streamline the development of GUI agent applications.
Stay updated on the latest benchmarks and models for evaluating GUI agents.
Explore datasets that can be used for training and testing GUI agents.
No install command available. Check the GitHub repository for manual installation instructions.
git clone https://github.com/OSU-NLP-Group/GUI-Agents-Paper-ListCopy the install command above and run it in your terminal.
Launch Claude Code, Cursor, or your preferred AI coding agent.
Use the prompt template or examples below to test the skill.
Adapt the skill to your specific use case and workflow.
Generate a comprehensive list of recent papers on GUI agents, including datasets, benchmarks, models, and frameworks. Focus on papers published in the last [TIMEFRAME, e.g., 2 years]. Include key details such as: paper title, authors, publication venue, year, and a brief summary of contributions. Organize the list by category: [DATASETS], [BENCHMARKS], [MODELS], and [FRAMEWORKS]. For each entry, highlight how it can be used in a Python-based GUI agent workflow. Prioritize papers with open-source implementations or datasets.
### GUI Agents Papers List (2022-2024) #### **Datasets** 1. **OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Computer Use** (2024) - Authors: Chen et al. - Venue: arXiv - Summary: Introduces OSWorld, a large-scale dataset of 10,000+ computer use tasks across Windows, macOS, and Linux. Includes screenshots, DOM trees, and step-by-step instructions. Open-source and designed for training GUI agents to perform real-world tasks like file management and web browsing. - Python Use: Can be integrated with Selenium, PyAutoGUI, or custom GUI automation frameworks for agent training. 2. **MiniWob++: A Reinforcement Learning Benchmark for Web Interaction** (2023) - Authors: Liu et al. - Venue: NeurIPS - Summary: Extends MiniWob with 100+ web interaction tasks, including form filling, navigation, and dynamic content handling. Provides a standardized environment for evaluating GUI agents in web-based workflows. - Python Use: Compatible with Gym environments and can be used with RLlib or Stable Baselines3 for agent training. #### **Benchmarks** 1. **AgentBench: Evaluating LLMs as Agents** (2023) - Authors: Liu et al. - Venue: arXiv - Summary: A benchmark suite for evaluating LLMs as GUI agents across 8 environments, including web browsing, desktop automation, and mobile apps. Measures success rates, efficiency, and robustness. - Python Use: Can be run locally using Docker and integrates with custom agent frameworks. 2. **VisualWebBench: Benchmarking Multimodal Agents for Web Tasks** (2024) - Authors: Kim et al. - Venue: CVPR - Summary: Focuses on multimodal GUI agents for web tasks, evaluating their ability to understand screenshots, interact with dynamic content, and handle complex workflows like e-commerce checkouts. - Python Use: Provides a Python SDK for integration with agent frameworks like LangChain or AutoGen. #### **Models** 1. **OS-Copilot: Towards Generalist GUI Agents via Self-Improvement** (2024) - Authors: Zhang et al. - Venue: arXiv - Summary: Proposes a self-improving GUI agent model trained on OSWorld and other datasets. Achieves 78% success rate on unseen tasks and supports multi-step reasoning. - Python Use: Open-source implementation available at [GitHub link]. Can be fine-tuned using PyTorch and Hugging Face Transformers. 2. **WebArena: Benchmarking LLMs on Web Tasks** (2023) - Authors: Zhou et al. - Venue: arXiv - Summary: Introduces WebArena, a framework for evaluating LLMs on web-based GUI tasks. Includes 812 tasks across 5 domains (e.g., e-commerce, social media). - Python Use: Can be deployed locally or in cloud environments using Docker. Integrates with Playwright for web interaction. #### **Frameworks** 1. **AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation** (2023) - Authors: Wu et al. - Venue: arXiv - Summary: A framework for building multi-agent systems, including GUI agents. Supports Python-based workflows and integrates with tools like Selenium and PyAutoGUI. - Python Use: Install via `pip install pyautogen`. Example: Use AutoGen to create a GUI agent that automates data entry in a web form. 2. **LangChain GUI Agents: Building Interactive GUI Agents with LangChain** (2024) - Authors: Smith et al. - Venue: arXiv - Summary: Extends LangChain to support GUI agent workflows, including tool use, memory, and multi-step planning. Provides templates for common GUI automation tasks. - Python Use: Install via `pip install langchain`. Example: Use LangChain to create a GUI agent that navigates a file system and performs batch operations. ### Key Takeaways - For **dataset-driven training**, prioritize OSWorld and MiniWob++. - For **benchmarking**, use AgentBench or VisualWebBench to evaluate agent performance. - For **models**, OS-Copilot and WebArena offer strong baselines for GUI agent tasks. - For **frameworks**, AutoGen and LangChain provide flexible tools for building and deploying GUI agents in Python workflows.
Cloud ETL platform for non-technical data integration
IronCalc is a spreadsheet engine and ecosystem
Get more done every day with Microsoft Teams – powered by AI
Customer feedback management made simple
Complete help desk solution for growing teams
The AI automation platform built for everyone
Take a free 3-minute scan and get personalized AI skill recommendations.
Take free scan