Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
2024).
automation on computing devices such as com-
puters and mobile phones. However, existing However, current MLLM-based GUI Agents
agents face challenges in multi-step reasoning face several critical challenges. A key limita-
and reliance on textual annotations, limiting tion lies in their reasoning capabilities (Zhang and
their effectiveness. We introduce InfiGUIAgent, Zhang, 2023; Qi et al., 2024; Yu et al., 2024).
an MLLM-based GUI Agent trained with a two- While many existing GUI Agents can perform basic
stage supervised fine-tuning pipeline. Stage 1 single-step reasoning, they struggle to effectively
enhances fundamental skills such as GUI un-
leverage information from previous steps. This
derstanding and grounding, while Stage 2 inte-
grates hierarchical reasoning and expectation- lack of reflection on past experiences can lead to
reflection reasoning skills using synthesized repetitive errors during task execution.
data to enable native reasoning abilities of the Another significant challenge lies in the reliance
agents. InfiGUIAgent achieves competitive per- on the additional information of the GUIs. Many
formance on several GUI benchmarks, high- prior GUI Agent implementations rely on accessi-
lighting the impact of native reasoning skills in bility trees or Set-of-Marks (Yang et al., 2023a), to
enhancing GUI interaction for automation tasks. represent or augment the GUI’s visual information.
Resources are available at https://github.
However, GUIs are inherently visual, and represent-
com/Reallm-Labs/InfiGUIAgent.
ing them primarily through text can lead to informa-
1 Introduction tion loss or redundancy. Augmenting visual input
with textual descriptions can also increase computa-
Graphical User Interface (GUI) Agents have tional overhead. Furthermore, the availability and
emerged as powerful tools for automating tasks consistency of these textual representations vary
on computing devices, including mobile phones across platforms, hindering practical deployment.
and computers. These agents can understand and To address these limitations, we propose Infi-
interact with GUIs to execute complex operations, GUIAgent, which is a MLLM-based GUI Agent
significantly enhancing user productivity and ex- trained through a two-stage supervised fine-tuning
panding the scope of automated task completion (SFT) methods with robust fundamental capabili-
(Hu et al., 2024b; Hong et al., 2024; Zhang and ties and native reasoning abilities. In stage 1, we
Zhang, 2023; Qi et al., 2024; Xie et al., 2024; Vu collect data covering multiple tasks, such as vision-
et al., 2024; Yu et al., 2024; Wen et al., 2023). language understanding, GUI-specific QA, and tool
Recent developments in multimodal large lan- use to improve fundamental capabilities such as
guage models (MLLMs) (Bai et al., 2023b; Li et al., GUI understanding and instruction grounding of
2024c; Team et al., 2024; Dai et al., 2022) have sig- the agents. In stage 2, we recognized two essential
nificantly advanced the potential of GUI Agents. reasoning skills for GUI Agents: (1) Hierarchical
MLLMs possess powerful visual understanding ca- reasoning, and (2) Expectation-reflection reason-
pabilities and can reason based on visual informa- ing, and integrate these skills into the SFT data
tion, making them a promising foundation for build- synthesized by MLLMs based on existing trajecto-
ing sophisticated GUI Agents. These models can ries. Our main contributions are threefold:
1
• We propose a two-stage supervised fine- code to perceive GUIs are developed (Wen et al.,
tuning pipeline to comprehensively improve 2023). However, various works have shown that
both the fundamental abilities and advanced learning to interact with the visual form of the
reasoning abilities of GUI Agents. GUIs can show superior performance (Hu et al.,
2024b). Therefore, MLLM-based GUI Agents
• We synthesize SFT data with two advanced are developed. ILuvUI (Jiang et al., 2023) fine-
reasoning skills: hierarchical reasoning and tuned LLaVA to enhance general GUI understand-
expectation-reflection reasoning, enabling the ing, while AppAgent (Zhang et al., 2023) explored
agents to natively perform complex reasoning. app usage through autonomous interactions. CogA-
• We build InfiGUIAgent by supervised fine- gent (Hong et al., 2024) integrated high-resolution
tuning a model using our SFT data and con- vision encoders, and Ferret-UI-anyres (You et al.,
duct experiments on several GUI benchmarks, 2025) employed an any-resolution approach. Build-
demonstrating that our model achieves com- ing upon these works, our study focuses on devel-
petitive performance. oping a more lightweight agent with a simplified
architecture for GUI tasks, aiming to improve ease
2 Related Works of deployment.
2
Stage 1: Fundamental Abilities Stage 2: Native Advanced Reasoning
GUI-Specific Task: Set an alarm for 7am. Reflection
Prior
GUI Understanding:
Screenshot
History Current Expectation Expected to open clock app by tapping icon... (Succeed )
• Screen2Words
• Screen Annotation Environment
Hierarchical Reasoning
Question Answering: Strategic Layer:
• ScreenQA Summary: Home screen → App Drawer → Clock interface
• Complex QA +
Planning: Alarm tab → Create new → Set 7am → Save
Instruction
Input
Instruction Grounding: Reasoning
• Widget Caption Answer / Action Current Step Reasoning: Need to access alarm tab from current
• ... Operate clock screen with multiple function tabs
Grounding: Tap alarm icon in top left corner
General Tool Usage
• LLaVA-OneVision • Glaive Function Action
• PixMo Calling {"name": "tap", "arguments": {"point": {"x": 115, "y": 67}}}
Reasoning
Figure 1: InfiGUIAgent is trained in two stages. Stage 1 cultivates fundamental abilities using diverse datasets
covering GUI understanding (element recognition and layout comprehension), question answering, instruction
grounding, general knowledge, and tool usage. Stage 2 introduces native advanced reasoning, employed during both
training and inference. This stage follows a cyclical process at each step, consisting of Reflection, Hierarchical
Reasoning (strategic and tactical layers), Action, and Expectation. Each step receives the overall task, the
history of previous screenshots and reasoning, and the current environment as input. Reflection assesses the
previous action’s outcome against its expectation, while Expectation predicts the outcome of the current action for
subsequent reflection.
2020a), Widget Caption (Li et al., 2020b), UIB- relative scale of [0, 1000]. This standardization
ert Reference Expression (Bai et al., 2021) and facilitates consistent representation of both point
OmniAct-Single Click (Kapoor et al., 2024). and box annotations in JSON format, with points
• Question Answering. Datasets contain GUI- expressed as {"x" : x, "y" : y} and bounding boxes
specific QA tasks, including GUIChat (Chen as {"x1" : x1 , "y1" : y1 , "x2" : x2 , "y2" : y2 }. In
et al., 2024), ScreenQA (Hsiao et al., 2022) and this coordinate system, the origin {"x" : 0, "y" : 0}
Complex QA (Yin et al., 2023). is located at the screen’s top-left corner, with the x-
• General Knowledge. Multimodal datasets axis extending rightward and the y-axis downward.
maintain model’s general capabilities, including The bottom-right corner corresponds to coordinates
LLaVA-OneVision (Li et al., 2024a) and PixMo {"x" : 1000, "y" : 1000}. To enhance data qual-
(MDeitke et al., 2024). ity, we implemented two additional preprocessing
• Tool Usage. Datasets cover general tool us- steps:
ing, including Glaive-function-calling (Glaive
AI, 2024). • Instruction Enhancement. For datasets with
ambiguous instructions, we developed standard-
Due to the diversity of our data sources, we im- ized instruction templates to establish clear cor-
plemented comprehensive format standardization respondence between commands and their ex-
across all datasets. Additionally, we adopted the pected outcomes.
Reference-Augmented Annotation format (see Sec- • Response Refinement. For entries with com-
tion 3.1.2) to enhance the model’s ability to ground plex or inconsistent response formats, we utilized
visual elements with textual descriptions, enabling Qwen2-VL-72B (Bai et al., 2023b) to reformu-
precise spatial referencing while maintaining natu- late responses while preserving their semantic
ral language flow. content. Each reformulation underwent valida-
tion to ensure accuracy and consistency.
3.1.1 Data Preprocessing and Standardization
Given the diversity of our data sources, we imple- 3.1.2 Reference-Augmented Annotation
mented comprehensive preprocessing steps to stan- To better leverage the spatial information available
dardize the data format across all datasets. We nor- in our collected datasets and enhance the model’s
malized the coordinate system by following (Wang visual-language understanding of GUIs, we imple-
et al., 2024), mapping all spatial coordinates to a mented a reference-augmented annotation format.
3
Table 1: Training datasets used in stage 1 of supervised fine-tuning.
This format enables bidirectional referencing be- tion (Shinn et al., 2023; Yao et al., 2023; Hu et al.,
tween GUI elements and textual responses. Specifi- 2024a), enabling agents to learn from past actions
cally, we adopted the following structured notation: and improve decision-making consistency. These
reasoning skills are integrated into the training
<ref type="box" datasets of agents, so that they can reason with
coords={"x1": x1, "y1": y1,
these skills natively without any extra prompting.
"x2": x2, "y2": y2}
To achieve this, we generate SFT data incorporat-
note="GUI annotation">
corresponding text
ing these reasoning skills based on existing trajec-
</ref> tory data (see Table 2) and continue fine-tuning the
model from stage 1.
The format consists of several key components:
3.2.1 Hierarchical Reasoning
the reference type (either "box" for rectangular re-
gions or "point" for specific locations), coordinate Effective execution of GUI tasks demands both
specifications (x1, y1, x2, y2 for boxes or x, y for overarching strategic planning and meticulous tac-
points), optional annotative notes, and the corre- tical execution. To achieve this, we synthesize
sponding textual content. To generate training data trajectory data with a hierarchical reasoning with
in this format, we prompted Qwen2-VL-72B (Bai two distinct layers:
et al., 2023b) to seamlessly integrate GUI spatial
information with original responses, maintaining • Strategic Layer. Strategic layer is responsible
natural language flow while preserving precise spa- for high-level task decomposition and sub-goal
tial references. planning. This layer analyzes the overall task ob-
jective and determines the sequence of subtasks
3.2 Stage 2: Training for Native Reasoning needed for completion.
Building upon the foundational capabilities such • Tactical Layer. Tactical layer handles the selec-
as understanding and grounding, GUI Agents must tion and grounding of concrete actions. Based
also master advanced reasoning skills to effectively on the strategic layer’s planning, agent select
handle complex tasks. We identify two crucial rea- appropriate GUI operations and adjusts their pa-
soning skills : (1) Hierarchical reasoning, which rameters to match the target.
enables planning and task decomposition, helping
agents structure complex tasks into manageable 3.2.2 Expectation-Reflection Reasoning
subtasks and execute them efficiently (Huang and To enhance action consistency and foster
Chang, 2023; Zhang et al., 2024b; Huang et al., autonomous self-correction, we incorporate
2024), and (2) Expectation-reflection reasoning, Expectation-reflection reasoning into the training
which fosters adaptive self-correction and reflec- datasets. This iterative process enhances the
4
Table 2: UI action reasoning datasets used in the training process
Tool Message:
agent’s ability to adapt and learn from its actions <tool_response>
through a structured reflection cycle: {
"name": "gui_operation",
• Reasoning. After reflection (except the first "content": {
step), the agents conduct hierarchical reasoning. "status": "success | failure",
"current_ui": <image>,
• Action. After the reasoning, the agent takes the
"current_task": <task_description>
action. }
• Expectation. Following each action, the agent }
generates expected outcomes which are used to </tool_response>
5
leaved image-text input formats well. To estab- 45K samples based on trajectories from datasets
lish clear correspondence between observations shown in Table 2. We continual supervised fine-
(screenshots) and steps, we generate detailed de- tune Qwen2-VL-2B (Bai et al., 2023c). We lever-
scriptions to replace the screenshots, which helps age ZeRO0 (Rajbhandari et al., 2020) technology
facilitate the subsequent reasoning process con- to enable full parameter fine-tuning of the model
struction. across 8 A800 80GB GPUs.
• Reflection. Given the previous expectation et−1 4.1.2 Evaluation Benchmarks
and current observation ot , we generate a reflec- ScreenSpot. ScreenSpot (Cheng et al., 2024) is
tion ft that evaluates the outcome of the previous an evaluation benchmark for GUI grounding, con-
action. sisting of over 1,200 instructions from iOS, An-
• Strategic Layer. The strategic reasoning consists droid, macOS, Windows, and Web environments,
of two parts: First, a summary is generated based with annotated element types.
on the n-step history Ht = {(oi , ri , ai )}t−1
i=t−n
AndroidWorld. AndroidWorld (Rawles et al.,
and current observation ot . Then, the planning
2024) is a fully functional Android environment
component is generated with access to the actual
that provides reward signals for 116 program-
action at to ensure alignment with the trajectory.
matic tasks across 20 real-world Android apps.
• Tactical Layer. This layer’s reasoning is con- We find that Android World uses Set-of-Marks
structed using the generated reflection ft and (SoM) (Yang et al., 2023b) to enhance the agent’s
strategic layer output. The actual action at from grounding ability. However, when humans oper-
the trajectory is incorporated to ensure the tacti- ate smartphones, their brains do not label elements
cal reasoning leads to appropriate action selec- on the screen. Over-reliance on SoM can lead to
tion. insufficient focus on pixel-level grounding ability.
• Expectation. For each state-action pair (st , at ), Therefore, in our experiments, agents respond to
we generate an expectation et based on current the raw image rather than the annotated image.
observation ot , reasoning process rt , and action
4.2 Main Results
at . Notably, we deliberately avoid using the next
state st+1 in this generation process. Although ScreenSpot. Table 4 provides the results of differ-
using st+1 could improve the agent’s accuracy ent models across three platforms (Mobile, Desk-
in modeling state transitions, while using st+1 top and Web) and two element types (Text and Icon)
could lead to perfect expectations, such an ap- on ScreenSpot (Cheng et al., 2024). InfiGUIAgent-
proach might impair the agent’s ability to handle 2B achieves highest accuracy of 76.3%, surpassing
expectation mismatches during deployment. several strong baselines such as ShowUI (Lin et al.,
2024) (75.1%) and UGround-7B (Gou et al., 2024)
While we avoid using st+1 in expectation gener- (73.3%), which is even with larger parameters size.
ation to maintain robustness, we also explore the
AndroidWorld. Table 5 compares the success
possibility of improving state transition modeling
rates of InfiGUIAgent with open-source models on
through a parallel next-state prediction task. Using
AndroidWorld (Rawles et al., 2024). InfiGUIAgent-
the trajectory data, we construct additional training
2B achieves an overall success rate of 0.09, outper-
examples where the agent learns to predict the next
forming open-source models of similar size, such
state description dt+1 given the current observa-
as ShowUI-2B (Lin et al., 2024) (0.07), and model
tion ot and action at . This auxiliary task helps the
with much more parameters such as LLaVa-OV-7B
agent learn state transition dynamics, while keep-
(Li et al., 2024b) (0.00) and Qwen2-VL-72B (Bai
ing the expectation generation process independent
et al., 2023b) (0.05).
of future states.
5 Conclusion
4 Experiments
In this work, we propose InfiGUIAgent, a novel
4.1 Experimental Setting
MLLM-based GUI Agents. By constructing com-
4.1.1 Implementation Details prehensive training datasets with two-stage super-
In stage 1, we sample 1M samples in total as il- vised fine-tuning, we enhance the model’s ability
lustrated in Table 1. In stage 2, we synthesized to understand, reason, and interact with GUIs. Our
6
Accuracy (%)
Model Avg.
Mobile Desktop Web
Text Icon Text Icon Text Icon
Proprietary Models
GPT-4o1 (OpenAI, 2024) 30.5 23.2 20.6 19.4 11.1 7.8 18.8
Gemini-1.5-pro2 (Team et al., 2024) 76.2 54.1 65.5 39.3 52.2 32.0 53.2
Open-source Models
Qwen2-VL-2B (Wang et al., 2024) 24.2 10.0 1.4 9.3 8.7 2.4 9.3
Qwen2-VL-7B (Wang et al., 2024) 61.3 39.3 52.0 45.0 33.0 21.8 42.9
CogAgent (Hong et al., 2024) 67.0 24.0 74.2 20.0 70.4 28.6 47.4
SeeClick (Cheng et al., 2024) 78.0 52.0 72.2 30.0 55.7 32.5 53.4
UGround-7B (Gou et al., 2024) 82.8 60.3 82.5 63.6 80.4 70.4 73.3
ShowUI-2B (Lin et al., 2024) 92.3 75.5 76.3 61.1 81.7 63.6 75.1
Ours
InfiGUIAgent-2B 88.6 74.7 85.6 65.0 79.1 64.6 76.3
Table 4: Performances on various platforms (Mobile, Desktop, Web) on Screenshot. All experiments were conducted
using raw screenshot information. Results marked in bold represent the best performance, and those underlined
indicate the second-best performance.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
evaluation, conducted using raw screenshots with- Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han,
out relying on additional GUI metadata, demon- Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang
Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang
strates the model’s applicability to real-world sce- Lu, K. Lu, Jianxin Ma, Rui Men, Xingzhang Ren,
narios. Experimental results show that our model Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong
performs well on GUI tasks and surpass several Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang
open-source baselines. Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian
Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen
Yu, Yu Bowen, Hongyi Yuan, Zheng Yuan, Jianwei
Zhang, Xing Zhang, Yichang Zhang, Zhenru Zhang,
References Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and
Tianhang Zhu. 2023a. Qwen technical report. ArXiv.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An-
toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang,
Arthur Mensch, Katie Millican, Malcolm Reynolds, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou,
Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda and Jingren Zhou. 2023b. Qwen-vl: A frontier large
Han, Zhitao Gong, Sina Samangooei, Marianne vision-language model with versatile abilities. ArXiv.
Monteiro, Jacob Menick, Sebastian Borgeaud, Andy
Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang,
laj Binkowski, Ricardo Barreira, Oriol Vinyals, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou,
Andrew Zisserman, and Karen Simonyan. 2022. and Jingren Zhou. 2023c. Qwen-vl: A frontier large
Flamingo: a visual language model for few-shot vision-language model with versatile abilities. arXiv
learning. In Advances in Neural Information Pro- preprint arXiv:2308.12966.
cessing Systems 35: Annual Conference on Neural
Information Processing Systems 2022, NeurIPS 2022, Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao,
New Orleans, LA, USA, November 28 - December 9, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren,
2022. and Hongsheng Li. 2024. Amex: Android multi-
annotation expo dataset for mobile gui agents. arXiv
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir preprint arXiv:2407.17490.
7
Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Xu, Shawn Wang, Xinchen Xu, Shuofei Qiao,
Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li,
Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin
Liu, and Maosong Sun. 2024. Guicourse: From gen- Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan
eral vision language models to versatile gui agents. Wu, Shengyu Zhang, and Fei Wu. 2024b. Os agents:
arXiv preprint arXiv:2406.11317. A survey on mllm-based agents for general comput-
ing devices use. Preprints.
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu,
Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli
Seeclick: Harnessing gui grounding for advanced Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing
visual gui agents. arXiv preprint arXiv:2401.10935. Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li,
Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu.
Yong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan, 2024c. Infiagent-dabench: Evaluating agents on data
Cong Zhou, Jingquan Wang, Zhangyin Feng, Fan analysis tasks. arXiv preprint arXiv:2401.05507.
Zhang, Xueyu Hu, and Shuming Shi. 2022. One
model, multiple modalities: A sparsely activated Jie Huang and Kevin Chen-Chuan Chang. 2023. To-
approach for text, sound, image, video and code. wards reasoning in large language models: A survey.
Preprint, arXiv:2205.06126. Preprint, arXiv:2212.10403.
Alexey Dosovitskiy, Lucas Beyer, Alexander Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruim-
Thomas Unterthiner, Mostafa Dehghani, Matthias ing Tang, and Enhong Chen. 2024. Understanding
Minderer, Georg Heigold, Sylvain Gelly, Jakob the planning of llm agents: A survey. arXiv preprint
Uszkoreit, and Neil Houlsby. 2021. An image arXiv:2402.02716.
is worth 16x16 words: Transformers for image
recognition at scale. In 9th International Conference Yue Jiang, Eldon Schoop, Amanda Swearngin, and
on Learning Representations, ICLR 2021, Virtual Jeffrey Nichols. 2023. Iluvui: Instruction-tuned
Event, Austria, May 3-7, 2021. OpenReview.net. language-vision modeling of uis from machine con-
versations. arXiv preprint arXiv:2310.04869.
Luciano Floridi and Massimo Chiriatti. 2020. Gpt-3:
Its nature, scope, limits, and consequences. Minds Raghav Kapoor, Yash Parag Butala, Melisa Russak,
and Machines, 30:681–694. Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and
Ruslan Salakutdinov. 2024. Omniact: A dataset and
Glaive AI. 2024. Glaive function calling dataset. benchmark for enabling multimodal generalist au-
https://huggingface.co/datasets/glaiveai/ tonomous agents for desktop and web. arXiv preprint
glaive-function-calling. Accessed: 2024-01- arXiv:2402.17553.
08.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yan-
Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Liu Li, Ziwei Liu, and Chunyuan Li. 2024a. Llava-
2024. Navigating the digital world as humans do: onevision: Easy visual task transfer. arXiv preprint
Universal visual grounding for gui agents. arXiv arXiv:2408.03329.
preprint arXiv:2410.05243.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang,
Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024b.
Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A Llava-onevision: Easy visual task transfer. Preprint,
visual language model for gui agents. In Proceedings arXiv:2408.03326.
of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 14281–14290. Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi
Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen
Yu-Chung Hsiao, Fedir Zubach, Gillbune Baech- Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren,
ler, Victor Carbune, Jason Lin, Maria Wang, Chao Li, Yifan Ye, Lihuan Zhang, Hanshu Yan,
Srinivas Sunkara, Yun Zhu, and Jindong Chen. Guoyin Wang, Bei Chen, and Junnan Li. 2024c. Aria:
2022. Screenqa: Large-scale question-answer An open multimodal native mixture-of-experts model.
pairs over mobile app screenshots. arXiv preprint Preprint, arXiv:2410.05993.
arXiv:2209.08199.
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Xueyu Hu, Kun Kuang, Jiankai Sun, Hongxia Yang, 2023. Blip-2: Bootstrapping language-image pre-
and Fei Wu. 2024a. Leveraging print debugging to training with frozen image encoders and large lan-
improve code generation in large language models. guage models. In International conference on ma-
Preprint, arXiv:2401.05319. chine learning, pages 19730–19742. PMLR.
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruix- Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu,
uan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie,
Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze and Hongxia Yang. 2024d. Infibench: Evaluating
8
the question-answering capabilities of code large lan- OpenAI. 2023. Gpt-4v(ision) system card.
guage models. arXiv preprint arXiv:2404.07940.
OpenAI. 2024. Gpt-4o. Accessed: 2025-01-03.
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason
Baldridge. 2020a. Mapping natural language instruc- Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao
tions to mobile ui action sequences. arXiv preprint Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian
arXiv:2005.03776. Yao, Tianjie Zhang, et al. 2024. Webrl: Train-
ing llm web agents via self-evolving online cur-
Yang Li, Luheng Li, Gangaand He, Jingjie Zheng, riculum reinforcement learning. arXiv preprint
Hong Li, and Zhiwei Guan. 2020b. Widget cap- arXiv:2411.02337.
tioning: Generating natural language description Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
for mobile user interface elements. arXiv preprint Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
arXiv:2010.04295. try, Amanda Askell, Pamela Mishkin, Jack Clark,
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan et al. 2021. Learning transferable visual models from
Yang, Zechen Bai, Weixian Lei, Lijuan Wang, and natural language supervision. In International confer-
Mike Zheng Shou. 2024. Showui: One vision- ence on machine learning, pages 8748–8763. PMLR.
language-action model for generalist gui agent. In Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,
NeurIPS 2024 Workshop on Open-World Agents. and Yuxiong He. 2020. Zero: Memory optimizations
toward training trillion parameter models. Preprint,
Haogeng Liu, Quanzeng You, Yiqi Wang, Xiaotian Han, arXiv:1910.02054.
Bohan Zhai, Yongfei Liu, Wentao Chen, Yiren Jian,
Yunzhe Tao, Jianbo Yuan, Ran He, and Hongxia Yang. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang,
2024. Infimm: Advancing multimodal understanding Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice
with an open-sourced visual language model. In Li, William Bishop, Wei Li, Folawiyo Campbell-
Annual Meeting of the Association for Computational Ajala, et al. 2024. Androidworld: A dynamic bench-
Linguistics. marking environment for autonomous agents. arXiv
preprint arXiv:2405.14573.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae
Lee. 2023. Visual instruction tuning. In Advances in Noah Shinn, Federico Cassano, Edward Berman, Ash-
Neural Information Processing Systems 36: Annual win Gopinath, Karthik Narasimhan, and Shunyu Yao.
Conference on Neural Information Processing Sys- 2023. Reflexion: Language agents with verbal rein-
tems 2023, NeurIPS 2023, New Orleans, LA, USA, forcement learning. Preprint, arXiv:2303.11366.
December 10 - 16, 2023.
Srinivas Sunkara, Maria Wang, Lijuan Liu, Gilles
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Fe- Baechler, Yu-Chung Hsiao, Abhanshu Sharma,
ichtenhofer, Trevor Darrell, and Saining Xie. 2022. James Stout, et al. 2022. Towards better semantic
A convnet for the 2020s. In Proceedings of the understanding of mobile interfaces. arXiv preprint
IEEE/CVF conference on computer vision and pat- arXiv:2210.02663.
tern recognition, pages 11976–11986.
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan
Matt MDeitke, Christopher Clark, Sangho Lee, Ro- Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer,
hun Tripathi, Yue Yang, Jae Sung Park, Moham- Damien Vincent, Zhufeng Pan, Shibo Wang, et al.
madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca 2024. Gemini 1.5: Unlocking multimodal under-
Soldaini, Jiasen Lu, Taira Anderson, Erin Bramsom, standing across millions of tokens of context. arXiv
Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay preprint arXiv:2403.05530.
Patel, Mark Yatskar, Chris Callison-Burch, Andrew Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Head, Rose Hendrix, Favyen Bastani, Eli van der Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Bilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Grave, and Guillaume Lample. 2023. Llama: Open
Newell, Piper Wolters, Kuo-Hao Gupta, Tanmay and efficient foundation language models. ArXiv.
sna Zeng, Jon Borchardt, Dirk Groeneveld, Crys-
tal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Minh Duc Vu, Han Wang, Jieshan Chen, Zhuang Li,
Schoenick, Oscar Michel, Ranjay Krishna, Luca Shengdong Zhao, Zhenchang Xing, and Chunyang
Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Chen. 2024. Gptvoicetasker: Advancing multi-step
Girshick, Ali Farhadi, and Aniruddha Kembhavi. mobile task efficiency through dynamic interface ex-
2024. Molmo and pixmo: Open weights and ploration and learning. In Proceedings of the 37th
open data for state-of-the-art vision-language models. Annual ACM Symposium on User Interface Software
arXiv preprint arXiv:2409.17146. and Technology, pages 1–17.
Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi
Shan, Xiutian Huang, and Wenhao Xu. 2024. Mo- Grossman, and Yang Li. 2021. Screen2words: Au-
bileflow: A multimodal llm for mobile gui agent. tomatic mobile ui summarization with multimodal
arXiv preprint arXiv:2407.04346. learning. arXiv preprint arXiv:2108.03353.
9
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin
hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023.
Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhanc- Appagent: Multimodal agents as smartphone users.
ing vision-language model’s perception of the world arXiv preprint arXiv:2312.13771.
at any resolution. arXiv preprint arXiv:2409.12191.
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao,
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang.
Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei 2024a. Android in the zoo: Chain-of-action-thought
Zhao, Xixuan Song, et al. 2023. Cogvlm: Visual ex- for gui agents. arXiv preprint arXiv:2403.02713.
pert for pretrained language models. arXiv preprint
Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang,
arXiv:2311.03079.
Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song,
Man Lan, and Furu Wei. 2024b. Llm as a master-
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, mind: A survey of strategic reasoning with large
Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, language models. Preprint, arXiv:2404.01230.
Yaqin Zhang, and Yunxin Liu. 2023. Autodroid: Llm-
powered task automation in android. arXiv preprint Zhuosheng Zhang and Aston Zhang. 2023. You only
arXiv:2308.15272. look at screens: Multimodal chain-of-action agents.
arXiv preprint arXiv:2309.11436.
Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu,
and Maosong Sun. 2021. Lawformer: A pre-trained
language model for chinese legal long documents. AI
Open, 2:79–84.
10
A Cases
A.1 Stage 1: Fundamental Abilities
We demonstrate the fundamental abilities trained
in Stage 1 through three cases: GUI Understanding
(Figure 2), Grounding (Figure 3), and Question
Answering (Figure 4).
11
Instruction: Describe the UI in detail. You should
reference components and locations in the UI.
12
Instruction: What day is it tomorrow? You should
reference components and locations in the UI.
Reflection: The expectation from the previous step was to open the Simple
SMS Messenger app by tapping on its icon. The current screen content
shows the Simple SMS Messenger app's ...
Summary: The overall task objective is to send the message "Yoga class Strategic Layer
every Tuesday and Thursday at 6 PM" to +18407874306 ...
Planning: To achieve the final goal, the plan is to: first, initiate a new chat;
second, enter the recipient's phone number (+18407874306); third,
compose the message ...
Reasoning: The core component on the current screen is the "Start chat"
button. The key point of the task objective in the current step is to initiate a
Tactical Layer
Action:
<tool_call>
{"name": "tap", "arguments": {"point": {"x": 824, "y": 886}}}
</tool_call>
Expectation: The purpose of this action is to open the new chat interface.
This action will likely result in a new screen appearing ...
Figure 5: Case of Native Advanced Reasoning. The agent’s goal is to reply to a message
13
Reflection: The expectation from the previous step was to open the
Contacts app by tapping on its icon. However, the current screen content
shows the Phone app with the "Recents" tab selected ...
Summary: The overall task objective is to create a new contact for Hugo
Strategic Layer
Pereira. In the previous step, I attempted to open the Contacts app ...
Planning: To achieve the final goal of creating a new contact for Hugo
Pereira, I first need to access the Contacts section within the Phone app.
Then I need to ...
Step K
Reasoning: The core components on the current screen are the three tabs at
the bottom: "Favorites", "Recents", and "Contacts". The key point of the
Tactical Layer
task objective in the current step is to navigate to the Contacts section.
Therefore, I need to tap on the "Contacts" tab.
Action:
<tool_call>
{"name": "tap", "arguments": {"point": {"x": 853, "y": 924}}}
</tool_call>
Figure 6: Case of Native Advanced Reasoning. The agent’s goal is to create a new contact.
Reflection: The expectation from the previous step was to navigate to the
Contacts section by tapping the "Contacts" tab. The current screen content
shows ... indicating that the previous action was successful.
Summary: The overall task objective is to create a new contact for Hugo
Strategic Layer
Pereira. In the previous steps, I opened the Phone app and ...
Planning: To achieve the final goal of creating a new contact for Hugo
Pereira, I need to first initiate the new contact creation process. Then, I
need to enter Hugo's details ...
Step K + 1
Reasoning: The core component on the current screen is the "Create new
contact" button. The key point of the task objective in the current step is to
Tactical Layer
initiate the new contact creation process. Therefore, I need to tap on the
"Create new contact" button.
Action:
<tool_call>
{"name": "tap", "arguments": {"point": {"x": 492, "y": 575}}}
</tool_call>
Figure 7: Case of Native Advanced Reasoning. The agent’s goal is to create a new contact.
14