0% found this document useful (0 votes)
172 views

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

InfiGUIAgent: A Multimodal Generalist GUI Agent

with Native Reasoning and Reflection


Yuhang Liu1 , Pengxiang Li2 , Zishu Wei1 , Congkai Xie3 , Xueyu Hu1 , Xinchen Xu1 ,
Shengyu Zhang1 , Xiaotian Han 4 , Hongxia Yang5 , Fei Wu1
1
Zhejiang University, 2 Dalian University of Technology, 3 Reallm Labs,
4
ByteDance Inc, 5 The Hong Kong Polytechnic University
[email protected], [email protected], [email protected]

Abstract interpret complex interface elements and adapt to a


wide range of tasks, leading to more efficient and
Graphical User Interface (GUI) Agents, pow- robust automation (Hong et al., 2024; Jiang et al.,
ered by multimodal large language models 2023; You et al., 2025; Nong et al., 2024; Vu et al.,
(MLLMs), have shown great potential for task
arXiv:2501.04575v1 [cs.AI] 8 Jan 2025

2024).
automation on computing devices such as com-
puters and mobile phones. However, existing However, current MLLM-based GUI Agents
agents face challenges in multi-step reasoning face several critical challenges. A key limita-
and reliance on textual annotations, limiting tion lies in their reasoning capabilities (Zhang and
their effectiveness. We introduce InfiGUIAgent, Zhang, 2023; Qi et al., 2024; Yu et al., 2024).
an MLLM-based GUI Agent trained with a two- While many existing GUI Agents can perform basic
stage supervised fine-tuning pipeline. Stage 1 single-step reasoning, they struggle to effectively
enhances fundamental skills such as GUI un-
leverage information from previous steps. This
derstanding and grounding, while Stage 2 inte-
grates hierarchical reasoning and expectation- lack of reflection on past experiences can lead to
reflection reasoning skills using synthesized repetitive errors during task execution.
data to enable native reasoning abilities of the Another significant challenge lies in the reliance
agents. InfiGUIAgent achieves competitive per- on the additional information of the GUIs. Many
formance on several GUI benchmarks, high- prior GUI Agent implementations rely on accessi-
lighting the impact of native reasoning skills in bility trees or Set-of-Marks (Yang et al., 2023a), to
enhancing GUI interaction for automation tasks. represent or augment the GUI’s visual information.
Resources are available at https://github.
However, GUIs are inherently visual, and represent-
com/Reallm-Labs/InfiGUIAgent.
ing them primarily through text can lead to informa-
1 Introduction tion loss or redundancy. Augmenting visual input
with textual descriptions can also increase computa-
Graphical User Interface (GUI) Agents have tional overhead. Furthermore, the availability and
emerged as powerful tools for automating tasks consistency of these textual representations vary
on computing devices, including mobile phones across platforms, hindering practical deployment.
and computers. These agents can understand and To address these limitations, we propose Infi-
interact with GUIs to execute complex operations, GUIAgent, which is a MLLM-based GUI Agent
significantly enhancing user productivity and ex- trained through a two-stage supervised fine-tuning
panding the scope of automated task completion (SFT) methods with robust fundamental capabili-
(Hu et al., 2024b; Hong et al., 2024; Zhang and ties and native reasoning abilities. In stage 1, we
Zhang, 2023; Qi et al., 2024; Xie et al., 2024; Vu collect data covering multiple tasks, such as vision-
et al., 2024; Yu et al., 2024; Wen et al., 2023). language understanding, GUI-specific QA, and tool
Recent developments in multimodal large lan- use to improve fundamental capabilities such as
guage models (MLLMs) (Bai et al., 2023b; Li et al., GUI understanding and instruction grounding of
2024c; Team et al., 2024; Dai et al., 2022) have sig- the agents. In stage 2, we recognized two essential
nificantly advanced the potential of GUI Agents. reasoning skills for GUI Agents: (1) Hierarchical
MLLMs possess powerful visual understanding ca- reasoning, and (2) Expectation-reflection reason-
pabilities and can reason based on visual informa- ing, and integrate these skills into the SFT data
tion, making them a promising foundation for build- synthesized by MLLMs based on existing trajecto-
ing sophisticated GUI Agents. These models can ries. Our main contributions are threefold:

1
• We propose a two-stage supervised fine- code to perceive GUIs are developed (Wen et al.,
tuning pipeline to comprehensively improve 2023). However, various works have shown that
both the fundamental abilities and advanced learning to interact with the visual form of the
reasoning abilities of GUI Agents. GUIs can show superior performance (Hu et al.,
2024b). Therefore, MLLM-based GUI Agents
• We synthesize SFT data with two advanced are developed. ILuvUI (Jiang et al., 2023) fine-
reasoning skills: hierarchical reasoning and tuned LLaVA to enhance general GUI understand-
expectation-reflection reasoning, enabling the ing, while AppAgent (Zhang et al., 2023) explored
agents to natively perform complex reasoning. app usage through autonomous interactions. CogA-
• We build InfiGUIAgent by supervised fine- gent (Hong et al., 2024) integrated high-resolution
tuning a model using our SFT data and con- vision encoders, and Ferret-UI-anyres (You et al.,
duct experiments on several GUI benchmarks, 2025) employed an any-resolution approach. Build-
demonstrating that our model achieves com- ing upon these works, our study focuses on devel-
petitive performance. oping a more lightweight agent with a simplified
architecture for GUI tasks, aiming to improve ease
2 Related Works of deployment.

2.1 Multimodal LLMs 3 Method


Large Language Models (LLMs) (Floridi and Chiri- In this section, we introduce our two-stage super-
atti, 2020; Touvron et al., 2023; Bai et al., 2023a; vised fine-tuning strategy for building InfiGUIA-
Xiao et al., 2021) have significantly enhanced the gent, as shown in Figure 1. In stage 1, we focus
capabilities of AI systems in tackling a wide range on improving fundamental abilities such as under-
of tasks (Hu et al., 2024c; Li et al., 2024d), thanks standing and grounding, particularly considering
to their exceptional ability to process complex se- the complexity of GUIs. In stage 2, we move on to
mantic and contextual information. The remark- improve the native reasoning abilities of agents for
able power of LLMs has also inspired exploration handling complicated GUI tasks.
into their potential for processing multimodal data,
such as images. Typically, the architecture of Mul- 3.1 Stage 1: Training for Fundamental
timodal Large Language Models (MLLMs) con- Abilities
sists of three main components: a pre-trained large Considering the complexity of GUIs, which involve
language model, a trained modality encoder, and diverse data formats such as HTML code, high-
a modality interface that connects the LLM with resolution interfaces cluttered with small icons and
the encoded modality features. Various vision en- text, general MLLMs lack fundamental abilities
coders, such as ViT (Dosovitskiy et al., 2021), in both understanding GUI and grounding the ac-
CLIP (Radford et al., 2021), and ConvNeXt (Liu tions. To address this, we first collected a range
et al., 2022), extract visual features, which are in- of existing visual-language and GUI datasets for
tegrated using techniques like adapter networks supervised fine-tuning in stage 1. We gathered data
(Liu et al., 2023), cross-attention layers (Alayrac covering several GUI tasks from multiple sources
et al., 2022), and visual expert modules (Wang to ensure a comprehensive capabilities improve-
et al., 2023). These methods have facilitated the ment (see Table 1). The datasets can be categorized
development of high-performing MLLMs, such as into five parts:
Qwen-VL (Bai et al., 2023b), GPT-4 Vision (Ope-
nAI, 2023), BLIP-2 (Li et al., 2023) and InfiMM • GUI Understanding. Datasets focusing on GUI
(Liu et al., 2024), thus opening new avenues for element recognition, layout comprehension, and
LLMs in processing GUI tasks. semantic interpretation, including Screen2Words
(Wang et al., 2021) and Screen Annotation
2.2 MLLM-based GUI Agents (Baechler et al., 2024).
Agents are AI systems that perceive their environ- • Grounding. Datasets capture various user inter-
ments, make decisions, and take actions to com- action sequences and operation patterns, includ-
plete specific tasks. LLMs reaching human-level ing GUIEnv (Chen et al., 2024), RICO Seman-
intelligence have greatly enhanced the ability to tic Annotation (Sunkara et al., 2022), SeeClick-
build agents. For GUI tasks, LLMs that read HTML Web (Cheng et al., 2024), RICO SCA (Li et al.,

2
Stage 1: Fundamental Abilities Stage 2: Native Advanced Reasoning
GUI-Specific Task: Set an alarm for 7am. Reflection
Prior
GUI Understanding:
Screenshot
History Current Expectation Expected to open clock app by tapping icon... (Succeed )
• Screen2Words
• Screen Annotation Environment
Hierarchical Reasoning
Question Answering: Strategic Layer:
• ScreenQA Summary: Home screen → App Drawer → Clock interface
• Complex QA +
Planning: Alarm tab → Create new → Set 7am → Save
Instruction

Input
Instruction Grounding: Reasoning

• RicoSCA & Tactical Layer:


Action

• Widget Caption Answer / Action Current Step Reasoning: Need to access alarm tab from current
• ... Operate clock screen with multiple function tabs
Grounding: Tap alarm icon in top left corner
General Tool Usage
• LLaVA-OneVision • Glaive Function Action
• PixMo Calling {"name": "tap", "arguments": {"point": {"x": 115, "y": 67}}}

Reasoning

+ Answer Instruction Action


&
Action Expectation
Subsequent
Instruction ... Alarm tab will open showing new alarm option... Reflection

Figure 1: InfiGUIAgent is trained in two stages. Stage 1 cultivates fundamental abilities using diverse datasets
covering GUI understanding (element recognition and layout comprehension), question answering, instruction
grounding, general knowledge, and tool usage. Stage 2 introduces native advanced reasoning, employed during both
training and inference. This stage follows a cyclical process at each step, consisting of Reflection, Hierarchical
Reasoning (strategic and tactical layers), Action, and Expectation. Each step receives the overall task, the
history of previous screenshots and reasoning, and the current environment as input. Reflection assesses the
previous action’s outcome against its expectation, while Expectation predicts the outcome of the current action for
subsequent reflection.

2020a), Widget Caption (Li et al., 2020b), UIB- relative scale of [0, 1000]. This standardization
ert Reference Expression (Bai et al., 2021) and facilitates consistent representation of both point
OmniAct-Single Click (Kapoor et al., 2024). and box annotations in JSON format, with points
• Question Answering. Datasets contain GUI- expressed as {"x" : x, "y" : y} and bounding boxes
specific QA tasks, including GUIChat (Chen as {"x1" : x1 , "y1" : y1 , "x2" : x2 , "y2" : y2 }. In
et al., 2024), ScreenQA (Hsiao et al., 2022) and this coordinate system, the origin {"x" : 0, "y" : 0}
Complex QA (Yin et al., 2023). is located at the screen’s top-left corner, with the x-
• General Knowledge. Multimodal datasets axis extending rightward and the y-axis downward.
maintain model’s general capabilities, including The bottom-right corner corresponds to coordinates
LLaVA-OneVision (Li et al., 2024a) and PixMo {"x" : 1000, "y" : 1000}. To enhance data qual-
(MDeitke et al., 2024). ity, we implemented two additional preprocessing
• Tool Usage. Datasets cover general tool us- steps:
ing, including Glaive-function-calling (Glaive
AI, 2024). • Instruction Enhancement. For datasets with
ambiguous instructions, we developed standard-
Due to the diversity of our data sources, we im- ized instruction templates to establish clear cor-
plemented comprehensive format standardization respondence between commands and their ex-
across all datasets. Additionally, we adopted the pected outcomes.
Reference-Augmented Annotation format (see Sec- • Response Refinement. For entries with com-
tion 3.1.2) to enhance the model’s ability to ground plex or inconsistent response formats, we utilized
visual elements with textual descriptions, enabling Qwen2-VL-72B (Bai et al., 2023b) to reformu-
precise spatial referencing while maintaining natu- late responses while preserving their semantic
ral language flow. content. Each reformulation underwent valida-
tion to ensure accuracy and consistency.
3.1.1 Data Preprocessing and Standardization
Given the diversity of our data sources, we imple- 3.1.2 Reference-Augmented Annotation
mented comprehensive preprocessing steps to stan- To better leverage the spatial information available
dardize the data format across all datasets. We nor- in our collected datasets and enhance the model’s
malized the coordinate system by following (Wang visual-language understanding of GUIs, we imple-
et al., 2024), mapping all spatial coordinates to a mented a reference-augmented annotation format.

3
Table 1: Training datasets used in stage 1 of supervised fine-tuning.

Dataset Platform Category # of Samples


GUI-related Datasets
GUIEnv (Chen et al., 2024) Webpage Grounding 150,000
RICO Semantic Annotation (Sunkara et al., 2022) Mobile Grounding 150,000
SeeClick-Web (Cheng et al., 2024) Webpage Grounding 100,000
RICO SCA (Li et al., 2020a) Mobile Grounding 100,000
Widget Caption (Li et al., 2020b) Mobile Grounding 70,000
GUIChat (Chen et al., 2024) Webpage QA 40,000
ScreenQA (Hsiao et al., 2022) Mobile QA 17,000
UIBert Reference Expression (Bai et al., 2021) Mobile & Mobile Grounding 16,000
Screen2Words (Wang et al., 2021) Mobile Understanding 12,000
Complex QA (Yin et al., 2023) Mobile QA 11,000
Screen Annotation (Baechler et al., 2024) Mobile Understanding 5,400
OmniAct-Single Click (Kapoor et al., 2024) Webpage & Desktop Grounding 4,800
Non-GUI Datasets
LLaVA-OneVision (Li et al., 2024a) - General 250,000
PixMo (MDeitke et al., 2024) - General 68,800
Glaive-function-calling (Glaive AI, 2024) - Tool Usage 5,000

This format enables bidirectional referencing be- tion (Shinn et al., 2023; Yao et al., 2023; Hu et al.,
tween GUI elements and textual responses. Specifi- 2024a), enabling agents to learn from past actions
cally, we adopted the following structured notation: and improve decision-making consistency. These
reasoning skills are integrated into the training
<ref type="box" datasets of agents, so that they can reason with
coords={"x1": x1, "y1": y1,
these skills natively without any extra prompting.
"x2": x2, "y2": y2}
To achieve this, we generate SFT data incorporat-
note="GUI annotation">
corresponding text
ing these reasoning skills based on existing trajec-
</ref> tory data (see Table 2) and continue fine-tuning the
model from stage 1.
The format consists of several key components:
3.2.1 Hierarchical Reasoning
the reference type (either "box" for rectangular re-
gions or "point" for specific locations), coordinate Effective execution of GUI tasks demands both
specifications (x1, y1, x2, y2 for boxes or x, y for overarching strategic planning and meticulous tac-
points), optional annotative notes, and the corre- tical execution. To achieve this, we synthesize
sponding textual content. To generate training data trajectory data with a hierarchical reasoning with
in this format, we prompted Qwen2-VL-72B (Bai two distinct layers:
et al., 2023b) to seamlessly integrate GUI spatial
information with original responses, maintaining • Strategic Layer. Strategic layer is responsible
natural language flow while preserving precise spa- for high-level task decomposition and sub-goal
tial references. planning. This layer analyzes the overall task ob-
jective and determines the sequence of subtasks
3.2 Stage 2: Training for Native Reasoning needed for completion.
Building upon the foundational capabilities such • Tactical Layer. Tactical layer handles the selec-
as understanding and grounding, GUI Agents must tion and grounding of concrete actions. Based
also master advanced reasoning skills to effectively on the strategic layer’s planning, agent select
handle complex tasks. We identify two crucial rea- appropriate GUI operations and adjusts their pa-
soning skills : (1) Hierarchical reasoning, which rameters to match the target.
enables planning and task decomposition, helping
agents structure complex tasks into manageable 3.2.2 Expectation-Reflection Reasoning
subtasks and execute them efficiently (Huang and To enhance action consistency and foster
Chang, 2023; Zhang et al., 2024b; Huang et al., autonomous self-correction, we incorporate
2024), and (2) Expectation-reflection reasoning, Expectation-reflection reasoning into the training
which fosters adaptive self-correction and reflec- datasets. This iterative process enhances the

4
Table 2: UI action reasoning datasets used in the training process

Dataset Platform # of Samples


GUIAct (Chen et al., 2024) Webpage & Mobile 10,000
AMEX (Chai et al., 2024) Mobile 3,000
Android in the Zoo (Zhang et al., 2024a) Mobile 2,000
Composition: Stage 1-aligned - 30,000

Category Operations interaction follows a standard protocol using func-


Single-point operations tap, click, hover, select tion calls and responses:
Two-point operations swipe, select_text
Directional operations scroll Assistant Message:
Text input input, point_input <tool_call>
Parameterless operations remember, enter, home, back {
State settings set_task_status "name": "action_name",
"arguments": {"action_parameters"}
Table 3: Categorization of actions in the action space. }
</tool_call>

Tool Message:
agent’s ability to adapt and learn from its actions <tool_response>
through a structured reflection cycle: {
"name": "gui_operation",
• Reasoning. After reflection (except the first "content": {
step), the agents conduct hierarchical reasoning. "status": "success | failure",
"current_ui": <image>,
• Action. After the reasoning, the agent takes the
"current_task": <task_description>
action. }
• Expectation. Following each action, the agent }
generates expected outcomes which are used to </tool_response>

be verified at the next step.


• Reflection. The agent evaluates whether its ac-
3.2.4 Modular Action Space
tions achieved the expected results and generat-
ing a textual summary of the reflection. Given the diverse action spaces across collected
datasets, we categorized and standardized the ac-
tions by unifying their names and parameters, merg-
3.2.3 Agent-Environment Interface
ing similar operations where appropriate. The re-
We formulate the GUI interaction as a process sulting action space A consists of independent,
where an agent interacts with a mobile environment. composable operations that can be flexibly com-
Let st ∈ S denote the environment state at step t, bined based on task requirements, as shown in Ta-
where S represents the state space. The agent can ble 3. This modular design allows for dynamic
observe the state through a screenshot observation action space configuration while maintaining a con-
ot and performs actions at ∈ A, where A is the sistent interface across different platforms and sce-
action space. The environment transitions from narios.
st to st+1 following st+1 ∼ P (·|st , at ), where P
represents the transition probability function. 3.2.5 Reasoning Process Construction
The agent receives a task goal g and maintains To construct high-quality reasoning data to stimu-
access to a history window of size n. At each step late the model’s native reasoning capabilities, we
t, the agent’s input consists of: leverage more capable MLLMs (e.g. Qwen2-VL-
72B) to generate structured reasoning processes
• Goal g
based on existing interaction trajectories. The con-
• Current observation ot struction process involves several key components:
• Historical context Ht = {(oi , ri , ai )}t−1
i=t−n ,
where ri represents the reasoning process • Screenshot Description. For each observation
ot in the trajectory, we generate a detailed de-
Based on these inputs, the agent generates a rea- scription dt . This step addresses the limitation
soning process rt and predicts an action at . The that some MLLM models do not support inter-

5
leaved image-text input formats well. To estab- 45K samples based on trajectories from datasets
lish clear correspondence between observations shown in Table 2. We continual supervised fine-
(screenshots) and steps, we generate detailed de- tune Qwen2-VL-2B (Bai et al., 2023c). We lever-
scriptions to replace the screenshots, which helps age ZeRO0 (Rajbhandari et al., 2020) technology
facilitate the subsequent reasoning process con- to enable full parameter fine-tuning of the model
struction. across 8 A800 80GB GPUs.
• Reflection. Given the previous expectation et−1 4.1.2 Evaluation Benchmarks
and current observation ot , we generate a reflec- ScreenSpot. ScreenSpot (Cheng et al., 2024) is
tion ft that evaluates the outcome of the previous an evaluation benchmark for GUI grounding, con-
action. sisting of over 1,200 instructions from iOS, An-
• Strategic Layer. The strategic reasoning consists droid, macOS, Windows, and Web environments,
of two parts: First, a summary is generated based with annotated element types.
on the n-step history Ht = {(oi , ri , ai )}t−1
i=t−n
AndroidWorld. AndroidWorld (Rawles et al.,
and current observation ot . Then, the planning
2024) is a fully functional Android environment
component is generated with access to the actual
that provides reward signals for 116 program-
action at to ensure alignment with the trajectory.
matic tasks across 20 real-world Android apps.
• Tactical Layer. This layer’s reasoning is con- We find that Android World uses Set-of-Marks
structed using the generated reflection ft and (SoM) (Yang et al., 2023b) to enhance the agent’s
strategic layer output. The actual action at from grounding ability. However, when humans oper-
the trajectory is incorporated to ensure the tacti- ate smartphones, their brains do not label elements
cal reasoning leads to appropriate action selec- on the screen. Over-reliance on SoM can lead to
tion. insufficient focus on pixel-level grounding ability.
• Expectation. For each state-action pair (st , at ), Therefore, in our experiments, agents respond to
we generate an expectation et based on current the raw image rather than the annotated image.
observation ot , reasoning process rt , and action
4.2 Main Results
at . Notably, we deliberately avoid using the next
state st+1 in this generation process. Although ScreenSpot. Table 4 provides the results of differ-
using st+1 could improve the agent’s accuracy ent models across three platforms (Mobile, Desk-
in modeling state transitions, while using st+1 top and Web) and two element types (Text and Icon)
could lead to perfect expectations, such an ap- on ScreenSpot (Cheng et al., 2024). InfiGUIAgent-
proach might impair the agent’s ability to handle 2B achieves highest accuracy of 76.3%, surpassing
expectation mismatches during deployment. several strong baselines such as ShowUI (Lin et al.,
2024) (75.1%) and UGround-7B (Gou et al., 2024)
While we avoid using st+1 in expectation gener- (73.3%), which is even with larger parameters size.
ation to maintain robustness, we also explore the
AndroidWorld. Table 5 compares the success
possibility of improving state transition modeling
rates of InfiGUIAgent with open-source models on
through a parallel next-state prediction task. Using
AndroidWorld (Rawles et al., 2024). InfiGUIAgent-
the trajectory data, we construct additional training
2B achieves an overall success rate of 0.09, outper-
examples where the agent learns to predict the next
forming open-source models of similar size, such
state description dt+1 given the current observa-
as ShowUI-2B (Lin et al., 2024) (0.07), and model
tion ot and action at . This auxiliary task helps the
with much more parameters such as LLaVa-OV-7B
agent learn state transition dynamics, while keep-
(Li et al., 2024b) (0.00) and Qwen2-VL-72B (Bai
ing the expectation generation process independent
et al., 2023b) (0.05).
of future states.
5 Conclusion
4 Experiments
In this work, we propose InfiGUIAgent, a novel
4.1 Experimental Setting
MLLM-based GUI Agents. By constructing com-
4.1.1 Implementation Details prehensive training datasets with two-stage super-
In stage 1, we sample 1M samples in total as il- vised fine-tuning, we enhance the model’s ability
lustrated in Table 1. In stage 2, we synthesized to understand, reason, and interact with GUIs. Our

6
Accuracy (%)
Model Avg.
Mobile Desktop Web
Text Icon Text Icon Text Icon
Proprietary Models
GPT-4o1 (OpenAI, 2024) 30.5 23.2 20.6 19.4 11.1 7.8 18.8
Gemini-1.5-pro2 (Team et al., 2024) 76.2 54.1 65.5 39.3 52.2 32.0 53.2
Open-source Models
Qwen2-VL-2B (Wang et al., 2024) 24.2 10.0 1.4 9.3 8.7 2.4 9.3
Qwen2-VL-7B (Wang et al., 2024) 61.3 39.3 52.0 45.0 33.0 21.8 42.9
CogAgent (Hong et al., 2024) 67.0 24.0 74.2 20.0 70.4 28.6 47.4
SeeClick (Cheng et al., 2024) 78.0 52.0 72.2 30.0 55.7 32.5 53.4
UGround-7B (Gou et al., 2024) 82.8 60.3 82.5 63.6 80.4 70.4 73.3
ShowUI-2B (Lin et al., 2024) 92.3 75.5 76.3 61.1 81.7 63.6 75.1
Ours
InfiGUIAgent-2B 88.6 74.7 85.6 65.0 79.1 64.6 76.3

Table 4: Performances on various platforms (Mobile, Desktop, Web) on Screenshot. All experiments were conducted
using raw screenshot information. Results marked in bold represent the best performance, and those underlined
indicate the second-best performance.

Success Rate Zubach, Hassan Mansoor, Vincent Etter, Victor


Model
Easy Middle Hard Overall Cărbune, Jason Lin, Jindong Chen, and Abhanshu
Qwen2-VL-2B (Wang et al., 2024) 0.00 0.00 0.00 0.00 Sharma. 2024. Screenai: A vision-language model
Qwen2-VL-7B (Wang et al., 2024) 0.00 0.00 0.05 0.05 for ui and infographics understanding. arXiv preprint
Qwen2-VL-72B (Wang et al., 2024) 0.08 0.00 0.05 0.05 arXiv:2402.04615.
LLaVa-OV-7B (Li et al., 2024b) 0.00 0.00 0.00 0.00
ShowUI-2B (Lin et al., 2024) 0.18 0.00 0.00 0.07
Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas
Ours Sunkara, Abhinav Rastogi, Jindong Chen, and
InfiGUIAgent-2B 0.25 0.00 0.00 0.09
Blaise Aguera y Arcas. 2021. Uibert: Learning
generic multimodal representations for ui understand-
Table 5: Performances on AndroidWorld. ing. arXiv preprint arXiv:2107.13731.

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
evaluation, conducted using raw screenshots with- Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han,
out relying on additional GUI metadata, demon- Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang
Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang
strates the model’s applicability to real-world sce- Lu, K. Lu, Jianxin Ma, Rui Men, Xingzhang Ren,
narios. Experimental results show that our model Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong
performs well on GUI tasks and surpass several Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang
open-source baselines. Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian
Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen
Yu, Yu Bowen, Hongyi Yuan, Zheng Yuan, Jianwei
Zhang, Xing Zhang, Yichang Zhang, Zhenru Zhang,
References Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and
Tianhang Zhu. 2023a. Qwen technical report. ArXiv.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An-
toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang,
Arthur Mensch, Katie Millican, Malcolm Reynolds, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou,
Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda and Jingren Zhou. 2023b. Qwen-vl: A frontier large
Han, Zhitao Gong, Sina Samangooei, Marianne vision-language model with versatile abilities. ArXiv.
Monteiro, Jacob Menick, Sebastian Borgeaud, Andy
Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang,
laj Binkowski, Ricardo Barreira, Oriol Vinyals, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou,
Andrew Zisserman, and Karen Simonyan. 2022. and Jingren Zhou. 2023c. Qwen-vl: A frontier large
Flamingo: a visual language model for few-shot vision-language model with versatile abilities. arXiv
learning. In Advances in Neural Information Pro- preprint arXiv:2308.12966.
cessing Systems 35: Annual Conference on Neural
Information Processing Systems 2022, NeurIPS 2022, Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao,
New Orleans, LA, USA, November 28 - December 9, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren,
2022. and Hongsheng Li. 2024. Amex: Android multi-
annotation expo dataset for mobile gui agents. arXiv
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir preprint arXiv:2407.17490.

7
Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Xu, Shawn Wang, Xinchen Xu, Shuofei Qiao,
Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li,
Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin
Liu, and Maosong Sun. 2024. Guicourse: From gen- Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan
eral vision language models to versatile gui agents. Wu, Shengyu Zhang, and Fei Wu. 2024b. Os agents:
arXiv preprint arXiv:2406.11317. A survey on mllm-based agents for general comput-
ing devices use. Preprints.
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu,
Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli
Seeclick: Harnessing gui grounding for advanced Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing
visual gui agents. arXiv preprint arXiv:2401.10935. Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li,
Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu.
Yong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan, 2024c. Infiagent-dabench: Evaluating agents on data
Cong Zhou, Jingquan Wang, Zhangyin Feng, Fan analysis tasks. arXiv preprint arXiv:2401.05507.
Zhang, Xueyu Hu, and Shuming Shi. 2022. One
model, multiple modalities: A sparsely activated Jie Huang and Kevin Chen-Chuan Chang. 2023. To-
approach for text, sound, image, video and code. wards reasoning in large language models: A survey.
Preprint, arXiv:2205.06126. Preprint, arXiv:2212.10403.
Alexey Dosovitskiy, Lucas Beyer, Alexander Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruim-
Thomas Unterthiner, Mostafa Dehghani, Matthias ing Tang, and Enhong Chen. 2024. Understanding
Minderer, Georg Heigold, Sylvain Gelly, Jakob the planning of llm agents: A survey. arXiv preprint
Uszkoreit, and Neil Houlsby. 2021. An image arXiv:2402.02716.
is worth 16x16 words: Transformers for image
recognition at scale. In 9th International Conference Yue Jiang, Eldon Schoop, Amanda Swearngin, and
on Learning Representations, ICLR 2021, Virtual Jeffrey Nichols. 2023. Iluvui: Instruction-tuned
Event, Austria, May 3-7, 2021. OpenReview.net. language-vision modeling of uis from machine con-
versations. arXiv preprint arXiv:2310.04869.
Luciano Floridi and Massimo Chiriatti. 2020. Gpt-3:
Its nature, scope, limits, and consequences. Minds Raghav Kapoor, Yash Parag Butala, Melisa Russak,
and Machines, 30:681–694. Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and
Ruslan Salakutdinov. 2024. Omniact: A dataset and
Glaive AI. 2024. Glaive function calling dataset. benchmark for enabling multimodal generalist au-
https://huggingface.co/datasets/glaiveai/ tonomous agents for desktop and web. arXiv preprint
glaive-function-calling. Accessed: 2024-01- arXiv:2402.17553.
08.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yan-
Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Liu Li, Ziwei Liu, and Chunyuan Li. 2024a. Llava-
2024. Navigating the digital world as humans do: onevision: Easy visual task transfer. arXiv preprint
Universal visual grounding for gui agents. arXiv arXiv:2408.03329.
preprint arXiv:2410.05243.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang,
Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024b.
Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A Llava-onevision: Easy visual task transfer. Preprint,
visual language model for gui agents. In Proceedings arXiv:2408.03326.
of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 14281–14290. Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi
Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen
Yu-Chung Hsiao, Fedir Zubach, Gillbune Baech- Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren,
ler, Victor Carbune, Jason Lin, Maria Wang, Chao Li, Yifan Ye, Lihuan Zhang, Hanshu Yan,
Srinivas Sunkara, Yun Zhu, and Jindong Chen. Guoyin Wang, Bei Chen, and Junnan Li. 2024c. Aria:
2022. Screenqa: Large-scale question-answer An open multimodal native mixture-of-experts model.
pairs over mobile app screenshots. arXiv preprint Preprint, arXiv:2410.05993.
arXiv:2209.08199.
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Xueyu Hu, Kun Kuang, Jiankai Sun, Hongxia Yang, 2023. Blip-2: Bootstrapping language-image pre-
and Fei Wu. 2024a. Leveraging print debugging to training with frozen image encoders and large lan-
improve code generation in large language models. guage models. In International conference on ma-
Preprint, arXiv:2401.05319. chine learning, pages 19730–19742. PMLR.
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruix- Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu,
uan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie,
Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze and Hongxia Yang. 2024d. Infibench: Evaluating

8
the question-answering capabilities of code large lan- OpenAI. 2023. Gpt-4v(ision) system card.
guage models. arXiv preprint arXiv:2404.07940.
OpenAI. 2024. Gpt-4o. Accessed: 2025-01-03.
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason
Baldridge. 2020a. Mapping natural language instruc- Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao
tions to mobile ui action sequences. arXiv preprint Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian
arXiv:2005.03776. Yao, Tianjie Zhang, et al. 2024. Webrl: Train-
ing llm web agents via self-evolving online cur-
Yang Li, Luheng Li, Gangaand He, Jingjie Zheng, riculum reinforcement learning. arXiv preprint
Hong Li, and Zhiwei Guan. 2020b. Widget cap- arXiv:2411.02337.
tioning: Generating natural language description Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
for mobile user interface elements. arXiv preprint Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
arXiv:2010.04295. try, Amanda Askell, Pamela Mishkin, Jack Clark,
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan et al. 2021. Learning transferable visual models from
Yang, Zechen Bai, Weixian Lei, Lijuan Wang, and natural language supervision. In International confer-
Mike Zheng Shou. 2024. Showui: One vision- ence on machine learning, pages 8748–8763. PMLR.
language-action model for generalist gui agent. In Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,
NeurIPS 2024 Workshop on Open-World Agents. and Yuxiong He. 2020. Zero: Memory optimizations
toward training trillion parameter models. Preprint,
Haogeng Liu, Quanzeng You, Yiqi Wang, Xiaotian Han, arXiv:1910.02054.
Bohan Zhai, Yongfei Liu, Wentao Chen, Yiren Jian,
Yunzhe Tao, Jianbo Yuan, Ran He, and Hongxia Yang. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang,
2024. Infimm: Advancing multimodal understanding Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice
with an open-sourced visual language model. In Li, William Bishop, Wei Li, Folawiyo Campbell-
Annual Meeting of the Association for Computational Ajala, et al. 2024. Androidworld: A dynamic bench-
Linguistics. marking environment for autonomous agents. arXiv
preprint arXiv:2405.14573.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae
Lee. 2023. Visual instruction tuning. In Advances in Noah Shinn, Federico Cassano, Edward Berman, Ash-
Neural Information Processing Systems 36: Annual win Gopinath, Karthik Narasimhan, and Shunyu Yao.
Conference on Neural Information Processing Sys- 2023. Reflexion: Language agents with verbal rein-
tems 2023, NeurIPS 2023, New Orleans, LA, USA, forcement learning. Preprint, arXiv:2303.11366.
December 10 - 16, 2023.
Srinivas Sunkara, Maria Wang, Lijuan Liu, Gilles
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Fe- Baechler, Yu-Chung Hsiao, Abhanshu Sharma,
ichtenhofer, Trevor Darrell, and Saining Xie. 2022. James Stout, et al. 2022. Towards better semantic
A convnet for the 2020s. In Proceedings of the understanding of mobile interfaces. arXiv preprint
IEEE/CVF conference on computer vision and pat- arXiv:2210.02663.
tern recognition, pages 11976–11986.
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan
Matt MDeitke, Christopher Clark, Sangho Lee, Ro- Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer,
hun Tripathi, Yue Yang, Jae Sung Park, Moham- Damien Vincent, Zhufeng Pan, Shibo Wang, et al.
madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca 2024. Gemini 1.5: Unlocking multimodal under-
Soldaini, Jiasen Lu, Taira Anderson, Erin Bramsom, standing across millions of tokens of context. arXiv
Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay preprint arXiv:2403.05530.
Patel, Mark Yatskar, Chris Callison-Burch, Andrew Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Head, Rose Hendrix, Favyen Bastani, Eli van der Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Bilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Grave, and Guillaume Lample. 2023. Llama: Open
Newell, Piper Wolters, Kuo-Hao Gupta, Tanmay and efficient foundation language models. ArXiv.
sna Zeng, Jon Borchardt, Dirk Groeneveld, Crys-
tal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Minh Duc Vu, Han Wang, Jieshan Chen, Zhuang Li,
Schoenick, Oscar Michel, Ranjay Krishna, Luca Shengdong Zhao, Zhenchang Xing, and Chunyang
Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Chen. 2024. Gptvoicetasker: Advancing multi-step
Girshick, Ali Farhadi, and Aniruddha Kembhavi. mobile task efficiency through dynamic interface ex-
2024. Molmo and pixmo: Open weights and ploration and learning. In Proceedings of the 37th
open data for state-of-the-art vision-language models. Annual ACM Symposium on User Interface Software
arXiv preprint arXiv:2409.17146. and Technology, pages 1–17.
Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi
Shan, Xiutian Huang, and Wenhao Xu. 2024. Mo- Grossman, and Yang Li. 2021. Screen2words: Au-
bileflow: A multimodal llm for mobile gui agent. tomatic mobile ui summarization with multimodal
arXiv preprint arXiv:2407.04346. learning. arXiv preprint arXiv:2108.03353.

9
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin
hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023.
Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhanc- Appagent: Multimodal agents as smartphone users.
ing vision-language model’s perception of the world arXiv preprint arXiv:2312.13771.
at any resolution. arXiv preprint arXiv:2409.12191.
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao,
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang.
Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei 2024a. Android in the zoo: Chain-of-action-thought
Zhao, Xixuan Song, et al. 2023. Cogvlm: Visual ex- for gui agents. arXiv preprint arXiv:2403.02713.
pert for pretrained language models. arXiv preprint
Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang,
arXiv:2311.03079.
Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song,
Man Lan, and Furu Wei. 2024b. Llm as a master-
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, mind: A survey of strategic reasoning with large
Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, language models. Preprint, arXiv:2404.01230.
Yaqin Zhang, and Yunxin Liu. 2023. Autodroid: Llm-
powered task automation in android. arXiv preprint Zhuosheng Zhang and Aston Zhang. 2023. You only
arXiv:2308.15272. look at screens: Multimodal chain-of-action agents.
arXiv preprint arXiv:2309.11436.
Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu,
and Maosong Sun. 2021. Lawformer: A pre-trained
language model for chinese legal long documents. AI
Open, 2:79–84.

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan


Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhou-
jun Cheng, Dongchan Shin, Fangyu Lei, et al. 2024.
Osworld: Benchmarking multimodal agents for open-
ended tasks in real computer environments. arXiv
preprint arXiv:2404.07972.

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun-


yuan Li, and Jianfeng Gao. 2023a. Set-of-mark
prompting unleashes extraordinary visual grounding
in gpt-4v. Preprint, arXiv:2310.11441.

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun-


yuan Li, and Jianfeng Gao. 2023b. Set-of-mark
prompting unleashes extraordinary visual grounding
in gpt-4v. arXiv preprint arXiv:2310.11441.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak


Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
React: Synergizing reasoning and acting in language
models. Preprint, arXiv:2210.03629.

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khy-


athi Chandu, Kai-Wei Chang, Yejin Choi, and
Bill Yuchen Lin. 2023. Agent lumos: Unified and
modular training for open-source language agents.
arXiv preprint arXiv:2311.05657.

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers,


Amanda Swearngin, Jeffrey Nichols, Yinfei Yang,
and Zhe Gan. 2025. Ferret-ui: Grounded mobile
ui understanding with multimodal llms. In Euro-
pean Conference on Computer Vision, pages 240–
255. Springer.

Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng,


Michel Galley, Jianfeng Gao, and Zhou Yu. 2024.
Exact: Teaching ai agents to explore with reflective-
mcts and exploratory learning. arXiv preprint
arXiv:2410.02052.

10
A Cases
A.1 Stage 1: Fundamental Abilities
We demonstrate the fundamental abilities trained
in Stage 1 through three cases: GUI Understanding
(Figure 2), Grounding (Figure 3), and Question
Answering (Figure 4).

A.2 Stage 2: Native Reasoning


We provide two representative cases to demonstrate
the reasoning and interaction process of InfiGUIA-
gent.
Reply to a Message Figure 5 illustrates a step
where the agent needs to reply to a specific message
in a messaging application. The reasoning process
involves identifying the "Start chat" button and
grounding the action to initiate the reply process.
Creating a New Contact Figure 6 and Figure 7
demonstrate sequential steps for creating a new con-
tact. In the first step (Step K), the agent navigates
to the "Contacts" section by reasoning and ground-
ing the action to the corresponding tab. In the
following step (Step K+1), the agent initiates the
contact creation process by identifying and tapping
the "Create new contact" button. These sequential
steps highlight the agent’s hierarchical reasoning
and grounding abilities.

11
Instruction: Describe the UI in detail. You should
reference components and locations in the UI.

Answer: This screenshot shows the settings menu of an


Android device, displaying various configuration options.
The top of the screen features a navigation bar with a back
arrow, indicating the ability to return to the previous screen.
The main content area lists several settings categories,
each with a brief description and icon for easy
identification.

Key UI components and elements present on the screen


include:
- A <ref type="box" coords={"x1": 0, "y1": 31, "x2": 131,
"y2": 110} note="return to the previous screen">back
arrow</ref> at the top left corner, allowing the user to
navigate back to the previous screen.
- The <ref type="box" coords={"x1": 175, "y1": 54, "x2":
404, "y2": 98} note="title of the current
screen">Settings</ref> title at the top center, indicating the
current page.

Figure 2: Case of GUI Understanding.

Instruction: Where can I send a message to someone? You


should reference components and locations in the UI.

Answer: To send a message, you need to access the


messaging app. The key components on the home screen
are the app icons. Specifically, you should look for the <ref
type="box" coords={"x1": 262, "y1": 766, "x2": 360, "y2":
814} note="message app icon">message app icon</ref>
located at the bottom left of the screen. This icon is
typically used to open the messaging app. Once you tap on
this icon, you will be taken to the messaging app where
you can send a message.

Figure 3: Case of Grounding.

12
Instruction: What day is it tomorrow? You should
reference components and locations in the UI.

Answer: To determine the day of the week, we need to


look at the date displayed at the top of the screen. The date
is shown as <ref type="box" coords={"x1": 248, "y1": 94,
"x2": 422, "y2": 119} note="day of the
week">Sunday</ref>. This indicates that today is Sunday.
Therefore, tomorrow will be Monday.

Figure 4: Case of Question Answering.

Reflection: The expectation from the previous step was to open the Simple
SMS Messenger app by tapping on its icon. The current screen content
shows the Simple SMS Messenger app's ...

Summary: The overall task objective is to send the message "Yoga class Strategic Layer
every Tuesday and Thursday at 6 PM" to +18407874306 ...

Planning: To achieve the final goal, the plan is to: first, initiate a new chat;
second, enter the recipient's phone number (+18407874306); third,
compose the message ...

Reasoning: The core component on the current screen is the "Start chat"
button. The key point of the task objective in the current step is to initiate a
Tactical Layer

new chat. Therefore, I need to tap on the "Start chat" button.

Grounding: Based on the reasoning, the current action should be to tap on


the "Start chat" button. The button is located at the bottom right of the
screen, a prominent blue circle with a white chat bubble icon. Therefore, I
should tap on the "Start chat" button.

Action:
<tool_call>
{"name": "tap", "arguments": {"point": {"x": 824, "y": 886}}}
</tool_call>

Expectation: The purpose of this action is to open the new chat interface.
This action will likely result in a new screen appearing ...

Figure 5: Case of Native Advanced Reasoning. The agent’s goal is to reply to a message

13
Reflection: The expectation from the previous step was to open the
Contacts app by tapping on its icon. However, the current screen content
shows the Phone app with the "Recents" tab selected ...

Summary: The overall task objective is to create a new contact for Hugo

Strategic Layer
Pereira. In the previous step, I attempted to open the Contacts app ...

Planning: To achieve the final goal of creating a new contact for Hugo
Pereira, I first need to access the Contacts section within the Phone app.
Then I need to ...
Step K

Reasoning: The core components on the current screen are the three tabs at
the bottom: "Favorites", "Recents", and "Contacts". The key point of the

Tactical Layer
task objective in the current step is to navigate to the Contacts section.
Therefore, I need to tap on the "Contacts" tab.

Grounding: Based on the reasoning, the current action should be to tap on


the "Contacts" tab. The "Contacts" tab is located at the bottom right of
the screen. Therefore, I should tap on the "Contacts" tab.

Action:
<tool_call>
{"name": "tap", "arguments": {"point": {"x": 853, "y": 924}}}
</tool_call>

Expectation: The purpose of this action is to navigate to the Contacts


section of the Phone app ...

Figure 6: Case of Native Advanced Reasoning. The agent’s goal is to create a new contact.

Reflection: The expectation from the previous step was to navigate to the
Contacts section by tapping the "Contacts" tab. The current screen content
shows ... indicating that the previous action was successful.

Summary: The overall task objective is to create a new contact for Hugo
Strategic Layer

Pereira. In the previous steps, I opened the Phone app and ...

Planning: To achieve the final goal of creating a new contact for Hugo
Pereira, I need to first initiate the new contact creation process. Then, I
need to enter Hugo's details ...
Step K + 1

Reasoning: The core component on the current screen is the "Create new
contact" button. The key point of the task objective in the current step is to
Tactical Layer

initiate the new contact creation process. Therefore, I need to tap on the
"Create new contact" button.

Grounding: Based on the reasoning, the current action should be to tap on


the "Create new contact" button. The button is located in the center of the
screen ...

Action:
<tool_call>
{"name": "tap", "arguments": {"point": {"x": 492, "y": 575}}}
</tool_call>

Expectation: The purpose of this action is to initiate the process of creating


a new contact …

Figure 7: Case of Native Advanced Reasoning. The agent’s goal is to create a new contact.

14

You might also like