0% found this document useful (0 votes)

172 views

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

Uploaded by

oguzhan.cetinkaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

172 views

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

Uploaded by

oguzhan.cetinkaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

InfiGUIAgent: A Multimodal Generalist GUI Agent

with Native Reasoning and Reflection

Yuhang Liu1 , Pengxiang Li2 , Zishu Wei1 , Congkai Xie3 , Xueyu Hu1 , Xinchen Xu1 ,
Shengyu Zhang1 , Xiaotian Han 4 , Hongxia Yang5 , Fei Wu1
1
Zhejiang University, 2 Dalian University of Technology, 3 Reallm Labs,
4
ByteDance Inc, 5 The Hong Kong Polytechnic University
[email protected], [email protected], [email protected]

Abstract interpret complex interface elements and adapt to a

wide range of tasks, leading to more efficient and
Graphical User Interface (GUI) Agents, pow- robust automation (Hong et al., 2024; Jiang et al.,
ered by multimodal large language models 2023; You et al., 2025; Nong et al., 2024; Vu et al.,
(MLLMs), have shown great potential for task
arXiv:2501.04575v1 [cs.AI] 8 Jan 2025

2024).
automation on computing devices such as com-
puters and mobile phones. However, existing However, current MLLM-based GUI Agents
agents face challenges in multi-step reasoning face several critical challenges. A key limita-
and reliance on textual annotations, limiting tion lies in their reasoning capabilities (Zhang and
their effectiveness. We introduce InfiGUIAgent, Zhang, 2023; Qi et al., 2024; Yu et al., 2024).
an MLLM-based GUI Agent trained with a two- While many existing GUI Agents can perform basic
stage supervised fine-tuning pipeline. Stage 1 single-step reasoning, they struggle to effectively
enhances fundamental skills such as GUI un-
leverage information from previous steps. This
derstanding and grounding, while Stage 2 inte-
grates hierarchical reasoning and expectation- lack of reflection on past experiences can lead to
reflection reasoning skills using synthesized repetitive errors during task execution.
data to enable native reasoning abilities of the Another significant challenge lies in the reliance
agents. InfiGUIAgent achieves competitive per- on the additional information of the GUIs. Many
formance on several GUI benchmarks, high- prior GUI Agent implementations rely on accessi-
lighting the impact of native reasoning skills in bility trees or Set-of-Marks (Yang et al., 2023a), to
enhancing GUI interaction for automation tasks. represent or augment the GUI’s visual information.
Resources are available at https://github.
However, GUIs are inherently visual, and represent-
com/Reallm-Labs/InfiGUIAgent.
ing them primarily through text can lead to informa-
1 Introduction tion loss or redundancy. Augmenting visual input
with textual descriptions can also increase computa-
Graphical User Interface (GUI) Agents have tional overhead. Furthermore, the availability and
emerged as powerful tools for automating tasks consistency of these textual representations vary
on computing devices, including mobile phones across platforms, hindering practical deployment.
and computers. These agents can understand and To address these limitations, we propose Infi-
interact with GUIs to execute complex operations, GUIAgent, which is a MLLM-based GUI Agent
significantly enhancing user productivity and ex- trained through a two-stage supervised fine-tuning
panding the scope of automated task completion (SFT) methods with robust fundamental capabili-
(Hu et al., 2024b; Hong et al., 2024; Zhang and ties and native reasoning abilities. In stage 1, we
Zhang, 2023; Qi et al., 2024; Xie et al., 2024; Vu collect data covering multiple tasks, such as vision-
et al., 2024; Yu et al., 2024; Wen et al., 2023). language understanding, GUI-specific QA, and tool
Recent developments in multimodal large lan- use to improve fundamental capabilities such as
guage models (MLLMs) (Bai et al., 2023b; Li et al., GUI understanding and instruction grounding of
2024c; Team et al., 2024; Dai et al., 2022) have sig- the agents. In stage 2, we recognized two essential
nificantly advanced the potential of GUI Agents. reasoning skills for GUI Agents: (1) Hierarchical
MLLMs possess powerful visual understanding ca- reasoning, and (2) Expectation-reflection reason-
pabilities and can reason based on visual informa- ing, and integrate these skills into the SFT data
tion, making them a promising foundation for build- synthesized by MLLMs based on existing trajecto-
ing sophisticated GUI Agents. These models can ries. Our main contributions are threefold:

1
• We propose a two-stage supervised fine- code to perceive GUIs are developed (Wen et al.,
tuning pipeline to comprehensively improve 2023). However, various works have shown that
both the fundamental abilities and advanced learning to interact with the visual form of the
reasoning abilities of GUI Agents. GUIs can show superior performance (Hu et al.,
2024b). Therefore, MLLM-based GUI Agents
• We synthesize SFT data with two advanced are developed. ILuvUI (Jiang et al., 2023) fine-
reasoning skills: hierarchical reasoning and tuned LLaVA to enhance general GUI understand-
expectation-reflection reasoning, enabling the ing, while AppAgent (Zhang et al., 2023) explored
agents to natively perform complex reasoning. app usage through autonomous interactions. CogA-
• We build InfiGUIAgent by supervised fine- gent (Hong et al., 2024) integrated high-resolution
tuning a model using our SFT data and con- vision encoders, and Ferret-UI-anyres (You et al.,
duct experiments on several GUI benchmarks, 2025) employed an any-resolution approach. Build-
demonstrating that our model achieves com- ing upon these works, our study focuses on devel-
petitive performance. oping a more lightweight agent with a simplified
architecture for GUI tasks, aiming to improve ease
2 Related Works of deployment.

2.1 Multimodal LLMs 3 Method

Large Language Models (LLMs) (Floridi and Chiri- In this section, we introduce our two-stage super-
atti, 2020; Touvron et al., 2023; Bai et al., 2023a; vised fine-tuning strategy for building InfiGUIA-
Xiao et al., 2021) have significantly enhanced the gent, as shown in Figure 1. In stage 1, we focus
capabilities of AI systems in tackling a wide range on improving fundamental abilities such as under-
of tasks (Hu et al., 2024c; Li et al., 2024d), thanks standing and grounding, particularly considering
to their exceptional ability to process complex se- the complexity of GUIs. In stage 2, we move on to
mantic and contextual information. The remark- improve the native reasoning abilities of agents for
able power of LLMs has also inspired exploration handling complicated GUI tasks.
into their potential for processing multimodal data,
such as images. Typically, the architecture of Mul- 3.1 Stage 1: Training for Fundamental
timodal Large Language Models (MLLMs) con- Abilities
sists of three main components: a pre-trained large Considering the complexity of GUIs, which involve
language model, a trained modality encoder, and diverse data formats such as HTML code, high-
a modality interface that connects the LLM with resolution interfaces cluttered with small icons and
the encoded modality features. Various vision en- text, general MLLMs lack fundamental abilities
coders, such as ViT (Dosovitskiy et al., 2021), in both understanding GUI and grounding the ac-
CLIP (Radford et al., 2021), and ConvNeXt (Liu tions. To address this, we first collected a range
et al., 2022), extract visual features, which are in- of existing visual-language and GUI datasets for
tegrated using techniques like adapter networks supervised fine-tuning in stage 1. We gathered data
(Liu et al., 2023), cross-attention layers (Alayrac covering several GUI tasks from multiple sources
et al., 2022), and visual expert modules (Wang to ensure a comprehensive capabilities improve-
et al., 2023). These methods have facilitated the ment (see Table 1). The datasets can be categorized
development of high-performing MLLMs, such as into five parts:
Qwen-VL (Bai et al., 2023b), GPT-4 Vision (Ope-
nAI, 2023), BLIP-2 (Li et al., 2023) and InfiMM • GUI Understanding. Datasets focusing on GUI
(Liu et al., 2024), thus opening new avenues for element recognition, layout comprehension, and
LLMs in processing GUI tasks. semantic interpretation, including Screen2Words
(Wang et al., 2021) and Screen Annotation
2.2 MLLM-based GUI Agents (Baechler et al., 2024).
Agents are AI systems that perceive their environ- • Grounding. Datasets capture various user inter-
ments, make decisions, and take actions to com- action sequences and operation patterns, includ-
plete specific tasks. LLMs reaching human-level ing GUIEnv (Chen et al., 2024), RICO Seman-
intelligence have greatly enhanced the ability to tic Annotation (Sunkara et al., 2022), SeeClick-
build agents. For GUI tasks, LLMs that read HTML Web (Cheng et al., 2024), RICO SCA (Li et al.,

2
Stage 1: Fundamental Abilities Stage 2: Native Advanced Reasoning
GUI-Specific Task: Set an alarm for 7am. Reflection
Prior
GUI Understanding:
Screenshot
History Current Expectation Expected to open clock app by tapping icon... (Succeed )
• Screen2Words
• Screen Annotation Environment
Hierarchical Reasoning
Question Answering: Strategic Layer:
• ScreenQA Summary: Home screen → App Drawer → Clock interface
• Complex QA +
Planning: Alarm tab → Create new → Set 7am → Save
Instruction

Input
Instruction Grounding: Reasoning

• RicoSCA & Tactical Layer:

Action

• Widget Caption Answer / Action Current Step Reasoning: Need to access alarm tab from current
• ... Operate clock screen with multiple function tabs
Grounding: Tap alarm icon in top left corner
General Tool Usage
• LLaVA-OneVision • Glaive Function Action
• PixMo Calling {"name": "tap", "arguments": {"point": {"x": 115, "y": 67}}}

Reasoning

+ Answer Instruction Action

&
Action Expectation
Subsequent
Instruction ... Alarm tab will open showing new alarm option... Reflection

Figure 1: InfiGUIAgent is trained in two stages. Stage 1 cultivates fundamental abilities using diverse datasets
covering GUI understanding (element recognition and layout comprehension), question answering, instruction
grounding, general knowledge, and tool usage. Stage 2 introduces native advanced reasoning, employed during both
training and inference. This stage follows a cyclical process at each step, consisting of Reflection, Hierarchical
Reasoning (strategic and tactical layers), Action, and Expectation. Each step receives the overall task, the
history of previous screenshots and reasoning, and the current environment as input. Reflection assesses the
previous action’s outcome against its expectation, while Expectation predicts the outcome of the current action for
subsequent reflection.

2020a), Widget Caption (Li et al., 2020b), UIB- relative scale of [0, 1000]. This standardization
ert Reference Expression (Bai et al., 2021) and facilitates consistent representation of both point
OmniAct-Single Click (Kapoor et al., 2024). and box annotations in JSON format, with points
• Question Answering. Datasets contain GUI- expressed as {"x" : x, "y" : y} and bounding boxes
specific QA tasks, including GUIChat (Chen as {"x1" : x1 , "y1" : y1 , "x2" : x2 , "y2" : y2 }. In
et al., 2024), ScreenQA (Hsiao et al., 2022) and this coordinate system, the origin {"x" : 0, "y" : 0}
Complex QA (Yin et al., 2023). is located at the screen’s top-left corner, with the x-
• General Knowledge. Multimodal datasets axis extending rightward and the y-axis downward.
maintain model’s general capabilities, including The bottom-right corner corresponds to coordinates
LLaVA-OneVision (Li et al., 2024a) and PixMo {"x" : 1000, "y" : 1000}. To enhance data qual-
(MDeitke et al., 2024). ity, we implemented two additional preprocessing
• Tool Usage. Datasets cover general tool us- steps:
ing, including Glaive-function-calling (Glaive
AI, 2024). • Instruction Enhancement. For datasets with
ambiguous instructions, we developed standard-
Due to the diversity of our data sources, we im- ized instruction templates to establish clear cor-
plemented comprehensive format standardization respondence between commands and their ex-
across all datasets. Additionally, we adopted the pected outcomes.
Reference-Augmented Annotation format (see Sec- • Response Refinement. For entries with com-
tion 3.1.2) to enhance the model’s ability to ground plex or inconsistent response formats, we utilized
visual elements with textual descriptions, enabling Qwen2-VL-72B (Bai et al., 2023b) to reformu-
precise spatial referencing while maintaining natu- late responses while preserving their semantic
ral language flow. content. Each reformulation underwent valida-
tion to ensure accuracy and consistency.
3.1.1 Data Preprocessing and Standardization
Given the diversity of our data sources, we imple- 3.1.2 Reference-Augmented Annotation
mented comprehensive preprocessing steps to stan- To better leverage the spatial information available
dardize the data format across all datasets. We nor- in our collected datasets and enhance the model’s
malized the coordinate system by following (Wang visual-language understanding of GUIs, we imple-
et al., 2024), mapping all spatial coordinates to a mented a reference-augmented annotation format.

3
Table 1: Training datasets used in stage 1 of supervised fine-tuning.

Dataset Platform Category # of Samples

GUI-related Datasets
GUIEnv (Chen et al., 2024) Webpage Grounding 150,000
RICO Semantic Annotation (Sunkara et al., 2022) Mobile Grounding 150,000
SeeClick-Web (Cheng et al., 2024) Webpage Grounding 100,000
RICO SCA (Li et al., 2020a) Mobile Grounding 100,000
Widget Caption (Li et al., 2020b) Mobile Grounding 70,000
GUIChat (Chen et al., 2024) Webpage QA 40,000
ScreenQA (Hsiao et al., 2022) Mobile QA 17,000
UIBert Reference Expression (Bai et al., 2021) Mobile & Mobile Grounding 16,000
Screen2Words (Wang et al., 2021) Mobile Understanding 12,000
Complex QA (Yin et al., 2023) Mobile QA 11,000
Screen Annotation (Baechler et al., 2024) Mobile Understanding 5,400
OmniAct-Single Click (Kapoor et al., 2024) Webpage & Desktop Grounding 4,800
Non-GUI Datasets
LLaVA-OneVision (Li et al., 2024a) - General 250,000
PixMo (MDeitke et al., 2024) - General 68,800
Glaive-function-calling (Glaive AI, 2024) - Tool Usage 5,000

This format enables bidirectional referencing be- tion (Shinn et al., 2023; Yao et al., 2023; Hu et al.,
tween GUI elements and textual responses. Specifi- 2024a), enabling agents to learn from past actions
cally, we adopted the following structured notation: and improve decision-making consistency. These
reasoning skills are integrated into the training
<ref type="box" datasets of agents, so that they can reason with
coords={"x1": x1, "y1": y1,
these skills natively without any extra prompting.
"x2": x2, "y2": y2}
To achieve this, we generate SFT data incorporat-
note="GUI annotation">
corresponding text
ing these reasoning skills based on existing trajec-
</ref> tory data (see Table 2) and continue fine-tuning the
model from stage 1.
The format consists of several key components:
3.2.1 Hierarchical Reasoning
the reference type (either "box" for rectangular re-
gions or "point" for specific locations), coordinate Effective execution of GUI tasks demands both
specifications (x1, y1, x2, y2 for boxes or x, y for overarching strategic planning and meticulous tac-
points), optional annotative notes, and the corre- tical execution. To achieve this, we synthesize
sponding textual content. To generate training data trajectory data with a hierarchical reasoning with
in this format, we prompted Qwen2-VL-72B (Bai two distinct layers:
et al., 2023b) to seamlessly integrate GUI spatial
information with original responses, maintaining • Strategic Layer. Strategic layer is responsible
natural language flow while preserving precise spa- for high-level task decomposition and sub-goal
tial references. planning. This layer analyzes the overall task ob-
jective and determines the sequence of subtasks
3.2 Stage 2: Training for Native Reasoning needed for completion.
Building upon the foundational capabilities such • Tactical Layer. Tactical layer handles the selec-
as understanding and grounding, GUI Agents must tion and grounding of concrete actions. Based
also master advanced reasoning skills to effectively on the strategic layer’s planning, agent select
handle complex tasks. We identify two crucial rea- appropriate GUI operations and adjusts their pa-
soning skills : (1) Hierarchical reasoning, which rameters to match the target.
enables planning and task decomposition, helping
agents structure complex tasks into manageable 3.2.2 Expectation-Reflection Reasoning
subtasks and execute them efficiently (Huang and To enhance action consistency and foster
Chang, 2023; Zhang et al., 2024b; Huang et al., autonomous self-correction, we incorporate
2024), and (2) Expectation-reflection reasoning, Expectation-reflection reasoning into the training
which fosters adaptive self-correction and reflec- datasets. This iterative process enhances the

4
Table 2: UI action reasoning datasets used in the training process

Dataset Platform # of Samples

GUIAct (Chen et al., 2024) Webpage & Mobile 10,000
AMEX (Chai et al., 2024) Mobile 3,000
Android in the Zoo (Zhang et al., 2024a) Mobile 2,000
Composition: Stage 1-aligned - 30,000

Category Operations interaction follows a standard protocol using func-

Single-point operations tap, click, hover, select tion calls and responses:
Two-point operations swipe, select_text
Directional operations scroll Assistant Message:
Text input input, point_input <tool_call>
Parameterless operations remember, enter, home, back {
State settings set_task_status "name": "action_name",
"arguments": {"action_parameters"}
Table 3: Categorization of actions in the action space. }
</tool_call>

Tool Message:
agent’s ability to adapt and learn from its actions <tool_response>
through a structured reflection cycle: {
"name": "gui_operation",
• Reasoning. After reflection (except the first "content": {
step), the agents conduct hierarchical reasoning. "status": "success | failure",
"current_ui": <image>,
• Action. After the reasoning, the agent takes the
"current_task": <task_description>
action. }
• Expectation. Following each action, the agent }
generates expected outcomes which are used to </tool_response>

be verified at the next step.

• Reflection. The agent evaluates whether its ac-
3.2.4 Modular Action Space
tions achieved the expected results and generat-
ing a textual summary of the reflection. Given the diverse action spaces across collected
datasets, we categorized and standardized the ac-
tions by unifying their names and parameters, merg-
3.2.3 Agent-Environment Interface
ing similar operations where appropriate. The re-
We formulate the GUI interaction as a process sulting action space A consists of independent,
where an agent interacts with a mobile environment. composable operations that can be flexibly com-
Let st ∈ S denote the environment state at step t, bined based on task requirements, as shown in Ta-
where S represents the state space. The agent can ble 3. This modular design allows for dynamic
observe the state through a screenshot observation action space configuration while maintaining a con-
ot and performs actions at ∈ A, where A is the sistent interface across different platforms and sce-
action space. The environment transitions from narios.
st to st+1 following st+1 ∼ P (·|st , at ), where P
represents the transition probability function. 3.2.5 Reasoning Process Construction
The agent receives a task goal g and maintains To construct high-quality reasoning data to stimu-
access to a history window of size n. At each step late the model’s native reasoning capabilities, we
t, the agent’s input consists of: leverage more capable MLLMs (e.g. Qwen2-VL-
72B) to generate structured reasoning processes
• Goal g
based on existing interaction trajectories. The con-
• Current observation ot struction process involves several key components:
• Historical context Ht = {(oi , ri , ai )}t−1
i=t−n ,
where ri represents the reasoning process • Screenshot Description. For each observation
ot in the trajectory, we generate a detailed de-
Based on these inputs, the agent generates a rea- scription dt . This step addresses the limitation
soning process rt and predicts an action at . The that some MLLM models do not support inter-

5
leaved image-text input formats well. To estab- 45K samples based on trajectories from datasets
lish clear correspondence between observations shown in Table 2. We continual supervised fine-
(screenshots) and steps, we generate detailed de- tune Qwen2-VL-2B (Bai et al., 2023c). We lever-
scriptions to replace the screenshots, which helps age ZeRO0 (Rajbhandari et al., 2020) technology
facilitate the subsequent reasoning process con- to enable full parameter fine-tuning of the model
struction. across 8 A800 80GB GPUs.
• Reflection. Given the previous expectation et−1 4.1.2 Evaluation Benchmarks
and current observation ot , we generate a reflec- ScreenSpot. ScreenSpot (Cheng et al., 2024) is
tion ft that evaluates the outcome of the previous an evaluation benchmark for GUI grounding, con-
action. sisting of over 1,200 instructions from iOS, An-
• Strategic Layer. The strategic reasoning consists droid, macOS, Windows, and Web environments,
of two parts: First, a summary is generated based with annotated element types.
on the n-step history Ht = {(oi , ri , ai )}t−1
i=t−n
AndroidWorld. AndroidWorld (Rawles et al.,
and current observation ot . Then, the planning
2024) is a fully functional Android environment
component is generated with access to the actual
that provides reward signals for 116 program-
action at to ensure alignment with the trajectory.
matic tasks across 20 real-world Android apps.
• Tactical Layer. This layer’s reasoning is con- We find that Android World uses Set-of-Marks
structed using the generated reflection ft and (SoM) (Yang et al., 2023b) to enhance the agent’s
strategic layer output. The actual action at from grounding ability. However, when humans oper-
the trajectory is incorporated to ensure the tacti- ate smartphones, their brains do not label elements
cal reasoning leads to appropriate action selec- on the screen. Over-reliance on SoM can lead to
tion. insufficient focus on pixel-level grounding ability.
• Expectation. For each state-action pair (st , at ), Therefore, in our experiments, agents respond to
we generate an expectation et based on current the raw image rather than the annotated image.
observation ot , reasoning process rt , and action
4.2 Main Results
at . Notably, we deliberately avoid using the next
state st+1 in this generation process. Although ScreenSpot. Table 4 provides the results of differ-
using st+1 could improve the agent’s accuracy ent models across three platforms (Mobile, Desk-
in modeling state transitions, while using st+1 top and Web) and two element types (Text and Icon)
could lead to perfect expectations, such an ap- on ScreenSpot (Cheng et al., 2024). InfiGUIAgent-
proach might impair the agent’s ability to handle 2B achieves highest accuracy of 76.3%, surpassing
expectation mismatches during deployment. several strong baselines such as ShowUI (Lin et al.,
2024) (75.1%) and UGround-7B (Gou et al., 2024)
While we avoid using st+1 in expectation gener- (73.3%), which is even with larger parameters size.
ation to maintain robustness, we also explore the
AndroidWorld. Table 5 compares the success
possibility of improving state transition modeling
rates of InfiGUIAgent with open-source models on
through a parallel next-state prediction task. Using
AndroidWorld (Rawles et al., 2024). InfiGUIAgent-
the trajectory data, we construct additional training
2B achieves an overall success rate of 0.09, outper-
examples where the agent learns to predict the next
forming open-source models of similar size, such
state description dt+1 given the current observa-
as ShowUI-2B (Lin et al., 2024) (0.07), and model
tion ot and action at . This auxiliary task helps the
with much more parameters such as LLaVa-OV-7B
agent learn state transition dynamics, while keep-
(Li et al., 2024b) (0.00) and Qwen2-VL-72B (Bai
ing the expectation generation process independent
et al., 2023b) (0.05).
of future states.
5 Conclusion
4 Experiments
In this work, we propose InfiGUIAgent, a novel
4.1 Experimental Setting
MLLM-based GUI Agents. By constructing com-
4.1.1 Implementation Details prehensive training datasets with two-stage super-
In stage 1, we sample 1M samples in total as il- vised fine-tuning, we enhance the model’s ability
lustrated in Table 1. In stage 2, we synthesized to understand, reason, and interact with GUIs. Our

6
Accuracy (%)
Model Avg.
Mobile Desktop Web
Text Icon Text Icon Text Icon
Proprietary Models
GPT-4o1 (OpenAI, 2024) 30.5 23.2 20.6 19.4 11.1 7.8 18.8
Gemini-1.5-pro2 (Team et al., 2024) 76.2 54.1 65.5 39.3 52.2 32.0 53.2
Open-source Models
Qwen2-VL-2B (Wang et al., 2024) 24.2 10.0 1.4 9.3 8.7 2.4 9.3
Qwen2-VL-7B (Wang et al., 2024) 61.3 39.3 52.0 45.0 33.0 21.8 42.9
CogAgent (Hong et al., 2024) 67.0 24.0 74.2 20.0 70.4 28.6 47.4
SeeClick (Cheng et al., 2024) 78.0 52.0 72.2 30.0 55.7 32.5 53.4
UGround-7B (Gou et al., 2024) 82.8 60.3 82.5 63.6 80.4 70.4 73.3
ShowUI-2B (Lin et al., 2024) 92.3 75.5 76.3 61.1 81.7 63.6 75.1
Ours
InfiGUIAgent-2B 88.6 74.7 85.6 65.0 79.1 64.6 76.3

Table 4: Performances on various platforms (Mobile, Desktop, Web) on Screenshot. All experiments were conducted
using raw screenshot information. Results marked in bold represent the best performance, and those underlined
indicate the second-best performance.

Success Rate Zubach, Hassan Mansoor, Vincent Etter, Victor

Model
Easy Middle Hard Overall Cărbune, Jason Lin, Jindong Chen, and Abhanshu
Qwen2-VL-2B (Wang et al., 2024) 0.00 0.00 0.00 0.00 Sharma. 2024. Screenai: A vision-language model
Qwen2-VL-7B (Wang et al., 2024) 0.00 0.00 0.05 0.05 for ui and infographics understanding. arXiv preprint
Qwen2-VL-72B (Wang et al., 2024) 0.08 0.00 0.05 0.05 arXiv:2402.04615.
LLaVa-OV-7B (Li et al., 2024b) 0.00 0.00 0.00 0.00
ShowUI-2B (Lin et al., 2024) 0.18 0.00 0.00 0.07
Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas
Ours Sunkara, Abhinav Rastogi, Jindong Chen, and
InfiGUIAgent-2B 0.25 0.00 0.00 0.09
Blaise Aguera y Arcas. 2021. Uibert: Learning
generic multimodal representations for ui understand-
Table 5: Performances on AndroidWorld. ing. arXiv preprint arXiv:2107.13731.

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
evaluation, conducted using raw screenshots with- Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han,
out relying on additional GUI metadata, demon- Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang
Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang
strates the model’s applicability to real-world sce- Lu, K. Lu, Jianxin Ma, Rui Men, Xingzhang Ren,
narios. Experimental results show that our model Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong
performs well on GUI tasks and surpass several Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang
open-source baselines. Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian
Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen
Yu, Yu Bowen, Hongyi Yuan, Zheng Yuan, Jianwei
Zhang, Xing Zhang, Yichang Zhang, Zhenru Zhang,
References Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and
Tianhang Zhu. 2023a. Qwen technical report. ArXiv.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An-
toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang,
Arthur Mensch, Katie Millican, Malcolm Reynolds, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou,
Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda and Jingren Zhou. 2023b. Qwen-vl: A frontier large
Han, Zhitao Gong, Sina Samangooei, Marianne vision-language model with versatile abilities. ArXiv.
Monteiro, Jacob Menick, Sebastian Borgeaud, Andy
Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang,
laj Binkowski, Ricardo Barreira, Oriol Vinyals, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou,
Andrew Zisserman, and Karen Simonyan. 2022. and Jingren Zhou. 2023c. Qwen-vl: A frontier large
Flamingo: a visual language model for few-shot vision-language model with versatile abilities. arXiv
learning. In Advances in Neural Information Pro- preprint arXiv:2308.12966.
cessing Systems 35: Annual Conference on Neural
Information Processing Systems 2022, NeurIPS 2022, Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao,
New Orleans, LA, USA, November 28 - December 9, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren,
2022. and Hongsheng Li. 2024. Amex: Android multi-
annotation expo dataset for mobile gui agents. arXiv
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir preprint arXiv:2407.17490.

7
Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Xu, Shawn Wang, Xinchen Xu, Shuofei Qiao,
Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li,
Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin
Liu, and Maosong Sun. 2024. Guicourse: From gen- Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan
eral vision language models to versatile gui agents. Wu, Shengyu Zhang, and Fei Wu. 2024b. Os agents:
arXiv preprint arXiv:2406.11317. A survey on mllm-based agents for general comput-
ing devices use. Preprints.
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu,
Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli
Seeclick: Harnessing gui grounding for advanced Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing
visual gui agents. arXiv preprint arXiv:2401.10935. Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li,
Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu.
Yong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan, 2024c. Infiagent-dabench: Evaluating agents on data
Cong Zhou, Jingquan Wang, Zhangyin Feng, Fan analysis tasks. arXiv preprint arXiv:2401.05507.
Zhang, Xueyu Hu, and Shuming Shi. 2022. One
model, multiple modalities: A sparsely activated Jie Huang and Kevin Chen-Chuan Chang. 2023. To-
approach for text, sound, image, video and code. wards reasoning in large language models: A survey.
Preprint, arXiv:2205.06126. Preprint, arXiv:2212.10403.
Alexey Dosovitskiy, Lucas Beyer, Alexander Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruim-
Thomas Unterthiner, Mostafa Dehghani, Matthias ing Tang, and Enhong Chen. 2024. Understanding
Minderer, Georg Heigold, Sylvain Gelly, Jakob the planning of llm agents: A survey. arXiv preprint
Uszkoreit, and Neil Houlsby. 2021. An image arXiv:2402.02716.
is worth 16x16 words: Transformers for image
recognition at scale. In 9th International Conference Yue Jiang, Eldon Schoop, Amanda Swearngin, and
on Learning Representations, ICLR 2021, Virtual Jeffrey Nichols. 2023. Iluvui: Instruction-tuned
Event, Austria, May 3-7, 2021. OpenReview.net. language-vision modeling of uis from machine con-
versations. arXiv preprint arXiv:2310.04869.
Luciano Floridi and Massimo Chiriatti. 2020. Gpt-3:
Its nature, scope, limits, and consequences. Minds Raghav Kapoor, Yash Parag Butala, Melisa Russak,
and Machines, 30:681–694. Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and
Ruslan Salakutdinov. 2024. Omniact: A dataset and
Glaive AI. 2024. Glaive function calling dataset. benchmark for enabling multimodal generalist au-
https://huggingface.co/datasets/glaiveai/ tonomous agents for desktop and web. arXiv preprint
glaive-function-calling. Accessed: 2024-01- arXiv:2402.17553.
08.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yan-
Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Liu Li, Ziwei Liu, and Chunyuan Li. 2024a. Llava-
2024. Navigating the digital world as humans do: onevision: Easy visual task transfer. arXiv preprint
Universal visual grounding for gui agents. arXiv arXiv:2408.03329.
preprint arXiv:2410.05243.
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang,
Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024b.
Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A Llava-onevision: Easy visual task transfer. Preprint,
visual language model for gui agents. In Proceedings arXiv:2408.03326.
of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 14281–14290. Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi
Shen, Bowen Qu, Xinyao Niu, Fan Zhou, Chengen
Yu-Chung Hsiao, Fedir Zubach, Gillbune Baech- Huang, Yanpeng Li, Chongyan Zhu, Xiaoyi Ren,
ler, Victor Carbune, Jason Lin, Maria Wang, Chao Li, Yifan Ye, Lihuan Zhang, Hanshu Yan,
Srinivas Sunkara, Yun Zhu, and Jindong Chen. Guoyin Wang, Bei Chen, and Junnan Li. 2024c. Aria:
2022. Screenqa: Large-scale question-answer An open multimodal native mixture-of-experts model.
pairs over mobile app screenshots. arXiv preprint Preprint, arXiv:2410.05993.
arXiv:2209.08199.
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Xueyu Hu, Kun Kuang, Jiankai Sun, Hongxia Yang, 2023. Blip-2: Bootstrapping language-image pre-
and Fei Wu. 2024a. Leveraging print debugging to training with frozen image encoders and large lan-
improve code generation in large language models. guage models. In International conference on ma-
Preprint, arXiv:2401.05319. chine learning, pages 19730–19742. PMLR.
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruix- Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu,
uan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie,
Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze and Hongxia Yang. 2024d. Infibench: Evaluating

8
the question-answering capabilities of code large lan- OpenAI. 2023. Gpt-4v(ision) system card.
guage models. arXiv preprint arXiv:2404.07940.
OpenAI. 2024. Gpt-4o. Accessed: 2025-01-03.
Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason
Baldridge. 2020a. Mapping natural language instruc- Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao
tions to mobile ui action sequences. arXiv preprint Sun, Xinyue Yang, Jiadai Sun, Yu Yang, Shuntian
arXiv:2005.03776. Yao, Tianjie Zhang, et al. 2024. Webrl: Train-
ing llm web agents via self-evolving online cur-
Yang Li, Luheng Li, Gangaand He, Jingjie Zheng, riculum reinforcement learning. arXiv preprint
Hong Li, and Zhiwei Guan. 2020b. Widget cap- arXiv:2411.02337.
tioning: Generating natural language description Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
for mobile user interface elements. arXiv preprint Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
arXiv:2010.04295. try, Amanda Askell, Pamela Mishkin, Jack Clark,
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan et al. 2021. Learning transferable visual models from
Yang, Zechen Bai, Weixian Lei, Lijuan Wang, and natural language supervision. In International confer-
Mike Zheng Shou. 2024. Showui: One vision- ence on machine learning, pages 8748–8763. PMLR.
language-action model for generalist gui agent. In Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase,
NeurIPS 2024 Workshop on Open-World Agents. and Yuxiong He. 2020. Zero: Memory optimizations
toward training trillion parameter models. Preprint,
Haogeng Liu, Quanzeng You, Yiqi Wang, Xiaotian Han, arXiv:1910.02054.
Bohan Zhai, Yongfei Liu, Wentao Chen, Yiren Jian,
Yunzhe Tao, Jianbo Yuan, Ran He, and Hongxia Yang. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang,
2024. Infimm: Advancing multimodal understanding Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice
with an open-sourced visual language model. In Li, William Bishop, Wei Li, Folawiyo Campbell-
Annual Meeting of the Association for Computational Ajala, et al. 2024. Androidworld: A dynamic bench-
Linguistics. marking environment for autonomous agents. arXiv
preprint arXiv:2405.14573.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae
Lee. 2023. Visual instruction tuning. In Advances in Noah Shinn, Federico Cassano, Edward Berman, Ash-
Neural Information Processing Systems 36: Annual win Gopinath, Karthik Narasimhan, and Shunyu Yao.
Conference on Neural Information Processing Sys- 2023. Reflexion: Language agents with verbal rein-
tems 2023, NeurIPS 2023, New Orleans, LA, USA, forcement learning. Preprint, arXiv:2303.11366.
December 10 - 16, 2023.
Srinivas Sunkara, Maria Wang, Lijuan Liu, Gilles
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Fe- Baechler, Yu-Chung Hsiao, Abhanshu Sharma,
ichtenhofer, Trevor Darrell, and Saining Xie. 2022. James Stout, et al. 2022. Towards better semantic
A convnet for the 2020s. In Proceedings of the understanding of mobile interfaces. arXiv preprint
IEEE/CVF conference on computer vision and pat- arXiv:2210.02663.
tern recognition, pages 11976–11986.
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan
Matt MDeitke, Christopher Clark, Sangho Lee, Ro- Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer,
hun Tripathi, Yue Yang, Jae Sung Park, Moham- Damien Vincent, Zhufeng Pan, Shibo Wang, et al.
madreza Salehi, Niklas Muennighoff, Kyle Lo, Luca 2024. Gemini 1.5: Unlocking multimodal under-
Soldaini, Jiasen Lu, Taira Anderson, Erin Bramsom, standing across millions of tokens of context. arXiv
Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay preprint arXiv:2403.05530.
Patel, Mark Yatskar, Chris Callison-Burch, Andrew Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Head, Rose Hendrix, Favyen Bastani, Eli van der Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Bilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Grave, and Guillaume Lample. 2023. Llama: Open
Newell, Piper Wolters, Kuo-Hao Gupta, Tanmay and efficient foundation language models. ArXiv.
sna Zeng, Jon Borchardt, Dirk Groeneveld, Crys-
tal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Minh Duc Vu, Han Wang, Jieshan Chen, Zhuang Li,
Schoenick, Oscar Michel, Ranjay Krishna, Luca Shengdong Zhao, Zhenchang Xing, and Chunyang
Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Chen. 2024. Gptvoicetasker: Advancing multi-step
Girshick, Ali Farhadi, and Aniruddha Kembhavi. mobile task efficiency through dynamic interface ex-
2024. Molmo and pixmo: Open weights and ploration and learning. In Proceedings of the 37th
open data for state-of-the-art vision-language models. Annual ACM Symposium on User Interface Software
arXiv preprint arXiv:2409.17146. and Technology, pages 1–17.
Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi
Shan, Xiutian Huang, and Wenhao Xu. 2024. Mo- Grossman, and Yang Li. 2021. Screen2words: Au-
bileflow: A multimodal llm for mobile gui agent. tomatic mobile ui summarization with multimodal
arXiv preprint arXiv:2407.04346. learning. arXiv preprint arXiv:2108.03353.

9
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi- Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin
hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023.
Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhanc- Appagent: Multimodal agents as smartphone users.
ing vision-language model’s perception of the world arXiv preprint arXiv:2312.13771.
at any resolution. arXiv preprint arXiv:2409.12191.
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao,
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang.
Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei 2024a. Android in the zoo: Chain-of-action-thought
Zhao, Xixuan Song, et al. 2023. Cogvlm: Visual ex- for gui agents. arXiv preprint arXiv:2403.02713.
pert for pretrained language models. arXiv preprint
Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang,
arXiv:2311.03079.
Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song,
Man Lan, and Furu Wei. 2024b. Llm as a master-
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, mind: A survey of strategic reasoning with large
Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, language models. Preprint, arXiv:2404.01230.
Yaqin Zhang, and Yunxin Liu. 2023. Autodroid: Llm-
powered task automation in android. arXiv preprint Zhuosheng Zhang and Aston Zhang. 2023. You only
arXiv:2308.15272. look at screens: Multimodal chain-of-action agents.
arXiv preprint arXiv:2309.11436.
Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu,
and Maosong Sun. 2021. Lawformer: A pre-trained
language model for chinese legal long documents. AI
Open, 2:79–84.

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan

Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhou-
jun Cheng, Dongchan Shin, Fangyu Lei, et al. 2024.
Osworld: Benchmarking multimodal agents for open-
ended tasks in real computer environments. arXiv
preprint arXiv:2404.07972.

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun-

yuan Li, and Jianfeng Gao. 2023a. Set-of-mark
prompting unleashes extraordinary visual grounding
in gpt-4v. Preprint, arXiv:2310.11441.

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun-

yuan Li, and Jianfeng Gao. 2023b. Set-of-mark
prompting unleashes extraordinary visual grounding
in gpt-4v. arXiv preprint arXiv:2310.11441.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak

Shafran, Karthik Narasimhan, and Yuan Cao. 2023.
React: Synergizing reasoning and acting in language
models. Preprint, arXiv:2210.03629.

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khy-

athi Chandu, Kai-Wei Chang, Yejin Choi, and
Bill Yuchen Lin. 2023. Agent lumos: Unified and
modular training for open-source language agents.
arXiv preprint arXiv:2311.05657.

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers,

Amanda Swearngin, Jeffrey Nichols, Yinfei Yang,
and Zhe Gan. 2025. Ferret-ui: Grounded mobile
ui understanding with multimodal llms. In Euro-
pean Conference on Computer Vision, pages 240–
255. Springer.

Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng,

Michel Galley, Jianfeng Gao, and Zhou Yu. 2024.
Exact: Teaching ai agents to explore with reflective-
mcts and exploratory learning. arXiv preprint
arXiv:2410.02052.

10
A Cases
A.1 Stage 1: Fundamental Abilities
We demonstrate the fundamental abilities trained
in Stage 1 through three cases: GUI Understanding
(Figure 2), Grounding (Figure 3), and Question
Answering (Figure 4).

A.2 Stage 2: Native Reasoning

We provide two representative cases to demonstrate
the reasoning and interaction process of InfiGUIA-
gent.
Reply to a Message Figure 5 illustrates a step
where the agent needs to reply to a specific message
in a messaging application. The reasoning process
involves identifying the "Start chat" button and
grounding the action to initiate the reply process.
Creating a New Contact Figure 6 and Figure 7
demonstrate sequential steps for creating a new con-
tact. In the first step (Step K), the agent navigates
to the "Contacts" section by reasoning and ground-
ing the action to the corresponding tab. In the
following step (Step K+1), the agent initiates the
contact creation process by identifying and tapping
the "Create new contact" button. These sequential
steps highlight the agent’s hierarchical reasoning
and grounding abilities.

11
Instruction: Describe the UI in detail. You should
reference components and locations in the UI.

Answer: This screenshot shows the settings menu of an

Android device, displaying various configuration options.
The top of the screen features a navigation bar with a back
arrow, indicating the ability to return to the previous screen.
The main content area lists several settings categories,
each with a brief description and icon for easy
identification.

Key UI components and elements present on the screen

include:
- A <ref type="box" coords={"x1": 0, "y1": 31, "x2": 131,
"y2": 110} note="return to the previous screen">back
arrow</ref> at the top left corner, allowing the user to
navigate back to the previous screen.
- The <ref type="box" coords={"x1": 175, "y1": 54, "x2":
404, "y2": 98} note="title of the current
screen">Settings</ref> title at the top center, indicating the
current page.
…

Figure 2: Case of GUI Understanding.

Instruction: Where can I send a message to someone? You

should reference components and locations in the UI.

Answer: To send a message, you need to access the

messaging app. The key components on the home screen
are the app icons. Specifically, you should look for the <ref
type="box" coords={"x1": 262, "y1": 766, "x2": 360, "y2":
814} note="message app icon">message app icon</ref>
located at the bottom left of the screen. This icon is
typically used to open the messaging app. Once you tap on
this icon, you will be taken to the messaging app where
you can send a message.

Figure 3: Case of Grounding.

12
Instruction: What day is it tomorrow? You should
reference components and locations in the UI.

Answer: To determine the day of the week, we need to

look at the date displayed at the top of the screen. The date
is shown as <ref type="box" coords={"x1": 248, "y1": 94,
"x2": 422, "y2": 119} note="day of the
week">Sunday</ref>. This indicates that today is Sunday.
Therefore, tomorrow will be Monday.

Figure 4: Case of Question Answering.

Reflection: The expectation from the previous step was to open the Simple
SMS Messenger app by tapping on its icon. The current screen content
shows the Simple SMS Messenger app's ...

Summary: The overall task objective is to send the message "Yoga class Strategic Layer
every Tuesday and Thursday at 6 PM" to +18407874306 ...

Planning: To achieve the final goal, the plan is to: first, initiate a new chat;
second, enter the recipient's phone number (+18407874306); third,
compose the message ...

Reasoning: The core component on the current screen is the "Start chat"
button. The key point of the task objective in the current step is to initiate a
Tactical Layer

new chat. Therefore, I need to tap on the "Start chat" button.

Grounding: Based on the reasoning, the current action should be to tap on

the "Start chat" button. The button is located at the bottom right of the
screen, a prominent blue circle with a white chat bubble icon. Therefore, I
should tap on the "Start chat" button.

Action:
<tool_call>
{"name": "tap", "arguments": {"point": {"x": 824, "y": 886}}}
</tool_call>

Expectation: The purpose of this action is to open the new chat interface.
This action will likely result in a new screen appearing ...

Figure 5: Case of Native Advanced Reasoning. The agent’s goal is to reply to a message

13
Reflection: The expectation from the previous step was to open the
Contacts app by tapping on its icon. However, the current screen content
shows the Phone app with the "Recents" tab selected ...

Summary: The overall task objective is to create a new contact for Hugo

Strategic Layer
Pereira. In the previous step, I attempted to open the Contacts app ...

Planning: To achieve the final goal of creating a new contact for Hugo
Pereira, I first need to access the Contacts section within the Phone app.
Then I need to ...
Step K

Reasoning: The core components on the current screen are the three tabs at
the bottom: "Favorites", "Recents", and "Contacts". The key point of the

Tactical Layer
task objective in the current step is to navigate to the Contacts section.
Therefore, I need to tap on the "Contacts" tab.

Grounding: Based on the reasoning, the current action should be to tap on

the "Contacts" tab. The "Contacts" tab is located at the bottom right of
the screen. Therefore, I should tap on the "Contacts" tab.

Action:
<tool_call>
{"name": "tap", "arguments": {"point": {"x": 853, "y": 924}}}
</tool_call>

Expectation: The purpose of this action is to navigate to the Contacts

section of the Phone app ...

Figure 6: Case of Native Advanced Reasoning. The agent’s goal is to create a new contact.

Reflection: The expectation from the previous step was to navigate to the
Contacts section by tapping the "Contacts" tab. The current screen content
shows ... indicating that the previous action was successful.

Summary: The overall task objective is to create a new contact for Hugo
Strategic Layer

Pereira. In the previous steps, I opened the Phone app and ...

Planning: To achieve the final goal of creating a new contact for Hugo
Pereira, I need to first initiate the new contact creation process. Then, I
need to enter Hugo's details ...
Step K + 1

Reasoning: The core component on the current screen is the "Create new
contact" button. The key point of the task objective in the current step is to
Tactical Layer

initiate the new contact creation process. Therefore, I need to tap on the
"Create new contact" button.

Grounding: Based on the reasoning, the current action should be to tap on

the "Create new contact" button. The button is located in the center of the
screen ...

Action:
<tool_call>
{"name": "tap", "arguments": {"point": {"x": 492, "y": 575}}}
</tool_call>

Expectation: The purpose of this action is to initiate the process of creating

a new contact …

Figure 7: Case of Native Advanced Reasoning. The agent’s goal is to create a new contact.

Top Star 6 - Teacher Book
100% (1)
Top Star 6 - Teacher Book
252 pages
Isa s5.06
0% (1)
Isa s5.06
58 pages
Then and Now Unit - Grade One Social Studies
No ratings yet
Then and Now Unit - Grade One Social Studies
47 pages
Joaquin - DOWNLAD PDF Cambridge Latin Course Unit 1 North American 4th Edition
0% (1)
Joaquin - DOWNLAD PDF Cambridge Latin Course Unit 1 North American 4th Edition
1 page
Lexicology
No ratings yet
Lexicology
9 pages
GUIAgents With Foundation Models
No ratings yet
GUIAgents With Foundation Models
10 pages
Large Language ModelBrained GUI Agents
No ratings yet
Large Language ModelBrained GUI Agents
78 pages
Large Language Model-Brained GUI Agents: A Survey
No ratings yet
Large Language Model-Brained GUI Agents: A Survey
78 pages
10
No ratings yet
10
80 pages
paper-4
No ratings yet
paper-4
14 pages
Harnessing GUI Grounding for Advanced Visual GUI Agents
No ratings yet
Harnessing GUI Grounding for Advanced Visual GUI Agents
20 pages
2503.21620v1
No ratings yet
2503.21620v1
16 pages
Visualwebarena:: Evaluating Multimodal Agents On Realistic Visually Grounded Web Tasks
No ratings yet
Visualwebarena:: Evaluating Multimodal Agents On Realistic Visually Grounded Web Tasks
25 pages
An Interactive Agent Foundation Model
No ratings yet
An Interactive Agent Foundation Model
22 pages
Cogagent: A Visual Language Model For Gui Agents
No ratings yet
Cogagent: A Visual Language Model For Gui Agents
27 pages
2402.07945
No ratings yet
2402.07945
21 pages
Survey on Evaluation of LLM-based Agents
No ratings yet
Survey on Evaluation of LLM-based Agents
20 pages
2312.08914v3
No ratings yet
2312.08914v3
27 pages
2505.23762v1
No ratings yet
2505.23762v1
27 pages
Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution
No ratings yet
Autonomous and Self-Adapting System for Synthetic Media Detection and Attribution
18 pages
Agentlite: A Lightweight Library For Building and Advancing Task-Oriented LLM Agent System
No ratings yet
Agentlite: A Lightweight Library For Building and Advancing Task-Oriented LLM Agent System
13 pages
Ferret-UI - Grounded Mobile UI Understanding With Multimodal LLMs
No ratings yet
Ferret-UI - Grounded Mobile UI Understanding With Multimodal LLMs
28 pages
CogAgent-A Visual Language Model for GUI Agents
No ratings yet
CogAgent-A Visual Language Model for GUI Agents
10 pages
2309.14365v1 (1)
No ratings yet
2309.14365v1 (1)
15 pages
Visual Agent Bench
No ratings yet
Visual Agent Bench
40 pages
Dynamic Planning for LLM-based Graphical User Interface Automation
No ratings yet
Dynamic Planning for LLM-based Graphical User Interface Automation
17 pages
TPTU: Task Planning and Tool Usage of Large Language Model-Based AI Agents
No ratings yet
TPTU: Task Planning and Tool Usage of Large Language Model-Based AI Agents
36 pages
Science Agent Bench
No ratings yet
Science Agent Bench
60 pages
2312.13771v2
No ratings yet
2312.13771v2
10 pages
Large_Language_Model-Driven_Immersive_Agent
No ratings yet
Large_Language_Model-Driven_Immersive_Agent
6 pages
agent s - framework for computer like a human
No ratings yet
agent s - framework for computer like a human
23 pages
GUIAction Narrator-when and Where Did That Action Take Place
No ratings yet
GUIAction Narrator-when and Where Did That Action Take Place
17 pages
mmeval-survey
No ratings yet
mmeval-survey
31 pages
Agents Open Source Framework
No ratings yet
Agents Open Source Framework
9 pages
Multi-Agentic RAG with Hugging Face Code Agents _ by Gabriele Sgroi, PhD _ Dec, 2024 _ Towards Data Science
No ratings yet
Multi-Agentic RAG with Hugging Face Code Agents _ by Gabriele Sgroi, PhD _ Dec, 2024 _ Towards Data Science
42 pages
Llava 2304.08485
No ratings yet
Llava 2304.08485
19 pages
Autonomous AI agents
No ratings yet
Autonomous AI agents
44 pages
2404.11483v2
No ratings yet
2404.11483v2
24 pages
A Survey On Multimodal Large Language Models
No ratings yet
A Survey On Multimodal Large Language Models
18 pages
A REAL-WORLD WEBAGENT _ paper
No ratings yet
A REAL-WORLD WEBAGENT _ paper
28 pages
Prompting - Unleashing the Potential of Prompt Engineering in Large Language Models
No ratings yet
Prompting - Unleashing the Potential of Prompt Engineering in Large Language Models
58 pages
2405.06643v2
No ratings yet
2405.06643v2
11 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
2409.18142v1
No ratings yet
2409.18142v1
23 pages
agent_laboratoray_1736610469
No ratings yet
agent_laboratoray_1736610469
56 pages
Levels of AI Agents - From Rules to Large Language Models
No ratings yet
Levels of AI Agents - From Rules to Large Language Models
8 pages
4
No ratings yet
4
33 pages
Function calling at Edge
No ratings yet
Function calling at Edge
9 pages
ssrn-4909147
No ratings yet
ssrn-4909147
33 pages
Visual GPT
No ratings yet
Visual GPT
17 pages
2401.10727v3
No ratings yet
2401.10727v3
29 pages
CAIpaper_AFR-v3.3-Drafted (1)
No ratings yet
CAIpaper_AFR-v3.3-Drafted (1)
8 pages
Multi Modal
No ratings yet
Multi Modal
25 pages
2409.09030v2
No ratings yet
2409.09030v2
19 pages
The Architecture Behind LLM Agents
No ratings yet
The Architecture Behind LLM Agents
2 pages
NeurIPS-2023-hugginggpt-solving-ai-tasks-with-chatgpt-and-its-friends-in-hugging-face-Paper-Conference
No ratings yet
NeurIPS-2023-hugginggpt-solving-ai-tasks-with-chatgpt-and-its-friends-in-hugging-face-Paper-Conference
27 pages
Survey LLM-Agents 2025
No ratings yet
Survey LLM-Agents 2025
44 pages
E3. AI Agents
No ratings yet
E3. AI Agents
49 pages
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
No ratings yet
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
119 pages
A Survey On Multimodal Large Language Models
No ratings yet
A Survey On Multimodal Large Language Models
15 pages
2311.07594v2
No ratings yet
2311.07594v2
15 pages
Paper 3
No ratings yet
Paper 3
48 pages
Designing User Interfaces with Glade: Definitive Reference for Developers and Engineers
From Everand
Designing User Interfaces with Glade: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming with Julia: Definitive Reference for Developers and Engineers
From Everand
Programming with Julia: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
25 Useful Phrasal Verbs For Business With Sample Sentences
No ratings yet
25 Useful Phrasal Verbs For Business With Sample Sentences
4 pages
Binnas o Somabesh From Khairul - S Basic Math
71% (7)
Binnas o Somabesh From Khairul - S Basic Math
22 pages
SYLLABUS-_-CLASS-IX-_-2024-25.docx--1- (1)
No ratings yet
SYLLABUS-_-CLASS-IX-_-2024-25.docx--1- (1)
24 pages
Teachers As Textbook Evaluators
No ratings yet
Teachers As Textbook Evaluators
7 pages
GIÁO ÁN THỰC TẬP GLSU 11 REVIEW 3 SKILL
No ratings yet
GIÁO ÁN THỰC TẬP GLSU 11 REVIEW 3 SKILL
7 pages
Film 2022 0194
No ratings yet
Film 2022 0194
25 pages
Comprehension Expression 2019 2020
No ratings yet
Comprehension Expression 2019 2020
90 pages
Latihan Soal Passive Voice Kls 9 - Quizizz
No ratings yet
Latihan Soal Passive Voice Kls 9 - Quizizz
5 pages
Assignment (Tripthong)
No ratings yet
Assignment (Tripthong)
4 pages
Rubric and Scoring
No ratings yet
Rubric and Scoring
7 pages
KALAISELVI
No ratings yet
KALAISELVI
1 page
yoon-2008-uninvited-guests-the-influence-of-teachers-roles-and-pedagogies-on-the-positioning-of-english-language
No ratings yet
yoon-2008-uninvited-guests-the-influence-of-teachers-roles-and-pedagogies-on-the-positioning-of-english-language
28 pages
2nd Quarter Examination English 7
No ratings yet
2nd Quarter Examination English 7
3 pages
FINAL TEST SCHEDULE 2024
No ratings yet
FINAL TEST SCHEDULE 2024
1 page
Past Perfect Ppt Classroom Posters Clt Communicative Language Teach 132722
No ratings yet
Past Perfect Ppt Classroom Posters Clt Communicative Language Teach 132722
11 pages
Purposive Communication Module
No ratings yet
Purposive Communication Module
27 pages
ALPHABET TEST ppt-2
No ratings yet
ALPHABET TEST ppt-2
21 pages
De Thi Giua Ki 2 Anh 9 - 19042022
No ratings yet
De Thi Giua Ki 2 Anh 9 - 19042022
30 pages
Final.Requiremen.RIZAL
No ratings yet
Final.Requiremen.RIZAL
3 pages
Get Translation and Language Teaching Continuing The Dialogue 1st Edition Melita Koletnik (Editor) Free All Chapters
100% (11)
Get Translation and Language Teaching Continuing The Dialogue 1st Edition Melita Koletnik (Editor) Free All Chapters
70 pages
Java Questions Bvrit v1.3
No ratings yet
Java Questions Bvrit v1.3
5 pages
Module 2 (ENGLISH7)
No ratings yet
Module 2 (ENGLISH7)
5 pages
Purposive Communication HandOuts 2
100% (1)
Purposive Communication HandOuts 2
4 pages
50 Problematic Words
No ratings yet
50 Problematic Words
4 pages
Computer Hardware and Software 2
No ratings yet
Computer Hardware and Software 2
6 pages

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

Uploaded by

Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

Uploaded by

InfiGUIAgent: A Multimodal Generalist GUI Agent

with Native Reasoning and Reflection

Abstract interpret complex interface elements and adapt to a

2.1 Multimodal LLMs 3 Method

• RicoSCA & Tactical Layer:

+ Answer Instruction Action

Dataset Platform Category # of Samples

Dataset Platform # of Samples

Category Operations interaction follows a standard protocol using func-

be verified at the next step.

Success Rate Zubach, Hassan Mansoor, Vincent Etter, Victor

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun-

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun-

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak

Da Yin, Faeze Brahman, Abhilasha Ravichander, Khy-

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers,

Xiao Yu, Baolin Peng, Vineeth Vajipey, Hao Cheng,

A.2 Stage 2: Native Reasoning

Answer: This screenshot shows the settings menu of an

Key UI components and elements present on the screen

Figure 2: Case of GUI Understanding.

Instruction: Where can I send a message to someone? You

Answer: To send a message, you need to access the

Figure 3: Case of Grounding.

Answer: To determine the day of the week, we need to

Figure 4: Case of Question Answering.

new chat. Therefore, I need to tap on the "Start chat" button.

Grounding: Based on the reasoning, the current action should be to tap on

Grounding: Based on the reasoning, the current action should be to tap on

Expectation: The purpose of this action is to navigate to the Contacts

Grounding: Based on the reasoning, the current action should be to tap on

Expectation: The purpose of this action is to initiate the process of creating

You might also like