MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents

Ruihan Chen¹, Qiming Li¹¹¹footnotemark: 1, Xiaocheng Feng^1,2,
Xiaoliang Yang¹, Weihong Zhong¹, Yuxuan Gu¹, Zekun Zhou¹, Bing Qin^1,2
¹Harbin Institute of Technology ²Peng Cheng Laboratory
{rhchen,qmli}@ir.hit.edu.cn
Equal Contribution

Abstract

With the advancement of computational resources, Large Vision–Language Models (LVLMs) exhibit impressive Perception and Reasoning (P&R) performance on Graphical User Interface (GUI) tasks. However, although they demonstrate strong P&R capabilities in English GUI scenarios, their performance in multilingual settings has received little attention, which limits their global applications. Moreover, existing studies on GUI tasks lack fine-grained analyses, including widget functions and elements’ spatial relationships, which are fundamental for more targeted improvements. To tackle these issues, we propose MPR-GUI-Bench, a Multilingual fine-grained Perception and Reasoning GUI Benchmark to evaluate GUI agents’ P&R capabilities. Evaluation results demonstrate that LVLMs exhibit significantly worse P&R performance in non-English languages than in English. To address these gaps, we propose GUI-XLI, a GUI Cross-Lingual Intervention method that applies interventions to the hidden states at P&R capability-related layers to mitigate the gaps between English and other languages, building on previous research showing that the hidden states of different language inputs exhibit significant differences in the latent space. Experimental results indicate that our method improves GUI agents’ multilingual P&R capability by 6.5% on average.

Ruihan Chen¹^†^†thanks: Equal Contribution, Qiming Li¹¹¹footnotemark: 1, Xiaocheng Feng^1,2, Xiaoliang Yang¹, Weihong Zhong¹, Yuxuan Gu¹, Zekun Zhou¹, Bing Qin^1,2 ¹Harbin Institute of Technology ²Peng Cheng Laboratory {rhchen,qmli}@ir.hit.edu.cn

1 Introduction

Refer to caption — Figure 1: Performance of GUI agents on our MPR-GUI-Bench benchmark. The left figure illustrates that all GUI agents exhibit the strongest performance in English, while the right one exhibits their fine-grained P&R capabilities across multiple dimensions.

Dataset

Languages

Capability

Fine-grained

Platform

Size

Type

Web.

Mob.

Desk.

GUI-WORLD

\cellcolorCustomGreen!60✔

✗

✔

✗

✔

12,379

dataset

AndroidWorld

\cellcolorCustomGreen!60✔

✗

✔

✗

✔

✗

116

env.

Mobile-Agent-Bench

\cellcolorCustomGreen!60✔

✗

✔

✗

100

env.

ScreenSpot

\cellcolorCustomGreen!60✔

✗

✔

✗

✔

1200+

dataset

GUI-Odyssey

\cellcolorCustomGreen!60✔

✗

✔

✗

✔

✗

7735

dataset

SPA-Bench

\cellcolorCustomGreen!60✔

✗

✔

✗

340

env.

MacOSWorld

\cellcolorCustomGreen!60✔

✗

\cellcolorCustomGreen!60✔

✗

✔

201+29

env.

MPR-GUI-Bench

\cellcolorCustomGreen!60✔

✗

✔

\cellcolorCustomGreen!60✔

✔

✗

12,936

dataset

Table 1: Comparison between MPR-GUI-Bench and other GUI benchmarks. In Capability category, P means the benchmark involves perception capabilities and R means reasoning capabilities. The Fine-grained category means the benchmark involves fine-grained analysis on GUI agents’ P&R capabilities. In Platform catagory, Web. means website, Mob. means mobile devices and Desk means computer desktop. In the Type catagory, env. means interactive environments to evaluate GUI agents’ real-time performances, while dataset evaluates task-specific performances through images the same as traditional multimodal benchmarks.

With the rapid development of large language-visual models (LVLMs) (nguyen2024guiagentssurvey), they exibit strong Perception and Reasoning (P&R) capabilities across various Graphical User Interface (GUI) benchmarks. As present in Table 1, existing GUI benchmarks remain limited in two aspects: (1) the evaluations are limited in English environments (rawles2024androidworlddynamicbenchmarkingenvironment; wang2024mobileagentbenchefficientuserfriendlybenchmark; chen2025spabench), which conflict with the global need for multilingual support; (2) current benchmarks lack systematic evaluation of GUI agents’ fine-grained P&R capabilities (cheng2024seeclick; lu2024guiodyssey; chen2024gui) due to overlooking the inherent feature of GUI scenarios (i.e., sparse visual elements and concise layouts). To address these limitations, we propose Multilingual fine-grained perception and Reasoning GUI Benchmark (MPR-GUI-Bench), the first benchmark to systematically evaluate the multilingual fine-grained P&R capabilities of GUI agents, featuring identical evaluation settings for each language. As exemplified in Figure 2, to construct MPR-GUI-Bench, we propose a semi-automatic pipeline leveraging both human resource and GPT-4o to automatically generate VQA (Visual Question Answering) samples and expand them to other languages, which significantly reduces manual effort while ensuring data quality. As presented in Figure 1, the evaluation results of seven baselines on MPR-GUI-Bench reveal a significant gap in fine-grained P&R capabilities between English and non-English languages, with an average FPR-ACC accuracy of 75.3% and 67.7%, respectively.

Building on previous works (ye2025claim; peng2025debiasingmultilingualllmscrosslingual; chang-etal-2022-geometry) which have shown that the hidden state distribution of LVLMs’ English input differs from other languages, and by aligning the distribution of other languages with English the competence gaps can be migrated, we propose GUI Cross-Lingual Intervention (GUI-XLI) method. GUI-XLI applies intervention to the hidden state of non-English inputs at P&R capability related layers to migrating the cross-lingual gaps. Experimental results demonstrate that our GUI-XLI method significantly improves the performance of open-source LVLMs on non-English GUI tasks by 6.5% in average with low inference latency.

Our contributions are summarized as follows:

•

we propose a semi-automatic pipeline that leverages compositional prompting with GPT-4o to construct multilingual GUI datasets to reduce manual effort while ensuring data quality.
•

We present MPR-GUI-Bench, the first multilingual benchmark to evaluate GUI agents’ fine-grained P&R capabilities on mobile devices.
•

We propose GUI-XLI, a training-free representation engineering method that mitigates LVLMs’ Scross-lingual P&R capability gaps.

2 Related Work

Multimodal LLM-based Agents

The continuous advancement of LVLMs in P&R capabilities reveals their potential as Multimodal LLM-based Agents (MLAs). In GUI scenarios, they can be grouped into three categories: (1) Closed-source LVLM-based GUI agents relying on standardized protocols (yan2025mcpworldunifiedbenchmarkingtestbed) in GUI scenarios; (2) Open-source LVLMs-based GUI agents strengthened by incorporating GUI data into training corpora (chen2024internvl; yao2024minicpm). These two catagories intend to directly transfer the general competence of foundational LVLMs to real-time GUI scenarios, overlooking the intrinsic properties of GUI tasks. (3) Other GUI agents (hong2023cogagent; qin2025ui; ariaui) trained on GUI datasets with stronger instruction following and GUI-grounding capabilities while reduced generalization and reasoning capabilities.

GUI Agent Benchmarks

As presented in Table 1, existing GUI agent benchmarks generally fall into two types: interactive environments and static datasets (nguyen2024guiagentssurvey). Environment-based benchmarks (rawles2024androidworlddynamicbenchmarkingenvironment; wang2024mobileagentbenchefficientuserfriendlybenchmark; chen2025spabench) treat each Status-Action-Operation cycle as a whole, providing limited analysis on the agents’ P&R capabilities. Dataset-based benchmarks are composed of static screenshots (cheng2024seeclick; lu2024guiodyssey; chen2024gui), present limited analysis in GUI agents’ perception process. Most of recent benchmarks only focus on English. However, with the increasing demand from users in different linguistic environments, GUI agents must possess balanced P&R capabilities across multilingual contexts to achieve broader applications.(tang2025surveymllmbasedguiagents; nguyen2024guiagentssurvey). There have been benchmarks (macosworld; chen2025spabench) that involve multilingual settings; however, none of them systematically evaluate fine-grained P&R capabilities, resulting in a lack of targeted explainability. To address this gap, we propose MPR-GUI-Bench, the first of its kind to systematically evaluate GUI agents’ fine-grained P&R capabilities in multilingual GUI scenarios.

3 MPR-GUI-Bench

Existing studies for GUI applications on LVLMs have mostly neglected fine-grained P&R capabilities, leading to disparities in their development. Moreover, even fewer studies have focused on these capabilities in multi-lingual contexts. As technology advances, users from diverse linguistic backgrounds have an increasingly urgent demand for LVLMs. Therefore, to achieve broader applications in GUI scenarios, it is crucial for LVLMs to eliminate multi-lingual bias. To this end, we introduce the Multi-lingual fine-grained Perception and Reasoning GUI Benchmark (MPR-GUI-Bench), a benchmark evaluating these capabilities in diverse multilingual GUI tasks.

3.1 Data Source

As shown in Figure 2, parallel screenshots in 6 languages: English, Chinese, French, Russian, Japanese and Thai (EN, ZH, FR, RU, JA, TH); spanning 39 distinct real-world GUI scenarios under two operating systems: iOS and Android on 6 mobile device models are collected by annotators.

3.2 Task Definitions

As shown in Figure 3, we design eight fine-grained dimensions to evaluate LVLMs’: (1) perception capabilities including perception of the interactive components (widgets) within the screenshots and users’ interative actions; (2) reasoning capabilities including spatial reasoning capabilities on the location or to clarify spatial relationship between elements and integrated reasoning capabilities on synthesized perception information. The eight fine-grained dimensions are defined as follows:

Perception Capabilities Evaluation Dimensions

•

Widget Function Comprehension (WF) evaluates LVLMs’ perception of the function of GUI elements and the meaning of visual cues.
•

Widget Interaction Comprehension (WI) evaluates LVLMs’ perception of the most suitable way for users to interaction with widgets.
•

Action Understanding (AU) evaluates LVLMs’ perception of the consequences of executed actions, including interface changes, system feedback, and impacts on future interactions.
•

Action Prediction (AP) evaluates LVLMs’ perception of action organization (e.g., types, targets, order, input content) to accomplish goals.

Reasoning Capabilities Evaluation Dimensions

•

Absolute Element Location (AEL) evaluates LVLMs’ reasoning capability to correctly locate UI elements and analyze their global positions.
•

Relative Element Location (REL) evaluates LVLMs’ reasoning capability in relative spatial relationships between GUI elements.
•

Rich information (RI) evaluates LVLMs’ capability to synthesize and reason based on long sequential screenshots with rich perception information, which requires a strong grasp of the fine-grained perception capabilities.
•

Sparse Information (SI) evaluates LVLMs’ capability to synthesize and reason about users’ intentions based on shorter screenshot sequences and less information and visual cues compared to RI, leading to higher difficulty.

Notably, the first six dimensions involve samples based on single screenshots, while the last two involve those based on sequential screenshots.

3.3 Benchmark Construction Pipeline

Figure 2 illustrates our automatic dataset construction pipeline leveraging GPT-4o (openai2024gpt4ocard) to reduce human effort. Annotators are required to collect screenshots, design prompts and check GPT-4o’s output to ensure data quality.

Step 1: Screenshot Collection Annotators are required to collect parallel screenshots across six languages and distinct GUI scenarios following strict guidelines shown in Appendix A.6.

Model	Lang	Perception				Reasoning				FPR-ACC
		AU	AP	WF	WI	AEL	REL	RI	SI
\rowcolor[gray]0.9 Open-source LVLMs
Intern2.5VL-8B	EN	\cellcolorCustomGreen!6081.2	\cellcolorCustomGreen!6089.9	\cellcolorCustomGreen!6079.5	\cellcolorCustomGreen!6092.1	\cellcolorCustomGreen!6082.0	\cellcolorCustomGreen!6082.0	\cellcolorCustomGreen!6080.0	\cellcolorCustomOrange!6044.0	\cellcolorCustomGreen!6075.2
	ZH	72.4	85.5	75.1	88.0	78.4	67.8	\cellcolorCustomOrange!6064.0	\cellcolorCustomGreen!6060.0	71.9
	FR	77.1	83.9	75.6	88.5	72.7	76.5	\cellcolorCustomGreen!6080.0	52.0	73.5
	RU	70.2	81.4	70.4	83.3	68.3	66.9	\cellcolorCustomGreen!6080.0	48.0	69.1
	JA	64.2	82.8	72.9	80.6	73.2	69.1	\cellcolorCustomOrange!6064.0	\cellcolorCustomOrange!6044.0	66.0
	TH	\cellcolorCustomOrange!6057.9	\cellcolorCustomOrange!6067.5	\cellcolorCustomOrange!6052.6	\cellcolorCustomOrange!6072.7	\cellcolorCustomOrange!6042.9	\cellcolorCustomOrange!6038.3	\cellcolorCustomGreen!6080.0	52.0	\cellcolorCustomOrange!6058.5
MiniCPM-o 2.6	EN	\cellcolorCustomGreen!6083.9	\cellcolorCustomGreen!6083.6	\cellcolorCustomGreen!6081.6	\cellcolorCustomGreen!6091.0	\cellcolorCustomGreen!6077.9	\cellcolorCustomGreen!6084.4	\cellcolorCustomGreen!6084.0	\cellcolorCustomGreen!6064.0	\cellcolorCustomGreen!6079.6
	ZH	76.5	82.0	75.1	90.2	75.1	73.8	80.0	\cellcolorCustomGreen!6064.0	75.9
	FR	76.5	79.0	77.8	89.9	74.3	76.8	76.0	60.0	74.6
	RU	74.6	79.5	72.3	86.6	72.4	73.8	64.0	56.0	70.2
	JA	72.7	77.6	74.0	85.8	74.0	67.2	68.0	\cellcolorCustomGreen!6064.0	71.7
	TH	\cellcolorCustomOrange!6066.4	\cellcolorCustomOrange!6074.0	\cellcolorCustomOrange!6060.6	\cellcolorCustomOrange!6078.1	\cellcolorCustomOrange!6063.1	\cellcolorCustomOrange!6042.6	\cellcolorCustomOrange!6052.0	\cellcolorCustomOrange!6040.0	\cellcolorCustomOrange!6057.1
Qwen-2.5-VL-7B-Instruct	EN	\cellcolorCustomGreen!6086.1	\cellcolorCustomGreen!6089.4	\cellcolorCustomGreen!6086.0	\cellcolorCustomGreen!6093.4	\cellcolorCustomGreen!6086.0	\cellcolorCustomGreen!6081.6	\cellcolorCustomGreen!6096.0	\cellcolorCustomGreen!6072.0	\cellcolorCustomGreen!6087.1
	ZH	83.6	88.8	77.8	\cellcolorCustomOrange!6088.8	79.2	74.3	\cellcolorCustomOrange!6068.0	\cellcolorCustomOrange!6068.0	80.4
	FR	81.7	83.6	80.0	91.3	76.5	79.0	72.0	\cellcolorCustomGreen!6072.0	80.3
	RU	77.6	86.1	76.7	89.6	77.3	75.1	72.0	\cellcolorCustomGreen!6072.0	80.4
	JA	81.7	87.7	79.2	90.7	77.3	69.1	88.0	\cellcolorCustomOrange!6068.0	79.5
	TH	\cellcolorCustomOrange!6076.8	\cellcolorCustomOrange!6082.5	\cellcolorCustomOrange!6077.5	\cellcolorCustomOrange!6088.8	\cellcolorCustomOrange!6073.5	\cellcolorCustomOrange!6065.3	76.0	\cellcolorCustomGreen!6072.0	\cellcolorCustomOrange!6075.7
Keye-VL-7B	EN	\cellcolorCustomGreen!6088.3	\cellcolorCustomGreen!6083.9	\cellcolorCustomGreen!6081.6	\cellcolorCustomGreen!6093.7	\cellcolorCustomGreen!6079.0	\cellcolorCustomGreen!6073.8	\cellcolorCustomGreen!6048.0	\cellcolorCustomGreen!6064.0	\cellcolorCustomGreen!6073.7
	ZH	82.0	81.4	73.4	89.3	77.3	68.0	40.0	52.0	66.9
	FR	84.2	82.5	76.4	91.0	72.4	71.6	44.0	44.0	66.5
	RU	80.1	82.0	72.1	86.6	75.7	68.3	44.0	\cellcolorCustomOrange!6028.0	61.8
	JA	78.1	82.2	71.2	86.6	72.7	58.2	\cellcolorCustomOrange!6036.0	40.0	61.4
	TH	\cellcolorCustomOrange!6075.1	\cellcolorCustomOrange!6077.3	\cellcolorCustomOrange!6066.0	\cellcolorCustomOrange!6084.4	\cellcolorCustomOrange!6068.3	\cellcolorCustomOrange!6044.0	\cellcolorCustomOrange!6036.0	36.0	\cellcolorCustomOrange!6057.0
\rowcolor[gray]0.9 GUI Agents
CogAgent-9B	EN	\cellcolorCustomGreen!6063.8	\cellcolorCustomGreen!6078.1	\cellcolorCustomGreen!6063.8	\cellcolorCustomGreen!6081.9	52.0	\cellcolorCustomGreen!6040.8	44.0	36.0	54.6
	ZH	62.8	74.3	59.3	78.1	\cellcolorCustomGreen!6054.0	31.6	\cellcolorCustomGreen!6060.0	36.0	\cellcolorCustomGreen!6055.0
	FR	56.9	\cellcolorCustomOrange!6069.7	58.9	68.0	43.2	38.2	52.0	36.0	51.0
	RU	55.2	72.4	\cellcolorCustomOrange!6052.4	\cellcolorCustomOrange!6067.2	43.9	31.6	56.0	\cellcolorCustomGreen!6052.0	53.8
	JA	\cellcolorCustomOrange!6048.3	73.5	54.8	68.0	36.0	\cellcolorCustomOrange!6026.3	\cellcolorCustomOrange!6040.0	44.0	47.9
	TH	50.0	\cellcolorCustomOrange!6069.7	54.8	68.6	\cellcolorCustomOrange!6032.0	31.1	\cellcolorCustomOrange!6040.0	\cellcolorCustomOrange!6020.0	\cellcolorCustomOrange!6042.8
\rowcolor[gray]0.9 Close-source LVLMs
Gemini-1.5-Flash	EN	85.0	\cellcolorCustomGreen!6085.8	\cellcolorCustomGreen!6076.2	\cellcolorCustomGreen!6093.4	\cellcolorCustomGreen!6071.6	61.5	64.0	40.0	68.4
	ZH	\cellcolorCustomGreen!6086.2	81.4	68.5	89.9	64.5	49.2	\cellcolorCustomGreen!6068.0	\cellcolorCustomGreen!6064.0	\cellcolorCustomGreen!6070.5
	FR	84.4	80.6	74.0	90.4	64.8	\cellcolorCustomGreen!6065.0	64.0	\cellcolorCustomOrange!6036.0	66.0
	RU	80.1	81.2	72.6	89.9	66.1	59.3	\cellcolorCustomOrange!6060.0	48.0	66.9
	JA	80.3	82.8	71.2	88.8	\cellcolorCustomOrange!6052.0	44.7	64.0	40.0	\cellcolorCustomOrange!6062.7
	TH	\cellcolorCustomOrange!6077.9	\cellcolorCustomOrange!6079.5	\cellcolorCustomOrange!6067.7	\cellcolorCustomOrange!6086.3	59.0	\cellcolorCustomOrange!6040.4	64.0	48.0	63.5
Gemini-2.5-Pro	EN	85.0	\cellcolorCustomGreen!6090.7	\cellcolorCustomGreen!6085.0	\cellcolorCustomGreen!6093.2	\cellcolorCustomGreen!6084.7	\cellcolorCustomGreen!6093.2	\cellcolorCustomGreen!6096.0	80.0	\cellcolorCustomGreen!6088.0
	ZH	78.4	85.8	82.5	81.2	82.5	71.0	92.0	\cellcolorCustomGreen!6084.0	82.9
	FR	\cellcolorCustomGreen!6086.9	\cellcolorCustomOrange!6064.5	81.6	92.9	81.4	65.3	88.0	\cellcolorCustomOrange!6076.0	79.6
	RU	\cellcolorCustomOrange!6063.4	90.4	54.0	92.1	75.4	70.8	92.0	80.0	78.3
	JA	85.2	84.7	\cellcolorCustomOrange!6053.4	\cellcolorCustomOrange!6065.0	\cellcolorCustomOrange!6060.1	\cellcolorCustomOrange!6062.3	\cellcolorCustomOrange!6068.0	\cellcolorCustomOrange!6076.0	\cellcolorCustomOrange!6070.0
	TH	82.8	71.3	81.6	83.4	81.6	83.7	72.0	80.0	79.2

Rater Agreement Distribution	Frequency
(Comp. vs. Non-Comp.)	(Items Num.)
6 vs. 0	1693
5 vs. 1	291
4 vs. 2	110
3 vs. 3	42
2 vs. 4	16
1 vs. 5	1
0 vs. 6	3
Total	2156

Translation Path	Accuracy (%)
Original (EN)	87.2
ZH $\rightarrow$ EN	87.2
JA $\rightarrow$ EN	86.6
RU $\rightarrow$ EN	86.0
FR $\rightarrow$ EN	87.0
TH $\rightarrow$ EN	86.2

MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents

Abstract

1 Introduction

2 Related Work

Multimodal LLM-based Agents

GUI Agent Benchmarks

3 MPR-GUI-Bench

3.1 Data Source

3.2 Task Definitions

Perception Capabilities Evaluation Dimensions

Reasoning Capabilities Evaluation Dimensions

3.3 Benchmark Construction Pipeline

3.4 Evaluation Metrics

3.5 Experiment Setup

Baseline

Implementation Details

3.6 Evaluation Result

4 GUI-XL-Intervention

4.1 Preliminary

4.2 GUI-XL-Memory

4.3 Cross-lingual Representation Intervention

5 Experiment

5.1 Setup

Baseline Models

Memory for RI&SI

5.2 Main Results

(1) Effective Multilingual P&R Capability Enhancement

(2) Data and Model Generalizability

(3) Significant Improvements across Tasks

6 Analysis

6.1 Visualization of Multilingual Hidden State

6.2 Ablation Studies

6.3 Reasoning Enhancement of GUI-XLI

7 Conclusion

Limitations

Appendix A Additional Details of MPR-GUI-Bench

A.1 Inter-Rater Reliability Analysis using Fleiss’ Kappa

1. Data Summary

2. Calculation of Fleiss’ Kappa

Step 1: Overall Observed Agreement (P¯\bar{P})

Step 2: Agreement Expected by Chance (P¯e\bar{P}_{e})

Step 3: Fleiss’ Kappa (κ\kappa)

A.2 Validation on GPT-4o Translation

A.3 Details about FPR-ACC

A.4 Prompts

A.5 Case Study

A.6 Data Collection Guidelines

Appendix B Overview of GUI-XLI

Step 1: Overall Observed Agreement ( $\bar{P}$ )

Step 2: Agreement Expected by Chance ( $\bar{P}_{e}$ )

Step 3: Fleiss’ Kappa ( $\kappa$ )