[EXP]: Benchmark Results Discrepancy for GPT-4o in Sotopia vs Sotopia-RL

### Description


Hi：

I’ve made some adjustments based on sotopia and was able to successfully run the benchmark. However, my current results are somewhat different from those reported in the paper (Sotopia-RL).

Here are my reproduced results (using **GPT-4o** as both agent and evaluator):

| Setting        | Believability | Relationship | Knowledge | Secret      | Social Rules | Financial & Material Benefits | Goal   | Overall Score | Setting Num | Episode Count |
|----------------|---------------|--------------|-----------|-------------|--------------|-------------------------------|--------|---------------|-------------|---------------|
| **Sotopia-all**  | 8.99 ± 0.01   | 2.53 ± 0.06   | 5.31 ± 0.13 | -0.10 ± 0.04 | -0.12 ± 0.03 | 0.60 ± 0.05 | 7.11 ± 0.10 | 3.47 ± 0.04 | 90.00 ± 0.00 | 562.00 ± 0.00 | 90 | 562 |
| **Sotopia-hard** | 8.88 ± 0.13  | 1.24 ± 0.22  | 4.55 ± 0.60 | 0.00 ± 0.00  | −0.12 ± 0.16 | 0.42 ± 0.14                   | 5.04 ± 0.31  | 2.86 ± 0.14  | 14        | 85            |

And here are the results from **Sotopia-RL** (using **GPT-4o as partner**):

| Model         | Setting      | Goal  | Average Score |
|---------------|--------------|-------|---------------|
| **GPT-4o**    | **Sotopia-all** | 8.19  | 3.76          |
| **GPT-4o**    | **Sotopia-hard**| 6.97  | 3.46          |

For the benchmark, I used the following command:

```bash
sotopia benchmark --only-show-performance \
  --models openrouter/openai/gpt-4o \
  --partner-model openrouter/openai/gpt-4o \
  --evaluator-model openrouter/openai/gpt-4o \
  --batch-size 10 \
  --task all \
  --push-to-db \
  --output-to-jsonl \
  --save-dir ./logs \
  --print-logs
````


The tag I’m using is **v0.1.4**.

Would you mind confirming:

1. Whether there were any specific hyperparameter or configuration differences (e.g., temperature, sampling strategy, or evaluator settings) between your released benchmark and the Sotopia-RL paper version?
2. Is there a recommended config file or branch for reproducing the **Sotopia-RL** results reported in Table 1?

Thanks again for your excellent work and for maintaining this project!


I also switch to v0.1.0-rc5，but got:

<img width="1555" height="120" alt="Image" src="https://github.com/user-attachments/assets/5b2d0693-fc73-41cc-957a-0c3c86e6cbb6" />

<img width="1552" height="91" alt="Image" src="https://github.com/user-attachments/assets/f05eb651-b37d-422b-accd-5347d9c82ae1" />

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EXP]: Benchmark Results Discrepancy for GPT-4o in Sotopia vs Sotopia-RL #206

Description

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Setting	Believability	Relationship	Knowledge	Secret	Social Rules	Financial & Material Benefits	Goal	Overall Score	Setting Num	Episode Count
Sotopia-all	8.99 ± 0.01	2.53 ± 0.06	5.31 ± 0.13	-0.10 ± 0.04	-0.12 ± 0.03	0.60 ± 0.05	7.11 ± 0.10	3.47 ± 0.04	90.00 ± 0.00	562.00 ± 0.00
Sotopia-hard	8.88 ± 0.13	1.24 ± 0.22	4.55 ± 0.60	0.00 ± 0.00	−0.12 ± 0.16	0.42 ± 0.14	5.04 ± 0.31	2.86 ± 0.14	14	85

Model	Setting	Goal	Average Score
GPT-4o	Sotopia-all	8.19	3.76
GPT-4o	Sotopia-hard	6.97	3.46

[EXP]: Benchmark Results Discrepancy for GPT-4o in Sotopia vs Sotopia-RL #206

Description

Description

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions