-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Description
Description
Hi:
I’ve made some adjustments based on sotopia and was able to successfully run the benchmark. However, my current results are somewhat different from those reported in the paper (Sotopia-RL).
Here are my reproduced results (using GPT-4o as both agent and evaluator):
| Setting | Believability | Relationship | Knowledge | Secret | Social Rules | Financial & Material Benefits | Goal | Overall Score | Setting Num | Episode Count |
|---|---|---|---|---|---|---|---|---|---|---|
| Sotopia-all | 8.99 ± 0.01 | 2.53 ± 0.06 | 5.31 ± 0.13 | -0.10 ± 0.04 | -0.12 ± 0.03 | 0.60 ± 0.05 | 7.11 ± 0.10 | 3.47 ± 0.04 | 90.00 ± 0.00 | 562.00 ± 0.00 |
| Sotopia-hard | 8.88 ± 0.13 | 1.24 ± 0.22 | 4.55 ± 0.60 | 0.00 ± 0.00 | −0.12 ± 0.16 | 0.42 ± 0.14 | 5.04 ± 0.31 | 2.86 ± 0.14 | 14 | 85 |
And here are the results from Sotopia-RL (using GPT-4o as partner):
| Model | Setting | Goal | Average Score |
|---|---|---|---|
| GPT-4o | Sotopia-all | 8.19 | 3.76 |
| GPT-4o | Sotopia-hard | 6.97 | 3.46 |
For the benchmark, I used the following command:
sotopia benchmark --only-show-performance \
--models openrouter/openai/gpt-4o \
--partner-model openrouter/openai/gpt-4o \
--evaluator-model openrouter/openai/gpt-4o \
--batch-size 10 \
--task all \
--push-to-db \
--output-to-jsonl \
--save-dir ./logs \
--print-logsThe tag I’m using is v0.1.4.
Would you mind confirming:
- Whether there were any specific hyperparameter or configuration differences (e.g., temperature, sampling strategy, or evaluator settings) between your released benchmark and the Sotopia-RL paper version?
- Is there a recommended config file or branch for reproducing the Sotopia-RL results reported in Table 1?
Thanks again for your excellent work and for maintaining this project!
I also switch to v0.1.0-rc5,but got:
Additional Information
No response
Metadata
Metadata
Assignees
Labels
No labels