Skip to content

[EXP]: Benchmark Results Discrepancy for GPT-4o in Sotopia vs Sotopia-RL #206

@Babylonehy

Description

@Babylonehy

Description

Hi:

I’ve made some adjustments based on sotopia and was able to successfully run the benchmark. However, my current results are somewhat different from those reported in the paper (Sotopia-RL).

Here are my reproduced results (using GPT-4o as both agent and evaluator):

Setting Believability Relationship Knowledge Secret Social Rules Financial & Material Benefits Goal Overall Score Setting Num Episode Count
Sotopia-all 8.99 ± 0.01 2.53 ± 0.06 5.31 ± 0.13 -0.10 ± 0.04 -0.12 ± 0.03 0.60 ± 0.05 7.11 ± 0.10 3.47 ± 0.04 90.00 ± 0.00 562.00 ± 0.00
Sotopia-hard 8.88 ± 0.13 1.24 ± 0.22 4.55 ± 0.60 0.00 ± 0.00 −0.12 ± 0.16 0.42 ± 0.14 5.04 ± 0.31 2.86 ± 0.14 14 85

And here are the results from Sotopia-RL (using GPT-4o as partner):

Model Setting Goal Average Score
GPT-4o Sotopia-all 8.19 3.76
GPT-4o Sotopia-hard 6.97 3.46

For the benchmark, I used the following command:

sotopia benchmark --only-show-performance \
  --models openrouter/openai/gpt-4o \
  --partner-model openrouter/openai/gpt-4o \
  --evaluator-model openrouter/openai/gpt-4o \
  --batch-size 10 \
  --task all \
  --push-to-db \
  --output-to-jsonl \
  --save-dir ./logs \
  --print-logs

The tag I’m using is v0.1.4.

Would you mind confirming:

  1. Whether there were any specific hyperparameter or configuration differences (e.g., temperature, sampling strategy, or evaluator settings) between your released benchmark and the Sotopia-RL paper version?
  2. Is there a recommended config file or branch for reproducing the Sotopia-RL results reported in Table 1?

Thanks again for your excellent work and for maintaining this project!

I also switch to v0.1.0-rc5,but got:

Image Image

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions