Skip to content

reproduce issue #93

Description

@yanan1116

not sure if anyone else encounter the same issues about the reproduction of table 1 on the paper.
for alfworld benchmark,
I do not change the current released main branch code,
but here is my results:

  • Qwen3.6-35B-A3B as target model,

No Skill: 64.2%
Human Skill : 72.4%

  • gpt-5.4-mini as target model

No Skill: 57.5%
Human Skill : 68.7%

human skill comes from https://github.com/microsoft/SkillOpt/blob/main/skillopt/envs/alfworld/skills/initial.md

the finding is opposite to the paper table 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions