not sure if anyone else encounter the same issues about the reproduction of table 1 on the paper.
for alfworld benchmark,
I do not change the current released main branch code,
but here is my results:
- Qwen3.6-35B-A3B as target model,
No Skill: 64.2%
Human Skill : 72.4%
- gpt-5.4-mini as target model
No Skill: 57.5%
Human Skill : 68.7%
human skill comes from https://github.com/microsoft/SkillOpt/blob/main/skillopt/envs/alfworld/skills/initial.md
the finding is opposite to the paper table 1.
not sure if anyone else encounter the same issues about the reproduction of table 1 on the paper.
for alfworld benchmark,
I do not change the current released main branch code,
but here is my results:
human skill comes from
https://github.com/microsoft/SkillOpt/blob/main/skillopt/envs/alfworld/skills/initial.mdthe finding is opposite to the paper table 1.