reproduce issue

not sure if anyone else encounter the same issues about the reproduction of table 1 on the paper.
for alfworld benchmark, 
I do not change the current released main branch code,
but here is my results:

- Qwen3.6-35B-A3B as target model, 

> `No Skill`: 64.2%
> `Human Skill` : 72.4%  

- gpt-5.4-mini as target model

> `No Skill`: 57.5%
> `Human Skill` : 68.7% 

human skill comes from `https://github.com/microsoft/SkillOpt/blob/main/skillopt/envs/alfworld/skills/initial.md`

the finding is opposite to the paper table 1.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reproduce issue #93

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

reproduce issue #93

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions