Skip to content

SelfAskRefusalScorer honors partial_content on blocked pieces#2083

Open
tejas0077 wants to merge 1 commit into
microsoft:mainfrom
tejas0077:fix/refusal-scorer-partial-content
Open

SelfAskRefusalScorer honors partial_content on blocked pieces#2083
tejas0077 wants to merge 1 commit into
microsoft:mainfrom
tejas0077:fix/refusal-scorer-partial-content

Conversation

@tejas0077

Copy link
Copy Markdown
Contributor

Fixes #2044 (sub-issue #2)

SelfAskRefusalScorer unconditionally returned refusal=True when response_error == "blocked", even when partial_content was available in prompt_metadata. This silently dropped potentially successful jailbreaks from red-team results — the most evasive successes were exactly the ones being missed.

The fix sets score_blocked_content = True on SelfAskRefusalScorer so the base Scorer class handles partial content substitution via the existing _apply_blocked_content_substitution mechanism. When a blocked piece has partial_content, it is now scored via the LLM instead of being unconditionally treated as a clean refusal.

The rationale string for blocked responses with no partial content has also been updated to be more descriptive.

Tests and Documentation

Updated the existing test_score_async_filtered_response test to match the new rationale string and added a new test test_score_async_blocked_with_partial_content_scores_partial that verifies blocked pieces with partial content are forwarded to the LLM scorer instead of immediately returning refusal=True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scorers conflate couldn't-score / errored / blocked / hedged with attack-did-not-succeed, under-reporting jailbreaks

1 participant