SelfAskRefusalScorer honors partial_content on blocked pieces#2083
Open
tejas0077 wants to merge 1 commit into
Open
SelfAskRefusalScorer honors partial_content on blocked pieces#2083tejas0077 wants to merge 1 commit into
tejas0077 wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2044 (sub-issue #2)
SelfAskRefusalScorer unconditionally returned refusal=True when response_error == "blocked", even when partial_content was available in prompt_metadata. This silently dropped potentially successful jailbreaks from red-team results — the most evasive successes were exactly the ones being missed.
The fix sets score_blocked_content = True on SelfAskRefusalScorer so the base Scorer class handles partial content substitution via the existing _apply_blocked_content_substitution mechanism. When a blocked piece has partial_content, it is now scored via the LLM instead of being unconditionally treated as a clean refusal.
The rationale string for blocked responses with no partial content has also been updated to be more descriptive.
Tests and Documentation
Updated the existing test_score_async_filtered_response test to match the new rationale string and added a new test test_score_async_blocked_with_partial_content_scores_partial that verifies blocked pieces with partial content are forwarded to the LLM scorer instead of immediately returning refusal=True.