Skip to content

GH-3632: Fix record-level notIn evaluation#3631

Merged
wgtmac merged 1 commit into
apache:masterfrom
wgtmac:codex/fix-record-level-notin
Jun 24, 2026
Merged

GH-3632: Fix record-level notIn evaluation#3631
wgtmac merged 1 commit into
apache:masterfrom
wgtmac:codex/fix-record-level-notin

Conversation

@wgtmac

@wgtmac wgtmac commented Jun 23, 2026

Copy link
Copy Markdown
Member

Rationale for this change

The generated record-level filter for notIn can accept a row after checking only the first non-matching value in the literal set. For example, notIn(col, {1.0, 3.0}) should reject 3.0, but the current logic can accept it after seeing 3.0 != 1.0.

What changes are included in this PR?

  • Evaluate notIn by checking whether any literal equals the row value before deciding the result.
  • Add a regression test for multi-value notIn record-level filtering.

Are these changes tested?

  • ./mvnw -q -Drat.skip=true -DskipTests -pl parquet-generator,parquet-column install
  • ./mvnw -q -Drat.skip=true -pl parquet-hadoop -Dtest=TestRecordLevelFilters -Dsurefire.failIfNoSpecifiedTests=false test

Are there any user-facing changes?

This fixes record-level filtering for notIn predicates with multiple literal values.

Closes #3632

@wgtmac wgtmac changed the title Fix record-level notIn evaluation GH-3632: Fix record-level notIn evaluation Jun 23, 2026
@wgtmac

wgtmac commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

Just caught this bug while working on #3393. Could you please take a look at this? @gszadovszky @Fokko

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch still looks incorrect for notIn sets that contain null plus non-null values. For example, notIn(col, {null, 3.0}) will emit setResult(true) for any non-null value, so a row where col == 3.0 is still kept.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. The generated evaluator now handles null membership only in updateNull(), and update(value) still scans the non-null values with the column comparator. Added a regression test for in/notIn with {null, 3.0}; notIn now rejects col == 3.0 and keeps only the non-matching non-null value.

@wgtmac wgtmac force-pushed the codex/fix-record-level-notin branch from 4fbc73e to 91862f8 Compare June 23, 2026 12:39

@gszadovszky gszadovszky left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wgtmac for fixing this!
LGTM.

@wgtmac

wgtmac commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

Thank you, @gszadovszky!

@wgtmac wgtmac merged commit 393727e into apache:master Jun 24, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Record-level notIn filter can return matching values

2 participants