GH-3632: Fix record-level notIn evaluation#3631
Conversation
|
Just caught this bug while working on #3393. Could you please take a look at this? @gszadovszky @Fokko |
There was a problem hiding this comment.
This branch still looks incorrect for notIn sets that contain null plus non-null values. For example, notIn(col, {null, 3.0}) will emit setResult(true) for any non-null value, so a row where col == 3.0 is still kept.
There was a problem hiding this comment.
Fixed in the latest push. The generated evaluator now handles null membership only in updateNull(), and update(value) still scans the non-null values with the column comparator. Added a regression test for in/notIn with {null, 3.0}; notIn now rejects col == 3.0 and keeps only the non-matching non-null value.
4fbc73e to
91862f8
Compare
gszadovszky
left a comment
There was a problem hiding this comment.
Thanks @wgtmac for fixing this!
LGTM.
|
Thank you, @gszadovszky! |
Rationale for this change
The generated record-level filter for
notIncan accept a row after checking only the first non-matching value in the literal set. For example,notIn(col, {1.0, 3.0})should reject3.0, but the current logic can accept it after seeing3.0 != 1.0.What changes are included in this PR?
notInby checking whether any literal equals the row value before deciding the result.notInrecord-level filtering.Are these changes tested?
./mvnw -q -Drat.skip=true -DskipTests -pl parquet-generator,parquet-column install./mvnw -q -Drat.skip=true -pl parquet-hadoop -Dtest=TestRecordLevelFilters -Dsurefire.failIfNoSpecifiedTests=false testAre there any user-facing changes?
This fixes record-level filtering for
notInpredicates with multiple literal values.Closes #3632