Skip to content

[SC 16935] Fix IQROutliersBarPlot failure on boolean numeric#526

Merged
AnilSorathiya merged 3 commits into
mainfrom
anilsorathiya/sc-16935/fix-iqroutliersbarplot-failure-on-boolean
Jun 29, 2026
Merged

[SC 16935] Fix IQROutliersBarPlot failure on boolean numeric#526
AnilSorathiya merged 3 commits into
mainfrom
anilsorathiya/sc-16935/fix-iqroutliersbarplot-failure-on-boolean

Conversation

@AnilSorathiya

Copy link
Copy Markdown
Contributor

Pull Request Description

What and why?

IQROutliersBarPlot failed with TypeError: numpy boolean subtract when a dataset included boolean columns in feature_columns_numeric (pandas treats bool as numeric).

Before: Boolean/binary columns could still be processed when building outlier_counts_by_feature, causing quantile() to fail on boolean data.

After: Boolean and binary features are excluded from IQR outlier calculations using is_bool_dtype() and nunique() > 2, for both plots and raw data output.

How to test

pytest tests/unit_tests/data_validation/test_IQROutliersBarPlot.py -v

Or manually:

import numpy as np
import pandas as pd
import validmind as vm
from validmind.tests.data_validation.IQROutliersBarPlot import IQROutliersBarPlot

df = pd.DataFrame({
    "numeric": np.random.randn(100),
    "flag": np.random.choice([True, False], 100),
})
dataset = vm.init_dataset(input_id="repro", dataset=df, __log=False)
IQROutliersBarPlot(dataset)  # should complete without error

What needs special review?

  • Confirm excluding boolean columns from IQR analysis is the intended behavior (IQR is not meaningful for boolean features).

Dependencies, breaking changes, and deployment notes

  • No dependencies.
  • No breaking changes.
  • Boolean columns will no longer appear in IQROutliersBarPlot output.

Release notes

bug

Fixed a failure in IQROutliersBarPlot when datasets contain boolean feature columns. The test now skips boolean and binary features instead of raising a numpy boolean subtract error during IQR calculations.

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

@AnilSorathiya AnilSorathiya requested a review from juanmleng June 26, 2026 14:08
@AnilSorathiya AnilSorathiya added the bug Something isn't working label Jun 26, 2026
@juanmleng

Copy link
Copy Markdown
Contributor

Nice fix overall — the is_bool_dtype guard is the right call and the intent is much clearer than the old unique() > 2 alone.

Two things worth addressing before merge:

  • The new test is checking the wrong attribute. test_boolean_dtype_excluded_from_raw_data asserts "flag" not in raw_data.outlier_counts_by_feature.index, but outlier_counts_by_feature is a DataFrame — .index holds row integers, not column names. So this test always passes regardless of whether the fix is working. It should be .columns instead.

  • IQROutliersTable has the same bug. It uses len(df[col].unique()) <= 2 without the is_bool_dtype guard, so a bool-dtype column would hit the same crash there. Worth patching in the same PR for consistency.

@github-actions

Copy link
Copy Markdown
Contributor

PR Summary

This pull request refines the logic for outlier detection within the IQR methods by improving the handling of boolean and binary features. Previously, the code was only excluding binary features based on the number of unique values; now, it explicitly checks for boolean data types using pandas' is_bool_dtype, ensuring that both boolean and binary columns are omitted from the outlier calculations.

Key changes include:

  • Adjusting the test in IQROutliersBarPlot to assert that the column 'flag' is not present in the raw data's outlier counts when it appears as a column rather than an index.
  • Adding a new test in IQROutliersTable to validate that boolean columns (specifically the 'flag' column) are excluded from both the outlier summary and the raw outlier records.
  • Updating the implementation in IQROutliersBarPlot and IQROutliersTable to use a more reliable check with pd.api.types.is_bool_dtype, which prevents errors (e.g., issues with numpy boolean subtraction) during the quantile computation for boolean columns.

These enhancements ensure that the outlier detection functionality will correctly process datasets containing boolean features without runtime errors, preserving the integrity of the analysis.

Test Suggestions

  • Add tests with datasets where all numeric features are also categorical (e.g., only two unique values) to ensure proper exclusion.
  • Include tests with mixed data types (boolean, binary, and continuous numerical features) to verify that only appropriate columns are processed.
  • Test edge cases where boolean columns might include non-standard representations (e.g., 0/1 instead of True/False) to ensure consistent behavior.
  • Perform performance tests on larger datasets to validate that the updated exclusion logic does not introduce significant overhead.

@AnilSorathiya AnilSorathiya merged commit 1fbbd6b into main Jun 29, 2026
34 of 35 checks passed
@AnilSorathiya AnilSorathiya deleted the anilsorathiya/sc-16935/fix-iqroutliersbarplot-failure-on-boolean branch June 29, 2026 13:13
@hunner hunner mentioned this pull request Jun 30, 2026
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants