Fix wild cluster bootstrap p-value / CI inconsistency (WCR + test inversion, #543) by igerber · Pull Request #546 · igerber/diff-diff

igerber · 2026-06-24T17:00:36Z

Summary

Fixes #543 — DifferenceInDifferences/TwoWayFixedEffects with inference="wild_bootstrap" returned a p-value that contradicted its own confidence interval (e.g. CI [2.30, 2.64] excluding 0, yet p = 0.86).

Root cause: wild_bootstrap_se claimed to run the Wild Cluster Restricted (WCR) bootstrap but never imposed the null — it re-fit the full design (keeping the treatment column) to the unchanged outcome, so the "restricted" residuals equaled the unrestricted ones and the bootstrap centered on the estimate instead of 0. The p-value mean(|t*| ≥ |t₀|) then measured noise around the estimate (≈0.5–0.86 regardless of significance) while the percentile CI happened to look fine.
Fix: genuinely impose H₀ (drop the coefficient's column for the restricted fit), studentize with the analytical CR1 SE, and derive the CI by test inversion so 0 ∈ CI ⟺ p ≥ alpha holds exactly. Few-cluster Rademacher cases enumerate the full sign-vector set (deterministic), matching R's fwildclusterboot::boottest.
New p_val_type parameter ("two-tailed" default | "equal-tailed"), surfaced on WildBootstrapResults and DiDResults (summary / to_dict / get_params).
Behavior change: prior wild_bootstrap p_value/conf_int are corrected (a true effect is now significant); reported se is now the analytical cluster-robust (CR1) SE (≈unchanged in well-behaved cases).
Bonus robustness: saturated cluster/HC1 designs (no residual DOF) now return NaN inference instead of raising ZeroDivisionError (shared linalg guard); a genuinely unbounded inverted CI is ±inf, never mixed finite/NaN.

Methodology references (required if estimator / math changes)

Method name(s): Wild Cluster Restricted (WCR) bootstrap for DifferenceInDifferences / TwoWayFixedEffects.
Paper / source link(s): Cameron, Gelbach & Miller (2008, REStat); Roodman, MacKinnon, Nielsen & Webb (2019, Stata J. — boottest); R package fwildclusterboot.
Any intentional deviations from the source (and why): (1) the p-value is floored at 1/(n_valid+1) to avoid an exact 0, but only when that floor stays below alpha, so it never flips a verdict relative to the inverted CI (boottest can report 0); (2) se/t_stat are analytical CR1 while p_value/conf_int come from the bootstrap inversion (the boottest reporting convention). Both documented in docs/methodology/REGISTRY.md §"Wild cluster bootstrap (WCR)".

Validation

Tests added/updated: tests/test_wild_bootstrap.py — the [Bug]: Result of Wild Cluster Bootstrap is inconsistent to Cluster-robust #543 significance regression, p↔CI consistency, null calibration, enumeration determinism, p_val_type propagation, low-effective-draw floor consistency, degenerate/saturated-design NaN handling, an independent full-refit WCR brute-force equivalence test, and a skip-guarded fwildclusterboot::boottest golden parity test.
R parity: validated against fwildclusterboot::boottest() defaults (benchmarks/R/generate_wild_cluster_boot_golden.R + checked-in golden) — bootstrap t-distribution matches to ~6e-14; se/t/interior-p exact; inverted CI to ~1e-4.
Suites run locally (pure-Python + Rust backends): wild bootstrap (57), linalg/estimators/vcov_type/twfe/hc2_bm/wls_cr2 (531), Rust equivalence (70); 01_basic_did.ipynb re-executes (nbmake). Local AI review (codex/gpt-5.5, full-scope) converged clean over 5 rounds.

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

…#543) `DifferenceInDifferences`/`TwoWayFixedEffects` with `inference="wild_bootstrap"` returned a p-value that contradicted its own confidence interval (issue #543: CI [2.30, 2.64] excluding 0, yet p=0.86). `wild_bootstrap_se` claimed to run the Wild Cluster Restricted (WCR) bootstrap but never imposed the null — it re-fit the full design to the unchanged outcome, so the "restricted" residuals equaled the unrestricted ones and the bootstrap centered on the estimate instead of 0. Now genuinely imposes H0 (drops the coefficient column), studentizes with the analytical CR1 SE, and derives the CI by test inversion so p and CI are exactly consistent (0 in CI <=> p >= alpha). Few-cluster Rademacher cases enumerate the full sign-vector set (deterministic), matching R `fwildclusterboot::boottest`. - New `p_val_type` param ("two-tailed" default | "equal-tailed"). - Reported `se` is now the analytical cluster-robust (CR1) SE. - R-parity: bootstrap t-distribution matches boottest to ~6e-14; se/t/interior-p exact; CI to ~1e-4. Validated via committed golden + independent brute-force. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…val_type propagation) - P1: p-value floor no longer breaks p/CI consistency. The floor 1/(n_valid+1) is now applied only when it is < alpha; with very few valid draws it can exceed alpha, and flooring there would flip a bootstrap-significant result (0 outside the inverted CI) to "non-significant", re-creating the #543 contradiction. When the floor would cross alpha the raw p (which boottest also reports) is used. - P1: p_val_type is now surfaced on WildBootstrapResults (+ summary) and on the high-level DiDResults (DiD + TWFE: field, summary, to_dict; None for analytical). - P2: docs/docstrings/tutorial/REGISTRY corrected — the two-tailed inverted CI may be asymmetric (was incorrectly called "symmetric"). - P2: deferred the CI-inversion B-by-n allocation perf note to TODO.md. - Tests: low-effective-draw floor-consistency regression (RNG B=9 and enumerated G=4), p_val_type-on-result assertions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ish) - P2: document the n_clusters<=20 enumeration cap in REGISTRY + wild_bootstrap_se and _wild_weight_matrix docstrings (the deterministic-enumeration claim holds for G<=20; beyond that, sampling is used). - P3: clarify the now-ignored `residuals` parameter docstring (retained for backward compatibility; the WCR path recomputes residuals internally). - P3: add a TwoWayFixedEffects wild-bootstrap integration test asserting p_val_type propagation + p/CI consistency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- P0: never return mixed finite/NaN inference. A genuinely unbounded inverted CI is now represented with +/-inf (not NaN), preserving 0 in CI <=> p >= alpha; the (se,t,p,CI) family stays all-or-nothing. - P0: saturated designs (no residual DOF, n == k_eff) no longer raise ZeroDivisionError. wild_bootstrap_se detects it before requesting the cluster vcov; the shared linalg HC1/CR1 adjustment (NumPy) guards n_eff <= k -> NaN vcov; and the estimator stores a NaN vcov when the bootstrap is degenerate. Both backends now return NaN (not a crash / not +inf) on saturated designs. - Tests: saturated 2x2 (G=2, n==k) -> all-NaN, no crash; near-zero-SE 2-cluster -> no mixed finite/NaN, consistent (unbounded +/-inf CI). - P3: fix stale results.py inference-label comment (WCR reports analytical CR1 SE with bootstrap p/CI); document the enumeration condition in tutorial 01. - CHANGELOG: note the saturated-design CR1 robustness fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Clarify the Rademacher enumeration trigger across the wild_bootstrap_se and _wild_weight_matrix docstrings, the few-cluster warning, and the R golden generator comment: enumeration fires when 2**(n_clusters-1) <= n_bootstrap (the boottest trigger, = the number of distinct |t| sign-classes), and the library then materializes all 2**n_clusters raw sign-vectors, reporting n_bootstrap = 2**n_clusters. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-24T17:07:07Z

PR Review: ⚠️ Needs Changes

Executive Summary

P1 methodology/default-behavior issue: Rademacher enumeration trigger is documented as matching fwildclusterboot, but the implementation enumerates at 2**(G-1) <= n_bootstrap, while the cited package docs describe enumeration when B exceeds 2**G.
The core WCR fix otherwise follows the registry: null imposition, CR1 studentization, test inversion, NaN all-or-nothing handling, and p_val_type propagation are wired through DiD/TWFE.
Documented deviations in REGISTRY.md (p-value floor and mixed CR1 SE with bootstrap p/CI) are not defects under the review rules.
I could not run tests in this sandbox because numpy is not installed.

Methodology

Finding 1: Rademacher enumeration trigger does not match the cited R behavior

Severity: P1
Location: diff_diff/utils.py:L617-L620, docs/methodology/REGISTRY.md:L107-L110
Impact: The implementation enumerates all 2**G Rademacher sign vectors when 2**(G-1) <= n_bootstrap. The registry states this is the same full-enumeration trigger as fwildclusterboot::boottest, but the cited package documentation says full enumeration happens when B exceeds the number of possible draw combinations, 2**G. This changes default behavior at G=10, where n_bootstrap=999 will enumerate and report 1024 draws here, while the documented R behavior would still sample 999 draws. That can change p-values/CIs and makes the parity claim incomplete.
Concrete fix: Either change the trigger to 2**n_clusters <= n_bootstrap (or strict > if matching the R docs exactly), or add a clearly labeled **Deviation from R:** note in REGISTRY.md and update docs/tests to cover the intentional G=10, n_bootstrap=999 behavior.

No other unmitigated methodology findings. The WCR null imposition, CR1 studentization, test inversion CI, p-value floor, and CR1 SE reporting are either aligned with the registry or explicitly documented there.

Code Quality

No P1+ findings.

P3 informational: wild_bootstrap_se() is public, but the new reduced-model construction can leave X_reduced with zero columns for a one-regressor design (diff_diff/utils.py:L834-L837). Estimator-internal DiD/TWFE paths do not hit this, but a small standalone utility test would be useful.

Performance

No blocking findings.

P3 informational: The CI inversion allocation cost is already tracked in TODO.md:L169, so no action required for this PR.

Maintainability

No P1+ findings.

The p_val_type parameter is propagated through constructor params, bootstrap results, high-level results, summaries, dictionaries, and TWFE inheritance.

Tech Debt

No untracked blocking tech debt found.

The main performance deferral for repeated _t_star(r) allocations is tracked in TODO.md.

Security

No findings.

Documentation/Tests

Finding 2: Missing parity coverage for the changed enumeration boundary

Severity: P2
Location: tests/test_wild_bootstrap.py:L934-L941, diff_diff/utils.py:L617-L620
Impact: Existing enumeration tests cover G=6, where both 2**(G-1) and 2**G are below the default-style bootstrap count. They do not cover the boundary where the implementation differs from the cited R docs, especially G=10, n_bootstrap=999.
Concrete fix: Add a test that pins the intended behavior at G=10, n_bootstrap=999: either assert sampled behavior matching R docs, or assert deterministic enumeration and document it as a deviation.

Path to Approval

Resolve the Rademacher enumeration-trigger mismatch by either matching the cited R trigger or documenting it as **Deviation from R:** in docs/methodology/REGISTRY.md.
Add a boundary test for G=10, n_bootstrap=999 that verifies the chosen behavior and the reported n_bootstrap.

…rboot (CI review P1) The Rademacher full-enumeration trigger was `2**(n_clusters-1) <= n_bootstrap`, but fwildclusterboot::boottest switches to enumeration only when n_bootstrap reaches the number of possible draws, `2**n_clusters` (verified empirically: for G=10, boottest samples at B=1023 and enumerates at B=1024). The old trigger made diff-diff enumerate (and report 2**G draws) at e.g. G=10, B=999 where R samples 999 — a real R-parity divergence. Trigger is now `2**n_clusters <= n_bootstrap`, matching boottest exactly at the boundary; docstrings, the few-cluster warning, REGISTRY, tutorial, and the R golden comment updated accordingly. (Only 2**(G-1) of the enumerated draws have distinct |t*|, but the full set is materialized — that count is not the trigger.) Also guard a degenerate single-regressor design (reduced model has zero columns) so wild_bootstrap_se no longer raises IndexError; the restricted residuals are then the variables themselves. Tests: enumeration-boundary parity at G=10 (sampled at B=999, deterministic at B=1024) and a single-regressor no-crash test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-24T17:53:20Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c9dcf2267bcf714e924987742bc9202d2694dcc2

PR Review: ✅ Looks Good

Overall Assessment

No unmitigated P0 or P1 findings.

Executive Summary

Previous P1 finding is resolved: Rademacher enumeration now uses 2**n_clusters <= n_bootstrap, and the registry/test boundary coverage now pins G=10, B=999/1024.
WCR methodology is documented in docs/methodology/REGISTRY.md:L93-L145 and implemented consistently in diff_diff/utils.py:L633-L997.
Documented deviations, including the p-value floor and mixed CR1 SE with bootstrap p/CI, are clearly labeled in the registry and are not defects.
p_val_type propagation covers constructor, get_params/set_params, bootstrap calls, DiD/TWFE results, summary(), and to_dict().
I could not run tests in this sandbox because importing the package fails with ModuleNotFoundError: No module named 'numpy'.

Methodology

No unmitigated findings.

Severity: P3 informational
Impact: The prior enumeration-trigger mismatch has been addressed. _wild_weight_matrix() now enumerates only when 2**n_clusters <= n_bootstrap, and the registry documents the same boundary. Boundary coverage exists in tests/test_wild_bootstrap.py:L1384-L1400.
Concrete fix: No action required.

Code Quality

No findings.

Performance

Severity: P3 informational
Impact: CI inversion repeatedly materializes bootstrap arrays in _t_star(r), which can be costly for large panels or large B. This is already tracked in TODO.md:L169, so it is not blocking under the deferred-work rules.
Concrete fix: No action required for this PR.

Maintainability

No findings.

Tech Debt

No untracked blocking tech debt found. The new wild-bootstrap allocation concern is tracked in TODO.md:L169.

Security

No findings.

Documentation/Tests

No blocking findings.

The PR adds methodology registry coverage, changelog notes, R golden fixtures, independent brute-force WCR checks, p/CI consistency tests, degenerate NaN handling, p_val_type propagation tests, and the prior G=10 enumeration boundary test. Tests were not executed here because numpy is unavailable in the review environment.

…ducibility tolerances Two CI failures, both non-logic: 1. Sphinx `-W` docs build: the new `p_val_type` parameter docstring had a bulleted list (with a multi-line item) not preceded by a blank line, which docutils flagged as "Block quote ends without a blank line". Added the required blank line before the list. 2. Wild-bootstrap reproducibility/determinism tests asserted bit-exact equality on `se` / `conf_int`, which now derive from the analytical cluster-robust (CR1) vcov solve. That solve is bit-reproducible only to ~1e-13 on threaded BLAS / the Rust backend, so two identical runs differed at ~1e-13–1e-15 and the exact `==` failed on every Rust-backend leg (the old `se` was a pure-numpy bootstrap-coef std and avoided that path). Relaxed the four affected assertions to `pytest.approx(rel=1e-9)` — far above the noise, still a strong reproducibility check — consistent with how the suite already tolerates BLAS-level swings. `p_value` (count-based) and `n_bootstrap` stay exact. Verified: tests/test_wild_bootstrap.py passes on both the python and rust backends (59/59). Pure Python Fallback, all Rust legs, notebooks, and docs legs already passed on the prior run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-24T19:18:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7907d83740db513a3b824a0a03fdda1ea7b6de6f

PR Review: ⚠️ Needs Changes

Overall Assessment

One newly identified unmitigated P1 remains: the estimator wrapper can still crash on rank-deficient nuisance designs after the low-level WCR helper has successfully handled them.

Executive Summary

Previous enumeration-boundary finding appears resolved: Rademacher enumeration now uses 2**n_clusters <= n_bootstrap, matching the registry.
The WCR methodology is documented in docs/methodology/REGISTRY.md:L93-L145, including the CR1 SE / bootstrap p-CI convention and p-value floor deviation.
p_val_type propagation is covered across constructor, get_params(), bootstrap call, result object, summary(), and to_dict().
[Newly identified] The wrapper still computes storage vcov with compute_robust_vcov(X, ...) on the original full matrix, which can raise on rank-deficient nuisance columns despite the WCR helper dropping them internally.
I could not run tests in this sandbox because numpy is unavailable.

Methodology

Finding 1: [Newly identified] Rank-deficient nuisance handling is not preserved by the estimator wrapper

Severity: P1

Impact: The registry says WCR drops rank-deficient nuisance columns so the identified ATT and bootstrap stay finite (docs/methodology/REGISTRY.md:L102-L106). wild_bootstrap_se() implements that by identifying kept columns and operating on X_eff (diff_diff/utils.py:L795-L829). However, after successful bootstrap inference, _run_wild_bootstrap_inference() recomputes the stored vcov via compute_robust_vcov(X, residuals, cluster_ids) on the original full design (diff_diff/estimators.py:L829-L837). compute_robust_vcov() does not apply the solve_ols rank-deficient dropped-column expansion path; it directly inverts/solves against the supplied full matrix (diff_diff/linalg.py:L1439-L1648, diff_diff/linalg.py:L2682-L2694). For TWFE full-dummy designs, this is reachable with collinear nuisance columns; the code itself documents that such columns are expected and should not poison ATT estimation (diff_diff/twfe.py:L319-L329).

Concrete fix: Compute the storage vcov through the same rank-aware path as the bootstrap, for example solve_ols(X, y, cluster_ids=cluster_ids, return_vcov=True, rank_deficient_action="silent"), and use its expanded vcov. Alternatively, have wild_bootstrap_se() return the CR1 vcov/kept mask it already computes. Ensure tests/test_estimators_vcov_type.py::test_twfe_hc2_wild_bootstrap_survives_rank_deficient_full_dummy passes and consider adding the same case for DifferenceInDifferences(... fixed_effects=...).

Code Quality

No additional findings.

Performance

Severity: P3 informational

Impact: CI inversion repeatedly materializes bootstrap arrays inside _t_star(r), which can be costly for large panels or large B. This is tracked in TODO.md:L169, so it is not blocking.

Concrete fix: No action required for this PR.

Maintainability

No additional findings.

Tech Debt

The wild-bootstrap allocation concern is properly tracked in TODO.md:L169. The P1 above is not deferrable because it contradicts the documented rank-deficient nuisance contract and can raise in production.

Security

No findings.

Documentation/Tests

Severity: P2

Impact: The relevant rank-deficient TWFE fixture already exists (tests/test_estimators_vcov_type.py:L830-L879), but I could not execute it here because importing numpy fails with ModuleNotFoundError.

Concrete fix: After fixing the wrapper vcov path, run at least pytest tests/test_estimators_vcov_type.py::TestTWFEVcovType::test_twfe_hc2_wild_bootstrap_survives_rank_deficient_full_dummy tests/test_wild_bootstrap.py.

Path to Approval

Replace _run_wild_bootstrap_inference()’s storage vcov computation with a rank-aware CR1 vcov path, such as solve_ols(..., cluster_ids=cluster_ids, return_vcov=True, rank_deficient_action="silent").
Confirm the existing TWFE rank-deficient full-dummy wild-bootstrap test passes.
Add or extend a base DifferenceInDifferences fixed-effects/covariate rank-deficient nuisance test to cover the shared wrapper outside TWFE.

The estimator stored the cluster-robust vcov via compute_robust_vcov(X, ...) on the full design, which inverts X'X directly and raises ValueError (or returns garbage) when a nuisance column is collinear — e.g. a fixed-effect dummy collinear with treatment on a full-dummy design — even though the ATT is identified and wild_bootstrap_se itself drops such columns internally. Verified: the storage call receives a rank-deficient X (rank 22 of 23) in the existing TWFE full-dummy test, and compute_robust_vcov raises on an exactly-singular X. Fix: compute the stored vcov through the rank-aware solve_ols(..., rank_deficient_action="silent") path, which drops collinear columns and NaN-expands the vcov for them — bit-identical to compute_robust_vcov on full-rank designs (verified, ~5e-17). Removed the now-unused compute_robust_vcov import. Test: a DiD fixed_effects design with a dummy that EXACTLY duplicates the treatment indicator (singular X'X) — wild-bootstrap fit stays finite, no crash, stored vcov NaN-expanded for the dropped column. Existing TWFE rank-deficient full-dummy test still passes (both backends). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-24T19:41:41Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 91a458d4fed9837bcdc4c9202f4046e712bab66f

PR Review: ⚠️ Needs Changes

Overall Assessment

One unmitigated P1 remains: the new saturated-design HC1/CR1 guard can bypass the existing cluster-count assumption check for analytical cluster-robust variance.

Executive Summary

Previous P1 is resolved: _run_wild_bootstrap_inference() now stores vcov through rank-aware solve_ols(..., rank_deficient_action="silent"), matching the WCR helper’s rank-deficient nuisance handling.
WCR methodology is now documented in docs/methodology/REGISTRY.md:L93-L145, including null imposition, CR1 studentization, test-inversion CI, enumeration, and documented R deviations.
p_val_type propagation appears complete across constructor, get_params(), bootstrap call, DiDResults, TwoWayFixedEffects, summary(), and to_dict().
[Newly identified] The shared HC1/CR1 saturated-design guard returns NaN before validating that cluster-robust SEs have at least two clusters.
I could not run tests in this sandbox: pytest is unavailable and importing numpy fails.

Methodology

Finding 1: [Newly identified] Saturated CR1 guard masks the “at least two clusters” assumption check

Severity: P1

Impact: compute_robust_vcov() now returns an all-NaN vcov whenever n_eff - k <= 0 before it checks cluster_ids. For a saturated analytical cluster-robust fit with only one cluster, this silently changes the contract from “invalid cluster-robust inference, raise ValueError” to “degenerate NaN inference.” Cluster-robust SEs require at least two clusters, and the pre-existing check still exists but is now unreachable in saturated one-cluster cases. The WCR helper itself still checks n_clusters < 2, so this is not a WCR p-value bug; it is a shared CR1/HC1 assumption-check regression introduced by the new guard. See diff_diff/linalg.py:L2626-L2636 before the cluster validation at diff_diff/linalg.py:L2649-L2671.

Concrete fix: Move the saturated NaN return after cluster-count validation on the clustered path. For cluster_ids is not None, compute effective clusters first, raise if < 2, then return NaN only if residual DOF are exhausted. Add a regression test such as solve_ols(np.eye(2), y, cluster_ids=np.zeros(2)) or direct compute_robust_vcov(np.eye(2), residuals, np.zeros(2)) expecting ValueError.

Informational: WCR deviations are documented

Severity: P3 informational

Impact: The analytical CR1 se/t_stat plus bootstrap-inverted p_value/conf_int, and the conditional p-value floor, are explicitly documented as intentional in docs/methodology/REGISTRY.md:L136-L145. Under the review rules, these are not defects.

Concrete fix: No action required.

Code Quality

No additional findings.

Performance

Finding: WCR inversion allocation cost is tracked

Severity: P3 informational

Impact: _t_star(r) is repeatedly evaluated during CI inversion and materializes large intermediate arrays. This is already tracked in TODO.md:L169, so it is not blocking.

Concrete fix: No action required for this PR.

Maintainability

No findings. The p_val_type surface is consistently threaded through the affected estimator/result paths.

Tech Debt

No untracked blocking tech debt found. The WCR allocation concern is properly recorded in TODO.md:L169.

Security

No findings.

Documentation/Tests

Finding: Missing regression for saturated one-cluster CR1 assumption check

Severity: P2

Impact: Existing cluster-count tests cover non-saturated one-cluster designs, but the new n_eff - k <= 0 branch needs a direct test to prevent masking the same assumption check in saturated designs.

Concrete fix: Add a focused linalg test for saturated n == k with one cluster that asserts ValueError("at least 2 clusters").

Verification note: git diff --check passed. I could not execute pytest because pytest is not installed and numpy import fails in this sandbox.

Path to Approval

In diff_diff/linalg.py, validate effective cluster count before returning NaN for saturated clustered HC1/CR1 designs.
Add a regression test for a saturated one-cluster clustered fit that must raise the existing “at least 2 clusters” error.
Run at least tests/test_linalg.py and tests/test_wild_bootstrap.py after the fix.

…(CI review P1) The saturated-design guard added earlier (n_eff <= k -> NaN vcov) was placed before the cluster-robust "need >= 2 clusters" validation, so a 1-cluster *saturated* cluster-robust request returned all-NaN instead of raising the documented ValueError (verified: compute_robust_vcov(eye(2), 0, zeros(2)) now raises; a non-saturated 1-cluster fit always raised). The cluster-count check (with the zero-total-weight effective-cluster exclusion) now runs first on the clustered path; the saturated NaN return only applies once that assumption holds. Multi-cluster saturated designs still return NaN (no ZeroDivisionError), and full-rank fits are unchanged. Tests: 1-cluster saturated -> ValueError; 2-cluster saturated -> NaN vcov (tests/test_linalg.py::TestComputeRobustVcov). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-24T19:51:35Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8bd7171b566532a68aa7bdab18396b1e05423f74

PR Review: ✅ Looks Good

Overall Assessment

✅ Looks good. No unmitigated P0/P1 findings remain. The prior P1 about the saturated-design CR1 guard masking the one-cluster assumption check is resolved.

Executive Summary

Previous P1 is addressed: cluster-count validation now runs before the saturated-design NaN return, with a direct regression test.
WCR changes match the methodology registry: null imposition, CR1 studentization, strict-tail p-values, Rademacher enumeration, and test-inversion CIs are documented and implemented.
The analytical CR1 se/t_stat plus bootstrap-inverted p_value/conf_int mix is explicitly documented, so it is not a defect.
p_val_type propagation appears complete across constructor, get_params(), DiD/TWFE fitting, result objects, summaries, and to_dict().
One P2 robustness issue remains in the duplicated early cluster-count check for direct public calls with pandas Series cluster IDs and zero weights.
I could not run tests in this sandbox: pytest is not installed and numpy is unavailable.

Methodology

Finding: Prior saturated one-cluster CR1 issue is resolved

Severity: P3 informational
Impact: The earlier assumption-check regression is fixed. compute_robust_vcov() now validates effective cluster count before returning NaN for saturated HC1/CR1 designs, so a saturated one-cluster cluster-robust request still raises the documented “at least 2 clusters” error. See diff_diff/linalg.py:L2626-L2652 and tests/test_linalg.py:L534-L548.
Concrete fix: No action required.

Finding: WCR deviations are documented

Severity: P3 informational
Impact: The WCR methodology changes affect DifferenceInDifferences and inherited TwoWayFixedEffects under inference="wild_bootstrap". The registry documents the intentional choices: restricted null imposition, CR1 studentization, strict-tail p-values, test inversion, Rademacher enumeration, analytical CR1 se/t_stat, and the conditional p-value floor. These match the implementation in diff_diff/utils.py:L650-L667 and diff_diff/utils.py:L831-L998, and the documented deviations in docs/methodology/REGISTRY.md:L93-L145 are not defects under the review rules.
Concrete fix: No action required.

Code Quality

Finding: Early weighted cluster-count check should normalize `cluster_ids`

Severity: P2
Impact: The new early cluster-count validation duplicates the later cluster branch but does not convert cluster_ids to an array before pd.Series(weights).groupby(cluster_ids).sum(). For direct public calls that pass cluster_ids as a pandas Series with a non-RangeIndex and zero p/aweights, pandas can align by index against the newly-created Series(weights), miscounting effective clusters before the later branch’s np.asarray(cluster_ids) normalization would run. See diff_diff/linalg.py:L2632-L2636.
Concrete fix: In the early check, set cluster_ids_arr = np.asarray(cluster_ids) and use that for both np.unique(...) and the weighted groupby. Add a regression with non-default-index pd.Series cluster IDs and zero weights.

Performance

Finding: WCR inversion allocation cost is tracked

Severity: P3 informational
Impact: _t_star(r) repeatedly materializes large bootstrap arrays during CI inversion. This is already tracked in TODO.md:L169, so it is accepted deferred work.
Concrete fix: No action required for this PR.

Maintainability

No findings. The new p_val_type surface is consistently threaded through the affected estimator and result paths.

Tech Debt

No untracked blocking tech debt found. The WCR performance limitation is tracked in TODO.md.

Security

No findings. I did not see credential-like material in the changed files.

Documentation/Tests

No blocking findings. The PR adds registry documentation, tutorial prose, R golden-generation fixtures, WCR parity tests, p-value/CI consistency tests, p_val_type propagation tests, and saturated-design regressions. Verification was limited by the sandbox: pytest and numpy are unavailable here.

…eview P2) The early cluster-count validation (added in the prior commit) grouped the zero-weight check with the raw `cluster_ids`. For a direct public compute_robust_vcov call passing `cluster_ids` as a non-default-index pandas Series alongside zero p/aweights, `pd.Series(weights).groupby(cluster_ids)` index-aligns the Series grouper against the freshly-created Series(weights) (default RangeIndex) and miscounts effective clusters (verified: counts 0 for a valid 3-cluster design), wrongly raising "need >= 2 clusters". Now normalize via np.asarray(cluster_ids) first, matching the later cluster branch. (Estimator paths pass numpy arrays and were unaffected.) Test: non-default-index Series cluster IDs + a zero weight on a valid 3-cluster design returns a finite vcov, no wrongful raise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-24T19:57:04Z

🔁 AI review rerun (requested by @igerber)

Head SHA: dec0df2b8efb7dbc79a657078ce5c03b18c6beb3

PR Review: ✅ Looks Good

Overall Assessment

✅ Looks good. I found no unmitigated P0/P1 issues in the PR changes.

Executive Summary

The prior one-cluster saturated CR1 concern is resolved: cluster-count validation now runs before the saturated-design NaN guard.
The prior pandas Series cluster-id / zero-weight miscount concern is also resolved and regression-tested.
WCR changes for DifferenceInDifferences and TwoWayFixedEffects match the new Methodology Registry contract: null imposition, CR1 studentization, strict-tail p-values, Rademacher enumeration, and test-inversion CIs.
Documented WCR deviations from R are properly labeled in REGISTRY.md, so they are informational, not defects.
p_val_type propagation appears complete across constructor, get_params() / set_params(), low-level WCR call, results, summary, to_dict(), and TWFE inheritance.
I could not run tests in this sandbox: pytest and numpy are not installed.

Methodology

Finding: WCR methodology matches documented contract

Severity: P3 informational
Impact: Affected methods are DifferenceInDifferences and inherited TwoWayFixedEffects under inference="wild_bootstrap". The implementation now imposes the null through reduced-design residualization, studentizes with analytical CR1, uses strict tail counts, enumerates few-cluster Rademacher sign vectors, and builds CIs by test inversion. These are documented in docs/methodology/REGISTRY.md:L93-L145 and implemented in diff_diff/utils.py:L831-L998.
Concrete fix: No action required.

Finding: Documented deviations from R are acceptable

Severity: P3 informational
Impact: The analytical CR1 se / t_stat plus bootstrap p_value / conf_int mix, and the conditional p-value floor, are explicitly documented with Note / Deviation from R labels in docs/methodology/REGISTRY.md:L136-L145. Under the review rules, these are not defects.
Concrete fix: No action required.

Code Quality

Finding: Previous cluster-count normalization issue is resolved

Severity: P3 informational
Impact: The early CR1 cluster-count check now normalizes cluster_ids with np.asarray(...) before grouping weights, preventing pandas index-alignment miscounts. See diff_diff/linalg.py:L2632-L2644 and regression coverage in tests/test_linalg.py:L550-L567.
Concrete fix: No action required.

Performance

Finding: WCR inversion allocation cost is tracked

Severity: P3 informational
Impact: _t_star(r) materializes large bootstrap arrays repeatedly during CI inversion. This is real overhead but already tracked as deferred technical debt in TODO.md:L169.
Concrete fix: No action required for this PR.

Maintainability

No findings. The new p_val_type surface is consistently threaded through estimator parameters and result metadata: diff_diff/estimators.py:L180-L219, diff_diff/estimators.py:L808-L820, diff_diff/estimators.py:L1019-L1028, diff_diff/twfe.py:L657-L723, and diff_diff/results.py:L135-L140.

Tech Debt

No untracked blocking tech debt found. The only material deferred item I saw is the WCR inversion allocation cost, already tracked in TODO.md.

Security

No findings. I did not see credential-like material or unsafe security-relevant changes in the diff.

Documentation/Tests

Finding: Verification limited by sandbox dependencies

Severity: P3 informational
Impact: I could not execute the targeted tests because pytest is unavailable, and local numerical checks are blocked because numpy is also unavailable. Static review found focused coverage for WCR correctness, R parity fixtures, p/CI consistency, low-draw floor consistency, p_val_type propagation, saturated-design NaN handling, and rank-deficient storage-vcov behavior in tests/test_wild_bootstrap.py:L896-L1482 and tests/test_linalg.py:L534-L567.
Concrete fix: No PR change required; rely on CI or run the affected test files in an environment with project dependencies installed.

igerber and others added 5 commits June 24, 2026 12:05

igerber added the ready-for-ci Triggers CI test workflows label Jun 24, 2026

igerber removed the ready-for-ci Triggers CI test workflows label Jun 24, 2026

igerber added the ready-for-ci Triggers CI test workflows label Jun 24, 2026

igerber merged commit 35f56f6 into main Jun 24, 2026
33 of 34 checks passed

igerber deleted the issue-543 branch June 24, 2026 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix wild cluster bootstrap p-value / CI inconsistency (WCR + test inversion, #543)#546

Fix wild cluster bootstrap p-value / CI inconsistency (WCR + test inversion, #543)#546
igerber merged 10 commits into
mainfrom
issue-543

igerber commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Jun 24, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Jun 24, 2026

PR Review: ⚠️ Needs Changes

Executive Summary

Methodology

Finding 1: Rademacher enumeration trigger does not match the cited R behavior

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Finding 2: Missing parity coverage for the changed enumeration boundary

Path to Approval

Uh oh!

github-actions Bot commented Jun 24, 2026

PR Review: ✅ Looks Good

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented Jun 24, 2026

PR Review: ⚠️ Needs Changes

Overall Assessment

Executive Summary

Methodology

Finding 1: [Newly identified] Rank-deficient nuisance handling is not preserved by the estimator wrapper

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

github-actions Bot commented Jun 24, 2026

PR Review: ⚠️ Needs Changes

Overall Assessment

Executive Summary

Methodology

Finding 1: [Newly identified] Saturated CR1 guard masks the “at least two clusters” assumption check

Informational: WCR deviations are documented

Code Quality

Performance

Finding: WCR inversion allocation cost is tracked

Maintainability

Tech Debt

Security

Documentation/Tests

Finding: Missing regression for saturated one-cluster CR1 assumption check

Path to Approval

Uh oh!

github-actions Bot commented Jun 24, 2026

PR Review: ✅ Looks Good

Overall Assessment

Executive Summary

Methodology

Finding: Prior saturated one-cluster CR1 issue is resolved

Finding: WCR deviations are documented

Code Quality

Finding: Early weighted cluster-count check should normalize cluster_ids

Performance

Finding: WCR inversion allocation cost is tracked

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Finding: Early weighted cluster-count check should normalize `cluster_ids`