Conversation
…#543) `DifferenceInDifferences`/`TwoWayFixedEffects` with `inference="wild_bootstrap"` returned a p-value that contradicted its own confidence interval (issue #543: CI [2.30, 2.64] excluding 0, yet p=0.86). `wild_bootstrap_se` claimed to run the Wild Cluster Restricted (WCR) bootstrap but never imposed the null — it re-fit the full design to the unchanged outcome, so the "restricted" residuals equaled the unrestricted ones and the bootstrap centered on the estimate instead of 0. Now genuinely imposes H0 (drops the coefficient column), studentizes with the analytical CR1 SE, and derives the CI by test inversion so p and CI are exactly consistent (0 in CI <=> p >= alpha). Few-cluster Rademacher cases enumerate the full sign-vector set (deterministic), matching R `fwildclusterboot::boottest`. - New `p_val_type` param ("two-tailed" default | "equal-tailed"). - Reported `se` is now the analytical cluster-robust (CR1) SE. - R-parity: bootstrap t-distribution matches boottest to ~6e-14; se/t/interior-p exact; CI to ~1e-4. Validated via committed golden + independent brute-force. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…val_type propagation) - P1: p-value floor no longer breaks p/CI consistency. The floor 1/(n_valid+1) is now applied only when it is < alpha; with very few valid draws it can exceed alpha, and flooring there would flip a bootstrap-significant result (0 outside the inverted CI) to "non-significant", re-creating the #543 contradiction. When the floor would cross alpha the raw p (which boottest also reports) is used. - P1: p_val_type is now surfaced on WildBootstrapResults (+ summary) and on the high-level DiDResults (DiD + TWFE: field, summary, to_dict; None for analytical). - P2: docs/docstrings/tutorial/REGISTRY corrected — the two-tailed inverted CI may be asymmetric (was incorrectly called "symmetric"). - P2: deferred the CI-inversion B-by-n allocation perf note to TODO.md. - Tests: low-effective-draw floor-consistency regression (RNG B=9 and enumerated G=4), p_val_type-on-result assertions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ish) - P2: document the n_clusters<=20 enumeration cap in REGISTRY + wild_bootstrap_se and _wild_weight_matrix docstrings (the deterministic-enumeration claim holds for G<=20; beyond that, sampling is used). - P3: clarify the now-ignored `residuals` parameter docstring (retained for backward compatibility; the WCR path recomputes residuals internally). - P3: add a TwoWayFixedEffects wild-bootstrap integration test asserting p_val_type propagation + p/CI consistency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- P0: never return mixed finite/NaN inference. A genuinely unbounded inverted CI is now represented with +/-inf (not NaN), preserving 0 in CI <=> p >= alpha; the (se,t,p,CI) family stays all-or-nothing. - P0: saturated designs (no residual DOF, n == k_eff) no longer raise ZeroDivisionError. wild_bootstrap_se detects it before requesting the cluster vcov; the shared linalg HC1/CR1 adjustment (NumPy) guards n_eff <= k -> NaN vcov; and the estimator stores a NaN vcov when the bootstrap is degenerate. Both backends now return NaN (not a crash / not +inf) on saturated designs. - Tests: saturated 2x2 (G=2, n==k) -> all-NaN, no crash; near-zero-SE 2-cluster -> no mixed finite/NaN, consistent (unbounded +/-inf CI). - P3: fix stale results.py inference-label comment (WCR reports analytical CR1 SE with bootstrap p/CI); document the enumeration condition in tutorial 01. - CHANGELOG: note the saturated-design CR1 robustness fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Clarify the Rademacher enumeration trigger across the wild_bootstrap_se and _wild_weight_matrix docstrings, the few-cluster warning, and the R golden generator comment: enumeration fires when 2**(n_clusters-1) <= n_bootstrap (the boottest trigger, = the number of distinct |t| sign-classes), and the library then materializes all 2**n_clusters raw sign-vectors, reporting n_bootstrap = 2**n_clusters. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR Review:
|
…rboot (CI review P1) The Rademacher full-enumeration trigger was `2**(n_clusters-1) <= n_bootstrap`, but fwildclusterboot::boottest switches to enumeration only when n_bootstrap reaches the number of possible draws, `2**n_clusters` (verified empirically: for G=10, boottest samples at B=1023 and enumerates at B=1024). The old trigger made diff-diff enumerate (and report 2**G draws) at e.g. G=10, B=999 where R samples 999 — a real R-parity divergence. Trigger is now `2**n_clusters <= n_bootstrap`, matching boottest exactly at the boundary; docstrings, the few-cluster warning, REGISTRY, tutorial, and the R golden comment updated accordingly. (Only 2**(G-1) of the enumerated draws have distinct |t*|, but the full set is materialized — that count is not the trigger.) Also guard a degenerate single-regressor design (reduced model has zero columns) so wild_bootstrap_se no longer raises IndexError; the restricted residuals are then the variables themselves. Tests: enumeration-boundary parity at G=10 (sampled at B=999, deterministic at B=1024) and a single-regressor no-crash test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR Review: ✅ Looks GoodOverall AssessmentNo unmitigated P0 or P1 findings. Executive Summary
MethodologyNo unmitigated findings. Severity: P3 informational Code QualityNo findings. PerformanceSeverity: P3 informational MaintainabilityNo findings. Tech DebtNo untracked blocking tech debt found. The new wild-bootstrap allocation concern is tracked in SecurityNo findings. Documentation/TestsNo blocking findings. The PR adds methodology registry coverage, changelog notes, R golden fixtures, independent brute-force WCR checks, p/CI consistency tests, degenerate NaN handling, |
…ducibility tolerances Two CI failures, both non-logic: 1. Sphinx `-W` docs build: the new `p_val_type` parameter docstring had a bulleted list (with a multi-line item) not preceded by a blank line, which docutils flagged as "Block quote ends without a blank line". Added the required blank line before the list. 2. Wild-bootstrap reproducibility/determinism tests asserted bit-exact equality on `se` / `conf_int`, which now derive from the analytical cluster-robust (CR1) vcov solve. That solve is bit-reproducible only to ~1e-13 on threaded BLAS / the Rust backend, so two identical runs differed at ~1e-13–1e-15 and the exact `==` failed on every Rust-backend leg (the old `se` was a pure-numpy bootstrap-coef std and avoided that path). Relaxed the four affected assertions to `pytest.approx(rel=1e-9)` — far above the noise, still a strong reproducibility check — consistent with how the suite already tolerates BLAS-level swings. `p_value` (count-based) and `n_bootstrap` stay exact. Verified: tests/test_wild_bootstrap.py passes on both the python and rust backends (59/59). Pure Python Fallback, all Rust legs, notebooks, and docs legs already passed on the prior run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR Review:
|
The estimator stored the cluster-robust vcov via compute_robust_vcov(X, ...) on the full design, which inverts X'X directly and raises ValueError (or returns garbage) when a nuisance column is collinear — e.g. a fixed-effect dummy collinear with treatment on a full-dummy design — even though the ATT is identified and wild_bootstrap_se itself drops such columns internally. Verified: the storage call receives a rank-deficient X (rank 22 of 23) in the existing TWFE full-dummy test, and compute_robust_vcov raises on an exactly-singular X. Fix: compute the stored vcov through the rank-aware solve_ols(..., rank_deficient_action="silent") path, which drops collinear columns and NaN-expands the vcov for them — bit-identical to compute_robust_vcov on full-rank designs (verified, ~5e-17). Removed the now-unused compute_robust_vcov import. Test: a DiD fixed_effects design with a dummy that EXACTLY duplicates the treatment indicator (singular X'X) — wild-bootstrap fit stays finite, no crash, stored vcov NaN-expanded for the dropped column. Existing TWFE rank-deficient full-dummy test still passes (both backends). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR Review:
|
…(CI review P1) The saturated-design guard added earlier (n_eff <= k -> NaN vcov) was placed before the cluster-robust "need >= 2 clusters" validation, so a 1-cluster *saturated* cluster-robust request returned all-NaN instead of raising the documented ValueError (verified: compute_robust_vcov(eye(2), 0, zeros(2)) now raises; a non-saturated 1-cluster fit always raised). The cluster-count check (with the zero-total-weight effective-cluster exclusion) now runs first on the clustered path; the saturated NaN return only applies once that assumption holds. Multi-cluster saturated designs still return NaN (no ZeroDivisionError), and full-rank fits are unchanged. Tests: 1-cluster saturated -> ValueError; 2-cluster saturated -> NaN vcov (tests/test_linalg.py::TestComputeRobustVcov). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR Review: ✅ Looks GoodOverall Assessment✅ Looks good. No unmitigated P0/P1 findings remain. The prior P1 about the saturated-design CR1 guard masking the one-cluster assumption check is resolved. Executive Summary
MethodologyFinding: Prior saturated one-cluster CR1 issue is resolvedSeverity: P3 informational Finding: WCR deviations are documentedSeverity: P3 informational Code QualityFinding: Early weighted cluster-count check should normalize
|
…eview P2) The early cluster-count validation (added in the prior commit) grouped the zero-weight check with the raw `cluster_ids`. For a direct public compute_robust_vcov call passing `cluster_ids` as a non-default-index pandas Series alongside zero p/aweights, `pd.Series(weights).groupby(cluster_ids)` index-aligns the Series grouper against the freshly-created Series(weights) (default RangeIndex) and miscounts effective clusters (verified: counts 0 for a valid 3-cluster design), wrongly raising "need >= 2 clusters". Now normalize via np.asarray(cluster_ids) first, matching the later cluster branch. (Estimator paths pass numpy arrays and were unaffected.) Test: non-default-index Series cluster IDs + a zero weight on a valid 3-cluster design returns a finite vcov, no wrongful raise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: PR Review: ✅ Looks GoodOverall Assessment✅ Looks good. I found no unmitigated P0/P1 issues in the PR changes. Executive Summary
MethodologyFinding: WCR methodology matches documented contractSeverity: P3 informational Finding: Documented deviations from R are acceptableSeverity: P3 informational Code QualityFinding: Previous cluster-count normalization issue is resolvedSeverity: P3 informational PerformanceFinding: WCR inversion allocation cost is trackedSeverity: P3 informational MaintainabilityNo findings. The new Tech DebtNo untracked blocking tech debt found. The only material deferred item I saw is the WCR inversion allocation cost, already tracked in SecurityNo findings. I did not see credential-like material or unsafe security-relevant changes in the diff. Documentation/TestsFinding: Verification limited by sandbox dependenciesSeverity: P3 informational |
Summary
Fixes #543 —
DifferenceInDifferences/TwoWayFixedEffectswithinference="wild_bootstrap"returned a p-value that contradicted its own confidence interval (e.g. CI[2.30, 2.64]excluding 0, yetp = 0.86).wild_bootstrap_seclaimed to run the Wild Cluster Restricted (WCR) bootstrap but never imposed the null — it re-fit the full design (keeping the treatment column) to the unchanged outcome, so the "restricted" residuals equaled the unrestricted ones and the bootstrap centered on the estimate instead of 0. The p-valuemean(|t*| ≥ |t₀|)then measured noise around the estimate (≈0.5–0.86 regardless of significance) while the percentile CI happened to look fine.0 ∈ CI ⟺ p ≥ alphaholds exactly. Few-cluster Rademacher cases enumerate the full sign-vector set (deterministic), matching R'sfwildclusterboot::boottest.p_val_typeparameter ("two-tailed"default |"equal-tailed"), surfaced onWildBootstrapResultsandDiDResults(summary /to_dict/get_params).wild_bootstrapp_value/conf_intare corrected (a true effect is now significant); reportedseis now the analytical cluster-robust (CR1) SE (≈unchanged in well-behaved cases).ZeroDivisionError(sharedlinalgguard); a genuinely unbounded inverted CI is±inf, never mixed finite/NaN.Methodology references (required if estimator / math changes)
DifferenceInDifferences/TwoWayFixedEffects.boottest); R packagefwildclusterboot.1/(n_valid+1)to avoid an exact 0, but only when that floor stays belowalpha, so it never flips a verdict relative to the inverted CI (boottest can report 0); (2)se/t_statare analytical CR1 whilep_value/conf_intcome from the bootstrap inversion (the boottest reporting convention). Both documented indocs/methodology/REGISTRY.md§"Wild cluster bootstrap (WCR)".Validation
tests/test_wild_bootstrap.py— the [Bug]: Result of Wild Cluster Bootstrap is inconsistent to Cluster-robust #543 significance regression, p↔CI consistency, null calibration, enumeration determinism,p_val_typepropagation, low-effective-draw floor consistency, degenerate/saturated-design NaN handling, an independent full-refit WCR brute-force equivalence test, and a skip-guardedfwildclusterboot::boottestgolden parity test.fwildclusterboot::boottest()defaults (benchmarks/R/generate_wild_cluster_boot_golden.R+ checked-in golden) — bootstrap t-distribution matches to ~6e-14;se/t/interior-pexact; inverted CI to ~1e-4.01_basic_did.ipynbre-executes (nbmake). Local AI review (codex/gpt-5.5, full-scope) converged clean over 5 rounds.Security / privacy
🤖 Generated with Claude Code