feat(simd): real WASM SIMD128 backend (fill simd_wasm.rs scalar scaffolding)#225
Conversation
…olding) `src/simd_wasm.rs` was commented-out scaffolding; wasm32 fell back to the pure-scalar SIMD types. Fill it with a native `core::arch::wasm32` v128 backend, mirroring `simd_neon::aarch64_simd`'s proven split (native v128 for the float/byte hot path, scalar fallback for the long tail). simd_wasm::wasm32_simd (gated wasm32 + target_feature="simd128"): - F32x16 / F64x8 as [v128;4] + F32Mask16 / F64Mask8 — full API parity with the scalar macro (arith/reduce/min-max/clamp/compare-mask/to-from-bits/ cast/round/floor/sqrt/abs/mul_add + operator impls + Mask::select). - I8x16 (one v128) = union of the scalar + NEON method sets (add/sub/min/max/ cmp_gt + from_i4_packed_u64/lane_i8/saturating_abs). - v128 hot kernels mirroring the NEON ones: dot_f32x4, popcount_u8x16, hamming_u8x16, hamming_u8x64 (Fingerprint<256> distance via i8x16_popcnt), base17_l1 (i16->i32 sign-extend before subtract, no overflow), codebook_gather_f32x4, bf16_to_f32_batch. - mul_add: f32x4_relaxed_madd under +relaxed-simd, else mul+add (base simd128 has no FMA). round = f32x4_nearest (ties-even, =NEON). NaN in min/max follows IEEE (=NEON). All cross-backend divergences documented. Dispatch (simd.rs): wasm32+simd128 arm re-exports the 8 native names from wasm32_simd and the remainder from scalar; the full-scalar fallback arm now excludes that case. Added wasm32 PREFERRED_*_LANES (128-bit widths) and .cargo/config-wasm.toml (-Ctarget-feature=+simd128). Also unblock the wasm build (pre-existing x86-only AMX leaks, not the SIMD scaffolding): gate the x86-only amx_matmul / simd_amx re-exports in simd.rs to target_arch="x86_64", and route backend::gemm_bf16 through the portable hpc::quantized::bf16_gemm_f32 on non-x86 (bit-equivalent to the AMX dispatcher's own scalar fallback). x86 paths are untouched by construction. Verified: cargo build -p ndarray --lib for wasm32 (+simd128 native / scalar / no_std) and x86_64 all green; a faithful standalone copy of wasm32_simd run under node passes 51 numeric checks (mask bit-patterns, saturating_abs(i8::MIN)=127, Hamming, Base17 incl. |a-b|=60000 overflow, bf16 shift); 217 SIMD + 85 backend/bf16 x86 tests pass; clippy -D warnings and fmt clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NSk1qJ28eh6A2hd4JUWFRr
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughImplements a complete ChangesWASM SIMD128 Backend
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 500d57e6a1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pub use crate::simd_wasm::wasm32_simd::{f32x16, f64x8, i8x16, F32Mask16, F32x16, F64Mask8, F64x8, I8x16}; | ||
| #[cfg(all(target_arch = "wasm32", target_feature = "simd128", not(feature = "nightly-simd")))] | ||
| pub use scalar::{ | ||
| batch_packed_i4_16, f32x8, f64x4, i16x16, i16x32, i32x16, i32x8, i64x4, i64x8, i8x32, i8x64, palette_lookup_u8x8, |
There was a problem hiding this comment.
Use the wasm I8x16 in batch_packed_i4_16
When building wasm32 with +simd128, crate::simd::I8x16 is the native v128 type re-exported just above, but this line re-exports scalar::batch_packed_i4_16, whose closure receives simd::scalar::I8x16 (src/simd_scalar.rs:1630). That makes portable callers fail only on wasm SIMD if they pass a helper taking crate::simd::I8x16 or write into out: &mut [I8x16], and it also bypasses the new native byte-lane backend for this W1a primitive. Re-export a wasm implementation of batch_packed_i4_16 built on wasm32_simd::I8x16, or keep I8x16 scalar in this dispatch arm.
Useful? React with 👍 / 👎.
Summary
src/simd_wasm.rswas commented-out scaffolding, sowasm32silently fell back to the pure-scalar SIMD types. This fills it with a nativecore::arch::wasm32v128 backend, mirroringsimd_neon::aarch64_simd's proven split (native v128 for the float/byte hot path, scalar fallback for the long tail).It also unblocks the wasm build itself — the crate did not compile for
wasm32at all before this change (pre-existing x86-only AMX leaks), so the new backend would have been unreachable.What's new
simd_wasm::wasm32_simd(gatedtarget_arch="wasm32" + target_feature="simd128"):F32x16/F64x8as[v128; 4]+F32Mask16/F64Mask8— full API parity with the scalar macro (arith, reduce, min/max/clamp, all six compare-masks,to_bits/from_bits,cast_i32,round/floor/sqrt/abs/mul_add, operator impls,Mask::select).I8x16(onev128) = union of the scalar + NEON method sets (add/sub/min/max/cmp_gt+from_i4_packed_u64/lane_i8/saturating_abs), so consumers stay portable across every backend.dot_f32x4,popcount_u8x16,hamming_u8x16,hamming_u8x64(Fingerprint<256> distance viai8x16_popcnt),base17_l1(sign-extends i16→i32 before the subtract — no overflow),codebook_gather_f32x4,bf16_to_f32_batch.Documented cross-backend divergences (consistent with how NEON/AVX already differ):
mul_addismul+addunless+relaxed-simd(thenf32x4_relaxed_madd).round=f32x4_nearest(ties-to-even, same as NEONvrndnq_*).simd_exp_f32NaN save/restore already absorbs this).Wiring:
simd.rsgets awasm32 + simd128dispatch arm (re-exports the 8 native names fromwasm32_simd, the remainder fromscalar); the full-scalar fallback arm now excludes that case.PREFERRED_*_LANESget wasm 128-bit widths (F32=4 / F64=2 / U64=2 / I16=8)..cargo/config-wasm.tomlenables+simd128.Unblocking the wasm build (pre-existing x86-only AMX leaks, not the SIMD scaffolding): the x86-only
amx_matmul/simd_amxre-exports insimd.rswere unconditional, andbackend::gemm_bf16calledamx_matmul::matmul_bf16_to_f32directly. Gated both totarget_arch="x86_64", and routedgemm_bf16through the portablehpc::quantized::bf16_gemm_f32on non-x86 (bit-equivalent to the AMX dispatcher's own scalar fallback). The x86 path is byte-identical by construction (the original block now lives undercfg(target_arch="x86_64")).Verification
cargo build -p ndarray --libgreen forwasm32+simd128 (native), scalar, --no-default-features (no_std), and x86_64 default.wasm32_simdcompiled towasm32+simd128and run under node: 51 numeric checks pass — exact comparison-mask bit-patterns,saturating_abs(i8::MIN)=127, Hamming=512, Base17 vs scalar (incl. a pathological|a-b|=60000overflow case), bf16 shift, codebook gather.cargo clippy -p ndarray --lib -- -D warningsclean;cargo fmt --checkclean.Adversarial review (3-angle Opus)
alpha=1, beta=0overwrites C).base17_l1_wasmnow widens i16→i32 viai32x4_extend_{low,high}_i16x8before the subtract, matching the scalar reference for the full i16 range (regression test added).pub mod simdis itself#[cfg(feature="std")], so the native wasm arm is transitively std-gated; the--no-default-featureswasm build is empirically clean.Notes
U8x64/I32x16/U64x8stay scalar on wasm — exactly as NEON keeps them; the free Hamming/Base17 kernels cover those hot paths.getrandom 0.3issue viandarray-rand/numeric-tests;-p ndarray --libis the correct wasm surface and is green.🤖 Generated with Claude Code
Generated by Claude Code
Summary by CodeRabbit
New Features
Bug Fixes