Conversation
📊 Benchmark & performance —
|
| N | fused | NumPy 2.4.2 | NumSharp speedup |
|---|---|---|---|
| 64 | 0.547 | 4.322 | 7.9× |
| 4,096 | 0.549 | 0.660 | 1.20× |
| 262,144 | 0.557 | 1.419 | 2.55× |
The kernel is size-invariant (~0.55 ns/elem at every size) while NumPy degrades 2–6× as data spills out of cache.
All 11 ops on this path — speedup vs NumPy @262K (f64):
abs 3.37× negate 3.15× floor 3.07× trunc 3.03× round 3.00×
sqrt 2.55× rad2deg 2.41× deg2rad 2.22× square 2.18× reciprocal 1.72×
Verified 22,000 bit-exact checks (fused == contiguous kernel); full unit suite 9447/0/11.
Note: this is a
DirectILKernelGeneratorwhole-array kernel that bypasses NpyIter by design — the fusion (gather folded intoVector.Create) is incompatible with NpyIter's gather/kernel separation, which is exactly the (slower) buffered path it replaces.
2. Official NumSharp-vs-NumPy benchmark (6038990f)
Methodology: BenchmarkDotNet Full — 50 iterations, InProcessEmit toolchain, iteration-time capped at 25 ms — × {1K / 100K / 10M} vs NumPy 2.4.2. i9-13900K · .NET 10.0.101 · Python 3.12.12. 1,813 C# measurements → 1,111 matched comparisons.
The iteration-time cap is what makes a Full run feasible: BDN's default Throughput strategy ramps to ~8192 invocations/iteration, so a 10M-element op at 50 iters took ~25 s per case. Capping it ⇒ ~15× faster (a 30-case set went 18 min → 70 s) with all 50 iterations preserved.
Headline — geomean (NumSharp ÷ NumPy, lower = better):
slower ◄───────── 1.0 (parity) ─────────► faster
1K ████████████████████ 1.96× (102 win / 212 lose)
100K ██████████████████▎ 1.83× (109 win / 196 lose)
10M ██████████▏ ........ 1.00× (166 win / 36 lose) ◄ PARITY
At the memory-bound 10M size NumSharp is at parity across ~409 ops (166 faster, only 36 slower). Small-size cost is the per-element dispatch + result-allocation tax (~2×).
Per-suite geomean by size:
| suite | 1K | 100K | 10M |
|---|---|---|---|
| Statistics | 0.19× | 0.68× | 0.48× ✅ |
| Sorting | 0.41× | 1.13× | 0.45× ✅ |
| Comparison | 1.27× | 2.22× | 0.50× ✅ |
| Bitwise | 8.16× | 1.16× | 0.61× ✅ |
| Reduction | 0.48× | 0.94× | 0.91× ✅ |
| Arithmetic | 3.09× | 2.62× | 1.25× 🟡 |
| Unary | 3.50× | 4.44× | 1.53× 🟡 |
| Creation | 12.26× | 2.92× | 2.24× 🟠 |
| LinearAlgebra | 2.76× | 1.66× | 4.02× 🔴 |
🏆 Biggest wins (@10m, real ms):
| op | dtype | NumPy | NumSharp | speedup |
|---|---|---|---|---|
average |
f32 | 9.60 | 0.94 | 10.2× |
nansum |
f32 | 14.35 | 1.49 | 10.0× |
nanprod |
f32 | 18.52 | 1.90 | 9.7× |
var |
f32 | 16.96 | 2.60 | 6.5× |
count_nonzero |
f64 | 22.61 | 3.74 | 6.0× |
nanmean |
f64 | 33.47 | 5.69 | 5.9× |
🎯 Biggest gaps (@10m) — optimization targets:
| op | dtype | NumPy | NumSharp | gap |
|---|---|---|---|---|
sum axis=1 |
uint8 | 3.12 | 49.74 | 16.0× |
dot |
f64 | 1.23 | 16.46 | 13.4× |
matmul |
f64 | 0.72 | 4.26 | 5.9× |
argsort |
int32 | 369 | 2162 | 5.9× |
→ three fronts: narrow-int axis reductions (no widening-SIMD), linear algebra (no BLAS), sort.
Per-dtype @10m (geomean):
int64 0.91 uint64 0.92 f32 0.93 f64 0.98 uint8 1.00 uint32 0.99 ◄ strong
int32 1.11 int16 1.14 uint16 1.24 bool 1.60 ◄ weak (bool, narrow-uint)
Dtype coverage: 10 dtypes compared vs NumPy; char/decimal measured but have no NumPy peer (C#-only). SByte/Half/Complex were uncovered and have since been added to the benchmark code (48e85528) — the next run produces the full 15-dtype matrix.
Reproducibility
- Reusable cross-platform runner:
python benchmark/run_benchmark.py(builds C#, runs BDN per-suite, sweeps NumPy at 3 sizes, merges, archives). - Full report:
benchmark/benchmark-report.md(1,311 rows). - Provenance snapshot keyed by date+hash:
benchmark/history/2026-06-05_6038990f/(manifest + report + NumPy timings).
…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NDExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NDIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.
… stat Per updated direction: the ratio convention is NumPy ÷ NumSharp again (>1.0× = NumSharp faster — bars grow right = faster, the original visual), AND every row now also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses. So a win reads two intuitive ways: "12.63× faster" and "🕐 8%" (takes only 8% of the time NumPy would); parity is 🕐 100%; >100% is slower. Huge slowdowns compact to e.g. 🕐 881×NP. render_dashboard.py: * r["sp"] = numpy/numsharp (speedup), r["pct"] = numsharp/numpy*100 (share of NumPy time). * headline + every bar/table show both: HEADLINE 0.74× geomean · 🕐 136%; by-suite e.g. statistics 2.28× 🕐 44%, reduction 1.21× 🕐 83%, creation 0.35× 🕐 283%; FASTEST nansum 12.63× 🕐 8%; SLOWEST np.zeros 0.001× 🕐 881×NP. * status-mix bands relabelled in %NumPy terms (faster ≤100% / close 100–200% / slower 200–500% / much >500%), a legend line explains the 🕐 stat, pct_str() keeps the column narrow (NN% under 1000%, else NN×NP). benchmarks-dashboard.md re-seeded with the matching note (heredoc — printf mis-read %NumPy as a format spec); docfx build clean, emoji verified present (U+1F550 ×54). Supersedes the brief NS/NP experiment (c0a5346). The op-matrix report (merge-results.py) still uses NS/NP "lower is better", and the nditer sheet / cards use NP/NS without the %NumPy stat — rolling the NP/NS + 🕐 %NumPy convention out to those is the next step, pending confirmation.
Completes the rollout chosen after the dashboard fix: every benchmark surface now uses the SAME convention — speedup = NumPy ÷ NumSharp (>1.0× = NumSharp faster) — and every surface also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses (30% = takes only 30% of the time NumPy would; <100% = faster; huge slowdowns compact to e.g. 880×NP). So a win reads two intuitive ways at once: "12.66× faster" and "🕐 8%". Op-matrix report (merge-results.py) — FLIPPED from NS/NP to NP/NS (the one surface that was "lower is better"): * ratio = numpy_ms / numsharp_ms; new pct_numpy field on UnifiedResult (JSON + CSV). * get_status bands inverted around >1 = faster (faster ≥1.0× / close 0.5–1.0× / slower 0.2–0.5× / much <0.2×); classify() credibility gate flips to ratio > 20 (was < 1/20). * Best/Worst now sort DESCENDING (fastest first); legend + tables + summary-by-size gain a 🕐 %NumPy column; ratio_fmt keeps tiny slowdowns readable (0.001× not 0.00×). * Regenerated from the on-disk run archive: Top Best nansum 12.66× 🕐 8%; Top Worst np.zeros 0.001× 🕐 880×NP; searchsorted stays negligible (now ratio>20). Counts unchanged (305/255/169/103/275/126) — same rows, just the direction relabelled. nditer sheet (nditer_sheet.py) + cards (nditer_cards.py) — already NP/NS, ADDED 🕐 %NumPy: * sheet: legend line + per-bar 🕐 %NumPy + headline "1.17× geomean · 🕐 85% of NumPy's time"; re-rendered nditer_results.md (--render-only, AV block intact). * cards: each bar label now "1.80× · 56%" (ops) / "4.3× · 23%" (dividends); footer explains the %. No emoji in matplotlib (DejaVu lacks the glyph) — the % carries it. Re-rendered. Narrative benchmarks.md + README — already NP/NS, added the 🕐 %NumPy line to the convention block, a %NumPy column to the by-class table, and a caption sentence. DocFX pages (benchmark-matrix.md, benchmark-iterator.md) re-seeded from the regenerated report + sheet; benchmarks.md updated; docfx build clean (0 errors). The dashboard (render_dashboard.py / benchmarks-dashboard.md) already carries this convention (49af3af), so the whole benchmark stack — report, dashboard, iterator sheet, cards, narrative, README — is now identical: NumPy ÷ NumSharp speedup + 🕐 %NumPy.
The clock sat before the figure with the right-align padding landing between them
("🕐 87%"). Moved it to immediately follow the percentage, no space — "87%🕐" — across
every surface, and likewise the metric name (🕐 %NumPy → %NumPy🕐). The alignment padding
now sits before the number (where it belongs) instead of after the emoji.
* render_dashboard.py / nditer_sheet.py: bar values "{pct_str}🕐", headline "85%🕐 of
NumPy's time", legend "%NumPy🕐 = …". Dashboard + sheet regenerated.
* merge-results.py: report legend, status-band table, summary-by-size "%NP🕐" column,
Best/Worst note, and per-suite "%NumPy🕐" column headers. Report regenerated.
* benchmarks.md + README: convention line / table column / caption "%NumPy🕐".
* DocFX pages (matrix, iterator, dashboard) re-seeded; dashboard page note "%NumPy🕐".
docfx build clean.
The matplotlib cards are unaffected (they show "1.80× · 56%" without the emoji — DejaVu
has no clock glyph — so there was never a gap to fix there).
… form pct_str (dashboard/sheet) and pct_fmt (report) switched to a ×-multiplier form for huge slowdowns (np.zeros etc.), so the %NumPy stat showed "880×NP🕐" / "880×" — breaking the NN%🕐 depiction the column promises. Now they always render a percentage: np.zeros reads "87957%" (report) / "88087%🕐" (dashboard) = takes ~880× as long, stated as a share of NumPy's time like every other cell. The ratio column is untouched — it legitimately uses × (0.001×, 12.65×); only the %NumPy formatters changed. Report + sheet + dashboard regenerated, the three DocFX pages re-seeded, docfx build clean.
…g from the report The dashboard and benchmark-report.md disagreed on the SAME cell: np.nansum(f64,100K) read 12.63× on the dashboard vs 12.65× in the report, np.zeros(i64,10M) read 88087% vs 87957%, quantile/percentile likewise — 161 rows printed a different ratio at 2dp between the two committed surfaces. Root cause: merge-results.py computes ratio = NumPy/NumSharp and pct_numpy from the FULL-PRECISION means, then stores numpy_ms/numsharp_ms rounded to 4dp. render_dashboard.py ignored the stored ratio/pct_numpy fields and RE-DIVIDED the rounded ms (r["numpy_ms"] / r["numsharp_ms"]), so every row where the 4dp rounding moved a digit drifted from the report. The report is correct (true ratio of the measured means); the dashboard was a rounding artifact of its own recompute. Fix: the credible loop now consumes r["ratio"] / r["pct_numpy"] straight from the JSON (the same numbers benchmark-report.md prints), falling back to 100/ratio only if pct is absent. Dashboard and report now agree cell-for-cell, and the per-suite/per-dtype geomeans key off the same stored ratios the report's Summary-by-size uses. Regenerated benchmark-dashboard.md (gitignored) and re-seeded the DocFX dashboard page; header preserved, body refreshed. Verified: nansum 12.65×/8%, zeros 0.001×/87957%, quantile 9.89×/10% identical on both surfaces; size tiers match Summary-by-size exactly.
…not run" cells
normalize_op_name dropped measured C# data on the floor whenever the C# benchmark label
and the NumPy suite name differed only cosmetically, so the report showed ⚪ "C# benchmark
not run" for ops that WERE run. Three archive-safe alias passes (applied identically to
both sides, so they only ever merge a true pair):
* empty "()" — a no-arg C# method call "a.flatten()" now meets NumPy's "a.flatten"
* "->" spacing — C# "reshape 2D -> 1D" now meets NumPy's "reshape 2D->1D"
* np.around — IS np.round (NumPy alias); C# benchmarks rounding as np.around, NumPy
emits np.round, so the whole np.round family was ⚪ despite real data
Effect (re-merged from the same archive — no re-run): ⚪ no-data 126 → 116; the np.round
family gains 6 real rows (float32/float64 × 3 sizes), a.flatten +2 (100K/10M), reshape
2D->1D +2. Verified against the archive before editing: +10 joined cells, 0 regressions
(no previously-matched cell lost), 0 new key collisions.
Regenerated benchmark-report.{md,json,csv} + the dashboard (now 840 credible cells,
0.73× geomean) and re-seeded the matrix + dashboard DocFX pages (headers preserved
byte-for-byte). The dashboard stays cell-consistent with the report via the canonical
ratio/pct fix from the prior commit.
NOT fixed here (genuine gaps needing a benchmark re-run, not a name alias): np.prod has
no NumPy full-reduction row at all; isnan/isinf/isfinite/isclose/allclose/array_equal/
maximum/minimum have no C# benchmark; amax/amin/mean/std/var axis variants and np.mean
on uint*/int16 lack a counterpart on one side.
…lex (NumPy parity)
These six complex ufuncs previously threw NotSupportedException from the
EmitUnaryComplexOperation default arm, even though NumPy 2.x has complex
loops for all of them (csinh/ccosh/ctanh/casin/cacos/catan). This wires
them up with full NumPy 2.4.2 parity.
Approach (hybrid BCL + C99 fixups, mirroring the existing abs/log2/exp2
pattern): a bit-exact probe over a finite battery showed System.Numerics.
Complex matches NumPy to a few ULP on the finite interior, but diverges at
86/360 edge components -- it returns (NaN,NaN) for nearly all inf/NaN inputs
instead of the C99 Annex G values, drops the sign of zero on branch cuts,
and mishandles arctan's imaginary-axis cut. So:
- NDComplexMath.{Sinh,Cosh,Tanh,Asin,Acos,Atan} delegate the finite
interior to the BCL and add the C99 fixups:
* Non-finite inputs: special-value tables ported from NumPy's msun
npy_csinh/ccosh/ctanh, with asin/atan reusing NumPy's own identities
asin(z)=i*conj(casinh(i*conj z)) and atan(z)=i*conj(catanh(i*conj z)).
* Branch-cut/signed-zero fixups (empirically derived against NumPy and
verified on a 64-point signed-zero grid): asin negates Re on x=-0 and
Im on y=-0; acos negates Im on the y=+0 cut; atan sets
Re=copysign(|y|>1?pi/2:0, x) on the imaginary axis and negates Im on y=-0.
* Where this NumPy build's system libm diverges from msun at infinities
(sign-preserving sinh(-inf+i*inf).re, cosh's even-function +inf*sin(y)
imaginary part, tanh's sign(y) zero, and the genuinely-unspecified
zero signs), the helpers match the observed NumPy 2.4.2 output.
- DirectILKernelGenerator: register CachedMethods.Complex{Sinh,Cosh,Tanh,
Asin,Acos,Atan} (pointing at NDComplexMath, not Complex.* directly) and
add the six cases to EmitUnaryComplexOperation.
Verification: a bit-exact harness over a 117-point battery (finite + signed
zeros + branch cuts + inf/NaN) plus a 64-point grid, diffed against NumPy
2.4.2, gives 1402/1404 components matching (1249 bit-exact, 153 within
<=3 ULP). The only 2 residuals are arctan's finite interior (1e-10 tiny
input ~8e-8 rel; 100+100j at 3 ULP) -- .NET's Atan kernel is less accurate
than NumPy's log1p-based one; an accepted, documented divergence.
Tests:
- NewDtypesUnaryTests: 9 NumPy-verified cases covering interior, branch
cuts, signed zeros, and C99 special values.
- Fuzz/MisalignedRegistry: the stale "complex kernel throws" excuse is
corrected to Half-only; complex sinh/cosh/tanh/arcsin/arccos are now held
to a tight 4-ULP gate (a real regression fails) instead of the blanket
complex-unary excuse; arctan stays under the documented blanket for its
accepted BCL-interior divergence.
All 609 Fuzz + NewDtypes tests pass (net10.0); the 26x5 complex corpus
cases for the five tightly-gated ops are all within 4 ULP.
…e nditer branch Replaces the stale PR description (written ~64 commits in, +50k lines) with a complete changelog of everything between the #612 merge-base (5eedb81) and HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare. Compiled via a two-pass audit: - Pass 1: every commit subject+body mined for features, perf numbers, and breaking changes; APIs/CI/benchmark/corpus facts verified against the live tree (test counts, fuzz corpus wc, Direct partial count, NpyIter LOC). - Pass 2: all 279 local commits re-walked against the draft. Caught and fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on master (verified via git grep origin/master) — reclassified from 'rebuilt' to new, raising the new-API count 22 -> 30; removed an unverifiable test count; added the 15-dtype hot-path parity item (786d705) and the DefaultEngine->NpyIter Tier-3B production routing. Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612 (already on master) and are intentionally excluded; the local master ref is stale, which is why master..HEAD misleadingly shows 339 commits. The same content (minus the H1) is now the live PR #611 description, pushed via REST PATCH (gh pr edit requires read:org scope the token lacks).
… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NpyIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.
…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NpyExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NpyIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.
…ndentals Adds AggressiveInlining/AggressiveOptimization to the complex hyperbolic and inverse-trig helpers and restructures them into a hot/cold split, so the JIT folds the per-element math into the IL-emitted unary kernel without a call frame: - Sinh/Cosh/Tanh/Asin/Acos/Atan (+ Abs and the tiny IsNegZero/IsPosZero/ HypotInf/ClogLarge helpers) are marked AggressiveInlining. Each public op is now a tiny finite-path wrapper (finite check -> Complex.* + fixups, or a cold-helper call) so it fits the inliner's budget. - The non-finite C99 special-value tables move into cold helpers (SinhSpecial/CoshSpecial/TanhSpecial/CasinhNonFinite/CacosNonFinite/ CatanhNonFinite) marked AggressiveOptimization -- kept out-of-line (so the hot wrapper stays inlineable) and fully optimized when actually hit. Behavior is identical to the prior inline form (verified below). IL-inlining experiment (the "emit the formula instead of call" question): benchmarked complex sinh both ways over 4M finite elements, median of 15 reps. The real-decomposition formula (Math.Sinh(x)*Math.Cos(y), Math.Cosh(x)*Math. Sin(y)) is bit-identical to Complex.Sinh (0/4M mismatches) but only 1.15x faster than the call; cosh 1.06x; asin/acos/atan have no real-Math.* formula (dominated by complex log/sqrt) so inlining would only drop a wrapper frame. The per-element cost is dominated by the transcendental itself, so emitting ~6 hand-written IL formulas is not worth the duplication/risk -- especially as the call-based kernel is already ~1.56x faster than NumPy 2.4.2 (np.sinh: 26.1 ns/elem vs NumPy 40.9). Decision: keep the handwritten methods; the inlining attributes capture the (small, safe) wrapper-elimination gain. Verified: NewDtypesUnaryTests + Fuzz UnaryExtra (4-ULP complex gate) green (62/62); the hot/cold split changes no results.
…exponential The np.allclose / np.random.exponential working-set leak guards (np.allclose.UsingTests, np.random.exponential.UsingTests) failed in CI with hundreds of MB of working-set growth (e.g. 551 MB on Linux, threshold 20 MB), while passing on Windows. Root cause: both functions allocate several NDArray intermediates per call and never dispose them — the unmanaged buffers ride the finalizer queue instead of being released synchronously. In a tight loop the managed wrappers are tiny so the GC rarely runs, leaving the intermediates LIVE between collections; the allocator can't reuse that memory, so the high-water mark balloons. On glibc (Linux) freed pages are retained in the arena, so the process RSS stays high even after the test's final GC.Collect()+WaitForPendingFinalizers() — hence the large WorkingSet64 delta. np.isclose: |a-b| <= atol + rtol*|b| materialized ~5 float64 temps (≈400 KB each at 50K elements) plus several bool temps, none disposed. Wrapped every fresh allocation in `using` (the elementwise operators/ufuncs each return a new array). x/y come from astype(copy:false), which returns the input itself when no conversion is needed, so they are caller-owned and never disposed here. The final combined array is captured in `using` too: MakeGeneric<bool>() takes its own refcount on the shared buffer, so disposing the backing temp on return keeps the result alive while keeping that last buffer off the finalizer queue. np.random.exponential: β·(-log(1-U)) left uniform, (1-U) and negate() intermediates un-disposed (only the log result was released). Now disposes all of them; only the trailing `* scale` allocates the fresh array returned to the caller. Effect (measured, peak WorkingSet64 growth across a 1000-iter no-GC loop — the CI failure mode): allclose 551 MB -> 3 MB, exponential -> 1 MB. Behavior is unchanged: full suite green on net8.0 and net10.0 (9718 passed / 0 failed under the CI filter), including the Logic fuzz corpus and the isclose/allclose/ exponential unit tests.
…ity with NumPy 2.4.2) The complex (complex128) overloads of the unary math ops deferred to System.Numerics.Complex for their finite interior. The BCL transcendentals diverge from NumPy on a wide range of edge inputs — large magnitudes, the unit circle, tiny/subnormal values, branch cuts and signed zeros — because they do NOT implement the careful FreeBSD msun algorithms NumPy uses. This replaces those deferrals with direct ports of NumPy's own routines in NDComplexMath, verified by a 504-point bit-exact sweep (Python struct-packed int64 references) classifying every result as exact / <=3 ULP / signed-zero / special / sign-flip. Result: 18 of 20 complex unary ops are now at full parity (0 divergence beyond <=3 ULP): exp, log, log10, log2, log1p, expm1, exp2, sin, cos, tan, sqrt, square, reciprocal, negative, sinh, cosh, tanh, arcsin, arctan. Algorithms ported / fixes (src/NumSharp.Core/Utilities/NDComplexMath.cs): - Log (npy_clog): real part = log|z| with the four-regime rescale — |z| huge (x2 down), subnormal (x2^53 up), near the unit circle (0.71<=|z|<=1.73 uses 0.5*log1p((m-1)(m+1)+n^2) via a Goldberg MathLog1p), and 0. Complex.Log cancels the real part to 0 near |z|=1 (e.g. log(1+1e-10 i).real must be 5e-21, not 0). ComplexLog is repointed here, so np.log, np.log2 and np.log10 all inherit the accuracy. - Tanh (npy_ctanh): Kahan's algorithm (t=tan(y); beta=1+t^2; s=sinh(x); rho=sqrt(1+s^2); tanh=(beta*rho*s + i t)/(1+beta*s^2)) plus the |x|>=22 overflow-safe branch. The BCL Complex.Tanh drifts ~33 ULP (tan(1.5) through the tan(z)=-i*tanh(iz) identity). - Sin/Cos/Tan: now route ALWAYS through Sinh/Cosh/Tanh, exactly as NumPy defines npy_csin/ccos/ctan (= -i*sinh(iz) / cosh(iz) / -i*tanh(iz)), so they match NumPy bit-for-bit instead of only on the BCL's finite interior. Fixes sin(0+1e300 i).real = NaN (BCL did cosh(huge)*0); the Sinh/Cosh y==0 guard returns (sinh(x), y)/(cosh(x), x*y) so a large real no longer yields inf*0 = NaN. - Expm1 (nc_expm1): real = expm1(x)*cos(y) - 2 sin^2(y/2), imag = exp(x)*sin(y); the real expm1 fallback uses the Goldberg identity (e^x-1)*x/log(e^x) which recovers the ~10 digits exp(x)-1 cancels and avoids underflow (expm1(1e-300)=1e-300, not 0). Fixes the non-finite imaginary (expm1(+Inf+0i).imag = exp(+Inf)*sin(0) = NaN) and origin signed zeros. - Square (z*z with FMA contraction): (fma(re,re,-(im*im)), fma(re,im,im*re)). NumPy's complex multiply is FMA-contracted, so square(1e-10+1e-10 i).real = -2.275e-37 (exact re^2 minus rounded im^2) and square(1e300+1e300 i).real = -inf; Complex.op_Multiply (no FMA) returned 0 and NaN. - Atan (npy_catanh, full): atanh(x) on the real axis, atan(y) on the imaginary axis, and the log1p(4|x|/sumsq(|x|-1,|y|))/4 interior, plus _sum_squares and an exponent-classified _real_part_reciprocal (raw biased-exponent field, NOT Math.ILogB which maps 0/Inf to int.MinValue/MaxValue and overflows the subtraction). Complex.Atan cancelled / underflowed the tiny imaginary part (arctan(0+1e-10 i).imag must be 1e-10). - Exp (npy_cexp): exp(-Inf + I(Inf|NaN)).imag = copysign(0, y) so exp(-inf-inf i).imag = -0 (the system libm keeps sign(y); npy_cexp's flat (0,0) dropped it). exp2 inherits this. - Reciprocal already used Smith's nc_recip (overflow-safe, correct signed zeros). Engine wiring (DirectILKernelGenerator[.Unary.Decimal].cs): ComplexLog repointed to NDComplexMath.Log; new cached methods ComplexExpm1 and ComplexSquare; the Expm1 and Square cases in EmitUnaryComplexOperation now call the ported helpers instead of inline Complex.Exp(z)-1 / Complex.op_Multiply. Accepted residuals (pathological inputs only, documented in code + the fuzz registry): - cos/sin with a NaN imaginary part: the resulting zero's sign is C99-UNSPECIFIED; the platform libm and the npy_ccos identity pick opposite signs (2 cases). - arccos with a sub-DBL_MIN imaginary part: Complex.Acos flushes the denormal real part to 0 where cacos's _do_hard_work keeps it (~5.8e-309); a denormal-range edge (4 cases). - sinh/cosh at the overflow boundary |x| in [710, 710.13]: Windows' CRT sinh overflows to inf while .NET Math.Sinh stays finite (a platform-libm boundary, absent on glibc). Tests: NewDtypesUnaryTests.cs adds 11 NumPy-2.4.2-verified cases for the huge-imaginary sin/cos, large-real sinh/cosh overflow, Kahan tan accuracy, near-unit-circle clog, scaled log10/log2/log1p, Goldberg expm1, exp2, FMA square, reciprocal signed zeros, catanh tiny/large arctan, and the exp -inf signed zero. Fuzz/MisalignedRegistry.cs tightens the complex-unary gate to <=3 ULP across the whole set (was a 4-ULP gate on 5 ops + a blanket excuse for the rest), narrows the >3-ULP excuse to the named pathological ops, and adds a separate entry for the (pre-existing) complex reduction/scan NaN-ordering divergence the old blanket covered. Full CI-style suite (net10.0, exclude OpenBugs/HighMemory): 9729 passed, 0 failed. net8.0 + net10.0 both build clean.
…-run
Adds the benchmark definitions that were missing on one side of the op-matrix join (so the
ops showed ⚪ "not run" or were discarded as C#-only), then re-runs the whole official suite
(all 14 comparison suites x 3 cache tiers, ~3h) to fill them in with live numbers. Result:
⚪ no-data 130 -> 76, and the headline moves from a stale 0.74x to 1.08x geomean (93%🕐)
over 1386 credible cells — the stale figure was dragged down by a broken searchsorted and by
simply missing most of NumSharp's fast reductions.
NumPy side (numpy_benchmark.py) — C# already benchmarked these; NumPy didn't:
* unary: np.tan, np.exp2, np.expm1, np.log2, np.log1p, np.clip(a,-10,10),
np.power(a,2|3|0.5)
* reduction: np.cumsum (all arithmetic dtypes), np.prod + np.prod axis=0/1, and the axis
variants np.amax/np.amin/np.mean axis=0(/1) and np.var/np.std axis=0
All names normalize to the existing C# [Benchmark(Description=...)] so they join 1:1.
C# side:
* ProdBenchmarks: was non-standard sizes (100/1000/10000) + method-form names (a.prod());
nothing could join it. Switched to the standard Small/Medium/Large tiers and function-form
np.prod(a)/np.prod(a, axis=k) — values stay in [0.5,1.0] so the product is overflow-safe at
every size. prod now has full + axis coverage (18 cells).
* MeanBenchmarks: CommonTypes -> ArithmeticTypes, closing the np.mean uint*/int16 ⚪ holes
(15 cells) — matches SumBenchmarks/MinMaxBenchmarks.
* LogicBenchmarks: isnan/isinf/isfinite/maximum/minimum/array_equal now join (54 cells).
Verified on the fresh run: searchsorted is purged of the 0.0000ms / >1e6x rows (now real,
1.16-1.44x faster), prod/cumsum/all axis reductions/the 6 predicates/mean-on-uint* all matched.
Regenerated benchmark-report.{md,json,csv} + dashboard and re-seeded the matrix + dashboard
DocFX pages.
KNOWN BUG surfaced (left as ⚪): np.isclose and np.allclose DETERMINISTICALLY segfault NumSharp
with the unmanaged-storage AccessViolation — each crashes even run alone, and in-class it killed
the whole logic suite before BenchmarkDotNet could export anything (took the 6 working predicates
down with it). Disabled both in LogicBenchmarks with a documented note; re-enable once the
NumSharp isclose/allclose lifetime bug is fixed. The 6 predicates were recovered by running each
in its own process (the same per-section isolation the NDIter harness uses for its AV).
…segfault) + doc review Parity review of the complex unary math overloads (commit 416affc). Verified all 20 affected ops across memory layouts (contiguous / F-contiguous / strided / transposed / both negative-stride directions / sliced-offset / broadcast / 0-d / empty) with a fresh bit-exact sweep — every op is layout-correct with 0 divergence — and confirmed the out=/where= ufunc parameters compose bit-exactly with the new complex kernels (exp returns the same out instance; sqrt's where=mask preserves masked-off slots). The review surfaced a pre-existing MEMORY-SAFETY bug (segfault), now fixed: np.exp(complex_array, dtype=float64) # and sqrt/log/log2/log10/log1p/expm1/exp2/sin/cos/tan/ # sinh/cosh/tanh/arcsin/arccos/arctan segfaulted instead of raising. Root cause: ResolveUnaryFloatReturnType honored an explicit dtype= override after only rejecting integer/bool targets (over < Single -> "No loop matching"). It never checked that the INPUT can reach the requested loop dtype by a same_kind cast. For a complex input + real-float dtype=, it returned the real type, ExecuteUnaryOp allocated an 8-byte/element output buffer, and the 16-byte/element complex kernel overran it. NumPy 2.4.2 raises instead: "Cannot cast ufunc 'exp' input from dtype('complex128') to dtype('float64') with casting rule 'same_kind'" Fix: ResolveUnaryFloatReturnType now calls the existing ValidateUnaryInputCast (already used by square/reciprocal/negative, which were NOT affected) on the override path. This reuses NDIterCasting.CanCast(SAME_KIND), so it allows the legal narrowings (int->float32, float64-> float32, float->complex) unchanged and rejects only the cross-kind complex->real cast, emitting NumPy's verbatim message. Probe matrix (complex/float/int inputs x float/complex/int dtype=) now matches NumPy across all 17 float-producing complex ufuncs; the order is preserved (integer dtype= still raises "No loop matching" before the cast check). Also refreshes the NDComplexMath class doc comment, which still described the old fork state ("sinh/cosh/tanh/asin/acos/atan delegate straight to System.Numerics.Complex", "arctan's BCL interior is the lone documented divergence") — it now lists the actual ported algorithms (npy_clog, Kahan ctanh, csinh/ccosh, npy_catanh, npy_cexp/csqrt, nc_expm1/Goldberg, FMA square, nc_recip), the two ops still delegating (asin/acos at parity), and the three accepted pathological residuals. Tests: NewDtypesUnaryTests.cs adds Complex_FloatUfunc_NarrowingDtype_RaisesCastError_NotSegfault (exp/log/sqrt/sin/tanh/arctan: complex+dtype=float64 raises the verbatim cast error, complex+ dtype=int64 raises "No loop matching", complex+dtype=complex128 returns complex). Full CI-style suite (net10.0, exclude OpenBugs/HighMemory): 9730 passed, 0 failed. net8.0 + net10.0 build clean. Note: ceil/floor/round/trunc on complex reject cleanly (no segfault) but with NumSharp's own message rather than NumPy's "ufunc not supported for the input types" — left as-is (out of scope; NumPy has no complex loop for them either). The int->exp2 InvalidProgramException (Single-output kernel) remains a separate, already-tracked bug (fuzz registry W3-C), unrelated to complex.
…o (match/beat NumPy 2.4.2) np.zeros was ~1000x slower than NumPy for large arrays (10M float64: 14.3 ms vs NumPy 0.011 ms). Root cause: it allocated an uninitialized buffer and then ran an eager per-element Fill loop that touched (and zeroed) every byte. NumPy instead delegates zeroing to the OS: PyDataMem_NEW_ZEROED -> calloc, whose demand-zero pages are committed and zeroed lazily on first write, so allocating zeros is effectively O(1) regardless of size (numpy/_core/src/multiarray/alloc.c npy_alloc_cache_zero: small sizes use a cache+memset, large sizes calloc). This ports NumPy's structure. The zeroing is now done by the allocator/OS, never an element loop — correct for all 15 dtypes because the all-zero bit pattern equals default(T) for every one of them (incl. Half, Single, Double, Decimal, Complex). Implementation -------------- - ArraySlice.Allocate(..., fillDefault: true) and Allocate<T>(..., true) now route to UnmanagedMemoryBlock<T>.AllocateZeroed instead of `new UnmanagedMemoryBlock<T>(count, default)` (Take + scalar Fill). All np.zeros overloads, np.zeros_like, np.eye/np.identity, and every internal fill-with-default allocation flow through here. - SizeBucketedBufferPool.TakeZeroed: NativeMemory.AllocZeroed (calloc) with no dirty-bucket reuse — a recycled buffer would force a full memset, discarding the lazy demand-zero win for large sizes and being no cheaper than calloc for small ones. - OsVirtualMemory (new, Windows-only): the Windows process heap eager-commits and memsets mid-size calloc requests (~256 KiB-2 MiB, ~0.05 ms for 800 KiB), unlike glibc/macOS which mmap large blocks lazily. For >= 128 KiB on Windows AllocateZeroed uses VirtualAlloc(MEM_COMMIT) (copy-on-write zero pages, ~0.002 ms) and a new Disposer AllocationType.Virtual that releases straight to the OS via VirtualFree (not pooled). Non-Windows and small sizes stay on calloc, which is already lazy/cheap there. Benchmark fix (pre-existing bug) -------------------------------- CreationBenchmarks returned each created array without disposing, leaking one buffer per op. NumPy's harness (numpy_benchmark.py) discards each result inside the timed loop, so CPython refcount frees it immediately — i.e. NumPy measures alloc+free while the C# benchmark measured alloc-only (unfair) and leaked. Under BenchmarkDotNet's thousands-of-ops-per-iteration, every untouched-but-committed buffer still charges Windows commit, so any fast creation op OOMs at 10M (np.empty(10M) already did; the old np.zeros only escaped by being slow enough to throttle BDN to a couple ops/iteration). All creation benchmarks now dispose per op, matching NumPy and bounding resident memory. Results (this machine, vs NumPy 2.4.2; BDN alloc+free) ------------------------------------------------------ - 10M float64: 14.3 ms -> 0.0033 ms (was ~1000x slower; now 3.1x faster) - medium (100K): 1.7-3.8x faster across i32/i64/f32/f64 - large (10M): 1.1-3.5x faster across i32/i64/f32/f64 - small (1K): ~1.5-2x slower — bounded by NDArray object construction (NDArray/Storage/Shape/ArraySlice/Disposer), shared by all creation APIs and sub-microsecond; the allocation itself is optimal. Tests ----- New Creation/np.zeros.AllocationTests.cs (12 tests): all 15 dtypes zeroed across heap/VirtualAlloc size regimes, full-scan of a multi-MB array, VirtualAlloc writeability/commit correctness, OwnsData, non-aliasing, reuse-after-dispose, multi-dim/high-rank/empty/sliced, default dtype, all overloads. Full CI suite (net8.0 + net10.0, excluding OpenBugs/HighMemory) green: 0 failed, 9742 passed.
…gramException) np.exp2 threw InvalidProgramException for EVERY call resolving to a float32 (Single) output — int16/uint16/char/float32 inputs (via ResolveUnaryFloatReturnType) or any input with dtype=float32. exp2->float64 and exp2->float16 were unaffected. Root cause: DirectILKernelGenerator.Unary.Math.cs, EmitExp2Call, the type == Single branch emitted one spurious instruction. Math.Pow takes (base, exponent), so the exponent must be stashed in a local and pushed after the base. The Single branch pushed an EXTRA `Ldc_R8 2.0` before the Stloc, so: Conv_R8 ; [x_d] Ldc_R8 2.0 ; [x_d, 2.0] <- spurious push Stloc locExp ; [x_d] locExp = 2.0 (saved the constant, NOT the exponent) Ldc_R8 2.0 ; [x_d, 2.0] Ldloc locExp ; [x_d, 2.0, 2.0] Call Math.Pow ; [x_d, 4.0] Pow(2,2)=4, pops 2 Conv_R4 ; [x_d, 4.0f] <- TWO values left on the stack The body left a net +1 on the evaluation stack, so the DynamicMethod failed IL verification -> InvalidProgramException at JIT time. (Latent second defect, masked by the crash: locExp held the constant 2.0, so even a balanced version would have returned Pow(2,2)=4 for every element.) Fix: delete the spurious push so the Single branch mirrors the (correct) Double branch wrapped in Conv_R8/Conv_R4 — Conv_R8 -> save exponent -> push base 2.0 -> push exponent -> Pow -> Conv_R4. exp2(float32[0,.5,1,-1,3]) now returns [1, 1.4142135, 2, 0.5, 8], bit-matching NumPy 2.4.2. Around-the-bug audit (no new findings): - Code: the decimal exp2 path (Unary.Decimal.cs) already saves the exponent correctly, and every other unary transcendental routes through EmitMathCall (MathF.X for Single — one arg, balanced), so exp2's Single branch was the only emitter with this idiom. - Empirical: swept all 20 float-producing unary ufuncs x 12 input dtypes (240 cells) vs NumPy -> 0 InvalidProgram crashes, 0 dtype mismatches; the only value differences are the pre-existing, accepted 1-float32-ULP transcendental gaps (exp/sin/cos/rad2deg, MathF vs NumPy's float32 libm), unrelated to exp2 (which is now exact). Tests: - NpApiOverloadTests_UnaryMath: removed [OpenBugs] from Exp2_WithType_Compiles / Exp2_WithNPTypeCode_Compiles (they reproduced this bug) and strengthened them to assert the 2^x values [1,2,4], not just no-crash (guards the latent locExp=2.0 wrong-value defect). - NewDtypesUnaryTests: new "Exp2 Single-output" region — int16/uint16/char inputs, float32 preserve (incl. fractional sqrt(2)), dtype=float32 override from int and double, a strided float32 view, and the unaffected float64/float16 branches. - MisalignedRegistry: removed the W3-C excuse so any regression of the crash now fails the fuzz gate. Verified: full CI-style suite (net10.0, exclude OpenBugs/HighMemory) 9749 passed / 0 failed; exp2 tests pass on net8.0 + net10.0 (runtime IL emission checked on both JITs).
np.power with a float16 (Half) scalar exponent threw InvalidCastException on the scalar-broadcast path — e.g. np.power(f16[2,3], f16(2)). NumPy 2.4.2 returns float16 ([4,9]). Root cause: Default.Power.cs ReadScalarAsDouble read the size-1 exponent via Convert.ToDouble(object). System.Half does not implement IConvertible, so the boxed Half threw InvalidCastException — even though the scalar-exponent fast paths (ScalarEqualsExact) explicitly list Half as supported. The throw fired from TryScalarExponentFastPath -> IsScalarValueZero -> ScalarEqualsExact -> ReadScalarAsDouble for ANY power call whose exponent value is a Half. Fix: ReadScalarAsDouble casts a boxed Half directly to double (Half is the only supported non-IConvertible dtype that reaches it; Complex is excluded upstream by ScalarEqualsExact's default and the float-only fast-path guard). With this, the fast paths resolve correctly: power(f16[2,3], f16 2) -> [4, 9] (exp=2 multiply) power(f16[2,3], f16 3) -> [8, 27] (general path) power(f16[4,9], f16 0.5) -> [2, 3] (sqrt fast path) power(f16[2,4], f16 -1) -> [0.5, 0.25] (reciprocal fast path) power(f16 2, f16[2,3,4]) -> [4, 8, 16] (scalar-broadcast base) all float16, bit-matching NumPy. Tests: NewDtypesComparisonTests.Half_Power_HalfExponent_ScalarBroadcast covers the formerly-throwing Half-exponent cells (the existing Half_Power used an int exponent, which is IConvertible and never hit the bug). MisalignedRegistry: removed the now-dead W1-B "power(float16): Half scalar path InvalidCast" excuse so a regression of the crash fails the fuzz gate; the binary_divmod_power corpus tally no longer lists it (was 3x). The kept power excuses are separate bugs: W1-F power(*,float16) dtype widening and W1-C power(uint64,int64) negative-exponent throw. Verified: power/Half/Fuzz suites 188 passed / 0 failed (net10.0); net8.0 + net10.0 build clean. This is the documented NEP50 weak-scalar behavior for an INT exponent (power(f16, int_scalar)->f16 vs NumPy float64) is unchanged and out of scope (registry entry #1).
… (W11-A, W6-D)
np.maximum / np.minimum (and np.clip) must propagate NaN per NumPy 2.4.2 — the
result is NaN whenever EITHER operand is NaN. NumSharp routes all three through
the clip SIMD kernel, whose hardware MAXPS/MINPD intrinsics (Avx.Max / Avx.Min,
plus the Sse/Avx512 equivalents) return the SECOND operand on an unordered (NaN)
compare, silently dropping the NaN on every full SIMD vector. The scalar tail
(Math.Max/Min, HalfMaxNaN, ComplexMaxNaN) already propagated, so the bug only
surfaced once an array reached the vector width; small arrays (and earlier
size-2 spot checks) passed by luck.
Reproduction (float64, >= one V256<double> = 4 lanes):
np.maximum([1,nan,3,nan,5,nan,7,8], [9,2,nan,4,nan,6,nan,nan])
before: [9, 2, NaN, 4, NaN, 6, NaN, NaN] (a==NaN lanes return b)
after : [9, NaN, NaN, NaN, NaN, NaN, NaN, NaN] == NumPy
np.clip([nan,1,2,3,4,5,6,7], 2, 5)
before: [2, 2, 2, 3, 4, 5, 5, 5] (NaN clamped to a_min)
after : [NaN, 2, 2, 3, 4, 5, 5, 5] == NumPy
Fix (DirectILKernelGenerator):
EmitVectorMinOrMax gains a `propagateNaN` flag. For Single/Double lanes it now
emits the NaN-restoring select
result = ConditionalSelect(Equals(a, a), hwMinMax(a, b), a)
i.e. keep the hardware min/max where a is non-NaN, else re-inject a (= NaN).
This is robust regardless of whether the underlying min/max already propagates
(verified against Avx.Max/Min AND the cross-platform Vector{N}.Max/Min fallback).
The raw hardware emit is factored into EmitRawVectorMinOrMax and is still used
verbatim for integer lanes (no NaN) and any opt-out caller, so integer
maximum/minimum are byte-for-byte unchanged. The two clip call sites
(Max(vec,lo) / Min(vec,hi)) pass propagateNaN: true, fixing maximum (clip
a_min), minimum (clip a_max), and clip (both bounds) in one place. clip with a
NaN BOUND also propagates, matching NumPy's minimum(maximum(a,lo),hi)
composition.
Scope: fmax/fmin (W7-B, the NaN-IGNORING opposite contract) still share the clip
path and remain divergent — a separate, already-tracked bug. This change does
not regress them (their fuzz excuse keys on a NaN appearing in NumSharp's output,
which still holds). The F-contiguous/strided element-pairing defect (W7-A) is
likewise untouched.
Registry: removed the now-dead W11-A (maximum_out/minimum_out), clip_out, and
W6-D (clip) NaN excuses from MisalignedRegistry so the differential matrix
verifies NaN propagation bit-exact; W7-A and W7-B excuses preserved.
Tests:
- np.minimum.Test.cs: maximum/minimum NaN propagation on the SIMD path
(float64, float32), strided views, the W11-A out=alias case
maximum(a, roll(a), out=a), and an int64 no-NaN regression guard.
- np.clip.Test.cs: clip(NaN) with scalar bounds, array bounds, and out=alias.
All expected values verified against NumPy 2.4.2.
Validation: full suite green (9760 passed / 0 failed); FuzzMatrix green incl.
Logic, Stat, Aliasing, Tail — the Aliasing corpus now reports zero documented
divergences for maximum_out / minimum_out / clip_out.
…ng and float16 signed-zero (W7-A)
np.maximum / np.minimum / np.clip (and fmax/fmin, which share the same kernel) all route
through the flat clip IL kernel in DirectILKernelGenerator.Clip. That kernel walks
src/dst/lo/hi linearly and pairs them by C-order (row-major) index. The engine, however,
handed it operands in whatever layout they arrived in:
- PrepareBound built an array bound via broadcast_to(...).astype(outType). astype keeps
the source layout (order='K'), so an F-contiguous / strided / broadcast bound stayed
non-C-contiguous and was read in raw buffer order.
- The general src/dst path used Cast(lhs, outType, copy:true), which likewise preserves
the source layout, so an F-contiguous / strided source produced a non-C-contiguous
working buffer.
- A non-C-contiguous user out= was clipped in place in its own (F/strided) buffer, again
mismatching the C-order bound.
Result: every element was paired with the wrong bound whenever a second operand (or the
source, or out=) was F-contiguous / transposed / strided / negative-strided / broadcast.
maximum(a, b.T) returned b's raw buffer instead of the element-wise max. This is the W7-A
fuzz divergence (34x in logic.jsonl).
Fix (Default.ClipNDArray): normalize every operand the flat array-bounds kernel touches to
C-contiguous offset-0 before the call.
- PrepareBound copies the prepared bound to C-order when it isn't already C-contiguous
offset-0.
- The general path re-materializes a C-order copy of dst when lhs is F-contiguous /
strided / offset.
- A non-C-contiguous out= is clipped into a C-order temp and copied back via np.copyto
(which honors the out's strides), preserving the return-the-out instance contract.
- Scalar bounds are order-insensitive (one value per lane) and keep their existing fast
paths untouched.
Half signed-zero tie-break (DirectILKernelGenerator.Clip): the residual W7-A divergence
after the pairing fix was float16-only and showed even on both-contiguous inputs - a
separate bug. NumPy's float16 maximum/minimum return the FIRST operand on a tie, so
maximum(+0,-0)=+0, maximum(-0,+0)=-0 (likewise minimum). HalfMaxNaN/HalfMinNaN used strict
>/< and returned the SECOND operand on the +0/-0 tie (wrong sign). Switched to >= / <= so
the tie resolves to the first operand, matching NumPy. Only the signed-zero result changes;
every non-zero equal pair is bit-identical. float32/64 use the SIMD path / Math.Max and
were already correct.
Verified bit-exact vs NumPy 2.4.2 across C / F / strided / transposed / negative-stride /
broadcast / 3-D layouts, mixed-dtype bounds, two-bound clip, and non-C-contiguous out=. The
logic.jsonl fuzz corpus now reports zero W7-A divergences (was 34x); maximum/minimum are
bit-exact on every layout.
MisalignedRegistry: removed the now-dead W7-A pairing excuse (the fallback for extrema value
divergences). fmax/fmin keep only the NaN-propagation excuse (W7-B, a separate ignore-NaN
bug); the isclose strided-complex excuse (its own comparison kernel, not clip-routed)
remains.
Tests (np.minimum.Test.cs): added F-contiguous bound, F-contiguous source, two-bound
F-contiguous clip, strided-slice, non-C-contiguous out=, 3-D transposed, float16
F-contiguous, and float16 signed-zero tie-break regressions. Added a flat (logical C-order)
assertion helper since GetDouble(int) on a multi-dimensional result indexes the first axis,
not the flat buffer.
Full net10.0 suite: 9769 passed / 0 failed.
The Half/Complex clamp helpers (HalfMaxNaN, HalfMinNaN, ComplexIsNaN, ComplexMaxNaN,
ComplexMinNaN) are invoked once per element from the clip IL kernel's inner loop - the
non-SIMD scalar path for Half/Complex, which have no Vector<T> Min/Max. Marked them
[MethodImpl(AggressiveInlining | AggressiveOptimization)]:
- AggressiveInlining lets the JIT fold them into the generated kernel where it can
(e.g. ComplexIsNaN into ComplexMax/MinNaN, and the helpers into the dynamic-method
body), removing the per-element call.
- AggressiveOptimization gives them full tier-1 codegen from the first call when they
are invoked standalone via the kernel's Call opcode, skipping the tier-0 quick-JIT
penalty on a hot path.
Scope: only the per-element helpers in the 'Per-Element Helpers for Non-SIMD Types (called
from generated IL)' region. The one-time IL-codegen methods (Generate, Emit*, etc.) are
intentionally left untouched - they run once per (dtype, mode, kind) and are cached, so
AggressiveOptimization would only add JIT startup cost with no steady-state benefit.
… op helpers Following the clip per-element helpers (d72822f), tag the other small per-element scalar/generic op helpers that run inside kernel inner loops with [MethodImpl(AggressiveInlining | AggressiveOptimization)] - inline into the loop where the JIT can, full tier-1 codegen otherwise (skip the tier-0 quick-JIT penalty on the hot path). Tagged (8 helpers, 3 files): - ComplexDivideNumPy (DirectILKernelGenerator.cs) - per-element complex divide called from the binary IL kernel. - HalfSignHelper (Unary.Decimal.cs) - per-element Half sign called from the unary IL kernel. - Reduction.Axis.cs accumulator helpers, each called once per element inside the axis-reduction loops: ReadAsDouble, CombineScalarsPromoted, ComplexLexPick, DivideByCount, ConvertToDouble<T>, ConvertFromDouble<T>. Their generic typeof(T) branches constant-fold to a single cast per instantiation, so the inlined bodies are tiny. Deliberately NOT tagged: - Whole-array / whole-axis loop helpers (Nan*SimdHelper, Arg*Helper, CopyGeneralSameType, the *ReduceAxis* loops, scan helpers): they ARE the loop, called once per op/axis with large bodies - AggressiveInlining is ignored on them, and .NET on-stack-replacement already promotes their long loops to tier-1. Can be revisited with AggressiveOptimization-only if profiling shows a tier-0 entry cost. - One-time IL-codegen methods (Generate/Emit*): run once per cached kernel key, so AggressiveOptimization would only add JIT startup cost with no steady-state benefit. JIT-only change - no behavior change. Full net10.0 suite: 9769 passed / 0 failed; FuzzMatrix 42/42.
…per-element op helpers Sweep the remaining small per-element / per-vector / per-T op helpers in the kernel hot core so they all carry [MethodImpl(AggressiveInlining | AggressiveOptimization)], extending the clip (d72822f) and reduction-accumulator (06d942c) passes. Verified the two flags compose with NO penalty: a clean per-variant microbenchmark (dedicated AggressiveOptimization loop per variant, 1.5e9 iters) measured AggressiveInlining = AggressiveInlining|AggressiveOptimization = AggressiveOptimization-only = 740-741 ms, vs NoInlining = 1106 ms. AggressiveOptimization does NOT suppress the inlining that AggressiveInlining requests; it only adds tier-1-from-first-call for the standalone version (relevant when a helper is reached via the kernels' generated-IL Call rather than inlined). Changes: - Upgraded 131 helpers that were already AggressiveInlining-only to add AggressiveOptimization (Reduction.Axis.Simd.cs, Reduction.Axis.Widening.cs, Reduction.Axis.VarStd.cs, Quantile.cs, Modf.cs, Where.cs, IndexCollector.cs, SimdMatMul.cs, SimdMatMul.Strided.cs, StrideDetector.cs). - Added the full attribute to 6 previously-untagged per-element / per-vector / seed helpers: CompareGreater<T>, CompareLess<T> (Reduction.Axis.Arg.cs - argmax/argmin per-element compare), GetNanIdentityValue<T> (Reduction.Axis.NaN.cs), GetIdentityValueTyped<T> (Reduction.Axis.cs), HorizontalReduce256<T>, HorizontalReduce128<T> (Reduction.Axis.Simd.cs - per-vector finalize). - Added using System.Runtime.CompilerServices; to the three files that lacked it. No AggressiveInlining-only attributes remain in Backends/Kernels/ (150 now carry both flags). JIT-only change - no behavior change. Full net10.0 suite: 9769 passed / 0 failed; FuzzMatrix 42/42.
…elpers codebase-wide Completes the sweep started in the kernels (e3685be): every method already marked [MethodImpl(MethodImplOptions.AggressiveInlining)] now also carries AggressiveOptimization. The two flags compose with no penalty (benchmarked: AggressiveInlining = both = AggressiveOptimization-only at 740-741ms, all far ahead of NoInlining at 1106ms) and the change is JIT-only / behavior-neutral. 396 helpers across 48 files upgraded - Primitives (NDArray operators), Backends/Iterators (NDIter/NDIterator per-element traversal), Utilities + Utilities/SpanSource, Backends/Default/Math + BLAS, Manipulation, Creation, RandomSampling, Backends/Unmanaged/Pooling, DateTime64, etc. No AggressiveInlining-only attributes remain in NumSharp.Core (550 now carry both flags). AggressiveOptimization is a no-op for methods that inline (most of these); its benefit is tier-1-from-first-call for the standalone version when a helper is reached via a non-inlined path (generated-IL Call, delegate, or a body too large to inline). Every changed line is the attribute swap - no behavior or body changes. Full net10.0 suite: 9769 passed / 0 failed; net8.0 core builds clean.
… axis-reduction perf Investigation artifacts only — no library code changed. complex128 axis reductions were the next-worst op cluster after np.zeros (sum/mean axis=0/1 ran 19–64× slower than NumPy 2.4.2). These POCs find the root cause and benchmark every candidate fix against NumPy 2.4.2, all re-runnable via `dotnet run -c Release` in benchmark/poc. Root cause (complex_axis_probe / complex_alloc_probe / complex_why_probe): - NOT boxing. np.sum(complex) allocates ~0 B/elem; the (object) casts in CombineScalarsPromoted JIT-elide. The earlier "30M boxes" theory was wrong. - The real cost is scalar System.Numerics.Complex add where NumPy SIMD-vectorizes the reduction (complex == contiguous double pairs). A double-pair Vector256 add reaches NumPy parity: 7.52 vs 7.59 ms @10m, and beats it @1m (0.31 vs 0.33). - Secondary: the scalar fallback's non-inlined per-element combine + op-switch + cache-hostile per-output column gather (axis=0); mean's MeanAxisComplex allocates an NDArray view + iterator per output row (6 B/elem). - Small-N gap is compute (scalar), not allocation: np.zeros((N,)) output is ~4% of the op; pure compute is ~93%. Candidate benchmark (complex_reduce_poc) — baseline / NDAxisIter / NDIter-2op via ForEach / NDIter-2op via ExecuteGeneric / monolithic generic-math, both axes, 10K–10M, plus strided input, Half/Decimal sweep, and alloc/op: - Monolithic generic-math is the most uniform (1.3–2.5× NumPy, never bad). - NDAxisIter beats NumPy on inner-reduce (axis=1) but is cache-hostile on outer-reduce (axis=0, 6.6× @10m — per-output strided column gather). - NDIter 2-op REDUCE is good on axis=0 but 5.8× on axis=1 (BUFFER+REDUCE inner double-loop unfinished) and allocates 2344 B/op for iterator construction. - Half needs a wider <TInput,TAccum> accumulator (same-type is precision-wrong); Decimal/Complex are correct same-type. generic_math_proto: proves .NET generic math (IAdditionOperators / IAdditiveIdentity) yields ONE zero-boxing kernel covering Complex AND all numerics; np.cumsum(complex) already rides this exact path (CumSumAxisKernel<T>). Fallback handling (complex_fallback_poc): layout-aware dispatch keyed on which LOGICAL axis is contiguous (NumPy's strategy), not the physical last axis — output-axis stride==1 -> SIMD slab-fold reduce-axis stride==1 -> SIMD 2-lane (re,im) contiguous reduce neither -> strided scalar, smaller-stride axis innermost Covers C-contig / row-sliced a[::2,:] / col-sliced a[:,::2] / transposed a.T, all correct, 0.79–1.23× NumPy (beats it on 4 of 8). Mirrors the fast-path dispatch AxisReductionSimdHelper already has for numeric dtypes, so the real fix = reuse that dispatch + plug in two complex double-pair SIMD primitives. Decision pending: (A) extend AxisReductionSimdHelper with Complex-as-double-pairs (fastest, lowest risk, matches existing reduction design) vs (B) migrate reductions to NDIter (long-term target; needs inner-reduce double-loop + small-N overhead work).
…e open item Updates the combinatorial-indexing handover to the executed reality: all five divergence buckets (R1 value-broadcast on empty/scalar selections, B2 0-d-base over-index, B3 empty advanced gather, B4 empty-slice assignment, R2 non-consecutive 0-d-bool placement + the Shape.Broadcast hash collision it uncovered) are fixed, committed (aea9fc78..7e968f5e), and pinned by the new Indexing.CombinatorialParity [FuzzMatrix] gate. The random sweep is 0 divergences across every measurable window. The only open item is R3, a pre-existing flaky teardown heap-corruption. Records the diagnostics gathered this session that supersede the prior "pinned managed" guess: it is a specific corpus shape (not allocation volume), it is NOT a pooled native-buffer overrun (end-aligned guard pages run clean), and it resists deterministic repro (crash point varies 6285-9879; ~1/3 even in the tightest window). Next-session plan: a per-case red-zone on FromArray / the direct-VirtualAlloc zeroed path, or a dotnet-dump capture. Index_Random stays [OpenBugs] purely on the crash, not parity.
Mechanical, behavior-preserving rename of NumSharp's iterator/expression
stack from the `Npy*` prefix to `ND*`. ND* matches NumSharp's house style
(`NDArray` ↔ numpy.ndarray, the retired NDIterator) and NumPy's Python
`nditer` name. The old `Npy*` prefix mirrored NumPy's C struct name
(`NpyIter`); `ND*` is the user-facing convention used everywhere else.
Scope (NumSharp-owned only):
- Types/delegates/enums: NpyIter→NDIter, NpyIterRef→NDIterRef,
NpyExpr→NDExpr, NpyIterState/Flags/PerOpFlags/GlobalFlags/OpFlags,
NpyFlatIterator→NDFlatIterator, NpyAxisIter→NDAxisIter,
NpyMemOverlap→NDMemOverlap, reduction-kernel structs + interfaces
(INpy…→IND…), NpyArrayMethodFlags→NDArrayMethodFlags, and the
NpyIter_* C-API-style method names → NDIter_*.
- Utilities: NpyComplexMath→NDComplexMath, NpyDivision→NDDivision,
NpyIntegerPower→NDIntegerPower.
- Benchmark subsystem: benchmark/npyiter → benchmark/nditer
(npyiter_{bench,sheet,cards,results,headline} → nditer_*,
--skip-npyiter → --skip-nditer).
- 65 files renamed via git mv; ~190 files content-swept; website docs,
docs/numpy notes, and frozen benchmark/history snapshots included.
Preserved (genuine NumPy references, NOT the stack):
- src/numpy/** (the upstream clone — NpyIter is NumPy's real C type).
- The .npy/.npz file format: `#region NpyFormat` (np.save/np.load) and the
SaveAndLoadWithNpyFileExt test.
- NumPy's C function names quoted in docs (npyiter_allocate_arrays,
npyiter_coalesce_axes, … kept verbatim).
Build: solution green (0 errors). Tests: 10980 passed, 0 failed, 11 skipped
(net10.0, CI filter TestCategory!=OpenBugs&!=HighMemory).
Branch commit messages were rewritten Npy→ND separately (message-only
history rewrite; file blobs in historical commits untouched). This commit
is registered in .git-blame-ignore-revs as a mechanical rename.
…e nditer branch Replaces the stale PR description (written ~64 commits in, +50k lines) with a complete changelog of everything between the #612 merge-base (5eedb81) and HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare. Compiled via a two-pass audit: - Pass 1: every commit subject+body mined for features, perf numbers, and breaking changes; APIs/CI/benchmark/corpus facts verified against the live tree (test counts, fuzz corpus wc, Direct partial count, NDIter LOC). - Pass 2: all 279 local commits re-walked against the draft. Caught and fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on master (verified via git grep origin/master) — reclassified from 'rebuilt' to new, raising the new-API count 22 -> 30; removed an unverifiable test count; added the 15-dtype hot-path parity item (786d705) and the DefaultEngine->NDIter Tier-3B production routing. Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612 (already on master) and are intentionally excluded; the local master ref is stale, which is why master..HEAD misleadingly shows 339 commits. The same content (minus the H1) is now the live PR #611 description, pushed via REST PATCH (gh pr edit requires read:org scope the token lacks).
… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NDIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.
Rename the two standalone benchmark projects to names reflecting what they are: - NumSharp.Benchmark.GraphEngine → NumSharp.Benchmark.CSharp — the comprehensive BenchmarkDotNet operation suite; pairs with NumSharp.Benchmark.Python (numpy baseline). - NumSharp.Benchmark.Exploration → NumSharp.Benchmark.Sandbox — the SIMD & broadcasting experiment playground. Project dirs, .csproj (AssemblyName/RootNamespace), all namespace/using decls (incl. the partial-qualifier Exploration.Infrastructure → Sandbox.Infrastructure), and external referencers. Neither project is in SciSharp.NumSharp.sln. Both build clean (0 errors). NOTE: benchmark/CLAUDE.md excluded — concurrent uncommitted edits by another actor; its references will be swept in a follow-up. Left as prose: the Sandbox banner "Performance Exploration Suite" and a "#541" reference.
…3.0; rename GraphEngine->CSharp refs RELEASE_0.60.0.md (new) — comprehensive release notes for NumSharp 0.60.0, the first stable, consolidating everything since 0.50.0 (the nditer engine, np.evaluate fusion, full advanced-indexing parity, byte-exact printing, C/F/A/K, np.random MT19937 parity, 36+ new np.* APIs, stride-native matmul, ARC, differential fuzzing). Benchmark claims pinned to the committed 2026-06-23 history snapshot: - TL;DR geomean corrected to 1.26x @10m (397 faster / 150 close / 42 slower of 615 ops; 1.14x @1k, 0.90x @100k) -- was the stale 1.00x/409-op June-5 figure. - Added a "Benchmark harness" section in section 8: the op/dtype/N matrix (1,851 cells) plus the five iterator/layout/operand/cast/fusion subsystems, history snapshots + latest symlink, and the NPY/NS convention. - section 8 cast row refreshed to the final campaign state (716 -> 129 lagging of 1,568 comparable; 1,439 winning) with the per-src narrow-int geomeans. - section 15 now discloses the benchmark-surfaced slower paths (small-N dispatch, scalar Half/Decimal, large-N any full-scan, comparison->bool, fancy gather). - np.random (MT19937 + exactly 24 new distribution files) and the split funcs verified new-since-0.50.0 against ground truth; np.random added to the opening. - Named the renamed C# bench project (benchmark/NumSharp.Benchmark.CSharp). benchmark/CLAUDE.md — brought the development guide from v2.0 (2026-02-13) to v3.0: run_benchmark.py documented as THE orchestrator; metrics/type-system/categories/ python-suite/directory sections updated to current state -- 15 dtypes (added int8/float16/complex128), ~615 ops/size (1,851 cells), 14 comparison suites + 5 appended subsystems, 3 cache tiers (1K/100K/10M), NPY/NS convention, committable history/ snapshots; run-benchmarks.ps1 marked legacy. All 13 NumSharp.Benchmark.GraphEngine references -> NumSharp.Benchmark.CSharp (folder/csproj/namespace/paths) for the rename in 002bc62. .github/workflows/benchmark.yml — stale "builds GraphEngine+Core" step comment -> "builds the CSharp bench project + Core".
Deleted the np-tests skill (write/migrate NumPy tests for NumSharp): - .claude/skills/np-tests/SKILL.md (the skill definition, ~3 KB) - .agents/skills/np-tests/SKILL.md (symlink to the above) Nothing else references the skill; the np-function skill is unaffected.
Release prep for the 0.60.0 major release — the nditer rewrite that aligns NumSharp's compute core with NumPy 2.4.2. - Version bump 0.40.0 -> 0.60.0 across NumSharp.Core and NumSharp.Bitmap (AssemblyVersion / FileVersion / Version / PackageVersion). The tagged CI release also injects the version from the v* tag; this keeps the repo metadata correct for local and non-tagged builds. - Add docs/releases/RELEASE_0.60.0.md — the full 0.60.0 release notes: NumPy nditer port, np.evaluate fusion, full advanced-indexing parity, byte-exact array printing, C/F/A/K memory layout, np.random rebuilt on MT19937 + 24 new distributions, all 15 dtypes (int8/float16/complex128), stride-native matmul, the astype SIMD campaign, ARC memory management, and differential fuzzing. - CI (build-and-release.yml): the GitHub Release body is now composed from docs/releases/RELEASE_<version>.md (install header + packages table + notes), falling back to auto-generated notes only when no notes file is present. - CLAUDE.md refreshed to the final/complete project state. - Website docs updated (il-generation, index, theme CSS).
|
Yes, I reviewed 600 files. Deal with it. |
The progress to 0.60.0
A from-scratch port of NumPy 2.4.2's
nditerengine and the entire compute stack built on it — merging thenditerbranch intomaster. It adds a fused-expression DSL (np.evaluate), full advanced-indexing parity with NumPy (down to the memory-safety and error-text level), byte-exact NumPy array printing, C/F/A/K memory-layout support wired through the whole API, stride-native matmul, NumPy-seed-compatiblenp.random(MT19937), all 15 NumPy dtypes, 36+ newnp.*functions, deterministic memory management, and a differential-fuzz pipeline that proves bit-exactness against NumPy. It ships as NumSharp 0.60.0 — the first stable (non-prerelease) release in the 0.x line.620 commits · 1,309 files · +330,971 / −273,664 (vs
master). The net is a large body of new engine code offset by an equally large deletion of legacy generated code (the Regen template engine,NDIterator, andMultiIteratorare all gone). Everything below was validated against NumPy 2.4.2 ground truth — a ~40,000-case differential corpus, 566 iterator-parity scenarios, a 12,000+-case index oracle, ~18,000 array-print fuzz cases, and per-feature battle tests run on actual NumPy output.TL;DR
NDIter— full port of NumPy 2.4.2'snditer(~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. It is the production execution engine for the elementwise/reduce core, at or faster than NumPy on every probed aspect.NDExprDSL + three-tier custom-op API, exposed as the publicnp.evaluate— write your own ufunc (raw IL / element-wise SIMD / composable expression trees) and run fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPyresult_typetyping and fused reductions.prepare_index+ unified advanced-index gather/scatter takes the get/set surface from ~697 → 0 divergences (full parity) against a new committed differential index oracle, closes a class of memory-safety bugs (out-of-bounds gather/scatter on exotic mixed indices), and aligns everyIndexError/ValueErrortext with NumPy.NDArray.ToString()is now a 1-to-1 port of NumPy 2.4.2'sarray_str/array_repr/array2string+ Dragon4 float formatting; new publicnp.array2string/np.array_repr/np.array_str/np.set_printoptions/np.printoptions/np.format_float_positional/np.format_float_scientific. ~18,000 fuzz cases byte-identical to NumPy.out=/where=/dtype=ufunc kwargs across the elementwise API — binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plusnp.bitwise_and/or/xor,np.positive, and first-classnp.maximum/minimum/fmax/fmin.int8andfloat16and promotescomplex128from a stub to first-class (was 12 dtypes), each wired through creation /astype/ reductions / comparisons / IL kernels with full 15×15 parity; the six Complex transcendentals are implemented. NumPy 2.x type aliases (np.byte→int8,np.complex64throws,np.intp/int_/uint) realigned.np.*APIs —sort/argsort,pad(11 modes),tile,median/percentile/quantile(all 13 interpolation methods) +nan*variants,average,ptp,take/put/place,extract/compress,diagonal/trace,argwhere/flatnonzero,unravel_index/ravel_multi_index/indices,delete/insert/append,split/array_split/hsplit/vsplit/dsplit,diff/ediff1d,asfortranarray/ascontiguousarray,np.multithreading, plus the printing APIs above.np.randomrebuilt for NumPy RNG parity — the legacy Knuth subtractive generator is replaced by MT19937 (NumPy's Mersenne Twister) for 1-to-1 seed/state compatibility, plus 24 new distribution samplers (weibull,vonmises,pareto,laplace,gumbel,dirichlet,multivariate_normal,noncentral_chisquare/noncentral_f,standard_t/standard_cauchy/standard_gamma/standard_exponential,triangular,zipf,logseries,rayleigh,wald,power,f,logistic,hypergeometric,multinomial,negative_binomial). Output is byte-identical to NumPy 2.4.2 at a given seed.Shapeunderstands F-contiguity,OrderResolverresolves NumPy order modes, ~68 layout bugs fixed.astype) faster than NumPy across the board — the entire copy/retype/cast surface is unified on oneNDIter.CopyAscore (the 2,226-line legacy per-element cast loop deleted), then a SIMD campaign took the full 15×8×15astypematrix from 716 → 129 lagging cells of 1,568 comparable (1,439 winning ≥1.0× vs NumPy 2.4.2).IDisposableonNDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).FuzzMatrixgate, and a nightly soak (§10).MultiIterator,NDIterator(interface + class +AsIterator), and the Regen template engine (87 inline#if _REGENblocks across 35 files) are all gone.FuzzMatrixgate. 177 formerly-[OpenBugs]reproductions were promoted into regular CI tests as their bugs were fixed.1.
NDIter— full NumPynditerportFrom-scratch C# port of NumPy 2.4.2's iterator machinery under
src/NumSharp.Core/Backends/Iterators/(~12,557 lines), promoted to public API with NDArray overloads. The public surface includes the NumPy-named flag enums (NDIterFlags/NDIterOpFlags/NDIterGlobalFlags,NPY_ORDER,NPY_CASTING), theNDIterRefkernel handle, and theNDInnerLoopFuncper-chunk delegate; the standalone flat iterator isNDFlatIterator(drivesnp.broadcast(...).iters).MULTI_INDEX,C_INDEX,F_INDEX,RANGE(parallel chunking),GotoIndex/GotoMultiIndex/GotoIterIndexDELAY_BUFALLOC, buffered-reduce double-loop (incl.bufferSize < coreSize)op_axeswith-1reduction axes,REDUCE_OK,IsFirstVisit,REUSE_REDUCE_LOOPSslab accumulationCOPY_IF_OVERLAPvia a port of NumPy'smem_overlapsolver (NDMemOverlap.cs) — overlapping in/out operands no longer silently corruptWRITEMASKED+ARRAYMASKexecuted — the buffered window flush writes back only mask-nonzero elements;VIRTUALoperands construct with NumPy 2.x semanticsNPY_MAXARGS=64) and unlimited dimensions (NumPy caps atNPY_MAXDIMS=64) via dynamic allocationCopy,GetIterView,RemoveAxis,RemoveMultiIndex,ResetBasePointers,IterRange,DebugPrint, fixed/axis stride queries,GetValue<T>/SetValue<T>, …NDIterCasting.CanCastmatches NumPy'ssafe/same_kindlattice exactlyValidated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, plus a permanent variation-probe harness. Dozens of parity bugs found and fixed against NumPy ground truth (negative-stride flipping,
NO_BROADCASTenforcement,F_INDEXcoalescing, buffered-reduction stride inversion, K-order on broadcast inputs, the size-1 stride-0 invariant,op_axesout-of-bounds reads on stretched size-1 axes, write-broadcast validation, unit-axis absorption) — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.Execution at NumPy speed
NDIterisn't just correct — it is the production execution engine:DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NDIter Tier-3B shell.a*b+c10M(a-b)/(a+b)10MKey mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (
vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels.2.
NDExprDSL +np.evaluate(fusion)User-extensible kernel layer on top of
NDIter— the public answer to "how do I write my own ufunc":ExecuteRawIL: emit raw IL against the NumPy ufunc signaturevoid(void** dataptrs, long* strides, long count, void* aux).ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.ExecuteExpression: composeNDExprtrees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons,Min/Max/Clamp/Where), plusCall()to splice any delegate/MethodInfointo a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.Exposed publicly as
np.evaluate(expr[, operands][, out]):result_typetyping — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly:(i4*i4)+f8wraps the multiply in int32 (→1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4,i4/2 → f8) with NumPy's exactOverflowError, and special resolvers (true_divide,arctan2, negative-integer-literalpower→ValueError, booladd=OR/multiply=AND).NDExpr.Sum/Prod/Min/Max/Meancompile a one-pass inner loop;sum(a*b)readsaandbonce and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).out=joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing throughCOPY_IF_OVERLAP); anEXTERNAL_LOOPguard prevents the silentcount==1slow path.a*b+c3.2×,(a-b)/(a+b)6.1×,sum(a*b)3.6×,sum f322.9×,i4*2+f83.5× faster. Permanent gate inbenchmark/fusion/.Floor/Ceil/Round/Truncate) vectorizes (float/double only, and only where the BCL providesVector{N}.<op>at the active width), so fused integer rounding no longer crashes and SIMD is used wherever it exists on the running runtime.3. Full advanced-indexing parity + memory-safety
A differential index oracle (NumPy 2.4.2 as the sole oracle) proved the common get/set surface was bit-exact but exposed ~697 divergences across exotic mixed advanced-index combinations (boolean-array + fancy, multi-dim fancy + slice, 0-d-bool + fancy, multi-fancy, empty combos) plus a flaky heap-corruption crash. This PR replaces the ad-hoc per-shape "Try*" patchwork with a faithful port of NumPy's
mapping.cmodel and drives the oracle from ~697 → 0 divergences (full parity). The final fixes closed the remaining divergence categories — value-broadcast on empty/scalar selections, 0-d-base over-indexing, empty-advanced gather, empty-slice assignment, and non-consecutive 0-d-bool axis placement (which uncovered and fixed a latentShape.Broadcasthash collision) — all pinned by a newIndexing.CombinatorialParityFuzzMatrixgate; the curated, dtype, and seeded-random oracle tiers are bit-exact. The one known open item is a flaky teardown heap-corruption crash on a single corpus shape (Index_Randomstays[OpenBugs]for the crash, not for any parity gap).PrepareIndex(Selection/NDArray.Indexing.PrepareIndex.cs) — a port of NumPy'sprepare_index: one up-front pass that classifies the whole index tuple (HAS_* bitmask) and validates it before any kernel runs. Raises NumPy-verbatim texts for too-many-indices, boolean-array-dim mismatch, integer/array value out-of-bounds, and un-broadcastable advanced blocks.TryBuildMultiAdvancedGrid— NumPy's general advanced-index algorithm (block broadcast + consec-aware axis placement). All advanced tuples (single or multiple fancy, fancy + slice/newaxis, 0-d bool joining the block, pure-advanced) now route through one gather/scatter; only pure-basic tuples fall through to the view path. Correct axis placement for ≥2 advanced indices mixed with slices (consecutive → in place; separated → advanced axes to front).a[0,4]on a 3×4 now raises instead of returning a neighbour); too-many advanced indices raise instead of walking strides past the array; the negative-stride gather offset bound was corrected (valid rows ona[::-1]no longer rejected); block-copy bounds guards + an opt-in Windows page-heap (NUMSHARP_GUARD_PAGES=1) backstop the gather/scatter.arr[mask, int],arr[:, mask],arr[mask, 1:3], leading k-D mask + basic) now works for get and set, expanding each mask to itsnonzero()components like NumPy.could not broadcast …ValueError); a valid smaller value tiles like NumPy; a broadcastable value scatters correctly into a ≥2-D fancy subspace; empty-into-non-empty raises, empty-into-empty no-ops.bool[]/bool[,](any rank) andIEnumerable<bool>are recognized as masks; C# tuples (nd[(1,2)]) spread to coordinates;List<int>/ArrayList/anyIEnumerablecoerce to a fancy index;uint64scalar indices andint8fancy-index dtype are accepted (all eight integer dtypes now index identically).FuzzMatrixgate (test/oracle/gen_index_oracle.py+Fuzz/IndexOracleTests.cs):index_curated.jsonl(2,265) +index_dtype.jsonl(104) run in CI bit-exact, and the 10,000-case seeded random tier is now 0 divergences too (replayed in soak; held out of the per-PR gate only by the teardown crash above). The combinatorial fixes are additionally pinned by theIndexing.CombinatorialParitygate.4. Byte-exact NumPy array printing
NDArray.ToString()is refactored from the legacy Python-list output ([0, 1, 2], no alignment/precision/summarization/dtype) to a 1-to-1 port of NumPy 2.4.2's array printing (numpy/_core/arrayprint.py+dragon4.c).ToString()/ToString(false)→np.array_str→"[0 1 2]"(thestr()form).ToString(true)→np.array_repr→"array([0, 1, 2], dtype=…)"(therepr()form).src/NumSharp.Core/Backends/Printing/:PrintOptions(AsyncLocal context mirroring NumPy'sformat_options),Dragon4(positional + scientific float→string),ElementFormatters(bool/int/float/complex with the two-pass exp-format decision, per-dtype cutoffs, column sizing, nan/inf fields),ArrayFormatter(recursive layout, line-wrap atlinewidth, summarization withedgeitems, the 0-dstr-vs-reprasymmetry, repr dtype/shape suffixes).ToString("R")(== Dragon4 unique) but routes all rounding throughToString("F"|"E"+precision)(rounds the true binary value, IEEE half-to-even) — never the shortest string (which diverges ~50% on adversarial ties).APIs/np.array2string.cs):np.array2string,np.array_str,np.array_repr,np.set_printoptions,np.get_printoptions,np.printoptions(IDisposable context),np.format_float_positional,np.format_float_scientific.Chardtype (string storage, no NumPy equivalent) keeps its legacy rendering, soGetString/AsStringare unaffected.Validated by ~18,000 differential-fuzz cases byte-identical to NumPy 2.4.2 across all dtypes (incl. float16/32/64 scientific, adversarial ties/carries), 1-D–5-D, sliced/broadcast/transposed views, complex, nan/inf, summarization, line-wrap, and print options; ~174 strict parity tests on net8.0 + net10.0.
5. C/F/A/K memory-layout support
Shapenow tracks F-contiguity with NumPy-convention contiguity computation; newOrderResolverresolvesC/F/A/Kfor every API with anorderparameter.copy,array,asarray,asanyarray,*_like,astype,flatten,ravel,reshape,eye,concatenate,cumsum,argsort,tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.np.asfortranarray,np.ascontiguousarray.np.whereselects C/F output layout the way NumPy does;ravel('F')of an F-contig source returns a view (was a 3,000× copy).fortran_order, Decimal scalar path, fancy-write isolation, …).6. New & completed
np.*APIsNew functions:
np.evaluate(fused expressions — §2),np.bitwise_and,np.bitwise_or,np.bitwise_xor,np.positive, first-classnp.maximum/np.minimum/np.fmax/np.fminnp.array2string,np.array_str,np.array_repr,np.set_printoptions,np.get_printoptions,np.printoptions,np.format_float_positional,np.format_float_scientificnp.sort(+ndarray.sort;np.argsortreimplemented) — radix line-kernel on NDIter, stable, NaN-last, all axes / orders (closes a long-standing Missing Function)np.pad(all 11 NumPy modes + callable),np.tile,np.delete,np.insert,np.appendnp.split,np.array_split(uneven splits),np.hsplit,np.vsplit,np.dsplitnp.take,np.put,np.place,np.extract,np.compress,np.argwhere,np.flatnonzero,np.diagonal,np.trace,np.unravel_index,np.ravel_multi_index,np.indicesnp.median,np.percentile,np.quantile(all 13 interpolation methods, tuple axis,out=,keepdims, QuickSelect),np.average(weights,returned, tuple-axis),np.ptp,np.nanmedian,np.nanpercentile,np.nanquantilenp.diff,np.ediff1dnp.asfortranarray,np.ascontiguousarraynp.multithreading(enabled, max_threads)— opt-in threaded kernelsRebuilt to full NumPy 2.x parity:
np.clip(min=/max=aliases, None bounds, 2.x promotion,out=, 4×-unrolled SIMD kernel),np.unique(5 missing kwargs, NaN partitioning, up to 43× faster),np.searchsorted(side=/sorter=, IL binary-search 5–25× faster),np.copyto(casting=/where=, overlap-safe, SIMD fast paths),np.asarray(copy=/like=/device=/string dtype),np.concatenate,np.all/np.any(tuple-axis,out=,where=),np.expand_dims(tuple axis),np.repeat(axis=),np.power(integer-power semantics + crash fix),np.broadcast(N-operand0..∞, live cursor, lazy.iters/.numiter). Engine completeness: bool/charmax/min, Complex quantile,IsInfimplemented, and the six Complex transcendentalssinh/cosh/tanh/arcsin/arccos/arctan(wereNotSupportedException).np.maximum/minimum/fmax/fminare now first-class binary ufuncs (NEP50 promotion, broadcasting,out=,where=, every execution path) instead ofclip-based shims — which fixes a real correctness bug:fmax/fminnow ignore NaN (return the finite operand) whilemaximum/minimumpropagate it.out=/where=/dtype=ufunc kwargs (NumPy parity)The kwargs on every NumPy ufunc now span the elementwise core — binary (
add,subtract,multiply,divide,true_divide,mod,power,floor_divide), unary-math (sqrt,exp,log,sin,cos,tan,abs/absolute,negative,square, …), the six comparisons, predicates (isnan/isfinite/isinf), bitwise,invert,arctan2— each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:outjoins the broadcast but never stretches (mismatched/stretchableoutraise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50);outonly needs a same_kind cast; the provided instance is returned (reference identity).wheremust be exactlybool; it broadcasts over operands and participates in output shape; mask-false slots keep prioroutcontents.outaliasing an input is well-defined viaCOPY_IF_OVERLAP.dtype=computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the booladd→OR /multiply→AND remap keyed off the final loop dtype.Random sampling — NumPy RNG parity (
np.random)np.randomis rebuilt around NumPy's own bit generator and legacy distribution algorithms, so a seeded NumSharp stream is byte-identical to NumPy 2.4.2.Seed(uint)/SeedByArray(uint[]), 53-bitNextDouble, rejection-sampled bounded ints, 624-word stateClone/SetState) with 1-to-1 seed/state compatibility:get_state()/set_state()use NumPy's state tuple format (Algorithm,Key[624],Pos,HasGauss,CachedGaussian). Gaussian generation moved from Box-Muller to the Marsaglia polar method with the_hasGauss/_gaussCachecarry, matching NumPy'srandom_standard_normal(and its cached state) exactly.dirichlet,f,gumbel,hypergeometric,laplace,logistic,logseries,multinomial,multivariate_normal,negative_binomial,noncentral_chisquare,noncentral_f,pareto,power,rayleigh,standard_cauchy,standard_exponential,standard_gamma,standard_t,triangular,vonmises,wald,weibull,zipf— joining the existing 16 (rand,randn,randint,uniform,normal,bernoulli,beta,binomial,chisquare,choice,exponential,gamma,geometric,lognormal,poisson,permutation/shuffle).RandomStatealgorithms so they consume the RNG in the same order and emit byte-identical values:geometric(NumPy's search algorithm; previously returned negatives),beta(Jöhnk's algorithm fora,b ≤ 1),chisquare(SampleStandardGammawith Vaduva's algorithm forshape < 1), thelognormal(mean=0)NaN fix, andsize=0empty-array support across every sampler.multivariate_normaluses an SVD transform (Jacobi eigendecomposition, eigenvalues sorted descending) matching NumPy's implementation; identity covariance is an exact sequence match.int64—poisson,binomial,geometric,hypergeometric,zipf,logseries,negative_binomial(matching NumPy dtypes); scalar overloads return a 0-dNDArray;longis the canonical size/index type throughout (allintdowncasts removed). Size/axis/seed validation, the return contract (size=None→ scalar,size=()→ 0-d), and the dual-overload API shape all follow NumPy.7. Linear algebra
Vector256FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxingGetValuefallback chain.matmulgufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a differential matrix (816 cases).np.multithreading— opt-in parallel 1-D dot (1M float dot 172 → 60 µs, ~2× NumPy's default). Off by default; bitwise-identical summation order when off.8. Performance (engine, reductions, shifts, sort, casts)
sum(int16, axis=1)1058 ms → 2.7 ms (389×); also fixes a uint32 axis-sum corruption bugmean/var/std(axis)count_nonzero20×min/max(f64/f32)Avx.Min/Maxdrops the JIT's redundant NaN fixup with a separate finite-mask + cold scan: 0.64×/0.69× → 1.55×/1.73× @100k; broadcast/neg-stride axis min/max SIMD-routed (was 0.07–0.17×)np.left_shift/right_shiftVPSLLV*/VPSRAV*via Tier-3B): common cases 2–4× NumPy (was 0.05–0.34× scalar); fixes 7 correctness bugs incl. NEP50 promotion, negative-count, bool→int8, sbyte SIMD; adds<</>>operator overloads onNDArraynp.sort/argsortnp.cumsumadd.accumulatewith KEEPORDER output (C-src→C, F-src→F); 1.6–11.8× by axis; fixes empty/0-d/size-1 NEP50 widening + negative-axis validationnp.nonzerosum(broadcast_to(...))~534–700× faster, bit-exactsum/mean(float)np.add.reducebit-for-bit; unblocks float32)np.any/np.all(bool/char)np.zeroscalloc/demand-zero — O(1) (10M f64: 14.3 ms → ~0.01 ms)astype)np.absdtype=parity (complex magnitude `High-performance type casting
The single largest engineering effort in this PR rebuilt the
astype/copy/retype machinery from the ground up.NDIter.CopyAs(dstType, src, order)— it resolvesC/F/A/KviaOrderResolver, allocates the destination, and fills it through the in-placeCopyprimitive.NDArray.copy()andDefaultEngine.Cast(astype) both fold into it; the old scalar /(1,)/ same-type-Clone/ F-contig-special /CastCrossTypebranch maze and the 2,226-line per-elementConverts.FindConvertercast loop are deleted. ATryCopySameTypefast path fills scalar-broadcast and gap-free contiguous destinations with one typedInitBlock/memset/SIMD fill (6–8× for 1-byte, ~2× wider).astypecombinations at 1M against NumPy 2.4.2, producing a lagging-cell worklist that successive waves drove from 716 → 129 of 1,568 comparable cells (1,439 now win ≥1.0×). The kernels:float→narrow-int(cvtt+ truncatingNarrow),float/int→bool(!=0compare), Half↔ via the Giesen bit-fiddle (widen/narrow, sNaN-preserving),complex→int/bool(real deinterleave), sub-word strided/reversed lane shuffles (VPSHUFB/VPACKUS), fusedVPGATHERwhole-array kernels for strided float→narrow, an IL-emitted scalar cast kernel for the Vector-less dtypes (directConverts.ToX, 0.65 → 1.5–2.6×), and a KEEPORDER same-type copy. Per-src geomeans: f16→narrow 3.8×, f32→narrow 2.0×, f64→narrow 1.4×.double→intconverters,ToUInt32(double)overflow → 0,DateTime/TimeSpanconversions, and the Half/Complex/char converter paths, and replaced theIConvertibleconstraint with aConverts<T>path — all pinned by the cast tier of the differential-fuzz corpus (full 15×15).Also in this PR:
np.wheregained an IL-generated AVX2/SSE4.1 (Neon-safe) SIMD kernel — 5.4× over the old call-based kernel and ~3.9× faster than NumPy at 1M — plus NEP50 weak-scalar type promotion;np.asanyarraynow accepts every built-in C# collection (List/HashSet/Queue/Stack/ImmutableArray/LinkedList/Memory<T>/ArraySegment/LINQ results, non-genericIEnumerable/IEnumeratorlikeArrayList,Tuple/ValueTuple, and mixedobject[]) with NumPy-like type promotion; and the test suite was migrated to MSTest v3.All figures in this section come from the committed, reproducible NumSharp-vs-NumPy benchmark harness — the op/dtype/N matrix plus five subsystems (iterator / layout / operand / cast / fusion), with per-release
benchmark/history/provenance snapshots and the NPY/NS convention (>1 = NumSharp faster). Full details in §10.9. Memory management — ARC +
IDisposableNDArrayimplementsIDisposablebacked by atomic reference counting on the unmanaged block: CAS-drivenTryAddRef/Release, idempotentDispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.dotat 100K: 446 collections → 0).GC.AddMemoryPressure/RemoveMemoryPressure) so the runtime sees the true unmanaged footprint and collects on time — fixing a runaway-growth bug (a 1M-array loop peaked at 10+ GB → ~54 MB).10. Validation & benchmarking infrastructure (NumPy as the oracle)
Two committed, reproducible systems back every correctness and performance claim in this PR. Both treat NumPy 2.4.2 as the sole oracle, both run with no Python at test time or in CI, and both ship as first-class, regeneratable infrastructure.
The NumPy differential oracle (correctness)
A property-style differential fuzzer that proves every NDIter-backed op is bit-identical to NumPy across the input space — caught systematically, not by hand-picked cases. (The motivating failure: a cast saturate-vs-wrap bug, latent in
where/copyto/concatenate, that example-based tests missed. It must be impossible to ship again.)How it works. Generators under
test/oracle/run real NumPy 2.4.2 and emit a committed, bytes-exact JSONL corpus; the C# harness undertest/NumSharp.UnitTest/Fuzz/replays the operand bytes and bit-compares. NumPy never runs at test time — the committed corpus is its frozen answer.(dtype, shape, element-strides, element-offset, bufferSize, base-buffer-hex)plus NumPy'sexpected(result dtype + shape + C-contiguous result bytes) — or the exception NumPy raised.FuzzCorpus.Reconstructrebuilds the exact logical array from the base bytes alone — broadcast (stride-0), negative strides, offset slices and all — by aliasing a contiguous storage with the operand's view shape (no validation, so even layouts the public API would normalize away replay faithfully).layout_catalog.py, mirrored 1:1 in C#): 26 single-array builders (C/F-contig 1–3-D, transposed, strided/step-2, negative-stride, offset slice, broadcast & scalar-broadcast stride-0, 0-D view at non-zero offset, singleton dims, newaxis, empty, high-rank 5-D, reshape-view), 9 pairwise builders (the SimdFull / SimdChunk / General / scalar-left-right / broadcast-row-col / negstride dispatch classes), and 5where-triple builders. Value pools are edge-loaded — NaN, ±inf, −0.0, type min/max, and the float→int overflow boundaries are front-loaded so even an 8-element operand exercises thecvttsentinel paths. Every builder self-validates that the operand is a true view into its base buffer, so the bytes fully determine the array.OpRegistry): astype (15×15), binary arithmetic (NEP50), floor_divide/mod/power, the six comparisons, ~30 unary-math kernels (incl. the transcendental / hyperbolic / inverse-trig stragglers), every reduction + the NaN-aware family, cumulative scans + diff, statistics (median/percentile/quantile/average/ptp/count_nonzero/clip), logic & extrema (isnan/isinf/isfinite/maximum/minimum/fmax/fmin/isclose), bitwise + shift, where/place, ~25 manipulation ops, modf multi-output, sorting/searching, matmul/dot/outer, andcopyto(incl. overlapping same-buffer copies). Every case is checked on three axes against NumPy: result dtype (NEP50), result shape (broadcasting), and the result bytes (bit-exact, NaN tokenized) — plus error-side parity (a tier where NumPy raises asserts NumSharp also throws).MisalignedRegistry(intended NEP50/algorithm divergences and a few tracked bugs — excused but printed, never hidden), or a failure (any unknown divergence — the gate goes red). A failing element-wise case is auto-shrunk to a 1-element minimal repro.gen_index_oracle.py+IndexOracleTests): a portable token-encoded index tuple over 15 base recipes (scalar, empty, 2-/3-D, transposed, strided, negative-stride, offset, broadcast), replayed for get and set with full error-side parity. Three tiers —index_curated(2,265, CI gate),index_dtype(104 forms × 13 dtypes, CI gate), andindex_random(10,000 seeded, now 0 divergences). This is the gate that drove advanced indexing from ~697 → 0 divergences (§3).MetamorphicTests) that need no NumPy:-(-a)==a,(a+b)-b==a,a.T.T==a, reshape round-trip, widening-cast round-trip,a*1==a/a+0==a,absidempotence, sum-all == flat-sum, concatenate split-free, argsort-of-sorted = identity, equality reflexivity. They catch internal-consistency bugs the differential corpus structurally cannot.[FuzzMatrix]gate on each CI (Windows/Ubuntu/macOS), replaying the committed corpus with no Python. A nightly fuzz-soak (fuzz_random.py, deterministic from its seed) sweeps ~1M fresh cases/night; a divergence prints a shrunk repro to drop intocorpus/regressions/, which then pins it on every CI thereafter. Regenerate withnumpy==2.4.2;Char/Decimal(no NumPy analog) are covered by the separateConverts-oracle tests.The benchmark harness (performance)
A committed, reproducible NumSharp-vs-NumPy comparison driven by one entry point (
benchmark/run_benchmark.py, NumPy 2.4.2 pinned). It builds and runs the BenchmarkDotNet op/dtype/N matrix (benchmark/NumSharp.Benchmark.CSharp) — 1K / 100K / 10M elements × all 15 dtypes, ~615 ops per size (1,851 measured cells: ✅ 792 / 🟡 357 / 🟠 177 / 🔴 72) — joined per(op, dtype, N)to a warm NumPy process, then appends five subsystems that fill the axes the matrix can't express:benchmark/nditer)NDItermachinery itself — construction, traversal, reductions, selection, dtypes, pathologies, dividends — vsnp.nditer, aspect × cache-tiernp.nditer;bcast_reduce538×benchmark/layout)benchmark/operand)benchmark/cast)astypesrc→dst × 8 layouts at 1M (no op-matrix coverage)benchmark/fusion)np.evaluatefused single-pass vs unfused np.* chainsMethodology guards keep the numbers honest: an InProcessEmit BenchmarkDotNet toolchain (sibling worktrees hold same-named projects the out-of-process toolchain refuses), a 25 ms-capped 50-iteration job (so µs–ms array ops skip BDN's nanosecond invocation ramp), and an asserted
-c Releasebuild (file-baseddotnet rundefaults to Debug, which silently halves hand-written kernels). Each run writes a committablebenchmark/history/<date>_<sha>/snapshot (report + every subsystem sheet + cards + a provenance MANIFEST) and repoints alatestsymlink; a decoupled post-releasebenchmark.ymlregenerates it, renders the DocFX pages, and commits it to master. The convention is NPY/NS throughout — ratio = NumPy ÷ NumSharp time, >1 = NumSharp faster (the headline figures above are the2026-06-23snapshot on an i9-13900K).11. Legacy stacks deleted
MultiIteratordeleted; all callers migrated toNDIter.Copy/ multi-operand execution.NDIterator(interface +NDIterator<T>+AsIteratorextensions) deleted entirely; production iteration runs throughNDIter/NDIterRef/GetAtIndex/NDFlatIterator.#if _REGEN … #else <generated> #endifblocks across 35 files collapsed to the generated code with the template DSL preserved as reference comments; all*.template.cs/.tt/GenerateCode.ps1and their dead<Compile Remove>csproj guards are gone.~400per-dtypeNPTypeCodeswitch sites were replaced by a genericNpFuncdispatch utility, and dead-code sweeps removed the 24[Obsolete(error:true)]tombstones plus 10 confirmed-dead private methods.12. New dtypes —
int8,float16,complex128(+Char8/DateTime64)NumSharp now supports all 15 NumPy dtypes (was 12). This PR adds
SByte(int8) andHalf(float16), and promotesComplex(complex128) from a stub to a first-class dtype — each mapped to its NumPy name and wired through array creation,np.dtype("int8"/"float16"/"complex128")parsing,astype, reductions, comparisons, and the IL kernels.NPTypeCode.Single =>switch sites found the three dtypes silently dropping out of ~23 files (9 production crash sites + 5 perf gaps); each was filled so the full 15×15 dtype matrix works end-to-end —NDIterCasting/NDIterBufferManager(safe-cast, buffered iteration, theComplex→realComplexWarningdrop),np.repeat,np.any/np.allaxis,argmax/argminaxis (Half first-NaN-wins, Complex lexicographic real-then-imag), and the reduction identity / min-max kernels. The six Complex transcendentalssinh/cosh/tanh/arcsin/arccos/arctanwere implemented (wereNotSupportedException).np.dtypeparser rewrite.np.dtype(string)is now aFrozenDictionarylookup;finfo/iinfoextend to the new dtypes; and the class-level aliases match NumPy 2.4.2:np.byte→ int8 (sbyte),np.int_→intp,np.uint→uintp,np.intp/np.uintp→ long/ulong on 64-bit, andnp.complex64/np.csinglenow throw instead of silently widening to complex128. See Breaking Changes.Vector<Half>in the BCL) and routes conversions through double; Complex is excluded from the inherently-real ops (unique/clip/randint) per NumPy. Reductions for the three dtypes run through the NDIter path rather than a boxed loop (Decimal axis ops 5–13×, Half mean 1.6–3.7×), with bit-exact accumulation (Half/Complex flat sum/mean accumulate in a wider type and cast back).Conversion-helper primitives — two further NumPy-parity types adapted from vendored .NET BCL sources under
src/dotnet/, standalone helpers for now (not yet wired intoNPTypeCode):Char8— the NumPyS1/ Pythonbytes(1)equivalent (readonly struct, 1 byte): conversions, operators, span helpers, and 100% PythonbytesAPI parity validated against a Python oracle.DateTime64— the NumPydatetime64equivalent (readonly structover alongtick count): fulllongrange and aNaTsentinel (long.MinValue) with IEEE-NaN-style propagation and NumPy comparison semantics (any ordering involvingNaTisfalse,!=istrue, whileEqualsstays hash-contract-compliant). Implicit interop fromDateTime/DateTimeOffset/long, checked conversion back, andConverts.ToDateTime64/ToX(DateTime64)matching NumPy exactly; calendar arithmetic delegates toSystem.DateTime.13. Examples — trainable MNIST MLP
New
examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive and a fused implementation (single-NDIterbias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). The stride-native GEMM made the old "copy transposed views beforenp.dot" workaround unnecessary; converges to >99% test accuracy in the bundled demo.14. Cross-platform
macOS / Apple-Silicon (ARM64) reduction parity fixed (6 failing tests on
macos-latest):maximum/minimumsigned-zero ties (ARMFMAX/FMINNMvs x86MAXPS/MINPS) now resolve to the second operand via an explicit strict-compare +ConditionalSelect;negate(+0.0)is an explicit sign-bit XOR (was0 - x→+0.0on ARM); narrow-intsum/prodreductions use exact integer accumulators on ARM (the AVX2-gated widening kernel fell back to a saturating double path). Reproduced under linux/arm64 via QEMU and pinned as a committed parity suite.15. Tests & CI
[OpenBugs]reproductions promoted into regular CI tests as their bugs were fixed (each asserts NumPy-2.4.2-correct behavior). Deliberately kept flagged: 7 AVX-512-only and 2 timing-dependent repros.np.ArrayPrint.ParityTests), cumsum parity (54), abs parity (33), shift parity,np.evaluate/out=/where=/dtype=parity, NDIter battletests (566), order-support sections, ARC lifecycle, and the macOS/ARM64 signed-zero parity suite.FuzzMatrixgate (the differential oracle + index oracle + metamorphic tiers — §10) runs on Windows/Ubuntu/macOS; nightlyfuzz-soak.yml; a decoupled post-releasebenchmark.ymlruns the whole NumSharp-vs-NumPy harness (op/dtype/N matrix + the five iterator/layout/operand/cast/fusion subsystems — §10), renders the DocFX benchmark pages, and auto-commits the refreshed report + cards + a committablebenchmark/history/<date>_<sha>/provenance snapshot (and itslatestsymlink) to master.flip/fliplr/flipud/rot90,diag,gradient, andround(the function form;np.round_/np.aroundexist); the benchmark-surfaced slower paths, all tracked in the committed sheets — small-N (~1K) per-call dispatch overhead, the scalarHalf/Decimalelement paths (no BCLVector<Half>/Vector<decimal>), large-Nnp.anyfull-scan, comparison→boolstores, and fancy gather/scatter; and a handful of iterator/indexing edge cases pinned as[OpenBugs].Key highlights since 0.40.0
0.60.0 caps a three-release arc that rebuilt NumSharp's compute core from the ground up and aligned it with NumPy 2.x. For anyone upgrading from 0.40.0, the cumulative picture across the prerelease line:
0.41.0 — IL Kernel Generator (the compute-core rewrite, Mar 2026)
System.Reflection.Emit.DynamicMethod) replaced the ~600 K-line Regen template engine with ~19 K lines — a net −533 K lines — withVector128/256/512SIMD and runtime width detection across every op.Parallel.*).nan*reductions,cbrt,floor_divide,left/right_shift,cumprod,count_nonzero,isnan/isfinite/isinf/isclose, and thenp.comparison+np.logicalmodules — plus the comparison/bitwise operators (==…>=,&/|/^) implemented for the first time.0.50.0 — Long Indexing (>2 GB arrays + the type system, Apr 2026)
Shape/NDArray/Storage/iterators/IL kernels — arrays beyond 2.1 billion elements (>2 GB) now work;np.argmax/np.nonzeroreturnlong. NewUnmanagedSpan<T>(long-lengthSpanparity),LongIntroSort, and unmanaged long index buffers.can_cast,promote_types,result_type,min_scalar_type,common_type,issubdtype,finfo,iinfo,isreal/iscomplex/isrealobj/iscomplexobj.np.arange(10)→int64,NPTypeHierarchy(bool not under Number), and 0-D scalar arrays (np.array(5)→ 0-D).__contains__/__len__/__iter__/__getitem__/__setitem__, plustolist()/item();np.frombufferrewritten to the full NumPy signature (count/offset/big-endian/view semantics).ValueType→objectscalar migration; operator-overload cleanup (−74%); 600+ battle tests.0.60.0 — nditer (this PR — the first stable)
nditerport as the execution engine;np.evaluatefusion (3.2–6.1×); full advanced-indexing parity + a differential index oracle; byte-exact array printing;out=/where=/dtype=ufunc kwargs; C/F/A/K memory layout; 36+ newnp.*APIs (sort, pad, percentile/quantile, take/put, split, …);np.randomrebuilt on MT19937 for NumPy seed parity + 24 new distributions; stride-native matmul and anastypeSIMD campaign that both beat NumPy; ARC memory management + buffer pool; differential fuzzing vs NumPy; and the legacy iterator stack and the Regen engine deleted outright. (All detailed in the sections above.)Breaking changes
int[]/long[]as the sole index is now FANCYnd[new int[]{0,2}]selects rows 0 and 2 (shape(2,…)), not the element at coordinate(0,2)nd.GetData(0, 2)for coordinate access.nd[0,2](separate ints) is unchanged;NDArray<T>.this[int[]]is unchangedNDArray.ToString()format changed to NumPyarray_str/array_repr[0 1 2](str) /array([0, 1, 2], dtype=int64)(repr) instead of[0, 1, 2]ToString()output; use the typed accessors/GetDatafor valuesnp.left_shift/right_shiftresult dtype is nowresult_type(lhs, rhs)int8 << int32→int32(wasint8, overflowing);bool << bool→int8np.fmax/fminnow ignore NaNmaximum/minimumstill propagate NaNnp.cumsumof empty / 0-d / size-1 integer input widens to int64 and never returns 0-dcumsum(empty int32)→ freshint64;cumsum(0-d)→(1,)np.add.accumulate)IncorrectShapeException(count, 1)value to fill one value per selected rowIndexError(too many indices)A[:, :, :]on a 2-D arrayIndexErrorA[0,4]on a 3×4, fancy-7on a size-6 axis (was wrong value / OOB read)np.fullargument order flipped tonp.full(shape, fill_value, dtype)np.full(fill_value, shape, dtype))dtypestays thirdbool - bool,-bool,np.negative(bool)now throw^/ cast to int first<=/>=returnsFalsenp.isnanexplicitlyfloor_divide/moddivide-by-zero & floored results;np.negative(uint)wrapsnp.power(int, negative int)raisesValueErrorcomplex→bool/float→inttruncation); transcendental NEP50 width promotion;np.clip/quantile dtype promotionnp.byteis nowint8(sbyte)uint8(byte)charconvention; usenp.ubyte/np.uint8for unsignednp.complex64/np.csinglenow thrownp.complex128np.int_/np.uint/np.intp/np.uintprealigned to NumPy 2.x (intp/uintp; long/ulong on 64-bit)np.randomseed sequences changed (Knuth subtractive generator → MT19937)np.randomdistributions returnint64poisson/binomial/geometric/hypergeometric/zipf/logseries/negative_binomialweredouble[1].copy()to writeMultiIteratorandNDIterator(+NDIterator<T>,AsIterator) removedNDIter/NDIter.Copy/NDFlatIteratorNDIter:MaxOperands=8and 64-dim limits removednp.copytounwriteable-destination error type correctedEverything above was validated against NumPy 2.4.2 ground truth — by ~40,000 differential corpus cases, 566 iterator parity scenarios, a 12,000+-case index oracle, ~18,000 array-print fuzz cases, and per-feature battle tests run on actual NumPy output.
Closes: #435 #439 #456 #477 #480 #495 #501 #508 #515 #542 #567 #568 #604 #605 #608