Skip to content

HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441

Open
johnml1135 wants to merge 11 commits into
masterfrom
fst-advisor
Open

HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor)#441
johnml1135 wants to merge 11 commits into
masterfrom
fst-advisor

Conversation

@johnml1135

@johnml1135 johnml1135 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What

Accelerates HermitCrab morphological analysis with a precompiled FST, behind a caching front end
that keeps the engine as the source of truth. No second morphology engine, no reimplemented constraints.

Entry point — CachingMorphologicalAnalyzer (fast + slow + cache):

  • Default AnalyzeWord = guaranteed complete (backwards-compatible). On a certified grammar
    (FST-closed per the census and FST==engine set-parity over a corpus) the FST is proven complete,
    so this runs FST-only with no full search. Otherwise it returns the cached engine result, or runs
    the engine on a miss and caches it. Either way: complete.
  • AnalyzeWordFast = opt-in immediate. Cached-complete if warm (or if the grammar is certified),
    else a sound but possibly under-generating verified-FST result, flagged IsComplete=false. Never
    runs the engine.
  • Warm(corpus) fills the cache in parallel; AnalysisCacheSerializer persists it across
    sessions (fixed corpora), keyed by MorphemeRegistry and guarded by a grammar-version string
    (stale cache rejected → re-warm). Confirmed non-words are cached too.

The FST pipeline behind the fast path: FstTemplateAnalyzer (proposer; immutable, shared;
derivation depth tunable) → VerifiedFstAnalyzer (confirms each candidate by restricted re-analysis,
FstReplay, against HC's own engine from a MorpherPool; emits the genuine HC analysis) →
CompleteHybridMorpher (certified→FST / else→engine, with per-word AnalyzeWord(word, useFst)).
GrammarFstAdvisor + GrammarFstClosure are the grammar census/linter (this PR's original core).

Guarantees

  • Correctness equals the engine — the cache never invents or hides an analysis; the default path is
    always complete. The fast path is sound (0 false positives on 50 generated non-words), a yes-only
    detector for "is this a word" (can under-generate, even to zero on un-built constructs), which the
    cache/certification corrects.
  • Proven-complete → no full search. A certified grammar never runs the engine; on the 60-word Sena
    corpus that certifies, the default complete path runs at ~18 ms/word (~11×).
  • Fast — ~13× on Sena's FST path; 98% of words covered by the fast path (the rest resolve via
    engine/cache, no silent miss).
  • Thread-safe — shared proposer + pooled engine + concurrent cache; 0 parallel-vs-sequential
    mismatches on 200 words
    (and a CI test).

Tests

CI unit tests on the in-repo toy grammar cover the proposer, the verify chain, the caching/persistence
layer (default==engine, provisional→complete after warm, certified-skip, round-trip + version guard),
soundness/negatives, the category fix, per-word opt-out, and thread-safety; plus advisor/closure. An
[Explicit] benchmark measures speed/parity/soundness/concurrency/certification on an external grammar.
Full unit suite green (93).

Honest limitations / out of scope

  • ~2% of Sena words use constructs the FST proposer doesn't build yet — compounding (the main one),
    depth-3 derivation, one suffix-order case. They resolve via the engine/cache (no silent miss) and keep
    the full corpus from certifying. Compounding is the highest-value next coverage build (an additive,
    shared-root-chain design) and is scoped as a focused follow-on.
  • The per-stem completeness proof (proving the fast path complete without the engine) was explored
    and abandoned; completeness is delivered by certification + cache + engine.
  • Deferred: the generator (reverse/synthesis direction); compounding; a 2-way-FST treatment of the
    residual.

Design + research record: docs/HERMITCRAB_FST_PLAN.md (§13 = caching front end); advisor:
docs/HERMITCRAB_FST_ADVISOR.md.

🤖 Generated with Claude Code


This change is Reviewable

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new static “FST-readiness” grammar linter for the HermitCrab parser (GrammarFstAdvisor.Analyze(Language)) that walks a compiled grammar, emits per-rule advisories (Escape/Cost/Info + reclaim notes like Regular/Probeable), and produces an overall tier verdict intended for authoring-time/CI use. This lays groundwork for future FST compilation work by making “what blocks FST / what is slow today” visible and actionable.

Changes:

  • Introduces GrammarFstAdvisor, GrammarFstReport, and GrammarAdvisory to classify expensive/non-FST-able constructs across morphological and phonological rules.
  • Adds NUnit tests covering concatenative cases, reduplication (bounded/unbounded), infixation, rewrite-rule harmony behavior, and opacity/probe-ability.
  • Adds planning docs for the advisor and the broader HermitCrab FST acceleration roadmap, plus an explicit local benchmark test for running the advisor on an external grammar.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorTests.cs Adds coverage for the advisor’s tiering and key escape classifications (reduplication/infix/harmony/opacity).
tests/SIL.Machine.Morphology.HermitCrab.Tests/GrammarFstAdvisorBenchmark.cs Adds an [Explicit] helper test to run and print the advisor report on an external HC XML grammar.
src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs Implements the advisor, report model, and the core static analyses for affix and phonological rules.
HERMITCRAB_FST_PLAN.md Documents the planned FST compiler/runtime approach, tiered hybrid design, and decision gate.
fst.md Documents the advisor’s classification rules, tier model, and the orthogonal Regular/Probeable axes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs
Comment thread src/SIL.Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs
@johnml1135 johnml1135 changed the title HermitCrab FST acceleration: grammar FST-readiness advisor (groundwork) HermitCrab FST acceleration: sound, fast, verify-by-re-analysis analyzer (+ grammar advisor) Jun 26, 2026
johnml1135 and others added 2 commits June 27, 2026 08:13
Four parallel audits (formal-language status × HC impl × FST impl) of every HC construct,
classified covered / partial / coverable / not-coverable, with architecture proposals and
appendices on closing the non-regular gap.

Headline findings:
- Almost all of HC is REGULAR (Kaplan-Kay) hence 1-way-FST-able; the only genuinely
  non-regular core is unbounded full-stem reduplication ({ww}) + an unbounded self-feeding
  rewrite cycle (HC caps at 256).
- Critical coverage ceiling: the proposer is only correct for 0-PHONOLOGY grammars (arcs are
  underlying segments, walk is surface) — it silently under-generates (fails safe; parity gate
  refuses to certify) for any grammar with phonological rules. Phonology-by-composition is the
  biggest coverage win.
- Robustness bug: the proposer THROWS on infix/circumfix/reduplication/process slots, aborting
  the whole build instead of degrading to the engine. Graceful degradation is the top this-PR fix.
- Other gaps: true zero-segment affix dropped; bounded compounding needs proposer + FstReplay
  changes; MPR/co-occurrence/env/stemname correctly left to verify (sound).

Appendix A: length-cap fold / detect-and-peel (compile-replace) / 2-way FST (Dolatian-Heinz) /
engine backstop for the non-FST-able constructs. Appendix B: verify-by-re-analysis + escape-aware
codec + certified-skip interlock all HELP later non-regular work; only the 2-way reduplication
solution would need a new execution model.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eview fixes

A — Graceful degradation: the proposer no longer THROWS on infix/circumfix/reduplication/
process slots; it skips the unbuildable construct, builds the rest, and sets
CoversAllConstructs=false so the grammar can't certify (those words fall to the
engine/cache; the parity gate enforces it). Was a NotSupportedException that aborted the
whole build, making the FST unusable on any grammar with such a slot.

B — True zero-segment affix (CopyFromInput only, no InsertSegments) now emits its morpheme
token with no segment arcs instead of throwing / being silently dropped. SlotOp treats a
zero-only slot as a position-less suffix so it still builds.

Certification guard: FromLanguage (Caching + CompleteHybrid) now requires
proposer.CoversAllConstructs in addition to closed + parity — a degraded build can't certify.

Copilot review fixes (advisor, still in PR):
- Examine RealizationalAffixProcessRule (it implements IMorphologicalRule + has Allomorphs;
  can encode reduplication/infix) — previously silently skipped, undercounting escapes.
  AnalyzeAffix refactored to (name, allomorphs) and the switch handles both rule types.
- GrammarFstReport counts are now PER-RULE (group advisories by Rule/Stratum/Kind, worst
  severity) instead of per-advisory, so per-allomorph advisories don't overcount and the
  partitions are consistent (Probeable+Opaque = Escape, Regular+NonRegular = Escape).

Tests: Build_ReduplicationSlot_DegradesGracefully_DoesNotThrow,
Analyze_ZeroSegmentSuffix_IsEmitted_NotDropped, Analyze_RealizationalReduplication_IsExamined.
Unit suite 96 green; Sena unchanged (certifies, 0 parallel mismatches, 0 false positives).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@johnml1135

Copy link
Copy Markdown
Collaborator Author

Thanks @copilot — both advisor comments addressed in 7832611f:

  1. Per-rule counting (GrammarFstReport): counts now group advisories by (Rule, Stratum, Kind) and take each rule's worst severity, so per-allomorph advisories no longer overcount. The escape sub-counts are now exact partitions — Probeable + Opaque = Escape and Regular + NonRegular = Escape (a rule is opaque/non-regular if any of its escape advisories is, the conservative aggregate).

  2. RealizationalAffixProcessRule skipped: the morphological-rule switch in Analyze now handles it alongside AffixProcessRule (both have Allomorphs and can encode reduplication/infixation). AnalyzeAffix was refactored to take (name, allomorphs) so both route through the same examination, and AffixRulesExamined now counts them. Regression test added: Analyze_RealizationalReduplication_IsExamined.

…lure)

CI runs `dotnet csharpier check .` and the new/edited FST files were not formatted.
Ran `dotnet csharpier format .` (1.2.6) — only the 11 FST/advisor/test files changed;
no unrelated files touched. Unit suite 96 green; csharpier check clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.47458% with 273 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.68%. Comparing base (9a21ddd) to head (196b712).

Files with missing lines Patch % Lines
...chine.Morphology.HermitCrab/FstTemplateAnalyzer.cs 86.51% 54 Missing and 16 partials ⚠️
...Machine.Morphology.HermitCrab/GrammarFstAdvisor.cs 82.05% 54 Missing and 16 partials ⚠️
...e.Morphology.HermitCrab/AnalysisCacheSerializer.cs 67.10% 14 Missing and 11 partials ⚠️
....Machine.Morphology.HermitCrab/MorphemeRegistry.cs 54.76% 15 Missing and 4 partials ⚠️
...phology.HermitCrab/CachingMorphologicalAnalyzer.cs 75.80% 13 Missing and 2 partials ⚠️
...ine.Morphology.HermitCrab/ReduplicationProposer.cs 74.46% 6 Missing and 6 partials ⚠️
...SIL.Machine.Morphology.HermitCrab/InfixProposer.cs 83.33% 7 Missing and 4 partials ⚠️
...L.Machine.Morphology.HermitCrab/MorphTokenCodec.cs 85.13% 7 Missing and 4 partials ⚠️
src/SIL.Machine.Morphology.HermitCrab/FstReplay.cs 78.26% 6 Missing and 4 partials ⚠️
...L.Machine.Morphology.HermitCrab/FstVerification.cs 85.71% 8 Missing and 2 partials ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #441      +/-   ##
==========================================
+ Coverage   73.20%   73.68%   +0.47%     
==========================================
  Files         440      458      +18     
  Lines       36931    38583    +1652     
  Branches     5077     5311     +234     
==========================================
+ Hits        27037    28429    +1392     
- Misses       8781     8969     +188     
- Partials     1113     1185      +72     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

johnml1135 and others added 7 commits June 27, 2026 09:31
Let the FST proposer match phonologically-altered surfaces, the C-internal
tier of Solution 1 (surface-allomorph precompile, docs/FST_FULL_COVERAGE_PLAN.md
Appendix C). For each root the grammar allows to stand bare, build a proposer arc
not just for the underlying shape but for every bare surface realization HC
synthesizes (phonology applied) — reusing the obligatoriness GenerateWords call,
so zero extra build cost. The emitted token is always the underlying morpheme;
verify re-runs HC with real phonology to confirm.

- FstTemplateAnalyzer: _bareRootValid -> _bareRootSurfaces; add BareRootSurfaces,
  UnderlyingForm, BuildRootChainFromSurface. Underlying arcs kept (union), so the
  0-phonology path is unchanged.

Fix a latent verify bug this exposed: AnalysisRewriteRule/AnalysisMetathesisRule
gate on Morpher.RuleSelector, and FstReplay pinned the selector to just the
candidate's morphological rules — silently disabling ALL phonology during verify.
The propose-and-verify spine could therefore never confirm any phonologically-
altered candidate. Phonological rules are obligatory deterministic rewrites, not a
fan-out choice, so FstReplay now always lets IPhonologicalRule through; the
morphological fan-out is still collapsed by gating the leaf rules + root, and
soundness is still enforced by the unchanged candidate-signature match.

Add Verified_CoversPhonologicallyAlteredBareRoot: an unconditional t->d rule makes
bare root "dat" surface only as "dad"; a baseline assertion proves the underlying-
only proposer misses "dad", the surface-precompile proposer covers it, verify
confirms it as a genuine HC analysis, and a non-word still yields nothing. Full
HermitCrab suite green (97 passed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FST_FULL_PLAN.md — implementation plan for the four expansion points. The
propose-and-verify split means correctness lives in verify + certification, never
in the proposer, so coverage expansion can only change the acceleration ratio,
never produce a wrong answer. Architecture: a CompositeProposer unions candidate
generators (FST + reduplication + infix scanners) into the one verify gate.

- Point 2 (infix) and Point 3 (reduplication): bounded candidate generators that
  strip/remove their material and RECURSE the residual through the FST proposer
  (so inflected reduplicants / infixed forms are covered), feeding the verify gate.
- Point 1 (all phonology): affix surface-precompile + C-boundary neighbor context,
  extending the shipped bare-root C-internal tier.
- Point 4 (C-exact composition): design recorded + deferred with rationale — it is
  a spine redesign (token side-table -> transducer outputs) whose only marginal
  gain over C-boundary is rare cross-boundary opacity that already falls back to
  the engine correctly. C-boundary subsumes its practical value.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reduplication (copy the whole base, surface = base·base) is the one provably
non-regular construct — an FST cannot represent it. Handle it BESIDE the FST: a
bounded candidate generator feeds the same propose-and-verify gate, so it is sound
without being regular.

- CompositeProposer: unions several proposers (FST + generators) into one
  IMorphologicalAnalyzer, deduping candidates by order-sensitive morpheme-identity
  signature before the verify gate. Aggregates coverage at the MorphOp level
  (CoversAllConstructs = FST's uncovered ops minus what generators cover) so a
  grammar can certify once a sibling generator covers the FST's skipped construct.
  New IConstructProposer interface lets a generator declare its covered ops.
- ReduplicationProposer (IConstructProposer): detects an adjacent doubling X·X,
  strips one copy, RECURSES the residual through the FST proposer (so an inflected
  reduplicant is covered, not just a bare root), and appends the reduplication
  morpheme in HC application order (root·…·RED). A coincidental doubling is pruned
  by verify (HC synthesis won't reproduce it).
- FstTemplateAnalyzer: replace the _hasUnbuiltConstructs bool with an
  _uncoveredOps set (records WHICH MorphOp was skipped — slot rules, in-slot
  affixes, and standalone morphological rules); expose UncoveredOps.
  CoversAllConstructs == (UncoveredOps empty).

Test: a full-reduplication grammar; the FST alone misses "sagsag" (and reports
not-fully-covered), the composite covers it (and reports covered), verify confirms
the genuine HC analysis, and a non-word still yields nothing. Full suite green (98).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Infixation (an affix inserted inside the stem, e.g. Tagalog -um-) is regular; the
FST proposer recognizes but does not build infix slots. Handle it as a sibling
generator feeding the same propose-and-verify gate.

- InfixProposer (IConstructProposer): for each infix and each interior position
  where the infix's surface segments occur, remove them and RECURSE the residual
  through the FST proposer (so an infixed form of an inflected stem is covered),
  then append the infix morpheme in HC application order (root·…·INF).
  Over-approximation — every interior occurrence is tried; verify prunes the wrong
  splits. O(surface-length × infixes) candidates, bounded.
- First cut: the infix must be a single contiguous run of inserted segments,
  matched against its underlying representation. Templatic multi-slot infixes and
  phonologically-altered infix surfaces are left to the engine (parity gate keeps
  results correct).

Test: an "a"-infix grammar ("sag" -> "saag"); the FST alone misses "saag" (and
reports not-fully-covered), the composite covers it (and reports covered), verify
confirms the genuine HC analysis, and a non-word still yields nothing. Full suite
green (99).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extend the surface-allomorph precompile from bare roots to AFFIXES: build each
affix's segment arcs from its underlying form AND each phonologically-altered
surface realization, so an affix whose surface differs from its underlying
segments (e.g. a suffix devoiced/changed by a rule) is matched by the proposer.

- SurfacePhonology: a forward-phonology helper that compiles each stratum's
  synthesis phonological rules (reusing HC's CompileSynthesisRule, exactly what
  SynthesisStratumRule runs) and applies them to a segment string in isolation,
  returning the distinct surface variants (C-internal tier: catches edge- and
  morpheme-internal alternations; cross-boundary ones ride the engine).
- FstTemplateAnalyzer.BuildAffixArcs: shared by both affix-arc sites (derivational
  layers + template slots) — builds the underlying path plus a path per altered
  surface variant. Default ctor passes an identity variant function, so the
  0-phonology path is byte-identical; the morpher ctor wires SurfacePhonology.

Tests: Proposer_CoversPhonologicallyAlteredAffix (a "t" suffix that surfaces only
as "d" via t->d: the underlying-only proposer misses "sagd", the surface-precompile
proposer covers it, verify stays sound) and SurfacePhonology_AppliesRulesForward.
Full suite green (101). FST_FULL_PLAN.md updated with the shipped/deferred matrix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Apply CSharpier to FstTemplateAnalyzer.cs (a reflowed method signature) so
Check-formatting passes, and update the now-stale class summary: the proposer
precompiles bounded phonology into its arcs and degrades gracefully on constructs
it cannot model (recording the MorphOp in UncoveredOps for the composite), rather
than throwing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The reduplication/infix generators were only constructed in test code — both
production factories built a bare FstTemplateAnalyzer, so a reduplicating/infixing
grammar never certified and the generators never ran. Wire them in.

- CompositeProposer.ForLanguage(language, fst): the standard production proposer
  (FST + reduplication + infix generators). Inert for grammars without those
  constructs (generators hold no rules, yield nothing; CoversAllConstructs is
  vacuously true) — near-zero overhead, byte-identical behavior.
- CompleteHybridMorpher.FromLanguage and CachingMorphologicalAnalyzer.FromLanguage
  now build the composite and certify on its CoversAllConstructs.

Integration test CompleteHybrid_WiresGenerators_...: a reduplicating grammar
certifies through the production factory and the fast path matches the engine on
bare/reduplicated/homograph/non-word — the test whose absence let the feature be
inert. Docs note the wiring + the extended empirical-certification caveat (a
certified grammar skips the engine, so the certification corpus must exercise the
reduplication/infix patterns). Full suite green (102).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants