fix(knowledge): lint worker orphan self-heal did nothing on real data#210
Merged
Conversation
Verifying KnowledgeLintWorker against the prod 77k corpus (per the 'do the cron jobs actually work' check) showed its orphan self-heal was effectively a no-op: - 1247 of 1433 orphans HAVE embeddings, but their nearest neighbor sits at ~0.55-0.59 — a near-miss of the global 0.6 link threshold. Re-linking at 0.6 produced ZERO links on every sampled orphan; they would stay isolated forever. - The other 186 orphans have NO embedding, so ArticleLinkingWorker silently no-ops on them entirely. Fixes: - ArticleLinkingWorker accepts an optional per-job (query-shaped arg, falls back to the global config). - KnowledgeLintWorker now splits orphans: embedded ones are re-linked at a lenient threshold (:knowledge_lint_orphan_link_threshold, default 0.5) so an isolated article connects to its closest relative instead of near-missing forever; un-embedded ones are sent to ArticleEmbeddingWorker (which embeds and chains to linking). The audit event now reports orphans_relinked AND orphans_embedding_enqueued. Tests: a near-miss neighbor (cos~0.55) links only with the lenient override; the lint worker embeds an un-embedded orphan and re-links embedded ones. Full local gate green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What real-data verification found
Per the 'do the scheduled jobs actually work?' check, I ran
KnowledgeLintWorkeragainst the prod 77k corpus. Its orphan self-heal was effectively a no-op:ArticleLinkingWorkersilently no-ops on them entirely.Mocked unit tests passed; only real data exposed this — same lesson as the classifier bug.
Fix
ArticleLinkingWorkeraccepts an optional per-jobthreshold(query-shaped arg, falls back to global config).KnowledgeLintWorkernow splits orphans::knowledge_lint_orphan_link_threshold(default 0.5) so an isolated article connects to its closest relative instead of near-missing forever;ArticleEmbeddingWorker(embeds, then chains to linking).orphans_relinkedandorphans_embedding_enqueued.Tests
A near-miss neighbor (cos ≈ 0.55) links only with the lenient override; the lint worker embeds an un-embedded orphan and re-links embedded ones. Full local gate green (2999 tests, credo, dialyzer 0).
Note
0.5 is a contained default (orphans only — never global linking). Tune
:knowledge_lint_orphan_link_thresholdif you want stricter/looser orphan rescue.