feat(knowledge): expand category taxonomy + single source of truth (P3 stage 1)#206
Merged
Merged
Conversation
…ce of truth P3 stage 1 of the second-brain taxonomy work. Adds the richer category set and, crucially, ends the 6-way duplication of the category list that was silently drifting (Article schema, both LLM extractor prompts + their @valid_categories guards, and two ingestion workers each carried their own copy). New module Loopctl.Knowledge.Categories is the canonical source: - active/0 -> pattern, decision, finding, reference, playbook, insight, entity, idea, quote, question (the 4 kept + 6 new). What new content is classified as. - retired/0 -> convention. Still DB-valid (the category column is a plain string and ~13% of the corpus is convention; dropping the enum value before the 77k reclassification backfill would make those rows fail to load), but not offered to new content. - definition/1 + prompt_fragment/0 -> per-category definitions injected into the extraction prompts so the model classifies into the full taxonomy correctly (e.g. playbook = the steps vs pattern = the shape). Wired through: - Article.@category_values -> Categories.all() (Ecto.Enum now can't drift) - ClaudeContentExtractor / LlmExtractor: prompt rebuilt from Categories at compile time; @valid_categories -> Categories.all_strings() (lenient gate; prompt only offers active categories) - ReviewKnowledgeWorker / ContentIngestionWorker: @valid_categories -> Categories.all() scale_seed.ex is intentionally left alone: it is synthetic scale-test data whose category round-robin feeds index-selectivity gates; changing its list length is orthogonal to the product taxonomy and risks scale-gate flakiness. No DB migration: category is a :string column, so new enum values need no schema change. Tests: Categories unit (composition, string helpers, definitions, prompt fragment, and an enum-vs-canonical drift guard via Ecto.Enum.values/2) + Article changeset accepts every new category. Full local gate green (format, credo --strict, dialyzer 0 errors). Next P3 stages (separate PRs): CategoryClassifier behaviour + DI; resumable, cost-ceilinged, dry-run-first backfill worker over the 77k corpus (write-on-confident-change, always reclassify convention); cross-repo synth.py / publish.py + user CLAUDE.md category list.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
P3 stage 1 of the second-brain taxonomy work: expand the category set and centralize it as a single source of truth.
The corpus only had 5 categories (
pattern/convention/decision/finding/reference). This adds the richer taxonomy you approved and — just as importantly — kills the 6-way duplication of the category list that was silently drifting.Taxonomy
New module
Loopctl.Knowledge.Categoriesis canonical:active/0—pattern, decision, finding, reference, playbook, insight, entity, idea, quote, question(the 4 kept + 6 new). What new content is classified as.retired/0—convention. Still DB-valid (thecategorycolumn is a plain string and ~13% of the corpus isconvention; dropping the enum value before the reclassification backfill would make those rows fail to load), but not offered to new content.definition/1+prompt_fragment/0— per-category definitions injected into the LLM extraction prompts so the model classifies into the full taxonomy correctly (e.g. playbook = the steps vs pattern = the shape).Wired through (drift eliminated)
Article.@category_valuesCategories.all()(Ecto.Enum can't drift)ClaudeContentExtractorprompt +@valid_categoriesCategoriesat compile time; gate →Categories.all_strings()LlmExtractorprompt +@valid_categoriesReviewKnowledgeWorker.@valid_categoriesCategories.all()ContentIngestionWorker.@valid_categoriesCategories.all()scale_seed.exis intentionally left alone — it's synthetic scale-test data whose category round-robin feeds index-selectivity gates; changing its list length is orthogonal to the product taxonomy and would risk scale-gate flakiness.No DB migration:
categoryis a:stringcolumn, so new enum values need no schema change.Tests
categories_test.exs: composition (all = active ++ retired, disjoint), the expanded active set, string helpers, definitions cover every active category, prompt fragment includes all active / excludes retired, and an enum-vs-canonical drift guard (Ecto.Enum.values(Article, :category) == Categories.all()).article_test.exs: changeset accepts every new category.knowledge_test.exs+knowledge_stats_controller_test.exs: the denseby_categorystats maps grew from 5→11 keys (correct — stats derives the dense map from the enum). Updated the expectations to derive fromCategoriesrather than re-hardcode, so a future taxonomy change can't silently break them again.Full local gate green: format, credo --strict, dialyzer (0 errors), 2987 tests / 0 failures.
Next P3 stages (separate PRs)
CategoryClassifierbehaviour + config-DI (Claude-backed) + Mox.convention; emits a proposed-distribution report before committing.synth.py/publish.pycategory enum (claude-config) + userCLAUDE.mdcategory list.conventionto 0, a follow-up drops it fromactive/retired.