design-proposal: unified TLS and PKI model#19
design-proposal: unified TLS and PKI model#19Aleksei Sviridkin (lexfrei) wants to merge 2 commits into
Conversation
|
Warning Review limit reached
More reviews will be available in 1 hour. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new design proposal document is added at ChangesUnified TLS/PKI Design Proposal
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for a unified TLS and PKI model for managed applications in Cozystack, outlining a two-tier architecture that separates edge-tier wildcard certificates from interior operator-owned PKI. Feedback on the proposal suggests programmatically sanitizing destination Secrets in the CA-distribution controller to explicitly omit private keys, and emitting Kubernetes Warning Events during naming collisions to prevent silent TLS verification failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
| For **operator-owned engines** (postgres, kafka, mongodb), the operator already publishes a CA-bearing Secret synchronously enough to consume, and for kafka it is already key-free. The work is to converge each on the canonical `<release>-ca-cert` shape and label — no new controller is required. | ||
|
|
||
| For **cert-manager-minting engines** (nats, qdrant, and the four open per-app pull requests), none of the chart-time paths work: `valuesFrom` is pinned, `lookup` is gone, and the CA is asynchronous (see "The problem"). The proposed mechanism is a small **watch-driven CA-distribution controller**, modelled directly on the existing wildcard-secret reconciler (`internal/controller/wildcardsecret/reconciler.go`, from `cozystack/cozystack#2990`). That reconciler already demonstrates the exact pattern: it watches a source Secret, projects copies into the right namespaces, marks every copy it owns with a management label (`cozystack.io/wildcard-secret-copy: "true"`), garbage-collects stale copies, and refuses to touch a foreign Secret that happens to share the name. The CA-distribution controller does the analogous thing one level down: it watches the per-release `<release>-ca` Secret that cert-manager creates asynchronously, and projects a `<release>-ca-cert` object carrying only `ca.crt` with the tenant-resource label. Because it is watch-driven, it sees the CA appear after render; because it writes a Secret the engine and the projection already understand, it needs neither `lookup` nor a custom `valuesFrom`. |
There was a problem hiding this comment.
To ensure robust security, the CA-distribution controller must programmatically sanitize the destination Secret. It should explicitly copy only the ca.crt field and guarantee that any private keys (such as tls.key or ca.key) are omitted, rather than relying on external validation. This prevents accidental exposure of private keys through the label-filtered tenantsecrets projection.
|
|
||
| - The helper's input PEM contains a private-key header → render fails closed; the chart does not deploy a key-bearing Secret. | ||
| - The per-release CA Secret does not exist yet → the controller waits for the watch event; it does not error or busy-loop. | ||
| - A foreign Secret already occupies the target name → the controller leaves it untouched (management-label guard). |
There was a problem hiding this comment.
Leaving a colliding foreign Secret untouched prevents overwriting, but it will cause silent TLS verification failures for the tenant because the correct CA certificate is never projected. The controller should emit a Kubernetes Warning Event on the corresponding application or namespace to alert operators of the naming collision.
Record a written target design for the TLS/PKI convergence: two-tier (edge issuance + interior operator-owned PKI), a per-engine PKI-ownership contract, and a delivery mechanism for the cert-manager-minting engines where charts cannot wire the CA-only helper directly. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
d4fdaf8 to
f1966d9
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
design-proposals/unified-tls-pki/README.md (1)
177-179: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDecide naming/scope before controller implementation starts.
The object naming/namespace and controller scope are still open questions, but rollout proceeds directly to implementation. Consider adding an explicit decision gate before Line 173 so follow-up PRs don’t diverge on contract shape.
Also applies to: 172-174
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@design-proposals/unified-tls-pki/README.md` around lines 177 - 179, Add an explicit decision gate before implementation begins by clarifying the canonical `<release>-ca-cert` naming/namespace contract and the controller scope in the unified TLS PKI README, since these are still open. Update the nearby section around the existing open questions to state that the CA-distribution controller shape must be decided first, and tie it to the cert-manager-minting engines versus a generalized projector, plus the per-tenant wildcard propagation mechanism so future PRs follow one agreed contract.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@design-proposals/unified-tls-pki/README.md`:
- Line 20: The README has conflicting status for cozystack/cozystack#2990
between the “Edge” note and the later rollout summary, so update the wording to
use one consistent status or date-qualified phrasing. Fix the entries around the
Edge status and the later section that mentions `#2990` so they agree, using the
existing “Edge, open” / “done” labels and the issue reference to make the
rollout state unambiguous.
---
Nitpick comments:
In `@design-proposals/unified-tls-pki/README.md`:
- Around line 177-179: Add an explicit decision gate before implementation
begins by clarifying the canonical `<release>-ca-cert` naming/namespace contract
and the controller scope in the unified TLS PKI README, since these are still
open. Update the nearby section around the existing open questions to state that
the CA-distribution controller shape must be decided first, and tie it to the
cert-manager-minting engines versus a generalized projector, plus the per-tenant
wildcard propagation mechanism so future PRs follow one agreed contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d9ffcbdd-488c-485b-8826-c7cbb0c4a005
📒 Files selected for processing (1)
design-proposals/unified-tls-pki/README.md
|
Strong proposal — pinning the two-tier model and the per-engine PKI-ownership table on paper is exactly what was missing, and §"The problem" nails the real delivery gap (pinned My one substantive concern is §5 (Delivery). The CA-distribution controller is justified by a What's already in place (all in cozystack/cozystack)The "which secrets are tenant-facing" contract is already declarative and is not name-based:
The consequence: if the projected The contract that's actually missingOnly the production of a key-free object is genuinely new (nothing in the platform strips a key today; the projection is whole-
One caveat on the ownerRef targetDon't owner-ref the source NetThe irreducible new work is the extraction step itself — read the CA secret, write the key-free copy, re-copy on rotation — which is the same job whether an operator does it natively (kafka's (Separately, a more principled root fix worth a sentence in "Alternatives": give the projection/selector a field filter so a single One factual check for the table: it lists CNPG |
The PKI-ownership table claimed postgres already delivers a key-free ca.crt. In fact CNPG's <release>-credentials Secret carries the server tls.key and is the object surfaced to tenants, so postgres is the canonical CA / private-key coupling case, not a converged engine. Fix the table, the takeaway, and the interior-state text, and state that convergence for postgres means a dedicated key-free <release>-ca-cert plus removing <release>-credentials from the tenant-facing surface. Replace the name-convention CA-distribution controller with an explicit source-selection label plus a small extraction controller, reusing the lineage webhook's spec.secrets label selector, owner-reference walk, and authoritative tenantresource stamping for marking and garbage collection. The controller sanitises to ca.crt at write time, owner-refs the projected Secret to the application instance, and emits a Warning Event on a name collision instead of failing silently. Record the field-filter projection as the principled root fix, align the per-tenant-propagation PR status across the document, and note the clustersecret-operator overlap with cross-namespace wildcard propagation. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
|
Adopted the whole §5 restructuring — this is the right call. The proposal now drops the name convention for an explicit On the CNPG Two security details I tightened alongside: the extraction controller sanitises to |
Andrei Kvapil (kvaps)
left a comment
There was a problem hiding this comment.
Direction is right and the doc is well-grounded — these are non-blocking comments, no objection to the approach.
One factual correction — postgres is not the coupling case.
The doc makes postgres the canonical CA / private-key coupling instance ("carries the server tls.key", "the live instance of this risk") and sequences it first. On main that doesn't hold:
packages/apps/postgres/templates/init-script.yaml:18-27renders<release>-credentialsas anOpaqueSecret whosestringDatais onlyuser: passwordpairs — notls.key, noca.crt.packages/apps/postgres/templates/dashboard-resourcemap.yaml:16-22surfaces exactly that passwords Secret to the tenant.- The "CNPG-created, carries
ca.crt" claim traces to the comment atpackages/apps/postgres/templates/db.yaml:20-22, which contradictsinit-script.yaml:<release>-credentialsis chart-owned, and CNPG'sca.crtlives in<release>-ca/-server/-app, none of which are in the dashboard Role.
So there's no private-key-to-tenant leak in any merged engine today. postgres is the same trust-anchor delivery gap as nats/qdrant (the tenant gets passwords, never ca.crt), not a coupling. Please fix the table row, the Security section, and the Upgrade step — removing <release>-credentials would drop the tenant's connection passwords, so convergence should add a key-free CA object, not remove that Secret. The db.yaml comment is wrong on its own; I'll fix it separately.
Kamaji belongs in the issuer enumeration, not just the exclusion row. It's the fourth independent PKI owner (control-plane CA for managed kubernetes), and it's the one that's fundamentally non-swappable — the kubeconfig pins that cluster CA, so a public edge cert is meaningless there. That's the strongest argument for "unify the interface, not the CA". Listing it only as "excluded" hides the point — I'd present it as a first-class example of the fork.
I'd prefer the field-filter over the extraction controller. The doc lists "a field filter on the projection itself" under Alternatives and defers it as touching the tenant-facing API. That's exactly where it belongs — the projection (pkg/registry/core/tenantsecret/rest.go, secretToTenant copies the whole Data) is the actual trust boundary. Extending ApplicationDefinition.spec.secrets.include with a per-key allow-list (e.g. keys: [ca.crt]) and applying it in the projection would:
- remove the need for a new controller, a second
<release>-ca-certobject, and the runtime no-private-key re-check; - work uniformly for operator-owned engines too (project
ca.crtstraight out of the CNPG secret — no async / pinned-valuesFrom/lookupproblem); - keep the security boundary in one authoritative place instead of relying on every chart plus a runtime guard.
It's backward-compatible (no keys → today's whole-Data behavior). Cross-namespace wildcard distribution (#2820) stays separate — clustersecret-operator already covers that class.
The framing flip, the two-tier model, the whole-Data projection as the security pivot, and the kafka reference shape are all on point.
Andrei Kvapil (kvaps)
left a comment
There was a problem hiding this comment.
Direction is right and the doc is well-grounded — these are non-blocking comments, no objection to the approach.
One factual correction — postgres is not the coupling case.
The doc makes postgres the canonical CA / private-key coupling instance ("carries the server tls.key", "the live instance of this risk") and sequences it first. On main that doesn't hold:
packages/apps/postgres/templates/init-script.yaml:18-27renders<release>-credentialsas anOpaqueSecret whosestringDatais onlyuser: passwordpairs — notls.key, noca.crt.packages/apps/postgres/templates/dashboard-resourcemap.yaml:16-22surfaces exactly that passwords Secret to the tenant.- The "CNPG-created, carries
ca.crt" claim traces to the comment atpackages/apps/postgres/templates/db.yaml:20-22, which contradictsinit-script.yaml:<release>-credentialsis chart-owned, and CNPG'sca.crtlives in<release>-ca/-server/-app, none of which are in the dashboard Role.
So there's no private-key-to-tenant leak in any merged engine today. postgres is the same trust-anchor delivery gap as nats/qdrant (the tenant gets passwords, never ca.crt), not a coupling. Please fix the table row, the Security section, and the Upgrade step — removing <release>-credentials would drop the tenant's connection passwords, so convergence should add a key-free CA object, not remove that Secret. The db.yaml comment is wrong on its own; I'll fix it separately.
Kamaji belongs in the issuer enumeration, not just the exclusion row. It's the fourth independent PKI owner (control-plane CA for managed kubernetes), and it's the one that's fundamentally non-swappable — the kubeconfig pins that cluster CA, so a public edge cert is meaningless there. That's the strongest argument for "unify the interface, not the CA". Listing it only as "excluded" hides the point — I'd present it as a first-class example of the fork.
I'd prefer the field-filter over the extraction controller. The doc lists "a field filter on the projection itself" under Alternatives and defers it as touching the tenant-facing API. That's exactly where it belongs — the projection (pkg/registry/core/tenantsecret/rest.go, secretToTenant copies the whole Data) is the actual trust boundary. Extending ApplicationDefinition.spec.secrets.include with a per-key allow-list (e.g. keys: [ca.crt]) and applying it in the projection would:
- remove the need for a new controller, a second
<release>-ca-certobject, and the runtime no-private-key re-check; - work uniformly for operator-owned engines too (project
ca.crtstraight out of the CNPG secret — no async / pinned-valuesFrom/lookupproblem); - keep the security boundary in one authoritative place instead of relying on every chart plus a runtime guard.
It's backward-compatible (no keys → today's whole-Data behavior). Cross-namespace wildcard distribution (#2820) stays separate — clustersecret-operator already covers that class.
The framing flip, the two-tier model, the whole-Data projection as the security pivot, and the kafka reference shape are all on point.
|
On the field-filter: I think it's a good idea, and I'd rather not fold it into this proposal — not because it's wrong, but because it's a bigger and largely separate thing that happens to also solve the CA case. The field-filter is a change to the tenant-secret projection contract itself, not to the TLS model. Today The reason I don't want to couple them is concrete: the §5 mechanism already adopted here — the So my suggestion: spin field-level projection out as its own design proposal. It stands on its own merits — it generalises to any multi-key secret and would let operator-owned engines project |
Adds a design proposal recording the target architecture for the TLS/PKI convergence tracked by epic cozystack/cozystack#2811.
The edge of that convergence has landed; the interior — who owns the PKI per managed engine and how a tenant receives the trust anchor — has not. This proposal pins it down before the per-app TLS pull requests land, and decides the two forks at its center on paper:
ca.crtconsume contract, rather than "cert-manager as the single issuer".It also documents the verified delivery gap (pinned valuesFrom, removed lookup, async per-release CA) and proposes a watch-driven CA-distribution controller modelled on the existing wildcard-secret reconciler.
Status: Draft.
Summary by CodeRabbit