Skip to content

design-proposal: compute plane for untrusted-code workloads#17

Open
Andrei Kvapil (kvaps) wants to merge 2 commits into
mainfrom
proposal/compute-plane
Open

design-proposal: compute plane for untrusted-code workloads#17
Andrei Kvapil (kvaps) wants to merge 2 commits into
mainfrom
proposal/compute-plane

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 23, 2026

Copy link
Copy Markdown
Member

Adds a design proposal for compute planes.

What

A compute plane is a Cozystack-managed Kubernetes cluster that a tenant does not see or manage, onto which untrusted-code workloads (notebooks, workflow "code" nodes, plugin systems, custom components) are placed instead of into the tenant namespace on the management cluster. The compute plane has no credentials to and no network path to the management/infra control plane; the management cluster applies workloads into it one-way via Flux, and tenant access is proxied back through the normal ingress entry point.

Why

Cozystack's model treats a managed app as a single-purpose barrier the tenant cannot cross (you can't run an arbitrary binary inside your managed Postgres). A growing class of apps breaks that by design — their feature is arbitrary code execution. Co-locating those with the management plane is unsafe; this proposal extends the barrier property to them instead of weakening it. No known exploit — a latent gap to close before such apps reach shared/production clusters.

How it maps to existing primitives

Built entirely on things that already exist: the managed kubernetes app (Kamaji control plane + KubeVirt nodes, GPU node groups, autoscaler), tenant modules, and Flux remote apply via HelmRelease.spec.kubeConfig.secretRef (already used by the kubernetes app for its own addons). Delivered as a tenant module (computePlane: "<profile>", a single-string profile reference — one source of truth, no inline override blob). App routing is a new placement: { ManagementPlane | ComputePlane } enum on ApplicationDefinition (default ManagementPlane = today's behavior).

Related proposals

#4 (tenant-module-overrides), #7 (cross-cluster-tenant-mesh), #8 / #9 (kubernetes-nodes).

Rendered: design-proposals/compute-plane/README.md

Summary by CodeRabbit

  • Documentation
    • Added a new design proposal for ComputePlane-based workload isolation on a separate managed Kubernetes cluster.
    • Described how tenant workloads can be placed on ComputePlane, how access is routed, and how service-to-service connectivity is restricted.
    • Included rollout phases, failure scenarios, test plans, and alternatives considered.

Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new design proposal document (design-proposals/compute-plane/README.md) is added, defining the ComputePlane concept: a Cozystack-managed Kubernetes cluster that isolates untrusted workloads, delivered as a tenant module, with apps routed via remote Flux apply and inbound access proxied through the tenant entry point.

Changes

ComputePlane Design Proposal

Layer / File(s) Summary
Context, background, and problem framing
design-proposals/compute-plane/README.md
Document metadata, overview of existing Cozystack tenant isolation and remote Flux apply primitives, core problem statement, goals, and non-goals (no cross-tenant sharing, no Kubernetes-alone multi-tenancy claims).
Core design: ComputePlane kind, tenant module, remote apply, and network access
design-proposals/compute-plane/README.md
ComputePlane defined as a distinct CRD kind, surfaced as a computePlane tenant module with single-tenant non-inheritance semantics; placement: ComputePlane apps routed via HelmRelease kubeConfig.secretRef; inbound access proxied through the tenant entry point; ComputePlane-to-tenant egress constrained by per-service CiliumNetworkPolicy (explicitly no kube-apiserver reachability).
Compatibility, security guarantees, and failure/edge cases
design-proposals/compute-plane/README.md
Additive upgrade/rollback compatibility described; security guarantees enumerated (no management credentials, no kube-API path back, separate identity/RBAC, single-tenant isolation); failure/edge cases defined including readiness ordering, closed-fail on missing kubeconfig, admission rejection without a ComputePlane, GPU exhaustion, and deletion finalizer ordering.
Testing plan, rollout phases, open questions, and alternatives
design-proposals/compute-plane/README.md
Unit, two-cluster integration, network/security, and end-to-end test plans enumerated; two-phase rollout outlined (core primitive then cozyllm consumers); open questions listed on naming, sharing, observability, billing; rejected alternatives documented including container hardening, gVisor, shared execution clusters, and per-module valuesOverride shape.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 Hops across the cluster wall,
A ComputePlane to catch the fall,
No credentials sneak back through,
Each tenant sealed with careful glue.
Remote Flux applies with flair,
Untrusted code runs over there!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: a design proposal for compute planes for untrusted-code workloads.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch proposal/compute-plane

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for ComputePlane, a Cozystack-managed Kubernetes cluster designed to isolate untrusted-code workloads from the management control plane. The reviewer feedback highlights three key areas for improvement: ensuring that the generated HelmRelease has spec.install.createNamespace: true enabled to prevent installation failures on the remote cluster, clarifying the exact data-path proxying mechanism to guarantee secure network isolation, and specifying how deletion ordering is enforced via a finalizer to prevent Flux finalizers from blocking indefinitely when a tenant is deleted.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

values: { ... }
```

The routing decision is driven by the `placement` enum on the `ApplicationDefinition` — `ManagementPlane` (default) applies into the tenant namespace on the management cluster as today; `ComputePlane` injects the ComputePlane `kubeConfig.secretRef`. The two values name the two symmetric planes. This keeps the routing policy declarative and out of per-app charts.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When using spec.kubeConfig to remote-apply a HelmRelease to the ComputePlane, the Helm controller will attempt to install the release into the target namespace (e.g., tenant-<name>). Since the ComputePlane is a freshly provisioned, separate cluster, this namespace will not exist by default. It is important to specify that spec.install.createNamespace: true must be enabled on the generated HelmRelease to prevent installation failures.

Suggested change
The routing decision is driven by the `placement` enum on the `ApplicationDefinition``ManagementPlane` (default) applies into the tenant namespace on the management cluster as today; `ComputePlane` injects the ComputePlane `kubeConfig.secretRef`. The two values name the two symmetric planes. This keeps the routing policy declarative and out of per-app charts.
The routing decision is driven by the `placement` enum on the `ApplicationDefinition``ManagementPlane` (default) applies into the tenant namespace on the management cluster as today; `ComputePlane` injects the ComputePlane `kubeConfig.secretRef`. The two values name the two symmetric planes. This keeps the routing policy declarative and out of per-app charts.
Additionally, because the ComputePlane is a separate cluster, the target namespace (e.g., `tenant-<name>`) will not exist by default. The generated `HelmRelease` must ensure `spec.install.createNamespace: true` is set so Flux can provision the namespace on the remote cluster during installation.


### 4. Access is proxied back through the tenant's normal entry point

Workloads expose themselves on the ComputePlane via standard Ingress/Gateway, and the ComputePlane's ingress is wired back to the tenant's existing entry point so the user reaches the app at a normal hostname. The user never receives ComputePlane credentials; only HTTP(S) app traffic crosses back, through the proxy/ingress path — not the kube-API path.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To strengthen the security guarantees, it would be beneficial to clarify the exact data-path plumbing for the proxying mechanism. Since the ComputePlane runs on KubeVirt VMs on the management cluster, explicitly stating how the management cluster's ingress routes traffic to the ComputePlane's ingress (e.g., via the VM's internal service/bridge IP) helps ensure there is no accidental reverse network path exposed to the untrusted workloads.

Suggested change
Workloads expose themselves on the ComputePlane via standard Ingress/Gateway, and the ComputePlane's ingress is wired back to the tenant's existing entry point so the user reaches the app at a normal hostname. The user never receives ComputePlane credentials; only HTTP(S) app traffic crosses back, through the proxy/ingress path — not the kube-API path.
Workloads expose themselves on the ComputePlane via standard Ingress/Gateway, and the ComputePlane's ingress is wired back to the tenant's existing entry point so the user reaches the app at a normal hostname. The user never receives ComputePlane credentials; only HTTP(S) app traffic crosses back, through the proxy/ingress path — not the kube-API path. Specifically, the management cluster's ingress controller proxies traffic directly to the KubeVirt VM's internal service/bridge IP, ensuring no reverse network path is exposed to the workload.

- **ComputePlane kubeconfig Secret missing/rotated** → remote apply fails closed (no fallback to local apply on the management cluster); status reflects the error. Failing closed is the security-correct behavior.
- **App declares `placement: ComputePlane` but no ComputePlane module set in the tenant chain** → reject at admission / surface a clear status error rather than silently deploying locally (which would re-introduce the risk). Inheritance walk: use the nearest enabling ancestor; if none, reject.
- **GPU exhaustion** → cluster-autoscaler adds GPU node groups up to `maxReplicas`; beyond that the workload pends, as in any autoscaled cluster.
- **Tenant deletion** → ComputePlane and its workloads are torn down with the tenant; ordering must delete remote HelmReleases before deprovisioning the ComputePlane to avoid orphaned remote resources.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Enforcing that remote HelmReleases are deleted before the ComputePlane is deprovisioned is a common challenge with Flux remote apply. If the ComputePlane cluster is destroyed first, Flux's HelmRelease will block indefinitely on its finalizer because it can no longer connect to the target API to perform cleanup. Specifying how this ordering is enforced (e.g., via a finalizer on the ComputePlane custom resource) makes the design robust against this failure mode.

Suggested change
- **Tenant deletion** → ComputePlane and its workloads are torn down with the tenant; ordering must delete remote HelmReleases before deprovisioning the ComputePlane to avoid orphaned remote resources.
- **Tenant deletion** → ComputePlane and its workloads are torn down with the tenant; ordering must delete remote HelmReleases before deprovisioning the ComputePlane to avoid orphaned remote resources. This ordering is enforced by a finalizer on the `ComputePlane` custom resource, which blocks its deletion until all associated remote `HelmRelease` resources have been successfully cleaned up.

@lllamnyp

Timofei Larkin (lllamnyp) commented Jun 24, 2026

Copy link
Copy Markdown
Member

Review: ComputePlane design proposal

The core idea is sound and the mechanism choice is right: routing placement: ComputePlane apps onto a Kamaji+KubeVirt cluster via remote Flux apply reuses proven plumbing, and putting untrusted code behind a VM boundary is the correct trust boundary. The doc is thorough on the management→ComputePlane direction. My remarks are about where the framing over-claims, where an analogy imports a security bug, and one omission that I think is actually load-bearing for whether the headline apps work at all.

1. "Hidden from the tenant" is conflated with the security boundary — and it isn't one

The doc leans hard on invisibility as a selling point ("a tenant does not see and does not manage," "hidden cluster," repeated in Overview, Goals, §1, User-facing changes). But two separable properties are being bundled:

  • (a) The ComputePlane has no credentials and no network path back to the management plane. This is the load-bearing security property. It's directional: ComputePlane → management is denied.
  • (b) The tenant cannot see or touch the ComputePlane. This is a UX/product choice, and it's the opposite direction: tenant → ComputePlane.

(b) does not follow from (a) and is not itself a security guarantee. The isolation argument is entirely about the ComputePlane→management direction; handing the tenant a scoped, read-only view of their own ComputePlane (it's their untrusted workloads, on a cluster that already has no path back to management) weakens nothing in that argument.

And the cost of full opacity is real. When the Jupyter pod CrashLoops, OOMs, or the GPU node won't schedule (the doc's own "GPU exhaustion → pod pends" failure case), the tenant has zero kubectl logs/describe/events. Every such incident becomes a platform-operator support ticket, and the design forecloses the alternative by making invisibility a hard requirement rather than a default. I share the concern that this is the kind of thing we get wrong the first time and then can't walk back.

If hiding is intentional, the justification needs to be stated, and I think the real one is narrower than "hide everything." The legitimate reason to withhold access is tamper-resistance: the platform wants to deploy hardening into the ComputePlane — restricted PSA, egress NetworkPolicies, admission control — that the tenant must not be able to remove, because removing it is what makes untrusted code dangerous. But that argues for withholding admin/write, not visibility. A read-only or namespace-scoped kubeconfig keeps the hardening tamper-proof while preserving debuggability.

Suggested resolution: decouple the two properties explicitly. Keep "no credentials/path back to management" as the security guarantee; demote "tenant cannot see the ComputePlane" to a default, and design a scoped read/observability path so users aren't operating a black box. If full opacity really is intended, say so and give the tamper-resistance argument — but I'd push back on it.

2. The "1-click" contrast doesn't distinguish from the rejected alternative; and the target needn't be a hidden ComputePlane (nit)

Two smaller things.

The 1-click framing. §2 says the ComputePlane module is enabled by the parent tenant at child-creation time. So the real flow is: someone provisions a full Kamaji control plane + KubeVirt node groups first, and only then does the end user get their one click. The provisioning cost isn't eliminated, it's relocated to the parent/admin. That makes the rejection of "expose a managed Kubernetes and let them install the app" weaker than stated — the ComputePlane is then "little more than a template": pre-baked secure defaults plus routing glue over the same managed-Kubernetes substrate. That's a fine thing to be, but the honest differentiators are (i) secure defaults the tenant can't misconfigure and (ii) placement-routing that auto-targets the right cluster — not the click count. I'd soften the 1-click contrast and lead with those.

The target generalizes. The routing mechanism (HelmRelease with kubeConfig.secretRef → some cluster) is fully generic; nothing requires the target to be the special hidden cluster. A tenant who already runs a managed kind: Kubernetes cluster (their own GPU cluster, say) could host their own JupyterHub there rather than being forced to stand up a second cluster. So the natural shape is placement targeting a named cluster — which may be a hidden ComputePlane or a tenant-owned cluster the tenant explicitly chooses — rather than hard-coding "ComputePlane" as the only non-management target. This also dovetails with point #1: if the target is a tenant-owned cluster, visibility comes for free. Worth listing as a placement-target option. (Nit — flag, don't block.)

3. Inheriting a ComputePlane to subtenants re-creates the exact problem ComputePlanes exist to solve (objection)

§2, Open questions, and Phase 3 all propose that child tenants may reuse an ancestor's ComputePlane, "matching the existing service-inheritance direction" (ingress/monitoring/etcd). The inheritance walk is encoded concretely: §2 "the chain walks up to the parent," Failure cases "use the nearest enabling ancestor." I think this is wrong and should be cut from the architecture, not merely "blocked in iteration 1, open later."

The premise of the whole proposal (Overview, Security) is that untrusted code must not share an isolation domain with what it could escalate toward. The management cluster is effectively the root ComputePlane — the place managed workloads run, which ComputePlane exists to keep untrusted code off of. Now take parent tenant P with CP_P, and child C whose untrusted workloads run on CP_P. C's untrusted code now shares a kube-API and host kernel with P's workloads. From P's vantage, C is exactly what the tenant was to the management cluster: an untrusted party whose container escape inside CP_P lands in P's compute environment — escape → host root on CP_P → P's workloads and whatever creds CP_P holds. That is the identical attack chain the proposal opens with, pushed one level down the tenant tree, not solved.

The service-inheritance analogy is precisely the flaw. ingress/monitoring/etcd are shared infrastructure services that sit at a trust level above their consumers — the tenant trusts its ingress. A ComputePlane is the inverse: a containment vessel for code its owner does not trust, sitting below the owner. You inherit a shared service; you do not share a containment boundary between mutually-distrusting parties (P and C need not trust each other) — that's just removing the containment. Each tenant running untrusted code needs its own ComputePlane, the same way each tenant gets its own isolation from the management cluster.

Suggested resolution: state explicitly that a ComputePlane serves exactly one tenant, full stop; a child that wants untrusted compute provisions its own. Remove the parent-walk from §2 and Failure cases — a placement: ComputePlane app in a tenant with no ComputePlane of its own should reject, not climb to an ancestor's. Drop the Phase 3 "allow child tenants to reuse an ancestor's ComputePlane."

One distinction worth preserving so this doesn't read as anti-efficiency: sharing the physical node pool / capacity across tenants (the "managed RDS runs many DBs on shared instance types" analogy in Open questions) is fine — that's infrastructure-layer resource pooling. Sharing a ComputePlane (a kube cluster / isolation domain) is not. The doc currently conflates these; only the former is safe, and saying so explicitly would strengthen the section.

4. No story for ComputePlane workloads reaching tenant-namespace services — and that's the whole point of the apps (omission)

The doc defers "credential propagation" (how a managed Postgres connection secret reaches a ComputePlane workload) to Open questions / Non-goals. But that's the lesser half of the problem and it's mis-framed as a secrets-plumbing issue. The harder half is network reachability, and it's in direct tension with the security model.

The headline workloads — an LLM, a notebook, an n8n flow — are close to useless in isolation; their entire value is connecting to the tenant's data: "my Jupyter notebook wants to talk to my Postgres." But the tenant's managed Postgres runs in the tenant namespace on the management/infra cluster. So "let my notebook reach my database" means opening a path from the ComputePlane into the management cluster — the exact thing Security §2 forbids ("ComputePlane pods are denied egress to the management/infra kube-apiserver," management→ComputePlane is the only allowed direction). A database connection is a ComputePlane→management flow, and there is no design for it. The apps that justify the feature don't function as specified.

The good news — and the reason this is an omission rather than a dead end — is that the connectivity primitive already exists in the design space and the doc cites it. PR #7 (cross-cluster-tenant-mesh) builds exactly a cross-cluster data-plane path (Kilo mesh-granularity=cross, TenantMeshLink CRD) that lets a managed tenant cluster reach host services (Ceph) while still denying host API access — one-way trust, no host-cluster API. A ComputePlane is a managed Kubernetes cluster, so mechanically the same link could carry ComputePlane→tenant-Postgres. The threat model just isn't the same: PR #7's consumer is a tenant-trusted cluster reaching shared infra; the ComputePlane's consumer is untrusted code, so a wide-open node-to-node mesh would hand untrusted workloads broad reach into the infra network. The scoping is the design work that's missing.

The key distinction the doc should make explicit: the threat is "no access to the management kube-API / no creds to escalate," not "no packets ever." Those are different planes. You can allow ComputePlane → a specific service endpoint (the Postgres pod IP:5432) while still denying ComputePlane → kube-apiserver — and the CiliumNetworkPolicy machinery the doc already cites (§Context, allow-to-apiserver) expresses precisely that shape.

Suggested resolution: promote this from a one-line Open question to a real "Connectivity to tenant services" section in the Design. It should (a) draw the kube-API-access vs. data-plane-reachability distinction so the isolation guarantee is stated precisely, (b) sketch the brokered path — narrowly-scoped per-service egress (reusing/constraining the PR #7 mesh, or an outbound mirror of the §4 ingress proxy), with who authorizes each endpoint and how the policy is generated — and (c) reconcile it with Security §2, which currently reads as a blanket denial. Every such hole is a path back toward infra, so it needs to be narrow and audited by construction, which is exactly why it deserves design rather than deferral.


Summary

# Severity Ask
3 Objection Make ComputePlanes single-tenant; remove parent-walk inheritance and the Phase 3 cross-tenant reuse. Allow shared node pools, never a shared cluster.
4 Omission Add a "Connectivity to tenant services" design section; distinguish kube-API access from data-plane reachability; reconcile with Security §2. The cited PR #7 mesh is the building block.
1 Open question Decouple "no creds back to management" (keep) from "tenant can't see it" (reconsider; offer scoped read access, or justify opacity via tamper-resistance and design the debug path).
2 Nit Soften the 1-click contrast; consider letting placement target any named cluster, not only a hidden ComputePlane.

Net: the mechanism is right and the proposal is close. №3 and №4 are the two I'd want resolved before this is implementable — one removes a security regression hiding inside a convenient analogy, the other fills in the connectivity half of the design without which the motivating apps don't work.

@kvaps

Copy link
Copy Markdown
Member Author

Thanks for the thorough review — agree the mechanism is right, and these are the right things to push on. Point by point.

#3 (single-tenant) — accepted. You're right that inheriting a ComputePlane to a child re-creates the exact escalation one level down, and that "block now, unblock later" is itself the hole. Making it single-tenant by design: removing the parent-walk from §2 and the Failure-cases inheritance — a placement: ComputePlane app in a tenant with no ComputePlane of its own rejects, it does not climb to an ancestor — and dropping the Phase 3 cross-tenant reuse. I'll keep your distinction explicit: sharing the physical node pool / capacity across tenants is fine (infra-layer pooling); sharing a ComputePlane (a cluster / isolation domain) is not.

#4 (connectivity) — agree it needs a real section; I don't think it needs a mesh. The path physically already exists: tenant-cluster worker nodes are KubeVirt VMs on the management Cilium pod network (packages/apps/kubernetes/templates/cluster.yamlnetworks: - name: default; pod: {}), so there is L3 adjacency by default, gated by CiliumNetworkPolicy. So the kube-API-access vs data-plane-reachability distinction you draw is exactly the framing, and this is a NetworkPolicy-scoping problem rather than a new mesh: allow ComputePlane → a specific service endpoint (the tenant's Postgres) while denying ComputePlane → kube-apiserver — the same shape as the existing policy.cozystack.io/allow-to-apiserver (packages/apps/tenant/templates/networkpolicy.yaml). For exposure we reuse what the managed kubernetes app already ships: exposeMethod: Proxied (management ingress → tenant NodePort) and kubevirt-ccm provisioning Service type: LoadBalancer from the management cluster, plus the kubevirt-csi storage path. Per-service egress is also narrower by construction than a node-to-node mesh, which addresses the scoping worry. I'll promote this to a "Connectivity to tenant services" section: draw the API-vs-data-plane line, specify the scoped per-service egress (who authorizes each endpoint, how the policy is generated), and reconcile Security §2, which currently reads as a blanket denial.

#1 (visibility vs security) — decoupling accepted. The load-bearing property is "no creds / no path back to management"; visibility is a separable UX / tamper-resistance concern, not a security guarantee, and I'll stop presenting invisibility as one. Today Kamaji already provisions an admin kubeconfig held by cluster-admins (not tenants), so operator-side debugging exists; a tenant-facing scoped read / observability path is a worthwhile extension and I'll record it as such rather than baking full opacity in.

#2 (placement target) — recording as an option. Letting placement target a named cluster — including a tenant's own existing managed Kubernetes cluster — is appealing and gives visibility for free. Caveat: on their own cluster the tenant holds full admin, so we can't guarantee SLA / automate updates / keep it truly managed. The alternative is a spec.managedDataplane: true mode on the Kubernetes app where Cozystack withholds the admin kubeconfig (effectively ComputePlane as a mode of the managed-Kubernetes app). Both have trade-offs, and either way the dependencies provisioned into such a cluster (ingress-nginx, cert-manager, …) need defining. I'll capture these as alternatives rather than hard-coding ComputePlane as the only non-management target.

I'll revise the proposal along these lines (and the cozyllm-specific doc inherits the single-tenant fix). Thanks again.

…ign, connectivity to tenant services, decouple visibility from the security boundary

Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
@kvaps

Copy link
Copy Markdown
Member Author

Pushed a revision (bcd9abb) addressing the review:

  • [docs] Add a proposal template #3 (single-tenant) — now single-tenant by design, not a temporary block. Removed the parent-walk from §2 and the Failure-cases inheritance (a placement: ComputePlane app in a tenant without its own ComputePlane now rejects, it never climbs to an ancestor), dropped the Phase 3 cross-tenant reuse, and stated the distinction explicitly: sharing the node pool / capacity across tenants is fine, sharing a ComputePlane (isolation domain) is not.
  • design-proposals: tenant module overrides #4 (connectivity) — new Design §5 "Connectivity to tenant services". No mesh needed: ComputePlane nodes already sit on the management Cilium pod network (packages/apps/kubernetes/templates/cluster.yamlnetworks: pod: {}), so it's a scoped per-service CiliumNetworkPolicy — allow ComputePlane → the tenant's Postgres endpoint, deny ComputePlane → kube-apiserver (the same shape as allow-to-apiserver), with outward exposure via exposeMethod: Proxied / kubevirt-ccm. Security §2 is reworded to make the kube-API-plane vs data-plane distinction explicit, so it no longer reads as a blanket egress denial.
  • [docs] Add a readme.md and basic design proposal guidance #1 (visibility) — decoupled from the security boundary. New "Visibility vs. the security boundary" subsection: the guarantee is "no creds / no kube-API path back to management"; what the platform withholds is admin/write (tamper-resistance), not visibility. Cluster-admins hold the Kamaji admin kubeconfig for debugging today, and a tenant-facing scoped read/observability view is an explicitly-allowed extension rather than baked-in opacity.
  • [docs] Fixup links in the desgin proposal guide #2 (placement target) — recorded as an open option rather than hard-coding a dedicated ComputePlane: placement could name a target cluster — a tenant's own managed Kubernetes (visibility for free, but they hold full admin → we can't guarantee SLA / managed updates) or a spec.managedDataplane: true mode that withholds the admin kubeconfig to recover the managed guarantees.

Also folded in the inline nits: spec.install.createNamespace: true on the generated HelmRelease, a finalizer on the ComputePlane resource to enforce delete-ordering, and the management-ingress → KubeVirt-VM data-path specifics in §4.

The connectivity-authorization details (#4) and the placement-target choice (#2) are left as Open questions. Re-review welcome — thanks again.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@design-proposals/compute-plane/README.md`:
- Around line 149-160: Section 5 only defines network reachability; it still
leaves the ComputePlane-to-tenant-service credential flow unresolved, which
makes the rollout promise incomplete. Update the proposal around the
“Connectivity to tenant services” section to explicitly define how a workload
gets the managed Postgres secret/connection string, using the same terminology
and objects already named there (ComputePlane workloads, tenant service,
per-service CiliumNetworkPolicy, `exposeMethod: Proxied` if relevant). If that
mechanism is not ready, move the Phase 1 rollout commitment behind the
credential-delivery design instead of presenting it as solved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 41068252-fe85-427a-ab0f-649c845bc159

📥 Commits

Reviewing files that changed from the base of the PR and between abb40d0 and bcd9abb.

📒 Files selected for processing (1)
  • design-proposals/compute-plane/README.md

Comment on lines +149 to +160
### 5. Connectivity to tenant services

The headline workloads are nearly useless in isolation — a notebook, an LLM, an n8n flow exist to reach the tenant's data ("my Jupyter notebook talks to my managed Postgres"). But the tenant's managed Postgres runs as a Service in the tenant namespace **on the management cluster**, so reaching it is a ComputePlane→management flow — the direction Security §2 otherwise restricts. The resolution is to be precise about *which* plane is restricted.

**The guarantee is "no kube-API access / no creds to escalate," not "no packets ever."** Those are different planes. A database connection (ComputePlane → `postgres-pod:5432`) is a *data-plane* flow and can be allowed while ComputePlane → kube-apiserver stays denied.

**No mesh is required.** ComputePlane worker nodes are KubeVirt VMs attached to the management cluster's Cilium pod network (`packages/apps/kubernetes/templates/cluster.yaml` → `networks: - name: default; pod: {}`), so L3 adjacency already exists and is gated by `CiliumNetworkPolicy`. Connectivity is therefore a **scoping** problem, expressed with machinery already in the platform:

- **Egress to a tenant service** is granted by a narrow, per-service `CiliumNetworkPolicy`: allow ComputePlane workloads → a specific endpoint (the tenant's Postgres Service), deny ComputePlane → kube-apiserver. This is the same shape as the existing `policy.cozystack.io/allow-to-apiserver` label policy, pointed at a data-plane endpoint instead of the API. Per-service egress is narrower by construction than a node-to-node mesh, so untrusted workloads never get broad reach into the infra network.
- **Exposing a ComputePlane workload outward** reuses what the managed `kubernetes` app already ships (Design §4): `exposeMethod: Proxied` and kubevirt-ccm `Service type: LoadBalancer`; persistent storage uses the `kubevirt-csi-driver` path.

Open: who authorizes each tenant-service endpoint and how the per-service policy is generated (a tenant-scoped allowlist vs. an explicit "expose this service to my ComputePlane" action). The remaining secret-delivery half — getting the Postgres connection string into the workload — is tracked under Open questions; the **network** half is solved by the scoped policy above, and every such opening is narrow and audited by construction.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

Define the credential-delivery path, not just the network path.

Section 5 solves reachability, but the proposal still defers how a ComputePlane workload gets the managed-service secret it needs. That makes the motivating workloads incomplete, and Phase 1 is currently promising a capability whose authz/plumbing model is still open. Please either specify that mechanism here or move the rollout commitment behind it.

Also applies to: 216-228

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@design-proposals/compute-plane/README.md` around lines 149 - 160, Section 5
only defines network reachability; it still leaves the
ComputePlane-to-tenant-service credential flow unresolved, which makes the
rollout promise incomplete. Update the proposal around the “Connectivity to
tenant services” section to explicitly define how a workload gets the managed
Postgres secret/connection string, using the same terminology and objects
already named there (ComputePlane workloads, tenant service, per-service
CiliumNetworkPolicy, `exposeMethod: Proxied` if relevant). If that mechanism is
not ready, move the Phase 1 rollout commitment behind the credential-delivery
design instead of presenting it as solved.

@lllamnyp

Copy link
Copy Markdown
Member

Thanks for the revision (bcd9abb) — the single-tenant-by-design fix (#3), the new Connectivity-to-tenant-services section (#4), and decoupling visibility from the security boundary (#1) all landed well, and the no-mesh framing is right: ComputePlane nodes already share the management Cilium pod network, so it's a scoped-policy problem, not a Kilo mesh.

I want to be clear up front about the verdict: I'm happy to see this built. ComputePlane is an excellent UX / managed-service offering — one-click managed deployment of code-executing apps that go beyond the existing platform catalog, with sane defaults the tenant doesn't have to assemble, on a cluster the operator keeps managed and can bill for. That's a real and worthwhile product.

My remaining objection is narrow but I think it's important: the Security section claims an isolation boundary that ComputePlane does not actually provide. It should be rewritten so the doc doesn't sell security where there isn't any.

The six guarantees are inherited from the managed-Kubernetes substrate, not provided by ComputePlane

From the management plane's point of view, a ComputePlane and a regular managed kind: Kubernetes cluster are the same object: Kamaji control plane + KubeVirt-VM workers + operator-held super-admin + the tenant-namespace network policy. Every Security guarantee is a property of that substrate:

# Guarantee Actually provided by
1 No mgmt creds in the cluster Any managed cluster — the kubeConfig lives management-side, nothing is copied in
2 No kube-API path to mgmt The existing <tenant>-egress CiliumClusterwideNetworkPolicy, which already selects every pod in the tenant namespace — including the KubeVirt virt-launcher pods that are the cluster's nodes
3 VM boundary KubeVirt — identical substrate
4 Separate identity domain Kamaji — identical
5 Single-tenant A regular kind: Kubernetes cluster is single-tenant too
6 No new tenant input to mgmt The apps.cozystack.io/* write path — unchanged

None of this is new to ComputePlane. The doc enumerates the managed-Kubernetes trust model and credits it to ComputePlane.

The one ComputePlane-specific control provides no platform protection

The single thing a ComputePlane has that a tenant-run cluster doesn't is tamper-proof hardening: the tenant isn't cluster-admin, so platform-applied PSA / network policy / admission can't be stripped. The doc presents this as protecting the platform from a malicious tenant. It can't — and not just because the VM boundary already contains the escape. It can't in principle, because the hardened venue is optional for the attacker:

A tenant who actually holds a management-hijacking payload doesn't run it in the hardened ComputePlane. They provision a regular managed kind: Kubernetes — same KubeVirt/Kamaji substrate, one click away, except they're admin and the hardening is absent — and run it there. Hardening present in one of two attacker-reachable venues, where the attacker picks the venue, contributes zero platform protection.

The motivating threat ("container escape → host root → management API → every tenant's secrets") is a property of the substrate, which both venues share, so the fork is:

  • If that chain is real, ComputePlane doesn't close it — the regular managed cluster is already the open door.
  • If the substrate contains it (hypervisor + the <tenant>-egress policy that already governs the VM pods), it was already contained on a vanilla managed cluster, and there was never a platform-safety reason for a hardened venue.

Either way the hardening adds nothing to platform safety. It would only protect the platform if tenants were also forbidden from provisioning their own unhardened managed clusters — which isn't the case and isn't proposed (that would be a far bigger change, and is the actual lever if the threat is believed).

So the hardening's only coherent security scope is intra-cluster: protecting the tenant from their own app's users (JupyterHub students, LLM-generated code), who are confined to whichever venue the tenant deployed. As a platform / multi-tenant boundary it isn't merely redundant — it's circular: it only "works" against an attacker who has agreed to attack from inside the box you hardened.

Suggested edits

  1. Reattribute guarantees [docs] Add a readme.md and basic design proposal guidance #1proposal: ApplicationDefinition multi-version conversion #6 to the managed-Kubernetes substrate ("ComputePlane inherits the existing managed-Kubernetes isolation: …") rather than presenting them as ComputePlane's own contribution.
  2. Scope the tamper-resistance bullet explicitly to protecting the tenant from their own app's users / their own misconfiguration, with one sentence stating why it is not a platform boundary (a tenant can always run the same payload on an unhardened same-substrate managed cluster). Drop any implication that it isolates one tenant's code from another's or from the management plane.
  3. Reframe the Overview from "closing a latent isolation gap" to what it actually is: managed UX + tamper-proof secure defaults (tenant-scoped) + operator-retained management. The isolation boundary is the pre-existing managed-Kubernetes + VM + tenant-egress model; ComputePlane is a curated, locked-down, operator-managed mode of it — which is genuinely useful and worth shipping, just not a new security boundary.

This connects to the placement-target open question, which already floats managedDataplane: true as a mode of kind: Kubernetes. That's arguably the honest shape of the whole feature: ComputePlane ≈ kind: Kubernetes + managedDataplane: true + a hardened profile + placement routing. Whether it needs to be a distinct kind at all is a fair question once the security framing is corrected — though there's a reasonable UX case for the separate kind regardless.

To be explicit: none of this blocks building it. I just want the design doc to describe the value as UX / managed-service rather than as an isolation guarantee the substrate already provides and the one new control can't enforce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants