Follow-up from #713. Not a trust-model gap — verification is fail-closed (KDS throttling denies release, never forges one) — but it's an availability foot-gun, especially now that --platform auto can land on SNP on AMD hosts.
What's wrong
sev-snp-qvl/src/lib.rs fetches AMD KDS collateral (cert chain + VCEK) with:
reqwest::blocking::Client::new() per request (lib.rs:374, lib.rs:395) — a fresh client every call, from inside an async verification path.
- no request timeout — a hung or throttling KDS (
HTTP 429 is documented on lab hosts) stalls verification with no bound.
- no caching — every verification re-fetches the same per-product cert chain and per-(chip_id, TCB) VCEK.
What to do
- use an async HTTP client (or run the blocking fetch on a dedicated pool), reusing one client.
- set explicit connect + request timeouts.
- cache collateral by (product, chip_id, reported_tcb); cert chains are per-product and long-lived, VCEKs are stable per (chip, TCB).
- keep collateral validation fail-closed; the pinned ARK (
builtin_ark()) stays the trust root regardless of what KDS returns.
The DSTACK_AMD_KDS_PROXY_URL / core.sev_snp.amd_kds_proxy_url mirror path already exists for throttled labs; this issue is about making the default path robust, not about the proxy.
Follow-up from #713. Not a trust-model gap — verification is fail-closed (KDS throttling denies release, never forges one) — but it's an availability foot-gun, especially now that
--platform autocan land on SNP on AMD hosts.What's wrong
sev-snp-qvl/src/lib.rsfetches AMD KDS collateral (cert chain + VCEK) with:reqwest::blocking::Client::new()per request (lib.rs:374,lib.rs:395) — a fresh client every call, from inside an async verification path.HTTP 429is documented on lab hosts) stalls verification with no bound.What to do
builtin_ark()) stays the trust root regardless of what KDS returns.The
DSTACK_AMD_KDS_PROXY_URL/core.sev_snp.amd_kds_proxy_urlmirror path already exists for throttled labs; this issue is about making the default path robust, not about the proxy.