docs: Document idle/unhealthy endpoint auto scale-down and image name validation#674
Conversation
…ation - Add idle endpoint scale-down policy (3d->2 workers+email, 7d->0) to endpoint settings - Note long-term idle and unhealthy auto scale-down in worker states table - Update Unhealthy worker state row to reflect auto scale-down behavior - Add troubleshooting entry for unexpected endpoint scale-down - Document image name validation at endpoint creation (runpod/*:latest rejected) - Add June 2026 release note for automatic endpoint scale-down Refs DOCS-452 (SLS-7, SLS-121, SLS-8, SLS-238)
|
|
||
| How long a worker stays active after completing a request before shutting down. You're billed during idle time, but the worker remains warm for immediate processing. Default: 5 seconds. | ||
|
|
||
| ### Idle endpoint scale-down |
There was a problem hiding this comment.
Added the Idle endpoint scale-down subsection from DOCS-452 (SLS-7): endpoints idle for 3 days have max_workers auto-reduced to 2 with an email sent, and idle for 7 days have max_workers set to 0; this is system-driven and users must raise max workers manually to resume.
| | **Throttled** | Temporarily unable to run due to host <MachineTooltip /> resource constraints | No | | ||
| | **Outdated** | Marked for replacement after update | Yes (while processing) | | ||
| | **Unhealthy** | Crashed; auto-retries for up to 7 days | No | | ||
| | **Unhealthy** | Crashed; auto-retries, then scales the endpoint down after repeated failures (email sent) | No | |
There was a problem hiding this comment.
Updated the Unhealthy worker-state row per DOCS-452 (SLS-121): the platform now auto-scales an endpoint down after it consistently produces unhealthy workers (to stop billing and reduce thrashing) and sends an email, replacing the prior "auto-retries for up to 7 days" description.
There was a problem hiding this comment.
@Promptless Don't replace the text in the table. Add as a note after the table "automatically scales down an endpoint that is consistently producing unhealthy workers, to stop billing and reduce thrashing. "
|
|
||
| ## Endpoint scaling issues | ||
|
|
||
| ### My endpoint was scaled down unexpectedly |
There was a problem hiding this comment.
Added the My endpoint was scaled down unexpectedly troubleshooting entry covering both DOCS-452 scale-down causes (SLS-7 prolonged inactivity and SLS-121 repeated unhealthy workers), each of which triggers an email, and how to resume the endpoint.
| * Document the specific image version or SHA in your deployment documentation. | ||
| * Keep images as small as possible for faster startup times. | ||
|
|
||
| Runpod validates your image name when you create an endpoint, and a reference that doesn't resolve to a published image is rejected at creation time. For example, `runpod/pytorch:latest` is rejected because `:latest` is not a published tag for Runpod's base images. When you deploy a Runpod base image, specify a tag that actually exists. You can browse the available tags for each image on [Docker Hub](https://hub.docker.com/u/runpod). For your own worker images, use a specific version or SHA tag as described above rather than `:latest`. |
There was a problem hiding this comment.
Documented image-name validation per DOCS-452 (SLS-8): Layer 2 image verification flipped from observe to enforce, so image names are now validated at endpoint creation and runpod/pytorch:latest (and other runpod/*:latest references) are rejected because :latest is not a published tag for these base images.
Revert the table-text replacement for the Unhealthy worker state and keep the unhealthy scale-down behavior described in the note after the table.
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
| - **Repeated unhealthy workers**: An endpoint that consistently produces unhealthy workers is scaled down automatically to stop billing and reduce thrashing, and Runpod sends you an email. See [troubleshooting](/serverless/troubleshooting#my-endpoint-was-scaled-down-unexpectedly). | ||
|
|
||
| </Update> | ||
|
|
There was a problem hiding this comment.
@Promptless Don't add anything to the release notes for now
Reviewer requested no changelog entry for now; keep all other Serverless documentation updates intact.
|
Thanks for the review! I've updated the suggestion on both points:
The PR has been updated with these changes. |
Open this suggestion in Promptless to view citations and reasoning process
Documents three recent Serverless platform changes from DOCS-452: (1) idle endpoint auto-scaling lifecycle (3 days → 2 workers + email, 7 days → 0 workers); (2) automatic scale-down on repeated unhealthy workers; (3) image name validation now enforced at endpoint creation (runpod/*:latest rejected). Updates endpoint settings, worker states, troubleshooting, deploy, and the changelog.
Trigger Events
Tip: Worried about broken links? Ask Promptless to find and fix them automatically 🔗