Skip to content

docs: Document idle/unhealthy endpoint auto scale-down and image name validation#674

Open
promptless[bot] wants to merge 5 commits into
mainfrom
promptless/serverless-idle-unhealthy-scaling-image-validation
Open

docs: Document idle/unhealthy endpoint auto scale-down and image name validation#674
promptless[bot] wants to merge 5 commits into
mainfrom
promptless/serverless-idle-unhealthy-scaling-image-validation

Conversation

@promptless

@promptless promptless Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Open this suggestion in Promptless to view citations and reasoning process

Documents three recent Serverless platform changes from DOCS-452: (1) idle endpoint auto-scaling lifecycle (3 days → 2 workers + email, 7 days → 0 workers); (2) automatic scale-down on repeated unhealthy workers; (3) image name validation now enforced at endpoint creation (runpod/*:latest rejected). Updates endpoint settings, worker states, troubleshooting, deploy, and the changelog.

Trigger Events


Tip: Worried about broken links? Ask Promptless to find and fix them automatically 🔗

…ation

- Add idle endpoint scale-down policy (3d->2 workers+email, 7d->0) to endpoint settings
- Note long-term idle and unhealthy auto scale-down in worker states table
- Update Unhealthy worker state row to reflect auto scale-down behavior
- Add troubleshooting entry for unexpected endpoint scale-down
- Document image name validation at endpoint creation (runpod/*:latest rejected)
- Add June 2026 release note for automatic endpoint scale-down

Refs DOCS-452 (SLS-7, SLS-121, SLS-8, SLS-238)

How long a worker stays active after completing a request before shutting down. You're billed during idle time, but the worker remains warm for immediate processing. Default: 5 seconds.

### Idle endpoint scale-down

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the Idle endpoint scale-down subsection from DOCS-452 (SLS-7): endpoints idle for 3 days have max_workers auto-reduced to 2 with an email sent, and idle for 7 days have max_workers set to 0; this is system-driven and users must raise max workers manually to resume.

Source: https://linear.app/runpod/issue/DOCS-452/fip-document-idle-endpoint-lifecycle-unhealthy-worker-auto-scale-down

Comment thread serverless/workers/overview.mdx Outdated
| **Throttled** | Temporarily unable to run due to host <MachineTooltip /> resource constraints | No |
| **Outdated** | Marked for replacement after update | Yes (while processing) |
| **Unhealthy** | Crashed; auto-retries for up to 7 days | No |
| **Unhealthy** | Crashed; auto-retries, then scales the endpoint down after repeated failures (email sent) | No |

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the Unhealthy worker-state row per DOCS-452 (SLS-121): the platform now auto-scales an endpoint down after it consistently produces unhealthy workers (to stop billing and reduce thrashing) and sends an email, replacing the prior "auto-retries for up to 7 days" description.

Source: https://linear.app/runpod/issue/DOCS-452/fip-document-idle-endpoint-lifecycle-unhealthy-worker-auto-scale-down

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Promptless Don't replace the text in the table. Add as a note after the table "automatically scales down an endpoint that is consistently producing unhealthy workers, to stop billing and reduce thrashing. "


## Endpoint scaling issues

### My endpoint was scaled down unexpectedly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the My endpoint was scaled down unexpectedly troubleshooting entry covering both DOCS-452 scale-down causes (SLS-7 prolonged inactivity and SLS-121 repeated unhealthy workers), each of which triggers an email, and how to resume the endpoint.

Source: https://linear.app/runpod/issue/DOCS-452/fip-document-idle-endpoint-lifecycle-unhealthy-worker-auto-scale-down

* Document the specific image version or SHA in your deployment documentation.
* Keep images as small as possible for faster startup times.

Runpod validates your image name when you create an endpoint, and a reference that doesn't resolve to a published image is rejected at creation time. For example, `runpod/pytorch:latest` is rejected because `:latest` is not a published tag for Runpod's base images. When you deploy a Runpod base image, specify a tag that actually exists. You can browse the available tags for each image on [Docker Hub](https://hub.docker.com/u/runpod). For your own worker images, use a specific version or SHA tag as described above rather than `:latest`.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented image-name validation per DOCS-452 (SLS-8): Layer 2 image verification flipped from observe to enforce, so image names are now validated at endpoint creation and runpod/pytorch:latest (and other runpod/*:latest references) are rejected because :latest is not a published tag for these base images.

Source: https://linear.app/runpod/issue/DOCS-452/fip-document-idle-endpoint-lifecycle-unhealthy-worker-auto-scale-down

Revert the table-text replacement for the Unhealthy worker state and keep
the unhealthy scale-down behavior described in the note after the table.
@mintlify

mintlify Bot commented Jun 24, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
runpod-docs 🟢 Ready View Preview Jun 24, 2026, 8:30 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Comment thread release-notes.mdx Outdated
- **Repeated unhealthy workers**: An endpoint that consistently produces unhealthy workers is scaled down automatically to stop billing and reduce thrashing, and Runpod sends you an email. See [troubleshooting](/serverless/troubleshooting#my-endpoint-was-scaled-down-unexpectedly).

</Update>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Promptless Don't add anything to the release notes for now

Reviewer requested no changelog entry for now; keep all other Serverless
documentation updates intact.
@promptless

promptless Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review! I've updated the suggestion on both points:

  • Unhealthy worker table: Restored the original table cell text and instead added a note after the table describing that the platform automatically scales down an endpoint that is consistently producing unhealthy workers, to stop billing and reduce thrashing.
  • Release notes: Removed the changelog/release-notes addition for now, per your request. The remaining documentation updates (endpoint settings, worker states, troubleshooting, and deploy) are unchanged.

The PR has been updated with these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant