AIR CLI Integration: Adding support for air run configuration#5657
AIR CLI Integration: Adding support for air run configuration#5657riddhibhagwat-db wants to merge 1 commit into
Conversation
Waiting for approvalCould not determine reviewers from git history. Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Integration test reportCommit: 226d41a
21 interesting tests: 13 SKIP, 8 RECOVERED
Top 20 slowest tests (at least 2 minutes):
|
73088e7 to
373988d
Compare
Port the run YAML schema and its structural validation from the Python CLI's sdk/config.py: the top-level runConfig plus the environment, docker_image, code_source/snapshot/git, and permission blocks. loadRunConfig decodes a YAML file with KnownFields (mirroring pydantic extra="forbid") and runs the validation pass. "Structural" covers types, required fields, and format/cross-field rules that need no workspace access. Online checks (compute pool resolution, GPU availability), git/filesystem checks, _bases_ composition, and CLI --override handling are deferred to later milestones. Two deliberate divergences from the Python schema, both following from the training-service-only port: the compute pool fields were already dropped, and the top-level priority field is dropped here since it is a node-pool queue-ordering knob with no meaning for serverless workloads. Co-authored-by: Isaac
373988d to
226d41a
Compare
|
|
||
| // validate runs structural validation over the whole config, returning the first | ||
| // failure. Fields are checked in declaration order to keep error output stable. | ||
| func (c *runConfig) validate() error { |
There was a problem hiding this comment.
| DockerImage *dockerImageConfig `yaml:"docker_image"` | ||
| } | ||
|
|
||
| func (e *environmentConfig) validate() error { |
There was a problem hiding this comment.
| URL string `yaml:"url"` | ||
| } | ||
|
|
||
| func (d *dockerImageConfig) validate() error { |
There was a problem hiding this comment.
| Snapshot *snapshotSourceConfig `yaml:"snapshot"` | ||
| } | ||
|
|
||
| func (c *codeSourceConfig) validate() error { |
There was a problem hiding this comment.
| IncludePaths []string `yaml:"include_paths"` | ||
| } | ||
|
|
||
| func (s *snapshotSourceConfig) validate() error { |
There was a problem hiding this comment.
| Remote gitRemote `yaml:"remote"` | ||
| } | ||
|
|
||
| func (g *gitRef) validate() error { |
There was a problem hiding this comment.
| Level string `yaml:"level"` | ||
| } | ||
|
|
||
| func (p *permission) validate() error { |
There was a problem hiding this comment.
| "go.yaml.in/yaml/v3" | ||
| ) | ||
|
|
||
| // loadRunConfig reads a run YAML config file, decodes it into the schema, and |
There was a problem hiding this comment.
| assert.Equal(t, 1, cfg.Compute.NumAccelerators) | ||
| } | ||
|
|
||
| func TestLoadRunConfig_FullFeatured(t *testing.T) { |
There was a problem hiding this comment.
vinchenzo-db
left a comment
There was a problem hiding this comment.
maybe a question for @maggiewang-db or @ben-hansen-db, is there a way to make sure this run config doesn't diverge from the run config in the python cli?
| // | ||
| // The `_bases_` composition feature and CLI `--override` handling are not yet | ||
| // ported; a config using `_bases_` is currently rejected as an unknown field. | ||
| func loadRunConfig(path string) (*runConfig, error) { |
There was a problem hiding this comment.
I would split the decode step and the validate step here. Technically, decodeRunConfig + validateRunConfig makes more sense here (AIP-wise)
| type gitRef struct { | ||
| Branch *string `yaml:"branch"` | ||
| Commit *string `yaml:"commit"` | ||
| Remote gitRemote `yaml:"remote"` |
There was a problem hiding this comment.
I'm not very good at Go but is the gitRemote flag used anywhere outside of tests?
| EnvVariables map[string]string `yaml:"env_variables"` | ||
| Secrets map[string]string `yaml:"secrets"` |
There was a problem hiding this comment.
Make sure you check that these aren't pointing to the same env variable.... and if so I imagine EnvVariable should take precedence, but ask @maggiewang-db
Reasoning is that if user accidentally sets env var and secret, and doesn't know which one, if secret takes precedence and they try to read it they make accidentally leak their secret.
| MaxRetries *int `yaml:"max_retries"` | ||
| TimeoutMinutes *int `yaml:"timeout_minutes"` | ||
| IdempotencyToken *string `yaml:"idempotency_token"` | ||
| Parameters map[string]any `yaml:"parameters"` |
There was a problem hiding this comment.
Are you sure map[string]any is best here? Maybe some validation to make sure people aren't injecting bad inputs.
| // validateExperimentName enforces the Databricks Jobs API task_key constraints: | ||
| // the experiment_name becomes a task key, which caps at 100 characters and allows | ||
| // only alphanumerics, hyphens, and underscores. |
There was a problem hiding this comment.
do we need to do something similar with mlflow_run_name?
Changes
Ports the air run YAML config schema and its structural validation from the Python CLI (cli/sdk/config.py) to Go, under experimental/air/cmd/.
Two deliberate divergences from the Python schema, both following from the training-service-only port:
Why
"Structural" validation (types, required fields, format/cross-field rules) needs no workspace access, so it's a self-contained, fully unit-testable unit that's worth landing on its own ahead of the launch logic. Splitting it out keeps the upcoming handle_run PR focused on orchestration rather than mixing in ~900 lines of schema.
The extra="forbid" / KnownFields behavior is load-bearing: it's what turns a typo'd or stale config key into an actionable error instead of a silently-ignored field, so it's preserved faithfully. This is stacked on air-integration-m2-1 (the compute model).
Tests
New unit tests in runconfig_test.go (62 subtests, table-driven), covering:
go test ./experimental/air/... passes; ./task lint-q reports 0 issues.