Support CPMS test role in shiftstack-qa automation#9
Conversation
imatza-rh
left a comment
There was a problem hiding this comment.
Good port from the IR-plugin. Nice improvements: OpenStack resource cleanup in always, must-gather on test failure. Nit: PR description says "branch fallback" but prepare_openshift_tests.yml has no fallback - same as all other roles, works fine.
Please run ./gate.sh (ansible-lint + pre-commit inside the shiftstack-client container) - no CI checks are configured on GitHub PRs. yamllint passes locally, but ansible-lint needs the container.
| - post | ||
| - verification | ||
| - day2ops | ||
| - cpms_test |
There was a problem hiding this comment.
Is cpms_test needed here? This is a Jenkins job definition — the Zuul integration job uses osp_verification.yaml. The e2e-periodic rolls all 3 masters (5h ginkgo timeout) and has been a systemic timeout in Jenkins since 4.18.
There was a problem hiding this comment.
You're right, this is a Jenkins job definition and the e2e-periodic 5h timeout would be problematic here. Removed cpms_test from stages, cpms_replace_attrs from day2ops_procedures, and the cpms_replacements vars. The CPMS tests will only run via the Zuul integration job using osp_verification.yaml.
| - cpms_replacements.sg_name in item.security_groups | json_query('[*].name') | ||
| with_items: "{{ master_after }}" | ||
|
|
||
| rescue: |
There was a problem hiding this comment.
The day2ops wrapper run_procedure.yml already runs must-gather + records failure on rescue. This inner rescue duplicates that. See how moving-etcd-to-ephemeral.yml handles it — no inner rescue, relies on the wrapper. Consider removing this rescue block (keep the always — the restore logic is correct and needed).
There was a problem hiding this comment.
Agreed. run_procedure.yml already handles must-gather and failure recording in its rescue block. Removed the inner rescue to follow the same pattern as moving-etcd-to-ephemeral.yml. The always block with the CPMS restore logic is kept since that's procedure-specific cleanup that the wrapper can't handle.
Co-authored-by: Cursor <cursoragent@cursor.com>
…inner rescue Co-authored-by: Cursor <cursoragent@cursor.com>
…n on test failure
f88cbb5 to
cce04e6
Compare
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…or OCP binaries Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ror to eliminate file-cache dependency
Co-authored-by: Cursor <cursoragent@cursor.com>
…sts, add 900s timeout per extract attempt Co-authored-by: Cursor <cursoragent@cursor.com>
imatza-rh
left a comment
There was a problem hiding this comment.
Three concerns before merging:
1. Timing budget - The osp_verification Zuul job currently runs 7-8.6h against an 8h timeout (recent SUCCESS builds). The CPMS e2e-periodic target has a 5h ginkgo timeout and rolls all 3 masters. Adding both presubmit + periodic would push the total well past the timeout. Consider running only e2e-presubmit initially, or running CPMS tests in a separate periodic job. Has this been tested on a live environment? Duration numbers would help.
2. Suggest splitting the binary download rewrite - The tools_get_openshift_release rewrite (commits bda356e-8548bec, merged from the use-oc-adm-release-extract-for-binaries branch) is independent of the CPMS test role. It changes a shared role used by all job definitions and introduces a hard dependency on rhoso_kubeconfig for pull secret extraction - the old code just did an HTTP GET with no cluster access needed. Splitting into a separate PR would let the CPMS test merge independently and give the shared role change its own review cycle.
3. cpms_replace_attrs wiring - The procedure is added but not wired into any day2ops_procedures list (it was removed from 4.17_ovnkubernetes_ipi.yaml per previous review). The cpms_replacements variable definition was also removed in that commit. Worth adding defaults and wiring it in osp_verification.yaml if it's intended to run, or noting in the description that it'll be wired separately.
| dest: "{{ cpms_results_subdir }}/junit_e2e_{{ cpms_tests_type }}.xml" | ||
| remote_src: yes | ||
| mode: u=rw,g=rw,o=r | ||
| when: (cpms_results_subdir + '/' + cpms_test_junit_filename) is file |
There was a problem hiding this comment.
The is file Jinja2 test evaluates on the Ansible controller, not the remote host where the test ran. The codebase uses ansible.builtin.stat for remote file checks (e.g., prepare_rhcos_image.yml:47). This works today because deploy_installer_host: false (controller = target), but would silently skip post-processing on environments with a remote installer host.
Consider:
- name: Check if JUnit file exists
ansible.builtin.stat:
path: "{{ cpms_results_subdir }}/{{ cpms_test_junit_filename }}"
register: _cpms_junit_fileThen use when: _cpms_junit_file.stat.exists below.
| key_for_filtering_results: "cpms" | ||
| test_name: "{{ cpms_test_name }}-{{ cpms_tests_type }}" | ||
| results_dir: "{{ cpms_results_subdir }}" | ||
| when: (cpms_results_subdir + '/' + cpms_test_junit_filename) is file |
There was a problem hiding this comment.
Same is file concern as line 48 - would use the same _cpms_junit_file.stat.exists check.
…t_cpms_test_role_tj
Summary
cpms_teststage role that runs the upstreamcluster-control-plane-machine-set-operatore2e tests (presubmit + periodic) with branch fallback logic for newer OCP versionscpms_replace_attrsday2ops procedure ported from the openshift-ir-plugin, which validates CPMS reconciliation by patching failure domains, networks, and security groups on control plane nodesocp_testing.yamland enable it inosp_verificationand4.17_ovnkubernetes_ipijob definitionsDetails
The CPMS operator manages the lifecycle of OCP control plane machines. This PR ports the existing Jenkins/IR-based CPMS testing into shiftstack-qa's Ansible automation framework.
New
cpms_teststage role:mainfor versions without a release branch)make e2e-presubmitandmake e2e-periodicwithOPENSHIFT_CI=truefor JUnit outputNew
cpms_replace_attrsday2ops procedure: