We have job-level retry on failure now, and will continue to need to in
order to work around fd.o infrastructure flakes. If we stop doing retry
inside the job, then we can crank down the gitlab-level timeouts on test
jobs to be closer to our CI guidelines and avoid blocking a runner for an
hour when things go wrong (for example, cheza #16 failing to boot in a
recognized way and continuously looping due to the intra-job retry).
Plus, the job logs will be more readable when you don't have two boots in
one job, and we'll get the flakes surfaced in our monitoring dashboards.
If internal retries were really doing useful work we may see an increase
in flakes as a result of this. I'm committing to turning off boards or
reducing coverage as necessary to handle this.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25790>
1. Move block used for detecting R8152 problems to the bootloader
phase where it belongs. Also remove requirement to 100 failures and just
retry immediatelly.
2. Consider job failed after 10 errors, not 100. From the logs on
cheza-14, ~ 30 errors is enough to fail.
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25285>
Modify the build process for the images to include the build to have ci-kdl
available in the Mesa jobs. Modify also the init-stage2 to launch in the
background the process that will collect data and store a json file with the
relative changes on the recorded data.
Signed-off-by: Sergi Blanch Torne <sergi.blanch.torne@collabora.com>
Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24177>
Better error handling is more reliable.
Options:
-L, follow location
--retry, number of retries
--retry-all-errors, does not fail on ALL errors, that's why there is -f
-f, fail fast with no output at all on server errors
--retry-delay, make curl sleep this amount of time before each retry
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/20788>
This is a farm of 5 (6, but one fails) TK1 boards for nouveau testing,
hosted and maintained by me. Currently it runs GLES dEQP.
I've been using ./.gitlab-ci/bin/ci_run_n_monitor.py --stress --target
gk20a to test it and am pretty confident of the skips/flakes list. Last
night it ran 318 jobs without fail, and prior to that there were two sets
of runs in the 100-200 range where only the one failing runner failed any
jobs.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/18497>
If we got a "Reached the end of the CPU serial log without finding a
result" because the test phase timed out, then the CPU serial would have
been closed as part of the timeout process, so we need to close the rest
and re-instantiate the servo run class.
fastboot and poe already re-instantiate the class on retry.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17689>
This should help with "marge got stuck for an hour and all I got was this
failed job with no results/" when a system intermittently wedges.
This replaces the BM_POE_TIMEOUT ("did we get something on serial in the
last 3 minutes?") that rpi had, in favor of checking that the whole test
job gets through in 20 minutes.
Acked-by: Juan A. Suarez <jasuarez@igalia.com>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17096>
If the SerialBuffers can just feed the same line queue, then we don't need
the extra threads reading line queues into a new merged line queue.
Less python threading code is always better. Plus, now we can pass args
to SerialBuffer.lines() for timeout/phase.
Acked-by: Juan A. Suarez <jasuarez@igalia.com>
Reviewed-by: Christian Gmeiner <christian.gmeiner@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17096>
The test suite is full of flakes around transform feedback, atomics, and
tess. But, I hope it can be useful for regression testing core Mesa
reworks.
This required updating the kernel to 5.16.12 to get a more stable boot
process. That kernel rebuild caused an update of the container with
piglit which that was missed in a previous MR, so we got new xfails in x86
swrast.
Acked-by: Ilia Mirkin <imirkin@alum.mit.edu> (nouveau)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15201>
This required updating the kernel to 5.16.12 to get a more stable boot
process. That kernel rebuild caused an update of the container with
piglit which that was missed in a previous MR, so we got new xfails in x86
swrast. Also, including modules on arm64 exposed a bug in v3d's
poe-powered.sh rsyncing of modules.
Acked-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/15201>
For every CI job, put JWT content into a file and unset CI_JOB_JWT
environment var
=======
* virgl jobs:
- Share JWT token file to crosvm instance
- Keep using `export -p` due to high complexity in the scripts
of these jobs. At least, the CI_JOB_JWT will not be leaked,
since it is being unset at the `before_script` phase of each
Mesa CI job.
* iris jobs: Update lava_job_submitter to take token file as argument
- generate-env with CI_JOB_JWT_TOKEN_FILE
- create token file during baremetal init stage
* baremetal jobs: Copy token file to bare-metal NFS
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Reviewed-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14004>
Using suites makes load-balancing our jobs much easier, keeps the CPU busy
handling the a630_gles_others.sh test sets (and improves the output and
baseline handling for them), and makes it trivial to add in more short
test sets.
a306: still 5 jobs, and we add KHR-GLES2 (KHR-GLES3 is unstable)
a530: still 5 jobs, added KHR-GLES*
a630_gl: 5 jobs becomes 4, and we add KHR-GLES*
a630_vk: still 3 jobs, now 1/3 of all VK instead of 1/4.
a630_vk_full: still 2 jobs, now includes full bypass testing, partial
no-force testing, and testing of pre-merge-skipped tests.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12256>
Whilst we want to reuse the same init and job environment for LAVA and
bare-metal, LAVA needs to additionally inject wget and tar jobs, so we
can actually get our per-job environment, as the rootfs we run in is
just the container-generated base rootfs.
Split the init script into two stages, with the first stage doing very
base bringup of devices and networking, and the second stage setting the
job environment and running the jobs.
Signed-off-by: Daniel Stone <daniels@collabora.com>
Acked-by: Martin Peres <martin.peres@mupuf.org>
Acked-by: Emma Anholt <emma@anholt.net>
Reviewed-by: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/11337>