Improves the error logging in the LAVA job submitter by capturing and
logging the exception message rather than just the exception type when a
job fails to run.
Additionally, introduces a clearer script interruption
message to aid in debugging and immediate understanding of job
submission failures.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28778>
This commit refactors the exception hierarchy to differentiate between
retriable and fatal errors in the CI pipeline, specifically within the
LAVA job submission process.
A new base class, `MesaCIRetriableException`, is introduced for
exceptions that should trigger a retry of the CI job, while
`MesaCIFatalException` is added for non-recoverable errors that halt the
process immediately.
Additionally, the logic for deciding whether a job should be retried or
not is updated to check for instances of `MesaCIRetriableException`,
improving the robustness and reliability of the CI job execution
strategy.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28778>
The r8152 error detection is now considering any order of the known
patterns to detect variations of the r8152 issues during the test phase.
This includes a small refactoring for eventual new issues.
Additionally, adjusted the timing for setting the `start_time` in
`test_lava_job_submitter.py` to ensure consistency and reliability in
test execution, aligning the start time closer to the job submission
process.
With this fix, the bad state shown in the following job will be
detected:
https://gitlab.freedesktop.org/drm/msm/-/jobs/55033953
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27688>
In our process of monitoring LAVA logs, we typically skip numerous
messages to enhance log clarity. We already exclude `feedback` messages
from display. These messages were just used as a heartbeat signal,
indicating that if we are receiving them, the Device Under Test (DUT) is
active.
Practically, if we continuously receive feedback messages without any
other message level (either `debug` or `target`) for several minutes,
this could be a cause for concern, as it likely indicates the device is
in a kind of livelock state.
Therefore, it is more prudent to ignore feedback messages, as they tend
to occur frequently in unstable scenarios. However, it's important to
note that any other message level will still be considered as a
heartbeat signal.
Real case where several minutes of feedback messages indicate a hang:
https://gitlab.freedesktop.org/mesa/mesa/-/jobs/53546660
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26996>
We were just detecting if a log like
[ 143.080663] r8152 2-1.3:1.0 eth0: Tx status -71
happened once before
[ 316.389695] nfs: server 192.168.201.1 not responding, still trying
But we can use a counter to be more assured that the device is
struggling to recover and we can add let this detection happen during
the boot phase.
This mimics how other freedreno devices deal with this problem, see
`cros_servo_run.py:64` for example.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27081>
We need update kernel often. We need test kernel changes often.
Introduced `KERNEL_EXTERNAL_TAG` to differ between `KERNEL_TAG` which is
also used to rebuild the containers. We don't need rebuild containers
for the external kernel, so this way we don't have to.
Updating kernel goes wruuuuuum.
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23563>
Add two unit tests related to the LAVA job definition.
test_generate_lava_job_definition_sanity checks for the most important
fields, deploy actions, namespaces etc.
test_lava_job_definition compares the generated definition with static
skeleton YAML files committed inside tests/data folder.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
Simplify both UART and SSH job definitions module to share common
building blocks themselves.
- generate_lava_yaml_payload is now a LAVAJobDefinition method, so
dropped the Strategy pattern between both modules
- if SSH is supported and UART is not enforced, default to SSH
- when SSH is enabled, wrap the last deploy action to run the SSH server
and rewrite the test actions, which should not change due to the boot
method
- create a constants module to load environment variables
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
Break it to smaller pieces with variable size (fastboot has 3 deploy
actions and uboot only one) to build the base definition nicely in the
end.
Extract kernel/dtb attachment and init_stage1 extraction into functions
to be later reused by SSH job definition.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
Modify the build process for the images to include the build to have ci-kdl
available in the Mesa jobs. Modify also the init-stage2 to launch in the
background the process that will collect data and store a json file with the
relative changes on the recorded data.
Signed-off-by: Sergi Blanch Torne <sergi.blanch.torne@collabora.com>
Reviewed-by: Eric Engestrom <eric@engestrom.ch>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24177>
- To keep things organized, create a base hidden jobs for alpine images,
as we have 2 now
- This image will have an SSH client ran by LAVA dispatchers
(x86_64-only) who will serve as a bypass channel of output and dmesg
and an alternative for UART.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23534>
Our LAVA farm is currently experiencing issues with running and pulling
docker. LAVA has been detecting (with a low rate) timeouts during these
commands, causing some jobs to fail with infrastructure errors.
Increasing the failure_retry will make the job retry run the container
when LAVA detects the failure without losing its place in the job queue.
We are currently investigating why docker times out. But, when LAVA
fails to detect it, we cancel the job on our side and resubmit it to the
job queue. For more information, please refer to following dashboard:
https://ci-stats-grafana.freedesktop.org/goto/VjZvaA_4z?orgId=1
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23534>
Already in hard-freeze, so we don't have to worry about breaking changes.
Significant changes:
- LLVM 15 is used instead of 11 or 13
- /dev/shm has to be manually mounted
- Debian 12 uses libdrm 2.4.114
- reworked creating of rootfs, from debootstrap to mmdebstrap
- split `create-rootfs.sh` into `lava_build.sh`, `setup-rootfs.sh`, and `strip-rootfs.sh`
- dropped winehq repository for now (Debian wine is up-to-date enough)
- we use wine now, no need to call explicitly call wine64
- bumped libasan from version 6 to 8
Signed-off-by: David Heidelberg <david.heidelberg@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21977>
To ensure proper SSH functioning, the device IP should be added to the
LAVA device dictionary by setting device_ip. LAVA will then map the
value to lava-target-ip.
meson-g12b-a311d-khadas-vim3-cbg-4 has an IP in the dictionary, while
sun50i-h6-pine-h64-cbg-1 and meson-g12b-a311d-khadas-vim3-cbg-2 do not.
Since some devices are not yet properly configured, and device tag
fixing is not an option here, let's temporarily switch to a job
definition based on UART, until it gets fixed.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22870>
To use the supported job definition depending on some Mesa CI job
characteristics.
The strategy here, is to use LAVA with a containerized SSH session to
follow the job output, escaping from dumping data to the UART, which
proves to be error prone in some devices.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22870>
Create a separate job definition that runs the job via SSH session.
The DUT test only sets up the SSH server via dropbear, and another
deployed docker runner in LAVA dispatcher access the DUT via SSH with
pseudo-terminal to propagate the logs in real time.
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22870>