8 Commits

Author SHA1 Message Date
Valentine Burley
f6dce6dee1 ci: Add a minimal Alpine container for running LAVA jobs
Compared to the existing Debian-based x86_64_pyutils container, this
Alpine-based variant reduces the image size by approximately 83%.

Include all the necessary python artifacts, including lava_job_submitter
in the container to avoid having to download them at the start of each
test job.

Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34980>
2025-05-26 17:25:40 +00:00
Deborah Brouwer
72c182f873 ci/lava: Detect a6xx gpu recovery failures
Sporadically a6xx gpu will fail to recover causing the lava job
a660_vk_full to loop on error messages for three hours before timing
out.

A few sporadic error messages may still be recoverable, but when multiple
errors occur over a short period, successful recovery is unlikely. Parse
the logs to look for repeated error messages within a short time period.
If found, cancel the lava job and rerun it.

Also add unit tests for this behaviour.

cc: mesa-stable

Reported-by: Valentine Burley <valentine.burley@gmail.com>
Acked-by: Daniel Stone <daniel.stone@collabora.com>
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: Deborah Brouwer <deborah.brouwer@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30032>
2024-07-19 23:41:13 +00:00
Guilherme Gallo
41cd32d10e ci/lava: Broader R8152 error handling
The r8152 error detection is now considering any order of the known
patterns to detect variations of the r8152 issues during the test phase.
This includes a small refactoring for eventual new issues.

Additionally, adjusted the timing for setting the `start_time` in
`test_lava_job_submitter.py` to ensure consistency and reliability in
test execution, aligning the start time closer to the job submission
process.

With this fix, the bad state shown in the following job will be
detected:
https://gitlab.freedesktop.org/drm/msm/-/jobs/55033953

Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27688>
2024-02-20 00:48:24 +00:00
Guilherme Gallo
ffe2b31f9a ci/lava: Detect hard resets during test phase
Hard resets should not occur during the test phase. Therefore, let's
detect them through specific log messages and raise an exception for a
known issue if it occurs. Without this detection, the job will continue
running on both Gitlab and LAVA until a timeout occurs.

Real case:
https://gitlab.freedesktop.org/mesa/mesa/-/jobs/53546660

Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/26995>
2024-01-23 22:47:24 +00:00
Guilherme Gallo
de2c847c24 ci/lava: Detect r8152 issue during boot phase
This week we found that the r8152 issue can happen during the boot
phase, make the necessary adjustments to detect it.

https://gitlab.freedesktop.org/vigneshraman/linux/-/jobs/53651940

Notes:
- The kernel messages during the boot phase is being redirected to the
feedback messages due to the namespaces from the SSH job.
- Update the unit tests:
  - Add boot phase detection
  - Correctly set the boot phase when mocking LogFollower

Reported-by: Vignesh Raman <vignesh.raman@collabora.com>
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27081>
2024-01-16 17:22:04 +00:00
Guilherme Gallo
bfd50f72eb ci/lava: Turn the r8152 issue check into a counter
We were just detecting if a log like
[  143.080663] r8152 2-1.3:1.0 eth0: Tx status -71
happened once before
[  316.389695] nfs: server 192.168.201.1 not responding, still trying

But we can use a counter to be more assured that the device is
struggling to recover and we can add let this detection happen during
the boot phase.

This mimics how other freedreno devices deal with this problem, see
`cros_servo_run.py:64` for example.

Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27081>
2024-01-16 17:22:04 +00:00
Guilherme Gallo
70f1291d8e ci/lava: Add canceled job status
We should be explicit that we are cancelling jobs once the script finds
some log messages that are linked with known issues. That means the
script preemptively retried the job without giving chances to recover.

Adds magenta color to cancelled jobs.

Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17389>
2022-07-08 12:26:05 +00:00
Guilherme Gallo
2c51b7a9c9 ci/lava: Detect R8152 issues preemptively and retry
Implement a log-based retry hint for R8152 issue described in #6681,
which is based on detecting these two consecutive lines:

```
r8152 <USB> eth0: Tx status -71
nfs: server <IP> not responding, still trying
```

Where <IP> and <USB> could be any IP and USB addresses, respectfully.

This commit is a temporary fix since it requires a section-aware log
follower, implemented in !16323. When the cited MR is merged, one will
make a proper fix on top of that.

Closes: #6681

Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/17389>
2022-07-08 12:26:05 +00:00