Commit Graph

7 Commits

Author SHA1 Message Date
Valentine Burley
f6dce6dee1 ci: Add a minimal Alpine container for running LAVA jobs
Compared to the existing Debian-based x86_64_pyutils container, this
Alpine-based variant reduces the image size by approximately 83%.

Include all the necessary python artifacts, including lava_job_submitter
in the container to avoid having to download them at the start of each
test job.

Signed-off-by: Valentine Burley <valentine.burley@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/34980>
2025-05-26 17:25:40 +00:00
Guilherme Gallo
422e65557d ci/lava: Tweak timeouts
LAVA actions follow a hierarchical structure, where most subactions have
their timeouts overridden if the parent action supports a retry
mechanism, such as the `depthcharge-retry` action.

The timeout is calculated as: [1]

```
parent action timeout / failure_retry value
```

To adjust a subaction's timeout, we need to modify the nearest parent
action.

[1]
https://gitlab.collabora.com/lava/lava/-/blob/collabora/production/lava_dispatcher/action.py#L149

Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/33906>
2025-03-10 05:44:25 +00:00
Deborah Brouwer
72c182f873 ci/lava: Detect a6xx gpu recovery failures
Sporadically a6xx gpu will fail to recover causing the lava job
a660_vk_full to loop on error messages for three hours before timing
out.

A few sporadic error messages may still be recoverable, but when multiple
errors occur over a short period, successful recovery is unlikely. Parse
the logs to look for repeated error messages within a short time period.
If found, cancel the lava job and rerun it.

Also add unit tests for this behaviour.

cc: mesa-stable

Reported-by: Valentine Burley <valentine.burley@gmail.com>
Acked-by: Daniel Stone <daniel.stone@collabora.com>
Reviewed-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Signed-off-by: Deborah Brouwer <deborah.brouwer@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/30032>
2024-07-19 23:41:13 +00:00
Guilherme Gallo
41cd32d10e ci/lava: Broader R8152 error handling
The r8152 error detection is now considering any order of the known
patterns to detect variations of the r8152 issues during the test phase.
This includes a small refactoring for eventual new issues.

Additionally, adjusted the timing for setting the `start_time` in
`test_lava_job_submitter.py` to ensure consistency and reliability in
test execution, aligning the start time closer to the job submission
process.

With this fix, the bad state shown in the following job will be
detected:
https://gitlab.freedesktop.org/drm/msm/-/jobs/55033953

Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27688>
2024-02-20 00:48:24 +00:00
Guilherme Gallo
de2c847c24 ci/lava: Detect r8152 issue during boot phase
This week we found that the r8152 issue can happen during the boot
phase, make the necessary adjustments to detect it.

https://gitlab.freedesktop.org/vigneshraman/linux/-/jobs/53651940

Notes:
- The kernel messages during the boot phase is being redirected to the
feedback messages due to the namespaces from the SSH job.
- Update the unit tests:
  - Add boot phase detection
  - Correctly set the boot phase when mocking LogFollower

Reported-by: Vignesh Raman <vignesh.raman@collabora.com>
Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27081>
2024-01-16 17:22:04 +00:00
Guilherme Gallo
bfd50f72eb ci/lava: Turn the r8152 issue check into a counter
We were just detecting if a log like
[  143.080663] r8152 2-1.3:1.0 eth0: Tx status -71
happened once before
[  316.389695] nfs: server 192.168.201.1 not responding, still trying

But we can use a counter to be more assured that the device is
struggling to recover and we can add let this detection happen during
the boot phase.

This mimics how other freedreno devices deal with this problem, see
`cros_servo_run.py:64` for example.

Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27081>
2024-01-16 17:22:04 +00:00
Guilherme Gallo
654f7f783f ci/lava: Make SSH definition wrap the UART one
Simplify both UART and SSH job definitions module to share common
building blocks themselves.

- generate_lava_yaml_payload is now a LAVAJobDefinition method, so
  dropped the Strategy pattern between both modules
- if SSH is supported and UART is not enforced, default to SSH
- when SSH is enabled, wrap the last deploy action to run the SSH server
  and rewrite the test actions, which should not change due to the boot
  method
- create a constants module to load environment variables

Signed-off-by: Guilherme Gallo <guilherme.gallo@collabora.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25912>
2023-11-02 03:31:50 +00:00