Make sure that both per-vertex and per-primitive attribute
ring stores are finished before position or primitive export
instructions are executed.
This is necessary because we need to ensure that mesh shader
waves work correctly when they have either vertex-only or
primitive-only waves.
Cc: mesa-stable
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
(cherry picked from commit 93b4f200de)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25157>
Cleanup the code that generates the two channels of the
primitive export instruction, and move storing the built-in
per-primitive outputs out to match how vertex attributes work.
Prepares the mesh shader lowering for a workaround that
affect export instructions.
Cc: mesa-stable
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
(cherry picked from commit 0721784b78)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25157>
This is a HW bug workaround for some (all?) GFX11 chips.
On these chips, rasterization can start before the attribute ring
stores are finished, which can cause issues.
As a workaround, wait for attribute ring stores to finish
before doing the position export.
Mesh shaders will be taken care of in another commit.
Cc: mesa-stable
Signed-off-by: Timur Kristóf <timur.kristof@gmail.com>
Reviewed-by: Qiang Yu <yuq825@gmail.com>
Reviewed-by: Rhys Perry <pendingchaos02@gmail.com>
(cherry picked from commit edd51655f0)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25157>
Up until now, the mesh pipeline assumed it would be always linked to the
fragment shader, and so the calculated MUE map would always be
available.
That is not the case for fast linked pipeline libraries, so the URB
setup needs to account for this. We do this by replicating what's done
for non-mesh pipelines, defining the URB based on the FS inputs, and
always assuming they will be laid out in order of varying number, except
that we also account for per-primitive attributes.
Fixes all GPL using tests under dEQP-VK.mesh_shader.ext.smoke.*
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
(cherry picked from commit 4eddeea7bf)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25188>
The compaction introduced in a252123363 ("intel/compiler/mesh: compactify MUE layout")
is not suitable for the case where graphics pipeline libraries are fast
linked, as the fragment shader won't receive the mue_map to know where
to locate its inputs.
For that case, keep doing what we did before and lay things down in the
order varyings are defined, which is also how it works for the non-mesh
case.
Fixes dEQP-VK.fragment_shading_rate.*fast_linked_library*.ms
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
(cherry picked from commit b200e5765c)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25188>
Move mesh URB allocations together with the other stages.
This fixes a hang that started happening with mesh enabled after
419531c5d9 ("intel/blorp: add a new flag to communicate PSS sync need")
Bspec 45352 says:
L3 Space allocation can only be changed when the GPU pipeline is
completely flushed.
It's likely that the PIPE_CONTROL added in that commit was breaking that
assumption and the URB allocation happening afterwards at the end of the
pipeline emission would then hang. And before that, we were probably
just getting lucky.
Reviewed-by: Caio Oliveira <caio.oliveira@intel.com>
(cherry picked from commit bcde58ea86)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25188>
Both per-primitive and per-vertex space is allocated in MUE in 8 dword
chunks and those 8-dword chunks (granularity of
3DSTATE_SBE_MESH.Per[Primitive|Vertex]URBEntryOutputReadLength)
are passed to fragment shaders as inputs (either non-interpolated
for per-primitive and flat vertex attributes or interpolated
for non-flat vertex attributes).
Some attributes have a special meaning and must be placed in separate
8/16-dword slot called Primitive Header or Vertex Header.
Primitive Header contains 4 such attributes (Cull Primitive,
ViewportIndex, RTAIndex, CPS), leaving 4 dwords (the rest of 8-dword
slot) potentially unused.
Vertex Header is similar - it starts with 3 unused dwords, 1 dword for
Point Size (but if we declare that shader doesn't produce Point Size
then we can reuse it), followed by 4 dwords for Position and optionally
8 dwords for clip distances.
This means we have an interesting optimization problem - we can put
some user attributes into holes in Primitive and Vertex Headers, which
may lead to smaller MUE size and potentially more mesh threads running
in parallel, but we have to be careful to use those holes only when
we need it, otherwise we could force HW to pass too much data to
fragment shader.
Example 1:
Let's assume that Primitive Header is enabled and user defined
12 dwords of per-primitive attributes.
Without packing we would consume 8 + ALIGN(12, 8) = 24 dwords of
MUE space and pass ALIGN(12, 8) = 16 dwords to fragment shader.
With packing, we'll consume 4 + 4 + ALIGN(12 - 4, 8) = 16 dwords of
MUE space and pass ALIGN(4, 8) + ALIGN(12 - 4, 8) = 16 dwords to
fragment shader.
16/16 is better than 24/16, so packing makes sense.
Example 2:
Now let's assume that Primitive Header is enabled and user defined
16 dwords of per-primitive attributes.
Without packing we would consume 8 + ALIGN(16, 8) = 24 dwords of
MUE space and pass ALIGN(16, 16) = 16 dwords to fragment shader.
With packing, we'll consume 4 + 4 + ALIGN(16 - 4, 8) = 24 dwords of
MUE space and pass ALIGN(4, 8) + ALIGN(16 - 4, 8) = 24 dwords to
fragment shader.
24/24 is worse than 24/16, so packing doesn't make sense.
This change doesn't affect vk_meshlet_cadscene in default configuration,
but it speeds it up by up to 25% with "-extraattributes N", where
N is some small value divisible by 2 (by default N == 1) and we
are bound by URB size.
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
(cherry picked from commit c1685f08dd)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25188>
Instead of using 4 dwords for each output slot, use only the amount
of memory actually needed by each variable.
There are some complications from this "obvious" idea:
- flat and non-flat variables can't be merged into the same vec4 slot,
because flat inputs mask has vec4 stride
- multi-slot variables can have different layout:
float[N] requires N 1-dword slots, but
i64vec3 requires 1 fully occupied 4-dword slot followed by 2-dword slot
- some output variables occur both in single-channel/component split
and combined variants
- crossing vec4 boundary requires generating more writes, so avoiding them
if possible is beneficial
This patch fixes some issues with arrays in per-vertex and per-primitive data
(func.mesh.ext.outputs.*.indirect_array.q0 in crucible)
and by reduction in single MUE size it allows spawning more threads at
the same time.
Note: this patch doesn't improve vk_meshlet_cadscene performance because
default layout is already optimal enough.
Reviewed-by: Ivan Briano <ivan.briano@intel.com>
(cherry picked from commit a252123363)
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25188>
The vulkan spec says all conversions are correctly rounded, so if the input
is larger than the largest fp16 value, we need to return MAX_FLOAT/inf
instead of cutting off the msbs.
Cc: mesa-stable
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24826>
(cherry picked from commit 6d949e18fd)
For z surfaces, flags.texture should be based on
RADEON_SURF_TC_COMPATIBLE_HTILE alone. Otherwise, addrlib could pick a
_X/_T swizzle mode for a MSAA depth texture, which is said to be broken:
When _X/_T swizzle mode was used for MSAA depth texture, TC will get zplane
equation from wrong address within memory range a tile covered and use the
garbage data for compressed Z reading which finally leads to corruption.
Fixes: de0885cdb8 ("amd/surface: add RADEON_SURF_NO_TEXTURE flag")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24767>
(cherry picked from commit e74c3dbb70)
It uses a poll function that waits for a second hoping for another thread
to catch up, which is not a reliable way to do synchronization. The test
has been spuriously failing merges on a regular basis recently.
This is issue #9222, which I'm leaving open until the author can fix the test.
Fixes: 3b69b67545 ("util/fossilize_db: add runtime RO foz db loading via FOZ_DBS_DYNAMIC_LIST")
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24755>
(cherry picked from commit 4dfd306454)