VERSION: bump for 24.0.8

docs: add release notes for 24.0.8
winsys/i915: depends on intel_wa.h
2024-05-22 18:48:39 +02:00 · 2024-05-22 18:48:26 +02:00 · 2024-05-20 12:40:12 +02:00 · 2024-05-20 12:40:12 +02:00 · 2024-05-20 12:40:11 +02:00 · 2024-05-20 12:40:11 +02:00
60 changed files with 7273 additions and 251 deletions
--- a/.pick_status.json
+++ b/.pick_status.json
--- a/2
+++ b/2
@@ -1 +1 @@
-24.0.7
+24.0.8
--- a/docs/relnotes.rst
+++ b/docs/relnotes.rst
@@ -3,6 +3,7 @@ Release Notes

 The release notes summarize what's new or changed in each Mesa release.

+-  :doc:`24.0.8 release notes <relnotes/24.0.8>`
 -  :doc:`24.0.7 release notes <relnotes/24.0.7>`
 -  :doc:`24.0.6 release notes <relnotes/24.0.6>`
 -  :doc:`24.0.5 release notes <relnotes/24.0.5>`
@@ -415,6 +416,7 @@ The release notes summarize what's new or changed in each Mesa release.
   :maxdepth: 1
   :hidden:

+   24.0.8 <relnotes/24.0.8>
   24.0.7 <relnotes/24.0.7>
   24.0.6 <relnotes/24.0.6>
   24.0.5 <relnotes/24.0.5>
--- a/docs/relnotes/24.0.7.rst
+++ b/docs/relnotes/24.0.7.rst
@@ -19,7 +19,7 @@ SHA256 checksum

 ::

-    TBD.
+    7454425f1ed4a6f1b5b107e1672b30c88b22ea0efea000ae2c7d96db93f6c26a  mesa-24.0.7.tar.xz


 New features
--- a/docs/relnotes/24.0.8.rst
+++ b/docs/relnotes/24.0.8.rst
@@ -0,0 +1,155 @@
+Mesa 24.0.8 Release Notes / 2024-05-22
+======================================
+
+Mesa 24.0.8 is a bug fix release which fixes bugs found since the 24.0.7 release.
+
+Mesa 24.0.8 implements the OpenGL 4.6 API, but the version reported by
+glGetString(GL_VERSION) or glGetIntegerv(GL_MAJOR_VERSION) /
+glGetIntegerv(GL_MINOR_VERSION) depends on the particular driver being used.
+Some drivers don't support all the features required in OpenGL 4.6. OpenGL
+4.6 is **only** available if requested at context creation.
+Compatibility contexts may report a lower version depending on each driver.
+
+Mesa 24.0.8 implements the Vulkan 1.3 API, but the version reported by
+the apiVersion property of the VkPhysicalDeviceProperties struct
+depends on the particular driver being used.
+
+SHA256 checksum
+---------------
+
+::
+
+    TBD.
+
+
+New features
+------------
+
+- None
+
+
+Bug fixes
+---------
+
+- [24.1-rc4] fatal error: intel/dev/intel_wa.h: No such file or directory
+- vcn: rewinding attached video in Totem cause [mmhub] page fault
+- When using amd gpu deinterlace, tv bt709 properties mapping to 2 chroma
+- VCN decoding freezes the whole system
+- [RDNA2 [AV1] [VAAPI] hw decoding glitches in Thorium 123.0.6312.133 after https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28960
+- WSI: Support VK_IMAGE_ASPECT_MEMORY_PLANE_i_BIT_EXT for DRM Modifiers in Vulkan
+- radv: Enshrouded GPU hang on RX 6800
+- NVK Zink: Wrong color in Unigine Valley benchmark
+- [anv] FINISHME: support YUV colorspace with DRM format modifiers
+- 24.0.6: build fails
+
+
+Changes
+-------
+
+Antoine Coutant (1):
+
+- drisw: fix build without dri3
+
+Bas Nieuwenhuizen (1):
+
+- radv: Use zerovram for Enshrouded.
+
+David Heidelberg (2):
+
+- freedreno/ci: move the disabled jobs from include to the main file
+- winsys/i915: depends on intel_wa.h
+
+David Rosca (6):
+
+- frontends/va: Only increment slice offset after first slice parameters
+- radeonsi: Update buffer for other planes in si_alloc_resource
+- frontends/va: Store slice types for H264 decode
+- radeonsi/vcn: Ensure DPB has as many buffers as references
+- radeonsi/vcn: Allow duplicate buffers in DPB
+- radeonsi/vcn: Ensure at least one reference for H264 P/B frames
+
+Eric Engestrom (5):
+
+- docs: add sha256sum for 24.0.7
+- .pick_status.json: Update to 18c53157318d6c8e572062f6bb768dfb621a55fd
+- .pick_status.json: Update to e154f90aa9e71cc98375866c3ab24c4e08e66cb7
+- .pick_status.json: Mark ae8fbe220ae67ffdce662c26bc4a634d475c0389 as denominated
+- .pick_status.json: Update to a31996ce5a6b7eb3b324b71eb9e9c45173953c50
+
+Faith Ekstrand (6):
+
+- nvk: Re-emit sample locations when rasterization samples changes
+- nvk/meta: Restore set_sizes[0]
+- nouveau/winsys: Take a reference to BOs found in the cache
+- drm-uapi: Sync nouveau_drm.h
+- nouveau/winsys: Add back nouveau_ws_bo_new_tiled()
+- vulkan/wsi: Bind memory planes, not YCbCr planes.
+
+Friedrich Vock (2):
+
+- aco/tests: Insert p_logical_start/end in reduce_temp tests
+- aco/spill: Insert p_start_linear_vgpr right after p_logical_end
+
+Georg Lehmann (1):
+
+- zink: use bitcasts instead of pack/unpack double opcodes
+
+José Expósito (1):
+
+- meson: Update proc_macro2 meson.build patch
+
+Karol Herbst (5):
+
+- rusticl/event: use Weak refs for dependencies
+- Revert "rusticl/event: use Weak refs for dependencies"
+- event: break long dependency chains on drop
+- rusticl/mesa/context: flush context before destruction
+- nir/lower_cl_images: set binding also for samplers
+
+Konstantin Seurer (3):
+
+- radv: Fix radv_shader_arena_block list corruption
+- radv: Remove arenas from capture_replay_arena_vas
+- radv: Zero initialize capture replay group handles
+
+Lionel Landwerlin (3):
+
+- anv: fix ycbcr plane indexing with indirect descriptors
+- anv: fix push constant subgroup_id location
+- nir/divergence: add missing load_printf_buffer_address
+
+Marek Olšák (1):
+
+- util: shift the mask in BITSET_TEST_RANGE_INSIDE_WORD to be relative to b
+
+Mike Blumenkrantz (8):
+
+- egl/x11: disable dri3 with LIBGL_KOPPER_DRI2=1 as expected
+- zink: add a batch ref for committed sparse resources
+- u_blitter: stop leaking saved blitter states on no-op blits
+- frontends/dri: only release pipe when screen init fails
+- frontends/dri: always init opencl_func_mutex in InitScreen hooks
+- zink: clean up semaphore arrays on batch state destroy
+- nir/lower_aaline: fix for scalarized outputs
+- nir/linking: fix nir_assign_io_var_locations for scalarized dual blend
+
+Patrick Lerda (2):
+
+- clover: fix memory leak related to optimize
+- r600: fix vertex state update clover regression
+
+Rhys Perry (1):
+
+- aco/waitcnt: fix DS/VMEM ordered writes when mixed
+
+Romain Naour (1):
+
+- glxext: don't try zink if not enabled in mesa
+
+Yiwei Zhang (5):
+
+- turnip: msm: clean up iova on error path
+- turnip: msm: fix racy gem close for re-imported dma-buf
+- turnip: virtio: fix error path in virtio_bo_init
+- turnip: virtio: fix iova leak upon found already imported dmabuf
+- turnip: virtio: fix racy gem close for re-imported dma-buf
--- a/include/drm-uapi/nouveau_drm.h
+++ b/include/drm-uapi/nouveau_drm.h
@@ -54,11 +54,42 @@ extern "C" {
 */
 #define NOUVEAU_GETPARAM_EXEC_PUSH_MAX   17

+/*
+ * NOUVEAU_GETPARAM_VRAM_BAR_SIZE - query bar size
+ *
+ * Query the VRAM BAR size.
+ */
+#define NOUVEAU_GETPARAM_VRAM_BAR_SIZE 18
+
+/*
+ * NOUVEAU_GETPARAM_VRAM_USED
+ *
+ * Get remaining VRAM size.
+ */
+#define NOUVEAU_GETPARAM_VRAM_USED 19
+
+/*
+ * NOUVEAU_GETPARAM_HAS_VMA_TILEMODE
+ *
+ * Query whether tile mode and PTE kind are accepted with VM allocs or not.
+ */
+#define NOUVEAU_GETPARAM_HAS_VMA_TILEMODE 20
+
 struct drm_nouveau_getparam {
 	__u64 param;
 	__u64 value;
 };

+/*
+ * Those are used to support selecting the main engine used on Kepler.
+ * This goes into drm_nouveau_channel_alloc::tt_ctxdma_handle
+ */
+#define NOUVEAU_FIFO_ENGINE_GR  0x01
+#define NOUVEAU_FIFO_ENGINE_VP  0x02
+#define NOUVEAU_FIFO_ENGINE_PPP 0x04
+#define NOUVEAU_FIFO_ENGINE_BSP 0x08
+#define NOUVEAU_FIFO_ENGINE_CE  0x30
+
 struct drm_nouveau_channel_alloc {
 	__u32     fb_ctxdma_handle;
 	__u32     tt_ctxdma_handle;
@@ -81,6 +112,18 @@ struct drm_nouveau_channel_free {
 	__s32 channel;
 };

+struct drm_nouveau_notifierobj_alloc {
+	__u32 channel;
+	__u32 handle;
+	__u32 size;
+	__u32 offset;
+};
+
+struct drm_nouveau_gpuobj_free {
+	__s32 channel;
+	__u32 handle;
+};
+
 #define NOUVEAU_GEM_DOMAIN_CPU       (1 << 0)
 #define NOUVEAU_GEM_DOMAIN_VRAM      (1 << 1)
 #define NOUVEAU_GEM_DOMAIN_GART      (1 << 2)
@@ -238,34 +281,32 @@ struct drm_nouveau_vm_init {
 struct drm_nouveau_vm_bind_op {
 	/**
 	 * @op: the operation type
+	 *
+	 * Supported values:
+	 *
+	 * %DRM_NOUVEAU_VM_BIND_OP_MAP - Map a GEM object to the GPU's VA
+	 * space. Optionally, the &DRM_NOUVEAU_VM_BIND_SPARSE flag can be
+	 * passed to instruct the kernel to create sparse mappings for the
+	 * given range.
+	 *
+	 * %DRM_NOUVEAU_VM_BIND_OP_UNMAP - Unmap an existing mapping in the
+	 * GPU's VA space. If the region the mapping is located in is a
+	 * sparse region, new sparse mappings are created where the unmapped
+	 * (memory backed) mapping was mapped previously. To remove a sparse
+	 * region the &DRM_NOUVEAU_VM_BIND_SPARSE must be set.
 	 */
 	__u32 op;
-/**
- * @DRM_NOUVEAU_VM_BIND_OP_MAP:
- *
- * Map a GEM object to the GPU's VA space. Optionally, the
- * &DRM_NOUVEAU_VM_BIND_SPARSE flag can be passed to instruct the kernel to
- * create sparse mappings for the given range.
- */
 #define DRM_NOUVEAU_VM_BIND_OP_MAP 0x0
-/**
- * @DRM_NOUVEAU_VM_BIND_OP_UNMAP:
- *
- * Unmap an existing mapping in the GPU's VA space. If the region the mapping
- * is located in is a sparse region, new sparse mappings are created where the
- * unmapped (memory backed) mapping was mapped previously. To remove a sparse
- * region the &DRM_NOUVEAU_VM_BIND_SPARSE must be set.
- */
 #define DRM_NOUVEAU_VM_BIND_OP_UNMAP 0x1
 	/**
 	 * @flags: the flags for a &drm_nouveau_vm_bind_op
+	 *
+	 * Supported values:
+	 *
+	 * %DRM_NOUVEAU_VM_BIND_SPARSE - Indicates that an allocated VA
+	 * space region should be sparse.
 	 */
 	__u32 flags;
-/**
- * @DRM_NOUVEAU_VM_BIND_SPARSE:
- *
- * Indicates that an allocated VA space region should be sparse.
- */
 #define DRM_NOUVEAU_VM_BIND_SPARSE (1 << 8)
 	/**
 	 * @handle: the handle of the DRM GEM object to map
@@ -301,17 +342,17 @@ struct drm_nouveau_vm_bind {
 	__u32 op_count;
 	/**
 	 * @flags: the flags for a &drm_nouveau_vm_bind ioctl
+	 *
+	 * Supported values:
+	 *
+	 * %DRM_NOUVEAU_VM_BIND_RUN_ASYNC - Indicates that the given VM_BIND
+	 * operation should be executed asynchronously by the kernel.
+	 *
+	 * If this flag is not supplied the kernel executes the associated
+	 * operations synchronously and doesn't accept any &drm_nouveau_sync
+	 * objects.
 	 */
 	__u32 flags;
-/**
- * @DRM_NOUVEAU_VM_BIND_RUN_ASYNC:
- *
- * Indicates that the given VM_BIND operation should be executed asynchronously
- * by the kernel.
- *
- * If this flag is not supplied the kernel executes the associated operations
- * synchronously and doesn't accept any &drm_nouveau_sync objects.
- */
 #define DRM_NOUVEAU_VM_BIND_RUN_ASYNC 0x1
 	/**
 	 * @wait_count: the number of wait &drm_nouveau_syncs
--- a/src/amd/ci/radv-navi21-aco-flakes.txt
+++ b/src/amd/ci/radv-navi21-aco-flakes.txt
@@ -10,7 +10,4 @@ dEQP-VK.draw.renderpass.multi_draw.mosaic.indexed_mixed.max_draws.stride_extra_1
 dEQP-VK.pipeline.*line_stipple_enable
 dEQP-VK.pipeline.*line_stipple_params

-# New CTS flakes in 1.3.6.3
-dEQP-VK.ray_tracing_pipeline.pipeline_library.configurations.(single|multi)threaded_compilation.*_check_(all|capture_replay)_handles
-
 dEQP-VK.query_pool.statistics_query.host_query_reset.geometry_shader_invocations.secondary.32bits_triangle_list_clear_depth
--- a/src/amd/ci/radv-navi31-aco-flakes.txt
+++ b/src/amd/ci/radv-navi31-aco-flakes.txt
@@ -1,5 +0,0 @@
-# New CTS flakes in 1.3.6.3
-dEQP-VK.ray_tracing_pipeline.pipeline_library.configurations.multithreaded_compilation.*_check_all_handles
-dEQP-VK.ray_tracing_pipeline.pipeline_library.configurations.multithreaded_compilation.*_check_capture_replay_handles
-dEQP-VK.ray_tracing_pipeline.pipeline_library.configurations.singlethreaded_compilation.*_check_all_handles
-dEQP-VK.ray_tracing_pipeline.pipeline_library.configurations.singlethreaded_compilation.*_check_capture_replay_handles
--- a/src/amd/compiler/aco_insert_waitcnt.cpp
+++ b/src/amd/compiler/aco_insert_waitcnt.cpp
@@ -428,18 +428,20 @@ check_instr(wait_ctx& ctx, wait_imm& wait, alu_delay_info& delay, Instruction* i
         if (it == ctx.gpr_map.end())
            continue;

+         wait_imm reg_imm = it->second.imm;
+
         /* Vector Memory reads and writes return in the order they were issued */
         uint8_t vmem_type = get_vmem_type(instr);
         if (vmem_type && ((it->second.events & vm_events) == event_vmem) &&
             it->second.vmem_types == vmem_type)
-            continue;
+            reg_imm.vm = wait_imm::unset_counter;

         /* LDS reads and writes return in the order they were issued. same for GDS */
         if (instr->isDS() &&
             (it->second.events & lgkm_events) == (instr->ds().gds ? event_gds : event_lds))
-            continue;
+            reg_imm.lgkm = wait_imm::unset_counter;

-         wait.combine(it->second.imm);
+         wait.combine(reg_imm);
      }
   }
 }
--- a/src/amd/compiler/aco_reduce_assign.cpp
+++ b/src/amd/compiler/aco_reduce_assign.cpp
@@ -118,11 +118,16 @@ setup_reduce_temp(Program* program)
                * would insert at the end instead of using this one. */
            } else {
               assert(last_top_level_block_idx < block.index);
-               /* insert before the branch at last top level block */
+               /* insert after p_logical_end of the last top-level block */
               std::vector<aco_ptr<Instruction>>& instructions =
                  program->blocks[last_top_level_block_idx].instructions;
-               instructions.insert(std::next(instructions.begin(), instructions.size() - 1),
-                                   std::move(create));
+               auto insert_point =
+                  std::find_if(instructions.rbegin(), instructions.rend(),
+                               [](const auto& iter) {
+                                  return iter->opcode == aco_opcode::p_logical_end;
+                               })
+                     .base();
+               instructions.insert(insert_point, std::move(create));
               inserted_at = last_top_level_block_idx;
            }
         }
@@ -161,8 +166,13 @@ setup_reduce_temp(Program* program)
               assert(last_top_level_block_idx < block.index);
               std::vector<aco_ptr<Instruction>>& instructions =
                  program->blocks[last_top_level_block_idx].instructions;
-               instructions.insert(std::next(instructions.begin(), instructions.size() - 1),
-                                   std::move(create));
+               auto insert_point =
+                  std::find_if(instructions.rbegin(), instructions.rend(),
+                               [](const auto& iter) {
+                                  return iter->opcode == aco_opcode::p_logical_end;
+                               })
+                     .base();
+               instructions.insert(insert_point, std::move(create));
               vtmp_inserted_at = last_top_level_block_idx;
            }
         }
--- a/src/amd/compiler/aco_spill.cpp
+++ b/src/amd/compiler/aco_spill.cpp
@@ -1845,10 +1845,16 @@ assign_spill_slots(spill_ctx& ctx, unsigned spills_to_vgpr)
                     instructions.emplace_back(std::move(create));
                  } else {
                     assert(last_top_level_block_idx < block.index);
-                     /* insert before the branch at last top level block */
+                     /* insert after p_logical_end of the last top-level block */
                     std::vector<aco_ptr<Instruction>>& block_instrs =
                        ctx.program->blocks[last_top_level_block_idx].instructions;
-                     block_instrs.insert(std::prev(block_instrs.end()), std::move(create));
+                     auto insert_point =
+                        std::find_if(block_instrs.rbegin(), block_instrs.rend(),
+                                     [](const auto& iter) {
+                                        return iter->opcode == aco_opcode::p_logical_end;
+                                     })
+                           .base();
+                     block_instrs.insert(insert_point, std::move(create));
                  }
               }

@@ -1885,10 +1891,16 @@ assign_spill_slots(spill_ctx& ctx, unsigned spills_to_vgpr)
                     instructions.emplace_back(std::move(create));
                  } else {
                     assert(last_top_level_block_idx < block.index);
-                     /* insert before the branch at last top level block */
+                     /* insert after p_logical_end of the last top-level block */
                     std::vector<aco_ptr<Instruction>>& block_instrs =
                        ctx.program->blocks[last_top_level_block_idx].instructions;
-                     block_instrs.insert(std::prev(block_instrs.end()), std::move(create));
+                     auto insert_point =
+                        std::find_if(block_instrs.rbegin(), block_instrs.rend(),
+                                     [](const auto& iter) {
+                                        return iter->opcode == aco_opcode::p_logical_end;
+                                     })
+                           .base();
+                     block_instrs.insert(insert_point, std::move(create));
                  }
               }

--- a/src/amd/compiler/tests/test_insert_waitcnt.cpp
+++ b/src/amd/compiler/tests/test_insert_waitcnt.cpp
@@ -53,3 +53,71 @@ BEGIN_TEST(insert_waitcnt.ds_ordered_count)

   finish_waitcnt_test();
 END_TEST
+
+BEGIN_TEST(insert_waitcnt.waw.mixed_vmem_lds.vmem)
+   if (!setup_cs(NULL, GFX10))
+      return;
+
+   Definition def_v4(PhysReg(260), v1);
+   Operand op_v0(PhysReg(256), v1);
+   Operand desc0(PhysReg(0), s4);
+
+   //>> BB0
+   //! /* logical preds: / linear preds: / kind: top-level, */
+   //! v1: %0:v[4] = buffer_load_dword %0:s[0-3], %0:v[0], 0
+   bld.mubuf(aco_opcode::buffer_load_dword, def_v4, desc0, op_v0, Operand::zero(), 0, false);
+
+   //>> BB1
+   //! /* logical preds: / linear preds: / kind: */
+   //! v1: %0:v[4] = ds_read_b32 %0:v[0]
+   bld.reset(program->create_and_insert_block());
+   bld.ds(aco_opcode::ds_read_b32, def_v4, op_v0);
+
+   bld.reset(program->create_and_insert_block());
+   program->blocks[2].linear_preds.push_back(0);
+   program->blocks[2].linear_preds.push_back(1);
+   program->blocks[2].logical_preds.push_back(0);
+   program->blocks[2].logical_preds.push_back(1);
+
+   //>> BB2
+   //! /* logical preds: BB0, BB1, / linear preds: BB0, BB1, / kind: uniform, */
+   //! s_waitcnt lgkmcnt(0)
+   //! v1: %0:v[4] = buffer_load_dword %0:s[0-3], %0:v[0], 0
+   bld.mubuf(aco_opcode::buffer_load_dword, def_v4, desc0, op_v0, Operand::zero(), 0, false);
+
+   finish_waitcnt_test();
+END_TEST
+
+BEGIN_TEST(insert_waitcnt.waw.mixed_vmem_lds.lds)
+   if (!setup_cs(NULL, GFX10))
+      return;
+
+   Definition def_v4(PhysReg(260), v1);
+   Operand op_v0(PhysReg(256), v1);
+   Operand desc0(PhysReg(0), s4);
+
+   //>> BB0
+   //! /* logical preds: / linear preds: / kind: top-level, */
+   //! v1: %0:v[4] = buffer_load_dword %0:s[0-3], %0:v[0], 0
+   bld.mubuf(aco_opcode::buffer_load_dword, def_v4, desc0, op_v0, Operand::zero(), 0, false);
+
+   //>> BB1
+   //! /* logical preds: / linear preds: / kind: */
+   //! v1: %0:v[4] = ds_read_b32 %0:v[0]
+   bld.reset(program->create_and_insert_block());
+   bld.ds(aco_opcode::ds_read_b32, def_v4, op_v0);
+
+   bld.reset(program->create_and_insert_block());
+   program->blocks[2].linear_preds.push_back(0);
+   program->blocks[2].linear_preds.push_back(1);
+   program->blocks[2].logical_preds.push_back(0);
+   program->blocks[2].logical_preds.push_back(1);
+
+   //>> BB2
+   //! /* logical preds: BB0, BB1, / linear preds: BB0, BB1, / kind: uniform, */
+   //! s_waitcnt vmcnt(0)
+   //! v1: %0:v[4] = ds_read_b32 %0:v[0]
+   bld.ds(aco_opcode::ds_read_b32, def_v4, op_v0);
+
+   finish_waitcnt_test();
+END_TEST
--- a/src/amd/compiler/tests/test_reduce_assign.cpp
+++ b/src/amd/compiler/tests/test_reduce_assign.cpp
@@ -41,6 +41,11 @@ BEGIN_TEST(setup_reduce_temp.divergent_if_phi)
   if (!setup_cs("s2 v1", GFX9))
      return;

+   //>> p_logical_start
+   //>> p_logical_end
+   bld.pseudo(aco_opcode::p_logical_start);
+   bld.pseudo(aco_opcode::p_logical_end);
+
   //>> lv1: %lv = p_start_linear_vgpr
   emit_divergent_if_else(
      program.get(), bld, Operand(inputs[0]),
--- a/src/amd/vulkan/radv_device.c
+++ b/src/amd/vulkan/radv_device.c
@@ -1320,9 +1320,6 @@ radv_DestroyDevice(VkDevice _device, const VkAllocationCallbacks *pAllocator)
   if (!device)
      return;

-   if (device->capture_replay_arena_vas)
-      _mesa_hash_table_u64_destroy(device->capture_replay_arena_vas);
-
   radv_device_finish_perf_counter_lock_cs(device);
   if (device->perf_counter_bo)
      device->ws->buffer_destroy(device->ws, device->perf_counter_bo);
@@ -1372,6 +1369,8 @@ radv_DestroyDevice(VkDevice _device, const VkAllocationCallbacks *pAllocator)
   radv_finish_trace(device);

   radv_destroy_shader_arenas(device);
+   if (device->capture_replay_arena_vas)
+      _mesa_hash_table_u64_destroy(device->capture_replay_arena_vas);

   radv_sqtt_finish(device);

--- a/src/amd/vulkan/radv_pipeline_rt.c
+++ b/src/amd/vulkan/radv_pipeline_rt.c
@@ -943,8 +943,12 @@ radv_GetRayTracingCaptureReplayShaderGroupHandlesKHR(VkDevice device, VkPipeline
      uint32_t recursive_shader = rt_pipeline->groups[firstGroup + i].recursive_shader;
      if (recursive_shader != VK_SHADER_UNUSED_KHR) {
         struct radv_shader *shader = rt_pipeline->stages[recursive_shader].shader;
-         if (shader)
-            data[i].recursive_shader_alloc = radv_serialize_shader_arena_block(shader->alloc);
+         if (shader) {
+            data[i].recursive_shader_alloc.offset = shader->alloc->offset;
+            data[i].recursive_shader_alloc.size = shader->alloc->size;
+            data[i].recursive_shader_alloc.arena_va = shader->alloc->arena->bo->va;
+            data[i].recursive_shader_alloc.arena_size = shader->alloc->arena->size;
+         }
      }
      data[i].non_recursive_idx = rt_pipeline->groups[firstGroup + i].handle.any_hit_index;
   }
--- a/src/amd/vulkan/radv_shader.c
+++ b/src/amd/vulkan/radv_shader.c
@@ -982,6 +982,7 @@ alloc_block_obj(struct radv_device *device)
 static void
 free_block_obj(struct radv_device *device, union radv_shader_arena_block *block)
 {
+   list_del(&block->pool);
   list_add(&block->pool, &device->shader_block_obj_pool);
 }

@@ -1267,7 +1268,6 @@ radv_free_shader_memory(struct radv_device *device, union radv_shader_arena_bloc
         remove_hole(free_list, hole_prev);

      hole_prev->size += hole->size;
-      list_del(&hole->list);
      free_block_obj(device, hole);

      hole = hole_prev;
@@ -1280,7 +1280,6 @@ radv_free_shader_memory(struct radv_device *device, union radv_shader_arena_bloc

      hole_next->offset -= hole->size;
      hole_next->size += hole->size;
-      list_del(&hole->list);
      free_block_obj(device, hole);

      hole = hole_next;
@@ -1293,6 +1292,18 @@ radv_free_shader_memory(struct radv_device *device, union radv_shader_arena_bloc
      radv_rmv_log_bo_destroy(device, arena->bo);
      device->ws->buffer_destroy(device->ws, arena->bo);
      list_del(&arena->list);
+
+      if (device->capture_replay_arena_vas) {
+         struct hash_entry *arena_entry = NULL;
+         hash_table_foreach (device->capture_replay_arena_vas->table, entry) {
+            if (entry->data == arena) {
+               arena_entry = entry;
+               break;
+            }
+         }
+         _mesa_hash_table_remove(device->capture_replay_arena_vas->table, arena_entry);
+      }
+
      free(arena);
   } else if (free_list) {
      add_hole(free_list, hole);
@@ -1301,18 +1312,6 @@ radv_free_shader_memory(struct radv_device *device, union radv_shader_arena_bloc
   mtx_unlock(&device->shader_arena_mutex);
 }

-struct radv_serialized_shader_arena_block
-radv_serialize_shader_arena_block(union radv_shader_arena_block *block)
-{
-   struct radv_serialized_shader_arena_block serialized_block = {
-      .offset = block->offset,
-      .size = block->size,
-      .arena_va = block->arena->bo->va,
-      .arena_size = block->arena->size,
-   };
-   return serialized_block;
-}
-
 union radv_shader_arena_block *
 radv_replay_shader_arena_block(struct radv_device *device, const struct radv_serialized_shader_arena_block *src,
                               void *ptr)
--- a/src/amd/vulkan/radv_shader.h
+++ b/src/amd/vulkan/radv_shader.h
@@ -775,8 +775,6 @@ union radv_shader_arena_block *radv_replay_shader_arena_block(struct radv_device
                                                              const struct radv_serialized_shader_arena_block *src,
                                                              void *ptr);

-struct radv_serialized_shader_arena_block radv_serialize_shader_arena_block(union radv_shader_arena_block *block);
-
 void radv_free_shader_memory(struct radv_device *device, union radv_shader_arena_block *alloc);

 struct radv_shader *radv_create_trap_handler_shader(struct radv_device *device);
--- a/src/compiler/nir/nir_divergence_analysis.c
+++ b/src/compiler/nir/nir_divergence_analysis.c
@@ -216,6 +216,7 @@ visit_intrinsic(nir_shader *shader, nir_intrinsic_instr *instr)
   case nir_intrinsic_load_rasterization_primitive_amd:
   case nir_intrinsic_load_global_constant_uniform_block_intel:
   case nir_intrinsic_cmat_length:
+   case nir_intrinsic_load_printf_buffer_address:
      is_divergent = false;
      break;

--- a/src/compiler/nir/nir_linking_helpers.c
+++ b/src/compiler/nir/nir_linking_helpers.c
@@ -1492,7 +1492,7 @@ nir_assign_io_var_locations(nir_shader *shader, nir_variable_mode mode,
                            unsigned *size, gl_shader_stage stage)
 {
   unsigned location = 0;
-   unsigned assigned_locations[VARYING_SLOT_TESS_MAX];
+   unsigned assigned_locations[VARYING_SLOT_TESS_MAX][2];
   uint64_t processed_locs[2] = { 0 };

   struct exec_list io_vars;
@@ -1584,7 +1584,7 @@ nir_assign_io_var_locations(nir_shader *shader, nir_variable_mode mode,
      if (processed) {
         /* TODO handle overlapping per-view variables */
         assert(!var->data.per_view);
-         unsigned driver_location = assigned_locations[var->data.location];
+         unsigned driver_location = assigned_locations[var->data.location][var->data.index];
         var->data.driver_location = driver_location;

         /* An array may be packed such that is crosses multiple other arrays
@@ -1605,7 +1605,7 @@ nir_assign_io_var_locations(nir_shader *shader, nir_variable_mode mode,
            unsigned num_unallocated_slots = last_slot_location - location;
            unsigned first_unallocated_slot = var_size - num_unallocated_slots;
            for (unsigned i = first_unallocated_slot; i < var_size; i++) {
-               assigned_locations[var->data.location + i] = location;
+               assigned_locations[var->data.location + i][var->data.index] = location;
               location++;
            }
         }
@@ -1613,7 +1613,7 @@ nir_assign_io_var_locations(nir_shader *shader, nir_variable_mode mode,
      }

      for (unsigned i = 0; i < var_size; i++) {
-         assigned_locations[var->data.location + i] = location + i;
+         assigned_locations[var->data.location + i][var->data.index] = location + i;
      }

      var->data.driver_location = location;
--- a/src/compiler/nir/nir_lower_cl_images.c
+++ b/src/compiler/nir/nir_lower_cl_images.c
@@ -161,6 +161,7 @@ nir_lower_cl_images(nir_shader *shader, bool lower_image_derefs, bool lower_samp
         assert(var->data.location > last_loc);
         last_loc = var->data.location;
         var->data.driver_location = num_samplers++;
+         var->data.binding = var->data.driver_location;
      } else {
         /* CL shouldn't have any sampled images */
         assert(!glsl_type_is_sampler(var->type));
--- a/src/egl/drivers/dri2/platform_x11.c
+++ b/src/egl/drivers/dri2/platform_x11.c
@@ -1518,7 +1518,8 @@ dri2_initialize_x11_swrast(_EGLDisplay *disp)
    */
   dri2_dpy->driver_name = strdup(disp->Options.Zink ? "zink" : "swrast");
   if (disp->Options.Zink &&
-       !debug_get_bool_option("LIBGL_DRI3_DISABLE", false))
+       !debug_get_bool_option("LIBGL_DRI3_DISABLE", false) &&
+       !debug_get_bool_option("LIBGL_KOPPER_DRI2", false))
      dri3_x11_connect(dri2_dpy);
   if (!dri2_load_driver_swrast(disp))
      goto cleanup;
--- a/src/freedreno/ci/gitlab-ci-inc.yml
+++ b/src/freedreno/ci/gitlab-ci-inc.yml
@@ -292,25 +292,6 @@
  tags:
    - google-freedreno-db410c

-# New jobs. Leave it as manual for now.
-.a306_piglit:
-  extends:
-    - .piglit-test
-    - .a306-test
-    - .google-freedreno-manual-rules
-  variables:
-    HWCI_START_XORG: 1
-
-# Something happened and now this hangchecks and doesn't recover.  Unkown when
-# it started.
-.a306_piglit_gl:
-  extends:
-    - .a306_piglit
-  variables:
-    PIGLIT_PROFILES: quick_gl
-    BM_KERNEL_EXTRA_ARGS: "msm.num_hw_submissions=1"
-    FDO_CI_CONCURRENT: 3
-
 # 8 devices (2023-04-15)
 .a530-test:
  extends:
--- a/src/freedreno/ci/gitlab-ci.yml
+++ b/src/freedreno/ci/gitlab-ci.yml
@@ -10,6 +10,25 @@ a306_gl:
    FDO_CI_CONCURRENT: 6
  parallel: 5

+# New jobs. Leave it as manual for now.
+.a306_piglit:
+  extends:
+    - .piglit-test
+    - .a306-test
+    - .google-freedreno-manual-rules
+  variables:
+    HWCI_START_XORG: 1
+
+# Something happened and now this hangchecks and doesn't recover.  Unkown when
+# it started.
+.a306_piglit_gl:
+  extends:
+    - .a306_piglit
+  variables:
+    PIGLIT_PROFILES: quick_gl
+    BM_KERNEL_EXTRA_ARGS: "msm.num_hw_submissions=1"
+    FDO_CI_CONCURRENT: 3
+
 a306_piglit_shader:
  extends:
    - .a306_piglit
--- a/src/freedreno/vulkan/.clang-format
+++ b/src/freedreno/vulkan/.clang-format
@@ -20,5 +20,8 @@ IncludeCategories:
  - Regex:           '.*'
    Priority:        1

+ForEachMacros:
+  - u_vector_foreach
+
 SpaceAfterCStyleCast: true
 SpaceBeforeCpp11BracedList: true
--- a/src/freedreno/vulkan/tu_knl.h
+++ b/src/freedreno/vulkan/tu_knl.h
@@ -15,12 +15,12 @@
 struct tu_u_trace_syncobj;
 struct vdrm_bo;

-enum tu_bo_alloc_flags
-{
+enum tu_bo_alloc_flags {
   TU_BO_ALLOC_NO_FLAGS = 0,
   TU_BO_ALLOC_ALLOW_DUMP = 1 << 0,
   TU_BO_ALLOC_GPU_READ_ONLY = 1 << 1,
   TU_BO_ALLOC_REPLAYABLE = 1 << 2,
+   TU_BO_ALLOC_DMABUF = 1 << 4,
 };

 /* Define tu_timeline_sync type based on drm syncobj for a point type
--- a/src/freedreno/vulkan/tu_knl_drm_msm.cc
+++ b/src/freedreno/vulkan/tu_knl_drm_msm.cc
@@ -321,44 +321,68 @@ tu_free_zombie_vma_locked(struct tu_device *dev, bool wait)
         last_signaled_fence = vma->fence;
      }

-      /* Ensure that internal kernel's vma is freed. */
-      struct drm_msm_gem_info req = {
-         .handle = vma->gem_handle,
-         .info = MSM_INFO_SET_IOVA,
-         .value = 0,
-      };
+      if (vma->gem_handle) {
+         /* Ensure that internal kernel's vma is freed. */
+         struct drm_msm_gem_info req = {
+            .handle = vma->gem_handle,
+            .info = MSM_INFO_SET_IOVA,
+            .value = 0,
+         };

-      int ret =
-         drmCommandWriteRead(dev->fd, DRM_MSM_GEM_INFO, &req, sizeof(req));
-      if (ret < 0) {
-         mesa_loge("MSM_INFO_SET_IOVA(0) failed! %d (%s)", ret,
-                   strerror(errno));
-         return VK_ERROR_UNKNOWN;
+         int ret =
+            drmCommandWriteRead(dev->fd, DRM_MSM_GEM_INFO, &req, sizeof(req));
+         if (ret < 0) {
+            mesa_loge("MSM_INFO_SET_IOVA(0) failed! %d (%s)", ret,
+                      strerror(errno));
+            return VK_ERROR_UNKNOWN;
+         }
+
+         tu_gem_close(dev, vma->gem_handle);
+
+         util_vma_heap_free(&dev->vma, vma->iova, vma->size);
      }

-      tu_gem_close(dev, vma->gem_handle);
-
-      util_vma_heap_free(&dev->vma, vma->iova, vma->size);
      u_vector_remove(&dev->zombie_vmas);
   }

   return VK_SUCCESS;
 }

+static bool
+tu_restore_from_zombie_vma_locked(struct tu_device *dev,
+                                  uint32_t gem_handle,
+                                  uint64_t *iova)
+{
+   struct tu_zombie_vma *vma;
+   u_vector_foreach (vma, &dev->zombie_vmas) {
+      if (vma->gem_handle == gem_handle) {
+         *iova = vma->iova;
+
+         /* mark to skip later gem and iova cleanup */
+         vma->gem_handle = 0;
+         return true;
+      }
+   }
+
+   return false;
+}
+
 static VkResult
-msm_allocate_userspace_iova(struct tu_device *dev,
-                            uint32_t gem_handle,
-                            uint64_t size,
-                            uint64_t client_iova,
-                            enum tu_bo_alloc_flags flags,
-                            uint64_t *iova)
+msm_allocate_userspace_iova_locked(struct tu_device *dev,
+                                   uint32_t gem_handle,
+                                   uint64_t size,
+                                   uint64_t client_iova,
+                                   enum tu_bo_alloc_flags flags,
+                                   uint64_t *iova)
 {
   VkResult result;

-   mtx_lock(&dev->vma_mutex);
-
   *iova = 0;

+   if ((flags & TU_BO_ALLOC_DMABUF) &&
+       tu_restore_from_zombie_vma_locked(dev, gem_handle, iova))
+      return VK_SUCCESS;
+
   tu_free_zombie_vma_locked(dev, false);

   result = tu_allocate_userspace_iova(dev, size, client_iova, flags, iova);
@@ -372,8 +396,6 @@ msm_allocate_userspace_iova(struct tu_device *dev,
      result = tu_allocate_userspace_iova(dev, size, client_iova, flags, iova);
   }

-   mtx_unlock(&dev->vma_mutex);
-
   if (result != VK_SUCCESS)
      return result;

@@ -386,6 +408,7 @@ msm_allocate_userspace_iova(struct tu_device *dev,
   int ret =
      drmCommandWriteRead(dev->fd, DRM_MSM_GEM_INFO, &req, sizeof(req));
   if (ret < 0) {
+      util_vma_heap_free(&dev->vma, *iova, size);
      mesa_loge("MSM_INFO_SET_IOVA failed! %d (%s)", ret, strerror(errno));
      return VK_ERROR_OUT_OF_HOST_MEMORY;
   }
@@ -420,8 +443,8 @@ tu_bo_init(struct tu_device *dev,
   assert(!client_iova || dev->physical_device->has_set_iova);

   if (dev->physical_device->has_set_iova) {
-      result = msm_allocate_userspace_iova(dev, gem_handle, size, client_iova,
-                                           flags, &iova);
+      result = msm_allocate_userspace_iova_locked(dev, gem_handle, size,
+                                                  client_iova, flags, &iova);
   } else {
      result = tu_allocate_kernel_iova(dev, gem_handle, &iova);
   }
@@ -445,6 +468,8 @@ tu_bo_init(struct tu_device *dev,
      if (!new_ptr) {
         dev->bo_count--;
         mtx_unlock(&dev->bo_mutex);
+         if (dev->physical_device->has_set_iova)
+            util_vma_heap_free(&dev->vma, iova, size);
         tu_gem_close(dev, gem_handle);
         return VK_ERROR_OUT_OF_HOST_MEMORY;
      }
@@ -506,6 +531,20 @@ tu_bo_set_kernel_name(struct tu_device *dev, struct tu_bo *bo, const char *name)
   }
 }

+static inline void
+msm_vma_lock(struct tu_device *dev)
+{
+   if (dev->physical_device->has_set_iova)
+      mtx_lock(&dev->vma_mutex);
+}
+
+static inline void
+msm_vma_unlock(struct tu_device *dev)
+{
+   if (dev->physical_device->has_set_iova)
+      mtx_unlock(&dev->vma_mutex);
+}
+
 static VkResult
 msm_bo_init(struct tu_device *dev,
            struct tu_bo **out_bo,
@@ -541,9 +580,15 @@ msm_bo_init(struct tu_device *dev,
   struct tu_bo* bo = tu_device_lookup_bo(dev, req.handle);
   assert(bo && bo->gem_handle == 0);

+   assert(!(flags & TU_BO_ALLOC_DMABUF));
+
+   msm_vma_lock(dev);
+
   VkResult result =
      tu_bo_init(dev, bo, req.handle, size, client_iova, flags, name);

+   msm_vma_unlock(dev);
+
   if (result != VK_SUCCESS)
      memset(bo, 0, sizeof(*bo));
   else
@@ -591,11 +636,13 @@ msm_bo_init_dmabuf(struct tu_device *dev,
    * to happen in parallel.
    */
   u_rwlock_wrlock(&dev->dma_bo_lock);
+   msm_vma_lock(dev);

   uint32_t gem_handle;
   int ret = drmPrimeFDToHandle(dev->fd, prime_fd,
                                &gem_handle);
   if (ret) {
+      msm_vma_unlock(dev);
      u_rwlock_wrunlock(&dev->dma_bo_lock);
      return vk_error(dev, VK_ERROR_INVALID_EXTERNAL_HANDLE);
   }
@@ -604,6 +651,7 @@ msm_bo_init_dmabuf(struct tu_device *dev,

   if (bo->refcnt != 0) {
      p_atomic_inc(&bo->refcnt);
+      msm_vma_unlock(dev);
      u_rwlock_wrunlock(&dev->dma_bo_lock);

      *out_bo = bo;
@@ -611,13 +659,14 @@ msm_bo_init_dmabuf(struct tu_device *dev,
   }

   VkResult result =
-      tu_bo_init(dev, bo, gem_handle, size, 0, TU_BO_ALLOC_NO_FLAGS, "dmabuf");
+      tu_bo_init(dev, bo, gem_handle, size, 0, TU_BO_ALLOC_DMABUF, "dmabuf");

   if (result != VK_SUCCESS)
      memset(bo, 0, sizeof(*bo));
   else
      *out_bo = bo;

+   msm_vma_unlock(dev);
   u_rwlock_wrunlock(&dev->dma_bo_lock);

   return result;
--- a/src/freedreno/vulkan/tu_knl_drm_virtio.cc
+++ b/src/freedreno/vulkan/tu_knl_drm_virtio.cc
@@ -412,14 +412,16 @@ tu_free_zombie_vma_locked(struct tu_device *dev, bool wait)
         last_signaled_fence = vma->fence;
      }

-      set_iova(dev, vma->res_id, 0);
-
      u_vector_remove(&dev->zombie_vmas);

-      struct tu_zombie_vma *vma2 = (struct tu_zombie_vma *)
-            u_vector_add(&vdev->zombie_vmas_stage_2);
+      if (vma->gem_handle) {
+         set_iova(dev, vma->res_id, 0);

-      *vma2 = *vma;
+         struct tu_zombie_vma *vma2 =
+            (struct tu_zombie_vma *) u_vector_add(&vdev->zombie_vmas_stage_2);
+
+         *vma2 = *vma;
+      }
   }

   /* And _then_ close the GEM handles: */
@@ -434,19 +436,44 @@ tu_free_zombie_vma_locked(struct tu_device *dev, bool wait)
   return VK_SUCCESS;
 }

+static bool
+tu_restore_from_zombie_vma_locked(struct tu_device *dev,
+                                  uint32_t gem_handle,
+                                  uint64_t *iova)
+{
+   struct tu_zombie_vma *vma;
+   u_vector_foreach (vma, &dev->zombie_vmas) {
+      if (vma->gem_handle == gem_handle) {
+         *iova = vma->iova;
+
+         /* mark to skip later vdrm bo and iova cleanup */
+         vma->gem_handle = 0;
+         return true;
+      }
+   }
+
+   return false;
+}
+
 static VkResult
-virtio_allocate_userspace_iova(struct tu_device *dev,
-                               uint64_t size,
-                               uint64_t client_iova,
-                               enum tu_bo_alloc_flags flags,
-                               uint64_t *iova)
+virtio_allocate_userspace_iova_locked(struct tu_device *dev,
+                                      uint32_t gem_handle,
+                                      uint64_t size,
+                                      uint64_t client_iova,
+                                      enum tu_bo_alloc_flags flags,
+                                      uint64_t *iova)
 {
   VkResult result;

-   mtx_lock(&dev->vma_mutex);
-
   *iova = 0;

+   if (flags & TU_BO_ALLOC_DMABUF) {
+      assert(gem_handle);
+
+      if (tu_restore_from_zombie_vma_locked(dev, gem_handle, iova))
+         return VK_SUCCESS;
+   }
+
   tu_free_zombie_vma_locked(dev, false);

   result = tu_allocate_userspace_iova(dev, size, client_iova, flags, iova);
@@ -460,8 +487,6 @@ virtio_allocate_userspace_iova(struct tu_device *dev,
      result = tu_allocate_userspace_iova(dev, size, client_iova, flags, iova);
   }

-   mtx_unlock(&dev->vma_mutex);
-
   return result;
 }

@@ -571,12 +596,8 @@ virtio_bo_init(struct tu_device *dev,
         .size = size,
   };
   VkResult result;
-
-   result = virtio_allocate_userspace_iova(dev, size, client_iova,
-                                           flags, &req.iova);
-   if (result != VK_SUCCESS) {
-      return result;
-   }
+   uint32_t res_id;
+   struct tu_bo *bo;

   if (mem_property & VK_MEMORY_PROPERTY_HOST_CACHED_BIT) {
      if (mem_property & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT) {
@@ -601,6 +622,16 @@ virtio_bo_init(struct tu_device *dev,
   if (flags & TU_BO_ALLOC_GPU_READ_ONLY)
      req.flags |= MSM_BO_GPU_READONLY;

+   assert(!(flags & TU_BO_ALLOC_DMABUF));
+
+   mtx_lock(&dev->vma_mutex);
+   result = virtio_allocate_userspace_iova_locked(dev, 0, size, client_iova,
+                                                  flags, &req.iova);
+   mtx_unlock(&dev->vma_mutex);
+
+   if (result != VK_SUCCESS)
+      return result;
+
   /* tunneled cmds are processed separately on host side,
    * before the renderer->get_blob() callback.. the blob_id
    * is used to link the created bo to the get_blob() call
@@ -611,27 +642,28 @@ virtio_bo_init(struct tu_device *dev,
      vdrm_bo_create(vdev->vdrm, size, blob_flags, req.blob_id, &req.hdr);

   if (!handle) {
-      util_vma_heap_free(&dev->vma, req.iova, size);
-      return vk_error(dev, VK_ERROR_OUT_OF_DEVICE_MEMORY);
+      result = VK_ERROR_OUT_OF_DEVICE_MEMORY;
+      goto fail;
   }

-   uint32_t res_id = vdrm_handle_to_res_id(vdev->vdrm, handle);
-   struct tu_bo* bo = tu_device_lookup_bo(dev, res_id);
+   res_id = vdrm_handle_to_res_id(vdev->vdrm, handle);
+   bo = tu_device_lookup_bo(dev, res_id);
   assert(bo && bo->gem_handle == 0);

   bo->res_id = res_id;

   result = tu_bo_init(dev, bo, handle, size, req.iova, flags, name);
-   if (result != VK_SUCCESS)
+   if (result != VK_SUCCESS) {
      memset(bo, 0, sizeof(*bo));
-   else
-      *out_bo = bo;
+      goto fail;
+   }
+
+   *out_bo = bo;

   /* We don't use bo->name here because for the !TU_DEBUG=bo case bo->name is NULL. */
   tu_bo_set_kernel_name(dev, bo, name);

-   if (result == VK_SUCCESS &&
-       (mem_property & VK_MEMORY_PROPERTY_HOST_CACHED_BIT) &&
+   if ((mem_property & VK_MEMORY_PROPERTY_HOST_CACHED_BIT) &&
       !(mem_property & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT)) {
      tu_bo_map(dev, bo);

@@ -644,6 +676,12 @@ virtio_bo_init(struct tu_device *dev,
      tu_sync_cache_bo(dev, bo, 0, VK_WHOLE_SIZE, TU_MEM_SYNC_CACHE_TO_GPU);
   }

+   return VK_SUCCESS;
+
+fail:
+   mtx_lock(&dev->vma_mutex);
+   util_vma_heap_free(&dev->vma, req.iova, size);
+   mtx_unlock(&dev->vma_mutex);
   return result;
 }

@@ -666,11 +704,6 @@ virtio_bo_init_dmabuf(struct tu_device *dev,
   /* iova allocation needs to consider the object's *real* size: */
   size = real_size;

-   uint64_t iova;
-   result = virtio_allocate_userspace_iova(dev, size, 0, TU_BO_ALLOC_NO_FLAGS, &iova);
-   if (result != VK_SUCCESS)
-      return result;
-
   /* Importing the same dmabuf several times would yield the same
    * gem_handle. Thus there could be a race when destroying
    * BO and importing the same dmabuf from different threads.
@@ -678,8 +711,10 @@ virtio_bo_init_dmabuf(struct tu_device *dev,
    * to happen in parallel.
    */
   u_rwlock_wrlock(&dev->dma_bo_lock);
+   mtx_lock(&dev->vma_mutex);

   uint32_t handle, res_id;
+   uint64_t iova;

   handle = vdrm_dmabuf_to_handle(vdrm, prime_fd);
   if (!handle) {
@@ -689,6 +724,7 @@ virtio_bo_init_dmabuf(struct tu_device *dev,

   res_id = vdrm_handle_to_res_id(vdrm, handle);
   if (!res_id) {
+      /* XXX gem_handle potentially leaked here since no refcnt */
      result = vk_error(dev, VK_ERROR_INVALID_EXTERNAL_HANDLE);
      goto out_unlock;
   }
@@ -702,21 +738,25 @@ virtio_bo_init_dmabuf(struct tu_device *dev,
      goto out_unlock;
   }

-   result = tu_bo_init(dev, bo, handle, size, iova,
-                       TU_BO_ALLOC_NO_FLAGS, "dmabuf");
-   if (result != VK_SUCCESS)
-      memset(bo, 0, sizeof(*bo));
-   else
-      *out_bo = bo;
-
-out_unlock:
-   u_rwlock_wrunlock(&dev->dma_bo_lock);
+   result = virtio_allocate_userspace_iova_locked(dev, handle, size, 0,
+                                                  TU_BO_ALLOC_DMABUF, &iova);
   if (result != VK_SUCCESS) {
-      mtx_lock(&dev->vma_mutex);
-      util_vma_heap_free(&dev->vma, iova, size);
-      mtx_unlock(&dev->vma_mutex);
+      vdrm_bo_close(dev->vdev->vdrm, handle);
+      goto out_unlock;
   }

+   result =
+      tu_bo_init(dev, bo, handle, size, iova, TU_BO_ALLOC_NO_FLAGS, "dmabuf");
+   if (result != VK_SUCCESS) {
+      util_vma_heap_free(&dev->vma, iova, size);
+      memset(bo, 0, sizeof(*bo));
+   } else {
+      *out_bo = bo;
+   }
+
+out_unlock:
+   mtx_unlock(&dev->vma_mutex);
+   u_rwlock_wrunlock(&dev->dma_bo_lock);
   return result;
 }

--- a/src/gallium/auxiliary/nir/nir_draw_helpers.c
+++ b/src/gallium/auxiliary/nir/nir_draw_helpers.c
@@ -177,6 +177,9 @@ lower_aaline_instr(nir_builder *b, nir_instr *instr, void *data)
      return false;
   if (var->data.location < FRAG_RESULT_DATA0 && var->data.location != FRAG_RESULT_COLOR)
      return false;
+   uint32_t mask = nir_intrinsic_write_mask(intrin) << var->data.location_frac;
+   if (!(mask & BITFIELD_BIT(3)))
+      return false;

   nir_def *out_input = intrin->src[1].ssa;
   b->cursor = nir_before_instr(instr);
@@ -223,12 +226,10 @@ lower_aaline_instr(nir_builder *b, nir_instr *instr, void *data)

   tmp = nir_fmul(b, nir_channel(b, tmp, 0),
                  nir_fmin(b, nir_channel(b, tmp, 1), max));
-   tmp = nir_fmul(b, nir_channel(b, out_input, 3), tmp);
+   tmp = nir_fmul(b, nir_channel(b, out_input, out_input->num_components - 1), tmp);

-   nir_def *out = nir_vec4(b, nir_channel(b, out_input, 0),
-                                 nir_channel(b, out_input, 1),
-                                 nir_channel(b, out_input, 2),
-                                 tmp);
+   nir_def *out = nir_vector_insert_imm(b, out_input, tmp,
+                                        out_input->num_components - 1);
   nir_src_rewrite(&intrin->src[1], out);
   return true;
 }
--- a/src/gallium/auxiliary/util/u_blitter.c
+++ b/src/gallium/auxiliary/util/u_blitter.c
@@ -2014,6 +2014,7 @@ void util_blitter_blit_generic(struct blitter_context *blitter,
                               unsigned dst_sample)
 {
   struct blitter_context_priv *ctx = (struct blitter_context_priv*)blitter;
+   unsigned count = 0;
   struct pipe_context *pipe = ctx->base.pipe;
   enum pipe_texture_target src_target = src->target;
   unsigned src_samples = src->texture->nr_samples;
@@ -2038,7 +2039,7 @@ void util_blitter_blit_generic(struct blitter_context *blitter,

   /* Return if there is nothing to do. */
   if (!dst_has_color && !dst_has_depth && !dst_has_stencil) {
-      return;
+      goto out;
   }

   bool is_scaled = dstbox->width != abs(srcbox->width) ||
@@ -2170,7 +2171,6 @@ void util_blitter_blit_generic(struct blitter_context *blitter,
   }

   /* Set samplers. */
-   unsigned count = 0;
   if (src_has_depth && src_has_stencil &&
       (dst_has_color || (dst_has_depth && dst_has_stencil))) {
      /* Setup two samplers, one for depth and the other one for stencil. */
@@ -2223,7 +2223,8 @@ void util_blitter_blit_generic(struct blitter_context *blitter,
   do_blits(ctx, dst, dstbox, src, src_width0, src_height0,
            srcbox, dst_has_depth || dst_has_stencil, use_txf, sample0_only,
            dst_sample);
-
+   util_blitter_unset_running_flag(blitter);
+out:
   util_blitter_restore_vertex_states(blitter);
   util_blitter_restore_fragment_states(blitter);
   util_blitter_restore_textures_internal(blitter, count);
@@ -2232,7 +2233,6 @@ void util_blitter_blit_generic(struct blitter_context *blitter,
      pipe->set_scissor_states(pipe, 0, 1, &ctx->base.saved_scissor);
   }
   util_blitter_restore_render_cond(blitter);
-   util_blitter_unset_running_flag(blitter);
 }

 void
--- a/src/gallium/drivers/r600/evergreen_state.c
+++ b/src/gallium/drivers/r600/evergreen_state.c
@@ -2137,7 +2137,8 @@ static void evergreen_emit_vertex_buffers(struct r600_context *rctx,
 {
 	struct radeon_cmdbuf *cs = &rctx->b.gfx.cs;
 	struct r600_fetch_shader *shader = (struct r600_fetch_shader*)rctx->vertex_fetch_shader.cso;
-	uint32_t dirty_mask = state->dirty_mask & shader->buffer_mask;
+	uint32_t buffer_mask = shader ? shader->buffer_mask : ~0;
+	uint32_t dirty_mask = state->dirty_mask & buffer_mask;

 	while (dirty_mask) {
 		struct pipe_vertex_buffer *vb;
@@ -2176,7 +2177,7 @@ static void evergreen_emit_vertex_buffers(struct r600_context *rctx,
 		radeon_emit(cs, radeon_add_to_buffer_list(&rctx->b, &rctx->b.gfx, rbuffer,
 						      RADEON_USAGE_READ | RADEON_PRIO_VERTEX_BUFFER));
 	}
-	state->dirty_mask &= ~shader->buffer_mask;
+	state->dirty_mask &= ~buffer_mask;
 }

 static void evergreen_fs_emit_vertex_buffers(struct r600_context *rctx, struct r600_atom * atom)
--- a/src/gallium/drivers/radeonsi/radeon_vcn_dec.c
+++ b/src/gallium/drivers/radeonsi/radeon_vcn_dec.c
@@ -239,15 +239,13 @@ static rvcn_dec_message_avc_t get_h264_msg(struct radeon_decoder *dec,
      }
   }

-   /* if reference picture exists, however no reference picture found at the end
-      curr_pic_ref_frame_num == 0, which is not reasonable, should be corrected. */
-   if (result.used_for_reference_flags && (result.curr_pic_ref_frame_num == 0)) {
-      for (i = 0; i < ARRAY_SIZE(result.ref_frame_list); i++) {
-         result.ref_frame_list[i] = pic->ref[i] ?
-                (uintptr_t)vl_video_buffer_get_associated_data(pic->ref[i], &dec->base) : 0xff;
-         if (result.ref_frame_list[i] != 0xff) {
+   /* need at least one reference for P/B frames */
+   if (result.curr_pic_ref_frame_num == 0 && pic->slice_parameter.slice_info_present) {
+      for (i = 0; i < pic->slice_count; i++) {
+         if (pic->slice_parameter.slice_type[i] % 5 != 2) {
            result.curr_pic_ref_frame_num++;
-            result.non_existing_frame_flags &= ~(1 << i);
+            result.ref_frame_list[0] = 0;
+            result.non_existing_frame_flags &= ~1;
            break;
         }
      }
@@ -279,7 +277,8 @@ static rvcn_dec_message_avc_t get_h264_msg(struct radeon_decoder *dec,
      dec->ref_codec.bts = CODEC_8_BITS;
      dec->ref_codec.index = result.decoded_pic_idx;
      dec->ref_codec.ref_size = 16;
-      memset(dec->ref_codec.ref_list, 0xff, sizeof(dec->ref_codec.ref_list));
+      dec->ref_codec.num_refs = result.curr_pic_ref_frame_num;
+      STATIC_ASSERT(sizeof(dec->ref_codec.ref_list) == sizeof(result.ref_frame_list));
      memcpy(dec->ref_codec.ref_list, result.ref_frame_list, sizeof(result.ref_frame_list));
   }

@@ -292,7 +291,7 @@ static rvcn_dec_message_hevc_t get_h265_msg(struct radeon_decoder *dec,
                                            struct pipe_h265_picture_desc *pic)
 {
   rvcn_dec_message_hevc_t result;
-   unsigned i, j;
+   unsigned i, j, num_refs = 0;

   memset(&result, 0, sizeof(result));
   result.sps_info_flags = 0;
@@ -413,9 +412,10 @@ static rvcn_dec_message_hevc_t get_h265_msg(struct radeon_decoder *dec,

      result.poc_list[i] = pic->PicOrderCntVal[i];

-      if (ref)
+      if (ref) {
         ref_pic = (uintptr_t)vl_video_buffer_get_associated_data(ref, &dec->base);
-      else
+         num_refs++;
+      } else
         ref_pic = 0x7F;
      result.ref_pic_list[i] = ref_pic;
   }
@@ -469,7 +469,8 @@ static rvcn_dec_message_hevc_t get_h265_msg(struct radeon_decoder *dec,
         CODEC_10_BITS : CODEC_8_BITS;
      dec->ref_codec.index = result.curr_idx;
      dec->ref_codec.ref_size = 15;
-      memset(dec->ref_codec.ref_list, 0x7f, sizeof(dec->ref_codec.ref_list));
+      dec->ref_codec.num_refs = num_refs;
+      STATIC_ASSERT(sizeof(dec->ref_codec.ref_list) == sizeof(result.ref_pic_list));
      memcpy(dec->ref_codec.ref_list, result.ref_pic_list, sizeof(result.ref_pic_list));
   }
   return result;
@@ -507,7 +508,7 @@ static rvcn_dec_message_vp9_t get_vp9_msg(struct radeon_decoder *dec,
                                          struct pipe_vp9_picture_desc *pic)
 {
   rvcn_dec_message_vp9_t result;
-   unsigned i ,j;
+   unsigned i, j, num_refs = 0;

   memset(&result, 0, sizeof(result));

@@ -641,9 +642,13 @@ static rvcn_dec_message_vp9_t get_vp9_msg(struct radeon_decoder *dec,
   get_current_pic_index(dec, target, &result.curr_pic_idx);

   for (i = 0; i < 8; i++) {
-      result.ref_frame_map[i] =
-         (pic->ref[i]) ? (uintptr_t)vl_video_buffer_get_associated_data(pic->ref[i], &dec->base)
-                       : 0x7f;
+      uintptr_t ref_frame;
+      if (pic->ref[i]) {
+         ref_frame = (uintptr_t)vl_video_buffer_get_associated_data(pic->ref[i], &dec->base);
+         num_refs++;
+      } else
+         ref_frame = 0x7f;
+      result.ref_frame_map[i] = ref_frame;
   }

   result.frame_refs[0] = result.ref_frame_map[pic->picture_parameter.pic_fields.last_ref_frame];
@@ -669,6 +674,7 @@ static rvcn_dec_message_vp9_t get_vp9_msg(struct radeon_decoder *dec,
         CODEC_10_BITS : CODEC_8_BITS;
      dec->ref_codec.index = result.curr_pic_idx;
      dec->ref_codec.ref_size = 8;
+      dec->ref_codec.num_refs = num_refs;
      memset(dec->ref_codec.ref_list, 0x7f, sizeof(dec->ref_codec.ref_list));
      memcpy(dec->ref_codec.ref_list, result.ref_frame_map, sizeof(result.ref_frame_map));
   }
@@ -959,7 +965,7 @@ static rvcn_dec_message_av1_t get_av1_msg(struct radeon_decoder *dec,
                                          struct pipe_av1_picture_desc *pic)
 {
   rvcn_dec_message_av1_t result;
-   unsigned i, j;
+   unsigned i, j, num_refs = 0;
   uint16_t tile_count = pic->picture_parameter.tile_cols * pic->picture_parameter.tile_rows;

   memset(&result, 0, sizeof(result));
@@ -1151,9 +1157,13 @@ static rvcn_dec_message_av1_t get_av1_msg(struct radeon_decoder *dec,
   result.order_hint_bits = pic->picture_parameter.order_hint_bits_minus_1 + 1;

   for (i = 0; i < NUM_AV1_REFS; ++i) {
-      result.ref_frame_map[i] =
-         (pic->ref[i]) ? (uintptr_t)vl_video_buffer_get_associated_data(pic->ref[i], &dec->base)
-                       : 0x7f;
+      uintptr_t ref_frame;
+      if (pic->ref[i]) {
+         ref_frame = (uintptr_t)vl_video_buffer_get_associated_data(pic->ref[i], &dec->base);
+         num_refs++;
+      } else
+         ref_frame = 0x7f;
+      result.ref_frame_map[i] = ref_frame;
   }
   for (i = 0; i < NUM_AV1_REFS_PER_FRAME; ++i)
       result.frame_refs[i] = result.ref_frame_map[pic->picture_parameter.ref_frame_idx[i]];
@@ -1300,6 +1310,7 @@ static rvcn_dec_message_av1_t get_av1_msg(struct radeon_decoder *dec,
      dec->ref_codec.bts = pic->picture_parameter.bit_depth_idx ? CODEC_10_BITS : CODEC_8_BITS;
      dec->ref_codec.index = result.curr_pic_idx;
      dec->ref_codec.ref_size = 8;
+      dec->ref_codec.num_refs = num_refs;
      memset(dec->ref_codec.ref_list, 0x7f, sizeof(dec->ref_codec.ref_list));
      memcpy(dec->ref_codec.ref_list, result.ref_frame_map, sizeof(result.ref_frame_map));
   }
@@ -1816,6 +1827,7 @@ static unsigned rvcn_dec_dynamic_dpb_t2_message(struct radeon_decoder *dec, rvcn
      size = size * 3 / 2;

   list_for_each_entry_safe(struct rvcn_dec_dynamic_dpb_t2, d, &dec->dpb_ref_list, list) {
+      bool found = false;
      for (i = 0; i < dec->ref_codec.ref_size; ++i) {
         if (((dec->ref_codec.ref_list[i] & 0x7f) != 0x7f) && (d->index == (dec->ref_codec.ref_list[i] & 0x7f))) {
            if (!dummy)
@@ -1829,10 +1841,10 @@ static unsigned rvcn_dec_dynamic_dpb_t2_message(struct radeon_decoder *dec, rvcn
            dynamic_dpb_t2->dpbAddrLo[i] = addr;
            dynamic_dpb_t2->dpbAddrHi[i] = addr >> 32;
            ++dynamic_dpb_t2->dpbArraySize;
-            break;
+            found = true;
         }
      }
-      if (i == dec->ref_codec.ref_size) {
+      if (!found) {
         if (d->dpb.res->b.b.width0 * d->dpb.res->b.b.height0 != size) {
            list_del(&d->list);
            list_addtail(&d->list, &dec->dpb_unref_list);
@@ -1887,6 +1899,23 @@ static unsigned rvcn_dec_dynamic_dpb_t2_message(struct radeon_decoder *dec, rvcn
      list_addtail(&dpb->list, &dec->dpb_ref_list);
   }

+   if (dynamic_dpb_t2->dpbArraySize < dec->ref_codec.num_refs) {
+      struct rvcn_dec_dynamic_dpb_t2 *d =
+         list_first_entry(&dec->dpb_ref_list, struct rvcn_dec_dynamic_dpb_t2, list);
+      addr = dec->ws->buffer_get_virtual_address(d->dpb.res->buf);
+      if (!addr && dummy)
+         addr = dec->ws->buffer_get_virtual_address(dummy->dpb.res->buf);
+      assert(addr);
+      for (i = 0; i < dec->ref_codec.num_refs; ++i) {
+         if (dynamic_dpb_t2->dpbAddrLo[i] || dynamic_dpb_t2->dpbAddrHi[i])
+            continue;
+         dynamic_dpb_t2->dpbAddrLo[i] = addr;
+         dynamic_dpb_t2->dpbAddrHi[i] = addr >> 32;
+         ++dynamic_dpb_t2->dpbArraySize;
+      }
+      assert(dynamic_dpb_t2->dpbArraySize == dec->ref_codec.num_refs);
+   }
+
   dec->ws->cs_add_buffer(&dec->cs, dpb->dpb.res->buf,
      RADEON_USAGE_READWRITE | RADEON_USAGE_SYNCHRONIZED, RADEON_DOMAIN_VRAM);
   addr = dec->ws->buffer_get_virtual_address(dpb->dpb.res->buf);
--- a/src/gallium/drivers/radeonsi/radeon_vcn_dec.h
+++ b/src/gallium/drivers/radeonsi/radeon_vcn_dec.h
@@ -120,6 +120,7 @@ struct radeon_decoder {
      } bts;
      uint8_t index;
      unsigned ref_size;
+      unsigned num_refs;
      uint8_t ref_list[16];
   } ref_codec;

--- a/src/gallium/drivers/radeonsi/si_buffer.c
+++ b/src/gallium/drivers/radeonsi/si_buffer.c
@@ -176,6 +176,15 @@ bool si_alloc_resource(struct si_screen *sscreen, struct si_resource *res)
   util_range_set_empty(&res->valid_buffer_range);
   res->TC_L2_dirty = false;

+   if (res->b.b.target != PIPE_BUFFER && !(res->b.b.flags & SI_RESOURCE_AUX_PLANE)) {
+      /* The buffer is shared with other planes. */
+      struct si_resource *plane = (struct si_resource *)res->b.b.next;
+      for (; plane; plane = (struct si_resource *)plane->b.b.next) {
+         radeon_bo_reference(sscreen->ws, &plane->buf, res->buf);
+         plane->gpu_address = res->gpu_address;
+      }
+   }
+
   /* Print debug information. */
   if (sscreen->debug_flags & DBG(VM) && res->b.b.target == PIPE_BUFFER) {
      fprintf(stderr, "VM start=0x%" PRIX64 "  end=0x%" PRIX64 " | Buffer %" PRIu64 " bytes | Flags: ",
--- a/src/gallium/drivers/zink/nir_to_spirv/nir_to_spirv.c
+++ b/src/gallium/drivers/zink/nir_to_spirv/nir_to_spirv.c
@@ -1933,13 +1933,7 @@ emit_alu(struct ntv_context *ctx, nir_alu_instr *alu)
      result = emit_builtin_unop(ctx, GLSLstd450PackHalf2x16, get_def_type(ctx, &alu->def, nir_type_uint), src[0]);
      break;

-   case nir_op_unpack_64_2x32:
-      assert(nir_op_infos[alu->op].num_inputs == 1);
-      result = emit_builtin_unop(ctx, GLSLstd450UnpackDouble2x32, get_def_type(ctx, &alu->def, nir_type_uint), src[0]);
-      break;
-
   BUILTIN_UNOPF(nir_op_unpack_half_2x16, GLSLstd450UnpackHalf2x16)
-   BUILTIN_UNOPF(nir_op_pack_64_2x32, GLSLstd450PackDouble2x32)
 #undef BUILTIN_UNOP
 #undef BUILTIN_UNOPF

@@ -2125,9 +2119,11 @@ emit_alu(struct ntv_context *ctx, nir_alu_instr *alu)
   /* those are all simple bitcasts, we could do better, but it doesn't matter */
   case nir_op_pack_32_4x8:
   case nir_op_pack_32_2x16:
+   case nir_op_pack_64_2x32:
   case nir_op_pack_64_4x16:
   case nir_op_unpack_32_4x8:
   case nir_op_unpack_32_2x16:
+   case nir_op_unpack_64_2x32:
   case nir_op_unpack_64_4x16: {
      result = emit_bitcast(ctx, dest_type, src[0]);
      break;
--- a/src/gallium/drivers/zink/zink_batch.c
+++ b/src/gallium/drivers/zink/zink_batch.c
@@ -309,6 +309,11 @@ zink_batch_state_destroy(struct zink_screen *screen, struct zink_batch_state *bs
   util_dynarray_fini(&bs->bindless_releases[0]);
   util_dynarray_fini(&bs->bindless_releases[1]);
   util_dynarray_fini(&bs->acquires);
+   util_dynarray_fini(&bs->signal_semaphores);
+   util_dynarray_fini(&bs->wait_semaphores);
+   util_dynarray_fini(&bs->wait_semaphore_stages);
+   util_dynarray_fini(&bs->fd_wait_semaphores);
+   util_dynarray_fini(&bs->fd_wait_semaphore_stages);
   util_dynarray_fini(&bs->acquire_flags);
   unsigned num_mfences = util_dynarray_num_elements(&bs->fence.mfences, void *);
   struct zink_tc_fence **mfence = bs->fence.mfences.data;
--- a/src/gallium/drivers/zink/zink_context.c
+++ b/src/gallium/drivers/zink/zink_context.c
@@ -4846,8 +4846,11 @@ zink_resource_commit(struct pipe_context *pctx, struct pipe_resource *pres, unsi
   VkSemaphore sem = VK_NULL_HANDLE;
   bool ret = zink_bo_commit(ctx, res, level, box, commit, &sem);
   if (ret) {
-      if (sem)
+      if (sem) {
         zink_batch_add_wait_semaphore(&ctx->batch, sem);
+         zink_batch_reference_resource_rw(&ctx->batch, res, true);
+         ctx->batch.has_work = true;
+      }
   } else {
      check_device_lost(ctx);
   }
--- a/src/gallium/frontends/clover/llvm/invocation.cpp
+++ b/src/gallium/frontends/clover/llvm/invocation.cpp
@@ -513,6 +513,7 @@ namespace {
      LLVMRunPasses(wrap(&mod), opt_str, tm, opts);

      LLVMDisposeTargetMachine(tm);
+      LLVMDisposePassBuilderOptions(opts);
   }

   std::unique_ptr<Module>
--- a/src/gallium/frontends/dri/dri2.c
+++ b/src/gallium/frontends/dri/dri2.c
@@ -2385,7 +2385,7 @@ dri2_init_screen(struct dri_screen *screen)
      pscreen = pipe_loader_create_screen(screen->dev);

   if (!pscreen)
-       goto fail;
+       return NULL;

   dri_init_options(screen);
   screen->throttle = pscreen->get_param(pscreen, PIPE_CAP_THROTTLE);
@@ -2419,7 +2419,7 @@ dri2_init_screen(struct dri_screen *screen)
   return configs;

 fail:
-   dri_release_screen(screen);
+   pipe_loader_release(&screen->dev, 1);

   return NULL;
 }
--- a/src/gallium/frontends/dri/drisw.c
+++ b/src/gallium/frontends/dri/drisw.c
@@ -546,6 +546,8 @@ drisw_init_screen(struct dri_screen *screen)
   struct pipe_screen *pscreen = NULL;
   const struct drisw_loader_funcs *lf = &drisw_lf;

+   (void) mtx_init(&screen->opencl_func_mutex, mtx_plain);
+
   screen->swrast_no_present = debug_get_option_swrast_no_present();

   if (loader->base.version >= 4) {
@@ -565,7 +567,7 @@ drisw_init_screen(struct dri_screen *screen)
      pscreen = pipe_loader_create_screen(screen->dev);

   if (!pscreen)
-      goto fail;
+      return NULL;

   dri_init_options(screen);
   configs = dri_init_screen(screen, pscreen);
@@ -593,7 +595,7 @@ drisw_init_screen(struct dri_screen *screen)

   return configs;
 fail:
-   dri_release_screen(screen);
+   pipe_loader_release(&screen->dev, 1);
   return NULL;
 }

--- a/src/gallium/frontends/dri/kopper.c
+++ b/src/gallium/frontends/dri/kopper.c
@@ -115,6 +115,8 @@ kopper_init_screen(struct dri_screen *screen)
   const __DRIconfig **configs;
   struct pipe_screen *pscreen = NULL;

+   (void) mtx_init(&screen->opencl_func_mutex, mtx_plain);
+
   if (!screen->kopper_loader) {
      fprintf(stderr, "mesa: Kopper interface not found!\n"
                      "      Ensure the versions of %s built with this version of Zink are\n"
@@ -134,7 +136,7 @@ kopper_init_screen(struct dri_screen *screen)
      pscreen = pipe_loader_create_screen(screen->dev);

   if (!pscreen)
-      goto fail;
+      return NULL;

   dri_init_options(screen);
   screen->unwrapped_screen = trace_screen_unwrap(pscreen);
@@ -167,7 +169,7 @@ kopper_init_screen(struct dri_screen *screen)

   return configs;
 fail:
-   dri_release_screen(screen);
+   pipe_loader_release(&screen->dev, 1);
   return NULL;
 }

--- a/src/gallium/frontends/rusticl/core/event.rs
+++ b/src/gallium/frontends/rusticl/core/event.rs
@@ -10,6 +10,7 @@ use mesa_rust_util::static_assert;
 use rusticl_opencl_gen::*;

 use std::collections::HashSet;
+use std::mem;
 use std::slice;
 use std::sync::Arc;
 use std::sync::Condvar;
@@ -272,6 +273,27 @@ impl Event {
    }
 }

+impl Drop for Event {
+    // implement drop in order to prevent stack overflows of long dependency chains.
+    //
+    // This abuses the fact that `Arc::into_inner` only succeeds when there is one strong reference
+    // so we turn a recursive drop chain into a drop list for events having no other references.
+    fn drop(&mut self) {
+        if self.deps.is_empty() {
+            return;
+        }
+
+        let mut deps_list = vec![mem::take(&mut self.deps)];
+        while let Some(deps) = deps_list.pop() {
+            for dep in deps {
+                if let Some(mut dep) = Arc::into_inner(dep) {
+                    deps_list.push(mem::take(&mut dep.deps));
+                }
+            }
+        }
+    }
+}
+
 // TODO worker thread per device
 // Condvar to wait on new events to work on
 // notify condvar when flushing queue events to worker
--- a/src/gallium/frontends/rusticl/mesa/pipe/context.rs
+++ b/src/gallium/frontends/rusticl/mesa/pipe/context.rs
@@ -658,6 +658,7 @@ impl PipeContext {

 impl Drop for PipeContext {
    fn drop(&mut self) {
+        self.flush().wait();
        unsafe {
            self.pipe.as_ref().destroy.unwrap()(self.pipe.as_ptr());
        }
--- a/src/gallium/frontends/va/picture.c
+++ b/src/gallium/frontends/va/picture.c
@@ -1033,7 +1033,8 @@ vlVaRenderPicture(VADriverContextP ctx, VAContextID context_id, VABufferID *buff

      case VASliceDataBufferType:
         vaStatus = handleVASliceDataBufferType(context, buf);
-         slice_offset += buf->size;
+         if (slice_idx)
+            slice_offset += buf->size;
         break;

      case VAProcPipelineParameterBufferType:
--- a/src/gallium/frontends/va/picture_h264.c
+++ b/src/gallium/frontends/va/picture_h264.c
@@ -186,6 +186,7 @@ void vlVaHandleSliceParameterBufferH264(vlVaContext *context, vlVaBuffer *buf)
   assert(context->desc.h264.slice_count < max_pipe_h264_slices);

   context->desc.h264.slice_parameter.slice_info_present = true;
+   context->desc.h264.slice_parameter.slice_type[context->desc.h264.slice_count] = h264->slice_type;
   context->desc.h264.slice_parameter.slice_data_size[context->desc.h264.slice_count] = h264->slice_data_size;
   context->desc.h264.slice_parameter.slice_data_offset[context->desc.h264.slice_count] = h264->slice_data_offset;

--- a/src/gallium/include/pipe/p_video_state.h
+++ b/src/gallium/include/pipe/p_video_state.h
@@ -411,6 +411,7 @@ struct pipe_h264_picture_desc
   {
      bool slice_info_present;
      uint32_t slice_count;
+      uint8_t slice_type[128];
      uint32_t slice_data_size[128];
      uint32_t slice_data_offset[128];
      enum pipe_slice_buffer_placement_type slice_data_flag[128];
--- a/src/gallium/winsys/i915/drm/meson.build
+++ b/src/gallium/winsys/i915/drm/meson.build
@@ -28,5 +28,5 @@ libi915drm = static_library(
    inc_include, inc_src, inc_gallium, inc_gallium_aux, inc_gallium_drivers
  ],
  link_with : [libintel_common],
-  dependencies : [dep_libdrm, dep_libdrm_intel],
+  dependencies : [dep_libdrm, dep_libdrm_intel, idep_intel_dev_wa],
 )
--- a/src/glx/drisw_glx.c
+++ b/src/glx/drisw_glx.c
@@ -32,7 +32,9 @@
 #include <dlfcn.h>
 #include "dri_common.h"
 #include "drisw_priv.h"
+#ifdef HAVE_DRI3
 #include "dri3_priv.h"
+#endif
 #include <X11/extensions/shmproto.h>
 #include <assert.h>
 #include <vulkan/vulkan_core.h>
--- a/src/glx/glxext.c
+++ b/src/glx/glxext.c
@@ -908,9 +908,11 @@ __glXInitialize(Display * dpy)
 #endif /* HAVE_DRI3 */
      if (!debug_get_bool_option("LIBGL_DRI2_DISABLE", false))
         dpyPriv->dri2Display = dri2CreateDisplay(dpy);
+#if defined(HAVE_ZINK)
      if (!dpyPriv->dri3Display && !dpyPriv->dri2Display)
         try_zink = !debug_get_bool_option("LIBGL_KOPPER_DISABLE", false) &&
                    !getenv("GALLIUM_DRIVER");
+#endif /* HAVE_ZINK */
   }
 #endif /* GLX_USE_DRM */
   if (glx_direct)
--- a/src/intel/vulkan/anv_nir_apply_pipeline_layout.c
+++ b/src/intel/vulkan/anv_nir_apply_pipeline_layout.c
@@ -743,7 +743,7 @@ build_desc_addr_for_res_index(nir_builder *b,
 static nir_def *
 build_desc_addr_for_binding(nir_builder *b,
                            unsigned set, unsigned binding,
-                            nir_def *array_index,
+                            nir_def *array_index, unsigned plane,
                            const struct apply_pipeline_layout_state *state)
 {
   const struct anv_descriptor_set_binding_layout *bind_layout =
@@ -759,6 +759,10 @@ build_desc_addr_for_binding(nir_builder *b,
                                   array_index,
                                   bind_layout->descriptor_surface_stride),
                      bind_layout->descriptor_surface_offset);
+      if (plane != 0) {
+         desc_offset = nir_iadd_imm(
+            b, desc_offset, plane * bind_layout->descriptor_data_surface_size);
+      }

      return nir_vec4(b, nir_unpack_64_2x32_split_x(b, set_addr),
                         nir_unpack_64_2x32_split_y(b, set_addr),
@@ -766,14 +770,21 @@ build_desc_addr_for_binding(nir_builder *b,
                         desc_offset);
   }

-   case nir_address_format_32bit_index_offset:
+   case nir_address_format_32bit_index_offset: {
+      nir_def *desc_offset =
+         nir_iadd_imm(b,
+                      nir_imul_imm(b,
+                                   array_index,
+                                   bind_layout->descriptor_surface_stride),
+                      bind_layout->descriptor_surface_offset);
+      if (plane != 0) {
+         desc_offset = nir_iadd_imm(
+            b, desc_offset, plane * bind_layout->descriptor_data_surface_size);
+      }
      return nir_vec2(b,
                      nir_imm_int(b, state->set[set].desc_offset),
-                      nir_iadd_imm(b,
-                                   nir_imul_imm(b,
-                                                array_index,
-                                                bind_layout->descriptor_surface_stride),
-                                   bind_layout->descriptor_surface_offset));
+                      desc_offset);
+   }

   default:
      unreachable("Unhandled address format");
@@ -827,7 +838,8 @@ build_surface_index_for_binding(nir_builder *b,
         set_offset = nir_imm_int(b, 0xdeaddead);

         nir_def *desc_addr =
-            build_desc_addr_for_binding(b, set, binding, array_index, state);
+            build_desc_addr_for_binding(b, set, binding, array_index,
+                                        plane, state);

         surface_index =
            build_load_descriptor_mem(b, desc_addr, 0, 1, 32, state);
@@ -908,7 +920,8 @@ build_sampler_handle_for_binding(nir_builder *b,
         set_offset = nir_imm_int(b, 0xdeaddead);

         nir_def *desc_addr =
-            build_desc_addr_for_binding(b, set, binding, array_index, state);
+            build_desc_addr_for_binding(b, set, binding, array_index,
+                                        plane, state);

         /* This is anv_sampled_image_descriptor, the sampler handle is always
          * in component 1.
@@ -1384,7 +1397,8 @@ lower_load_accel_struct_desc(nir_builder *b,

   struct res_index_defs res = unpack_res_index(b, res_index);
   nir_def *desc_addr =
-      build_desc_addr_for_binding(b, set, binding, res.array_index, state);
+      build_desc_addr_for_binding(b, set, binding, res.array_index,
+                                  0 /* plane */, state);

   /* Acceleration structure descriptors are always uint64_t */
   nir_def *desc = build_load_descriptor_mem(b, desc_addr, 0, 1, 64, state);
@@ -1613,7 +1627,7 @@ lower_image_size_intrinsic(nir_builder *b, nir_intrinsic_instr *intrin,
   }

   nir_def *desc_addr = build_desc_addr_for_binding(
-      b, set, binding, array_index, state);
+      b, set, binding, array_index, 0 /* plane */, state);

   b->cursor = nir_after_instr(&intrin->instr);

--- a/src/intel/vulkan/anv_private.h
+++ b/src/intel/vulkan/anv_private.h
@@ -3180,6 +3180,12 @@ struct anv_push_constants {
   /** Dynamic offsets for dynamic UBOs and SSBOs */
   uint32_t dynamic_offsets[MAX_DYNAMIC_BUFFERS];

+   /* Robust access pushed registers. */
+   uint64_t push_reg_mask[MESA_SHADER_STAGES];
+
+   /** Ray query globals (RT_DISPATCH_GLOBALS) */
+   uint64_t ray_query_globals;
+
   union {
      struct {
         /** Dynamic MSAA value */
@@ -3200,16 +3206,12 @@ struct anv_push_constants {
          *
          * This is never set by software but is implicitly filled out when
          * uploading the push constants for compute shaders.
+          *
+          * This *MUST* be the last field of the anv_push_constants structure.
          */
         uint32_t subgroup_id;
      } cs;
   };
-
-   /* Robust access pushed registers. */
-   uint64_t push_reg_mask[MESA_SHADER_STAGES];
-
-   /** Ray query globals (RT_DISPATCH_GLOBALS) */
-   uint64_t ray_query_globals;
 };

 struct anv_surface_state {
--- a/src/nouveau/vulkan/nvk_cmd_draw.c
+++ b/src/nouveau/vulkan/nvk_cmd_draw.c
@@ -1438,13 +1438,15 @@ nvk_flush_ms_state(struct nvk_cmd_buffer *cmd)
      });
   }

-   if (BITSET_TEST(dyn->dirty, MESA_VK_DYNAMIC_MS_SAMPLE_LOCATIONS) ||
+   if (BITSET_TEST(dyn->dirty, MESA_VK_DYNAMIC_MS_RASTERIZATION_SAMPLES) ||
+       BITSET_TEST(dyn->dirty, MESA_VK_DYNAMIC_MS_SAMPLE_LOCATIONS) ||
       BITSET_TEST(dyn->dirty, MESA_VK_DYNAMIC_MS_SAMPLE_LOCATIONS_ENABLE)) {
      const struct vk_sample_locations_state *sl;
      if (dyn->ms.sample_locations_enable) {
         sl = dyn->ms.sample_locations;
      } else {
-         sl = vk_standard_sample_locations_state(dyn->ms.rasterization_samples);
+         const uint32_t samples = MAX2(1, dyn->ms.rasterization_samples);
+         sl = vk_standard_sample_locations_state(samples);
      }

      for (uint32_t i = 0; i < sl->per_pixel; i++) {
--- a/src/nouveau/vulkan/nvk_cmd_meta.c
+++ b/src/nouveau/vulkan/nvk_cmd_meta.c
@@ -130,6 +130,7 @@ nvk_meta_end(struct nvk_cmd_buffer *cmd,
 {
   if (save->desc0) {
      cmd->state.gfx.descriptors.sets[0] = save->desc0;
+      cmd->state.gfx.descriptors.set_sizes[0] = save->desc0->size;
      cmd->state.gfx.descriptors.root.sets[0] = nvk_descriptor_set_addr(save->desc0);
      cmd->state.gfx.descriptors.sets_dirty |= BITFIELD_BIT(0);
      cmd->state.gfx.descriptors.push_dirty &= ~BITFIELD_BIT(0);
--- a/src/nouveau/winsys/nouveau_bo.c
+++ b/src/nouveau/winsys/nouveau_bo.c
@@ -10,6 +10,9 @@
 #include <sys/mman.h>
 #include <xf86drm.h>

+#include "nvidia/classes/cl9097.h"
+#include "nvidia/classes/clc597.h"
+
 static void
 bo_bind(struct nouveau_ws_device *dev,
        uint32_t handle, uint64_t addr,
@@ -170,9 +173,10 @@ nouveau_ws_bo_new_mapped(struct nouveau_ws_device *dev,
 }

 static struct nouveau_ws_bo *
-nouveau_ws_bo_new_locked(struct nouveau_ws_device *dev,
-                         uint64_t size, uint64_t align,
-                         enum nouveau_ws_bo_flags flags)
+nouveau_ws_bo_new_tiled_locked(struct nouveau_ws_device *dev,
+                               uint64_t size, uint64_t align,
+                               uint8_t pte_kind, uint16_t tile_mode,
+                               enum nouveau_ws_bo_flags flags)
 {
   struct drm_nouveau_gem_new req = {};

@@ -205,6 +209,9 @@ nouveau_ws_bo_new_locked(struct nouveau_ws_device *dev,
   if (flags & NOUVEAU_WS_BO_NO_SHARE)
      req.info.domain |= NOUVEAU_GEM_DOMAIN_NO_SHARE;

+   req.info.tile_flags = (uint32_t)pte_kind << 8;
+   req.info.tile_mode = tile_mode;
+
   req.info.size = size;
   req.align = align;

@@ -242,19 +249,29 @@ fail_gem_new:
 }

 struct nouveau_ws_bo *
-nouveau_ws_bo_new(struct nouveau_ws_device *dev,
-                  uint64_t size, uint64_t align,
-                  enum nouveau_ws_bo_flags flags)
+nouveau_ws_bo_new_tiled(struct nouveau_ws_device *dev,
+                        uint64_t size, uint64_t align,
+                        uint8_t pte_kind, uint16_t tile_mode,
+                        enum nouveau_ws_bo_flags flags)
 {
   struct nouveau_ws_bo *bo;

   simple_mtx_lock(&dev->bos_lock);
-   bo = nouveau_ws_bo_new_locked(dev, size, align, flags);
+   bo = nouveau_ws_bo_new_tiled_locked(dev, size, align,
+                                       pte_kind, tile_mode, flags);
   simple_mtx_unlock(&dev->bos_lock);

   return bo;
 }

+struct nouveau_ws_bo *
+nouveau_ws_bo_new(struct nouveau_ws_device *dev,
+                  uint64_t size, uint64_t align,
+                  enum nouveau_ws_bo_flags flags)
+{
+   return nouveau_ws_bo_new_tiled(dev, size, align, 0, 0, flags);
+}
+
 static struct nouveau_ws_bo *
 nouveau_ws_bo_from_dma_buf_locked(struct nouveau_ws_device *dev, int fd)
 {
@@ -265,8 +282,11 @@ nouveau_ws_bo_from_dma_buf_locked(struct nouveau_ws_device *dev, int fd)

   struct hash_entry *entry =
      _mesa_hash_table_search(dev->bos, (void *)(uintptr_t)handle);
-   if (entry != NULL)
-      return entry->data;
+   if (entry != NULL) {
+      struct nouveau_ws_bo *bo = entry->data;
+      nouveau_ws_bo_ref(bo);
+      return bo;
+   }

   /*
    * If we got here, no BO exists for the retrieved handle. If we error
--- a/src/nouveau/winsys/nouveau_bo.h
+++ b/src/nouveau/winsys/nouveau_bo.h
@@ -68,6 +68,11 @@ struct nouveau_ws_bo *nouveau_ws_bo_new_mapped(struct nouveau_ws_device *,
                                               enum nouveau_ws_bo_flags,
                                               enum nouveau_ws_bo_map_flags map_flags,
                                               void **map_out);
+struct nouveau_ws_bo *nouveau_ws_bo_new_tiled(struct nouveau_ws_device *,
+                                              uint64_t size, uint64_t align,
+                                              uint8_t pte_kind,
+                                              uint16_t tile_mode,
+                                              enum nouveau_ws_bo_flags);
 struct nouveau_ws_bo *nouveau_ws_bo_from_dma_buf(struct nouveau_ws_device *,
                                                 int fd);
 void nouveau_ws_bo_destroy(struct nouveau_ws_bo *);
--- a/src/nouveau/winsys/nouveau_device.c
+++ b/src/nouveau/winsys/nouveau_device.c
@@ -380,3 +380,13 @@ nouveau_ws_device_destroy(struct nouveau_ws_device *device)
   close(device->fd);
   FREE(device);
 }
+
+bool
+nouveau_ws_device_has_tiled_bo(struct nouveau_ws_device *device)
+{
+   uint64_t has = 0;
+   if (nouveau_ws_param(device->fd, NOUVEAU_GETPARAM_HAS_VMA_TILEMODE, &has))
+      return false;
+
+   return has != 0;
+}
--- a/src/nouveau/winsys/nouveau_device.h
+++ b/src/nouveau/winsys/nouveau_device.h
@@ -64,6 +64,8 @@ struct nouveau_ws_device {
 struct nouveau_ws_device *nouveau_ws_device_new(struct _drmDevice *drm_device);
 void nouveau_ws_device_destroy(struct nouveau_ws_device *);

+bool nouveau_ws_device_has_tiled_bo(struct nouveau_ws_device *device);
+
 #ifdef __cplusplus
 }
 #endif
--- a/src/util/00-radv-defaults.conf
+++ b/src/util/00-radv-defaults.conf
@@ -208,5 +208,9 @@ Application bugs worked around in this file:
        <application name="Half-Life Alyx" application_name_match="hlvr">
            <option name="dual_color_blend_by_location" value="true" />
        </application>
+
+        <application name="Enshrouded" executable="enshrouded.exe">
+            <option name="radv_zero_vram" value="true"/>
+        </application>
    </device>
 </driconf>
--- a/src/util/bitset.h
+++ b/src/util/bitset.h
@@ -209,7 +209,8 @@ __bitset_shl(BITSET_WORD *x, unsigned amount, unsigned n)
 */
 #define BITSET_TEST_RANGE_INSIDE_WORD(x, b, e, mask) \
   (BITSET_BITWORD(b) == BITSET_BITWORD(e) ? \
-   (((x)[BITSET_BITWORD(b)] & BITSET_RANGE(b, e)) == mask) : \
+   (((x)[BITSET_BITWORD(b)] & BITSET_RANGE(b, e)) == \
+   (((BITSET_WORD)mask) << (b % BITSET_WORDBITS))) : \
   (assert (!"BITSET_TEST_RANGE: bit range crosses word boundary"), 0))
 #define BITSET_SET_RANGE_INSIDE_WORD(x, b, e) \
   (BITSET_BITWORD(b) == BITSET_BITWORD(e) ? \
--- a/src/vulkan/wsi/wsi_common_drm.c
+++ b/src/vulkan/wsi/wsi_common_drm.c
@@ -532,7 +532,7 @@ wsi_create_native_image_mem(const struct wsi_swapchain *chain,

      for (uint32_t p = 0; p < image->num_planes; p++) {
         const VkImageSubresource image_subresource = {
-            .aspectMask = VK_IMAGE_ASPECT_PLANE_0_BIT << p,
+            .aspectMask = VK_IMAGE_ASPECT_MEMORY_PLANE_0_BIT_EXT << p,
            .mipLevel = 0,
            .arrayLayer = 0,
         };
--- a/subprojects/packagefiles/proc-macro2/meson.build
+++ b/subprojects/packagefiles/proc-macro2/meson.build
@@ -41,6 +41,15 @@ endif
 if rc.version().version_compare('< 1.57')
  rust_args += ['--cfg', 'no_is_available']
 endif
+if rc.version().version_compare('< 1.66')
+  rust_args += ['--cfg', 'no_source_text']
+endif
+if rc.version().version_compare('< 1.79')
+  rust_args += [
+    '--cfg', 'no_literal_byte_character',
+    '--cfg', 'no_literal_c_string',
+    ]
+endif

 u_ind = subproject('unicode-ident').get_variable('lib')
@@ -1 +1 @@
 .0.7
 .0.8