118 lines
6.4 KiB
ReStructuredText
118 lines
6.4 KiB
ReStructuredText
Instancing
|
|
==========
|
|
|
|
The attribute descriptor lets the attribute unit compute the address of an
|
|
attribute given the vertex and instance ID. Unfortunately, the way this works is
|
|
rather complicated when instancing is enabled.
|
|
|
|
To explain this, first we need to explain how compute and vertex threads are
|
|
dispatched. When a quad is dispatched, it receives a single, linear index.
|
|
However, we need to translate that index into a (vertex id, instance id) pair.
|
|
One option would be to do:
|
|
|
|
.. math::
|
|
\text{vertex id} = \text{linear id} \% \text{num vertices}
|
|
|
|
\text{instance id} = \text{linear id} / \text{num vertices}
|
|
|
|
but this involves a costly division and modulus by an arbitrary number.
|
|
Instead, we could pad ``num_vertices``. We dispatch
|
|
:math:`\text{padded_num_vertices} \cdot \text{num_instances}` threads instead
|
|
of :math:`\text{num_vertices} \cdot \text{num_instances}`, which results
|
|
in some "extra" threads with :math:`\text{vertex_id} \geq \text{num_vertices}`,
|
|
which we have to discard. The more we pad ``num_vertices``, the more "wasted"
|
|
threads we dispatch, but the division is potentially easier.
|
|
|
|
One straightforward choice is to pad ``num_vertices`` to the next power
|
|
of two, which means that the division and modulus are just simple bit shifts
|
|
and masking. But the actual algorithm is a bit more complicated. The thread
|
|
dispatcher has special support for dividing by 3, 5, 7, and 9, in addition
|
|
to dividing by a power of two. As a result, ``padded_num_vertices`` can
|
|
be 1, 3, 5, 7, or 9 times a power of two. This results in less wasted threads,
|
|
since we need less padding.
|
|
|
|
``padded_num_vertices`` is picked by the hardware. The driver just specifies
|
|
the actual number of vertices. Note that ``padded_num_vertices`` is a multiple
|
|
of four (presumably because threads are dispatched in groups of 4). Also,
|
|
``padded_num_vertices`` is always at least one more than ``num_vertices``,
|
|
which seems like a quirk of the hardware. For larger ``num_vertices``, the
|
|
hardware uses the following algorithm: using the binary representation of
|
|
``num_vertices``, we look at the most significant set bit as well as the
|
|
following 3 bits. Let n be the number of bits after those 4 bits. Then we
|
|
set ``padded_num_vertices`` according to the following table:
|
|
|
|
========== =======================
|
|
high bits ``padded_num_vertices``
|
|
========== =======================
|
|
1000 :math:`9 \cdot 2^n`
|
|
1001 :math:`5 \cdot 2^{n+1}`
|
|
101x :math:`3 \cdot 2^{n+2}`
|
|
110x :math:`7 \cdot 2^{n+1}`
|
|
111x :math:`2^{n+4}`
|
|
========== =======================
|
|
|
|
For example, if :math:`\text{num_vertices} = 70` is passed to
|
|
:c:func:`glDraw()`, its binary representation is 1000110, so :math:`n = 3`
|
|
and the high bits are 1000, and therefore
|
|
:math:`\text{padded_num_vertices} = 9 \cdot 2^3 = 72`.
|
|
|
|
The attribute unit works in terms of the original ``linear_id``. if
|
|
:math:`\text{num_instances} = 1`, then they are the same, and everything
|
|
is simple. However, with instancing things get more complicated. There are
|
|
four possible modes, two of them we can group together:
|
|
|
|
1. Use the ``linear_id`` directly. Only used when there is no instancing.
|
|
|
|
2. Use the ``linear_id`` modulo a constant. This is used for per-vertex
|
|
attributes with instancing enabled by making the constant equal
|
|
``padded_num_vertices``. Because the modulus is always ``padded_num_vertices``,
|
|
this mode only supports a modulus that is a power of 2 times 1, 3, 5, 7,
|
|
or 9. The shift field specifies the power of two, while the ``extra_flags``
|
|
field specifies the odd number. If :math:`\text{shift} = n` and
|
|
:math:`\text{extra_flags} = m`, then the modulus is
|
|
:math:`(2m + 1) \cdot 2^n`. As an example, if
|
|
:math:`\text{num_vertices} = 70`, then as computed above,
|
|
:math:`\text{padded_num_vertices} = 9 \cdot 2^3`, so we should set
|
|
:math:`\text{extra_flags} = 4` and :math:`\text{shift} = 3`. Note that we
|
|
must exactly follow the hardware algorithm used to get ``padded_num_vertices``
|
|
in order to correctly implement per-vertex attributes.
|
|
|
|
3. Divide the ``linear_id`` by a constant. In order to correctly implement
|
|
instance divisors, we have to divide ``linear_id`` by ``padded_num_vertices``
|
|
times to user-specified divisor. So first we compute ``padded_num_vertices``,
|
|
again following the exact same algorithm that the hardware uses, then multiply
|
|
it by the GL-level divisor to get the hardware-level divisor. This case is
|
|
further divided into two more cases. If the hardware-level divisor is a
|
|
power of two, then we just need to shift. The shift amount is specified by
|
|
the shift field, so that the hardware-level divisor is just
|
|
:math:`2^\text{shift}`.
|
|
|
|
If it isn't a power of two, then we have to divide by an arbitrary integer.
|
|
For that, we use the well-known technique of multiplying by an approximation
|
|
of the inverse. The driver must compute the magic multiplier and shift
|
|
amount, and then the hardware does the multiplication and shift. The
|
|
hardware and driver also use the "round-down" optimization as described in
|
|
https://ridiculousfish.com/files/faster_unsigned_division_by_constants.pdf.
|
|
The hardware further assumes the multiplier is between :math:`2^{31}` and
|
|
:math:`2^{32}`, so the high bit is implicitly set to 1 even though it is set
|
|
to 0 by the driver -- presumably this simplifies the hardware multiplier a
|
|
little. The hardware first multiplies ``linear_id`` by the multiplier and
|
|
takes the high 32 bits, then applies the round-down correction if
|
|
:math:`\text{extra_flags} = 1`, then finally shifts right by the shift field.
|
|
|
|
There are some differences between ridiculousfish's algorithm and the Mali
|
|
hardware algorithm, which means that the reference code from ridiculousfish
|
|
doesn't always produce the right constants. Mali does not use the pre-shift
|
|
optimization, since that would make a hardware implementation slower (it
|
|
would have to always do the pre-shift, multiply, and post-shift operations).
|
|
It also forces the multiplier to be at least :math:`2^{31}`, which means
|
|
that the exponent is entirely fixed, so there is no trial-and-error.
|
|
Altogether, given the divisor d, the algorithm the driver must follow is:
|
|
|
|
1. Set :math:`\text{shift} = \lfloor \log_2(d) \rfloor`.
|
|
2. Compute :math:`m = \lceil 2^{shift + 32} / d \rceil` and :math:`e = 2^{shift + 32} % d`.
|
|
3. If :math:`e \leq 2^{shift}`, then we need to use the round-down algorithm.
|
|
Set :math:`\text{magic_divisor} = m - 1` and :math:`\text{extra_flags} = 1`.
|
|
4. Otherwise, set :math:`\text{magic_divisor} = m` and
|
|
:math:`\text{extra_flags} = 0`.
|