Memory dependency microbenchmark

Discussion:

(too old to reply)

Anton Ertl

2023-11-03 09:15:58 UTC

I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.

You can find the text portion of this page below; there are a few
links on the page that are not produced below.

If you have information on any of the things I don't know (or don't
remember), please let me know about it.

- anton

Memory Dependency Microbenchmark

This microbenchmark tests some aspects of how hardware deals with
dependences through memory. It performs the following loop:

for (i=0; i<250000000; i++) {
*b = *a+1;
*d = *c+1;
*f = *e+1;
*h = *g+1;
}

The eight parameters of the binary allow you to determine to which
memory location the pointers a b c d e f g h refer, in particular, if
some of them refer to the same memory location or not. Each parameter
corresponds to one pointer, and two pointers refer to the same memory
location iff the corresponding parameters are the same.

This microbenchmark uses pointers that have been in their registers
for a long time, so it does not test speculative alias detection by
the hardware.

I have measured using the following parameter combinations (in the
order a b c d e f g h):

* A 0 1 2 3 4 5 6 7: All computations (statements in the loop body)
are completely independent. Ideally they can be performed as fast
as the resources allow.
* B 0 1 1 2 2 3 3 4: A sequence of 4 dependent computations, but
the next iteration does not depend on the results of the previous
one, so again, the work can be performed as fast as the resources
allow. However, in order to achieve this performance, the loads
of the next iteration have to be started while several
architecturally earlier stores still have to be executed, and,
comparing the results of B to those of A, all measured cores have
more difficulty with that, resulting in slower performance.
* C 0 0 2 2 4 4 6 6: 1-computation recurrences (i.e., the
computation in the next iteration depends on a computation in the
current iteration), four of those. So at most 4 of these
computations (plus the loop overhead) can be performed in
parallel.
* D 0 1 1 0 2 3 3 2: 2-computation recurrences (i.e., two dependent
computations in an iteration, and the first of those in the
current iteration depends on the second one in the previous
iteration), 2 of those. So at most two of these computations
(plus the loop overhead) can be performed in parallel.
* E 0 1 2 3 1 0 3 2: The same data flow as D, but the computations
are arranged differently: Here we first have two independent
computations, and then two computations that depend on the
earlier computations.
* F 0 1 1 2 2 0 3 3: A 3-computation recurrence and a 1-computation
recurrence. In the best case you see the latency of three
computations per iteration.
* G 0 1 1 2 3 3 2 0: The same data flow as F, but the computations
are arranged differently.
* H 0 0 0 0 0 0 0 0: One 4-computation recurrence. These
computations can only be performed sequentialy, only the loop
overhead can be performed in parallel to them.

The results for different CPU cores are shown in the following. The
numbers are cycles per computation (statement in the loop body). To
get the cycles per iteration, multiply by 4.

A B C D E F G H microarchitecture CPU
1.17 2.11 1.46 3.28 3.67 4.50 4.50 6.93 K8 Athlon 64 X2 4600+
1.00 1.26 2.00 4.00 4.00 6.01 6.01 8.00 Zen Ryzen 5 1600X
1.00 1.19 2.00 4.00 4.00 6.00 6.01 8.00 Zen2 Ryzen 9 3900X
0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3 Ryzen 7 5800X
1.00 1.52 1.99 3.70 3.66 4.65 4.62 7.42 Sandy Bridge Xeon E3-1220
1.00 1.26 1.61 3.06 3.00 4.50 4.50 6.85 Haswell Core i7-4790K
1.00 1.12 1.43 2.90 2.84 4.03 4.05 5.53 Skylake Core i5-6600K
0.78 0.92 1.01 1.81 1.13 2.00 2.00 2.25 Rocket Lake Xeon W-1370P
0.81 0.92 0.83 0.86 0.81 0.82 0.81 1.01 Tiger Lake Core i5-1135G7
3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A53 Amlogic S905
3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A55 RK3588
1.25 2.26 2.06 3.95 4.00 6.18 6.18 7.67 Cortex-A72 RK3399
1.50 1.77 1.63 3.00 3.03 4.10 4.07 5.51 Cortex-A76 RK3588
1.03 1.66 1.50 3.00 3.00 4.50 4.50 7.84 IceStorm Apple M1 (variations from 6.0-8.75 for H)
3.01 4.52 3.01 4.01 3.01 4.01 4.01 5.02 U74 JH7100 (RISC-V)

There can be different sophistication levels of CPU cores, both in
terms of dealing with aliases, and in terms of forwarding from stores
to loads. And of course there is the difference between in-order
execution and out-of-order execution.

Aliases

The simplest approach would be to assume that all memory accesses may
refer to the same memory location. In that case we would expect that
the core performs all parameter combinations as badly as H. None of
the measured cores exhibits this behaviour, so obviously they are
more sophisticated than that.

The next approach is to allow to perform architecturally later loads
at the same time, or, with OoO, earlier than stores to a different
address. All tested cores seem to do this.

Finally, there is the issue of what to do when the load is to the
same address as an architecturally preceding store. This brings us to
the next section:

Store to load forwarding

The simplest approach is to wait until the data arrives in the cache
and then load it from there. I dimly rememeber seeing 15 cycles per
iteration for a loop that incremented a memory location in every
iteration, but none of the cores measured above take that long
(although, for Zen and Zen2 one might wonder).

The next approach is to let the load read from the store buffer if
the data is there (and first wait until it is there). In this case
the whole sequence of store-load has a total latency that is not much
higher than the load latency. It seems that most cores measured here
do that. We see this best in the H results; e.g., a Skylake has 4
cycles of load latency and 1 cycle of add latency, and we see 5.53
cycles of latency for store-load-add, meaning that the store
contributes an average latency of 0.53 cycles. There are some
complications in case the load only partially overlaps the store (I
should put an appropriate chipsncheese link here).

Finally, the core might detect that the data that is loaded is coming
from a store that has not been retired yet, so the physical register
used by the store can be directly renamed into the register of the
load result, and the load does not need to access the store buffer or
cache at all (what is the name of this feature?). As a result, in the
ideal case we see only the 1-cycle latency of the add in case H. In
the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully;
Rocket Lake probably also does that, but either does not succeed in
all cases (the different results between D and E speak for this
theory), or there is an additional source of latency.

I expect that Firestorm (the performance core of the Apple M1) also
has this feature, but unfortunately the cycles performance counter
does not work for Firestorm on Linux 6.4.0-asahi-00571-gda70cd78bc50

In-order vs. out-of-order execution

With in-order execution (on Cortex A53 and A55, and on the U74), the
loads cannot be executed before architecturally earlier stores, even
if both refer to different memory locations. So even A is relatively
slow on in-order cores. In-order execution also means that B is
almost as slow as H, while with out-of-order execution it can
theoretically be executed as fast as A (but in practice they exhibit
slightly slower performance, but certainly much faster than H).

With OoO, we see much better performance in cases where there are
independent computation chains. Given the size of the buffers in the
various OoO microarchitectures (hundreds of instructions in the
reorder buffer, dozens in schedulers), it is surprising that B is
slower than A given the small size of each iteration (~15
instructions); and even D is slower than E on some
microarchitectures, most notably Rocket Lake.

Measuring your own hardware

You can download the contents of this directory and run the benchmark
with

wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip
unzip memdep.zip
cd memdep
make

If you want to do your own parameter combinations, you can run the
binary with

./memdep-`uname -m` a b c d e f g h

where a b c d e f g h are integers and correspond to the pointers in
the loop. If you want to get results like in the table above, run it
like this:

perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'

Future work

Make a good microbenchmark that produces the addresses so late that
either the core has to wait for the addresses or has to speculate
whether the load accesses the same address as the store or not. Do it
for predictable aliasing and for unpredictable aliasing. I remember a
good article about this predictor (that actually looked at Intel's
patent), but don't remember the URL; Intel calls this technique as
implemented in Ice Lake and later P-cores Fast Store Forwarding
Predictor (FSFP) (but my memory says that the article I read about
looked at older microarchitectures that have a similar feature). AMD
describes such a hardware feature under the name predictive store
forwarding (PSF), which they added in Zen3.

Related

One thing I remember is that I have done a microbenchmark that was
intended to measure predictive store forwarding, but it (also?)
measured the forward-at-register-level technique described above.

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

EricP

2023-11-03 17:13:41 UTC