Discussion:
Memory dependency microbenchmark
(too old to reply)
Anton Ertl
2023-11-03 09:15:58 UTC
Permalink
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.

You can find the text portion of this page below; there are a few
links on the page that are not produced below.

If you have information on any of the things I don't know (or don't
remember), please let me know about it.

- anton

Memory Dependency Microbenchmark

This microbenchmark tests some aspects of how hardware deals with
dependences through memory. It performs the following loop:

for (i=0; i<250000000; i++) {
*b = *a+1;
*d = *c+1;
*f = *e+1;
*h = *g+1;
}

The eight parameters of the binary allow you to determine to which
memory location the pointers a b c d e f g h refer, in particular, if
some of them refer to the same memory location or not. Each parameter
corresponds to one pointer, and two pointers refer to the same memory
location iff the corresponding parameters are the same.

This microbenchmark uses pointers that have been in their registers
for a long time, so it does not test speculative alias detection by
the hardware.

I have measured using the following parameter combinations (in the
order a b c d e f g h):

* A 0 1 2 3 4 5 6 7: All computations (statements in the loop body)
are completely independent. Ideally they can be performed as fast
as the resources allow.
* B 0 1 1 2 2 3 3 4: A sequence of 4 dependent computations, but
the next iteration does not depend on the results of the previous
one, so again, the work can be performed as fast as the resources
allow. However, in order to achieve this performance, the loads
of the next iteration have to be started while several
architecturally earlier stores still have to be executed, and,
comparing the results of B to those of A, all measured cores have
more difficulty with that, resulting in slower performance.
* C 0 0 2 2 4 4 6 6: 1-computation recurrences (i.e., the
computation in the next iteration depends on a computation in the
current iteration), four of those. So at most 4 of these
computations (plus the loop overhead) can be performed in
parallel.
* D 0 1 1 0 2 3 3 2: 2-computation recurrences (i.e., two dependent
computations in an iteration, and the first of those in the
current iteration depends on the second one in the previous
iteration), 2 of those. So at most two of these computations
(plus the loop overhead) can be performed in parallel.
* E 0 1 2 3 1 0 3 2: The same data flow as D, but the computations
are arranged differently: Here we first have two independent
computations, and then two computations that depend on the
earlier computations.
* F 0 1 1 2 2 0 3 3: A 3-computation recurrence and a 1-computation
recurrence. In the best case you see the latency of three
computations per iteration.
* G 0 1 1 2 3 3 2 0: The same data flow as F, but the computations
are arranged differently.
* H 0 0 0 0 0 0 0 0: One 4-computation recurrence. These
computations can only be performed sequentialy, only the loop
overhead can be performed in parallel to them.

The results for different CPU cores are shown in the following. The
numbers are cycles per computation (statement in the loop body). To
get the cycles per iteration, multiply by 4.

A B C D E F G H microarchitecture CPU
1.17 2.11 1.46 3.28 3.67 4.50 4.50 6.93 K8 Athlon 64 X2 4600+
1.00 1.26 2.00 4.00 4.00 6.01 6.01 8.00 Zen Ryzen 5 1600X
1.00 1.19 2.00 4.00 4.00 6.00 6.01 8.00 Zen2 Ryzen 9 3900X
0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3 Ryzen 7 5800X
1.00 1.52 1.99 3.70 3.66 4.65 4.62 7.42 Sandy Bridge Xeon E3-1220
1.00 1.26 1.61 3.06 3.00 4.50 4.50 6.85 Haswell Core i7-4790K
1.00 1.12 1.43 2.90 2.84 4.03 4.05 5.53 Skylake Core i5-6600K
0.78 0.92 1.01 1.81 1.13 2.00 2.00 2.25 Rocket Lake Xeon W-1370P
0.81 0.92 0.83 0.86 0.81 0.82 0.81 1.01 Tiger Lake Core i5-1135G7
3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A53 Amlogic S905
3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A55 RK3588
1.25 2.26 2.06 3.95 4.00 6.18 6.18 7.67 Cortex-A72 RK3399
1.50 1.77 1.63 3.00 3.03 4.10 4.07 5.51 Cortex-A76 RK3588
1.03 1.66 1.50 3.00 3.00 4.50 4.50 7.84 IceStorm Apple M1 (variations from 6.0-8.75 for H)
3.01 4.52 3.01 4.01 3.01 4.01 4.01 5.02 U74 JH7100 (RISC-V)

There can be different sophistication levels of CPU cores, both in
terms of dealing with aliases, and in terms of forwarding from stores
to loads. And of course there is the difference between in-order
execution and out-of-order execution.

Aliases

The simplest approach would be to assume that all memory accesses may
refer to the same memory location. In that case we would expect that
the core performs all parameter combinations as badly as H. None of
the measured cores exhibits this behaviour, so obviously they are
more sophisticated than that.

The next approach is to allow to perform architecturally later loads
at the same time, or, with OoO, earlier than stores to a different
address. All tested cores seem to do this.

Finally, there is the issue of what to do when the load is to the
same address as an architecturally preceding store. This brings us to
the next section:

Store to load forwarding

The simplest approach is to wait until the data arrives in the cache
and then load it from there. I dimly rememeber seeing 15 cycles per
iteration for a loop that incremented a memory location in every
iteration, but none of the cores measured above take that long
(although, for Zen and Zen2 one might wonder).

The next approach is to let the load read from the store buffer if
the data is there (and first wait until it is there). In this case
the whole sequence of store-load has a total latency that is not much
higher than the load latency. It seems that most cores measured here
do that. We see this best in the H results; e.g., a Skylake has 4
cycles of load latency and 1 cycle of add latency, and we see 5.53
cycles of latency for store-load-add, meaning that the store
contributes an average latency of 0.53 cycles. There are some
complications in case the load only partially overlaps the store (I
should put an appropriate chipsncheese link here).

Finally, the core might detect that the data that is loaded is coming
from a store that has not been retired yet, so the physical register
used by the store can be directly renamed into the register of the
load result, and the load does not need to access the store buffer or
cache at all (what is the name of this feature?). As a result, in the
ideal case we see only the 1-cycle latency of the add in case H. In
the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully;
Rocket Lake probably also does that, but either does not succeed in
all cases (the different results between D and E speak for this
theory), or there is an additional source of latency.

I expect that Firestorm (the performance core of the Apple M1) also
has this feature, but unfortunately the cycles performance counter
does not work for Firestorm on Linux 6.4.0-asahi-00571-gda70cd78bc50

In-order vs. out-of-order execution

With in-order execution (on Cortex A53 and A55, and on the U74), the
loads cannot be executed before architecturally earlier stores, even
if both refer to different memory locations. So even A is relatively
slow on in-order cores. In-order execution also means that B is
almost as slow as H, while with out-of-order execution it can
theoretically be executed as fast as A (but in practice they exhibit
slightly slower performance, but certainly much faster than H).

With OoO, we see much better performance in cases where there are
independent computation chains. Given the size of the buffers in the
various OoO microarchitectures (hundreds of instructions in the
reorder buffer, dozens in schedulers), it is surprising that B is
slower than A given the small size of each iteration (~15
instructions); and even D is slower than E on some
microarchitectures, most notably Rocket Lake.

Measuring your own hardware

You can download the contents of this directory and run the benchmark
with

wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip
unzip memdep.zip
cd memdep
make

If you want to do your own parameter combinations, you can run the
binary with

./memdep-`uname -m` a b c d e f g h

where a b c d e f g h are integers and correspond to the pointers in
the loop. If you want to get results like in the table above, run it
like this:

perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'

Future work

Make a good microbenchmark that produces the addresses so late that
either the core has to wait for the addresses or has to speculate
whether the load accesses the same address as the store or not. Do it
for predictable aliasing and for unpredictable aliasing. I remember a
good article about this predictor (that actually looked at Intel's
patent), but don't remember the URL; Intel calls this technique as
implemented in Ice Lake and later P-cores Fast Store Forwarding
Predictor (FSFP) (but my memory says that the article I read about
looked at older microarchitectures that have a similar feature). AMD
describes such a hardware feature under the name predictive store
forwarding (PSF), which they added in Zen3.

Related

One thing I remember is that I have done a microbenchmark that was
intended to measure predictive store forwarding, but it (also?)
measured the forward-at-register-level technique described above.
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
EricP
2023-11-03 17:13:41 UTC
Permalink
Post by Anton Ertl
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.
You can find the text portion of this page below; there are a few
links on the page that are not produced below.
If you have information on any of the things I don't know (or don't
remember), please let me know about it.
- anton
Memory Dependency Microbenchmark
This microbenchmark tests some aspects of how hardware deals with
for (i=0; i<250000000; i++) {
*b = *a+1;
*d = *c+1;
*f = *e+1;
*h = *g+1;
}
The eight parameters of the binary allow you to determine to which
memory location the pointers a b c d e f g h refer, in particular, if
some of them refer to the same memory location or not. Each parameter
corresponds to one pointer, and two pointers refer to the same memory
location iff the corresponding parameters are the same.
This microbenchmark uses pointers that have been in their registers
for a long time, so it does not test speculative alias detection by
the hardware.
I have measured using the following parameter combinations (in the
* A 0 1 2 3 4 5 6 7: All computations (statements in the loop body)
are completely independent. Ideally they can be performed as fast
as the resources allow.
* B 0 1 1 2 2 3 3 4: A sequence of 4 dependent computations, but
the next iteration does not depend on the results of the previous
one, so again, the work can be performed as fast as the resources
allow. However, in order to achieve this performance, the loads
of the next iteration have to be started while several
architecturally earlier stores still have to be executed, and,
comparing the results of B to those of A, all measured cores have
more difficulty with that, resulting in slower performance.
* C 0 0 2 2 4 4 6 6: 1-computation recurrences (i.e., the
computation in the next iteration depends on a computation in the
current iteration), four of those. So at most 4 of these
computations (plus the loop overhead) can be performed in
parallel.
* D 0 1 1 0 2 3 3 2: 2-computation recurrences (i.e., two dependent
computations in an iteration, and the first of those in the
current iteration depends on the second one in the previous
iteration), 2 of those. So at most two of these computations
(plus the loop overhead) can be performed in parallel.
* E 0 1 2 3 1 0 3 2: The same data flow as D, but the computations
are arranged differently: Here we first have two independent
computations, and then two computations that depend on the
earlier computations.
* F 0 1 1 2 2 0 3 3: A 3-computation recurrence and a 1-computation
recurrence. In the best case you see the latency of three
computations per iteration.
* G 0 1 1 2 3 3 2 0: The same data flow as F, but the computations
are arranged differently.
* H 0 0 0 0 0 0 0 0: One 4-computation recurrence. These
computations can only be performed sequentialy, only the loop
overhead can be performed in parallel to them.
The results for different CPU cores are shown in the following. The
numbers are cycles per computation (statement in the loop body). To
get the cycles per iteration, multiply by 4.
A B C D E F G H microarchitecture CPU
1.17 2.11 1.46 3.28 3.67 4.50 4.50 6.93 K8 Athlon 64 X2 4600+
1.00 1.26 2.00 4.00 4.00 6.01 6.01 8.00 Zen Ryzen 5 1600X
1.00 1.19 2.00 4.00 4.00 6.00 6.01 8.00 Zen2 Ryzen 9 3900X
0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3 Ryzen 7 5800X
1.00 1.52 1.99 3.70 3.66 4.65 4.62 7.42 Sandy Bridge Xeon E3-1220
1.00 1.26 1.61 3.06 3.00 4.50 4.50 6.85 Haswell Core i7-4790K
1.00 1.12 1.43 2.90 2.84 4.03 4.05 5.53 Skylake Core i5-6600K
0.78 0.92 1.01 1.81 1.13 2.00 2.00 2.25 Rocket Lake Xeon W-1370P
0.81 0.92 0.83 0.86 0.81 0.82 0.81 1.01 Tiger Lake Core i5-1135G7
3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A53 Amlogic S905
3.25 4.00 3.25 3.75 3.25 3.75 3.50 4.00 Cortex-A55 RK3588
1.25 2.26 2.06 3.95 4.00 6.18 6.18 7.67 Cortex-A72 RK3399
1.50 1.77 1.63 3.00 3.03 4.10 4.07 5.51 Cortex-A76 RK3588
1.03 1.66 1.50 3.00 3.00 4.50 4.50 7.84 IceStorm Apple M1 (variations from 6.0-8.75 for H)
3.01 4.52 3.01 4.01 3.01 4.01 4.01 5.02 U74 JH7100 (RISC-V)
There can be different sophistication levels of CPU cores, both in
terms of dealing with aliases, and in terms of forwarding from stores
to loads. And of course there is the difference between in-order
execution and out-of-order execution.
Aliases
The simplest approach would be to assume that all memory accesses may
refer to the same memory location. In that case we would expect that
the core performs all parameter combinations as badly as H. None of
the measured cores exhibits this behaviour, so obviously they are
more sophisticated than that.
The next approach is to allow to perform architecturally later loads
at the same time, or, with OoO, earlier than stores to a different
address. All tested cores seem to do this.
Finally, there is the issue of what to do when the load is to the
same address as an architecturally preceding store. This brings us to
Store to load forwarding
The simplest approach is to wait until the data arrives in the cache
and then load it from there. I dimly rememeber seeing 15 cycles per
iteration for a loop that incremented a memory location in every
iteration, but none of the cores measured above take that long
(although, for Zen and Zen2 one might wonder).
The next approach is to let the load read from the store buffer if
the data is there (and first wait until it is there). In this case
the whole sequence of store-load has a total latency that is not much
higher than the load latency. It seems that most cores measured here
do that. We see this best in the H results; e.g., a Skylake has 4
cycles of load latency and 1 cycle of add latency, and we see 5.53
cycles of latency for store-load-add, meaning that the store
contributes an average latency of 0.53 cycles. There are some
complications in case the load only partially overlaps the store (I
should put an appropriate chipsncheese link here).
Finally, the core might detect that the data that is loaded is coming
from a store that has not been retired yet, so the physical register
used by the store can be directly renamed into the register of the
load result, and the load does not need to access the store buffer or
cache at all (what is the name of this feature?). As a result, in the
ideal case we see only the 1-cycle latency of the add in case H. In
the measured cores, Zen3 and Tiger Lake exhibit this behaviour fully;
Rocket Lake probably also does that, but either does not succeed in
all cases (the different results between D and E speak for this
theory), or there is an additional source of latency.
I expect that Firestorm (the performance core of the Apple M1) also
has this feature, but unfortunately the cycles performance counter
does not work for Firestorm on Linux 6.4.0-asahi-00571-gda70cd78bc50
In-order vs. out-of-order execution
With in-order execution (on Cortex A53 and A55, and on the U74), the
loads cannot be executed before architecturally earlier stores, even
if both refer to different memory locations. So even A is relatively
slow on in-order cores. In-order execution also means that B is
almost as slow as H, while with out-of-order execution it can
theoretically be executed as fast as A (but in practice they exhibit
slightly slower performance, but certainly much faster than H).
With OoO, we see much better performance in cases where there are
independent computation chains. Given the size of the buffers in the
various OoO microarchitectures (hundreds of instructions in the
reorder buffer, dozens in schedulers), it is surprising that B is
slower than A given the small size of each iteration (~15
instructions); and even D is slower than E on some
microarchitectures, most notably Rocket Lake.
Measuring your own hardware
You can download the contents of this directory and run the benchmark
with
wget http://www.complang.tuwien.ac.at/anton/memdep/memdep.zip
unzip memdep.zip
cd memdep
make
If you want to do your own parameter combinations, you can run the
binary with
./memdep-`uname -m` a b c d e f g h
where a b c d e f g h are integers and correspond to the pointers in
the loop. If you want to get results like in the table above, run it
perf stat --log-fd 3 -x, -e $(CYCLES) ./memdep-$(ARCH) a b c d e f g h 3>&1 | awk -F, '{printf("%5.2f\n",$$1/1000000000)}'
Future work
Make a good microbenchmark that produces the addresses so late that
either the core has to wait for the addresses or has to speculate
whether the load accesses the same address as the store or not. Do it
for predictable aliasing and for unpredictable aliasing. I remember a
good article about this predictor (that actually looked at Intel's
patent), but don't remember the URL; Intel calls this technique as
implemented in Ice Lake and later P-cores Fast Store Forwarding
Predictor (FSFP) (but my memory says that the article I read about
looked at older microarchitectures that have a similar feature). AMD
describes such a hardware feature under the name predictive store
forwarding (PSF), which they added in Zen3.
Related
One thing I remember is that I have done a microbenchmark that was
intended to measure predictive store forwarding, but it (also?)
measured the forward-at-register-level technique described above.
I notice that all your 8 references are within the same cache line.
Store to load forwarding might behave differently across lines
and 4kB page boundaries.
Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096

extern void memdep(long *a, long *b, long *c, long *d,
long *e, long *f, long *g, long *h);

for (i=0; i<1000; i++)
a[i] = 0;
memdep(a+offsets[0], a+offsets[1], a+offsets[2], a+offsets[3],
a+offsets[4], a+offsets[5], a+offsets[6], a+offsets[7]);
Anton Ertl
2023-11-04 17:01:32 UTC
Permalink
Post by EricP
I notice that all your 8 references are within the same cache line.
Store to load forwarding might behave differently across lines
and 4kB page boundaries.
Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096
The numbers are in 8-byte granularity, so to get different cache lines
n*8 is good enough, and for different pagres n*512. However, the
array is only 8000 bytes long, so you can use only numbers in the
range 0...999. Anyway, I made another make target "ericp", so when
you "make ericp", you get the following parameter combinations:

X 0 8 16 24 32 40 48 56

Different cache lines, but always the same "bank" in a cache line;
this should hurt K8, maybe also others.

Y 0 9 18 27 36 45 54 63

Different cache lines, and different banks for different accesses.

Z 0 513 994 11 524 989 22 535

All different cache lines and different banks; the second access is on
a different page than the first, and the third is likely on a
different page. Then start with the first page again. These are all
independent accesses.

Results:

X Y Z A B C D E F G H
1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3
1.00 1.00 1.00 0.78 0.91 1.02 1.81 1.13 2.00 2.00 2.25 Rocketlake
2.00 1.17 1.17 1.17 2.03 1.25 3.30 3.69 4.50 4.50 6.91 K8

We indeed see a slowdown on Zen3 and Rocketlake on X Y Z compare to
the dependence-wise equivalent A. My gyess is that these CPUs can
only store to one cache line per cycle, and can optimize the stores in
some cases for A. If that is the case, that is a resource thing, not
a failure to recognize independence.

We see a slowdown for X on K8, which is somewhat expected; however,
thinking about t, I wonder: It seems as if the K8 can do only one load
or store per cycle, if they are to the same bank, but several to
different banks; it has been too longs since a read how it worked. Y
and Z are the same speed as A, showing that the distance between
addresses does not influence the no-alias detection (at least for
these distances).

Is this what you had in mind?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
EricP
2023-11-05 15:09:34 UTC
Permalink
Post by Anton Ertl
Post by EricP
I notice that all your 8 references are within the same cache line.
Store to load forwarding might behave differently across lines
and 4kB page boundaries.
Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096
The numbers are in 8-byte granularity, so to get different cache lines
n*8 is good enough, and for different pagres n*512. However, the
array is only 8000 bytes long, so you can use only numbers in the
range 0...999. Anyway, I made another make target "ericp", so when
X 0 8 16 24 32 40 48 56
Different cache lines, but always the same "bank" in a cache line;
this should hurt K8, maybe also others.
Y 0 9 18 27 36 45 54 63
Different cache lines, and different banks for different accesses.
Z 0 513 994 11 524 989 22 535
All different cache lines and different banks; the second access is on
a different page than the first, and the third is likely on a
different page. Then start with the first page again. These are all
independent accesses.
X Y Z A B C D E F G H
1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Zen3
1.00 1.00 1.00 0.78 0.91 1.02 1.81 1.13 2.00 2.00 2.25 Rocketlake
2.00 1.17 1.17 1.17 2.03 1.25 3.30 3.69 4.50 4.50 6.91 K8
We indeed see a slowdown on Zen3 and Rocketlake on X Y Z compare to
the dependence-wise equivalent A. My gyess is that these CPUs can
only store to one cache line per cycle, and can optimize the stores in
some cases for A. If that is the case, that is a resource thing, not
a failure to recognize independence.
It might be that A is occasionally combining the multiple stores
to the same line in the store buffer whereas X Y Z do not.
So maybe A needs 25% less cache accesses.
Post by Anton Ertl
We see a slowdown for X on K8, which is somewhat expected; however,
thinking about t, I wonder: It seems as if the K8 can do only one load
or store per cycle, if they are to the same bank, but several to
different banks; it has been too longs since a read how it worked. Y
and Z are the same speed as A, showing that the distance between
addresses does not influence the no-alias detection (at least for
these distances).
Is this what you had in mind?
- anton
Yes. My thought was that by targeting the same cache line you
might be triggering alternate mechanisms that cause serialization.

First was that x86-TSO coherence allows a younger load to bypass (execute
before) an older store to a non-overlapping address, otherwise it is serial.
The detection of "same address" could be as high resolution as 8-byte
operand or as low as a cache line. So by targeting separate cache lines
it could allow more load-store bypassing and concurrency.

Also, as you noted, by targeting the same cache line it would serialize
on the same cache bank port, if it has multiple banks.

And I just suggested different pages to disable any "same page"
virtual address translation optimizations (if there are any).
Chris M. Thomasson
2023-11-05 20:46:36 UTC
Permalink
Post by EricP
Post by Anton Ertl
Post by EricP
I notice that all your 8 references are within the same cache line.
Store to load forwarding might behave differently across lines
and 4kB page boundaries.
Instead of 0..7 you might try 0*64..7*64 and 0*4096..7*4096
The numbers are in 8-byte granularity, so to get different cache lines
n*8 is good enough, and for different pagres n*512.  However, the
array is only 8000 bytes long, so you can use only numbers in the
range 0...999.  Anyway, I made another make target "ericp", so when
X 0 8 16 24 32 40 48 56
Different cache lines, but always the same "bank" in a cache line;
this should hurt K8, maybe also others.
Y 0 9 18 27 36 45 54 63
Different cache lines, and different banks for different accesses.
Z 0 513 994 11 524 989 22 535
All different cache lines and different banks; the second access is on
a different page than the first, and the third is likely on a
different page.  Then start with the first page again.  These are all
independent accesses.
  X     Y     Z     A     B     C     D     E     F     G     H
 1.00  1.00  1.00  0.75  1.00  1.00  1.00  1.00  1.00  1.00  1.00 Zen3
 1.00  1.00  1.00  0.78  0.91  1.02  1.81  1.13  2.00  2.00  2.25
Rocketlake
 2.00  1.17  1.17  1.17  2.03  1.25  3.30  3.69  4.50  4.50  6.91 K8
We indeed see a slowdown on Zen3 and Rocketlake on X Y Z compare to
the dependence-wise equivalent A.  My gyess is that these CPUs can
only store to one cache line per cycle, and can optimize the stores in
some cases for A.  If that is the case, that is a resource thing, not
a failure to recognize independence.
It might be that A is occasionally combining the multiple stores
to the same line in the store buffer whereas X Y Z do not.
So maybe A needs 25% less cache accesses.
Post by Anton Ertl
We see a slowdown for X on K8, which is somewhat expected; however,
thinking about t, I wonder: It seems as if the K8 can do only one load
or store per cycle, if they are to the same bank, but several to
different banks; it has been too longs since a read how it worked.  Y
and Z are the same speed as A, showing that the distance between
addresses does not influence the no-alias detection (at least for
these distances).
Is this what you had in mind?
- anton
Yes. My thought was that by targeting the same cache line you
might be triggering alternate mechanisms that cause serialization.
First was that x86-TSO coherence allows a younger load to bypass (execute
before) an older store to a non-overlapping address, otherwise it is serial.
The detection of "same address" could be as high resolution as 8-byte
operand or as low as a cache line. So by targeting separate cache lines
it could allow more load-store bypassing and concurrency.
Also, as you noted, by targeting the same cache line it would serialize
on the same cache bank port, if it has multiple banks.
And I just suggested different pages to disable any "same page"
virtual address translation optimizations (if there are any).
Iirc, one can release a spinlock using an atomic store on x86, no
LOCK'ED RMW. Btw, have you ever tried to implement hazard pointers on an
x86? It requires an explicit memory barrier.
EricP
2023-11-08 14:26:23 UTC
Permalink
Post by Chris M. Thomasson
Post by EricP
Yes. My thought was that by targeting the same cache line you
might be triggering alternate mechanisms that cause serialization.
First was that x86-TSO coherence allows a younger load to bypass (execute
before) an older store to a non-overlapping address, otherwise it is serial.
The detection of "same address" could be as high resolution as 8-byte
operand or as low as a cache line. So by targeting separate cache lines
it could allow more load-store bypassing and concurrency.
Also, as you noted, by targeting the same cache line it would serialize
on the same cache bank port, if it has multiple banks.
And I just suggested different pages to disable any "same page"
virtual address translation optimizations (if there are any).
Iirc, one can release a spinlock using an atomic store on x86, no
LOCK'ED RMW.
Sure, because its not a RMW. Its just a store which under x86-TSO
becomes visible after prior stores to the protected section.

This rule seems to imposes some particular design complexities on the
order a Load-Store Queue and cache can perform store operations.

Say we have two cache lines A and B.

If there is a store to cache line ST [A+0]
then to a different line ST [B+0], then a another ST [A+1],
and if the first ST [A+0] hits cache but second ST [B+0] misses,
then under TSO the third ST [A+1] must appear to stall so that it
does not become visible until after the ST [B+0] has been performed,
even though line A is in the cache.

ST [A+0],r1 <- this hits cache
ST [B+0],r2 <- this misses cache
ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to finish

On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
invalidate msg might get in and transfer ownership of line A to a different
core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].

In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
or it allows it to proceed to the cache but pins line A until the
update to B is done. If it uses the second pinning approach then it
must also deal with all the potential deadlock/livelock possibilities.

And the cache access is pipelined so all of this is asynchronous.
When ST [B+0] misses cache, ST [A+1] might already be in the pipeline.
So even in the simple "stall until older stores done" approach it needs
even more logic to detect this and NAK the following stores back to LSQ,
and later wake them up and replay them when the ST [B+0] is done.
Post by Chris M. Thomasson
Btw, have you ever tried to implement hazard pointers on an
x86? It requires an explicit memory barrier.
That lock-free stuff makes my brain hurt.
MitchAlsup
2023-11-10 01:42:15 UTC
Permalink
Post by EricP
Post by Chris M. Thomasson
Post by EricP
Yes. My thought was that by targeting the same cache line you
might be triggering alternate mechanisms that cause serialization.
First was that x86-TSO coherence allows a younger load to bypass (execute
before) an older store to a non-overlapping address, otherwise it is serial.
The detection of "same address" could be as high resolution as 8-byte
operand or as low as a cache line. So by targeting separate cache lines
it could allow more load-store bypassing and concurrency.
Also, as you noted, by targeting the same cache line it would serialize
on the same cache bank port, if it has multiple banks.
And I just suggested different pages to disable any "same page"
virtual address translation optimizations (if there are any).
Iirc, one can release a spinlock using an atomic store on x86, no
LOCK'ED RMW.
Sure, because its not a RMW. Its just a store which under x86-TSO
becomes visible after prior stores to the protected section.
This rule seems to imposes some particular design complexities on the
order a Load-Store Queue and cache can perform store operations.
Say we have two cache lines A and B.
If there is a store to cache line ST [A+0]
then to a different line ST [B+0], then a another ST [A+1],
and if the first ST [A+0] hits cache but second ST [B+0] misses,
then under TSO the third ST [A+1] must appear to stall so that it
does not become visible until after the ST [B+0] has been performed,
even though line A is in the cache.
ST [A+0],r1 <- this hits cache
ST [B+0],r2 <- this misses cache
ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to finish
On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
invalidate msg might get in and transfer ownership of line A to a different
core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].
In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
<
If you have an MDM+TC == CC, the CPU can perform the ST into CC where it
awaits "ordering" while the external world is left believing it has not
started yet {This is 1991 technology}. CC can effectively eliminate ST
ordering stalls seen from the SPU while preserving all of the TSO-ness
the external observers need.
<
Post by EricP
or it allows it to proceed to the cache but pins line A until the
update to B is done. If it uses the second pinning approach then it
must also deal with all the potential deadlock/livelock possibilities.
<
In a conditional Cache, every instructions has (at least a portion) of
its Data Cache associated line. So every ST has a place to deposit its
data; and that place can be subject to backup and cancellation (based on
external stuff happening}.
<
After the ST reached the complete state (ready to retire) CC data is
migrated to DC data as porting and banking permiti.
<
Post by EricP
And the cache access is pipelined so all of this is asynchronous.
When ST [B+0] misses cache, ST [A+1] might already be in the pipeline.
So even in the simple "stall until older stores done" approach it needs
even more logic to detect this and NAK the following stores back to LSQ,
and later wake them up and replay them when the ST [B+0] is done.
Post by Chris M. Thomasson
Btw, have you ever tried to implement hazard pointers on an
x86? It requires an explicit memory barrier.
That lock-free stuff makes my brain hurt.
EricP
2023-11-11 21:43:54 UTC
Permalink
Post by MitchAlsup
Post by EricP
Post by Chris M. Thomasson
Post by EricP
Yes. My thought was that by targeting the same cache line you
might be triggering alternate mechanisms that cause serialization.
First was that x86-TSO coherence allows a younger load to bypass (execute
before) an older store to a non-overlapping address, otherwise it is serial.
The detection of "same address" could be as high resolution as 8-byte
operand or as low as a cache line. So by targeting separate cache lines
it could allow more load-store bypassing and concurrency.
Also, as you noted, by targeting the same cache line it would serialize
on the same cache bank port, if it has multiple banks.
And I just suggested different pages to disable any "same page"
virtual address translation optimizations (if there are any).
Iirc, one can release a spinlock using an atomic store on x86, no
LOCK'ED RMW.
Sure, because its not a RMW. Its just a store which under x86-TSO
becomes visible after prior stores to the protected section.
This rule seems to imposes some particular design complexities on the
order a Load-Store Queue and cache can perform store operations.
Say we have two cache lines A and B.
If there is a store to cache line ST [A+0]
then to a different line ST [B+0], then a another ST [A+1],
and if the first ST [A+0] hits cache but second ST [B+0] misses,
then under TSO the third ST [A+1] must appear to stall so that it
does not become visible until after the ST [B+0] has been performed,
even though line A is in the cache.
ST [A+0],r1 <- this hits cache
ST [B+0],r2 <- this misses cache
ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to finish
On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
invalidate msg might get in and transfer ownership of line A to a different
core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].
In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
<
If you have an MDM+TC == CC, the CPU can perform the ST into CC where it
awaits "ordering" while the external world is left believing it has not
started yet {This is 1991 technology}. CC can effectively eliminate ST
ordering stalls seen from the SPU while preserving all of the TSO-ness
the external observers need.
MDM = Memory Dependency Matrix
TC = Temporal Cache
CC = Conditional Cache

The Conditional Cache sounds similar to the Store Buffers that other
designs refer to but with multiple versions, as you outline below.
This seems to duplicate many existing functions of the Load Store Queue.

I'm thinking it would be simpler to keep everything in one circular
load/store queue and hold store data there after retire until it
can be handed off to the cache. See below.
Post by MitchAlsup
<
Post by EricP
or it allows it to proceed to the cache but pins line A until the
update to B is done. If it uses the second pinning approach then it
must also deal with all the potential deadlock/livelock possibilities.
<
In a conditional Cache, every instructions has (at least a portion) of
its Data Cache associated line. So every ST has a place to deposit its
data; and that place can be subject to backup and cancellation (based on
external stuff happening}.
<
After the ST reached the complete state (ready to retire) CC data is
migrated to DC data as porting and banking permiti.
<
Yes, but that seems a rather expensive approach.
The problem I have with the CC is that it requires some pretty complex
logic to track multiple versions of cache lines, journal before-image copies,
track their inter-dependencies, and decide when to commit updates.
And much of this duplicates functionality that the LSQ already has to
support store-to-load forwarding and load-store bypass address matching.
All those buffer need free lists and allocators, another dependency matrix,
CAM's to match addresses to the assigned buffers.

Its not clear to me that all this is significantly better than
a simpler approach.

Instead I was thinking of having a unified LSQ as a single circular buffer
with all the load and store entries in (circular) program order.
Store data is held in the LSQ after the store instruction retires
until it is accepted by the cache. While it is in the LSQ it can still be
forwarded to younger loads to the same address.

LSQ stores are sent to the pipelined cache which indicates back to LSQ
a hit/miss after the pipeline latency.
If it hits then the entry is removed from LSQ.
If it misses then the store data is held in LSQ and cache blocks further
stores until the miss resolves. However LSQ continues to send future
store address to trigger line prefetches until all miss buffers are busy.
If a load misses then all further loads must stop because
load-load bypassing is not allowed until TSO.

When the missed line arrives, cache sends a wakeup signal to LSQ
which restarts sending entries from where it left off.
MitchAlsup
2023-11-11 23:32:30 UTC
Permalink
Post by EricP
Post by MitchAlsup
Post by EricP
Post by Chris M. Thomasson
Post by EricP
Yes. My thought was that by targeting the same cache line you
might be triggering alternate mechanisms that cause serialization.
First was that x86-TSO coherence allows a younger load to bypass (execute
before) an older store to a non-overlapping address, otherwise it is serial.
The detection of "same address" could be as high resolution as 8-byte
operand or as low as a cache line. So by targeting separate cache lines
it could allow more load-store bypassing and concurrency.
Also, as you noted, by targeting the same cache line it would serialize
on the same cache bank port, if it has multiple banks.
And I just suggested different pages to disable any "same page"
virtual address translation optimizations (if there are any).
Iirc, one can release a spinlock using an atomic store on x86, no
LOCK'ED RMW.
Sure, because its not a RMW. Its just a store which under x86-TSO
becomes visible after prior stores to the protected section.
This rule seems to imposes some particular design complexities on the
order a Load-Store Queue and cache can perform store operations.
Say we have two cache lines A and B.
If there is a store to cache line ST [A+0]
then to a different line ST [B+0], then a another ST [A+1],
and if the first ST [A+0] hits cache but second ST [B+0] misses,
then under TSO the third ST [A+1] must appear to stall so that it
does not become visible until after the ST [B+0] has been performed,
even though line A is in the cache.
ST [A+0],r1 <- this hits cache
ST [B+0],r2 <- this misses cache
ST [A+1],r3 <- this waits for B to arrive and store to [B+0] to finish
On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
invalidate msg might get in and transfer ownership of line A to a different
core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].
In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
<
If you have an MDM+TC == CC, the CPU can perform the ST into CC where it
awaits "ordering" while the external world is left believing it has not
started yet {This is 1991 technology}. CC can effectively eliminate ST
ordering stalls seen from the SPU while preserving all of the TSO-ness
the external observers need.
MDM = Memory Dependency Matrix
TC = Temporal Cache
CC = Conditional Cache
The Conditional Cache sounds similar to the Store Buffers that other
designs refer to but with multiple versions, as you outline below.
This seems to duplicate many existing functions of the Load Store Queue.
I'm thinking it would be simpler to keep everything in one circular
load/store queue and hold store data there after retire until it
can be handed off to the cache. See below.
<
a) it is circular and tied intimately to the execution window.
b) it integrates the LDs into the queue {just in case a store has
to be replayed all dependent LDs get replayed too and for solving
memory order problems dynamically.}
Post by EricP
Post by MitchAlsup
<
Post by EricP
or it allows it to proceed to the cache but pins line A until the
update to B is done. If it uses the second pinning approach then it
must also deal with all the potential deadlock/livelock possibilities.
<
In a conditional Cache, every instructions has (at least a portion) of
its Data Cache associated line. So every ST has a place to deposit its
data; and that place can be subject to backup and cancellation (based on
external stuff happening}.
<
After the ST reached the complete state (ready to retire) CC data is
migrated to DC data as porting and banking permiti.
<
Yes, but that seems a rather expensive approach.
The problem I have with the CC is that it requires some pretty complex
logic to track multiple versions of cache lines, journal before-image copies,
track their inter-dependencies, and decide when to commit updates.
<
We did not track all that stuff (Mc 88120 circa 1991) what we did was to copy
data from the cache/memory into everybody needing that data, then stores could
overwrite data into the cache into younger accesses; so most of the STs did
not actually write into DC because we knew a younger store would do that for us.
<
Post by EricP
And much of this duplicates functionality that the LSQ already has to
support store-to-load forwarding and load-store bypass address matching.
<
The CC is the LD/ST queue, but by integrating certain features, most of the
control logic vanishes, and the component "pretty much" manages itself. It
does speak back to the Execution window logic to control the advancement of
the consistent point and the retire point of the window.
<
Post by EricP
All those buffer need free lists and allocators, another dependency matrix,
CAM's to match addresses to the assigned buffers.
<
Nope:: dedicated storage. If the storage is not available, insert stalls.
<
Post by EricP
Its not clear to me that all this is significantly better than
a simpler approach.
<
Think of it like a reservation station for a horizon of memory reference
instructions. Instead of CAMing on renamed registers, you CAM on portion
of VA from the CPU side, and on PA for the snooping side. The MDM handles
the memory renaming in a way that can be undone {Branch prediction repair
and Snoop discovering memory order has been violated.}
Post by EricP
Instead I was thinking of having a unified LSQ as a single circular buffer
with all the load and store entries in (circular) program order.
<
Yep, that is what a CC does. However, once 2 addresses discover "we cannot
refer to the same cache line container" they become independent.
<
Post by EricP
Store data is held in the LSQ after the store instruction retires
<
The ST must deliver its data to DC before the ST can retire. Until the ST
delivers its data to DC, the ST is subject to being replayed upon detection
of memory order violations.
<
Post by EricP
until it is accepted by the cache. While it is in the LSQ it can still be
forwarded to younger loads to the same address.
LSQ stores are sent to the pipelined cache which indicates back to LSQ
a hit/miss after the pipeline latency.
<
And hit data is written into All CC entires waiting for it.
<
Post by EricP
If it hits then the entry is removed from LSQ.
<
Entries are inserted into CC at issue and removed from CC at retire. They
remain replayable until consistent.
<
Post by EricP
If it misses then the store data is held in LSQ and cache blocks further
stores until the miss resolves. However LSQ continues to send future
<
Mc 88120 kept byte writes in the CC.data so STs could write data into CC
even before data arrives from the memory hierarchy.
<
Post by EricP
store address to trigger line prefetches until all miss buffers are busy.
<
The CC is the Miss Buffer !! and entries that missed but got translated
are selected for external access in much the same way that instructions
are selected from reservation stations.
<
Post by EricP
If a load misses then all further loads must stop because
load-load bypassing is not allowed until TSO.
<
But the MDM effectively allows multiple LDs to miss and be serviced
independently {except config, MMI/O, ATOMIC) and repair back to TSO
only if an external even requires actual TSO behavior.
<
Post by EricP
When the missed line arrives, cache sends a wakeup signal to LSQ
which restarts sending entries from where it left off.
<
When miss data returns with the PA, a portion of PA it used and everybody who
is waiting on the returning data snarfs it up in the same write (into CC)
cycle.
Chris M. Thomasson
2023-11-10 20:57:36 UTC
Permalink
Post by EricP
Post by Chris M. Thomasson
Post by EricP
Yes. My thought was that by targeting the same cache line you
might be triggering alternate mechanisms that cause serialization.
First was that x86-TSO coherence allows a younger load to bypass (execute
before) an older store to a non-overlapping address, otherwise it is serial.
The detection of "same address" could be as high resolution as 8-byte
operand or as low as a cache line. So by targeting separate cache lines
it could allow more load-store bypassing and concurrency.
Also, as you noted, by targeting the same cache line it would serialize
on the same cache bank port, if it has multiple banks.
And I just suggested different pages to disable any "same page"
virtual address translation optimizations (if there are any).
Iirc, one can release a spinlock using an atomic store on x86, no
LOCK'ED RMW.
Sure, because its not a RMW. Its just a store which under x86-TSO
becomes visible after prior stores to the protected section.
This rule seems to imposes some particular design complexities on the
order a Load-Store Queue and cache can perform store operations.
Say we have two cache lines A and B.
If there is a store to cache line ST [A+0]
then to a different line ST [B+0], then a another ST [A+1],
and if the first ST [A+0] hits cache but second ST [B+0] misses,
then under TSO the third ST [A+1] must appear to stall so that it
does not become visible until after the ST [B+0] has been performed,
even though line A is in the cache.
  ST [A+0],r1   <- this hits cache
  ST [B+0],r2   <- this misses cache
  ST [A+1],r3   <- this waits for B to arrive and store to [B+0] to finish
On core C1, if ST [A+1] was allowed to be performed before [B+0] then an
invalidate msg might get in and transfer ownership of line A to a different
core C2, allowing the new value of [A+1] to be visible at C2 before [B+0].
In order to prevent this under TSO, either LSQ actually stalls ST [A+1],
or it allows it to proceed to the cache but pins line A until the
update to B is done. If it uses the second pinning approach then it
must also deal with all the potential deadlock/livelock possibilities.
And the cache access is pipelined so all of this is asynchronous.
When ST [B+0] misses cache, ST [A+1] might already be in the pipeline.
So even in the simple "stall until older stores done" approach it needs
even more logic to detect this and NAK the following stores back to LSQ,
and later wake them up and replay them when the ST [B+0] is done.
Post by Chris M. Thomasson
Btw, have you ever tried to implement hazard pointers on an x86? It
requires an explicit memory barrier.
That lock-free stuff makes my brain hurt.
Iirc, hazard pointers require a store followed by a load to another
location to be honored. This requires a membar on x86.
MitchAlsup
2023-11-10 23:35:55 UTC
Permalink
Post by Chris M. Thomasson
Post by EricP
Post by Chris M. Thomasson
Btw, have you ever tried to implement hazard pointers on an x86? It
requires an explicit memory barrier.
That lock-free stuff makes my brain hurt.
Iirc, hazard pointers require a store followed by a load to another
location to be honored. This requires a membar on x86.
<
It is stuff like this that made My 66000 architecture define changes
to memory order based on several pieces of state (thus no membars)
<
Accesses to ROM are unordered
Accesses to config space is strongly ordered
Accesses to MMI/O space is sequentially consistent
Participating accesses (ATOMIC) are sequentially consistent
everything else is causal.
And the HW tracks this on a per memory reference basis--in effect
all orderings are in effect all the time.
<
Performance guys get what they want,
Lamport guys (atomic) get what they want
Device drivers get what they want on the accesses that need it
..
Chris M. Thomasson
2023-11-11 04:07:21 UTC
Permalink
Post by Chris M. Thomasson
Post by EricP
Post by Chris M. Thomasson
Btw, have you ever tried to implement hazard pointers on an x86? It
requires an explicit memory barrier.
That lock-free stuff makes my brain hurt.
Iirc, hazard pointers require a store followed by a load to another
location to be honored. This requires a membar on x86.
<
It is stuff like this that made My 66000 architecture define changes to
memory order based on several pieces of state (thus no membars)
<
Accesses to ROM are unordered
Accesses to config space is strongly ordered
Accesses to MMI/O space is sequentially consistent
Participating accesses (ATOMIC) are sequentially consistent
everything else is causal.
And the HW tracks this on a per memory reference basis--in effect
all orderings are in effect all the time.
<
Performance guys get what they want,
Lamport guys (atomic) get what they want
Device drivers get what they want on the accesses that need it
..
Nice! Well, there is a way to avoid the explicit membar on x86. It
involves a marriage of RCU and Hazard Pointers.
MitchAlsup
2023-11-12 19:13:07 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by EricP
Post by Chris M. Thomasson
Btw, have you ever tried to implement hazard pointers on an x86? It
requires an explicit memory barrier.
That lock-free stuff makes my brain hurt.
Iirc, hazard pointers require a store followed by a load to another
location to be honored. This requires a membar on x86.
<
It is stuff like this that made My 66000 architecture define changes to
memory order based on several pieces of state (thus no membars)
<
Accesses to ROM are unordered
Accesses to config space is strongly ordered
Accesses to MMI/O space is sequentially consistent
Participating accesses (ATOMIC) are sequentially consistent
everything else is causal.
And the HW tracks this on a per memory reference basis--in effect
all orderings are in effect all the time.
<
Performance guys get what they want,
Lamport guys (atomic) get what they want
Device drivers get what they want on the accesses that need it
..
Nice! Well, there is a way to avoid the explicit membar on x86. It
involves a marriage of RCU and Hazard Pointers.
<
I watched membar evolve during my time at AMD and decided to make a
machine that never needs to use them--but still gets the right thinig
done.
<
Chris M. Thomasson
2023-11-12 22:01:46 UTC
Permalink
Post by MitchAlsup
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
Post by EricP
Post by Chris M. Thomasson
Btw, have you ever tried to implement hazard pointers on an x86?
It requires an explicit memory barrier.
That lock-free stuff makes my brain hurt.
Iirc, hazard pointers require a store followed by a load to another
location to be honored. This requires a membar on x86.
<
It is stuff like this that made My 66000 architecture define changes
to memory order based on several pieces of state (thus no membars)
<
Accesses to ROM are unordered
Accesses to config space is strongly ordered
Accesses to MMI/O space is sequentially consistent
Participating accesses (ATOMIC) are sequentially consistent
everything else is causal.
And the HW tracks this on a per memory reference basis--in effect
all orderings are in effect all the time.
<
Performance guys get what they want,
Lamport guys (atomic) get what they want
Device drivers get what they want on the accesses that need it
..
Nice! Well, there is a way to avoid the explicit membar on x86. It
involves a marriage of RCU and Hazard Pointers.
<
I watched membar evolve during my time at AMD and decided to make a
machine that never needs to use them--but still gets the right thinig
done.
<
TSO? Oh, wait. I need to go think back about your system. It been some
years sense we conversed about it. Btw, what happened to the Mill?
Anyway, I digress.

Fwiw, imho, it helps to think about proper alignment and boundaries, try
to work with a given architecture, not against it.

A highly relaxed memory model can be beneficial for certain workloads.
Kent Dickey
2023-11-12 22:33:11 UTC
Permalink
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.

I believe that statement to be false. Can you describe some of these
workloads?

Kent
Chris M. Thomasson
2023-11-12 22:39:11 UTC
Permalink
Post by Kent Dickey
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.
I believe that statement to be false. Can you describe some of these
workloads?
One example... Basically, think along the lines of RCU. Usually,
read-mostly and write rather rarely can _greatly_ benefit for not using
any memory barriers. And this in on rather relaxed systems. Think SPARC
in RMO mode. DEC alpha is a "special case" wrt RCU.
Chris M. Thomasson
2023-11-12 22:43:18 UTC
Permalink
Post by Kent Dickey
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.
I believe that statement to be false. Can you describe some of these
workloads?
Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would ruin
performance right off the bat... Think about it.
MitchAlsup
2023-11-13 00:20:51 UTC
Permalink
Post by Chris M. Thomasson
Post by Kent Dickey
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.
I believe that statement to be false. Can you describe some of these
workloads?
Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would ruin
performance right off the bat... Think about it.
<
Assuming you are willing to accept the wrong answer fast, rather than the
right answer later. There are very few algorithms with this property.
Chris M. Thomasson
2023-11-13 00:28:20 UTC
Permalink
Post by MitchAlsup
Post by Chris M. Thomasson
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true.  In
general, it
is assumed to be true without proof.
I believe that statement to be false.  Can you describe some of these
workloads?
Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would
ruin performance right off the bat... Think about it.
<
Assuming you are willing to accept the wrong answer fast, rather than
the right answer later. There are very few algorithms with this property.
That does not make any sense to me. Think of a basic mutex. It basically
requires an acquire membar for the lock and a release membar for the
unlock. On SPARC that would be:

acquire = MEMBAR #LoadStore | #LoadLoad

release = MEMBAR #LoadStore | #StoreStore

Okay, fine. However, if I made them sequentially consistent, it would
require a damn #StoreLoad barrier for both acquire and release. This is
not good and should be avoided when possible.

Also, RCU prides itself with not having to use any memory barriers for
its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
or MFENCE on Intel, it would completely ruin its performance.
Kent Dickey
2023-11-13 05:44:43 UTC
Permalink
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true.  In
general, it
is assumed to be true without proof.
I believe that statement to be false.  Can you describe some of these
workloads?
Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would
ruin performance right off the bat... Think about it.
<
Assuming you are willing to accept the wrong answer fast, rather than
the right answer later. There are very few algorithms with this property.
That does not make any sense to me. Think of a basic mutex. It basically
requires an acquire membar for the lock and a release membar for the
acquire = MEMBAR #LoadStore | #LoadLoad
release = MEMBAR #LoadStore | #StoreStore
Okay, fine. However, if I made them sequentially consistent, it would
require a damn #StoreLoad barrier for both acquire and release. This is
not good and should be avoided when possible.
Also, RCU prides itself with not having to use any memory barriers for
its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
or MFENCE on Intel, it would completely ruin its performance.
You are using the terms in the exact opposite meaning as I would understand
computer architecture.

We'll just assume 3 choices for CPU ordering:

- Sequential consistency (SC). Hardware does everything, there are no
barriers needed, ld/st instructions appear to happen in some
global order.
- Total Store Ordering (TSO) (x86). Stores appear to be done in program
order, but a CPU can peek and see its own local store results
before other CPUs can. Loads appear to be done in some total
program order (not counting hitting its own local stores).
TSO is like SC except there's effectively a store queue, and
stores can finish when placed in the queue, and the queue drains
in FIFO order. Needs no barriers, except for special cases like
Lamport's algorithm (it's easy to avoid barriers).
- Relaxed. Loads and stores are not ordered, users have to put in memory
barriers and hope they did it right.

So, a highly relaxed memory model is the ONLY model which needs barriers.
If you want to get rid of barriers, use a better memory model.

A relaxed ordering SYSTEM says rather than spending a few thousand
gates getting ordering right by hardware in the CPU, instead we're going
to require software to put in some very difficult to understand barriers.
And we're going to have a 45 page academic paper using all the greek
alphabet to describe when you need to put in barriers. Literally no one
understands all the rules, so the best bet is put in too many barriers
and wait for someone to nitpick your code and fix it for you.

[I have a theorem: there is no correct non-trivial multithreaded program
on an architecture which requires barriers for correctness.].

A very simple thought exercise shows even if Sequential Consistency
and/or TSO were slower (and I maintain they are not), but even if you
believe that, a Relaxed Ordering system WILL be slower than TSO or
Sequential for workloads which often use barriers (instructions tagged
with acquire/release are barriers). In a Relaxed ordering system, the
barriers will not be as efficient as the automatic barriers of TSO/SC
(otherwise, why not just do that?), so if barriers are executed often,
performance will be lower than hardware TSO/SC, even if there are no
contentions or any work for the barriers to do. In fact, performance
could be significantly lower.

People know this, it's why they keep trying to get rid of barriers in
their code. So get rid of all them and demand TSO ordering.

Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
on everyone else to try to come up with algorithms which need fewer
barriers. It's crazy.

Relaxed Ordering is a mistake.

Kent
a***@littlepinkcloud.invalid
2023-11-13 11:49:31 UTC
Permalink
Post by Kent Dickey
- Sequential consistency (SC). Hardware does everything, there are no
barriers needed, ld/st instructions appear to happen in some
global order.
- Total Store Ordering (TSO) (x86). Stores appear to be done in program
order, but a CPU can peek and see its own local store results
before other CPUs can. Loads appear to be done in some total
program order (not counting hitting its own local stores).
TSO is like SC except there's effectively a store queue, and
stores can finish when placed in the queue, and the queue drains
in FIFO order. Needs no barriers, except for special cases like
Lamport's algorithm (it's easy to avoid barriers).
- Relaxed. Loads and stores are not ordered, users have to put in memory
barriers and hope they did it right.
So, a highly relaxed memory model is the ONLY model which needs barriers.
But this isn't true. Real processors aren't anywhere near as wildly
chaotic as this.

The common form of relaxed memory we see today is causally consistent
and multi-copy atomic (and cache coherent). So, all other threads see
stores to a single location in the same order, and you don't get the
extraordinary going-backwards-in-time behaviour of DEC Alpha.
Post by Kent Dickey
A relaxed ordering SYSTEM says rather than spending a few thousand
gates getting ordering right by hardware in the CPU, instead we're
going to require software to put in some very difficult to
understand barriers.
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Post by Kent Dickey
A very simple thought exercise shows even if Sequential Consistency
and/or TSO were slower (and I maintain they are not), but even if
you believe that, a Relaxed Ordering system WILL be slower than TSO
or Sequential for workloads which often use barriers (instructions
tagged with acquire/release are barriers). In a Relaxed ordering
system, the barriers will not be as efficient as the automatic
barriers of TSO/SC (otherwise, why not just do that?),
Whyever not? They do the same thing.

Andrew.
MitchAlsup
2023-11-13 19:16:41 UTC
Permalink
Post by a***@littlepinkcloud.invalid
Post by Kent Dickey
- Sequential consistency (SC). Hardware does everything, there are no
barriers needed, ld/st instructions appear to happen in some
global order.
- Total Store Ordering (TSO) (x86). Stores appear to be done in program
order, but a CPU can peek and see its own local store results
before other CPUs can. Loads appear to be done in some total
program order (not counting hitting its own local stores).
TSO is like SC except there's effectively a store queue, and
stores can finish when placed in the queue, and the queue drains
in FIFO order. Needs no barriers, except for special cases like
Lamport's algorithm (it's easy to avoid barriers).
- Relaxed. Loads and stores are not ordered, users have to put in memory
barriers and hope they did it right.
So, a highly relaxed memory model is the ONLY model which needs barriers.
But this isn't true. Real processors aren't anywhere near as wildly
chaotic as this.
The common form of relaxed memory we see today is causally consistent
and multi-copy atomic (and cache coherent). So, all other threads see
stores to a single location in the same order, and you don't get the
extraordinary going-backwards-in-time behaviour of DEC Alpha.
Post by Kent Dickey
A relaxed ordering SYSTEM says rather than spending a few thousand
gates getting ordering right by hardware in the CPU, instead we're
going to require software to put in some very difficult to
understand barriers.
<
The control logic to perform may be that small, the state to maintain it
is at least (5×48^2)×1.2-gates.
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
<
Andy Glew used to report that there was surprisingly few instructions
performed out-of-order on an GBOoO machine.
Post by a***@littlepinkcloud.invalid
Post by Kent Dickey
A very simple thought exercise shows even if Sequential Consistency
and/or TSO were slower (and I maintain they are not), but even if
you believe that, a Relaxed Ordering system WILL be slower than TSO
or Sequential for workloads which often use barriers (instructions
tagged with acquire/release are barriers). In a Relaxed ordering
system, the barriers will not be as efficient as the automatic
barriers of TSO/SC (otherwise, why not just do that?),
Whyever not? They do the same thing.
Andrew.
Chris M. Thomasson
2023-11-13 21:11:18 UTC
Permalink
Post by a***@littlepinkcloud.invalid
Post by Kent Dickey
- Sequential consistency (SC). Hardware does everything, there are no
barriers needed, ld/st instructions appear to happen in some
global order.
- Total Store Ordering (TSO) (x86). Stores appear to be done in program
order, but a CPU can peek and see its own local store results
before other CPUs can. Loads appear to be done in some total
program order (not counting hitting its own local stores).
TSO is like SC except there's effectively a store queue, and
stores can finish when placed in the queue, and the queue drains
in FIFO order. Needs no barriers, except for special cases like
Lamport's algorithm (it's easy to avoid barriers).
- Relaxed. Loads and stores are not ordered, users have to put in memory
barriers and hope they did it right.
So, a highly relaxed memory model is the ONLY model which needs barriers.
But this isn't true. Real processors aren't anywhere near as wildly
chaotic as this.
The common form of relaxed memory we see today is causally consistent
and multi-copy atomic (and cache coherent). So, all other threads see
stores to a single location in the same order, and you don't get the
extraordinary going-backwards-in-time behaviour of DEC Alpha.
Post by Kent Dickey
A relaxed ordering SYSTEM says rather than spending a few thousand
gates getting ordering right by hardware in the CPU, instead we're
going to require software to put in some very difficult to
understand barriers.
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Post by a***@littlepinkcloud.invalid
Post by Kent Dickey
A very simple thought exercise shows even if Sequential Consistency
and/or TSO were slower (and I maintain they are not), but even if
you believe that, a Relaxed Ordering system WILL be slower than TSO
or Sequential for workloads which often use barriers (instructions
tagged with acquire/release are barriers). In a Relaxed ordering
system, the barriers will not be as efficient as the automatic
barriers of TSO/SC (otherwise, why not just do that?),
Whyever not? They do the same thing.
Andrew.
a***@littlepinkcloud.invalid
2023-11-14 10:25:22 UTC
Permalink
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.

(For pedants: Mostly, but not completely, even on TSO, e.g. Dekker's
Algorithm, which needs something stronger.)

Because of this, the assertion that programming a non-TSO machine is
"harder" doesn't IMO stand up, at least in C programs, because the
same data-race bugs can manifest themselves as either compiler
optimizations or hardware reorderings. And a compiler can, at least in
theory, can do things that are far weirder than any memory system
does.

Andrew.
Chris M. Thomasson
2023-11-14 20:44:44 UTC
Permalink
Post by a***@littlepinkcloud.invalid
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough.
Well, I think I was being a bit too dense here and missed your main
point. Sorry.
Post by a***@littlepinkcloud.invalid
If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering.
Agreed.
Post by a***@littlepinkcloud.invalid
If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.
(For pedants: Mostly, but not completely, even on TSO, e.g. Dekker's
Algorithm, which needs something stronger.)
Indeed, it does.
Post by a***@littlepinkcloud.invalid
Because of this, the assertion that programming a non-TSO machine is
"harder" doesn't IMO stand up, at least in C programs, because the
same data-race bugs can manifest themselves as either compiler
optimizations or hardware reorderings. And a compiler can, at least in
theory, can do things that are far weirder than any memory system
does.
Kent Dickey
2023-11-15 20:32:16 UTC
Permalink
Post by a***@littlepinkcloud.invalid
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.
What you are saying is:

As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.

If you assume you've already solved the problem, then you find the
problem is solved! Magic!

What I'm arguing is: the CPU should behave as if memory_order_seq_cst
is set on all accesses with no special trickery. This acquire/release
nonsense is all weakly ordered brain damage. The problem is on weakly
ordered CPUs, performance definitely does matter in terms of getting this
stuff right, but that's their problem. Being weakly ordered makes them
slower when they have to execute barriers for correctness, but it's the
barriers themselves that are the slow down, not ordering the requests
properly.

If the CPU takes ownership of ordering, then the only rule is: you just
have to use atomic properly (even then, you can often get away with
volatile for most producer/consumer cases), and these subtypes for all
accesses don't matter for correctness or performance.

It also would be nice if multithreaded programs could be written in C99,
or pre-C++11. It's kinda surprising we've only been able to write threaded
programs for about 10 years.

Kent
Chris M. Thomasson
2023-11-15 20:42:13 UTC
Permalink
Post by Kent Dickey
Post by a***@littlepinkcloud.invalid
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
If you assume you've already solved the problem, then you find the
problem is solved! Magic!
What I'm arguing is: the CPU should behave as if memory_order_seq_cst
is set on all accesses with no special trickery. This acquire/release
nonsense is all weakly ordered brain damage.
Huh? Just because you have a problem with memory ordering, does not mean
that all std::memory_order_* should default to seq_cst! You are being
radically obtuse here. Is as if you have no idea about what you are
writing about.

[...]
MitchAlsup
2023-11-15 20:56:39 UTC
Permalink
Post by Kent Dickey
Post by a***@littlepinkcloud.invalid
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
If you assume you've already solved the problem, then you find the
problem is solved! Magic!
What I'm arguing is: the CPU should behave as if memory_order_seq_cst
is set on all accesses with no special trickery.
<
Should appear as if.....not behave as if.
<
Post by Kent Dickey
This acquire/release
nonsense is all weakly ordered brain damage.
<
Agreed ...
<
Post by Kent Dickey
The problem is on weakly
ordered CPUs, performance definitely does matter in terms of getting this
stuff right, but that's their problem.
<
Not necessarily !! If you have a weakly ordered machine and start doing
something ATOMIC, the processor can switch to sequential consistency upon
the detection of the ATOMIC event starting (LL for example) stay SC until
the ATOMIC even is done, then revert back to weakly ordered as it pleases.
The single threaded code gets is performance while the multithreaded code
gets is SC (as Lamport demonstrated was necessary).
<
Post by Kent Dickey
Being weakly ordered makes them
slower when they have to execute barriers for correctness, but it's the
barriers themselves that are the slow down, not ordering the requests
properly.
<
But You see, doing it my way gets rid of the MemBar instructions but not
their necessary effects. In addition, in my model, every access within
an ATOMIC event is SC not just a MemBar at the front and end.
<
Post by Kent Dickey
If the CPU takes ownership of ordering, then the only rule is: you just
have to use atomic properly (even then, you can often get away with
volatile for most producer/consumer cases), and these subtypes for all
accesses don't matter for correctness or performance.
<
But a CPU is not in a position to determine Memory Order, the (multiplicity
of) memory controllers do, the CPUs just have to figure out how to put up with
the imposed order based on the kind of things the CPU is doing at that instant.
<
Post by Kent Dickey
It also would be nice if multithreaded programs could be written in C99,
or pre-C++11. It's kinda surprising we've only been able to write threaded
programs for about 10 years.
<
I wrote multithreaded programs on an 8085.....
<
Post by Kent Dickey
Kent
Chris M. Thomasson
2023-11-15 21:00:06 UTC
Permalink
Post by MitchAlsup
Post by Kent Dickey
Post by a***@littlepinkcloud.invalid
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
If you assume you've already solved the problem, then you find the
problem is solved!  Magic!
What I'm arguing is: the CPU should behave as if memory_order_seq_cst
is set on all accesses with no special trickery.
<
Should appear as if.....not behave as if.
<
Post by Kent Dickey
                                                 This acquire/release
nonsense is all weakly ordered brain damage.
<
Agreed ...
<
Why do you "seem" to think its brain damage? Knowing how to use them
properly is a good thing.
Post by MitchAlsup
Post by Kent Dickey
                                              The problem is on weakly
ordered CPUs, performance definitely does matter in terms of getting this
stuff right, but that's their problem.
<
Not necessarily !! If you have a weakly ordered machine and start doing
something ATOMIC, the processor can switch to sequential consistency upon
the detection of the ATOMIC event starting (LL for example) stay SC until
the ATOMIC even is done, then revert back to weakly ordered as it pleases.
The single threaded code gets is performance while the multithreaded code
gets is SC (as Lamport demonstrated was necessary).
<
Post by Kent Dickey
                                        Being weakly ordered makes them
slower when they have to execute barriers for correctness, but it's the
barriers themselves that are the slow down, not ordering the requests
properly.
<
But You see, doing it my way gets rid of the MemBar instructions but not
their necessary effects. In addition, in my model, every access within
an ATOMIC event is SC not just a MemBar at the front and end.
<
Post by Kent Dickey
If the CPU takes ownership of ordering, then the only rule is: you just
have to use atomic properly (even then, you can often get away with
volatile for most producer/consumer cases), and these subtypes for all
accesses don't matter for correctness or performance.
<
But a CPU is not in a position to determine Memory Order, the
(multiplicity of) memory controllers do, the CPUs just have to figure
out how to put up with
the imposed order based on the kind of things the CPU is doing at that instant.
<
Post by Kent Dickey
It also would be nice if multithreaded programs could be written in C99,
or pre-C++11.  It's kinda surprising we've only been able to write
threaded
programs for about 10 years.
<
I wrote multithreaded programs on an 8085.....
<
Post by Kent Dickey
Kent
Chris M. Thomasson
2023-11-15 21:35:41 UTC
Permalink
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Kent Dickey
Post by a***@littlepinkcloud.invalid
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
If you assume you've already solved the problem, then you find the
problem is solved!  Magic!
What I'm arguing is: the CPU should behave as if memory_order_seq_cst
is set on all accesses with no special trickery.
<
Should appear as if.....not behave as if.
<
Post by Kent Dickey
                                                 This acquire/release
nonsense is all weakly ordered brain damage.
<
Agreed ...
<
Why do you "seem" to think its brain damage? Knowing how to use them
properly is a good thing.
Post by MitchAlsup
Post by Kent Dickey
                                              The problem is on weakly
ordered CPUs, performance definitely does matter in terms of getting this
stuff right, but that's their problem.
<
Not necessarily !! If you have a weakly ordered machine and start doing
something ATOMIC, the processor can switch to sequential consistency upon
the detection of the ATOMIC event starting (LL for example) stay SC until
the ATOMIC even is done, then revert back to weakly ordered as it pleases.
The single threaded code gets is performance while the multithreaded code
gets is SC (as Lamport demonstrated was necessary).
<
Post by Kent Dickey
                                        Being weakly ordered makes them
slower when they have to execute barriers for correctness, but it's the
barriers themselves that are the slow down, not ordering the requests
properly.
Huh? Avoiding as many memory barriers as possible (aka... ideally
relaxed) is key to making sync algorithms faster! Or using an acquire
instead of seq_cst is great. Only use seq_cst when you absolutely have to.
Post by Chris M. Thomasson
Post by MitchAlsup
<
But You see, doing it my way gets rid of the MemBar instructions but not
their necessary effects. In addition, in my model, every access within
an ATOMIC event is SC not just a MemBar at the front and end.
<
Post by Kent Dickey
If the CPU takes ownership of ordering, then the only rule is: you just
have to use atomic properly (even then, you can often get away with
volatile for most producer/consumer cases), and these subtypes for all
accesses don't matter for correctness or performance.
<
But a CPU is not in a position to determine Memory Order, the
(multiplicity of) memory controllers do, the CPUs just have to figure
out how to put up with
the imposed order based on the kind of things the CPU is doing at that instant.
<
Post by Kent Dickey
It also would be nice if multithreaded programs could be written in C99,
or pre-C++11.  It's kinda surprising we've only been able to write
threaded
programs for about 10 years.
<
I wrote multithreaded programs on an 8085.....
<
Post by Kent Dickey
Kent
Chris M. Thomasson
2023-11-15 21:41:06 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Kent Dickey
Post by a***@littlepinkcloud.invalid
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
If you assume you've already solved the problem, then you find the
problem is solved!  Magic!
What I'm arguing is: the CPU should behave as if memory_order_seq_cst
is set on all accesses with no special trickery.
<
Should appear as if.....not behave as if.
<
Post by Kent Dickey
                                                 This acquire/release
nonsense is all weakly ordered brain damage.
<
Agreed ...
<
Why do you "seem" to think its brain damage? Knowing how to use them
properly is a good thing.
Post by MitchAlsup
Post by Kent Dickey
                                              The problem is on weakly
ordered CPUs, performance definitely does matter in terms of getting this
stuff right, but that's their problem.
<
Not necessarily !! If you have a weakly ordered machine and start doing
something ATOMIC, the processor can switch to sequential consistency upon
the detection of the ATOMIC event starting (LL for example) stay SC until
the ATOMIC even is done, then revert back to weakly ordered as it pleases.
The single threaded code gets is performance while the multithreaded code
gets is SC (as Lamport demonstrated was necessary).
<
Post by Kent Dickey
                                        Being weakly ordered makes them
slower when they have to execute barriers for correctness, but it's the
barriers themselves that are the slow down, not ordering the requests
properly.
Huh? Avoiding as many memory barriers as possible (aka... ideally
relaxed) is key to making sync algorithms faster! Or using an acquire
instead of seq_cst is great. Only use seq_cst when you absolutely have to.
Post by Chris M. Thomasson
Post by MitchAlsup
<
But You see, doing it my way gets rid of the MemBar instructions but not
their necessary effects. In addition, in my model, every access within
an ATOMIC event is SC not just a MemBar at the front and end.
<
Post by Kent Dickey
If the CPU takes ownership of ordering, then the only rule is: you just
have to use atomic properly (even then, you can often get away with
volatile for most producer/consumer cases), and these subtypes for all
accesses don't matter for correctness or performance.
<
But a CPU is not in a position to determine Memory Order, the
(multiplicity of) memory controllers do, the CPUs just have to figure
out how to put up with
the imposed order based on the kind of things the CPU is doing at that instant.
<
Post by Kent Dickey
It also would be nice if multithreaded programs could be written in C99,
or pre-C++11.  It's kinda surprising we've only been able to write
threaded
programs for about 10 years.
<
I wrote multithreaded programs on an 8085.....
<
Post by Kent Dickey
Kent
pseudo-code:

std::atomic<node*> m_node = create_stack();



// Read-Copy Update (RCU)
// reader thread iteration...

node* current = n_node.load(std::memory_order_relaxed);
while (current)
{
node* next = current->m_next.load(std::memory_order_relaxed);
compute(current);
current = next;
}


Now, you are trying to tell me that I should use seq_cst instead of
those relaxed memory barriers for the iteration? That would DESTROY
performance, big time... Really BAD!


Listen to all of this:

http://youtu.be/DZJPqTTt7MA
Chris M. Thomasson
2023-11-15 21:56:31 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Kent Dickey
Post by a***@littlepinkcloud.invalid
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be
(probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
If you assume you've already solved the problem, then you find the
problem is solved!  Magic!
What I'm arguing is: the CPU should behave as if memory_order_seq_cst
is set on all accesses with no special trickery.
<
Should appear as if.....not behave as if.
<
Post by Kent Dickey
                                                 This acquire/release
nonsense is all weakly ordered brain damage.
<
Agreed ...
<
Why do you "seem" to think its brain damage? Knowing how to use them
properly is a good thing.
Post by MitchAlsup
Post by Kent Dickey
                                              The problem is on weakly
ordered CPUs, performance definitely does matter in terms of getting this
stuff right, but that's their problem.
<
Not necessarily !! If you have a weakly ordered machine and start doing
something ATOMIC, the processor can switch to sequential consistency upon
the detection of the ATOMIC event starting (LL for example) stay SC until
the ATOMIC even is done, then revert back to weakly ordered as it pleases.
The single threaded code gets is performance while the multithreaded code
gets is SC (as Lamport demonstrated was necessary).
<
Post by Kent Dickey
                                        Being weakly ordered makes
them
slower when they have to execute barriers for correctness, but it's the
barriers themselves that are the slow down, not ordering the requests
properly.
Huh? Avoiding as many memory barriers as possible (aka... ideally
relaxed) is key to making sync algorithms faster! Or using an acquire
instead of seq_cst is great. Only use seq_cst when you absolutely have to.
Post by Chris M. Thomasson
Post by MitchAlsup
<
But You see, doing it my way gets rid of the MemBar instructions but not
their necessary effects. In addition, in my model, every access within
an ATOMIC event is SC not just a MemBar at the front and end.
<
Post by Kent Dickey
If the CPU takes ownership of ordering, then the only rule is: you just
have to use atomic properly (even then, you can often get away with
volatile for most producer/consumer cases), and these subtypes for all
accesses don't matter for correctness or performance.
<
But a CPU is not in a position to determine Memory Order, the
(multiplicity of) memory controllers do, the CPUs just have to
figure out how to put up with
the imposed order based on the kind of things the CPU is doing at that instant.
<
Post by Kent Dickey
It also would be nice if multithreaded programs could be written in C99,
or pre-C++11.  It's kinda surprising we've only been able to write
threaded
programs for about 10 years.
<
I wrote multithreaded programs on an 8085.....
<
Post by Kent Dickey
Kent
std::atomic<node*> m_node = create_stack();
// Read-Copy Update (RCU)
// reader thread iteration...
node* current = n_node.load(std::memory_order_relaxed);
while (current)
{
   node* next = current->m_next.load(std::memory_order_relaxed);
   compute(current);
   current = next;
}
Now, you are trying to tell me that I should use seq_cst instead of
those relaxed memory barriers for the iteration? That would DESTROY
performance, big time... Really BAD!
Heck, if I knew I was on a DEC Alpha and depending on the writer side of
the algorithm I was using, std::memory_order_consume would be in order.
Post by Chris M. Thomasson
http://youtu.be/DZJPqTTt7MA
MitchAlsup
2023-11-15 23:32:12 UTC
Permalink
Post by Chris M. Thomasson
std::atomic<node*> m_node = create_stack();
// Read-Copy Update (RCU)
// reader thread iteration...
node* current = n_node.load(std::memory_order_relaxed);
while (current)
{
node* next = current->m_next.load(std::memory_order_relaxed);
compute(current);
current = next;
}
Now, you are trying to tell me that I should use seq_cst instead of
those relaxed memory barriers for the iteration? That would DESTROY
performance, big time... Really BAD!
<
The scan of the concurrent data structure's linked list does not have
to be atomic, or even ordered. It is only once you have identified an
element you want exclusive access to that sequential consistency is
needed.
Post by Chris M. Thomasson
http://youtu.be/DZJPqTTt7MA
Chris M. Thomasson
2023-11-15 23:54:00 UTC
Permalink
Post by MitchAlsup
Post by Chris M. Thomasson
std::atomic<node*> m_node = create_stack();
// Read-Copy Update (RCU)
// reader thread iteration...
node* current = n_node.load(std::memory_order_relaxed);
while (current)
{
    node* next = current->m_next.load(std::memory_order_relaxed);
    compute(current);
    current = next;
}
Now, you are trying to tell me that I should use seq_cst instead of
those relaxed memory barriers for the iteration? That would DESTROY
performance, big time... Really BAD!
<
The scan of the concurrent data structure's linked list does not have
to be atomic, or even ordered.
It has to use atomic loads and stores. However, RCU makes it so we do
not need to use any memory barriers (dec alpha aside for a moment) while
iterating it.
Post by MitchAlsup
It is only once you have identified an
element you want exclusive access to that sequential consistency is
needed.
Even a mutex does not need sequential consistency.

Dekker aside for a moment because it depends on a store followed by a
load to another location. TSO cannot even handle this without a membar.
Post by MitchAlsup
Post by Chris M. Thomasson
http://youtu.be/DZJPqTTt7MA
Chris M. Thomasson
2023-11-15 23:56:50 UTC
Permalink
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
std::atomic<node*> m_node = create_stack();
// Read-Copy Update (RCU)
// reader thread iteration...
node* current = n_node.load(std::memory_order_relaxed);
while (current)
{
    node* next = current->m_next.load(std::memory_order_relaxed);
    compute(current);
    current = next;
}
Now, you are trying to tell me that I should use seq_cst instead of
those relaxed memory barriers for the iteration? That would DESTROY
performance, big time... Really BAD!
<
The scan of the concurrent data structure's linked list does not have
to be atomic, or even ordered.
It has to use atomic loads and stores. However, RCU makes it so we do
not need to use any memory barriers (dec alpha aside for a moment) while
iterating it.
Post by MitchAlsup
It is only once you have identified an
element you want exclusive access to that sequential consistency is
needed.
Even a mutex does not need sequential consistency.
Dekker aside for a moment because it depends on a store followed by a
load to another location. TSO cannot even handle this without a membar.
Post by MitchAlsup
Post by Chris M. Thomasson
http://youtu.be/DZJPqTTt7MA
Afaict, std::memory_order_consume was added to support RCU compatible
algorithms.
a***@littlepinkcloud.invalid
2023-11-17 09:03:24 UTC
Permalink
Post by Kent Dickey
Post by a***@littlepinkcloud.invalid
Post by Chris M. Thomasson
Post by a***@littlepinkcloud.invalid
I don't think that's really true. The reorderings we see in currently-
produced hardware are, more or less, a subset of the same reorderings
that C compilers perform. Therefore, if you see a confusing hardware
reordering in a multi-threaded C program it may well be (probably is!)
a bug (according to the C standard) *even on a TSO machine*. The only
common counter-example to this is for volatile accesses.
Be sure to keep C++ std::atomic in mind... Also, std::memory_order_*
Maybe I wasn't clear enough. If you use std::atomic and
std::memory_order_* in such a way that there are no data races, your
concurrent program will be fine on both TSO and relaxed memory
ordering. If you try to fix data races with volatile instead of
std::atomic and std::memory_order_*, that'll mostly fix things on a
TSO machine, but not on a machine with relaxed memory ordering.
As long as you fully analyze your program, ensure all multithreaded
accesses are only through atomic variables, and you label every
access to an atomic variable properly (although my point is: exactly
what should that be??), then there is no problem.
Well, this is definitely true. But it's not exactly a plan: in
practice, people use careful synchronization boundaries and immutable
data structures.
Post by Kent Dickey
What I'm arguing is: the CPU should behave as if
memory_order_seq_cst is set on all accesses with no special
trickery.
What I'm saying is: that isn't sufficient if you are using an
optimizing compiler. And if you are programming for an optimizing
compiler you have to follow the rules anway. And the optimizing
compiler can reorder stores and loads as much, if not more, than the
hardware does.
Post by Kent Dickey
This acquire/release nonsense is all weakly ordered brain
damage. The problem is on weakly ordered CPUs, performance
definitely does matter in terms of getting this stuff right, but
that's their problem. Being weakly ordered makes them slower when
they have to execute barriers for correctness, but it's the barriers
themselves that are the slow down, not ordering the requests
properly.
How is that? All the barriers do is enforce the ordering.

Take the Apple M1 as an exaple. It has a TSO mode bit. It also has TSO
stores and loads, intended for when TSO mode is turned off. Are you
saying that the TSO stores and loads use a different mechanism to
enforce ordering from the one used when TSO is on by default?

Andrew.
Stefan Monnier
2023-11-17 15:44:45 UTC
Permalink
Post by Kent Dickey
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell. This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.


Stefan
MitchAlsup
2023-11-17 18:37:17 UTC
Permalink
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell. This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want to
move an element from one doubly linked list to another place in some
other doubly linked list:: you would write::
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
fn = fr->next;
fp = fr->prev;
tn = to->next;



if( TRUE )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
fr->next = tn;
return TRUE;
}
return FALSE;
}

In order to change this into a fully qualified ATOMIC event, the code
is decorated as::

BOOLEAN MoveElement( Element *fr, Element *to )
{
esmLOCK( fn = fr->next ); // get data
esmLOCK( fp = fr->prev );
esmLOCK( tn = to->next );
esmLOCK( fn ); // touch data
esmLOCK( fp );
esmLOCK( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn; // move the bits around
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCK( fr->next = tn );
return TRUE;
}
return FALSE;
}

Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to
to get the job(s) done.
Chris M. Thomasson
2023-11-17 19:21:13 UTC
Permalink
Post by MitchAlsup
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell.  This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
        Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want to
move an element from one doubly linked list to another place in some
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;
    if( TRUE )
    {
             fp->next = fn;
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
             fr->next = tn;
             return TRUE;
    }
    return FALSE;
}
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
    esmLOCK( fn = fr->next );         // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn );                    // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
             fp->next = fn;           // move the bits around
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
    esmLOCK( fr->next = tn );
             return TRUE;
    }
    return FALSE;
}
Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to to get
the job(s) done.
Interesting. Any chance of live lock wrt esmINTERFERENCE() ?
Chris M. Thomasson
2023-11-17 19:22:31 UTC
Permalink
Post by MitchAlsup
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell.  This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
        Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want to
move an element from one doubly linked list to another place in some
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;
    if( TRUE )
    {
             fp->next = fn;
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
             fr->next = tn;
             return TRUE;
    }
    return FALSE;
}
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
    esmLOCK( fn = fr->next );         // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn );                    // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
             fp->next = fn;           // move the bits around
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
    esmLOCK( fr->next = tn );
^^^^^^^^^^^^^^^^^^^^^^^^

Why perform an esmLOCK here? Please correct my confusion.
Post by MitchAlsup
             return TRUE;
    }
    return FALSE;
}
Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to to get
the job(s) done.
MitchAlsup
2023-11-17 20:49:08 UTC
Permalink
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell.  This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
        Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want to
move an element from one doubly linked list to another place in some
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;
    if( TRUE )
    {
             fp->next = fn;
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
             fr->next = tn;
             return TRUE;
    }
    return FALSE;
}
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
    esmLOCK( fn = fr->next );         // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn );                    // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
             fp->next = fn;           // move the bits around
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
    esmLOCK( fr->next = tn );
^^^^^^^^^^^^^^^^^^^^^^^^
Why perform an esmLOCK here? Please correct my confusion.
esmLOCK on the last participating Store is what tells the HW that the ATOMIC
event is finished. So, the first esmLOCK begins the ATOMIC event and the
only esmLOCK on an outgoing (ST or PostPush) memory reference ends the event.
Post by Chris M. Thomasson
Post by MitchAlsup
             return TRUE;
    }
    return FALSE;
}
Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to to get
the job(s) done.
Scott Lurndal
2023-11-18 00:03:15 UTC
Permalink
Post by MitchAlsup
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell. This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want to
move an element from one doubly linked list to another place in some
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
fn = fr->next;
fp = fr->prev;
tn = to->next;
if( TRUE )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
fr->next = tn;
return TRUE;
}
return FALSE;
}
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
esmLOCK( fn = fr->next ); // get data
esmLOCK( fp = fr->prev );
esmLOCK( tn = to->next );
esmLOCK( fn ); // touch data
esmLOCK( fp );
esmLOCK( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn; // move the bits around
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCK( fr->next = tn );
return TRUE;
}
return FALSE;
}
Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to
to get the job(s) done.
That looks suspiciously like transactional memory.
MitchAlsup
2023-11-18 01:03:26 UTC
Permalink
Post by Scott Lurndal
Post by MitchAlsup
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell. This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want to
move an element from one doubly linked list to another place in some
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
fn = fr->next;
fp = fr->prev;
tn = to->next;
if( TRUE )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
fr->next = tn;
return TRUE;
}
return FALSE;
}
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
esmLOCK( fn = fr->next ); // get data
esmLOCK( fp = fr->prev );
esmLOCK( tn = to->next );
esmLOCK( fn ); // touch data
esmLOCK( fp );
esmLOCK( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn; // move the bits around
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCK( fr->next = tn );
return TRUE;
}
return FALSE;
}
Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to
to get the job(s) done.
That looks suspiciously like transactional memory.
I has some flavors of such, but::
it has no nesting,
it has a strict limit of 8 participating cache lines,
it automagically transfers control when disruptive interference is detected,
it is subject to timeouts;

But does have the property that all interested 3rd parties see participating
memory only in the before or only in the completely after states.
Chris M. Thomasson
2023-11-18 21:47:31 UTC
Permalink
Post by MitchAlsup
Post by Scott Lurndal
Post by MitchAlsup
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell.  This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
        Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want to
move an element from one doubly linked list to another place in some
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;
    if( TRUE )
    {
             fp->next = fn;
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
             fr->next = tn;
             return TRUE;
    }
    return FALSE;
}
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
    esmLOCK( fn = fr->next );         // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn );                    // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
             fp->next = fn;           // move the bits around
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
    esmLOCK( fr->next = tn );
             return TRUE;
    }
    return FALSE;
}
Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
get the job(s) done.
That looks suspiciously like transactional memory.
Indeed, it does. Worried about live lock wrt esmINTERFERENCE().
Post by MitchAlsup
it has no nesting,
it has a strict limit of 8 participating cache lines,
it automagically transfers control when disruptive interference is detected,
it is subject to timeouts;
But does have the property that all interested 3rd parties see
participating
memory only in the before or only in the completely after states.
Are you familiar with KCSS? K-Compare Single Swap?

https://people.csail.mit.edu/shanir/publications/K-Compare.pdf

?

Check this out the old thread:

https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ
MitchAlsup
2023-11-19 19:32:07 UTC
Permalink
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Scott Lurndal
Post by MitchAlsup
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell.  This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
        Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want to
move an element from one doubly linked list to another place in some
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;
    if( TRUE )
    {
             fp->next = fn;
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
             fr->next = tn;
             return TRUE;
    }
    return FALSE;
}
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
    esmLOCK( fn = fr->next );         // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn );                    // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
             fp->next = fn;           // move the bits around
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
    esmLOCK( fr->next = tn );
             return TRUE;
    }
    return FALSE;
}
Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
get the job(s) done.
That looks suspiciously like transactional memory.
Indeed, it does. Worried about live lock wrt esmINTERFERENCE().
esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
conditional branch where the condition is delivered from the miss
buffer (where I detect interference wrt participating cache lines.)
Post by Chris M. Thomasson
Post by MitchAlsup
it has no nesting,
it has a strict limit of 8 participating cache lines,
it automagically transfers control when disruptive interference is detected,
it is subject to timeouts;
But does have the property that all interested 3rd parties see participating
memory only in the before or only in the completely after states.
Are you familiar with KCSS? K-Compare Single Swap?
https://people.csail.mit.edu/shanir/publications/K-Compare.pdf
Easily done:
esmLOCK( c1 = p1->condition1 );
esmLOCK( c2 = p2->condition2 );
...
if( c1 == C1 && C2 == C2 && c2 == C3 ... )
...
esmLOCK( some data );

Esm was designed to allow any known synchronization means (in 2013)
to be directly implemented in esm either inline or via subroutine
calls.
Post by Chris M. Thomasson
?
https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ
Chris M. Thomasson
2023-11-19 19:47:01 UTC
Permalink
Post by MitchAlsup
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Scott Lurndal
Post by MitchAlsup
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all
multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell.  This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
        Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want
to move an element from one doubly linked list to another place in
some
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;
    if( TRUE )
    {
             fp->next = fn;
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
             fr->next = tn;
             return TRUE;
    }
    return FALSE;
}
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
    esmLOCK( fn = fr->next );         // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn );                    // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
             fp->next = fn;           // move the bits around
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
    esmLOCK( fr->next = tn );
             return TRUE;
    }
    return FALSE;
}
Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
get the job(s) done.
That looks suspiciously like transactional memory.
Indeed, it does. Worried about live lock wrt esmINTERFERENCE().
esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
conditional branch where the condition is delivered from the miss
buffer (where I detect interference wrt participating cache lines.)
So, can false sharing on a participating cache line make
esmINTERFERENCE() return true?
Post by MitchAlsup
Post by Chris M. Thomasson
Post by MitchAlsup
it has no nesting,
it has a strict limit of 8 participating cache lines,
it automagically transfers control when disruptive interference is detected,
it is subject to timeouts;
But does have the property that all interested 3rd parties see participating
memory only in the before or only in the completely after states.
Are you familiar with KCSS? K-Compare Single Swap?
https://people.csail.mit.edu/shanir/publications/K-Compare.pdf
    esmLOCK( c1 = p1->condition1 );
    esmLOCK( c2 = p2->condition2 );
    ...
    if( c1 == C1 && C2 == C2 && c2 == C3 ... )
        ...
        esmLOCK( some data );
Esm was designed to allow any known synchronization means (in 2013)
to be directly implemented in esm either inline or via subroutine
calls.
I can see how that would work. The problem is that I am not exactly sure
how esmINTERFERENCE works internally... Can it detect/prevent live lock?
I remember hearing from my friend Joe Seigh, who worked at IBM, that
they had some sort of logic that would prevent live lock in a compare
and swap wrt their free pool manipulation logic. Iirc, it was somewhat
related to ABA, hard to remember right now, sorry. I need to find that
old thread back in comp.programming.threads.
Post by MitchAlsup
Post by Chris M. Thomasson
?
https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ
MitchAlsup
2023-11-19 22:23:21 UTC
Permalink
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Scott Lurndal
Post by MitchAlsup
Post by Stefan Monnier
Post by Kent Dickey
As long as you fully analyze your program, ensure all
multithreaded accesses
are only through atomic variables, and you label every access to an
atomic variable properly (although my point is: exactly what should that
be??), then there is no problem.
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
I'm thinking here of languages like Rust or the STM library of
Haskell.  This also solves the problem that memory accesses can be
reordered by the compiler, since in that case the compiler is fully
aware of which accesses can be reordered and which can't.
        Stefan
<
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event. So, lets say you want
to move an element from one doubly linked list to another place in
some
<
BOOLEAN MoveElement( Element *fr, Element *to )
{
    fn = fr->next;
    fp = fr->prev;
    tn = to->next;
    if( TRUE )
    {
             fp->next = fn;
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
             fr->next = tn;
             return TRUE;
    }
    return FALSE;
}
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
    esmLOCK( fn = fr->next );         // get data
    esmLOCK( fp = fr->prev );
    esmLOCK( tn = to->next );
    esmLOCK( fn );                    // touch data
    esmLOCK( fp );
    esmLOCK( tn );
    if( !esmINTERFERENCE() )
    {
             fp->next = fn;           // move the bits around
             fn->prev = fp;
             to->next = fr;
             tn->prev = fr;
             fr->prev = to;
    esmLOCK( fr->next = tn );
             return TRUE;
    }
    return FALSE;
}
Having a multiplicity of containers participate in an ATOMIC event
is key to making ATOMIC stuff fast and needing fewer ATOMICs to to
get the job(s) done.
That looks suspiciously like transactional memory.
Indeed, it does. Worried about live lock wrt esmINTERFERENCE().
esmINTERFERENCE() is an actual instruction in My 66000 ISA. It is a
conditional branch where the condition is delivered from the miss
buffer (where I detect interference wrt participating cache lines.)
So, can false sharing on a participating cache line make
esmINTERFERENCE() return true?
Post by MitchAlsup
Post by Chris M. Thomasson
Post by MitchAlsup
it has no nesting,
it has a strict limit of 8 participating cache lines,
it automagically transfers control when disruptive interference is detected,
it is subject to timeouts;
But does have the property that all interested 3rd parties see participating
memory only in the before or only in the completely after states.
Are you familiar with KCSS? K-Compare Single Swap?
https://people.csail.mit.edu/shanir/publications/K-Compare.pdf
    esmLOCK( c1 = p1->condition1 );
    esmLOCK( c2 = p2->condition2 );
    ...
    if( c1 == C1 && C2 == C2 && c2 == C3 ... )
        ...
        esmLOCK( some data );
Esm was designed to allow any known synchronization means (in 2013)
to be directly implemented in esm either inline or via subroutine
calls.
I can see how that would work. The problem is that I am not exactly sure
how esmINTERFERENCE works internally... Can it detect/prevent live lock?
esmINTERFERENCE is a branch on interference instruction. This is a conditional
branch instruction that queries whether any of the participating cache lines
has seen a read-with-intent or coherent-invalidate. In effect the branch
logic reaches out to the miss buffer and asks if any of the participating
cache lines has been snooped-for-write:: and you can't do this in 2 instruc-
tions or you loose ATOMICicity.

If it has, control is transferred and the ATOMIC event is failed
If it has not, and all participating cache lines are present, then this
core is allowed to NAK all requests to those participating cache lines
{and control is not transferred}.

So, you gain control over where flow goes on failure, and essentially
commit the whole event to finish.

So, while it does not eliminate live/dead-lock situations, it allows SW
to be constructed to avoid live/dead lock situations:: Why is a value
which is provided when an ATOMIC event fails. 0 means success, negative
values are spurious (buffer overflows,...) while positives represent
the number of competing threads, so the following case, skips elements
on a linked list to decrease future initerference.

Element* getElement( unSigned Key )
{
int count = 0;
for( p = structure.head; p ; p = p->next )
{
if( p->Key == Key )
{
if( count-- < 0 )
{
esmLOCK( p );
prev = p->prev;
esmLOCK( prev );
next = p->next;
esmLOCK( next );
if( !esmINTERFERENCE() )
{
p->prev = next;
p->next = prev;
p->prev = NULL;
esmLOCK( p->next = NULL );
return p;
}
else
{
count = esmWHY();
p = structure.head;
}
}
}
}
return NULL;
}

Doing ATOMIC things like this means one can take the BigO( n^3 ) activity
that happens when a timer goes off and n threads all want access to the
work queue, down to BigO( 3 ) yes=constant, but in practice it is reduced
to BigO( ln( n ) ) when requesters arrive in random order at random time.
Post by Chris M. Thomasson
I remember hearing from my friend Joe Seigh, who worked at IBM, that
they had some sort of logic that would prevent live lock in a compare
and swap wrt their free pool manipulation logic. Iirc, it was somewhat
related to ABA, hard to remember right now, sorry. I need to find that
old thread back in comp.programming.threads.
Depending on system size: there can be several system function units that
grant "order" for ATOMIC events. These are useful for 64+node systems
and unnecessary for less than 8-node systems. Disjoint memory spaces
can use independent ATOMIC arbiters and whether they are in use or not is
invisible to SW.
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
?
https://groups.google.com/g/comp.arch/c/shshLdF1uqs/m/VLmZSCBGDTkJ
Stefan Monnier
2023-11-20 16:32:00 UTC
Permalink
Post by MitchAlsup
Post by Stefan Monnier
BTW, the above sounds daunting when writing in C because you have to do
that analysis yourself, but there are programming languages out there
which will do that analysis for you as part of type checking.
[...]
Post by MitchAlsup
I created the Exotic Synchronization Method such that you could just
write the code needed to do the work, and then decorate those accesses
which are participating in the ATOMIC event.
[...]
Post by MitchAlsup
In order to change this into a fully qualified ATOMIC event, the code
BOOLEAN MoveElement( Element *fr, Element *to )
{
esmLOCK( fn = fr->next ); // get data
esmLOCK( fp = fr->prev );
esmLOCK( tn = to->next );
esmLOCK( fn ); // touch data
esmLOCK( fp );
esmLOCK( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn; // move the bits around
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCK( fr->next = tn );
return TRUE;
}
return FALSE;
}
This is nice, but the onus is still on the programmer to manually make
sure they don't forget to always `esmLOCK` all the shared data, that
they don't `esmLOCK` the data that doesn't need it, ...

In contrast, Haskell's STM will emit a type error if you ever try to use
a shared var in a non-atomic sequence (or if you try to do things like
`printf` from within an atomic sequence, ...).

I see Haskell's STM as a kind of "ideal API" for the programmer.
And it works tolerably in many cases. But reconciling this kind of
abstraction with the performance you can get by low-level twiddling is
the hard part :-)


Stefan
EricP
2023-11-13 18:22:44 UTC
Permalink
Post by Kent Dickey
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In
general, it
is assumed to be true without proof.
I believe that statement to be false. Can you describe some of these
workloads?
Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would
ruin performance right off the bat... Think about it.
<
Assuming you are willing to accept the wrong answer fast, rather than
the right answer later. There are very few algorithms with this property.
That does not make any sense to me. Think of a basic mutex. It basically
requires an acquire membar for the lock and a release membar for the
acquire = MEMBAR #LoadStore | #LoadLoad
release = MEMBAR #LoadStore | #StoreStore
Okay, fine. However, if I made them sequentially consistent, it would
require a damn #StoreLoad barrier for both acquire and release. This is
not good and should be avoided when possible.
Also, RCU prides itself with not having to use any memory barriers for
its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
or MFENCE on Intel, it would completely ruin its performance.
You are using the terms in the exact opposite meaning as I would understand
computer architecture.
- Sequential consistency (SC). Hardware does everything, there are no
barriers needed, ld/st instructions appear to happen in some
global order.
- Total Store Ordering (TSO) (x86). Stores appear to be done in program
order, but a CPU can peek and see its own local store results
before other CPUs can. Loads appear to be done in some total
program order (not counting hitting its own local stores).
TSO is like SC except there's effectively a store queue, and
stores can finish when placed in the queue, and the queue drains
in FIFO order. Needs no barriers, except for special cases like
Lamport's algorithm (it's easy to avoid barriers).
- Relaxed. Loads and stores are not ordered, users have to put in memory
barriers and hope they did it right.
So, a highly relaxed memory model is the ONLY model which needs barriers.
If you want to get rid of barriers, use a better memory model.
A relaxed ordering SYSTEM says rather than spending a few thousand
gates getting ordering right by hardware in the CPU, instead we're going
to require software to put in some very difficult to understand barriers.
And we're going to have a 45 page academic paper using all the greek
alphabet to describe when you need to put in barriers. Literally no one
understands all the rules, so the best bet is put in too many barriers
and wait for someone to nitpick your code and fix it for you.
[I have a theorem: there is no correct non-trivial multithreaded program
on an architecture which requires barriers for correctness.].
A very simple thought exercise shows even if Sequential Consistency
and/or TSO were slower (and I maintain they are not), but even if you
believe that, a Relaxed Ordering system WILL be slower than TSO or
Sequential for workloads which often use barriers (instructions tagged
with acquire/release are barriers). In a Relaxed ordering system, the
barriers will not be as efficient as the automatic barriers of TSO/SC
(otherwise, why not just do that?), so if barriers are executed often,
performance will be lower than hardware TSO/SC, even if there are no
contentions or any work for the barriers to do. In fact, performance
could be significantly lower.
People know this, it's why they keep trying to get rid of barriers in
their code. So get rid of all them and demand TSO ordering.
Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
on everyone else to try to come up with algorithms which need fewer
barriers. It's crazy.
Relaxed Ordering is a mistake.
Kent
I suggest something different: the ability to switch between TSO and
relaxed with non-privileged user mode instructions.

Non-concurrent code does not see the relaxed ordering, and should benefit
from extra concurrency in the Load Store Queue and cache that relaxed rules
allow, because the local core always sees its own memory as consistent.
For example, relaxed ordering allows multiple LD and ST to be in
multiple pipelines to multiple cache banks at once without regard
as to the exact order the operations are applied.

This is fine for non concurrently accessed data structures,
either non-shared data areas or shared but guarded by mutexes.

But relaxed is hard for people to reason about for concurrently accessed
lock free data structures. Now these don't just appear out of thin air so
it is reasonable for a program to emit TSO_START and TSO_END instructions.

On the other hand, almost no code is lock-free or ever will be.
So why have all the extra HW logic to support TSO if its only really
needed for this rare kind of programming.

But there is also a category of memory area that is not covered by the
above rules, one where one core thinks its memory is local and not shared
but in fact it is being accessed concurrently.

If thread T1 (say an app) on core C1 says its memory is relaxed, and calls
a subroutine passing a pointer to a value on T1's stack, and that pointer
is passed to thread T2 (a driver) on core C2 which accesses that memory,
then even if T2 declared itself to be using TSO rules it would not force
T1 on C1 obey them.

Where this approach could fail is the kind of laissez-faire sharing done
by many apps, libraries, and OS's behind the scenes in the real world.
MitchAlsup
2023-11-13 19:29:30 UTC
Permalink
Post by EricP
Post by Kent Dickey
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In
general, it
is assumed to be true without proof.
I believe that statement to be false. Can you describe some of these
workloads?
Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would
ruin performance right off the bat... Think about it.
<
Assuming you are willing to accept the wrong answer fast, rather than
the right answer later. There are very few algorithms with this property.
That does not make any sense to me. Think of a basic mutex. It basically
requires an acquire membar for the lock and a release membar for the
acquire = MEMBAR #LoadStore | #LoadLoad
release = MEMBAR #LoadStore | #StoreStore
Okay, fine. However, if I made them sequentially consistent, it would
require a damn #StoreLoad barrier for both acquire and release. This is
not good and should be avoided when possible.
Also, RCU prides itself with not having to use any memory barriers for
its read side. If RCU was forced to use a seq cst, basically LOCKED RMW
or MFENCE on Intel, it would completely ruin its performance.
You are using the terms in the exact opposite meaning as I would understand
computer architecture.
- Sequential consistency (SC). Hardware does everything, there are no
barriers needed, ld/st instructions appear to happen in some
global order.
- Total Store Ordering (TSO) (x86). Stores appear to be done in program
order, but a CPU can peek and see its own local store results
before other CPUs can. Loads appear to be done in some total
program order (not counting hitting its own local stores).
TSO is like SC except there's effectively a store queue, and
stores can finish when placed in the queue, and the queue drains
in FIFO order. Needs no barriers, except for special cases like
Lamport's algorithm (it's easy to avoid barriers).
- Relaxed. Loads and stores are not ordered, users have to put in memory
barriers and hope they did it right.
So, a highly relaxed memory model is the ONLY model which needs barriers.
If you want to get rid of barriers, use a better memory model.
A relaxed ordering SYSTEM says rather than spending a few thousand
gates getting ordering right by hardware in the CPU, instead we're going
to require software to put in some very difficult to understand barriers.
And we're going to have a 45 page academic paper using all the greek
alphabet to describe when you need to put in barriers. Literally no one
understands all the rules, so the best bet is put in too many barriers
and wait for someone to nitpick your code and fix it for you.
[I have a theorem: there is no correct non-trivial multithreaded program
on an architecture which requires barriers for correctness.].
A very simple thought exercise shows even if Sequential Consistency
and/or TSO were slower (and I maintain they are not), but even if you
believe that, a Relaxed Ordering system WILL be slower than TSO or
Sequential for workloads which often use barriers (instructions tagged
with acquire/release are barriers). In a Relaxed ordering system, the
barriers will not be as efficient as the automatic barriers of TSO/SC
(otherwise, why not just do that?), so if barriers are executed often,
performance will be lower than hardware TSO/SC, even if there are no
contentions or any work for the barriers to do. In fact, performance
could be significantly lower.
People know this, it's why they keep trying to get rid of barriers in
their code. So get rid of all them and demand TSO ordering.
Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
on everyone else to try to come up with algorithms which need fewer
barriers. It's crazy.
Relaxed Ordering is a mistake.
Kent
I suggest something different: the ability to switch between TSO and
relaxed with non-privileged user mode instructions.
<
I suggest this switch between modes be done without executing any extra
instructions. I do the switches based on the address space of the access
{DRAM, MMI/O, config, ROM} and I also switch to SC when an ATOMIC event
begins.
<
Post by EricP
Non-concurrent code does not see the relaxed ordering, and should benefit
from extra concurrency in the Load Store Queue and cache that relaxed rules
allow, because the local core always sees its own memory as consistent.
For example, relaxed ordering allows multiple LD and ST to be in
multiple pipelines to multiple cache banks at once without regard
as to the exact order the operations are applied.
<
I suspect SUN lost significant performance by always running TSO and
it still required barrier instructions.
<
Post by EricP
This is fine for non concurrently accessed data structures,
either non-shared data areas or shared but guarded by mutexes.
But relaxed is hard for people to reason about for concurrently accessed
lock free data structures.
<
It is hard for people who have a completely vonNeumann thinking pattern
pattern to reason about these things, it is not hard for someone whos
entire career was spent doing multiplicity of things concurrently.
<
SW languages (and debuggers,...) teach people to think:: this happens than
that happens then something else happens. HW languages teach people to think
"crap all of this is happening at once, how do I make sense out of it".
<
In CPU design (or chip design in general) one is NEVER given the illusion
that a single <vast> state describes the moment. It is s shame the SW
did not follow similar route.
<
Post by EricP
Now these don't just appear out of thin air so
it is reasonable for a program to emit TSO_START and TSO_END instructions.
On the other hand, almost no code is lock-free or ever will be.
So why have all the extra HW logic to support TSO if its only really
needed for this rare kind of programming.
<
I made this exact argument to SUN circa 1993.....
<
Post by EricP
But there is also a category of memory area that is not covered by the
above rules, one where one core thinks its memory is local and not shared
but in fact it is being accessed concurrently.
If thread T1 (say an app) on core C1 says its memory is relaxed, and calls
a subroutine passing a pointer to a value on T1's stack, and that pointer
is passed to thread T2 (a driver) on core C2 which accesses that memory,
then even if T2 declared itself to be using TSO rules it would not force
T1 on C1 obey them.
Where this approach could fail is the kind of laissez-faire sharing done
by many apps, libraries, and OS's behind the scenes in the real world.
So, anything written in JavaScript........
Chris M. Thomasson
2023-11-13 21:31:27 UTC
Permalink
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true.  In
general, it
is assumed to be true without proof.
I believe that statement to be false.  Can you describe some of these
workloads?
Also, think about converting any sound lock-free algorithm's
finely tuned memory barriers to _all_ sequential consistency...
That would ruin performance right off the bat... Think about it.
<
Assuming you are willing to accept the wrong answer fast, rather
than the right answer later. There are very few algorithms with
this property.
That does not make any sense to me. Think of a basic mutex. It
basically requires an acquire membar for the lock and a release
acquire = MEMBAR #LoadStore | #LoadLoad
release = MEMBAR #LoadStore | #StoreStore
Okay, fine. However, if I made them sequentially consistent, it
would require a damn #StoreLoad barrier for both acquire and
release. This is not good and should be avoided when possible.
Also, RCU prides itself with not having to use any memory barriers
for its read side. If RCU was forced to use a seq cst, basically
LOCKED RMW or MFENCE on Intel, it would completely ruin its
performance.
[...]
< I suspect SUN lost significant performance by always running TSO and
it still required barrier instructions.
[...]

Intel still requires an explicit membar for hazard pointers as-is. Sparc
in TSO mode still requires a membar for this. Spard needs a #StoreLoad
wrt the store followed by a load to another location relationship to
hold. Intel needs a LOCK'ed atomic or MFENCE to handle this.
Branimir Maksimovic
2023-11-15 04:10:08 UTC
Permalink
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true.  In
general, it
is assumed to be true without proof.
I believe that statement to be false.  Can you describe some of these
workloads?
Also, think about converting any sound lock-free algorithm's
finely tuned memory barriers to _all_ sequential consistency...
That would ruin performance right off the bat... Think about it.
<
Assuming you are willing to accept the wrong answer fast, rather
than the right answer later. There are very few algorithms with
this property.
That does not make any sense to me. Think of a basic mutex. It
basically requires an acquire membar for the lock and a release
acquire = MEMBAR #LoadStore | #LoadLoad
release = MEMBAR #LoadStore | #StoreStore
Okay, fine. However, if I made them sequentially consistent, it
would require a damn #StoreLoad barrier for both acquire and
release. This is not good and should be avoided when possible.
Also, RCU prides itself with not having to use any memory barriers
for its read side. If RCU was forced to use a seq cst, basically
LOCKED RMW or MFENCE on Intel, it would completely ruin its
performance.
[...]
< I suspect SUN lost significant performance by always running TSO and
it still required barrier instructions.
[...]
Intel still requires an explicit membar for hazard pointers as-is. Sparc
in TSO mode still requires a membar for this. Spard needs a #StoreLoad
wrt the store followed by a load to another location relationship to
hold. Intel needs a LOCK'ed atomic or MFENCE to handle this.
I think that Apple M1 requires, too. I has problems wihout membar.
--
7-77-777, Evil Sinner!
https://www.linkedin.com/in/branimir-maksimovic-6762bbaa/
Chris M. Thomasson
2023-11-17 19:31:53 UTC
Permalink
Post by Branimir Maksimovic
Post by Chris M. Thomasson
Post by Chris M. Thomasson
Post by MitchAlsup
Post by Chris M. Thomasson
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true.  In
general, it
is assumed to be true without proof.
I believe that statement to be false.  Can you describe some of these
workloads?
Also, think about converting any sound lock-free algorithm's
finely tuned memory barriers to _all_ sequential consistency...
That would ruin performance right off the bat... Think about it.
<
Assuming you are willing to accept the wrong answer fast, rather
than the right answer later. There are very few algorithms with
this property.
That does not make any sense to me. Think of a basic mutex. It
basically requires an acquire membar for the lock and a release
acquire = MEMBAR #LoadStore | #LoadLoad
release = MEMBAR #LoadStore | #StoreStore
Okay, fine. However, if I made them sequentially consistent, it
would require a damn #StoreLoad barrier for both acquire and
release. This is not good and should be avoided when possible.
Also, RCU prides itself with not having to use any memory barriers
for its read side. If RCU was forced to use a seq cst, basically
LOCKED RMW or MFENCE on Intel, it would completely ruin its
performance.
[...]
< I suspect SUN lost significant performance by always running TSO and
it still required barrier instructions.
[...]
Intel still requires an explicit membar for hazard pointers as-is. Sparc
in TSO mode still requires a membar for this. Spard needs a #StoreLoad
wrt the store followed by a load to another location relationship to
hold. Intel needs a LOCK'ed atomic or MFENCE to handle this.
I think that Apple M1 requires, too. I has problems wihout membar.
Afaict, any TSO model needs an explicit barrier for the store followed
by a load to another location relationship to hold. If the program does
not use this relationship, then it does not need to use them on TSO.
Paul A. Clayton
2023-11-20 16:00:30 UTC
Permalink
[snip]
Post by EricP
Post by Kent Dickey
Thus, the people trapped in Relaxed Ordering Hell then push
weird schemes
on everyone else to try to come up with algorithms which need fewer
barriers.  It's crazy.
Relaxed Ordering is a mistake.
Kent
I suggest something different: the ability to switch between TSO and
relaxed with non-privileged user mode instructions.
Non-concurrent code does not see the relaxed ordering, and should benefit
from extra concurrency in the Load Store Queue and cache that
relaxed rules
allow, because the local core always sees its own memory as
consistent.
For example, relaxed ordering allows multiple LD and ST to be in
multiple pipelines to multiple cache banks at once without regard
as to the exact order the operations are applied.
This is fine for non concurrently accessed data structures,
either non-shared data areas or shared but guarded by mutexes.
But relaxed is hard for people to reason about for concurrently accessed
lock free data structures. Now these don't just appear out of thin air so
it is reasonable for a program to emit TSO_START and TSO_END
instructions.
On the other hand, almost no code is lock-free or ever will be.
So why have all the extra HW logic to support TSO if its only really
needed for this rare kind of programming.
But there is also a category of memory area that is not covered by the
above rules, one where one core thinks its memory is local and not shared
but in fact it is being accessed concurrently.
If thread T1 (say an app) on core C1 says its memory is relaxed, and calls
a subroutine passing a pointer to a value on T1's stack, and that pointer
is passed to thread T2 (a driver) on core C2 which accesses that memory,
then even if T2 declared itself to be using TSO rules it would not force
T1 on C1 obey them.
Where this approach could fail is the kind of laissez-faire
sharing done
by many apps, libraries, and OS's behind the scenes in the real world.
Another possibility is for non-shared memory to be handled
differently. (This is similar to My 66000's handling of memory
types and things mentioned by Mitch Alsup here.)

Even with a multithreaded program, stack and TLS would be "thread
private" and not require the same consistency guarantees.

Various memory partitioning schemes theoretically can provide
similar benefits for systems with shared memory controllers where
programs do not share most modifiable data with other programs.
Even something like a web hosting system might be able to benefit
from lack of coherence (much less consistency) between different
web hosts.
Scott Lurndal
2023-11-20 17:26:43 UTC
Permalink
Post by Paul A. Clayton
[snip]
Post by EricP
Post by Kent Dickey
Thus, the people trapped in Relaxed Ordering Hell then push
weird schemes
on everyone else to try to come up with algorithms which need fewer
barriers.  It's crazy.
Relaxed Ordering is a mistake.
Kent
I suggest something different: the ability to switch between TSO and
relaxed with non-privileged user mode instructions.
Even with a multithreaded program, stack and TLS would be "thread
private" and not require the same consistency guarantees.
Why do you think that 'stack' would be thread private? It's
quite common to allocate long-lived data structures on the
stack and pass the address of the object to code that may
be executing in the context of other threads. So long as the lifetime of
the object extends beyond the last reference, of course.

Objects allocated on the stack in 'main', for instance.
MitchAlsup
2023-11-21 00:52:08 UTC
Permalink
Post by Scott Lurndal
Post by Paul A. Clayton
[snip]
Post by EricP
Post by Kent Dickey
Thus, the people trapped in Relaxed Ordering Hell then push
weird schemes
on everyone else to try to come up with algorithms which need fewer
barriers.  It's crazy.
Relaxed Ordering is a mistake.
Kent
I suggest something different: the ability to switch between TSO and
relaxed with non-privileged user mode instructions.
Even with a multithreaded program, stack and TLS would be "thread
private" and not require the same consistency guarantees.
Why do you think that 'stack' would be thread private? It's
quite common to allocate long-lived data structures on the
stack and pass the address of the object to code that may
be executing in the context of other threads. So long as the lifetime of
the object extends beyond the last reference, of course.
Even Thread Local Store is not private to the thread if the thread
creates a pointer into it and allows others to see the pointer.

The only thing the HW can validate as non-shared is that portion of
the stack containing callee save registers (and the return address)
but only 2 known architectures have these chunks of memory is an
address space where threads cannot read-write-or-execute that chunk.
Post by Scott Lurndal
Objects allocated on the stack in 'main', for instance.
EricP
2023-11-21 14:29:43 UTC
Permalink
Post by MitchAlsup
Post by Scott Lurndal
Post by Paul A. Clayton
[snip]
Post by EricP
Post by Kent Dickey
Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
on everyone else to try to come up with algorithms which need fewer
barriers. It's crazy.
Relaxed Ordering is a mistake.
Kent
I suggest something different: the ability to switch between TSO and
relaxed with non-privileged user mode instructions.
Even with a multithreaded program, stack and TLS would be "thread
private" and not require the same consistency guarantees.
Why do you think that 'stack' would be thread private? It's
quite common to allocate long-lived data structures on the
stack and pass the address of the object to code that may
be executing in the context of other threads. So long as the lifetime of
the object extends beyond the last reference, of course.
Even Thread Local Store is not private to the thread if the thread
creates a pointer into it and allows others to see the pointer.
The only thing the HW can validate as non-shared is that portion of
the stack containing callee save registers (and the return address)
but only 2 known architectures have these chunks of memory is an
address space where threads cannot read-write-or-execute that chunk.
The callee save area may be R-W-E page protected against it own thread
but it doesn't prevent a privileged thread from concurrently accessing
that save area (say to edit the stack to deliver a signal)
so the same coherence applies there too.
EricP
2023-11-20 18:26:39 UTC
Permalink
Post by Paul A. Clayton
[snip]
Post by EricP
Post by Kent Dickey
Thus, the people trapped in Relaxed Ordering Hell then push weird schemes
on everyone else to try to come up with algorithms which need fewer
barriers. It's crazy.
Relaxed Ordering is a mistake.
Kent
I suggest something different: the ability to switch between TSO and
relaxed with non-privileged user mode instructions.
Non-concurrent code does not see the relaxed ordering, and should benefit
from extra concurrency in the Load Store Queue and cache that relaxed rules
allow, because the local core always sees its own memory as consistent.
For example, relaxed ordering allows multiple LD and ST to be in
multiple pipelines to multiple cache banks at once without regard
as to the exact order the operations are applied.
This is fine for non concurrently accessed data structures,
either non-shared data areas or shared but guarded by mutexes.
But relaxed is hard for people to reason about for concurrently accessed
lock free data structures. Now these don't just appear out of thin air so
it is reasonable for a program to emit TSO_START and TSO_END
instructions.
On the other hand, almost no code is lock-free or ever will be.
So why have all the extra HW logic to support TSO if its only really
needed for this rare kind of programming.
But there is also a category of memory area that is not covered by the
above rules, one where one core thinks its memory is local and not shared
but in fact it is being accessed concurrently.
If thread T1 (say an app) on core C1 says its memory is relaxed, and calls
a subroutine passing a pointer to a value on T1's stack, and that pointer
is passed to thread T2 (a driver) on core C2 which accesses that memory,
then even if T2 declared itself to be using TSO rules it would not force
T1 on C1 obey them.
Where this approach could fail is the kind of laissez-faire sharing done
by many apps, libraries, and OS's behind the scenes in the real world.
Another possibility is for non-shared memory to be handled
differently. (This is similar to My 66000's handling of memory
types and things mentioned by Mitch Alsup here.)
Even with a multithreaded program, stack and TLS would be "thread
private" and not require the same consistency guarantees.
Various memory partitioning schemes theoretically can provide
similar benefits for systems with shared memory controllers where
programs do not share most modifiable data with other programs.
Even something like a web hosting system might be able to benefit
from lack of coherence (much less consistency) between different
web hosts.
But this is my point: in many programs there is no memory that
you can point to and say it is always private to a single thread.
And this is independent of language, its to do with program structure.

You can say a certain memory range is shared and guarded by locks,
or shared and managed by lock-free code.
And we can point to this because the code in modularized this way.

But the opposite of 'definitely shared' is not 'definitely private',
it's 'dont know' or 'sometimes'.

Eg: Your code pops up a dialog box, prompts the user for an integer value,
and writes that value to a program global variable.
Do you know whether that dialog box is a separate thread?
Should you care? What if the dialog starts out in the same thread,
then a new release changes it to a separate thread.
What if the variable is on the thread stack or in a heap?

On a weak ordered system I definitely need to know because
I need barriers to ensure I can read the variable properly.

And this is where TSO makes the difference.
I don't have to know exactly every byte that might be updated concurrently
in some context at some time (and that context can change dynamically).
a***@littlepinkcloud.invalid
2023-11-21 13:44:56 UTC
Permalink
Post by EricP
But this is my point: in many programs there is no memory that
you can point to and say it is always private to a single thread.
That's more about program design. In a multi-threaded program this is
something you really should know.
Post by EricP
And this is independent of language, its to do with program structure.
You can say a certain memory range is shared and guarded by locks,
or shared and managed by lock-free code.
And we can point to this because the code in modularized this way.
But the opposite of 'definitely shared' is not 'definitely private',
it's 'dont know' or 'sometimes'.
Eg: Your code pops up a dialog box, prompts the user for an integer value,
and writes that value to a program global variable.
Do you know whether that dialog box is a separate thread?
Should you care? What if the dialog starts out in the same thread,
then a new release changes it to a separate thread.
What if the variable is on the thread stack or in a heap?
On a weak ordered system I definitely need to know because
I need barriers to ensure I can read the variable properly.
In this case, no, I don't think you do. Barriers only control the
ordering between accesses, not when they become visible, and here
there's only one access. If there are at least two, and you really
need to see one before the other, then you need a barrier. And even on
a TSO machine, you're going to have to do something on both the reader
and the writer sides if you need ordering to be protected from a
compiler.

Andrew.
EricP
2023-11-21 17:11:00 UTC
Permalink
Post by a***@littlepinkcloud.invalid
Post by EricP
But this is my point: in many programs there is no memory that
you can point to and say it is always private to a single thread.
That's more about program design. In a multi-threaded program this is
something you really should know.
Post by EricP
And this is independent of language, its to do with program structure.
You can say a certain memory range is shared and guarded by locks,
or shared and managed by lock-free code.
And we can point to this because the code in modularized this way.
But the opposite of 'definitely shared' is not 'definitely private',
it's 'dont know' or 'sometimes'.
Eg: Your code pops up a dialog box, prompts the user for an integer value,
and writes that value to a program global variable.
Do you know whether that dialog box is a separate thread?
Should you care? What if the dialog starts out in the same thread,
then a new release changes it to a separate thread.
What if the variable is on the thread stack or in a heap?
On a weak ordered system I definitely need to know because
I need barriers to ensure I can read the variable properly.
In this case, no, I don't think you do. Barriers only control the
ordering between accesses, not when they become visible, and here
there's only one access. If there are at least two, and you really
need to see one before the other, then you need a barrier.
The barriers also ensure the various local buffers, pipelines and
inbound and outbound comms command and reply message queues are drained.
It ensures that the operations that came before it have reached
their coherency point - the cache controller - and that any
outstanding asynchronous operations are complete.
And that in turn controls when values become visible.

On a weak order cpu with no store ordering, the cpu is not required
to propagate any store into the cache within any period of time.
It can stash it in a write combine buffer waiting to see if more
updates to the same line appear.

Weak order requires a membar after a store to force the it into the cache,
triggering the coherence handshake which invalidates other copies,
so that when remote cores reread a line they see the updated value.

In other words, to retire the membar instruction the core must force the
prior store values into the coherent cache making them globally visible.

The difference for TSO is that a store has implied membars before it to
prevent it bypassing (executing before) older loads and stores.
Post by a***@littlepinkcloud.invalid
And even on
a TSO machine, you're going to have to do something on both the reader
and the writer sides if you need ordering to be protected from a
compiler.
Andrew.
Compilers are a different discussion.

Anton Ertl
2023-11-13 07:48:35 UTC
Permalink
Post by Chris M. Thomasson
Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would ruin
performance right off the bat... Think about it.
Proof by claim?

I think about several similar instances, where people went for
simple-minded hardware designs and threw the complexity over the wall
to the software people, and claimed that it was for performance; I
call that the "supercomputing attitude", and it may work in areas
where the software crisis has not yet struck[1], but is a bad attitude
in areas like general-purpose computing where it has struck.

1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.

2) A little earlier, the Alpha designers thought that they could gain
speed by not implementing denormal numbers and by implementing
imprecise exceptions for FP operations, so that it is not possible to
implement denormal numbers through a software fixup, either. For
dealing properly with denormal numbers, you had to insert a trap
barrier right after each FP instruction, and presumably this cost a
lot of performance on early Alpha implementations. However, when I
measured it on the 21264 (released six years after the first Alpha),
the cost was like that of a nop; I guess that the trap barrier was
actually a nop on the 21264, because, as an OoO processor, the 21264
performs precise exceptions without breaking a sweat. And the 21264
is faster than the models where the trap barrier actually does
something. Meanwhile, Mitch Alsup also has posted that he can
implement fast denormal numbers with IIRC 30 extra gates (which is
probably less than what is needed for implementing the trap barrier).

3) The Alpha is a rich source of examples of the supercomputer
attitude: It started out without instructions for accessing 8-bit and
16-bit data in memory. Instead, the idea was that for accessing
memory, you would use instruction sequences, and for accessing I/O
devices, the device was mapped three times or so: In one address range
you performed bytewise access, in another address range 16-bit
accesses, and in the third address range 32-bit and 64-bit accesses;
I/O driver writers had to write or modify their drivers for this
model. The rationale for that was that they required ECC for
permanent storage and that would supposedly require slow RMW accesses
for writing bytes to write-back caches. Now the 21064 and 21164 had a
write-through D-cache. That made it easy to add byte and word
accesses (BWX) in the 21164A (released 1996), but they could have done
it from the start. The 21164A is in no way slower than the 21164; it
has the same IPC and a higher clock rate.

Some people welcome and celebrate the challenges that the
supercomputer attitude poses for software, and justify it with
"performance", but as the examples above show, such claims often turn
out to be false when you actually invest effort into more capable
hardware.

Given that multi-processors come out of supercomputing, it's no
surprise that the supercomputing attitude is particularly strong
there.

But if you look at it from an architecture (i.e., hardware/software
interface) perspective, weak consistency is just bad architecture:
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential
consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.

[1] The software crisis is that software costs are higher than
hardware costs, and supercomputing with its gigantic hardware costs
notices the software crisis much later than general-purpose computing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Stefan Monnier
2023-11-13 15:36:52 UTC
Permalink
Post by Anton Ertl
1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.
IIRC the main argument for the Mill wasn't that it was going to be
faster but that it would give a better performance per watt by avoiding
the administrative cost of managing those hundreds of reordered
in-flight instructions, without losing too much peak performance.

The fact that performance per watt of in-order ARM cores is not really
lower than that of OOO cores suggests that the Mill wouldn't deliver on
this "promise" either.
Still, I really would like to see how it plays out in practice, instead
of having to guess.


Stefan
MitchAlsup
2023-11-13 19:11:51 UTC
Permalink
Post by Anton Ertl
Post by Chris M. Thomasson
Also, think about converting any sound lock-free algorithm's finely
tuned memory barriers to _all_ sequential consistency... That would ruin
performance right off the bat... Think about it.
Proof by claim?
I think about several similar instances, where people went for
simple-minded hardware designs and threw the complexity over the wall
to the software people, and claimed that it was for performance; I
call that the "supercomputing attitude", and it may work in areas
where the software crisis has not yet struck[1], but is a bad attitude
in areas like general-purpose computing where it has struck.
1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.
<
Not faster, but easier to do with acceptable HW costs. The pipeline
is 1-3 stages longer, but HW has dynamic information that SW cannot have.
<
Post by Anton Ertl
2) A little earlier, the Alpha designers thought that they could gain
speed by not implementing denormal numbers and by implementing
imprecise exceptions for FP operations, so that it is not possible to
implement denormal numbers through a software fixup, either. For
<
So did I in Mc 88100--just as wrong then as it is now.
<
Post by Anton Ertl
dealing properly with denormal numbers, you had to insert a trap
barrier right after each FP instruction, and presumably this cost a
lot of performance on early Alpha implementations. However, when I
measured it on the 21264 (released six years after the first Alpha),
the cost was like that of a nop; I guess that the trap barrier was
actually a nop on the 21264, because, as an OoO processor, the 21264
performs precise exceptions without breaking a sweat. And the 21264
is faster than the models where the trap barrier actually does
something. Meanwhile, Mitch Alsup also has posted that he can
implement fast denormal numbers with IIRC 30 extra gates (which is
probably less than what is needed for implementing the trap barrier).
<
I recall saying it is about 2% of the gate count of an FMAC unit.
<
Post by Anton Ertl
3) The Alpha is a rich source of examples of the supercomputer
attitude: It started out without instructions for accessing 8-bit and
16-bit data in memory. Instead, the idea was that for accessing
memory, you would use instruction sequences, and for accessing I/O
devices, the device was mapped three times or so: In one address range
you performed bytewise access, in another address range 16-bit
accesses, and in the third address range 32-bit and 64-bit accesses;
I/O driver writers had to write or modify their drivers for this
model. The rationale for that was that they required ECC for
permanent storage and that would supposedly require slow RMW accesses
for writing bytes to write-back caches. Now the 21064 and 21164 had a
write-through D-cache. That made it easy to add byte and word
accesses (BWX) in the 21164A (released 1996), but they could have done
it from the start. The 21164A is in no way slower than the 21164; it
has the same IPC and a higher clock rate.
Some people welcome and celebrate the challenges that the
supercomputer attitude poses for software, and justify it with
"performance", but as the examples above show, such claims often turn
out to be false when you actually invest effort into more capable
hardware.
Given that multi-processors come out of supercomputing, it's no
surprise that the supercomputing attitude is particularly strong
there.
But if you look at it from an architecture (i.e., hardware/software
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential
consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.
<
The problem that the weak consistency models enabled comes from the
fact it was universal over all accesses. However the TLB can be used
to solve that problem so each access has its own model and the HW has
to perform with that model often across a multiplicity of memory
references. For my part I have 4 memory models and the CPUs switch to
the appropriate model upon detection without needing instructions. So
when the first instruction of an ATOMIC event is detected (decode),
All weaker outstanding request are allowed to complete, and all of
the ATOMIC requests are performed in sequentially consistent manner,
then afterwards the memory model is weakened, again.
<
Post by Anton Ertl
[1] The software crisis is that software costs are higher than
hardware costs, and supercomputing with its gigantic hardware costs
notices the software crisis much later than general-purpose computing.
- anton
Paul A. Clayton
2023-11-20 15:50:34 UTC
Permalink
On 11/13/23 2:48 AM, Anton Ertl wrote:
[snip]
Post by Anton Ertl
I think about several similar instances, where people went for
simple-minded hardware designs and threw the complexity over the wall
to the software people, and claimed that it was for performance; I
call that the "supercomputing attitude", and it may work in areas
where the software crisis has not yet struck[1], but is a bad attitude
in areas like general-purpose computing where it has struck.
This is not just a hardware-software wall problem, though that
wall and its abuse is usually well-established. As someone with a
micro-optimization orientation, I know I need more external
awareness, but as a non-practicing entity what I think or present
has little effect/danger. Even in my case, there is some danger of
spreading a falsehood (or dangerously incomplete truths), so
external correction is valuable (and I value it myself as I
dislike being wrong, being corrected early hurts but hurts less
than being corrected after the inaccuracy has been well-
established in my own and others' minds).

System-aware optimization also interacts with interface layering.
Isolating concerns reduces design complexity and from a given
complexity allows exploiting "don't care" aspects. The "don't
care" aspects can be painful when the interface user does care;
sometimes these can force a violating of the abstraction,
introducing a dependency of a specific implementation (which can
then introduce an informal interface [performance compatibility
is a common informal interface]).
Post by Anton Ertl
1) People thought that they could achieve faster hardware by throwing
the task of scheduling instructions for maximum instruction-level
parallelism over to the compiler people. Several companies (in
particular, Intel, HP, and Transmeta) invested a lot of money into
this dream (and the Mill project relives this dream), but it turned
out that doing the scheduling in hardware is faster.
Yet there does not seem to be a strong push to develop a dataflow-
oriented interface/ISA (that is does not require genius
programmers or super-genius compilers). I am not certain what such
an interface would look like, but I suspect something closer to a
transport-triggered architecture (TTA) would be an early step. A
TTA-like architecture would compactly encode single use values and
provide some routing information while supporting (possible)
multiple use and some sense of use deferment (loads and stores).

Value prediction (including branch/predicate prediction) also
seems to be required to be included in design considerations.

Such an ISA would also probably blur the boundaries between
threads and naturally support speculative multithreading, which is
in some sense a distant/variable deferment communication/dataflow.

[snip]
Post by Anton Ertl
Meanwhile, Mitch Alsup also has posted that he can
implement fast denormal numbers with IIRC 30 extra gates (which is
probably less than what is needed for implementing the trap barrier).
I think that cost estimate assumes the inclusion of (single
rounding) FMADD. Single-rounding FMADD was not common for RISCs
when the Alpha designers made their choice.

I am **certainly not** a numerical analyst, but I had the
impression that flush-to-zero was not horrible for analyzing for
correctness and (for double precision) not commonly a problem. Yet
I also think that having multiply round based on an "integer"
power-of-two high result (without carry-in) — where the hardware
could also be used for integer multiply by reciprocal — might have
been "better", so my opinion should probably be taken with a mine
of salt.

I would not be surprised if special-purpose low-power DSPs not
only use not-IEEE formats but use inexact rounding. Even using
inexact computation might be justified for extreme cases.
Post by Anton Ertl
3) The Alpha is a rich source of examples of the supercomputer
attitude: It started out without instructions for accessing 8-bit and
16-bit data in memory. Instead, the idea was that for accessing
memory, you would use instruction sequences, and for accessing I/O
devices, the device was mapped three times or so: In one address range
you performed bytewise access, in another address range 16-bit
accesses, and in the third address range 32-bit and 64-bit accesses;
I/O driver writers had to write or modify their drivers for this
model. The rationale for that was that they required ECC for
permanent storage and that would supposedly require slow RMW accesses
for writing bytes to write-back caches. Now the 21064 and 21164 had a
write-through D-cache. That made it easy to add byte and word
accesses (BWX) in the 21164A (released 1996), but they could have done
it from the start. The 21164A is in no way slower than the 21164; it
has the same IPC and a higher clock rate.
Yet Intel has been using byte parity for L1 Dcaches, so that
design choice was perhaps not *entirely* irrational. (I disagree
with that choice, having hindsight, but I can appreciate the
reasoning.) Parity-only L1 Dcaches are not that bad since the
SRAM design will likely be more robust to allow faster access (I
think) and dirty values will tend to be either evicted quickly or
checked often.

If smaller writes are rare, hardware RMW in a writeback cache
would not have been that expensive, but the cost would have no
value if smaller writes are never necessary.

(I do wonder if there is an interface that would allow software to
reduce hardware RMW costs — often a value is read before being
modified — without introducing more complexity than benefit.
Exploiting the standard double-wide read used for unaligned
accesses to access a double-wide aligned memory seems similarly
desirable. While idiom-detection would allow this to be done in
hardware without changing the interface, idiom detection is more
complex than direct encoding and typically relies on software to
reduce that complexity — e.g., only detecting short contiguous
idioms.)

The different memory regions trick is also used for bit-granular
accesses in some ISAs (e.g., ARM) mainly for I/O device accesses.
Even without side-effects for accesses, non-atomicity might be a
concern. (Of course, one could architect that all simple load-op-
store sequences on that type of memory are atomic, using three
instruction idiom detection.)
Post by Anton Ertl
Some people welcome and celebrate the challenges that the
supercomputer attitude poses for software, and justify it with
"performance", but as the examples above show, such claims often turn
out to be false when you actually invest effort into more capable
hardware.
The tricky part seems to be in discerning when (and where) extra
effort is justified. This also depends on how easily the
difficulty can be encapsulated. Can a compiler reliably "do the
right thing" (without having to have been written by a supergenius
AI)? Can a library reliably provide the necessary extra
functionality — splitting the difficulty between application
programmer discipline and difficulty of developing the system
software — without requiring genius system programmers and highly
competent application programmers?

Someone who writes lock-free methods for fun is probably not well-
positioned to estimate the difficulty/lack-of-fun of such for most
programmers. Communication between different interest groups seems
critical, but communication also requires data and not just
anecdotes or traditional wisdom. (Anecdotes and traditional wisdom
do have value!)

[snip]
Post by Anton Ertl
But if you look at it from an architecture (i.e., hardware/software
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential
consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.
I am not convinced that sequential consistency is the best
interface. My66000 does not provide sequential consistency for
ordinary memory. While Mitch Alsup would have difficulty
empathizing with most programmers, he has enough experience to
write specifications for "hostile" engineers so he probably
understands the tradeoffs on both sides of the interface fairly
well.

When an effort is considered hard like parallel programming, there
seems to be a spectrum of viewpoints from the UNIX/"real
programmers" perspective of limiting effort to experts to simplify
the system interface so that almost anyone can do almost anything.
The extreme positions have obvious cultural issues (where
expertise is either required for worth or expertise is despised as
arrogance) as well as mechanical issues (expertise is naturally
limited by finite knowledge — where vast knowledge implies
communication overhead even within a single supercomputer
complex).
Post by Anton Ertl
[1] The software crisis is that software costs are higher than
hardware costs, and supercomputing with its gigantic hardware costs
notices the software crisis much later than general-purpose computing.
This is one strong reason for complexity to be shifted toward
hardware, but I think that there is a danger of "toward" becoming
"into".

I do not know nearly enough about memory ordering considerations
in hardware and software to have more than an opinion based on
which experts I believe (and a tiny amount of rational/data
consistency inference on my part). From what I have read, TSO
seems to be the best tradeoff of hardware overhead and software
difficulty, but I suspect the best set of fine-grained guarantees
may be somewhat different than provided by simple TSO. I also
think there may be noticeable advantages for allowing use that is
outside of recipes/the formal interface.

Documenting an interface often brings an assumption of continuity,
so (besides the cost of writing documentation) there is a
disincentive to expose internals that leak through the abstraction
layer.

(That was a very wordy response.)
MitchAlsup
2023-11-20 18:51:49 UTC
Permalink
Post by Paul A. Clayton
[snip]
Post by Anton Ertl
But if you look at it from an architecture (i.e., hardware/software
good architecture says what happens to the architectural state when
software performs some instruction. From that perspective sequential
consistency is architecturally best. Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.
I am not convinced that sequential consistency is the best
interface. My66000 does not provide sequential consistency for
ordinary memory. While Mitch Alsup would have difficulty
empathizing with most programmers, he has enough experience to
write specifications for "hostile" engineers so he probably
understands the tradeoffs on both sides of the interface fairly
well.
All accesses being universally sequentially consistent is way too
much ordering, however, the ability to detect the start-end of
ATOMIC events and switching to SC gives the programmer all the
order he needs without constraining the non-concurrent memory
at all.

Over at config-space control registers--these need more than TSO or SC,
these need strong ordering.

On the other hand true ROM needs no ordering whatsoever--so why
impose any ??

One size does not fit all !!
Chris M. Thomasson
2023-11-20 20:29:01 UTC
Permalink
Post by MitchAlsup
Post by Paul A. Clayton
[snip]
Post by Anton Ertl
But if you look at it from an architecture (i.e., hardware/software
good architecture says what happens to the architectural state when
software performs some instruction.  From that perspective sequential
consistency is architecturally best.  Weaker consistency models
describe how the architecture does not provide the sequential
consistency guarantees that are so easy to describe; the weaker the
model, the more deviations it has to describe.
I am not convinced that sequential consistency is the best
interface. My66000 does not provide sequential consistency for
ordinary memory. While Mitch Alsup would have difficulty
empathizing with most programmers, he has enough experience to
write specifications for "hostile" engineers so he probably
understands the tradeoffs on both sides of the interface fairly
well.
All accesses being universally sequentially consistent is way too
much ordering, however, the ability to detect the start-end of
ATOMIC events and switching to SC gives the programmer all the
order he needs without constraining the non-concurrent memory
at all.
Over at config-space control registers--these need more than TSO or SC,
these need strong ordering.
On the other hand true ROM needs no ordering whatsoever--so why
impose any ??
One size does not fit all !!
Fwiw, I remember posting an idea of so-called tagged memory barriers on
this group some years ago. I need to try to dig it up.
MitchAlsup
2023-11-13 00:19:21 UTC
Permalink
Post by Kent Dickey
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.
<
In its most general case, relaxed order only provides a performance advantage
when the code is single threaded.
<
Post by Kent Dickey
I believe that statement to be false. Can you describe some of these
workloads?
<
Relaxed memory order fails spectacularly when multiple threads are accessing
data.
<
Post by Kent Dickey
Kent
Kent Dickey
2023-11-13 05:54:31 UTC
Permalink
Post by MitchAlsup
Post by Kent Dickey
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.
<
In its most general case, relaxed order only provides a performance advantage
when the code is single threaded.
I believe a Relaxed Memory model provides a small performance improvement
ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
there's nothing to order).

Relazed Memory ordering provides approximately zero performance improvement
to an OoO CPU, and in fact, might actually lower performance (depends on
how barriers are done--if done poorly, it could be a big negative).

Yes, the system designers of the world have said: let's slow down our
fastest most expensive most profitable CPUs, so we can speed up our cheapest
lowest profit CPUs a few percent, and push a ton of work onto software
developers.

It's crazy.
Post by MitchAlsup
Post by Kent Dickey
I believe that statement to be false. Can you describe some of these
workloads?
<
Relaxed memory order fails spectacularly when multiple threads are accessing
data.
Probably need to clarify with "accessing modified data".

Kent
Chris M. Thomasson
2023-11-13 21:36:37 UTC
Permalink
Post by Kent Dickey
Post by MitchAlsup
Post by Kent Dickey
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.
<
In its most general case, relaxed order only provides a performance advantage
when the code is single threaded.
I believe a Relaxed Memory model provides a small performance improvement
ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
there's nothing to order).
Relazed Memory ordering provides approximately zero performance improvement
to an OoO CPU, and in fact, might actually lower performance (depends on
how barriers are done--if done poorly, it could be a big negative).
Yes, the system designers of the world have said: let's slow down our
fastest most expensive most profitable CPUs, so we can speed up our cheapest
lowest profit CPUs a few percent, and push a ton of work onto software
developers.
It's crazy.
Post by MitchAlsup
Post by Kent Dickey
I believe that statement to be false. Can you describe some of these
workloads?
<
Relaxed memory order fails spectacularly when multiple threads are accessing
data.
Probably need to clarify with "accessing modified data".
Kent
Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
exist? I must be misunderstanding you point here. Sorry if I am. ;^o
Kent Dickey
2023-11-15 21:09:30 UTC
Permalink
Post by Chris M. Thomasson
Post by Kent Dickey
Post by MitchAlsup
Post by Kent Dickey
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.
<
In its most general case, relaxed order only provides a performance advantage
when the code is single threaded.
I believe a Relaxed Memory model provides a small performance improvement
ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
there's nothing to order).
Relazed Memory ordering provides approximately zero performance improvement
to an OoO CPU, and in fact, might actually lower performance (depends on
how barriers are done--if done poorly, it could be a big negative).
Yes, the system designers of the world have said: let's slow down our
fastest most expensive most profitable CPUs, so we can speed up our cheapest
lowest profit CPUs a few percent, and push a ton of work onto software
developers.
It's crazy.
Post by MitchAlsup
Post by Kent Dickey
I believe that statement to be false. Can you describe some of these
workloads?
<
Relaxed memory order fails spectacularly when multiple threads are accessing
data.
Probably need to clarify with "accessing modified data".
Kent
Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
exist? I must be misunderstanding you point here. Sorry if I am. ;^o
You have internalized weakly ordered memory, and you're having trouble
seeing beyond it.

CPUs with weakly ordered memory are the ones that need all those flags.
Yes, you need the flags if you want to use those CPUs. I'm pointing out:
we could all just require better memory ordering and get rid of all this
cruft. Give the flag, don't give the flag, the program is still correct
and works properly.

It's like FP denorms--it's generally been decided the hardware cost
to implement it is small, so hardware needs to support it at full speed.
No need to write code in a careful way to avoid denorms, to use funky CPU-
specific calls to turn on flush-to-0, etc., it just works, we move on to
other topics. But we still have flush-to-0 calls available--but you don't
need to bother to use them. In my opinion, memory ordering is much more
complex for programmers to handle. I maintain it's actually so
complex most people cannot get it right in software for non-trivial
interactions. I've found many hardware designers have a very hard time
reasoning about this as well when I report bugs (since the rules are so
complex and poorly described). There are over 100 pages describing memory
ordering in the Arm Architectureal Reference Manual, and it is very
complex (Dependency through registers and memory; Basic Dependency;
Address Dependency; Data Dependency; Control Dependency; Pick Basic
dependency; Pick Address Dependency; Pick Data Dependency; Pick
Control Dependency, Pick Dependency...and this is just from the definition
of terms). It's all very abstract and difficult to follow. I'll be
honest: I do not understand all of these rules, and I don't care to.
I know how to implement a CPU, so I know what they've done, and that's
much simpler to understand. But writing a threaded application is much
more complex than it should be for software.

The cost to do TSO is some out-of-order tracking structures need to get
a little bigger, and some instructions have to stay in queues longer
(which is why they may need to get bigger), and allow re-issuing loads
which now have stale data. The difference between TSO and Sequential
Consistency is to just disallow loads seeing stores queued before they
write to the data cache (well, you can speculatively let loads happen,
but you need to be able to walk it back, which is not difficult). This
is why I say the performance cost is low--normal code missing caches and
not being pestered by other CPUs can run at the same speed. But when
other CPUs begin pestering us, the interference can all be worked out as
efficiently as possible using hardware, and barriers just do not
compete.

Kent
Chris M. Thomasson
2023-11-15 21:25:30 UTC
Permalink
Post by Kent Dickey
Post by Chris M. Thomasson
Post by Kent Dickey
Post by MitchAlsup
Post by Kent Dickey
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true. In general, it
is assumed to be true without proof.
<
In its most general case, relaxed order only provides a performance advantage
when the code is single threaded.
I believe a Relaxed Memory model provides a small performance improvement
ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
there's nothing to order).
Relazed Memory ordering provides approximately zero performance improvement
to an OoO CPU, and in fact, might actually lower performance (depends on
how barriers are done--if done poorly, it could be a big negative).
Yes, the system designers of the world have said: let's slow down our
fastest most expensive most profitable CPUs, so we can speed up our cheapest
lowest profit CPUs a few percent, and push a ton of work onto software
developers.
It's crazy.
Post by MitchAlsup
Post by Kent Dickey
I believe that statement to be false. Can you describe some of these
workloads?
<
Relaxed memory order fails spectacularly when multiple threads are accessing
data.
Probably need to clarify with "accessing modified data".
Kent
Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
exist? I must be misunderstanding you point here. Sorry if I am. ;^o
You have internalized weakly ordered memory, and you're having trouble
seeing beyond it.
Really? Don't project yourself on me. Altering all of the memory
barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.
Post by Kent Dickey
CPUs with weakly ordered memory are the ones that need all those flags.
we could all just require better memory ordering and get rid of all this
cruft. Give the flag, don't give the flag, the program is still correct
and works properly.
Huh? Just cruft? wow. Just because it seems hard for you does not mean
we should eliminate it. Believe it or not there are people out there
that know how to use memory barriers. I suppose you would use seq_cst to
load each node of a lock-free stack iteration in a RCU read-side region.
This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a, sort a, seems
like you do not have all that much experience with them. Humm...
Post by Kent Dickey
It's like FP denorms--it's generally been decided the hardware cost
to implement it is small, so hardware needs to support it at full speed.
No need to write code in a careful way to avoid denorms, to use funky CPU-
specific calls to turn on flush-to-0, etc., it just works, we move on to
other topics. But we still have flush-to-0 calls available--but you don't
need to bother to use them. In my opinion, memory ordering is much more
complex for programmers to handle. I maintain it's actually so
complex most people cannot get it right in software for non-trivial
interactions. I've found many hardware designers have a very hard time
reasoning about this as well when I report bugs (since the rules are so
complex and poorly described). There are over 100 pages describing memory
ordering in the Arm Architectureal Reference Manual, and it is very
complex (Dependency through registers and memory; Basic Dependency;
Address Dependency; Data Dependency; Control Dependency; Pick Basic
dependency; Pick Address Dependency; Pick Data Dependency; Pick
Control Dependency, Pick Dependency...and this is just from the definition
of terms). It's all very abstract and difficult to follow. I'll be
honest: I do not understand all of these rules, and I don't care to.
I know how to implement a CPU, so I know what they've done, and that's
much simpler to understand. But writing a threaded application is much
more complex than it should be for software.
The cost to do TSO is some out-of-order tracking structures need to get
a little bigger, and some instructions have to stay in queues longer
(which is why they may need to get bigger), and allow re-issuing loads
which now have stale data. The difference between TSO and Sequential
Consistency is to just disallow loads seeing stores queued before they
write to the data cache (well, you can speculatively let loads happen,
but you need to be able to walk it back, which is not difficult). This
is why I say the performance cost is low--normal code missing caches and
not being pestered by other CPUs can run at the same speed. But when
other CPUs begin pestering us, the interference can all be worked out as
efficiently as possible using hardware, and barriers just do not
compete.
Having access to fine grain memory barriers is a very good thing. Of
course we can use C++ right now and make everything seq_cst, but that is
moronic. Why would you want to use seq_cst everywhere when you do not
have to? There are rather massive performance implications.

Are you thinking about a magic arch that we cannot use right now?
Chris M. Thomasson
2023-11-15 21:27:59 UTC
Permalink
Post by Chris M. Thomasson
Post by Kent Dickey
Post by Chris M. Thomasson
Post by Kent Dickey
Post by MitchAlsup
Post by Chris M. Thomasson
A highly relaxed memory model can be beneficial for certain workloads.
I know a lot of people believe that statement to be true.  In
general, it
is assumed to be true without proof.
<
In its most general case, relaxed order only provides a performance advantage
when the code is single threaded.
I believe a Relaxed Memory model provides a small performance improvement
ONLY to simple in-order CPUs in an MP system (if you're a single CPU,
there's nothing to order).
Relazed Memory ordering provides approximately zero performance improvement
to an OoO CPU, and in fact, might actually lower performance (depends on
how barriers are done--if done poorly, it could be a big negative).
Yes, the system designers of the world have said: let's slow down our
fastest most expensive most profitable CPUs, so we can speed up our cheapest
lowest profit CPUs a few percent, and push a ton of work onto software
developers.
It's crazy.
Post by MitchAlsup
I believe that statement to be false.  Can you describe some of these
workloads?
<
Relaxed memory order fails spectacularly when multiple threads are accessing
data.
Probably need to clarify with "accessing modified data".
Kent
Huh? So, C++ is crazy for allowing for std::memory_order_relaxed to even
exist? I must be misunderstanding you point here. Sorry if I am. ;^o
You have internalized weakly ordered memory, and you're having trouble
seeing beyond it.
Really? Don't project yourself on me. Altering all of the memory
barriers of a finely tuned lock-free algorithm to seq_cst is VERY bad.
Post by Kent Dickey
CPUs with weakly ordered memory are the ones that need all those flags.
we could all just require better memory ordering and get rid of all this
cruft.  Give the flag, don't give the flag, the program is still correct
and works properly.
Huh? Just cruft? wow. Just because it seems hard for you does not mean
we should eliminate it. Believe it or not there are people out there
that know how to use memory barriers. I suppose you would use seq_cst to
load each node of a lock-free stack iteration in a RCU read-side region.
This is terrible! Realy bad, bad, BAD! Afaicvt, it kind a, sort a, seems
like you do not have all that much experience with them. Humm...
Post by Kent Dickey
It's like FP denorms--it's generally been decided the hardware cost
to implement it is small, so hardware needs to support it at full speed.
No need to write code in a careful way to avoid denorms, to use funky CPU-
specific calls to turn on flush-to-0, etc., it just works, we move on to
other topics.  But we still have flush-to-0 calls available--but you
don't
need to bother to use them.  In my opinion, memory ordering is much more
complex for programmers to handle.  I maintain it's actually so
complex most people cannot get it right in software for non-trivial
interactions.  I've found many hardware designers have a very hard time
reasoning about this as well when I report bugs (since the rules are so
complex and poorly described).  There are over 100 pages describing
memory
ordering in the Arm Architectureal Reference Manual, and it is very
complex (Dependency through registers and memory; Basic Dependency;
Address Dependency; Data Dependency; Control Dependency; Pick Basic
dependency; Pick Address Dependency; Pick Data Dependency; Pick
Control Dependency, Pick Dependency...and this is just from the definition
of terms).  It's all very abstract and difficult to follow.  I'll be
honest: I do not understand all of these rules, and I don't care to.
I know how to implement a CPU, so I know what they've done, and that's
much simpler to understand.  But writing a threaded application is much
more complex than it should be for software.
The cost to do TSO is some out-of-order tracking structures need to get
a little bigger, and some instructions have to stay in queues longer
(which is why they may need to get bigger), and allow re-issuing loads
which now have stale data.  The difference between TSO and Sequential
Consistency is to just disallow loads seeing stores queued before they
write to the data cache (well, you can speculatively let loads happen,
but you need to be able to walk it back, which is not difficult).  This
is why I say the performance cost is low--normal code missing caches and
not being pestered by other CPUs can run at the same speed.  But when
other CPUs begin pestering us, the interference can all be worked out as
efficiently as possible using hardware, and barriers just do not
compete.
Having access to fine grain memory barriers is a very good thing. Of
course we can use C++ right now and make everything seq_cst, but that is
moronic. Why would you want to use seq_cst everywhere when you do not
have to? There are rather massive performance implications.
Are you thinking about a magic arch that we cannot use right now?

Stefan Monnier
2023-11-15 22:40:15 UTC
Permalink
Post by Chris M. Thomasson
Are you thinking about a magic arch that we cannot use right now?
Yes, he is, obviously.
So when you say "it's bad", please tell us why.

We know it would run slow on existing CPUs, that's not the question.
The question is: why would it be impossible or very hard to
make a CPU that could execute such code efficiently.

I suspect there can be a very valid reasons, maybe for the same kinds of
reasons why some systems allow nested transactions (e.g. when you have
a transaction with two calls to `gensym`: it doesn't matter whether the
two calls really return consecutive symbols (as would be guaranteed if
the code were truly run atomically), all that matters is that those
symbols are unique).

So maybe with sequential consistency, there could be some forms of
parallelism which we'd completely disallow, whereas a weaker form of
consistency would allow it. I'm having a hard time imagining what it
could be, tho.

BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
their big supercomputers, no?


Stefan
Chris M. Thomasson
2023-11-15 23:01:42 UTC
Permalink
Post by Stefan Monnier
Post by Chris M. Thomasson
Are you thinking about a magic arch that we cannot use right now?
Yes, he is, obviously.
So when you say "it's bad", please tell us why.
Oh, shit. This is the main point of my confusion. When I say its bad, I
am referring to using std::memory_order_seq_cst all over the place on
_existing_ architectures.
Post by Stefan Monnier
We know it would run slow on existing CPUs, that's not the question.
The question is: why would it be impossible or very hard to
make a CPU that could execute such code efficiently.
I suspect there can be a very valid reasons, maybe for the same kinds of
reasons why some systems allow nested transactions (e.g. when you have
a transaction with two calls to `gensym`: it doesn't matter whether the
two calls really return consecutive symbols (as would be guaranteed if
the code were truly run atomically), all that matters is that those
symbols are unique).
So maybe with sequential consistency, there could be some forms of
parallelism which we'd completely disallow, whereas a weaker form of
consistency would allow it. I'm having a hard time imagining what it
could be, tho.
BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
their big supercomputers, no?
Offering sequential consistency and mandating it are different things.
We can all get sequential consistency just by using
std::memory_order_seq_cst all of the time. If an arch can be created
that is 100% sequential consistency and beat out existing finely tuned
algorithms, like RCU based lock-free algorithms, then, I would love to
see it.

As of now, wrt are current arch's, say, iterating a lock-free stack in a
RCU read side region using seq_cst is horrible.
Chris M. Thomasson
2023-11-15 23:15:10 UTC
Permalink
Post by Chris M. Thomasson
Post by Stefan Monnier
Post by Chris M. Thomasson
Are you thinking about a magic arch that we cannot use right now?
Yes, he is, obviously.
So when you say "it's bad", please tell us why.
Oh, shit. This is the main point of my confusion. When I say its bad, I
am referring to using std::memory_order_seq_cst all over the place on
_existing_ architectures.
Post by Stefan Monnier
We know it would run slow on existing CPUs, that's not the question.
The question is: why would it be impossible or very hard to
make a CPU that could execute such code efficiently.
I suspect there can be a very valid reasons, maybe for the same kinds of
reasons why some systems allow nested transactions (e.g. when you have
a transaction with two calls to `gensym`: it doesn't matter whether the
two calls really return consecutive symbols (as would be guaranteed if
the code were truly run atomically), all that matters is that those
symbols are unique).
So maybe with sequential consistency, there could be some forms of
parallelism which we'd completely disallow, whereas a weaker form of
consistency would allow it.  I'm having a hard time imagining what it
could be, tho.
BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
their big supercomputers, no?
Offering sequential consistency and mandating it are different things.
We can all get sequential consistency just by using
std::memory_order_seq_cst all of the time. If an arch can be created
that is 100% sequential consistency and beat out existing finely tuned
algorithms, like RCU based lock-free algorithms, then, I would love to
see it.
As of now, wrt are current arch's, say, iterating a lock-free stack in a
RCU read side region using seq_cst is horrible.
Just because programming sync algorithms on a weakly ordered arch can be
difficult for some people does not mean we have to get rid of it
altogether...
Stefan Monnier
2023-11-15 23:39:28 UTC
Permalink
Post by Chris M. Thomasson
Just because programming sync algorithms on a weakly ordered arch can be
difficult for some people does not mean we have to get rid of it
altogether...
No, but it all depends on the hardware cost.
SGI's big multiprocessors (starting with their R10K processor) obeyed
sequential consistency and AFAIK that did not prevent them from providing
top-notch performance.

So if it's not terribly costly, it may end up being *faster* because the
memory barriers become noops (not to mention the fact that performance
engineers can spend time on other things).

I suspect Intel and others did look into it at some point and ended up
deciding it's not actually faster, but I agree with Kent that the
complexity cost for programmers might not really be warranted and that
maybe we'd all be better off if CPU manufacturers followed SGI's lead.

According to https://dl.acm.org/doi/abs/10.1145/2366231.2337220
the cost of supporting SC can be around 6% using tricks known in 2012.
I'd be surprised if it can't be brought further down.


Stefan
Chris M. Thomasson
2023-11-16 00:03:31 UTC
Permalink
Post by Stefan Monnier
Post by Chris M. Thomasson
Just because programming sync algorithms on a weakly ordered arch can be
difficult for some people does not mean we have to get rid of it
altogether...
No, but it all depends on the hardware cost.
SGI's big multiprocessors (starting with their R10K processor) obeyed
sequential consistency and AFAIK that did not prevent them from providing
top-notch performance.
So if it's not terribly costly, it may end up being *faster* because the
memory barriers become noops (not to mention the fact that performance
engineers can spend time on other things).
I suspect Intel and others did look into it at some point and ended up
deciding it's not actually faster, but I agree with Kent that the
complexity cost for programmers might not really be warranted and that
maybe we'd all be better off if CPU manufacturers followed SGI's lead.
According to https://dl.acm.org/doi/abs/10.1145/2366231.2337220
the cost of supporting SC can be around 6% using tricks known in 2012.
I'd be surprised if it can't be brought further down.
Actually, for some reason this line of thinking reminds me of the
following funny scene in a movie called Dirty Rotten Scoundrels. Where
they had to put corks on the forks of Ruprecht (Steve Martin), to
prevent him from hurting himself. So, basically, Ruprecht would be the
programmer and the Micheal Caine character would be the arch designer
thinking that relaxed models are too complex, and all programmers are
mainly morons:



http://youtu.be/SKDX-qJaJ08

Afaict, std::memory_order_consume was added to support RCU compatible
algorithms.
Stefan Monnier
2023-11-16 00:37:01 UTC
Permalink
Actually, for some reason this line of thinking reminds me of the following
funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
put corks on the forks of Ruprecht (Steve Martin), to prevent him from
hurting himself. So, basically, Ruprecht would be the programmer and the
Micheal Caine character would be the arch designer thinking that relaxed
Of course, cognitive dissonance would bite those people who have invested
efforts into learning about all the intricacies of non-sequential
memory models.

But if we can make SC's efficiency sufficiently close to that of TSO,
the benefit could be significant for all those people who have not
invested such efforts.

The evolution of computer programming is littered with steps that reduce
performance in exchange for a cleaner programming model.

You don't need to be a moron to be baffled by the complexity of relaxed
memory models.


Stefan
Chris M. Thomasson
2023-11-16 02:21:04 UTC
Permalink
Post by Stefan Monnier
Actually, for some reason this line of thinking reminds me of the following
funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
put corks on the forks of Ruprecht (Steve Martin), to prevent him from
hurting himself. So, basically, Ruprecht would be the programmer and the
Micheal Caine character would be the arch designer thinking that relaxed
Of course, cognitive dissonance would bite those people who have invested
efforts into learning about all the intricacies of non-sequential
memory models.
But if we can make SC's efficiency sufficiently close to that of TSO,
the benefit could be significant for all those people who have not
invested such efforts.
The evolution of computer programming is littered with steps that reduce
performance in exchange for a cleaner programming model.
You don't need to be a moron to be baffled by the complexity of relaxed
memory models.
Mainly, the kernel guys have to know all about them in order to try to
squeeze as much performance as they can from a given arch. Lock/Wait
free algorithms are used quite a bit. Creating them is never easy,
relaxed model or not.

Fwiw, I like the SPARC model where one could put the processor in RMO or
TSO mode. I think there was another mode, but I cannot remember it right
now. A SEQ mode should be doable in SPARC. However, it would incur a
performance penalty.

The funny thing in that in lock/wait-free programming, we always look
for the weakest barriers than can be used, and still be correct. So, if
everything was suddenly seq_cst with much better performance on a new
magic arch, well, I would need to check it out for sure!
Scott Lurndal
2023-11-16 14:54:49 UTC
Permalink
Post by Chris M. Thomasson
Post by Stefan Monnier
Actually, for some reason this line of thinking reminds me of the following
funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
put corks on the forks of Ruprecht (Steve Martin), to prevent him from
hurting himself. So, basically, Ruprecht would be the programmer and the
Micheal Caine character would be the arch designer thinking that relaxed
Of course, cognitive dissonance would bite those people who have invested
efforts into learning about all the intricacies of non-sequential
memory models.
But if we can make SC's efficiency sufficiently close to that of TSO,
the benefit could be significant for all those people who have not
invested such efforts.
The evolution of computer programming is littered with steps that reduce
performance in exchange for a cleaner programming model.
You don't need to be a moron to be baffled by the complexity of relaxed
memory models.
Mainly, the kernel guys have to know all about them in order to try to
squeeze as much performance as they can from a given arch. Lock/Wait
free algorithms are used quite a bit. Creating them is never easy,
relaxed model or not.
We ran into an issue with linux circa 2006 or thereabouts where
there was an issue accessing an skb (networks stack buffer) that
required adding a barrier (AMD64 processor). Details are fuzzy two decades later....
Chris M. Thomasson
2023-11-16 22:07:16 UTC
Permalink
Post by Scott Lurndal
Post by Chris M. Thomasson
Post by Stefan Monnier
Actually, for some reason this line of thinking reminds me of the following
funny scene in a movie called Dirty Rotten Scoundrels. Where they had to
put corks on the forks of Ruprecht (Steve Martin), to prevent him from
hurting himself. So, basically, Ruprecht would be the programmer and the
Micheal Caine character would be the arch designer thinking that relaxed
Of course, cognitive dissonance would bite those people who have invested
efforts into learning about all the intricacies of non-sequential
memory models.
But if we can make SC's efficiency sufficiently close to that of TSO,
the benefit could be significant for all those people who have not
invested such efforts.
The evolution of computer programming is littered with steps that reduce
performance in exchange for a cleaner programming model.
You don't need to be a moron to be baffled by the complexity of relaxed
memory models.
Mainly, the kernel guys have to know all about them in order to try to
squeeze as much performance as they can from a given arch. Lock/Wait
free algorithms are used quite a bit. Creating them is never easy,
relaxed model or not.
We ran into an issue with linux circa 2006 or thereabouts where
there was an issue accessing an skb (networks stack buffer) that
required adding a barrier (AMD64 processor). Details are fuzzy two decades later....
I know that RCU is being used for dynamic routing tables, and a lot more
in the Linux kernel. I cannot remember if the skb used it, I bet it did.
For some damn reason, this makes me think of eieio on the PPC...
Kent Dickey
2023-11-16 01:41:56 UTC
Permalink
Post by Stefan Monnier
Post by Chris M. Thomasson
Are you thinking about a magic arch that we cannot use right now?
Yes, he is, obviously.
So when you say "it's bad", please tell us why.
We know it would run slow on existing CPUs, that's not the question.
The question is: why would it be impossible or very hard to
make a CPU that could execute such code efficiently.
I suspect there can be a very valid reasons, maybe for the same kinds of
reasons why some systems allow nested transactions (e.g. when you have
a transaction with two calls to `gensym`: it doesn't matter whether the
two calls really return consecutive symbols (as would be guaranteed if
the code were truly run atomically), all that matters is that those
symbols are unique).
So maybe with sequential consistency, there could be some forms of
parallelism which we'd completely disallow, whereas a weaker form of
consistency would allow it. I'm having a hard time imagining what it
could be, tho.
BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
their big supercomputers, no?
Stefan
Every HP PA-RISC system, from workstation to the largest server, were
sequentially consistent and needed no barriers. It was not a problem,
I never thought it was even considered any sort of performance issue.
Once you decide to support it, you just throw some hardware at it,
and you're done.

Since the original HP PA-RISC MP designs were sequentially consistent, all
the implementations afterward kept it up since no one wanted any existing
code to break. The architects defined the architecture to allow weak
ordering, but no implementation (by HP at least) did so. These
architects then went on to IA-64, where it really is weak, since IA64 was
in order, so it has more of a payoff there since IA64 didn't want to spend
hardware on this (they had 50 other things to waste it on), and IA-64 is
full of bad ideas and ideas just not implemented well.

Sun was TSO, which is weaker. Sun was never a performance champion,
other than by throwing the most cores at a parallel problem. So being
TSO relative to Sequential Consistency didn't seem to buy Sun much. DEC
Alpha was the poster child of weakly ordered (so weak they didn't maintain
single CPU consistent ordering with itself) and it was often a performance
champion on tech workloads, but that had more to do with their
much higher clock frequencies, and that edge went away once out-of-order
took off. DEC Alpha was never a big player in TPC-C workloads, where
everybody made their money in the 90's. Technical computing is nice and
fun, but there was not a lot of profit in it compared to business workloads
in the 90s. IA64's reason to exist was to run SGI/SUN/DEC out of business,
and it effectively did (while hurting HP about as much).

I don't know if SGI was sequentially consistent. It's possible, since
software developed for other systems might have pushed them to support it,
but the academic RISC folks were pretty big on weakly ordered.

A problem with weakly ordered is no implementation is THAT weakly
ordered. To maximize doing things in a bad order requires "unlucky" cache
misses, and these are just not common in practice. So "Store A; Store B"
often appears to be done in that order with no barriers on weakly ordered
systems. It's hard to feel confident you've written anything complex
right, so most algorithms are kept relatively simple to make it more likely
they are well tested.

HP PA-RISC had poor atomic support, so HP-UX used a simple spin-lock
using just load and store. I forget the details, but it was something
like this: Each of 4 CPUs got one byte in a 4-byte word. First, set
your byte to "iwant", then read the word. If everyone else is 0, then
set your byte to "iwon", then read the word. If everyone else is still
0, you've won, do what you want. And if you see other bytes getting
set, then you move to a backoff algorithm to determine the winner (you
have 256 states you can move through). Note that when you release the
lock, you can pick the next winner with a single store. What I
understood was this was actually faster than a simple compare-and-swap
since it let software immediately see if there was contention, and move
to a back-off algorithm right away (and you can see who's contending,
and deal with that as well). Spinlocks tend to lead to cache
invalidation storms, and it's hard to tune, but this was much more
tuneable. It scaled to 64-CPUs by doing the lock in two steps, and
moving to a 64-bit word.

Kent
Chris M. Thomasson
2023-11-16 02:09:28 UTC
Permalink
Post by Kent Dickey
Post by Stefan Monnier
Post by Chris M. Thomasson
Are you thinking about a magic arch that we cannot use right now?
Yes, he is, obviously.
So when you say "it's bad", please tell us why.
We know it would run slow on existing CPUs, that's not the question.
The question is: why would it be impossible or very hard to
make a CPU that could execute such code efficiently.
I suspect there can be a very valid reasons, maybe for the same kinds of
reasons why some systems allow nested transactions (e.g. when you have
a transaction with two calls to `gensym`: it doesn't matter whether the
two calls really return consecutive symbols (as would be guaranteed if
the code were truly run atomically), all that matters is that those
symbols are unique).
So maybe with sequential consistency, there could be some forms of
parallelism which we'd completely disallow, whereas a weaker form of
consistency would allow it. I'm having a hard time imagining what it
could be, tho.
BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
their big supercomputers, no?
Stefan
Every HP PA-RISC system, from workstation to the largest server, were
sequentially consistent and needed no barriers. It was not a problem,
I never thought it was even considered any sort of performance issue.
Once you decide to support it, you just throw some hardware at it,
and you're done.
Since the original HP PA-RISC MP designs were sequentially consistent, all
the implementations afterward kept it up since no one wanted any existing
code to break. The architects defined the architecture to allow weak
ordering, but no implementation (by HP at least) did so. These
architects then went on to IA-64, where it really is weak, since IA64 was
in order, so it has more of a payoff there since IA64 didn't want to spend
hardware on this (they had 50 other things to waste it on), and IA-64 is
full of bad ideas and ideas just not implemented well.
Sun was TSO, which is weaker. [...]
Sun had RMO mode as well. Avoid #StoreLoad membars was a must, only use
them when you absolutely have to.
Michael S
2023-11-18 19:10:47 UTC
Permalink
On Thu, 16 Nov 2023 01:41:56 -0000 (UTC)
Post by Kent Dickey
Post by Stefan Monnier
Post by Chris M. Thomasson
Are you thinking about a magic arch that we cannot use right now?
Yes, he is, obviously.
So when you say "it's bad", please tell us why.
We know it would run slow on existing CPUs, that's not the question.
The question is: why would it be impossible or very hard to
make a CPU that could execute such code efficiently.
I suspect there can be a very valid reasons, maybe for the same
kinds of reasons why some systems allow nested transactions (e.g.
when you have a transaction with two calls to `gensym`: it doesn't
matter whether the two calls really return consecutive symbols (as
would be guaranteed if the code were truly run atomically), all that
matters is that those symbols are unique).
So maybe with sequential consistency, there could be some forms of
parallelism which we'd completely disallow, whereas a weaker form of
consistency would allow it. I'm having a hard time imagining what it
could be, tho.
BTW, IIRC, SGI's MIPS was defined to offer sequential consistency on
their big supercomputers, no?
Stefan
Every HP PA-RISC system, from workstation to the largest server, were
sequentially consistent and needed no barriers. It was not a problem,
I never thought it was even considered any sort of performance issue.
Once you decide to support it, you just throw some hardware at it,
and you're done.
Since the original HP PA-RISC MP designs were sequentially
consistent, all the implementations afterward kept it up since no one
wanted any existing code to break. The architects defined the
architecture to allow weak ordering, but no implementation (by HP at
least) did so. These architects then went on to IA-64, where it
really is weak, since IA64 was in order, so it has more of a payoff
there since IA64 didn't want to spend hardware on this (they had 50
other things to waste it on), and IA-64 is full of bad ideas and
ideas just not implemented well.
Sun was TSO, which is weaker. Sun was never a performance champion,
other than by throwing the most cores at a parallel problem. So being
TSO relative to Sequential Consistency didn't seem to buy Sun much.
DEC Alpha was the poster child of weakly ordered (so weak they didn't
maintain single CPU consistent ordering with itself) and it was often
a performance champion on tech workloads, but that had more to do
with their much higher clock frequencies, and that edge went away
once out-of-order took off. DEC Alpha was never a big player in
TPC-C workloads, where everybody made their money in the 90's.
Technical computing is nice and fun, but there was not a lot of
profit in it compared to business workloads in the 90s. IA64's
reason to exist was to run SGI/SUN/DEC out of business, and it
effectively did (while hurting HP about as much).
I don't know if SGI was sequentially consistent.
AFAIK, Stefan Monnier is correct. SGI MIPS gear was SC.
Their later gear (Itanium and x86) were not, because of underlying
cores.
Post by Kent Dickey
It's possible, since
software developed for other systems might have pushed them to
support it, but the academic RISC folks were pretty big on weakly
ordered.
A problem with weakly ordered is no implementation is THAT weakly
ordered. To maximize doing things in a bad order requires "unlucky"
cache misses, and these are just not common in practice. So "Store
A; Store B" often appears to be done in that order with no barriers
on weakly ordered systems. It's hard to feel confident you've
written anything complex right, so most algorithms are kept
relatively simple to make it more likely they are well tested.
HP PA-RISC had poor atomic support, so HP-UX used a simple spin-lock
using just load and store. I forget the details, but it was something
like this: Each of 4 CPUs got one byte in a 4-byte word. First, set
your byte to "iwant", then read the word. If everyone else is 0, then
set your byte to "iwon", then read the word. If everyone else is
still 0, you've won, do what you want. And if you see other bytes
getting set, then you move to a backoff algorithm to determine the
winner (you have 256 states you can move through). Note that when
you release the lock, you can pick the next winner with a single
store. What I understood was this was actually faster than a simple
compare-and-swap since it let software immediately see if there was
contention, and move to a back-off algorithm right away (and you can
see who's contending, and deal with that as well). Spinlocks tend to
lead to cache invalidation storms, and it's hard to tune, but this
was much more tuneable. It scaled to 64-CPUs by doing the lock in
two steps, and moving to a 64-bit word.
Kent
I don't see how high-end SC core can match performance of high-end
TSO-or-weaker core with weak point of SC being single-thread
performance in the case where code encounters few L1D misses intermixed
with plenty of L1D hits.
Of course, I don't know all tricks that high-end core designers are
using today or even 10% of their tricks. So, may be, what I consider
impossible is in fact possible.
But as a matter of fact nobody designed brand new high-end SC core in
these century. The closest were probably MIPS cores from Scott
Lurndal's employee Cavium, but I think they were always at least factor
of 1.5x slower than contemporary state of the art in single-thread
performance and more commonly factor of 2+.
MitchAlsup
2023-11-19 19:38:47 UTC
Permalink
Post by Michael S
On Thu, 16 Nov 2023 01:41:56 -0000 (UTC)
Kent
I don't see how high-end SC core can match performance of high-end
TSO-or-weaker core with weak point of SC being single-thread
performance in the case where code encounters few L1D misses intermixed
with plenty of L1D hits.
They cannot--but does it really matter ??

Back in 1993 I ran an experiment on Mc 88120 where we did not consider
a memory reference performed until we got a signal back from one of the
multiple DRAM controllers (compared to fire and forget) and an big
complicated applications the loss in performance was about 2%--2% perf
loss allowed on to be in a position to recover from an ECC failure in
the transport of write data to DRAM.

This was still not TSO but it is a relevant data point.
Post by Michael S
Of course, I don't know all tricks that high-end core designers are
using today or even 10% of their tricks. So, may be, what I consider
impossible is in fact possible.
But as a matter of fact nobody designed brand new high-end SC core in
these century. The closest were probably MIPS cores from Scott
Lurndal's employee Cavium, but I think they were always at least factor
of 1.5x slower than contemporary state of the art in single-thread
performance and more commonly factor of 2+.
Right now, nobody is competitive with x86-64 at the high end. You are
going to need a 5GHz core operating 4 instructions per cycle to be
within spitting distance.
Thomas Koenig
2023-11-19 21:17:20 UTC
Permalink
Post by MitchAlsup
Right now, nobody is competitive with x86-64 at the high end. You are
going to need a 5GHz core operating 4 instructions per cycle to be
within spitting distance.
Power10 isn't doing badly at 4 GHz if I read the SPECint values
right, but it (very probably) cannot compete on performance per
currency unit.
Michael S
2023-11-19 22:10:53 UTC
Permalink
On Sun, 19 Nov 2023 21:17:20 -0000 (UTC)
Post by Thomas Koenig
Post by MitchAlsup
Right now, nobody is competitive with x86-64 at the high end. You
are going to need a 5GHz core operating 4 instructions per cycle to
be within spitting distance.
Power10 isn't doing badly at 4 GHz if I read the SPECint values
right, but it (very probably) cannot compete on performance per
currency unit.
You do not read SPECint values right.
For starter, there are no POWER "speed" submissions at all. All
published numbers are embarassingly parallel "rate" scores where since
POWER7 POWER has huge advantage of 8 hardware thread vs all competitors
except Oracle having only 1 or 2 threads.
Since POWER8 IBM applies another trick of calling what is essentially 2
cores a single core.
My uneducated guess is that in sigle-threaded integer performance
POWER10 should be approximately on par with the best Intel Skylake i.e.
below fastest Intel, AMD and Apple chips from 2021.
Even more so relatively to 2023.
And by 2023Q4 apart frome those three mentioned above we have Qualcomm
near the top and even ARM Inc (Cortex X3) while behind the other four,
is probably ahead of IBM.
Anton Ertl
2023-11-20 07:34:09 UTC
Permalink
Post by Michael S
My uneducated guess is that in sigle-threaded integer performance
POWER10 should be approximately on par with the best Intel Skylake
Slightly more educated:

gforth-fast onebench.fs (development version of Gforth), lower is better:

sieve bubble matrix fib fft version; machine
0.075 0.099 0.042 0.112 0.033 20231116; Power10 3900MHz; gcc-11.4.1
0.061 0.090 0.019 0.056 0.021 20231116; Core i5-6600K 4000MHz; gcc-10.1

Another benchmark I can run quickly is the LaTeX benchmark
<http://www.complang.tuwien.ac.at/anton/latex-bench/>:

On the Power10:

487.14 msec task-clock # 0.998 CPUs utilized
3 context-switches # 6.158 /sec
1 cpu-migrations # 2.053 /sec
590 page-faults # 1.211 K/sec
1897813538 cycles # 3.896 GHz
4411333480 instructions # 2.32 insn per cycle
542280135 branches # 1.113 G/sec
11170586 branch-misses # 2.06% of all branches

0.488029554 seconds time elapsed

0.477731000 seconds user
0.010164000 seconds sys

On the Core i5-6600K (Skylake):

396.04 msec task-clock # 0.999 CPUs utilized
2 context-switches # 0.005 K/sec
0 cpu-migrations # 0.000 K/sec
7091 page-faults # 0.018 M/sec
1584116718 cycles # 4.000 GHz
3015599019 instructions # 1.90 insn per cycle
564834472 branches # 1426.217 M/sec
8230928 branch-misses # 1.46% of all branches

0.396422458 seconds time elapsed

0.380444000 seconds user
0.016018000 seconds sys

One disadvantage of this LaTeX benchmark is that it depends on the
LaTeX version, and on the installed packages how much it has to do.
In the present case both systems use Tex Live 2020, so that should
make no difference. The Skylake has a lot of packages installed and
executes more instructions for the LaTeX benchmark than any other
AMD64 system where I measured the executed instructions. I do not
know how many packages are installed on the Power10 system.

Comparing the number of branches is interesting. If we assume that
compiler transformations like if-conversion and loop unrolling do
*not* cause a significant difference in branches, the similar number
of branches suggests that the workload is comparable.

In any case, both results indicate that the 3900MHz Power10 has a
lower single-thread performance than a 4GHz Skylake. So you buy a
Power10 if you want to process large numbers of threads or processes
and want to put in lots of RAM.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Michael S
2023-11-20 14:40:05 UTC
Permalink
On Mon, 20 Nov 2023 07:34:09 GMT
Post by Anton Ertl
Post by Michael S
My uneducated guess is that in sigle-threaded integer performance
POWER10 should be approximately on par with the best Intel Skylake
sieve bubble matrix fib fft version; machine
0.075 0.099 0.042 0.112 0.033 20231116; Power10 3900MHz; gcc-11.4.1
0.061 0.090 0.019 0.056 0.021 20231116; Core i5-6600K 4000MHz; gcc-10.1
Another benchmark I can run quickly is the LaTeX benchmark
487.14 msec task-clock # 0.998 CPUs
utilized 3 context-switches # 6.158 /sec
1 cpu-migrations # 2.053 /sec
590 page-faults # 1.211 K/sec
1897813538 cycles # 3.896 GHz
4411333480 instructions # 2.32 insn per
cycle 542280135 branches # 1.113 G/sec
11170586 branch-misses # 2.06% of all branches
0.488029554 seconds time elapsed
0.477731000 seconds user
0.010164000 seconds sys
396.04 msec task-clock # 0.999 CPUs
utilized 2 context-switches # 0.005 K/sec
0 cpu-migrations # 0.000 K/sec
7091 page-faults # 0.018 M/sec
1584116718 cycles # 4.000 GHz
3015599019 instructions # 1.90 insn per
cycle 564834472 branches # 1426.217 M/sec
8230928 branch-misses # 1.46% of all branches
0.396422458 seconds time elapsed
0.380444000 seconds user
0.016018000 seconds sys
One disadvantage of this LaTeX benchmark is that it depends on the
LaTeX version, and on the installed packages how much it has to do.
In the present case both systems use Tex Live 2020, so that should
make no difference. The Skylake has a lot of packages installed and
executes more instructions for the LaTeX benchmark than any other
AMD64 system where I measured the executed instructions. I do not
know how many packages are installed on the Power10 system.
Comparing the number of branches is interesting. If we assume that
compiler transformations like if-conversion and loop unrolling do
*not* cause a significant difference in branches, the similar number
of branches suggests that the workload is comparable.
In any case, both results indicate that the 3900MHz Power10 has a
lower single-thread performance than a 4GHz Skylake. So you buy a
Power10 if you want to process large numbers of threads or processes
and want to put in lots of RAM.
- anton
That's interesting, thank you.

Do not your benchmarks have rather small datasets?
If true, it will put likes of POWER10 (or of Skylake-SP/Skylake-X)
with their big, slow L2 caches at disadvantage relatively to Skylake
Client that has small fast L2 cache.

But if your numbers are representative then POWER10 can be matched by
rather ancient Intel CPUs. E.g. by 9.5 y.o. i7-4790. Even my Xeon
E3-1271 v3 would stay a chance. Or by AMD Zen1.
Anton Ertl
2023-11-20 15:50:08 UTC
Permalink
Post by Michael S
Do not your benchmarks have rather small datasets?
If true, it will put likes of POWER10 (or of Skylake-SP/Skylake-X)
with their big, slow L2 caches at disadvantage relatively to Skylake
Client that has small fast L2 cache.
These benchmarks don't miss L1 much, so the size and latency of the L2
have little influence. For the LaTeX benchmark:

Performance counter stats for 'latex bench':

2963866155 instructions:u
874909912 L1-dcache-loads
13846145 L1-dcache-load-misses #1.58% of all L1-dcache accesses
2754917 LLC-loads
227951 LLC-stores

0.394554268 seconds time elapsed

0.370677000 seconds user
0.023914000 seconds sys

Interestingly, ~20% of the D-cache load misses become LLC loads
(i.e. L2 load misses), so a larger L2 may have an advantage. Let's
see on a Tiger Lake:

2955784200 instructions:u
864268345 L1-dcache-loads
11751007 L1-dcache-load-misses #1.36% of all L1-dcache accesses
393018 LLC-loads
13868 LLC-stores

0.306192273 seconds time elapsed

0.289893000 seconds user
0.016105000 seconds sys

So, going from 256KB L2 to 1280KB L2 reduces the number of L2 load
misses by a factor of 7 on this benchmark (well, there is TeX Live
2022 and possibly a different set of packages on the Tiger Lake
machine, so there is also the question of comparability).

The results I see on the Power10 machine are strange:

Performance counter stats for 'latex bench':

9622223 L1-dcache-load-misses
9847374 LLC-loads

0.476681995 seconds time elapsed

0.466409000 seconds user
0.010139000 seconds sys

Maybe prefetches are counted as LLC-loads.

For the small gforth benchmarks on Skylake:

2205651226 instructions:u
793549469 L1-dcache-loads
4115040 L1-dcache-load-misses #0.52% of all L1-dcache accesses
852553 LLC-loads
26356 LLC-stores

0.265190457 seconds time elapsed

0.260044000 seconds user
0.004036000 seconds sys

An even smaller D-cache miss rate, and a similar 20% L2 miss rate
(which constitutes an even smaller proportion of execution time.
Post by Michael S
But if your numbers are representative then POWER10 can be matched by
rather ancient Intel CPUs. E.g. by 9.5 y.o. i7-4790. Even my Xeon
E3-1271 v3 would stay a chance. Or by AMD Zen1.
Certainly on these benchmarks. There may be applications that benefit
from what Power10 has to offer, but my guess is that for most
applications, you are better off with a somewhat recent Intel or AMD
CPU.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Chris M. Thomasson
2023-11-19 03:44:50 UTC
Permalink
On 11/15/2023 5:41 PM, Kent Dickey wrote:
[...]

Btw, thank you for Kegs! Excellent.
Chris M. Thomasson
2023-11-03 22:29:32 UTC
Permalink
Post by Anton Ertl
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.
[...]

Is the only arch out there that does not require an explicit memory
barrier for data-dependent loads a DEC Alpha? I think so.
Anton Ertl
2023-11-04 17:40:57 UTC
Permalink
Post by Chris M. Thomasson
Post by Anton Ertl
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.
[...]
Is the only arch out there that does not require an explicit memory
barrier for data-dependent loads a DEC Alpha? I think so.
I don't know any architecture that requires memory barriers for
single-threaded programs that access just memory, not even Alpha.

You may be thinking of the memory consistency model of Alpha, which is
even weaker than everything else I know of. This is not surprising,
given that a prominent advocacy paper for weak consistency
[adve&gharachorloo95] came out of DEC.

@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
BGB
2023-11-04 19:48:04 UTC
Permalink
Post by Anton Ertl
Post by Chris M. Thomasson
Post by Anton Ertl
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.
[...]
Is the only arch out there that does not require an explicit memory
barrier for data-dependent loads a DEC Alpha? I think so.
I don't know any architecture that requires memory barriers for
single-threaded programs that access just memory, not even Alpha.
You may be thinking of the memory consistency model of Alpha, which is
even weaker than everything else I know of. This is not surprising,
given that a prominent advocacy paper for weak consistency
[adve&gharachorloo95] came out of DEC.
Hmm, one could maybe save some logic and also make timing a little
easier if they disallowed RAW and WAW (in the same cache line), and made
them undefined behavior...

However, this would suck for software (and would effectively either
mandate strict stack alignment to keep prologs/epilogs working, or
require saving registers in a convoluted order). And, in my case, I can
tell that code steps on these scenarios very often (particularly in
prolog sequences).


This is why I bother with a "somewhat expensive" feature of forwarding
stored cache lines back to load in cases where one is accessing a "just
accessed" cache line.

Disabling this forwarding (and forcing a stall instead), tends to have a
more noticeable adverse effect on performance (but is often needed to
try to pass timing at higher clock speeds).

Granted, I guess it is possible I could fiddle with the compiler to try
to improve this situation, say:
Only use MOV.X with a 16B alignment;
Stagger even/odd stores in pairs when possible.

Say, as opposed to:
MOV.X R12, (SP, 160)
MOV.X R10, (SP, 144)
MOV.X R8, (SP, 128)
MOV.X R30, (SP, 112)
MOV.X R28, (SP, 96)
MOV.X R26, (SP, 80)
MOV.X R24, (SP, 64)
Say:
MOV.X R10, (SP, 144)
MOV.X R12, (SP, 160)
MOV.X R30, (SP, 112)
MOV.X R8, (SP, 128)
MOV.X R26, (SP, 80)
MOV.X R28, (SP, 96)
MOV.X R24, (SP, 64)

Say, to try to avoid two adjacent stores within the same 32-byte
paired-line (for 64-bit load, one would try to avoid two adjacent stores
within the same 32 bytes).


But, I sort of ended my 75MHz experiment for now, and fell back to
50MHz, where I can more easily afford to have this forwarding (in which
case the above becomes mostly N/A).


As for cache-coherence between cores, this basically still does not
exist yet.

It also seems like it would require a bit more ringbus traffic to pull
off, say:
If a core wants to write to a line, it needs to flag its intention to do so;
If a line was previously fetched for read, but is now being written, the
line would need to be flushed and re-fetched with the new intention;
If a line is being fetched for write, we somehow need to signal it to be
flushed from whatever L1 cache was last holding it;
...



But, the alternative (what I currently have), effectively disallows
traditional forms of (writable) memory sharing between threads (and
"volatile" memory access seems to pretty slow at present). So, for
multithreaded code, one would need to have the threads work mostly
independently (without any shared mutable structures), and then trigger
an L1 flush when it is done with its work (and the receiving thread also
needs to trigger an L1 flush).

Granted, similar ended up happening with the rasterizer module with
TKRA-GL, where cache-flushing is needed whenever handing off the
framebuffer between the main CPU and the rasterizer module (otherwise,
graphical garbage and incompletely drawn geometry may result); and one
also needs to flush the L1 cache and trigger the rasterizer module to
flush its texture cache whenever uploading a new texture.

And, as-is, texture uploading has ended up being horridly slow. So, even
if I can (sort of) afford to run GLQuake with lightmaps now, dynamic
lightmaps need to be disabled as it is too slow.

Though, an intermediate possibility could be to store lightmaps with
dynamic lights in (only) one of 2 states, with two different textures:
Dynamic light ON/Max, Dynamic light OFF/Min). However, this would
effectively disallow effects like "slow pulsate" (which could only
alternate between fully-on and fully-off); as well as updates from
dynamic light sources like rockets (where, as-is, firing a rocket with
dynamic lightmaps, eats the CPU something hard).

Not entirely sure how something like a Pentium-1 managed all this either.

Granted, faster in this case would still be to do everything with vertex
lighting (as is currently still the default in my GLQuake port), if
albeit (not as good), and a bit of a hack as Quake wasn't really
designed for vertex lighting.
Post by Anton Ertl
@TechReport{adve&gharachorloo95,
author = {Sarita V. Adve and Kourosh Gharachorloo},
title = {Shared Memory Consistency Models: A Tutorial},
institution = {Digital Western Research Lab},
year = {1995},
type = {WRL Research Report},
number = {95/7},
annote = {Gives an overview of architectural features of
shared-memory computers such as independent memory
banks and per-CPU caches, and how they make the (for
programmers) most natural consistency model hard to
implement, giving examples of programs that can fail
with weaker consistency models. It then discusses
several categories of weaker consistency models and
actual consistency models in these categories, and
which ``safety net'' (e.g., memory barrier
instructions) programmers need to use to work around
the deficiencies of these models. While the authors
recognize that programmers find it difficult to use
these safety nets correctly and efficiently, it
still advocates weaker consistency models, claiming
that sequential consistency is too inefficient, by
outlining an inefficient implementation (which is of
course no proof that no efficient implementation
exists). Still the paper is a good introduction to
the issues involved.}
}
- anton
MitchAlsup
2023-11-10 01:36:31 UTC
Permalink
Post by BGB
Post by Anton Ertl
Post by Chris M. Thomasson
Post by Anton Ertl
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.
[...]
Is the only arch out there that does not require an explicit memory
barrier for data-dependent loads a DEC Alpha? I think so.
I don't know any architecture that requires memory barriers for
single-threaded programs that access just memory, not even Alpha.
You may be thinking of the memory consistency model of Alpha, which is
even weaker than everything else I know of. This is not surprising,
given that a prominent advocacy paper for weak consistency
[adve&gharachorloo95] came out of DEC.
Hmm, one could maybe save some logic and also make timing a little
easier if they disallowed RAW and WAW (in the same cache line), and made
them undefined behavior...
<
Intractable in practice.
<
What you CAN do is to dynamically solve this problem in HW. By constructing
a memory dependency matrix and not allowing a memory reference to complete
unless it is safe to do so.
<
Post by BGB
However, this would suck for software (and would effectively either
mandate strict stack alignment to keep prologs/epilogs working, or
require saving registers in a convoluted order). And, in my case, I can
tell that code steps on these scenarios very often (particularly in
prolog sequences).
<
It only suck when there is a REAL dependency and saves you neck every time
a dependency is not found (mid-to-high 90%-ile).
<
Post by BGB
This is why I bother with a "somewhat expensive" feature of forwarding
stored cache lines back to load in cases where one is accessing a "just
accessed" cache line.
<
I use a MDM (above) attached to a temporal cache.
<
When an instruction is inserted in the MDM/TC==CC, it is made dependent on all
instructions that are already in the CC that have not completed. and installed
in the Unknown Address state.
<
When an address is AGENed, the address is sent to the data cache and a portion
is sent to the CC and that portion CAMs against all the other addresses. The
result of the comparison is either "Is the same line" or "Can't be the same line"
and the state changes to Known Address.
<
While the compares are taking place, the MDM is relaxed. And entry that "can't
be the same as" is removed as a dependency. When all dependencies have been
removed, the instruction can complete.
<
If either CC or DC returns with data, the state advances to Known Data. Stores
with Known data are allowed to complete into CC. Modified CC data migrates back
to DC when the ST instruction becomes retireable.
<
At this point the ST has completed as seen by the CPU, but has not completed
as seen by "interested Observers", and is subject to prediction repair, AGEN
replay, and those rare situations where one changes the memory ordering model
{ATOMICs, MMI/O, Config space}. This is the key point--the CPU can think "its
done" while the external world can think "It has not happened yet".
<
Thus, memory references that alias on a cache line basis are performed in order,
while those that are not run independently.
<
The conditional (temporal) cache holds the address and <at least a portion> of
the associated cache line. Any reference can hit on that data in CC even if it
is port or bank blocked in the DC. CC hits a bit over 50% of the time, effectively
reducing cache port/bank conflicts that have been touched in the time of the
temporal cache.
<
Post by BGB
Disabling this forwarding (and forcing a stall instead), tends to have a
more noticeable adverse effect on performance (but is often needed to
try to pass timing at higher clock speeds).
Granted, I guess it is possible I could fiddle with the compiler to try
Only use MOV.X with a 16B alignment;
Stagger even/odd stores in pairs when possible.
MOV.X R12, (SP, 160)
MOV.X R10, (SP, 144)
MOV.X R8, (SP, 128)
MOV.X R30, (SP, 112)
MOV.X R28, (SP, 96)
MOV.X R26, (SP, 80)
MOV.X R24, (SP, 64)
MOV.X R10, (SP, 144)
MOV.X R12, (SP, 160)
MOV.X R30, (SP, 112)
MOV.X R8, (SP, 128)
MOV.X R26, (SP, 80)
MOV.X R28, (SP, 96)
MOV.X R24, (SP, 64)
<
I will use this as an example as to Why you want save/restore instructions
in ISA::
a) so the compiler does not need to deal with ordering problems
b) so fewer instructions are produced.
Post by BGB
Say, to try to avoid two adjacent stores within the same 32-byte
paired-line (for 64-bit load, one would try to avoid two adjacent stores
within the same 32 bytes).
<
Solved by CC in my case.
<
BGB
2023-11-10 04:10:54 UTC
Permalink
Post by MitchAlsup
Post by BGB
Post by Anton Ertl
Post by Chris M. Thomasson
Post by Anton Ertl
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures.  You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.
[...]
Is the only arch out there that does not require an explicit memory
barrier for data-dependent loads a DEC Alpha? I think so.
I don't know any architecture that requires memory barriers for
single-threaded programs that access just memory, not even Alpha.
You may be thinking of the memory consistency model of Alpha, which is
even weaker than everything else I know of.  This is not surprising,
given that a prominent advocacy paper for weak consistency
[adve&gharachorloo95] came out of DEC.
Hmm, one could maybe save some logic and also make timing a little
easier if they disallowed RAW and WAW (in the same cache line), and
made them undefined behavior...
<
Intractable in practice.
<
What you CAN do is to dynamically solve this problem in HW. By constructing
a memory dependency matrix and not allowing a memory reference to complete
unless it is safe to do so.
<
Probably true.

My options were one of:
Forward results from an in-progress Store back to a following Load/Store
(more expensive, but faster);
Stall the pipeline until the prior Store can complete (cheaper but slower).


Granted, the "not bother" option could be cheaper still, but carries an
unreasonable level of risk (and in practice would likely mean that any
store through a free pointer followed by another memory access would end
up needing to use NOP padding or similar).

Or, handling it in the CPU, by generating an interlock stall whenever a
Store is followed by another memory operation. But, this would probably
be the worst possible option in terms of performance...
Post by MitchAlsup
Post by BGB
However, this would suck for software (and would effectively either
mandate strict stack alignment to keep prologs/epilogs working, or
require saving registers in a convoluted order). And, in my case, I
can tell that code steps on these scenarios very often (particularly
in prolog sequences).
<
It only suck when there is a REAL dependency and saves you neck every time
a dependency is not found (mid-to-high 90%-ile).
<
OK.
Post by MitchAlsup
Post by BGB
This is why I bother with a "somewhat expensive" feature of forwarding
stored cache lines back to load in cases where one is accessing a
"just accessed" cache line.
<
I use a MDM (above) attached to a temporal cache. <
When an instruction is inserted in the MDM/TC==CC, it is made dependent on all
instructions that are already in the CC that have not completed. and installed
in the Unknown Address state.
<
When an address is AGENed, the address is sent to the data cache and a portion
is sent to the CC and that portion CAMs against all the other addresses. The
result of the comparison is either "Is the same line" or "Can't be the same line"
and the state changes to Known Address.
<
While the compares are taking place, the MDM is relaxed. And entry that "can't
be the same as" is removed as a dependency. When all dependencies have been
removed, the instruction can complete.
<
If either CC or DC returns with data, the state advances to Known Data. Stores
with Known data are allowed to complete into CC. Modified CC data migrates back
to DC when the ST instruction becomes retireable. <
At this point the ST has completed as seen by the CPU, but has not completed
as seen by "interested Observers", and is subject to prediction repair, AGEN
replay, and those rare situations where one changes the memory ordering model
{ATOMICs, MMI/O, Config space}. This is the key point--the CPU can think "its
done" while the external world can think "It has not happened yet".
<
Thus, memory references that alias on a cache line basis are performed in order,
while those that are not run independently.
<
The conditional (temporal) cache holds the address and <at least a portion> of
the associated cache line. Any reference can hit on that data in CC even if it
is port or bank blocked in the DC. CC hits a bit over 50% of the time, effectively
reducing cache port/bank conflicts that have been touched in the time of the
temporal cache.
<
In my pipeline, it is simpler...

Just, the L1 cache sees that the next stage holds the results of a store
to the same cache lines, and either forwards the result or generates a
stall until the stall completes.

Where, forwarding is faster, but stalling is cheaper.
Post by MitchAlsup
Post by BGB
Disabling this forwarding (and forcing a stall instead), tends to have
a more noticeable adverse effect on performance (but is often needed
to try to pass timing at higher clock speeds).
Granted, I guess it is possible I could fiddle with the compiler to
   Only use MOV.X with a 16B alignment;
   Stagger even/odd stores in pairs when possible.
   MOV.X R12, (SP, 160)
   MOV.X R10, (SP, 144)
   MOV.X R8,  (SP, 128)
   MOV.X R30, (SP, 112)
   MOV.X R28, (SP, 96)
   MOV.X R26, (SP, 80)
   MOV.X R24, (SP, 64)
   MOV.X R10, (SP, 144)
   MOV.X R12, (SP, 160)
   MOV.X R30, (SP, 112)
   MOV.X R8,  (SP, 128)
   MOV.X R26, (SP, 80)
   MOV.X R28, (SP, 96)
   MOV.X R24, (SP, 64)
<
I will use this as an example as to Why you want save/restore instructions
a) so the compiler does not need to deal with ordering problems
b) so fewer instructions are produced.
Still haven't gotten around to this, but it could potentially help
performance with the cheaper cache options (namely, configurations
without the internal forwarding).

But, yeah, the compiler does still need to deal with this sometimes.
Post by MitchAlsup
Post by BGB
Say, to try to avoid two adjacent stores within the same 32-byte
paired-line (for 64-bit load, one would try to avoid two adjacent
stores within the same 32 bytes).
<
Solved by CC in my case.
OK.
Post by MitchAlsup
<
MitchAlsup
2023-11-10 23:31:20 UTC
Permalink
Post by BGB
Post by MitchAlsup
Post by BGB
Granted, I guess it is possible I could fiddle with the compiler to
   Only use MOV.X with a 16B alignment;
   Stagger even/odd stores in pairs when possible.
   MOV.X R12, (SP, 160)
   MOV.X R10, (SP, 144)
   MOV.X R8,  (SP, 128)
   MOV.X R30, (SP, 112)
   MOV.X R28, (SP, 96)
   MOV.X R26, (SP, 80)
   MOV.X R24, (SP, 64)
   MOV.X R10, (SP, 144)
   MOV.X R12, (SP, 160)
   MOV.X R30, (SP, 112)
   MOV.X R8,  (SP, 128)
   MOV.X R26, (SP, 80)
   MOV.X R28, (SP, 96)
   MOV.X R24, (SP, 64)
<
I will use this as an example as to Why you want save/restore instructions
a) so the compiler does not need to deal with ordering problems
b) so fewer instructions are produced.
Still haven't gotten around to this, but it could potentially help
performance with the cheaper cache options (namely, configurations
without the internal forwarding).
<
Imagine a scenario where you are returning from one function call only
to call another function*. Since EXIT performs all the state reload
(and the RET), and the return IP is read first so the front end can
fetch-decode to feed the machine..... (*) possibly moving registers
around to form the argument list.
<
NOW imagine that the front end encounters a CALL and the instruction
at the target of the CALL is ENTER.
a) the restore process can stop-or-complete
b) the save process can mostly be skipped since memory already contains
the correct bit patterns (My 66000 ABI)
<
Saving cycles a compiler cannot.....
<
Post by BGB
But, yeah, the compiler does still need to deal with this sometimes.
MitchAlsup
2023-11-10 01:43:04 UTC
Permalink
Post by Anton Ertl
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures.
<
Absolutely Brilliant.
<
Well Done, and Thanks.
Anton Ertl
2023-11-15 18:32:40 UTC
Permalink
Post by Anton Ertl
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.
I have now added parameter combinations for detecting something that I
currently call memory renaming: If the microarchitecture can overlap
independent accesses to the same memory location (analogous to
register renaming).

Does anybody know if there is a commonly-used term for this feature?
I would also be interested in a commonly-used term for bypassing the
store buffer in store-to-load forwarding (as we see in Zen3 and Tiger
Lake).

Concerning memory renaming, it turns out that Intel added it in
Nehalem, and AMD added it between K10 and Excavator (probably in the
Bulldozer). ARM has not implemented this in any microarchitecture I
have measured (up to Cortex-A76), and Apple has it in Firestorm (M1
P-core), but not in Icestorm (M1 E-Core).

One interesting benefit of memory renaming is that when you are bad at
register allocation, and keep local variables in memory (or virtual
machine registers, or virtual machine stack items), independent
computations that use the same local variable still can execute in
parallel. Say, stuff like

a = b[1]; c = a+1;
a = b[2]; d = a-1;

The hardware can execute them in parallel, even if a is located in
memory.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup
2023-11-15 19:05:42 UTC
Permalink
Post by Anton Ertl
Post by Anton Ertl
I have written a microbenchmark for measuring how memory dependencies
affect the performance of various microarchitectures. You can find it
along with a description and results on
<http://www.complang.tuwien.ac.at/anton/memdep/>.
I have now added parameter combinations for detecting something that I
currently call memory renaming: If the microarchitecture can overlap
independent accesses to the same memory location (analogous to
register renaming).
Does anybody know if there is a commonly-used term for this feature?
<
I use the term Conditional Cache, I have heard others call it a Memory
Reorder Buffer.
<
Post by Anton Ertl
I would also be interested in a commonly-used term for bypassing the
store buffer in store-to-load forwarding (as we see in Zen3 and Tiger
Lake).
<
Store-to-Load-Forwarding is the only term I know here.
<
Post by Anton Ertl
Concerning memory renaming, it turns out that Intel added it in
Nehalem, and AMD added it between K10 and Excavator (probably in the
Bulldozer). ARM has not implemented this in any microarchitecture I
have measured (up to Cortex-A76), and Apple has it in Firestorm (M1
P-core), but not in Icestorm (M1 E-Core).
One interesting benefit of memory renaming is that when you are bad at
register allocation,
<
Or bad at alias analysis...
<
< and keep local variables in memory (or virtual
Post by Anton Ertl
machine registers, or virtual machine stack items), independent
computations that use the same local variable still can execute in
parallel. Say, stuff like
a = b[1]; c = a+1;
a = b[2]; d = a-1;
The hardware can execute them in parallel, even if a is located in
memory.
<
And why not, they are independent containers.
Post by Anton Ertl
- anton
Loading...