Discussion:
Another security vulnerability
(too old to reply)
Stephen Fuld
2024-03-24 17:40:18 UTC
Permalink
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/

So, is there a way to fix this while maintaining the feature's
performance advantage?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
MitchAlsup1
2024-03-24 18:20:06 UTC
Permalink
Post by Stephen Fuld
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
So, is there a way to fix this while maintaining the feature's
performance advantage?
They COULD start by not putting prefetched data into the cache
until after the predicting instruction retires. {{I have a note
from about 20 months ago where this feature was publicized and
the note indicates a potential side-channel.}}

An alternative is to notice that [*]cryption instructions are
being processed and turn DMP off during those intervals of time.
{Or both}.

Principle:: an Architecturally visible unit of data can only become
visible after the causing instruction retires. A high precision timer
makes cache line [dis]placement visible; so either take away the HPT
or don't alter cache visible state too early.

And we are off to the races, again.....
Stefan Monnier
2024-03-25 16:18:18 UTC
Permalink
Post by MitchAlsup1
Principle:: an Architecturally visible unit of data can only become
visible after the causing instruction retires. A high precision timer
makes cache line [dis]placement visible; so either take away the HPT
or don't alter cache visible state too early.
And parallelism (e.g. multicores) can be used to emulate HPT, so "take
away the HPT" is not really an option.


Stefan
Michael S
2024-03-26 16:57:30 UTC
Permalink
On Mon, 25 Mar 2024 12:18:18 -0400
Post by Stefan Monnier
Post by MitchAlsup1
Principle:: an Architecturally visible unit of data can only become
visible after the causing instruction retires. A high precision
timer makes cache line [dis]placement visible; so either take away
the HPT or don't alter cache visible state too early.
And parallelism (e.g. multicores) can be used to emulate HPT, so "take
away the HPT" is not really an option.
Stefan
That's exactly how they measured time in PoC.
Michael S
2024-03-26 16:56:19 UTC
Permalink
On Sun, 24 Mar 2024 18:20:06 +0000
Post by MitchAlsup1
Post by Stephen Fuld
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
So, is there a way to fix this while maintaining the feature's
performance advantage?
They COULD start by not putting prefetched data into the cache
until after the predicting instruction retires. {{I have a note
from about 20 months ago where this feature was publicized and
the note indicates a potential side-channel.}}
An alternative is to notice that [*]cryption instructions are
being processed and turn DMP off during those intervals of time.
{Or both}.
Their PoC attacks public key crypto algorithms - RSA-2048, DHKE-2048 and
couple of exotic new algorithms that nobody uses.
I think, neither RSA-2048 nor DHKE-2048 use any special crypto
instructions.
On Intel/AMD it's likely that thise crypto routines use MULX
and ADX instruction much for often than non-crypto code, but that's not
guaranteed.
On ARM64 you don't even have that much, because equivalents of MULX and
of ADX are part of base instruction set.
Post by MitchAlsup1
Principle:: an Architecturally visible unit of data can only become
visible after the causing instruction retires. A high precision timer
makes cache line [dis]placement visible; so either take away the HPT
or don't alter cache visible state too early.
And we are off to the races, again.....
Lawrence D'Oliveiro
2024-03-24 21:47:17 UTC
Permalink
Post by Stephen Fuld
So, is there a way to fix this while maintaining the feature's
performance advantage?
Even if it’s fixed, how does that help existing users with broken
machines?
Stefan Monnier
2024-03-25 16:26:30 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Stephen Fuld
So, is there a way to fix this while maintaining the feature's
performance advantage?
Even if it’s fixed, how does that help existing users with broken
machines?
Of course, the current computing world loves it even more if it pushes
users to buy new machines.

This said, until now manufacturers don't seem to consider side-channel
attacks as something worth spending much effort to avoid, so I'd expect
future machines to be just as vulnerable :-(


Stefan
Stephen Fuld
2024-03-25 16:50:28 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Stephen Fuld
So, is there a way to fix this while maintaining the feature's
performance advantage?
Even if it’s fixed, how does that help existing users with broken
machines?
The article gives several suggestions, but they all come at a
performance cost, and require some, presumably small, changes to crypto
code.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
Lawrence D'Oliveiro
2024-03-25 21:22:35 UTC
Permalink
Post by Stephen Fuld
Post by Lawrence D'Oliveiro
Even if it’s fixed, how does that help existing users with broken
machines?
The article gives several suggestions, but they all come at a
performance cost ...
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really a
good idea.
MitchAlsup1
2024-03-25 22:17:55 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Stephen Fuld
Post by Lawrence D'Oliveiro
Even if it’s fixed, how does that help existing users with broken
machines?
The article gives several suggestions, but they all come at a
performance cost ...
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really a
good idea.
Would you like to inform us of how it can be done otherwise ?
Lawrence D'Oliveiro
2024-03-26 00:27:34 UTC
Permalink
Post by MitchAlsup1
Post by Lawrence D'Oliveiro
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really a
good idea.
Would you like to inform us of how it can be done otherwise ?
Upgradeable firmware/software, of course.
Stephen Fuld
2024-03-27 16:22:12 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by MitchAlsup1
Post by Lawrence D'Oliveiro
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really a
good idea.
Would you like to inform us of how it can be done otherwise ?
Upgradeable firmware/software, of course.
But microcode is generally slower than dedicated hardware, and most
people seem to be unwilling to give up performance all the time to gain
an advantage in a situation that occurs infrequently and mostly never.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
Lawrence D'Oliveiro
2024-03-28 00:46:13 UTC
Permalink
Post by Stephen Fuld
Post by Lawrence D'Oliveiro
Post by MitchAlsup1
Post by Lawrence D'Oliveiro
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really
a good idea.
Would you like to inform us of how it can be done otherwise ?
Upgradeable firmware/software, of course.
But microcode is generally slower than dedicated hardware, and most
people seem to be unwilling to give up performance all the time to gain
an advantage in a situation that occurs infrequently and mostly never.
Bruce Schneier has a saying: “attacks never get worse, they can only get
better”.
MitchAlsup1
2024-03-28 01:22:06 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Stephen Fuld
Post by Lawrence D'Oliveiro
Post by MitchAlsup1
Post by Lawrence D'Oliveiro
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really
a good idea.
Would you like to inform us of how it can be done otherwise ?
Upgradeable firmware/software, of course.
But microcode is generally slower than dedicated hardware, and most
people seem to be unwilling to give up performance all the time to gain
an advantage in a situation that occurs infrequently and mostly never.
S.E.L 32/65, 32/67, 32/87 were all microcoded. 95% of the instructions*
ran down the pipeline without using the microcode (which was only there
to pick up the pieces after HW logic sequencers got tooo complicated.

(*) closer to 97% of the dynamic instruction stream.

Microcode IS generally slower, but not always. PDP-11 or VAX microcode
is too much, IBM, S.E.L. is not too much.
Post by Lawrence D'Oliveiro
Bruce Schneier has a saying: “attacks never get worse, they can only get
better”.
Which is why to have to design for attackers (to be thwarted}.

Observing the last 4-odd years, it appears to me that CPU designers will
not be in a position to "give a rat's ass" until there is performance
competitive, cost competitive, alternatives. Given Apple, AMD, ARM, and
Intel not giving a rat's ass, it's going to have to come from somewhere
else.
Michael S
2024-03-25 23:34:45 UTC
Permalink
On Mon, 25 Mar 2024 21:22:35 -0000 (UTC)
Post by Lawrence D'Oliveiro
Post by Stephen Fuld
Post by Lawrence D'Oliveiro
Even if it’s fixed, how does that help existing users with broken
machines?
The article gives several suggestions, but they all come at a
performance cost ...
The basic problem is that building all this complex, bug-prone
functionality into monolithic, nonupgradeable hardware is not really
a good idea.
May be, not a good idea. But the best.
Thomas Koenig
2024-03-25 06:37:31 UTC
Permalink
Post by Stephen Fuld
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
It's Groundhog Day, all over again!
Post by Stephen Fuld
So, is there a way to fix this while maintaining the feature's
performance advantage?
From what is written in the article, nothing is currently known.

For new silicon, people could finally implement Mitch's suggestion
of not committing speculative state before the instruction retires.
(It would be interesting to see how much area and power this
would cost with the hundreds of instructions in flight with
modern micro-architectures).

For existing silicon - run crypto on efficiency cores, or just
make sure not to run untrusted code on your machine :-(
Anton Ertl
2024-03-25 08:37:51 UTC
Permalink
Post by Thomas Koenig
For existing silicon - run crypto on efficiency cores
Not the recent vulnerability that affects Intel's efficiency cores.
Also, if the prefetcher works with data in a shared cache (I don't
know whether the data-dependent prefetchers do that), it may not
matter on which core the code runs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Scott Lurndal
2024-03-25 17:07:16 UTC
Permalink
Post by Anton Ertl
Post by Thomas Koenig
For existing silicon - run crypto on efficiency cores
Not the recent vulnerability that affects Intel's efficiency cores.
Also, if the prefetcher works with data in a shared cache (I don't
know whether the data-dependent prefetchers do that), it may not
matter on which core the code runs.
Run it in non-cacheable memory. Slow but safe.
Lawrence D'Oliveiro
2024-03-26 02:32:46 UTC
Permalink
Post by Scott Lurndal
Run it in non-cacheable memory. Slow but safe.
But 99% of the performance speedups of the last 20-30 years have involved
caching of some kind.
Scott Lurndal
2024-03-26 14:12:09 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Scott Lurndal
Run it in non-cacheable memory. Slow but safe.
But 99% of the performance speedups of the last 20-30 years have involved
caching of some kind.
So what? Running the crypto algorithms (when not offloaded to
on-chip accelerators) using non-cacheable memory as a workaround
until the hardware issues are ameliorated doesn't imply that
all other code needs to run non-cachable.
Anton Ertl
2024-03-26 16:36:26 UTC
Permalink
Post by Scott Lurndal
Post by Scott Lurndal
Run it in non-cacheable memory. Slow but safe.
...
Post by Scott Lurndal
Running the crypto algorithms (when not offloaded to
on-chip accelerators) using non-cacheable memory as a workaround
until the hardware issues are ameliorated doesn't imply that
all other code needs to run non-cachable.
Then your crypto code is slow and unsafe. The attacker will use the
rest of the application to extract the crypto keys, whether through
the cache side-channel of Spectre, or a prefetcher-based side channel.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Anton Ertl
2024-03-26 09:18:36 UTC
Permalink
Post by Scott Lurndal
Post by Anton Ertl
Also, if the prefetcher works with data in a shared cache (I don't
know whether the data-dependent prefetchers do that), it may not
matter on which core the code runs.
Run it in non-cacheable memory. Slow but safe.
To eliminate this particular vulnerability, it's sufficient to disable
the data-dependent prefetcher.

I wonder where these ideas are coming from that call for the worst
possible fix for a vulnerability?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Scott Lurndal
2024-03-26 14:12:39 UTC
Permalink
Post by Anton Ertl
Post by Scott Lurndal
Post by Anton Ertl
Also, if the prefetcher works with data in a shared cache (I don't
know whether the data-dependent prefetchers do that), it may not
matter on which core the code runs.
Run it in non-cacheable memory. Slow but safe.
To eliminate this particular vulnerability, it's sufficient to disable
the data-dependent prefetcher.
That assumes that chicken bit(s) are available to do that.
Anton Ertl
2024-03-26 16:40:38 UTC
Permalink
Post by Scott Lurndal
Post by Anton Ertl
Post by Scott Lurndal
Post by Anton Ertl
Also, if the prefetcher works with data in a shared cache (I don't
know whether the data-dependent prefetchers do that), it may not
matter on which core the code runs.
Run it in non-cacheable memory. Slow but safe.
To eliminate this particular vulnerability, it's sufficient to disable
the data-dependent prefetcher.
That assumes that chicken bit(s) are available to do that.
The hardware designers have put in the chicken bit(s); it's highly
unlikely that they have unconditionally enabled the data-dependent
prefetcher on M1 and M2, and only added a chicken bit on M3. Now that
the hardware indeed turns out to be broken, they just need to activate
it/them.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Michael S
2024-03-25 11:27:32 UTC
Permalink
On Mon, 25 Mar 2024 06:37:31 -0000 (UTC)
Post by Thomas Koenig
Post by Stephen Fuld
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
It's Groundhog Day, all over again!
Post by Stephen Fuld
So, is there a way to fix this while maintaining the feature's
performance advantage?
From what is written in the article, nothing is currently known.
For new silicon, people could finally implement Mitch's suggestion
of not committing speculative state before the instruction retires.
(It would be interesting to see how much area and power this
would cost with the hundreds of instructions in flight with
modern micro-architectures).
For existing silicon - run crypto on efficiency cores, or just
make sure not to run untrusted code on your machine :-(
I would trust your second solution better than the first one.
Scott Lurndal
2024-03-25 17:06:35 UTC
Permalink
Post by Thomas Koenig
Post by Stephen Fuld
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
It's Groundhog Day, all over again!
Post by Stephen Fuld
So, is there a way to fix this while maintaining the feature's
performance advantage?
From what is written in the article, nothing is currently known.
For new silicon, people could finally implement Mitch's suggestion
of not committing speculative state before the instruction retires.
To be fair, that's been suggested for quite a few years by
various semiconductor designers. The difficulties lie in
efficient implementation thereof.
Post by Thomas Koenig
(It would be interesting to see how much area and power this
would cost with the hundreds of instructions in flight with
modern micro-architectures).
Yes.
Post by Thomas Koenig
For existing silicon - run crypto on efficiency cores, or just
make sure not to run untrusted code on your machine :-(
Anton Ertl
2024-03-25 07:25:34 UTC
Permalink
Post by Stephen Fuld
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
That's a pretty bad article, but at least one can read it without
JavaScript, unlike the web page of the vulnerability
<https://gofetch.fail/>.

Classically hardware prefetchers have based their predictions on the
addresses of earlier accesses. If they base their predictions only on
architectural accesses and (by coding the cryptographic code
appropriately) don't have information about the secrets in the
addresses, the prefetcher cannot reveal these secrets through a side
channel. Cryptographic code has been written that way for quite a
while.

These prefetchers are not so great on pointer-chasing code (unless the
data happens to be allocated at regular distances), so apparently
Apple engineers, and according to this article also Intel engineers
added a prefetcher that prefetches based on the contents of data that
it fetches. To anyone who knows how a cache side-channel works, it is
crystal-clear that this makes it possible to reveal the *contents* of
accessed memory through a cache side channel. Even if only
architectural accesses are used for that prediction, the possibility
is still there, because cryptographic code has to access the secrets.

This should be clear to anyone who understands Spectre and actually
anyone who understands classical (architectural) side-channel attacks;
but it should have been on the minds of the hardware designers very
much since Spectre has been discovered.

The contribution of the GoFetch researchers is that they demonstrate
that this is not just a theoretical possibility.

If Intel added this vulnerability in Raptor Lake (as the article
states), they have to take the full blame. On the positive side, the
GoFetch researchers have not found a way to exploit Raptor Lake's
data-dependent prefetcher. Yet. But I would not bet on there not
being a way to exploit this.

Apple's designers at least have the excuse that at the time when they
laid the groundwork for the M1, Spectre was not known, and when it
became known, it was too late to eliminate this prefetcher from the
design (but not too late to disable it through a chicken bit).
Post by Stephen Fuld
So, is there a way to fix this while maintaining the feature's
performance advantage?
What is the performance advantage? People who have tried to use
software prefetching have often been disappointed by the results. I
expect that a data-dependent prefetcher will usually not be
beneficial, either. There will be a few cases where it may help, but
the average performance advantage will be small.

On the GoFetch web page the researchers suggest using the chicken bit
in the cryptographic library. I would not be surprised if there was a
combination of speculation and data-dependent prefetching, or of
address-dependent prefetching and data-dependent prefetching that
allows all code (not just cryptographic code) to perform
data-dependent prefetches based on the secret data that only crypto
code accesses architecturally. But whether that's the case depends on
the hardware design; plus, if speculative accesses from other codes to
this data are possible, the data can usually be revealed through a
speculative load even in the absence of a data-dependent prefetcher
(but there may be software mitigations for that scenario that the
data-dependent prefetcher would circumvent).

The web page also mentions "input blinding". I wonder how that can be
made to work reliably. If the attacker has access to all the loaded
data (through GoFetch) and knows how the blinded data is processed,
the attacker can do everything that the crypto code can do,
e.g. create a session key. Of course, if the attacker has to
reconstruct the key from several pieces of data, the attack becomes
more difficult, but relying on it being too difficult has not been a
recipe for success in the past (e.g., before Bernstein's cache timing
attack on AES it was thought to be too difficult to exploint).

So what other solutions might there be? The results of the
data-dependent prefetches could be loaded into a special cache so that
they don't evict entries of other caches. If a load architecturally
accesses this data, it is transferred to the regular cache. That
special cache should be fully associative, to avoid revealing bits of
the addresse of other accesses (i.e., data) through the accessed set.
That leaves the possibility of revealing something by evicting
something from this special cache just based on capacity, but I don't
see how that could be exploited.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Michael S
2024-03-26 16:29:41 UTC
Permalink
On Mon, 25 Mar 2024 07:25:34 GMT
Post by Anton Ertl
Post by Stephen Fuld
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
That's a pretty bad article, but at least one can read it without
JavaScript, unlike the web page of the vulnerability
<https://gofetch.fail/>.
In case you missed it, the web page contains link to pdf:
https://gofetch.fail/files/gofetch.pdf
Thomas Koenig
2024-03-27 08:20:49 UTC
Permalink
Post by Michael S
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.

Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
EricP
2024-03-27 13:44:34 UTC
Permalink
Post by Thomas Koenig
Post by Michael S
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.

I used this approach for the Atomic Fetch-and-OP instructions.
They only need one or two data types and one address mode.

I also considered the same single [reg] address mode for privileged
Load & Store to Physical Address, though these would need to
support 1,2,4, and 8 byte ints, and need some cache control bits.
Thomas Koenig
2024-03-27 15:05:36 UTC
Permalink
Post by EricP
Post by Thomas Koenig
Post by Michael S
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
Hm, I'm not sure it would actually be used rarely, at least not
the way I thought about it.

I envisage a "ldp" (load pointer) instruction, which turns on
prefetaching, for everything that looks like

foo_t *p = some_expr;

which could also mean something like

*p = ptrarray[i];

with a scaled and indexed load (for example), where prefixing
is turned on, and a "ldd" (load double data) instruction where,
explicitly, for

long int n = some_other_expr;

where prefetching is explicitly disabled. (Apart from the security
implicatins, this could also save a tiny bit of power).
EricP
2024-03-27 16:08:05 UTC
Permalink
Post by Thomas Koenig
Post by EricP
Post by Thomas Koenig
Post by Michael S
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
Hm, I'm not sure it would actually be used rarely, at least not
the way I thought about it.
I'm referring to your load with prefetch disable.
For these particular loads it's users could likely tolerate the
"overhead" of an extra LEA instruction to calculate the address,
and don't need all 7 integer data types.
Post by Thomas Koenig
I envisage a "ldp" (load pointer) instruction, which turns on
prefetaching, for everything that looks like
foo_t *p = some_expr;
which could also mean something like
*p = ptrarray[i];
So this would be
LEA r0,[rBase+rIndex*8+offset]
LDAPQ r0,[r0] // load quad with auto pointer prefetch

though I'm not really sold on the need for this if you have an instruction
(below) that explicitly disables pointer auto-prefetch.
Plus all this does is eliminate an explicit PREFCHR Prefetch-for-Read.
Post by Thomas Koenig
with a scaled and indexed load (for example), where prefixing
is turned on, and a "ldd" (load double data) instruction where,
explicitly, for
long int n = some_other_expr;
where prefetching is explicitly disabled. (Apart from the security
implicatins, this could also save a tiny bit of power).
LEA r0,[rBase+offset]
LDNPQ r0,[r0] // load quad no-auto-prefetch

costs 1 opcode as it only supports int64 data type and
doesn't need a corresponding STNPQ store.
Thomas Koenig
2024-03-27 20:15:25 UTC
Permalink
Post by EricP
Post by Thomas Koenig
Post by EricP
Post by Thomas Koenig
Post by Michael S
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
Hm, I'm not sure it would actually be used rarely, at least not
the way I thought about it.
I'm referring to your load with prefetch disable.
For these particular loads it's users could likely tolerate the
"overhead" of an extra LEA instruction to calculate the address,
and don't need all 7 integer data types.
If it was LEA-only, it would need some kind of pragma in the code
which said "use this more cumbersome and slower, but more
safe version".

For that reason, I would probably prefer a separate version
which is implicitly safe and does not have any other drawbacks
for performance, with no code changes.

If it's worth the opcode space...
Anton Ertl
2024-03-27 18:14:11 UTC
Permalink
Post by EricP
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
You lost me here. Do you mean that a load with address mode
[register] is considered to be a non-address load and not followed by
the data-dependent prefetcher? So how would an address load be
encoded if the natural expression would be [register]?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
EricP
2024-03-27 19:42:34 UTC
Permalink
Post by Anton Ertl
Post by EricP
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
You lost me here. Do you mean that a load with address mode
[register] is considered to be a non-address load and not followed by
the data-dependent prefetcher? So how would an address load be
encoded if the natural expression would be [register]?
- anton
I'm pointing out that not all instructions need to be orthogonal.
There can be savings in opcode space by tempering that based on
expected frequency of occurrence.

The normal LD and ST have all their address modes and data types
because these functions occur frequently enough that we deem it
worthwhile to support these all in one instruction,
such as supporting both sign and zero extended loads
or scaled index addressing.

I note there is this class of relatively rarely used special purpose
memory access instructions that don't need to have all singing and all
dancing address modes and/or data types like the regular LD and ST.

Since I need a LEA Load Effective Address instruction anyway
which does rBase+rIndex*scale+offset calculation
(plus I have others, like where rBase is RIP or an absolute address),
then I can drop all but the [reg] address mode for these rare instructions
and in many cases drop some sign or zero extend types for loads.

For example, I use just two opcodes for Atomic Fetch Add int64 and int32
AFADD8 rDst,rSrc,[rAddr]
AFADD4 rDst,rSrc,[rAddr]
MitchAlsup1
2024-03-27 21:27:16 UTC
Permalink
Post by EricP
Post by Anton Ertl
Post by EricP
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
You lost me here. Do you mean that a load with address mode
[register] is considered to be a non-address load and not followed by
the data-dependent prefetcher? So how would an address load be
encoded if the natural expression would be [register]?
- anton
I'm pointing out that not all instructions need to be orthogonal.
There can be savings in opcode space by tempering that based on
expected frequency of occurrence.
The normal LD and ST have all their address modes and data types
because these functions occur frequently enough that we deem it
worthwhile to support these all in one instruction,
such as supporting both sign and zero extended loads
or scaled index addressing.
I note there is this class of relatively rarely used special purpose
memory access instructions that don't need to have all singing and all
dancing address modes and/or data types like the regular LD and ST.
Since I need a LEA Load Effective Address instruction anyway
which does rBase+rIndex*scale+offset calculation
(plus I have others, like where rBase is RIP or an absolute address),
then I can drop all but the [reg] address mode for these rare instructions
and in many cases drop some sign or zero extend types for loads.
It seems to me that once the core has identified an address and an offset
from that address contains another address (foo->next, foo->prev) that
only those are prefetched. So this depends on placing next as the first
container in a structure and remains dependent on chasing next a lot more
often than chasing prev.

Otherwise, knowing a loaded value contains a pointer to a structure (or array)
one cannot predict what to prefetch unless one can assume the offset into the
struct (or array).

Now Note:: If there were an instruction that loaded the value known to be
a pointer and prefetched based on the received pointer, then the prefetch
is now architectural not µArchitectural and you are allowed to damage the
cache or TLB when/after the instruction retires.
Post by EricP
For example, I use just two opcodes for Atomic Fetch Add int64 and int32
AFADD8 rDst,rSrc,[rAddr]
AFADD4 rDst,rSrc,[rAddr]
EricP
2024-03-28 16:52:07 UTC
Permalink
Post by MitchAlsup1
Post by EricP
Post by Anton Ertl
Post by EricP
It doesn't need to eat opcode space if you only support one data type,
64-bit ints, and one address mode, [register].
Other address modes can be calculated using LEA.
Since these are rare instructions to solve a particular problem,
they won't be used that often, so a few extra instructions shouldn't matter.
You lost me here. Do you mean that a load with address mode
[register] is considered to be a non-address load and not followed by
the data-dependent prefetcher? So how would an address load be
encoded if the natural expression would be [register]?
- anton
I'm pointing out that not all instructions need to be orthogonal.
There can be savings in opcode space by tempering that based on
expected frequency of occurrence.
The normal LD and ST have all their address modes and data types
because these functions occur frequently enough that we deem it
worthwhile to support these all in one instruction,
such as supporting both sign and zero extended loads
or scaled index addressing.
I note there is this class of relatively rarely used special purpose
memory access instructions that don't need to have all singing and all
dancing address modes and/or data types like the regular LD and ST.
Since I need a LEA Load Effective Address instruction anyway
which does rBase+rIndex*scale+offset calculation
(plus I have others, like where rBase is RIP or an absolute address),
then I can drop all but the [reg] address mode for these rare
instructions
and in many cases drop some sign or zero extend types for loads.
It seems to me that once the core has identified an address and an offset
from that address contains another address (foo->next, foo->prev) that
only those are prefetched. So this depends on placing next as the first
container in a structure and remains dependent on chasing next a lot more
often than chasing prev.
Otherwise, knowing a loaded value contains a pointer to a structure (or array)
one cannot predict what to prefetch unless one can assume the offset into the
struct (or array).
Right, this is the problem that these "data memory-dependent" prefetchers
like described in that Intel Programmable and Integrated Unified Memory
Architecture (PIUMA)" paper referenced by Paul Clayton are trying to solve.

The pointer field to chase can be
(a) at an +- offset from the current pointer virtual address
(b) at a different offset for each iteration
(c) conditional on some other field at some other offset

and most important:

(d) any new pointers are virtual address that have to start back at
the Load Store Queue for VA translation and forwarding testing
after applying (a),(b) and (c) above.

Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.

So I find the simplistic, blithe data-dependent auto prefetching
described as questionable.
Post by MitchAlsup1
Now Note:: If there were an instruction that loaded the value known to be
a pointer and prefetched based on the received pointer, then the prefetch
is now architectural not µArchitectural and you are allowed to damage the
cache or TLB when/after the instruction retires.
In the PIUMA case those pointers were to sparse data sets
so part of the problem was rolling over the cache, as well as
(and the PIUMA paper didn't mention this) the TLB.

After reading the PIUMA paper I had an idea for a small modification
to the PTE cache control bits to handle sparse data. The PTE's 3 CC bits
can specify the upper page table levels are cached in the TLB but
lower levels are not because they would always roll over the TLB.
However the non-TLB cached PTE's may optionally still be cached
in L1 or L2, or not at all.

This allows one to hold the top page table levels in the TLB,
the upper middle levels in L1, lower middle levels in L2,
and leaf PTE's and sparse code/data not cached at all.
BUT, as PIUMA proposes, we also allow the memory subsystem to read and write
individual aligned 8-byte values from DRAM, rather than whole cache lines,
so we only move that actual 8 bytes values we need.

Also note that page table walks are also graph structure walks
but chasing physical addresses at some simple calculated offsets.
These physical addresses might be cached in L1 or L2 so we can't
just chase these pointers in the memory controller but,
if one wants to do this, have to do so in the cache controller.
MitchAlsup1
2024-03-28 19:59:45 UTC
Permalink
Post by EricP
Post by MitchAlsup1
It seems to me that once the core has identified an address and an offset
from that address contains another address (foo->next, foo->prev) that
only those are prefetched. So this depends on placing next as the first
container in a structure and remains dependent on chasing next a lot more
often than chasing prev.
Otherwise, knowing a loaded value contains a pointer to a structure (or array)
one cannot predict what to prefetch unless one can assume the offset into the
struct (or array).
Right, this is the problem that these "data memory-dependent" prefetchers
like described in that Intel Programmable and Integrated Unified Memory
Architecture (PIUMA)" paper referenced by Paul Clayton are trying to solve.
The pointer field to chase can be
(a) at an +- offset from the current pointer virtual address
(b) at a different offset for each iteration
(c) conditional on some other field at some other offset
(d) any new pointers are virtual address that have to start back at
the Load Store Queue for VA translation and forwarding testing
after applying (a),(b) and (c) above.
This is the tidbit that prevents doing prefetches at/in the DRAM controller.
The address so fetched needs translation !! And this requires dragging
stuff over to DRC that is not normally done.
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
Latency cost is identical, instruction issue/retire costs are lower.
Post by EricP
So I find the simplistic, blithe data-dependent auto prefetching
described as questionable.
K9 built a SW model of such a prefetcher. For the first 1B cycles of a
SPEC benchmark from ~2004 it performed quite well. {{We later figured out
that this was initialization that built a GB data structure.}} Late on
in the benchmark the pointers got scrambled and the performance from the
prefetched fell on its face.

Moral, you need close to 1T cycle of simulation to qualify a prefetcher.
Post by EricP
Post by MitchAlsup1
Now Note:: If there were an instruction that loaded the value known to be
a pointer and prefetched based on the received pointer, then the prefetch
is now architectural not µArchitectural and you are allowed to damage the
cache or TLB when/after the instruction retires.
In the PIUMA case those pointers were to sparse data sets
so part of the problem was rolling over the cache, as well as
(and the PIUMA paper didn't mention this) the TLB.
After reading the PIUMA paper I had an idea for a small modification
to the PTE cache control bits to handle sparse data. The PTE's 3 CC bits
can specify the upper page table levels are cached in the TLB but
lower levels are not because they would always roll over the TLB.
However the non-TLB cached PTE's may optionally still be cached
in L1 or L2, or not at all.
This allows one to hold the top page table levels in the TLB,
the upper middle levels in L1, lower middle levels in L2,
and leaf PTE's and sparse code/data not cached at all.
Given the 2-level TLBs currently in vogue, the fist level might
have 32-64 PTEs, while the second might have 2048 PTEs. With this
number of PTEs available, does you scheme still give benefit ??
Post by EricP
BUT, as PIUMA proposes, we also allow the memory subsystem to read and write
individual aligned 8-byte values from DRAM, rather than whole cache lines,
so we only move that actual 8 bytes values we need.
Busses on cores are reaching the stage where an entire cache line
is transferred in 1-cycle. With such busses, why define anything
smaller than a cache line ?? {other than uncacheable accesses}
Post by EricP
Also note that page table walks are also graph structure walks
but chasing physical addresses at some simple calculated offsets.
These physical addresses might be cached in L1 or L2 so we can't
just chase these pointers in the memory controller but,
if one wants to do this, have to do so in the cache controller.
Yes, this is why the K9 prefetcher was in the L2 where it had access
to the L2 TLB.
Paul A. Clayton
2024-03-29 01:51:51 UTC
Permalink
[snip]
Post by MitchAlsup1
Post by EricP
(d) any new pointers are virtual address that have to start back at
     the Load Store Queue for VA translation and forwarding testing
     after applying (a),(b) and (c) above.
This is the tidbit that prevents doing prefetches at/in the DRAM controller.
The address so fetched needs translation !! And this requires dragging
stuff over to DRC that is not normally done.
With multiple memory channels having independent memory
controllers (a reasonable design I suspect), a memory controller
may have to send the prefetch request to another memory controller
anyway. If the prefetch has to take a trip on the on-chip network,
a "minor side trip" for translation might not be horrible (though
it seems distasteful to me).

With the Mill having translation at last level cache miss, such
prefetching may be more natural *but* distributing the virtual
address translations and the memory controllers seems challenging
when one wants to minimize hops.

[snip]
Post by MitchAlsup1
Post by EricP
BUT, as PIUMA proposes, we also allow the memory subsystem to
read and write
individual aligned 8-byte values from DRAM, rather than whole
cache lines,
so we only move that actual 8 bytes values we need.
Busses on cores are reaching the stage where an entire cache line
is transferred in 1-cycle. With such busses, why define anything
smaller than a cache line ?? {other than uncacheable accesses}
The Intel research chip was special-purpose targeting
cache-unfriendly code. Reading 64 bytes when 99% of the time 56
bytes would be unused is rather wasteful (and having more memory
channels helps under high thread count).

However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)

In-cache compression might also nudge the tradeoffs. Being able to
have higher effective bandwidth when data is transmitted in a
compressed form might be useful. "Lossy compression", where the
recipient does not care about much of the data, would allow
compression even when the data itself is not compressible. For
contiguous useful data, this is comparable to a smaller cache
line.
Scott Lurndal
2024-03-29 14:15:23 UTC
Permalink
Post by Paul A. Clayton
[snip]
Post by MitchAlsup1
Post by EricP
(d) any new pointers are virtual address that have to start back at
     the Load Store Queue for VA translation and forwarding testing
     after applying (a),(b) and (c) above.
This is the tidbit that prevents doing prefetches at/in the DRAM controller.
The address so fetched needs translation !! And this requires dragging
stuff over to DRC that is not normally done.
With multiple memory channels having independent memory
controllers (a reasonable design I suspect), a memory controller
may have to send the prefetch request to another memory controller
anyway.
Which is usually handled by the LLC when the address space is
striped across multiple memory controllers.
Post by Paul A. Clayton
Post by MitchAlsup1
Busses on cores are reaching the stage where an entire cache line
is transferred in 1-cycle. With such busses, why define anything
smaller than a cache line ?? {other than uncacheable accesses}
The Intel research chip was special-purpose targeting
cache-unfriendly code. Reading 64 bytes when 99% of the time 56
bytes would be unused is rather wasteful (and having more memory
channels helps under high thread count).
Given the lack of both spatial and temporal locality in that
workload, one wonders if the data should be cached at all.
Post by Paul A. Clayton
However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)
So long as the data transfer is cachable, the atomics can be handled
at the LLC, rather than the memory controller.
Paul A. Clayton
2024-03-29 21:24:45 UTC
Permalink
[snip]
Post by Scott Lurndal
Post by Paul A. Clayton
Post by MitchAlsup1
This is the tidbit that prevents doing prefetches at/in the DRAM controller.
The address so fetched needs translation !! And this requires dragging
stuff over to DRC that is not normally done.
With multiple memory channels having independent memory
controllers (a reasonable design I suspect), a memory controller
may have to send the prefetch request to another memory controller
anyway.
Which is usually handled by the LLC when the address space is
striped across multiple memory controllers.
For data-dependent (pointer chasing) prefetching, one would like
for the prefetch to start as soon as the data was available. For a
cache miss that would be in the memory controller. Having to do
address translation can be inconvenient even for LLC.
Post by Scott Lurndal
Post by Paul A. Clayton
Post by MitchAlsup1
Busses on cores are reaching the stage where an entire cache line
is transferred in 1-cycle. With such busses, why define anything
smaller than a cache line ?? {other than uncacheable accesses}
The Intel research chip was special-purpose targeting
cache-unfriendly code. Reading 64 bytes when 99% of the time 56
bytes would be unused is rather wasteful (and having more memory
channels helps under high thread count).
Given the lack of both spatial and temporal locality in that
workload, one wonders if the data should be cached at all.
Most of the data was intended not to be cached. Instructions and
stack would still have spatial locality and some temporal
locality.
Post by Scott Lurndal
Post by Paul A. Clayton
However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)
So long as the data transfer is cachable, the atomics can be handled
at the LLC, rather than the memory controller.
Yes, but if the width of the on-chip network — which is what Mitch
was referring to in transferring a cache line in one cycle — is
c.72 bytes (64 bytes for the data and 8 bytes for control
information) it seems that short messages would either have to be
grouped (increasing latency) or waste a significant fraction of
the potential bandwidth for that transfer. Compressed cache lines
would also not save bandwidth. These may not be significant
considerations, but this is an answer to "why define anything
smaller than a cache line?", i.e., seemingly reasonable
motivations may exist.
Scott Lurndal
2024-03-29 22:34:44 UTC
Permalink
Post by Paul A. Clayton
[snip]
Post by Scott Lurndal
Post by Paul A. Clayton
However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)
So long as the data transfer is cachable, the atomics can be handled
at the LLC, rather than the memory controller.
Yes, but if the width of the on-chip network — which is what Mitch
was referring to in transferring a cache line in one cycle — is
c.72 bytes (64 bytes for the data and 8 bytes for control
information) it seems that short messages would either have to be
grouped (increasing latency) or waste a significant fraction of
the potential bandwidth for that transfer. Compressed cache lines
would also not save bandwidth. These may not be significant
considerations, but this is an answer to "why define anything
smaller than a cache line?", i.e., seemingly reasonable
motivations may exist.
It's not uncommon for the bus/switch/mesh -structure- to be 512-bits wide,
which indeed will support a full cache line transfer in a single transaction;
it also supports high-volume DMA operations (either memory to memory or
device to memory).

Most of the interconnect (bus, switched or point-to-point) implementations
have an overlaying protocol (including the cache coherency
protocol) and are effectively message based, with agents posting requests
that don't need a reply and expecting a reply for the rest.

That doesn't require that every transaction over that bus to
utilize the full width of the bus.
MitchAlsup1
2024-03-30 01:06:23 UTC
Permalink
Post by Scott Lurndal
Post by Paul A. Clayton
[snip]
Post by Scott Lurndal
Post by Paul A. Clayton
However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)
So long as the data transfer is cachable, the atomics can be handled
at the LLC, rather than the memory controller.
Yes, but if the width of the on-chip network — which is what Mitch
was referring to in transferring a cache line in one cycle — is
c.72 bytes (64 bytes for the data and 8 bytes for control
information) it seems that short messages would either have to be
grouped (increasing latency) or waste a significant fraction of
the potential bandwidth for that transfer. Compressed cache lines
would also not save bandwidth. These may not be significant
considerations, but this is an answer to "why define anything
smaller than a cache line?", i.e., seemingly reasonable
motivations may exist.
It's not uncommon for the bus/switch/mesh -structure- to be 512-bits wide,
which indeed will support a full cache line transfer in a single transaction;
It is not the transaction it is a single beat of the clock. One can have
narrower bus widths and simply divide the cache line size by the bus width
to get the number of required beats.
Post by Scott Lurndal
it also supports high-volume DMA operations (either memory to memory or
device to memory).
Most of the interconnect (bus, switched or point-to-point) implementations
have an
or more than one
Post by Scott Lurndal
overlaying protocol (including the cache coherency
protocol) and are effectively message based, with agents posting requests
that don't need a reply and expecting a reply for the rest.
Many older busses read PTP and PTEs from memory sizeof( PTE ) at a time,
some of them requesting write permission so that used and modified bits
can be written back immediately.{{Which skirts the distinction between
cacheable and uncacheable in several ways.}}
Post by Scott Lurndal
That doesn't require that every transaction over that bus to
utilize the full width of the bus.
In my wide bus situation, the line width is used to gang up multiple
responses (from different end-points) into a single beat==message.
For example the chip-to-chip transport can carry multiple independent
SNOOP responses in a single beat (saving cycles and lowering latency).
Stefan Monnier
2024-04-03 18:10:41 UTC
Permalink
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.


Stefan
Thomas Koenig
2024-04-03 20:05:02 UTC
Permalink
Post by Stefan Monnier
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches), but ought to be quite good
at predicting prefetches. If a pointer is loaded, chances are
very high that are it will be dereferenced.
MitchAlsup1
2024-04-03 21:37:13 UTC
Permalink
Post by Thomas Koenig
Post by Stefan Monnier
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
Which are always predicted to have no error.
Post by Thomas Koenig
but ought to be quite good
at predicting prefetches.
What makes you think programmers understand prefetches any better than
exceptions ??
Post by Thomas Koenig
If a pointer is loaded, chances are
very high that are it will be dereferenced.
What if the value loaded is NULL.
Thomas Koenig
2024-04-04 17:51:19 UTC
Permalink
Post by MitchAlsup1
Post by Thomas Koenig
Post by Stefan Monnier
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
Which are always predicted to have no error.
On the second or third time, certainly. Hmmm... given hot/cold
splitting which is fairly standard by now, do branch predictors
take this into account?
Post by MitchAlsup1
Post by Thomas Koenig
but ought to be quite good
at predicting prefetches.
What makes you think programmers understand prefetches any better than
exceptions ??
Pointers are used in many common data structures; linked list,
trees, ... A programmer who does not know about dereferencing
pointers should be kept away from computer keyboards, preferably
at a distance of at least 3 m.
Post by MitchAlsup1
Post by Thomas Koenig
If a pointer is loaded, chances are
very high that are it will be dereferenced.
What if the value loaded is NULL.
Then it should be trivially predicted that it should not be prefetched.
MitchAlsup1
2024-04-04 20:09:39 UTC
Permalink
Post by Thomas Koenig
Post by MitchAlsup1
Post by Thomas Koenig
Post by Stefan Monnier
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
Which are always predicted to have no error.
There I mean that the programmer wrote the code::

if( no error so far )
{
then continue
}
else
{
deal with the error
}

Many times, the "deal with the error" code is never even fetched.
Post by Thomas Koenig
On the second or third time, certainly. Hmmm... given hot/cold
splitting which is fairly standard by now, do branch predictors
take this into account?
First we are talking about predicting branches at compile time and
the way the programmer writes the source code, not about the dynamic
predictions of HW.

Given that it is compile time, one either predicts it is taken
(loops) or not taken (errors and initialization) and arrange
the code such that fall through is the predicted pattern (except
for loops).

Then at run time, all these branches are predicted with the standard
predictors present in the core.
Initialization stuff is mispredicted once or twice
error code is only mispredicted when an error occurs
loops are mispredicted once or twice per entrance.

Also note:: With an ISA like My 66000, one can preform branching using
predication and neither predict the branch nor modify where FETCH is
fetching. Ideally, predication should deal with hard to predict branches
and all flow control where the then and else clauses are short. When
these are removed from the predictor, prediction should improve--maybe
not in the number of predictions that are correct, but in the total time
wasted on branching (including both repair and misfetching overheads).
Post by Thomas Koenig
Post by MitchAlsup1
Post by Thomas Koenig
but ought to be quite good
at predicting prefetches.
What makes you think programmers understand prefetches any better than
exceptions ??
Pointers are used in many common data structures; linked list,
trees, ... A programmer who does not know about dereferencing
pointers should be kept away from computer keyboards, preferably
at a distance of at least 3 m.
3Km ??
Post by Thomas Koenig
Post by MitchAlsup1
Post by Thomas Koenig
If a pointer is loaded, chances are
very high that are it will be dereferenced.
What if the value loaded is NULL.
Then it should be trivially predicted that it should not be prefetched.
Scott Lurndal
2024-04-04 20:35:12 UTC
Permalink
Post by MitchAlsup1
Post by Thomas Koenig
Post by MitchAlsup1
Post by Thomas Koenig
Post by Stefan Monnier
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
Which are always predicted to have no error.
if( no error so far )
{
then continue
}
else
{
deal with the error
}
Many times, the "deal with the error" code is never even fetched.
Post by Thomas Koenig
On the second or third time, certainly. Hmmm... given hot/cold
splitting which is fairly standard by now, do branch predictors
take this into account?
First we are talking about predicting branches at compile time and
the way the programmer writes the source code, not about the dynamic
predictions of HW.
gcc provides a way to "annotate" a condition with the expected
common result:

#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)


if (likely(bus_enable.s.enabled)) {
do something
} else {
do something else
}

This will affect the layout of the code (e.g. deferring generation
of the else clause with the result that it ends up in a different
cache line or page).

It's used in the linux kernel, and in certain cpu bound applications.
MitchAlsup1
2024-04-04 20:57:45 UTC
Permalink
Post by Scott Lurndal
Post by MitchAlsup1
Post by Thomas Koenig
Post by MitchAlsup1
Post by Thomas Koenig
Post by Stefan Monnier
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches),
Which are always predicted to have no error.
if( no error so far )
{
then continue
}
else
{
deal with the error
}
Many times, the "deal with the error" code is never even fetched.
Post by Thomas Koenig
On the second or third time, certainly. Hmmm... given hot/cold
splitting which is fairly standard by now, do branch predictors
take this into account?
First we are talking about predicting branches at compile time and
the way the programmer writes the source code, not about the dynamic
predictions of HW.
gcc provides a way to "annotate" a condition with the expected
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
if (likely(bus_enable.s.enabled)) {
do something
} else {
do something else
}
This will affect the layout of the code (e.g. deferring generation
of the else clause with the result that it ends up in a different
cache line or page).
It's used in the linux kernel, and in certain cpu bound applications.
Thank you for pointing this out.
Paul A. Clayton
2024-04-07 21:20:11 UTC
Permalink
On 4/4/24 4:09 PM, MitchAlsup1 wrote:> Thomas Koenig wrote:
[snip]
Post by MitchAlsup1
Given that it is compile time, one either predicts it is taken
(loops) or not taken (errors and initialization) and arrange
the code such that fall through is the predicted pattern (except
for loops).
"Boolean Formula-based Branch Prediction for Future Technologies"
(Daniel A. Jiménez et al., 2001) presents a mechanism for branch
prediction based on a static formula using global history, so *in
theory* a compiler could provide more than just taken/not-taken
predictions. ("Static Methods in Hybrid Branch Prediction", Dirk
Grunwald et al., 1998, proposed static predictor selection,
removing the selector table; Donald Lindsay, the second author of
that paper, had a PhD thesis "Static Methods in Branch Prediction".)

(A static prediction can also be used only for an agree predictor.
It is even conceivable that the dominant direction might not be
the best static agreeing prediction; e.g., if aliases would train
to an agree/disagree state that more often mispredicts the branch
[and the branch has few global entries]. Such would be a very
fragile optimization requiring considerable analysis, so it seems
very unlikely to be practical. [I have wondered if a hash of the
branch address might be XORed with predictor bits to provide a
quasi-tagging. XORing with the prediction could cause the most
interference, good for destructive aliasing; XORing with counter
or hysteresis bits would seem to moderate interfence, allowing a
prediction to be stable such that a selector would tranfer a
mispredicting branch to another table. If an expected/agree sense
was available, such tagging might be more useful.])
Post by MitchAlsup1
Then at run time, all these branches are predicted with the
standard
Post by MitchAlsup1
predictors present in the core.
Initialization stuff is mispredicted once or twice
Or not at all if BTB miss generates a correct predict not taken.
(Even without that mechanism, a first encounter might be
correctly predicted by accident.)
Post by MitchAlsup1
error code is only mispredicted when an error occurs
loops are mispredicted once or twice per entrance.
Or never mispredicted for only-counted loops for which the
counter value is available early enough in the pipeline. (Gshare
can also predict loop exit/entry correctly for small loop counts
with few internal branches. TAGE can do better.)
Post by MitchAlsup1
Also note:: With an ISA like My 66000, one can preform
branching using
Post by MitchAlsup1
predication and neither predict the branch nor modify where
FETCH is
Post by MitchAlsup1
fetching. Ideally, predication should deal with hard to predict branches
and all flow control where the then and else clauses are short. When
these are removed from the predictor, prediction should
improve--maybe
Post by MitchAlsup1
not in the number of predictions that are correct, but in the
total time
Post by MitchAlsup1
wasted on branching (including both repair and misfetching
overheads).

Rarely-executed blocks should presumably use branches even when
short to remove the rarely-executed code from the normal
instruction stream. I would guess that exceptional actions are
typically longer/more complicated.

(Consistent timing would also be important for some real-time
tasks and for avoiding timing side channels.)

The best performing choice would also seem to be potentially
microarchitecture-dependent. Obviously the accuracy of branch
prediction and the cost of aliasing would matter (and
perversely mispredicting a branch can _potentially_ improve
performance, though not on My 66000, I think, because more
persistent microarchitectural state is not updated until
instruction commitment).

If the predicate value is delayed and predicated operations
wait in the scheduler for this operand and the operands of one
path are available before the predicate value, branch prediction
might allow deeper speculation. For highly dynamically
predictable but short branches, deeper speculation might help
more than fetch irregularity hurts. (The predicate could be
predicted — and this prediction is needed later in the pipeline —
but not distinguishing between prediction-allowed predication
and normal predication might prevent prediction from being
implemented to avoid data-dependent timing of predication.)

(The cost of speculation can also be variable. With underutilized
resources (thermal, memory bandwidth, etc.) speculation would
generally be less expensive than with high demand on resources.)
MitchAlsup1
2024-04-07 23:50:25 UTC
Permalink
Post by Paul A. Clayton
[snip]
Post by MitchAlsup1
Also note:: With an ISA like My 66000, one can preform
branching using
Post by MitchAlsup1
predication and neither predict the branch nor modify where
FETCH is
Post by MitchAlsup1
fetching. Ideally, predication should deal with hard to predict
branches
Post by MitchAlsup1
and all flow control where the then and else clauses are short.
When
Post by MitchAlsup1
these are removed from the predictor, prediction should
improve--maybe
Post by MitchAlsup1
not in the number of predictions that are correct, but in the
total time
Post by MitchAlsup1
wasted on branching (including both repair and misfetching
overheads).
Rarely-executed blocks should presumably use branches even when
short to remove the rarely-executed code from the normal
instruction stream. I would guess that exceptional actions are
typically longer/more complicated.
If you will arrive at the join point by simply fetching there
is no reason to use a branch.
Post by Paul A. Clayton
(Consistent timing would also be important for some real-time
tasks and for avoiding timing side channels.)
Predication is closer to fixed timing than any branching.
Post by Paul A. Clayton
The best performing choice would also seem to be potentially
microarchitecture-dependent. Obviously the accuracy of branch
prediction and the cost of aliasing would matter (and
perversely mispredicting a branch can _potentially_ improve
performance, though not on My 66000, I think, because more
persistent microarchitectural state is not updated until
instruction commitment).
While not committed, it is still available.
Post by Paul A. Clayton
If the predicate value is delayed and predicated operations
wait in the scheduler for this operand and the operands of one
path are available before the predicate value, branch prediction
might allow deeper speculation.
Just like data forwarding, branch prediction forwarding is
easily achieved in the pipeline.

< For highly dynamically
Post by Paul A. Clayton
predictable but short branches, deeper speculation might help
more than fetch irregularity hurts.
What can possibly be more regular than constant time ??
Post by Paul A. Clayton
(The predicate could be
predicted — and this prediction is needed later in the pipeline —
One HAS TO assume that GBOoO machines will predict predicates.
Post by Paul A. Clayton
but not distinguishing between prediction-allowed predication
and normal predication might prevent prediction from being
implemented to avoid data-dependent timing of predication.)
(The cost of speculation can also be variable. With underutilized
resources (thermal, memory bandwidth, etc.) speculation would
generally be less expensive than with high demand on resources.)
Stefan Monnier
2024-04-05 16:54:50 UTC
Permalink
Post by Thomas Koenig
Post by Stefan Monnier
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches), but ought to be quite good
at predicting prefetches. If a pointer is loaded, chances are
very high that are it will be dereferenced.
I don't think it's that simple: prefetches only bring the data into L1
cache, so they're only useful if:

- The data is not already in L1.
- The data will be used soon (i.e. before it gets thrown away from the cache).
- The corresponding load doesn't occur right away.

In all other cases, the prefetch will be just wasted work.

It's easy for programmers to "predict" those (dependent) loads which will occur
right away, but those don't really benefit from a prefetch.
E.g. if the dependent load is done 2 cycles later, performing a prefetch
lets you start the memory access 2 cycles early, but since that access
is not in L1 it'll take more than 10 cycles, so shaving
2 cycles off isn't of great benefit.

Given that we're talking about performing a prefetch on the result of
a previous load, and loads tend to already have a fairly high latency
(3-5 cycles), "2 cycles later" really means "5-7 cycles after the
beginning of the load of that pointer". That can easily translate to 20
instructions later.

My gut feeling is that it's difficult for programmers to predict what
will happen more than 20 instructions further without looking at
detailed profiling.


Stefan
MitchAlsup1
2024-04-05 21:28:59 UTC
Permalink
Post by Stefan Monnier
Post by Thomas Koenig
Post by Stefan Monnier
Post by EricP
Since each chased pointer starts back at LSQ, the cost is no different
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
I thought the important difference is that the decision to prefetch or
not can be done dynamically based on past history.
Programmers and compilers are notoriously bad at predicting
branches (except for error branches), but ought to be quite good
at predicting prefetches. If a pointer is loaded, chances are
very high that are it will be dereferenced.
I don't think it's that simple: prefetches only bring the data into L1
- The data is not already in L1.
- The data will be used soon (i.e. before it gets thrown away from the cache).
- The corresponding load doesn't occur right away.
In all other cases, the prefetch will be just wasted work.
It's easy for programmers to "predict" those (dependent) loads which will occur
right away, but those don't really benefit from a prefetch.
E.g. if the dependent load is done 2 cycles later, performing a prefetch
lets you start the memory access 2 cycles early, but since that access
is not in L1 it'll take more than 10 cycles, so shaving
2 cycles off isn't of great benefit.
Given that we're talking about performing a prefetch on the result of
a previous load, and loads tend to already have a fairly high latency
(3-5 cycles), "2 cycles later" really means "5-7 cycles after the
beginning of the load of that pointer". That can easily translate to 20
instructions later.
My gut feeling is that it's difficult for programmers to predict what
will happen more than 20 instructions further without looking at
detailed profiling.
Difficult becomes impossible when the code has to operate "well" over
a range of implementations. {{With some suitable definition of "well"}}

Consider deciding how far into the future (counting instructions) a
prefetch has to be placed so that the data arrives before use. Then
consider the smallest execution window is 16 instructions and the
largest execution window is 300 instructions; and you want the same
code to be semi-optimal on both.
Post by Stefan Monnier
Stefan
Anton Ertl
2024-03-27 17:52:30 UTC
Permalink
Post by Thomas Koenig
Post by Michael S
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
Only works for new versions of an architecture, and supporting
compilers, but no code change would be required. And, of course,
it would eat up opcode space.
The other way 'round seems to be a better approach: mark those loads
that load addresses, and then prefetch at most based on the data
loaded by these instructions (of course, the data-dependent prefetcher
may choose to ignore the data based on history). That means that
existing programs are immune to GoFetch, but also don't benefit from
the data-dependent prefetcher (which is a minor issue IMO).

As for opcode space, we already have prefetch instructions, so one
could implement this by letting every load that actually loads an
address be followed by a prefetch instruction. But of course that
would consume more code space and decoding bandwidth than just adding
a load-address instruction.

In any case, passing the prefetch hints to hardware that may ignore
the hint based on history may help reduce the performance
disadvantages that have been seen when using prefetch hint
instructions.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Scott Lurndal
2024-03-27 20:52:13 UTC
Permalink
Post by Thomas Koenig
Post by Michael S
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
It is worth noting (from the paper's Introduction):

In particular, Augury reported that the [Apple M-series ed.] DMP only activates
in the presence of a rather idiosyncratic program memory
access pattern (where the program streams through an array
of pointers and architecturally dereferences those pointers).
This access pattern is not typically found in security critical
software such as side-channel hardened constant-time code--
hence making that code impervious to leakage through the
DMP.
Scott Lurndal
2024-03-27 20:53:25 UTC
Permalink
Post by Scott Lurndal
Post by Thomas Koenig
Post by Michael S
https://gofetch.fail/files/gofetch.pdf
Looking the paper, it seems that a separate "load value" instruction
(where it is guaranteed that no pointer prefetching will be done)
could fix this particular issue. Compilers know what type is being
loaded from memory, and could issue the corresponding instruction.
This would not impact performance.
In particular, Augury reported that the [Apple M-series ed.] DMP only activates
in the presence of a rather idiosyncratic program memory
access pattern (where the program streams through an array
of pointers and architecturally dereferences those pointers).
This access pattern is not typically found in security critical
software such as side-channel hardened constant-time code--
hence making that code impervious to leakage through the
DMP.
Reminder to self, read rest of article before commenting.

Never mind.
John Savard
2024-03-25 13:26:35 UTC
Permalink
On Sun, 24 Mar 2024 10:40:18 -0700, Stephen Fuld
Post by Stephen Fuld
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
So, is there a way to fix this while maintaining the feature's
performance advantage?
While the article was pessimistic about fixing it by a microcode fix
that could be downloaded, the description didn't suggest to me that it
would be hard to correct it in future designs. Maybe my memory is
faulty.

John Savard
Michael S
2024-03-25 13:53:06 UTC
Permalink
On Mon, 25 Mar 2024 07:26:35 -0600
Post by John Savard
On Sun, 24 Mar 2024 10:40:18 -0700, Stephen Fuld
Post by Stephen Fuld
https://arstechnica.com/security/2024/03/hackers-can-extract-secret-encryption-keys-from-apples-mac-chips/
So, is there a way to fix this while maintaining the feature's
performance advantage?
While the article was pessimistic about fixing it by a microcode fix
that could be downloaded, the description didn't suggest to me that it
would be hard to correct it in future designs. Maybe my memory is
faulty.
John Savard
The description suggests that it wouldn't be hard to correct by
introduction of new mode bit that can be turned on and off by cryto
libraries.
That is not going to help existing binaries.

Now, personally I don't believe that for single-user platform like Mac
the threat is even remotely real and that any fix is needed. And I am
not even an Apple fanboy, more like a mild hater.
Lawrence D'Oliveiro
2024-03-26 02:31:38 UTC
Permalink
Post by Michael S
Now, personally I don't believe that for single-user platform like Mac
the threat is even remotely real and that any fix is needed.
It could be exploited via, for example, some future web browser or other
app vulnerability.

Remember Schneier’s dictum: attacks only ever get worse, they never get
better.
Loading...