Discussion:
AMD Cache speed funny
(too old to reply)
Vir Campestris
2024-01-30 16:36:17 UTC
Permalink
I've knocked up a little utility program to try to work out some
performance figures for my CPU.

It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
4MB L3 cache
2MB L2 cache
384kb L1 cache

What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.

A C++ fragment is this. I can post the whole thing if it would help.

// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;

Stopwatch s;
s.start();
while (1) // until break when mask runs out
{
for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time

if (mask == 0) break; // Stop if we've run out of mask

mask >>= 1; // shrink the mask
}

As you can see it starts with a large mask (in fact for a whole GB) and
halves it as it goes around.

All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.

But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.

Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.

A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.

What am I missing?

Thanks
Andy
Anton Ertl
2024-01-30 17:20:59 UTC
Permalink
Post by Vir Campestris
for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
...
Post by Vir Campestris
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.
A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.
What am I missing?
When you do

raw[0] ^= index;

in every step you read the result of the pervious iteration, xor it,
and store it again. This means that you have one chain of RMW data
dependences, with one RMW per iteration. On the Zen2 (which your
3400G has), this requires 8 cycles (see column H of
<http://www.complang.tuwien.ac.at/anton/memdep/>). With mask=1, you
get 2 chains, each with one 8-cycle RMW every second iteration, so you
need 4 cycles per iteration (see my column C). With mask=3, you get 4
chains and 2 cycles per iteration. Looking at my results, I would
expect another doubling with mask=7, but maybe your loop is running
into resource limits at that point (mine does 4 RMWs per iteration).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Michael S
2024-01-30 17:38:15 UTC
Permalink
On Tue, 30 Jan 2024 16:36:17 +0000
Post by Vir Campestris
I've knocked up a little utility program to try to work out some
performance figures for my CPU.
4MB L3 cache
2MB L2 cache
384kb L1 cache
That's for the whole chip and it includes L1I caches.
For individual core and excluding L1I the numbers are:
4MB L3 cache
512 KB L2 cache
32 KB L1D cache
Post by Vir Campestris
What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.
A C++ fragment is this. I can post the whole thing if it would help.
// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;
Stopwatch s;
s.start();
while (1) // until break when mask runs out
{
for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time
if (mask == 0) break; // Stop if we've run out of mask
mask >>= 1; // shrink the mask
}
As you can see it starts with a large mask (in fact for a whole GB)
and halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large
mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as
the mask gets smaller. No apparent effect when it gets under the L1
cache size.
But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the
L1 data cache.
A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this
odd slow down with small masks.
What am I missing?
Thanks
Andy
First, I'd look at generated asm.
If compiler was doing a good job then at mask <= 4095 (32 KB) you should
see slightly less than 1 iteration of the loop per cycle, i.e. assuming
4.2 GHz clock, approximately 30 GB/s.
Since you see less, it's a sign that compiler did less than perfect job.
Try to help it with manual loop unrolling.

As to the problem with lower performance at very small masks, it's
expected. CPU tries to execute loads speculatively out of order under
assumption that they don't alias with preceding stores. So actual loads
runs few loop iterations ahead of the stores. We can't say for sure how
many iterations ahead, but 7 to 10 iterations sounds like a good guess.
When your mask=7 (32 bytes) then aliasing starts to happen. On old
primitive CPUs, like Pentium 4, it causes massive slowdown, because
those early loads has to be replayed after rather significant delay
of about 20 cycles (length of pipeline). Your Zen1+ CPU is much smarter,
it detects that things are no good and stops wild speculations. So, you
don't see huge slowdown. But without speculation every load starts only
after all stores that preceded it in program order were either
committed into L1D cache or their address was checked against the
speculative load address and no aliasing was found. Since you see only
mild slowdown, it seems that the later is done rather effectively and
your CPU is still able to run loads speculatively, but now only 2 or 3
steps ahead, which is not enough to get the same performance as before.
MitchAlsup1
2024-01-30 20:11:42 UTC
Permalink
Post by Vir Campestris
As you can see it starts with a large mask (in fact for a whole GB) and
halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.
The execution window is apparently able to absorb the latency of L3 miss,
and stream L3->L1 accesses.

Anton answered the question regarding small masks.
Michael S
2024-01-30 20:37:05 UTC
Permalink
On Tue, 30 Jan 2024 20:11:42 +0000
Post by MitchAlsup1
Post by Vir Campestris
As you can see it starts with a large mask (in fact for a whole GB)
and halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large
mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that
as the mask gets smaller. No apparent effect when it gets under the
L1 cache size.
The execution window is apparently able to absorb the latency of L3
miss, and stream L3->L1 accesses.
That sounds unlikely. L3 latency is too big to be covered by execution
window. Much more likely they have adequate HW prefetch from L3 to L2
and may be (less likely) even to L1D.
Post by MitchAlsup1
Anton answered the question regarding small masks.
Terje Mathisen
2024-01-31 06:59:41 UTC
Permalink
Post by Vir Campestris
I've knocked up a little utility program to try to work out some
performance figures for my CPU.
4MB L3 cache
2MB L2 cache
384kb L1 cache
What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.
A C++ fragment is this. I can post the whole thing if it would help.
// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;
Stopwatch s;
s.start();
while (1)       // until break when mask runs out
{
        for (size_t index = 0; index < storeWordCount; ++index)
        {
                // read and write a word in store.
                Raw[index & mask] ^= index;
        }
        s.lap(mask);            // records the current time
        if (mask == 0) break;   // Stop if we've run out of mask
        mask >>= 1;             // shrink the mask
}
As you can see it starts with a large mask (in fact for a whole GB) and
halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.
But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.
A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.
Mitch, Anton and Michael have already answered, I just want to add that
we have one additional potential factor:

Rowhammer protection:

It is possible that the pattern of re-XORing the same or a small number
of locations over and over could trigger a pattern detector which was
designed to mitigate against Rowhammer.

OTOH, this would much more easily be handled with memory range based
coalescing of write operations in the last level cache, right?

I.e. for normal (write combining) memory, it would (afaik) be legal to
delay the actual writes to RAM for a significant time, long enough to
merge multiple memory writes.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Anton Ertl
2024-01-31 08:17:13 UTC
Permalink
It is possible that the pattern of re-XORing the same or a small number=20
of locations over and over could trigger a pattern detector which was=20
designed to mitigate against Rowhammer.
I don't think that memory controller designers have actually
implemented Rowhammer protection: I would expect that the processor
manufacturers would have bregged about that if they had. They have
not. And even RAM manufacturers have stopped mentioning anything
about Rowhammer in their specs. It seems that all hardware
manufacturers have decided that Rowhammer is something that will just
disappear from public knowledge (and therefore from what they have to
deal with) if they just ignore it long enough. It appears that they
are right.

They seem to take the same approach wrt Spectre-family attacks. In
that case, however, new variants appear all the time, so maybe the
approach won't work here.

However, in the present case "the same small number of locations" is
not hammered, because a small number of memory locations fits into the
cache in the adjacent access pattern that this test uses, and all
writes will just be to the cache.
OTOH, this would much more easily be handled with memory range based=20
coalescing of write operations in the last level cache, right?
We have had write-back caches (at the L2 or L1 level, and certainly at
the LLC level) since the later 486 years.
I.e. for normal (write combining) memory
Normal memory is write-back. AFAIK write combining is for stuff like
graphics card memory.
it would (afaik) be legal to=20
delay the actual writes to RAM for a significant time, long enough to=20
merge multiple memory writes.
And this is what actually happens, through the magic of write-back
caches.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Michael S
2024-01-31 11:13:53 UTC
Permalink
On Wed, 31 Jan 2024 07:59:41 +0100
Post by Terje Mathisen
Post by Vir Campestris
I've knocked up a little utility program to try to work out some
performance figures for my CPU.
4MB L3 cache
2MB L2 cache
384kb L1 cache
What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.
A C++ fragment is this. I can post the whole thing if it would help.
// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;
Stopwatch s;
s.start();
while (1)       // until break when mask runs out
{
        for (size_t index = 0; index < storeWordCount; ++index)
        {
                // read and write a word in store.
                Raw[index & mask] ^= index;
        }
        s.lap(mask);            // records the current time
        if (mask == 0) break;   // Stop if we've run out of mask
        mask >>= 1;             // shrink the mask
}
As you can see it starts with a large mask (in fact for a whole GB)
and halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large
mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that
as the mask gets smaller. No apparent effect when it gets under the
L1 cache size.
But...
When the mask is very small (3) it slows to 18GB/s. With 1 it
halves again, and with zero (so it only operates on the same word
over and over) it's half again. A fifth of the size with a large
block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the
L1 data cache.
A late thought was to replace that ^= index with something that
reads the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this
odd slow down with small masks.
Mitch, Anton and Michael have already answered, I just want to add
It is possible that the pattern of re-XORing the same or a small
number of locations over and over could trigger a pattern detector
which was designed to mitigate against Rowhammer.
OTOH, this would much more easily be handled with memory range based
coalescing of write operations in the last level cache, right?
I.e. for normal (write combining) memory, it would (afaik) be legal
to delay the actual writes to RAM for a significant time, long enough
to merge multiple memory writes.
Terje
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."

By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Scott Lurndal
2024-01-31 15:04:50 UTC
Permalink
Post by Michael S
On Wed, 31 Jan 2024 07:59:41 +0100
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
ARMv8 has a control bit that can be set to allow EL0 access
to the DC system instructions. By default it is a privileged
instruction. It is up to the operating software to enable
it for user-mode code.
Anton Ertl
2024-01-31 17:17:21 UTC
Permalink
Post by Michael S
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol. This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).

However, AFAIK this is insufficient for fixing Rowhammer. Caches have
relatively limited associativity, up to something like 16-way
set-associativity, so if you write to the same set 17 times, you are
guaranteed to miss the cache. With 3 levels of cache you may need 49
accesses (probably less), but I expect that the resulting DRAM
accesses to a cache line are still not rare enough that Rowhammer
cannot happen.

The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go. With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup
2024-01-31 20:12:15 UTC
Permalink
Post by Anton Ertl
Post by Michael S
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices.
I have wondered for a while about why device access is not to coherent
space. If it were so, then no CFLUSH functionality is needed, I/O can
just read/write an address and always get the freshest copy. {{Maybe
not the device itself, but the PCIe Root could translate from device
access space to memory access space (coherent).}}
Post by Anton Ertl
An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol. This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
However, AFAIK this is insufficient for fixing Rowhammer.
If L3 (LLC) is not a processor cache but a great big read/write buffer
for DRAM, then Rowhammering is significantly harder to accomplish.
Post by Anton Ertl
Caches have
relatively limited associativity, up to something like 16-way
set-associativity, so if you write to the same set 17 times, you are
guaranteed to miss the cache. With 3 levels of cache you may need 49
accesses (probably less), but I expect that the resulting DRAM
accesses to a cache line are still not rare enough that Rowhammer
cannot happen.
Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.

So, the trick is to detect the RowHammering and insert refresh commands.
Post by Anton Ertl
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go. With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
- anton
EricP
2024-02-02 17:03:41 UTC
Permalink
Post by MitchAlsup
Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.
So, the trick is to detect the RowHammering and insert refresh commands.
It's not just the immediately physically adjacent rows -
I think I read that the effect falls off for up to +-3 rows away.

Also it may be data dependent - 0's bleed into adjacent 1's and 1's into 0's.

And the threshold when it triggers has been changing as drams become more
dense. In 2014 when this was first encountered it took 139K activations.
By 2020 that was down to 4.8K.

So figuring out how much a row has been damaged is complicated,
and the window for detecting it is getting smaller.
MitchAlsup
2024-02-02 19:34:25 UTC
Permalink
Post by EricP
Post by MitchAlsup
Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.
So, the trick is to detect the RowHammering and insert refresh commands.
It's not just the immediately physically adjacent rows -
I think I read that the effect falls off for up to +-3 rows away.
My understanding is that RowHammer has to access the same row multiple times
to disrupt bits in an adjacent row. This sounds like a charge sharing problem.
A long time ago We found a problem with one manufactures SRAM when the same
row was hit >6,000 times, there was enough charge sharing that the adjacent
dynamic word decoder also fired so we had 2 or 3 word lines active at the
same time. We encountered this when a LD missed the cache and was sent down
through NorthBridge, SouthBridge, onto another bus, finally out to the device
and back, while the CPU was continuing to read the ICache every cycle.

My limited understanding of RowPress is that you should not keep the Row open
for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My bet is
that this is a leakage issue on the bit line made sensitive by the word line.
Post by EricP
Also it may be data dependent - 0's bleed into adjacent 1's and 1's into 0's.
DRAMs are funny like this. Adjacent bit lines store data differently. Even
bits store 0 as 0 and 1 as 1 while odd cells store 0 as 1 and 1 as 0. They
do this so the sense amplified has a differential to sense, either the even
cell or the odd cell is asserted on the bit line pair and the sense amp then
has a differential to sense. One line goes up a little or down a little while
the other bit line stays where it is.
Post by EricP
And the threshold when it triggers has been changing as drams become more
dense. In 2014 when this was first encountered it took 139K activations.
By 2020 that was down to 4.8K.
So figuring out how much a row has been damaged is complicated,
and the window for detecting it is getting smaller.
EricP
2024-02-02 22:20:51 UTC
Permalink
Post by MitchAlsup
Post by EricP
Post by MitchAlsup
Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.
So, the trick is to detect the RowHammering and insert refresh commands.
It's not just the immediately physically adjacent rows -
I think I read that the effect falls off for up to +-3 rows away.
My understanding is that RowHammer has to access the same row multiple times
to disrupt bits in an adjacent row. This sounds like a charge sharing problem.
Yes, as I understand it charge migration.
I had a nice document on the root cause of Rowhammer but I can't seem to
find it again. This one is a little heavy on the semiconductor physics:

On dram rowhammer and the physics of insecurity, 2020
https://ieeexplore.ieee.org/iel7/16/9385809/09366976.pdf

"Experimental evidence points to two mechanisms for the RH disturb,
namely cell transistor subthreshold leakage and electron injection
into the p-well of the DRAM array from the hammered cell transistors
and their subsequent capture by the storage node (SN) junctions [13].

Regarding the subthreshold leakage, lower cell transistor threshold
voltages have been shown to correlate with higher susceptibility to RH.
This is consistent with crosstalk between the switching aggressor wordline
and the victim wordlines pulling up the latter sufficiently in the
potential to drain away some of the victim cell’s stored charge [14], [15].

Regarding the injected electrons from the hammered cell transistors,
the blame for these has been placed on two different origins.
The first describes a collapsing inversion layer associated with the
hammered cell transistor where a population of electrons is injected
into the p-well as the transistor’s gate turns off [16]. The second
describes electron injection from charge traps near the silicon/gate
dielectric interface of the cell select transistor [13], [17].
Several studies look into techniques for hampering the migration of
these injected electrons."
Post by MitchAlsup
A long time ago We found a problem with one manufactures SRAM when the same
row was hit >6,000 times, there was enough charge sharing that the
adjacent dynamic word decoder also fired so we had 2 or 3 word lines
active at the same time. We encountered this when a LD missed the cache
and was sent down
through NorthBridge, SouthBridge, onto another bus, finally out to the device
and back, while the CPU was continuing to read the ICache every cycle.
I think of this as aging: each activation ages the rows up to some distance
by amounts depending on the distance due to charge migration.

Originally it was found by activating rows immediately adjacent to the
victim but then they looked and found it further out to +-4 rows.
This effect appears to be called the Rowhammer "blast radius".

This paper is from 2023 but I'm sure I've seen mention of this effect
before but not called blast radius.

BLASTER: Characterizing the Blast Radius of Rowhammer, 2023
https://www.research-collection.ethz.ch/handle/20.500.11850/617284
https://dramsec.ethz.ch/papers/blaster.pdf

"In particular, we show for the first time that BLASTER significantly
reduces the number of necessary activations to the victim-adjacent
aggressors using other aggressor rows that are up to four rows away
from the victim."
Post by MitchAlsup
My limited understanding of RowPress is that you should not keep the Row open
for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My bet is
that this is a leakage issue on the bit line made sensitive by the word line.
Yes, from what I read the factors affecting Rowhammer vulnerability are:

1) DRAM chip temperature, 2) aggressor row active time,
and 3) victim DRAM cell’s physical location.
Post by MitchAlsup
Post by EricP
Also it may be data dependent - 0's bleed into adjacent 1's and 1's into 0's.
DRAMs are funny like this. Adjacent bit lines store data differently. Even
bits store 0 as 0 and 1 as 1 while odd cells store 0 as 1 and 1 as 0. They
do this so the sense amplified has a differential to sense, either the even
cell or the odd cell is asserted on the bit line pair and the sense amp then
has a differential to sense. One line goes up a little or down a little while
the other bit line stays where it is.
Post by EricP
And the threshold when it triggers has been changing as drams become more
dense. In 2014 when this was first encountered it took 139K activations.
By 2020 that was down to 4.8K.
So figuring out how much a row has been damaged is complicated,
and the window for detecting it is getting smaller.
EricP
2024-02-02 23:24:10 UTC
Permalink
Post by EricP
Post by MitchAlsup
A long time ago We found a problem with one manufactures SRAM when the same
row was hit >6,000 times, there was enough charge sharing that the
adjacent dynamic word decoder also fired so we had 2 or 3 word lines
active at the same time. We encountered this when a LD missed the cache
and was sent down
through NorthBridge, SouthBridge, onto another bus, finally out to the device
and back, while the CPU was continuing to read the ICache every cycle.
I think of this as aging: each activation ages the rows up to some distance
by amounts depending on the distance due to charge migration.
Originally it was found by activating rows immediately adjacent to the
victim but then they looked and found it further out to +-4 rows.
This effect appears to be called the Rowhammer "blast radius".
This paper is from 2023 but I'm sure I've seen mention of this effect
before but not called blast radius.
BLASTER: Characterizing the Blast Radius of Rowhammer, 2023
https://www.research-collection.ethz.ch/handle/20.500.11850/617284
https://dramsec.ethz.ch/papers/blaster.pdf
"In particular, we show for the first time that BLASTER significantly
reduces the number of necessary activations to the victim-adjacent
aggressors using other aggressor rows that are up to four rows away
from the victim."
To elaborate a bit, as I understand it this means that if a dram
has a blast radius of +-3 and we take 7 rows A B C D E F G,
and assuming the aging factor is linear, then any read or refresh
of row D resets its age to 0 but ages C&E by 3, B&F by 2, A&G by 1.
If any row age total hits 15,000 its data dies.

This is why I thought canary bits might work, because they integrate the
sum of all adjacent activates while taking blast distance into account.
As long as the canary _reliably_ dies at age 12,000 and the data at 15,000
then the dram could transparently refresh the aged-out rows.
EricP
2024-02-03 17:12:03 UTC
Permalink
Post by MitchAlsup
Post by EricP
Post by MitchAlsup
Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.
So, the trick is to detect the RowHammering and insert refresh commands.
It's not just the immediately physically adjacent rows -
I think I read that the effect falls off for up to +-3 rows away.
My understanding is that RowHammer has to access the same row multiple times
to disrupt bits in an adjacent row. This sounds like a charge sharing problem.
A long time ago We found a problem with one manufactures SRAM when the same
row was hit >6,000 times, there was enough charge sharing that the
adjacent dynamic word decoder also fired so we had 2 or 3 word lines
active at the same time. We encountered this when a LD missed the cache
and was sent down
through NorthBridge, SouthBridge, onto another bus, finally out to the device
and back, while the CPU was continuing to read the ICache every cycle.
My limited understanding of RowPress is that you should not keep the Row open
for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My bet is
that this is a leakage issue on the bit line made sensitive by the word line.
Ah I see from the RowPress paper that it is different from RowHammer.
RowHammer is based on activation counts and RowPress on activation time.
Previously papers had just said that activation time correlated with
bit flips and I guess everyone just assumed it was the same mechanism.
But the RowPress paper shows it affects different bits from RowHammer.
Also RowPress and RowHammer tend to flip in different directions,
RowHammer flips 0 to 1 and RowPress 1 to 0 (taking the true and anti
cell logic states into account). Possibly one is doing electron injection
and the other hole injection.
Michael S
2024-01-31 20:49:15 UTC
Permalink
On Wed, 31 Jan 2024 17:17:21 GMT
Post by Anton Ertl
Post by Michael S
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples
"Executions of the CLFLUSH instruction are ordered with respect to
each other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol.
Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
and that at that time all Intel's PCI/AGP root hubs were already fully
I/O-coherent for several years, I find your theory unlikely.

Myself, I don't know the original reason, but I do know a use case
where CLFLUSH, while not strictly necessary, simplifies things greatly
- entering deep sleep state in which CPU caches are powered down and
DRAM put in self-refresh mode.

Of course, this particular use case does not require *non-priviledged*
CLFLUSH, so obviously Intel had different reason.
Post by Anton Ertl
This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
However, AFAIK this is insufficient for fixing Rowhammer. Caches have
relatively limited associativity, up to something like 16-way
set-associativity, so if you write to the same set 17 times, you are
guaranteed to miss the cache. With 3 levels of cache you may need 49
accesses (probably less), but I expect that the resulting DRAM
accesses to a cache line are still not rare enough that Rowhammer
cannot happen.
Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.

Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.

Today we have yet another variant called RowPress that bypasses TRR
mitigation more reliably than mult-rate RH. I think this one would be
practically impossible without CLFLUSH., esp. when system under attack
carries other DRAM accesses in parallel with attackers code.
Post by Anton Ertl
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go.
IMHO, all thise solutions are pure fantasy, because memory controller
does not even know which rows are physically adjacent. POC authors
typically run lengthy tests in order to figure it out.
Post by Anton Ertl
With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
- anton
They cared enough to implement the simplest of proposed solutions - TRR.
Yes, it was quickly found insufficient, but at least there was a
demonstration of good intentions.
MitchAlsup
2024-01-31 23:22:38 UTC
Permalink
Post by Michael S
On Wed, 31 Jan 2024 17:17:21 GMT
Post by Anton Ertl
Post by Michael S
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples
"Executions of the CLFLUSH instruction are ordered with respect to
each other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol.
Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
and that at that time all Intel's PCI/AGP root hubs were already fully
I/O-coherent for several years, I find your theory unlikely.
Myself, I don't know the original reason, but I do know a use case
where CLFLUSH, while not strictly necessary, simplifies things greatly
- entering deep sleep state in which CPU caches are powered down and
DRAM put in self-refresh mode.
Of course, this particular use case does not require *non-priviledged*
CLFLUSH, so obviously Intel had different reason.
There was no assumption that this could result in a side channel or
attack vector at the time of its non-privileged inclusion. Afterwards
there was no reason to make it privileged until 2017 and by then the
ability to do anything about it has vanished.

Me, personally, I see this as a violation of the cache is there to
reduce memory latency principle and thereby improve performance.
Post by Michael S
Post by Anton Ertl
This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
However, AFAIK this is insufficient for fixing Rowhammer. Caches have
relatively limited associativity, up to something like 16-way
set-associativity, so if you write to the same set 17 times, you are
guaranteed to miss the cache. With 3 levels of cache you may need 49
accesses (probably less), but I expect that the resulting DRAM
accesses to a cache line are still not rare enough that Rowhammer
cannot happen.
Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.
The problem here is the fact that DRAMs do not use linear decoders, so
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.

The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.
Post by Michael S
Today we have yet another variant called RowPress that bypasses TRR
mitigation more reliably than mult-rate RH. I think this one would be
practically impossible without CLFLUSH., esp. when system under attack
carries other DRAM accesses in parallel with attackers code.
Post by Anton Ertl
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go.
IMHO, all thise solutions are pure fantasy, because memory controller
does not even know which rows are physically adjacent.
Different DIMMs and even different DRAMs on the same DIMM may not
share that correspondence. {There is a lot of bit line and a little
word line repair done at the tester.}
Post by Michael S
POC authors
typically run lengthy tests in order to figure it out.
Post by Anton Ertl
With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
- anton
They cared enough to implement the simplest of proposed solutions - TRR.
Yes, it was quickly found insufficient, but at least there was a
demonstration of good intentions.
EricP
2024-02-02 17:15:21 UTC
Permalink
Post by MitchAlsup
Post by Michael S
Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.
The problem here is the fact that DRAMs do not use linear decoders, so
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.
The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.

I was wondering if each row could have "canary" bit,
a specially weakened bit that always flips early.
This would also intrinsically handle the cases of effects
falling off over the +-3 adjacent rows.

Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Thomas Koenig
2024-02-02 18:09:14 UTC
Permalink
Post by EricP
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
That would look... interesting.

How are large OR gates actually constructed? I would assume that an
eight-input OR gate could look something like

nand(nor(a,b),nor(c,d),nor(e,f),nor(g,h))

which would reduce the number of inputs by a factor of 2^3, so
seven layers of these OR gates would be needed.

Wiring would be interesting as well...
MitchAlsup
2024-02-02 19:18:12 UTC
Permalink
Post by Thomas Koenig
Post by EricP
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
That would look... interesting.
How are large OR gates actually constructed? I would assume that an
eight-input OR gate could look something like
nand(nor(a,b),nor(c,d),nor(e,f),nor(g,h))
Close, but NANDs come with 4-inputs and NORs come with 3*, so you get
a 3×4 = 12:1 reduction per pair of stages.

2985984->248832->20736->1728->144->12->1
Post by Thomas Koenig
which would reduce the number of inputs by a factor of 2^3, so
seven layers of these OR gates would be needed.
6 not 7
Post by Thomas Koenig
Wiring would be interesting as well...
That is why we have 10 layers of metal--oh wait DRAMs don't have that
much metal.....

(*) NANDs having 4 inputs while NORs only have 3 is a consequence of
P-channel transistors having lower transconductance and higher body
effects, and there are differences between planar transistors and
finFETs here, too.
Anton Ertl
2024-02-03 09:28:14 UTC
Permalink
Post by EricP
Post by MitchAlsup
Post by Michael S
Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.
The problem here is the fact that DRAMs do not use linear decoders, so
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as œ the block away from each other.
The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
Admittedly, if you just update the counter for a specific row and the
refresh all rows in the blast radius when a limit is reached, you
may get many more refreshes than the minimum necessary, but given that
normal programs usually do not hammer specific row ranges, the
additional refreshes may still be relatively few in non-attack
situations (and when being attacked, you prefer lower DRAM performance
to a successful attack).

Alternatively, a kind of cache could be used. Keep counts of N most
recently accessed rows, remove the row on refresh; when accessing a
row that has not been in the cache, evict the entry for the row with
the lowest count C, and set the count of the loaded row to C+1. When
a count (or ensemble of counts) reaches the limit, refresh every row.

This would take much less memory, but require finding the entry with
the lowest count. By dividing the cache into sets, this becomes more
realistic; upon reaching a limit, only the rows in the blast radius of
the lines in a set need to be refreshed.
Post by EricP
I was wondering if each row could have "canary" bit,
a specially weakened bit that always flips early.
This would also intrinsically handle the cases of effects
falling off over the +-3 adjacent rows.
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Yes, doing it in analog has its charms. However, I see the following
difficulties:

* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?

* To flip a bit in one direction, AFAIK the hammering rows have to
have a specific content. I guess with a blast radius of 4 rows on
each side, you could have 4 columns. Each row has a canary in one
of these columns and the three adjacent bits in this column are
attacker bits that have the value that is useful for effecting a bit
flip in a canary. Probably a more refined variant of this idea
would be necessary is necessary to deal with diagonal influence and
the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
in this thread.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup
2024-02-03 17:13:23 UTC
Permalink
Post by Anton Ertl
Post by EricP
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Yes, doing it in analog has its charms. However, I see the following
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
You know what its value should be and you raise hell when it is not as
expected. This may require 2 canary bits.
Post by Anton Ertl
* To flip a bit in one direction, AFAIK the hammering rows have to
have a specific content. I guess with a blast radius of 4 rows on
each side, you could have 4 columns. Each row has a canary in one
of these columns and the three adjacent bits in this column are
attacker bits that have the value that is useful for effecting a bit
flip in a canary. Probably a more refined variant of this idea
would be necessary is necessary to deal with diagonal influence and
the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
in this thread.
- anton
Anton Ertl
2024-02-03 17:45:31 UTC
Permalink
Post by MitchAlsup
Post by Anton Ertl
Post by EricP
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Yes, doing it in analog has its charms. However, I see the following
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
You know what its value should be and you raise hell when it is not as
expected.
So that is about detecting Rowhammer after the fact. Yes, you could
do that when the row is refreshed. The only problem is that by then
the attacker could have extracted the secret(s) with the
Rowhammer-based attack. Better than nothing, but still not a very
attractive approach.

I prefer a solution that detects that a row might suffer a bit flip
after several more accesses, and refreshes the row befor that happens.
And I don't think that this can be implemented with an analog canary
that works like a DRAM cell; but I am not a solid-state physicist,
maybe there is a way.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup
2024-02-03 19:10:56 UTC
Permalink
Post by Anton Ertl
Post by MitchAlsup
Post by Anton Ertl
Post by EricP
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Yes, doing it in analog has its charms. However, I see the following
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
You know what its value should be and you raise hell when it is not as
expected.
So that is about detecting Rowhammer after the fact. Yes, you could
do that when the row is refreshed. The only problem is that by then
the attacker could have extracted the secret(s) with the
Rowhammer-based attack. Better than nothing, but still not a very
attractive approach.
I prefer a solution that detects that a row might suffer a bit flip
after several more accesses, and refreshes the row before that happens.
And I don't think that this can be implemented with an analog canary
that works like a DRAM cell; but I am not a solid-state physicist,
maybe there is a way.
Sooner or later, designers will have to come to the realization that
an external DRAM controller can never guarantee everything every DRAM
actually needs to retain data under all conditions, and the DRAMs
are going to have to change the interface such that requests flow
in and results flow out based on the DRAM internal controller--much
like that of a SATA disk drive.

Let us face it, the DDR-6 interface model is based on the 16K-bit
DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
double data rated, and each step added address bits to RAS and CAS.

I suspect when this happens, the DRAMs will partition the inbound
address into 3 or 4 sections, and use each section independently
Bank-Row-Column or block-bank-row-column.

In addition each building block will be internally self timed, no
external need to refresh the bank-row, and the only non access
command in the arsenal is power-down and power-up.

You can only put so much lipstick on a pig.
Post by Anton Ertl
- anton
Anton Ertl
2024-02-05 09:48:52 UTC
Permalink
Post by MitchAlsup
Sooner or later, designers will have to come to the realization that
an external DRAM controller can never guarantee everything every DRAM
actually needs to retain data under all conditions, and the DRAMs
are going to have to change the interface such that requests flow
in and results flow out based on the DRAM internal controller--much
like that of a SATA disk drive.
Let us face it, the DDR-6 interface model is based on the 16K-bit
DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
double data rated, and each step added address bits to RAS and CAS.
I don't know about DDR6, but the DDR5 command interface is
significantly more complex
<https://en.wikipedia.org/wiki/DDR5#Command_encoding> than early
asynchronous DRAM.
Post by MitchAlsup
I suspect when this happens, the DRAMs will partition the inbound
address into 3 or 4 sections, and use each section independently
Bank-Row-Column or block-bank-row-column.
Looking at the commands from the link above, Activate already
transfers the row in two pieces, and the read and write are also
transferred in two pieces.
Post by MitchAlsup
In addition each building block will be internally self timed, no
external need to refresh the bank-row, and the only non access
command in the arsenal is power-down and power-up.
Self-refresh is already there, but AFAIK only used when processing is
suspended.

However, there are many commands, many more than in the 16kx1 DRAMs of
old. What would make them go in the direction of simplifying the
interface? The hardest part these days seems to be getting the high
transfer rates to work, the rest of the interface is probably
comparatively easy.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup1
2024-02-05 22:30:19 UTC
Permalink
Post by Anton Ertl
Post by MitchAlsup
Sooner or later, designers will have to come to the realization that
an external DRAM controller can never guarantee everything every DRAM
actually needs to retain data under all conditions, and the DRAMs
are going to have to change the interface such that requests flow
in and results flow out based on the DRAM internal controller--much
like that of a SATA disk drive.
Let us face it, the DDR-6 interface model is based on the 16K-bit
DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
double data rated, and each step added address bits to RAS and CAS.
I don't know about DDR6, but the DDR5 command interface is
significantly more complex
<https://en.wikipedia.org/wiki/DDR5#Command_encoding> than early
asynchronous DRAM.
Post by MitchAlsup
I suspect when this happens, the DRAMs will partition the inbound
address into 3 or 4 sections, and use each section independently
Bank-Row-Column or block-bank-row-column.
Looking at the commands from the link above, Activate already
transfers the row in two pieces, and the read and write are also
transferred in two pieces.
Post by MitchAlsup
In addition each building block will be internally self timed, no
external need to refresh the bank-row, and the only non access
command in the arsenal is power-down and power-up.
Self-refresh is already there, but AFAIK only used when processing is
suspended.
My DRAM controller (AMD Opteron rev G) used ACTivate commands instead of
refresh commands to refresh rows in DDR2 DRAM. The timings were better.
It just did not come back and ask for data from the RASed row.
Post by Anton Ertl
However, there are many commands, many more than in the 16kx1 DRAMs of
old. What would make them go in the direction of simplifying the
interface?
Pins that are less expensive.
Post by Anton Ertl
The hardest part these days seems to be getting the high
transfer rates to work, the rest of the interface is probably
comparatively easy.
This is from DDR4 and onward where one has to control drive strength
and clock edge offsets (with a DLL) to transfer data that fast.
Post by Anton Ertl
- anton
EricP
2024-02-04 05:00:12 UTC
Permalink
Post by Anton Ertl
Post by EricP
Post by MitchAlsup
Post by Michael S
Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.
The problem here is the fact that DRAMs do not use linear decoders, so
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.
The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
Admittedly, if you just update the counter for a specific row and the
refresh all rows in the blast radius when a limit is reached, you
may get many more refreshes than the minimum necessary, but given that
normal programs usually do not hammer specific row ranges, the
additional refreshes may still be relatively few in non-attack
situations (and when being attacked, you prefer lower DRAM performance
to a successful attack).
They said that the current threshold for causing flips in an immediate
neighbor is 4800 activations, but with a blast radius of +-4 that
can be in any of the 8 neighbors, so your counter threshold will have
to trigger refresh at 1/8 of that level or every 600 activations.

And as the dram features get smaller that threshold number will go down
and probably the blast radius will go up. So this could have scaling
issues in the future.
Post by Anton Ertl
Alternatively, a kind of cache could be used. Keep counts of N most
recently accessed rows, remove the row on refresh; when accessing a
row that has not been in the cache, evict the entry for the row with
the lowest count C, and set the count of the loaded row to C+1. When
a count (or ensemble of counts) reaches the limit, refresh every row.
That would be a CAM or assoc sram and would have to hold a large
number of entries. This would have to be in the memory controller.
Post by Anton Ertl
This would take much less memory, but require finding the entry with
the lowest count. By dividing the cache into sets, this becomes more
realistic; upon reaching a limit, only the rows in the blast radius of
the lines in a set need to be refreshed.
Post by EricP
I was wondering if each row could have "canary" bit,
a specially weakened bit that always flips early.
This would also intrinsically handle the cases of effects
falling off over the +-3 adjacent rows.
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Yes, doing it in analog has its charms. However, I see the following
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
The canary would have to be a little more complicated than a standard
storage cell because it has to compare the cell to the expected value
and then drive an output transistor to pull down a dynamic bit line
for a wired-OR of all the canaries in a bank.
Hopefully that would isolate the canary from its read bit line changes.

Fitting this into a dram row could be a problem.
This would all have the same height as a normal row to fit horizontally
along a dram row so it didn't bugger up the row spacing.
Post by Anton Ertl
* To flip a bit in one direction, AFAIK the hammering rows have to
have a specific content. I guess with a blast radius of 4 rows on
each side, you could have 4 columns. Each row has a canary in one
of these columns and the three adjacent bits in this column are
attacker bits that have the value that is useful for effecting a bit
flip in a canary. Probably a more refined variant of this idea
would be necessary is necessary to deal with diagonal influence and
the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
in this thread.
- anton
Each canary might be 3 cells with alternating patterns,
even row numbers are inited to 010 and odd rows to 101,
positioned in vertical columns. Presumably this would put
the maximum and a predictable stress on the center bit.
Since the expected value for each row is hard wired it is
easy to test if it changes.
Anton Ertl
2024-02-05 09:35:21 UTC
Permalink
Post by EricP
Post by Anton Ertl
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
Admittedly, if you just update the counter for a specific row and the
refresh all rows in the blast radius when a limit is reached, you
may get many more refreshes than the minimum necessary, but given that
normal programs usually do not hammer specific row ranges, the
additional refreshes may still be relatively few in non-attack
situations (and when being attacked, you prefer lower DRAM performance
to a successful attack).
They said that the current threshold for causing flips in an immediate
neighbor is 4800 activations, but with a blast radius of +-4 that
can be in any of the 8 neighbors, so your counter threshold will have
to trigger refresh at 1/8 of that level or every 600 activations.
So only 10 bits of counter are necessary, reducing the overhead to
0.125%:-).
Post by EricP
And as the dram features get smaller that threshold number will go down
and probably the blast radius will go up. So this could have scaling
issues in the future.
Yes.
Post by EricP
Post by Anton Ertl
Alternatively, a kind of cache could be used. Keep counts of N most
recently accessed rows, remove the row on refresh; when accessing a
row that has not been in the cache, evict the entry for the row with
the lowest count C, and set the count of the loaded row to C+1. When
a count (or ensemble of counts) reaches the limit, refresh every row.
That would be a CAM or assoc sram and would have to hold a large
number of entries. This would have to be in the memory controller.
Possibly. Recent DRAMs also support self-refresh (to allow powering
down the connection to the memory controller); this kind of stuff
could also be on the DRAM device, avoiding all the problems that
memory controllers have with knowing the characteristics of the DRAM
device.
Post by EricP
Post by Anton Ertl
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
The canary would have to be a little more complicated than a standard
storage cell because it has to compare the cell to the expected value
Maybe capacitative coupling (as used for flash AFAIK) could be used to
measure the contents of the canary without discharging it. There
still would be tunneling, as in Rowhammer itself, but I guess one
could account for that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup1
2024-02-10 23:24:18 UTC
Permalink
Post by Anton Ertl
Post by EricP
Post by Anton Ertl
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
Admittedly, if you just update the counter for a specific row and the
refresh all rows in the blast radius when a limit is reached, you
may get many more refreshes than the minimum necessary, but given that
normal programs usually do not hammer specific row ranges, the
additional refreshes may still be relatively few in non-attack
situations (and when being attacked, you prefer lower DRAM performance
to a successful attack).
They said that the current threshold for causing flips in an immediate
neighbor is 4800 activations, but with a blast radius of +-4 that
can be in any of the 8 neighbors, so your counter threshold will have
to trigger refresh at 1/8 of that level or every 600 activations.
So only 10 bits of counter are necessary, reducing the overhead to
0.125%:-).
Post by EricP
And as the dram features get smaller that threshold number will go down
and probably the blast radius will go up. So this could have scaling
issues in the future.
Yes.
If the DRAM manufactures placed a Faraday shield over the DRAM arrays
{A gound plane} the blast radius goes from a linear charge sharing issue
to a quadratic charge sharing issue. Such a ground plane is a layer of
metal with a single <never changing> voltage on it. This might change the
blast radius from 8 to 2.

{{We did this kind of things for SRAM so we could run large signal count
busses over the SRAM arrays.}}
Post by Anton Ertl
- anton
MitchAlsup1
2024-02-04 21:12:57 UTC
Permalink
Post by Anton Ertl
Post by EricP
Post by MitchAlsup
Post by Michael S
Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.
The problem here is the fact that DRAMs do not use linear decoders, so
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.
The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
You are comparing a 16-bit incrementor and its associated flip-flop
with a single transistor divided by the number of them in a word. My
guess is that you are off by 20× (should be close to 4%)
Anton Ertl
2024-02-05 09:08:34 UTC
Permalink
Post by MitchAlsup1
Post by Anton Ertl
Post by EricP
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
You are comparing a 16-bit incrementor and its associated flip-flop
with a single transistor divided by the number of them in a word.
I was thinking about counting each access only when the cache line is
accessed. Then there needs to be only one incrementor per bank, and
the counter can be stored in DRAM like the payload data.

But thinking about it again, I wonder how counters would be reset.
Maybe, when the counter reaches the limit, all lines in its blast
radius are refereshed, and the counter of the present line is reset to
0.

Another disadvantage would be that we have to make decisions about
possible rowhammering only based on one counter, and have to trigger
refreshes of all lines in the blast radius based on worst-case
scenarios (i.e., assuming that other rows in the blast radius have any
count up to the limit).

Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.

Alternatively, if you want to invest more, one could follow your idea
and have counter SRAM (maybe including counting circuitry) for each
row; each refresh of a line would increment the counters in the blast
radius by an appropriate amount, and when a counter reaches its limit,
it would trigger a refresh of that row.
Post by MitchAlsup1
My guess is that you are off by 20× (should be close to 4%)
Even 4% is not "impractical".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
EricP
2024-02-06 21:41:00 UTC
Permalink
Post by Anton Ertl
Post by MitchAlsup1
Post by Anton Ertl
Post by EricP
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
You are comparing a 16-bit incrementor and its associated flip-flop
with a single transistor divided by the number of them in a word.
I was thinking about counting each access only when the cache line is
accessed. Then there needs to be only one incrementor per bank, and
the counter can be stored in DRAM like the payload data.
Dram row reads are destructive so a single row activate command
internally has three cycles: read, sense and redrive, restore.

The counter could be stored in the dram cells and the
N-bit incrementer integrated into the bit line sense amp latches,
such that when the activate command does its restore cycle
it writes back the incremented counter.
The incremented counter would also be available in the row buffer.

Since the next precharge can't happen for 40-50 ns we have some
time to decide what to do next.
Post by Anton Ertl
But thinking about it again, I wonder how counters would be reset.
Maybe, when the counter reaches the limit, all lines in its blast
radius are refereshed, and the counter of the present line is reset to
0.
On a row read if the counter hits its threshold limit the restore
cycle writes back a count of 0, otherwise the incremented counter.

The problem is with the +-4 blast radius refreshes. Each of those refreshes
ages its neighbors which we need to track, so we can't reset those counters.
This could cause a write amplification where each refresh repeatedly
triggers 4 more refreshes.

It is possible to use the counter as a state machine.
Something like...
1) For normal, periodic refreshes set count to some initial value.
2) For reads increment count and if carry-out then reset to initial value
and schedule immediate blast refresh of +-4 neighbor rows.
3) For blast row refresh increment count but don't check for overflow.
If there is a count overflow it gets detected on its next row read.
Post by Anton Ertl
Another disadvantage would be that we have to make decisions about
possible rowhammering only based on one counter, and have to trigger
refreshes of all lines in the blast radius based on worst-case
scenarios (i.e., assuming that other rows in the blast radius have any
count up to the limit).
Yes, unless there is a way to infer the total counts for the neighbors.
Bloom filter?
But see below.
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Lets see how bad this is.

The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the counters
so the counts are not cumulative.

That overhead is only going to grow as dram density increases.
MitchAlsup1
2024-02-10 23:20:17 UTC
Permalink
Post by EricP
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC repairs ?!?
with the potential for a few ECC repair fails ?!!?
Post by EricP
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the counters
so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then refreshing every row
is worse for data retention than spreading the refreshes out over the 64ms max
interval rather evenly.
Post by EricP
That overhead is only going to grow as dram density increases.
So are all the attack vectors.
Anton Ertl
2024-02-11 13:20:50 UTC
Permalink
Post by MitchAlsup1
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC repairs ?!?
with the potential for a few ECC repair fails ?!!?
That's not the issue at hand here. The issue at hand here is whether
the relatively cheap mechanism I described has an acceptable number of
additional refreshes during normal operation, or whether a more
expensive (in terms of area) mechanism is needed to fix Rowhammer.

Concerning ECC, many computers do not have ECC memory, and for those
that have it, ECC does not reliably fix Rowhammer; if it did, the fix
would be simple: Use ECC, which is a good idea anyway, even if it
costs 25% more chips in case of DDR5 DIMMs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
EricP
2024-02-11 15:46:02 UTC
Permalink
Post by MitchAlsup1
Post by EricP
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC repairs ?!?
with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Post by MitchAlsup1
Post by EricP
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the counters
so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then refreshing every row
is worse for data retention than spreading the refreshes out over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by row?
I would expect doing so would introduce big stalls into memory access.

64 ms / 8192 rows per block = 7.8125 us row interval.
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
MitchAlsup1
2024-02-11 19:57:34 UTC
Permalink
Post by EricP
Post by MitchAlsup1
Post by EricP
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC repairs ?!?
with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Post by MitchAlsup1
Post by EricP
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the counters
so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then refreshing every row
is worse for data retention than spreading the refreshes out over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by row?
I would expect doing so would introduce big stalls into memory access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7µs and if the
back was active it would allow REF to slip. But on a second timer event
it would interrupt data transfer and induce 2 refreshes to catch up. In
general, this worked well as it almost never happened.
Post by EricP
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.

When one changes page boundaries the HoB address bits are essentially
randomized by the TLB:: why not just close the row at that point ?
Post by EricP
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
Michael S
2024-02-12 15:14:26 UTC
Permalink
On Sun, 11 Feb 2024 19:57:34 +0000
Post by MitchAlsup1
Post by EricP
Post by MitchAlsup1
Post by EricP
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Post by MitchAlsup1
Post by EricP
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the
counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then refreshing every row
is worse for data retention than spreading the refreshes out over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by
row? I would expect doing so would introduce big stalls into memory
access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7µs and if the
back was active it would allow REF to slip. But on a second timer
event it would interrupt data transfer and induce 2 refreshes to
catch up. In general, this worked well as it almost never happened.
Post by EricP
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.
When one changes page boundaries the HoB address bits are essentially
randomized by the TLB:: why not just close the row at that point ?
Post by EricP
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
Michael S
2024-02-12 15:27:59 UTC
Permalink
On Sun, 11 Feb 2024 19:57:34 +0000
Post by MitchAlsup1
Post by EricP
Post by MitchAlsup1
Post by EricP
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Post by MitchAlsup1
Post by EricP
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the
counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then refreshing every row
is worse for data retention than spreading the refreshes out over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by
row? I would expect doing so would introduce big stalls into memory
access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7µs and if the
back was active it would allow REF to slip. But on a second timer
event it would interrupt data transfer and induce 2 refreshes to
catch up. In general, this worked well as it almost never happened.
Post by EricP
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.
DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s
Post by MitchAlsup1
When one changes page boundaries the HoB address bits are essentially
randomized by the TLB:: why not just close the row at that point ?
Because memory controller is not aware of CPU page boundaries.
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
Post by MitchAlsup1
Post by EricP
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
Scott Lurndal
2024-02-12 20:27:13 UTC
Permalink
Post by Michael S
On Sun, 11 Feb 2024 19:57:34 +0000
Because memory controller is not aware of CPU page boundaries.
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
AArch64 supports translation granules of 4k, 16k and 64k. 4K
and 64K are the most common. While the architecture defines
16k, an implementation is free to not support it and I'm not aware of any
widespread usage.
Michael S
2024-02-12 21:12:50 UTC
Permalink
On Mon, 12 Feb 2024 20:27:13 GMT
Post by Scott Lurndal
Post by Michael S
On Sun, 11 Feb 2024 19:57:34 +0000
Because memory controller is not aware of CPU page boundaries.
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
AArch64 supports translation granules of 4k, 16k and 64k. 4K
and 64K are the most common. While the architecture defines
16k, an implementation is free to not support it and I'm not aware of
any widespread usage.
I think, 16KB is the main page size on Apple. Android is trying the
same, but so far has problems.
Apple+Android == approximately 101% of AArch64 total.
MitchAlsup1
2024-02-12 22:45:08 UTC
Permalink
Post by Michael S
On Sun, 11 Feb 2024 19:57:34 +0000
Post by MitchAlsup1
Post by EricP
Post by MitchAlsup1
Post by EricP
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Post by MitchAlsup1
Post by EricP
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the
counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then refreshing every row
is worse for data retention than spreading the refreshes out over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by
row? I would expect doing so would introduce big stalls into memory
access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7µs and if the
back was active it would allow REF to slip. But on a second timer
event it would interrupt data transfer and induce 2 refreshes to
catch up. In general, this worked well as it almost never happened.
Post by EricP
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.
DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s
Post by MitchAlsup1
When one changes page boundaries the HoB address bits are essentially
randomized by the TLB:: why not just close the row at that point ?
Because memory controller is not aware of CPU page boundaries.
Bits<19:12> changed. How hard is that to detect ??
Post by Michael S
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
Neither of which prevent closing the row to avoid memory retention
issues.
Post by Michael S
Post by MitchAlsup1
Post by EricP
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
Michael S
2024-02-12 23:15:28 UTC
Permalink
On Mon, 12 Feb 2024 22:45:08 +0000
Post by MitchAlsup1
Post by Michael S
On Sun, 11 Feb 2024 19:57:34 +0000
Post by MitchAlsup1
Post by EricP
Post by MitchAlsup1
Post by EricP
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary
to prevent Rowhammer, but that approach may still be good
enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Post by MitchAlsup1
Post by EricP
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3%
overhead. And the whole dram is refreshed every 64 ms reseting
all the counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then
refreshing every row
is worse for data retention than spreading the refreshes out over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically
by row? I would expect doing so would introduce big stalls into
memory access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7µs and if
the back was active it would allow REF to slip. But on a second
timer event it would interrupt data transfer and induce 2
refreshes to catch up. In general, this worked well as it almost
never happened.
Post by EricP
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.
DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s
Post by MitchAlsup1
When one changes page boundaries the HoB address bits are
essentially randomized by the TLB:: why not just close the row at
that point ?
Because memory controller is not aware of CPU page boundaries.
Bits<19:12> changed. How hard is that to detect ??
Do you always answer one statement before reading the next statement?
Post by MitchAlsup1
Post by Michael S
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
Neither of which prevent closing the row to avoid memory retention
issues.
What scenario of attack do you have in mind?
I would think that neither in "classic" multi-side Row Hammer nor in Row
Press attacker has to cross CPU page boundaries. If he (attacker)
happens to know that memory controller likes to close DRAMraws on any
particular address boundary, then he can easily avoid accessing last
cache line before that particular boundary.

BTW, all this attacks (or should I say, all this POCs, because I don't
think that somebody ever caught real RH/RP attack launched by real bad
guy) rather heavily depend on big or huge pages. They are close to
impossible with small pages, even when "small" means 16 KB rather than
4 KB.
Post by MitchAlsup1
Post by Michael S
Post by MitchAlsup1
Post by EricP
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
MitchAlsup1
2024-02-13 00:19:18 UTC
Permalink
Post by Michael S
On Mon, 12 Feb 2024 22:45:08 +0000
Post by MitchAlsup1
Post by Michael S
On Sun, 11 Feb 2024 19:57:34 +0000
Post by MitchAlsup1
Post by EricP
Post by MitchAlsup1
Post by EricP
Post by Anton Ertl
Both disadvantages lead to far more refreshes than necessary
to prevent Rowhammer, but that approach may still be good
enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Post by MitchAlsup1
Post by EricP
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600
trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3%
overhead. And the whole dram is refreshed every 64 ms reseting
all the counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then
refreshing every row
is worse for data retention than spreading the refreshes out
over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically
by row? I would expect doing so would introduce big stalls into
memory access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7µs and if
the back was active it would allow REF to slip. But on a second
timer event it would interrupt data transfer and induce 2
refreshes to catch up. In general, this worked well as it almost
never happened.
Post by EricP
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.
DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s
Post by MitchAlsup1
When one changes page boundaries the HoB address bits are
essentially randomized by the TLB:: why not just close the row at
that point ?
Because memory controller is not aware of CPU page boundaries.
Bits<19:12> changed. How hard is that to detect ??
Do you always answer one statement before reading the next statement?
I actually wrote the above after writing the below.
Post by Michael S
Post by MitchAlsup1
Post by Michael S
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
Neither of which prevent closing the row to avoid memory retention
issues.
What scenario of attack do you have in mind?
RowPress depends on keeping the row open too long--clearly evident in the
charts in the document.
Post by Michael S
I would think that neither in "classic" multi-side Row Hammer nor in Row
Press attacker has to cross CPU page boundaries. If he (attacker)
happens to know that memory controller likes to close DRAMraws on any
particular address boundary, then he can easily avoid accessing last
cache line before that particular boundary.
RowHammer depends on closing the row too often.

Performance (single CPU) depends on allowing the open row to service
several pending requests streaming data at CAS access speeds.

There is a balance to be found by preventing RowHammer from opening
nearby rows too often and in preventing RowPress from holding them
open for too long.

I happen to think (without evidence beyond that of the rRowPress document)
that the balance is distributing refreshes evenly across the refresh
interval (as evidenced in the charts in RowPress document. It ends up
that with modern DDR this enables about 4096 bytes to be read/written
to a row before closing it (within a factor of 2-4).
Post by Michael S
BTW, all this attacks (or should I say, all this POCs, because I don't
think that somebody ever caught real RH/RP attack launched by real bad
guy) rather heavily depend on big or huge pages. They are close to
impossible with small pages, even when "small" means 16 KB rather than
4 KB.
Post by MitchAlsup1
Post by Michael S
Post by MitchAlsup1
Post by EricP
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
Michael S
2024-02-13 15:19:16 UTC
Permalink
On Tue, 13 Feb 2024 00:19:18 +0000
Post by MitchAlsup1
Post by Michael S
On Mon, 12 Feb 2024 22:45:08 +0000
Post by MitchAlsup1
Post by Michael S
On Sun, 11 Feb 2024 19:57:34 +0000
Post by MitchAlsup1
Post by EricP
Post by MitchAlsup1
Post by EricP
Post by Anton Ertl
Both disadvantages lead to far more refreshes than
necessary to prevent Rowhammer, but that approach may
still be good enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Post by MitchAlsup1
Post by EricP
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 =
600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3%
overhead. And the whole dram is refreshed every 64 ms
reseting all the counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then
refreshing every row
is worse for data retention than spreading the refreshes out
over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of
periodically by row? I would expect doing so would introduce
big stalls into memory access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7µs and
if the back was active it would allow REF to slip. But on a
second timer event it would interrupt data transfer and induce 2
refreshes to catch up. In general, this worked well as it almost
never happened.
Post by EricP
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.
DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s
Post by MitchAlsup1
When one changes page boundaries the HoB address bits are
essentially randomized by the TLB:: why not just close the row
at that point ?
Because memory controller is not aware of CPU page boundaries.
Bits<19:12> changed. How hard is that to detect ??
Do you always answer one statement before reading the next
statement?
I actually wrote the above after writing the below.
Post by Michael S
Post by MitchAlsup1
Post by Michael S
Besides, in aarch64 world 16KB pages are rather common. And in
x86 world "transparent huge pages" are rather common.
Neither of which prevent closing the row to avoid memory retention
issues.
What scenario of attack do you have in mind?
RowPress depends on keeping the row open too long--clearly evident in
the charts in the document.
Clarification for casual observers that didn't bother to read Row Press
paper: RowPress attack does not depends on keeping row open
continuously.
Short interruptions actually greatly improve effectiveness of attack
significantly increasing BER for a given duration of attack. After
all, RowPress *is* a variant of RowHammer.
For a given interruption rate, longer interruptions reduce effectiveness
of attack, but not dramatically so. For example, for most practically
important interruption rate of 128 KHz (period=7.81 usec) increasing
duration of off interval from absolute minimum allowed by protocol
(~50ns) to 2 usec reduces efficiency of attack only by factor of 2 o 3.
Post by MitchAlsup1
Post by Michael S
I would think that neither in "classic" multi-side Row Hammer nor
in Row Press attacker has to cross CPU page boundaries. If he
(attacker) happens to know that memory controller likes to close
DRAMraws on any particular address boundary, then he can easily
avoid accessing last cache line before that particular boundary.
RowHammer depends on closing the row too often.
Yes, except that it is unknown whether major RH impact is done by
closing the row or by opening it. The later is more likely. But since
the rate of opening and closing is the same, this finer difference is
not important.
Post by MitchAlsup1
Performance (single CPU) depends on allowing the open row to service
several pending requests streaming data at CAS access speeds.
There is a balance to be found by preventing RowHammer from opening
nearby rows too often and in preventing RowPress from holding them
open for too long.
There is no balance. Opening nearby rows too often helps both variants
of attack.
Post by MitchAlsup1
I happen to think (without evidence beyond that of the rRowPress
document) that the balance is distributing refreshes evenly across
the refresh interval (as evidenced in the charts in RowPress
document. It ends up that with modern DDR this enables about 4096
bytes to be read/written to a row before closing it (within a factor
of 2-4).
Huh?
DDR4-3200 channel transfers data at rate approaching 25.6 GB/s. DDR5
will be the same when it reaches it's projected maximum speed of 6400.
25.6 GB/s * 7.81 usec = 200,000 bytes. That's a factor of 49 rather than
2-4.
EricP
2024-02-13 16:24:10 UTC
Permalink
Post by Michael S
On Tue, 13 Feb 2024 00:19:18 +0000
Post by MitchAlsup1
RowPress depends on keeping the row open too long--clearly evident in
the charts in the document.
Clarification for casual observers that didn't bother to read Row Press
paper: RowPress attack does not depends on keeping row open
continuously.
Short interruptions actually greatly improve effectiveness of attack
significantly increasing BER for a given duration of attack. After
all, RowPress *is* a variant of RowHammer.
RowPress documents that keeping the aggressor row open longer lowers
the limit on the adjacent rows before opens (RowHammers) causes bit flips.
Also the paper notes that DRAM manufacturers, eg Micron and Samsung,
already document that keeping a row open longer can cause read-disturbance.
What's new is the paper documents the interaction between row activation
time and the subsequent number of opens (RowHammers) needed to flip a bit.

Also note that different bits are susceptible to RowPress and RowHammer.
See section 4.3

RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023
https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf

"RowPress breaks memory isolation by keeping a DRAM row open for a long
period of time, which disturbs physically nearby rows enough to cause
bitflips. We show that RowPress amplifies DRAM’s vulnerability to
read-disturb attacks by significantly reducing the number of row
activations needed to induce a bitflip by one to two orders of
magnitude under realistic conditions. In extreme cases, RowPress induces
bitflips in a DRAM row when an adjacent row is activated only once."

"We show that keeping a DRAM row (i.e., aggressor row) open for a long
period of time (i.e., a large aggressor row ON time, tAggON) disturbs
physically nearby DRAM rows. Doing so induces bitflips in the victim row
without requiring (tens of) thousands of activations to the aggressor row."
Post by Michael S
For a given interruption rate, longer interruptions reduce effectiveness
of attack, but not dramatically so. For example, for most practically
important interruption rate of 128 KHz (period=7.81 usec) increasing
duration of off interval from absolute minimum allowed by protocol
(~50ns) to 2 usec reduces efficiency of attack only by factor of 2 o 3.
Reduced by a factor of up to 363. Under figure 1.

"We observe that as tAggON increases, compared to the most effective
RowHammer pattern, the most effective Row-Press pattern reduces ACmin
1) by 17.6× on average (up to 40.7×) when tAggON is as large as the
refresh interval (7.8 μs),
2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
the maximum allowed tAggON, and
3) down to only one activation for an extreme tAggON of 30 ms
(highlighted by dashed red boxes).

Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON increases."
Michael S
2024-02-13 17:00:30 UTC
Permalink
On Tue, 13 Feb 2024 11:24:10 -0500
Post by EricP
Post by Michael S
On Tue, 13 Feb 2024 00:19:18 +0000
Post by MitchAlsup1
RowPress depends on keeping the row open too long--clearly evident
in the charts in the document.
Clarification for casual observers that didn't bother to read Row
Press paper: RowPress attack does not depends on keeping row open
continuously.
Short interruptions actually greatly improve effectiveness of attack
significantly increasing BER for a given duration of attack. After
all, RowPress *is* a variant of RowHammer.
RowPress documents that keeping the aggressor row open longer lowers
the limit on the adjacent rows before opens (RowHammers) causes bit flips.
Correct, but irrelevant.
Post by EricP
Also the paper notes that DRAM manufacturers, eg Micron and
Samsung, already document that keeping a row open longer can cause
read-disturbance. What's new is the paper documents the interaction
between row activation time and the subsequent number of opens
(RowHammers) needed to flip a bit.
Correct and relevant, but not to the issue at hand which is criticism
of Mitch's ideas of mitigation.
Post by EricP
Also note that different bits are susceptible to RowPress and
RowHammer. See section 4.3
RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023
https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf
"RowPress breaks memory isolation by keeping a DRAM row open for a
long period of time, which disturbs physically nearby rows enough to
cause bitflips. We show that RowPress amplifies DRAM’s vulnerability
to read-disturb attacks by significantly reducing the number of row
activations needed to induce a bitflip by one to two orders of
magnitude under realistic conditions. In extreme cases, RowPress
induces bitflips in a DRAM row when an adjacent row is activated only
once."
"We show that keeping a DRAM row (i.e., aggressor row) open for a long
period of time (i.e., a large aggressor row ON time, tAggON) disturbs
physically nearby DRAM rows. Doing so induces bitflips in the victim
row without requiring (tens of) thousands of activations to the
aggressor row."
Post by Michael S
For a given interruption rate, longer interruptions reduce
effectiveness of attack, but not dramatically so. For example, for
most practically important interruption rate of 128 KHz
(period=7.81 usec) increasing duration of off interval from
absolute minimum allowed by protocol (~50ns) to 2 usec reduces
efficiency of attack only by factor of 2 o 3.
Reduced by a factor of up to 363. Under figure 1.
"We observe that as tAggON increases, compared to the most effective
RowHammer pattern, the most effective Row-Press pattern reduces ACmin
1) by 17.6× on average (up to 40.7×) when tAggON is as large as the
refresh interval (7.8 μs),
2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
the maximum allowed tAggON, and
3) down to only one activation for an extreme tAggON of 30 ms
(highlighted by dashed red boxes).
Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON increases."
ACmin by itself is a wrong measure of efficiency of attack.
The right measure is reciprocal of the total duration of attack.
At any given duty cycle reciprocal of the total duration of attack
grows with increased rate of interruptions (a.k.a. hammering rate).
The general trend is the same as for all other RH variants, the only
difference that dependency on hammering rate is somewhat weaker.

Relatively weak influence of duty cycle itself is shown in figure 22.

The practical significance of RowPress is due to two factors.
(1) is the factor is the one you mentioned above - it can flip
different bits from those flippable by other RH variants.
(2) is that it is not affected at all by DDR4 TRR
attempt of mitigation.

The third, less important factor is that RowPress appears quite robust
to differences between major manufacturers.

However, one should not overlook that efficiency of RowPress attacks
when measured by the most important criterion of BER per duration of
attack is many times lower than earlier techniques of double-sided and
multi-sided hammering.
Michael S
2024-02-14 17:46:36 UTC
Permalink
On Wed, 14 Feb 2024 10:51:47 -0500
Post by Michael S
On Tue, 13 Feb 2024 11:24:10 -0500
Post by EricP
Post by Michael S
On Tue, 13 Feb 2024 00:19:18 +0000
Post by MitchAlsup1
RowPress depends on keeping the row open too long--clearly
evident in the charts in the document.
Clarification for casual observers that didn't bother to read Row
Press paper: RowPress attack does not depends on keeping row open
continuously.
Short interruptions actually greatly improve effectiveness of
attack significantly increasing BER for a given duration of
attack. After all, RowPress *is* a variant of RowHammer.
RowPress documents that keeping the aggressor row open longer
lowers the limit on the adjacent rows before opens (RowHammers)
causes bit flips.
Correct, but irrelevant.
It was kinda the whole point of the RowPress paper.
Post by Michael S
Post by EricP
Also the paper notes that DRAM manufacturers, eg Micron and
Samsung, already document that keeping a row open longer can cause
read-disturbance. What's new is the paper documents the interaction
between row activation time and the subsequent number of opens
(RowHammers) needed to flip a bit.
Correct and relevant, but not to the issue at hand which is
criticism of Mitch's ideas of mitigation.
Post by EricP
Also note that different bits are susceptible to RowPress and
RowHammer. See section 4.3
RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023
https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf
I just found out that there are two different versions of the RowPress
RowPress: Amplifying Read Disturbance in Modern DRAM Chips, 2023
https://arxiv.org/pdf/2306.17061.pdf
Post by Michael S
Post by EricP
Post by Michael S
For a given interruption rate, longer interruptions reduce
effectiveness of attack, but not dramatically so. For example, for
most practically important interruption rate of 128 KHz
(period=7.81 usec) increasing duration of off interval from
absolute minimum allowed by protocol (~50ns) to 2 usec reduces
efficiency of attack only by factor of 2 o 3.
Reduced by a factor of up to 363. Under figure 1.
"We observe that as tAggON increases, compared to the most
effective RowHammer pattern, the most effective Row-Press pattern
reduces ACmin 1) by 17.6× on average (up to 40.7×) when tAggON is
as large as the refresh interval (7.8 μs),
2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
the maximum allowed tAggON, and
3) down to only one activation for an extreme tAggON of 30 ms
(highlighted by dashed red boxes).
Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON increases."
ACmin by itself is a wrong measure of efficiency of attack.
I'm not interested in the efficiency of the attack.
ACmin, the minimum absolute count of opens above which we lose data
is the number I'm interested in.
You may be interested, but I don't understand why.
For me, the important thing is how much time it take until probability
of the flip become significant.
Suppose, attack (A) hammers at 5 MHz and has ACmin=5e4. Attack (B)
hammers at 0.13 MHz (typical for RP in real-world setup) and has
ACmin=3e3.
Then I'd say that attack (A) is 2.3 times more dangerous.

Back to real world, researchers demonstrated that multi-side
hammering can have ACmin that is significantly lower than our imaginary
attack (A), so the only remaining question is how fast can we hammer
without triggering TRR. My 5MHz number probably hard to achieve for
attacker, but 2-3 MHz sound doable.
Post by Michael S
The right measure is reciprocal of the total duration of attack.
At any given duty cycle reciprocal of the total duration of attack
grows with increased rate of interruptions (a.k.a. hammering rate).
The general trend is the same as for all other RH variants, the only
difference that dependency on hammering rate is somewhat weaker.
Relatively weak influence of duty cycle itself is shown in figure 22.
Looking at figure 22 on the arxiv version of the paper,
this is a completely different test. This test was to explain the
discrepancy between the RowPress results and the earlier cited papers.
BER is the fraction of DRAM cells in a DRAM row that experience
bitflips. Its a different measure because RowPress detects when ANY
data loss begins, not the fraction of lost data bits (efficiency)
after it kicks in.
Obsv 16 explains it, the BER for the bottom two lines,
which are the ones with a long total tA2A, goes up in all graphs
by between a factor of 10 to about 500, which is the RowPress effect.
To my eye what this test shows is the PRE phase may *heal* some of the
damaging effects that the ACT phase causes, but only to a certain
point. Possibly the PRE phase scavenges the ACT hot injection
carriers.
Post by Michael S
The practical significance of RowPress is due to two factors.
(1) is the factor is the one you mentioned above - it can flip
different bits from those flippable by other RH variants.
(2) is that it is not affected at all by DDR4 TRR
attempt of mitigation.
I take away something completely different: there are multiple
interacting error mechanisms at work here. RowHammer and RowPress are
likely completely different physics and fixing one won't fix the
other.
Different like coupling in different frequency bands - yes.
But both caused by insufficient isolation.
It also suggests there may be other similar mechanisms waiting to be
found.
Post by Michael S
The third, less important factor is that RowPress appears quite
robust to differences between major manufacturers.
However, one should not overlook that efficiency of RowPress attacks
when measured by the most important criterion of BER per duration of
attack is many times lower than earlier techniques of double-sided
and multi-sided hammering.
For me the BER is irrelevant if it is above 0.0.
I want to know where the errors start which is ACmin.
So, call it time to first flip. The principle is the same.
Still, MSRH causes harm faster than RP.

EricP
2024-02-13 17:05:18 UTC
Permalink
Post by Michael S
On Tue, 13 Feb 2024 00:19:18 +0000
Post by MitchAlsup1
RowHammer depends on closing the row too often.
Yes, except that it is unknown whether major RH impact is done by
closing the row or by opening it. The later is more likely. But since
the rate of opening and closing is the same, this finer difference is
not important.
A Deeper Look into RowHammers Sensitivities Experimental Analysis
of Real DRAM Chips and Implications on Future Attacks and Defenses, 2021
https://arxiv.org/pdf/2110.10291

That paper pre-dates the RowPress one and notes:

"6.1 Impact of Aggressor Row’s On-Time

Obsv. 8. As the aggressor row stays active longer (i.e., tAggON increases),
more DRAM cells experience RowHammer bit flips and they
experience RowHammer bit flips at lower hammer counts."

Obsv. 9. RowHammer vulnerability consistently worsens as tAggON
increases in DRAM chips from all four manufacturers.

6.2 Impact of Aggressor Row’s Off-Time

Obsv. 10. As the bank stays precharged longer (i.e., tAggOFF increases),
fewer DRAM cells experience RowHammer bit flips and they
experience RowHammer bit flips at higher hammer counts.

Obsv. 11. RowHammer vulnerability consistently reduces as
tAggOFF increases in DRAM chips from all four manufacturers."
Michael S
2024-02-14 08:50:28 UTC
Permalink
On Tue, 13 Feb 2024 12:05:18 -0500
Post by EricP
Post by Michael S
On Tue, 13 Feb 2024 00:19:18 +0000
Post by MitchAlsup1
RowHammer depends on closing the row too often.
Yes, except that it is unknown whether major RH impact is done by
closing the row or by opening it. The later is more likely. But
since the rate of opening and closing is the same, this finer
difference is not important.
A Deeper Look into RowHammers Sensitivities Experimental Analysis
of Real DRAM Chips and Implications on Future Attacks and Defenses,
2021 https://arxiv.org/pdf/2110.10291
"6.1 Impact of Aggressor Row’s On-Time
Obsv. 8. As the aggressor row stays active longer (i.e., tAggON
increases), more DRAM cells experience RowHammer bit flips and they
experience RowHammer bit flips at lower hammer counts."
Obsv. 9. RowHammer vulnerability consistently worsens as tAggON
increases in DRAM chips from all four manufacturers.
6.2 Impact of Aggressor Row’s Off-Time
Obsv. 10. As the bank stays precharged longer (i.e., tAggOFF
increases), fewer DRAM cells experience RowHammer bit flips and they
experience RowHammer bit flips at higher hammer counts.
Obsv. 11. RowHammer vulnerability consistently reduces as
tAggOFF increases in DRAM chips from all four manufacturers."
novaBBS is not updating since yesterday, so Mitch is not aware of
our latest posts.
EricP
2024-02-01 14:20:24 UTC
Permalink
Post by Michael S
On Wed, 31 Jan 2024 17:17:21 GMT
Post by Anton Ertl
Post by Michael S
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples
"Executions of the CLFLUSH instruction are ordered with respect to
each other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol.
Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
and that at that time all Intel's PCI/AGP root hubs were already fully
I/O-coherent for several years, I find your theory unlikely.
Myself, I don't know the original reason, but I do know a use case
where CLFLUSH, while not strictly necessary, simplifies things greatly
- entering deep sleep state in which CPU caches are powered down and
DRAM put in self-refresh mode.
CLFLUSH wouldn't be useful for that as it flushes for a virtual address.
It also allows all sorts reorderings that you don't want to think about
during a (possibly emergency) cache sync.

The privileged WBINVD and WBNOINVD instructions are intended for that.
It sounds like they basically halt the core for the duration of the
write back of all modified lines.
Anton Ertl
2024-02-03 08:42:18 UTC
Permalink
Post by Michael S
On Wed, 31 Jan 2024 17:17:21 GMT
Post by Anton Ertl
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go.
IMHO, all thise solutions are pure fantasy, because memory controller
does not even know which rows are physically adjacent. POC authors
typically run lengthy tests in order to figure it out.
Given that the attackers can find out, it is just a lack of
communication between DRAM manufacturers and memory controller
manufacturers that result in that ignorance. Not a valid excuse.

There is a standardization committee (JEDEC) that documents how
various DRAM types are accessed, refreshed etc. They put information
about that (and about RAM overclocking (XMP, Expo)) in the SPD ROMs of
the DIMMs, so they can also put information about line adjacency
there.
Post by Michael S
Post by Anton Ertl
With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
...
Post by Michael S
They cared enough to implement the simplest of proposed solutions - TRR.
Yes, it was quickly found insufficient, but at least there was a
demonstration of good intentions.
Yes. However, looking at Table III of
<https://comsec.ethz.ch/wp-content/files/blacksmith_sp22.pdf>, there
seems to be significant differences between manufacturers A and D on
one hand, and B and C on the other, with exploits taking much longer
for B and C, and failing in some cases.

One may wonder if the DRAM manufacturers could have put their
physicists to the task of identifying the conditions under which bit
flips can occur, and identify the refreshes that are at least
necessary to prevent these conditions from occuring. If they have not
done so, or if they have not implemented the resulting recommendations
(or passed them to the memory controller people), a certain amount of
blame rests on them.

Anyway, never mind the blame, looking into the future, I find it
worrying that I did not find any mention of Rowhammer protection in
the specs of DIMMs when I last looked.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup
2024-02-03 17:10:30 UTC
Permalink
Post by Anton Ertl
Post by Michael S
On Wed, 31 Jan 2024 17:17:21 GMT
Post by Anton Ertl
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go.
IMHO, all thise solutions are pure fantasy, because memory controller
does not even know which rows are physically adjacent. POC authors
typically run lengthy tests in order to figure it out.
Given that the attackers can find out, it is just a lack of
communication between DRAM manufacturers and memory controller
manufacturers that result in that ignorance. Not a valid excuse.
There is a standardization committee (JEDEC) that documents how
various DRAM types are accessed, refreshed etc. They put information
about that (and about RAM overclocking (XMP, Expo)) in the SPD ROMs of
the DIMMs, so they can also put information about line adjacency
there.
Post by Michael S
Post by Anton Ertl
With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
....
Post by Michael S
They cared enough to implement the simplest of proposed solutions - TRR.
Yes, it was quickly found insufficient, but at least there was a
demonstration of good intentions.
Yes. However, looking at Table III of
<https://comsec.ethz.ch/wp-content/files/blacksmith_sp22.pdf>, there
seems to be significant differences between manufacturers A and D on
one hand, and B and C on the other, with exploits taking much longer
for B and C, and failing in some cases.
One may wonder if the DRAM manufacturers could have put their
physicists to the task of identifying the conditions under which bit
flips can occur, and identify the refreshes that are at least
necessary to prevent these conditions from occuring. If they have not
done so, or if they have not implemented the resulting recommendations
(or passed them to the memory controller people), a certain amount of
blame rests on them.
Anyway, never mind the blame, looking into the future, I find it
worrying that I did not find any mention of Rowhammer protection in
the specs of DIMMs when I last looked.
My information is that they (DRAM mfgs) looked and said they could not
fix a problem that emanated from the DRAM controller.
Post by Anton Ertl
- anton
EricP
2024-02-01 14:05:19 UTC
Permalink
Post by Anton Ertl
Post by Michael S
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol. This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
The text in Intel Vol 1 Architecture manual indicates they viewed all
these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
as part of SSE for use by graphics applications that want to take
manual control of their caching and minimize cache pollution.

Note that the non-temporal move instructions MOVNTxx were also part of
that SSE bunch and could also be used to force a write to DRAM.
Michael S
2024-02-01 14:30:27 UTC
Permalink
On Thu, 01 Feb 2024 09:05:19 -0500
Post by EricP
Post by Anton Ertl
Post by Michael S
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC
"Executions of the CLFLUSH instruction are ordered with respect to
each other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to
the same cache line.1 They are not ordered with respect to
executions of CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you
don't need stuff like CLFLUSH. My guess is that CLFLUSH is there
for getting DRAM up-to-date for DMA from I/O devices. An
alternative would be to let the memory controller remember which
lines are modified, and if the I/O device asks for that line, get
the up-to-date data from the cache line using the cache-consistency
protocol. This would turn CLFLUSH into a noop (at least as far as
writing to DRAM is concerned, the ordering constraints may still be
relevant), so there is a way to fix this mistake (if it is one).
The text in Intel Vol 1 Architecture manual indicates they viewed all
these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
as part of SSE for use by graphics applications that want to take
manual control of their caching and minimize cache pollution.
Note that the non-temporal move instructions MOVNTxx were also part of
that SSE bunch and could also be used to force a write to DRAM.
According to Wikipedia, CLFLUSH was not introduced with SSE.
It was introduced together with SSE2, but formally is not part of it.
CLFLUSHOPT came much, much, much later and was likely related to Optane
DIMMs aspirations of late 2010s.
Chris M. Thomasson
2024-02-01 20:48:59 UTC
Permalink
Post by EricP
Post by Anton Ertl
Post by Michael S
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH.  My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices.  An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol.  This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
The text in Intel Vol 1 Architecture manual indicates they viewed all
these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
as part of SSE for use by graphics applications that want to take
manual control of their caching and minimize cache pollution.
Note that the non-temporal move instructions MOVNTxx were also part of
that SSE bunch and could also be used to force a write to DRAM.
Then there are the LFENCE, SFENCE and MFENCE for write back memory.
Non-temporal stores, iirc.
Chris M. Thomasson
2024-02-01 20:49:46 UTC
Permalink
Post by Chris M. Thomasson
Post by EricP
Post by Anton Ertl
Post by Michael S
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH.  My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices.  An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol.  This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
The text in Intel Vol 1 Architecture manual indicates they viewed all
these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
as part of SSE for use by graphics applications that want to take
manual control of their caching and minimize cache pollution.
Note that the non-temporal move instructions MOVNTxx were also part of
that SSE bunch and could also be used to force a write to DRAM.
Then there are the LFENCE, SFENCE and MFENCE for write back memory.
Non-temporal stores, iirc.
Oops, non-write back memory! IIRC. Sorry.
a***@littlepinkcloud.invalid
2024-02-01 09:39:13 UTC
Permalink
Post by Michael S
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
For Arm, with its non-coherent data and instruction caches, you need
some way to flush dcache to the point of unification in order to make
instruction changes visible. Also, regardless of icache coherence, when
using non-volatile memory you need an efficient way to flush dcache to
the point of peristence. You need that in order to make sure that a
transaction has been written to a log.

With the latter, you could restrict dcache flushes to pages with a
particular non-volatile attribute. I don't think there's anything you
can do about the former, short of simply making i- and d-cache
coherent. Which is a good idea, but not everyone does it.

Andrew.
Michael S
2024-02-01 13:36:46 UTC
Permalink
On Thu, 01 Feb 2024 09:39:13 +0000
Post by a***@littlepinkcloud.invalid
Post by Michael S
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
For Arm, with its non-coherent data and instruction caches, you need
some way to flush dcache to the point of unification in order to make
instruction changes visible. Also, regardless of icache coherence,
when using non-volatile memory you need an efficient way to flush
dcache to the point of peristence. You need that in order to make
sure that a transaction has been written to a log.
With the latter, you could restrict dcache flushes to pages with a
particular non-volatile attribute. I don't think there's anything you
can do about the former, short of simply making i- and d-cache
coherent.
For the later, privileged flush instruction sounds sufficient.

For the former, ARMv8 appears to have a special instruction (or you can
call it a special variant of DC instruction) - Clean by virtual address
to point of unification (DC CVAU). This instruction alone would not
make RH attack much easier. The problem is that privilagability of this
instruction controlled by the same bit as privilagability of two much
more dangerous variations of DC (DC CVAC and DC CIVAC).
Post by a***@littlepinkcloud.invalid
Which is a good idea, but not everyone does it.
Andrew.
Neoverse N1 had it. I don't know about the rest of Neoverse series.
a***@littlepinkcloud.invalid
2024-02-02 10:20:17 UTC
Permalink
Post by Michael S
On Thu, 01 Feb 2024 09:39:13 +0000
Post by a***@littlepinkcloud.invalid
Post by Michael S
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
For Arm, with its non-coherent data and instruction caches, you need
some way to flush dcache to the point of unification in order to make
instruction changes visible. Also, regardless of icache coherence,
when using non-volatile memory you need an efficient way to flush
dcache to the point of peristence. You need that in order to make
sure that a transaction has been written to a log.
With the latter, you could restrict dcache flushes to pages with a
particular non-volatile attribute. I don't think there's anything you
can do about the former, short of simply making i- and d-cache
coherent.
For the later, privileged flush instruction sounds sufficient.
Does it? You're trying for hight throughput, and a full system call
wouldn't help with that. And besides, if userspace can ask kernel to
do something on its behalf, you haven't added any security by making
it privileged.
Post by Michael S
For the former, ARMv8 appears to have a special instruction (or you can
call it a special variant of DC instruction) - Clean by virtual address
to point of unification (DC CVAU). This instruction alone would not
make RH attack much easier. The problem is that privilagability of this
instruction controlled by the same bit as privilagability of two much
more dangerous variations of DC (DC CVAC and DC CIVAC).
Ah, thanks.

Andrew.
Loading...