Vir Campestris
2024-01-30 16:36:17 UTC
I've knocked up a little utility program to try to work out some
performance figures for my CPU.
It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
4MB L3 cache
2MB L2 cache
384kb L1 cache
What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.
A C++ fragment is this. I can post the whole thing if it would help.
// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;
Stopwatch s;
s.start();
while (1) // until break when mask runs out
{
for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time
if (mask == 0) break; // Stop if we've run out of mask
mask >>= 1; // shrink the mask
}
As you can see it starts with a large mask (in fact for a whole GB) and
halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.
But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.
A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.
What am I missing?
Thanks
Andy
performance figures for my CPU.
It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
4MB L3 cache
2MB L2 cache
384kb L1 cache
What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.
A C++ fragment is this. I can post the whole thing if it would help.
// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;
Stopwatch s;
s.start();
while (1) // until break when mask runs out
{
for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time
if (mask == 0) break; // Stop if we've run out of mask
mask >>= 1; // shrink the mask
}
As you can see it starts with a large mask (in fact for a whole GB) and
halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.
But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.
A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.
What am I missing?
Thanks
Andy