Post by Robert Myers Post by David L. Craig
If we're talking about custom, never-mind-the-cost
designs, then that's the stuff that should make this
a really fun group.
If no one ever goes blue sky and asks: what is even physically
possible without worrying what may or may not be already in the works
at Intel, then we are forever limited, even in the imagination, to
what a marketdroid at Intel believes can be sold at Intel's customary
Coupling this to stuff we said earlier about
a) sequential access patterns, brute force - neither of us consider that
b) random access patterns
c) what you, Robert, siad you were most interested in, and rather nicely
called "crystalline" access patterns. By the way, I rather like that
term: it is much more accurate than saying "stride-N", and encapsulates
several sorts of regularity.
Now, I think it can be said that a machine that does random access
patterns efficiently also does "crystalline" access patterns. Yes?
I can imagine optimizations specific to the crystalline access patterns,
that do not help true random access. But I'd like to kill two birds
with one stone.
So, how can we make these access patterns more effective?
Perhaps we should lose the cache line orientation - transferring data
bytes that aren't needed.
I envision an interconnect fabric that is completely scatter/gather
oriented. We don't do away with burst or block operations: we always
transfer, say, 64 bytes at a time. But into that 64 bytes we might
pack, say, 4 pairs of 64 bit address and 64 bit data, for stores. Or
perhaps bursts of 128 bytes, mixing tuples of 64 bit address and 128 bit
data. Or maybe... compression, whatever. Stores are the complicated
one; reads are relatively simple, vectors of, say, 8 64 bit addresses.
By the way, this is where strided or crystalline access patterns might
have some advantages: they may compress better.
Your basic processing element produces such scatter gather load or store
requests. Particularly if it has scatter/gather vector instructions
like Larrabee (per wikipedia), or if it is a CIMT coherent threaded
architecture like the GPUs. The scatter/gather operations emitted by a
processor need not be directed at a single target - they may be split
and merged as they flow through the fabric.
In order to eliminate unnecessary full-cache line flow, we do not
require read-for-ownership. But we don't go the stupid way of
write-through. I lean towards having a valid bit per byte, in these
scatter-gather requests, and possibly in the caches. As I have
discussed in this newsgroup before, this allows us to have writeback
caches where multiple processors can write to the same memory location
simultaneously. The byte valids allows us to live with weak memory
ordering, but do away with the bad problem of losing data when people
write to different bytes of the same line simultaneously. In fact,
depending on the interconnection fabric topology, you might even have
processor ordering. But basically it eliminates the biggest source of
overhead in cache coherency.
Of course, you want to handle non-cache friendly memory access patterns.
I don't think you can safely get rid of caches; but I think that there
should be a full suite of cache control operations, such as is partially
Such a scatter/gather memory subsystem might exist in the fabric. It
works best with processor support to generate and handle the
scatter/gather requests ad replies. (Yes, the main thing is in the
interconnect; but some processor support is needed, to get crap out of
the way of the fabric).
The scatter/gather interconnect fabric might be interfaced to
conventional DRAMs, with their block transfers of 64 or 128 bytes. If
so, I would be tempted to create a memory side cache - a cache that is
in the memory controller, not the processor - seeking to leverage some
of the wasted parts of cache lines. With cache control, of course.
However, if there is any chance of getting DRAM architectures to be more
scatter/gather friendly, great. But the people who can really talk
about that are Tom Pawlowski at Micron, and his counterpart at Samsung.
I've not been at a company that could influence DRAM much, since
Motorola in the late 1980s. And I dare say that Mitch didn't make much
headway there. I've mentioned Tom Pawlowski's vision, as presented at
SC09 and elsewhere, of an abstract DRAM interface for stacked DRAM+logic
units. I think the scattter/gather approach I describe above should be
a candidate for such an abstract interface.
If there is anyone that thinks that there is a great new memory
technology coming down the pike that will make the bandwidth wars
easier, I'd love to hear about it. For that matter, the impending
integration of non-volatile memory is great - but as I understand
things, it will probably make the memory hierarchy even more sequential
bandwidth oriented, unfriendly to other access patterns.
On this fabric, also pass messages - probably with instruction set
support to directly produce messages, and mechanisms such as TLBs to
route them without OS intervention.
I.e. my overall approach is - eliminate unnecessary ful cache line
transfers, emphasize scatter gather. Make the most efficient use of
what we have.
Now, I remain an unrepentant mass market computer architect. Some
people want to design the fastest supercomputer in the world; I want to
design the computer my mother uses. But, I'm not so far removed fromn
the buildings full of stuff supercomputers that Robert Myers describes.
First, I have worked on such. But, second, I'm interested in much of
this not just because it is relevant to cost no barrier supercomputers,
but also because it is relevant to mass markets.
Most specifically, datacenters. Although datacenters tend not to use
large scale shared memory, and tend to be unwilling to compromise the
memory ordering and cache coherency guidelines in their small scale
shared memory nodes, I suspect that PGAS has applications, e.g. to
Hadoop like map/reduce. Moreover, much of this scatter/gather is also
what network routers want - that OTHER form of computing system that can
occupy large buildings, but which also comes in smaller flavors.
Finally, the above applies even to moderate sized, say 16 or 32,
multiprocessor systems in manycore chips.
I.e. I am interested in such scatter/gather memory and interconnect,
that make the most efficient use of bandwidth, because they apply to the