Discussion:
Effects of Memory Latency and Bandwidth on Supercomputer,Application Performance
(too old to reply)
Robert Myers
2010-07-23 03:53:48 UTC
Permalink
Thanks to all who contributed their thoughts on high-bandwidth computing.

A wiki and mailing list are on their way for those who wish to pursue
the discussion.

I have previously posted a link here to this article

On the Effects of Memory Latency and Bandwidth on Supercomputer.
Application Performance. Richard Murphy. Sandia National Laboratories...
www.sandia.gov/~rcmurph/doc/latency.pdf

Google groups search is so broken that I can't find my previous post or
the discussion around it.

Robert.
Brett Davis
2010-07-25 04:03:47 UTC
Permalink
Post by Robert Myers
On the Effects of Memory Latency and Bandwidth on Supercomputer.
Application Performance. Richard Murphy. Sandia National Laboratories.
www.sandia.gov/~rcmurph/doc/latency.pdf
This paper has pretty graphs, but like most academic papers it has a
fatal flaw. They clearly did not scale the fetch ahead to match the
bandwidth, not a single benchmark was improved with greater bandwidth.
Newer PowerPCs are available with greater bandwidth, and the
performance of those chips clearly would not match these graphs.

I know everyone wants reduced latency, but technology favors
increasing latency to get more bandwidth. At least until RRAM gets
embedded on the CPU die, then you will get reduced latency and
even more bandwidth, whether you want it or not. ;)

Of course if your problem does not fit in X gigabytes that are on die,
then you will have to wait for the 4k page to be streamed in.
Can you say really bad latency, yes I know you can. ;)

Brett
Robert Myers
2010-07-25 04:31:38 UTC
Permalink
Post by Brett Davis
Post by Robert Myers
On the Effects of Memory Latency and Bandwidth on Supercomputer.
Application Performance. Richard Murphy. Sandia National Laboratories.
www.sandia.gov/~rcmurph/doc/latency.pdf
This paper has pretty graphs, but like most academic papers it has a
fatal flaw. They clearly did not scale the fetch ahead to match the
bandwidth, not a single benchmark was improved with greater bandwidth.
Newer PowerPCs are available with greater bandwidth, and the
performance of those chips clearly would not match these graphs.
I know everyone wants reduced latency, but technology favors
increasing latency to get more bandwidth. At least until RRAM gets
embedded on the CPU die, then you will get reduced latency and
even more bandwidth, whether you want it or not. ;)
Of course if your problem does not fit in X gigabytes that are on die,
then you will have to wait for the 4k page to be streamed in.
Can you say really bad latency, yes I know you can. ;)
I was asked if I had ever looked at studies of the effects of
bandwidth by others, and the answer is that I have.

Since I don't get to sit in on the deliberations of the national labs
as to how they will turn money into carbon emissions, I consider
worthwhile any kind of evidence as to their thinking, no matter how
superficial.

I agree. The plots are splendid. They always are.

Robert.
Brett Davis
2010-07-25 05:27:38 UTC
Permalink
In article
Post by Robert Myers
Post by Brett Davis
Post by Robert Myers
On the Effects of Memory Latency and Bandwidth on Supercomputer.
Application Performance. Richard Murphy. Sandia National Laboratories.
www.sandia.gov/~rcmurph/doc/latency.pdf
This paper has pretty graphs, but like most academic papers it has a
fatal flaw. They clearly did not scale the fetch ahead to match the
bandwidth, not a single benchmark was improved with greater bandwidth.
Newer PowerPCs are available with greater bandwidth, and the
performance of those chips clearly would not match these graphs.
I was asked if I had ever looked at studies of the effects of
bandwidth by others, and the answer is that I have.
I agree. The plots are splendid. They always are.
Stanford online has a truly awesome related iTunes talk by Bill Dally:
http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.edu.3692287061.03692287064.3946505769?i=2123730674
From:
http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.edu.3692287061.03692287064
iTunes store search term "The future of throughput".
Track 13 for the course "Programming Massively Parallel Processors with CUDA (CS193G)".

A related PDF:
http://cva.stanford.edu/people/dally/ISSCC2005.pdf

Brett
Robert Myers
2010-07-26 03:12:46 UTC
Permalink
Post by Brett Davis
In article
Post by Robert Myers
Post by Brett Davis
Post by Robert Myers
On the Effects of Memory Latency and Bandwidth on Supercomputer.
Application Performance. Richard Murphy. Sandia National Laboratories.
www.sandia.gov/~rcmurph/doc/latency.pdf
This paper has pretty graphs, but like most academic papers it has a
fatal flaw. They clearly did not scale the fetch ahead to match the
bandwidth, not a single benchmark was improved with greater bandwidth.
Newer PowerPCs are available with greater bandwidth, and the
performance of those chips clearly would not match these graphs.
I was asked if I had ever looked at studies of the effects of
bandwidth by others, and the answer is that I have.
I agree.  The plots are splendid.  They always are.
Stanford online has a truly awesome related iTunes talk by Bill Dally:http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.e...
From:http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.e...
iTunes store search term "The future of throughput".
Track 13 for the course "Programming Massively Parallel Processors with CUDA (CS193G)".
I got stopped on Bill Dally's first slide, where he says
"efficiency=locality."

He claims that his slide tells you everything you need to know about
the future of computing, AND THAT BELIEF AND PROSELYTIZING FOR THAT
BELIEF IS WHY I HAVE USED UP SO MUCH SPACE HERE.

THE MOST INTERESTING PROBLEMS ARE NOT LOCAL, BECAUSE THE MOST
INTERESTING PROBLEMS ARE NOT LINEAR.

A problem is embarrassingly parallel IFF it is local IFF it is linear.

If a problem is linear, then there is a representation in which it is
both local and embarrassingly parallel. If a problem is not linear,
then there is, in general, no such representation. Does one need a
PhD from Cal Tech to understand this?

I suppose that it is inevitable that EE's should believe that
interesting things can happen in a linear universe, but, from a
mathematical point of view, linear systems are as interesting as the
aging of paintings hanging in the Louvre.

Robert.
Post by Brett Davis
A related PDF:http://cva.stanford.edu/people/dally/ISSCC2005.pdf
Brett
Andy Glew <"newsgroup at comp-arch.net">
2010-07-26 05:33:36 UTC
Permalink
Post by Robert Myers
I got stopped on Bill Dally's first slide, where he says
"efficiency=locality."
He claims that his slide tells you everything you need to know about
the future of computing, AND THAT BELIEF AND PROSELYTIZING FOR THAT
BELIEF IS WHY I HAVE USED UP SO MUCH SPACE HERE.
THE MOST INTERESTING PROBLEMS ARE NOT LOCAL, BECAUSE THE MOST
INTERESTING PROBLEMS ARE NOT LINEAR.
A problem is embarrassingly parallel IFF it is local IFF it is linear.
If a problem is linear, then there is a representation in which it is
both local and embarrassingly parallel. If a problem is not linear,
then there is, in general, no such representation. Does one need a
PhD from Cal Tech to understand this?
(1) Is there a proof for this? Not that there are non-linear systems
that are not embarassingly parallel, but that there are no interesting
non-linear systems that are not amenable to parallel solutions.

E.g. the N-body problem, gravitational dynamics?

(2) If what you and Dally say is true, Robert, then you may be tilting
at windmills. There may be no computationally efficient way of doing
what you want.

I don't believe this, because I do not equate computationally efficient
to embarassingly parallel.

Also: even though locality really matters, nevertheless we have used
that as an excuse to pessimize non-local activities. At the very least
we can influence the constant factor by removing these pessimizations.
As I have attempted to do with my proposal for a scatter/gather based
memory subsystem.
Robert Myers
2010-07-26 17:16:21 UTC
Permalink
Post by Robert Myers
I got stopped on Bill Dally's first slide, where he says
"efficiency=locality."
He claims that his slide tells you everything you need to know about
the future of computing, AND THAT BELIEF AND PROSELYTIZING FOR THAT
BELIEF IS WHY I HAVE USED UP SO MUCH SPACE HERE.
THE MOST INTERESTING PROBLEMS ARE NOT LOCAL, BECAUSE THE MOST
INTERESTING PROBLEMS ARE NOT LINEAR.
A problem is embarrassingly parallel IFF it is local IFF it is linear.
If a problem is linear, then there is a representation in which it is
both local and embarrassingly parallel.  If a problem is not linear,
then there is, in general, no such representation.  Does one need a
PhD from Cal Tech to understand this?
(1) Is there a proof for this?  Not that there are non-linear systems
that are not embarassingly parallel, but that there are no interesting
non-linear systems that are not amenable to parallel solutions.
E.g. the N-body problem, gravitational dynamics?
First, you have to understand that what I think of as a huge problem
the DoE (for example) does not even recognize as a problem at all,
never mind as a huge problem.

Bill Dally later whisks through the logic, where he talks about
calculating fluxes that never make it off the graphics card. If you
think way back to the divergence theorem, you can see what they're
doing. You reduce the nonlinear terms in the Navier-Stokes (or other
equation involving transport) to fluxes through the surface of a
control volume. Voila! The non-linear term is local.

Since you've got all these flops and not much bandwidth, you only ever
transmit and receiive fluxes across the boundary. What's more, if
you're the DoE, you can add all kinds of chemical reactions and
constitutive equations to keep the flops occupied, EVEN THOUGH YOU AR
USING TERRIBLE APPROXIMATION OF THE TRANSPORT EQUATIONS, even without
the complication of all those chemical reactions (or whatever).

Control volumes, of course, do not appear in the Navier-Stokes
equations, which contains differential operators. Any accurate
approximation to the differential operators is non-local, and, in
fact, the only one with quantifiable limitations in the face of non-
linearity is global (a spectral representation of the operator).

This all sounds very arcane, I'm sure, and the DoE wants you to think
it is arcane, but the physics are driven by the details of how the
shortest scales interact with the longest scales, and when you
misrepresent the differential operators with a crude appropximation,
you are changing the exact physics you want to look at.

If you can't build computers that can actually do global FFT's, then
you will never be able to examine what the numerics are actually
doing, and I do truly believe that that's the way the supercomputer
bandwagon wants things. That way, they can publish gorgeous pictures
of the Rayleigh-Taylor instability, even while using methods of
unknowable accuracy.

It's important enough that, in the middle of his sales pitch, Bill
Dally feels compelled to whisk through an admission of what's really
going on, even though I strongly suspect that he has no idea of the
mathematical consequences of the methods he is using. This whole
"supercomputer" shenanigan (and with it miraculous GPU's that will
unravel the secrets of the universe with practically no bandwidth) is
built on a crude fudge. The fudge is SO convenient, though, that the
empire builders and software writers just want to get on with it,
without worrying that they might have lost contact with reality before
they even started.
(2) If what you and Dally say is true, Robert, then you may be tilting
at windmills. There may be no computationally efficient way of doing
what you want.
I don't believe this, because I do not equate computationally efficient
to embarassingly parallel.
Also: even though locality really matters, nevertheless we have used
that as an excuse to pessimize non-local activities.   At the very least
we can influence the constant factor by removing these pessimizations.
As I have attempted to do with my proposal for a scatter/gather based
memory subsystem.
I may well be tilting at windmills. I've understood that all along.

Even if you can build computers with sufficient global bandwidth (can
afford the hardware), the energy costs might kill you. I don't really
know.

The FFT isn't local, but it diagonalizes the derivative (DOES TRULY
MAKE THE CALCULATION LOCAL), and it can be done with very great
efficiency, IFF you have sufficient global bandwidth.

Personally, I think that if I'm tilting at windmills, it's not because
of any actual physical limitation or any real financial limitation
that can't be overcome, but because dealing with the foundational
reality I keep harping on will interfere with the march of petaflops
and the endless flow of pretty pictures.

Robert.
Chris Gray
2010-07-26 23:29:38 UTC
Permalink
Post by Robert Myers
First, you have to understand that what I think of as a huge problem
the DoE (for example) does not even recognize as a problem at all,
never mind as a huge problem.
... etc ...

I know virtually nothing about the math and coding of these things, and so
should just stay out of this. But:

1) surely someone reading this newsgroup has a simulation/whatever that can
be fairly easily converted from the "bad" locality assumptions into one
that runs globally. Can't you just hack some of the constants so that it
only has one "cell" instead of many? Don't the environments support things
like large arrays spanning multiple system nodes?

2) surely someone reading this newsgroup has some kind of access to a system
that can try to run the resulting simulation. I figure that back in the
old days at Myrias we could have found a way to do it. We could likely
have done a weekend run on 1024 nodes, longer on fewer nodes.

It could even be in the interests of hardware vendors to do this. If the
run proves Robert right, then there could be lots of new money coming to
research systems able to run globally.

Likely one run wouldn't be enough, but it would at least be a start.
--
Chris Gray
Rick Jones
2010-07-26 23:16:48 UTC
Permalink
Post by Chris Gray
Post by Robert Myers
First, you have to understand that what I think of as a huge problem
the DoE (for example) does not even recognize as a problem at all,
never mind as a huge problem.
... etc ...
I know virtually nothing about the math and coding of these things, and so
1) surely someone reading this newsgroup has a simulation/whatever that can
be fairly easily converted from the "bad" locality assumptions into one
that runs globally. Can't you just hack some of the constants so that it
only has one "cell" instead of many? Don't the environments support things
like large arrays spanning multiple system nodes?
2) surely someone reading this newsgroup has some kind of access to a system
that can try to run the resulting simulation. I figure that back in the
old days at Myrias we could have found a way to do it. We could likely
have done a weekend run on 1024 nodes, longer on fewer nodes.
It could even be in the interests of hardware vendors to do this. If the
run proves Robert right, then there could be lots of new money coming to
research systems able to run globally.
Likely one run wouldn't be enough, but it would at least be a start.
Perhaps the folks at ScaleMP?

rick jones
--
firebug n, the idiot who tosses a lit cigarette out his car window
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Robert Myers
2010-07-27 00:24:51 UTC
Permalink
Post by Chris Gray
Post by Robert Myers
First, you have to understand that what I think of as a huge problem
the DoE (for example) does not even recognize as a problem at all,
never mind as a huge problem.
... etc ...
I know virtually nothing about the math and coding of these things, and so
1) surely someone reading this newsgroup has a simulation/whatever that can
   be fairly easily converted from the "bad" locality assumptions into one
   that runs globally. Can't you just hack some of the constants so that it
   only has one "cell" instead of many? Don't the environments support things
   like large arrays spanning multiple system nodes?
2) surely someone reading this newsgroup has some kind of access to a system
   that can try to run the resulting simulation. I figure that back in the
   old days at Myrias we could have found a way to do it. We could likely
   have done a weekend run on 1024 nodes, longer on fewer nodes.
   It could even be in the interests of hardware vendors to do this. If the
   run proves Robert right, then there could be lots of new money coming to
   research systems able to run globally.
Likely one run wouldn't be enough, but it would at least be a start.
It's not as if any of this just bubbled up out of the ground like
methane around the Deep Horizon well.

People who need to get the physics right know how to do it:

http://www.o3d.org/abracco/annual_rev_3dnumerical.pdf

It would be fascinating to see what would happen if you tried to
reproduce these kinds of fundamental calculations using the kind of
differencing that Dally apparently takes for granted. It would be a
significant undertaking, and I don't think anyone would be surprised
that such calculations would yield results that would be wrong for
reasons that one could easily predict ahead of time: the real physics
of isotropic turbulence would be swamped by numerical diffusion.

In my admittedly cynical view of the world, when you need to get it
right, and you know it will be obvious that you have gotten it wrong,
no one uses the methods that the DoE apparently designs its computers
around. If all you need is a calculation that isn't obviously wrong
without doing some subtle checking, crude differencing schemes are the
rule and highly accurate differencing schemes are the exception. At
one time, fluid mechanical calculations were so expensive that you
were happy to get something that just looked right.

We shouldn't have to live that way any more. I can't guarantee that
the kind of comparison you'd think would be obvious to do hasn't been
done, but the incentive to do it is low and the risks very high. As
someone who really does know has said to me in a friendly way: I'm
likely to make a lot of enemies.

A careful comparison of local, low-order differencing results with
global, high-order differencing results is an obvious thing to do, but
if there is any such study with any computer even remotely state-of-
the-art (and, remember, state of the art "supercomputers" are
*terrible* for FFT's), I'm not familiar with it. It's an obvious
thing to do.

Robert.
Andy Glew <"newsgroup at comp-arch.net">
2010-07-27 02:34:54 UTC
Permalink
Post by Robert Myers
A careful comparison of local, low-order differencing results with
global, high-order differencing results is an obvious thing to do, but
if there is any such study with any computer even remotely state-of-
the-art (and, remember, state of the art "supercomputers" are
*terrible* for FFT's), I'm not familiar with it. It's an obvious
thing to do.
Can you remind us (okay, me, the lazy one) what the scaling of bandwidth
requirements with problem size is for FFTs?

I.e. for an NxNxN 3D FFT, DP coordinates, what is the bandwidth? Phrase
it solely in terms of sendig around 64 bit DP numbers.

And phrase it in terms of a dancehall configuration - all memory on one
side, all processors on the other. I.e. do NOT neglect bandwidth to
processor local memory.

One you get to a certain level of the FFT, we have lots of nice cache
line sized accesses. Unless they are being distributed to different
processors. Which is where my scatter/gather interconnect would help.

We might want to create 3D tiles.

I'm staring at this thinking compression. Adjacent points are probably
close in value. Now, compression of cache lines doesn't help that much,
unless you have sequential access patterns even bigger than a cache
line. But in my scatter/gather world, any compression might help, since
it can be packed with other odd sized packets.
Robert Myers
2010-07-27 04:06:38 UTC
Permalink
Post by Andy Glew <"newsgroup at comp-arch.net">
Post by Robert Myers
A careful comparison of local, low-order differencing results with
global, high-order differencing results is an obvious thing to do, but
if there is any such study with any computer even remotely state-of-
the-art (and, remember, state of the art "supercomputers" are
*terrible* for FFT's), I'm not familiar with it.  It's an obvious
thing to do.
Can you remind us (okay, me, the lazy one) what the scaling of bandwidth
requirements with problem size is for FFTs?
FFT pseudospectral code on the Cray-1 (~1 byte/flop--Mitch can correct
me, I'm sure) was memory-bound in vector mode. The Cray moved 64-bit
words.
Post by Andy Glew <"newsgroup at comp-arch.net">
I.e. for an NxNxN 3D FFT, DP coordinates, what is the bandwidth?  Phrase
it solely in terms of sendig around 64 bit DP numbers.
And phrase it in terms of a dancehall configuration - all memory on one
side, all processors on the other.  I.e. do NOT neglect bandwidth to
processor local memory.
One you get to a certain level of the FFT, we have lots of nice cache
line sized accesses. Unless they are being distributed to different
processors.  Which is where my scatter/gather interconnect would help.
Lots of possibilities for cleverness, in particular by simply renaming
rather than moving--e.g. who cares if things are physically in bit
reversed order under certain conditions. Also, you don't need a
general interconnect if what you really want to do are FFT's.
Post by Andy Glew <"newsgroup at comp-arch.net">
We might want to create 3D tiles.
Very likely. So much depends on the details of the interconnect.
I've done some speculation on possibilities, which I will pass along.
Post by Andy Glew <"newsgroup at comp-arch.net">
I'm staring at this thinking compression.  Adjacent points are probably
close in value.  Now, compression of cache lines doesn't help that much,
unless you have sequential access patterns even bigger than a cache
line.  But in my scatter/gather world, any compression might help, since
it can be packed with other odd sized packets.
The really tormented sections of data are likely to be compact in
physical space, if that helps. In spectral space, things will be wild
because of phase. One might learn a lot about turbulence just by
looking at the possibilities for compression. Might be some serious
prizes or at least some noteworthy papers along that line of inquiry
if you found anything interesting.

I'm sorry if this seems at tad superficial. This discussion takes a
lot out of me.

Robert.
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-27 07:26:39 UTC
Permalink
Post by Andy Glew <"newsgroup at comp-arch.net">
And phrase it in terms of a dancehall configuration - all memory on one
side, all processors on the other. I.e. do NOT neglect bandwidth to
processor local memory.
Nice way of putting it: All accesses are just as expensive, i.e. very
expensive.
Post by Andy Glew <"newsgroup at comp-arch.net">
One you get to a certain level of the FFT, we have lots of nice cache
line sized accesses. Unless they are being distributed to different
processors. Which is where my scatter/gather interconnect would help.
We might want to create 3D tiles.
This sounds a lot more reasonable: 1D FFT is very nice from a cache
viewpoint, while 2D and 3D are much worse.

Could we, instead of 3D tiles, using 3-way address interleaving, similar
to what some graphics formats use?

I.e. Larrabee which has been (at least for the time) repurposed as a HPC
platform has a couple of opcodes for two-way and three-way bit interleaving.

Using this my naive interpretation would be that you would get
enormously better cache behavior than from a regular (power-of-two
stride) 2D or 3D FFT?
Post by Andy Glew <"newsgroup at comp-arch.net">
I'm staring at this thinking compression. Adjacent points are probably
close in value. Now, compression of cache lines doesn't help that much,
unless you have sequential access patterns even bigger than a cache
line. But in my scatter/gather world, any compression might help, since
it can be packed with other odd sized packets.
I'm afraid that during the FFT run you would convert all the numbers
in-place, into values where your nice local compressibility might disappear.

I have more faith in uniform compression, i.e. making do with less
bits/node, simply because that is easy to emulate and check where
numerical instability would rear its ugly head.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2010-07-27 07:56:14 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Andy Glew <"newsgroup at comp-arch.net">
And phrase it in terms of a dancehall configuration - all memory on one
side, all processors on the other. I.e. do NOT neglect bandwidth to
processor local memory.
Nice way of putting it: All accesses are just as expensive, i.e. very
expensive.
And it makes reasoning about consistency easier - Lamport used that
model in his seminal paper on sequential consistency.
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Andy Glew <"newsgroup at comp-arch.net">
One you get to a certain level of the FFT, we have lots of nice cache
line sized accesses. Unless they are being distributed to different
processors. Which is where my scatter/gather interconnect would help.
We might want to create 3D tiles.
This sounds a lot more reasonable: 1D FFT is very nice from a cache
viewpoint, while 2D and 3D are much worse.
Now you have lost me. That is exactly the converse of my understanding.

1-D FFTs are pretty dire from a caching viewpoint, because there is
virtually no data reuse, except in some special cases and with code
designed to use them. Or have you found a new approach?

2- and 3-D FFTs are quite good for caching, if the code is changed
to make use of that. For a weak meaning of 'quite good', true. And,
yes, you need a blocked transpose operation.

I think that a good scatter/gather interconnect would help with both,
but the devil is in the details.


Regards,
Nick Maclaren.
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-27 12:34:41 UTC
Permalink
Post by n***@cam.ac.uk
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Andy Glew <"newsgroup at comp-arch.net">
And phrase it in terms of a dancehall configuration - all memory on one
side, all processors on the other. I.e. do NOT neglect bandwidth to
processor local memory.
Nice way of putting it: All accesses are just as expensive, i.e. very
expensive.
And it makes reasoning about consistency easier - Lamport used that
model in his seminal paper on sequential consistency.
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Andy Glew <"newsgroup at comp-arch.net">
One you get to a certain level of the FFT, we have lots of nice cache
line sized accesses. Unless they are being distributed to different
processors. Which is where my scatter/gather interconnect would help.
We might want to create 3D tiles.
This sounds a lot more reasonable: 1D FFT is very nice from a cache
viewpoint, while 2D and 3D are much worse.
Now you have lost me. That is exactly the converse of my understanding.
1-D FFTs are pretty dire from a caching viewpoint, because there is
virtually no data reuse, except in some special cases and with code
designed to use them. Or have you found a new approach?
Not at new approach afaik, this is just from memory, from the IMDCT I
worked on for an audio codec:

For each of the log2(n) iterations there are just two pointers which
traverse the data array, and they do so sequentially, so caching
behavior should be close to optimal as long as the cache is at least 2-way.

OK, I do see that when the data size is significantly larger than the
outermost cache layer, this becomes a pure streaming vector job, with no
reuse at all between those log2(n) iterations. :-(

For arrays that do fit in some cache layer, sequential access is
optimal, but for significantly larger data sizes you would like to
overlap multiple iterations...
Post by n***@cam.ac.uk
2- and 3-D FFTs are quite good for caching, if the code is changed
to make use of that. For a weak meaning of 'quite good', true. And,
yes, you need a blocked transpose operation.
I'll have to look into this at some point in time.
Post by n***@cam.ac.uk
I think that a good scatter/gather interconnect would help with both,
but the devil is in the details.
S/G is very useful, IFF it can increase the total available bandwidth,
but as long as DRAMs keep getting more and more dependent upon big block
transfers, it sounds like a hard problem. :-(

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
James Van Buskirk
2010-07-27 13:22:11 UTC
Permalink
Post by n***@cam.ac.uk
1-D FFTs are pretty dire from a caching viewpoint, because there is
virtually no data reuse, except in some special cases and with code
designed to use them. Or have you found a new approach?
2- and 3-D FFTs are quite good for caching, if the code is changed
to make use of that. For a weak meaning of 'quite good', true. And,
yes, you need a blocked transpose operation.
The difference between 1-D and the others is that for a 1-D FFT you
have to perform a preliminary transpose because the data that will be
combined on the first half of the passes is scattered all over the
initial array, so you need 3 transposes assuming cache is big enough
to hold a sqrt(N) sized array. For the others, one dimension's worth
of data is already contiguous in memory so you only need one transpose
per dimension. For an FFT all access to main memory (not to cache)
takes place in the transpose operations.
--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
Thomas Womack
2010-07-30 17:47:49 UTC
Permalink
Post by Robert Myers
http://www.o3d.org/abracco/annual_rev_3dnumerical.pdf
Are http://code.google.com/p/p3dfft/ and
http://www.sdsc.edu/us/resources/p3dfft/docs/TG08_DNS.pdf relevant in
this case? Large 3D FFTs decomposing over lots of processors into
(N/x)*(N/y)*N bricks, just over 100 seconds for 8192^3 on 2^15 CPUs
(2048 quad-quad-opterons) at TACC. The TACC machine 'Ranger' is a
load of racks plus a monolithic 3456-port 40GBit Infiniband switch
from Sun, so doesn't look that dissimilar to a national-labs machine.

I suppose the question is what counts as awful - it's 5% of peak, but
(figure 2 of TG08_DNS) it's 5% of peak from 2^9 to 2^15 CPUs, which
it's not ludicrous to call scalable.

Tom
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-31 06:22:36 UTC
Permalink
Post by Thomas Womack
Post by Robert Myers
http://www.o3d.org/abracco/annual_rev_3dnumerical.pdf
Are http://code.google.com/p/p3dfft/ and
http://www.sdsc.edu/us/resources/p3dfft/docs/TG08_DNS.pdf relevant in
this case? Large 3D FFTs decomposing over lots of processors into
(N/x)*(N/y)*N bricks, just over 100 seconds for 8192^3 on 2^15 CPUs
(2048 quad-quad-opterons) at TACC. The TACC machine 'Ranger' is a
load of racks plus a monolithic 3456-port 40GBit Infiniband switch
from Sun, so doesn't look that dissimilar to a national-labs machine.
I suppose the question is what counts as awful - it's 5% of peak, but
(figure 2 of TG08_DNS) it's 5% of peak from 2^9 to 2^15 CPUs, which
it's not ludicrous to call scalable.
No, not at all.

To me, the most interesting part of that paper was the way they had to
tune their 2D decomposition for the actual core layout, i.e. with 16
cores/node the most efficient setup was with 4xN "pencils", almost
certainly due to those 16 cores coming from 4 4-core CPUs.

The other highly significant piece of information was that they got away
with single-prec numbers!

Using DP instead would double memory and communication sizes, while it
would reduce FP throughput by an order of magnitude on something like a
Cell or most GPUs. OTOH, this would still be fast enough to keep up with
the communication network, right?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert Myers
2010-08-05 18:55:53 UTC
Permalink
Post by Robert Myers
I got stopped on Bill Dally's first slide, where he says
"efficiency=locality."
He claims that his slide tells you everything you need to know about
the future of computing, AND THAT BELIEF AND PROSELYTIZING FOR THAT
BELIEF IS WHY I HAVE USED UP SO MUCH SPACE HERE.
THE MOST INTERESTING PROBLEMS ARE NOT LOCAL, BECAUSE THE MOST
INTERESTING PROBLEMS ARE NOT LINEAR.
A problem is embarrassingly parallelIFFit is localIFFit is linear.
If a problem is linear, then there is a representation in which it is
both local and embarrassingly parallel.  If a problem is not linear,
then there is, in general, no such representation.  Does one need a
PhD from Cal Tech to understand this?
(1) Is there a proof for this?  Not that there are non-linear systems
that are not embarassingly parallel, but that there are no interesting
non-linear systems that are not amenable to parallel solutions.
E.g. the N-body problem, gravitational dynamics?
I ducked a direct answer to this question on the first pass.

A linear problem is one for which, if f and g are solutions, then f+g
is a solution. Since the previous statement is a definition of
linear, it's an if and only if property, as all definitions are. In
general, you can divide the problem up into many parallel problems and
add it all up at the end, so the problem is embarrassingly parallel.

As to the non-locality of nonlinear problems, imagine a Fourier space
representation. If you expand the nonlinearity in a power series, the
first nonlinear term will have a product of states h1*h2. If you
start out with h1 and h2 as pure states with spatial frequency f1 and
f2, then the non-linear term will produce mixed states at spatial
frequency f1+f2, and the interaction of those states will produce
still more states.

Sometimes there exists a transformation that will expose an apparently
non-linear problem as linear in some representation, but, in general,
no such transformation exists.

You ask about the n-body problem, which could be gravitation
(astrodynamics) or the coulomb force (molecular dynamics). You can
obtain a Poisson equation for the potential

div grad (potential) = charge density.

If you are willing to represent the second derivative by centered
differences, then you end up with a tridiagonal problem that is well-
suited to the linear library machines the Top500 list favors.

Once you have the potential, you can find the force on any particle as

force = grad (potential)

There is a name for this approximation in molecular dynamics, but
since the approach had already been thought of in many other contexts,
a separate name is hardly justified. As to the accuracy of the
approximation, that is another matter.

I assume that using the Poisson equation approach (with centered
differences) is how IBM managed to give Blue Gene the name it did, in
spite of the fact that, on the face of it, the machine is unsuited to
calculating long-range interactions like the Coulomb force because of
limited global bandwidth.

Robert.

Brett Davis
2010-07-26 06:49:24 UTC
Permalink
In article
Post by Robert Myers
Post by Brett Davis
Post by Robert Myers
I agree.  The plots are splendid.  They always are.
Stanford online has a truly awesome related iTunes talk by Bill Dally:http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.e...
From:http://deimos3.apple.com/WebObjects/Core.woa/Browse/itunes.stanford.e...
iTunes store search term "The future of throughput".
Track 13 for the course "Programming Massively Parallel Processors with CUDA (CS193G)".
I got stopped on Bill Dally's first slide, where he says
"efficiency=locality."
He claims that his slide tells you everything you need to know about
the future of computing, AND THAT BELIEF AND PROSELYTIZING FOR THAT
BELIEF IS WHY I HAVE USED UP SO MUCH SPACE HERE.
THE MOST INTERESTING PROBLEMS ARE NOT LOCAL, BECAUSE THE MOST
INTERESTING PROBLEMS ARE NOT LINEAR.
A problem is embarrassingly parallel IFF it is local IFF it is linear.
If a problem is linear, then there is a representation in which it is
both local and embarrassingly parallel. If a problem is not linear,
then there is, in general, no such representation. Does one need a
PhD from Cal Tech to understand this?
A simple programmer will tell you fine, but your performance is going
to be horrible.
Several things have been tried:

The company that bought Cray had a machine that was 1000 way
interleaved to deal with 1000 cycle latencies from RAM.
That design is dead, and buried?

Sun had its 8 way threaded multiprocessor for a more traditional
approach to the same problem. Dead.

IBM sells 4 way multiprocessors, it is too early to say if this
is a fad and failure, for IBM.

AMD would allow you to split the 128 bit memory bus into two
independent 64 bit busses, for better transaction throughput.
To the best of my knowledge no one turns this mode on, as one
bus is faster for everyone?

For BullDozer with its 256 bit memory bus I could see one
128 bit bus for the OS and app image, and a pair of 64 bit
buses for the database to use. But that is a stretch when
AMD may have decided the whole idea is stupid.

Intel has two way HypeThreading, which while no longer a net
negative that is turned off at boot by default like the NetBust
design was, is still a wash at best. Bold hopes for 2x speed
increases for high latency code have returned no speed boost
at all. You are hard pressed to find a 5% improvement in a
non-rigged benchmark.
Now that we have more CPUs than threads of work to do for most
desktop CPUs, the very idea of fine grain threading is dead.
(Outside of Intel marketing, where it is just Blue Crystals.)
Fine grain threading costs transistors for all those things
you have to double, and otherwise increase the size of.
Costing power, which costs clock, which costs performance.

AMD is quite vocal that they will never do HypeThreading,
saying that it is a net negative, and I believe them.
Post by Robert Myers
I suppose that it is inevitable that EE's should believe that
interesting things can happen in a linear universe, but, from a
mathematical point of view, linear systems are as interesting as the
aging of paintings hanging in the Louvre.
Robert.
The one problem set I know of that really wants small random
access is dictionary based AI, the last remaining approach
to emulating the human mind, as all the other approaches have
failed. (Think ELisa with a terra-byte database.)

But ultimately this is a kludge to get the same results that
the human mind does, but the human mind is massively parallel
and soft plugboard wired up between neurons.

On the scale between ASIC-DSP-CPU the human mind way off the
left of ASIC.

So what problem is it that you want this functionality for?

Fringe designs and apps are interesting mental exercises, fun.
Post by Robert Myers
Post by Brett Davis
A related PDF:http://cva.stanford.edu/people/dally/ISSCC2005.pdf
Brett
Ken Hagan
2010-07-26 11:28:26 UTC
Permalink
... dictionary based AI, the last remaining approach
to emulating the human mind, as all the other approaches have
failed. (Think ELisa with a terra-byte database.)
That would be "last remaining that I've thought of", with a strong
implication that it has survived this long simply because the other
failures were tried first.
But ultimately this is a kludge to get the same results that
the human mind does, but the human mind is massively parallel
and soft plugboard wired up between neurons.
I think we can be pretty certain that the human mind is not a *soft*
plugboard on the sort of timescales that it solves intellectual problems.
On the question of its parallelism, I'll wait until someone comes up with
a plausible model for how it works. (Come to that, it doesn't make much
sense to take lessons in computer architecture from the brain either, for
the same reason.)
So what problem is it that you want this functionality for?
Fringe designs and apps are interesting mental exercises, fun.
Simulating almost any natural process, as Robert said.

Picking up on his "Stanford PhD" remark...

In the physical sciences, there is an unstated assumption amongst
examiners that unless a question includes a calculation it lacks "rigour".
(Those of you who set exam questions may wish to contradict me on this
point, but it was certainly true when/where I did my degree.)

Nearly all of the calculations in undergraduate physics concern linear
systems, because non-linear ones can't be done. (Well, perhaps they can,
but probably not many and not by undergraduates.)

So students spend lots of time solving linear problems during their course
and graduate if they pass an exam that is mostly filled with linear
problems.

If those students stay with their subject for long enough, reality will
kick in and they will slowly come to appreciate that linear approximations
are just that. But how long does this process take? Has the "Stanford PhD"
spent enough time in serious research to unlearn their undergraduate
degree?

And on a more frivolous note, would I be correct in saying that the domain
of applicability of a linear approximate shrinks as you perform its
calculations to ever greater precision? If so, the current trend in
super-computing may be to contract down to a singularity of usefulness. :)
Brett Davis
2010-07-27 06:11:08 UTC
Permalink
Post by Ken Hagan
... dictionary based AI, the last remaining approach
to emulating the human mind, as all the other approaches have
failed. (Think ELisa with a terra-byte database.)
That would be "last remaining that I've thought of", with a strong
implication that it has survived this long simply because the other
failures were tried first.
The PBS series NOVA did a show on AI. Back in the 60s/70s there were
a dozen major promising approaches and another dozen on the fringe.
As faster computers and more PHDs where thrown at the problems, those
approaches were proofed as wrong one by one.

The dictionary approach was not one of the promising approaches, any
PHD would tell you that the computation required would not be available
for 50 to 100 years, someone would come up with a breakthrough long
before the dictionary approach became viable to even test properly.

Here we are 50 years later... ;)
Post by Ken Hagan
But ultimately this is a kludge to get the same results that
the human mind does, but the human mind is massively parallel
and soft plugboard wired up between neurons.
I think we can be pretty certain that the human mind is not a *soft*
plugboard on the sort of timescales that it solves intellectual problems.
On the question of its parallelism, I'll wait until someone comes up with
a plausible model for how it works. (Come to that, it doesn't make much
sense to take lessons in computer architecture from the brain either, for
the same reason.)
I did not mean to imply realtime rewiring, I assume a few 1000's of
neurons are rewired each night as you sleep. Unused connections die,
and underutilized neurons seek out new connections, semi-randomly.
Ergo parts of the brain that lose use due to a limb or other sense
getting chopped off, are re-assigned over time, which is what scientists
see. All a SWAG of course.

We know how picture memories are stored in the brain, and retrieved.
Its an extreme form of pattern matching against similar images, scaled.
And the images have idea tags for searching.

This technique is quite similar to a compressed dictionary lookup
with cross correlation to related ideas.

Intelligence would seem to be nothing more than a critical mass of
data plus connections, allowed to self explore and learn. Quite a shock.

No God or soul would seem to be needed, those are created afterwords,
to explain that which cannot be explained. (This is a old idea, it
comes up often in Bible studies.)

Brett
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-27 07:35:47 UTC
Permalink
Post by Brett Davis
Intelligence would seem to be nothing more than a critical mass of
data plus connections, allowed to self explore and learn. Quite a shock.
I believe intelligence (in humans) are very closely correlated with the
size of your associative memory, i.e. how large a problem you can grok
at once.
Post by Brett Davis
No God or soul would seem to be needed, those are created afterwords,
to explain that which cannot be explained. (This is a old idea, it
comes up often in Bible studies.)
Just like creation myths, these ideas seemed to crop up in every society.

It is still extremely saddening to me to keep reading articles like the
one yesterday about a town in Louisiana where the school board was
unanimous in wanting "creative design" alongside or instead of evolution.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2010-07-27 08:14:50 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
It is still extremely saddening to me to keep reading articles like the
one yesterday about a town in Louisiana where the school board was
unanimous in wanting "creative design" alongside or instead of evolution.
It's bloody terrifying! Those imbeciles seem to have read the first
and last books of the bible (in Authorized Version order), taken them
as gospel, and ignored everything in between, including the Gospels.
And they are coming very close to controlling most of the military
power of the planet, including almost all the nuclear weapons.


Regards,
Nick Maclaren.
Jason Riedy
2010-07-26 16:13:23 UTC
Permalink
Post by Brett Davis
The company that bought Cray had a machine that was 1000 way
interleaved to deal with 1000 cycle latencies from RAM.
That design is dead, and buried?
Nope. The Cray XMT exists; I use the one at PNNL almost daily. The
XMT2 is on its way. And your numbers are off by a factor of 10. The
Tera machine had ~140 cycle memory latency (iirc) and carried 128
threads per cpu. The XMT's latency is far worse, and you have fewer
useful threads per processor (user code limited to 100). Some of that
is slated to change with the XMT2, but I don't know (and wouldn't be
able to say yet) hard numbers.

I'm not making any claims about commercial viability, however. But it's
far easier to obtain decent performance on massive-scale combinatorial
graph analysis on the XMT than elsewhere... at the moment. Amdahl
giggles a bit at practical potential...

Jason
Brett Davis
2010-07-27 05:14:50 UTC
Permalink
Post by Jason Riedy
Post by Brett Davis
The company that bought Cray had a machine that was 1000 way
interleaved to deal with 1000 cycle latencies from RAM.
That design is dead, and buried?
Nope. The Cray XMT exists; I use the one at PNNL almost daily. The
XMT2 is on its way. And your numbers are off by a factor of 10. The
Tera machine had ~140 cycle memory latency (iirc) and carried 128
threads per cpu. The XMT's latency is far worse, and you have fewer
useful threads per processor (user code limited to 100). Some of that
is slated to change with the XMT2, but I don't know (and wouldn't be
able to say yet) hard numbers.
I stand corrected.
I based the death on this model, which sold one unit:
http://en.wikipedia.org/wiki/Cray_MTA-2

That info is old however, a new model is out:
http://en.wikipedia.org/wiki/Cray_XMT

Is is of course limited by the socket it uses, another reason I
thought it would stay dead. The 1000 threads was a rumor for a proposed
follow up, its what you need at the predicted 10 GHz speeds CPUs
were supposed to be running at today, to completely hide 1000 cycle
memory latency.

This should be as close to exactly what Robert Myers wants, that
he can get.
Post by Jason Riedy
I'm not making any claims about commercial viability, however. But it's
far easier to obtain decent performance on massive-scale combinatorial
graph analysis on the XMT than elsewhere... at the moment. Amdahl
giggles a bit at practical potential...
Jason
Andy Glew <"newsgroup at comp-arch.net">
2010-07-27 02:51:59 UTC
Permalink
Post by Brett Davis
The one problem set I know of that really wants small random
access is dictionary based AI, the last remaining approach
to emulating the human mind, as all the other approaches have
failed. (Think ELisa with a terra-byte database.)
Network routing (big routers, lots of packets).

People who are modelling large social networks, large networks of
actors. Inter-relationships.
Benny Amorsen
2010-07-27 06:50:21 UTC
Permalink
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".

You can simply replicate the whole routing table to local memory on all
processors; updates are a challenge but it's usually ok if packets go to
the wrong route for a few ms.


/Benny
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-27 07:46:49 UTC
Permalink
Post by Benny Amorsen
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".
Even going to a full /32 table is just 4GB, which is very cheap these
days. :-)

The alternative of multi-layer tables will save a lot of memory, but
double the number of lookups:

First-level table is /16, returning a 16-bit number which is either a
route/interface/gateway or a pointer to one of the secondary tables.

This way the 128K first-level table fits nicely in processor cache, and
handles almost all the long-distance routes, while the secondary tables
can be of adaptive size, i.e. the header contains the number of bits to
use for lookup.

The one problem with this approach is that it has variable latency, so
that makes it harder to use in a pure HW fast path. :-(
Post by Benny Amorsen
You can simply replicate the whole routing table to local memory on all
processors; updates are a challenge but it's usually ok if packets go to
the wrong route for a few ms.
Right, the routing update could have arrived a few ms sooner or later
anyway.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Kai Harrekilde-Petersen
2010-07-27 20:59:13 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Benny Amorsen
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".
Even going to a full /32 table is just 4GB, which is very cheap these
days. :-)
The problem that I saw when I was designing Ethernet switch/routers 5
years ago, wasn't one particular lookup, but the fact that you need to
do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
learning), DIP, SIP, VLAN, ACLs, whatnot). Each Gbit of Ethernet can
generate 1.488M packets per second.

The DRAMs may be cheap enough, but the pins and the power to drive
multiple banks sure ain't cheap.

Remember that you want to do this in the switch/router hardware path (ie
no CPU should touch the packet), and at wire-speed for all ports at the
same time.

Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-28 07:04:04 UTC
Permalink
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Benny Amorsen
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".
Even going to a full /32 table is just 4GB, which is very cheap these
days. :-)
The problem that I saw when I was designing Ethernet switch/routers 5
years ago, wasn't one particular lookup, but the fact that you need to
do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
learning), DIP, SIP, VLAN, ACLs, whatnot).
I sort of assumed that not all of these would require the same size
table. What is the total index size, i.e. sum of all the index bits?
Post by Kai Harrekilde-Petersen
Each Gbit of Ethernet can generate 1.488M packets per second.
And unless you can promise wire-speed for any N/2->N/2 full duplex mesh
you should not try to sell the product, right?
Post by Kai Harrekilde-Petersen
The DRAMs may be cheap enough, but the pins and the power to drive
multiple banks sure ain't cheap.
Remember that you want to do this in the switch/router hardware path (ie
no CPU should touch the packet), and at wire-speed for all ports at the
same time.
Obviously.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Kai Harrekilde-Petersen
2010-07-28 10:11:43 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Benny Amorsen
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".
Even going to a full /32 table is just 4GB, which is very cheap these
days. :-)
The problem that I saw when I was designing Ethernet switch/routers 5
years ago, wasn't one particular lookup, but the fact that you need to
do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
learning), DIP, SIP, VLAN, ACLs, whatnot).
I sort of assumed that not all of these would require the same size
table. What is the total index size, i.e. sum of all the index bits?
I had to go back to some old documentation to remember it. So take the
following as a real life example, but not absolute gospel.

The first thing you need to do in a router, is to identify which logical
port you're on (think link aggregation) and whether you need to do
routing or not (a VID,MAC lookup).

If you're doing routing (L3 forwarding), next you need to do an ingress
VLAN check and then a CIDR Longest-prefix lookup for IPv4 unicast + an
ARP lookup, and another lookup for IPv4 multicast (the ARP and multicast
tables can share storage). Oh, and you need to do a egress VLAN lookup
as well.

The CIDR lookup has around 55 bits, whereas the ARP was a straight 13bit
index.

At Layer two, you have a (VID, MAC) table is around 75-80 bits wide
(12+48bit [VID,MAC] plus the data you need to store such as autolearning
status, aging info & destination, CPU copy/move, mirroring etc), and has
as many entries you want to throw at it. 8K-16K entries is the norm in
the low end, door-stopper boxes.

I've probably forgotten a couply of minor lookups.

The size of the tables also depends on whether you put real port bitmaps
in all the tables, or you put an Id in there, and then have secondary
Id-> port bitmap conversion tables later. We did the latter.
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
Each Gbit of Ethernet can generate 1.488M packets per second.
And unless you can promise wire-speed for any N/2->N/2 full duplex
mesh you should not try to sell the product, right?
Correct. The ability to do this has become cheap enough that it's become
a tickmark requirement (at least at the low-port count end). For the big
modular switches used at the campus/corporate/WAN level, this is not
feasible.


Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>
Morten Reistad
2010-07-28 14:38:10 UTC
Permalink
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Benny Amorsen
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".
Even going to a full /32 table is just 4GB, which is very cheap these
days. :-)
Or use some red-black trees. Just make them vertical in memory, not
horisontal.
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
The problem that I saw when I was designing Ethernet switch/routers 5
years ago, wasn't one particular lookup, but the fact that you need to
do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
learning), DIP, SIP, VLAN, ACLs, whatnot).
I sort of assumed that not all of these would require the same size
table. What is the total index size, i.e. sum of all the index bits?
I had to go back to some old documentation to remember it. So take the
following as a real life example, but not absolute gospel.
The first thing you need to do in a router, is to identify which logical
I thought this was about building clusters? There you switch where you
can, route where you must. You should get to the 4 nine point of traffic
just doing switching.

You need a route-cache first. There you make a hash of some sort, and
get a mac address back. This should be around 1.2 memory accesses per
packet. Next, you push it out the right interface with the right mac.
If you link aggregate, you need to update a counter. All of this is
in the end node.
Post by Kai Harrekilde-Petersen
port you're on (think link aggregation) and whether you need to do
routing or not (a VID,MAC lookup).
If you're doing routing (L3 forwarding), next you need to do an ingress
VLAN check and then a CIDR Longest-prefix lookup for IPv4 unicast + an
ARP lookup, and another lookup for IPv4 multicast (the ARP and multicast
tables can share storage). Oh, and you need to do a egress VLAN lookup
as well.
And then you cache the result. IP address -> exit point and mac. With
a hashing cache to a small bucket you should be able to get this to be
reasonably fast.
Post by Kai Harrekilde-Petersen
The CIDR lookup has around 55 bits, whereas the ARP was a straight 13bit
index.
At Layer two, you have a (VID, MAC) table is around 75-80 bits wide
(12+48bit [VID,MAC] plus the data you need to store such as autolearning
status, aging info & destination, CPU copy/move, mirroring etc), and has
as many entries you want to throw at it. 8K-16K entries is the norm in
the low end, door-stopper boxes.
The fast boxes, and a lot of consumer stuff as well, have associative
mac caches. Enter mac, exit one short int you can plug in to port selection.

All that other stuff goes in the cpu. Cpu has shadow table that tuns
3-5 orders of magnitude slower. Associative index is generated from the
shadow.

Consumer boxes have around 400 entries in the associative index,
large boxes have low tens of thousands. That is _one_ access to get
exit port. To speed things up you normally do a cut-through lookup so
the next hop is ready when the packet is in memory.
Post by Kai Harrekilde-Petersen
I've probably forgotten a couply of minor lookups.
The size of the tables also depends on whether you put real port bitmaps
in all the tables, or you put an Id in there, and then have secondary
Id-> port bitmap conversion tables later. We did the latter.
But in the cache you want the direct exit data.
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
Each Gbit of Ethernet can generate 1.488M packets per second.
And unless you can promise wire-speed for any N/2->N/2 full duplex
mesh you should not try to sell the product, right?
Correct. The ability to do this has become cheap enough that it's become
a tickmark requirement (at least at the low-port count end). For the big
modular switches used at the campus/corporate/WAN level, this is not
feasible.
Layer 3 switching is an abomintion when you want performance. Switch
where you can, route where you must.

-- mrr
Kai Harrekilde-Petersen
2010-07-28 15:40:51 UTC
Permalink
Post by Morten Reistad
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Benny Amorsen
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".
Even going to a full /32 table is just 4GB, which is very cheap these
days. :-)
Or use some red-black trees. Just make them vertical in memory, not
horisontal.
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
The problem that I saw when I was designing Ethernet switch/routers 5
years ago, wasn't one particular lookup, but the fact that you need to
do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
learning), DIP, SIP, VLAN, ACLs, whatnot).
I sort of assumed that not all of these would require the same size
table. What is the total index size, i.e. sum of all the index bits?
I had to go back to some old documentation to remember it. So take the
following as a real life example, but not absolute gospel.
The first thing you need to do in a router, is to identify which logical
I thought this was about building clusters? There you switch where you
can, route where you must. You should get to the 4 nine point of traffic
just doing switching.
You need a route-cache first. There you make a hash of some sort, and
get a mac address back. This should be around 1.2 memory accesses per
packet.
Obviously, there are many ways to do these things. But on a L2/l3 fully
integrated switch, one of the most direct and simplest ways of handling
these looks is to assign separate (internal) physical RAMs for each
lookup.
Post by Morten Reistad
Next, you push it out the right interface with the right mac.
If you link aggregate, you need to update a counter. All of this is
in the end node.
And then you cache the result. IP address -> exit point and mac. With
a hashing cache to a small bucket you should be able to get this to be
reasonably fast.
If you do things Right(tm), you don't need caching and you don't get the
problems coming from caching: you simply build a system which is fast
enough to handle the maximum frame/packet rate. With a 24 x 1Gbps + 2 x
10Gbps system, that's just under 66M packets per second.
Post by Morten Reistad
Post by Kai Harrekilde-Petersen
The CIDR lookup has around 55 bits, whereas the ARP was a straight 13bit
index.
At Layer two, you have a (VID, MAC) table is around 75-80 bits wide
(12+48bit [VID,MAC] plus the data you need to store such as autolearning
status, aging info & destination, CPU copy/move, mirroring etc), and has
as many entries you want to throw at it. 8K-16K entries is the norm in
the low end, door-stopper boxes.
The fast boxes, and a lot of consumer stuff as well, have associative
mac caches. Enter mac, exit one short int you can plug in to port selection.
Of course. Typically 4-8 sets. More sets burn more power, and gets
unwieldy even inside an ASIC due to routing distances.
Post by Morten Reistad
All that other stuff goes in the cpu. Cpu has shadow table that tuns
3-5 orders of magnitude slower. Associative index is generated from the
shadow.
Not necessarily. The CPU just needs to know how to search/insert for a
given (VID,MAC) tupple.
Post by Morten Reistad
Consumer boxes have around 400 entries in the associative index,
large boxes have low tens of thousands.
No. The consumer boxes has MAC and ARP tables on the order of 4-8.000. I
believe the newer generations goes up to around 16K entries (still far
more than really needed in a consumer network).
Post by Morten Reistad
That is _one_ access to get
exit port. To speed things up you normally do a cut-through lookup so
the next hop is ready when the packet is in memory.
Post by Kai Harrekilde-Petersen
I've probably forgotten a couply of minor lookups.
The size of the tables also depends on whether you put real port bitmaps
in all the tables, or you put an Id in there, and then have secondary
Id-> port bitmap conversion tables later. We did the latter.
But in the cache you want the direct exit data.
Not necessarily: Think on the update scenario. If you have the direct
port bitmap stored, you need to traverse the entire memory every time
someone plug a cable in or out. With an Id, you just edit that one
location.
Post by Morten Reistad
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
Each Gbit of Ethernet can generate 1.488M packets per second.
And unless you can promise wire-speed for any N/2->N/2 full duplex
mesh you should not try to sell the product, right?
Correct. The ability to do this has become cheap enough that it's become
a tickmark requirement (at least at the low-port count end). For the big
modular switches used at the campus/corporate/WAN level, this is not
feasible.
Layer 3 switching is an abomintion when you want performance. Switch
where you can, route where you must.
Why is L3 switching an abomination when you can do it at the same speed
as L2 switching, ie wire-speed all ports? It's really not that difficult
to do L3 switching, if you think it in from the start.


Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>
Bakul Shah
2010-07-29 06:36:20 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Benny Amorsen
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".
Even going to a full /32 table is just 4GB, which is very cheap these
days. :-)
The problem that I saw when I was designing Ethernet switch/routers 5
years ago, wasn't one particular lookup, but the fact that you need to
do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
learning), DIP, SIP, VLAN, ACLs, whatnot).
I sort of assumed that not all of these would require the same size
table. What is the total index size, i.e. sum of all the index bits?
The sum is too large to allow a single access in a realistic
system. You use whatever tricks you have to, to achieve wire-
speed switching/IP forwarding while staying within given
constraints. Nowadays ternary CAMs are used to do the heavy
lifting for lookups but indexing, some sort of tries are also
used. In addition to the things mentioned above, you also
have to deal with QoS (based on some subset of {src, dst}
{ether-addr, ip-addr, port}, ether-ype, protocol, vlan, mpls
tags), policing, shaping, scheduling, counter updates etc and
everything requires access to memory or TCAM! Typically a
heavily pipelined network processor or special purpose ASIC
is used. The nice thing is packet forwarding is almost
completely parallelizable.

To bring this back to comp.arch, if you have on-chip L1, L2 &
L3 caches, chances are you have to xfer large chunks of data
on every access of external RAM for efficiency reasons. Seems
to me, you may as well wrap each such request, response in an
ethernet frame by putting multiple ethernet framers onboard!
Similarly memory modules should provide an ethernet internface.
Note that there is already AOE (ATA over Ethernet) to connect
disks via ethernet. And since external memories are now like
disk (or tape)... :-)
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-29 07:57:27 UTC
Permalink
Post by Bakul Shah
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Kai Harrekilde-Petersen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Benny Amorsen
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".
Even going to a full /32 table is just 4GB, which is very cheap these
days. :-)
The problem that I saw when I was designing Ethernet switch/routers 5
years ago, wasn't one particular lookup, but the fact that you need to
do *several* quick lookups for each packet (DMAC, 2*SMAC (rd+wr for
learning), DIP, SIP, VLAN, ACLs, whatnot).
I sort of assumed that not all of these would require the same size
table. What is the total index size, i.e. sum of all the index bits?
The sum is too large to allow a single access in a realistic
Sorry, I was unclear, but the OP did understand what I meant:

I was really asking for the full list of lookups, with individual sizes,
needed to do everything a router has to do these days.
Post by Bakul Shah
system. You use whatever tricks you have to, to achieve wire-
speed switching/IP forwarding while staying within given
constraints. Nowadays ternary CAMs are used to do the heavy
lifting for lookups but indexing, some sort of tries are also
used. In addition to the things mentioned above, you also
have to deal with QoS (based on some subset of {src, dst}
{ether-addr, ip-addr, port}, ether-ype, protocol, vlan, mpls
tags), policing, shaping, scheduling, counter updates etc and
everything requires access to memory or TCAM! Typically a
heavily pipelined network processor or special purpose ASIC
is used. The nice thing is packet forwarding is almost
completely parallelizable.
Yes, but if you want to take a chance and skip the trailing checksum
test, in order to forward packets as soon as you have the header, then
you would have even more severe timing restrictions, right?

(Skipping/delaying the checksum test would mean depending upon the end
node to detect the error.)

BTW, is anyone doing this? Maybe in order to win benchmarketing tests?
Post by Bakul Shah
To bring this back to comp.arch, if you have on-chip L1, L2 &
L3 caches, chances are you have to xfer large chunks of data
on every access of external RAM for efficiency reasons. Seems
to me, you may as well wrap each such request, response in an
ethernet frame by putting multiple ethernet framers onboard!
Similarly memory modules should provide an ethernet internface.
Note that there is already AOE (ATA over Ethernet) to connect
disks via ethernet. And since external memories are now like
disk (or tape)... :-)
Well, what does the current cross-cpu protocols look like?

AMD or Intel doesn't seem to matter, there's still a nice little HW
network stack in there.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Bakul Shah
2010-07-29 08:27:30 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Bakul Shah
system. You use whatever tricks you have to, to achieve wire-
speed switching/IP forwarding while staying within given
constraints. Nowadays ternary CAMs are used to do the heavy
lifting for lookups but indexing, some sort of tries are also
used. In addition to the things mentioned above, you also
have to deal with QoS (based on some subset of {src, dst}
{ether-addr, ip-addr, port}, ether-ype, protocol, vlan, mpls
tags), policing, shaping, scheduling, counter updates etc and
everything requires access to memory or TCAM! Typically a
heavily pipelined network processor or special purpose ASIC
is used. The nice thing is packet forwarding is almost
completely parallelizable.
Yes, but if you want to take a chance and skip the trailing checksum
test, in order to forward packets as soon as you have the header, then
you would have even more severe timing restrictions, right?
You still have the same time budget: worst case you still have to send
out 64 byte packets back to back. Most lookups can be done as soon as the
NPU can get at the header.
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
(Skipping/delaying the checksum test would mean depending upon the end
node to detect the error.)
BTW, is anyone doing this? Maybe in order to win benchmarketing tests?
You can drop a bad CRC packet at a later point in the pipeline but before
sending it out.
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Bakul Shah
To bring this back to comp.arch, if you have on-chip L1, L2 &
L3 caches, chances are you have to xfer large chunks of data
on every access of external RAM for efficiency reasons. Seems
to me, you may as well wrap each such request, response in an
ethernet frame by putting multiple ethernet framers onboard!
Similarly memory modules should provide an ethernet internface.
Note that there is already AOE (ATA over Ethernet) to connect
disks via ethernet. And since external memories are now like
disk (or tape)... :-)
Well, what does the current cross-cpu protocols look like?
AMD or Intel doesn't seem to matter, there's still a nice little HW
network stack in there.
Ethernet frames seem to have become the most common denominator in
networks so I was speculating may be that'd be the cheapest way to
shove lots of data around?
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-29 16:04:09 UTC
Permalink
Post by Bakul Shah
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Yes, but if you want to take a chance and skip the trailing checksum
test, in order to forward packets as soon as you have the header, then
you would have even more severe timing restrictions, right?
You still have the same time budget: worst case you still have to send
out 64 byte packets back to back. Most lookups can be done as soon as the
NPU can get at the header.
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
(Skipping/delaying the checksum test would mean depending upon the end
node to detect the error.)
BTW, is anyone doing this? Maybe in order to win benchmarketing tests?
You can drop a bad CRC packet at a later point in the pipeline but before
sending it out.
I meant sending out _before_ you have received it, as soon as you have
the dest address.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Morten Reistad
2010-07-30 16:37:03 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Bakul Shah
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
(Skipping/delaying the checksum test would mean depending upon the end
node to detect the error.)
BTW, is anyone doing this? Maybe in order to win benchmarketing tests?
And bring down latency. A lot, actually. Naive store and forward of a
12400 bit (farming included) frame on a gigabit link uses around 86 us.
Add some store/handling logic, and you are close to 100.

Modern gigabit switches bring this latency down to below 10. and, yes, there
are procedures to handle bad crc by signalling BAD at the end of a frame
to the next hop.
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Bakul Shah
You can drop a bad CRC packet at a later point in the pipeline but before
sending it out.
I meant sending out _before_ you have received it, as soon as you have
the dest address.
Yes, it is called "cut-through switching". Now go read the promo materiel
about it. And, yes, it really helps on protocols like NFS.
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Terje
- mrr
Benny Amorsen
2010-07-30 18:34:44 UTC
Permalink
Post by Morten Reistad
Yes, it is called "cut-through switching". Now go read the promo materiel
about it. And, yes, it really helps on protocols like NFS.
But is there anyone who does it for routing? And why did cut-through
pretty much disappear for gigabit switching, only to reappear for 10
Gbps? Except for HP who seem to be stuck with store-and-forward, but
maybe that is why they bought 3com.


/Benny
Benny Amorsen
2010-07-29 11:21:17 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Yes, but if you want to take a chance and skip the trailing checksum
test, in order to forward packets as soon as you have the header, then
you would have even more severe timing restrictions, right?
There are several layers of checksums at play here. If we stick to IP,
only the header has a checksum, and for IPv6 even that has been removed.
So there isn't really a chance to take, because you have the checksum
before you start receiving the payload (and the payload isn't
protected).

There is a whole-packet checksum at the ethernet level (if the physical
layer happens to be ethernet, of course). Switches used to pretty much
universally do cut-through switching until gigabit switches arrived.
Almost all gigabit switches are store-and-forward, but somehow latency
was rediscovered in 10Gbps-switches, so quite a few of those are
cut-through.

Unfortunately "cut-through routing" refers to something entirely
different from "cut-through switching". I haven't been able to find any
products claiming to do anything but store-and-forward routing.


/Benny
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-29 16:16:19 UTC
Permalink
Post by Benny Amorsen
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Yes, but if you want to take a chance and skip the trailing checksum
test, in order to forward packets as soon as you have the header, then
you would have even more severe timing restrictions, right?
There are several layers of checksums at play here. If we stick to IP,
only the header has a checksum, and for IPv6 even that has been removed.
Mea culpa, or wishful thinking on my part. :-(
Post by Benny Amorsen
So there isn't really a chance to take, because you have the checksum
before you start receiving the payload (and the payload isn't
protected).
There is a whole-packet checksum at the ethernet level (if the physical
layer happens to be ethernet, of course). Switches used to pretty much
universally do cut-through switching until gigabit switches arrived.
Almost all gigabit switches are store-and-forward, but somehow latency
was rediscovered in 10Gbps-switches, so quite a few of those are
cut-through.
:-)
Post by Benny Amorsen
Unfortunately "cut-through routing" refers to something entirely
different from "cut-through switching". I haven't been able to find any
products claiming to do anything but store-and-forward routing.
OK, that was really what I was asking about.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Morten Reistad
2010-07-27 23:30:59 UTC
Permalink
Post by Benny Amorsen
Post by Andy Glew <"newsgroup at comp-arch.net">
Network routing (big routers, lots of packets).
Well, for loose coupling ethernet switching is a better option
as long as you can keep the mac table within the associative
memory of the laster switches. This means around 8000 as
an upper limit. You can push this somewhat by fancy
interconnect patterns, but hardly by more than an order
of magnitude.

So the hardware can get you a few tens of thousands of
processors around 50 us away from each other.
Post by Benny Amorsen
With IPv4 you can usually get away with routing on the top 24 bits + a
bit of special handling of the few local routes on longer than 24-bit.
That's a mere 16MB table if you can make do with possible 256
"gateways".
Stop thinking in the limits of ipv4. We have had a more
scalable version for a decade now. It even works in the last
three versions of windows.
Post by Benny Amorsen
You can simply replicate the whole routing table to local memory on all
processors; updates are a challenge but it's usually ok if packets go to
the wrong route for a few ms.
Why? Map the mac addresses into the ip address, and let the
switches do the heavy lifting.

-- mrr
Tim McCaffrey
2010-07-28 15:18:56 UTC
Permalink
Post by Brett Davis
AMD would allow you to split the 128 bit memory bus into two
independent 64 bit busses, for better transaction throughput.
To the best of my knowledge no one turns this mode on, as one
bus is faster for everyone?
For the Opteron system we used, the BIOS default was unganged (two
64 bit busses). The vendor confirmed that this usually gives
better performance.

- Tim
MitchAlsup
2010-07-28 21:32:11 UTC
Permalink
Post by Brett Davis
AMD would allow you to split the 128 bit memory bus into two
independent 64 bit busses, for better transaction throughput.
To the best of my knowledge no one turns this mode on, as one
bus is faster for everyone?
The wide memory bus is invariably faster, especialy with small number
of DIMMs.

What the dual bus approach does is to allow all stuffings of the DIMMs
to end up with useable memory systems with the amount of memory that
got plugged in. This is in effect what the aftermarket customer wants,
and not what the performance customer wants. Which leads to:

The memory system is always fastest when all the DIMMs have the same
timing numbers.

Mitch
Paul A. Clayton
2010-07-29 00:40:11 UTC
Permalink
On Jul 28, 5:32 pm, MitchAlsup <***@aol.com> wrote:
[snip]
Post by MitchAlsup
The wide memory bus is invariably faster, especialy with small number
of DIMMs.
Wouldn't having twice as many potentially active DRAM banks (two
independent channels vs. two DIMM channels merged to a single
addressed channel) be a significant benefit for many multithreaded
and some single-threaded applications where bank conflicts might be
more common (especially with a "small number of DIMMs")?



Paul A. Clayton
just a technophile
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-29 07:34:10 UTC
Permalink
Post by Paul A. Clayton
[snip]
Post by MitchAlsup
The wide memory bus is invariably faster, especialy with small number
of DIMMs.
Wouldn't having twice as many potentially active DRAM banks (two
independent channels vs. two DIMM channels merged to a single
addressed channel) be a significant benefit for many multithreaded
and some single-threaded applications where bank conflicts might be
more common (especially with a "small number of DIMMs")?
It would, unless the size of a combined channel is still less than or
the same as a cache line:

When a single channel can deliver 64 bits and a combined channel 128
bits, and a cache line (at 512+ bits) is the smallest unit of transfer,
then you _want_ wider channels.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
MitchAlsup
2010-07-29 12:45:23 UTC
Permalink
Post by Paul A. Clayton
[snip]
Post by MitchAlsup
The wide memory bus is invariably faster, especialy with small number
of DIMMs.
Wouldn't having twice as many potentially active DRAM banks (two
independent channels vs. two DIMM channels merged to a single
addressed channel) be a significant benefit for many multithreaded
and some single-threaded applications where bank conflicts might be
more common (especially with a "small number of DIMMs")?
This was extensively simulated, and to our surprise::

The first data beat arrives at the same point in time on both the side
and narrow arrangements. The last data beat arrives a lot longer
afterwards on the narrow arrangement. It is the last data beat which
governs the sending of the line through the Crossbar. This is not
inevitably necessary, but it is on the crossbar in Opteron. Once the
memory controller starts sending the first line, it cannot switch to
another line and use the BW available in the fabric router. So,
getting the whole line out into the fabric is the key, and why the
dual DIMM bus does not work as well as one would expect.

It is perfectly reasonable to build the fabric where this property is
not a limiting factor and actually get better performance with more
banks.

Mitch
Loading...