Discussion:
High-bandwidth computing interest group
(too old to reply)
Robert Myers
2010-07-18 22:02:49 UTC
Permalink
I have lamented, at length, the proliferation of flops at the expense of
bytes-per-flop in what are now currently styled as supercomputers.

This subject came up recently on the Fedora User's Mailing List when
someone claimed that GPU's are just what the doctor ordered to make
high-end computation pervasively available. Even I have fallen into
that trap, in this forum, and I was quickly corrected. In the most
general circumstance, GPU's seem practically to have been invented to
expose bandwidth starvation.

At least one person on the Fedora list got it and says that he has
encountered similar issues in his own work (what is in short supply is
not flops, but bytes per flop). He also seems to understand that the
problem is fundamental and cannot be made to go away with an endless
proliferation of press releases, photographs of "supercomputers," and an
endless procession of often meaningless color plots.

Since the issue is only tangentially related to the list, he suggested a
private mailing list to pursue the issue further without annoying others
with a topic that most are manifestly not interested in.

The subject is really a mix of micro and macro computer architecture,
the physical limitations of hardware, the realities of what is ever
likely to be funded, and the grubby details of computational mathematics.

Since I have talked most about the subject here and gotten the most
valuable feedback here, I thought to solicit advice as to what kind of
forum would seem most plausible/attractive to pursue such a subject. I
could probably host a mailing list myself, but would that be the way to
go about it and would anyone else be interested?

Email me privately if you don't care to respond publicly.

Thanks.

Robert.
Edward Feustel
2010-07-19 09:54:03 UTC
Permalink
On Sun, 18 Jul 2010 18:02:49 -0400, Robert Myers
Post by Robert Myers
I have lamented, at length, the proliferation of flops at the expense of
bytes-per-flop in what are now currently styled as supercomputers.
---
Post by Robert Myers
Since I have talked most about the subject here and gotten the most
valuable feedback here, I thought to solicit advice as to what kind of
forum would seem most plausible/attractive to pursue such a subject. I
could probably host a mailing list myself, but would that be the way to
go about it and would anyone else be interested?
Email me privately if you don't care to respond publicly.
Thanks.
Robert.
This is an important subject. I would suggest that everything be
archived in a searchable environment. Keywords and tightly focused
discussion would be helpful (if possible). Please let me know if you
decide to do a wiki or e-mail list.

Ed Feustel
Dartmouth College
jacko
2010-07-19 15:02:00 UTC
Permalink
I might click a look see.
MitchAlsup
2010-07-19 15:36:18 UTC
Permalink
It seems to me that having less than 8 bytes of memory bandwidth per
flop leads to an endless series of cache excersizes.**

It also seems to me that nobody is going to be able to put the
required 100 GB/s/processor pin interface on the part.*

Nor does it seam, it would have the latency needed to strip mine main
memory continuously were the required BW made available.

Thus, we are in essence screwed.

* current bandwidths
a) 3 GHz processors with 2 FP pipes running 128-bit double DP flops
(ala SSE) This gives 12 GFlop/processor
b) 12 GFlop/processor demands 100 GByte/processor
c) DDR3 can achieve 17 GBytes/channel
d) high end PC processors can afford 2 memory channels
e) therefore we are screwed:
e.1)The memory system can supply only 1/3rd of what a single processor
wants
e.2)There are 4 and growing numbers of processors
e.3) therefore the memory systen can support less than 1/12 as much BW
as required.

Mitch

** The Ideal memBW/Flop is 3 memory operations per flop, and back in
the Cray-1 to XMP transition much of the vectorization gain occurred
from the added memBW and the better chaining.
nik Simpson
2010-07-19 19:44:07 UTC
Permalink
Post by MitchAlsup
d) high end PC processors can afford 2 memory channels
Not quite as screwed as that, the top-end Xeon & Opteron parts have 4
DDR3 memory channels, but still screwed. For the 2-socket space, it's 3
DDR3 memory channels for typical server processors. Of course, the move
to on-chip memory controllers means that scope for additional memory
channels is pretty much "zero" but that's the price you pay for
commodity parts, they are designed to meet the majority of customers,
and it's hard to justify the costs of additional memory channels at the
processor and board layout levels just to satisfy the needs of bandwidth
crazy HPC apps ;-)
--
Nik Simpson
jacko
2010-07-19 22:21:23 UTC
Permalink
Why do memory channels have to be wired by inverter chains and
relativly long track interconnect on the circuit board? Microwave
pipework from chiptop to chiptop is perhaps possible, but maintaining
enough bandwidth over the microwave channel is many GHz, but it is
close, so of a low radiant power!

Flops or not? lets generalize and call them nops, said he with a touch
of carcasm. Non-specific Operations, needing GB/s.

Cheers Jacko
Robert Myers
2010-07-20 00:18:50 UTC
Permalink
Post by nik Simpson
Post by MitchAlsup
d) high end PC processors can afford 2 memory channels
Not quite as screwed as that, the top-end Xeon & Opteron parts have 4
DDR3 memory channels, but still screwed. For the 2-socket space, it's 3
DDR3 memory channels for typical server processors. Of course, the move
to on-chip memory controllers means that scope for additional memory
channels is pretty much "zero" but that's the price you pay for
commodity parts, they are designed to meet the majority of customers,
and it's hard to justify the costs of additional memory channels at the
processor and board layout levels just to satisfy the needs of bandwidth
crazy HPC apps ;-)
Maybe the capabilities of high-end x86 are and will continue to be so
compelling that, unless IBM is building the machine, that's what we're
looking at for the foreseeable future.

I don't understand the economics of less mass-market designs, but
maybe the perfect chip would be some iteration of an "open" core,
maybe less heat-intensive, less expensive, and soldered-down with more
attention to memory and I/O resources.

Or maybe you could dual port or route memory, accepting whatever cost
in latency there is, and at least allow some pure DMA device to
perform I/O and gather/scatter chores so as to maximize what processor
bandwidth there is.

I'd like some blue sky thinking.

Robert.
Andrew Reilly
2010-07-20 04:43:46 UTC
Permalink
The memory system can supply only 1/3rd of what a single processor wants
If that's the case (and down-thread Nik Simpson suggests that the best
case might even be twice as "good", or 2/3 of a single processor's worst-
case demand), then that's amazingly better than has been available, at
least in the commodity processor space, for quite a long time. I
remember when I started moving DSP code onto PCs, and finding anything
with better than 10MB/s memory bandwidth was not easy. These days my
problem set typically doesn't get out of the cache, so that's not
something I personally worry about much any more. If your problem set is
driven by stream-style vector ops, then you might as well switch to low-
power critters like Atoms, and match the flops to the available
bandwidth, and save some power.

On the other hand, I have a lot of difficulty believing that even for
large-scale vector-style code, a bit of loop fusion, blocking or code
factoring can't bring value-reuse up to a level where even (0.3/nProcs)
available bandwidth is plenty.

That's single-threaded application-think. Where you *really* need that
bandwidth, I suspect, is for the inter-processor communication between
your hoards of cooperating (ha!) cores.

Cheers,
--
Andrew
jacko
2010-07-20 05:44:42 UTC
Permalink
Post by Andrew Reilly
The memory system can supply only 1/3rd of what a single processor wants
If that's the case (and down-thread Nik Simpson suggests that the best
case might even be twice as "good", or 2/3 of a single processor's worst-
case demand), then that's amazingly better than has been available, at
least in the commodity processor space, for quite a long time.  I
remember when I started moving DSP code onto PCs, and finding anything
with better than 10MB/s memory bandwidth was not easy.  These days my
problem set typically doesn't get out of the cache, so that's not
something I personally worry about much any more.  If your problem set is
driven by stream-style vector ops, then you might as well switch to low-
power critters like Atoms, and match the flops to the available
bandwidth, and save some power.
Or run a bigger network off the same power.
Post by Andrew Reilly
On the other hand, I have a lot of difficulty believing that even for
large-scale vector-style code, a bit of loop fusion, blocking or code
factoring can't bring value-reuse up to a level where even (0.3/nProcs)
available bandwidth is plenty.
Prob(able)ly - sick perverse hanging on to a longer word in the post
quantum age.
Post by Andrew Reilly
That's single-threaded application-think.  Where you *really* need that
bandwidth, I suspect, is for the inter-processor communication between
your hoards of cooperating (ha!) cores.
Maybe. I think much of the problem is not vectors, as these are
usually have a single index, it's matrix and tensor problems which
have 2 or n indexes. T[a,b,c,d]

The fact that many product sums over different indexes, even with
transpose elimination coding (automatic switching between row and
column order based on linear sequencing of a write target, or for best
read/write/read/write etc. performance) in the prefetch context, with
limited gather/scatter.

Maybe even some multi store (slightly wasteful of memory cells) with
differing address bit swapings? the high bits as an address map
translation selector with bank write and read combo 'union' operation
(* or +)?.

Ummm.
George Neuner
2010-07-20 14:33:55 UTC
Permalink
On Mon, 19 Jul 2010 08:36:18 -0700 (PDT), MitchAlsup
Post by MitchAlsup
It seems to me that having less than 8 bytes of memory bandwidth per
flop leads to an endless series of cache excersizes.**
It also seems to me that nobody is going to be able to put the
required 100 GB/s/processor pin interface on the part.*
Nor does it seam, it would have the latency needed to strip mine main
memory continuously were the required BW made available.
Thus, we are in essence screwed.
* current bandwidths
a) 3 GHz processors with 2 FP pipes running 128-bit double DP flops
(ala SSE) This gives 12 GFlop/processor
b) 12 GFlop/processor demands 100 GByte/processor
c) DDR3 can achieve 17 GBytes/channel
d) high end PC processors can afford 2 memory channels
e.1)The memory system can supply only 1/3rd of what a single processor
wants
e.2)There are 4 and growing numbers of processors
e.3) therefore the memory systen can support less than 1/12 as much BW
as required.
Mitch
** The Ideal memBW/Flop is 3 memory operations per flop, and back in
the Cray-1 to XMP transition much of the vectorization gain occurred
from the added memBW and the better chaining.
ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers. Yes there was a lot of latency (and I
know you [Mitch] and Robert Myers are dead set against latency too)
but the staging data movement provided a lot of opportunity to overlap
with real computation.

YMMV, but I think pipeline vector units need to make a comeback. I am
not particularly happy at the thought of using them again, but I don't
see a good way around it.

George
n***@cam.ac.uk
2010-07-20 14:41:13 UTC
Permalink
Post by George Neuner
ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers. Yes there was a lot of latency (and I
know you [Mitch] and Robert Myers are dead set against latency too)
but the staging data movement provided a lot of opportunity to overlap
with real computation.
Yes.
Post by George Neuner
YMMV, but I think pipeline vector units need to make a comeback. I am
not particularly happy at the thought of using them again, but I don't
see a good way around it.
NO chance! It's completely infeasible - they were dropped because
the vendors couldn't make them for affordable amounts of money any
longer.


Regards,
Nick Maclaren.
jacko
2010-07-20 14:54:58 UTC
Permalink
Post by George Neuner
ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers.  Yes there was a lot of latency (and I
know you [Mitch] and Robert Myers are dead set against latency too)
but the staging data movement provided a lot of opportunity to overlap
with real computation.
Yes.
Post by George Neuner
YMMV, but I think pipeline vector units need to make a comeback.  I am
not particularly happy at the thought of using them again, but I don't
see a good way around it.
NO chance!  It's completely infeasible - they were dropped because
the vendors couldn't make them for affordable amounts of money any
longer.
Regards,
Nick Maclaren.
Maybe he needs a FPGA card with many single cycle Boothe multipliers
on chip, A bit slow though due to routing delays, but much parallel.
There really should be a way to que mulmac pairs with a reset to zero
(or the nilpotent).
George Neuner
2010-07-21 22:18:19 UTC
Permalink
Post by n***@cam.ac.uk
Post by George Neuner
ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers. ...
... the staging data movement provided a lot of opportunity to
overlap with real computation.
YMMV, but I think pipeline vector units need to make a comeback.
NO chance! It's completely infeasible - they were dropped because
the vendors couldn't make them for affordable amounts of money any
longer.
Hi Nick,

Actually I'm a bit skeptical of the cost argument ... obviously it's
not feasible to make large banks of vector registers fast enough for
multiple GHz FPUs to fight over, but what about a vector FPU with a
few dedicated registers?

There are a number of (relatively) low cost DSPs in the up to ~300MHz
range that have large (32KB and up, 4K double floats) 1ns dual ported
SRAM, are able to sustain 1 or more flops/SRAM cycle, and which match
or exceed the sustainable FP performance of much faster CPUs. Some of
these DSPs are $5-$10 in industrial quantities and some even are cheap
in hobby quantities.

Given the economics of mass production, it would seem that creating
some kind of vector coprocessor combining FPU, address units and a few
banks of SRAM with host DMA access should be relatively cheap if the
FPU is kept in under 500MHz.

Obviously, it could not have the peak performance of the GHz host FPU,
but a suitable problem could easily keep several such processors
working. Cray's were a b*tch, but when the problem suited them ...
With several vector coprocessors on a plug-in board, this isn't very
different from the GPU model other than having more flexibility in
staging data.


The other issue is this: what exactly are we talking about in this
thread ... are we trying to have the fastest FPUs possible or do we
want a low cost machine with (very|extremely) high throughput?

No doubt I've overlooked something (or many things 8) pertaining to
economics or politics or programming - I don't think there is any
question that there are plenty of problems (or subproblems) suitable
for solving on vector machines. So please feel free to enlighten me.

George
jacko
2010-07-21 23:31:40 UTC
Permalink
Post by George Neuner
No doubt I've overlooked something (or many things 8) pertaining to
economics or politics or programming - I don't think there is any
question that there are plenty of problems (or subproblems) suitable
for solving on vector machines.  So please feel free to enlighten me.
I think it's that FPU speed is not the bottleneck at present. It's
keeping it fed with data, and shifting it arround memory in suitable
ordered patterns. Maybe not fetching data as a linear cacheline unit,
but maybe a generic step n (not just powers of 2) as a generic scatter/
gather, with n changable on the virtual cache line before a save say.

Maybe it's about what an address is and can it specify process to
smart memories on read and write.

It's definatly about reducing latency when this is possible or how
this may be possible.

And it's about cache structures which may help in any or all of the
above, by preventing an onset of thrashing.

SIMD is part of this, as the program size drops. But even vector units
have to be kept fed with data.
n***@cam.ac.uk
2010-07-22 08:46:44 UTC
Permalink
Post by George Neuner
Post by n***@cam.ac.uk
Post by George Neuner
ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers. ...
... the staging data movement provided a lot of opportunity to
overlap with real computation.
YMMV, but I think pipeline vector units need to make a comeback.
NO chance! It's completely infeasible - they were dropped because
the vendors couldn't make them for affordable amounts of money any
longer.
Actually I'm a bit skeptical of the cost argument ... obviously it's
not feasible to make large banks of vector registers fast enough for
multiple GHz FPUs to fight over, but what about a vector FPU with a
few dedicated registers?
'Tain't the computation that's the problem - it's the memory access,
as "jacko" said.

Many traditional vector units had enough bandwidth to keep an AXPY
running at full tilt - nowadays, one would need 1 TB/sec for a low
end vector computer, and 1 PB/sec for a high-end one. Feasible,
but not cheap.

Also, the usefulness of such things was very dependent on whether
they would allow 'fancy' vector operations, such as strided and
indexed vectors, gather/scatter and so on. The number of programs
that need only simple vector operations is quite small.

I believe that, by the end, 90% of the cost of such machines was in
the memory management and only 10% in the computation. At very
rough hand-waving levels.


Regards,
Nick Maclaren.
George Neuner
2010-07-22 16:39:27 UTC
Permalink
Also, the usefulness of [vector processors] was very dependent on
whether they would allow 'fancy' vector operations, such as strided
and indexed vectors, gather/scatter and so on. The number of programs
that need only simple vector operations is quite small.
I believe that, by the end, 90% of the cost of such machines was in
the memory management and only 10% in the computation. At very
rough hand-waving levels.
I get that ... but the near impossibility (with current technology) of
feeding several FPUs - vector or otherwise - from a shared memory
feeds right back into my argument that they need separate memories.

Reading further down the thread, Robert seems to be mainly concerned
with keeping his FPUs fed in a shared memory environment. I don't
really care whether the "Right Thing" architecture is a loose gang of
AL/FP units with programmable interconnects fed by scatter/gather DMA
channels[*]. Data placement and staging are, IMO, mainly software
issues (though, naturally, I appreciate any help the hardware can
give).

George

[*] Many DSPs can pipe one or more of their DMA channels through the
ALU to do swizzling, packing/unpacking and other rearrangements. Some
DSP permit general ALU operations on the DMA stream for use in real
time data capture.
n***@cam.ac.uk
2010-07-22 16:49:27 UTC
Permalink
Post by George Neuner
Also, the usefulness of [vector processors] was very dependent on
whether they would allow 'fancy' vector operations, such as strided
and indexed vectors, gather/scatter and so on. The number of programs
that need only simple vector operations is quite small.
I believe that, by the end, 90% of the cost of such machines was in
the memory management and only 10% in the computation. At very
rough hand-waving levels.
I get that ... but the near impossibility (with current technology) of
feeding several FPUs - vector or otherwise - from a shared memory
feeds right back into my argument that they need separate memories.
And, as I said, the killer with that is the very small number of
programs that can make use of such a system. The requirement for
'fancy' vector operations was largely to provide facilities to
transfer elements between locations in vectors.


Regards,
Nick Maclaren.
jacko
2010-07-22 17:58:34 UTC
Permalink
Post by n***@cam.ac.uk
Post by George Neuner
Also, the usefulness of [vector processors] was very dependent on
whether they would allow 'fancy' vector operations, such as strided
and indexed vectors, gather/scatter and so on.  The number of programs
that need only simple vector operations is quite small.
I believe that, by the end, 90% of the cost of such machines was in
the memory management and only 10% in the computation.  At very
rough hand-waving levels.
I get that ... but the near impossibility (with current technology) of
feeding several FPUs - vector or otherwise - from a shared memory
feeds right back into my argument that they need separate memories.  
And, as I said, the killer with that is the very small number of
programs that can make use of such a system.  The requirement for
'fancy' vector operations was largely to provide facilities to
transfer elements between locations in vectors.
Regards,
Nick Maclaren.
Let's call this the data shovelling to data processing ratio. SPR.
George Neuner
2010-07-22 15:37:56 UTC
Permalink
On Wed, 21 Jul 2010 18:18:19 -0400, George Neuner
Some stuff
I think the exchange among Robert, Mitch and Andy that just appeared
answered most of my question.

George
Robert Myers
2010-07-22 18:41:22 UTC
Permalink
Post by George Neuner
On Wed, 21 Jul 2010 18:18:19 -0400, George Neuner
Some stuff
I think the exchange among Robert, Mitch and Andy that just appeared
answered most of my question.
I feel like I'm walking a tightrope in some of these discussions.

At a time when vector processors were still a fading memory (even in
the US), an occasional article would mention that "vector computers"
were easier to use for many scientists than thousands of cots
processors hooked together by whatever.

The real problem is not in how the computation is organized, but in
how memory is accessed. Replicating the memory access style of the
early Cray architectures isn't possible beyond a very limited memory
size, but it sure would be nice to figure out a way to simulate the
experience.

Robert.
n***@cam.ac.uk
2010-07-22 18:47:43 UTC
Permalink
Post by Robert Myers
At a time when vector processors were still a fading memory (even in
the US), an occasional article would mention that "vector computers"
were easier to use for many scientists than thousands of cots
processors hooked together by whatever.
Yup. And more recently, among the very few people who used them.
Post by Robert Myers
The real problem is not in how the computation is organized, but in
how memory is accessed. Replicating the memory access style of the
early Cray architectures isn't possible beyond a very limited memory
size, but it sure would be nice to figure out a way to simulate the
experience.
Hitachi did pretty well with the SR2201 and SR8000. But it was
still expensive.


Regards,
Nick Maclaren.
MitchAlsup
2010-07-22 22:26:28 UTC
Permalink
Post by Robert Myers
The real problem is not in how the computation is organized, but in
how memory is accessed.  Replicating the memory access style of the
early Cray architectures isn't possible beyond a very limited memory
size, but it sure would be nice to figure out a way to simulate the
experience.
One of the reasons CRAY machines lived <somewhat> longer than some of
the other vector supercomputers was that <at least> CRAY I/O system
could operate at vastly higher performance levels than their Japanees
counterparts. Thus, while the CPU was crunching the I/O system could
be shoveling data around at vast I/O rates, so that when the CPU was
done crunching, the next unit of work was ready to be tackled.

This is not easy with TeraByte sized data footprints and sub-GigaByte
main memory footprints.

Mitch
Thomas Womack
2010-07-23 17:19:48 UTC
Permalink
Post by Robert Myers
At a time when vector processors were still a fading memory (even in
the US), an occasional article would mention that "vector computers"
were easier to use for many scientists than thousands of cots
processors hooked together by whatever.
Yes, this is certainly true. Earth Simulator demonstrated that you
could build a pretty impressive vector processor, which (Journal of
the Earth Simulator - one of the really good resources since it talks
about both the science and the implementation issues) managed 90%
performance on lots of tasks, partly because using it was very
prestigious and you weren't allowed to use the whole machine on jobs
which didn't manage very high performance on a 10% subset. But it was
a $400 million project to build a 35Tflops machine, and the subsequent
project to spend a similar amount this decade on a heftier machine
came to nothing.

I've worked at an establishment with an X1, and it was woefully
under-used because the problems that came up didn't fit the vector
organisation terribly well; it is not at all clear why they bought the
X1 in the first place.
Post by Robert Myers
The real problem is not in how the computation is organized, but in
how memory is accessed. Replicating the memory access style of the
early Cray architectures isn't possible beyond a very limited memory
size, but it sure would be nice to figure out a way to simulate the
experience.
I _think_ this starts, particularly for the crystalline memory access
case, to be almost a language-design issue.

Tom
Robert Myers
2010-07-23 18:30:32 UTC
Permalink
Post by Robert Myers
At a time when vector processors were still a fading memory (even in
the US), an occasional article would mention that "vector computers"
were easier to use for many scientists than thousands of cots
processors hooked together by whatever.
Yes, this is certainly true.  Earth Simulator demonstrated that you
could build a pretty impressive vector processor, which (Journal of
the Earth Simulator - one of the really good resources since it talks
about both the science and the implementation issues) managed 90%
performance on lots of tasks, partly because using it was very
prestigious and you weren't allowed to use the whole machine on jobs
which didn't manage very high performance on a 10% subset.  But it was
a $400 million project to build a 35Tflops machine, and the subsequent
project to spend a similar amount this decade on a heftier machine
came to nothing.
I've worked at an establishment with an X1, and it was woefully
under-used because the problems that came up didn't fit the vector
organisation terribly well; it is not at all clear why they bought the
X1 in the first place.
So, if you can cheaply build a machine with lots of flops that
sometimes you can't use, who cares if the flops you *can* use are
still more plentiful and less expensive than, say, an Earth Simulator
style effort, especially if there are lots of problems for which the
magnificently awesome vector processor is useless? That's essentially
the argument to defend the purchasing decisions that are being made at
a national level in the US.

I would agree, if only I could wrestle a tiny concession from the
empire-builders. The machines they are building are *not* scalable,
and I wish they'd stop claiming they are. It would be like my cable
company claiming that its system is scalable because it can hang as
many users off the same cable as it can get away with. It's all very
well until too many try to use the bandwidth at once.

Having the bandwidth per flop drop to zero is no different from having
the bandwidth per user drop to zero, and even my cable company, which
has lots of gall, wouldn't have the gall to claim that it's not a
problem and that they don't have to worry about it, because they do.
Post by Robert Myers
The real problem is not in how the computation is organized, but in
how memory is accessed.  Replicating the memory access style of the
early Cray architectures isn't possible beyond a very limited memory
size, but it sure would be nice to figure out a way to simulate the
experience.
I _think_ this starts, particularly for the crystalline memory access
case, to be almost a language-design issue.
Engineers apparently find Mathlab easy to use. No slight to Matlab,
but the disconnect with the hardware can be painful. I don't think
the hardware and software issues can be separated.

Robert.
George Neuner
2010-07-24 09:10:10 UTC
Permalink
On Fri, 23 Jul 2010 11:30:32 -0700 (PDT), Robert Myers
I don't think the hardware and software issues can be separated.
Thank goodness someone said that.

At least where HPC is concerned, I've been convinced for some time
that we are fighting hardware rather than leveraging it. I've spent a
number of years with DSPs and FPGAs and I've come to believe that we
(or at least compilers) need to be deliberately programming memory
interfaces as well as the ALU/FPU operations.

The problem most often cited for vector units is that they need to
support non-consecutive and non-uniform striding to be useful. I
agree that there *does* need to be support for those features, but I
believe it should be in the memory subsystem rather than in the
processor.

I'm supposing that there are vector registers accessible to
scatter/gather DMA and further supposing that there are several DMA
channels - ideally 1 channel per register. The programmer's indexing
loop code is compiled into instructions that program DMA to
read/gather a block of operands into the source registers, execute the
vector operation(s), and finally DMA write/scatters the results back
to memory.

I do understand that problems have to have "crystalline" access
patterns and enough long vector(izable) operations to absorb the
latency of data staging. I know there are plenty of problems that
don't fit that model.

The main issue would be having a main memory that could tolerate
concurrent DMA - but I know that lots of things are possible with
appropriate design: I once worked with a system which had a
proprietary FPGA based memory controller that sustained 1400MB/s - 700
in and out - using banked 100MHz SDRAM (the old kind, not DDR).

I used to have a 40MHz ADI Sharc 21060 (120 MFlops sustained) on a bus
mastering PCI board in a 450MHz Pentium II desktop. I had a number of
programs that turned the DSP into a long vector processor (512 to 4096
element "registers") and used overlapped DMA to move data in and out
while processing. Given a large enough data set that 40MHz DSP could
handily outperform the host's 450MHz CPU.

George
n***@cam.ac.uk
2010-07-24 10:01:18 UTC
Permalink
Post by George Neuner
On Fri, 23 Jul 2010 11:30:32 -0700 (PDT), Robert Myers
I don't think the hardware and software issues can be separated.
Thank goodness someone said that.
At least where HPC is concerned, I've been convinced for some time
that we are fighting hardware rather than leveraging it. I've spent a
number of years with DSPs and FPGAs and I've come to believe that we
(or at least compilers) need to be deliberately programming memory
interfaces as well as the ALU/FPU operations.
The problem most often cited for vector units is that they need to
support non-consecutive and non-uniform striding to be useful. I
agree that there *does* need to be support for those features, but I
believe it should be in the memory subsystem rather than in the
processor.
I believe that you have taken the first step on the path to True
Enlightenment, but need to have the courage of your convictions
and proceed further on :-)

I.e. I agree, and what we need is architectures which are designed
to provide data management first and foremost, and which attach
the computation onto that. I.e. turn the traditional approach on
its head. And I don't think that is limited to HPC, either.
I can't see any of the decent computer architects having any great
problem with this concept, but I doubt that the benchmarketers and
execudroids would swallow it.

It would also need a comparable revolution in programming languages
and paradigms, though there have been a lot of exploratory ones that
show the concepts are viable.


Regards,
Nick Maclaren.
Robert Myers
2010-07-24 17:02:02 UTC
Permalink
Post by n***@cam.ac.uk
Post by George Neuner
At least where HPC is concerned, I've been convinced for some time
that we are fighting hardware rather than leveraging it.  I've spent a
number of years with DSPs and FPGAs and I've come to believe that we
(or at least compilers) need to be deliberately programming memory
interfaces as well as the ALU/FPU operations.
The problem most often cited for vector units is that they need to
support non-consecutive and non-uniform striding to be useful.  I
agree that there *does* need to be support for those features, but I
believe it should be in the memory subsystem rather than in the
processor.
I believe that you have taken the first step on the path to True
Enlightenment, but need to have the courage of your convictions
and proceed further on :-)
I.e. I agree, and what we need is architectures which are designed
to provide data management first and foremost, and which attach
the computation onto that.  I.e. turn the traditional approach on
its head.  And I don't think that is limited to HPC, either.
I can't see any of the decent computer architects having any great
problem with this concept, but I doubt that the benchmarketers and
execudroids would swallow it.
Ok. So here's a half-baked guess.

The reason that doesn't happen isn't to be found in the corner office,
but in your thread about RDMA and Andy's comments in that thread, in
particular.

Today's computers are *not* designed around computation, but around
coherent cache. Now that the memory controller is on the die, the
takeover is complete. Nothing moves efficiently without notice and
often unnecessary involvement of the real Von Neumann bottleneck,
which is the cache.

Cache snooping is the one ring that rules them all.

I doubt if an implausible journey through Middle Earth by fantastic
creatures would help, but probably some similarly wild exercise of the
imagination is called for.

Currently, you cluster processors when you can't conveniently jam them
all into a single coherence domain. The multiple coherence domains
that result are an annoyance to someone like me who would desperately
like to think in terms of one big, flat memory space, but they also
allow new possibilities, like moving data around without bothering
other processors and other coherence domains. Maybe you want multiple
coherence domains even when you aren't forced into it by the size of a
board or a rack or a mainframe.

Maybe you want more programmable control over coherence domains. If
you're not going to scrap cache and cache snooping, maybe you can
wrestle some control away from the hardware and give it to the
software.

Robert.
n***@cam.ac.uk
2010-07-24 20:24:52 UTC
Permalink
Post by Robert Myers
Today's computers are *not* designed around computation, but around
coherent cache. Now that the memory controller is on the die, the
takeover is complete. Nothing moves efficiently without notice and
often unnecessary involvement of the real Von Neumann bottleneck,
which is the cache.
Yes and no. Their interfaces are still designed around computation,
and the coherent cache is designed to give the impression that
programmers need not concern themselves with programming memory
access - it's all transparent.
Post by Robert Myers
Maybe you want more programmable control over coherence domains. If
you're not going to scrap cache and cache snooping, maybe you can
wrestle some control away from the hardware and give it to the
software.
That is, indeed, part of what I do mean.


Regards,
Nick Maclaren.
MitchAlsup
2010-07-24 23:52:22 UTC
Permalink
Post by Robert Myers
Today's computers are *not* designed around computation, but around
coherent cache.  Now that the memory controller is on the die, the
takeover is complete.  Nothing moves efficiently without notice and
often unnecessary involvement of the real Von Neumann bottleneck,
which is the cache.
Yes and no.  Their interfaces are still designed around computation,
and the coherent cache is designed to give the impression that
programmers need not concern themselves with programming memory
access - it's all transparent.
If cache were transparent, then instructions such are PREFETCH<...>
would not exist! Streaming stores would not exist!
Compilers would not go to extraordinary pains to use these, or to find
out the configuration parameters of the cahce hierarchy.
Memory Controllers would not be optimized around cache lines.
Memory Ordering would not be subject to Cache Coherence rules.

But <as usual> I digress....

I think what Robert is getting at is that lumping everything under a
coherent cache is running into a vonNeumann wall.

Mitch
Brett Davis
2010-07-25 04:31:35 UTC
Permalink
In article
Post by MitchAlsup
If cache were transparent, then instructions such are PREFETCH<...>
would not exist! Streaming stores would not exist!
Compilers would not go to extraordinary pains to use these
I have never in my life seen a compiler issue a PREFETCH instruction.
I have several times mocked the usefulness of PREFETCH as implemented
for CPUs in the embedded market. (Locking up one of the two read
ports makes good performance impossible without resorting to assembly.)

I would think that the fetch ahead engine on high end x86 and POWER
would make PREFETCH just about as useless, except to prime the pump
at the start of a new data set being streamed in.

How is PREFETCH used by which compilers today?
Rick Jones
2010-07-26 17:16:16 UTC
Permalink
Post by Brett Davis
I have never in my life seen a compiler issue a PREFETCH instruction.
I have several times mocked the usefulness of PREFETCH as implemented
for CPUs in the embedded market. (Locking up one of the two read
ports makes good performance impossible without resorting to assembly.)
I would think that the fetch ahead engine on high end x86 and POWER
would make PREFETCH just about as useless, except to prime the pump
at the start of a new data set being streamed in.
How is PREFETCH used by which compilers today?
Exactly how it is used I do not know, but this:

http://www.spec.org/cpu2006/results/res2010q1/cpu2006-20100301-09740.html

and the linked description of the -opt-prefetch flag:

http://www.spec.org/cpu2006/results/res2010q1/cpu2006-20100301-09740.flags.html#user_CXXbase_f-opt-prefetch

Suggests that compilers to have that feature.

rick jones
--
portable adj, code that compiles under more than one compiler
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
n***@cam.ac.uk
2010-07-26 17:47:06 UTC
Permalink
Post by Rick Jones
Post by Brett Davis
I have never in my life seen a compiler issue a PREFETCH instruction.
I have several times mocked the usefulness of PREFETCH as implemented
for CPUs in the embedded market. (Locking up one of the two read
ports makes good performance impossible without resorting to assembly.)
I would think that the fetch ahead engine on high end x86 and POWER
would make PREFETCH just about as useless, except to prime the pump
at the start of a new data set being streamed in.
How is PREFETCH used by which compilers today?
http://www.spec.org/cpu2006/results/res2010q1/cpu2006-20100301-09740.html
http://www.spec.org/cpu2006/results/res2010q1/cpu2006-20100301-09740.flags.html#user_CXXbase_f-opt-prefetch
Suggests that compilers to have that feature.
Don't believe everything that you are told! I have set such flags
for several compilers on several architectures, and looked for the
inserted instructions. Sometimes they are inserted, but often not.
My tests included comparing the sizes of a large number of modules
of typical scientific code, and giving them trivial examples which
were ideally suited for the technique.

A suspicious and cynical old sod, aren't I?


Regards,
Nick Maclaren.
n***@cam.ac.uk
2010-07-25 08:16:55 UTC
Permalink
Post by MitchAlsup
Post by Robert Myers
Today's computers are *not* designed around computation, but around
coherent cache. =A0Now that the memory controller is on the die, the
takeover is complete. =A0Nothing moves efficiently without notice and
often unnecessary involvement of the real Von Neumann bottleneck,
which is the cache.
Yes and no. =A0Their interfaces are still designed around computation,
and the coherent cache is designed to give the impression that
programmers need not concern themselves with programming memory
access - it's all transparent.
If cache were transparent, then instructions such are PREFETCH<...>
would not exist! Streaming stores would not exist!
Few things in real life are absolute. Those are late and rather
unsatisfactory extras that have never been a great success.
Post by MitchAlsup
Compilers would not go to extraordinary pains to use these, or to find
out the configuration parameters of the cahce hierarchy.
Most don't. The few that do don't succeed very well.
Post by MitchAlsup
Memory Controllers would not be optimized around cache lines.
Memory Ordering would not be subject to Cache Coherence rules.
No, those support my point. Their APIs to the programmer are
intended to be as transparent as possible - the visibility is
because that is not feasible.
Post by MitchAlsup
I think what Robert is getting at is that lumping everything under a
coherent cache is running into a vonNeumann wall.
Precisely. That's been clear for a long time. My point is that
desperate solutions need desperate remedies, and it's about time
that we accepted that the wall is at the end of a long cul de sac.
There isn't any way round or through, so the only technical
solution is to back off a long way and try a different route.

No, I don't expect to live to see it happen.


Regards,
Nick Maclaren.
Andrew Reilly
2010-07-26 03:43:07 UTC
Permalink
Post by MitchAlsup
I think what Robert is getting at is that lumping everything under a
coherent cache is running into a vonNeumann wall.
Coherence is clearly complicated, but it doesn't seem necessarily to be
sequential. Are there theoretical limits to how parallelisable coherence
can be? Is the main issue speed-of-light limits to round-trip
communication between distributed cache controllers?

Cheers,
--
Andrew
n***@cam.ac.uk
2010-07-26 07:05:39 UTC
Permalink
Post by Andrew Reilly
Post by MitchAlsup
I think what Robert is getting at is that lumping everything under a
coherent cache is running into a vonNeumann wall.
Coherence is clearly complicated, but it doesn't seem necessarily to be
sequential. Are there theoretical limits to how parallelisable coherence
can be? Is the main issue speed-of-light limits to round-trip
communication between distributed cache controllers?
Yes, and no, respectively. However, the theoretical limits that I
know of in this area are much weaker than the practical ones.


Regards,
Nick Maclaren.
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-25 08:42:12 UTC
Permalink
Post by Robert Myers
Maybe you want more programmable control over coherence domains. If
you're not going to scrap cache and cache snooping, maybe you can
wrestle some control away from the hardware and give it to the
software.
That sounds like software-controlled distributed shared memory, a
concept that generates a lot more research papers and PhDs than actual
useful products, at least so far. :-(

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert Myers
2010-07-25 18:05:03 UTC
Permalink
On Jul 25, 4:42 am, Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Maybe you want more programmable control over coherence domains.  If
you're not going to scrap cache and cache snooping, maybe you can
wrestle some control away from the hardware and give it to the
software.
That sounds like software-controlled distributed shared memory, a
concept that generates a lot more research papers and PhDs than actual
useful products, at least so far. :-(
A PhD thesis that we both saw presented

www.bunniestudios.com/bunnie/phdthesis.pdf

dealt with a similar train of thought, except taking the exact
opposite turn: doing even more to hide the hardware from the user of
distributed memory. I understand much better now what he was trying
to do. ;-)

Robert.
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-25 18:23:14 UTC
Permalink
On Jul 25, 4:42 am, Terje Mathisen<"terje.mathisen at tmsw.no">
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Robert Myers
Maybe you want more programmable control over coherence domains. If
you're not going to scrap cache and cache snooping, maybe you can
wrestle some control away from the hardware and give it to the
software.
That sounds like software-controlled distributed shared memory, a
concept that generates a lot more research papers and PhDs than actual
useful products, at least so far. :-(
A PhD thesis that we both saw presented
www.bunniestudios.com/bunnie/phdthesis.pdf
Hmmm... Andrew "Bunnie" Huang I presume?

Yes!

I remember seeing that thesis, but I was chocked just now when I
realized that it was from way back in 2002!
dealt with a similar train of thought, except taking the exact
opposite turn: doing even more to hide the hardware from the user of
distributed memory. I understand much better now what he was trying
to do. ;-)
Trying to define an architecture which could bypass as many of the
(data) transport-related problems as possible, by making it extremely
cheap to migrate sw threads instead?

Terje
Robert.
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Robert Myers
2010-07-25 19:06:55 UTC
Permalink
On Jul 25, 2:23 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
On Jul 25, 4:42 am, Terje Mathisen<"terje.mathisen at tmsw.no">
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Maybe you want more programmable control over coherence domains.  If
you're not going to scrap cache and cache snooping, maybe you can
wrestle some control away from the hardware and give it to the
software.
That sounds like software-controlled distributed shared memory, a
concept that generates a lot more research papers and PhDs than actual
useful products, at least so far. :-(
A PhD thesis that we both saw presented
www.bunniestudios.com/bunnie/phdthesis.pdf
Hmmm... Andrew "Bunnie" Huang I presume?
Yes!
I remember seeing that thesis, but I was chocked just now when I
realized that it was from way back in 2002!
dealt with a similar train of thought, except taking the exact
opposite turn: doing even more to hide the hardware from the user of
distributed memory.  I understand much better now what he was trying
to do. ;-)
Trying to define an architecture which could bypass as many of the
(data) transport-related problems as possible, by making it extremely
cheap to migrate sw threads instead?
Move the instructions to the data, I gather, rather than the other way
around.

That wouldn't help my problems.

Robert.
n***@cam.ac.uk
2010-07-26 08:31:35 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Robert Myers
Maybe you want more programmable control over coherence domains. If
you're not going to scrap cache and cache snooping, maybe you can
wrestle some control away from the hardware and give it to the
software.
That sounds like software-controlled distributed shared memory, a
concept that generates a lot more research papers and PhDs than actual
useful products, at least so far. :-(
I believe that tackling it as a "computer science" problem is a large
part of the reason that it has never got anywhere. The thesis posted
later is a fairly typical example of the better research - let's skip
over the worse research, holding our noses and averting our gaze!
The killer isn't that it wouldn't work. The killer is how to map a
sufficient class of problems to it to make it worthwhile - the three
examples used are all well-known to be easily optimised by a wide
range of architectures (parallel and other). And, like Robert, I
don't see it doing so - AS IT STANDS - it might well be a starting
point for a viable design.

I believe that something COULD be done, but I don't believe that
anything WILL be done for the forseeable future. Benchmarketing and
existing spaghetti code rule too much decision making. Also, as I
have posted ad tedium, the architecture is of little use without
tackling the programming paradigms used.


Regards,
Nick Maclaren.
George Neuner
2010-07-26 19:13:58 UTC
Permalink
On Sun, 25 Jul 2010 10:42:12 +0200, Terje Mathisen <"terje.mathisen at
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Robert Myers
Maybe you want more programmable control over coherence domains. If
you're not going to scrap cache and cache snooping, maybe you can
wrestle some control away from the hardware and give it to the
software.
That sounds like software-controlled distributed shared memory, a
concept that generates a lot more research papers and PhDs than actual
useful products, at least so far. :-(
The hardware controlled version: KSR-1, went belly up.

George

Andy Glew <"newsgroup at comp-arch.net">
2010-07-25 05:06:57 UTC
Permalink
Post by George Neuner
The problem most often cited for vector units is that they need to
support non-consecutive and non-uniform striding to be useful. I
agree that there *does* need to be support for those features, but I
believe it should be in the memory subsystem rather than in the
processor.
This is why, in a recent post, I have proposed creating a memory
subsystem and interconnect that is designed around scatter/gather.

I think this can be done fairly straightforwardly for the interconnect.

It is harder to do for the DRAMs themselves. Modern DRAMs are oriented
towards providing bursts of 4 to 8 cycles worth of data. If you have 64
to 128 bit wide interfaces, that means that you are always going to be
reading or writing 32 to 128 consecutive, contiguous bytes at a go.
Whether you need it or not.

--

I'm still happy to have processor support to give me the scatter
gathers. Either in the form of vector instructions, or GPU-style
SIMD/SIMT/CIMT, or in the form of out-of-order execution. Lacking
these, with in-order scalar processing you have to do a lot of work to
get to where you can start scatter/gather - circa 8 times more
processors being needed.

But once you have got s/g requests, howsoever generated, then the real
action is in the interconnect. Or it could be.

--

Having said that alll modern DRAMs are oriented towards 4 to 8 cycle
bursts...

Maybe we can build scatter/gather friendly memory subsystems.

Instead of building 64-128-256 bit wide DRAM channels, maybe we should
be building 8 or 16 bit wide DRAM channels. That can give us 64 or 128
bits in any transfer, over 4 to 8 clocks. It will add to latency, but
maybe the improved s/g performance will be worth it.

Such a narrow DRAM channel system would probably consume at least 50%
more power than a wide DRAM system. Would that be outweighed by wasting
less bandwidth on unneceessary parts of cachelines?

It would also cost a lot more, not using commodity DIMMs.
Andy Glew <"newsgroup at comp-arch.net">
2010-07-23 19:01:59 UTC
Permalink
Post by George Neuner
Post by n***@cam.ac.uk
Post by George Neuner
ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers. ...
... the staging data movement provided a lot of opportunity to
overlap with real computation.
YMMV, but I think pipeline vector units need to make a comeback.
NO chance! It's completely infeasible - they were dropped because
the vendors couldn't make them for affordable amounts of money any
longer.
Actually I'm a bit skeptical of the cost argument ... obviously it's
not feasible to make large banks of vector registers fast enough for
multiple GHz FPUs to fight over, but what about a vector FPU with a
few dedicated registers?
I have been reading this thread somewhat bemused.

To start, full disclosure: I have proposed having pipelined vector
instructions making a comeback, in my postings to this group and my
presentations, e.g. at Berkeley Parlab (linked to on some web page).
Reason: not to improve performance, but to reduce costs compared to what
is now done now.

What is done now?

There are CPUs with FPUs pipelined circa 5-7 cycles deep. Commonly 2
sets of 4 32-bit SP elements wide, sometimes 8 or 16 wide.

There are GPUs with 256-1024 SP FPUs on them. I'm not so sure about
pipeline depth, but it is indicated to be deep by recommendations that
dependent ops not be closer together than 40 cycles.

The GPUs often have 16KB of registers. For each group of 32 or so FPUs.

I.e. we are building systems with more FPUs, more deeply pipelined FPUs,
and more registers than the vector machines I am most familiar with,
Cray-1 era machines. I don't know by heart the specs for the last few
generations of vector machines before they died back, but I suspect that
modern CPUs and, especially, GPUs, are comparable.

Except
(1) they are not organized as vector machines, and
(2) the memory subsystems are less powerful, in proportion to the FPUs,
than in the old days.

I'm going to skate past the memory subsystems since we have talked about
this at length elsewhere, and since that will be the topic of Robert
Myers' new mailing list. Except to say (a) high end GPUs often have
memory separate from the main CPU memory, made with more expebsive
GDRAMs rather than conventional DRAMs, and (b) modern DRAMs emphasize
sequential burst accesses in ways that Cray-1's SRAM based memory
subsystem did not. Basically, commodity DRAM does not lend itself to
non-unit-stride access patterns. And building a big system out of
non-commodity memory is much more prohibitive than back in the day of
the Cray-1. This was becoming evident in the last years of the old
vector processors.

But let's get back to the fact that these modern machines, with more
FPUs more deeply pipelined, and with more registers, than the classic
vector machines, are not organized as pipelined vector machines. To
some limited extent they are small parallel vector machines - operating
on 4 32b SP in a given operation, in parallel in one instruction. The
actual FPU operation is pipelined. They may be a small degree of vector
pipelining, e.g. spreading an 8 element vector over 2 cycles. But not
the same degree of vector pipelining as in the okd days, where a single
instruction may be pipelined over 16 or 64 cycles.

Why aren't modern CPUs and GPUs vector pipelined? I think one of the
not widely recognized things is that we are significantly better at
pipelining now than in the old days. The Cray-1 had 8 gate delays per
cycle. I suspect that one of the motivations for vectors was that it
was a challenge to decode back to back dependent instructions at that
rate, whereas it was easier to decode an instruction, set up a vector,,
and then run that vector instruction for 16 to 64 cycles. Yes, arranging
chaining, and yes, I know that one of the Cray-1's claims to fame was
better scalar instruction performance.

If you can run individual scalar instructions as fast as you can run
vector instructions, giving the same FLOPS, wouldn't you rather? Why
use vectors rather than scalars?
I'll answer my own question: (a) vectors allow you to use the same
number of register bits to specificy a lot more registers -
#vector-registers * #elements per vector. (b) vectors save power - you
onl;y decode the instruction once, and the decoding and scheduling logic
getsd amortized over the entire vector.
But if you aren't increasing the register set or adding new types
of registers, and if you aren't that worried about power, then you don't
need vectors.
But we are worried about power, aren't we?

Why aren't modern GPUs vector pipelined? Basically because they are
SIMD, or, rather, SIMD in its modern evolution of SIMT, CIMT, Coherent
Threading. This nearly always gets 4 cycle's worth of amortization of
instruction decode and schedule cost. And it seems to be easier to
program. And it promotes portability.

When I started working on GPUs, I thought, like many on this newsgroup,
that vector ISAs were easier to program than SIMD GPUs. I was quite
surprised to find out that this is NOT the case. Graphics programmers
consistengtly prefer the SIMD programming model. Or, rather, they
conistently prefer to have lots of little threads executing scalar or
moderate VLIW or short vector instructions, rather than fewer,
heavyweight, threads executing longer vector instructions. Partly
because their problems tend to be short vector, 4 element, rather than
long vector operations. Perhaps because SIMD is what they are familiar
with - although, again I emphasize than SIMT/CIMT is not the same as
classic Illiac-IV SIMD. I think that one of the most important aspects
is that SIMD/SIMT/CIMT code is more portable - it runs fairly well on
both GPUs and CPUs. And it runs on GPUs no matter whether the parallel
FPUs, what would be the vector FPUs, are 16 wide x 4 cycles, or 8 wide x
8 cycles, or ...
Andy Glew <"newsgroup at comp-arch.net">
2010-07-24 00:28:11 UTC
Permalink
The workday officialy over at 5pm, so I can continue the post I started
at lunch. (Although I am pretty sure to get back to work this evening.)

Top quoting without deleting my previous post - so you'll have to scroll
way down.
Post by Andy Glew <"newsgroup at comp-arch.net">
Post by George Neuner
Post by George Neuner
ISTM bandwidth was the whole point behind pipelined vector processors
in the older supercomputers. ...
... the staging data movement provided a lot of opportunity to
overlap with real computation.
YMMV, but I think pipeline vector units need to make a comeback.
NO chance! It's completely infeasible - they were dropped because
the vendors couldn't make them for affordable amounts of money any
longer.
Actually I'm a bit skeptical of the cost argument ... obviously it's
not feasible to make large banks of vector registers fast enough for
multiple GHz FPUs to fight over, but what about a vector FPU with a
few dedicated registers?
I have been reading this thread somewhat bemused.
To start, full disclosure: I have proposed having pipelined vector
instructions making a comeback, in my postings to this group and my
presentations, e.g. at Berkeley Parlab (linked to on some web page).
Reason: not to improve performance, but to reduce costs compared to what
is now done now.
What is done now?
There are CPUs with FPUs pipelined circa 5-7 cycles deep. Commonly 2
sets of 4 32-bit SP elements wide, sometimes 8 or 16 wide.
There are GPUs with 256-1024 SP FPUs on them. I'm not so sure about
pipeline depth, but it is indicated to be deep by recommendations that
dependent ops not be closer together than 40 cycles.
The GPUs often have 16KB of registers. For each group of 32 or so FPUs.
I.e. we are building systems with more FPUs, more deeply pipelined FPUs,
and more registers than the vector machines I am most familiar with,
Cray-1 era machines. I don't know by heart the specs for the last few
generations of vector machines before they died back, but I suspect that
modern CPUs and, especially, GPUs, are comparable.
Except
(1) they are not organized as vector machines, and
(2) the memory subsystems are less powerful, in proportion to the FPUs,
than in the old days.
I'm going to skate past the memory subsystems since we have talked about
this at length elsewhere, and since that will be the topic of Robert
Myers' new mailing list. Except to say (a) high end GPUs often have
memory separate from the main CPU memory, made with more expebsive
GDRAMs rather than conventional DRAMs, and (b) modern DRAMs emphasize
sequential burst accesses in ways that Cray-1's SRAM based memory
subsystem did not. Basically, commodity DRAM does not lend itself to
non-unit-stride access patterns. And building a big system out of
non-commodity memory is much more prohibitive than back in the day of
the Cray-1. This was becoming evident in the last years of the old
vector processors.
But let's get back to the fact that these modern machines, with more
FPUs more deeply pipelined, and with more registers, than the classic
vector machines, are not organized as pipelined vector machines. To some
limited extent they are small parallel vector machines - operating on 4
32b SP in a given operation, in parallel in one instruction. The actual
FPU operation is pipelined. They may be a small degree of vector
pipelining, e.g. spreading an 8 element vector over 2 cycles. But not
the same degree of vector pipelining as in the okd days, where a single
instruction may be pipelined over 16 or 64 cycles.
Why aren't modern CPUs and GPUs vector pipelined? I think one of the not
widely recognized things is that we are significantly better at
pipelining now than in the old days. The Cray-1 had 8 gate delays per
cycle. I suspect that one of the motivations for vectors was that it was
a challenge to decode back to back dependent instructions at that rate,
whereas it was easier to decode an instruction, set up a vector,, and
then run that vector instruction for 16 to 64 cycles. Yes, arranging
chaining, and yes, I know that one of the Cray-1's claims to fame was
better scalar instruction performance.
If you can run individual scalar instructions as fast as you can run
vector instructions, giving the same FLOPS, wouldn't you rather? Why use
vectors rather than scalars?
I'll answer my own question: (a) vectors allow you to use the same
number of register bits to specificy a lot more registers -
#vector-registers * #elements per vector. (b) vectors save power - you
onl;y decode the instruction once, and the decoding and scheduling logic
getsd amortized over the entire vector.
But if you aren't increasing the register set or adding new types of
registers, and if you aren't that worried about power, then you don't
need vectors.
But we are worried about power, aren't we?
Why aren't modern GPUs vector pipelined? Basically because they are
SIMD, or, rather, SIMD in its modern evolution of SIMT, CIMT, Coherent
Threading. This nearly always gets 4 cycle's worth of amortization of
instruction decode and schedule cost. And it seems to be easier to
program. And it promotes portability.
When I started working on GPUs, I thought, like many on this newsgroup,
that vector ISAs were easier to program than SIMD GPUs. I was quite
surprised to find out that this is NOT the case. Graphics programmers
consistengtly prefer the SIMD programming model. Or, rather, they
conistently prefer to have lots of little threads executing scalar or
moderate VLIW or short vector instructions, rather than fewer,
heavyweight, threads executing longer vector instructions. Partly
because their problems tend to be short vector, 4 element, rather than
long vector operations. Perhaps because SIMD is what they are familiar
with - although, again I emphasize than SIMT/CIMT is not the same as
classic Illiac-IV SIMD. I think that one of the most important aspects
is that SIMD/SIMT/CIMT code is more portable - it runs fairly well on
both GPUs and CPUs. And it runs on GPUs no matter whether the parallel
FPUs, what would be the vector FPUs, are 16 wide x 4 cycles, or 8 wide x
8 cycles, or ...
Continuing the discussion of the advantages of vector instruction sets
and hardware.

Vector ISAs allow you to have a whole lot of registers accessible from
relatively small register numbers in the instruction. GPU
SIMD/SIMT/CIMT get the same effect by having a whole lot of threads,
each given a variable number of registers. Basically, reducing the
number of registers allocated to threads (which run in warps or
wavefronts, say 16 wide over 4 cycles) is equivalent to, and probably
better that, having a variable vector length. Variable on a per vector
register basis. I'm not aware of many classic vector ISAs doing this -
and if they did, they would lose the next advantage.

Vector register files can be cheaper than ordinary register files.
Instead of having to allow any register to be accessed, vector ISAs
allow you to only have to index the first element of a vector fast;
subsequent elements can stream along with greater latency. However, I'm
not aware of any recent vector hardware uarch that has taken advantage
of this possibility. Usually they build just a great big wide register file.

Vector ISAs are on a slippery slope of ISA complexity. First you have
vector+vector ->vector ops. Then you add vector sum reductions. Inner
products. Prefix calculations. Operate under mask. Etc. This slippery
slope seems much less slippery for CIMT, since most of these
opeerations can be synthesized simply out of the scalar operations that
are their basis.

Vector chaining is a source of performance - and complexity. It happens
somewhat for free with Nvidia style scalar SIMT, and the equivalent of
more complicated chaining complexes can be set up using ATI/AMD's VLIW SIMT.

All this being said, why would I be interested in reviving vector ISAs?

Mainly because vector ISAs allow the cost of instruction decode and
scheduling to be amortized.

But also because, as I discusssed in my Berkeley Parlab presentation of
Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate
somewhat the deficiencies of coherent threading, specifically the
problem of divergence.
Terje Mathisen <"terje.mathisen at tmsw.no">
2010-07-24 06:48:41 UTC
Permalink
Post by Andy Glew <"newsgroup at comp-arch.net">
Vector register files can be cheaper than ordinary register files.
Instead of having to allow any register to be accessed, vector ISAs
allow you to only have to index the first element of a vector fast;
subsequent elements can stream along with greater latency. However, I'm
not aware of any recent vector hardware uarch that has taken advantage
of this possibility. Usually they build just a great big wide register file.
And this is needed!

If you check actual SIMD type code, you'll notice that various forms of
permutations are _very_ common, i.e. you need to rearrange the order of
data in one or more vector register:

If vectors were processed in streaming mode, we would have the same
situation as for the Pentium4 which did half a register in each half
cycle in the fast core, but had to punt each time you did a right shift
(or any other operations which could not be processed in LE order).

I have seen once a reference to Altivec code that used the in-register
permute operation more than any other opcode.
Post by Andy Glew <"newsgroup at comp-arch.net">
Vector ISAs are on a slippery slope of ISA complexity. First you have
vector+vector ->vector ops. Then you add vector sum reductions. Inner
products. Prefix calculations. Operate under mask. Etc. This slippery
slope seems much less slippery for CIMT, since most of these opeerations
can be synthesized simply out of the scalar operations that are their
basis.
Except that even scalar code needs prefix/mask type operations in order
to get rid of some branches, right?

All (most of?) the others seem to boil down to a need for a fast vector
permute...
Post by Andy Glew <"newsgroup at comp-arch.net">
Vector chaining is a source of performance - and complexity. It happens
somewhat for free with Nvidia style scalar SIMT, and the equivalent of
more complicated chaining complexes can be set up using ATI/AMD's VLIW SIMT.
All this being said, why would I be interested in reviving vector ISAs?
Mainly because vector ISAs allow the cost of instruction decode and
scheduling to be amortized.
But also because, as I discusssed in my Berkeley Parlab presentation of
Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate
somewhat the deficiencies of coherent threading, specifically the
problem of divergence.
Please tell!

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2010-07-24 10:05:31 UTC
Permalink
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
If you check actual SIMD type code, you'll notice that various forms of
permutations are _very_ common, i.e. you need to rearrange the order of
If vectors were processed in streaming mode, we would have the same
situation as for the Pentium4 which did half a register in each half
cycle in the fast core, but had to punt each time you did a right shift
(or any other operations which could not be processed in LE order).
I have seen once a reference to Altivec code that used the in-register
permute operation more than any other opcode.
Except that even scalar code needs prefix/mask type operations in order
to get rid of some branches, right?
All (most of?) the others seem to boil down to a need for a fast vector
permute...
Yes. My limited investigations indicated that the viability of
vector systems usually boiled down to whether the hardware's ability
to do that was enough to meet the software's requirements. If not,
it spent most of its time in scalar code.
Post by Terje Mathisen <"terje.mathisen at tmsw.no">
Post by Andy Glew <"newsgroup at comp-arch.net">
But also because, as I discusssed in my Berkeley Parlab presentation of
Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate
somewhat the deficiencies of coherent threading, specifically the
problem of divergence.
Please tell!
Indeed, yes, please do!


Regards,
Nick Maclaren.
Andy Glew <"newsgroup at comp-arch.net">
2010-07-25 05:34:30 UTC
Permalink
Post by Andy Glew <"newsgroup at comp-arch.net">
But also because, as I discusssed in my Berkeley Parlab presentation of
Aug 2009 on GPUs, I can see how to use vector ISAs to ameliorate
somewhat the deficiencies of coherent threading, specifically the
problem of divergence.
You want me to repeat what I already said in those slides?

Sure, if it's necessary.

But the pictures in the slides say it better than I can on comp.arch.

BTW, here is a link to the sliddes, on Googlle Docs using their
incomprehensiblle URLs.

---

I'll give a brief overview:

First off, a basic starter: coherent threading is better than
prredication or masking for complicated IF structures. Even in a warp of
64 threads, coherent threading only ever executes paths that one of the
threads is taking. Whereas predication executes all paths.

The big problem with GPU style coherennt threading, aka SIMD, aka SIMT,
maybe CIMT is divergence. But when you get into the coherent threading
mindset, I can think of dozzens of ways to ameliorate divergence.

So, here's a first "optimization": create a loop buffer at each lane, or
group of lanes. So it is not really SIMT, single instruction multiple
data, any more - it is really single instruction to the loop buffer,
independent thereafter. Much divergence can thereby be tolerated - as
in, instruction issue slots that would be wasted because of SIMD can be
used.

Trouble is, loop buffers per lane lose much of the cost reduction of CIMT.

I know: what's a degenerate loop buffer? A vector instruction.

By slide 14 I am showing how, if the indivdual instructions of CIMT are
time vectors, you can distribute instructions from divergent paths while
others are executing. I.e. you might lose an instruction issye cycle,
but if instruction execution takes 4 or more cycles, 50% divergence need
only lose 10% or less instruction execution bandwidth.
This is a use of vector instructions.

Slide 13 depicts an optimization independent of vector instructions.
Most GPUs distribute a warp or wavefront over multiple cycles,
typicallly 4. If you can rejigger threads within a wavefront, soo that
they can be moved between cycles so that converged threads execute together.
This is not a use of vector instructions.

But, vector instructions make things easier to rejigger - since threads
themselves are already spread over several cycles.

Related, not in the slidedeck: spatial rejiggering. Making a "lane" of
instructions map to one or multiple lanes of ALUs.
E.g. if all threads in a warp are converged, then assign each
thread to one and only one ALU lane.
But if you have divergence, spread instruction execution foor a
thread across several ALLU lanes. E.g. the classic 50% divergence would
get swalloweed up.
As usual, vectoor instructions would make it easy, although it
probably could be done without.

I've already mentioned how you can think of adding coherent insttruction
dispatch to multicluster multithreading.

Slides 34 and 35 talk about how time pipelined vectors with VL vector
length control eliminate the vector length wastage of parallel vector
architectures, such as Larrabeee was reported to be in the SIGGRAPH paper.

Slide 71 talks about how time pipelined vectors operationg planar, SOA
style, savepower by reducing toggling.
Thomas Womack
2010-07-19 15:54:03 UTC
Permalink
Post by Robert Myers
I have lamented, at length, the proliferation of flops at the expense of
bytes-per-flop in what are now currently styled as supercomputers.
This subject came up recently on the Fedora User's Mailing List when
someone claimed that GPU's are just what the doctor ordered to make
high-end computation pervasively available. Even I have fallen into
that trap, in this forum, and I was quickly corrected. In the most
general circumstance, GPU's seem practically to have been invented to
expose bandwidth starvation.
Yes, they've got a very low peak bandwidth:peak flops ratio; but the
peak bandwidth is reasonably high in absolute terms - the geforce 480
peak bandwidth is about that of a Cray T916.

(the chip has about 2000 balls on the bottom, 384 of which are memory
I/O running at 4GHz)

I don't think it makes sense to complain about low bw:flops ratios;
you could always make the ratio higher by removing ALUs, getting you a
machine which is less capable at the many jobs that can be made to
need flops but not bytes.

Tom
Robert Myers
2010-07-19 16:31:19 UTC
Permalink
Post by Thomas Womack
Post by Robert Myers
I have lamented, at length, the proliferation of flops at the expense of
bytes-per-flop in what are now currently styled as supercomputers.
This subject came up recently on the Fedora User's Mailing List when
someone claimed that GPU's are just what the doctor ordered to make
high-end computation pervasively available.  Even I have fallen into
that trap, in this forum, and I was quickly corrected.  In the most
general circumstance, GPU's seem practically to have been invented to
expose bandwidth starvation.
Yes, they've got a very low peak bandwidth:peak flops ratio; but the
peak bandwidth is reasonably high in absolute terms - the geforce 480
peak bandwidth is about that of a Cray T916.
(the chip has about 2000 balls on the bottom, 384 of which are memory
I/O running at 4GHz)
I don't think it makes sense to complain about low bw:flops ratios;
you could always make the ratio higher by removing ALUs, getting you a
machine which is less capable at the many jobs that can be made to
need flops but not bytes.
It doesn't make much sense, as I have repeatedly been reminded, simply
to complain, if that's all you ever do.

If nothing else, I've been on a one-man crusade to stop the
misrepresentation of current "supercomputers." The designs are *not*
scalable, except with respect to a set of problems that are
embarrassingly parallel in the global sense, or so close to
embarrassingly parallel that the wimpy global bandwidth that's
available is not a serious handicap.

If you can't examine the most interesting questions about the
interaction between the largest and smallest scales the machine can
represent without making indefensible mathematical leaps, then why
bother building the machine at all? Because there are a bunch of
almost embarrassingly parallel problems that you *can* do?

I don't think we're ever going to agree on this. Your ongoing
annoyance has been noted. I'd like to explore what can and what
cannot be done so that everyone understands the consequences of the
decisions being made about computational frontiers that, from the way
we are going now, will never be explored.

Maybe we've reached a brick wall. If so, I'm mostly the only one
talking about it, and I'd like to broaden the discussion without
annoying people who don't want to hear it.

Robert.
David L. Craig
2010-07-19 17:42:23 UTC
Permalink
I am new to comp.arch and so am unclear of the pertinent history of
this
discussion, so please bear with me and don't take any offense at
anything
I say, as that is quite unintended.

Is the floating point bandwidth issue only being applied to one
architecture;
e.g., x86? If so, why? Is this not a problem with other designs?
Also, why
single out floating point bandwidth? For instance, what about the
maximum
number of parallel RAM acceses architectures can support, which has
major
impacts on balancing cores' use with I/Os use?

If everyone thinks a different group is called for, that's fine with
me. I just
want to understand the reasons this type of discussion doesn't fit
here.
Robert Myers
2010-07-19 18:59:23 UTC
Permalink
Post by David L. Craig
I am new to comp.arch and so am unclear of the pertinent history of
this
discussion, so please bear with me and don't take any offense at
anything
I say, as that is quite unintended.
Is the floating point bandwidth issue only being applied to one
architecture;
e.g., x86? If so, why? Is this not a problem with other designs?
Some of my harshest criticism has been aimed at computers built around
the Power architecture, one of which briefly owned the top spot on the
Top 500 list. The problem is not peculiar to any ISA.
Post by David L. Craig
Also, why
single out floating point bandwidth? For instance, what about the
maximum
number of parallel RAM acceses architectures can support, which has
major
impacts on balancing cores' use with I/Os use?
I have no desire to limit the figures of merit that deserve
consideration. I just want provide some corrective to the "Wow! A
gazillion flops!" without even an asterisk talk.

Right now, people present, brag about, and plan for just one figure of
merit: linpack flops. That makes sense to some, I gather, but it makes
no sense to me.

Computation is more or less a solved problem. Most of the challenges
left have to do with moving data around, with latency and not bandwidth
having gotten the lion's share of attention (for good reason). I
believe that moving data around will ultimately be the limiting factor
with regard to reducing power consumption.
Post by David L. Craig
If everyone thinks a different group is called for, that's fine with
me. I just
want to understand the reasons this type of discussion doesn't fit
here.
The safest answer that I can think of to this question is that it is
really an interdepartmental problem.

The computer architects here have been relatively tolerant of my
excursions of thought as to why the computers currently being built
don't really cut it, but a proper discussion of all the pro's and con's
would take the discussion and perhaps the list far beyond any normal
definition of computer architecture.

Even leaving aside justifying why expensive bandwidth is not optional,
there is little precedent here for in-depth explorations of blue-sky
proposals. A fair fraction of the blue-sky propositions brought here
can't be taken seriously, and my sense of this group is that it wants to
keep the thinking mostly inside the box, not for want of imagination,
but to avoid science fiction and rambling, uninformed discussion.

Robert.
Andy Glew <"newsgroup at comp-arch.net">
2010-07-20 15:31:46 UTC
Permalink
I am new to comp.arch and so am unclear of the pertinent history of this
discussion
This is a bit of a tired discussion. Not because the solution is known,
but because the solutions that we think we know aren't commercially
feasible. We need to break out of the box.

We welcome new blood, and new ideas.
Also, why single out floating point bandwidth? For instance, what about the
maximum number of parallel RAM acceses architectures can support, which has
major impacts on balancing cores' use with I/Os use?
Computation is more or less a solved problem. Most of the challenges
left have to do with moving data around, with latency and not bandwidth
having gotten the lion's share of attention (for good reason). I believe
that moving data around will ultimately be the limiting factor with
regard to reducing power consumption.
I'm with you, David. Maximizing what I call the MLP, the memory level
parallelism, the number of DRAM accesses that can be concurrently in
flight, is one of the things that we can do.

But Robert's comment is symptomatic of the discussion. Robert says most
work has been on latency, by which I think that he means caches, and
maybe integrating the memory controller. I say MLP to Robert, but he
glides on by.

Robert is interested in brute force bandwidth. Mitch points out that
modern CPUs have 1-4 DRAM channels, which defines the bandwidth that you
get, assuming fairly standard JEDEC DRAM interfaces. GPUs may have more
channels, 6 being a possibility, wider, etc., so higher bandwidth is a
possibility.

Me, I'm just the MLP guy: give me a certain number of channels and
bandwidth, I try to make the best use of them. MLP is one of the ways
of making more efficient use of whatever limited bandwidth you have. I
guess that's my mindset - making the most of what you have. Not because
I don't want to increase the overall memory bandwidth. But because I
don't have any great ideas on how to do so, apart from
a) More memory channels
b) Wider memory channnels
c) Memory channels/DRAMs that handle short bursts/high address
bandwidth efficiently
d) DRAMs with a high degree of internal banking
e) aggressive DRAM scheduling
Actually, c,d,e are really ways of making more efficient use of
bandwidth, i.e. preventing pins from going idle because the burst length
is giving you a lot of data you don't want.
f) stacking DRAMs
g) stacking DRAMs with an interface chip such as Tom Pawlowski of
micron proposes, and a new abstract DRAM interface, enabling all of the
good stuff above but keeping DRAM a comodity
h) stacking DRAMs with an interface chip and a processor chip (with
however many processors you care to build).

Actually, I think that it is inaccurate to say that Robert Myers just
wants brute force memory bandwidth. I think that he would be unhappy
with a machine that achieved brute force memory bandwidth by having 4KiB
burst transfers - because while that machine might be good for DAXPY, it
would not be good for most of the codes Robert wants.
I think that Robert does not want brute force sequential babwidth.
I think that he needs randoom access pattern bandwidth.

Q: is that so, Robert?
Even leaving aside justifying why expensive bandwidth is not optional,
there is little precedent here for in-depth explorations of blue-sky
proposals. A fair fraction of the blue-sky propositions brought here
can't be taken seriously, and my sense of this group is that it wants to
keep the thinking mostly inside the box, not for want of imagination,
but to avoid science fiction and rambling, uninformed discussion.
I'm game for blue-sky SCIENCE FICTION. I.e. imaginings based on
science. That have some possibility of being true.

I'm not so hot on science FANTASY, imaginin

gs based on wishful thinking.
MitchAlsup
2010-07-20 16:35:05 UTC
Permalink
Post by Andy Glew <"newsgroup at comp-arch.net">
Actually, I think that it is inaccurate to say that Robert Myers just
wants brute force memory bandwidth.  
Undouubtably correct.

As to why Vector machine fell out of fashion. Vectors were architected
to absorb memory latency. Early Crays had 64-entry vectors and 20-ish
cycle main memory. Later, as the CPUs got faster and the memories
larger and more interleaved, the latency, in cycles, to main memory
increased. And once the Vector machines got to where main memory
latency, in cycles, was greater than the vector length, their course
had been run.

Nor can OoO machines create vector performance rates unless the
latency to <whatever layer in the memory hierarchy supplies the data>
can be absorbed by the size of the execution window. Thus, the
execution window needs to be 2.5-3 times the number of flops being
crunched per loop iteration. We are at the point where, even when the
L2 cache supplies data, there are too many latency cycles for the
machine to be able to efficiently strip mine data. {And in most cases
the cache hierarchy is not designed to efficiently strip mine data,
either.}

neither
a) High latency adequate bandwidth
nor
b) Low Latency inadequate bandwidth
enable vector execution rates--that is getting the most of the FP
computation capabilities.

Mitch
Robert Myers
2010-07-20 16:54:04 UTC
Permalink
Post by Andy Glew <"newsgroup at comp-arch.net">
I think that Robert does not want brute force sequential babwidth.
I think that he needs randoom access pattern bandwidth.
Q: is that so, Robert?
The problems that I personally am most interested in are fairly
"crystalline": very regular access patterns across very regular data
structures.

So the data access patterns are neither random nor sequential. The fact
that processors and memory controllers want to deal with cache lines and
not with 8-byte words is a serious constraint. No matter how you
arrange a multi-dimensional array, some kinds of access are going to
waste huge amounts of bandwidth, even though the access pattern is far
from random.

In the ideal world, you don't want scientists and engineers worrying
about where things are, and more and more problems involve access
patterns that are hard to anticipate. If you can't make random access
free (as fast as sequential access), then at least you can aim at making
hardware naiveté less costly (a factor of, say, two penalty for having
majored in physics rather than computer science, rather than a factor
of, say, ten or more).

Problems that require truly random (or hard to anticipate) access are
(as I perceive things) far more frequent than they were in the early
Cray days, and the costs of dealing with them increasingly painful.

To attempt to be concise: I have no doubt that the needs of media
stream processors will be met without my worrying about it. Any kind of
more complicated access (I speculate) is now so far down the priority
list that, from the POV of COTS processor manufacturers, it is in the
noise. So I'm interested in talking about any kind of calculation that
can't feed a GPU without some degree of hard work or even magic.

If I seem a tad blasé about the parts of the problem you understand the
most about (or are most interested in), it's because my concerns extend
far beyond a standard rack mounted board and even beyond the rack to the
football-field sized installations that get the most press in HPC.
There are so many pieces to this problem, that even making a
comprehensive list is a challenge. At one time, you could enter a room
and see a Cray 2 (not including the supporting plumbing). Now you'd
have to take the roof off a building and rent a helicopter to get a
similar view of a state of the art "supercomputer." There's a lot to
think about.

I'm also interested in what you can build that doesn't occupy
significant real estate and require a rewiring of the nation's electric
grid, so I'm interested in what you can jam onto a single board or into
a single rack. No shortage of things to talk about.

A final word about latency and bandwidth. I really want to keep my mind
as open as possible. The more latency you can tolerate, perhaps with
some of the kinds of exotic approaches (e.g. huge instruction windows)
that interest you, the more options you have for approaching the problem
of bandwidth. I know that most everyone here understands that. I just
want to make it clear that I understand it, too.

Robert.
David L. Craig
2010-07-20 17:49:49 UTC
Permalink
On Jul 20, 11:31 am, Andy Glew <"newsgroup at comp-arch.net">
Post by Andy Glew <"newsgroup at comp-arch.net">
We welcome new blood, and new ideas.
These are new ideas? I hope not.
Post by Andy Glew <"newsgroup at comp-arch.net">
I'm with you, David. Maximizing what I call the MLP, the
memory level parallelism, the number of DRAM accesses that
can be concurrently in flight, is one of the things that
we can do.
Me, I'm just the MLP guy: give me a certain number of
channels and bandwidth, I try to make the best use of
them. MLP is one of the ways of making more efficient
use of whatever limited bandwidth you have. I guess that's
my mindset - making the most of what you have. Not because
I don't want to increase the overall memory bandwidth.
But because I don't have any great ideas on how to do so,
apart from
a) More memory channels
b) Wider memory channnels
c) Memory channels/DRAMs that handle short bursts/high
address bandwidth efficiently
d) DRAMs with a high degree of internal banking
e) aggressive DRAM scheduling
Actually, c,d,e are really ways of making more efficient
use of bandwidth, i.e. preventing pins from going idle
because the burst length is giving you a lot of data you
don't want.
f) stacking DRAMs
g) stacking DRAMs with an interface chip such as Tom
Pawlowski of micron proposes, and a new abstract
DRAM interface, enabling all of the good stuff
above but keeping DRAM a comodity
h) stacking DRAMs with an interface chip and a
processor chip (with however many processors you
care to build).
If we're talking about COTS design, FP bandwidth is
probably not the area in which to increase production
costs for better performance. As Mitch Alsup observed
a little after the post I've been quoting became
Post by Andy Glew <"newsgroup at comp-arch.net">
We are at the point where, even when the L2 cache
supplies data, there are too many latency cycles for
the machine to be able to efficiently strip mine
data. {And in most cases the cache hierarchy is not
designed to efficiently strip mine data, either.}
Have performance runs using various cache disablements
indicated any gains could be realized therein? If so,
I think that makes the case for adding circuits to
increase RAM parallelism as the cores fight it out for
timely data in and data out operations.

If we're talking about custom, never-mind-the-cost
designs, then that's the stuff that should make this
a really fun group.
jacko
2010-07-20 18:48:45 UTC
Permalink
reality rnter, thr eniene pooj descn to lan dern turdil/


Soery I must STIOP giving thge motostest.

I'd love for it to mux long dataa. I can't see hoe it frows to the
tend to stuff. Chad'ict? I do know that not writing is good.
Robert Myers
2010-07-20 18:49:03 UTC
Permalink
Post by David L. Craig
If we're talking about custom, never-mind-the-cost
designs, then that's the stuff that should make this
a really fun group.
If no one ever goes blue sky and asks: what is even physically
possible without worrying what may or may not be already in the works
at Intel, then we are forever limited, even in the imagination, to
what a marketdroid at Intel believes can be sold at Intel's customary
margins. There is always IBM, of course, and AMD seems willing to try
anything that isn't guaranteed to put it out of business, but, for the
most part, the dollars just aren't there, unless the government
supplies them.

As far as I'm concerned, the roots of the current climate for HPC can
be found in some DoD memos from the early nineties. I'm pretty sure I
have already offered links to some of those documents here.

In all fairness to those memos and to the semiconductor industry in
the US, the markets have delivered well beyond the limits I feared
when those memos first came out. I doubt if mass-market x86
hypervisors ever crossed the imagination at IBM, even as the
barbarians were at the gates.

Also, to be fair to markets, the cost-no-object exercises the
government undertook even after those early 90's memos delivered
almost nothing of any real use. Lots of money has been squandered on
some really dumb ideas. The national labs and others have tried the
same idea (glorified Beowulf) with practically every plausible
processor and interconnect on offer and pretty much the same result
(90%+ efficiency for Linpack, 10% for anything even slightly more
interesting).

Moving the discussion to some place slightly less visible than
comp.arch might not produce more productive flights of fancy, but I,
for one, am interested in what is physically possible and not just
what can be built with the consent of Sen. Mikulski--a lady I have
always admired, to be sure, from her earliest days in politics, just
not the person I'd cite as intellectual backup for technical
decisions.

Robert.
MitchAlsup
2010-07-20 21:07:46 UTC
Permalink
An example of the subtle microarchitectureal optimization that is in
Robert's favor was tried in one of my previous designs.

The L1 cache was organized to cache the width of the bus returning
from the L2 on die cache.

The L2 cache was organized at the width of your typical multibeet
cache line returning from main memory. Thus, one L2 cache line would
occupy 4 L1 cache sub-lines when fully 'in' the L1. Some horseplay at
the cache coherence protocol prevented incoherence.

With the L1-to-L2 interface suitably organized, one could strip mine
data from the L2 through the L1 through the computation units back to
the L1. L1 Victims were transfered back to the L2 as L2 data arrived
and forwarded into execution.

Here, the execution window had to absorb only the L2 transfer delay
plus the floatig point computation delay. And for this that execution
window worked just fine. DAXPY and DGEMM on suitably sized vectors
would strip mine data footprints as big as the L2 cache at vector
rates.

Mitch
David L. Craig
2010-07-20 21:31:28 UTC
Permalink
There is always IBM, of course[...]
Ah, yes, for over a hundred years so far, anyway. ;-)
But do you mention them as the designers of AIX-POWER,
OS/400-iSeries, whatever-x86, or big iron? I have noticed
the x86 boxes have always been trying to catch up with the
mainframes but the gap really doesn't change much.
I doubt if mass-market x86 hypervisors ever crossed the
imagination at IBM, even as the barbarians were at the
gates.
You'd be wrong. A lot of IBMers and customer VMers were
watching what Intel was going to do with the 80386 next
generations to support machine virtualization. While
Intel claimed it was coming, by mainframe standards, they
showed they just weren't serious. Not only can x86 not
fully virtualize itself, it has known design flaws that
can be exploited to compromise the integrity of its
guests and the hypervisor. That it is used widely as a
consolidation platform boggles the minds of those in the
know. We're waiting for the eventual big stories.
Also, to be fair to markets, the cost-no-object
exercises the government undertook even after
those early 90's memos delivered almost nothing of
any real use.  Lots of money has been squandered on
some really dumb ideas.
Moving the discussion to some place slightly less
visible than comp.arch might not produce more
productive flights of fancy, but I, for one, am
interested in what is physically possible [...].
Some ideas are looking to be not so dumb; e.g., quantum
computing. I wonder what JVN would make of them if he were
still around? I suspect it's hard to get more blue-sky
physically possible than those beasties.
Robert Myers
2010-07-20 23:11:53 UTC
Permalink
Post by David L. Craig
There is always IBM, of course[...]
Ah, yes, for over a hundred years so far, anyway. ;-)
But do you mention them as the designers of AIX-POWER,
OS/400-iSeries, whatever-x86, or big iron?
I'm thinking of IBM as the general contractor for, say, Blue Waters.
The CPU will be Power 7, but the OS will apparently be Linux. My
assumption is that, as a matter of national policy, the US government
wants to keep IBM as the non-x86 option.
Post by David L. Craig
 I have noticed
the x86 boxes have always been trying to catch up with the
mainframes but the gap really doesn't change much.
I doubt if mass-market x86 hypervisors ever crossed the
imagination at IBM, even as the barbarians were at the
gates.
You'd be wrong.  A lot of IBMers and customer VMers were
watching what Intel was going to do with the 80386 next
generations to support machine virtualization.  While
Intel claimed it was coming, by mainframe standards, they
showed they just weren't serious.  Not only can x86 not
fully virtualize itself, it has known design flaws that
can be exploited to compromise the integrity of its
guests and the hypervisor.  That it is used widely as a
consolidation platform boggles the minds of those in the
know.  We're waiting for the eventual big stories.
Well, *I* never thought they were serious. I assumed that, if
virtualization other than a Vmware-type hack ever came to Intel, it
would be a feature of IA-64, where I assumed that virtualization had
been penciled in from the beginning.

I'm waiting for the big stories, too. At this point, building secure
systems is surely a bigger national priority than having the most
flops.
Post by David L. Craig
Also, to be fair to markets, the cost-no-object
exercises the government undertook even after
those early 90's memos delivered almost nothing of
any real use.  Lots of money has been squandered on
some really dumb ideas.
Moving the discussion to some place slightly less
visible than comp.arch might not produce more
productive flights of fancy, but I, for one, am
interested in what is physically possible [...].
Some ideas are looking to be not so dumb; e.g., quantum
computing.  I wonder what JVN would make of them if he were
still around?  I suspect it's hard to get more blue-sky
physically possible than those beasties.
Maybe quantum entanglement is the answer to moving data around.

Robert.
David L. Craig
2010-07-21 14:58:41 UTC
Permalink
Post by Robert Myers
Maybe quantum entanglement is the answer to moving data around.
Sigh... I wonder how many decades we are from that being standard in
COTS hardware (assuming the global underpins of R&D hold up that
long). Probably more than I've got (unless the medical R&D also grows
by leaps and bounds and society deems me worthy of being kept around).

I like simultaneous backup 180 degrees around the planet and on the
Moon, that's for sure.
Robert Myers
2010-07-21 17:17:00 UTC
Permalink
Post by Robert Myers
Maybe quantum entanglement is the answer to moving data around.
Sigh...  I wonder how many decades we are from that being standard in
COTS hardware (assuming the global underpins of R&D hold up that
long).  Probably more than I've got (unless the medical R&D also grows
by leaps and bounds and society deems me worthy of being kept around).
I like simultaneous backup 180 degrees around the planet and on the
Moon, that's for sure.
There are several boundaries to this problem, which is a mixture of
electrical engineering, applied physics, computer architecture, device
electronics, computational mathematics, and applied physics (and some
I've probably left out).

The applied physics boundary is, without a doubt, both the most
potentially interesting and the leakiest with respect to wild
speculation.

The actual history of the transistor extends to well before my
lifetime, and the full history of the transition from flaky and
limited understanding to commodity devices is instructive.

I don't intend to pursue the applied physics boundary myself because I
don't really know enough even to engage in knowledgeable speculation,
but, if you intend to go outside the box, you always run the risk of
intellectual trajectories that head off into the cosmos, never to
return. Since most of the people who have expressed an interest so
far are considerably more down to earth than that, I don't expect the
conversation to run much risk of becoming dominated by possibilities
that no one currently knows how to manufacture, but I'm willing to run
the risk.

Robert.
jacko
2010-07-21 17:44:33 UTC
Permalink
Post by Robert Myers
Post by Robert Myers
Maybe quantum entanglement is the answer to moving data around.
Sigh...  I wonder how many decades we are from that being standard in
COTS hardware (assuming the global underpins of R&D hold up that
long).  Probably more than I've got (unless the medical R&D also grows
by leaps and bounds and society deems me worthy of being kept around).
I don't intend to pursue the applied physics boundary myself because I
don't really know enough even to engage in knowledgeable speculation,
but, if you intend to go outside the box, you always run the risk of
intellectual trajectories that head off into the cosmos,  never to
return.  Since most of the people who have expressed an interest so
far are considerably more down to earth than that, I don't expect the
conversation to run much risk of becoming dominated by possibilities
that no one currently knows how to manufacture, but I'm willing to run
the risk.
And I thought you were going to make the quantum memory joke ... Well
it allows us to store all values in all addresses, and it performs all
possible calculations in under 1 clock cycle.
Alex McDonald
2010-07-21 22:46:22 UTC
Permalink
Post by David L. Craig
I doubt if mass-market x86 hypervisors ever crossed the
imagination at IBM, even as the barbarians were at the
gates.
You'd be wrong.  A lot of IBMers and customer VMers were
watching what Intel was going to do with the 80386 next
generations to support machine virtualization.  While
Intel claimed it was coming, by mainframe standards, they
showed they just weren't serious.  Not only can x86 not
fully virtualize itself, it has known design flaws that
can be exploited to compromise the integrity of its
guests and the hypervisor.  That it is used widely as a
consolidation platform boggles the minds of those in the
know.  We're waiting for the eventual big stories.
Can you be more explicit on this? I understand the lack of complete
virtualization is an issue with the x86, but I'm fascinated by your
claim of exploitable design flaws; what are they?
Andy Glew <"newsgroup at comp-arch.net">
2010-07-22 04:48:38 UTC
Permalink
Post by Alex McDonald
Post by David L. Craig
I doubt if mass-market x86 hypervisors ever crossed the
imagination at IBM, even as the barbarians were at the
gates.
You'd be wrong. A lot of IBMers and customer VMers were
watching what Intel was going to do with the 80386 next
generations to support machine virtualization. While
Intel claimed it was coming, by mainframe standards, they
showed they just weren't serious. Not only can x86 not
fully virtualize itself, it has known design flaws that
can be exploited to compromise the integrity of its
guests and the hypervisor. That it is used widely as a
consolidation platform boggles the minds of those in the
know. We're waiting for the eventual big stories.
Can you be more explicit on this? I understand the lack of complete
virtualization is an issue with the x86, but I'm fascinated by your
claim of exploitable design flaws; what are they?
The 80386 and othrr processors, up until recently, were incompletely
self virtualizing.

However, as far as I know, with the addition of VMX at Intel, and
Pacifica at AMD, the x86 processors are now cimpldetely self virtualizing.
George Neuner
2010-07-22 18:16:02 UTC
Permalink
On Wed, 21 Jul 2010 21:48:38 -0700, Andy Glew <"newsgroup at
Post by Andy Glew <"newsgroup at comp-arch.net">
Post by Alex McDonald
Can you be more explicit on this? I understand the lack of complete
virtualization is an issue with the x86, but I'm fascinated by your
claim of exploitable design flaws; what are they?
The 80386 and othrr processors, up until recently, were incompletely
self virtualizing.
However, as far as I know, with the addition of VMX at Intel, and
Pacifica at AMD, the x86 processors are now completely self virtualizing.
Yes, but last year there was a claim of some rootkit type hack that
could take control of Intel's hypervisor. I don't know any details, I
just remember seeing a headline.

George
Robert Myers
2010-07-22 20:10:06 UTC
Permalink
Post by George Neuner
On Wed, 21 Jul 2010 21:48:38 -0700, Andy Glew <"newsgroup at
Post by Andy Glew <"newsgroup at comp-arch.net">
Post by Alex McDonald
Can you be more explicit on this? I understand the lack of complete
virtualization is an issue with the x86, but I'm fascinated by your
claim of exploitable design flaws; what are they?
The 80386 and othrr processors, up until recently, were incompletely
self virtualizing.
However, as far as I know, with the addition of VMX at Intel, and
Pacifica at AMD, the x86 processors are now completely self virtualizing.
Yes, but last year there was a claim of some rootkit type hack that
could take control of Intel's hypervisor. I don't know any details, I
just remember seeing a headline.
Yes, there are proposed exploits.

One big reason that IBM mainframes are not a plausible target for
hackers is that not many hackers own one. Monoculture in hardware is
risky in the same way that monoculture in agriculture is risky:
parasitic invaders can expend more energy exploring a narrow and
well-defined target.

I assume that hacking of x86 hypervisors will be a much bigger problem
than hacking of IBM propietary hypervisors has ever been, but only in
part because IBM is historically marginally more careful about what it
sells.

Robert.
r***@yahoo.com
2010-07-23 05:21:32 UTC
Permalink
Post by George Neuner
On Wed, 21 Jul 2010 21:48:38 -0700, Andy Glew <"newsgroup at
Post by Andy Glew <"newsgroup at comp-arch.net">
Post by Alex McDonald
Can you be more explicit on this? I understand the lack of complete
virtualization is an issue with the x86, but I'm fascinated by your
claim of exploitable design flaws; what are they?
The 80386 and othrr processors, up until recently, were incompletely
self virtualizing.
However, as far as I know, with the addition of VMX at Intel, and
Pacifica at AMD, the x86 processors are now completely self virtualizing.
Yes, but last year there was a claim of some rootkit type hack that
could take control of Intel's hypervisor.  I don't know any details, I
just remember seeing a headline.
You probably mean the SubVirt hack. That exploited the virtualization
hardware to hack the target OS. IOW, the idea was to inject a VM
hypervisor, and then continue running what the user thought was the
native OS under that. And of course a hypervisor is capable of pretty
much any level of mischief.

There have also been attacks on several hypervisors, but most
vulnerabilities occur because you can send commands of various sorts
from a guest to the hypervisor (for example, "attach virtual disk
'someotheros/sysvol' to me"), that were often not properly secured.
And before anyone objects, being able to issue commands to the
hypervisor is very useful - you can, for example, request access to a
particular real device, or change the amount of memory in your guest,
etc.
r***@yahoo.com
2010-07-23 05:25:59 UTC
Permalink
Post by Andy Glew <"newsgroup at comp-arch.net">
Post by Alex McDonald
Post by David L. Craig
I doubt if mass-market x86 hypervisors ever crossed the
imagination at IBM, even as the barbarians were at the
gates.
You'd be wrong.  A lot of IBMers and customer VMers were
watching what Intel was going to do with the 80386 next
generations to support machine virtualization.  While
Intel claimed it was coming, by mainframe standards, they
showed they just weren't serious.  Not only can x86 not
fully virtualize itself, it has known design flaws that
can be exploited to compromise the integrity of its
guests and the hypervisor.  That it is used widely as a
consolidation platform boggles the minds of those in the
know.  We're waiting for the eventual big stories.
Can you be more explicit on this? I understand the lack of complete
virtualization is an issue with the x86, but I'm fascinated by your
claim of exploitable design flaws; what are they?
The 80386 and othrr processors, up until recently, were incompletely
self virtualizing.
However, as far as I know, with the addition of VMX at Intel, and
Pacifica at AMD, the x86 processors are now cimpldetely self virtualizing.
It's not clear to me that the VMX extensions themselves virtualize
well (IOW, can you emulate VMX for the guests, so you can’t have
nested VMs using VMX?). In fact, as I read the docs, they don't.
Andy Glew <"newsgroup at comp-arch.net">
2010-07-21 05:03:07 UTC
Permalink
Post by Robert Myers
Post by David L. Craig
If we're talking about custom, never-mind-the-cost
designs, then that's the stuff that should make this
a really fun group.
If no one ever goes blue sky and asks: what is even physically
possible without worrying what may or may not be already in the works
at Intel, then we are forever limited, even in the imagination, to
what a marketdroid at Intel believes can be sold at Intel's customary
margins.
Coupling this to stuff we said earlier about

a) sequential access patterns, brute force - neither of us consider that
interesting

b) random access patterns

c) what you, Robert, siad you were most interested in, and rather nicely
called "crystalline" access patterns. By the way, I rather like that
term: it is much more accurate than saying "stride-N", and encapsulates
several sorts of regularity.

Now, I think it can be said that a machine that does random access
patterns efficiently also does "crystalline" access patterns. Yes?

I can imagine optimizations specific to the crystalline access patterns,
that do not help true random access. But I'd like to kill two birds
with one stone.

So, how can we make these access patterns more effective?

Perhaps we should lose the cache line orientation - transferring data
bytes that aren't needed.

I envision an interconnect fabric that is completely scatter/gather
oriented. We don't do away with burst or block operations: we always
transfer, say, 64 bytes at a time. But into that 64 bytes we might
pack, say, 4 pairs of 64 bit address and 64 bit data, for stores. Or
perhaps bursts of 128 bytes, mixing tuples of 64 bit address and 128 bit
data. Or maybe... compression, whatever. Stores are the complicated
one; reads are relatively simple, vectors of, say, 8 64 bit addresses.

By the way, this is where strided or crystalline access patterns might
have some advantages: they may compress better.

Your basic processing element produces such scatter gather load or store
requests. Particularly if it has scatter/gather vector instructions
like Larrabee (per wikipedia), or if it is a CIMT coherent threaded
architecture like the GPUs. The scatter/gather operations emitted by a
processor need not be directed at a single target - they may be split
and merged as they flow through the fabric.

In order to eliminate unnecessary full-cache line flow, we do not
require read-for-ownership. But we don't go the stupid way of
write-through. I lean towards having a valid bit per byte, in these
scatter-gather requests, and possibly in the caches. As I have
discussed in this newsgroup before, this allows us to have writeback
caches where multiple processors can write to the same memory location
simultaneously. The byte valids allows us to live with weak memory
ordering, but do away with the bad problem of losing data when people
write to different bytes of the same line simultaneously. In fact,
depending on the interconnection fabric topology, you might even have
processor ordering. But basically it eliminates the biggest source of
overhead in cache coherency.

Of course, you want to handle non-cache friendly memory access patterns.
I don't think you can safely get rid of caches; but I think that there
should be a full suite of cache control operations, such as is partially
listed at
http://semipublic.comp-arch.net/wiki/Andy_Glew's_List_of_101_Cache_Control_Operations

Such a scatter/gather memory subsystem might exist in the fabric. It
works best with processor support to generate and handle the
scatter/gather requests ad replies. (Yes, the main thing is in the
interconnect; but some processor support is needed, to get crap out of
the way of the fabric).

The scatter/gather interconnect fabric might be interfaced to
conventional DRAMs, with their block transfers of 64 or 128 bytes. If
so, I would be tempted to create a memory side cache - a cache that is
in the memory controller, not the processor - seeking to leverage some
of the wasted parts of cache lines. With cache control, of course.

However, if there is any chance of getting DRAM architectures to be more
scatter/gather friendly, great. But the people who can really talk
about that are Tom Pawlowski at Micron, and his counterpart at Samsung.
I've not been at a company that could influence DRAM much, since
Motorola in the late 1980s. And I dare say that Mitch didn't make much
headway there. I've mentioned Tom Pawlowski's vision, as presented at
SC09 and elsewhere, of an abstract DRAM interface for stacked DRAM+logic
units. I think the scattter/gather approach I describe above should be
a candidate for such an abstract interface.

If there is anyone that thinks that there is a great new memory
technology coming down the pike that will make the bandwidth wars
easier, I'd love to hear about it. For that matter, the impending
integration of non-volatile memory is great - but as I understand
things, it will probably make the memory hierarchy even more sequential
bandwidth oriented, unfriendly to other access patterns.

--

On this fabric, also pass messages - probably with instruction set
support to directly produce messages, and mechanisms such as TLBs to
route them without OS intervention.

--

I.e. my overall approach is - eliminate unnecessary ful cache line
transfers, emphasize scatter gather. Make the most efficient use of
what we have.

--

Now, I remain an unrepentant mass market computer architect. Some
people want to design the fastest supercomputer in the world; I want to
design the computer my mother uses. But, I'm not so far removed fromn
the buildings full of stuff supercomputers that Robert Myers describes.
First, I have worked on such. But, second, I'm interested in much of
this not just because it is relevant to cost no barrier supercomputers,
but also because it is relevant to mass markets.
Most specifically, datacenters. Although datacenters tend not to use
large scale shared memory, and tend to be unwilling to compromise the
memory ordering and cache coherency guidelines in their small scale
shared memory nodes, I suspect that PGAS has applications, e.g. to
Hadoop like map/reduce. Moreover, much of this scatter/gather is also
what network routers want - that OTHER form of computing system that can
occupy large buildings, but which also comes in smaller flavors.
Finally, the above applies even to moderate sized, say 16 or 32,
multiprocessor systems in manycore chips.

I.e. I am interested in such scatter/gather memory and interconnect,
that make the most efficient use of bandwidth, because they apply to the
entire spectrum,
Andrew Reilly
2010-07-21 09:30:04 UTC
Permalink
Post by Robert Myers
(90%+ efficiency for Linpack, 10% for anything even slightly more
interesting).
Have you, or anyone else here, ever read any studies of the sensitivities
of the latter performance figure to differences in interconnect bandwidth/
expense? I.e., does plugging another fat IB tree into every node in
parallel, doubling cross section bandwidth, raise the second figure to
20%?

Is 10% (of peak FP throughput, I would guess) really representative of
the real production code used by the typical buyer of these large-scale
HPC systems? I'm not counting the build-it-and-they-will-come
installations at places like Universities, but the built-to-solve-problem-
X ones at places like oil companies, weather forecasters and (I guess)
weapons labs. I don't work in any of those kinds of environments, so I
don't know anything about the code that they run.

Would moving that efficiency number higher be better than making 10%-
efficiency machines less expensive?

Cheers,
--
Andrew
n***@cam.ac.uk
2010-07-21 10:26:26 UTC
Permalink
Post by Andrew Reilly
Post by Robert Myers
(90%+ efficiency for Linpack, 10% for anything even slightly more
interesting).
Have you, or anyone else here, ever read any studies of the sensitivities
of the latter performance figure to differences in interconnect bandwidth/
expense? I.e., does plugging another fat IB tree into every node in
parallel, doubling cross section bandwidth, raise the second figure to
20%?
A little, and I have done a bit of testing. It does help, sometimes
considerably, but the latency is at least as important as the bandwidth.
Post by Andrew Reilly
Would moving that efficiency number higher be better than making 10%-
efficiency machines less expensive?
It's marginal. The real killer is the number of programs where even
a large improvement would allow only a small increase in scalability.


Regards,
Nick Maclaren.
Jeremy Linton
2010-07-21 15:42:25 UTC
Permalink
Post by n***@cam.ac.uk
Post by Andrew Reilly
Post by Robert Myers
(90%+ efficiency for Linpack, 10% for anything even slightly more
interesting).
Have you, or anyone else here, ever read any studies of the sensitivities
of the latter performance figure to differences in interconnect bandwidth/
expense? I.e., does plugging another fat IB tree into every node in
parallel, doubling cross section bandwidth, raise the second figure to
20%?
A little, and I have done a bit of testing. It does help, sometimes
considerably, but the latency is at least as important as the bandwidth.
With regard to latency, I've wondered for a while, why no has built a
large inifiniband (like?) switch with a large closely attached memory.
It probably won't help the MPI guys, but those beasts are only used for
the HPC market anyway. Why not modify them to shave a hop off and admit
that some segment of the HPC market could use it? Is the HPC market that
cost sensitive that they cannot afford a slight improvement, at a
disproportionate cost, for one component in the system?
n***@cam.ac.uk
2010-07-21 16:28:12 UTC
Permalink
Post by Jeremy Linton
Post by n***@cam.ac.uk
Post by Andrew Reilly
Post by Robert Myers
(90%+ efficiency for Linpack, 10% for anything even slightly more
interesting).
Have you, or anyone else here, ever read any studies of the sensitivities
of the latter performance figure to differences in interconnect bandwidth/
expense? I.e., does plugging another fat IB tree into every node in
parallel, doubling cross section bandwidth, raise the second figure to
20%?
A little, and I have done a bit of testing. It does help, sometimes
considerably, but the latency is at least as important as the bandwidth.
With regard to latency, I've wondered for a while, why no has built a
large inifiniband (like?) switch with a large closely attached memory.
It probably won't help the MPI guys, but those beasts are only used for
the HPC market anyway. Why not modify them to shave a hop off and admit
that some segment of the HPC market could use it? Is the HPC market that
cost sensitive that they cannot afford a slight improvement, at a
disproportionate cost, for one component in the system?
It's been done, but network attached memory isn't really viable, as
local memory is so cheap and so much faster.

It sounds as if you think it would reduce latency, but of what?
I.e. what would you use it for?


Regards,
Nick Maclaren.
Robert Myers
2010-07-21 16:27:31 UTC
Permalink
Post by Andrew Reilly
Post by Robert Myers
(90%+ efficiency for Linpack, 10% for anything even slightly more
interesting).
Have you, or anyone else here, ever read any studies of the sensitivities
of the latter performance figure to differences in interconnect bandwidth/
expense? I.e., does plugging another fat IB tree into every node in
parallel, doubling cross section bandwidth, raise the second figure to
20%?
I have read such studies, yes, and I've even posted some of what I've
found here on comp.arch, where there has been past discussion of just
those kinds of questions.

That's an argument for why this material shouldn't be limited to being
scattered through comp.arch. I have a hard time finding even my own
posts with Google groups search.

A place has generously been offered to host probably a mailing list and
a wiki. I'll be glad to try to continue to pursue the conversation here
to try to generate as wide interest as possible, but, since I've already
worn the patience of some thin by repeating myself, I'd rather focus on
finding a relatively quiet gathering place for those who are really
interested.

I have neither interest in nor intention of moderating a group or
limiting the membership, so whatever is done should be available to
whoever is interested. Whatever I do will be clearly announced here.

Robert.
nedbrek
2010-07-21 12:05:55 UTC
Permalink
Hello all,
Post by Robert Myers
Post by David L. Craig
If we're talking about custom, never-mind-the-cost
designs, then that's the stuff that should make this
a really fun group.
Moving the discussion to some place slightly less visible than
comp.arch might not produce more productive flights of fancy, but I,
for one, am interested in what is physically possible and not just
what can be built with the consent of Sen. Mikulski--a lady I have
always admired, to be sure, from her earliest days in politics, just
not the person I'd cite as intellectual backup for technical
decisions.
If we are only limited by physics, a lot is possible...

Can you summarize the problem space here?
1) Amount of data - fixed (SPEC), or grows with performance (TPC)
2) Style of access - you mentioned this some, regular (not random) but not
really suitable for sequential (or cache line) structures. Is it sparse
array? Linked lists? What percentage is pointers vs. FMAC inputs?
3) How branchy is it?

I think that should be enough to get some juices going...

Ned
jacko
2010-07-21 17:00:00 UTC
Permalink
Post by nedbrek
If we are only limited by physics, a lot is possible...
At least one hopes so.
Post by nedbrek
Can you summarize the problem space here?
1) Amount of data - fixed (SPEC), or grows with performance (TPC)
2) Style of access - you mentioned this some, regular (not random) but not
really suitable for sequential (or cache line) structures.  Is it sparse
array?  Linked lists?  What percentage is pointers vs. FMAC inputs?
3) How branchy is it?
1) As an example finite element modeling of large systems. 4D => x, y,
z, t. Say heatflow patterns in a multilayer silicon device.
2) Start with domething like a matrix 2D determinant or inversion.
Often sparse. Link lists are better at sparce data. Very few pointers,
mainly double floats.
3) Suprisingly not that branchy, but very loopy.

Cheers Jacko
Robert Myers
2010-07-21 18:12:10 UTC
Permalink
Post by jacko
Post by nedbrek
If we are only limited by physics, a lot is possible...
At least one hopes so.
Post by nedbrek
Can you summarize the problem space here?
1) Amount of data - fixed (SPEC), or grows with performance (TPC)
2) Style of access - you mentioned this some, regular (not random) but not
really suitable for sequential (or cache line) structures. Is it sparse
array? Linked lists? What percentage is pointers vs. FMAC inputs?
3) How branchy is it?
1) As an example finite element modeling of large systems. 4D => x, y,
z, t. Say heatflow patterns in a multilayer silicon device.
2) Start with domething like a matrix 2D determinant or inversion.
Often sparse. Link lists are better at sparce data. Very few pointers,
mainly double floats.
3) Suprisingly not that branchy, but very loopy.
That's an example of the kind of problem that (in my perception) has
come to dominate computational physics and, at the macro level (a
warehouse full of processors with some kind of wires connecting them) is
reasonably well-served by current "supercomputers."

There is still plenty left to be done in what I would call
boundary-dominated computational physics, but, in the problems I'm
concerned about, I doubt very much if the free (boundary-free) field is
being calculated correctly. It would be as if you were trying to do
scattering in a Born approximation and didn't even get the zeroth term
(the incident plane wave) right.

complex geometries -> linked lists, sparse, irregular matrices.

non-linear free field -> dense matrices that lend themselves to clever
manipulation and that, in many cases, can be diagonalized at relatively
low cost.

The actual problem -> accurate representation of a nonlinear free field
+ non-trivial geometry == bureaucrats apparently prefer to pretend that
the problem doesn't exist, or at least not to scrutinize too closely
what's behind the plausible-looking pictures that come out.

Robert.
Robert Myers
2010-07-21 18:13:58 UTC
Permalink
Post by jacko
Post by nedbrek
If we are only limited by physics, a lot is possible...
At least one hopes so.
Post by nedbrek
Can you summarize the problem space here?
1) Amount of data - fixed (SPEC), or grows with performance (TPC)
2) Style of access - you mentioned this some, regular (not random) but not
really suitable for sequential (or cache line) structures. Is it sparse
array? Linked lists? What percentage is pointers vs. FMAC inputs?
3) How branchy is it?
1) As an example finite element modeling of large systems. 4D => x, y,
z, t. Say heatflow patterns in a multilayer silicon device.
2) Start with domething like a matrix 2D determinant or inversion.
Often sparse. Link lists are better at sparce data. Very few pointers,
mainly double floats.
3) Suprisingly not that branchy, but very loopy.
That's an example of the kind of problem that (in my perception) has
come to dominate computational physics and, at the macro level (a
warehouse full of processors with some kind of wires connecting them) is
reasonably well-served by current "supercomputers."

There is still plenty left to be done in what I would call
boundary-dominated computational physics, but, in the problems I'm
concerned about, I doubt very much if the free (boundary-free) field is
being calculated correctly. It would be as if you were trying to do
scattering in a Born approximation and didn't even get the zeroth term
(the incident plane wave) right.

complex geometries -> linked lists, sparse, irregular matrices.

non-linear free field -> dense matrices that lend themselves to clever
manipulation and that, in many cases, can be diagonalized at relatively
low cost.

The actual problem -> accurate representation of a nonlinear free field
+ non-trivial geometry == bureaucrats apparently prefer to pretend that
the problem doesn't exist, or at least not to scrutinize too closely
what's behind the plausible-looking pictures that come out.

Robert.
jacko
2010-07-21 18:59:13 UTC
Permalink
Post by Robert Myers
Post by jacko
Post by nedbrek
If we are only limited by physics, a lot is possible...
At least one hopes so.
Post by nedbrek
Can you summarize the problem space here?
1) Amount of data - fixed (SPEC), or grows with performance (TPC)
2) Style of access - you mentioned this some, regular (not random) but not
really suitable for sequential (or cache line) structures.  Is it sparse
array?  Linked lists?  What percentage is pointers vs. FMAC inputs?
3) How branchy is it?
1) As an example finite element modeling of large systems. 4D => x, y,
z, t. Say heatflow patterns in a multilayer silicon device.
2) Start with domething like a matrix 2D determinant or inversion.
Often sparse. Link lists are better at sparce data. Very few pointers,
mainly double floats.
3) Suprisingly not that branchy, but very loopy.
That's an example of the kind of problem that (in my perception) has
come to dominate computational physics and, at the macro level (a
warehouse full of processors with some kind of wires connecting them) is
reasonably well-served by current "supercomputers."
There is still plenty left to be done in what I would call
boundary-dominated computational physics, but, in the problems I'm
concerned about, I doubt very much if the free (boundary-free) field is
being calculated correctly.  It would be as if you were trying to do
scattering in a Born approximation and didn't even get the zeroth term
(the incident plane wave) right.
complex geometries -> linked lists, sparse, irregular matrices.
non-linear free field -> dense matrices that lend themselves to clever
manipulation and that, in many cases, can be diagonalized at relatively
low cost.
The actual problem -> accurate representation of a nonlinear free field
+ non-trivial geometry == bureaucrats apparently prefer to pretend that
the problem doesn't exist, or at least not to scrutinize too closely
what's behind the plausible-looking pictures that come out.
Robert.- Hide quoted text -
- Show quoted text -
Umm, I think a need for upto cubic fields is resonable in modelling.
Certain effects do not show in the quadratic or linear approximations.
This can be done by tripling the variable count, and lots more
computation, but surely there must be ways.

Quartic modelling may not serve that much of an extra purpose, as a
cusp catastrophy is within the cubic. Mapping the field to x, and
performing an inverse map to find applied force can linearize certain
problems.
Robert Myers
2010-07-21 19:25:54 UTC
Permalink
Post by jacko
Post by Robert Myers
The actual problem -> accurate representation of a nonlinear free field
+ non-trivial geometry == bureaucrats apparently prefer to pretend that
the problem doesn't exist, or at least not to scrutinize too closely
what's behind the plausible-looking pictures that come out.
Robert.- Hide quoted text -
- Show quoted text -
Umm, I think a need for upto cubic fields is resonable in modelling.
Certain effects do not show in the quadratic or linear approximations.
This can be done by tripling the variable count, and lots more
computation, but surely there must be ways.
Quartic modelling may not serve that much of an extra purpose, as a
cusp catastrophy is within the cubic. Mapping the field to x, and
performing an inverse map to find applied force can linearize certain
problems.
I don't want to alienate the computer architects here by turning this
into a forum on computational mathematics.

Maybe you know something about nonlinear equations that I don't. If you
know enough, maybe you want to look into the million dollar prize on
offer from the Clay Institute for answering some very fundamental
questions about either the Navier-Stokes or the Euler equations.

Truncation of the hierarchy of equations for turbulence by assuming that
the fourth cumulant is zero leads to unphysical results, like negative
energies in the spectral energy distribution. I'm a tad muddy on the
actual history now, but I knew that result decades ago.

There is, as far as I know, no ab initio or even natural truncation of
the infinite hiearchy of conserved quantities that isn't problematical.
There are various hacks that work--sort of. Every single plot that
you see that purports to represent the calculation of a fluid flow at a
reasonable Reynolds number depends on some kind of hack.

For the Navier-Stokes equations, nature provides a natural cut-off scale
in length, the turbulent dissipation scale, and ab initio calculations
at interesting turbulent Reynolds numbers do exist up to Re~10,000.

As I've tried (unsuccessfully) to explain here, the interaction between
the longest and shortest scales in a problem that is more than weakly
non-linear (problems for which expansions in the linear free-field
propagator do not converge) is not some arcane mathematical nit, but
absolutely fundamental to the understanding of lots of questions that
one would really like the answer to.

Even if people continue to build careers based on calculations that
blithely ignore a fundamental reality of the governing equations, and
even if Al Gore could go through another ten reincarnations without
understanding what I'm talking about, the reality won't go away because
the computers to address it are inconveniently expensive.

Robert.
jacko
2010-07-21 20:03:55 UTC
Permalink
Navier-Stokes is one of the hardest, and most useful to model.

My limit went to phrasing the Re expression as an inequality to a
constant, and applying the initial steps of Uncertain Geometry to make
a possible strong 'turbulance as uncertainty' idea.
http://sites.google.com/site/jackokring for uncertain geometry.

But yes more emphisis should be placed on nonlinear fluid modelling as
a test benchmark of GPU style arrays.
nedbrek
2010-07-22 12:31:00 UTC
Permalink
Hello all,
Post by Robert Myers
Post by jacko
Post by Robert Myers
The actual problem -> accurate representation of a nonlinear free field
+ non-trivial geometry == bureaucrats apparently prefer to pretend that
the problem doesn't exist, or at least not to scrutinize too closely
what's behind the plausible-looking pictures that come out.
Umm, I think a need for upto cubic fields is resonable in modelling.
Certain effects do not show in the quadratic or linear approximations.
This can be done by tripling the variable count, and lots more
computation, but surely there must be ways.
Quartic modelling may not serve that much of an extra purpose, as a
cusp catastrophy is within the cubic. Mapping the field to x, and
performing an inverse map to find applied force can linearize certain
problems.
Truncation of the hierarchy of equations for turbulence by assuming that
the fourth cumulant is zero leads to unphysical results, like negative
energies in the spectral energy distribution. I'm a tad muddy on the
actual history now, but I knew that result decades ago.
There is, as far as I know, no ab initio or even natural truncation of the
infinite hiearchy of conserved quantities that isn't problematical. There
are various hacks that work--sort of. Every single plot that you see that
purports to represent the calculation of a fluid flow at a reasonable
Reynolds number depends on some kind of hack.
For the Navier-Stokes equations, nature provides a natural cut-off scale
in length, the turbulent dissipation scale, and ab initio calculations at
interesting turbulent Reynolds numbers do exist up to Re~10,000.
I'm not following very well...

I think you are saying the problem is resistant to mathematical models
(which is fine with me, I am skeptical of mathematical models of physical
processes). jacko is suggesting some sort of simulation (finite elements?)
where 4D is necessary, although you might be able to reduce it to 3D. You
seem certain that 4D is necessary. 4D will add a lot of data, and
operations, but should be doable.

My main concern is the "infinite hierarchy" and Re~10,000. If there are
infinites involved, some sort of mathematical analysis will be necessary -
we cannot simulate infinite :) If Re~10,000 means we need to do 10,000
operations at each node - I think that is doable (although expensive). If
it means 10,000 dimensions for data - that is probably too much.

Ned
jacko
2010-07-22 12:16:56 UTC
Permalink
Post by nedbrek
I'm not following very well...
I think you are saying the problem is resistant to mathematical models
(which is fine with me, I am skeptical of mathematical models of physical
processes).  jacko is suggesting some sort of simulation (finite elements?)
where 4D is necessary, although you might be able to reduce it to 3D.  You
seem certain that 4D is necessary.  4D will add a lot of data, and
operations, but should be doable.
My main concern is the "infinite hierarchy" and Re~10,000.  If there are
infinites involved, some sort of mathematical analysis will be necessary -
we cannot simulate infinite :)  If Re~10,000 means we need to do 10,000
operations at each node - I think that is doable (although expensive).  If
it means 10,000 dimensions for data - that is probably too much.
Simply put, although in doing might make practical use prone to mis-
simulation, Re is a turbulance merit figure, below and all's fine,
above and it's as swirly as a cavitation soup. It's probly linked to
the Bernoulli effect. The hierarchy is due to the fractal swirls of
the turbulance. It needs to be 4D if all fluid flow is modelled, and
no unexpected effects are to be occuring.

It is good for a benchmark, as many sub-problems such as EM wave
equation and 'heat in a solid' equation, which are more ordered in
terms of no turbulance, will still have similar memory access patterns.
jacko
2010-07-20 18:55:08 UTC
Permalink
Post by David L. Craig
On Jul 20, 11:31 am, Andy Glew <"newsgroup at comp-arch.net">
Post by Andy Glew <"newsgroup at comp-arch.net">
We welcome new blood, and new ideas.
These are new ideas?  I hope not.
Post by Andy Glew <"newsgroup at comp-arch.net">
I'm with you, David. Maximizing what I call the MLP, the
memory level parallelism, the number of DRAM accesses that
can be concurrently in flight, is one of the things that
we can do.
Me, I'm just the MLP guy:  give me a certain number of
channels and bandwidth, I try to make the best use of
them.  MLP is one of the ways of making more efficient
use of whatever limited bandwidth you have. I guess that's
my mindset - making the most of what you have.  Not because
I don't want to increase the overall memory bandwidth.
But because I don't have any great ideas on how to do so,
apart from
  a) More memory channels
  b) Wider memory channnels
  c) Memory channels/DRAMs that handle short bursts/high
     address bandwidth efficiently
  d) DRAMs with a high degree of internal banking
  e) aggressive DRAM scheduling
Actually, c,d,e are really ways of making more efficient
use of bandwidth, i.e. preventing pins from going idle
because the burst length is giving you a lot of data you
don't want.
  f) stacking DRAMs
  g) stacking DRAMs with an interface chip such as Tom
     Pawlowski of micron proposes, and a new abstract
     DRAM interface, enabling all of the good stuff
     above but keeping DRAM a comodity
  h) stacking DRAMs with an interface chip and a
     processor chip (with however many processors you
     care to build).
If we're talking about COTS design, FP bandwidth is
probably not the area in which to increase production
costs for better performance.  As Mitch Alsup observed
a little after the post I've been quoting became
Post by Andy Glew <"newsgroup at comp-arch.net">
We are at the point where, even when the L2 cache
supplies data, there are too many latency cycles for
the machine to be able to efficiently strip mine
data. {And in most cases the cache hierarchy is not
designed to efficiently strip mine data, either.}
Have performance runs using various cache disablements
indicated any gains could be realized therein?  If so,
I think that makes the case for adding circuits to
increase RAM parallelism as the cores fight it out for
timely data in and data out operations.
If we're talking about custom, never-mind-the-cost
designs, then that's the stuff that should make this
a really fun group.- Hide quoted text -
- Show quoted text -
Why want in a explcit eans be< (short) all functors line up to allign.
Edward Feustel
2010-07-21 10:06:06 UTC
Permalink
On Tue, 20 Jul 2010 08:31:46 -0700, Andy Glew <"newsgroup at
Post by Andy Glew <"newsgroup at comp-arch.net">
Me, I'm just the MLP guy: give me a certain number of channels and
bandwidth, I try to make the best use of them. MLP is one of the ways
of making more efficient use of whatever limited bandwidth you have. I
guess that's my mindset - making the most of what you have. Not because
I don't want to increase the overall memory bandwidth. But because I
don't have any great ideas on how to do so, apart from
a) More memory channels
b) Wider memory channnels
c) Memory channels/DRAMs that handle short bursts/high address
bandwidth efficiently
d) DRAMs with a high degree of internal banking
e) aggressive DRAM scheduling
Actually, c,d,e are really ways of making more efficient use of
bandwidth, i.e. preventing pins from going idle because the burst length
is giving you a lot of data you don't want.
f) stacking DRAMs
g) stacking DRAMs with an interface chip such as Tom Pawlowski of
micron proposes, and a new abstract DRAM interface, enabling all of the
good stuff above but keeping DRAM a comodity
h) stacking DRAMs with an interface chip and a processor chip (with
however many processors you care to build).
It is interesting as to what we thought the original poser was
interested in. I was intrigued with the notion of higher bandwidth
inter-processor communication a'la the CM-2, not just higher bandwidth
a'la STAR 100 or the various CRAYs. Our use of many processors would
appear to cry out for this.

What about "higher-level constructs" that permit processors to "know"
what they are trying to "obtain"/"give" and that permit the processors
to overlap/schedule operations on things that are larger than a few
bytes. I realize this takes more gates, but we appear to have gates
to spare.
Ed
j***@cix.compulink.co.uk
2010-07-19 19:29:56 UTC
Permalink
Post by Robert Myers
Since I have talked most about the subject here and gotten the most
valuable feedback here, I thought to solicit advice as to what kind
of forum would seem most plausible/attractive to pursue such a
subject.
A mailing list seems the most plausible to me. When the subject doesn't
have a well-defined structure (as yet), a wiki or web BBS tends to get
in the way of communication.
--
John Dallman, ***@cix.co.uk, HTML mail is treated as probable spam.
Continue reading on narkive:
Loading...