Discussion:
The Technology of PS3
(too old to reply)
subsystem
2003-11-13 01:55:57 UTC
Permalink
old but otherwise interesting read :)

http://www.xboxrules.com/yabbse/index.php?threadid=47


The Technology of PS3
Eddie Edwards, April 2003
Foreword

Recent news articles have explained that the patent application for
the technology on which PS3 will assumedly be based is now available
online. I've spent some time examining the patent and I have formed
some theories and educated guesses as to what it all means in
practice. This document describes the patent and outlines my ideas.
Some of these guesses are informed by my knowledge of PS2 (I was one
of the VU coders on Naughty Dog's Jak & Daxter although I do not work
for Sony now). You may wish to refer to Paul Zimmons' PowerPoint
presentation which has diagrams that might make some of this stuff
clearer. Also, until I get told to take it down, I have made the
patent itself available in a more easily downloadable form (a 2MB ZIP
containing 61 TIF files).

The technology of PS3 is based on what IBM call the "Cell
Architecture". This architecture is being developed by a team of 300
engineers from Sony, IBM and Toshiba. PS2 was developed by Sony and
Toshiba. Sony appear to have designed the basic architecture, while
Toshiba have figured out how to implement it in silicon. The new
consortium includes IBM, who for PS3 will use their advanced
fabrication technologies to build the chips faster and smaller than
would otherwise have been possible. In addition, the effort is
supposedly a holistic approach whereby tools and applications are
being developed alongside the hardware. IBM have particular expertise
in building applications and operating systems for massively parallel
systems - I expect IBM to have significant input into the software for
this system.

There is a lot of PS2 in the Cell Architecture. It is the PS2 flavour
that is most apparent to me when I read the patent. However, IBM must
be bringing a significant amount of stuff to the table too. The patent
for instance refers to a VLIW processor with 4 FPUs, rather than a
dual-issue processor with a single SIMD vector FPU. Does this imply
that the chips are based on an IBM-style VLIW ALU set? Or does it just
mean that it's a fast VU with a "very long instruction word" of only 2
instructions? Furthermore, note that IBM have been making and selling
massively parallel supercomputers for several decades now. IBM
experts' input on the programming paradigms and tool set are going to
be invaluable. And the host processor finally drops the MIPS ISA in
favour of IBM's own PowerPC instruction set. But we may not get to
program the PPCs inside the PS3 anyway.

I have had to make assumptions. Forgive them. If anyone with insight
or knowledge wishes to enlighten me, please do.
Contents

*
* Foreword Cells
* APUs
o Instruction Width
* Winnie the PU
* PEs
* The Broadband Engine
* Visualizers
* Will the Real PS3 Please Stand Up?
* Memory : Sandboxes
* Memory : Producer / Consumer Synchronization
* Memory : Random Access, Caches, etc.
* Forward and Sideways Compatibility
* Graphics
o Modelling
* Programming PS3
* Jazzing with Blue Gene
* Stream Processing
* Readers' Comments
* Links and References

Cells

(There is some confusion as to what a "cell" is in this patent. The
media is generally using the term "cell" for what the patent calls a
"processing element" or "PE". In the patent, the term "cell" refers to
a unit of software and data, while the term "PE" refers to a
processing element that contains multiple processing units. I will use
that nomenclature here.)

Cells are central to the PS3's network architecture. A cell can
contain program code and/or data. Thus, a cell could be a packet in an
MPEG data stream (if you were streaming a movie online) or it could be
a part of an application (e.g. part of the rendering engine for a PS3
game). The format of a cell is loosely defined in the patent. All
software is made up of cells (and here I use software in its most
general sense to include programs and data). Furthermore, a cell can
run anywhere on the network - on a server, on a client, on a PDA, etc.

Say, for instance, that a website wanted to stream a TV signal to you
in their new improved format DivY. They could send you a cell that
contained the program instructions for decoding the DivY stream into a
regular TV picture. Then they send you the DivY-endoded picture
stream. This would work if you had a PS3 or if you had a digital TV,
or even if you had a powerful enough PDA - assuming their design
followed the new standard.

Depending on how "open" Sony make this it might be easy or impossible
to program your own PS3 just by sending it data packets you want it to
run. (Note that Sony's history in this respect is interesting - their
PSX Yaroze and PS2 Linux projects do show some willingness to open
their machines up to hobbyists.)
APUs

Cells run on one or more "attached processing units" or APUs (I
pronounce this after the character in the Simpsons!) An APU is
architecturally very similar to the vector unit (VU) found in PS2, but
bigger and more uniform:

* 128-bit processor
* 1024-bit external bus
* 128K (8192 x 128-bit words) of local RAM
* 128 x 128-bit registers
* 4-way floating point vector unit giving 32GFLOPS
* 4-way integer vector unit giving 32GIOPS

(Compare this to the VU's 128-bit external bus, 16K of code RAM, 16K
of data RAM, 32 x 128-bit registers, single way 16-bit integer unit,
and only 1.2GFLOPS.)

The APU is a very long instruction word (VLIW) processor. Each cycle
it can issue one instruction to the floating point vector unit and one
to the integer vector unit simultaneously. It is much more similar to
a traditional DSP than to a CPU like the Pentium III - it does no
dynamic analysis of the instruction stream, no reordering. The
register set is imbued with enough ports that the FPU and the IPU can
each read 3 registers and write one register on each cycle. Unlike the
VU, the integer unit on the APU is vectorized, each vector element is
a 32-bit int (VU was only 16-bit) and the register set is shared with
the FPU (in VU there is a smaller dedicated integer register set). APU
should therefore be somewhat easier to program and much more
general-purpose than the VU.

Unlike the VU, which used a Harvard architecture (seperate program and
data memories), the APU seems to use a traditional (von Neumann)
architecture where the 128K of local RAM is shared by code and data.
The local RAM appears to be triple- ported so that a single load or
store can occur in parallel with an instruction fetch, mitigating the
von Neumannism (the other port is for DMA). The connection is 256 bits
wide (2 x 128 bits), so only one load or store can occur per cycle -
it seems reasonable to assume therefore that the load/store
instructions only occur on the integer side of the VLIW instruction,
as was the case on the VU. Since there is no distinction between
integer and floating point registers this works out just fine. The
third RAM port attaches the APU to other components in the system and
allows data to be DMAed in or out of the chip 1024 bits at a time.
These DMAs can be triggered by the APU itself, which differs from the
PS2 where only the host processor could trigger a DMA.

Note that the APU is not a coprocessor but a processor in its own
right. Once loaded with a program and data it can sit there for years
running it independently of the rest of the system. Cells can be
written to use one or more APUs, thus multiple APUs can cooperate to
perform a single logical task. A telling example given in the patent
is where three APUs convert 3D models into 2D representations, and one
APU then converts this into pixels. The implication is that PS3 will
perform pure software rendering.

The declared speed of these APUs is awesome - 32GFLOPS + 32GIOPS (32
billion floating-point instructions and 32 billion integer
instructions per second). I expect Sony consider a 4-way vectorized
multiply-accumulate instruction to be 8 FLOPs, so the clock speed of
the APU is 4GHz, as has been reported elsewhere in the media. This is
very much faster than the PS2's sedate 300MHz clock - by about 13
times. I presume that the FPUs are pipelined (i.e. you can issue one
instruction per cycle but it takes, say, four cycles to come up with
the answer). But if PS2 had a 4-stage pipeline for the multipliers at
300MHz, what's the pipeline depth going to be at 4GHz? 8 stages? 16
stages? The details of this will depend on the precise design of the
APU and this is not covered by the patent, but it is worth noting that
naked pipelines are hard to code for at a depth of 4; at a depth of
greater than this it may simply be unfeasible to write optimal code
for these parts.

Note: the APUs may instead be using an IBM-style VLIW architecture
where each ALU (4 floating point and 4 integer) is operable
independently from different parts of the instruction word. However,
the word size of the registers is 128, so each floating point unit
must access part of the same register. This seriously limits the
effectiveness of a VLIW architecture and makes it rather difficult to
program for. I therefore assume that the ALUs are acting like typical
4-way vector SIMD units.

One interesting departure from PS2 is that all software cells run on
APUs. On PS2 there were two VUs but also one general- purpose CPU (a
MIPS chip). This chip was the only chip in the system capable of the
128-bit vector integer operations (necessary for fast construction of
drawlists), and this functionality is now subsumed into the APU. There
is a non-APU processor in the new system but it only runs OS code, not
cells, so its precise architecture is irrelevant - it could be
anything, and the same software cells would still run on the APUs just
fine.
Instruction Width

Given 128 registers, it takes 7 bits to identify a register. Each
instruction can have 3 inputs and 1 output which is 28 bits. I am
presuming they are keeping the extremely useful vector element masks
which would add 4 bits to the FPU side. Only in the case of MAC
(multiply-accumulate) are 3 inputs actually needed, but say you
specify a MAC on both the IPU and FPU - that's 60 bits for register
specifications alone. I therefore doubt that the instruction length is
64 bits - I think the VLIW on the APU must be 128 bits wide, which is
reasonable since that's the word length and since there is bandwidth
to read 128 bits out of memory per cycle as well as do a load/store
to/from memory at the same time. But this is probably going to mean
code is not overly compact - only 8,192 instructions will fit into the
whole of APU RAM, with no room for data in that case.

On the other hand, 128 bits is a lot of bits for an instruction given
that only 60 are used so far. Assuming 256 distinct instructions per
side (which is very very generous) that's 8 bits per side making 76.
My guess is they may have another 16 bits to mask integer operations,
just as 4 bits mask the FPU operations. 16 bits enables you to isolate
any given byte(s) in the register. That's 92.

Another cool feature they might employ is conditional execution like
on the ARM - 4 bits would control each instruction's execution
according to the standard condition codes. I was suprised not to see
this on the VU in PS2 (perhaps ARM have a patent?) because it helps to
avoid a lot of petty little branches. If the PPC is influencing the
design, they may just throw a barrel shifter in after every
instruction too (that would be quite ARM-like as well). So even
without unaligned memory accesses you can isolate any field in a
128-bit word in a single mask-and-shift instruction. Another 7 bits
there too (integer only) ... that's still only 99 bits - 29 bits are
still available.

What seems to be common on classic VLIW chips is to have parts of the
ALU directly controlled by the instruction code. On a classic CPU
instructions are decoded to generate the control lines for the ALU
(for instance, to select which part of the ALU should calculate the
result). With a VLIW chip you can encode the control lines directly -
known as horizontal encoding. This makes the chips simpler and the
instructions more powerful - you can do unusual things with
instructions that you couldn't do on a regular CPU. Regular
instructions are present as special cases. It's more like you're
directly controlling the ALU with bitfields than you're issuing
"instructions". This can make the processor difficult for humans to
write code for - but in some ways easier for machines (e.g.
compilers). It is possible that the other 29 bits go towards this
encoding.

However, the patent does not go into much detail about any of this, so
you should treat the above few paragraphs with some suspicion until
more information comes to light.
Winnie the PU

As mentioned above, there is a secondary type of processing unit which
is called just "processing unit" or PU. The patent says almost nothing
about the internals of this component - media reports suggest that
current incarnations will be based on the PowerPC chip, but this is
not particularly relevant. The PU never runs user code, only OS code.
It is responsible for coordinating activity between a set of APUs -
for instance deciding which APUs will run which cells. It is also
resposible for memory management - deciding which areas of RAM can be
used by which APUs. The PU must run trusted code because, as I explain
later, it is the PU that sets up the "sandboxes" which protect the
whole system from viruses and similar malicious code downloaded off
the internet.
PEs

The patent then puts several APUs and a PU together to make a
"processing element" or PE. A typical PE will contain:

* A master PU which coordinates the activities in the PE
* A direct memory access controller or DMAC which deals with memory
accesses
* A number of APUs, typically 8

The PE is the smallest thing you would actually make into a chip (or
you can put multiple PEs onto a chip - see later). It contains a
1024-bit bus for direct attachment to DRAM. It also contains an
internal bus, the PE-bus. I am inferring that this bus is 1024 bits
wide since it attaches directly to the memory interface of the APUs,
which are 1024 bits wide. However, the patent says almost nothing in
detail about the PE-bus.

The DMAC appears to provide a simultaneous DMA channel for each APU -
that is, 8 simultaneous 1024-bit channels. The DRAM itself is split
into 8 sectors and for simultaneity each APU must be accessing a
different sector. Nominally the DRAM is 64MB and each sector is 8MB
large. Sectors themselves consist of 1MB banks configured as just 8192
1024-bit words.

A PE with 8 APUs is theoretically capable of 256GFLOPS or 1/4 TFLOPS.
Surely that's enough power for a next gen console? Not according to
Sony ...
The Broadband Engine

Now we put together four PEs in one chip and get what the patent calls
a "Broadband Engine" or BE. Instead of each PE having its own DRAM the
four PEs share it. The PE-busses of each PE are joined together in the
BE-bus, and included on the same chip are optional I/O blocks. The
interface to the DRAM is still indicated as being external, but still
I assume the DRAM must be on the same die to accomodate the 8192-wire
interface.

The BE has 1/4 the memory bandwidth of a PE since four PEs share the
same DRAM. So they must share. This is done using a crosspoint switch
whereby each of the 8 channels on each PE can be attached to any of
the 8 sectors on the DRAM.

Additionally, each BE has 8 external DMA channels which can be
attached to the DRAM through the same crosspoint mechanism. This
allows BEs to be attached together and to directly access each others
DRAM (presumably with some delay). The patent discusses connecting BEs
up in various topologies.

One thing the patent talks about is grafting an optical waveguide
directly onto the BE chip so that BEs can be interconnected optically
- literally, the chip packaging would include optical ports where
optical fibres could be attached directly. Think about that! If the BE
plus DRAM was a self-contained unit, there would be no need at all for
high-frequency electrical interfaces in a system built of BEs, and
therefore board design should become much much easier than it is
today. Note that the patent makes it clear that the optical interface
is an option - it may never actually appear - but it would be very
useful in building clusters of these things, for instance in
supercomputers.

A BE with 4 PEs is theoretically capable of 1 TFLOPS - about 400 times
faster than a PS2.
Visualizers

Visualizers (VSs) are mentioned a few times through the document. A
visualizer is like a PE where 4 of the APUs are removed and in their
place is put some video memory (VRAM), a video controller (CRTC) and a
"pixel engine". Almost no details are given but it is a fair
assumption that one or more of these will form the graphical "backend"
for the PS3. I would presume the pixel engine performs simple
operations such as those performed by the backend of a regular
graphics pipeline - check and update Z, check and update stencil,
check and update alpha, write pixel. The existence of the VS is
further evidence to suggest that PS3 is designed for software
rendering only.

Diagrams in the patent suggest that visualizers can be used in groups
- presumably each VS does a quarter of the output image (similar to
the GS Cube).

In my section on graphics later, I describe the software rendering
techniques I believe PS3 will use. These techniques use at least 16x
oversampling of the image (i.e. a 2560 x 1920 image instead of a
640x480 image), and the obvious hardware implementation might be
capable of drawing up to 16 pixels simultaneously - which is
equivalent to 1 pixel of a 640x480 image per cycle. Since the NTSC
output is 640x480 I call these "pixels" while the 2560 x 1920 image is
composed of "superpixels", with 16 superpixels per pixel.
Will the Real PS3 Please Stand Up?

So what is PS3 to be, then? The patent mentions several different
possible system architectures, from PDAs (a single VS) through what it
terms "graphical workstations" which are one or two PEs and one or two
VSs, to massive systems made up of 5 BEs connected optically to each
other. Which one is the PS3?

The most revealing diagram to me is Figure 6 in the patent, which is
described as two chips - one being 4 PEs (i.e. a BE) and one being 4
VSs with an I/O processor which is rather coincidentally named IOP -
the same name as the I/O processor in PS2 (this component will still
be required to talk to the disk drive, joypads, USB ports, etc.) The
bus between the two main chips looks like it's meant to be electrical.
Oddly, each major chip has the 64MB of DRAM attached (on chip?) and
this only gives 128MB of total system RAM. That seems very very low. I
would expect a more practical system to have maybe 64MB per PE or VS
giving a total of 512MB of RAM - much more reasonable. So perhaps the
128MB is merely a type of "secondary cache" - fast on-chip RAM. Then
lots of slower RAM could be attached to the system using a regular
memory controller. This slow RAM would be much cheaper than the
"secondary cache" RAM and would probably not have the 8,192-wire
interface. In fact, looking at the PS2's GS design, there we have 4MB
of VRAM which has a 1024-bit bus to the texture cache - so perhaps the
64MB per PE is an extension of this VRAM design? On the other hand,
VRAM tends to be fast and low-latency, whereas the patent specifically
calls the 64MB per PE "slow DRAM".

So how powerful is this machine? Well the 4 PEs give us 1 TFLOPS. The
4 VSs give another 1/2 TFLOPS. Add integer instructions in, and call
what the pixel engine does "integer operations" too, and you pretty
soon see a machine that really is capable of trillions of operations
per second - a superbly ludicrous amount.

Assuming the pixel engine can handle a pixel (16 superpixels) per
cycle, at 4GHz with 4 VSs that's a fillrate of 16GPPS - enough to draw
a 640 x 480 x 60Hz screen with 800x overdraw. Lovely. (However, note
that when drawing triangles smaller than a pixel, a certain amount of
"overdraw" is required just to fill the screen - so the available
depth complexity is "only" of the order of 100 or so).
Memory : Sandboxes

The DRAM used in this system is not actually 1024-bits wide but 1024 +
N bits where N is extra control information. This extra information is
used in 2 ways - to provide hardware "sandboxing" whereby regions of
memory can be set up to allow access from only a certain subset of
APUs, and to provide hardware prod cer-consumer synchronization, which
I discuss later.

Sandboxes are implemented using the following logical test:
(REQID & REQIDMASK) == (MEMID & MEMIDMASK)

Here, REQID and REQIDMASK are an ID and mask associated with the APU
making the request; MEMID and MEMIDMASK are an ID and mask associated
with the memory location being read or written. If the results are
equal the access goes ahead, otherwise it is blocked.

This system allows for APUs to have private memory, memory shared with
a specific set of other APUs, and a quite open-ended set of other
permutations. It's not clear how this facility interacts with the
facility for BEs to directly read the memory of other BEs - one would
imagine 32 APUs per BE would mean the IDs and masks were 32 bits wide
with one bit per APU - but if a potentially unlimited set of APUs in
other BEs can access the DRAM then how are the IDs set up, I wonder?
Memory : Producer / Consumer Synchronization

The DRAM also performs another special function, that is to allow
automatic synchronization between an APU that is producing information
and an APU that is consuming that information. The synchronization
works per 1024-bit word. Essentially, the system is set up so that the
producer issues a DMA "sync write" to memory and the consumer issues a
DMA "sync read" from memory. If the sync write occurs first, all is
well. If the sync read occurs first, the consumer is stalled until the
write occurs.

What does that actually mean? Well, there is a bit in each memory
location internal to the APU. (We're talking here about 1024-bit
locations not 128-bit locations.) This bit is set when a "sync read"
is pending to that memory location. The patent description fluffs the
explanation of this a little, but I am inferring that the APU issues a
sync read and then carries on until code in the APU attempts to access
the data that has been read. If the data has not yet arrived from RAM,
the APU stops working until the data is available. (The patent seems
to imply that the APU stalls immediately but that doesn't make a whole
lot of sense since the extra bit internal to the RAM would not then be
necessary.)

This mechanism is very important - it means that you can prefetch data
into memory and as long as you can keep working on other stuff the
data will arrive when it is ready and the APU need not stall. So the
synchronization can be free in cycles (seamless) and free in terms of
PU overhead. The overhead in DRAM for each memory location is about 18
bits - 1 free/empty bit, an APU ID (5 bits) and 1 destination address
(13 bits). But note what I said above about there being more than 32
APUs accessing the same amount of DRAM - perhaps more than 5 bits is
necessary for the APU ID?

Parity DRAM will already provide an extra 128 bits for every 1024 bit
word - this is more than enough to provide the 40 bits required by the
sandboxing and the synching, with 88 bits left for ECC (ECC is not
mentioned in the patent, but it is reasonable to assume it could be a
feature - ECC places an error correcting code into the spare bits of
DRAM so that in the unlikely event that a cosmic ray changes a bit
pattern in your RAM the system can detect and correct the error.
Honestly! I'm not making this up!)

The patent makes a fuss about how this synching will make it trivial
to read data from an I/O device. But it seems to me that the
synching's main function is to make it trivial to stream data around
the APUs with no intervention from the PU. You can set up arbitrary
producer-consumer graphs and have them work as if by magic. It's a
great feature for things like video processing where several APUs
might be doing MPEG compression of images that are read from a
digitizer and processed by other APUs (e.g. the addition of menus or
even special effects). Each stage must wait for data from the previous
stage, and this synchronization allows this to be done with minimum
hassle. As I discuss in the programming section, streams of data are
going to be a key concept in PS3 programming.
Memory : Random Access, Caches, etc.

Now I've said before that the PU doesn't run user code. Except, maybe
it does. A lot of a game can be shoehorned into the APU model -
certainly the whole graphics engine, probably much AI code, sound
code. But it's a case of writing a piece of code that handles a 128K
working set (minus code size). What about cases where you really,
really, really need random access to memory? What about just porting
some monolithic C onto the platform? How do we do that? Surely we need
a traditional processor for "regular" tasks?

Well it's still not clear that you do. Random access to memory is
going to screw you up. It would probably be faster to make a list of
memory locations you need to access first (in local RAM), sort that,
and do the memory accesses sequentially! A cache won't help in this
case.

What about porting monolithic C? I think the answer to this runs deep.
I think the answer is: you can't. I think that writing code for this
beast is fundamentally different. You have to break your code into
processes that can work with only 128K of data at one time. The PS3
will require a new approach to programming. I have my theories about
what this new approach is, which I describe later, but ultimately the
new approach should actually be more uniform than our current
approach. It will force modularity upon us, which might not be a
totally bad thing.

What the system may offer for trusted programs (e.g. off an encoded
DVD-ROM rather than off the internet) is the ability to modify the PU
operating system to e.g. change the algorithm it uses to distribute
cells among APUs. Or the ability to add drivers directly to the I/O
processor (IOP) on the PS3. It seems likely Sony will offer some level
of control to games programmers - control that is simply not required
by simple purveyors of streaming media. On the other hand, this would
destroy their ability to swap the PU over to a different processor
architecture.

Something which is not addressed by the patent is the question, what
granularity do APUs see in their local memory? Can they do byte
stores? 32-bit loads? My guess is that they can only transfer 1024-bit
quantities from main DRAM to local RAM, and can only transfer 128-bit
quantities from local RAM to registers, but that in a single
instruction they can isolate any bitfield from a pair of 128-bit
registers. But it is only a guess.
Forward and Sideways Compatibility

By forward compatibility I mean the programs can run on future revs of
the hardware without error. By sideways compatibility I mean the
programs can run on hardware that implements the same instruction set
but that is made by different manufacturers to different designs. In
both cases, we're talking about running programs on chips that have
different timing characteristics to the chips it was written on.

The patent discusses a timer that is provided on each PU. You tell it
how long you think an APU program ought to take, and if it takes less
time (say on a faster APU) then it waits until the specified time has
passed - so the program will never be faster than it should be.

I don't get this part, really. At first I thought perhaps the timer
was for synching processes, but it only guarantees "no earlier than"
completion, so synching would be impossible since some processes may
not have "arrived" yet. Even if this is the intention, it would only
work if the time budget referred only to APU processing and not to
memory accesses - since the DRAM is shared with other PEs each APU
only has 1/4 of a DMAC channel available to it ... so stalls may occur
based on what the other APUs in the system are doing. You may easily
blow your timeframe this way, through no fault of your own, and
suddenly you're not synched up any more.

So what is this for, then? Perhaps it's to define timebases for your
game program overall - you use the timer to specify that the game runs
at 60Hz no faster even on a fast CPU. That seems unlikely, though,
because standard game programming repertoire includes ways to make
games run at real-time speed and higher frame rate (if the display can
support that) when the processing ability of the machine improves. So
maybe the timer is used to define the 60Hz NTSC output frequency - and
maybe the subfrequencies off that. Remember this is an 8GHz part. It
might use an APU to generate the entire NTSC picture. But it doesn't
seem to though since there is a CRTC specified in each VS.

Ideas anyone?
Graphics

Software rendering appears to be the order of the day, although it may
be premature to make this observation since they could always add some
other GPU if the performance of the software rendering failed to
impress. It might seem like a waste to do scan conversion in software
when Sony already have hardware that can scan convert 75 million
triangles per second in PS2.

But PS3 isn't going to do 75 million triangles per second. Oh no. It's
going to do a lot more than that. I'm going to stick my neck out and
say that PS3 will, at peak, be capable of 1 billion triangles per
second. But before I justify that figure, let us just assume that it
will do a lot of triangles. So many triangles that the average size of
one is less than a pixel. So what's to scan convert? It's a dot,
right? Well, no. You could draw a dot and you'd get an image but the
image would look pretty odd - more like a very high-resolution mosaic
than a properly anti-aliased CG image. The system will need to do
subpixel rendering and average out to get a nice image. 4x4
supersampling of a 640x480 image gives 2560x2048 superpixels -
1280x1024 on each of the 4 VS units. Now, if we ever want to draw a
triangle larger than a pixel in width we subdivide it. This is all
done in code on the APU. Once the triangle is less than 4x4
superpixels in size there are algorithms you can use to very rapidly
determine which subpixels it covers. You keep bitmasks for every
possible (4x4x4x4 = 256) edge and you mask them together to give the
triangle coverage. Since the triangle is less than a pixel in size
there is no point texturing it - you just fill it with a single
colour. So we're rendering flatshaded polys. We can do a lot of these
in software. We end up with a nicely antialiased image which has the
appearance of texture mapping only because we referred to a texture
map when we decided each triangle's colour. We write APU programs to
determine this colour - these programs are called shaders. Anyone
familiar with RenderMan should be starting to understand what's going
on. In a sense the rendering capabilities of PS3 are very much akin to
a real-time Reyes RenderMan engine.

So how do on Earth do we get 1 billion triangles per second? Well, the
hardware to calculate the triangle fragments is small and fast. Then
all you need is the basic pixel operations we have already on GPUs -
z-test, stencil-test, alpha-test, alpha-blend, etc. Assuming 4x4
supersampling, each triangle covers up to 16 superpixels (actually
never more than 10) and 1 superpixel = 1/16 triangle per VS per cycle
gives 1 billion triangles per second. (These 16 operations can even be
parallelized if the VRAM is split into 16 banks so we may even get a
theoretical 16 billion triangles per second, but 4x 4GHz APUs could
never drive this many triangles out so it seems a little pointless).

It is possible to drive this from a simple 16-bit input mask, so
software needs to determine coverage and pixel position for each small
triangle; however this will take several cycles, while the hardware to
"scan convert" such polygons is small and fast (a 256x16 triple-ported
ROM plus some basic coordinate shifting logic). It's possible
therefore that the pixel engines are actually capable of
"scan-converting" these subpixel polygons, easing the software burden.

But a big caveat with all these triangle counts is the complexity of
the shader itself. The simplest shader might be able to kick out a
polygon every 16 cycles from each APU, pushing the pixel engines to
the max, but anything more complex (e.g. with texturing or shading or
fogging) will require more cycles. So as with all graphic systems, the
practical triangle count will likely be less than 20% of the
theoretical maximum.
Modelling

Procedural graphics has got to be the way to go with these things. No
matter how fast the DRAM is, it's not going to be comparable to the
32GFLOPS available on each APU. Memory will be very slow, processing
will be very fast. Pal Engstad at Naughty Dog pointed out to me that
VU programming on the PS2 is akin to old-fashioned tape-based
mainframe programming - you can read a small number of records from
the tape into local RAM and you can only read them sequentially or
efficiency suffers badly. There are algorithms for sorting records
held on tapes and these algorithms can be applied equally well to the
problem of sorting large arrays in memory using just the 128K
available to an APU. But at the end of the day, you're going to have
oodles of cycles just being wasted while you wait for memory.
Procedural models can be instanced from minimum data in RAM using
oodles of cycles. There's a natural match here.
Programming PS3

It is truly unknown how PS3 will be programmed. There are many
possible models, because the architecture is so flexible. In
particular, hybrid models are most likely - i.e. not all APUs will be
programmed using the same model. For instance, fast stream "pipeline"
code, for example audio code, rendering code, decompression code,
might be written in native APU assembler. These assembler programs
will run on dedicated APUs and be controlled by other parts of the
program ... so the interactions are much simplified. In effect each
piece of code like this is functioning a lot like a piece of hardware,
and particularly in a simple "slave" mode. The massively parallel
nature of PS3 should not trouble the authors of this code too much.

On the other hand, some code, such as AI code, is heavily
object-oriented and relies on heavy intercommunication between
objects. This code will have to be written in an object-oriented
environment (that is, language plus run-time components). If a project
can run all AI code on a single APU, things will be simple. But this
defeats the point of having such a system in the first place, and it
defeats the scalability of the larger server systems (think of the
servers running an online game - also based on PS3 technology). So the
overall programming environment will need to handle the parallel
nature of the system for the programmer.

Collected here are some ideas about programming PS3.
Objects can be Locked using Memory-based Synch

The producer/consumer synching can be used to provide locking on
objects in RAM:

1. When an object is created it is written using Sync Write. This puts
the memory containing the object into full state.
2. If an APU wants to access the object, it issues a Sync Read.
Assuming the object is initialized the object memory is now in empty
state and the APU's local RAM contains a copy of the object.
3. When the APU is finished, it writes the local copy back to main RAM
using a Sync Write.
4. If another APU attempts to access the object while the other APU is
working, it is stalled until the Sync Write occurs.
5. A second APU attempting to access the object while this APU is
stalled will cause an error on the second APU (and hopefully the PU's
OS will handle this error and cause it to retry).
6. Deleting the object involves a Sync Read without a following Sync
Write.
7. The granularity of this technique is only 1024-bit DRAM words ...
128 byte blocks of RAM. Multiple objects could share a RAM block but
they could only be locked as a unit.

Jazzing with Blue Gene

This will make you laugh. In 1999, IBM began the 5-year Blue Gene
project (the sequel to the chess-playing Deep Blue). Read about it.
The idea was to make a PetaFLOPS computer by taking processors at
1GFLOPS and placing 32 on a chip to make a 32GFLOPS chip. 64 of these
chips on a board would give 2 TFLOPS. A tower of 8 boards would give
16 TFLOPS and a room of 64 towers would make a PFLOPS. Enough to do
protein folding at interactive rates.

By 2005, PS3 will have APUs as powerful as Blue Gene's chips and chips
half as powerful as Blue Gene's boards. But Blue Gene is due at about
the same time as PS3. Is it possible that Blue Gene will simply be a
really really big PS3? Using Broadband Engine chips on the Blue Gene
boards would provide 64 TFLOPS per board, 512 TFLOPS per tower, so a
room of 64 towers would provide 32 PFLOPS. You could fit a mere PFLOPS
in a closet! The optical interface could be used to link towers (note
that at 4GHz light only travels 3 inches per clock cycle - so there is
quite a latency even at lightspeed!). 32 PFLOPS is more than 2^64
instructions per second.

But if you're not even using your PS3 all that power could be used for
something like ***@Home or ***@Home - you run software that
connects your PS3 to other PS3s to form what's in effect a gigantic
supercomputer. Only 32,000 PS3s need be connected to match Blue Gene.
If people are motivated, millions of PS3s may be available at any one
time.

One major application of this incredible computer power is to do
protein folding experiments - experiments which help to find the
causes of and hopefully cures to certain illnesses including Cystic
Fibrosis, Alzheimer's Disease and some cancers. But another is to
simulate nuclear bombs or hydrogen bombs. In the future these
supercomputers are also likely to be useful in genetic engineering
design, or in the design of fusion reactors (cheap clean power). It is
up to you whether or not you wish to support any of these causes, and
it should be up to you whether or not the machine you paid for is used
towards them. Remember that software cells can be transmitted to your
machine over the network, and it's up to Sony's OS when this is
allowed to happen. It would not be impossible for Sony to code the OS
so that any spare power is automatically given to any cause Sony
choose as long as you're online. It is important that consumers are
aware of the issue, and can donate their spare cycles to whoever they
choose.
Stream Processing

While researching this article I came across the Imagine stream
processor developed by William J. Dally's team at Stanford University.
This is a chip which is about 10x faster again (relative to clock
speed) than the PS3 chips, and uses a somewhat similar parallel
design. Another team at Stanford is doing related research into
Streaming Supercomputers which are just large batches of these chips
connected directly together. (It's not clear at this stage whether or
not Stanford patents for these designs might form "prior use" against
the Sony patents).

The Stanford team have come up with tools and methods for dealing with
programming on the stream architecture - they write "kernels"
(effectively APU code) in a language called KernelC which is compiled
and loop-unrolled (a la VU code) to target the VLIW architecture of
the Imagine processor. They then chain these kernels together with
streams using StreamC, which makes a regular program that runs on the
host processor (a PowerPC or ARM chip in this case). Note that the
Imagine system has mainly been used to accelerate specific tasks -
e.g. rendering - and not to run entire games which include rendering,
audio generation, AI and physics all at once.
Readers' Comments

Please let me know if you have any comments or questions about this
page by emailing me at ***@tinyted.net. I will reproduce and answer
the most interesting comments and questions here. Let me know if you
wish to remain anonymous (I will not print email addresses, only
names).

In particular I would dearly love to hear from anyone on the Cell
Architecture team! I have two major questions:

1. What is the instruction encoding?
2. Can the PUs run user code or just system code?

Links and References
The Patent

The patent application (USPTO)
The patent application (single ZIP)
Paul Zimmons' PowerPoint presentation
Reyes / Renderman

Computer Graphics: Principles and Practice (2e)
Comparing Reyes and OpenGL on a Stream Architecture
The Reyes Image Rendering Architecture
Stanford Research

The Imagine stream processor homepage
Streaming Supercomputers
Protein Folding

The Science of ***@Home
Unravelling the Mystery of Protein Folding
Programming Protein Folding Simulations
----------------------------------------------------------------------------
---
Hans de Vries
2003-11-13 12:51:18 UTC
Permalink
"subsystem" <***@nobody.net> wrote in message news:<NwBsb.21177$***@newssrv26.news.prodigy.com>...
> old but otherwise interesting read :)
>
> http://www.xboxrules.com/yabbse/index.php?threadid=47
>
>
> The Technology of PS3
> Eddie Edwards, April 2003
> Foreword


The only practical way to implement 4 Power PC's and 32 Cell Processors
each with 128 bit (4x32) functional units on a single chip in 2006 with
a 65 nm process and a 100W budget is to use virtual processors. This
would be consistent with future PowerPC processors and IBM's Blue Gene work.

The 4 PowerPC's could be a single IBM Power6 core running 4 threads and
at twice the frequency as a Power5 would run in the same process.
That would be 8 GHz in 65 nm.

The 32 PE's have a combined performance of 32 GFlops or 1 GFlop each
according to this presentation of Sony Entertainment's CEO here.
http://www.watch.impress.co.jp/game/docs/20020921/tgsf.htm

Have a look at this image:
http://www.watch.impress.co.jp/game/docs/20020921/tgsf15.jpg

This presentation uses large data centers to get at these 1 TeraFlop
and even 1 PetaFlop marketing numbers. This Sony presentation seems
to be a clarification after the 1 TeraFlop rumor stories: "PS3 will
be more then 100 times more powerfull than a Pentium 4"

A single "Altivec" or "PS2" like SIMD unit with four 32 bit Floating
Point units and four 32 bit Integer units running also at 8 GHz in
65 nm could be used to implement 32 virtual PE's working from one 4 MB
local memory.

Each PE would run at an effective 250 MHz with 1 GFlop (as stated in
the presentation). Each PE would be able to fetch, decode and execute a
single SIMD instruction before loading the next one. Thereby eliminating all
the branch prediction, out of order and load/store overhead of modern
processors. 80% of such a unit would be Functional units, Floating Point
and Integer, and 20% would be control logic. In modern OOO processors
it is more like the reverse.

The patent application revived the 1 TeraFlop rumors by saying that the
"preferred" performance of each PE would be 32 GFlops and 32 GIops. Sony's
own PS3 presentation however clearly says 1 GFlop per PE for the first
implementation. 1 GFlop per PE suggest that the PE's are implemented as
virtual PE's, possibly in the way as described above.

Regards, Hans.
subsystem
2003-11-13 13:11:01 UTC
Permalink
In the patent, it was prefered that each *APU* have 32 GFLOPs performance.
Not each PE.

There would be 1 PU/CPU per PE, and 8 APUs - which would give *256 GFLOPs*
per PE.

Then 4 PEs (256 GFLOPs each) are put onto a single chip to form a BroadBand
Engine. that is where the 1 TFLOPs came from.

The BroadBand Engine would be the main CPU of PS3.



Now in this new presentation, KK shows 1 PE having performance of 1 GFLOP.
this does not make sense at all. that's less than the Emotion Engine of
PS2 which has 6.2 GFLOPs performance.

The slides are 2-3 years old, that is why. they are the SAME slides that IBM
showed for the Blue Gene project, IIRC.

If one PE (Processor Element) can only achive 1 GFLOPs, then
Sony-IBM-Toshiba are going BACKWARDS not FORWARDS in performance.

256 GFLOPs in patent down to 1 GFLOPs makes no sense whatsoever.



"Hans de Vries" <***@chip-architect.com> wrote in message
news:***@posting.google.com...
> "subsystem" <***@nobody.net> wrote in message
news:<NwBsb.21177$***@newssrv26.news.prodigy.com>...
> > old but otherwise interesting read :)
> >
> > http://www.xboxrules.com/yabbse/index.php?threadid=47
> >
> >
> > The Technology of PS3
> > Eddie Edwards, April 2003
> > Foreword
>
>
> The only practical way to implement 4 Power PC's and 32 Cell Processors
> each with 128 bit (4x32) functional units on a single chip in 2006 with
> a 65 nm process and a 100W budget is to use virtual processors. This
> would be consistent with future PowerPC processors and IBM's Blue Gene
work.
>
> The 4 PowerPC's could be a single IBM Power6 core running 4 threads and
> at twice the frequency as a Power5 would run in the same process.
> That would be 8 GHz in 65 nm.
>
> The 32 PE's have a combined performance of 32 GFlops or 1 GFlop each
> according to this presentation of Sony Entertainment's CEO here.
> http://www.watch.impress.co.jp/game/docs/20020921/tgsf.htm
>
> Have a look at this image:
> http://www.watch.impress.co.jp/game/docs/20020921/tgsf15.jpg
>
> This presentation uses large data centers to get at these 1 TeraFlop
> and even 1 PetaFlop marketing numbers. This Sony presentation seems
> to be a clarification after the 1 TeraFlop rumor stories: "PS3 will
> be more then 100 times more powerfull than a Pentium 4"
>
> A single "Altivec" or "PS2" like SIMD unit with four 32 bit Floating
> Point units and four 32 bit Integer units running also at 8 GHz in
> 65 nm could be used to implement 32 virtual PE's working from one 4 MB
> local memory.
>
> Each PE would run at an effective 250 MHz with 1 GFlop (as stated in
> the presentation). Each PE would be able to fetch, decode and execute a
> single SIMD instruction before loading the next one. Thereby eliminating
all
> the branch prediction, out of order and load/store overhead of modern
> processors. 80% of such a unit would be Functional units, Floating Point
> and Integer, and 20% would be control logic. In modern OOO processors
> it is more like the reverse.
>
> The patent application revived the 1 TeraFlop rumors by saying that the
> "preferred" performance of each PE would be 32 GFlops and 32 GIops. Sony's
> own PS3 presentation however clearly says 1 GFlop per PE for the first
> implementation. 1 GFlop per PE suggest that the PE's are implemented as
> virtual PE's, possibly in the way as described above.
>
> Regards, Hans.
Alexis Cousein
2003-11-13 14:06:09 UTC
Permalink
subsystem wrote:

> In the patent, it was prefered that each *APU* have 32 GFLOPs performance.
> Not each PE.


Yes -- *preferred*. I'd also prefer to have 1 Tflops on my desktop.

Also, it's a *patent* application, so even the "preferred" numbers are
probably not talking about the first implementation.


--
Alexis Cousein Senior Systems Engineer
***@sgi.com SGI/Silicon Graphics Brussels
<opinions expressed here are my own, not those of my employer>
<SGI sells IRIX *and* Linux systems -- I'm not flaming either>
Martin Høyer Kristiansen
2003-11-13 14:10:38 UTC
Permalink
Alexis Cousein wrote:

> subsystem wrote:
>
>> In the patent, it was prefered that each *APU* have 32 GFLOPs
>> performance.
>> Not each PE.
>
>
>
> Yes -- *preferred*. I'd also prefer to have 1 Tflops on my desktop.
>
> Also, it's a *patent* application, so even the "preferred" numbers are
> probably not talking about the first implementation.

Yeah I also think that running 32 SIMD processors @ 4GHz within a 100 W
envelope is kind of optimistic for the 2006 timeframe.

It will probably launch at 1 to 2 GHz which will still result in TEH
CRAZY peak floating point performance.

Cheers
Martin
lmurata
2003-11-13 19:57:45 UTC
Permalink
"subsystem" <***@nobody.net> wrote in message news:<FpLsb.37$***@newssrv26.news.prodigy.com>...
> In the patent, it was prefered that each *APU* have 32 GFLOPs performance.
> Not each PE.
>
> There would be 1 PU/CPU per PE, and 8 APUs - which would give *256 GFLOPs*
> per PE.
>
> Then 4 PEs (256 GFLOPs each) are put onto a single chip to form a BroadBand
> Engine. that is where the 1 TFLOPs came from.
>
> The BroadBand Engine would be the main CPU of PS3.
>
>
>
> Now in this new presentation, KK shows 1 PE having performance of 1 GFLOP.
> this does not make sense at all. that's less than the Emotion Engine of
> PS2 which has 6.2 GFLOPs performance.
>
> The slides are 2-3 years old, that is why. they are the SAME slides that IBM
> showed for the Blue Gene project, IIRC.


I wouldn't trust what this guy says. I've seen the original IBM image,
and it IS different. Subsystem, why don't you give a link to the
original IBM image, so that others can compare it with the slide?

>
> If one PE (Processor Element) can only achive 1 GFLOPs, then
> Sony-IBM-Toshiba are going BACKWARDS not FORWARDS in performance.
>
> 256 GFLOPs in patent down to 1 GFLOPs makes no sense whatsoever.
>
>
>
> "Hans de Vries" <***@chip-architect.com> wrote in message
> news:***@posting.google.com...
> > "subsystem" <***@nobody.net> wrote in message
> news:<NwBsb.21177$***@newssrv26.news.prodigy.com>...
> > > old but otherwise interesting read :)
> > >
> > > http://www.xboxrules.com/yabbse/index.php?threadid=47
> > >
> > >
> > > The Technology of PS3
> > > Eddie Edwards, April 2003
> > > Foreword
> >
> >
> > The only practical way to implement 4 Power PC's and 32 Cell Processors
> > each with 128 bit (4x32) functional units on a single chip in 2006 with
> > a 65 nm process and a 100W budget is to use virtual processors. This
> > would be consistent with future PowerPC processors and IBM's Blue Gene
> work.
> >
> > The 4 PowerPC's could be a single IBM Power6 core running 4 threads and
> > at twice the frequency as a Power5 would run in the same process.
> > That would be 8 GHz in 65 nm.
> >
> > The 32 PE's have a combined performance of 32 GFlops or 1 GFlop each
> > according to this presentation of Sony Entertainment's CEO here.
> > http://www.watch.impress.co.jp/game/docs/20020921/tgsf.htm
> >
> > Have a look at this image:
> > http://www.watch.impress.co.jp/game/docs/20020921/tgsf15.jpg
> >
> > This presentation uses large data centers to get at these 1 TeraFlop
> > and even 1 PetaFlop marketing numbers. This Sony presentation seems
> > to be a clarification after the 1 TeraFlop rumor stories: "PS3 will
> > be more then 100 times more powerfull than a Pentium 4"
> >
> > A single "Altivec" or "PS2" like SIMD unit with four 32 bit Floating
> > Point units and four 32 bit Integer units running also at 8 GHz in
> > 65 nm could be used to implement 32 virtual PE's working from one 4 MB
> > local memory.
> >
> > Each PE would run at an effective 250 MHz with 1 GFlop (as stated in
> > the presentation). Each PE would be able to fetch, decode and execute a
> > single SIMD instruction before loading the next one. Thereby eliminating
> all
> > the branch prediction, out of order and load/store overhead of modern
> > processors. 80% of such a unit would be Functional units, Floating Point
> > and Integer, and 20% would be control logic. In modern OOO processors
> > it is more like the reverse.
> >
> > The patent application revived the 1 TeraFlop rumors by saying that the
> > "preferred" performance of each PE would be 32 GFlops and 32 GIops. Sony's
> > own PS3 presentation however clearly says 1 GFlop per PE for the first
> > implementation. 1 GFlop per PE suggest that the PE's are implemented as
> > virtual PE's, possibly in the way as described above.
> >
> > Regards, Hans.
Hans de Vries
2003-11-13 23:15:13 UTC
Permalink
***@hotmail.com (lmurata) wrote in message news:<***@posting.google.com>...
> "subsystem" <***@nobody.net> wrote in message news:<FpLsb.37$***@newssrv26.news.prodigy.com>...
> > In the patent, it was prefered that each *APU* have 32 GFLOPs performance.
> > Not each PE.
> >
> > There would be 1 PU/CPU per PE, and 8 APUs - which would give *256 GFLOPs*
> > per PE.
> >
> > Then 4 PEs (256 GFLOPs each) are put onto a single chip to form a BroadBand
> > Engine. that is where the 1 TFLOPs came from.
> >
> > The BroadBand Engine would be the main CPU of PS3.
> >
> >
> >
> > Now in this new presentation, KK shows 1 PE having performance of 1 GFLOP.
> > this does not make sense at all. that's less than the Emotion Engine of
> > PS2 which has 6.2 GFLOPs performance.
> >
> > The slides are 2-3 years old, that is why. they are the SAME slides that IBM
> > showed for the Blue Gene project, IIRC.
>
>
> I wouldn't trust what this guy says. I've seen the original IBM image,
> and it IS different. Subsystem, why don't you give a link to the
> original IBM image, so that others can compare it with the slide?
>

Actually, It IS the same slide, So Sony Cell == Blue Gene ?!?

Sony Cell presentation September 20,2002 from Sony Entertainment's CEO:

http://www.watch.impress.co.jp/game/docs/20020921/tgsf15.jpg
http://www.watch.impress.co.jp/game/docs/20020921/tgsf.htm

Various IBM Blue Gene presentations:

www-8.ibm.com/solutions/au/downloads/bluegene_03.pdf
www.ibm.com/jp/software/s390/conf2001/data/zentai.pdf
www.ibm.com/de/entwicklung/academia/archive_2000/october26_snir.pdf

Amazing....

Regards, Hans.
lmurata
2003-11-14 06:33:07 UTC
Permalink
***@chip-architect.com (Hans de Vries) wrote in message news:<***@posting.google.com>...
> ***@hotmail.com (lmurata) wrote in message news:<***@posting.google.com>...
> > "subsystem" <***@nobody.net> wrote in message news:<FpLsb.37$***@newssrv26.news.prodigy.com>...
> > > In the patent, it was prefered that each *APU* have 32 GFLOPs performance.
> > > Not each PE.
> > >
> > > There would be 1 PU/CPU per PE, and 8 APUs - which would give *256 GFLOPs*
> > > per PE.
> > >
> > > Then 4 PEs (256 GFLOPs each) are put onto a single chip to form a BroadBand
> > > Engine. that is where the 1 TFLOPs came from.
> > >
> > > The BroadBand Engine would be the main CPU of PS3.
> > >
> > >
> > >
> > > Now in this new presentation, KK shows 1 PE having performance of 1 GFLOP.
> > > this does not make sense at all. that's less than the Emotion Engine of
> > > PS2 which has 6.2 GFLOPs performance.
> > >
> > > The slides are 2-3 years old, that is why. they are the SAME slides that IBM
> > > showed for the Blue Gene project, IIRC.
> >
> >
> > I wouldn't trust what this guy says. I've seen the original IBM image,
> > and it IS different. Subsystem, why don't you give a link to the
> > original IBM image, so that others can compare it with the slide?
> >
>
> Actually, It IS the same slide, So Sony Cell == Blue Gene ?!?

What I meant was that the slide was changed to refer to cell and not
blue gene. Most think that Kutaragi simply made a mistake and forgot
to change the numbers. I usually point out why one would change the
text on the slide but forget to change the numbers also.

>
> Sony Cell presentation September 20,2002 from Sony Entertainment's CEO:
>
> http://www.watch.impress.co.jp/game/docs/20020921/tgsf15.jpg
> http://www.watch.impress.co.jp/game/docs/20020921/tgsf.htm
>
> Various IBM Blue Gene presentations:
>
> www-8.ibm.com/solutions/au/downloads/bluegene_03.pdf
> www.ibm.com/jp/software/s390/conf2001/data/zentai.pdf
> www.ibm.com/de/entwicklung/academia/archive_2000/october26_snir.pdf
>
> Amazing....
>
> Regards, Hans.
Hans de Vries
2003-11-14 10:21:19 UTC
Permalink
***@hotmail.com (lmurata) wrote in message news:<***@posting.google.com>...

> >
> > Actually, It IS the same slide, So Sony Cell == Blue Gene ?!?
>
> What I meant was that the slide was changed to refer to cell and not
> blue gene. Most think that Kutaragi simply made a mistake and forgot
> to change the numbers. I usually point out why one would change the
> text on the slide but forget to change the numbers also.
>

Blue Gene has changed a lot in the mean time... but it still doesn't
look like a PS2 or PS3.

http://sc-2002.org/paperpdfs/pap.pap207.pdf

2.8 or 5.6 GFlops (64 bit) per chip. (page 17)

Regards, Hans.
lmurata
2003-11-14 14:32:46 UTC
Permalink
***@chip-architect.com (Hans de Vries) wrote in message news:<***@posting.google.com>...
> ***@hotmail.com (lmurata) wrote in message news:<***@posting.google.com>...
>
> > >
> > > Actually, It IS the same slide, So Sony Cell == Blue Gene ?!?
> >
> > What I meant was that the slide was changed to refer to cell and not
> > blue gene. Most think that Kutaragi simply made a mistake and forgot
> > to change the numbers. I usually point out why one would change the
> > text on the slide but forget to change the numbers also.
> >
>
> Blue Gene has changed a lot in the mean time... but it still doesn't
> look like a PS2 or PS3.


Could Microsoft license Cell technology from IBM for Xbox2? Some have
suggested that Sony's lawyers would never allow it. Others have said
that IBM does not own the technology.


>
> http://sc-2002.org/paperpdfs/pap.pap207.pdf
>
> 2.8 or 5.6 GFlops (64 bit) per chip. (page 17)
>
> Regards, Hans.
wogston
2003-11-14 23:53:47 UTC
Permalink
> Could Microsoft license Cell technology from IBM for Xbox2? Some have
> suggested that Sony's lawyers would never allow it. Others have said
> that IBM does not own the technology.

I didn't know lawyers made any executive decisions or did have any say in
allowing or disallowing executive decisions. I thought they were just
interface for leeching money from other companies and individuals.
Hans de Vries
2003-11-15 16:54:57 UTC
Permalink
***@hotmail.com (lmurata) wrote in message news:<***@posting.google.com>...
> ***@chip-architect.com (Hans de Vries) wrote in message news:<***@posting.google.com>...
> > ***@hotmail.com (lmurata) wrote in message news:<***@posting.google.com>...
> >
> > > >
> > > > Actually, It IS the same slide, So Sony Cell == Blue Gene ?!?
> > >
> > > What I meant was that the slide was changed to refer to cell and not
> > > blue gene. Most think that Kutaragi simply made a mistake and forgot
> > > to change the numbers. I usually point out why one would change the
> > > text on the slide but forget to change the numbers also.
> > >
> >
> > Blue Gene has changed a lot in the mean time... but it still doesn't
> > look like a PS2 or PS3.
>
>

> Could Microsoft license Cell technology from IBM for Xbox2? Some have
> suggested that Sony's lawyers would never allow it. Others have said
> that IBM does not own the technology.
>

It seems so, It seems that Nintendo will use it as well.
See this press-release here:

http://www.forbes.com/home_asia/newswire/2003/11/14/rtr1147955.html


" IBM vice president of technology and strategy Irving
Wladawsky-Berger said that the supercomputer used 1,000
microprocessors that are based on PowerPC microchip
technology. The PowerPC chip is currently used in Apple
Computer Inc. computers.

It is also the technology that will be the foundation
of the next generation of gaming consoles from Nintendo
Co. and Sony Corp., which IBM is working on, he said.

He said the chips were less expensive and consumed less
power than traditional microprocessors, making it possible
to pack the same amount of computing power into a smaller
space. Producing the chips in volume for gaming will help
offset the costs of building supercomputers, he said"

So indeed, Blue Gene/L == Cell now.

(maybe there will be Sony specific APU's although the fact that they
use the same presentation slides seems to suggest otherwise)

I didn't see any PS2/PS3 like 128 bit (4x32) bit SIMD in Blue Gene/L.
Rather 2x64 with two independent 64 bit Floating Point units. However,
it is relatively simple to implement dual 32 bit on those units by
re-using much of the hardware (like the multiplier Wallace trees).
Ideally would be something which is also compatible with Apple's
Altivec. IBM's realizes what mass-production can do I guess.

Blue Gene/L: http://sc-2002.org/paperpdfs/pap.pap207.pdf

Blue Gene/L released right now:

http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&safe=off&selm=bp2s7u%24k3m%241%40news.rchland.ibm.com

Just imagine what it could mean, with Microsoft now included, If
Windows XP 64 runs not only on Xbox, but also on Playstation 3 !?
or even on Nintendo game consoles?

That would be quite an historical turning point.

Regards, Hans
Niels Jørgen Kruse
2003-11-15 22:52:25 UTC
Permalink
I artiklen <***@posting.google.com> ,
***@chip-architect.com (Hans de Vries) skrev:

> ***@hotmail.com (lmurata) wrote in message
> news:<***@posting.google.com>...
>> Could Microsoft license Cell technology from IBM for Xbox2? Some have
>> suggested that Sony's lawyers would never allow it. Others have said
>> that IBM does not own the technology.
>>
>
> It seems so, It seems that Nintendo will use it as well.
> See this press-release here:
>
> http://www.forbes.com/home_asia/newswire/2003/11/14/rtr1147955.html
>
>
> " IBM vice president of technology and strategy Irving
> Wladawsky-Berger said that the supercomputer used 1,000
> microprocessors that are based on PowerPC microchip
> technology. The PowerPC chip is currently used in Apple
> Computer Inc. computers.

Equating the 440 core with the POWER4 core means that he is talking in
*very* broad terms.

> It is also the technology that will be the foundation
> of the next generation of gaming consoles from Nintendo
> Co. and Sony Corp., which IBM is working on, he said.

This just means that some sort of PPC are in all of these.

> He said the chips were less expensive and consumed less
> power than traditional microprocessors, making it possible
> to pack the same amount of computing power into a smaller
> space. Producing the chips in volume for gaming will help
> offset the costs of building supercomputers, he said"

Using the same cores in multiple products reduce development costs. The
silicon is *not* the same.

> So indeed, Blue Gene/L == Cell now.
>
> (maybe there will be Sony specific APU's although the fact that they
> use the same presentation slides seems to suggest otherwise)
>
> I didn't see any PS2/PS3 like 128 bit (4x32) bit SIMD in Blue Gene/L.
> Rather 2x64 with two independent 64 bit Floating Point units. However,
> it is relatively simple to implement dual 32 bit on those units by
> re-using much of the hardware (like the multiplier Wallace trees).
> Ideally would be something which is also compatible with Apple's
> Altivec. IBM's realizes what mass-production can do I guess.
>
> Blue Gene/L: http://sc-2002.org/paperpdfs/pap.pap207.pdf
>
> Blue Gene/L released right now:
>
>
http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&safe=off&selm=bp2s7u%
24k3m%241
> %40news.rchland.ibm.com
>
> Just imagine what it could mean, with Microsoft now included, If
> Windows XP 64 runs not only on Xbox, but also on Playstation 3 !?
> or even on Nintendo game consoles?
>
> That would be quite an historical turning point.

You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
of SIMD, you can forget about emulating Xbox1 games.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Nate Edel
2003-11-16 03:10:14 UTC
Permalink
In comp.arch Niels J?rgen Kruse <***@get2net.dk> wrote:
> You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
> of SIMD, you can forget about emulating Xbox1 games.

How much can adding a cheap celeron cost? The hard part will be the video
system, but if it's pretty well virtualized with DirectX already and not
using NVidia-specific primitives, it should be able to use the rest of the
XBox2 processor core to do as well as the existing GPU.

--
Nate Edel http://www.nkedel.com/

"But Marge! I've never felt so accepted in my life. These people looked deep
into my soul and assigned me a number based on the order in which I joined."
Niels Jørgen Kruse
2003-11-16 15:27:36 UTC
Permalink
I artiklen <md6j81-***@mail.sfchat.org> , ***@sfchat.org (Nate
Edel) skrev:

> In comp.arch Niels J?rgen Kruse <***@get2net.dk> wrote:
>> You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
>> of SIMD, you can forget about emulating Xbox1 games.
>
> How much can adding a cheap celeron cost? The hard part will be the video
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> system, but if it's pretty well virtualized with DirectX already and not
> using NVidia-specific primitives, it should be able to use the rest of the
> XBox2 processor core to do as well as the existing GPU.

Too much for a console, both in $ and Watts. It also has to be attached
somehow. Microsoft is probably going for a single chip solution.

My guess would be a POWER4 core (a shrink or 2 down the road) plus a Radeon
core plus various IO controllers. No Altivec (the Radeon can do the SIMD),
but possibly some extra instructions to help IA32 emulation and do sneaky
DRM. Shared onchip memory controller for GPU and CPU (you don't have the
pinnage for more and a console can't afford more DRAM than can be reasonably
hooked up to a GPU anyway).

It could also be a POWER5 core if they don't mind the 24% extra area
(Altivec adds 35-40% to a POWER4 core).

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Keith R. Williams
2003-11-16 19:44:02 UTC
Permalink
In article <md6j81-***@mail.sfchat.org>, ***@sfchat.org
says...
> In comp.arch Niels J?rgen Kruse <***@get2net.dk> wrote:
> > You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
> > of SIMD, you can forget about emulating Xbox1 games.
>
> How much can adding a cheap celeron cost?

A lot when you don't already own one.

> The hard part will be the video
> system, but if it's pretty well virtualized with DirectX already and not
> using NVidia-specific primitives, it should be able to use the rest of the
> XBox2 processor core to do as well as the existing GPU.

--
Keith
Keith R. Williams
2003-11-16 19:43:11 UTC
Permalink
In article <xdytb.16762$***@news.get2net.dk>,
***@get2net.dk says...
> I artiklen <***@posting.google.com> ,
> ***@chip-architect.com (Hans de Vries) skrev:

<snipping much>

> > Just imagine what it could mean, with Microsoft now included, If
> > Windows XP 64 runs not only on Xbox, but also on Playstation 3 !?
> > or even on Nintendo game consoles?
> >
> > That would be quite an historical turning point.
>
> You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
> of SIMD, you can forget about emulating Xbox1 games.

Why? Emulation of x86 on PPC is fairly well know to work.

--
Keith
Niels Jørgen Kruse
2003-11-16 22:07:40 UTC
Permalink
I artiklen <***@enews.newsguy.com> , Keith R.
Williams <***@attglobal.net> skrev:

> In article <xdytb.16762$***@news.get2net.dk>,
> ***@get2net.dk says...
>> I artiklen <***@posting.google.com> ,
>> ***@chip-architect.com (Hans de Vries) skrev:
>
> <snipping much>
>
>> > Just imagine what it could mean, with Microsoft now included, If
>> > Windows XP 64 runs not only on Xbox, but also on Playstation 3 !?
>> > or even on Nintendo game consoles?
>> >
>> > That would be quite an historical turning point.
>>
>> You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
>> of SIMD, you can forget about emulating Xbox1 games.
>
> Why? Emulation of x86 on PPC is fairly well know to work.

....s...l...o...w...l...y....

Emulating a ~700 MHz PIII with a 440, so that the difference wouldn't be too
painfully obvious, would be very difficult. Having copious SIMD resources is
no help.

Some say that backwards compatibily is unimportant for consoles. On the
other hand, there are limits to how many consoles it is practical to have
hooked up at once.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Rupert Pigott
2003-11-16 23:05:15 UTC
Permalink
"Niels Jørgen Kruse" <***@get2net.dk> wrote in message
news:pzStb.5917$***@news.get2net.dk...
> I artiklen <***@enews.newsguy.com> , Keith R.
> Williams <***@attglobal.net> skrev:
>
> > In article <xdytb.16762$***@news.get2net.dk>,
> > ***@get2net.dk says...
> >> I artiklen <***@posting.google.com> ,
> >> ***@chip-architect.com (Hans de Vries) skrev:
> >
> > <snipping much>
> >
> >> > Just imagine what it could mean, with Microsoft now included, If
> >> > Windows XP 64 runs not only on Xbox, but also on Playstation 3 !?
> >> > or even on Nintendo game consoles?
> >> >
> >> > That would be quite an historical turning point.
> >>
> >> You got carried away. Certainly, if the Xbox2 chip is a 440 core plus
lots
> >> of SIMD, you can forget about emulating Xbox1 games.
> >
> > Why? Emulation of x86 on PPC is fairly well know to work.
>
> ....s...l...o...w...l...y....
>
> Emulating a ~700 MHz PIII with a 440, so that the difference wouldn't be
too
> painfully obvious, would be very difficult. Having copious SIMD resources
is
> no help.

Very possible if most of the time that PIII is executing
OS / library code... FX!32 all over again. Old hat.

Cheers,
Rupert
Keith R. Williams
2003-11-17 02:41:03 UTC
Permalink
In article <pzStb.5917$***@news.get2net.dk>,
***@get2net.dk says...
> I artiklen <***@enews.newsguy.com> , Keith R.
> Williams <***@attglobal.net> skrev:
>
> > In article <xdytb.16762$***@news.get2net.dk>,
> > ***@get2net.dk says...
> >> I artiklen <***@posting.google.com> ,
> >> ***@chip-architect.com (Hans de Vries) skrev:
> >
> > <snipping much>
> >
> >> > Just imagine what it could mean, with Microsoft now included, If
> >> > Windows XP 64 runs not only on Xbox, but also on Playstation 3 !?
> >> > or even on Nintendo game consoles?
> >> >
> >> > That would be quite an historical turning point.
> >>
> >> You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
> >> of SIMD, you can forget about emulating Xbox1 games.
> >
> > Why? Emulation of x86 on PPC is fairly well know to work.
>
> ....s...l...o...w...l...y....

Now wait just a minute. You stated that:

Equating the 440 core with the POWER4 core means that he is
talking in *very* broad terms.

Are you stating that you're talking about PPC here? Or are you
shifting to the specific core?

> Emulating a ~700 MHz PIII with a 440, so that the difference wouldn't be too
> painfully obvious, would be very difficult. Having copious SIMD resources is
> no help.

Since (at least the mill I've been listening to) the 440 is
DEADBEEF, I don't think that's what is being proposed.

> Some say that backwards compatibily is unimportant for consoles. On the
> other hand, there are limits to how many consoles it is practical to have
> hooked up at once.

I don't see this as the problem. I haven't' kept my P5s, but not
because I can't hook them up at once. OTOH, my son has his PC,
GC, PS2, and X-Box all hooked up.
Niels Jørgen Kruse
2003-11-17 20:37:01 UTC
Permalink
I artiklen <***@enews.newsguy.com> , Keith R.
Williams <***@attglobal.net> skrev:

> In article <pzStb.5917$***@news.get2net.dk>,
> ***@get2net.dk says...
>> I artiklen <***@enews.newsguy.com> , Keith R.
>> Williams <***@attglobal.net> skrev:
>>
>> > In article <xdytb.16762$***@news.get2net.dk>,
>> > ***@get2net.dk says...
>> >> I artiklen <***@posting.google.com> ,
>> >> ***@chip-architect.com (Hans de Vries) skrev:
>> >
>> > <snipping much>
>> >
>> >> > Just imagine what it could mean, with Microsoft now included, If
>> >> > Windows XP 64 runs not only on Xbox, but also on Playstation 3 !?
>> >> > or even on Nintendo game consoles?
>> >> >
>> >> > That would be quite an historical turning point.
>> >>
>> >> You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
>> >> of SIMD, you can forget about emulating Xbox1 games.
>> >
>> > Why? Emulation of x86 on PPC is fairly well know to work.
>>
>> ....s...l...o...w...l...y....
>
> Now wait just a minute. You stated that:
>
> Equating the 440 core with the POWER4 core means that he is
> talking in *very* broad terms.
>
> Are you stating that you're talking about PPC here? Or are you
> shifting to the specific core?

My statement about emulation was conditional on a 440 core scenario.

Hans de Vries was seeing the same core everywhere, and BlueGene uses a 440.

>> Emulating a ~700 MHz PIII with a 440, so that the difference wouldn't be too
>> painfully obvious, would be very difficult. Having copious SIMD resources is
>> no help.
>
> Since (at least the mill I've been listening to) the 440 is
> DEADBEEF, I don't think that's what is being proposed.

Uhm, unallocated memory? Are you saying in a roundabout way that speculation
leading to a 440 core in the Xbox2 took a wrong turn somewhere?

I agree that Microsoft is likely to go for a desktop like programming
environment, where they can reuse technology from their hometurf.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Derek Gladding
2003-11-21 08:10:31 UTC
Permalink
Keith R. Williams wrote:

[snip]

> Why? Emulation of x86 on PPC is fairly well know to work.

It's possible to emulate the functionality OK, but very difficult to
emulate the timing. For game code, this can be a big problem.

- Derek
Hans de Vries
2003-11-17 04:16:47 UTC
Permalink
"Niels Jørgen Kruse" <***@get2net.dk> wrote in message news:<xdytb.16762$***@news.get2net.dk>...
> I artiklen <***@posting.google.com> ,
> ***@chip-architect.com (Hans de Vries) skrev:
>
> > ***@hotmail.com (lmurata) wrote in message
> > news:<***@posting.google.com>...
> >> Could Microsoft license Cell technology from IBM for Xbox2? Some have
> >> suggested that Sony's lawyers would never allow it. Others have said
> >> that IBM does not own the technology.
> >>
> >
> > It seems so, It seems that Nintendo will use it as well.
> > See this press-release here:
> >
> > http://www.forbes.com/home_asia/newswire/2003/11/14/rtr1147955.html
> >
> >
> > " IBM vice president of technology and strategy Irving
> > Wladawsky-Berger said that the supercomputer used 1,000
> > microprocessors that are based on PowerPC microchip
> > technology. The PowerPC chip is currently used in Apple
> > Computer Inc. computers.
>
> Equating the 440 core with the POWER4 core means that he is talking in
> *very* broad terms.
>
> > It is also the technology that will be the foundation
> > of the next generation of gaming consoles from Nintendo
> > Co. and Sony Corp., which IBM is working on, he said.
>
> This just means that some sort of PPC are in all of these.
>
> > He said the chips were less expensive and consumed less
> > power than traditional microprocessors, making it possible
> > to pack the same amount of computing power into a smaller
> > space. Producing the chips in volume for gaming will help
> > offset the costs of building supercomputers, he said"
>
> Using the same cores in multiple products reduce development costs. The
> silicon is *not* the same.

How many different PowerPC's can there be. How much resources can you
waist... One for Apple, one for Sony, one for Nintendo, one for Microsoft,
one for their own Servers...

Which one of the game chips should carry the overhead of the super-
computer interconnect. Or will buyers have a choice between Nintendo,
Play-station and Xbox flavored supercomputers? :^)

The use of a modified (32 bit) 440 core seems more like a prototyping
vehicle. The original BlueGene design used simple but multithreaded
processors (8 threads per processor) and 32 processors per chip.

One can only guess what the final product will be in 2006. I do not think
that either one of the two options above could successfully get Microsoft,
Sony and Nintendo to bet their game-console business on it.

I do think that a Power6 core could do just that. It should be less then
half the size of today's G5 based on the Power4 core when implemented in
65 nm technology. The Power6 will have a much higher clock speed (by
splitting the pipeline stages in two?) and is likely to support more (4?) threads.


>
> > So indeed, Blue Gene/L == Cell now.
> >
> > (maybe there will be Sony specific APU's although the fact that they
> > use the same presentation slides seems to suggest otherwise)
> >
> > I didn't see any PS2/PS3 like 128 bit (4x32) bit SIMD in Blue Gene/L.
> > Rather 2x64 with two independent 64 bit Floating Point units. However,
> > it is relatively simple to implement dual 32 bit on those units by
> > re-using much of the hardware (like the multiplier Wallace trees).
> > Ideally would be something which is also compatible with Apple's
> > Altivec. IBM's realizes what mass-production can do I guess.
> >
> > Blue Gene/L: http://sc-2002.org/paperpdfs/pap.pap207.pdf
> >
> > Blue Gene/L released right now:
> >
> >
> http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&safe=off&selm=bp2s7u%
> 24k3m%241
> > %40news.rchland.ibm.com
> >
> > Just imagine what it could mean, with Microsoft now included, If
> > Windows XP 64 runs not only on Xbox, but also on Playstation 3 !?
> > or even on Nintendo game consoles?
> >
> > That would be quite an historical turning point.
>
> You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
> of SIMD, you can forget about emulating Xbox1 games.

Games spend 90%+ of their time in directX-like libraries. The libraries don't
need emulation and can run in native PPC code.

Regards, Hans
Niels Jørgen Kruse
2003-11-17 20:59:17 UTC
Permalink
I artiklen <***@posting.google.com> ,
***@chip-architect.com (Hans de Vries) skrev:

> "Niels Jørgen Kruse" <***@get2net.dk> wrote in message
> news:<xdytb.16762$***@news.get2net.dk>...
>> I artiklen <***@posting.google.com> ,
>> ***@chip-architect.com (Hans de Vries) skrev:
>>
>> > ***@hotmail.com (lmurata) wrote in message
>> > news:<***@posting.google.com>...
>> >> Could Microsoft license Cell technology from IBM for Xbox2? Some have
>> >> suggested that Sony's lawyers would never allow it. Others have said
>> >> that IBM does not own the technology.
>> >>
>> >
>> > It seems so, It seems that Nintendo will use it as well.
>> > See this press-release here:
>> >
>> > http://www.forbes.com/home_asia/newswire/2003/11/14/rtr1147955.html
>> >
>> >
>> > " IBM vice president of technology and strategy Irving
>> > Wladawsky-Berger said that the supercomputer used 1,000
>> > microprocessors that are based on PowerPC microchip
>> > technology. The PowerPC chip is currently used in Apple
>> > Computer Inc. computers.
>>
>> Equating the 440 core with the POWER4 core means that he is talking in
>> *very* broad terms.
>>
>> > It is also the technology that will be the foundation
>> > of the next generation of gaming consoles from Nintendo
>> > Co. and Sony Corp., which IBM is working on, he said.
>>
>> This just means that some sort of PPC are in all of these.
>>
>> > He said the chips were less expensive and consumed less
>> > power than traditional microprocessors, making it possible
>> > to pack the same amount of computing power into a smaller
>> > space. Producing the chips in volume for gaming will help
>> > offset the costs of building supercomputers, he said"
>>
>> Using the same cores in multiple products reduce development costs. The
>> silicon is *not* the same.
>
> How many different PowerPC's can there be. How much resources can you
> waist... One for Apple, one for Sony, one for Nintendo, one for Microsoft,
> one for their own Servers...

Why should it be so expensive to modify a working core, particularly if it
is designed to be extensible? IBM doesn't handtweak every circuit.

> Which one of the game chips should carry the overhead of the super-
> computer interconnect. Or will buyers have a choice between Nintendo,
> Play-station and Xbox flavored supercomputers? :^)

Do you think IBM is losing money on their e-servers?

> The use of a modified (32 bit) 440 core seems more like a prototyping
> vehicle. The original BlueGene design used simple but multithreaded
> processors (8 threads per processor) and 32 processors per chip.

Did you read the description of the final item in
<http://www.llnl.gov/asci/platforms/bluegenel/pdf/software.pdf>?

> One can only guess what the final product will be in 2006. I do not think
> that either one of the two options above could successfully get Microsoft,
> Sony and Nintendo to bet their game-console business on it.

You still think there is a connection beyond the tenuous one of basic ISA.

> I do think that a Power6 core could do just that. It should be less then
> half the size of today's G5 based on the Power4 core when implemented in
> 65 nm technology. The Power6 will have a much higher clock speed (by
> splitting the pipeline stages in two?) and is likely to support more (4?)
threads.

Do you have any information on the POWER6 core? The dramatic clock increase
projected in that timeframe, I suspect has to do with IBM's HOT technology
which will benefit static (synthesized) logic much more than dynamic logic.

>>
>> > So indeed, Blue Gene/L == Cell now.
>> >
>> > (maybe there will be Sony specific APU's although the fact that they
>> > use the same presentation slides seems to suggest otherwise)
>> >
>> > I didn't see any PS2/PS3 like 128 bit (4x32) bit SIMD in Blue Gene/L.
>> > Rather 2x64 with two independent 64 bit Floating Point units. However,
>> > it is relatively simple to implement dual 32 bit on those units by
>> > re-using much of the hardware (like the multiplier Wallace trees).
>> > Ideally would be something which is also compatible with Apple's
>> > Altivec. IBM's realizes what mass-production can do I guess.
>> >
>> > Blue Gene/L: http://sc-2002.org/paperpdfs/pap.pap207.pdf
>> >
>> > Blue Gene/L released right now:
>> >
>> >
>> http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&safe=off&selm=bp2s7u%
>> 24k3m%241
>> > %40news.rchland.ibm.com
>> >
>> > Just imagine what it could mean, with Microsoft now included, If
>> > Windows XP 64 runs not only on Xbox, but also on Playstation 3 !?
>> > or even on Nintendo game consoles?
>> >
>> > That would be quite an historical turning point.
>>
>> You got carried away. Certainly, if the Xbox2 chip is a 440 core plus lots
>> of SIMD, you can forget about emulating Xbox1 games.
>
> Games spend 90%+ of their time in directX-like libraries. The libraries don't
> need emulation and can run in native PPC code.

If you happen to run into a massive lump of 10% code, you will notice. I
don't know how important emulation is, but the current market leader is
backwards compatible. (It is not clear if the PS3 will be.)

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Rupert Pigott
2003-11-17 22:09:29 UTC
Permalink
"Niels Jørgen Kruse" <***@get2net.dk> wrote in message
news:lFaub.8262$***@news.get2net.dk...
> I artiklen <***@posting.google.com> ,
> ***@chip-architect.com (Hans de Vries) skrev:

[SNIP - hopefully I got this snip right :/ ]

> > Games spend 90%+ of their time in directX-like libraries. The libraries
don't
> > need emulation and can run in native PPC code.
>
> If you happen to run into a massive lump of 10% code, you will notice. I
> don't know how important emulation is, but the current market leader is
> backwards compatible. (It is not clear if the PS3 will be.)

Storage & memory are cheap(ish)... Surely you could translate the
code statically on the first time the game is played, stash it
away in a persistant cache and then just rely on the translation &
native libraries to yield the required performance.

Cheers,
Rupert
Hans de Vries
2003-11-18 04:32:17 UTC
Permalink
"Niels Jørgen Kruse" <***@get2net.dk> wrote in message news:<lFaub.8262$***@news.get2net.dk>...
> I artiklen <***@posting.google.com> ,
> ***@chip-architect.com (Hans de Vries) skrev:
>
> > How many different PowerPC's can there be. How much resources can you
> > waist... One for Apple, one for Sony, one for Nintendo, one for Microsoft,
> > one for their own Servers...
>
> Why should it be so expensive to modify a working core, particularly if it
> is designed to be extensible? IBM doesn't handtweak every circuit.
>

The reason to use a single die for a range of products is almost
always logistics. So in the next three months your customors may need:

Sony 5.5 million processors
Microsoft 2.5 million processors
Nintendo 1.5 million processors
Supercomputers 0.5 million processors

or

Microsoft 4.5 million processors
Nintendo 3.5 million processors
Sony 3.0 million processors
Supercomputers 2.5 million processors

But how do you know? Worse, you had to start production months ago.
A single die that you can use for every product even if the products
are not exactly the same allows you to react faster to the needs of
your customers. It avoids that you're left with inventory you have
to dump. Less inventory means better cash flow et-cetera.

> > Which one of the game chips should carry the overhead of the super-
> > computer interconnect. Or will buyers have a choice between Nintendo,
> > Play-station and Xbox flavored supercomputers? :^)
>
> Do you think IBM is losing money on their e-servers?
>

I don't know any exact numbers but did see people talking about IBM
micro-electronics losing something like $ 1 Billion last year. More
than enough reason to pay attention to the economics...

> > The use of a modified (32 bit) 440 core seems more like a prototyping
> > vehicle. The original BlueGene design used simple but multithreaded
> > processors (8 threads per processor) and 32 processors per chip.
>
> Did you read the description of the final item in
> <http://www.llnl.gov/asci/platforms/bluegenel/pdf/software.pdf>?
>

This one describes the same hardware:
http://sc-2002.org/paperpdfs/pap.pap207.pdf

> > One can only guess what the final product will be in 2006. I do not think
> > that either one of the two options above could successfully get Microsoft,
> > Sony and Nintendo to bet their game-console business on it.
>
> You still think there is a connection beyond the tenuous one of basic ISA.
>

The connection would rather be in the ISA than in the underlying
micro-architecture. So Blue-Gene/L does introduce something 128 bit
SSE2-like to the PowerPC's ISA. A single 32 bit instruction word
executes either a single 64 bit FP op or a dual 2x64 bit SIMD op.

Most x86 SSE2 implementations get the second 64 bit FP for "free".
A highly pipelined 64 bit FP unit generally has a lot of unused
timeslots that can be used to handle the 2nd 64 bit part. That's
actually what makes it tempting to add these strange 2x64 bit
instructions to the ISA. The 440 based version demonstrates that
it's a prototype ISA implementation by actually adding a real
second 64 bit FP unit...

The final game computer version may have a single higher pipelined
64 bit FP which can handle the 4x32 stuff too by re-using parts of
the 64 bit circuits for the dual 32 FP operations (Think Wallace
trees.)

>
> > I do think that a Power6 core could do just that. It should be less then
> > half the size of today's G5 based on the Power4 core when implemented in
> > 65 nm technology. The Power6 will have a much higher clock speed (by
> > splitting the pipeline stages in two?) and is likely to support more (4?)
> threads.
>
> Do you have any information on the POWER6 core? The dramatic clock increase
> projected in that timeframe, I suspect has to do with IBM's HOT technology
> which will benefit static (synthesized) logic much more than dynamic logic.
>

I guess everybody is using "ping-pong dynamic logic" now:
Two dynamic circuits for every circuit. One evaluates while the
other is recharged and visa versa in the next cycle.

See for instance Paul DeMone's articles: "The Stuff Dreams Are Made
Of". Dynamic logic is handled at the end of the second article.

http://www.realworldtech.com/page.cfm?ArticleID=RWT050802020022
http://www.realworldtech.com/page.cfm?ArticleID=RWT090402005224

These circuits can be synthesized like static logic.

> >
> > Games spend 90%+ of their time in directX-like libraries. The libraries don't
> > need emulation and can run in native PPC code.
>
> If you happen to run into a massive lump of 10% code, you will notice. I
> don't know how important emulation is, but the current market leader is
> backwards compatible. (It is not clear if the PS3 will be.)

It's one of the reasons why it makes sense to use an up to date
high performance core rather than a modified embedded 440.

Regards, Hans.
Iain McClatchie
2003-11-18 22:23:17 UTC
Permalink
Hans> I guess everybody is using "ping-pong dynamic logic" now

Was that overheard information or are there actually references?

I would be kind of surprised to see dynamic logic circuits used
in an arrangement where the inputs of two gates running on
alternate clock cycles were tied together, and their outputs run
into a static NAND. Is that what you were describing?

I _can_ imagine unrolling a tight pipeline by a factor of two,
in order to cut the clock frequency by a factor of two and thus
give the gates time to recharge. I did that on a DES loop in
1999.
Niels Jørgen Kruse
2003-11-19 20:22:14 UTC
Permalink
I artiklen <***@posting.google.com> ,
***@chip-architect.com (Hans de Vries) skrev:

> "Niels Jørgen Kruse" <***@get2net.dk> wrote in message
> news:<lFaub.8262$***@news.get2net.dk>...
>> I artiklen <***@posting.google.com> ,
>> ***@chip-architect.com (Hans de Vries) skrev:
>>
>> > How many different PowerPC's can there be. How much resources can you
>> > waist... One for Apple, one for Sony, one for Nintendo, one for Microsoft,
>> > one for their own Servers...
>>
>> Why should it be so expensive to modify a working core, particularly if it
>> is designed to be extensible? IBM doesn't handtweak every circuit.
>>
>
> The reason to use a single die for a range of products is almost
> always logistics. So in the next three months your customors may need:
>
> Sony 5.5 million processors
> Microsoft 2.5 million processors
> Nintendo 1.5 million processors
> Supercomputers 0.5 million processors
>
> or
>
> Microsoft 4.5 million processors
> Nintendo 3.5 million processors
> Sony 3.0 million processors
> Supercomputers 2.5 million processors
>
> But how do you know?

Because they tell you? Do you think IBM speculatively manufacture costum
silicon without the intended recipient being committed to buying?

Consoles are manufactured unchanged for years and development paid up front,
so I don't see why a bit of inventory of costum chips is a problem.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
kevin getting
2003-11-19 23:21:50 UTC
Permalink
On Wed, 19 Nov 2003, Niels J[ISO-8859-1] =F8rgen Kruse wrote:

> I artiklen <***@posting.google.com> ,
> ***@chip-architect.com (Hans de Vries) skrev:
>
> > The reason to use a single die for a range of products is almost
> > always logistics. So in the next three months your customors may need:
> >
> > Sony 5.5 million processors
> > Microsoft 2.5 million processors
> > Nintendo 1.5 million processors
> > Supercomputers 0.5 million processors
> >
> > or
> >
> > Microsoft 4.5 million processors
> > Nintendo 3.5 million processors
> > Sony 3.0 million processors
> > Supercomputers 2.5 million processors
> >
> > But how do you know?
>
> Because they tell you? Do you think IBM speculatively manufacture costum
> silicon without the intended recipient being committed to buying?
>
> Consoles are manufactured unchanged for years and development paid up fro=
nt,
> so I don't see why a bit of inventory of costum chips is a problem.

Console hardware can change over its life time. Take a look at the
original PSX and compare it to the PSone. They play the same games and
the hardware specs are the same but what's under the hood is different.
Sony announced awhile back that they are aiming to merge the Emotion
Engine CPU and the Graphics Synthesiser GPU in the PS2 into one huge
custom chip. I'm not sure if they are actually shipping the EE/GS chip
much less using them in the PS2's currently.
Will R
2003-11-29 06:11:43 UTC
Permalink
<< If you happen to run into a massive lump of 10% code, you will notice. I
don't know how important emulation is, but the current market leader is
backwards compatible. (It is not clear if the PS3 will be.)

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark >>

I just thought of something. What if MS made the X Box2 backwards compatible
to the PS2? The current X-Box is fast enough to mostly emulate N64 games, so
if they were going to write an emulator for XB1, why stop there?

Would the fact that the playstations use MIPS processors instead of X86 make
emulation much more difficult? (More registers, etc.) Is the PS2 architecture
just so whacked that it would be a serious PITA to try to emulate it on
anything?
Chris Morgan
2003-12-05 18:48:32 UTC
Permalink
***@aol.com (Will R) writes:

> Would the fact that the playstations use MIPS processors instead of X86 make
> emulation much more difficult? (More registers, etc.) Is the PS2 architecture
> just so whacked that it would be a serious PITA to try to emulate it on
> anything?

The shape of its performance envelope is unusual. I seem to recall it
having less pixel-fill-rate than XBox but higher peak memory
bandwidth, the DRAM bus on PS2 is claimed to be 2560 bits wide.

Chris
--
Chris Morgan
"Post posting of policy changes by the boss will result in
real rule revisions that are irreversible"

- anonymous correspondent
George William Herbert
2003-11-18 01:42:22 UTC
Permalink
Hans de Vries <***@chip-architect.com> wrote:
>How many different PowerPC's can there be.

A lot, though the field seems to be shrinking over time
to fewer models.


-george william herbert
***@retro.com
George William Herbert
2003-11-18 01:39:04 UTC
Permalink
Keith R. Williams <***@attglobal.net> wrote:
>Since (at least the mill I've been listening to) the 440 is
>DEADBEEF, I don't think that's what is being proposed.

Until IBM offers a better replacement embedded CPU option,
qualified on their ASIC fab processes as well as available
as a standalone CPU, the 440 isn't going anywhere.

It probably won't even if a 64-bit version (640? ;-)
intended for embedded applications shows up; sometimes the
extra 1-2 mm^2 matter, and 64-bittedness doesn't.


-george william herbert
***@retro.com
Hans de Vries
2003-11-13 20:28:42 UTC
Permalink
"subsystem" <***@nobody.net> wrote in message news:<FpLsb.37$***@newssrv26.news.prodigy.com>...
> In the patent, it was prefered that each *APU* have 32 GFLOPs performance.
> Not each PE.

OK. I should have used "APU" instead of "PE" in Cell terms.

The 4 PowerPC's are the "PU"s in the patent and the 32 PS2 like SIMD
processors are the "APU"s

>
> There would be 1 PU/CPU per PE, and 8 APUs - which would give *256 GFLOPs*
> per PE.
>
> Then 4 PEs (256 GFLOPs each) are put onto a single chip to form a BroadBand
> Engine. that is where the 1 TFLOPs came from.
>
> The BroadBand Engine would be the main CPU of PS3.
>

The 32 GFlops and 32 GIops are only mentioned once in the patent on
page 52 referring to figure 4:

"Floating point units 412 preferably operate at a speed of 32
billion floating point operations per second (32 GFLOPS), and
integer units 414 preferably operate at a speed of 32 billion
operations per second (32 GOPS)"

Figure 4 shows four Floating Point units and four Integer Units.
The text of the patent may be interpreted also in other ways like:
Each Floating Point unit operates at 32 GFlops (= 128 GFlops per APU)
or in yet another interpretation: All Floating Point units together
operate at 32 GFlops (= 32 GFlops/chip)

>
> Now in this new presentation, KK shows 1 PE having performance of 1 GFLOP.
> this does not make sense at all. that's less than the Emotion Engine of
> PS2 which has 6.2 GFLOPs performance.


Each APU would have 1 GFlops. The entire Chip would have 32 APU's
running at 32 GFlops according to:
http://www.watch.impress.co.jp/game/docs/20020921/tgsf15.jpg

>
> The slides are 2-3 years old, that is why. they are the SAME slides that IBM
> showed for the Blue Gene project, IIRC.
>

The patent application is from March 22, 2001 while this presentation
from the President of Sony's Entertainment division (=PlayStation)
is from September 20, 2002.


> If one PE (Processor Element) can only achieve 1 GFLOPs, then
> Sony-IBM-Toshiba are going BACKWARDS not FORWARDS in performance.
>
> 256 GFLOPs in patent down to 1 GFLOPs makes no sense whatsoever.

So it's 1 GFlop per APU, 8 GFlops per PE and 32 GFlops per Chip.
It is still a huge improvement over the PS2. The PS2 runs at 300 MHz
while each of the 32 (virtual) APU's would run at 250 MHz.


Regards, Hans




>
> "Hans de Vries" <***@chip-architect.com> wrote in message
> news:***@posting.google.com...
> > "subsystem" <***@nobody.net> wrote in message
> news:<NwBsb.21177$***@newssrv26.news.prodigy.com>...
> > > old but otherwise interesting read :)
> > >
> > > http://www.xboxrules.com/yabbse/index.php?threadid=47
> > >
> > >
> > > The Technology of PS3
> > > Eddie Edwards, April 2003
> > > Foreword
> >
> >
> > The only practical way to implement 4 Power PC's and 32 Cell Processors
> > each with 128 bit (4x32) functional units on a single chip in 2006 with
> > a 65 nm process and a 100W budget is to use virtual processors. This
> > would be consistent with future PowerPC processors and IBM's Blue Gene
> work.
> >
> > The 4 PowerPC's could be a single IBM Power6 core running 4 threads and
> > at twice the frequency as a Power5 would run in the same process.
> > That would be 8 GHz in 65 nm.
> >
> > The 32 PE's have a combined performance of 32 GFlops or 1 GFlop each
> > according to this presentation of Sony Entertainment's CEO here.
> > http://www.watch.impress.co.jp/game/docs/20020921/tgsf.htm
> >
> > Have a look at this image:
> > http://www.watch.impress.co.jp/game/docs/20020921/tgsf15.jpg
> >
> > This presentation uses large data centers to get at these 1 TeraFlop
> > and even 1 PetaFlop marketing numbers. This Sony presentation seems
> > to be a clarification after the 1 TeraFlop rumor stories: "PS3 will
> > be more then 100 times more powerfull than a Pentium 4"
> >
> > A single "Altivec" or "PS2" like SIMD unit with four 32 bit Floating
> > Point units and four 32 bit Integer units running also at 8 GHz in
> > 65 nm could be used to implement 32 virtual PE's working from one 4 MB
> > local memory.
> >
> > Each PE would run at an effective 250 MHz with 1 GFlop (as stated in
> > the presentation). Each PE would be able to fetch, decode and execute a
> > single SIMD instruction before loading the next one. Thereby eliminating
> all
> > the branch prediction, out of order and load/store overhead of modern
> > processors. 80% of such a unit would be Functional units, Floating Point
> > and Integer, and 20% would be control logic. In modern OOO processors
> > it is more like the reverse.
> >
> > The patent application revived the 1 TeraFlop rumors by saying that the
> > "preferred" performance of each PE would be 32 GFlops and 32 GIops. Sony's
> > own PS3 presentation however clearly says 1 GFlop per PE for the first
> > implementation. 1 GFlop per PE suggest that the PE's are implemented as
> > virtual PE's, possibly in the way as described above.
> >
> > Regards, Hans.
Continue reading on narkive:
Loading...