Discussion:
A Very Bad Idea
(too old to reply)
Quadibloc
2024-02-05 06:48:59 UTC
Permalink
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.

Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.

These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.

So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.

It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.

But try as I might, I can't see a useful application for such
a chip. The restricted access to memory would basically hobble
it for anything but a narrow class of embarassingly parallel
applications. The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.

John Savard
Chris M. Thomasson
2024-02-05 06:51:08 UTC
Permalink
On 2/4/2024 10:48 PM, Quadibloc wrote:
[...]
Post by Quadibloc
The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.
The CELL wrt playstation was an interesting arch. DMA to the cell
processors. Some games did not even use them. They used the PPC's instead.
Anton Ertl
2024-02-05 07:44:24 UTC
Permalink
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
To some extent, it is: Zen4 performs 512-bit SIMD by feeding its
512-bit registers to the 256-bit units in two successive cycles.
Earlier Zen used 2 physical 128-bit registers as one logical 256-bit
register and AFAIK it split 256-bit operations into two 128-bit
operations that could be scheduled arbitrarily by the OoO engine
(while Zen4 treats the 512-bit operation as a unit that consumes two
cycles of a pipelined 256-bit unit). Similar things have been done by
Intel and AMD in other CPUs, implementing 256-bit operations with
128-bit units (Gracemont, Bulldozer-Excavator, Jaguar and Puma), or
implementing 128-bit operations with 64-bit units (e.g., on the K8).

Why are they not using longer vectors with the same FUs or narrower
FUs? For Gracemont, that's really the question; they even disabled
AVX-512 on Alder Lake and Raptor Lake completely (even on Xeon CPUs
with disabled Gracemont) because Gracemont does not do AVX-512.
Supposedly the reason is that Gracemont does not have enough physical
128-bit registers for AVX-512 (128 such registers would be needed to
implement the 32 logical ZMM registers, and probably some more to
avoid deadlocks and maybe for some microcoded operations;
<https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/>
reports 191+16 XMM registers and 95+16 YMM registers, which makes me
doubt that explanation).

Anyway, the size of the register files is one reason for avoiding
longer vectors.

Also, the question is how much it buys. For Zen4, I remember seeing
results that coding the same stuff as using two 256-bit instructions
rather than one 512-bit instruction increased power consumption a
little, resulting in the CPU (running at the power limit) lowering the
clock rate of the cores from IIRC 3700MHz to 3600MHz; not a very big
benefit. How much would the benefit be from longer vectors? Probably
not more than another 100MHz: From 256-bit instructions to 512-bit
instructions already halves the number of instructions to process in
the front end; eliminating the other half would require infinitely
long vectors.
Post by Quadibloc
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues.
My memory says is that he mentioned memory latency. He did not
explain why he thinks so, but caches and prefetchers seem to be doing
ok for bridging the latency from DRAM to L2 or L1.

As for main memory bandwidth, that is certainly a problem for
applications that have frequent cache misses (many, but not all HPC
applications are among them). And once you are limited by main memory
bandwidth, the ISA makes little difference.

But for those applications where caches work (e.g., dense matrix
multiplication in the HPC realm), I don't see a reason why a
long-vector architecture would be unworkable. It's just that, as
discussed above, the benefits are small.
Post by Quadibloc
The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
Caches work well for most applications. So mainstream CPUs are
designed with a certain amount of cache and enough main-memory
bandwidth to satisfy most applications. For the niche that needs more
main-memory bandwidth, there are GPGPUs which have high bandwidth
because their original application needs it (and AFAIK GPGPUs have
long vectors). For the remaining niche, having a CPU with several
stacks of HBM memory attached (like the NEC vector CPUs) is a good
idea; and given that there is legacy software for NEC vector CPUs,
providing that ISA also covers that need.
Post by Quadibloc
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth; and why would you go for explicitly managed
local memory (which deservedly vanished from the market, see below)
rather than the well-working setup of cache and prefetchers? BTW,
Raptor Cove gives you 2MB of private L2.
Post by Quadibloc
The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.
The Playstation 3 has a separate GPU based on the Nvidia G70
<https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Graphics_processing_unit>.

What I heard/read about the Cell CPU is that the SPEs were too hard to
make good use of and that consequently they were not used much.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Quadibloc
2024-02-05 13:19:41 UTC
Permalink
Post by Anton Ertl
Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth;
Well, the original Cray I had a main memory of eight megabytes, and the
Cray Y-MP had up to 512 megabytes of memory.

I was keeping as close to the original CELL design as possible, but
certainly one could try to improve. After all, if Intel could make
a device like the Xeon Phi, having multiple CPUs on a chip all sharing
access to external memory, however inadequate, could still be done (but
then I wouldn't be addressing Mitch Alsup's objection).

Instead of imitating the CELL, or the Xeon Phi, for that matter, what
I think of as a more practical way to make a consumer Cray-like chip
would be to put only one core in a package, and give that core an
eight-channel memory bus.

Some older NEC designs used a sixteen-channel memory bus, but I felt
that eight channels will already be expensive for a consumer product.

Given Mitch Alsup's objection, though, I threw out the opposite kind
of design, one patterned after the CELL, as one that maybe could allow
a vector CPU to churn out more FLOPs. But as I noted, it seems to have
the fatal flaw of very little capacity for any kind of useful work...
which is kind of the whole point of any CPU.

John Savard
John Dallman
2024-02-05 14:34:00 UTC
Permalink
Post by Quadibloc
I was keeping as close to the original CELL design as possible
That, I think, was a mistake. Given how unsuccessful it was, it's fair
evidence that strategy isn't useful for very much.

There's an approach that isn't specifically for vectors, but gives good
memory bandwidth, and is having some commercial success. Apple's M-series
ARM SoCs have limited RAM (8GB to 24GB) on the SoC, but it's much larger
than caches, or a Cray-1's main memory. They then use fast SSDs for
swapping, but there's no reason, apart from cost, that you couldn't have
a layer with 1 TB or so of DRAM between the SoC and the SSD.

John
Anton Ertl
2024-02-05 13:44:56 UTC
Permalink
Post by Quadibloc
Post by Anton Ertl
Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth;
Well, the original Cray I had a main memory of eight megabytes
If you want to compete with a 1976 supercomputer, Megabytes may be
enough. However, if you want to compete with something from 2024,
better look at how much local memory the likes of these NEC cards, or
Nvidia or AMD GPGPUs provide. And that's Gigabytes.
Post by Quadibloc
I was keeping as close to the original CELL design as possible, but
certainly one could try to improve. After all, if Intel could make
a device like the Xeon Phi, having multiple CPUs on a chip all sharing
access to external memory, however inadequate, could still be done
You don't have to look for the Xeon Phi. The lowly Athlon 64 X2 or
Pentium D from 2005 already have several cores sharing access the
external memory (and the UltraSPARC T1 from the same year even has 8
cores.

The Xeon Phis are interesting:

* Knight's Corner is a PCIe card with up to 16GB local memory and
bandwidths up to 352GB/s (plus access to the host system's DRAM and
anemic bandwidth (PCIe 2.0 x16)).

* Knight's Landing was available as PCIe card or as socketed CPU with
16GB of local memory with "400+ GB/s" bandwidth and up to 384GB of
DDR4 memory with 102.4GB/s.

* Knight's Mill was only available in a socketed version with similar
specs.

* Eventually they were replaced by the big mainstream Xeons without
local memory, stuff like the Xeon Platinum 8180 with about 128GB/s
DRAM bandwidth.

It seems that running the HPC processor as a coprocessor was not good
enough for the Xeon Phi, and that the applications that needed lots of
bandwidth to local memory also did not provide enough revenue to
sustain Xeon Phi development; OTOH, Nvidia has great success with its
GPGPU line, so maybe the market is there, but the Xeon Phi was
uncompetetive.

If you are interested in such things, the recently announced AMD
Instinct MI300A (CPUs+GPUs) with 128GB local memory or MI300X (GPUs
only) with 192GB local memory with 5300GB/s bandwidth may be of
interest to you.
Post by Quadibloc
Instead of imitating the CELL, or the Xeon Phi, for that matter, what
I think of as a more practical way to make a consumer Cray-like chip
would be to put only one core in a package, and give that core an
eight-channel memory bus.
IBM sells Power systems with few cores and the full memory system.
Similarly, you can buy AMD EPYCs with few active cores and the full
memory system. Some of them have a lot of cache, too (e.g., 72F3 and
7373X).
Post by Quadibloc
Some older NEC designs used a sixteen-channel memory bus, but I felt
that eight channels will already be expensive for a consumer product.
If you want high bandwidth in a consumer product, buy a graphics card.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup1
2024-02-05 19:46:55 UTC
Permalink
Post by Quadibloc
Post by Anton Ertl
Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth;
Well, the original Cray I had a main memory of eight megabytes, and the
Cray Y-MP had up to 512 megabytes of memory.
CRAY-1 could access one 64-bit memory container per cycle continuously.
CRAY-XMP could access 3 64-bit memory containers in 2LD, 1 SR per cycle
continuously.
Where memory started at about 16 cycles away (12.5ns version) and ended
up about 30 cycles away (6ns version) and a memory bank could be accessed
about every 7 cycles.
Post by Quadibloc
I was keeping as close to the original CELL design as possible, but
certainly one could try to improve. After all, if Intel could make
a device like the Xeon Phi, having multiple CPUs on a chip all sharing
access to external memory, however inadequate, could still be done (but
then I wouldn't be addressing Mitch Alsup's objection).
Instead of imitating the CELL, or the Xeon Phi, for that matter, what
I think of as a more practical way to make a consumer Cray-like chip
would be to put only one core in a package, and give that core an
eight-channel memory bus.
Some older NEC designs used a sixteen-channel memory bus, but I felt
that eight channels will already be expensive for a consumer product.
Given Mitch Alsup's objection, though, I threw out the opposite kind
of design, one patterned after the CELL, as one that maybe could allow
a vector CPU to churn out more FLOPs. But as I noted, it seems to have
the fatal flaw of very little capacity for any kind of useful work...
which is kind of the whole point of any CPU.
John Savard
MitchAlsup1
2024-02-05 19:30:20 UTC
Permalink
Post by Anton Ertl
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
To some extent, it is: Zen4 performs 512-bit SIMD by feeding its
512-bit registers to the 256-bit units in two successive cycles.
Earlier Zen used 2 physical 128-bit registers as one logical 256-bit
register and AFAIK it split 256-bit operations into two 128-bit
operations that could be scheduled arbitrarily by the OoO engine
(while Zen4 treats the 512-bit operation as a unit that consumes two
cycles of a pipelined 256-bit unit). Similar things have been done by
Intel and AMD in other CPUs, implementing 256-bit operations with
128-bit units (Gracemont, Bulldozer-Excavator, Jaguar and Puma), or
implementing 128-bit operations with 64-bit units (e.g., on the K8).
Why are they not using longer vectors with the same FUs or narrower
FUs? For Gracemont, that's really the question; they even disabled
AVX-512 on Alder Lake and Raptor Lake completely (even on Xeon CPUs
with disabled Gracemont) because Gracemont does not do AVX-512.
They wanted to keep core power under some <thermal> limit 256-bits
fit under this limit, 513 did not.
Post by Anton Ertl
Supposedly the reason is that Gracemont does not have enough physical
128-bit registers for AVX-512 (128 such registers would be needed to
implement the 32 logical ZMM registers, and probably some more to
avoid deadlocks and maybe for some microcoded operations;
<https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/>
reports 191+16 XMM registers and 95+16 YMM registers, which makes me
doubt that explanation).
Anyway, the size of the register files is one reason for avoiding
longer vectors.
Also, the question is how much it buys. For Zen4, I remember seeing
results that coding the same stuff as using two 256-bit instructions
rather than one 512-bit instruction increased power consumption a
little, resulting in the CPU (running at the power limit) lowering the
clock rate of the cores from IIRC 3700MHz to 3600MHz; not a very big
benefit. How much would the benefit be from longer vectors? Probably
not more than another 100MHz: From 256-bit instructions to 512-bit
instructions already halves the number of instructions to process in
the front end; eliminating the other half would require infinitely
long vectors.
Post by Quadibloc
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues.
My memory says is that he mentioned memory latency. He did not
explain why he thinks so, but caches and prefetchers seem to be doing
ok for bridging the latency from DRAM to L2 or L1.
As seen by scalar cores, yes, as seen by vector cores (like CRAY) no.

I might note:: RISC-V has a CRAY-like vector extension and a SIMD-like
vector extension. ... make of that what you may.
Post by Anton Ertl
As for main memory bandwidth, that is certainly a problem for
applications that have frequent cache misses (many, but not all HPC
applications are among them). And once you are limited by main memory
bandwidth, the ISA makes little difference.
My point in the previous post.
Post by Anton Ertl
But for those applications where caches work (e.g., dense matrix
multiplication in the HPC realm), I don't see a reason why a
long-vector architecture would be unworkable. It's just that, as
discussed above, the benefits are small.
TeraByte 2D and 3D FFTs are not cache friendly...
Post by Anton Ertl
Post by Quadibloc
The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
Caches work well for most applications. So mainstream CPUs are
designed with a certain amount of cache and enough main-memory
bandwidth to satisfy most applications. For the niche that needs more
main-memory bandwidth, there are GPGPUs which have high bandwidth
because their original application needs it (and AFAIK GPGPUs have
And can afford to absorb the latency.
Post by Anton Ertl
long vectors). For the remaining niche, having a CPU with several
stacks of HBM memory attached (like the NEC vector CPUs) is a good
idea; and given that there is legacy software for NEC vector CPUs,
providing that ISA also covers that need.
Post by Quadibloc
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
Who would buy such a microprocessor? Megabytes? Laughable. If
that's intended to be a buffer for main memory, you need the
main-memory bandwidth; and why would you go for explicitly managed
local memory (which deservedly vanished from the market, see below)
rather than the well-working setup of cache and prefetchers? BTW,
Raptor Cove gives you 2MB of private L2.
Post by Quadibloc
The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.
The Playstation 3 has a separate GPU based on the Nvidia G70
<https://en.wikipedia.org/wiki/PlayStation_3_technical_specifications#Graphics_processing_unit>.
What I heard/read about the Cell CPU is that the SPEs were too hard to
make good use of and that consequently they were not used much.
- anton
BGB
2024-02-05 09:32:16 UTC
Permalink
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.
One doesn't need to disallow access to external RAM, but maybe:
Memory coherence is fairly weak for these cores;
The local RAM addresses are treated as "strongly preferable".

Or, say, there is a region on RAM that is divided among the cores, where
the core has fast access to its own local chunk, but slow access to any
of the other chunks (which are treated more like external RAM).

Here, threads would be assigned to particular cores, and the scheduler
may not move a thread from one core to another if it is assigned to a
given core.


As for SIMD vs vectors, as I see it, SIMD seems to make sense in that it
is cheap and simple.

The Cell cores were, if anything, more of a "SIMD First, ALU Second"
approach, building it around 128-bit registers but only using part of
these for integer code.

I went a slightly different direction, using 64-bit registers that may
be used in pairs for 128-bit ops. This may make more sense if one
assumes that the core is going to be used for a lot more general purpose
code, rather than used almost entirely for SIMD.


I have some hesitation about "vector processing", as it seems fairly
alien to how this stuff normally sort of works; seems more complicated
than SIMD for an implementation; ...

It is arguably more scalable, but as I see it, much past 64 or 128 bit
vectors, SIMD rapidly goes into diminishing returns, and it makes more
sense to be like "128-bit is good enough" than to try to chase after
ever wider SIMD vectors.

Or, maybe a hybrid strategy, where the vector operations are applied
over types that may just so happen to include SIMD vectors.



Granted, with the usual caveat that one needs to be careful in the
design of SIMD to not allow it to eat too much of ones encoding space.


Well, and there seems to be a trap of people trying to stick "extra
stuff" into the SIMD ops that bloats their requirements, such as gluing
range-saturation or type-conversions onto the SIMD operations.

Granted, I did experiment with bolting shuffle onto SIMD ops, but this
was exclusive to larger encodings (64 and 96 bit operations). And,
likely, not really worth it outside of some niche cases.
Post by Quadibloc
But try as I might, I can't see a useful application for such
a chip. The restricted access to memory would basically hobble
it for anything but a narrow class of embarassingly parallel
applications. The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.
Yes, tradeoffs.

But, I can also note that even for semi-general use, an ISA design like
RV64G is suffers a significant disadvantage, say, vs my own ISA, in the
area of doing an OpenGL implementation. Needing to fall back to scalar
operations really hurts things (even as much as the RV64 F/D extensions
are "fairly aggressive" in some areas, with a lot of features that I
would have otherwise deemed "mostly unnecessary").

Or, some features it does have, partly backfire:
The ABI and compiler treat 'float' and 'double' as distinct types;
In turn requiring a lot of conversion back and forth.

My own approach was to always keep scalar values as Binary64 in
registers, since this limits having a bunch of back-and-forth conversion
for 'float' in local variables (this type was more relevant to in-memory
storage, and as a hint for when and where it is safe to trade off
precision for speed).


Though, looks like "-ffast-math" in GCC does help this issue (the float
variables remain as float, rather than endlessly bouncing back and forth
between float and double).
MitchAlsup1
2024-02-05 19:43:15 UTC
Permalink
Post by BGB
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.
Memory coherence is fairly weak for these cores;
The local RAM addresses are treated as "strongly preferable".
Or, say, there is a region on RAM that is divided among the cores, where
the core has fast access to its own local chunk, but slow access to any
of the other chunks (which are treated more like external RAM).
Large FFTs do not fit n this category. FFTs are one of the most valuable
means of calculating Great Big Physics "stuff". We used FFT back in the
NMR lab to change a BigO( n^3 ) problem into 2×BigO( n×log(n) ) problem.
VERY Many big physics simulations do similarly.

That problem was matrix-matrix multiplication !!

MultipliedMatrix = IFFT( ConjugateMultiply( FFT( matrix ), pattern ) );

{where pattern was FFTd earlier }

Lookup the data access pattern and apply that knowledge to TB sized
matrixes and then ask yourself if caches bring anything to the party ?
Post by BGB
Here, threads would be assigned to particular cores, and the scheduler
may not move a thread from one core to another if it is assigned to a
given core.
As for SIMD vs vectors, as I see it, SIMD seems to make sense in that it
is cheap and simple.
If you are happy adding 1,000+ instructions to your ISA, then yes.
Post by BGB
The Cell cores were, if anything, more of a "SIMD First, ALU Second"
approach, building it around 128-bit registers but only using part of
these for integer code.
I went a slightly different direction, using 64-bit registers that may
be used in pairs for 128-bit ops. This may make more sense if one
assumes that the core is going to be used for a lot more general purpose
code, rather than used almost entirely for SIMD.
I have some hesitation about "vector processing", as it seems fairly
alien to how this stuff normally sort of works; seems more complicated
than SIMD for an implementation; ...
Vector design is a lot more about the memory system (feeding the beast)
than the core (the beast) consuming memory BW.
Post by BGB
It is arguably more scalable, but as I see it, much past 64 or 128 bit
vectors, SIMD rapidly goes into diminishing returns, and it makes more
sense to be like "128-bit is good enough" than to try to chase after
ever wider SIMD vectors.
Architecture is more about "what to leave OUT" as about "what to put in".
Post by BGB
But, I can also note that even for semi-general use, an ISA design like
RV64G is suffers a significant disadvantage, say, vs my own ISA, in the
They disobeyed the "what to leave out" and "what to put in" rules.
BGB-Alt
2024-02-05 22:51:33 UTC
Permalink
Post by MitchAlsup1
Post by BGB
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.
Memory coherence is fairly weak for these cores;
The local RAM addresses are treated as "strongly preferable".
Or, say, there is a region on RAM that is divided among the cores,
where the core has fast access to its own local chunk, but slow access
to any of the other chunks (which are treated more like external RAM).
Large FFTs do not fit n this category. FFTs are one of the most valuable
means of calculating Great Big Physics "stuff". We used FFT back in the
NMR lab to change a BigO( n^3 ) problem into 2×BigO( n×log(n) ) problem.
VERY Many big physics simulations do similarly.
That problem was matrix-matrix multiplication !!
MultipliedMatrix = IFFT( ConjugateMultiply( FFT( matrix ), pattern ) );
{where pattern was FFTd earlier }
Lookup the data access pattern and apply that knowledge to TB sized
matrixes and then ask yourself if caches bring anything to the party ?
AFAIK, it is typical for FFT-style transforms (such as DCT) beyond a
certain size to decompose them into a grid and then perform it in
multiple stages and layers?...

Say, if you want a 64x64, you decompose it into 8x8, then apply a second
level on the DC coefficients of the first level.
Post by MitchAlsup1
Post by BGB
Here, threads would be assigned to particular cores, and the scheduler
may not move a thread from one core to another if it is assigned to a
given core.
As for SIMD vs vectors, as I see it, SIMD seems to make sense in that
it is cheap and simple.
If you are happy adding 1,000+ instructions to your ISA, then yes.
If you leave out stuff like op+convert, saturating arithmetic, ...
The number of needed instructions drops significantly.

Basically, most of the stuff that leads to an MxN expansion in the
number of instructions.

So, for example, I left out packed-byte operations and saturating ops,
instead:
Packed Integer:
4x Int16
2x|4x Int32
Packed Float:
4x Binary16
2x|4x Binary32
2x Binary64

I also decided that there would be no native vector sizes other than 64
and 128 bits.
Post by MitchAlsup1
Post by BGB
The Cell cores were, if anything, more of a "SIMD First, ALU Second"
approach, building it around 128-bit registers but only using part of
these for integer code.
I went a slightly different direction, using 64-bit registers that may
be used in pairs for 128-bit ops. This may make more sense if one
assumes that the core is going to be used for a lot more general
purpose code, rather than used almost entirely for SIMD.
I have some hesitation about "vector processing", as it seems fairly
alien to how this stuff normally sort of works; seems more complicated
than SIMD for an implementation; ...
Vector design is a lot more about the memory system (feeding the beast)
than the core (the beast) consuming memory BW.
Yeah.

Seems to have the weirdness that the vectors are often more defined in
terms of memory arrays, rather than the SIMD-like "well, this register
has 4 datums".

SIMD works well if one also has a lot of data with 3 or 4 elements,
which is typical.

A lot of the larger-vector SIMD and autovectorization stuff seems more
focused on the vector-style usage, rather than datasets based on 3D and
4D vector math though (say, with a lot of DotProduct and CrossProduct
and similar).


But, as for implementation:
SIMD is basically the same as normal scalar stuff...
Vector, is not.
Post by MitchAlsup1
Post by BGB
It is arguably more scalable, but as I see it, much past 64 or 128 bit
vectors, SIMD rapidly goes into diminishing returns, and it makes more
sense to be like "128-bit is good enough" than to try to chase after
ever wider SIMD vectors.
Architecture is more about "what to leave OUT" as about "what to put in".
Yeah.

As I see it, AVX-256 was already in diminishing returns.
Then AVX-512 is more sort of a gimmick that ends up left out of most of
the CPUs anyways, and for most embedded CPUs, they are 64 or 128-bit
vectors and call it done.
Post by MitchAlsup1
Post by BGB
But, I can also note that even for semi-general use, an ISA design
like RV64G is suffers a significant disadvantage, say, vs my own ISA,
in the
They disobeyed the "what to leave out" and "what to put in" rules.
As I can note, needing to add a whole bunch of stuff to the FPU, and
still not seeing much gain (even if I simulate superscalar operation in
the emulator, still "not particularly fast" at this).


Granted, GCC seems pretty bent on using FMADD.S and similar, even though
in my core this is slower than separate FMUL.S+FADD.S would be, and it
seems that options for other targets (like "-fno-fused-multiply-add" or
"-fno-madd4" and similar, are not recognized for RISC-V).

It is kind of a crappy balance in some areas.
Does pretty good for code-size at least, but performance seems lacking
in everything I am testing.

Though, ironically, it is not so much a "uniform badness", as often the
"hot spots" seems to be more concentrated, like, over-all it is doing
OK, except in the spots where it ends up shooting itself in the foot...

Contrast, for a lot of the same programs on BJX2, things are slightly
less concentrated on the hot-spots.


Well, and the other difference that code for BJX2 seems to spend roughly
3x as many clock-cycles in memory ops (rather than being more ALU bound;
even after adding a special-case to reduce ALU ops to 1 cycle).

So, Doom's breakdown for RV64 was more like:
ADD/Shift
Branch
Mem
Quake's is more like:
ADD
FMADD/etc, FDIV.S
Mem
Branch

For BJX2, it was generally:
Mem
Branch
ALU and CMP
...


For Software Quake, it should have been more fare, since both are mostly
using the C version of Quake's software renderer, which doesn't really
make much use of any special features in my ISA.

RV64G still doesn't win...

For scalar-FPU intensive stuff, theoretically RV64G could have had an
advantage, as it has a lot more special-case ops, vs, say:
Well, you got Binary64 FADD, FSUB, and FMUL?...

Well, and Compare, and a few converter ops, ...
And, then nominally running all the "float" operations through the
double precision FPU operations, ...


Meanwhile, for GLQuake, TKRA-GL makes more significant use of features
in my ISA, and it shows. It is mostly high-single-digit to
low-double-digit on BJX2, and around 0/1 fps on RV64G (and, basically,
entirely unplayable; by another stat, averaging roughly 0.8 fps).

Maybe one could do a decent OpenGL implementation with the 'V'
extension, but I have my doubts, and don't currently have any intent to
mess with the V extension.

Main difference between RV64G and BJX2 regarding OpenGL:
Mostly that BJX2 has SIMD ops;
Well, and twice the GPRs, this code is very register-hungry.
well, with some functions with 100+ local variables.

And, for the RISC-V case, it needs to fall back to scalar ops and a lot
more bit twiddling.

Granted, this code was also an area where, scaled relative to
clock-speed, BJX2 was beating my Ryzen; this is not true of most other code.

Though, while x86-64 also has SIMD, it lacks some other nifty features,
like RGB555 pack/unpack ops, or helper ops for texture-compression.


Theoretically, should be possible to use the hardware rasterizer module,
which should at least be "less terrible" (it looked like GLQuake was
being rendered using software rasterization in this case, will need to
look into it some more).

...
MitchAlsup1
2024-02-05 19:20:51 UTC
Permalink
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
Memory LATENCY issues not BW issues. The length of the vector
has to be able to absorb a miss at all cache levels without
stalling the core. 5GHz processors, 60 ns DRAM access times
means the minimum vector length is 300 registers in a single
vector. Which also means it takes a loop of count 300+ to
reach peak efficiency.

To a certain extent the B registers of the CRAY 2 were to
do that (absorb longer and longer memory latencies) but
this B register set is now considered a failure.
Post by Quadibloc
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
They also increased vector length as memory latency increased.
Ending up at (IIRC) 256 entry VRF[k].
Post by Quadibloc
These days, Moore's Law has limped along well enough to allow
putting a lot of cache memory on a single die and so on.
Consider FFT:: sooner or later you are reading and writing
vastly scattered memory containers considerably smaller than
any cache line. FFT is one you want peak efficiency on !
So, if you want FFT to run at core peak efficiency, your
interconnect has to be able to pass either containers from
different memory banks on alternating cycles, or whole
cache lines in a single cycle. {{The later is easier to do}}

A vector machine (done properly) is a bandwidth machine rather
than a latency based machine (which can be optimized by cache
hierarchy).
Post by Quadibloc
So, perhaps it might be possible to design a chip that is
basically similar to the IBM/SONY CELL microprocessor,
except that the satellite processors handle Cray-style vectors,
and have multiple megabytes of individual local storage.
Generally microprocessors are pin limited as are DRAM chips,
so in order to get the required BW--2LD1ST per cycle continuously
with latency less than vector length--you end up needing a way
to access 16-to-64 DRAM DIMMs simultaneously. Yo might be able
to do with with PCIe 6.0 is you have 64 twisted quads, one for
each DRAM DIMM. Minimum memory size is 64 DIMMs !

A processor box with 64 DIMMs (as its minimum) is not mass market.

One reason CRAY sold a lot of supercomputers is that its I/O
system was also up to the task--CRAY YMP had 4× the I/O BW
of NEC SX{4,5,6} so when the application became I/O bound
the 6ns YMP was faster than the SX.

It is perfectly OK to try to build a CRAY-like vector processor.
But designing a vector processor is a lot more about the memory
system (feeding the beast) than about the processor (the beast).
Post by Quadibloc
It might be possible to design such a chip. The main processor
with access to external DRAM would be a conventional processor,
with only ordinary SIMD vector capabilities. And such a chip
might well be able to execute lots of instructions if one runs
a suitable benchmark on it.
If you figure this out, there is a market for 100-200 vector
supercomputers mainframes per year. If you can build a company
that makes money on this volume-- go for it !
Post by Quadibloc
But try as I might, I can't see a useful application for such
a chip. The restricted access to memory would basically hobble
it for anything but a narrow class of embarassingly parallel
applications. The original CELL was thought of as being useful
for graphics applications, but GPUs are much better at that.
John Savard
Lawrence D'Oliveiro
2024-02-05 22:56:20 UTC
Permalink
I am very fond of the vector architecture of the Cray I and similar
machines, because it seems to me the one way of increasing computer
performance that proved effective in the past that still isn't being
applied to microprocessors today.
Mitch Alsup, however, has noted that such an architecture is unworkable
today due to memory bandwidth issues.
RISC-V has a long-vector feature very consciously modelled on the Cray
one. It eschews the short-vector SIMD fashion that has infested so many
architectures these days precisely because the resulting combinatorial
explosion in added instructions makes a mockery of the “R” in “RISC”.
MitchAlsup1
2024-02-10 23:27:35 UTC
Permalink
Post by Lawrence D'Oliveiro
I am very fond of the vector architecture of the Cray I and similar
machines, because it seems to me the one way of increasing computer
performance that proved effective in the past that still isn't being
applied to microprocessors today.
Mitch Alsup, however, has noted that such an architecture is unworkable
today due to memory bandwidth issues.
RISC-V has a long-vector feature very consciously modelled on the Cray
one. It eschews the short-vector SIMD fashion that has infested so many
architectures these days precisely because the resulting combinatorial
explosion in added instructions makes a mockery of the “R” in “RISC”.
So does the C extension--its all redundant...
Marcus
2024-02-13 18:57:28 UTC
Permalink
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of 64-bit).

My machine can start and finish at most one 32-bit operation on every
clock cycle, so it is very simple. The same thing goes for vector
operations: at most one 32-bit vector element per clock cycle.

Thus, it always feels like using vector instructions would not give any
performance gains. Yet, every time I vectorize a scalar loop (basically
change scalar registers for vector registers), I see a very healthy
performance increase.

I attribute this to reduced loop overhead, eliminated hazards, reduced
I$ pressure and possibly improved cache locality and reduced register
pressure.

(I know very well that VVM gives similar gains without the VRF)

I guess my point here is that I think that there are opportunities in
the very low end space (e.g. in order) to improve performance by simply
adding MRISC32-style vector support. I think that the gains would be
even bigger for non-pipelined machines, that could start "pumping" the
execute stage on every cycle when processing vectors, skipping the fetch
and decode cycles.

BTW, I have also noticed that I often only need a very limited number of
vector registers in the core vectorized loops (e.g. 2-4 registers), so I
don't think that the VRF has to be excruciatingly big to add value to a
small core. I also envision that for most cases you never have to
preserve vector registers over function calls. I.e. there's really no
need to push/pop vector registers to the stack, except for context
switches (which I believe should be optimized by tagging unused vector
registers to save on stack bandwidth).

/Marcus
Quadibloc
2024-02-14 05:24:27 UTC
Permalink
Post by Marcus
(I know very well that VVM gives similar gains without the VRF)
Other than the Cray I being around longer than VVM, what good is
a vector register file?

The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.

But there's also one very bad thing about a vector register file.

Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.

So, the vector register file being a _large shared resource_, one
faces the dilemma... make extra copies for as many programs as may
be running, or save and restore it.

I've come up with _one_ possible solution. Remember the Texas Instruments
9900, which kept its registers in memory, because it was a 16-bit CPU
back when there weren't really enough gates on a die to make one
possible... leading to fast context switching?

Well, why not have an on-chip memory, smaller than L2 cache but made
of similar memory cells, and use it for multiple vector register files,
indicated by a pointer register?

But then the on-chip memory has to be divided into areas locked off
from different users, just like external DRAM, and _that_ becomes
a bit painful to contemplate.

The Cray I was intended to be used basically in *batch* mode. Having
a huge vector register file in an ISA meant for *timesharing* is the
problem.

Perhaps what is really needed is VVM combined with some very good
cache hinting mechanisms. I don't have the expertise needed to work
that out, so I'll have to settle for something rather more kludgey
instead.

Of course, if a Cray I is a *batch* processing computer, that sort
of justifies the notion I came up with earlier - in a thread I
aptly titled "A Very Bad Idea" - of making a Cray I-like CPU with
vector registers an auxilliary processor after the fashion of those
in the IBM/Sony CELL processor. But one wants high-bandwidth access
to DRAM, not no access to DRAM!

The NEC SX-Aurora TSUBASA solves the issue by putting all its DRAM
inside a module that looks a lot like a video card. You just have to
settle for 48 gigabytes of memory that won't be expandable.

Some database computers, of course, have as much as a terabyte of
DRAM - which used to be the size of a large magnetic hard drive.

People who can afford a terabyte of DRAM can also afford an eight-channel
memory bus, so it should be possible to manage something.

John Savard
Quadibloc
2024-02-14 05:53:02 UTC
Permalink
Post by Quadibloc
Of course, if a Cray I is a *batch* processing computer, that sort
of justifies the notion I came up with earlier - in a thread I
aptly titled "A Very Bad Idea"
Didn't look very carefully. That's _this_ thread.

John Savard
Thomas Koenig
2024-02-14 17:13:28 UTC
Permalink
Post by Quadibloc
Post by Marcus
(I know very well that VVM gives similar gains without the VRF)
Other than the Cray I being around longer than VVM, what good is
a vector register file?
The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances.
The Cray systems weren't used as general purpose timesharing systems.
They were used as database server, though - fast I/O, cheaper than
an IBM machine of the same performance.

Or so I heard, ~ 30 years ago.
MitchAlsup1
2024-02-14 20:45:36 UTC
Permalink
Post by Thomas Koenig
Post by Quadibloc
Post by Marcus
(I know very well that VVM gives similar gains without the VRF)
Other than the Cray I being around longer than VVM, what good is
a vector register file?
The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances.
The Cray systems weren't used as general purpose timesharing systems.
They were used as database server, though - fast I/O, cheaper than
an IBM machine of the same performance.
The only thing they lacked for timesharing was paging:: CRAYs had a
base and bounds memory map. They made up for lack of paging with an
stupidly fast I/O system.
Post by Thomas Koenig
Or so I heard, ~ 30 years ago.
Should be closer to 40 years ago.
Quadibloc
2024-02-15 11:21:14 UTC
Permalink
Post by MitchAlsup1
Post by Thomas Koenig
Post by Quadibloc
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances.
The Cray systems weren't used as general purpose timesharing systems.
I wasn't intending this as a criticism of the Cray systems, but
of my plan to copy their vector architecture in a chip intended
for general purpose desktop computer use.
Post by MitchAlsup1
Post by Thomas Koenig
They were used as database server, though - fast I/O, cheaper than
an IBM machine of the same performance.
Interesting.
Post by MitchAlsup1
The only thing they lacked for timesharing was paging:: CRAYs had a
base and bounds memory map. They made up for lack of paging with an
stupidly fast I/O system.
Good to know; the Cray I was a success, so it's good to learn from
it.

John Savard
Marcus
2024-02-15 18:44:27 UTC
Permalink
Post by Quadibloc
Post by Marcus
(I know very well that VVM gives similar gains without the VRF)
Other than the Cray I being around longer than VVM, what good is
a vector register file?
The obvious answer is that it's internal storage, rather than main
memory, so it's useful for the same reason that cache memory is
useful - access to frequently used values is much faster.
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.

My current vision (not MRISC32), which is a very simple
microcontroller type implementation (basically in the same ballpark as
Cortex-M or small RV32I implementations), would have a relatively
limited vector register file.

I scribbled down a suggestion here:

* https://gitlab.com/-/snippets/3673883

In particular, pay attention to the sections "Vector state on context
switches" and "Thread context".

My idea is not new, but I think that it takes some old ideas a few steps
further. So here goes...

There are four vector registers (V1-V4), each consisting of 8 x 32 bits,
for a grand total of 128 bytes of vector thread context state. To start
with, this is not an enormous amount of state (it's the same size as the
integer register file of RV32I).

Each vector register is associated with a "vector in use" flag, which is
set as soon as the vector register is written to.

The novel part (AFAIK) is that all "vector in use" flags are cleared as
soon as a function returns (rts) or another function is called (bl/jl),
which takes advantage of the ABI that says that all vector registers are
scratch registers.

I then predict that the ISA will have some sort of intelligent store
and restore state instructions, that will only waste memory cycles
for vector registers that are marked as "in use". I also predict that
most vector registers will be unused most of the time (except for
threads that use up 100% CPU time with heavy data processing, which
should hopefully be in minority - especially in the kind of systems
where you want to put a microcontroller style CPU).

I do not yet know if this will fly, though...
Post by Quadibloc
So, the vector register file being a _large shared resource_, one
faces the dilemma... make extra copies for as many programs as may
be running, or save and restore it.
I've come up with _one_ possible solution. Remember the Texas Instruments
9900, which kept its registers in memory, because it was a 16-bit CPU
back when there weren't really enough gates on a die to make one
possible... leading to fast context switching?
Well, why not have an on-chip memory, smaller than L2 cache but made
of similar memory cells, and use it for multiple vector register files,
indicated by a pointer register?
I have had a similar idea for "big" implementations that have a huge
vector register file. My idea, though, is more of a hybrid: Basically
keep a few copies (e.g. 4-8 copies?) of vector registers for hot threads
that can be quickly switched between (no cost - just a logical "vector
register file ID" that is changed), and then have a more or less
separate memory path to a bigger vector register file cache, and swap
register file copies in/out of the hot storage asynchronously.

I'm not sure if it would be feasible to either implement next-thread
prediction in hardware, or get help from the OS in the form of hints
about the next likely thread(s) to execute, but the idea is that it
should be possible to hide most of the context switch overhead this way.
Post by Quadibloc
But then the on-chip memory has to be divided into areas locked off
from different users, just like external DRAM, and _that_ becomes
a bit painful to contemplate.
Wouldn't a kernel space "thread ID" or "vector register file ID" do?
Post by Quadibloc
The Cray I was intended to be used basically in *batch* mode. Having
a huge vector register file in an ISA meant for *timesharing* is the
problem.
Perhaps what is really needed is VVM combined with some very good
cache hinting mechanisms. I don't have the expertise needed to work
that out, so I'll have to settle for something rather more kludgey
instead.
Of course, if a Cray I is a *batch* processing computer, that sort
of justifies the notion I came up with earlier - in a thread I
aptly titled "A Very Bad Idea" - of making a Cray I-like CPU with
vector registers an auxilliary processor after the fashion of those
in the IBM/Sony CELL processor. But one wants high-bandwidth access
to DRAM, not no access to DRAM!
The NEC SX-Aurora TSUBASA solves the issue by putting all its DRAM
inside a module that looks a lot like a video card. You just have to
settle for 48 gigabytes of memory that won't be expandable.
Some database computers, of course, have as much as a terabyte of
DRAM - which used to be the size of a large magnetic hard drive.
People who can afford a terabyte of DRAM can also afford an eight-channel
memory bus, so it should be possible to manage something.
John Savard
MitchAlsup1
2024-02-15 19:12:00 UTC
Permalink
Post by Marcus
Post by Quadibloc
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
My current vision (not MRISC32), which is a very simple
microcontroller type implementation (basically in the same ballpark as
Cortex-M or small RV32I implementations), would have a relatively
limited vector register file.
* https://gitlab.com/-/snippets/3673883
In particular, pay attention to the sections "Vector state on context
switches" and "Thread context".
My idea is not new, but I think that it takes some old ideas a few steps
further. So here goes...
There are four vector registers (V1-V4), each consisting of 8 x 32 bits,
for a grand total of 128 bytes of vector thread context state. To start
with, this is not an enormous amount of state (it's the same size as the
integer register file of RV32I).
Each vector register is associated with a "vector in use" flag, which is
set as soon as the vector register is written to.
The novel part (AFAIK) is that all "vector in use" flags are cleared as
soon as a function returns (rts) or another function is called (bl/jl),
which takes advantage of the ABI that says that all vector registers are
scratch registers.
I then predict that the ISA will have some sort of intelligent store
and restore state instructions, that will only waste memory cycles
for vector registers that are marked as "in use". I also predict that
most vector registers will be unused most of the time (except for
threads that use up 100% CPU time with heavy data processing, which
should hopefully be in minority - especially in the kind of systems
where you want to put a microcontroller style CPU).
VVM is designed such that even ISRs can use the vectorized parts of the
implementation. Move data, clear pages, string.h, ... so allowing GuestOSs
to use vectorization fall out for free.
Post by Marcus
I do not yet know if this will fly, though...
Quadibloc
2024-02-16 04:10:21 UTC
Permalink
Post by Marcus
Post by Quadibloc
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
Yes, and therefore I am looking into ways to deal with it somehow.

Why not just use Mitch Alsup's wonderful VVM?

It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).

But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.

What _could_ substitute for vector registers isn't data forwarding,
it's the cache, since that does the same thing vector registers do:
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.

My first question, though, is whether or not we can really make caches
that good.

But skepticism about VVM isn't actually helpful if Cray-style vectors
are now impossible to be made to work given current memory speeds.

The basic way in which I originally felt I could make it work was really
quite simple. The operating system, from privileged code, could set a
bit in the PSW that turns on, or off, the ability to run instructions that
access the vector registers.

The details of how one may have to make use of that capability... well,
that's software. So maybe the OS has to stipulate that one can only have
one process at a time that uses these vectors - and that process has to
run as a batch process!

Hey, the GPU in a computer these days is also a singular resource.

Having resources that have to be treated that way is not really what
people are used to, but a computer that _can_ run your CFD codes
efficiently is better than a computer that *can't* run your CFD codes.

Given _that_, obviously if VVM is a better fit to the regular computer
model, and it offers nearly the same performance, then what I should do
is offer VVM or something very much like it _in addition_ to Cray-style
vectors, so that the best possible vector performance for conventional
non-batch programs is also available.

Now, what would I think of as being "something very much like VVM" without
actually being VVM?

Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.

So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember, and
I think it's also too hard for at least some implementors. I see it
as asking for trouble in a way that I'd rather avoid.

So my substitute for VVM should now be obvious - explicit memory-to-memory
vector instructions, like on an old STAR-100.

John Savard
Anton Ertl
2024-02-16 07:27:36 UTC
Permalink
Post by Quadibloc
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
I don't think that's a proper characterization of VVM. One advantage
that vector registers have over memory-memory machines is that vector
registers, once loaded, can be used several times. And AFAIK VVM has
that advantage, too. E.g., if you have the loop

for (i=0; i<n; i++) {
double b = a[i];
c[i] = b;
d[i] = b;
}

a[i] is loaded only once (also in VVM), while a memory-memory
formulation would load a[i] twice. And on the microarchiectural
level, VVM may work with vector registers, but the nice part is that
it's only microarchiecture, and it avoids all the nasty consequences
of making it architectural, such as more expensive context switches.
Post by Quadibloc
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?), and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures. I think he also allows
recurrences (in particular, reductions), but I don't understand how
his hardware auto-vectorizes that; e.g.:

double r=0.0;
for (i=0; i<n; i++)
r += a[i];

This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
Post by Quadibloc
So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember,
My understanding is that there is no need to remember much. Just
remember that it has to be a simple loop, and mark it. But, as in all
auto-vectorization schemes, there are cases where it works better than
in others.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Stephen Fuld
2024-02-16 13:11:42 UTC
Permalink
Post by Anton Ertl
Post by Quadibloc
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
I don't think that's a proper characterization of VVM. One advantage
that vector registers have over memory-memory machines is that vector
registers, once loaded, can be used several times. And AFAIK VVM has
that advantage, too. E.g., if you have the loop
for (i=0; i<n; i++) {
double b = a[i];
c[i] = b;
d[i] = b;
}
a[i] is loaded only once (also in VVM), while a memory-memory
formulation would load a[i] twice. And on the microarchiectural
level, VVM may work with vector registers, but the nice part is that
it's only microarchiecture, and it avoids all the nasty consequences
of making it architectural, such as more expensive context switches.
Post by Quadibloc
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a
single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.
Post by Anton Ertl
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
Post by Anton Ertl
I think he also allows
recurrences (in particular, reductions), but I don't understand how
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
MitchAlsup1
2024-02-16 18:53:00 UTC
Permalink
Post by Stephen Fuld
Post by Anton Ertl
Post by Quadibloc
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
I don't think that's a proper characterization of VVM. One advantage
that vector registers have over memory-memory machines is that vector
registers, once loaded, can be used several times. And AFAIK VVM has
that advantage, too. E.g., if you have the loop
for (i=0; i<n; i++) {
double b = a[i];
c[i] = b;
d[i] = b;
}
a[i] is loaded only once (also in VVM), while a memory-memory
formulation would load a[i] twice. And on the microarchiectural
level, VVM may work with vector registers, but the nice part is that
it's only microarchiecture, and it avoids all the nasty consequences
of making it architectural, such as more expensive context switches.
Post by Quadibloc
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Bookends on the loop provide the information the HW needs, the VEC
instruction at the top provides the IP for the LOOP instruction at
the bottom to branch to, and also provides a bit map of registers
which are live-out of the loop, discarding other used loop registers.
Post by Stephen Fuld
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a
single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.
All of that is correct.
Post by Stephen Fuld
Post by Anton Ertl
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
Predicated control flow--yes, branch flow-control no.
Post by Stephen Fuld
Post by Anton Ertl
I think he also allows
recurrences (in particular, reductions), but I don't understand how
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
Right, register and memory dependencies are observed and obeyed. So,
in the above loop, the recurrence slows the loop down to the latency of
FADD, but the LD, ADD-CMP-BC run concurrently; so, you are still faster
than if you did no VVM the loop.
MitchAlsup1
2024-02-16 18:57:11 UTC
Permalink
Post by Stephen Fuld
Post by Anton Ertl
Post by Quadibloc
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a
single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.
Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty. Maybe we'll see it in some Intel or AMD CPU
in the coming years.
Post by Stephen Fuld
Post by Anton Ertl
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
Sure, predication is not a control structure.
Post by Stephen Fuld
Post by Anton Ertl
I think he also allows
recurrences (in particular, reductions), but I don't understand how
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware. For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
double r;
double r0=0.0;
....
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
.... deal with the remaining iterations ...
r = r0+...+r15;
But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.
VVM is also adept at vectorizing str* and mem* functions from the C
library, and as such, you have to do it in a way that even ISRs can
use VVM (when it is to their advantage).
- anton
Stephen Fuld
2024-02-16 19:03:26 UTC
Permalink
Post by Stephen Fuld
Post by Anton Ertl
Post by Quadibloc
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a
single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.
Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty.
IANAHG, but if it were that simple, I would think Mitch would have
implemented it that way.
Maybe we'll see it in some Intel or AMD CPU
in the coming years.
One can hope!
Post by Stephen Fuld
Post by Anton Ertl
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
Sure, predication is not a control structure.
OK, but my point is that you can do conditional execution within a VVM loop.
Post by Stephen Fuld
Post by Anton Ertl
I think he also allows
recurrences (in particular, reductions), but I don't understand how
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.
Sure. ISTM, and again, IANAHG, that the problem for VVM is the hardware
recognizing that the loop contains no instructions that can't be
parallelized. There are also some issues like doing a sum of signed
integer values and knowing whether overflow occurred, etc. The
programmer may know that overflow cannot occur, but the HW doesn't.
For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
double r;
double r0=0.0;
...
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
... deal with the remaining iterations ...
r = r0+...+r15;
But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.
Agreed.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
MitchAlsup
2024-02-16 23:22:08 UTC
Permalink
Post by Stephen Fuld
Post by Stephen Fuld
Post by Anton Ertl
Post by Quadibloc
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
My understanding is that he requires explicit marking (why?),
Of course, Mitch can answer for himself, but ISTM that the explicit
marking allows a more efficient implementation, specifically the
instructions in the loop can be fetched and decoded only once, it allows
the HW to elide some register writes, and saves an instruction by
combining the loop count decrement and test and the return branch into a
single instruction. Perhaps the HW could figure out all of that by
analyzing a "normal" instruction stream, but that seems much harder.
Compared to the rest of the VVM stuff, recognizing it in hardware does
not add much difficulty.
IANAHG, but if it were that simple, I would think Mitch would have
implemented it that way.
Maybe we'll see it in some Intel or AMD CPU
in the coming years.
One can hope!
Post by Stephen Fuld
Post by Anton Ertl
and that
the loop can do almost anything, but (I think) it must be a simple
loop without further control structures.
It allows predicated instructions within the loop
Sure, predication is not a control structure.
OK, but my point is that you can do conditional execution within a VVM loop.
Post by Stephen Fuld
Post by Anton Ertl
I think he also allows
recurrences (in particular, reductions), but I don't understand how
double r=0.0;
for (i=0; i<n; i++)
r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
From what I understand, while you can do reductions in a VVM loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention. That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.
Sure. ISTM, and again, IANAHG, that the problem for VVM is the hardware
recognizing that the loop contains no instructions that can't be
parallelized. There are also some issues like doing a sum of signed
integer values and knowing whether overflow occurred, etc. The
programmer may know that overflow cannot occur, but the HW doesn't.
The HW does not need preceding knowledge. If an exception happens, the
vectorized loop collapses into a scalar loop precisely, and can be
handled in the standard fashion.
Post by Stephen Fuld
For FP addition that should give the same
result as the sequential code, it's probably much harder. Of course,
double r;
double r0=0.0;
...
double r15=0.0;
for (i=0; i<n-15; i+=16) {
r0 += a[i];
...
r15 += a[i+15];
}
... deal with the remaining iterations ...
r = r0+...+r15;
But then the point of auto-vectorization is that the programmers are
unaware of what's going on behind the curtain, and that promise is not
kept if they have to write code like above.
Agreed.
Terje Mathisen
2024-02-17 09:34:05 UTC
Permalink
snip
Post by MitchAlsup
Post by Anton Ertl
I think he also allows
recurrences (in particular, reductions), but I don't understand how
double r=0.0;
for (i=0; i<n; i++)
    r += a[i];
This is particularly nasty given that FP addition is not associative;
but even if you allow fast-math-style reassociation, doing this in
hardware seems to be quite a bit harder than the rest of VVM.
 From what I understand, while you can do reductions in a VVM
loop, and
it takes advantage of wide fetch etc., it doesn't auto parallelize the
reduction, thus avoids the problem you mention.  That does cost
performance if the reduction could be parallelized, e.g. find the max
value in an array.
My feeling is that, for max it's relatively easy to perform a wide
reduction in hardware.
Sure.  ISTM, and again, IANAHG, that the problem for VVM is the
hardware recognizing that the loop contains no instructions that
can't be parallelized.  There are also some issues like doing a sum
of signed integer values and knowing whether overflow occurred,
etc.  The programmer may know that overflow cannot occur, but the HW
doesn't.
The HW does not need preceding knowledge. If an exception happens, the
vectorized loop collapses into a scalar loop precisely, and can be
handled in the standard fashion.
I think you might have missed my point.  If you are summing the signed
integer elements of an array, whether you get an overflow or not can
depend on the order the additions are done.  Thus, without knowledge
that only the programmer has (i.e. that with the size of the actual data
used, overflow is impossible) the hardware cannot parallelize such an
operation.  If the programmer knows that overflow cannot occur, he has
no way to communicate that to the VVM hardware, such that the HW could
parallelize the summation.
I am not sure, but I strongly believe that VMM cannot be caught out this
way, simply because it would observe the accumulator loop dependency.

I.e. it could do all the other loop instructions (load/add/loop counter
decrement & branch) completeley overlapped, but the actual adds to the
accumulator register would limit total throughput to the ADD-to-ADD latency.

So, on the first hand, VMM cannot automagically parallelize this to use
multiple accumulators, on the other hand a programmer would be free to
use a pair of wider accumulators to sidestep the issue.

On the third (i.e gripping) hand you could have a language like Java
where it would be illegal to transform a temporarily trapping loop into
one that would not trap and give the mathematically correct answer.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Anton Ertl
2024-02-17 18:03:53 UTC
Permalink
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?

If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly
don't trap.

If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Terje Mathisen
2024-02-17 18:58:19 UTC
Permalink
Post by Anton Ertl
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Sorry to be unclear:

I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.

int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}

will overflow if called with data = [127, 1, -2], right?

while if you implement it with

int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}

then you would be OK, and the final result would be mathematically correct.

For this particular example, you would also get the correct answer with
wrapping arithmetic, even if that by default is UB in modern C.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
MitchAlsup1
2024-02-17 20:03:01 UTC
Permalink
Post by Terje Mathisen
Post by Anton Ertl
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}
will overflow if called with data = [127, 1, -2], right?
Yes, and it should not be vectorized when your vector resource has
CRAY-like vector registers--however, it can be vectorized with VVM
like resources.
Post by Terje Mathisen
while if you implement it with
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
then you would be OK, and the final result would be mathematically correct.
when len > 2^24 it may still not be mathematically correct for 32-bit ints
or len > 2^60 for 64-bit ints.
Post by Terje Mathisen
For this particular example, you would also get the correct answer with
wrapping arithmetic, even if that by default is UB in modern C.
Terje
Tim Rentsch
2024-02-17 23:36:31 UTC
Permalink
Post by Terje Mathisen
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
MitchAlsup1
2024-02-18 01:03:23 UTC
Permalink
Post by Tim Rentsch
Post by Terje Mathisen
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
Tim Rentsch
2024-02-18 04:01:21 UTC
Permalink
Post by MitchAlsup1
Post by Tim Rentsch
Post by Terje Mathisen
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
Opus
2024-02-18 08:24:23 UTC
Permalink
Post by Tim Rentsch
Post by MitchAlsup1
Post by Tim Rentsch
Post by Terje Mathisen
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
Of course the conversion will be done implicitly. C converts almost
anything implicitly. Not that this is its greatest feature.

The explicit cast is still useful: 1/ to express intent (it shows that
the potential loss of data is intentional) and then 2/ to avoid compiler
warnings (if you enable -Wconversion, which I usually recommend) or
warning from any serious static analyzer too (which I highly recommend
using too).
Scott Lurndal
2024-02-18 16:10:52 UTC
Permalink
Post by Tim Rentsch
Post by MitchAlsup1
Post by Tim Rentsch
Post by Terje Mathisen
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
I suspect most experienced C programs know that.

Yet, the 'superfluous' cast is also documentation that the
programmer _intended_ that the return value would be narrowed.
MitchAlsup1
2024-02-18 17:48:09 UTC
Permalink
Post by Tim Rentsch
Post by MitchAlsup1
Post by Tim Rentsch
Post by Terje Mathisen
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
But the return statement is where overflow (if any) is detected.
The cast is superfluous because a conversion to int8_t will be
done in any case, since the return type of the function is
int8_t.
Missing my point:: which was::

The summation loop will not overflow, and overflow is detected at
the smash from int to int8_t.
Terje Mathisen
2024-02-18 10:26:09 UTC
Permalink
Post by Tim Rentsch
Post by Terje Mathisen
[...]
int8_t sum(int len, int8_t data[])
{
int s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return (int8_t) s;
}
The cast in the return statement is superfluous.
I am normally writing Rust these days, where UB is far less common, but
casts like this are mandatory.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Anton Ertl
2024-02-18 07:47:13 UTC
Permalink
Post by Terje Mathisen
Post by Anton Ertl
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
...
Post by Terje Mathisen
I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.
As mentioned, Java defines addition of the integral base types to use
modulo (aka wrapping) arithmetic, i.e., overflow is fully defined with
nice properties. In particular, the associative law hold for modulo
addition, which allows all kinds of reassociations that are helpful
for parallelizing reduction.
Post by Terje Mathisen
int8_t sum(int len, int8_t data[])
{
int8_t s = 0;
for (unsigned i = 0 i < len; i++) {
s += data[i];
}
return s;
}
will overflow if called with data = [127, 1, -2], right?
I don't think that int8_t or unsigned are Java types.

If that is C code: C standard lawyers will tell you what the C
standard says about storing 128 into s (the addition itself does not
overflow, because it uses ints).
Post by Terje Mathisen
For this particular example, you would also get the correct answer with
wrapping arithmetic, even if that by default is UB in modern C.
The standardized subset of C is not relevant for discussing Java.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
David Brown
2024-02-19 13:11:51 UTC
Permalink
Post by Terje Mathisen
Post by Anton Ertl
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.
Post by Terje Mathisen
I was specifically talking about adding a bunch of integers together,
some positive and some negative, so that by doing them in program order
you will get an overflow, but if you did them in some other order, or
with a double-wide accumulator, the final result would in fact fit in
the designated target variable.
int8_t sum(int len, int8_t data[])
{
  int8_t s = 0;
  for (unsigned i = 0 i < len; i++) {
    s += data[i];
  }
  return s;
}
will overflow if called with data = [127, 1, -2], right?
No. In C, int8_t values will be promoted to "int" (which is always at
least 16 bits, on any target) and the operation will therefore not
overflow. The conversion of the result of "s + data[i]" from int to
int8_t, implicit in the assignment, also cannot "overflow" since that
term applies only to the evaluation of operators. But if this value is
outside the range for int8_t, then the conversion is
implementation-defined behaviour. (That is unlike signed integer
overflow, which is undefined behaviour.)

All real-life implementations will define the conversion as
modulo/truncation/wrapping, however you prefer to think of it, though it
is not specified in the standards.
Post by Terje Mathisen
while if you implement it with
int8_t sum(int len, int8_t data[])
{
  int s = 0;
  for (unsigned i = 0 i < len; i++) {
    s += data[i];
  }
  return (int8_t) s;
}
then you would be OK, and the final result would be mathematically correct.
Converting the "int" to "int8_t" will give the correct value whenever it
is in the range of int8_t. But if we assume that the implementation
does out-of-range conversions as two's complement wrapping, then the
result will be the same no matter when the modulo operations are done.
Post by Terje Mathisen
For this particular example, you would also get the correct answer with
wrapping arithmetic, even if that by default is UB in modern C.
There's no UB in either case. Only IB.
Terje Mathisen
2024-02-19 22:21:57 UTC
Permalink
Post by David Brown
Post by Anton Ertl
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.
I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(

If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?

What I was trying to illustrate was the principle that by having a wider
accumulator you could aggregate a series of numbers, both positive and
negative, and get the correct (in-range) result, even if the input
happened to be arranged in such a way that it would temporarily overflow
the target int type.

I think it is much better to do it this way and then get a conversion
size trap at the very end when/if the final sum is in fact too large for
the result type.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
MitchAlsup1
2024-02-20 01:10:23 UTC
Permalink
Post by Terje Mathisen
Post by David Brown
Post by Anton Ertl
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.
I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(
If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?
Not necessarily:: accumulation rules allow the promotion of int->long
inside a loop if the long is smashed back to int immediately after the
loop terminates. A compiler should be able to perform this transformation.
In effect, this hoists the smashes back to int out of the loop, increasing
performance and making it that much harder to tickle the overflow exception.
Post by Terje Mathisen
What I was trying to illustrate was the principle that by having a wider
accumulator you could aggregate a series of numbers, both positive and
negative, and get the correct (in-range) result, even if the input
happened to be arranged in such a way that it would temporarily overflow
the target int type.
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)

We are also in an era where the dust on dusty decks is starting to show
its accumulated depth.
Post by Terje Mathisen
I think it is much better to do it this way and then get a conversion
size trap at the very end when/if the final sum is in fact too large for
the result type.
Same argument holds for Kahan-Babuška summation.
Post by Terje Mathisen
Terje
Anton Ertl
2024-02-20 07:32:40 UTC
Permalink
Post by MitchAlsup1
Post by Terje Mathisen
If we have a platform where the default integer size is 32 bits and long
is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?
Not necessarily:: accumulation rules allow the promotion of int->long
inside a loop if the long is smashed back to int immediately after the
loop terminates. A compiler should be able to perform this transformation.
What "accumulation rules"?

Certainly with twos-complement modulo arithmetic, the following
distributive laws hold:

(a+b) mod 2^n = (a mod 2^n) + (b mod 2^n)

and this also holds for a "signed mod" operator smod that represents
the congruence classes modulo 2^n by the numbers -2^(n-1)..2^(n-1)-1
instead of 0..2^n-1. I actually would prefer to write the equivalence
above as a congruence modulo 2^n, which would avoid the need to
explain that separately, but I don't see a good way to do it. Maybe:

a+b is congruent with (a mod 2^n) + (b mod 2^n) modulo 2^n

but of course this still uses the mod operator that produces values
0..2^n-1.

We also have:

a is congruent to a mod 2^m modulo 2^n if m>=n

So, the result is that, yes, if we only have a wider addition
instruction, and need a narrower result at some point, we can
undistribute the narrowing operation (sign extension or zero
extension) and just apply it to the end result.
Post by MitchAlsup1
In effect, this hoists the smashes back to int out of the loop, increasing
performance and making it that much harder to tickle the overflow exception.
What overflow exception? I have yet to see a C compiler use an
addition instruction that causes an overflow exception on integer
overflow, unless specifically asked to do so with -ftrapv. And if the
programmer explicitly asked for traps on overflow, then the C compiler
should not try to "optimize" them away. Note that the stuff above is
true for modulo arithmetic, not (in general) for trapping arithmetic.
Post by MitchAlsup1
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s. We have
been paying with additional sign-extension and zero-extension
operations ever since then, and it has even deformed architectures:
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Anton Ertl
2024-02-20 12:00:29 UTC
Permalink
Post by Anton Ertl
Post by MitchAlsup1
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s. We have
been paying with additional sign-extension and zero-extension
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.
Also, these architectures probably would not have the so-called 32-bit
arithmetic instructions (like RV64G's addw) if the mainstream C world
had decided to use ILP64. RV64G could have left these instructions
away and replaced them with a sequence of add, slli, srli, i.e., a
64-bit addition followed by a sign-extension idiom. After all, RISC-V
seems to favour sequences of more general instructions over having
more specialized instructions (and addressing modes). But apparently
the frequency of 32-bit additions is so high thanks to I32LP64 that
they added addw and addiw to RV64G; and they even occupy space in the
compressed extension (16-bit encodings of frequent instructions).

BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either. E.g., consider the function

void sext(int M, int *ic, int *is)
{
int k;
for (k = 1; k <= M; k++) {
ic[k] += is[k];
}
}

which is based on the only loop (from 456.hmmer) in SPECint 2006 where
the difference between -fwrapv and the default produces a measurable
performance difference (as reported in section 3.3 of
<https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>). I
created variations of this function, where the types of M and k were
changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops". The loop body
looks as follows on RV64GC:

int unsigned (u)intptr_t
.L3: .L8: .L15:
slli a5,a4,0x2 slli a5,a4,0x20 lw a5,0(a1)
add a6,a1,a5 srli a5,a5,0x1e lw a4,4(a2)
add a5,a5,a2 add a6,a1,a5 addi a1,a1,4
lw a3,0(a6) add a5,a5,a2 addi a2,a2,4
lw a5,0(a5) lw a3,0(a6) addw a5,a5,a4
addiw a4,a4,1 lw a5,0(a5) sw a5,-4(a1)
addw a5,a5,a3 addiw a4,a4,1 bne a2,a3,54 <.L15>
sw a5,0(a6) addw a5,a5,a3
bge a0,a4,6 <.L3> sw a5,0(a6)
bgeu a0,a4,28 <.L8>

There is no difference between the intptr_t loop body and the
uintptr_t loop. And without -fwrapv, the int loop looks just like the
(u)intptr_t loop (because the C compiler then assumes that signed
integer overflow does not happen).

So, if you don't have a specific reason to choode int or unsigned,
better use intptr_t or uintptr_t, respectively. In this way you can
circumvent some of the damage that I32LP64 has done.

- anton
Post by Anton Ertl
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
David Brown
2024-02-20 16:25:18 UTC
Permalink
Post by Anton Ertl
Post by Anton Ertl
Post by MitchAlsup1
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.
Post by Anton Ertl
Post by Anton Ertl
We have
been paying with additional sign-extension and zero-extension
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.
Also, these architectures probably would not have the so-called 32-bit
arithmetic instructions (like RV64G's addw) if the mainstream C world
had decided to use ILP64. RV64G could have left these instructions
away and replaced them with a sequence of add, slli, srli, i.e., a
64-bit addition followed by a sign-extension idiom. After all, RISC-V
seems to favour sequences of more general instructions over having
more specialized instructions (and addressing modes). But apparently
the frequency of 32-bit additions is so high thanks to I32LP64 that
they added addw and addiw to RV64G; and they even occupy space in the
compressed extension (16-bit encodings of frequent instructions).
BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes,
picked to fit the range you need). These adapt suitably for different
targets. If you want to force the issue, then "int64_t" is IMHO clearer
than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.
Post by Anton Ertl
E.g., consider the function
void sext(int M, int *ic, int *is)
{
int k;
for (k = 1; k <= M; k++) {
ic[k] += is[k];
}
}
which is based on the only loop (from 456.hmmer) in SPECint 2006 where
the difference between -fwrapv and the default produces a measurable
performance difference (as reported in section 3.3 of
<https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>). I
created variations of this function, where the types of M and k were
changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops". The loop body
int unsigned (u)intptr_t
slli a5,a4,0x2 slli a5,a4,0x20 lw a5,0(a1)
add a6,a1,a5 srli a5,a5,0x1e lw a4,4(a2)
add a5,a5,a2 add a6,a1,a5 addi a1,a1,4
lw a3,0(a6) add a5,a5,a2 addi a2,a2,4
lw a5,0(a5) lw a3,0(a6) addw a5,a5,a4
addiw a4,a4,1 lw a5,0(a5) sw a5,-4(a1)
addw a5,a5,a3 addiw a4,a4,1 bne a2,a3,54 <.L15>
sw a5,0(a6) addw a5,a5,a3
bge a0,a4,6 <.L3> sw a5,0(a6)
bgeu a0,a4,28 <.L8>
There is no difference between the intptr_t loop body and the
uintptr_t loop. And without -fwrapv, the int loop looks just like the
(u)intptr_t loop (because the C compiler then assumes that signed
integer overflow does not happen).
So, if you don't have a specific reason to choode int or unsigned,
better use intptr_t or uintptr_t, respectively. In this way you can
circumvent some of the damage that I32LP64 has done.
I would say the takeaways here are :

If you want fast local variables, use C's [u]int_fastN_t types. That's
what they are for.

Don't use unsigned types for counters and indexes unless you actually
need them. Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so
letting the compiler assume that will not happen is a good thing. (And
it lets you check for overflow bugs using run-time sanitizers.)

It is not just RISCV - this advice applies to all the 64-bit
architectures I tried on <https://godbolt.org>.
Scott Lurndal
2024-02-20 16:27:56 UTC
Permalink
Post by David Brown
Post by Anton Ertl
Post by MitchAlsup1
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.
Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.
Anton Ertl
2024-02-20 18:27:40 UTC
Permalink
Post by Scott Lurndal
Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.
In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various
places. So code portable between 16-bit and 32-bit systems could not
assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?

In any case, I32LP64 caused breakage for my code, and I expect that
there was more code around with the assumption sizeof(int)==sizeof(int
*) than with the assumption sizeof(int)==4. Of course, we worked
around this misfeature of the C compilers on Digital OSF/1, but those
who assumed sizeof(int)==4 would have adapted their code if the
decision had been for ILP64.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Scott Lurndal
2024-02-20 19:12:39 UTC
Permalink
Post by Anton Ertl
Post by Scott Lurndal
Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.
In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various
places. So code portable between 16-bit and 32-bit systems could not
assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?
Most of the problems encountered when moving unix (System V)
from 16-bit to 32-bit were more around missing typedefs for certain
data types (e.g. uids, gids, pids, etc), so there was
a lot of code that declared these as shorts, but the 32-bit
kernels defined these as 32-bit (unsigned) values).

That's when uid_t, pid_t, gid_t were added.

Then there were the folks who used 'short [int]' instead of 'int'
since they were the same size on the PDP-11.

The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).
MitchAlsup1
2024-02-20 21:03:59 UTC
Permalink
Post by Scott Lurndal
Post by Anton Ertl
Post by Scott Lurndal
Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.
In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various
places. So code portable between 16-bit and 32-bit systems could not
assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?
Most of the problems encountered when moving unix (System V)
from 16-bit to 32-bit were more around missing typedefs for certain
data types (e.g. uids, gids, pids, etc), so there was
a lot of code that declared these as shorts, but the 32-bit
kernels defined these as 32-bit (unsigned) values).
That's when uid_t, pid_t, gid_t were added.
Then there were the folks who used 'short [int]' instead of 'int'
since they were the same size on the PDP-11.
This is a good reason to build a time machine and send back a few
of the Sopranos to take care of those programmers.
Post by Scott Lurndal
The Unix code ported relatively easily to I32LP64 because uintptr_t
had been used extensively rather than assumptions about
sizeof(int) == sizeof(int *).
At least the sense of forward progress.
MitchAlsup1
2024-02-20 21:01:45 UTC
Permalink
Post by Anton Ertl
Post by Scott Lurndal
Or maybe changing int from 32-bit to 64-bit would have caused
as many (or likely more) problems as changing from 16-bit to 32-bit did back in the
day.
In Unix sizeof(int) == sizeof(int *) on both 16-bit and 32-bit
architectures. Given the history of C, that's not surprising: BCPL
and B have a single type, the machine word, and it eventually became
C's int. You see this in "int" declarations being optional in various
places. So code portable between 16-bit and 32-bit systems could not
assume that int has a specific size (such as 32 bits), but if it
assumed that sizeof(int) == sizeof(int *), that would port fine
between 16-bit and 32-bit Unixes. There may have been C code that
assumed that sizeof(int)==4, but why cater to this kind of code which
did not even port to 16-bit systems?
In any case, I32LP64 caused breakage for my code, and I expect that
# define int int64_t
# define long int64_t
# define unsigned uint64_t
# define ...
Post by Anton Ertl
there was more code around with the assumption sizeof(int)==sizeof(int
*) than with the assumption sizeof(int)==4. Of course, we worked
around this misfeature of the C compilers on Digital OSF/1, but those
who assumed sizeof(int)==4 would have adapted their code if the
decision had been for ILP64.
- anton
Anton Ertl
2024-02-20 17:47:37 UTC
Permalink
Post by David Brown
Post by Anton Ertl
Post by Anton Ertl
Post by MitchAlsup1
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.
I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
their reasons:

Guess 1: There was more software that depended on sizeof(int)==4 than
software that depended on sizeof(int)==sizeof(char *).

Guess 2: When benchmarketing without adapting the source code (as is
usual), I32LP64 produced better numbers then ILP64 for some
benchmarks, because arrays and other data structures with int elements
are smaller and have better cache hit rates.

My guess is that it was a mixture of 1 and 2, with 2 being the
decisive factor. I have certainly seen a lot of writing about how
64-bit (pointers) hurt performance, and it even led to the x32
nonsense (which never went anywhere, not surprising to me). These
days support for 32-bit applications is eliminated from ARM cores,
another indication that the performance advantages of 32-bit pointers
are minor.
Post by David Brown
Post by Anton Ertl
BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes,
picked to fit the range you need).
Sure, and then the program might break when an array has more the 2^31
elements; or it might work on one platform and break on another one.

By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use
something else for a loop counter.

For a type of which you store many in an array or other data
structure, you probably prefer int32_t rather than int_fast32_t if 32
bits is enough. So I don't see a reason for int_fast32_t etc.

These adapt suitably for different
Post by David Brown
targets. If you want to force the issue, then "int64_t" is IMHO clearer
than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.
Why do you bring up "long long int"? As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.
Post by David Brown
If you want fast local variables, use C's [u]int_fastN_t types. That's
what they are for.
I don't see a point in those types. What's wrong with (u)intptr_t IYO?
Post by David Brown
Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so
letting the compiler assume that will not happen is a good thing.
Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.

Anyway, if I have made a mistake in my code, why would let the
compiler assume that I did not make a mistake be a good thing?

I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
and for good performance, I write the code appropriately (e.g., by
using intptr_t instead of int).
Post by David Brown
(And
it lets you check for overflow bugs using run-time sanitizers.)
If the compiler assumes that overflow does not happen, how do these
"sanitizers" work?

Anyway, I certainly have code that relies on modulo arithmetic.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup1
2024-02-20 20:58:58 UTC
Permalink
Post by Anton Ertl
Post by David Brown
Post by Anton Ertl
Post by Anton Ertl
Post by MitchAlsup1
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at
the time? Or do you know any other reason? Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.
I know no implementation of a 64-bit architecture where ALU operations
(except maybe division where present) is slower in 64 bits than in 32
bits. I would have chosen ILP64 at the time, so I can only guess at
The ALU operations were slower because when you IADD x and y to give z
if you then want z to have no significance above 32-bits you require
a second instruction to smash the value back into the container range.
So 1 ALU in 64-bits takes 1 instruction while 1 ALU requiring 32-bits
requires 2. This is the reason RISC-V has 32-bit ALU instructions.
Post by Anton Ertl
Guess 1: There was more software that depended on sizeof(int)==4 than
software that depended on sizeof(int)==sizeof(char *).
There remains a lot of SW that is dependent on INT being smaller than LONG.
Post by Anton Ertl
Guess 2: When benchmarketing without adapting the source code (as is
usual), I32LP64 produced better numbers then ILP64 for some
benchmarks, because arrays and other data structures with int elements
are smaller and have better cache hit rates.
Some truth in here.
Post by Anton Ertl
My guess is that it was a mixture of 1 and 2, with 2 being the
decisive factor. I have certainly seen a lot of writing about how
64-bit (pointers) hurt performance, and it even led to the x32
nonsense (which never went anywhere, not surprising to me). These
days support for 32-bit applications is eliminated from ARM cores,
another indication that the performance advantages of 32-bit pointers
are minor.
Try programming array codes where the arrays are TB in size with P32.

My ONLY argument is that int was SUPPOSED to be the fastest arithmetic.
It was in PDP-11 days and VAX days, but it is no longer.
Post by Anton Ertl
Post by David Brown
Post by Anton Ertl
BTW, some people here have advocated the use of unsigned instead of
int. Which of the two results in better code depends on the
architecture. On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better. On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better. But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes,
picked to fit the range you need).
The semantics of unsigned are better when you are twiddling bits than
when signed. My programming practice is to use unsigned everywhere that
a negative number is unexpected.
Post by Anton Ertl
Sure, and then the program might break when an array has more the 2^31
elements; or it might work on one platform and break on another one.
By contrast, with (u)intptr_t, on modern architectures you use the
type that's as wide as the GPRs. And I don't see a reason why to use
something else for a loop counter.
This is equivalent to my argument above--int SHOULD be the fastest kind
of arithmetic; sadly this tenet has been cast aside.
Post by Anton Ertl
For a type of which you store many in an array or other data
structure, you probably prefer int32_t rather than int_fast32_t if 32
bits is enough. So I don't see a reason for int_fast32_t etc.
These adapt suitably for different
Post by David Brown
targets. If you want to force the issue, then "int64_t" is IMHO clearer
than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.
Why do you bring up "long long int"? As for int64_t, that tends to be
slow (if supported at all) on 32-bit platforms, and it is more than
what is necessary for indexing arrays and for loop counters that are
used for indexing into arrays.
Post by David Brown
If you want fast local variables, use C's [u]int_fastN_t types. That's
what they are for.
I don't see a point in those types. What's wrong with (u)intptr_t IYO?
In reality, this is a dusty deck problem.
Post by Anton Ertl
Post by David Brown
Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so
letting the compiler assume that will not happen is a good thing.
Thank you for giving a demonstration for Scott Lurndal. I assume that
you claim to be a programmer.
Anyway, if I have made a mistake in my code, why would let the
compiler assume that I did not make a mistake be a good thing?
There are more people willing to pay good money for a wrong answer fast
than there are willing to pay good money for correct answer slow.
Post by Anton Ertl
I OTOH prefer if the compiler behaves consistently, so I use -fwrapv,
and for good performance, I write the code appropriately (e.g., by
using intptr_t instead of int).
Everything I do is inherently 64-bits except for interfaces to OS
services.
Post by Anton Ertl
Post by David Brown
(And
it lets you check for overflow bugs using run-time sanitizers.)
If the compiler assumes that overflow does not happen, how do these
"sanitizers" work?
Anyway, I certainly have code that relies on modulo arithmetic.
- anton
Brian G. Lucas
2024-02-20 19:18:20 UTC
Permalink
Post by Anton Ertl
Post by Anton Ertl
Post by MitchAlsup1
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at the
time?  Or do you know any other reason?  Maybe some of the early 64-bit cpus
were faster at 32-bit, just as some early 32-bit cpus were faster at 16-bit.
Post by Anton Ertl
Post by Anton Ertl
We have
been paying with additional sign-extension and zero-extension
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.
Also, these architectures probably would not have the so-called 32-bit
arithmetic instructions (like RV64G's addw) if the mainstream C world
had decided to use ILP64.  RV64G could have left these instructions
away and replaced them with a sequence of add, slli, srli, i.e., a
64-bit addition followed by a sign-extension idiom.  After all, RISC-V
seems to favour sequences of more general instructions over having
more specialized instructions (and addressing modes).  But apparently
the frequency of 32-bit additions is so high thanks to I32LP64 that
they added addw and addiw to RV64G; and they even occupy space in the
compressed extension (16-bit encodings of frequent instructions).
BTW, some people here have advocated the use of unsigned instead of
int.  Which of the two results in better code depends on the
architecture.  On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better.  On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better.  But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes, picked
to fit the range you need).  These adapt suitably for different targets.  If
you want to force the issue, then "int64_t" is IMHO clearer than "long long
int" and does not give a strange impression where you are using a type aimed at
pointer arithmetic for general integer arithmetic.
What I would like is a compiler flag that did "IFF when an int (or unsigned)
ends up in a register, promote it to the 'fast' type". This would be great
when compiling dusty C decks. (Was there ever C code on punched cards?)
Anton Ertl
2024-02-20 21:49:39 UTC
Permalink
Post by Brian G. Lucas
What I would like is a compiler flag that did "IFF when an int (or unsigned)
ends up in a register, promote it to the 'fast' type".
That used to be the point of C's promotion-to-int rule, until the
I32LP64 mistake. Now, despite architectural workarounds like RISC-V's
addw, we see the fallout of this mistake.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
BGB
2024-02-20 19:23:55 UTC
Permalink
Post by David Brown
Post by Anton Ertl
Post by Anton Ertl
Post by MitchAlsup1
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)
C has been in that era since the bad I32LP64 decision of the people
who did the first 64-bit Unix compilers in the early 1990s.
I presume the main reason for this was the size and cost of memory at
the time?  Or do you know any other reason?  Maybe some of the early
64-bit cpus were faster at 32-bit, just as some early 32-bit cpus were
faster at 16-bit.
I never really questioned it, but then again, when I was starting out,
everything was ILP32, and going to I32LP64 made the most sense.

MSVC went from ILP32 to IL32P64, which was wonky. But, then again, some
amount of originally 16-bit code had used 'long' whenever they wanted a
32-bit value; seemingly more often than expecting 'long' to be the same
size as a pointer (the other common assumption).
Post by David Brown
Post by Anton Ertl
Post by Anton Ertl
We have
been paying with additional sign-extension and zero-extension
ARM A64 has addressing modes that include sign- or zero-extending a
32-bit index, and RISC-V's selected SLLI, SRLI, SRAI for their
compressed extension, probably because they are so frequent because
they are used in RISC-V's idioms for sign and zero extension.
Also, these architectures probably would not have the so-called 32-bit
arithmetic instructions (like RV64G's addw) if the mainstream C world
had decided to use ILP64.  RV64G could have left these instructions
away and replaced them with a sequence of add, slli, srli, i.e., a
64-bit addition followed by a sign-extension idiom.  After all, RISC-V
seems to favour sequences of more general instructions over having
more specialized instructions (and addressing modes).  But apparently
the frequency of 32-bit additions is so high thanks to I32LP64 that
they added addw and addiw to RV64G; and they even occupy space in the
compressed extension (16-bit encodings of frequent instructions).
In my case, I had dedicated instructions for sign and zero extension of
the various types.

But, yeah, like RISC-V, there are both 32 and 64 bit variants of many
instructions, mostly because some amount of code will break if an
overflow happens, and does not behave as it would on an authentic 32-bit
machine (IOW: wrap on overflow).


Though, some of this reminds me:
Another instruction SuperH had that was not carried over:
ADDV
Which would do an ADD and update SR.T based on Overflow.


But, this scenario was rare, and one could usually detect overflow like:
if(b>=0)
{
c=a+b;
if(c<a) { ...overflow... }
}else
{
c=a+b;
if(c>a) { ...overflow... }
}

Meanwhile, cases where one would need an instruction like ADDV would not
happen in C compiler output.

For detecting 32-bit overflow, in premise one could also do something like:
ADD R4, R5, R3
EXTS.L R3, R2
CMPQEQ R3, R2
BF .overflow

Or, with JCMP extension (*1):
ADD R4, R5, R3
EXTS.L R3, R2
BRNE R3, R2, .overflow

Or, RV:
ADD A6, A0, A1
ADDW A7, A6, 0
BNE A6, A7, .overflow


*1: Seems I actually need this to more consistently match/beat RV
performance (unlike my prior estimates that this case was mostly
unnecessary and that 2-op sequences were sufficient).


But, more work is still needed.
And, it seemed, a recent ISA tweak to improve performance (by making the
immediate for AND signed) had unintentionally broke compiler output in
the Baseline+XGPR mode, as when writing the code, I had overlooked that
this case was N/E in XGPR mode (or, that is was effectively another "XG2
Only" feature).


Then had to do a last minute update to my GitHub project, as I had noted
that I had uploaded stuff with BGBCC producing broken output, which is
not ideal (and then hope that none of the other stuff I was working on
right at the moment breaks anything; as I had otherwise been recently
working a bit in trying to squeeze more performance out of BGBCC's
compiler output).


Well, since "barely faster than RISC-V in some cases" (or, potentially
losing in some other cases) isn't really a strong enough win as I see it...


But, seems the margins are close enough that things like missing out on
12-15% of AND immediate values being negative, cycles spent in relative
comparisons for branches, etc..., are enough to visibly effect results.

...
Post by David Brown
Post by Anton Ertl
BTW, some people here have advocated the use of unsigned instead of
int.  Which of the two results in better code depends on the
architecture.  On AMD64 where the so-called 32-bit instructions
perform a 32->64-bit zero-extension, unsigned is better.  On RV64G
where the so-called 32-bit instructions perform a 32->64-bit sign
extension, signed int is better.  But actually the best way is to use
a full-width type like intptr_t or uintptr_t, which gives better
results than either.
I would suggest C "fast" types like int_fast32_t (or other "fast" sizes,
picked to fit the range you need).  These adapt suitably for different
targets.  If you want to force the issue, then "int64_t" is IMHO clearer
than "long long int" and does not give a strange impression where you
are using a type aimed at pointer arithmetic for general integer arithmetic.
A lot of my code was written mostly in an alternate timeline where
"stdint.h" was mostly absent for a long time (cough, MSVC); so a
convention arose of using names like "s64" or "u64" instead for the
64-bit types (well, with similar for most of the other explicit sizes,
and other typedef's for things like trying to match the size of a
pointer, ...).

Where, say, C99 didn't really make its way over to the Windows side of
things until well over a decade after the fact (early on, I was using
Cygwin, which had gotten stuck with old version of GCC for a fairly long
time).

Though, at least for things like the C and runtime libraries, arguably
it would make more sense to move over to C99 conventions.


Then again, even now, it is sometimes difficult to not be at least a
little weirded out by things like putting declarations anywhere other
than at the top of a function.
Post by David Brown
Post by Anton Ertl
 E.g., consider the function
void sext(int M, int *ic, int *is)
{
   int k;
   for (k = 1; k <= M; k++) {
     ic[k] += is[k];
   }
}
which is based on the only loop (from 456.hmmer) in SPECint 2006 where
the difference between -fwrapv and the default produces a measurable
performance difference (as reported in section 3.3 of
<https://people.eecs.berkeley.edu/~akcheung/papers/apsys12.pdf>).  I
created variations of this function, where the types of M and k were
changed to b) unsigned, c) intptr_t, d) uintptr_t and compiled the
code with "gcc -Wall -fwrapv -O3 -c -fno-unroll-loops".  The loop body
    int                   unsigned             (u)intptr_t
slli  a5,a4,0x2       slli  a5,a4,0x20       lw    a5,0(a1)
add   a6,a1,a5        srli  a5,a5,0x1e       lw    a4,4(a2)
add   a5,a5,a2        add   a6,a1,a5         addi  a1,a1,4
lw    a3,0(a6)        add   a5,a5,a2         addi  a2,a2,4
lw    a5,0(a5)        lw    a3,0(a6)         addw  a5,a5,a4
addiw a4,a4,1         lw    a5,0(a5)         sw    a5,-4(a1)
addw  a5,a5,a3        addiw a4,a4,1          bne   a2,a3,54 <.L15>
sw    a5,0(a6)        addw  a5,a5,a3
bge   a0,a4,6 <.L3>   sw    a5,0(a6)
                       bgeu  a0,a4,28 <.L8>
There is no difference between the intptr_t loop body and the
uintptr_t loop.  And without -fwrapv, the int loop looks just like the
(u)intptr_t loop (because the C compiler then assumes that signed
integer overflow does not happen).
So, if you don't have a specific reason to choode int or unsigned,
better use intptr_t or uintptr_t, respectively.  In this way you can
circumvent some of the damage that I32LP64 has done.
If you want fast local variables, use C's [u]int_fastN_t types.  That's
what they are for.
Don't use unsigned types for counters and indexes unless you actually
need them.  Don't use "-fwrapv" unless you actually need it - in most
code, if your arithmetic overflows, you have a mistake in your code, so
letting the compiler assume that will not happen is a good thing.  (And
it lets you check for overflow bugs using run-time sanitizers.)
The "-fwrapv" option is needed for a lot of old software to behave as
expected though. In BGBCC, I had adopted "-fwrapv" semantics as the
assumed default (well, along with "-fno-strict-aliasing", where TBAA
semantics are also kinda of a situation of "playing roulette").

But, I guess if one had a way to set trap-on-overflow, and allowed some
way to get values back in int32 range (without itself triggering a trap,
or triggering some other UB) maybe.

Otherwise, it is mostly an issue of things like trying to hunt down the
cause of demo desync and similar in the Doom engine games, ...
Post by David Brown
It is not just RISCV - this advice applies to all the 64-bit
architectures I tried on <https://godbolt.org>.
Can say though that, for my ISA, one is better off not trying to use any
type larger than 32 bits as an array index...

There was an optional feature to allow this, but under the default
rules, the index is assumed to be 33 bits (covering both int32 and
uint32 range), with the assumption that one would need to fall back to
ALU operations for "long" indexing (like how array indexing in-general
works in RISC-V).

Though, this is more a case of saving some LUTs and slightly better
timing, when the need for a large array index almost non existent.


Well, at least excluding some future time where RAM is cheap enough that
one can justify using lots of multi-GB arrays...
David Brown
2024-02-20 12:42:22 UTC
Permalink
Post by MitchAlsup1
Post by Terje Mathisen
Post by David Brown
Post by Anton Ertl
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly
don't trap.
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
I haven't really been following this thread, but there's a few things
here that stand out to me - at least as long as we are talking about C.
I realized a bunch of messages ago that it was a bad idea to write
(pseudo-)C to illustrate a general problem. :-(
If we have a platform where the default integer size is 32 bits and
long is 64 bits, then afaik the C promotion rules will use int as the
accumulator size, right?
Not necessarily:: accumulation rules allow the promotion of int->long
inside a loop if the long is smashed back to int immediately after the
loop terminates. A compiler should be able to perform this transformation.
In effect, this hoists the smashes back to int out of the loop, increasing
performance and making it that much harder to tickle the overflow exception.
A compiler can make any transformations it wants, as long as the final
observable behaviour is unchanged. And since signed integer arithmetic
overflow is undefined behaviour, the compiler can do whatever it likes
if that happens. So if your target has 32-bit int and that is the type
you use in the source code for the accumulator, the compiler can
certainly use a 64-bit integer type for implementation. It could also
use a double floating point type. All that matters is that if the user
feeds in numbers that never lead to an 32-bit signed integer overflow,
the final output is correct.

It is not normal to have exceptions on integer overflow. If a compiler
supports that as an extension (perhaps for debugging), then that is
giving defined behaviour to signed integer overflow, and now it is
observable - so the compiler cannot make optimisations that "make it
much harder to tickle". It would stop quite a few optimisations, in
fact - but certainly can be useful for debugging code.
Post by MitchAlsup1
Post by Terje Mathisen
What I was trying to illustrate was the principle that by having a
wider accumulator you could aggregate a series of numbers, both
positive and negative, and get the correct (in-range) result, even if
the input happened to be arranged in such a way that it would
temporarily overflow the target int type.
We are in an era where long has higher performance than ints (except for
cache footprint overheads.)
"long" on many systems (Windows, and all 32-bit systems - which I think
have now overtaken 8-bit systems as the biggest market segment based on
unit volumes) is the same size as "int". But assuming you mean that
64-bit arithmetic has higher performance than 32-bit arithmetic on
modern "big" processors, that is sometimes correct. As well as the
aforementioned cache and memory bandwidth differences, 32-bit can still
be faster for some types of operation (such as division) or if
operations can be vectorised. But it is not surprising that on 64-bit
systems, "int_fast32_t" is usually 64-bit, as that is typically faster
for most operations on local variables.
Thomas Koenig
2024-02-20 06:31:59 UTC
Permalink
Post by David Brown
Post by Terje Mathisen
int8_t sum(int len, int8_t data[])
{
  int8_t s = 0;
  for (unsigned i = 0 i < len; i++) {
Just a side remark: This loop can get very long for len < 0.

Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .

In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
Post by David Brown
Post by Terje Mathisen
    s += data[i];
  }
  return s;
}
will overflow if called with data = [127, 1, -2], right?
No. In C, int8_t values will be promoted to "int" (which is always at
least 16 bits, on any target) and the operation will therefore not
overflow.
Depending on len and the data...
Post by David Brown
The conversion of the result of "s + data[i]" from int to
int8_t, implicit in the assignment, also cannot "overflow" since that
term applies only to the evaluation of operators. But if this value is
outside the range for int8_t, then the conversion is
implementation-defined behaviour. (That is unlike signed integer
overflow, which is undefined behaviour.)
And that is one of the things that bugs me, in languages like C
and Fortran both.

Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
Anton Ertl
2024-02-20 08:15:22 UTC
Permalink
Post by Thomas Koenig
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.

Those ideas are that integer overflows do not happen and that a
competent programmer proactively prevents them from happening, by
sizing the types accordingly, and checking the inputs. Therefore,
there is no need to check if an addition overflows, and many
architectures do not provide an easy way:

* The old ones-complement and sign-magnitude machines trapped on
overflow AFAIK.

* S/360 has some mode bits that allow trapping on overflow, or setting
some bits in some register (where the meaning of these bits depends
on the instructions that set them).

* MIPS and Alpha provide trap-on-signed-overflow addition,
subtraction, and (Alpha) multiplication, but no easy way to check
whether one of these operations would overflow. In MIPS64r6 (2014)
MIPS finally added BOVC/BNVC, which branches if adding two registers
would produce a signed overflow. RISC-V eliminates the
trap-on-signed-overflow instructions, but otherwise follows the
original MIPS and Alpha approach to overflow.

Over time both unsigned and signed overflow have become more
important: High-level programming languages nowadays often support
Bignums (arbitrarily long (signed) integers), and cryptography needs
fixed-width wide arithmetics. For Bignums, you need to detect
(without trapping) overflows of signed-integer arithmetics, for wide
addition and subtraction (for both the wide path of Bignums, and for
cryptography), you need carry/borrow, i.e. unsigned
overflow/underflow.

And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,
MIPS64r6 has added BOVC/BNVC, the 88000 and Power have a carry bit
(and, for Power, IIRC a sticky overflow bit) outside their usual
handling of comparison results.

But of course, C standardization is barely moving into the 1970s; from
what I read, they finally managed to standardize twos-complement
arithmetics (as present in the S/360 in 1964 and pretty much every
architecture that was not just an extension of an earlier architecture
since then). Leave some time until the importance of overflow
handling dawns on them; it was not properly present in the S/360, and
is not really supported in RISC-V (and in MIPS only since 2014), so
it's obviously much too early to standardize such things in the
standardized subset of C, a language subset that targeted
ones-complement and sign-magnitude dinosaurs until recently.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Scott Lurndal
2024-02-20 14:46:10 UTC
Permalink
Post by Anton Ertl
Post by Thomas Koenig
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.

COBOL:
ADD 1 TO TALLY ON OVERFLOW ...

BPL:
IF OVERFLOW ...
Post by Anton Ertl
Those ideas are that integer overflows do not happen and that a
Can't say that I've known a programmer who thought that way.
Post by Anton Ertl
And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,
Architectures in the 1960's had a flags register with and overflow bit.
Terje Mathisen
2024-02-20 15:17:21 UTC
Permalink
Post by Scott Lurndal
Post by Anton Ertl
Post by Thomas Koenig
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
ADD 1 TO TALLY ON OVERFLOW ...
IF OVERFLOW ...
Post by Anton Ertl
Those ideas are that integer overflows do not happen and that a
Can't say that I've known a programmer who thought that way.
Post by Anton Ertl
And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,
Architectures in the 1960's had a flags register with and overflow bit.
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.

Not only that, these cpus also had a dedicated single-byte opcode INTO
(hex 0xCE) to allow you to implement exception-style overflow handling
with very little impact on the mainline program, just emit that INTO
opcode directly after any program sequence where the compiler believed
that an overflow which shjould be handled, might happen.

Terje
PS.INTO was removed in AMD64, I don't remember exactly what the opcode
was repurposed for?
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
David Brown
2024-02-20 16:28:45 UTC
Permalink
Post by Terje Mathisen
Post by Scott Lurndal
Post by Anton Ertl
Post by Thomas Koenig
Signed integer overflow is undefined behavior in C and prohibited
in Fortran.  Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language.  Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
    ADD 1 TO TALLY ON OVERFLOW ...
    IF OVERFLOW ...
Post by Anton Ertl
Those ideas are that integer overflows do not happen and that a
Can't say that I've known a programmer who thought that way.
Post by Anton Ertl
And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,
Architectures in the 1960's had a flags register with and overflow bit.
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC, as flag registers are a bottleneck for OOO and
superscaling, overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.
Post by Terje Mathisen
Not only that, these cpus also had a dedicated single-byte opcode INTO
(hex 0xCE) to allow you to implement exception-style overflow handling
with very little impact on the mainline program, just emit that INTO
opcode directly after any program sequence where the compiler believed
that an overflow which shjould be handled, might happen.
Terje
PS.INTO was removed in AMD64, I don't remember exactly what the opcode
was repurposed for?
Scott Lurndal
2024-02-20 16:37:39 UTC
Permalink
Post by David Brown
Post by Terje Mathisen
Post by Scott Lurndal
Post by Anton Ertl
Post by Thomas Koenig
Signed integer overflow is undefined behavior in C and prohibited
in Fortran.  Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language.  Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
    ADD 1 TO TALLY ON OVERFLOW ...
    IF OVERFLOW ...
Post by Anton Ertl
Those ideas are that integer overflows do not happen and that a
Can't say that I've known a programmer who thought that way.
Post by Anton Ertl
And this is reflected in more recent architectures: Many architectures
since about 1970 have a flags register with carry and overflow bits,
Architectures in the 1960's had a flags register with and overflow bit.
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC, as flag registers are a bottleneck for OOO and
superscaling, overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.
ARM often has two encodings for the math instructions - one that sets the flags and one
that doesn't.
Anton Ertl
2024-02-20 18:42:17 UTC
Permalink
Post by David Brown
Post by Terje Mathisen
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC,
No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.
Post by David Brown
as flag registers are a bottleneck for OOO and
superscaling
No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).
Implementing a flags register in an OoO microarchitecture does require
execution resources, however.
Post by David Brown
overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.
That's nonsense. People use carry for implementing multi-precision
arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup1
2024-02-20 19:31:20 UTC
Permalink
Post by Anton Ertl
Post by David Brown
Post by Terje Mathisen
x86 has had an 'O' (Overflow) flags bit since the very beginning, along
with JO and JNO for Jump on Overflow and Jump if Not Overflow.
Many processors had something similar. But I think they fell out of
fashion for 64-bit RISC,
No, it didn't. All the RISCs that had flags registers for their
32-bit architectures still have it for their 64-bit architectures.
Post by David Brown
as flag registers are a bottleneck for OOO and
superscaling
No, it isn't, as demonstrated by the fact that architectures with
flags registers (AMD64, ARM A64) handily outperform architectures
without (but probably not because they have a flags register).
Implementing a flags register in an OoO microarchitecture does require
execution resources, however.
These implementations have several 200 man design teams and decades
of architectural and µArchitectural understanding that upstart
RISC designs cannot afford (until they <also> reach 100M chips sold
per year--where you reach the required amount of cubic dollars).

The fact that some designs with flags can perform at the top or near
the top is only indicative that flags are "not that much" of an
impediment to performance at the scale of current µprocessors
(3-4 wide)
Post by Anton Ertl
Post by David Brown
overflow is a lot less common for 64-bit arithmetic, and
people were not really using the flag except for implementation of
64-bit arithmetic.
That's nonsense. People use carry for implementing multi-precision
arithmetic (e.g., for cryptography) and for Bignums, and they use
overflow for implementing Bignums. And the significance of these
features has increased over time.
You can implement BigNums efficiently without a RAW serializing
carry bit in some control register (see My 66000 CARRY instruction-
modifier).
Post by Anton Ertl
- anton
Anton Ertl
2024-02-20 16:39:16 UTC
Permalink
Post by Scott Lurndal
Post by Anton Ertl
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
IIRC S/360 has two modes of operation: One where, on signed addition,
overflow traps, and one where it sets some flag; and the flag-setting
is not as consistent as say the NZCV flags on modern architectures;
instead, there are two bits that can mean anything at all, depending
on the instruction that sets them. In any case, if you use a program
that checks for overflows, then you either have to change the mode to
non-trapping before the addition and change it back afterwards, or all
signed overflows that are not checked explicitly are ignored.

Supposedly other architectures have instructions that trap on signed
integer overflow; if you cannot disable that feature on them, you
cannot use overflow-checking programs; if you can disable that
feature, you have the same problem as the S/360.

In which case the question is: What is the result of such a silent
overflowing addition/subtraction/multiplication? On 2s-complement
architectures, the result likely is defined by modulo arithmetic, but
what about ones-complement and sign-magnitude machines?

Certainly the S/360 design indicates that the architect does not
expect an overflow to happen in regular execution, and that
overflow-checking in programs is not expected by the architect,
either.

Moreover, addition with carry-in was only added in ESA/390 in 1990.
So they certainly did not expect multi-precision arithmetic or Bignums
before then.
Post by Scott Lurndal
ADD 1 TO TALLY ON OVERFLOW ...
IF OVERFLOW ...
This BPL: <https://academic.oup.com/comjnl/article/25/3/289/369715>?
Post by Scott Lurndal
Post by Anton Ertl
Those ideas are that integer overflows do not happen and that a
Can't say that I've known a programmer who thought that way.
Just read into the discussions about the default treatment of signed
integer overflow by, e.g. gcc, and clang. The line of argument goes
like this: The C standard does not define the behaviour on signed
overflow, therefore a C program must not have such overflows,
therefore it is not just correct but also desirable for a compiler to
assume that such overflows do not happen. I don't know whether people
who argue that way are programmers, but at least they pose as such on
the 'net.
Post by Scott Lurndal
Architectures in the 1960's had a flags register with and overflow bit.
S/360 certainly has no dedicated overflow bit, just multi-purpose bits
that sometimes mean "overflow".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Scott Lurndal
2024-02-20 17:24:55 UTC
Permalink
Post by Anton Ertl
Post by Scott Lurndal
Post by Anton Ertl
It seems to me that this is based on the ideas people in the old days
had about integer overflows, and these ideas are also reflected in the
architectures of old.
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
IIRC S/360 has two modes of operation: One where, on signed addition,
overflow traps, and one where it sets some flag; and the flag-setting
is not as consistent as say the NZCV flags on modern architectures;
instead, there are two bits that can mean anything at all, depending
on the instruction that sets them. In any case, if you use a program
that checks for overflows, then you either have to change the mode to
non-trapping before the addition and change it back afterwards, or all
signed overflows that are not checked explicitly are ignored.
The contemporaneous B3500 had a three-bit field for the 'flags'
called COMS/OVF (Comparison Toggles(COMH/COML) and Overflow Toggle).

The overflow toggle was sticky and only reset by the Branch on
Overflow (OFL) instruction.

There were no traps.
Post by Anton Ertl
Moreover, addition with carry-in was only added in ESA/390 in 1990.
So they certainly did not expect multi-precision arithmetic or Bignums
before then.
The B3500 was BCD, with one to 100 digit operands. Effectively
bignums. The optional floating point instruction set had a two
digit exponent and a hundred digit fraction.
Post by Anton Ertl
Post by Scott Lurndal
ADD 1 TO TALLY ON OVERFLOW ...
IF OVERFLOW ...
This BPL: <https://academic.oup.com/comjnl/article/25/3/289/369715>?
No. Burroughs Programming Language.

https://en.wikipedia.org/wiki/Burroughs_Medium_Systems
John Levine
2024-02-20 19:44:57 UTC
Permalink
Post by Anton Ertl
Post by Scott Lurndal
Architectures of old _expected_ integer overflows and had
mechanisms in the languages to test for them.
IIRC S/360 has two modes of operation: One where, on signed addition,
overflow traps, and one where it sets some flag; and the flag-setting
is not as consistent as say the NZCV flags on modern architectures;
instead, there are two bits that can mean anything at all, depending
on the instruction that sets them. In any case, if you use a program
that checks for overflows, then you either have to change the mode to
non-trapping before the addition and change it back afterwards, or all
signed overflows that are not checked explicitly are ignored.
Not quite. It had regular add and subtract (A, AR, S, SR) and logical
(AL, ALR, SL, SLR). The former set the condition code as
negative,zero,positive, or overflow, and interrupted if the overflow
interrupt was enabled. The latter set the condition code as zero no
carry, nonzero no carry, zero carry, or nonzero carry, and never
overflowed. There weren't instructions to do add or subtract with
carry but it was pretty easy to fake by doing a branch on no carry
around an instruction to add or subtract 1.

Multiplication was always signed and took two single length operands
and produced a double length product. It couldn't overflow but you
could with some pain check to see if the high word of the product had
any significant bits.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Thomas Koenig
2024-02-20 17:43:18 UTC
Permalink
Post by Scott Lurndal
Architectures in the 1960's had a flags register with and overflow bit.
POWER still has a (sticky) overflow bit.
David Brown
2024-02-20 12:58:46 UTC
Permalink
Post by Thomas Koenig
Post by David Brown
Post by Terje Mathisen
int8_t sum(int len, int8_t data[])
{
  int8_t s = 0;
  for (unsigned i = 0 i < len; i++) {
Just a side remark: This loop can get very long for len < 0.
Yes. "len" would be converted to "unsigned" by addition of 2^n (2^32
for the sizes given here) before the comparison. (It is not an infinite
loop, however.)
Post by Thomas Koenig
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
Wouldn't it be better to forbid mixing of signedness? I don't know
Fortran, so that might be a silly question!

For my C programming, I like to have "gcc -Wconversion -Wsign-conversion
-Wsign-compare" to catch unintended mixes of signedness.
Post by Thomas Koenig
Post by David Brown
Post by Terje Mathisen
    s += data[i];
  }
  return s;
}
will overflow if called with data = [127, 1, -2], right?
No. In C, int8_t values will be promoted to "int" (which is always at
least 16 bits, on any target) and the operation will therefore not
overflow.
Depending on len and the data...
Yes, but I think that was given by Terje in the specification for the
code that the intermediary calculations could be too big for "int8_t",
but not too big for "int".
Post by Thomas Koenig
Post by David Brown
The conversion of the result of "s + data[i]" from int to
int8_t, implicit in the assignment, also cannot "overflow" since that
term applies only to the evaluation of operators. But if this value is
outside the range for int8_t, then the conversion is
implementation-defined behaviour. (That is unlike signed integer
overflow, which is undefined behaviour.)
And that is one of the things that bugs me, in languages like C
and Fortran both.
Me too.

Even now (from when C23 is officially released) that two's complement is
the only allowed representation for signed integers in C, conversion
does not need to be wrapping - conversion to a signed integer type from
a value that is outside its range is "implementation-defined or an
implementation-defined signal is raised". I like that there is the
option of raising a signal - that lets you have debug options in the
compiler to find run-time issues. But I'd prefer that the alternative
to that was specified as modulo or wrapping behaviour.

(I like that signed integer overflow is UB, however.)
Post by Thomas Koenig
Signed integer overflow is undefined behavior in C and prohibited
in Fortran. Yet, there is no straightforward, standard-compliant
way to check for signed overflow (and handle this appropriately)
in either language. Gcc adds stuff like __builtin_add_overflow,
but this kind of thing really belongs in the core language.
You'll be glad that this is now in C23:

#include <stdckdint.h>
bool ckd_add(type1 *result, type2 a, type3 b);
bool ckd_sub(type1 *result, type2 a, type3 b);
bool ckd_mul(type1 *result, type2 a, type3 b);

(Basically, it's the gcc/clang extensions that have been standardised.)
Thomas Koenig
2024-02-20 17:42:24 UTC
Permalink
Post by David Brown
Post by Thomas Koenig
Even further on the side: I wrote up a proposal for finally
introducing a wrapping UNSIGNED type to Fortran, which hopefully
will be considered in the next J3 meeting, it can be found at
https://j3-fortran.org/doc/year/24/24-102.txt .
In this proposal, I intended to forbid UNSIGNED variables in
DO loops, especially for this sort of reason.
Wouldn't it be better to forbid mixing of signedness?
That is also in the proposal.
Post by David Brown
I don't know
Fortran, so that might be a silly question!
Not at all - mixing signed and unsigned arithmetic is a source
of headache for C, and I see no reason to impose the same sort of
headache on Fortran (and I see that were are in agreement here,
or you would not have written
Post by David Brown
For my C programming, I like to have "gcc -Wconversion -Wsign-conversion
-Wsign-compare" to catch unintended mixes of signedness.
:-)

Not sure if it will pass, though - unsigned ints have been brought
up in the past, and rejected. Maybe, with the support of DIN, this
has a better chance.
Scott Lurndal
2024-02-20 14:39:10 UTC
Permalink
Post by Thomas Koenig
Post by Terje Mathisen
int8_t sum(int len, int8_t data[])
{
  int8_t s = 0;
  for (unsigned i = 0 i < len; i++) {
Just a side remark: This loop can get very long for len < 0.
Which is why len should have been declared as size_t. A negative
array length is nonsensical.
Terje Mathisen
2024-02-20 15:09:36 UTC
Permalink
Post by Scott Lurndal
Post by Thomas Koenig
Post by Terje Mathisen
int8_t sum(int len, int8_t data[])
{
  int8_t s = 0;
  for (unsigned i = 0 i < len; i++) {
Just a side remark: This loop can get very long for len < 0.
Which is why len should have been declared as size_t. A negative
array length is nonsensical.
It was a pure typo from my side. In Rust array indices are always of
"size_t" type, you have to explicitely convert/cast anything else before
you can use it in a lookup:

opcodes[addr as usize]

when I needed addr (an i64 variable) to be able to take on negative
values, but here (knowing that it is now in fact positive) I am using it
as an array index.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Thomas Koenig
2024-02-20 17:37:09 UTC
Permalink
Post by Terje Mathisen
It was a pure typo from my side. In Rust array indices are always of
"size_t" type, you have to explicitely convert/cast anything else before
opcodes[addr as usize]
No arbitrary array indices, then?

I sometimes find array bounds like

a(-3:3)

convenient, but it is not a killer for not using a language
(I use both C and Perl :-)
Thomas Koenig
2024-02-20 17:32:59 UTC
Permalink
Post by Scott Lurndal
Post by Thomas Koenig
Post by Terje Mathisen
int8_t sum(int len, int8_t data[])
{
  int8_t s = 0;
  for (unsigned i = 0 i < len; i++) {
Just a side remark: This loop can get very long for len < 0.
Which is why len should have been declared as size_t. A negative
array length is nonsensical.
Not in all languages. It can be just a shorthand for a zero-sized
array:

$ cat array.f90
program main
real, dimension(1:-1) :: a
print *,size(a)
end program main
$ gfortran array.f90 && ./a.out
0
$

The argument is the same as for a DO loop like

DO I=1,-1
...
END DO

which is also executed zero times (and not -3 times :-)
BGB
2024-02-17 21:03:41 UTC
Permalink
Post by Anton Ertl
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly
don't trap.
Yes.

Trap on overflow is not really a thing in the JVM, the basic integer
types are modulo, and don't actually distinguish signed from unsigned
(unsigned arithmetic is merely faked in some cases with special
operators, with signed arithmetic assumed as the default).
Post by Anton Ertl
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
Yeah. No traps, only NaNs.


FWIW: My own languages, and BGBCC, also partly followed Java's model in
this area. But, it wasn't hard: This is generally how C behaves as well
on most targets.

Well, except that C will often trap for things like divide by zero and
similar, at least on x86. Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.


On BJX2, there isn't currently any divide-by-zero trap, since:
This case doesn't happen in normal program execution;
Handling it with a trap would cost more than not bothering.

So, IIRC, integer divide-by-zero will just give 0, and FP divide-by-zero
will give Inf or NaN.
Post by Anton Ertl
- anton
MitchAlsup1
2024-02-17 22:08:30 UTC
Permalink
Post by BGB
Post by Anton Ertl
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly
don't trap.
Yes.
Trap on overflow is not really a thing in the JVM, the basic integer
types are modulo, and don't actually distinguish signed from unsigned
(unsigned arithmetic is merely faked in some cases with special
operators, with signed arithmetic assumed as the default).
Post by Anton Ertl
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
People skilled in numerical analysis hate java FP semantics.
Post by BGB
Yeah. No traps, only NaNs.
FWIW: My own languages, and BGBCC, also partly followed Java's model in
this area. But, it wasn't hard: This is generally how C behaves as well
on most targets.
Well, except that C will often trap for things like divide by zero and
similar, at least on x86. Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.
This case doesn't happen in normal program execution;
Handling it with a trap would cost more than not bothering.
This sounds like it should make your machine safe to program and use,
but it does not.
Post by BGB
So, IIRC, integer divide-by-zero will just give 0, and FP divide-by-zero
will give Inf or NaN.
Can I volunteer this as the worst possible value for int/0, [un]signedMAX
is trivially harder to implement.
Post by BGB
Post by Anton Ertl
- anton
BGB
2024-02-17 23:36:43 UTC
Permalink
Post by MitchAlsup1
Post by BGB
Post by Anton Ertl
On the third (i.e gripping) hand you could have a language like Java=20
where it would be illegal to transform a temporarily trapping loop into=20
one that would not trap and give the mathematically correct answer.
What "temporarily trapping loop" and "mathematically correct answer"?
If you are talking about integer arithmetic, the limited integers in
Java have modulo semantics, i.e., they don't trap, and BigIntegers certainly
don't trap.
Yes.
Trap on overflow is not really a thing in the JVM, the basic integer
types are modulo, and don't actually distinguish signed from unsigned
(unsigned arithmetic is merely faked in some cases with special
operators, with signed arithmetic assumed as the default).
Post by Anton Ertl
If you are talking about FP (like I did), by default FP addition does
not trap in Java, and any mention of "mathematically correct" in
connection with FP needs a lot of further elaboration.
People skilled in numerical analysis hate java FP semantics.
Post by BGB
Yeah. No traps, only NaNs.
FWIW: My own languages, and BGBCC, also partly followed Java's model
in this area. But, it wasn't hard: This is generally how C behaves as
well on most targets.
Well, except that C will often trap for things like divide by zero and
similar, at least on x86. Though, off-hand, I don't remember whether
or not JVM throws an exception on divide-by-zero.
   This case doesn't happen in normal program execution;
   Handling it with a trap would cost more than not bothering.
This sounds like it should make your machine safe to program and use,
but it does not.
It is more concerned with "cheap" than "safe".

Trap on divide-by-zero would require having a way for the divider unit
to signal divide-by-zero has been encountered (say, so some external
logic can raise the corresponding exception code). This is not free.

Granted, since the unit is slow, could potentially add a few cycles of
latency without much ill effect.
Post by MitchAlsup1
Post by BGB
So, IIRC, integer divide-by-zero will just give 0, and FP
divide-by-zero will give Inf or NaN.
Can I volunteer this as the worst possible value for int/0, [un]signedMAX
is trivially harder to implement.
Probably depends on what one wants.
With 0, the value just sorta goes poof and disappears...

Not exactly the most numerically correct answer, granted.

Probably at least still better than "divide by zero gives some totally
random garbage value", which is, technically, the cheaper option...



Though, at least Inf has the advantage that it just sorta appears on its
own when one tries to find the reciprocal of 0 (it initially turns into
a very huge value, and quickly turns into Inf in the N-R stages; with
Inf as a special case for whenever the exponent goes out of range).

The NaN then appears if one special-cases 0*Inf to produce NaN.
x/0 => x*rcp(0) => x*Inf => Inf
0/0 => 0*rcp(0) => 0*Inf => NaN

...
Post by MitchAlsup1
Post by BGB
Post by Anton Ertl
- anton
MitchAlsup1
2024-02-18 01:06:46 UTC
Permalink
Post by BGB
Post by MitchAlsup1
Post by BGB
   This case doesn't happen in normal program execution;
   Handling it with a trap would cost more than not bothering.
This sounds like it should make your machine safe to program and use,
but it does not.
It is more concerned with "cheap" than "safe".
Trap on divide-by-zero would require having a way for the divider unit
to signal divide-by-zero has been encountered (say, so some external
logic can raise the corresponding exception code). This is not free.
Most result busses have a bit that carries exception to the retire end
of the pipeline. The retire stage looks at the bit, sees a DIV instruction
and knows what exception was raised. FP generally needs 3-such bits on
the result bus.
Anton Ertl
2024-02-18 08:00:18 UTC
Permalink
Post by BGB
Well, except that C will often trap for things like divide by zero and
similar, at least on x86.
The division instructions of IA-32 and AMD64 trap on divide-by-zero
and when the result is out of range. Unsurprisingly, C compilers
usually use these instructions when compiling division on these
architectures. One interesting case is what C compilers do when you
write

long foo(long x)
{
return x/-1;
}

Both gcc and clang compile this to

0: 48 89 f8 mov %rdi,%rax
3: 48 f7 d8 neg %rax
6: c3 retq

and you don't get a trap when you call foo(LONG_MIN), while you would
if the compiler did not know that the divisor is -1 (and it was -1 at
run-time).

By contrast, when I implemented division-by-constant optimization in
Gforth, I decided not "optimize" the division by -1 case, so you get
the ordinary division operation and its behaviour. If a programmer
codes a division by -1 rather than just NEGATE, they probably want
something other than NEGATE.
Post by BGB
Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.
Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
says:

|if the dividend is the negative integer of largest possible magnitude
|for its type, and the divisor is -1, then integer overflow occurs and
|the result is equal to the dividend. Despite the overflow, no
|exception is thrown in this case. On the other hand, if the value of
|the divisor in an integer division is 0, then an ArithmeticException
|is thrown.

I expect that the JVM has matching wording.

So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating the
divide instruction. Alternatively, the generated code could just
produce a division instruction, and the signal handler (on Unix) or
equivalent could then check if the divisor was 0 (and then throw an
ArithmeticException) or -1 (and then produce a long_min result and
continue execution).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Michael S
2024-02-18 18:14:05 UTC
Permalink
On Sun, 18 Feb 2024 08:00:18 GMT
Post by Anton Ertl
Post by BGB
Well, except that C will often trap for things like divide by zero
and similar, at least on x86.
The division instructions of IA-32 and AMD64 trap on divide-by-zero
and when the result is out of range. Unsurprisingly, C compilers
usually use these instructions when compiling division on these
architectures. One interesting case is what C compilers do when you
write
long foo(long x)
{
return x/-1;
}
Both gcc and clang compile this to
0: 48 89 f8 mov %rdi,%rax
3: 48 f7 d8 neg %rax
6: c3 retq
and you don't get a trap when you call foo(LONG_MIN), while you would
if the compiler did not know that the divisor is -1 (and it was -1 at
run-time).
By contrast, when I implemented division-by-constant optimization in
Gforth, I decided not "optimize" the division by -1 case, so you get
the ordinary division operation and its behaviour. If a programmer
codes a division by -1 rather than just NEGATE, they probably want
something other than NEGATE.
Post by BGB
Though, off-hand, I don't remember whether or
not JVM throws an exception on divide-by-zero.
Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
|if the dividend is the negative integer of largest possible magnitude
|for its type, and the divisor is -1, then integer overflow occurs and
|the result is equal to the dividend. Despite the overflow, no
|exception is thrown in this case. On the other hand, if the value of
|the divisor in an integer division is 0, then an ArithmeticException
|is thrown.
I expect that the JVM has matching wording.
So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating the
divide instruction. Alternatively, the generated code could just
produce a division instruction, and the signal handler (on Unix) or
equivalent could then check if the divisor was 0 (and then throw an
ArithmeticException) or -1 (and then produce a long_min result and
continue execution).
- anton
I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division sequence,
i.e. CQO followed by IDIV, will produce result expected by Java spec
without any overflow.
Anton Ertl
2024-02-18 22:40:08 UTC
Permalink
Post by Michael S
On Sun, 18 Feb 2024 08:00:18 GMT
Post by Anton Ertl
Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
|if the dividend is the negative integer of largest possible magnitude
|for its type, and the divisor is -1, then integer overflow occurs and
|the result is equal to the dividend. Despite the overflow, no
|exception is thrown in this case. On the other hand, if the value of
|the divisor in an integer division is 0, then an ArithmeticException
|is thrown.
I expect that the JVM has matching wording.
So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just generating the
divide instruction. Alternatively, the generated code could just
produce a division instruction, and the signal handler (on Unix) or
equivalent could then check if the divisor was 0 (and then throw an
ArithmeticException) or -1 (and then produce a long_min result and
continue execution).
- anton
I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division sequence,
i.e. CQO followed by IDIV, will produce result expected by Java spec
without any overflow.
Try it. E.g., in gforth-fast /S performs this sequence:

see /s
Code /s
0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
0x00005614dd335631 <gforth_engine+3217>: mov 0x8(%r13),%rax
0x00005614dd335635 <gforth_engine+3221>: add $0x8,%r13
0x00005614dd335639 <gforth_engine+3225>: cqto
0x00005614dd33563b <gforth_engine+3227>: idiv %r8
0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
end-code

And when I divide LONG_MIN by -1, I get a trap:

$8000000000000000 -1 /s
*the terminal*:12:22: error: Division by zero
$8000000000000000 -1 >>>/s<<<

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Michael S
2024-02-19 16:47:32 UTC
Permalink
On Mon, 19 Feb 2024 01:20:09 +0200
On Sun, 18 Feb 2024 22:40:08 GMT
Post by Anton Ertl
Post by Michael S
On Sun, 18 Feb 2024 08:00:18 GMT
Post by Anton Ertl
Reading up on Java,
<https://docs.oracle.com/javase/specs/jls/se21/html/jls-15.html#jls-15.17.2>
|if the dividend is the negative integer of largest possible
magnitude |for its type, and the divisor is -1, then integer
overflow occurs and |the result is equal to the dividend. Despite
the overflow, no |exception is thrown in this case. On the other
hand, if the value of |the divisor in an integer division is 0,
then an ArithmeticException |is thrown.
I expect that the JVM has matching wording.
So on, e.g., AMD64 the JVM has to generate code that catches the
long_min/-1 case and produces long_min rather then just
generating the divide instruction. Alternatively, the generated
code could just produce a division instruction, and the signal
handler (on Unix) or equivalent could then check if the divisor
was 0 (and then throw an ArithmeticException) or -1 (and then
produce a long_min result and continue execution).
- anton
I don't understand why case of LONG_MIN/-1 would possibly require
special handling. IMHO, regular iAMD64 64-bit integer division
sequence, i.e. CQO followed by IDIV, will produce result expected
by Java spec without any overflow.
see /s
Code /s
0x00005614dd33562d <gforth_engine+3213>: add $0x8,%rbx
0x00005614dd335631 <gforth_engine+3217>: mov
0x8(%r13),%rax 0x00005614dd335635 <gforth_engine+3221>: add
$0x8,%r13 0x00005614dd335639 <gforth_engine+3225>: cqto
0x00005614dd33563b <gforth_engine+3227>: idiv %r8
0x00005614dd33563e <gforth_engine+3230>: mov %rax,%r8
0x00005614dd335641 <gforth_engine+3233>: mov (%rbx),%rax
0x00005614dd335644 <gforth_engine+3236>: jmp *%rax
end-code
$8000000000000000 -1 /s
*the terminal*:12:22: error: Division by zero
$8000000000000000 -1 >>>/s<<<
- anton
You are right.
LONG_MIN/1 works, but LONG_MIN/-1 crashes, to my surprize.
Seems like I didn't RTFM with regard to IDIV for too many years.
Most likely, back when I was reading the manual for the first time, I
read DIV paragraph thoroughly and then just looked briefly at IDIV
assuming that it is about the same and not paying attention to the
differences in corner cases.
BGB
2024-02-16 08:10:57 UTC
Permalink
Post by Quadibloc
Post by Marcus
Post by Quadibloc
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
Yes, and therefore I am looking into ways to deal with it somehow.
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.
What _could_ substitute for vector registers isn't data forwarding,
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.
My first question, though, is whether or not we can really make caches
that good.
But skepticism about VVM isn't actually helpful if Cray-style vectors
are now impossible to be made to work given current memory speeds.
Possibly true.

One could push a lot more data through a SIMD unit if one could, say:
Perform two memory loads and one memory store per cycle;
Then, say, the idea of vector registers with a memory address and
implicit SIMD vectors, makes sense.

But, this still leaves the problem of "what happens when all the data no
longer fits in the L1 cache?..."

Doing vector calculations at memcpy speed doesn't really gain much over
doing SIMD calculations at memcpy speed. SIMD is, meanwhile, easier to
implement, and does have obvious merit over scalar calculations in many
contexts.


So, say, I have a 200 MFLOP SIMD unit, but with Binary16, realistically
I could only get ~ 35 MFLOP at for a large vector operation at L2 speeds
(or 17 MFLOP if using Binary32). The limit may jump to 70 MFLOP if the
operation resembles a large vector dot-product.

But, yeah, if the SIMD code is carefully written, it is possible for the
SIMD code to operate at near memcpy speeds (despite the inherent
inefficiency of needing to "waste" clock-cycles on things like memory
loads).


OTOH, theoretically, a computer pushing 400MB/s for memcpy could get 100
MFLOP at Binary32 precision (but, might end up being less if working on
it using x87). Or, if the calculation is more involved, slower still.

If the operation resembles a 4-element vector multiply-accumulate,
cycling over each element in SIMD-like patterns (say, the "fake the SIMD
operations using structs of floats" strategy), x87 eats it hard; and is
a bigger bottleneck than the memory bandwidth.


OTOH: If one is doing a bunch of large branching scalar calculations
that each produce a single output value, x87 makes a lot more sense
(IOW, the sorts of things people more traditionally think of when they
think of "math", as opposed to a whole lot of "c[i]=a[i]*b[i]+c[i];" or
similar...).
Post by Quadibloc
The basic way in which I originally felt I could make it work was really
quite simple. The operating system, from privileged code, could set a
bit in the PSW that turns on, or off, the ability to run instructions that
access the vector registers.
The details of how one may have to make use of that capability... well,
that's software. So maybe the OS has to stipulate that one can only have
one process at a time that uses these vectors - and that process has to
run as a batch process!
Hey, the GPU in a computer these days is also a singular resource.
Having resources that have to be treated that way is not really what
people are used to, but a computer that _can_ run your CFD codes
efficiently is better than a computer that *can't* run your CFD codes.
If a computer can't effectively run an algorithm due to ISA design, this
is a design fail.


Say, "define mechanism, not policy" and the like. Some vector ISA's seem
unattractive though in that the data needs to be made to fit the
operations (even more so than SIMD would otherwise imply), or would
require structuring things in ways that would not be cache friendly.


As I see it, they also don't really "solve" much that couldn't otherwise
be solved with SIMD if one had multiple memory ports, but doing so is
still potentially rendered moot if one doesn't have enough memory
bandwidth to keep everything fed.


Though, luckily, most "common" algorithms are can fit most of the code
and data in the L1 cache, more limited by processing the data than by
memory bandwidth (and, in this case, it might be easier to justify
having multiple memory ports to get data into and out of registers more
quickly).
Post by Quadibloc
Given _that_, obviously if VVM is a better fit to the regular computer
model, and it offers nearly the same performance, then what I should do
is offer VVM or something very much like it _in addition_ to Cray-style
vectors, so that the best possible vector performance for conventional
non-batch programs is also available.
Now, what would I think of as being "something very much like VVM" without
actually being VVM?
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember, and
I think it's also too hard for at least some implementors. I see it
as asking for trouble in a way that I'd rather avoid.
So my substitute for VVM should now be obvious - explicit memory-to-memory
vector instructions, like on an old STAR-100.
One has the issue of likely needing arcane implementation magic.

The other that it is likely to be limiting and cache unfriendly.


For better/worse, I partly followed the MMX / SSE model.
But, cutting a lot of corners to make things cheaper.

Generally leaving out things like packed byte operations or saturating
arithmetic, which while potentially useful, are not particularly cheap
either.


I generally try to approach SIMD in a similar way to the integer ISA,
where it is usually better to add features reluctantly (well, and/or end
up with the ISA clogged up with stuff that does not meaningfully
contribute to performance).

Well, nevermind stuff that exists mostly as a workaround to other design
choices (say, instructions which only need to exist due to the lack of
an architectural zero register, or because of my choice to use pointer
tagging, ...).

But, as noted, if I were doing it all again, I might consider having ZR,
LR, and GBR, in the GPR space.


But, then I get back to fiddling with my CMP3R extension:
Have noted that enabling it helps with Doom, but somewhat hurts
Dhrystone, for reasons that aren't entirely obvious (enabling it should
presumably not hurt Dhrystone, given all it is really doing is replacing
2-op sequences with 1-op in a way that should theoretically reduce
4-cycle dependent pairs down to 1|2 cycles).

Have replaced the "CMPQGE Rm, Imm5u, Rn" instruction with "CMPQLT Rm,
Imm5u, Rn", noting that the latter can usefully encode more cases than
the former ("GE Imm" can be replaced with "GT Imm-1" to similar effect,
whereas "LT" can encode both the LT and LE cases, even if it breaks
symmetry with the 3R cases, where LT and LE are expressed by flipping
the arguments).

Well, and also in the process finding some cases where the existing
compiler logic was compiling stuff inefficiently (was using FF Op64
encodings where an FE jumbo-encodings would have been better in XG2
Mode, and because Op64 only had a 17-bit immediate, was *also* leading
to loading the immediate values into temporary registers).

...


Though, can note that I still seem to be losing to RV64 in terms of
code-density, and it seems like figuring out more of the code-density
delta may potentially lead to a bigger advantage over RV64 in terms of
performance (as-is, the performance delta is "not sufficiently large" as
I see it, when it comes to "plain C integer-oriented code").

Like, would be easier to justify my project if it could have a win on
nearly every metric (well, beyond just "around 20% faster in Doom, and
does an OpenGL style software rasterizer acceptably well...").

Would have been happier if it was more like 70$ faster or similar...

Seems like I have to be happy at the moment with "if I pull the stops,
it averages ~ 29 fps..." (mostly stays over 20fps, sometimes hits the
35fps limiter, ...).

But, then it means I have to result to "kinda weak" options, like
enabling full compare+branch ops and disabling stack canary checks. But,
then it is seeming like these sorts of "little things" may actually
matter (and most of the obvious "low hanging fruit" is also gone at this
point in terms of integer performance; well, unless I want to start
reviving old experimental features like 4R integer multiply-accumulate
or similar).

...
MitchAlsup1
2024-02-16 18:35:01 UTC
Permalink
Post by Quadibloc
Post by Marcus
Post by Quadibloc
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
Yes, and therefore I am looking into ways to deal with it somehow.
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
VVM on My 66000 remains a RISC ISA--what it does is provide an implemen-
tation freedom to perform multiple loops (SIMD -style) at the same time.
CRAY nomenclature would call this "lanes".
Post by Quadibloc
But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.
What _could_ substitute for vector registers isn't data forwarding,
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.
Cache buffers to be more precise.
Post by Quadibloc
My first question, though, is whether or not we can really make caches
that good.
Once a memory reference in a vectorized loop starts to miss, you quit
storing the data in the cache and just strip mine it through the cache
buffers, and avoid polluting the DCache with data that will be displaced
before the loop completes.
Post by Quadibloc
But skepticism about VVM isn't actually helpful if Cray-style vectors
are now impossible to be made to work given current memory speeds.
The basic way in which I originally felt I could make it work was really
quite simple. The operating system, from privileged code, could set a
bit in the PSW that turns on, or off, the ability to run instructions that
access the vector registers.
The details of how one may have to make use of that capability... well,
that's software. So maybe the OS has to stipulate that one can only have
one process at a time that uses these vectors - and that process has to
run as a batch process!
Hey, the GPU in a computer these days is also a singular resource.
Having resources that have to be treated that way is not really what
people are used to, but a computer that _can_ run your CFD codes
efficiently is better than a computer that *can't* run your CFD codes.
Given _that_, obviously if VVM is a better fit to the regular computer
model, and it offers nearly the same performance, then what I should do
is offer VVM or something very much like it _in addition_ to Cray-style
vectors, so that the best possible vector performance for conventional
non-batch programs is also available.
Now, what would I think of as being "something very much like VVM" without
actually being VVM?
Basically, Mitch has his architecture designed for implementation on
CPUs that are smart enough to notice certain combinations of instructions
and execute them as though they're single instructions doing the same
thing, which can then be executed more efficiently.
So this makes those exact combinations part of the... ISA syntax...
which I think is too hard for assembler programmers to remember, and
I think it's also too hard for at least some implementors. I see it
as asking for trouble in a way that I'd rather avoid.
So my substitute for VVM should now be obvious - explicit memory-to-memory
vector instructions, like on an old STAR-100.
Gasp........
Post by Quadibloc
John Savard
Quadibloc
2024-02-17 04:30:37 UTC
Permalink
Post by MitchAlsup1
Post by Quadibloc
So my substitute for VVM should now be obvious - explicit memory-to-memory
vector instructions, like on an old STAR-100.
Gasp........
Oh, dear. But, yes, old-style memory-to-memory vector instructions omit
at least one very important thing that VVM provides, which I do indeed
want to make sure I include.

So there would need to be instructions like

multiply v1 by v2 giving scratch-1
add scratch-1 to scratch-2 giving scratch-3
divide scratch-2 by v1 giving v4

... that is, instead of vector registers, there would still be another
kind of thing that isn't a vector in memory, but instead an *explicit*
reference to a forwarding node.

And so these vector instructions would have to be in explicitly
delimited groups (since forwarding nodes, unlike vector registers, aren't
intended to be _persistent_, so a group of vector instructions would have
to combine into a clause which for some purposes acts like a single
instruction)... which then makes it look a whole lot _more_ like VVM,
even though the inside of the sandwich is now special instructiions,
instead of ordinary arithmetic instructions as in VVM.

I think there _may_ have been something like this already in the
original Concertina.

John Savard
MitchAlsup1
2024-02-16 18:45:59 UTC
Permalink
Post by Quadibloc
Post by Marcus
Post by Quadibloc
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
Yes, and therefore I am looking into ways to deal with it somehow.
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.
What _could_ substitute for vector registers isn't data forwarding,
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.
My first question, though, is whether or not we can really make caches
that good.
I think that you are missing some of the points that I'm trying to make.
In my recent comments I have been talking about very low end machines,
the kinds that can execute at most one instruction per clock cycle, or
maybe less, and that may not even have a cache at all.
I'm saying that I believe that within this category there is an
opportunity for improving performance with very little cost by adding
vector operations.
E.g. imagine a non-pipelined implementation with a single memory port,
shared by instruction fetch and data load/store, that requires perhaps
two cycles to fetch and decode an instruction, and executes the
instruction in the third cycle (possibly accessing the memory, which
precludes fetching a new instruction until the fourth or even fifth
cycle).
Now imagine if a single instruction could iterate over several elements
of a vector register. This would mean that the execution unit could
execute up to one operation every clock cycle, approaching similar
performance levels as a pipelined 1 CPI machine. The memory port would
be free for data traffic as no new instructions have to be fetched
during the vector loop. And so on.
You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.

Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.
Similarly, imagine a very simple strictly in-order pipelined
implementation, where you have to resolve hazards by stalling the
pipeline every time there is RAW hazard for instance, and you have to
throw away cycles every time you mispredict a branch (which may be
quite often if you only have a very primitive predictor).
With vector operations you pause the front end (fetch and decode) while
iterating over vector elements, which eliminates branch misprediction
penalties. You also magically do away with RAW hazards as by the time
you start issuing a new instruction the vector elements needed from the
previous instruction have already been written to the register file.
And of course you do away with loop overhead instructions (increment,
compare, branch).
VVM does not use branch prediction--it uses a zero-loss ADD-CMP-BC
instruction I call LOOP.

And you do not have to lose precise exceptions, either.
As a bonus, I believe that a vector solution like that would be more
energy efficient, as less work has to be done for each operation than if
you have to fetch and decode an instruction for every operation that you
do.
More energy efficient, but consumes more energy because it is running
more data in less time.
As I said, VVM has many similar properties, but I am currently exploring
if a VRF solution can be made sufficiently cheap to be feasible in this
very low end space, where I believe that VVM may be a bit too much (this
assumption is mostly based on my own ignorance, so take it with a grain
of salt).
For reference, the microarchitectural complexity that I'm thinking about
https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v
/Marcus
EricP
2024-02-16 19:27:19 UTC
Permalink
Post by MitchAlsup1
You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.
Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?
Is that done by disambiguation logic, checking for same cache line access?
EricP
2024-02-17 12:49:34 UTC
Permalink
Post by EricP
Post by MitchAlsup1
You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.
Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?
first: memory is inherently misaligned in My 66000 architecture. So, since
the width of the machine is 64-bits, we read or write in 128-bit quantities
so that we have enough bits to extract the misaligned data from or a
container
large enough to store a 64-bit value into. {{And there are all the
associated
corner cases}}
Second: over in VVM-land, the implementation can decide to read and write
wider, but is architecturally constrained not to shrink below 128-bits.
A 1-wide My66160 would read pairs of double precision FP values, or quads
of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values.
This supports loops of 6IPC or greater in a 1-wide machine. This machine
would process suitable loops at 128-bits per cycle--depending on "other
things" that are generally allowable.
A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
per cycle. This supports 20 IPC or greater in the 6-wide machine. As
many as
8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths
to support this rate.
Ah, so it can emit Load/Store Pair LDP/STP (or wider) uOps inside the loop.
That's more straight forward than fusing LD's or ST's in LSQ.
Post by EricP
Is that done by disambiguation logic, checking for same cache line access?
Before I have said that the front end observes the first iteration of
the loop and makes some determinations as to how wide the loop can be
run on
the machine at hand. One of those observations is whether memory addresses
are dense, whether they all go in the same direction, and what registers
carry loop-to-loop dependencies.
How does it know when to use LDP/STP uOps?
That decision would have to be made early in the front end, likely Decode
and before Rename because you have to know how many dest registers you need.

But the decision on the legality to use LDP/STP depends on knowing the
current loop counter >= 2 and address(es) aligned on a 16 byte boundary,
which are multiple dynamic, possibly calculated, values only available
much later to the back end.
MitchAlsup1
2024-02-17 17:18:36 UTC
Permalink
Post by EricP
Post by EricP
Post by MitchAlsup1
You should think of it like:: VVM can execute as many operations per
cycle as it has function units. In particular, the low end machine
can execute a LD, and FMAC, and the ADD-CMP-BC loop terminator every
cycle. LDs operate at 128-bits wide, so one can execute a LD on even
cycles and a ST on odd cycles--giving 6-IPC on a 1 wide machine.
Bigger implementations can have more cache ports and more FMAC units;
and include "lanes" in SIMD-like fashion.
Regarding the 128-bit LD and ST, are you saying the LSQ recognizes
two consecutive 64-bit LD or ST to consecutive addresses and merges
them into a single cache access?
first: memory is inherently misaligned in My 66000 architecture. So, since
the width of the machine is 64-bits, we read or write in 128-bit quantities
so that we have enough bits to extract the misaligned data from or a
container
large enough to store a 64-bit value into. {{And there are all the
associated
corner cases}}
Second: over in VVM-land, the implementation can decide to read and write
wider, but is architecturally constrained not to shrink below 128-bits.
A 1-wide My66160 would read pairs of double precision FP values, or quads
of 32-bit values, octets of 16-bit values, and hexademials of 8-bit values.
This supports loops of 6IPC or greater in a 1-wide machine. This machine
would process suitable loops at 128-bits per cycle--depending on "other
things" that are generally allowable.
A 6-wide My66650 would read a cache line at a time, and has 3 cache ports
per cycle. This supports 20 IPC or greater in the 6-wide machine. As
many as
8 DP FP calculations per cycle are possible, with adequate LD/ST bandwidths
to support this rate.
Ah, so it can emit Load/Store Pair LDP/STP (or wider) uOps inside the loop.
That's more straight forward than fusing LD's or ST's in LSQ.
Post by EricP
Is that done by disambiguation logic, checking for same cache line access?
Before I have said that the front end observes the first iteration of
the loop and makes some determinations as to how wide the loop can be
run on
the machine at hand. One of those observations is whether memory addresses
are dense, whether they all go in the same direction, and what registers
carry loop-to-loop dependencies.
How does it know when to use LDP/STP uOps?
It does not have LDP/STP ops to use.
It uses the width of the cache port it has.
It just so happens that the low end machine has a cache width of 128-bits.
But each implementation gets to choose its own width.
Post by EricP
That decision would have to be made early in the front end, likely Decode
and before Rename because you have to know how many dest registers you need.
It is not using a register, although it is using flip-flops. It is not
using something that is visible to SW but is visible to HW.
Post by EricP
But the decision on the legality to use LDP/STP depends on knowing the
current loop counter >= 2 and address(es) aligned on a 16 byte boundary,
which are multiple dynamic, possibly calculated, values only available
much later to the back end.
It does not need to see the address aligned to a 16-byte boundary.
BGB
2024-02-16 22:18:47 UTC
Permalink
Post by Quadibloc
Post by Marcus
Post by Quadibloc
But there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
Yes, and therefore I am looking into ways to deal with it somehow.
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.
What _could_ substitute for vector registers isn't data forwarding,
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.
My first question, though, is whether or not we can really make caches
that good.
I think that you are missing some of the points that I'm trying to make.
In my recent comments I have been talking about very low end machines,
the kinds that can execute at most one instruction per clock cycle, or
maybe less, and that may not even have a cache at all.
I'm saying that I believe that within this category there is an
opportunity for improving performance with very little cost by adding
vector operations.
E.g. imagine a non-pipelined implementation with a single memory port,
shared by instruction fetch and data load/store, that requires perhaps
two cycles to fetch and decode an instruction, and executes the
instruction in the third cycle (possibly accessing the memory, which
precludes fetching a new instruction until the fourth or even fifth
cycle).
Now imagine if a single instruction could iterate over several elements
of a vector register. This would mean that the execution unit could
execute up to one operation every clock cycle, approaching similar
performance levels as a pipelined 1 CPI machine. The memory port would
be free for data traffic as no new instructions have to be fetched
during the vector loop. And so on.
I guess possible.

Ironically, the first designs I did were pipelined, except some early
versions of my core did not use pipelined memory access (so, Load/Store
would have been a lot more expensive), in combination with a painfully
slow bus design.
Similarly, imagine a very simple strictly in-order pipelined
implementation, where you have to resolve hazards by stalling the
pipeline every time there is RAW hazard for instance, and you have to
throw away cycles every time you mispredict a branch (which may be
quite often if you only have a very primitive predictor).
Yeah, the above is more like my core design...
With vector operations you pause the front end (fetch and decode) while
iterating over vector elements, which eliminates branch misprediction
penalties. You also magically do away with RAW hazards as by the time
you start issuing a new instruction the vector elements needed from the
previous instruction have already been written to the register file.
And of course you do away with loop overhead instructions (increment,
compare, branch).
As a bonus, I believe that a vector solution like that would be more
energy efficient, as less work has to be done for each operation than if
you have to fetch and decode an instruction for every operation that you
do.
FWIW:
The first version of SIMD operations I did was to feed each vector
element through the main FPU.

So:
Scalar, 6-cycles, with 5-cycle internal latency, 1 overhead cycle.
2-wide SIMD, 8 cycles, 1+6+1
4-wide SIMD, 10 cycles, 1+8+1
The relative LUT cost increase of supporting SIMD in this way is fairly
low (and, still worthwhile, as it is faster than non-pipelined FPU ops).


For the later low-precision unit, latency was reduced by implementing
low-precision FADD/FMUL units with a 3-cycle latency, and running 4 sets
in parallel. This is not cheap, but faster.

Another feature was to allow the low-precision unit to accept
double-precision values, though processing them at low precision.

This was the incentive to adding FADDA/FSUBA/FMULA, since these can do
low-precision ops on Binary64 with a 3-cycle latency (just with the
tradeoff of being partial precision and truncate only).


Though, through experimentation, noted for single-precision stuff that
one needs at least single precision, as, say, S.E8.F16.Z7, is not
sufficient...
Both Quake and my experimental BGBTech3 engine started breaking in
obvious ways with a 16-bit mantissa for "float".

Internally, IIRC, I ended up using a 25 bit mantissa, as this was just
enough to get Binary32 working correctly.

Though, the low-precision unit doesn't currently support integer
conversion, where it can be noted that the mantissa needs to be wide
enough to handle the widest supported integer type (so, say, if wants
Int32, they also need a 32-bit mantissa).


Going from my original SIMD implementation to a vector implementation
would likely require some mechanism for the FPU to access memory. This,
however, is the "evil" part.

Also, annoyingly, adding memory ports as not as easy as simply adding
duplicate L1 caches, as then one has to deal with memory coherence
between them.


Similarly, Block-RAM arrays don't support multiple read ports, so trying
to simply access the same Block-RAM again will effectively duplicate all
of the block-RAM for the L1 cache (horrible waste).

Though, simply mirroring the L1 contents across both ports would be the
simplest approach...



One possible way to have a multi-port L1 cache might be, say:
Add an L0 cache level, which deals with the Load/Store mechanism;
Natively caches a small number of cache lines (say, 2..8);
Would likely need to be fully-associative, for "reasons";
A L0 Miss would be handled by initiating a request to the L1 cache.
Each memory port gets its own L0 cache,
which arbitrate access to the L1 cache.
A store into an L0 cache causes it to signal the others about it.
Each L0 cache will then need to flush the offending cache-line.

The obvious drawbacks:
More expensive;
One needs to effectively hold several cache lines of data in FFs.
Will add latency.
2x 4-way lines, would miss often;
Will add 1+ cycles for every L0 miss.



One other considered variant being:
The Lane1 port accessed the L1 directly (Does Load/Store);
The Lane2 port has a small asymmetric cache, and only does Load.
Store via Lane 1 port invalidates lines in the Lane 2 L0.
In this case, the Lane2 cache could be bigger and use LUTRAM.
Say, 2x 32-lines, 1K.

However, the harder part is how to best arbitrate misses to the Lane 2 port.

IIRC, I had experimentally implemented this latter strategy, with an
approach like:
A Lane 2 miss signals a cache-miss as before.
During a Lane 2 miss, the Lane 1 array-fetch is redirected to Lane 2's
index;
If Lane 2's miss is serviced from Lane 1, it is copied over to Lane 2;
Else, a L1 miss happens as before, and the line was added to both the
main L1 cache, and the Lane 2 sub-cache.


I think one other idea I had considered was to implement a 2-way
set-associative L1 (rather than direct-mapped), but then allow dual-lane
memory access to use each way of the set independently (in the case of a
single memory access, it functions as a 2-way cache).

If an L1 miss occurs, both "ways" will be brought to look at the same
index (corresponding to the lane that had missed), and then the L1 miss
handling would do its thing (either resolving within the L1 cache, or
sending a request out on the bus, loading the cache-line into the way
associated with the lane that had missed).

This approach could be preferable, but would require writing a new L1 cache.


Though, thus far, multi-port memory access hasn't really seemed "worth it".

And, this still wouldn't deal with vectors, merely the possibility of
allowing things like, say:
MOV.Q (R4, 64), R8 | MOV.Q (R5, 192), R10

Generally, disallowing Load|Store or Store|Store, as this would open up
another level of pain, vs Load|Load.


As for reducing looping overhead:
Luckily, this part can be handled purely in software (such as via manual
or automatic loop unrolling). Currently BGBCC requires manual unrolling,
as automatic unrolling takes too much cleverness, but, ...
As I said, VVM has many similar properties, but I am currently exploring
if a VRF solution can be made sufficiently cheap to be feasible in this
very low end space, where I believe that VVM may be a bit too much (this
assumption is mostly based on my own ignorance, so take it with a grain
of salt).
Admittedly, I don't actually understand how VVM would work in hardware.
And, if in the trivial case it decays to behaving like scalar code, this
is something, but kinda meh.
For reference, the microarchitectural complexity that I'm thinking about
https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v
Interesting in a way that such small cores are possible.
Admittedly, my core is quite significantly more complicated.

And, pretty much all of my designs had significantly more Verilog code.


The smallest cores I had done ended up being around 5k LUT, mostly using
a minimalist / stripped-down version of SH-2 (with an interrupt
mechanism more like the SH-4, *1).

Though, some of the tweaks used in this design were later used as a
basis of "BtSR1", which was in turn used as the design basis for the
original form of BJX2. Though, BtSR1 did change around the 16-bit
instruction layout from SuperH, which was likely a mistake in
retrospect, but doesn't matter as much at this point.


Though, I am left to realize a few things could be better:
If the Imm16 encodings were consistent with the rest of the ISA;
If all of the encoding blocks had the registers in the same places.

For example, the Imm16 block encodes the Rn register field differently
from all of the other blocks. And, the 1R block has the Rn field where
Ro is located in the 3R blocks (it might have been better to instead
treat the 1R block as a subset of the 2R block).


*1: The interrupt mechanism in the SH-2 was more like that in the 8086,
essentially pushing PC and SR onto the stack for an interrupt, and
loading the entry-point address from a vector table. Comparably, the
approach used in the SH-4 was simpler/cheaper as it did not require the
exception-handler mechanism to be able to access memory...

Conversely, SH-2A went in a very different direction regarding exception
handling (IIRC, effectively having multiple sets of the register file,
with a mechanism to spill/reload all the registers from memory as
needed, in hardware, via the associated TBR address; assigning to TBR
effectively performing a context switch on interrupt return). Well,
along with a separate register space for interrupt handling (kinda like
RISC-V), and still pushing stuff onto the stack/popping from the stack
(like 8086). In all, a seemingly likely fairly complicated/expensive
mechanism.


Though, eventually did end up with a mechanism superficially similar (at
the level of the C code), but differing mostly in that the whole
mechanism is handled in software (via the ISR prolog/epilog; and a lot
of Load/Store instructions).

Differs slightly in that in my ABI, TBR has a pointer (at a fixed
location) to the register save area; IIRC, in SH-2A, TBR was itself a
pointer to the register save area (and any other task/process context
stuff would necessarily need to be located after the register save area).


As can be noted:
SH style 16-bit layouts:
zzzz-nnnn-mmmm-zzzz 2R
zzzz-nnnn-zzzz-zzzz 1R
zzzz-nnnn-iiii-iiii 2RI (Imm8)
zzzz-zzzz-iiii-iiii Imm8
zzzz-iiii-iiii-iiii Imm12 (BRA/BSR)
BtSR1/BJX2 16-bit layouts
zzzz-zzzz-nnnn-mmmm 2R
zzzz-zzzz-nnnn-zzzz 1R
zzzz-zzzz-nnnn-iiii 2RI (Imm4)
zzzz-nnnn-iiii-iiii 2RI (Imm8)
zzzz-zzzz-iiii-iiii Imm8
zzzz-iiii-iiii-iiii Imm12 (LDIZ/LDIN)

Both SH-2A and BJX1 had effectively thrown 32-bit encodings ad-hoc into
the 16-bit opcode space.

For BJX2, they had been consolidated:
If the high 3 bits of the 16-bit word were 111, it was the start of a
32-bit encoding.


As noted, 32-bit encodings:
NMOP-zwzz-nnnn-mmmm zzzz-qnmo-oooo-zzzz //3R
NMZP-zwzz-nnnn-mmmm zzzz-qnmz-zzzz-zzzz //2R
ZZNP-zwzz-zzzz-zzzz zzzz-qzzn-nnnn-zzzz //1R (Inconsistent, Ro->Rn)
NMIP-zwzz-nnnn-mmmm zzzz-qnmi-iiii-iiii //3RI (Imm9/Disp9)
NMIP-zwzz-nnnn-zzzz zzzz-qnii-iiii-iiii //2RI (Imm10)
NZZP-zwzz-zzzn-nnnn iiii-iiii-iiii-iiii //2RI (Imm16)
IIIP-zwzz-iiii-iiii zzzz-iiii-iiii-iiii //Disp20 (BRA/BSR)
ZZZP-zwzz-iiii-iiii iiii-iiii-iiii-iiii //Imm24 (LDIZ/LDIN; Jumbo)

In Baseline, the high 3 bits are always 1 for 32-bit ops, in XG2, these
are repurposed as extension bits, but are encoded in an inverted form
(such that the Baseline encodings are still valid in XG2).

Where, here, the uppercase bits are those that are inverted.
n/m/o: register
i: Disp/Immed
z: Opcode


Admittedly, the space is less cleanly organized if compared with RISC-V,
so I suspect something like superscalar would be harder. Might have been
preferable if Rn was always in the same spot (since Rn is needed to do
superscalar checks, and is more of an issue if one needs to figure out
instruction block to figure out Rn alias).

Well, or check every possible register field, possibly ignoring the high
bits for simplicity:
xxxx-xxxx-nnnn-mmmm xxxx-xxxx-oooo-xxxx //Virtual layout
Where: x=don't care.
Where, one can be conservative and reject cases where any possible alias
could happen (even if it is a false alias).

But, this would likely reject a lot more cases, than if the logic
detects which encoding block is in use (and uses these to enable/disable
the various alias checks).
Say:
xxxx-0x00-nnnn-mmmm xxxx-xxxx-oooo-xxxx // (F0/F3/F9)
xxxx-0x00-xxxx-xxxx 0011-xxxx-nnnn-0000 //Virtual layout (F0-1R)
xxxx-0xxx-nnnn-mmmm xxxx-xxxx-xxxx-xxxx // (F1/F2)
xxxx-1x00-xxxx-nnnn xxxx-xxxx-xxxx-xxxx //Virtual layout (F8)

So, matching these cases and turning them into enable-bits for the alias
checks (and, maybe at least trying to check the full 6-bit register
fields; since if the block is determined, the ambiguity that would
require needing to drop to 4 bits is eliminated).


With RISC-V, since the destination register doesn't move, and always has
the same bits, no logic is needed to figure out where it could be.

Likely, more logic would be needed to determine which instructions are
valid prefixes and suffixes.


But, would it be worthwhile:
Less certain, as admittedly my compiler mostly already deals with this;
But, it could deal with cases either where the compiler misses cases
which are allowed on the core, or where there is a possible mismatch
between what the core supports and the bundling rules assumed by the
compiler (though, likely only relevant for higher-end cores; for low-end
cores the strategy of falling back to scalar execution is likely more
sensible).

As-is, doing superscalar in the CPU is unlikely to match the bundling
done by the compiler (since the compiler can do more accurate checks).

Seems to work OK for RISC-V though:
There is basically no other viable option in this case;
The encoding space seems to have been organized reasonably well for
this, making this part easier to pull off.
MitchAlsup1
2024-02-17 22:03:16 UTC
Permalink
But, I am not entirely sure how one would go about implementing it, as
MOV.Q (R4), R16
MOV.Q (R5), R17
ADD 8, R4
ADD 8, R5
PADD.H R16, R17, R18
MOV.Q R18, (R6)
ADD 8, R6
All in a single instruction.
With the proper instruction set, the above is::

VEC R9,{}
LDSH R10,[R1,Ri<<1]
LDSH R11,[R2,Ri<<1]
ADD R12,R10,R11
STH R12,[R3,Ri<<1]
LOOP LT,Ri,#1,Rmax

Once you see that there is no loop recurrence, then the loops can be run
concurrently as wide as you have arithmetic capabilities and cache BW--
in this case we have an arithmetic capability of 4 Halfword ADDs per cycle
and a memory capability of 128-bits every cycle creating a BW of 4×5 inst
every 1.5 cycles or 13.3 IPC and we are memory limited, not arithmetic
limited.
PADD.H R16, R17, R18
You will find the requisite patterns harder to recognize when the memory
reference size is NOT the calculation size. In your case, the calculation
is .H while memory reference is .Q .
BGB
2024-02-18 01:59:19 UTC
Permalink
But, I am not entirely sure how one would go about implementing it, as
   MOV.Q   (R4), R16
   MOV.Q   (R5), R17
   ADD     8, R4
   ADD     8, R5
   PADD.H  R16, R17, R18
   MOV.Q   R18, (R6)
   ADD     8, R6
All in a single instruction.
    VEC     R9,{}
    LDSH    R10,[R1,Ri<<1]
    LDSH    R11,[R2,Ri<<1]
    ADD     R12,R10,R11
    STH     R12,[R3,Ri<<1]
    LOOP    LT,Ri,#1,Rmax
Once you see that there is no loop recurrence, then the loops can be run
concurrently as wide as you have arithmetic capabilities and cache BW--
in this case we have an arithmetic capability of 4 Halfword ADDs per cycle
and a memory capability of 128-bits every cycle creating a BW of 4×5 inst
every 1.5 cycles or 13.3 IPC and we are memory limited, not arithmetic
limited.
Theoretically, but, how would the hardware "actually work"?...

Even if it does the part of allowing the hardware to know that a loop
exists, how does the hardware know to schedule these ops in any way more
efficient than to simply loop over them.

And, simply looping over them, would not gain much here.


My difficultly is I can't seem to imagine a mechanism that would
actually pull this off (at least, within the limits of a vaguely similar
hardware class to what I am dealing with).
   PADD.H  R16, R17, R18
You will find the requisite patterns harder to recognize when the memory
reference size is NOT the calculation size. In your case, the calculation
is .H while memory reference is .Q .
PADD.H does 4x Binary16 (64-bit vector), so Q works (also 64-bit).

256x PADD.H would operate on a vector of 1024 elements.



Though, this contrived case does seem to make a possible use-case for
auto-increment addressing.

Would almost entirely dismiss this possibility, apart from the
realization that my performance lead over RV64G (if both are running on
the same pipeline, and RV64G has dual-issue superscalar) is a lot
smaller than I would prefer.

Well, could always drop back to 2-cycle ALU and 3-cycle Load, to make
RV64G suffer worse in relation; but, like, excluding using cheap tricks
to put RV64G at a disadvantage.

But, not quite desperate enough to re-add auto-increment, which while it
could potentially help performance, it is not clear that it would be a
good thing for the ISA design as a whole.


As noted, did end up needing to break out the "BRxx Rm, Rn, Lbl"
instructions to reclaim the lead in Dhrystone, but even then, the lead
isn't all that large...


There are also now the CMP3R instructions. Basic idea behind CMP3R being
that the CMPxx output that would normally go to SR.T, is instead sent to
an output GPR (sorta like "SLT" in RISC-V, but expanded out slightly more).


It looks like one other case that may need addressing is the current
inability to encode a negative immediate to AND (within a single 32-bit
instruction), which seems to be more common than originally expected.
y=x&(~3);
And similar.

I originally assumed this case to be "almost never", but this isn't
quite right (it is a less common case, granted, but can still occur
quite often on the scale of a program), and at the moment requires
either a 64-bit encoding or putting the constant into a temporary
register (and the main spot it might make sense to put this is already
used by the "RSUB" instruction).


Well, and also inefficiencies due to my compiler failing to realize that
NULL is basically equivalent to a constant 0 (and so does not need to be
handled using temporary registers and the general-purpose mechanisms one
generally needs for "real" pointers). Tried poking at this case before
but then stuff started breaking.


But, I guess, at least Doom is getting a little bit faster; gradually
spending more of its time up against the 35 fps limiter.

...
Michael S
2024-02-14 09:14:22 UTC
Permalink
On Tue, 13 Feb 2024 19:57:28 +0100
Post by Marcus
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of 64-bit).
Does it means that you have 8 VRs and each VR is 2048 bits?
Post by Marcus
My machine can start and finish at most one 32-bit operation on every
clock cycle, so it is very simple. The same thing goes for vector
operations: at most one 32-bit vector element per clock cycle.
Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.
I attribute this to reduced loop overhead, eliminated hazards, reduced
I$ pressure and possibly improved cache locality and reduced register
pressure.
(I know very well that VVM gives similar gains without the VRF)
I guess my point here is that I think that there are opportunities in
the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.
BTW, I have also noticed that I often only need a very limited number
of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.
It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm then
8 VRs would not take you very far. 16 VRs are ALOT better. More than 16
VR can help somewhat, but the difference between 32 and 16 (in this
type of kernels) is much much smaller than difference between 8 and
16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.
Post by Marcus
I also envision that for most cases
you never have to preserve vector registers over function calls. I.e.
there's really no need to push/pop vector registers to the stack,
except for context switches (which I believe should be optimized by
tagging unused vector registers to save on stack bandwidth).
/Marcus
If CRAY-style VRs work for you it's no proof than lighter VRs, e.g. ARM
Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far from
optimal. Ratio of 8 looks like more optimal in case when performance of
vectorized loops is a top priority. Ratio of 4 is a wise choice
otherwise.
Marcus
2024-02-15 19:00:20 UTC
Permalink
Post by Michael S
On Tue, 13 Feb 2024 19:57:28 +0100
Post by Marcus
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of 64-bit).
Does it means that you have 8 VRs and each VR is 2048 bits?
No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
number of registers as I have five-bit vector address fields in the
instruction encoding (because 32 scalar registers). I have been thinking
about reducing it to 16 vector registers, and find some clever use for
the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not there yet.

The number of vector elements in each register is implementation
defined, but currently the minimum number of vector elements is set to
16 (I wanted to set it relatively high to push myself to come up with
solutions to problems related to large vector registers).

Each vector element is 32 bits wide.

So, in total: 32 x 16 x 32 bits = 16384 bits

This is, incidentally, exactly the same as for AVX-512.
Post by Michael S
Post by Marcus
My machine can start and finish at most one 32-bit operation on every
clock cycle, so it is very simple. The same thing goes for vector
operations: at most one 32-bit vector element per clock cycle.
Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.
I attribute this to reduced loop overhead, eliminated hazards, reduced
I$ pressure and possibly improved cache locality and reduced register
pressure.
(I know very well that VVM gives similar gains without the VRF)
I guess my point here is that I think that there are opportunities in
the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.
BTW, I have also noticed that I often only need a very limited number
of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.
It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm then
8 VRs would not take you very far. 16 VRs are ALOT better. More than 16
VR can help somewhat, but the difference between 32 and 16 (in this
type of kernels) is much much smaller than difference between 8 and
16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.
I expect that people will want to do such things with an MRISC32 core.
However, for the "small cores" that I'm talking about, I doubt that they
would even have floating-point support. It's more a question of simple
loop optimizations - e.g. the kinds you find in libc or software
rasterization kernels. For those you will often get lots of work done
with just four vector registers.
Post by Michael S
Post by Marcus
I also envision that for most cases
you never have to preserve vector registers over function calls. I.e.
there's really no need to push/pop vector registers to the stack,
except for context switches (which I believe should be optimized by
tagging unused vector registers to save on stack bandwidth).
/Marcus
If CRAY-style VRs work for you it's no proof than lighter VRs, e.g. ARM
Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far from
optimal. Ratio of 8 looks like more optimal in case when performance of
vectorized loops is a top priority. Ratio of 4 is a wise choice
otherwise.
For MRISC32 I'm aiming for splitting a vector operation into four. That
seems to eliminate most RAW hazards as execution pipelines tend to be at
most four stages long (or thereabout). So, with a pipeline width of 128
bits (which seems to be the goto width for many implementations), you
want registers that have 4 x 128 = 512 bits, which is one of the reasons
that I mandate at least 512-bit vector registers in MRISC32.

Of course, nothing is set in stone, but so far that has been my
thinking.

/Marcus
Michael S
2024-02-15 21:00:33 UTC
Permalink
On Thu, 15 Feb 2024 20:00:20 +0100
Post by Marcus
Post by Michael S
On Tue, 13 Feb 2024 19:57:28 +0100
Post by Marcus
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of 64-bit).
Does it means that you have 8 VRs and each VR is 2048 bits?
No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
number of registers as I have five-bit vector address fields in the
instruction encoding (because 32 scalar registers). I have been
thinking about reducing it to 16 vector registers, and find some
clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
there yet.
The number of vector elements in each register is implementation
defined, but currently the minimum number of vector elements is set to
16 (I wanted to set it relatively high to push myself to come up with
solutions to problems related to large vector registers).
Each vector element is 32 bits wide.
So, in total: 32 x 16 x 32 bits = 16384 bits
This is, incidentally, exactly the same as for AVX-512.
Post by Michael S
Post by Marcus
My machine can start and finish at most one 32-bit operation on
every clock cycle, so it is very simple. The same thing goes for
vector operations: at most one 32-bit vector element per clock
cycle.
Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.
I attribute this to reduced loop overhead, eliminated hazards,
reduced I$ pressure and possibly improved cache locality and
reduced register pressure.
(I know very well that VVM gives similar gains without the VRF)
I guess my point here is that I think that there are opportunities
in the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.
BTW, I have also noticed that I often only need a very limited
number of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.
It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm
then 8 VRs would not take you very far. 16 VRs are ALOT better.
More than 16 VR can help somewhat, but the difference between 32
and 16 (in this type of kernels) is much much smaller than
difference between 8 and 16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.
I expect that people will want to do such things with an MRISC32 core.
However, for the "small cores" that I'm talking about, I doubt that
they would even have floating-point support. It's more a question of
simple loop optimizations - e.g. the kinds you find in libc or
software rasterization kernels. For those you will often get lots of
work done with just four vector registers.
Post by Michael S
Post by Marcus
I also envision that for most cases
you never have to preserve vector registers over function calls.
I.e. there's really no need to push/pop vector registers to the
stack, except for context switches (which I believe should be
optimized by tagging unused vector registers to save on stack
bandwidth).
/Marcus
If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
ARM Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far
from optimal. Ratio of 8 looks like more optimal in case when
performance of vectorized loops is a top priority. Ratio of 4 is a
wise choice otherwise.
For MRISC32 I'm aiming for splitting a vector operation into four.
That seems to eliminate most RAW hazards as execution pipelines tend
to be at most four stages long (or thereabout). So, with a pipeline
width of 128 bits (which seems to be the goto width for many
implementations), you want registers that have 4 x 128 = 512 bits,
which is one of the reasons that I mandate at least 512-bit vector
registers in MRISC32.
Of course, nothing is set in stone, but so far that has been my
thinking.
/Marcus
Sounds quite reasonable, but I wouldn't call it "Cray-style".
Marcus
2024-02-16 11:37:55 UTC
Permalink
Post by Michael S
On Thu, 15 Feb 2024 20:00:20 +0100
Post by Marcus
Post by Michael S
On Tue, 13 Feb 2024 19:57:28 +0100
Post by Marcus
Post by Quadibloc
I am very fond of the vector architecture of the Cray I and
similar machines, because it seems to me the one way of
increasing computer performance that proved effective in
the past that still isn't being applied to microprocessors
today.
Mitch Alsup, however, has noted that such an architecture is
unworkable today due to memory bandwidth issues. The one
extant example of this architecture these days, the NEC
SX-Aurora TSUBASA, keeps its entire main memory of up to 48
gigabytes on the same card as the CPU, with a form factor
resembling a video card - it doesn't try to use the main
memory bus of a PC motherboard. So that seems to confirm
this.
FWIW I would just like to share my positive experience with MRISC32
style vectors (very similar to Cray 1, except 32-bit instead of 64-bit).
Does it means that you have 8 VRs and each VR is 2048 bits?
No. MRISC32 has 32 VRs. I think it's too much, but it was the natural
number of registers as I have five-bit vector address fields in the
instruction encoding (because 32 scalar registers). I have been
thinking about reducing it to 16 vector registers, and find some
clever use for the MSB (e.g. '1'=mask, '0'=don't mask), but I'm not
there yet.
The number of vector elements in each register is implementation
defined, but currently the minimum number of vector elements is set to
16 (I wanted to set it relatively high to push myself to come up with
solutions to problems related to large vector registers).
Each vector element is 32 bits wide.
So, in total: 32 x 16 x 32 bits = 16384 bits
This is, incidentally, exactly the same as for AVX-512.
Post by Michael S
Post by Marcus
My machine can start and finish at most one 32-bit operation on
every clock cycle, so it is very simple. The same thing goes for
vector operations: at most one 32-bit vector element per clock
cycle.
Thus, it always feels like using vector instructions would not give
any performance gains. Yet, every time I vectorize a scalar loop
(basically change scalar registers for vector registers), I see a
very healthy performance increase.
I attribute this to reduced loop overhead, eliminated hazards,
reduced I$ pressure and possibly improved cache locality and
reduced register pressure.
(I know very well that VVM gives similar gains without the VRF)
I guess my point here is that I think that there are opportunities
in the very low end space (e.g. in order) to improve performance by
simply adding MRISC32-style vector support. I think that the gains
would be even bigger for non-pipelined machines, that could start
"pumping" the execute stage on every cycle when processing vectors,
skipping the fetch and decode cycles.
BTW, I have also noticed that I often only need a very limited
number of vector registers in the core vectorized loops (e.g. 2-4
registers), so I don't think that the VRF has to be excruciatingly
big to add value to a small core.
It depends on what you are doing.
If you want good performance in matrix multiply type of algorithm
then 8 VRs would not take you very far. 16 VRs are ALOT better.
More than 16 VR can help somewhat, but the difference between 32
and 16 (in this type of kernels) is much much smaller than
difference between 8 and 16.
Radix-4 and mixed-radix FFT are probably similar except that I never
profiled as thoroughly as I did SGEMM.
I expect that people will want to do such things with an MRISC32 core.
However, for the "small cores" that I'm talking about, I doubt that
they would even have floating-point support. It's more a question of
simple loop optimizations - e.g. the kinds you find in libc or
software rasterization kernels. For those you will often get lots of
work done with just four vector registers.
Post by Michael S
Post by Marcus
I also envision that for most cases
you never have to preserve vector registers over function calls.
I.e. there's really no need to push/pop vector registers to the
stack, except for context switches (which I believe should be
optimized by tagging unused vector registers to save on stack
bandwidth).
/Marcus
If CRAY-style VRs work for you it's no proof than lighter VRs, e.g.
ARM Helium-style, would not work as well or better.
My personal opinion is that even for low ens in-order cores the
CRAY-like huge ratio between VR width and execution width is far
from optimal. Ratio of 8 looks like more optimal in case when
performance of vectorized loops is a top priority. Ratio of 4 is a
wise choice otherwise.
For MRISC32 I'm aiming for splitting a vector operation into four.
That seems to eliminate most RAW hazards as execution pipelines tend
to be at most four stages long (or thereabout). So, with a pipeline
width of 128 bits (which seems to be the goto width for many
implementations), you want registers that have 4 x 128 = 512 bits,
which is one of the reasons that I mandate at least 512-bit vector
registers in MRISC32.
Of course, nothing is set in stone, but so far that has been my
thinking.
/Marcus
Sounds quite reasonable, but I wouldn't call it "Cray-style".
Then what would you call it?

I just use the term "Cray-style" to differentiate the style of vector
ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
memory-memory vector ISA:s, etc.

/Marcus
Michael S
2024-02-16 12:04:25 UTC
Permalink
On Fri, 16 Feb 2024 12:37:55 +0100
Post by Marcus
Then what would you call it?
I just use the term "Cray-style" to differentiate the style of vector
ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
memory-memory vector ISA:s, etc.
/Marcus
I'd call it a variant of SIMD.
For me everything with vector register width to ALU width ratio <= 4 is
SIMD. 8 is borderline, above 8 is vector.
It means that sometimes I classify by implementation instead of by
architecture which in theory is problematic. But I don't care, I am not
in academy.
Marcus
2024-02-16 12:27:00 UTC
Permalink
Post by Michael S
On Fri, 16 Feb 2024 12:37:55 +0100
Post by Marcus
Then what would you call it?
I just use the term "Cray-style" to differentiate the style of vector
ISA from explicit SIMD ISA:s, GPU-style vector ISA:s and STAR-style
memory-memory vector ISA:s, etc.
/Marcus
I'd call it a variant of SIMD.
For me everything with vector register width to ALU width ratio <= 4 is
SIMD. 8 is borderline, above 8 is vector.
It means that sometimes I classify by implementation instead of by
architecture which in theory is problematic. But I don't care, I am not
in academy.
Ok, I am generally talking about the ISA, which dictates the semantics
and what kind of implementations that are possible (or at least
feasible).

For my current MRISC32-A1 implementation, the vector register width to
ALU width ratio is 16, so it would definitely qualify as "vector" then.

The ISA is designed, however, to support wider execution, but the idea
is to *not require* very wide execution, but rather encourage sequential
execution (up to a point where things like hazard resolution become
less of a problem and OoO is not really a necessity for high
throughput).

/Marcus
Loading...