MRISC32 question

Post by Thomas Koenig
I just looked at the MRISC32 at github, and I have a question.
There is no register flag, and offhand I do not see an
easy way to generate a carry for a multi-word addition.
Did I miss something?

ADDH adds the 2 operands and shifts the 33-bit word down by one bit
thus the HoB of this is the carry of the ADD of the same operands.

Less clunky than most..........

Terje Mathisen

2021-05-20 08:53:33 UTC

ADDH adds the 2 operands and shifts the 33-bit word down by one bit
thus the HoB of this is the carry of the ADD of the same operands.
Less clunky than most..........

Particularly if they also have "branch_if_negative" or
"predicate_on_negative".

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Quadibloc

2021-05-24 16:33:56 UTC

Post by MitchAlsup
ADDH adds the 2 operands and shifts the 33-bit word down by one bit

...doesn't that mean that you will have to do the addition _twice_ to get
the carry as well as the full 32-bit sum?

John Savard

MitchAlsup

2021-05-24 17:28:15 UTC

Post by MitchAlsup
ADDH adds the 2 operands and shifts the 33-bit word down by one bit

...doesn't that mean that you will have to do the addition _twice_ to get
the carry as well as the full 32-bit sum?

<
Yes, once to get carry, once to get normal sum.

Post by Quadibloc
John Savard

Thomas Koenig

2021-05-24 18:34:43 UTC

Post by MitchAlsup
ADDH adds the 2 operands and shifts the 33-bit word down by one bit

...doesn't that mean that you will have to do the addition _twice_ to get
the carry as well as the full 32-bit sum?

<
Yes, once to get carry, once to get normal sum.

At the danger of belaboring the _really_ obvious...

Full addition including carry really is an operation of

(a,carry_out) = f(b,c,carry_in)

so to model that to its full extent, you need five operands (or
four if carry_in can equal carry_out).

So, if you are restricted to classical three-operand instructions,
you need to hide the carry flags somewhere (like in a register
flag, and then use "add with carry") or use additional instructions.
If you don't want a status register, you have to think of something
else, such as doing the operation twice.

The situation is similar to when you need both the upper and the
lower part of a multiplication.

MitchAlsup

2021-05-24 21:08:41 UTC

Post by MitchAlsup
ADDH adds the 2 operands and shifts the 33-bit word down by one bit

...doesn't that mean that you will have to do the addition _twice_ to get
the carry as well as the full 32-bit sum?

<
Yes, once to get carry, once to get normal sum.

At the danger of belaboring the _really_ obvious...
Full addition including carry really is an operation of
(a,carry_out) = f(b,c,carry_in)

<
{carry,a} = {b,carry}+c;

Post by Thomas Koenig
so to model that to its full extent, you need five operands (or
four if carry_in can equal carry_out).

<
Yes, this is what you get (3-operand, 2-result) with the CARRY instruction-
modifier in My 66000.
<
The CARRY instruction has a Register specifier that designates the register
associated with carry information (both as operand and as result). Then
the CARRY i-m has a 16-bit field that designates the CARRY behavior on the
subsequent 8 instructions. {} means the instruction is not involved with carry,
{I} means the carry is an operand to this instruction, {O} means the carry is
a result of this instruction, and {IO} means carry is both an operand and a result.

Post by Thomas Koenig
So, if you are restricted to classical three-operand instructions,
you need to hide the carry flags somewhere (like in a register
flag, and then use "add with carry") or use additional instructions.
If you don't want a status register, you have to think of something
else, such as doing the operation twice.

<
The CARRY i-m provides the register specifier for a chain of carry calculations.

Post by Thomas Koenig
The situation is similar to when you need both the upper and the
lower part of a multiplication.

<
CARRY provides access to the HoB of integer multiplication {O}, provides
access to integer MAC in short form {I} and wide form {IO}, along with
rotates Shift-under-shadow-of-CARRY, multiprecision shifts {IO}, basically
everything that cannot be calculated in 1 result but can be in 2 registers is
accessed via the CARRY i-m.

Marcus

2021-05-24 19:15:23 UTC

Post by MitchAlsup
ADDH adds the 2 operands and shifts the 33-bit word down by one bit

...doesn't that mean that you will have to do the addition _twice_ to get
the carry as well as the full 32-bit sum?

<
Yes, once to get carry, once to get normal sum.

Post by Quadibloc
John Savard

It works, but I can't think of a solution that's as efficient as SLTU.

Example calculating s6:s5 = s2:s1 + s4:s3

; Using SLTU (4 instructions)
add s5,s1,s3
add s6,s2,s4
sltu s7,s5,s1 ; s7=-1 if carry
sub s6,s6,s7

; Using ADDH + branch (5 instructions)
add s5,s1,s3
add s6,s2,s4
addh s7,s1,s3
bge s7,1f ; Skip addition if s7>=0
add s6,s6,#1
1:

; Using ADDH + shift (5 instructions)
add s5,s1,s3
add s6,s2,s4
addh s7,s1,s3
lsr s7,s7,#31 ; Shift carry bit into bit #0
add s6,s6,s7

; Using ADDH + SLT (5 instructions)
add s5,s1,s3
add s6,s2,s4
addh s7,s1,s3
slt s7,s7,#0 ; s7=-1 if carry
sub s6,s6,s7

/Marcus

MitchAlsup

2021-05-24 21:15:36 UTC

Post by MitchAlsup
ADDH adds the 2 operands and shifts the 33-bit word down by one bit

...doesn't that mean that you will have to do the addition _twice_ to get
the carry as well as the full 32-bit sum?

<
Yes, once to get carry, once to get normal sum.

Post by Quadibloc
John Savard

<
; using My 66000 2-calculation instructions, 1 instruction-modifier.
CARRY R6,{{O}{I}}
ADD R5,R1,R3 // R6 gets the carry out of this instruction
ADD R6,R2,R4 // R6 provides carry in for this instruction
<
The CARRY instruction does not actually execute, it provides additional semantic information
to other instructions which DO execute. In narrow, small implementations, it may actually
execute (pass through the pipeline), in larger ones it surely will not.

Post by Marcus
; Using ADDH + branch (5 instructions)
add s5,s1,s3
add s6,s2,s4
addh s7,s1,s3
bge s7,1f ; Skip addition if s7>=0
add s6,s6,#1
; Using ADDH + shift (5 instructions)
add s5,s1,s3
add s6,s2,s4
addh s7,s1,s3
lsr s7,s7,#31 ; Shift carry bit into bit #0
add s6,s6,s7
; Using ADDH + SLT (5 instructions)
add s5,s1,s3
add s6,s2,s4
addh s7,s1,s3
slt s7,s7,#0 ; s7=-1 if carry
sub s6,s6,s7
/Marcus

Terje Mathisen

2021-05-25 06:15:34 UTC

Post by MitchAlsup
ADDH adds the 2 operands and shifts the 33-bit word down by one bit

...doesn't that mean that you will have to do the addition _twice_ to get
the carry as well as the full 32-bit sum?

<
Yes, once to get carry, once to get normal sum.

Post by Quadibloc
John Savard

I would strongly prefer using CARRY with {IO} for the last op, since
that will leave any outgoing carry in R6, where it can either be used
inline as a check & branch to handle wraparound, or when inside a
debugger, you could use it as a combined trap on the next instruction:

Stop here if R6 != 0

Post by MitchAlsup
The CARRY instruction does not actually execute, it provides additional semantic information
to other instructions which DO execute. In narrow, small implementations, it may actually
execute (pass through the pipeline), in larger ones it surely will not.

Post by Marcus
; Using ADDH + branch (5 instructions)
add s5,s1,s3
add s6,s2,s4
addh s7,s1,s3
bge s7,1f ; Skip addition if s7>=0
add s6,s6,#1

I have only seen one place where it makes sense to use branching to
handle carry propagation, even on architectures with visible carry flag
and ADD/ADC instructions:

When doing bigint/arbitrary precision and the input to be added can have
less terms than the accumulator, then I would consider propagating the
carry blindly for just one extra term, and then branch past the rest
unless there was another outgoing carry:

With 64-bit terms, the odds of a random ADC term,0 actually creating a
new carry is very small indeed, so unless constant time is needed then
use a branch:

uint64_t *long_add(uint64_t acc[], unsigned alen,
uint64_t b[], unsigned blen)
{
carry_type c = 0;
unsigned i;
for (i = 0; i < blen; i++) {
c = _addcarryx_u64(c, acc[i], b[i], &acc[i]);
}
while (c && i < alen) {
c = _addcarryx_u64(c, acc[i], 0, &acc[i]);
}
return alen + c;
// Output just overflowed if c != 0, so the caller must
// then increase the size of the accumulator array and
// put a '1' in the new top bin.
}

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

2021-05-25 15:58:56 UTC

Post by Terje Mathisen

Post by MitchAlsup
ADDH adds the 2 operands and shifts the 33-bit word down by one bit

...doesn't that mean that you will have to do the addition _twice_ to get
the carry as well as the full 32-bit sum?

<
Yes, once to get carry, once to get normal sum.

Post by Quadibloc
John Savard

Post by Terje Mathisen
I would strongly prefer using CARRY with {IO} for the last op, since
that will leave any outgoing carry in R6, where it can either be used
inline as a check & branch to handle wraparound, or when inside a

<
When you do not absorb the final width result, and the operation is signed,
the machine performs overflow checking (for free) and in most cases
overflow checking checks for any significance in the undelivered result.
<
And besides, you can specify {O} as you please.

Post by Terje Mathisen
Stop here if R6 != 0

Post by Terje Mathisen

Post by Marcus
; Using ADDH + branch (5 instructions)
add s5,s1,s3
add s6,s2,s4
addh s7,s1,s3
bge s7,1f ; Skip addition if s7>=0
add s6,s6,#1

I have only seen one place where it makes sense to use branching to
handle carry propagation, even on architectures with visible carry flag
When doing bigint/arbitrary precision and the input to be added can have
less terms than the accumulator, then I would consider propagating the
carry blindly for just one extra term, and then branch past the rest
With 64-bit terms, the odds of a random ADC term,0 actually creating a
new carry is very small indeed, so unless constant time is needed then
uint64_t *long_add(uint64_t acc[], unsigned alen,
uint64_t b[], unsigned blen)
{
carry_type c = 0;
unsigned i;
for (i = 0; i < blen; i++) {
c = _addcarryx_u64(c, acc[i], b[i], &acc[i]);

{c,acc[i]} = {acc[i],c}+b[i];

Post by Terje Mathisen
}
while (c && i < alen) {
c = _addcarryx_u64(c, acc[i], 0, &acc[i]);

{c,acc[i]}=acc[i]+c;

Post by Terje Mathisen
}
return alen + c;
// Output just overflowed if c != 0, so the caller must
// then increase the size of the accumulator array and
// put a '1' in the new top bin.
}
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Bernd Linsel

2021-05-19 19:26:01 UTC

You don't need flags to perform multi-word arithmetics, they are just a
convenient (and perhaps, more performant) way to do them.

Addition: A carry occurred, if the sum is less than either of the two
addends.

Subtraction: A borrow occured, if the difference is greater than the
subtrahend.

--
Bernd

MitchAlsup

2021-05-19 19:58:09 UTC

Post by Bernd Linsel

You don't need flags to perform multi-word arithmetics, they are just a
convenient (and perhaps, more performant) way to do them.

<
And also, perhaps not !

Post by Bernd Linsel
Addition: A carry occurred, if the sum is less than either of the two
addends.
Subtraction: A borrow occured, if the difference is greater than the
subtrahend.
--
Bernd

Marcus

2021-05-20 11:53:10 UTC

Post by Bernd Linsel

You don't need flags to perform multi-word arithmetics, they are just a
convenient (and perhaps, more performant) way to do them.
Addition: A carry occurred, if the sum is less than either of the two
addends.
Subtraction: A borrow occured, if the difference is greater than the
subtrahend.

Correct. MRISC32 currently does not have a short-hand solution for
dealing with carry, but relies on the standard 4-instruction sequence
that is generated by GCC, for instance.

add s3,s1,s3
sltu s1,s3,s1
add s2,s2,s4
sub s2,s2,s1

I will probably look into more optimal ways of implementing carry.
The My66000 solution is interesting. Another possibility would be
to use some sort of 3-source ADD instruction, which could reduce
the sequence to three instructions instead of four.

/Marcus

Post by Bernd Linsel
--
Bernd

Thomas Koenig

2021-05-20 16:11:43 UTC

Post by Marcus
Correct. MRISC32 currently does not have a short-hand solution for
dealing with carry, but relies on the standard 4-instruction sequence
that is generated by GCC, for instance.
add s3,s1,s3
sltu s1,s3,s1
add s2,s2,s4
sub s2,s2,s1
I will probably look into more optimal ways of implementing carry.
The My66000 solution is interesting. Another possibility would be
to use some sort of 3-source ADD instruction, which could reduce
the sequence to three instructions instead of four.

A few weeks ago, a suggestion that registers could have extra bits,
or that you could have per-register status bits, was discussed here.

A carry bit for each register, with an extra "add with carry"
instruction, would cut this down to two.

This could also be implemented as an extra 32-bit condition register
(this would probably be easier to model in gcc).

Of course, if you go down that route, you might also do the same
for signed overflow, another 32-bit register.

All of that only makes sense if you expect a lot of 64-bit
or arithmetic on your architecture. So, choices, choices...

Marcus

2021-05-20 18:45:36 UTC

Yes, I followed that discussion from a distance.

Post by Thomas Koenig
This could also be implemented as an extra 32-bit condition register
(this would probably be easier to model in gcc).
Of course, if you go down that route, you might also do the same
for signed overflow, another 32-bit register.
All of that only makes sense if you expect a lot of 64-bit
or arithmetic on your architecture. So, choices, choices...

One of the things that influences my ISA choices is that I want all
instructions to have the same behavior for vector registers as for
scalar registers (excluding branches that are purely scalar
operations).

That is one reason why MRISC32 does not have a flags register - it's
simply a bad match for vector operations. It's also the reason for
having the compare-instructions return 0 for FALSE and -1 (i.e.
all bits set) for TRUE: it works well for masks in vector operations,
e.g. in combination with the "SEL" instruction that does bitwise
selection, which can be used for implementing conditional selection
in scalar operations, vector operations and even packed operations,
e.g:

SLT.B V3,V1,V2 ; Byte-wise "set if less than", "V3 = V1 < V2"
SEL V3,V4,V5 ; Bit-wise select V4 (if true) or V5 (if false)

...which will select a byte from V4 or V5 depending on if the
corresponding (signed) byte in V1 is less than V2 or not.

Obviously you can do the same thing for regular scalar word-size
(unpacked) integers too:

SLT S3,S1,S2
SEL S3,S4,S5

I have still not decided on what carry mechanism to use (the current
state of affairs is that it already works - but with a slight
overhead compared to the optimal two-instruction sequence). As you
said it depends on how frequent 64-bit operations are. Actually, as
I aim to one day create a 64-bit version of the ISA ("MRISC64"),
this is one of the things that I do not want to over-optimize for
32-bit ALU operations ATM.

/Marcus

Thomas Koenig

2021-05-22 08:54:14 UTC

Yes, I followed that discussion from a distance.

That would mean that you would need an additional carry bit for each
vector register.

Adding a carry bit for each SIMD sub-register (which you also have)
would probably not be warranted. If users want to do addition
which could create a carry, they could just use the next wider
type.

Post by Marcus
That is one reason why MRISC32 does not have a flags register - it's
simply a bad match for vector operations. It's also the reason for
having the compare-instructions return 0 for FALSE and -1 (i.e.
all bits set) for TRUE: it works well for masks in vector operations,
e.g. in combination with the "SEL" instruction that does bitwise
selection, which can be used for implementing conditional selection
in scalar operations, vector operations and even packed operations,

[...]

Sounds like a good idea.

Post by Marcus
I have still not decided on what carry mechanism to use (the current
state of affairs is that it already works - but with a slight
overhead compared to the optimal two-instruction sequence). As you
said it depends on how frequent 64-bit operations are.

Public-key cryptography depends on multi-precision integers, so
having an efficient implementation could well be a plus.

Ivan Godard

2021-05-22 09:29:55 UTC

Yes, I followed that discussion from a distance.

Excellent. That way you don't get people complaining when their code
that "worked" in scalar gets overflows when the vectorizing is enabled
in the compiler.

Marcus

2021-05-23 18:08:47 UTC

Post by Marcus
Correct. MRISC32 currently does not have a short-hand solution for
dealing with carry, but relies on the standard 4-instruction sequence
that is generated by GCC, for instance.
    add    s3,s1,s3
    sltu    s1,s3,s1
    add    s2,s2,s4
    sub    s2,s2,s1
I will probably look into more optimal ways of implementing carry.
The My66000 solution is interesting. Another possibility would be
to use some sort of 3-source ADD instruction, which could reduce
the sequence to three instructions instead of four.

Yes, I followed that discussion from a distance.

Excellent. That way you don't get people complaining when their code
that "worked" in scalar gets overflows when the vectorizing is enabled
in the compiler.

As a SW developer, I'm fed up with having to deal with differences
between scalar and SIMD. For instance in some of the code that I have
worked with it is common to use scalar operations for corner cases that
can not easily be handled in SIMD (this may include memory & array
alignment and similar), and it's really annoying when the result of
your algorithm depends on whether or not data goes via the scalar or
the SIMD code. E.g. writing unit tests that cover all possible cases
of buffer alignment etc. is a PITA.

So yes, having scalar and vector operations that produce exactly the
same results was high priority for me.

(it should also lower the threshold for compiler auto-vectorization)

MitchAlsup

2021-05-23 19:06:32 UTC

Post by Ivan Godard
Excellent. That way you don't get people complaining when their code
that "worked" in scalar gets overflows when the vectorizing is enabled
in the compiler.

As a SW developer, I'm fed up with having to deal with differences
between scalar and SIMD. For instance in some of the code that I have
worked with it is common to use scalar operations for corner cases that
can not easily be handled in SIMD (this may include memory & array
alignment and similar), and it's really annoying when the result of
your algorithm depends on whether or not data goes via the scalar or
the SIMD code. E.g. writing unit tests that cover all possible cases
of buffer alignment etc. is a PITA.
So yes, having scalar and vector operations that produce exactly the
same results was high priority for me.

<
You would enjoy using VVM.

Post by Marcus
(it should also lower the threshold for compiler auto-vectorization)

<
Vectorization should vectorize ANY loop that a compiler can compile
into 16 or fewer SCALAR instructions.
<
Vectorization should not sacrifice any access to exception handling,
either.

Ivan Godard

2021-05-24 00:01:59 UTC

Post by Ivan Godard
Excellent. That way you don't get people complaining when their code
that "worked" in scalar gets overflows when the vectorizing is enabled
in the compiler.

As a SW developer, I'm fed up with having to deal with differences
between scalar and SIMD. For instance in some of the code that I have
worked with it is common to use scalar operations for corner cases that
can not easily be handled in SIMD (this may include memory & array
alignment and similar), and it's really annoying when the result of
your algorithm depends on whether or not data goes via the scalar or
the SIMD code. E.g. writing unit tests that cover all possible cases
of buffer alignment etc. is a PITA.
So yes, having scalar and vector operations that produce exactly the
same results was high priority for me.

<
You would enjoy using VVM.

Post by Marcus
(it should also lower the threshold for compiler auto-vectorization)

<
Vectorization should vectorize ANY loop that a compiler can compile
into 16 or fewer SCALAR instructions.

Why have any limit?

Post by MitchAlsup
Vectorization should not sacrifice any access to exception handling,
either.

MitchAlsup

2021-05-24 01:04:13 UTC

Post by Ivan Godard
Excellent. That way you don't get people complaining when their code
that "worked" in scalar gets overflows when the vectorizing is enabled
in the compiler.

As a SW developer, I'm fed up with having to deal with differences
between scalar and SIMD. For instance in some of the code that I have
worked with it is common to use scalar operations for corner cases that
can not easily be handled in SIMD (this may include memory & array
alignment and similar), and it's really annoying when the result of
your algorithm depends on whether or not data goes via the scalar or
the SIMD code. E.g. writing unit tests that cover all possible cases
of buffer alignment etc. is a PITA.
So yes, having scalar and vector operations that produce exactly the
same results was high priority for me.

<
You would enjoy using VVM.

Post by Marcus
(it should also lower the threshold for compiler auto-vectorization)

<
Vectorization should vectorize ANY loop that a compiler can compile
into 16 or fewer SCALAR instructions.

Why have any limit?

<
This gets at the heart of what vectorization is: Is vectorization to be applied
at the instruction level and creating a mass of state, or should vectorization
be applied at the loop level and retain all of the scalar properties we know
and love--like precise interrupts--with essentially no increase in state (6-bits).
<
In the case one chooses to vectorize loops, there is a certain number of
instructions that can be "in" the machine at any point in time (sometimes
known of as the size of the execution window) and if the loop exceeds
this number the performance of the loop may be degraded.
<
When one chooses to vectorize loops, one ends up vectorizing (very) many
of the kinds of things that programs do almost all the time--structure assign-
ments, string handling, memory manipulation, general purpose calculations
of all sorts--which are not often treated as first class citizens of "vector
machines"
<
But Ivan is correct in that <any given number> is artificial--but so is the length
of the vector register artificial, and the loss of precise exceptions burdensome.
<
In the case of My 66000 if the loop is larger than the size of the execution
window, the front of the loop can be performed as wide as the implementation
provides resources, and beyond this amount, the remaining portion might run
closer to scalar performance.
<

Post by MitchAlsup
Vectorization should not sacrifice any access to exception handling,
either.

Ivan Godard

2021-05-24 02:47:21 UTC

Post by Ivan Godard
Excellent. That way you don't get people complaining when their code
that "worked" in scalar gets overflows when the vectorizing is enabled
in the compiler.

As a SW developer, I'm fed up with having to deal with differences
between scalar and SIMD. For instance in some of the code that I have
worked with it is common to use scalar operations for corner cases that
can not easily be handled in SIMD (this may include memory & array
alignment and similar), and it's really annoying when the result of
your algorithm depends on whether or not data goes via the scalar or
the SIMD code. E.g. writing unit tests that cover all possible cases
of buffer alignment etc. is a PITA.
So yes, having scalar and vector operations that produce exactly the
same results was high priority for me.

<
You would enjoy using VVM.

Post by Marcus
(it should also lower the threshold for compiler auto-vectorization)

<
Vectorization should vectorize ANY loop that a compiler can compile
into 16 or fewer SCALAR instructions.

Why have any limit?

So by recognizing a loop in hardware the vectors are limited to the
hardware recognition capacity, while if recognized by software there is
no such limit because software recognition is unbounded. The drawback to
software recognition being that it is sensitive to the eventual hardware
capacity which must be taken into account in the generated code, while
the hardware recognition can use code that is relatively agnostic,
scalar even in your case. To achieve hardware's agnostic code, the
software code must be redone at install time, which has complications
due to heterogeneous machines, work flow, and the like.

Any reason the two approaches can't be blended? Software hooking several
hardware pipes together end-to-end, or something?

Anton Ertl

2021-05-24 13:29:30 UTC

Post by Ivan Godard
So by recognizing a loop in hardware the vectors are limited to the
hardware recognition capacity, while if recognized by software there is
no such limit because software recognition is unbounded. The drawback to
software recognition being that it is sensitive to the eventual hardware
capacity which must be taken into account in the generated code, while
the hardware recognition can use code that is relatively agnostic,
scalar even in your case.

In my experiments with vectorization, the number of SIMD register
names limited the amount of "software vector chaining" that could be
performed. It does not help if the software could have chained more,
but the register names are not there. And register names are not easy
to add, while, as far as I understand, VVM is almost transparent, and
I hope that, like OoO, it can be extended to more capable hardware
without requiring architecture changes.

Post by Ivan Godard
Any reason the two approaches can't be blended? Software hooking several
hardware pipes together end-to-end, or something?

Software might recognize a loop that may profit from VVM, and split it
into several loops that are within the target hardware's VVM limit.
Of course, you cannot split all loops, just like you cannot fuse all
loops.

But it seems to me that if this is a significant problem, it's better
to throw more hardware at it than to work around it in software.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

Marcus

2021-05-24 19:38:17 UTC

Post by Anton Ertl

Post by Ivan Godard
Any reason the two approaches can't be blended? Software hooking several
hardware pipes together end-to-end, or something?

Software might recognize a loop that may profit from VVM, and split it
into several loops that are within the target hardware's VVM limit.
Of course, you cannot split all loops, just like you cannot fuse all
loops.
But it seems to me that if this is a significant problem, it's better
to throw more hardware at it than to work around it in software.
- anton

The MRISC32 ISA provides 31 vector registers, which sets an upper limit
to how complex loops you can vectorize without spilling registers to
memory.

On the other hand the hardware implementation is free to select what
level of parallelism it wants (and how long pipelines it wants to
use), since the ISA does not impose any upper limit to how many
elements the vector registers can have, and vector operations can be
processed in any parallel x serial configuration. E.g. if an
implementation uses vector registers with 16 elements each, it can
choose to implement a vector operation as one of:

* 1 lane x 16 steps (this is what the MRISC32-A1 does)
* 2 lanes x 8 steps
* 4 lanes x 4 steps
* 8 lanes x 2 steps
* 16 lanes x 1 step

Of course, an implementation can use other vector register sizes too,
and the same compiled binary will work on any configuration - no
recompile necessary.

I think that there are pros and cons with all vector solutions. As
a SW developer I prefer the MRISC32 model over any of the contemporary
SIMD ISA:s, but I can also see clear benefits of VVM.

/Marcus

Marcus

2021-05-23 17:58:49 UTC

Yes, I followed that discussion from a distance.

That would mean that you would need an additional carry bit for each
vector register.
Adding a carry bit for each SIMD sub-register (which you also have)
would probably not be warranted.

Yes, I don't think that adding extra bits like that works well for my
vector & packed data model.

Post by Thomas Koenig
If users want to do addition
which could create a carry, they could just use the next wider
type.

Not necessarily. Expanding types comes with a cost - you have to use
"unpack" instructions e.g. to decompose one vector register of byte
sized data into two vector registers consisting of half-word sized
data. Whether or not this cost is higher or lower than using carry
arithmetic depends on the situation.

Also, I really prefer having equivalent functionality at all levels
(scalar vs vector vs packed), so if I add functionality for dealing
with carry for scalar I will most likely make it work for vector and
packed data too.

[...]
Sounds like a good idea.

Public-key cryptography depends on multi-precision integers, so
having an efficient implementation could well be a plus.

That's a good point.

MitchAlsup

2021-05-23 19:04:27 UTC

Yes, I followed that discussion from a distance.

That would mean that you would need an additional carry bit for each
vector register.
Adding a carry bit for each SIMD sub-register (which you also have)
would probably not be warranted.

Yes, I don't think that adding extra bits like that works well for my
vector & packed data model.

Post by Thomas Koenig
If users want to do addition
which could create a carry, they could just use the next wider
type.

<
Captain Obvious would like to point out that this is a better argument
that vector instructions sets were misdesigned, than about adds
with carry.

Post by Marcus
Also, I really prefer having equivalent functionality at all levels
(scalar vs vector vs packed), so if I add functionality for dealing
with carry for scalar I will most likely make it work for vector and
packed data too.

<
Which is why VVM is the way it is, even to the point of having precise
exceptions while operating in vector mode.

[...]
Sounds like a good idea.

Public-key cryptography depends on multi-precision integers, so
having an efficient implementation could well be a plus.

That's a good point.

Anton Ertl

2021-05-20 17:39:23 UTC