"Mini" tags to reduce the number of op codes

Discussion:

Add Reply

Stephen Fuld

2024-04-03 16:43:44 UTC

There has been discussion here about the benefits of reducing the number
of op codes. One reason not mentioned before is if you have fixed
length instructions, you may want to leave as many codes as possible
available for future use. Of course, if you are doing a 16-bit
instruction design, where instruction bits are especially tight, you may
save enough op-codes to save a bit, perhaps allowing a larger register
specifier field, or to allow more instructions in the smaller subset.

It is in this spirit that I had an idea, partially inspired by Mill’s
use of tags in registers, but not memory. I worked through this idea
using the My 6600 as an example “substrate” for two reasons. First, it
has several features that are “friendly” to the idea. Second, I know
Mitch cares about keeping the number of op codes low.

Please bear in mind that this is just the germ of an idea. It is
certainly not fully worked out. I present it here to stimulate
discussions, and because it has been fun to think about.

The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single
floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I don’t think you need any special "store
tag" instructions.

When executing arithmetic instructions, if the tag bits of both sources
of an instruction are the same, do the appropriate operation (floating
or integer), and set the tag bit of the result register appropriately.
If the tag bits of the two sources are different, I see several
possibilities.

1. Generate an exception.
2. Use the sense of source 1 for the arithmetic operation, but perform
the appropriate conversion on the second operand first, potentially
saving an instruction
3. Always do the operation in floating point and convert the integer
operand prior to the operation. (Or, if you prefer, change floating
point to integer in the above description.)
4. Same as 2 or 3 above, but don’t do the conversions.

I suspect this is the least useful choice. I am not sure which is the
best option.

Given that, use the same op code for the floating-point and fixed
versions of the same operations. So we can save eight op codes, the
four arithmetic operations, max, min, abs and compare. So far, a net
savings of six opcodes.

But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some
operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesn’t know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are
saving or restoring in the same data structure it uses for the registers
(yes, it adds 32 bits to that structure – minimal cost). The same
mechanism works for interrupts that take control away from a running
process.

I don’t think you need to set or clear the tag bits without doing
anything else, but if you do, I think you could “repurpose” some other
instructions to do this, without requiring another op-code. For
example, Oring a register with itself could be used to set the tag bit
and Oring a register with zero could clear it. These should be pretty rare.

That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their
respective tag bits before knowing which FU to use? If it causes an
extra cycle per instruction, then it is almost certainly not worth it.
IANAHG, so I don’t know. But even if it doesn’t cost any performance, I
think the overall gains are pretty small, and probably not worth it
unless the op-code space is really tight (which, for My 66000 it isn’t).

Anyway, it has been fun thinking about this, so I hope you don’t mind
the, probably too long, post.
Any comments are welcome.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Anton Ertl

2024-04-03 17:24:05 UTC

Permalink

Post by Stephen Fuld
The idea is to add 32 bits to the processor state, one per register
(though probably not physically part of the register file) as a tag. If
set, the bit indicates that the corresponding register contains a
floating-point value. Clear indicates not floating point (integer,
address, etc.). There would be two additional instructions, load single
floating and load double floating, which work the same as the other 32-
and 64-bit loads, but in addition to loading the value, set the tag bit
for the destination register. Non-floating-point loads would clear the
tag bit. As I show below, I donât think you need any special "store
tag" instructions.

...

Post by Stephen Fuld
But we can go further. There are some opcodes that only make sense for
FP operands, e.g. the transcendental instructions. And there are some
operations that probably only make sense for non-FP operands, e.g. POP,
FF1, probably shifts. Given the tag bit, these could share the same
op-code. There may be several more of these.

Certainly makes reading disassembler output fun (or writing the
disassembler). This reminds me of the work on SafeTSA [amme+01] where
they encode only programs that are correct (according to some notion
of correctness).

Post by Stephen Fuld
I think this all works fine for a single compilation unit, as the
compiler certainly knows the type of the data. But what happens with
separate compilations? The called function probably doesnât know the
tag value for callee saved registers. Fortunately, the My 66000
architecture comes to the rescue here. You would modify the Enter and
Exit instructions to save/restore the tag bits of the registers they are
saving or restoring in the same data structure it uses for the registers
(yes, it adds 32 bits to that structure â minimal cost).

That's expensive in an OoO CPU. There you want each tag to be stored
alongside with the other 64 bits of the register, because they should
be renamed at the same time. So the ENTER instruction would depend on
all the registers that it saves (or maybe on all registers). And upon
EXIT the restored registers have to be reassembled (which ist not that
expensive).

I have a similar problem for the carry and overflow bits in
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.

Post by Stephen Fuld
The same
mechanism works for interrupts that take control away from a running
process.

For context switches one cannot get around the problem, but they are
much rarer than calls and returns, so requiring a pipeline drain for
them is not so bad.

Concerning interrupts, as long as nesting is limited, one could just
treat the physical registers of the interrupted program as taken, and
execute the interrupt with the remaining physical registers. No need
to save any architectural registers or their tag, carry, or overflow
bits.

Post by Stephen Fuld
That is as far as I got. I think you could net save perhaps 8-12 op
codes, which is about 10% of the existing op codes - not bad. Is it
worth it? To me, a major question is the effect on performance. What
is the cost of having to decode the source registers and reading their
respective tag bits before knowing which FU to use?

In in OoO CPU, that's pretty heavy.

But actually, your idea does not need any computation results for
determining the tag bits of registers (except during EXIT), so you
probably can handle the tags in the front end (decoder and renamer).
Then the tags are really separate and not part of the rgisters that
have to be renamed, and you don't need to perform any waiting on
ENTER.

However, in EXIT the front end would have to wait for the result of
the load/store unit loading the 32 bits, unless you add a special
mechanism for that. So EXIT would become expensive, one way or the
other.

@InProceedings{amme+01,
author = {Wolfram Amme and Niall Dalton and Jeffery von Ronne
and Michael Franz},
title = {Safe{TSA}: A Type Safe and Referentially Secure
Mobile-Code Representation Based on Static Single
Assignment Form},
crossref = {sigplan01},
pages = {137--147},
annote = {The basic ideas in this representation are:
variables are named as the pair (distance in the
dominator tree, assignment within basic block);
variables are separated by type, with operations
referring only to variables of the right type (like
integer and FP instructions and registers in
assemblers); memory references use types to encode
that a null-pointer check and/or a range check has
already occured, allowing optimizing these
operations; the resulting code is encoded (using
text compression methods) in a way that supports
only correct code. These ideas are discussed mostly
in a general way, with some Java-specifics, but the
representation supposedly also supports Fortran95
and Ada95. The representation supports some CSE, but
not for address computation operations. The paper
also gives numbers on size (usually a little smaller
than Java bytecode), and some other static metrics,
especially wrt. the effect of optimizations.}
}

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

MitchAlsup1

2024-04-14 23:25:52 UTC

Permalink

Post by Anton Ertl
I have a similar problem for the carry and overflow bits in
< http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to
let those bits not survive across calls; if there was a cheap solution
for the problem, it would eliminate this drawback of my idea.

My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
whereas RISC-V encodes the inner loop in 11 instructions.

Source code:

void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
{
uint64_t c = 0;
for( int i = 0; i < n; i++ )
{
{c, sum[i]} = a[i] + b[i] + c;
}
return
}

Assembly code::

.global mpn_add_n
mpn_add_n:
MOV R5,#0 // c
MOV R6,#0 // i

VEC R7,{}
LDD R8,[R2,Ri<<3]
LDD R9,[R3,Ri<<3]
CARRY R5,{{IO}}
ADD R10,R8,R9
STD R10,[R1,Ri<<3]
LOOP LT,R6,#1,R4
RET

So, adding a few "bells and whistles" to RISC-V does give you a
performance gain (1.38×); using a well designed ISA gives you a
performance gain of 2.00× !! {{moral: don't stop too early}}

Note that all the register bookkeeping has disappeared !! because
of the indexed memory reference form.

As I count executing instructions, VEC does not execute, nor does
CARRY--CARRY causes the subsequent ADD to take C input as carry and
the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
BC sequence in a single instruction and in a single clock.

Terje Mathisen

2024-04-15 08:02:46 UTC