Concertina II Progress

Post by Quadibloc
Some progress has been made in advancing a small step towards sanity
in the description of the Concertina II architecture described at
http://www.quadibloc.com/arch/ct17int.htm
As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

Ironically, I am getting slightly better reach on average with (scaled)
9-bit (and 10) bit displacements than RISC-V gets with 12 bits...

Say:
DWORD:
12s, Unscaled: +/- 2K
9u, 4B Scale : + 2K
10s, 4B Scale: +/- 2K (XG2)
QWORD:
12s, Unscaled: +/- 2K
9u, 8B Scale : + 4K
10s, 8B Scale: +/- 4K (XG2)

It was a pretty tight call between 10s and 10u, but 10s won out by a
slight margin mostly because the majority of structs and stack-frames
tend to be smaller than 4K (but, does create an incentive to use larger
storage formats for on-stack storage).

Though, for integer immediate instructions, RISC-V would have a slight
advantage. Where, say, roughly 9% of 3R integer immediate values miss
with the existing Imm9u/Imm9n scheme; but the sliver of "Misses with 9
bits, but would hit with 12 bits", is relatively small (most of the
"miss" cases are much larger constants).

However, a fair chunk of these "miss" cases, could be handled with a
bit-set/bit-clear instruction, say:
y=x|0x02000000;
z=x&0xFBFFFFFF;
Turning into, say:
BIS R4, 25, R6
BIC R4, 25, R7

Unclear if this case is quite common enough to justify adding these
instructions though (granted, a case could be made for them).

However, a few cases do typically need larger displacements:
PC relative, such as branches.
GBR relative, namely constant loads.

For PC relative, 20-bits is "mostly enough", but one program has hit the
20-bit limit (+/- 1MB). Recently, via a tweak, in current forms of the
ISA, the effective branch-displacement limit (for a 32-bit instruction
form) has been increased to 23 bit (+/- 8MB).
Baseline+XGPR: Unconditional BRA and BSR only.
Conditional branches still limited to 20 bits.
XG2: Also includes conditional branches.

In these cases, it was mostly because the bits that were being used to
extend the GPRs to 6 bits were N/A for their original purpose with
branch-ops, and this could be repurposed to the displacement. Main other
alternatives would have been 22 bits + alternate link register, or a
3-bit LR field; however, the cost of supporting this would have been
higher than that of reassigning them simply towards making the
displacement bigger.

Potentially a similar role could have been served by a conjoined "MOV
LR, R1 | BSR Disp" instruction (and/or allowing "MOV LR, R1" in Lane 2
as a special case for this, even if it would not otherwise be allowed
within the ISA rules). Though, would defeat the point if this encoding
foils the branch predictor.

Recently, had ended up adding some Disp11s Compare-with-Zero branches,
mostly as these branches turn out to be useful (in the face of 2-cycle
CMPxx), and 8 bits "wasn't quite enough". Say, Disp11s can cover a much
bigger if/else block or loop body (+/- 2K) than Disp8s (+/- 256B).

For GBR Relative:
The default 9-bit displacement was Byte scaled (for "reasons");
But, a 512B range isn't terribly useful;
Later forms ended up with Disp10u Scaled:
This gives 4K or 8K of range (in Baseline)
This increases to 8K and 16K in XG2.

If the compiler sorts primitive global variables by descending-usage
(and emits the top N specially, at the start of ".data"), then the
Scaled GBR cases can access a majority of the global variables (around
75-80% with a scaled 10-bit displacement).

Effectively, the remaining 20-25% or so need to be handled as one of:
Jumbo Disp33s (if Jumbo prefixes are available, most profiles);
2-op Disp25s (no jumbo, '.data'+'.bss' less than 16MB).
3-op Disp33s (else).

Though, as with the stack frames, these instructions do create an
incentive to effectively promote any small global variables to a larger
storage type (such as 'char' or 'short' to 'int'); just with implicit
sign (or zero) extensions to preserve the expected behavior of the
smaller type (though, strictly speaking, only zero-extensions would be
required by the C standard, given signed overflow is technically UB; but
there would be something "deeply wrong" with a 'char' variable being
able to hold, say, -4495213, or similar).

Though, does mean for normal variables, "just use int or similar" is
typically faster (say, because there are dedicated 32-bit sign and zero
extending forms of some of the common ALU ops, but not for 8 or 16 bit
cases).

A Disp16u case could maybe reach 256K or 512K, which could cover much of
a combined data+bss section. While in theory this could be better, to
make effective use of this would require effectively folding much of
".bss" into ".data", which is not such a good thing for the program
loader (as opposed to merely folding the top N most-used variables into
".data").

Then again, uninitialized global arrays could probably still be left in
".bss", which tend to be the main "bulking factor" for this section (as
opposed to normal variables).

Post by Quadibloc
I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

Yeah.

If you want a Load/Store to have two 5 bit registers and a 16-bit
displacement, only 6 bits are left in a 32-bit instruction word. This
is, not a whole lot...

For a full set of Load/Store ops, this is 4 bits;
For a set of basic ALU ops, this is another 3 bits.

So, just for Load/Store and basic ALU ops, half the encoding space is
gone...

Would it be worth it?...

Post by Quadibloc
So what I had done was, after squeezing as much as I could into a basic
instruction format, I provided for switching into alternate instruction
formats which made different compromises by using the block headers.
This has now been dropped. Since I managed to get the normal (unaligned)
memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without compromises
in the basic instruction set, it wasn't needed to have multiple instruction
formats.
I had to change the instructions longer than 32 bits to get them in the
basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

Such is a long standing issue...

I am also annoyed sometimes at how complicated my design has gotten.
Still, it is within reason, and not too far outside the scope of many
existing RISC's.

But, as noted, the reason XG2 exists as-is was sort of a compromise:
I couldn't come up with any encoding which could actually give
everything I wanted, and the "most practical" option was effectively to
dust off an idea I had originally rejected:
Having an alternate encoding which dropped 16-bit ops in favor of
reusing these bits for more GPRs.

At first glance, RISC-V seems cleaner and simpler, but this falls on its
face once one goes outside the scope of RV64IM or similar.

And, it isn't tempting when, at least from my POV, RV64 seems "less
good" than what I have already (others may disagree; but at least to me,
some parts of RISC-V's design seem to me like kind of a trash fire).

The main tempting thing the RV64 has is that, maybe, if one goes and
implements RV64GC and clones a bunch of SiFive's hardware interfaces,
then potentially one can run a mainline Linux on it.

There have apparently been some people that have gotten NOMMU Linux
working on RV32IM targets, which is possible (and, ironically, seemingly
basing these on the SuperH branch in the Linux kernel from what I had
seen...).

Seemingly, AMD/Xilinx is jumping over from MicroBlaze to an RV32
variant. But, granted, RV32 isn't too far from what MicroBlaze is
typically used for, so not really a huge stretch.

I sometimes wonder if maybe I would be better off jumping to RV, but
then I end up seeing examples where cores running at somewhat higher
clock speeds still manage to deliver relatively poor framerates in Doom.

Like, as-is, my MIPs scores are kinda weak, but I am still getting
around 30 fps in Doom at around 20-24 MIPs.

RV64IM seemingly needs significantly higher MIPs to get similar
framerates in Doom.

Say, for Doom:
BJX2 needs ~ 800k instructions / frame;
RV64IM seemingly needs nearly 2 million instructions / frame.

Not entirely sure what all is going on, but I have my suspicions.

Though, it does seem to be the inverse situation with Dhrystone.

Say:
BJX2: around 1.3 DMIPS per BJX2 instruction;
RV64: around 3.8 DMIPS per RV64 instruction.

Though, I can note that there seems to be "something weird" with
Dhrystone and GCC (in multiple scenarios, GCC gives Dhrystone scores
that are significantly above what could be "reasonably expected", or
which agree with the scores given by other compilers, seemingly as-if it
is optimizing away a big chunk of the benchmark...).

But, these results don't typically extend to other programs (where
scores are typically much closer together).

Actually, I have noted that if comparing BGBCC with MSVC and BJX2 with
my Ryzen, performance relations seem to scale pretty closer to linearly
relative to clock-speed, albeit with some outliers.

There are cases where deviation has been noted:
Speed differences for TKRA-GL's software rasterizer backend are smaller
than the difference in clock-speed (74x clock-speed delta; 20x fill-rate
delta);
And cases where it is bigger: The performance delta for things like LZ4
decompression or some of my image codecs is somewhat larger than the
clock-speed delta (say: 74x clock-speed delta, 115x performance delta, *1).

*1: Though, LZ4 still operates near memcpy() speed in both cases; issue
is mostly that, relative to MHz, my BJX2 core has comparably slower
memory access.

Albeit somehow, this trend reverses for my early 2000s laptop, which has
slower RAM access. However, the SO-DIMM is 4x the width (64b vs 16b),
and 133MHz vs 50MHz; and this leads to a theoretical 10.64x ratio, which
isn't too far off from the observed memcpy() performance of the laptop.

So, laptop has 10.64x faster RAM, relative to 28x more MHz.

Wheres, say, my Ryzen has 2.64x more MHz (3.7 vs 1.4), but around 40x
more memory bandwidth (12.7x for single-thread memcpy).

Well, and if I did jump over to RV64, it would renderer much of what I
am doing entirely moot.

I *could* do a dedicated RV64 core, but could unlikely make it "notable"
enough to be worthwhile.

So, it seems like my options are either:
Continue on doing stuff mostly as is;
Drop it and probably go off to doing something else entirely.

...

But, don't have much else better to be doing, considering the typically
"meh" response to most of my 3D engine attempts. And my general
lackluster skills towards most types of "creative" endeavors (I suspect
"affective alexithymia" probably doesn't help too much for artistic
expression).

Well, and I have also recently noted other oddities, for example:
It seems I may have "reverse slope hearing loss", and my hearing is
seemingly notably poor for sounds much lower than about 1.5 or 2kHz
(lower-frequency sine waves are nearly inaudible, but I can still hear
square/triangle/sawtooth waves well; most of what I perceive as
low-frequency sounds seemingly being based on higher-frequency harmonics
of those sounds).

So, say:
2kHz..4kHz, loud, heard easily;
4kHz..8kHz, also heard readily;
8..15kHz, fades away and disappears.
But, OTOH, for sine waves:
1kHz: much quieter than 2kHz
500Hz: fairly mild at full volume
250Hz: relatively quiet
125Hz: barely audible.

But, for sounds much under around 200Hz, I can feel the vibrations, and
can associate these with sound (but, this effect is not localized to
ears, also works with hands and similar; this effect seems strongest at
around 50-100 Hz, but has a lower range of around 6-8Hz, below this
point, feeling becomes less sensitive to it, but visual perception can
take over at this point).

I can take audio and apply a fairly aggressive 2kHz high-pass filter
(say, -48db per octave, applied several times), and for the most part it
doesn't sound that much different, though does sound a little more
tinny. This "tinny" effect is reduced with a 1kHz high-pass filter.

Most of what I had perceived as low-frequency sounds are still present
even after the filtering (and while entirely absent in a spectrum plot).
Zooming in generally shows patterns of higher frequency vibrations
following similar patterns to the low-frequency vibrations, which
seemingly I perceive "as" the low-frequency vibration.

And, in all this, I hadn't noticed that anything was amiss until looking
into it for other reasons.

I am left to wonder is some of this could be related to my preference
for the sound of ADPCM compression over that of MP3 at lower quality
levels (low bitrate MP3 sounds particularly awful, whereas ADPCM tends
to fare better; but seemingly other people disagree).

Does possibly explain some other past difficulties:
I can make a noise and hear the walls within a room;
But, trying to hit a metal tank to determine how much sand was in the
tank by hearing, was quite a bit more difficult (best I could do was hit
the tank, and then try to hear what parts of the tank had reduced echo;
but results were pretty mixed as the sand level did not significantly
change the echoes).

Apparently, it turns out, people were listening for "thud" vs "not
thud", but like, I couldn't really hear this part, and wasn't even
really aware there should be a "thud" (or even really what a "thud"
sounds like apart from the effects of, say, something hitting a chunk of
wood; hitting a sand-filled steel tank with a rubber mallet was nearly
silent, but, knuckles or tapping it with a screwdriver was easier to
hear, ...).

Well, also can't really understand what anyone is saying over the phone
(as the phone reduces everything to difficult to understand muffled noises).

Or, like the sound-effects in Wolfenstein 3D being theoretically voice
clips saying stuff, but are more things like "aaaa uunn" or "aaaauuuu"
or "uu aa uu" or similar owing to the poor audio quality.

Well, and my past failures to achieve any kind of intelligibility in
past experiments messing with formant synthesis.

And some experiments with vocoder like designs, noting that I could
seemingly discard pretty much everything much below 500Hz or 1kHz
without much ill effect; but theoretically there is "relevant stuff" in
these frequency ranges. Didn't really think of much at the time (it
seemed like all of this was a "based frequency" where the combined
amplitude of everything could be averaged together and treated like a
single channel).

Had noted that, one thing that did sort of work, was, say:
Split the audio into 32 frequency bands;
Pick the top 2 or 3 bands, ignoring low-frequency or adjacent bands;
Say, anything below 1kHz is ignored.
Record the band number and relative volume.

Then, regenerate waveforms at each of these bands with the measured
volume (along with alternate versions spread across different octaves;
it worked better if higher power-of-2 frequencies were also synthesized,
albeit at lower intensities). Get back "mostly intelligible" speech.

IIRC, had mostly used 32 bands spread across 2 octaves (say, 1-2 kHz and
2-4kHz, or 2-4 kHz and 4-8 kHz).
Can also mix in sounds from the same relative position in other octaves.

Seemed to have best results with mostly evenly-spread frequency bands.

...

Thomas Koenig

2023-11-09 18:50:37 UTC

Post by Quadibloc
As Mitch Alsup has rightly noted, I want to have my cake and eat it
too. I want an instruction format that is quick to fetch and decode,
like a RISC format. I want RISC-like banks of 32 registers, and I
want the CISC-like addressing modes of the IBM System/360, but with
16-bit displacements, not 12-bit displacements.

So, r1 = r2 + r3 + offset.

Three registers is 15 bits plus a 16-bit offset, which gives you
31 bits. You're left with one bit of opcode, one for load and
one for store.

The /360 had 12 bits for three registers plus 12 bits of offset, so
24 bits left eight bits for the opcode (the RX format).

So, if you want to do this kind of thing, why not go for a full 32-bit
offset in a second 32-bit word?

[...]

Post by Quadibloc
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

Have you ever written an assembler for your ISA?

BGB-Alt

2023-11-09 21:36:12 UTC

So, r1 = r2 + r3 + offset.
Three registers is 15 bits plus a 16-bit offset, which gives you
31 bits. You're left with one bit of opcode, one for load and
one for store.

Oh, that is even worse than I understood it as, namely:
LDx Rd, (Rs, Disp16)
...

But, yeah, 1 bit of opcode clearly wouldn't work...

Post by Thomas Koenig
The /360 had 12 bits for three registers plus 12 bits of offset, so
24 bits left eight bits for the opcode (the RX format).
So, if you want to do this kind of thing, why not go for a full 32-bit
offset in a second 32-bit word?

Originally, I had turned any displacements that didn't fit into 9 bits
into a 2-op sequence:
MOV Imm25s, R0
MOV.x (Rb, R0), Rn

Actually, worse yet, the first form of BJX2 only had 5-bit Load/Store
displacements, but it didn't take long to realize that 5 bits wasn't
really enough (say, when roughly 2/3 of the load and store operations
can't fit in the displacement).

But, now, there are Jumbo-encodings, which can encode a full 33-bit
displacement in a 64-bit encoding. Not everything is perfect though,
mostly because these encodings are bigger and can't be used in a bundle.

But, still "less bad" in this sense than my original 48-bit encodings,
where "for reasons", these couldn't co-exist with bundles in the same
code block.

Despite the loss of 48-bit ops though:
The jumbo encodings give larger displacements (33s vs 24u or 17s);
They reuse the existing 32-bit decoders, rather than needing a dedicated
48-bit decoder.

But, yeah, "use another instruction word" if one needs a larger
displacement, is mostly the option that I would probably recommend.

At first, the 5-bit encodings went away, but later came back as a zombie
of sorts (cases emerged where their existence was still valuable).

But, then it later came up to a tradeoff (with the design of XG2):
Do I expand the Disp9u to Disp10u, and then keep with the XGPR encoding
of using the Disp5u encodings to encode a Disp6s case (for a small range
of negative displacements), or expand to Disp9u to Disp10s?...

In this case, Disp10s won out by a small margin, as I needed non-trivial
negative displacements at least slightly more often than I needed 8K for
structs and stack frames and similar.

But, for most things, a 16-bit displacement would be a waste...
If I were going to go the route of using a signed 12-bit displacement
(like RISC-V), would probably still keep it scaled though, as 8K/16K is
still more useful than 2K.

Branch displacements are typically still hard-wired as 2 though, partly
as the ISA started out with 16-bit ops, and switching XG2 over to 4-byte
scale would have broken its symmetry with the Baseline ISA.

Though, could pull a cheap trick and repurpose the LSB of branch ops in
XG2, given as-is, it is effectively "Must Be Zero" (all instructions
have a 32-bit alignment in this mode, and branches to an odd address are
not allowed).

So, the idea of a BSR that uses R1 as an alternate Link-Register is
still not (entirely) dead (while at the same time allowing for the
'.text' section to be expanded to 8MB).

There are 64-bit Disp33s and Abs48 branch encodings, but, yeah, they
have costs:
They are 64-bit vs 32-bit, thus, bigger;
Are ignored by the branch predictor, thus, slower;
The Abs48 case is not PC relative
Using it within a program requires a base reloc;
Is generally useful for DLL imports and special cases though (*1).

*1: Its existence is mostly as an alternative in these cases to a more
expensive option:
MOV Addr64, R1
JMP R1
Which needs 128-bits, and is also ignored by the branch predictor.

Post by Thomas Koenig
[...]

Have you ever written an assembler for your ISA?

Yeah, whether someone can write an assembler, or disassembler/emulator,
and not drive themselves insane in the attempt, is possibly a test of
"sanity".

Granted, still not foolproof, as it isn't that bad to write an
assembler/disassembler for x86 either, but trying to decode it in
hardware would be nightmarish.

Best guess I can have would be a "preclassify" stage:
If this is an opcode byte, how long will it be, and will a Mod/RM
follow, ...?
If this is a Mod/RM byte, how many bytes will this add.

Then in theory, one can figure instruction length like:
Fetch OpLen for IP;
Fetch Mod/RM len for IP+OpLen if Mod/RM flag is set;
Add OpLen+ModRmLen.
Add an extra 2/4 bytes if an Immed is present for this opcode.

Nicer to not bother.

For my 75 MHz experiment, did end up adding a similar sort of
"preclassify" logic to deal with instruction-lengths though, at the cost
that now L1 I$ cache-lines are specific to the operating mode in which
they were fetched (which now needs to be checked along with the address
and similar).

Mostly all this is a case of "looking up 4 bits of tag metadata" being
less latency than "feed 9 bits of instruction bits through some LUTs"
(or 12 bits if RISC-V decoding is enabled). There is still some latency
due to MUX'ing and similar, but this part is unavoidable.

So, former case:
8 bits: Classify BJX2 instruction length;
1 bit: Specify Baseline or XG2.
Latter case:
8 bits: Classify BJX2 instruction length;
2 bits: Classify RISC-V instruction length (16/32)
2 bits: Specify Baseline, XG2, RISC-V, or XG2RV.

Which map to 4 bits (IIRC):
(0): 16-bit
(1): (WEX && WxE) || Jumbo
(2): WEX
(3): Jumbo

As-is, after MUX'ing, this can effectively turn op-len determination
into a 4 or 6 bit lookup, say (bits tag 1:0 for two adjacent 32-bit words):
00zz: 32-bit
01zz: 16-bit
1000: 64-bit
1001: 48-bit (unused)
1010: 96-bit (*)
1011: Invalid
11zz: Invalid

*: Here, we just assume that the 3'rd instruction word is 00.
Would actually need to check this if either 4-wide bundles or 80-bit
encodings were "actually a thing".

Where, handling both XG2 and WXE (WEX Enable) in the preclassify step
greatly simplifies the logic during instruction fetch.

This could, in premise, be reduced further in an "XG2 only" core, or to
a lesser extent by eliminating the original XGPR scheme. These are not
currently planned though (say, the first-stage lookup width could be
reduced from 8 to 5 or 7 bits).

...

Quadibloc

2023-11-09 21:51:31 UTC

Post by Thomas Koenig
So, r1 = r2 + r3 + offset.
Three registers is 15 bits plus a 16-bit offset, which gives you 31
bits. You're left with one bit of opcode, one for load and one for
store.

LDx Rd, (Rs, Disp16)
...
But, yeah, 1 bit of opcode clearly wouldn't work...

And indeed, he is correct, that is what I'm trying to do.

But I easily solve _most_ of the problem.

I just use 3 bits for the index register and the base register.

The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

16-bit register-to-register instructions use eight bits to specify their
source and destination registers, so both registers must be from the same
group of eight registers.

This lends itself to writing code where four distinct threads are
interleaved, helping pipelining in implementations too cheap to have
out-of-order executiion.

The index register can be one of registers 1 to 7 (0 means no indexing).

The base register can be one of registers 25 to 31. (24, or a 0 in the
three-bit base register field, indicates a special addressing mode.)

This sort of is reminiscent of System/360 coding conventions.

The special addressing modes do stuff like using registers 17 to 23 as
base registers with a 12 bit displacement, so that additional short
segments can be accessed.

As I noted, shaving off two bits each from two fields gives me four more
bits, and five bits is exactly what I need for the opcode field.

Unfortunately, I needed one more bit, because I also wanted 16-bit
instructions, and they take up too much space. That led me... to some
interesting gyrations, but I finally found a compromise that was
acceptable to me for saving those bits, so acceptable that I could drop
the option of using the block header to switch to using "full" instructions
instead. Finally!

John Savard

BGB-Alt

2023-11-09 23:49:03 UTC

Post by Thomas Koenig
So, r1 = r2 + r3 + offset.
Three registers is 15 bits plus a 16-bit offset, which gives you 31
bits. You're left with one bit of opcode, one for load and one for
store.

LDx Rd, (Rs, Disp16)
...
But, yeah, 1 bit of opcode clearly wouldn't work...

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

Unless, maybe, registers were being treated like a stack, but even then,
this is still gonna suck.

Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

Theoretically, 32 registers should be "pretty good", but I ended up with
64 partly due to arguable weakness in my compilers' register allocation.

Say, 64 makes it possible to static assign most of the variables in most
of the functions, which avoids the need for spill and fill; at least
with a register allocator that isn't smart enough to locally assign
registers across basic-block boundaries).

I am not sure if a more clever compiler (such as GCC) could also find
ways to make effective use of 64 GPRs.

I guess, IA-64 did have 128 registers in banks of 32. Not sure how well
this worked.

Post by Quadibloc
16-bit register-to-register instructions use eight bits to specify their
source and destination registers, so both registers must be from the same
group of eight registers.

When I added R32..R63, I ended up not bothering adding any way to access
them from 16-bit ops.

So:
R0..R15: Generally accessible for all of 16-bit land;
R16..R31: Accessible from a limited subset of 16-bit operations.
R32..R63: Inaccessible from 16-bit land.
Only accessible for an ISA subset for 32-bit ops in XGPR.

Things are more orthogonal in XG2:
No 16-bit ops;
All of the 32-bit ops can access R0..R63 in the same way.

Post by Quadibloc
This lends itself to writing code where four distinct threads are
interleaved, helping pipelining in implementations too cheap to have
out-of-order executiion.

Considered variations on this in my case as well, just with static
control flow.

However, BGBCC is nowhere near clever enough to pull this off...

Best that can be managed is doing this sort of thing manually (this is
sort of how "functions with 100+ local variables" are born).

In theory, a compiler could infer when blocks of code or functions are
not sequentially dependent and inline everything and schedule it in
parallel, but alas, this sort of thing requires a bit of cleverness that
is hard to pull off.

Post by Quadibloc
The index register can be one of registers 1 to 7 (0 means no indexing).
The base register can be one of registers 25 to 31. (24, or a 0 in the
three-bit base register field, indicates a special addressing mode.)
This sort of is reminiscent of System/360 coding conventions.

OK.

Post by Quadibloc
The special addressing modes do stuff like using registers 17 to 23 as
base registers with a 12 bit displacement, so that additional short
segments can be accessed.
As I noted, shaving off two bits each from two fields gives me four more
bits, and five bits is exactly what I need for the opcode field.
Unfortunately, I needed one more bit, because I also wanted 16-bit
instructions, and they take up too much space. That led me... to some
interesting gyrations, but I finally found a compromise that was
acceptable to me for saving those bits, so acceptable that I could drop
the option of using the block header to switch to using "full" instructions
instead. Finally!

A more straightforward encoding would make things, more straightforward...

Main debates I think are, say:
Whether to start with the MSB of each word (what I had often done);
Or, start from the LSB (like RISC-V);
Whether 5 or 6 bit register fields;
How much bits for immediate and opcode fields;
...

Bundling and predication may eat a few bits, say:
00: Scalar
01: Bundle
10/11: If-True / If-False

In my case, this did leave an ugly hack case to support conditional ops
in bundles. Namely, the instruction to "Load 24 bits into R0" has
different interpretations in each case (Scalar: Load 24 bits into R0;
Bundle: Jumbo Prefix; If-True/If-False, repeat a different instruction
block, but understood as both conditional and bundled).

This could be fully orthogonal with 3 bits, but it seems, this is a big ask:
000, Unconditional, Scalar
001, Unconditional, Bundle
010, Special, Scalar (Eg: Large constant load or Branch)
011, Special, Bundle (Eg: Jumbo Prefix)
100, If-True, Scalar
101, If-True, Bundle
110, If-False, Scalar
111, If-False, Bundle

This leads to a lopsided encoding though, and it seems like things only
really fit together nicely with a limited combination of sizes.

Say, for an immediate field:
24+ 9 => 33s
24+24+16 => 64
This is almost magic...

Though:
26+ 7 => 33s
26+26+12 => 64
Could also work.

But, does end up with an ISA layout where immediate values are mostly 7u
or 7n, which is not nearly as attractive as 9u and 9n.

Say, for Load/Store displacement hit (rough approximations, from memory):
5u: 35%
7u: 65%
9u: 90%
...

All turns into a bit of an annoying numbers game sometimes...

But, this ended up as part of why I ended up with XG2, which didn't give
me everything I wanted, and the encodings of some things does have more
"dog chew" than I would like (I would have preferred if everything were
nice contiguous fields, rather than the bits for each register field
being scattered across the instruction word).

But, the numbers added up in a way that worked better than most of the
alternatives I could come up with (and happened to also be the "least
effort" implementation path).

Granted, I still keep half expecting people to be like "Dude, just jump
onto the RISC-V wagon...".

Or, failing this, at least implement enough of RISC-V to be able to run
Linux on it (but, this would require significant architectural changes;
being able to run a "stock" RV64GC Linux build would effectively require
partially cloning a bunch of SiFive's architectural choices or similar;
which is not something I would be happy with).

But, otherwise, pretty much any other option in this area would still
mean a porting effort...

Well, and the on/off consideration of trying to port a BSD variant, as
BSD seemed like potentially less effort (there is far less implicit
assumptions of GNU related stuff being used).

...

Quadibloc

2023-11-10 04:37:16 UTC

Post by Quadibloc
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

For 32-bit instructions, the only implication is that the first few
integer registers would be used as index registers, and the last few
would be used as base registers, which is likely to be true in any
case.

It's only in the 16-bit operate instructions that this splitting of
registers is actively present as a constraint. It is needed to make
16-bit operate instructions possible.

So the cure is that if a compiler finds this too much trouble, it
doesn't have to use the 16-bit instructions.

Of course, if compilers can't use them, that raises the question of
whether 16-bit instructions are worth having. Without them, the
complications that I needed to be happy about my memory-reference
instructions could have been entirely avoided.

John Savard

BGB

2023-11-10 06:46:43 UTC

Post by Quadibloc
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

FWIW: I went with 16-bit ops with 4-bit register fields (with a small
subset with 5-bit register fields).

Granted, layout was different than SH:
zzzz-nnnn-mmmm-zzzz //typical SH layout
zzzz-zzzz-nnnn-mmmm //typical BJX2 layout

Where, as noted, typical 32-bit layout in my case is:
111p-ZwZZ-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ
And, in XG2:
NMOP-ZwZZ-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ

I guess, a "minor" reorganization might yield, say:
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ (3R)
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmZZ-ZZZZ-ZZZZ (2R)
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (3RI, Imm10)
PwZZ-ZZZZ-ZZnn-nnnn-ZZZZ-ZZii-iiii-iiii (2RI, Imm10)
PwZZ-ZZZZ-ZZnn-nnnn-iiii-iiii-iiii-iiii (2RI, Imm16)
PwZZ-ZZZZ-iiii-iiii-iiii-iiii-iiii-iiii (Imm24)

Which seems like actually a relatively nice layout thus far...

Possibly, going further:
Pw00-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ (3R Space)
Pw00-1111-ZZnn-nnnn-mmmm-mmZZ-ZZZZ-ZZZZ (2R Space)

Pw01-ZZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (Ld/St Disp10)

Pw10-0ZZZ-ZZnn-nnnn-mmmm-mmii-iiii-iiii (3RI Imm10, ALU Block)
Pw10-1ZZZ-ZZnn-nnnn-ZZZZ-ZZii-iiii-iiii (2RI Imm10)

Pw11-0ZZZ-ZZnn-nnnn-iiii-iiii-iiii-iiii (2RI, Imm16)

Pw11-1110-iiii-iiii-iiii-iiii-iiii-iiii BRA Disp24s (+/- 32MB)
Pw11-1111-iiii-iiii-iiii-iiii-iiii-iiii BSR Disp24s (+/- 32MB)

1111-111Z-iiii-iiii-iiii-iiii-iiii-iiii Jumbo

Though, might almost make sense for PrWEX to be N/E, as the PrWEX blocks
seem to be infrequently used in BJX2 (basically, for predicated
instructions that exist as part of an instruction bundle).

Say:
Scalar: 77.3%
WEX : 8.9%
Pred : 13.5%
PrWEX : 0.3%

Post by Quadibloc
So the cure is that if a compiler finds this too much trouble, it
doesn't have to use the 16-bit instructions.
Of course, if compilers can't use them, that raises the question of
whether 16-bit instructions are worth having. Without them, the
complications that I needed to be happy about my memory-reference
instructions could have been entirely avoided.

For performance optimized cases, I am starting to suspect 16-bit ops are
not worth it.

For size optimization, they make sense; but size optimization also means
mostly confining register allocation to R0..R15 in my case, with
heuristics for when to enable additional registers, where enabling the
higher registers effectively hinders the use of 16-bit instructions.

The other option I have found is that, rather than optimizing for
smaller instructions (as in an ISA with 16 bit instructions), one can
instead optimize for doing stuff in as few instructions as it is
reasonable to do so, which in turn further goes against the use of
16-bit instructions.

And, thus far, I am ending up building a lot of my programs in XG2 mode
despite the slightly worse code density (leaving the main "hold outs"
for the Baseline encoding mostly being the kernel and Boot ROM).

The kernel could go over to XG2 without too much issue, mostly leaving
the Boot ROM. Switching over the ROM would require some functional
tweaks (coming out of reset in a different mode), as well as probably
either increasing the size of the ROM or removing some stuff (building
the Boot ROM as-is in XG2 mode would exceed the current 32K limit).

Granted, the main things the ROM contains is a bunch of boot-time sanity
check stuff, a RAM counter, FAT32 driver, and stuff to init the graphics
module (such as a Boot-time ASCII font, *).

*: Though, this font saves some space by only encoding the ASCII-range
characters, and packing the character glyphs into 5*6 pixels (allowing
32-bits, rather than the 64-bits needed for an 8x8 glyph). This won out
aesthetically over using a 7-segment or 14-segment font (as well as it
taking more complex logic to unpack 7 or 14 segment into an 8x8
character cell).

Where, say, unlike a CGA or VGA, the initial font is not held in a
hardware ROM. There was originally, but it was cheaper to manage the
font in software, effectively using the VRAM as a plain color-cell
display in text mode.

...

MitchAlsup

2023-11-12 21:28:11 UTC

Post by BGB
For performance optimized cases, I am starting to suspect 16-bit ops are
not worth it.

<
BINGO:: another near convert.......
<

Post by BGB
For size optimization, they make sense; but size optimization also means
mostly confining register allocation to R0..R15 in my case, with
heuristics for when to enable additional registers, where enabling the
higher registers effectively hinders the use of 16-bit instructions.
The other option I have found is that, rather than optimizing for
smaller instructions (as in an ISA with 16 bit instructions), one can
instead optimize for doing stuff in as few instructions as it is
reasonable to do so, which in turn further goes against the use of
16-bit instructions.

<
This is the My 66000 path-execute fewer instructions even if they take
the same number of bytes in .text.
<

Scott Lurndal

2023-11-10 14:51:44 UTC

Post by Quadibloc
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

As soon as you make 'general purpose registers' not 'general'
you've significantly complicated register allocation in compilers
and likely caused additional memory accesses due to the need to
spill registers unnecessarily.

BGB

2023-11-10 18:24:08 UTC

Post by Quadibloc
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

Yeah.

Either banks of 8, or an 8 data + 8 address, or ... would kinda "rather
suck".

Or, even smaller cases, like, "most instructions can use all the
registers, but these ops only work on a subset" is kind of an annoyance
(this is a big part of why I bothered with the whole XG2 thing).

Much better to have a big flat register space.

Though, within reason.
Say:
* 8: Pain, can barely hold anything in registers.
** One barely has enough for working values for expressions, etc.
* 16: Not quite enough, still lots of spill/fill.
* 32: Can work well, with a good register allocator;
* 64: Can largely eliminate spill/fill, but a little much.
* 128: Too many.
* 256: Absurd.

So, say, 32 and 64 seem to be the "good" area, where with 32, a majority
of the functions can sit comfortably with most or all of their variables
held in registers. But, for functions with a large number of variables
(say, 100 or more), spill/fill becomes an issue (*).

Having 64 allows a majority of functions to use a "static assign
everything" strategy, where spill/fill can be eliminated entirely (apart
from the prolog/epilog sequences), and otherwise seems to deal better
with functions with large numbers of variables.

*: And is more of a pain with a register allocator design which can't
keep any non-static-assigned values in registers across basic-block
boundaries. This issue is, ironically, less obvious with 16 registers
(since spill/fill runs rampant anyways). But having nearly every basic
block start with a blob of stack loads, and end with a blob of stores,
only to reload them all again on the other side of a label, is fairly
obvious.

Having 64 registers does at least mostly hit this nail...

Meanwhile, for 128, there aren't really enough variables and temporaries
in most functions to make effective use of them. Also, 7 bit register
fields wont fit easily into a 32-bit instruction word.

As for register arguments:
* Probably 8 or 16.
** 8 makes the most sense with 32 GPRs.
*** 16 is asking too much.
*** 8 deals with around 98% of functions.
** 16 makes sense with 64 GPRs.
*** Nearly all functions can use exclusively register arguments.
*** Gain is small though, if it only benefits 2% of functions.
*** It is almost a "shoe in", except for cost of fixed spill space
*** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
*** Though, an ABI could decide to not have a spill space in this way.

Though, admittedly, for a lot of my programs I had still ended up going
with 8 register arguments with 64 GPRs, mostly as the gains of 16
arguments is small, relative of the cost of spending an additional 64
bytes in nearly every stack frame (and also there are still some
unresolved bugs when using 16 argument mode).

...

Current leaning is also that:
32-bit primary instruction size;
32/64/96 bit for variable-length instructions;
Is "pretty good".

In performance-oriented use cases, 16-bit encodings "aren't really worth
it".
In cases where you need a 32 or 64 bit value, being able to encode them
or load them quickly into a register is ideal. Spending multiple
instructions to glue a value together isn't ideal, nor is needing to
load it from memory (this particularly sucks from the compiler POV).

As for addressing modes:
(Rb, Disp) : ~ 66-75%
(Rb, Ri) : ~ 25-33%
Can address the vast majority of cases.

Displacements are most effective when scaled by the size of the element
type, as unaligned displacements are exceedingly rare. The vast majority
of displacements are also positive.

Not having a register-indexed mode is shooting oneself in the foot, as
these are "not exactly rare".

Most other possible addressing modes can be mostly ignored.
Auto-increment becomes moot if one has superscalar or VLIW;
(Rb, Ri, Disp) is only really applicable in niche cases
Eg, array inside struct, etc.
...

RISC-V did sort of shoot itself in the foot in several of these areas,
albeit with some workarounds in "Bitmanip":
SHnADD, can mimic a LEA, allowing array access in fewer ops.
PACK, allows an inline 64-bit constant load in 5 instructions...
LUI+ADD+LUI+ADD+PACK
...

Still not ideal...

An extra cycle for memory access is not ideal for a close second place
addressing mode; nor are 64-bit constants rare enough that one
necessarily wants to spend 5 or so clock cycles on them.

But, still better than the situation where one does not have these
instructions.

...

Stephen Fuld

2023-11-10 19:17:37 UTC

Post by Quadibloc
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

Yes, but sometimes you just need "another bit" in the instructions. So
an alternative is to break the requirement that all register specifier
fields in the instruction be the same length. So, for example, allow
access to all registers from one source operand position, but say only
half from the other source operand position. So, for a system with 32
registers, you would need 5 plus 5 plus 4 bits. Much of the time, such
as with commutative operations like adds, this doesn't hurt at all.

Yes, this makes register allocation in the compiler harder. And
occasionally you might need an extra instruction to copy a value to the
half size field, but on high end systems, this can be done in the rename
stage without taking an execution slot.

A more extreme alternative is to only allow the destination field to
also be one bit smaller. Of course, this makes things even harder for
the compiler, and probably requires extra "copy" instructions more
frequently, but sometimes you just gotta do what you gotta do. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

2023-11-11 18:11:04 UTC

Post by BGB
Much better to have a big flat register space.

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.
<

Post by Stephen Fuld
access to all registers from one source operand position, but say only
half from the other source operand position. So, for a system with 32
registers, you would need 5 plus 5 plus 4 bits. Much of the time, such
as with commutative operations like adds, this doesn't hurt at all.
Yes, this makes register allocation in the compiler harder. And
occasionally you might need an extra instruction to copy a value to the
half size field, but on high end systems, this can be done in the rename
stage without taking an execution slot.
A more extreme alternative is to only allow the destination field to
also be one bit smaller. Of course, this makes things even harder for
the compiler, and probably requires extra "copy" instructions more
frequently, but sometimes you just gotta do what you gotta do. :-(

BGB-Alt

2023-11-11 20:33:20 UTC

Post by BGB
Much better to have a big flat register space.

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.
<

Or, a similar role is served by my Jumbo-Op64 prefix.

So, there are two different Jumbo prefixes:
Jumbo-Imm, which mostly just makes the immed/disp field bigger;
Jumbo-Op64, which mostly extends the opcode and other things;
May extend immediate, but less so, and that is not its main purpose.

Op64 also does, optionally:
Being the original mechanism to address R32..R63, before XGPR and XG2
encodings were added, and needed (in Baseline) for the parts of the ISA
not covered by the XGPR encodings;
Adds a potential 4'th register, extra displacement (or smaller Immed
extension), or rounding-mode / opcode bits (depends on the base
instruction).

As-is, 8 bits in the Op64 prefix are Must Be Zero, as-is, they are
designated specifically towards expanding the opcode space (with the 00
case designated as mapping to the same instruction as in the basic
32-bit encoding).

Post by Stephen Fuld
access to all registers from one source operand position, but say only
half from the other source operand position. So, for a system with 32
registers, you would need 5 plus 5 plus 4 bits. Much of the time,
such as with commutative operations like adds, this doesn't hurt at all.
Yes, this makes register allocation in the compiler harder. And
occasionally you might need an extra instruction to copy a value to
the half size field, but on high end systems, this can be done in the
rename stage without taking an execution slot.
A more extreme alternative is to only allow the destination field to
also be one bit smaller. Of course, this makes things even harder for
the compiler, and probably requires extra "copy" instructions more
frequently, but sometimes you just gotta do what you gotta do. :-(

Stephen Fuld

2023-11-15 18:38:56 UTC

Post by BGB
Much better to have a big flat register space.

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

Good point. A combination of the two ideas could be to have the prefix
instruction specify which register to use instead of the one specified
in the reduced register specifier for whichever instructions in its
shadow have the bit set in the prefix. Worst case, this is the same as
my original proposal - one extra, not really executed, instruction
(prefix versus register to register move) for one where you need to use
it, but this idea might, by allowing the prefix to specify multiple
instructions, save more than one extra "instruction". The only downside
is it requires an additional op code.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

2023-11-15 19:02:00 UTC

Post by BGB
Much better to have a big flat register space.

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

<
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.
<
< Worst case, this is the same as

Post by Stephen Fuld
my original proposal - one extra, not really executed, instruction

<
Which is why I use the term instruction-modifier.
<

Post by Stephen Fuld
(prefix versus register to register move) for one where you need to use
it, but this idea might, by allowing the prefix to specify multiple
instructions, save more than one extra "instruction". The only downside
is it requires an additional op code.

<
But by having an instruction-modifier that can add bits to several
succeeding instructions, you can avoid cluttering up ISA with things
like ADC, SBC, IMULD, DDIV, ....... So, in the end, you save OpCode
enumeration space not consume it.

Stephen Fuld

2023-11-15 19:58:25 UTC

Post by BGB
Much better to have a big flat register space.

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

<
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.

I am not sure what you are proposing here. Can you show an example?

Post by MitchAlsup
<
< Worst case, this is the same as

Post by Stephen Fuld
my original proposal - one extra, not really executed, instruction

<
Which is why I use the term instruction-modifier.

Agreed.

Post by MitchAlsup
<

Post by Stephen Fuld
(prefix versus register to register move) for one where you need to
use it, but this idea might, by allowing the prefix to specify
multiple instructions, save more than one extra "instruction". The
only downside is it requires an additional op code.

In the general case, I certainly agree. But here you need a different
op-code than CARRY, as this has different semantics, and I think the new
instruction modifier has no other use, hence it is an additional op code
versus the original proposal of using essentially a register copy
instruction, which already exists (i.e. a load with a zero displacement
and the source register as the address modifier).

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

2023-11-15 21:10:52 UTC

Post by BGB
Much better to have a big flat register space.

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

<
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.

I am not sure what you are proposing here. Can you show an example?

Let us postulate an MoreBits instruction-modifier with a 16-bit immediate
field. Now each 16-bit instruction, that has access to only 8 registers,
strips off 2-bits/specifier, so now all its register specifiers are 5-bits.
The immediate supplies the bits and as bits are stripped off the Decoder
shifts the field down by the consumed bits. When the last bit has been
stripped off you would need another MB im to supply those bits. Since
only 16-bit instructions are "limited" one MB should last about a basic
block or extended basic block.

Note I don't care how the bits are apportioned, formatted, consumed, ...

Post by MitchAlsup
<
< Worst case, this is the same as

Post by Stephen Fuld
my original proposal - one extra, not really executed, instruction

<
Which is why I use the term instruction-modifier.

Agreed.

Post by MitchAlsup
<

Post by Stephen Fuld
(prefix versus register to register move) for one where you need to
use it, but this idea might, by allowing the prefix to specify
multiple instructions, save more than one extra "instruction". The
only downside is it requires an additional op code.

CARRY is your access to ALL extended precision calculations (saving 20+
OpCodes when you consider a robust commercial ISA rather than an Academic
ISA.) Carry accesses integer arithmetic, shifts, extracts, inserts, and
exact floating point calculations larger than 64-bits including Kahan-
Babashuka summation. {{Not bad for 1 OpCode !!}}

Similarly:: VEC-LOOP provide access to 1,000+ SIMD instructions and 400+
Vector instructions at the cost of 2 units in the OpCode Space !! It also
allows a future implementation to execute wider (or narrower) than SIMD
with no change in the instruction sequence.

MoreBits is effectively just like REX except it can span instructions.

Stephen Fuld

2023-11-20 17:31:11 UTC

Post by BGB
Much better to have a big flat register space.

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

<
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.

I am not sure what you are proposing here. Can you show an example?

Let us postulate an MoreBits instruction-modifier with a 16-bit immediate
field. Now each 16-bit instruction, that has access to only 8 registers,
strips off 2-bits/specifier, so now all its register specifiers are 5-bits.
The immediate supplies the bits and as bits are stripped off the Decoder
shifts the field down by the consumed bits. When the last bit has been
stripped off you would need another MB im to supply those bits. Since
only 16-bit instructions are "limited" one MB should last about a basic
block or extended basic block.
Note I don't care how the bits are apportioned, formatted, consumed, ...

Oh, so you have changed the meaning of the "immediate bit map" from
specifying which of the following instructions it applies to (e.g.
CARRY) to the actual data. I like it!

If using 16 bit instructions, and if you only have one small register
field per instruction, I think it is better to make "MoreBits" a 16 bit
instruction modifier itself, with say a five bit op code and an eleven
bit immediate, which supplies the extra bit for the next 11
instructions. More compact than a 32 bit instruction, and almost as
"far reaching". If you need more than 11 bits, even if you add a second
MB instruction modifier 11 instructions later, you are still no worse
off than an instruction modifier plus a 16 bit immediate.

Of course, if you need more than one extra bit per instruction, then
more "drastic" measures, such as your proposal, are needed.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

2023-11-20 23:51:46 UTC

Post by BGB
Much better to have a big flat register space.

<
Another way to get a few more bits is to use a prefix-instruction like
CARRY for those seldom needed bits.

<
You could have the prefix instruction supply the missing bits of all
shortened register specifiers.

I am not sure what you are proposing here. Can you show an example?

Let us postulate an MoreBits instruction-modifier with a 16-bit immediate
field. Now each 16-bit instruction, that has access to only 8 registers,
strips off 2-bits/specifier, so now all its register specifiers are 5-bits.
The immediate supplies the bits and as bits are stripped off the Decoder
shifts the field down by the consumed bits. When the last bit has been
stripped off you would need another MB im to supply those bits. Since
only 16-bit instructions are "limited" one MB should last about a
basic block or extended basic block.
Note I don't care how the bits are apportioned, formatted, consumed, ...

Oh, so you have changed the meaning of the "immediate bit map" from
specifying which of the following instructions it applies to (e.g.
CARRY) to the actual data. I like it!
If using 16 bit instructions, and if you only have one small register
field per instruction, I think it is better to make "MoreBits" a 16 bit
instruction modifier itself, with say a five bit op code and an eleven
bit immediate, which supplies the extra bit for the next 11
instructions. More compact than a 32 bit instruction, and almost as
"far reaching". If you need more than 11 bits, even if you add a second
MB instruction modifier 11 instructions later, you are still no worse
off than an instruction modifier plus a 16 bit immediate.
Of course, if you need more than one extra bit per instruction, then
more "drastic" measures, such as your proposal, are needed.

Ironically, this is closer to how 32-bit ops were originally intended to
work in BJX2, and how they worked in BJX1 (where most of the 32-bit ops
were basically prefixes on the existing 16-bit SuperH ops).

Say:
ZnmZ //typical layout of a 16-bit op, R0..R15
8Ceo-ZnmZ //Op gains an extra register field, and R16..R31.

Then, in the original form of BJX2:
ZZnm
F0eo-ZZnm

For some ops, the 3rd register (Ro) would instead operate as a 5-bit
immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.

When I later added the Imm9 encodings, the encoding of the other ops was
changed to be more consistent with this:
F0nm-ZeoZ
F2nm-Zeii

This was originally designed as a possible successor ISA, but it seemed
"better" to back-fold it into my existing ISA (effectively replacing the
original encoding scheme in the process).

This encoding was relatively stable, until Jumbo prefixes were added and
shook things up a little more (and the more recent shakeup with XG2,
which has effectively fragmented the ISA into two sub-variants with
neither being a "clear winner", *).

*: The previous Baseline encoding is better for code density (due to
still having 16-bit ops), XG2 is better for performance (due to more
orthogonality, such as the ability to use every register from every
instruction, and adding a bit to the Immed/Displacement fields, or 3 in
the case of plain branches).

Had considered possible options for "Make XG2's encoding less dog
chewed", but the issue is not so simple as simply shifting the bits
around (shuffling the bits would just make it dog-chewed in other ways).

So, existing encoding, expressed in bits, is roughly:
NMOP-ZwZZ-nnnn-mmmm ZZZZ-Qnmo-oooo-ZZZZ

And the possible revised form:
PwZZ-ZZZZ-ZZnn-nnnn-mmmm-mmoo-oooo-ZZZZ

However, what I have thus far would effectively amount to nearly a full
reboot of the encoding (which would be a huge pile of effort), so less
likely to be "worth it" in the name of a slightly less chewed encoding
scheme (and, hell, RISC-V is going along OK with its immediate fields
being effectively confetti).

Though, another option could be closer to a straight reshuffle:
NMOP-ZwZZ-nnnn-mmmm YYYY-Qnmo-oooo-XXXX
NMIP-ZwZZ-nnnn-mmmm YYYY-Qnmi-iiii-iiii
To:
PwZZ-ZQnn-nnnn-YYYY-mmmm-mmoo-oooo-XXXX
PwZZ-ZQnn-nnnn-YYYY-mmmm-mmii-iiii-iiii

So, the existing ISA listing could be mapped over mostly as-is, with the
main changes (besides the bit-reshuffle) being in the immediate field.

However:
DDDP-0w00-nnnn-mmmm 1100-dddd-dddd-dddd
To:
Pw00-0ddd-dddd-YYYY-dddd-dddd-dddd-dddd

Is gonna need some new relocs, ...

OTOH, it would allow making the F8 block's encoding consistent with the
rest of the ISA.

But, recently I am left feeling uncertain if any of this is anything
more than moot...

Did recently make a little bit of progress towards having a GUI in
TestKern, in that I now have a console window with a shell "sorta" able
to run inside this console.

Has partly opened the "pandora's box" though that is needing to deal
with multitasking, re-entrance, and the possible need for needing to use
mutex locking (as-is, it was "barely working" in that I had to carefully
avoid re-entrance in a few areas to keep the kernel from exploding; as
none of this stuff has mutexes).

Well, and then having to fix-up issues like making the scheduler not try
to schedule the syscall-handler task and then promptly causing the "OS"
to explode (for now, these are special cased; I may need to come up with
a general way of flagging some tasks as "do not schedule", since they
will exist as special-cases to handle syscalls or specifically as the
target of inter-process VTable calls, as is the case with TKGDI, where
the call itself will schedule the task). Where, in this case, the
mechanism for inter-task control flow will take a form resembling that
of COM objects (it is likely that TKRA-GL may need to be reworked into
this form as well, *2).

Also looking like I will need to rework how the shell works.
Effectively, now, rather than the CLI running directly in the kernel, it
needs to be a userland (or "superuserland", *) task communicating with
the kernel via syscalls. So, the shell can no longer directly invoke the
PE/COFF loader, but will now need to use a "CreateProcess" call (and
then probably sleep-loop until the created process terminates).

*: Where a task is being run more like a userland task, but still in
running in supervisor mode (the syscall handler task and TKGDI backend
running in this mode).

Where, say:
Thread: Logical thread of execution within some existing process;
Process: Distinct collection of 1 or more threads within a shared
address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other thread-like
entities (such as call and method handlers), may be either thread-like
or process-like.

Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the only
valid exit point for this task being where it transfers control back to
the caller and awaits the next syscall to arrive; and it is not valid
for this task to try to syscall back into itself).

As-is, I am running a lot of tasks in userland, but for now there is
effectively no real memory protection in TestKern, but the plan is to
try to resolve this. This is itself work; needing to gradually weed-out
programs accessing privileged resources; and in some system-level APIs
needing to distinguish "Local" from "Global" memory ("malloc" will give
local memory, whereas "tkgGlobalAlloc" will give global memory; the idea
being for now that global memory will be identity mapped and accessible
across process boundaries).

Doesn't "yet" matter, but easier to try to address this now than later.

*2: For TKRA-GL, it generally needs to work with physically mapped
memory and MMIO to access the rasterizer module, which means the backend
parts will likely need to run either in "superuserland" or in "kernel land".

Likely rework is to try to separate the OpenGL API front-end from some
backend machinery, which will be a more narrowly focused interface
mostly dealing with things like:
Uploading textures and similar;
Drawing vertex arrays.
All the things like glEnable/glDisable, matrix-stack manipulations, etc,
will need to be kept in the front-end (making a context switch every
time the program used glEnable or glColor4f or similar would be an
impractical level of overhead).

Though, in Windows, the division point seems to be a little higher
(closer to the level of the OpenGL API itself). To mimic the Windows
model, I would effectively need two division points:

A front-end interface whose purpose is mostly to wrap over a bunch of
"GetProcAddress" funk (with some way to plug in an interface to provide
the GetProcAddress backend). This isn't asking too much more, since one
needs to provide all the GetProcAddress cruft either way.

An division interface between the frontend part which needs to run
directly in the userland task, and the backend part which deals with the
"actually making stuff happen" parts.

One could design a lower-level API for this latter part, but
(ironically) it would probably end up sort of resembling some sort of
weird OpenGL/Direct3D hybrid...

Though, could still do like TKGDI and provide a C wrapper over the
internal VTable calls.
HRESULT fooDooTheThing()
{
fooContext *ctx;
ctx=fooGetCurrentContext();
return(ctx->vt->DooTheThing(ctx));
}
...

A lot of this stuff gets kind of annoying sometimes though...

Like, one can't just "do the thing", they end up needing a bunch of
layers and boilerplate getting from "the place where the thing needs to
be done" to "the place where the thing can be done" (but, I guess, the
other alternative being to effectively not have an OS at all).

...

MitchAlsup

2023-11-21 22:12:18 UTC

Post by BGB
For some ops, the 3rd register (Ro) would instead operate as a 5-bit
immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.

Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit register
specifier into a 5-bit immediate of either positive or negative integer
value. This makes::

1<<n
~0<<n
container.bitfield = 7;

single instructions.

Post by BGB
Thread: Logical thread of execution within some existing process;

has a register file and a stack.

Post by BGB
Process: Distinct collection of 1 or more threads within a shared

has a memory map a heap and a vector of threads.

Post by BGB
address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other thread-like
entities (such as call and method handlers), may be either thread-like
or process-like.
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

We call these things:: dispatchers.

Post by BGB
Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the only
valid exit point for this task being where it transfers control back to
the caller and awaits the next syscall to arrive; and it is not valid
for this task to try to syscall back into itself).

In My 66000, every <effective> SysCall goes deeper into the privilege
hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.

BGB

2023-11-22 03:36:30 UTC

Post by BGB
For some ops, the 3rd register (Ro) would instead operate as a 5-bit
immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.

Almost all My 66000 {1,2,3}-operand instructions can convert a 5-bit register
specifier into a 5-bit immediate of either positive or negative integer
    1<<n
   ~0<<n
    container.bitfield = 7;
single instructions.

Originally, the pattern depended on the 16-bit operation, IIRC:
(Rm), Rn => (Rm, Disp5), Rn
(Rm, R0), Rn => (Rm, Ro), Rn
ALU Ops:
OP Rm, Rn => OP Rm, Ro, Rn
OP Rm, R0, Rn => OP Rm, Imm5u, Rn

Initially, BJX2 started out in a similar camp to BJX1, but when it
became obvious that the 16-bit and 32-bit encodings effectively needed
separate encoders, there was no real point keeping up the concept of
32-bit ops being prefix-extended 16-bit ops.

Then some other analysis/testing showed that for "general case
tradeoffs", it was better to have an ISA with primarily 32-bit encodings
with a 16-bit subset, than one with primarily 16-bit encodings with
32-bit extended forms (though, by this point, I had already settled on
the general encoding scheme).

The main practical consequence of this realization was that the ISA did
not need to be able to operate entirely within the limits of the 16-bit
encoding space (but, did need to be able to operate without any of the
16-bit encodings).

After more development, I now have:
Imm5u/Disp5u, some ops (Baseline)
Imm6s/Disp6s (XG2)
Imm9u: Typical ALU ops
Imm10u (XG2)
Imm9n: A few ALU ops
Imm10n (XG2)
Disp9u: LD/ST ops
Disp10s (XG2)
TBD if Disp10u+Disp6s would have been better.
Since negative displacements are still pretty rare.
Might have been better to have larger positive displacements.
Imm10{u/n}: Various 2RI ops
Imm11{u/n} {XG2}
Disp11s / Disp12s (XG2), Branch-Compare-Zero
Effectively uses an opcode bit as the sign bit.
Imm16u/Imm16n: Some 2RI ops.
Disp20s: BRA/BSR
Disp23s (XG2)
Imm24{u/n}: LDIZ/LDIN ("MOV Imm25s, R0")

However, they are only available in specific combinations.
Imm9u: ADD, ADDS.L, ADDU.L, AND, OR, XOR, SH{A/L}D{L/Q}, MULS, MULU
Imm9n: ADD, ADDS.L, ADDU.L

Which does mean, say:
y=x&(~7);
Needs either to load a constant into a register, or use a jumbo prefix.

The Disp9u/Disp10s encoding exists on all basic Load/Store ops, however
"special" ops (like XMOV.x) only have Disp5u/Disp6s encodings (not a
huge loss though).

With a Jumbo-Imm prefix, many of the Disp/Imm cases expand to 33 bits
(except Disp5 which only goes to 29 bits).

Post by BGB
Thread: Logical thread of execution within some existing process;

has a register file and a stack.

Post by BGB
Process: Distinct collection of 1 or more threads within a shared

has a memory map a heap and a vector of threads.

Post by BGB
address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other
thread-like entities (such as call and method handlers), may be either
thread-like or process-like.
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

We call these things:: dispatchers.

Yeah.

As-is, I have several major interrupt handlers:

Fault: Something has gone wrong, current handling is to stall the CPU
until reset (and/or terminate the emulator). Could in premise do other
things.

IRQ: Deals with timer, may potentially be used for preemptive task
scheduling (code is in place, but this is not currently enabled). Does
not currently perform any other "complex" actions (and the "practical"
use of IRQ's remains limited in my case, due in large part to the
limitations of interrupt handling).

TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
action if a "page fault" style event occurs (or something needs to be
paged in/paged out from the swapfile).

SYSCALL: Mostly initiates task switches and similar, and little else.

Unlike x86, the design of the interrupt mechanisms means it isn't
practical to hang the whole OS off of an interrupt handler. The closest
option is mostly to use the interrupt handlers to trigger context
switches (which is, ironically, slightly less of an issue, as many of
the "hard" parts of a context switch are already performed for sake of
dealing with the "rather minimalist" interrupt mechanism).

Basically, in this design, it isn't possible to enter a new interrupt
without first returning from the prior interrupt (at least not without
f*ing the CPU state). And, as-is, interrupts can only operate in
physically addressed mode.

They also need to manually save and restore all the registers, since
unlike either SuperH or RISC-V, BJX2 does not have any banked registers
(apart from SP/SSP, which switch places when entering/leaving an ISR).

Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
8086 real-mode, it doesn't implicitly push anything to the stack (nor
have an "interrupt vector table").

So, the interrupt handling is basically a computed branch; which was
basically about the cheapest mechanism I could come up with at the time.

Did create a little bit of a puzzle initially as to how to get the CPU
state saved off and restored with no free registers. Though, there are a
few CR's which capture the CPU state at the time the ISR happens (these
registers getting overwritten every time a new interrupt occurs).

So, say:
Interrupt entry:
Copy low bits of SR into high bits of EXSR;
Copy PC into SPC.
Copy fault address into TEA;
Swap SP and SSP (*1);
Set CPU flags to Supervisor+ISR mode;
CPU Mode bits now copied from high bits of VBR.
Computed branch relative to VBR.
Offset depends on interrupt category.
Interrupt return (RTE):
Copy EXSR bits back into SR;
Unswap SP/SSP (*1);
Branch to SPC.

*1: At the time, couldn't figure a good way to shave more logic off the
mechanism. Though, now, the most obvious candidate now would be to
eliminate the implicit SP/SSP swapping (this part is currently handled
in the instruction decoder).

So, instead, the ISR entry point would do something like:
MOV SP, SSP
MOV 0xDE00, SP //Designated ISR stack SRAM
MOV.Q R0, (SP, 0)
NOV.Q R1, (SP, 8)
... Now save off everything else ...

But, didn't really think of it at the time.

There is already the trick of requiring VBR to be aligned (currently 64B
in practice; formally 256B), mostly so as to allow the "address
computation" to be done via bit-slicing.

Not sure if many CPUs have a cheaper mechanism here...

Note that in my case, generally the interrupt handlers are written in C,
with the compiler managing all the ISR prolog/epilog stuff (mostly
saving/restoring pretty much the entire CPU state to the ISR stack).

Generally, the ISR's also need to deal with having a comparably small
stack (with 0.75K already used for the saved CPU state).

Where:
0000..7FFF: Boot ROM
8000..BFFF: (Optional) Extended Boot ROM
C000..DFFF: Boot/ISR SRAM
E000..FFFF: (Optional) Extended SRAM

Generally, much of the work of the context switch is pulled off using
"memcpy" calls (with the compiler providing a special "__arch_regsave"
variable giving the address of the location it has dumped the CPU
registers into; which in turn covers most of the core state that needs
to be saved/restored for a process context switch).

Though, I guess one other possibility would be if the compiler-generated
ISR code assumed TBR to always be valid (and then copied the registers
to a fixed location relative to TBR instead of the ISR stack), which
could in-theory allow for faster context switching (by eliminating the
need for the memcpy calls), but would be a bit more brittle (if TBR is
invalid, stuff is going to break pretty hard as soon as an interrupt
happens).

Would likely need special compiler attributes for this (would not make
sense for interrupts which do not, or are unlikely to, perform a context
switch).

Post by BGB
Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the
only valid exit point for this task being where it transfers control
back to the caller and awaits the next syscall to arrive; and it is
not valid for this task to try to syscall back into itself).

No way to handle a syscall recursively in my case, partly because of how
the task works:
It gets started at a certain location, and switches off at the point
where it would receive a syscall request.

So, sort of like:
... //initial task setup
TK_Task_SyscallReturnToUser(task);
while(1)
{
TK_Task_SyscallGetArgs(&task, &sobj, &umsg, &rptr, &args);
//handle the syscall
TK_Task_SyscallReturnToUser(task);
}
Whenever ReturnToUser returns, it expects there to be a syscall request
for it to handle. This call effectively transfers control back to the
caller task, with the syscall task ready to receive a new request.

SyscallGetArgs basically invokes "arcane magic" to fetch the parameters
for the task that performed the syscall (the dispatch mechanism stashes
the parameters in a designated location in the syscall handler's task
context).

However, if the Syscall task itself tries to invoke yield, or otherwise
triggers a context switch, then it will not be at the correct location
to handle a syscall if one were to arrive (at which point, the OS explodes).

Or, if it tries to perform a syscall, then the syscall attempt will
return immediately (since it effectively performs a context which back
to itself).

Granted, it is possible that the SYSCALL dispatcher could be made to
dispatch among one of multiple SYSCALL tasks, which could then handle up
to N levels of recursion.

On a multi-core system, each core would also need its own syscall tasks
(well, and/or they operate round-robin, and the syscall is directed at
whichever task is in the correct state to handle a request).

There is a little flexibility here, at least in as far as pretty much
the whole mechanism is managed in software in this case (apart from the
ISR mechanism itself).

Note that for inter-task method-calls, a similar mechanism is used to
normal syscalls, except:
A range of special syscall numbers is used as a VTable index;
The object's VTable implicitly encodes the PID of the task to
dispatch the request to.

So, instead of waiting for syscalls, it waits for method calls, and then
dispatches them as needed (locally) when they arrive.

On the reciever end, there is a mechanism to compose the VTable
interface, where the VTable is effectively composed of methods whose
sole purpose is to invoke a syscall, passing the argument list and
similar off to a handler, with the syscall number based on the method's
location within the VTable.

Then, the SYSCALL ISR sees this, and then fetches the corresponding task
to dispatch to, ...

MitchAlsup

2023-11-22 18:38:00 UTC

Post by BGB
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

We call these things:: dispatchers.

Yeah.
Fault: Something has gone wrong, current handling is to stall the CPU
until reset (and/or terminate the emulator). Could in premise do other
things.

I call these checks:: a page fault is an unanticipated SysCall to the
Guest OS page fault handler; whereas a check is something that should
never happen but did (ECC repair fail): These trap to Real HV.

Post by BGB
IRQ: Deals with timer, may potentially be used for preemptive task
scheduling (code is in place, but this is not currently enabled). Does
not currently perform any other "complex" actions (and the "practical"
use of IRQ's remains limited in my case, due in large part to the
limitations of interrupt handling).

Every My 66000 process has its own event table which combines exceptions
interrupts, SysCalls,... This means there is no table surgery when switching
between Guest OS and Guest Hypervisor and Real Hypervisor.

Post by BGB
TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
action if a "page fault" style event occurs (or something needs to be
paged in/paged out from the swapfile).

HW table walking.

Post by BGB
SYSCALL: Mostly initiates task switches and similar, and little else.

Part of Event table.

Post by BGB
Unlike x86, the design of the interrupt mechanisms means it isn't
practical to hang the whole OS off of an interrupt handler. The closest
option is mostly to use the interrupt handlers to trigger context
switches (which is, ironically, slightly less of an issue, as many of
the "hard" parts of a context switch are already performed for sake of
dealing with the "rather minimalist" interrupt mechanism).

My 66000 can perform a context (user->user) in a single instruction.
Old state goes to memory, new state comes from memory; by the time
state has arrived, you are fetching instructions in the new context
under the new context MMU tables and privileges and priorities.

Post by BGB
Basically, in this design, it isn't possible to enter a new interrupt
without first returning from the prior interrupt (at least not without
f*ing the CPU state). And, as-is, interrupts can only operate in
physically addressed mode.
They also need to manually save and restore all the registers, since
unlike either SuperH or RISC-V, BJX2 does not have any banked registers
(apart from SP/SSP, which switch places when entering/leaving an ISR).
Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
8086 real-mode, it doesn't implicitly push anything to the stack (nor
have an "interrupt vector table").
So, the interrupt handling is basically a computed branch; which was
basically about the cheapest mechanism I could come up with at the time.
Did create a little bit of a puzzle initially as to how to get the CPU
state saved off and restored with no free registers. Though, there are a
few CR's which capture the CPU state at the time the ISR happens (these
registers getting overwritten every time a new interrupt occurs).

Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a
time.

Post by BGB
Copy low bits of SR into high bits of EXSR;
Copy PC into SPC.
Copy fault address into TEA;
Swap SP and SSP (*1);
Set CPU flags to Supervisor+ISR mode;
CPU Mode bits now copied from high bits of VBR.
Computed branch relative to VBR.
Offset depends on interrupt category.
Copy EXSR bits back into SR;
Unswap SP/SSP (*1);
Branch to SPC.

Interrupt Entry Point::
// by this point all the old registers have been saved where they
// are supposed to go, and the interrupt dispatcher registers are
// already loader up and ready to go, and the CPU is running at
// whatever privilege level was specified.
HR R1<-WHY
LD IP,[IP,R1<<3,InterruptVectorTable] // Call through table
RTI
//
InterruptHandler0:
// do what is necessary
// note this can all be written in C
RET

Post by BGB
*1: At the time, couldn't figure a good way to shave more logic off the
mechanism. Though, now, the most obvious candidate now would be to
eliminate the implicit SP/SSP swapping (this part is currently handled
in the instruction decoder).
MOV SP, SSP
MOV 0xDE00, SP //Designated ISR stack SRAM
MOV.Q R0, (SP, 0)
NOV.Q R1, (SP, 8)
... Now save off everything else ...
But, didn't really think of it at the time.
There is already the trick of requiring VBR to be aligned (currently 64B
in practice; formally 256B), mostly so as to allow the "address
computation" to be done via bit-slicing.
Not sure if many CPUs have a cheaper mechanism here...

Treat the CPU state and the register state as cache lines and have
HW shuffle them in and out. You can even start the 5 cache line reads
before you start the CPU state writes; saving latency (which you cannot
using SW only methods).

Post by BGB
Note that in my case, generally the interrupt handlers are written in C,
with the compiler managing all the ISR prolog/epilog stuff (mostly
saving/restoring pretty much the entire CPU state to the ISR stack).

My 66000 compiler remains blissfully ignorant of ISR prologue and
epilogue and it still works.

Post by BGB
Generally, the ISR's also need to deal with having a comparably small
stack (with 0.75K already used for the saved CPU state).
0000..7FFF: Boot ROM
8000..BFFF: (Optional) Extended Boot ROM
C000..DFFF: Boot/ISR SRAM
E000..FFFF: (Optional) Extended SRAM
Generally, much of the work of the context switch is pulled off using
"memcpy" calls (with the compiler providing a special "__arch_regsave"
variable giving the address of the location it has dumped the CPU
registers into; which in turn covers most of the core state that needs
to be saved/restored for a process context switch).

Why not just make the HW push and pull cache lines.

Stefan Monnier

2023-11-22 22:17:30 UTC

Post by MitchAlsup
Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a

Hmm... I thought the "66000" came from the CDC 6600 but now I wonder if
it's not also a pun on the TI 9900.

Stefan

MitchAlsup

2023-11-22 23:58:19 UTC

Post by Stefan Monnier

Post by MitchAlsup
Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a

Hmm... I thought the "66000" came from the CDC 6600 but now I wonder if
it's not also a pun on the TI 9900.

In reverence to CDC 6600, not came from.

Exchange Jump on CDC 6600 causes a context switch that took 16+10 processor cycles
(after the scoreboard cleared.) And on the 6600, NOS was in the PPs and the CPUs
were there to just crunch numbers.

I have a hard real time version of My 66000 where the lower levels of the OS is
in HW, and if you have fewer than 1024 threads running, you do not expend any
(zero, 0, nada, zilch) cycles in the OS performing context switches or priority
alterations. This system has the property that if an interrupt (or message)
arrives to unblock a waiting thread that is of higher priority than any CPU in
affinity group of CPUs, then the lowest priority CPU in that group receives the
higher priority thread (without an excursion through the OS (damaging cache
state).)

I have a Linux friendly version where context switch is a single instruction.
When you write a context pointer that entire context is now available to support
whatever you want it to support. So, a unprivileged application can context
switch to another unprivileged application by writing a single control register
leaving Guest OS, Guest HV and Real HV in their original configuration. Guest
OS can context switch to a different Guest OS in a single instruction and then
the Guest OS receiving control needs to context switch to an application it wants
to run--so 20-ish cycles to perform a Guest OS switch. (This now costs typical
old architectures 10,000 cycles)

But nowhere does any thread receiving control have to execute and state or
register saving or restoring......Just like Exchange Jump.....

Post by Stefan Monnier
Stefan

Scott Lurndal

2023-11-23 20:46:38 UTC

Post by MitchAlsup
I have a Linux friendly version where context switch is a single instruction.

The Burroughs B3500 had a single such instruction, called
Branch Reinstate (BRE).

The task context (base register, limit register, accumulator, comparison
and overflow flags) were stored in small region at absolute address 60
and BRE would restore that state (and interrupts would save it).
Index registers were mapped to base-relative addresses 8, 16 and 24
(8 digits each).

The V-Series did a complete revamp of the processor architecture to
support larger memory sizes (both per task and systemwide) and
SMP. A segmentation scheme was adopted (for backward compatability)
and seven additional base-limit pairs were added to support direct
access to 8 segments at any time (called an evironment). There
could be up to 1,000,000 environments per task, each with up to
8 active memory areas (and 92 inactive memory areas accessible to
three special instructions for data movement and comparison).

The instruction was renamed Branch Reinstate Virtual (BRV) and would
read the task table entry and load all the relevent state, including
loading the active environment table into the processor base-limit
registers. BRV accessed a table in memory, indexed by task number,
that stored all the state of the task (200 digits worth).

At the same time, we added SMP support including an inter-cpu
communication instruction (my invention) similar to the
mechanism adopted a few years later when Intel added SMP
support for P5.

We also added hardware mutex and condition variable instructions;
the "LOK" instruction would atomically acquire the mutex, if
available, or interrupt to a microkernel scheduler if unavailable.
"UNLK" would interrupt if a higher priority task was waiting
for the lock. There were CAUS and WAIT instructions that
offered capabilities similar to posix condition variables.

Each defined lock had a canonical lock level (a 4 digit
number) and the hardware would fail a lock request where
the new lock canonical lock number is less than the current
lock owned by the task (if any). Unlock enforced the
reverse. This prevented any A-B deadlock situations from
occuring, although with many locks in a large subsystem (e.g
the MCP OS) it was tricky sometimes to assign lock numbers.
This also implicitly encouraged programmers to minimize
the critical section and avoid nested locking where possible.

The microkernel only handled scheduling and interrupts, all
MCP code ran in the context of either the task making the
request, or in an 'independent runner' (a kernel thread)
dispatched from the microkernel. I/O interrupts were dispatched
to two different independent runners, one for normal interrupts
and one for real-time interrupts. Real-time interrupts were
used for document sorters (e.g. MICR reader/sorters processing
checks/cheques/utility bills, etc) in order to be able to
select the destination pocket for each document in the
time interval from the read station to the pocket-select
station (at 2500 documents per minute - 42 per second,
one document every 24 milliseconds). We supported ten
active sorters per host. Even had one host installed
on an L-1011 with reader/sorters that processed
checks on coast-to-coast overnight flights.

MitchAlsup

2023-11-23 21:08:45 UTC

Post by MitchAlsup
I have a Linux friendly version where context switch is a single instruction.

The Burroughs B3500 had a single such instruction, called
Branch Reinstate (BRE).

My 66000 has 4 privilege levels {application, Guest OS, Guest HV, Real HV}
each privilege level has its own {IP, RF, Root Pointer, CSP, Exception
{Enabled, Raised}, and a few more things contained in 5 contiguous cache
lines.

The 4 privilege levels, each, have a pointer to those 5 cache lines. By
writing the control register (HR instruction) one can change the control
point for each level (of course you have to have appropriate permission--
but I decided that a user should have the ability to context switch to
another user without needing OS intervention--thus pthreads do not need
an excursion through the Guest OS to switch threads under the same memory
map {but do when crossing processes}.

Thus, all 4 privileges are always resident in the privilege hierarchy
at the cost of 4 DoubleWord registers instead of at the cost of 4 RFs.
With these levels all resident simultaneously, no table surgery is needed
to switch levels {Root pointers, MTRR,...} and no RF save/restore is
needed.

Paul A. Clayton

2023-11-23 22:13:03 UTC

On 11/23/23 4:08 PM, MitchAlsup wrote:
[snip]

Post by MitchAlsup
The 4 privilege levels, each, have a pointer to those 5 cache
lines. By writing the control register (HR instruction) one
can change the control point for each level (of course you
have to have appropriate permission-- but I decided that a
user should have the ability to context switch to another
user without needing OS intervention--thus pthreads do not
need an excursion through the Guest OS to switch threads
under the same memory map {but do when crossing processes}.

My 66000 also has Port Holes, which seem to offer some
cross-protection-domain access.

While not significantly helpful, I also wonder if privilege
reducing operations could be lower cost by not involving the
OS. This would require the OS to store the allowed privilege
elsewhere, but this might be done anyway. It would also have
little use (I suspect) and still require OS involvement to
restore privilege. There might be some cases where privilege
is only needed in an initialization stage, but that seems
likely to be rare.

Writing to the accessed and dirty bits of a PTE would also
seem to be something that could, in theory, be allowed to a
user-level process. Clearing the dirty bit could be dangerous
if stale data was from another protection domain. Clearing
the accessed bit would seem to only "strongly hint" that the
page be victimized earlier; setting the dirty bit would not
be different than a "silent store" [not useful it seems since
a load/store instruction pair could accomplish the same] and
setting the accessed bit would seem the same as performing a
non-caching load to any location in the page acting as a
"keep me" hint [probably not useful]. Even with this little
thought, allowing these PTE changes seems not worthwhile.)

BGB

2023-11-23 03:50:30 UTC

Post by BGB
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to
context switch back to the task that made the request, or to yield
to another task, ...).

We call these things:: dispatchers.

Yeah.
Fault: Something has gone wrong, current handling is to stall the CPU
until reset (and/or terminate the emulator). Could in premise do other
things.

A lot of things here are things that could be handled, but are not
currently handled:
Invalid instructions;
Access to invalid memory regions;
Access to memory in a way which violates access protections;
A branch to an invalid address;
Code used the BREAK instruction or similar;
Etc.

Generally at present, if any of these happens, it means that something
has gone badly enough that I want to stall immediately and probably
debug it.

In a "real" OS, if this happens in userland, one would typically turn
this into "SEGFAULT" or similar.

For the emulator, if a BREAK occurs in ISR mode (or any other fault
happens in ISR mode), it causes the emulator to stop execution, dump a
backtrace and registers, and then terminate. Otherwise, exiting the
emulator normally will dump a bunch of profiling information (this part
is not done if the emulator terminates due to a fault).

Stalling the core in the Verilog core causes it to dump the state of the
pipeline and some other things via "$display" (potentially relevant for
debugging). Or, allows seeing the crash PC on the 7-segment display on
the Nexys A7.

In my case, the VBR register is global (and set up during boot).

Any per-process event dispatching would need to be handled in software.

I didn't go with an x86-style IDT or similar partly because this would
have been significantly more expensive (in terms of Verilog code and
LUTs) than the existing mechanism. The role of an x86-style IDT could be
faked in software though.

So, VBR is sort of like:
(63:48): Encodes CPU state to use on ISR entry;
(47: 6): Encodes the ISR entry point.
In practice only (28:6) are "actually usable".
( 5: 0): Must be Zero

Where, low-order bits are replaced with an entry offset:
00: RESET
08: FAULT
10: IRQ
18: TLBMISS
20: SYSCALL
28: Reserved

The 8-bytes of space gives enough space to encode a relative or absolute
branch to the actual entry point (which not being so big as to be
needlessly wasteful).

During CPU reset, VBR is cleared to 0, and then control is transferred
to 0, which branches to the ROM's entry point.

The use of a computed branch was preferable to a "vector table" as the
vector table would have required some mechanism for the CPU to perform a
memory load to get the address. Computed branch was easier, since no
special memory load is needed, just branch there, and assume this lands
on a branch instruction which takes control where it needs to go.

Post by BGB
TLB Miss: Handles TLB miss and ACL Miss events, may initiate further
action if a "page fault" style event occurs (or something needs to be
paged in/paged out from the swapfile).

HW table walking.

Yeah, no page-table hardware in my case.

Had on/off considered an "Inverted Page-Table" like in IA-64, but this
still seemed to be annoyingly expensive vs the "Throw a TLB-Miss
Exception" route. Even if I eliminated the TLB-Miss logic, would then
need to have Page-Fault logic, which doesn't really save anything there
either.

There is a designated register though for the page-table: TTB.

With the considered inverted-page-table using a separate VIPT register,
the idea being that VIPT would point to a region of, say, 4096x4x128b
TLBE's (~256K), effectively functioning as a RAM-backed L3 TLB. If this
table lacked the requested TLBE, this would still result in a TLB Miss
fault.

Note that the idea was still that trying to use 96-bit virtual address
mode would require two TLBE's, effectively halving associativity. This
in turn requires plain modulo-addressing as hashing can create a "bad
situation" where a 2-way TLB will get stuck in an infinite loop (but
this infinite loop scenario is narrowly averted with modulo addressing).

Granted, 4-way is still better as it seems to result in a comparably
lower TLB miss rate.

It is still possible though to XOR the TLBE's index with a bit-pattern
derived from the ASID, to slightly reduce the cost of context switches
in some cases (if multiple address spaces were being used).

Note that the L1 I$ and D$ can get along reasonably well with an
optional 32-entry 1-way "Micro-TLB".

Post by BGB
SYSCALL: Mostly initiates task switches and similar, and little else.

Part of Event table.

All software in my case.

Post by BGB
Unlike x86, the design of the interrupt mechanisms means it isn't
practical to hang the whole OS off of an interrupt handler. The
closest option is mostly to use the interrupt handlers to trigger
context switches (which is, ironically, slightly less of an issue, as
many of the "hard" parts of a context switch are already performed for
sake of dealing with the "rather minimalist" interrupt mechanism).

Yeah, but that is not exactly minimalist in terms of the hardware.

Granted, burning around 1 kilocycle of overhead per syscall isn't ideal
either...

Eg:
Save registers to ISR stack;
Copy registers to User context;
Copy handler-task registers to ISR stack;
Reload registers from ISR stack;
Handle the syscall;
Save registers to ISR stack;
Copy registers to Syscall context;
Copy User registers to ISR stack;
Reload registers from ISR stack.

Does mean that one needs to be economical with syscalls (say, doing
"printf" a whole line at a time, rather than individual characters, ...).

And, did create incentive to allow getting the microsecond-clock value
and hardware RNG values from CPUID rather than needing a syscall (say,
don't want to burn 20us to check the microsecond counter, ...).

If the "memcpy's" could be eliminated, this could roughly halve the cost
of doing a syscall.

One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).

Worth the cost? Dunno.

Not too much different to modern Windows, where slow syscalls are still
fairly common (and despite the slowness of the mechanism, it seems like
BJX2 sycalls still manage to be around an order of magnitude faster than
Windows syscalls in terms of clock-cycle cost...).

Well, and the seeming absurdity of WaitForSingleObject() on a mutex
generally taking upwards of 1 million clock-cycles IIRC in past
experiments (when the mutex isn't already locked; and, if it is
locked... yeah...).

You could lock a mutex... or you could render an entire frame in Doom,
then checksum the frame image, and use the checksum as a hash key. In a
roughly similar time-scale.

Luckily, at least, the CriticalSection objects were not absurdly slow...

Post by BGB
Basically, in this design, it isn't possible to enter a new interrupt
without first returning from the prior interrupt (at least not without
f*ing the CPU state). And, as-is, interrupts can only operate in
physically addressed mode.
They also need to manually save and restore all the registers, since
unlike either SuperH or RISC-V, BJX2 does not have any banked
registers (apart from SP/SSP, which switch places when
entering/leaving an ISR).
Unlike x86 (protected mode), it doesn't have TSS's either, and unlike
8086 real-mode, it doesn't implicitly push anything to the stack (nor
have an "interrupt vector table").
So, the interrupt handling is basically a computed branch; which was
basically about the cheapest mechanism I could come up with at the time.
Did create a little bit of a puzzle initially as to how to get the CPU
state saved off and restored with no free registers. Though, there are
a few CR's which capture the CPU state at the time the ISR happens
(these registers getting overwritten every time a new interrupt occurs).

Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a
time.

Possible, but poses its own share of problems...

Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM, and
some sort of mechanism to access this RAM via an MMIO interface.

Pros/cons, seems like each possibility would also come with drawbacks:
As-is: Slowness due to needing to save/reload everything;
RISC-V: Expensive regfile, only works for limited cases;
MMIO Backed + RV-like: Faster U<->S, but slower task switching.
RAM Backed: Cache coherence becomes a critical feature.

The RISC-V like approach makes sense if one assumes:
There is a user process;
There is a kernel running under it;
We want to call from the user process into the kernel.

Doesn't make so much sense, say, for:
User Process A calls a VTable entry which calls into User Process B;
Service A uses a VTable to call into the VFS;
...

Say, where one is making use of horizontal context switches for control
flow between logical tasks. Which would still remain fairly expensive
under a RISC-V like model.

One could have enough register banks for N logical tasks, but supporting
4 or 8 copies of the register file is going to cost more than 2 or 3.

Granted, possibly, handling system calls via using a mechanism along the
lines of a horizontal context switch, is a bit unusual...

But, ironically, this sort of ended up seeming like the most
straightforward approach in my case.

Post by BGB
     Copy low bits of SR into high bits of EXSR;
     Copy PC into SPC.
     Copy fault address into TEA;
     Swap SP and SSP (*1);
     Set CPU flags to Supervisor+ISR mode;
       CPU Mode bits now copied from high bits of VBR.
     Computed branch relative to VBR.
       Offset depends on interrupt category.
     Copy EXSR bits back into SR;
     Unswap SP/SSP (*1);
     Branch to SPC.

      // by this point all the old registers have been saved where they
      // are supposed to go, and the interrupt dispatcher registers are
      // already loader up and ready to go, and the CPU is running at
      // whatever privilege level was specified.
      HR   R1<-WHY
      LD   IP,[IP,R1<<3,InterruptVectorTable] // Call through table
      RTI
//
      // do what is necessary
      // note this can all be written in C
      RET

Above, I was describing what the hardware was doing.

The software side is basically more like:
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;
Get some of the CRs saved off (we need R0 and R1 free here);
Get the rest of the GPRs saved onto the stack;
Call into the main part of the ISR handler (using normal C ABI);
Restore most of the GPRs;
Restore most of the CRs;
Restore R0 and R1;
Do an RTE.

If I were to make the ISR mechanism assume that TBR was valid:
Branch from VBR-table to ISR entry point;
Get R0/R1/R8/R9 saved onto the stack;
Load the address of the register-save area from the current TBR;
Save CRs and GPRs to register save area
Copy over the values saved onto the stack.
Call into the main part of the ISR handler (using normal C ABI);
Restore everything from the potentially new TBR;
...

Pros:
Could speed up syscalls and task switches;
No hardware-level changes needed.

Cons:
Now the compiler would be hard-coded for TestKern's TBR layout (this
stuff would need to be baked into the ABI, *).

*: This structure being comparable to the TEB in Windows (and also holds
the location to find things like TLS variables and similar).

It differs slightly from the Windows TEB though:
The main part is Read-Only in Userland;
Holds a pointer to a Kernel-Only part;
This part holds the saved registers.
Holds another pointer to a User Modifiable part
This part holds the TLS variables and some execution-state stuff.

Likely, in C land, might look something like:
__interrupt __declspec(isr_regsave_tbr) void __isr_syscall(void)
{
...
}

With the "__declspec(isr_regsave_tbr)" signaling to BGBCC that it should
save registers directly into the TBR's register-save area rather than
onto the ISR stack.

Should be workable at least under the assumption that no one is going to
try to invoke a syscall without a valid TBR.

Post by BGB
*1: At the time, couldn't figure a good way to shave more logic off
the mechanism. Though, now, the most obvious candidate now would be to
eliminate the implicit SP/SSP swapping (this part is currently handled
in the instruction decoder).
   MOV    SP, SSP
   MOV    0xDE00, SP //Designated ISR stack SRAM
   MOV.Q R0, (SP, 0)
   NOV.Q R1, (SP, 8)
   ... Now save off everything else ...
But, didn't really think of it at the time.
There is already the trick of requiring VBR to be aligned (currently
64B in practice; formally 256B), mostly so as to allow the "address
computation" to be done via bit-slicing.
Not sure if many CPUs have a cheaper mechanism here...

I meant hardware-side cost.

But, yeah, software-side could be a fair bit faster...

Post by BGB
Note that in my case, generally the interrupt handlers are written in
C, with the compiler managing all the ISR prolog/epilog stuff (mostly
saving/restoring pretty much the entire CPU state to the ISR stack).

My 66000 compiler remains blissfully ignorant of ISR prologue and
epilogue and it still works.

Post by BGB
Generally, the ISR's also need to deal with having a comparably small
stack (with 0.75K already used for the saved CPU state).
   0000..7FFF: Boot ROM
   8000..BFFF: (Optional) Extended Boot ROM
   C000..DFFF: Boot/ISR SRAM
   E000..FFFF: (Optional) Extended SRAM
Generally, much of the work of the context switch is pulled off using
"memcpy" calls (with the compiler providing a special "__arch_regsave"
variable giving the address of the location it has dumped the CPU
registers into; which in turn covers most of the core state that needs
to be saved/restored for a process context switch).

Why not just make the HW push and pull cache lines.

My current prediction is that the mechanism for doing this would make
the register file significantly more expensive, along with making for
more serious problems related to memory coherence if the CPU tries to
touch any of this (unlike the RAM-backed VRAM, I can't hand-wave this,
if things don't go perfectly, stuff is gonna explode).

Granted, going "true multicore" it likely to require addressing the
cache coherence issues somehow (likely needing to manually invoke cache
flushes to deal with multithreaded code isn't really going to fly).

MitchAlsup

2023-11-23 16:53:04 UTC

Post by BGB
Yeah, but that is not exactly minimalist in terms of the hardware.
Granted, burning around 1 kilocycle of overhead per syscall isn't ideal
either...
Save registers to ISR stack;
Copy registers to User context;
Copy handler-task registers to ISR stack;
Reload registers from ISR stack;
Handle the syscall;
Save registers to ISR stack;
Copy registers to Syscall context;
Copy User registers to ISR stack;
Reload registers from ISR stack.
Does mean that one needs to be economical with syscalls (say, doing
"printf" a whole line at a time, rather than individual characters, ...).

Not at all--I have reduced SysCalls to just a bit slower than actual CALL.
say around 10-cycles. Use them as often as you like.

Post by BGB
And, did create incentive to allow getting the microsecond-clock value
and hardware RNG values from CPUID rather than needing a syscall (say,
don't want to burn 20us to check the microsecond counter, ...).
If the "memcpy's" could be eliminated, this could roughly halve the cost
of doing a syscall.

I have MM (memory move) as a 3-operand instruction.

Post by BGB
One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).

There is one CPU register file, and every running thread has an address
where that file comes from and goes to--just like a block of 4 cache lines;
There is a 5th cache line that contains all the other PSW stuff.

Post by BGB
Worth the cost? Dunno.

In my opinion--Absolutely worth it.

Post by BGB
Not too much different to modern Windows, where slow syscalls are still
fairly common (and despite the slowness of the mechanism, it seems like
BJX2 sycalls still manage to be around an order of magnitude faster than
Windows syscalls in terms of clock-cycle cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

Post by MitchAlsup
Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a
time.

Possible, but poses its own share of problems...
Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

Post by BGB
Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM, and
some sort of mechanism to access this RAM via an MMIO interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

Post by BGB
As-is: Slowness due to needing to save/reload everything;
RISC-V: Expensive regfile, only works for limited cases;
MMIO Backed + RV-like: Faster U<->S, but slower task switching.
RAM Backed: Cache coherence becomes a critical feature.
There is a user process;
There is a kernel running under it;
We want to call from the user process into the kernel.

So if you ae running under a Real OS you don't need 2 sets of RFs in my
model.

Post by BGB
User Process A calls a VTable entry which calls into User Process B;
Service A uses a VTable to call into the VFS;
...
Say, where one is making use of horizontal context switches for control
flow between logical tasks. Which would still remain fairly expensive
under a RISC-V like model.

Yes, but PTHREADing can be done without privilege and in a single instruction.

Post by BGB
One could have enough register banks for N logical tasks, but supporting
4 or 8 copies of the register file is going to cost more than 2 or 3.
Above, I was describing what the hardware was doing.
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

Post by BGB
Get some of the CRs saved off (we need R0 and R1 free here);
Get the rest of the GPRs saved onto the stack;
Call into the main part of the ISR handler (using normal C ABI);
Restore most of the GPRs;
Restore most of the CRs;
Restore R0 and R1;
Do an RTE.
Branch from VBR-table to ISR entry point;
Call into the main part of the ISR handler (using normal C ABI);
Do an RTE.

See what it saves ??

BGB

2023-11-23 21:53:47 UTC

Post by BGB
Yeah, but that is not exactly minimalist in terms of the hardware.
Granted, burning around 1 kilocycle of overhead per syscall isn't
ideal either...
   Save registers to ISR stack;
   Copy registers to User context;
   Copy handler-task registers to ISR stack;
   Reload registers from ISR stack;
   Handle the syscall;
   Save registers to ISR stack;
   Copy registers to Syscall context;
   Copy User registers to ISR stack;
   Reload registers from ISR stack.
Does mean that one needs to be economical with syscalls (say, doing
"printf" a whole line at a time, rather than individual characters, ...).

Not at all--I have reduced SysCalls to just a bit slower than actual CALL.
say around 10-cycles. Use them as often as you like.

OK.

Well, they aren't very fast in my case, in any case.

I have MM (memory move) as a 3-operand instruction.

None in my case...

But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.

Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).

Post by BGB
One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).

No direct equivalent.

I was thinking sort of like the RISC-V Privileged spec, there are
User/Supervisor/Machine sets, with the mode effecting which of these is
visible.

Obvious drawback in my case is that this would effectively increase the
number of internal GPRs from 64 to 192 (and, at that point, may as well
go to 4 copies and have 256).

If this were handled in the decoder, this would mean roughly a 9-bit
register selector field (vs the current 7 bits).

The increase in the number of CRs could be less, since only a few of
them actually need duplication.

But, don't want to go this way, and it would only be a partial solution
that also does not map up well to my current implementation.

Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.

Major differences:
SH-4 banked out R0..R7 when entering an interrupt;
The VBR relative entry-point offsets were a bit, ad-hoc.

There were some fairly arbitrary displacements based on the type of
interrupt. Almost like they designed their interrupt mechanism around a
particular chunk of ASM code or something. In my case, I kept a similar
idea, but just used a fixed 8-byte spacing, with the idea of these spots
branching to the actual entry point.

Though, one other difference is in my case I ended up adding a dedicated
SYSCALL handler; on SH-4 they had used a TRAP instruction, which would
have gone to the FAULT handler instead.

It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch, but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).

Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.

In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.

I think at one point, I had considered having tasks have both User and
Supervisor state (with two stacks and two copies of all the registers),
but ended up not going this way (and instead giving the syscalls their
designated own task context; which also saves on per-task memory overhead).

Post by BGB
Worth the cost? Dunno.

In my opinion--Absolutely worth it.

Post by BGB
Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

Looked into it a little more, realized that "an order of magnitude" may
have actually been a little conservative; seems like Windows syscalls
may be more in the area of 50-100k cycles.

Why exactly? Dunno.

This is still ignoring some of the "slow cases" which may take millions
of clock cycles.

It also seems like fast-ish syscalls may be more of a Linux thing.

Post by MitchAlsup
Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a
time.

Possible, but poses its own share of problems...
Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

OK.

Having only 1 set of registers is good...

Issue is the mechanism for how to get all the contents in/out of the
register file, in a way that is both cost effective, and faster than
using a series of Load/Store instructions would have otherwise been.

Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.

One bit of trickery would be, "what if" the Boot SRAM region were inside
the L1 cache rather than out on the ringbus?...

But, then one would have the cost of keeping 8K of SRAM close to the CPU
core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any
case...).

Though keeping it tied to a specific CPU core (and effectively processor
local) would avoid the ugly "what if" scenario of two CPU cores trying
to service an interrupt at the same time and potentially stepping on
each others' stacks. The main tradeoff vs putting the stacks in DRAM is
mostly that DRAM may have (comparably more expensive) L2 misses.

Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code from
the ISR stack or similar.

Post by BGB
Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM,
and some sort of mechanism to access this RAM via an MMIO interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

Possibly.

It is also possible that some of the TBR / "struct TKPE_TaskInfo_s"
stuff could be baked into hardware... But, I don't want to go this route
(baking parts of it into the C ABI is at least "slightly" less evil).

Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.

I guess I can probably safely rule out MMIO under the basis that context
switching via moving registers via MMIO would be slower than the current
mechanism (of using a series of Load/Store instructions).

Post by BGB
   As-is: Slowness due to needing to save/reload everything;
   RISC-V: Expensive regfile, only works for limited cases;
   MMIO Backed + RV-like: Faster U<->S, but slower task switching.
   RAM Backed: Cache coherence becomes a critical feature.
   There is a user process;
   There is a kernel running under it;
   We want to call from the user process into the kernel.

So if you ae running under a Real OS you don't need 2 sets of RFs in my
model.

OK.

Whether or not my "OS" is "Real" is still a bit debatable.
From what I can tell, it is sort of loosely in Win 3.x territory (at best).

As-in, can have multiple tasks and task switching, memory protection is
rather lacking, and still using cooperative scheduling (preemptive has
been experimented with, but at the moment is prone to cause stuff to
explode; I will need to "sort stuff out a bit more" and add things like
mutex locks around various things before this point).

Main obvious difference is:
while(cond)
{
thrd_yield();
cond=some_check();
}
Is OK, but:
while(cond)
cond=some_check();

May potentially lock up the OS if it gets stuck in an infinite loop.

In my current "GUI experiments", its stability is an almost comedic
level of badness (to what extent things work at all).

But, then again, Win3.x in DOSBox is not exactly "rock solid" either, so
even as primitive as it is, it seems "almost within reach". Like, "It
may work, it may cause the video driver to corrupt itself (leading to a
screen of indecipherable garbage or similar), or the Windows install
might just decide to corrupt its files badly enough that one has to
reinstall it to make it work again, ...".

Though, ironically, I am still left making some uses of 16 color BMP
images and CRAM and similar. Though, slightly atypical, in that I am
using CRAM as a still image format, and hacked things so that both
formats can support transparency.

Say: 16-color BMP: The "High Intensity Magenta" color can be used as a
transparent color if needed. For 8-bit CRAM, a 256-color palette is
used, with one of the colors (0x80 in this case) being used as a
transparent color.

Note that "actual Windows" can't load these CRAM BMP's (but, also can't
load a few of the "should work" formats either; like 2-bpp images or the
older BITMAPCOREHEADER format).

Then again, one could argue, maybe it doesn't make much sense for modern
programs to be able to load formats that haven't seen much use since the
days of CGA and Windows 1.x ?...

Post by BGB
   User Process A calls a VTable entry which calls into User Process B;
   Service A uses a VTable to call into the VFS;
   ...
Say, where one is making use of horizontal context switches for
control flow between logical tasks. Which would still remain fairly
expensive under a RISC-V like model.

Yes, but PTHREADing can be done without privilege and in a single instruction.

OK.

Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.

Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...

This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler
changes, likely the main issue here).

Well, nevermind any cost of locating the next thread, but at the moment,
I am using a fairly simplistic round-robin scheduling strategy, so the
scheduler mostly starts at a given PID, and looks for the next PID that
holds a valid/running task (wrapping back to PID 1 if it hits the end,
and stopping the search if it gets back to the original PID).

The high-level threading model wasn't based on pthreads in my case, but
rather C11 threads (and had implemented a lot of the "threads.h" stuff).

One could potentially mimic pthreads on top of C11 threads though.

At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads were
a better fit.

Post by BGB
One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.
Above, I was describing what the hardware was doing.
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

SP and SSP swap places on interrupt entry (currently by renumbering the
registers in the instruction decoder).

SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.

Essentially, both SP and SSP are SPRs, but:
SP is mapped into R15 in the GPR space;
SSP is mapped into the CR space.

So, when executing an ISR, it is effectively using SSP as its SP.

If I were eliminate this implicit register-swap mechanism, then the ISR
entry would likely need to reload a constant address each time. Though,
this change would also break binary compatibility with my existing code.

But, in theory, eliminating the register swap could allow demoting SP to
being a normal GPR.

Also, things like renumbering parts of the register space based on CPU
mode is expensive.

Though, some of my more recent design ideas would have gone over to an
ordering slightly more like RISC-V, say:
R0: ZR or PC (ALU or MEM)
R1: LR or TBR (ALU or MEM)
R2: SP
R3: GP (GBR)
R4 -R15: Scratch
R16-R31: Callee Save
R32-R47: Scratch
R48-R63: Callee Save

Would likely not adopt RISC-V's C ABI though.

Though, if one assumes R4..R63 are GPRs, this would allow both this ISA
and RISC-V to still use the same register numbering.

This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC to
be able to follow RISC-V's C ABI rules would be a non-trivial level of
effort; but is rendered moot if one still needs to use call thunking).

The interpretation for R0 and R1 would depend on how they are used:
ALU or similar: ZR and LR (Zero and Link Register)
Load/Store Base: PC and TBR.

Idea being that in userland, TBR effectively still exists as a Read-Only
register (allowing userland to modify TBR would effectively also allow
userland to wreck the OS).

Thing is mostly that needing to renumber registers in the decoder based
on CPU mode isn't entirely free in terms of LUT cost or timing latency
(even if it only applies to a subset of the register space).

Note that for RV decoding:
X0..X31 -> R0 ..R31 (more or less)
F0..F31 -> R32..R63

But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.

Though, it seems like most RV code could likely tolerate some deviation
in some areas (will it care that the high 32 bits of a Binary32 register
don't hold NaN? Will it care about the extra funkiness going on in LR? ...).

Post by BGB
   Get some of the CRs saved off (we need R0 and R1 free here);
   Get the rest of the GPRs saved onto the stack;
   Call into the main part of the ISR handler (using normal C ABI);
   Restore most of the GPRs;
   Restore most of the CRs;
   Restore R0 and R1;
   Do an RTE.
   Branch from VBR-table to ISR entry point;
   Call into the main part of the ISR handler (using normal C ABI);
   Do an RTE.

See what it saves ??

This is fewer instructions.

But, hardware cost, and clock-cycle savings?...

As-is, I can't come up with much that is both:
Fairly cheap to implement in hardware;
Would saves a lot of clock-cycles over software-based options.

As noted, the former is also why I had thus far mostly rejected the
RISC-V strategy (*).

*: Ironically, despite RISC-V having fewer GPRs, to implement the
Privileged spec, RISC-V would still end up needing a somewhat bigger
register file... Nevermind what exactly is going on with CSRs...

Say:
BJX2: 64 GPRs, ~ 14 CRs in use.
Some of the CRs defined (like the SMT set) don't currently exist.
TEAH is specific to Addr96 mode;
VIPT doesn't currently exist
Will only exist if/when inverted page tables are added.
STTB exists but isn't currently being used
Was intended for supervisor-mode page tables;
But, N/A if Supervisor Mode is reached via a task switch...

RISC-V: 3x ( 32 GPRs + 32 FPRs), 3x a bunch of CSRs.
So, theoretically, 192 registers, plus a bunch more CSRs.
Nevermind that the 'V' extension would add more registers.
Would we also need 3 copies of all the Vector registers, ... ?

MitchAlsup

2023-11-23 23:30:50 UTC

Post by BGB
If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.

I have MM (memory move) as a 3-operand instruction.

None in my case...
But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.
Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).

Post by BGB
One other option would be to do like RISC-V's privileged spec and have
multiple copies of the register file (and likely instructions for
accessing these alternate register files).

Decode is not the problem, sensing 1:256 is a big problem, in practice
even SRAMs only have 32-pairs of cells on a bit line using exotic timed
sense amps.
{{Decode is almost NEVER the logic delay problem:: ½ is situation recognition,
the other ½ is fan-out buffering--driving the lines into the decoder is more
gates of delay than determining if a given select line should be asserted.}}

Post by BGB
The increase in the number of CRs could be less, since only a few of
them actually need duplication.
But, don't want to go this way, and it would only be a partial solution
that also does not map up well to my current implementation.
Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.
SH-4 banked out R0..R7 when entering an interrupt;
The VBR relative entry-point offsets were a bit, ad-hoc.
There were some fairly arbitrary displacements based on the type of
interrupt. Almost like they designed their interrupt mechanism around a
particular chunk of ASM code or something. In my case, I kept a similar
idea, but just used a fixed 8-byte spacing, with the idea of these spots
branching to the actual entry point.
Though, one other difference is in my case I ended up adding a dedicated
SYSCALL handler; on SH-4 they had used a TRAP instruction, which would
have gone to the FAULT handler instead.
It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch,

but why ?? the probability that control returns from a given IST to its
softIRQ is less than ½ in a loaded system.

Post by BGB
but the specifics of
doing so would get a bit more hairy and arcane (which is sort of why I
just sorta ended up using a context switch).
Not sure what Linux on SH-4 had done, didn't really investigate this
part of the code all that much at the time.
In theory, the ISR handlers could be made to mimic the x86 TSS
mechanism, but this wouldn't gain much.

Stay away from anything you see in x86 except in using it a moniker
to avoid.

Post by BGB
I think at one point, I had considered having tasks have both User and
Supervisor state (with two stacks and two copies of all the registers),
but ended up not going this way (and instead giving the syscalls their
designated own task context; which also saves on per-task memory overhead).

Post by BGB
Worth the cost? Dunno.

In my opinion--Absolutely worth it.

Post by BGB
Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

Looked into it a little more, realized that "an order of magnitude" may
have actually been a little conservative; seems like Windows syscalls
may be more in the area of 50-100k cycles.
Why exactly? Dunno.
This is still ignoring some of the "slow cases" which may take millions
of clock cycles.
It also seems like fast-ish syscalls may be more of a Linux thing.

Post by MitchAlsup
Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a
time.

Possible, but poses its own share of problems...
Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

6R6W RFs are as big as one can practically build. You can get as much
Read BW by duplication, but you only have "so much" Write BW (even when
you know each write is to a different register).

Post by BGB
Short of a pipeline redesign, it is unlikely to exceed a best case of
around 128 bits per clock cycle, with (in practice) there typically
being other penalties due to things like L1 misses and similar.

6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.

Post by BGB
One bit of trickery would be, "what if" the Boot SRAM region were inside
the L1 cache rather than out on the ringbus?...

2 things::
a) By giving threadstate an address you gain the ability to load the
initial RF image from ROM as the CPU comes out of reset--it comes out
with a complete RF, a complete thread.header, mapping tables, privilege
and priority.
b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
state (no underlying DRAM address availible) so you have ~1MB to play around
with until you find DRAM, configure, initialize, and put in fee-pool.)
So, here, you HAVE "enough" storage to program BOOT activities in a HLL
(of your choice).

Post by BGB
But, then one would have the cost of keeping 8K of SRAM close to the CPU
core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any
case...).

Is the Icache and Dcache not close enough ?? If not then add L2 !!

Post by BGB
Though keeping it tied to a specific CPU core (and effectively processor
local) would avoid the ugly "what if" scenario of two CPU cores trying
to service an interrupt at the same time and potentially stepping on
each others' stacks. The main tradeoff vs putting the stacks in DRAM is
mostly that DRAM may have (comparably more expensive) L2 misses.

The interrupt (re)mapping table takes care of this prior to the CPU being
bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
table associated with the "Originating" thread. (IO/-MMU). That interrupt
is logged into the table and if enabled its priority is used to determine
which set of CPUs should be bothered, the affinity mask of the "Originating"
thread is used to qualify which CPU from the priority set, and one of these
is selected. The selected CPU is tapped on the shoulder, and sends a get-
Interrupt request to the Interrupt table logic which sends back the priority
and number of a pending interrupt. If the CPU is still at lower priority
than the returning interrupt, the CPU <at this point> stops running code
from the old thread and begins running code on the new thread.
{{During the sending of the interrupt to the CPU and the receipt of the
claim-Interrupt message, that interrupt will not get handed to any other
CPU}} So, the CPU continues to run instructions while the CPUs contend
for and claim unique interrupts. There are 512 unique interrupt at each of
64 priority levels, and each process can have its own Interrupt Table.
These tables need no maintenance except when interrupts are created and
destroyed.}}

HV, Guest HV, Guest OS each have their own unique interrupt tables;
Although it could be arranged such that all could use the same table.

Post by BGB
Would add a potential "wonk" factor though, if this SRAM region were
only visible for D$ access, but inaccessible from the I$. But, I guess
one can argue, there isn't really a valid reason to try to run code from
the ISR stack or similar.

Post by BGB
Though, could make sense if one has a mechanism where a context switch
could have a mechanism to dump the whole register file to Block-RAM,
and some sort of mechanism to access this RAM via an MMIO interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

My mechanism is taking that struct task.....s (at least the part HW
needs to understand) and associating each one into a table that points
at DRAM. Now, when you want this thread to run, you load up the pointer
set the e-bit (enabled) and write it into the current header at its
privilege level. Poof--all 5 cache lines of state from the currently
running thread goes back to where it permanent home in DRAM is, and
the new thread fetches 5 cache lines of state of the new thread.
a) you can start the reads before you start the writes
b) you can start the writes anytime you have outbound access to "the bus"
c) the writes can be no late than the ½ cycle before the reads get written.
Which is a lot faster than you can do in SW with LDs and STs.

Post by BGB
Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.

I config-space mapped all my CRs, so you get an unlimited number of them.

Post by BGB
I guess I can probably safely rule out MMIO under the basis that context
switching via moving registers via MMIO would be slower than the current
mechanism (of using a series of Load/Store instructions).

.................

Post by MitchAlsup
Yes, but PTHREADing can be done without privilege and in a single instruction.

OK.
Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.

In my case it is about MemoryLatency+5 cycles.

Yes, thread switch is a 1-way function--which is the reason you can
allow a user to preempt himself and allow a compatriot to run in his
place.....

Post by BGB
Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...

My Real Time version of MY 66000 does 10-ish cycle context switch
(as seen at the CPU) but here a hunk of HW has gathered up those 5 cache
lines and sent them to the targeted CPU and all the CPU has to do is push
out the old state (5-cache liens) So the data was heading towards the CPU
before the CPU even knew it wanted that data !!

Post by BGB
This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler
changes, likely the main issue here).
Well, nevermind any cost of locating the next thread, but at the moment,
I am using a fairly simplistic round-robin scheduling strategy, so the
scheduler mostly starts at a given PID, and looks for the next PID that
holds a valid/running task (wrapping back to PID 1 if it hits the end,
and stopping the search if it gets back to the original PID).
The high-level threading model wasn't based on pthreads in my case, but
rather C11 threads (and had implemented a lot of the "threads.h" stuff).
One could potentially mimic pthreads on top of C11 threads though.
At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads were
a better fit.

Post by BGB
One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.
Above, I was describing what the hardware was doing.
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

SP and SSP swap places on interrupt entry (currently by renumbering the
registers in the instruction decoder).

So, in effect, you actually have 33 registers with only 32 visible at
any instant. I am just so glad not to have gone down that rabbet hole
this time......

Post by BGB
SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.
SP is mapped into R15 in the GPR space;
SSP is mapped into the CR space.
So, when executing an ISR, it is effectively using SSP as its SP.
If I were eliminate this implicit register-swap mechanism, then the ISR
entry would likely need to reload a constant address each time. Though,
this change would also break binary compatibility with my existing code.
But, in theory, eliminating the register swap could allow demoting SP to
being a normal GPR.
Also, things like renumbering parts of the register space based on CPU
mode is expensive.
Though, some of my more recent design ideas would have gone over to an
R0: ZR or PC (ALU or MEM)
R1: LR or TBR (ALU or MEM)
R2: SP
R3: GP (GBR)
R4 -R15: Scratch
R16-R31: Callee Save
R32-R47: Scratch
R48-R63: Callee Save
Would likely not adopt RISC-V's C ABI though.

R0:: GPR, Return Address, proxy for IP, proxy for 0
R1..R9 Arguments and results passed in registers
R10..R15 Temporary Registers (scratch)
R16..R29 Callee Save
R30 FP when in use, Callee Save
R31 SP

Post by BGB
Though, if one assumes R4..R63 are GPRs, this would allow both this ISA
and RISC-V to still use the same register numbering.
This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC to
be able to follow RISC-V's C ABI rules would be a non-trivial level of
effort; but is rendered moot if one still needs to use call thunking).
ALU or similar: ZR and LR (Zero and Link Register)
Load/Store Base: PC and TBR.
Idea being that in userland, TBR effectively still exists as a Read-Only
register (allowing userland to modify TBR would effectively also allow
userland to wreck the OS).
Thing is mostly that needing to renumber registers in the decoder based
on CPU mode isn't entirely free in terms of LUT cost or timing latency
(even if it only applies to a subset of the register space).
X0..X31 -> R0 ..R31 (more or less)
F0..F31 -> R32..R63
But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.
Though, it seems like most RV code could likely tolerate some deviation
in some areas (will it care that the high 32 bits of a Binary32 register
don't hold NaN? Will it care about the extra funkiness going on in LR? ...).

Post by BGB
   Get some of the CRs saved off (we need R0 and R1 free here);
   Get the rest of the GPRs saved onto the stack;
   Call into the main part of the ISR handler (using normal C ABI);
   Restore most of the GPRs;
   Restore most of the CRs;
   Restore R0 and R1;
   Do an RTE.
   Branch from VBR-table to ISR entry point;
   Call into the main part of the ISR handler (using normal C ABI);
   Do an RTE.

See what it saves ??

This is fewer instructions.
But, hardware cost,

the HW cost has already been purchased by the state machine that
writes out 5-cache lines and waits for 5-cache lines to arrive.

and clock-cycle savings?...
The reads can arrive before you start the writes, you can go so far
as to organize your pipeline so the read data being written pushes
out the write data that needs to return to memory-making the timing
brain dead easy to achieve.

Post by BGB
Fairly cheap to implement in hardware;
Would saves a lot of clock-cycles over software-based options.
As noted, the former is also why I had thus far mostly rejected the
RISC-V strategy (*).

Yet, you seem to be buying insurance as if you might need to head in that
direction.

Post by BGB
*: Ironically, despite RISC-V having fewer GPRs, to implement the
Privileged spec, RISC-V would still end up needing a somewhat bigger
register file... Nevermind what exactly is going on with CSRs...

Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.

Robert Finch

2023-11-24 02:36:41 UTC

Post by BGB
If the "memcpy's" could be eliminated, this could roughly halve the
cost of doing a syscall.

I have MM (memory move) as a 3-operand instruction.

None in my case...
But, a memcpy loop can move ~ 280-300 MB/s in the L1 cache at 50MHz.
Still might be better to not do a memcpy in these cases.
Say, if the ISR handler could "merely" reassign the TBR register to
switch from one task to another to perform the context switch (still
ignoring all the loads/stores hidden in the prolog and epilog).

Post by BGB
One other option would be to do like RISC-V's privileged spec and
have multiple copies of the register file (and likely instructions
for accessing these alternate register files).

Post by BGB
The increase in the number of CRs could be less, since only a few of
them actually need duplication.
But, don't want to go this way, and it would only be a partial
solution that also does not map up well to my current implementation.
Not sure how an OS on SH-4 would have managed all this, but I suspect
their interrupt model would have had similar limitations to mine.
SH-4 banked out R0..R7 when entering an interrupt;
The VBR relative entry-point offsets were a bit, ad-hoc.
There were some fairly arbitrary displacements based on the type of
interrupt. Almost like they designed their interrupt mechanism around
a particular chunk of ASM code or something. In my case, I kept a
similar idea, but just used a fixed 8-byte spacing, with the idea of
these spots branching to the actual entry point.
Though, one other difference is in my case I ended up adding a
dedicated SYSCALL handler; on SH-4 they had used a TRAP instruction,
which would have gone to the FAULT handler instead.
It is in-theory possible to jump from Interrupt Mode to normal
Supervisor Mode without a full context switch,

but why ?? the probability that control returns from a given IST to its
softIRQ is less than ½ in a loaded system.

Stay away from anything you see in x86 except in using it a moniker
to avoid.

Post by BGB
I think at one point, I had considered having tasks have both User and
Supervisor state (with two stacks and two copies of all the
registers), but ended up not going this way (and instead giving the
syscalls their designated own task context; which also saves on
per-task memory overhead).

Post by BGB
Worth the cost? Dunno.

In my opinion--Absolutely worth it.

Post by BGB
Not too much different to modern Windows, where slow syscalls are
still fairly common (and despite the slowness of the mechanism, it
seems like BJX2 sycalls still manage to be around an order of
magnitude faster than Windows syscalls in terms of clock-cycle cost...).

Now, just get it down to a cache missing {L1, L2} instruction fetch.

Looked into it a little more, realized that "an order of magnitude"
may have actually been a little conservative; seems like Windows
syscalls may be more in the area of 50-100k cycles.
Why exactly? Dunno.
This is still ignoring some of the "slow cases" which may take
millions of clock cycles.
It also seems like fast-ish syscalls may be more of a Linux thing.

Post by MitchAlsup
Why not just treat the RF as a cache with a known address in physical memory.
In MY 66000 that is what I do and then just push and pull 4 cache lines at a
time.

Possible, but poses its own share of problems...
Not sure how this could be implemented cost-effectively, or for that
matter, more cheaply than a RISC-V style mode-banked register-file.

1 RF of 32 entries is smaller than 4 RFs of 32 entries each. So, instead
of having 4 cache lines of state and 1 doubleword of address, you need
16 cache lines of state.

6R6W RFs are as big as one can practically build. You can get as much
Read BW by duplication, but you only have "so much" Write BW (even when
you know each write is to a different register).

6R ports are 6*64-bits = 384-bits out and 384-bits in per cycle.

Post by BGB
One bit of trickery would be, "what if" the Boot SRAM region were
inside the L1 cache rather than out on the ringbus?...

a) By giving threadstate an address you gain the ability to load the
initial RF image from ROM as the CPU comes out of reset--it comes out
with a complete RF, a complete thread.header, mapping tables, privilege
and priority.
b) Those ROM-based TLB entries map to the L1 and L2 caches in Allocate
state (no underlying DRAM address availible) so you have ~1MB to play around
with until you find DRAM, configure, initialize, and put in fee-pool.)
So, here, you HAVE "enough" storage to program BOOT activities in a HLL
(of your choice).

Post by BGB
But, then one would have the cost of keeping 8K of SRAM close to the
CPU core that is mostly only ever used during interrupt handling (but,
probably still cheaper than making the register file 3x bigger, in any
case...).

Is the Icache and Dcache not close enough ?? If not then add L2 !!

Post by BGB
Though keeping it tied to a specific CPU core (and effectively
processor local) would avoid the ugly "what if" scenario of two CPU
cores trying to service an interrupt at the same time and potentially
stepping on each others' stacks. The main tradeoff vs putting the
stacks in DRAM is mostly that DRAM may have (comparably more
expensive) L2 misses.

The interrupt (re)mapping table takes care of this prior to the CPU being
bothered. A {CPU or device} sends an interrupt to the Interrupt mapping
table associated with the "Originating" thread. (IO/-MMU). That interrupt
is logged into the table and if enabled its priority is used to determine
which set of CPUs should be bothered, the affinity mask of the
"Originating"
thread is used to qualify which CPU from the priority set, and one of these
is selected. The selected CPU is tapped on the shoulder, and sends a get-
Interrupt request to the Interrupt table logic which sends back the priority
and number of a pending interrupt. If the CPU is still at lower priority
than the returning interrupt, the CPU <at this point> stops running code
from the old thread and begins running code on the new thread.
{{During the sending of the interrupt to the CPU and the receipt of the
claim-Interrupt message, that interrupt will not get handed to any other
CPU}} So, the CPU continues to run instructions while the CPUs contend
for and claim unique interrupts. There are 512 unique interrupt at each of
64 priority levels, and each process can have its own Interrupt Table.
These tables need no maintenance except when interrupts are created and
destroyed.}}
HV, Guest HV, Guest OS each have their own unique interrupt tables;
Although it could be arranged such that all could use the same table.

Post by BGB
Though, could make sense if one has a mechanism where a context
switch could have a mechanism to dump the whole register file to
Block-RAM, and some sort of mechanism to access this RAM via an MMIO
interface.

Just put it in DRAM at SW controlled (via TLB) addresses.

Post by BGB
Also possible could be to add another CR for "Dump context registers
here", this adds the costs of another CR though.

I config-space mapped all my CRs, so you get an unlimited number of them.

Post by BGB
I guess I can probably safely rule out MMIO under the basis that
context switching via moving registers via MMIO would be slower than
the current mechanism (of using a series of Load/Store instructions).

.................

Post by MitchAlsup
Yes, but PTHREADing can be done without privilege and in a single instruction.

OK.
Luckily, a thread-switch only needs to go 1-way, reducing it to around
500 cycles as-is in my case.

In my case it is about MemoryLatency+5 cycles.
Yes, thread switch is a 1-way function--which is the reason you can
allow a user to preempt himself and allow a compatriot to run in his
place.....

Post by BGB
Theoretical minimum would be around 150-200 cycles, with most of the
savings based on eliminating around 1.5kB worth of "memcpy()"...

My Real Time version of MY 66000 does 10-ish cycle context switch
(as seen at the CPU) but here a hunk of HW has gathered up those 5 cache
lines and sent them to the targeted CPU and all the CPU has to do is push
out the old state (5-cache liens) So the data was heading towards the
CPU before the CPU even knew it wanted that data !!

Post by BGB
This need not involve an ISA change, could in theory be done by making
the SYSCALL ISR mandate that TBR be valid (and the associated compiler
changes, likely the main issue here).
Well, nevermind any cost of locating the next thread, but at the
moment, I am using a fairly simplistic round-robin scheduling
strategy, so the scheduler mostly starts at a given PID, and looks for
the next PID that holds a valid/running task (wrapping back to PID 1
if it hits the end, and stopping the search if it gets back to the
original PID).
The high-level threading model wasn't based on pthreads in my case,
but rather C11 threads (and had implemented a lot of the "threads.h"
stuff).
One could potentially mimic pthreads on top of C11 threads though.
At the moment, I forgot why I decided to go with C11 threads over
pthreads, but IIRC I think I had felt at the time like C11 threads
were a better fit.

Post by BGB
One could have enough register banks for N logical tasks, but
supporting 4 or 8 copies of the register file is going to cost more
than 2 or 3.
Above, I was describing what the hardware was doing.
Branch from VBR-table to ISR entry point;
Get R0 and R1 saved onto the stack;

Where did you get the address of this stack ??

SP and SSP swap places on interrupt entry (currently by renumbering
the registers in the instruction decoder).

So, in effect, you actually have 33 registers with only 32 visible at
any instant. I am just so glad not to have gone down that rabbet hole
this time......

Post by BGB
SSP is initialized early on to the SRAM stack, so when an interrupt
happens, the 'SP' register automatically becomes the SRAM stack.
   SP is mapped into R15 in the GPR space;
   SSP is mapped into the CR space.
So, when executing an ISR, it is effectively using SSP as its SP.
If I were eliminate this implicit register-swap mechanism, then the
ISR entry would likely need to reload a constant address each time.
Though, this change would also break binary compatibility with my
existing code.
But, in theory, eliminating the register swap could allow demoting SP
to being a normal GPR.
Also, things like renumbering parts of the register space based on CPU
mode is expensive.
Though, some of my more recent design ideas would have gone over to an
   R0: ZR or PC (ALU or MEM)
   R1: LR or TBR (ALU or MEM)
   R2: SP
   R3: GP (GBR)
   R4 -R15: Scratch
   R16-R31: Callee Save
   R32-R47: Scratch
   R48-R63: Callee Save
Would likely not adopt RISC-V's C ABI though.

R0::     GPR, Return Address, proxy for IP, proxy for 0
R1..R9   Arguments and results passed in registers
R10..R15 Temporary Registers (scratch)
R16..R29 Callee Save
R30      FP when in use, Callee Save
R31      SP

Post by BGB
Though, if one assumes R4..R63 are GPRs, this would allow both this
ISA and RISC-V to still use the same register numbering.
This is already fairly close to the register numbering scheme used in
XG2RV, though the assumption was that XG2RV would have used RV's ABI,
but this was stalled out mostly due to compiler issues (getting BGBCC
to be able to follow RISC-V's C ABI rules would be a non-trivial level
of effort; but is rendered moot if one still needs to use call thunking).
   ALU or similar: ZR and LR (Zero and Link Register)
   Load/Store Base: PC and TBR.
Idea being that in userland, TBR effectively still exists as a
Read-Only register (allowing userland to modify TBR would effectively
also allow userland to wreck the OS).
Thing is mostly that needing to renumber registers in the decoder
based on CPU mode isn't entirely free in terms of LUT cost or timing
latency (even if it only applies to a subset of the register space).
   X0..X31 -> R0 ..R31 (more or less)
   F0..F31 -> R32..R63
But, RV's FPU instructions don't match up exactly 1:1, and some cases
would have semantic differences.
Though, it seems like most RV code could likely tolerate some
deviation in some areas (will it care that the high 32 bits of a
Binary32 register don't hold NaN? Will it care about the extra
funkiness going on in LR? ...).

Post by BGB
   Get some of the CRs saved off (we need R0 and R1 free here);
   Get the rest of the GPRs saved onto the stack;
   Call into the main part of the ISR handler (using normal C ABI);
   Restore most of the GPRs;
   Restore most of the CRs;
   Restore R0 and R1;
   Do an RTE.
   Branch from VBR-table to ISR entry point;
   Call into the main part of the ISR handler (using normal C ABI);
   Do an RTE.

See what it saves ??

This is fewer instructions.
But, hardware cost,

the HW cost has already been purchased by the state machine that writes
out 5-cache lines and waits for 5-cache lines to arrive.
and clock-cycle savings?...
The reads can arrive before you start the writes, you can go so far as
to organize your pipeline so the read data being written pushes
out the write data that needs to return to memory-making the timing
brain dead easy to achieve.

Yet, you seem to be buying insurance as if you might need to head in that
direction.

Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.

My 68000 CPU core had a couple of task switching instructions added to
it. I made a dedicated task switch RAM wide enough to load or store all
the 68k registers in a single clock. Total task switch time was about
four clocks IIRC. The interrupt vector table was setup to be able to
automatically task switch on interrupt. The RAM had storage for up to
512 tasks, but it was dedicated inside the CPU core rather than storing
task information in the memory system.

Q+ has a 64 register file, so it would take eight or nine cache lines to
store the context. Q+ register file is 4w18r ATM. Getting from the
register file to or from a cache line is a challenge. To access groups
of eight registers at once would mean adding or using eight register
file ports. The register file has only four write ports so only ½ of a
cache line could be written to the file in a clock cycle. It is
appealing to handle multiple registers per clock. Read/write ports are
dedicated to specific function units, so making use of them for task
switching may involve additional logic. I called the CSR to store the
task state address the TS CSR.

As I understand it normally RISCV does not use multiple register files,
it has only a single file. There may be implementations out there that
do make use of multiple files, but I think the standard is setup to get
by with a single file.

MitchAlsup

2023-11-24 03:11:17 UTC

Post by MitchAlsup
Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.

This is headed in the right direction. Make context switching something
easy to pull off.

Post by Robert Finch
Q+ has a 64 register file, so it would take eight or nine cache lines to
store the context. Q+ register file is 4w18r ATM. Getting from the
register file to or from a cache line is a challenge. To access groups
of eight registers at once would mean adding or using eight register
file ports. The register file has only four write ports so only ½ of a
cache line could be written to the file in a clock cycle. It is
appealing to handle multiple registers per clock. Read/write ports are
dedicated to specific function units, so making use of them for task
switching may involve additional logic. I called the CSR to store the
task state address the TS CSR.

4W generally ends up with 4R and replications lead to 8R 12R 16R and 20R.
Yet you chose 18. Why ?

This is above and beyond the "typical" operand consumption of a RISC ISA.
Your typical 4-wide RISC ISA would have 8R (6-wide is better balanced at
12R allowing 1 FU to consume 3-registers and 1 FU having only 1-operand
(or forwarding). What are you using the other 5-operands for ??

Post by Robert Finch
As I understand it normally RISCV does not use multiple register files,

RISC-V has a 32 entry GPR and a 32 entry FPR.

Post by Robert Finch
it has only a single file. There may be implementations out there that
do make use of multiple files, but I think the standard is setup to get
by with a single file.

Robert Finch

2023-11-24 04:37:54 UTC

Post by MitchAlsup
Whereas that special State is only a dozen register <with state>
in My 66000--the rest being either memory resident or memory mapped.

This is headed in the right direction. Make context switching something
easy to pull off.

Post by Robert Finch
Q+ has a 64 register file, so it would take eight or nine cache lines
to store the context. Q+ register file is 4w18r ATM. Getting from the
register file to or from a cache line is a challenge. To access groups
of eight registers at once would mean adding or using eight register
file ports. The register file has only four write ports so only ½ of a
cache line could be written to the file in a clock cycle. It is
appealing to handle multiple registers per clock. Read/write ports are
dedicated to specific function units, so making use of them for task
switching may involve additional logic. I called the CSR to store the
task state address the TS CSR.

4W generally ends up with 4R and replications lead to 8R 12R 16R and 20R.
Yet you chose 18. Why ?
This is above and beyond the "typical" operand consumption of a RISC ISA.
Your typical 4-wide RISC ISA would have 8R (6-wide is better balanced at
12R allowing 1 FU to consume 3-registers and 1 FU having only 1-operand
(or forwarding). What are you using the other 5-operands for ??

Post by Robert Finch
As I understand it normally RISCV does not use multiple register files,

RISC-V has a 32 entry GPR and a 32 entry FPR.

Post by Robert Finch
it has only a single file. There may be implementations out there that
do make use of multiple files, but I think the standard is setup to
get by with a single file.

I have 4w1r replicated 18 times. That is enough read ports to supply
three operands each to six functional units. All six functional units
may be scheduled at the same time. I have thought of trying to use fewer
read ports by prioritizing the ports as it is unlikely that all ports
would be needed at the same time. The current design is simple, but not
resource efficient. Six function units are ALU0, ALU1, FPU, FCU, LOAD,
STORE. The FCU really only needs two source operands.

There is no forwarding in the design (yet). I have read this cost about
10% in performance. I think this may be made up for by a smaller design
that can operate at a higher fmax. I have found in the past that
forwarding muxes appear on the critical timing path. I have seen another
design eliminating forwarding. It made the difference between operating
at 50 MHz or 60 MHz+. 20% gain in fmax. I think this may be an aspect of
an FPGA implementation.

Robert Finch

2023-11-22 03:47:26 UTC

Post by BGB
For some ops, the 3rd register (Ro) would instead operate as a 5-bit
immediate/displacement field. Which was initially a similar idea, with
the 32-bit space mirroring the 16-bit space.

Q+ CPU allows immediates of any length to be used in place of source
operand register values via postfix instructions. Virtually all
instructions may use immediates instead of registers. There are also
quick immediate form instructions that have the second source operand as
an immediate constant encoded directly in the instruction as this is the
most common use.

The postfix immediate instructions come in four lengths. 23-bit, 39-bit,
71-bit and 135-bit. Currently float values make use of on 32 or 64 bits
out of the 39 and 71-bit formats. I have been pondering having the float
immediates left aligned with additional trailing bits. These bits are
zero for now.

Postfixes are treated as part of the current instruction by the CPU.

Post by BGB
Thread: Logical thread of execution within some existing process;

has a register file and a stack.

Post by BGB
Process: Distinct collection of 1 or more threads within a shared

has a memory map a heap and a vector of threads.

Post by BGB
address space and shared process identity (may have its own address
space, though as-of-yet, TestKern uses a shared global address space);
Task: Supergroup that includes Threads, Processes, and other
thread-like entities (such as call and method handlers), may be either
thread-like or process-like.
Where, say, the Syscall interrupt handler doesn't generally handle
syscalls itself (since the ISRs will only have access to
physically-mapped addresses), but effectively instead initiates a
context switch to the task that can handle the request (or, to context
switch back to the task that made the request, or to yield to another
task, ...).

We call these things:: dispatchers.

Post by BGB
Though, will need to probably add more special case handling such that
the Syscall task can not yield or try to itself make a syscall (the
only valid exit point for this task being where it transfers control
back to the caller and awaits the next syscall to arrive; and it is
not valid for this task to try to syscall back into itself).

Does it follow the same way for hardware interrupts? I think RISCV goes
to the deepest level first, machine level, then redirects to lower
levels as needed. I was planning on Q+ operating the same way.

MitchAlsup

2023-11-22 19:36:28 UTC

Post by MitchAlsup
In My 66000, every <effective> SysCall goes deeper into the privilege
hierarchy. So, Application SysCalls Guest OS, Guest OS SysCalls Guest HV,
Guest HV SysCalls real HV. No data structures need maintenance during
these transitions of the hierarchy.

It depends, there is the school of thought that just deliver control to
someone who can always deal with it (Machine level in RISC-V) and there
is the other school of thought that some table should encode which level
of the system control is delivered to. The former allow SW to control
every step of the process, the later gets rid of all the SW checking
and simplifies the process of getting to and back from interrupt handlers
(and their associated soft IRQs.)

Scott Lurndal

2023-11-23 19:17:14 UTC

ARMv8 allows the interrupt and fast interrupt (IRQ, FIQ) signals to be
delivered to the EL1 (operating system) ring unless system registers at
higher (more privileged) exception levels trap the signal. EL3 (firmware)
level is the most privileged level and generally 'owns' the FIQ signal,
while the IRQ signal is owned by EL1 (bare metal OS) or EL2 (hypervisor).

The destination exception level of each signal is controlled by
bits in system registers (SCR_EL3 to direct them to EL3, HCR_EL2 to
direct them to EL2).

Interrupts can be assigned to one of two groups - group 0 which is
always delivered as an FIQ and group 1 which is delivered as an IRQ.

Group zero interrupts are considered "secure" interrupts and only
secure accesses can modify the configuration of such interrupts.

Group one interrupts can be either non-secure or secure depending on
the security state of the target exception level (secure or non-secure).

The higher priority half of the interrupt priority (8 bits) is considered
a secure range, the rest non-secure, thus secure interrupts will always have
higher priority than non-secure interrupts.

There is no software "checking" required.

Exception return (i.e. context switch) loads the PSR from SPSR_ELx and
the PC from ELR_ELx[*] and that's the entirety of the software visible state
handled by the hardware. Each exception level has its own page table
root registers (TTBR0_ELx, TTBR1_ELx for each half of the VA space), so
there is nothing for software to reload. Hardware manages the TLB entries
which are tagged with both security state and exception level.

[*] Both are system registers (flops, not ram)

[**] The secure flag (!SCR_EL3[NS]) acts like an 'invisible'
address bit at bit N (where N is the number of bits of supported
physical address). This provides two completely distinct N-bit
address spaces - one secure and one non-secure with SCR_EL3[NS]
controlling which space is used by accesses. NS only applies
to EL 0 - 2, EL3 is always considered secure. N is typically 48,
but can be up to 52 in the current versions of the architecture.

MitchAlsup

2023-11-12 21:35:11 UTC

Post by BGB
* Probably 8 or 16.
** 8 makes the most sense with 32 GPRs.
*** 16 is asking too much.
*** 8 deals with around 98% of functions.
** 16 makes sense with 64 GPRs.
*** Nearly all functions can use exclusively register arguments.
*** Gain is small though, if it only benefits 2% of functions.
*** It is almost a "shoe in", except for cost of fixed spill space
*** 128 bytes at the bottom of every non-leaf stack-frame is noticeable.
*** Though, an ABI could decide to not have a spill space in this way.

<
For the reasons stated above (some clipped) I agree with this whole block of
statements.
<
Since My 66000 has 32 registers, I went with upto 8 arguments in registers,
upto 8 results in registers, with the 9th of either on-the-stack in such a
way that if the callee is vararg the argument registers can be pushed on the
stack to form a memory resident vector of arguments {{just perfect for
printf().}}
<
With 8 registers covering 98%-ile of calls, there is too little left by
making this boundary 12-16 both of which ARE still possible.
<

Post by BGB
Though, admittedly, for a lot of my programs I had still ended up going
with 8 register arguments with 64 GPRs, mostly as the gains of 16
arguments is small, relative of the cost of spending an additional 64
bytes in nearly every stack frame (and also there are still some
unresolved bugs when using 16 argument mode).

<
It is a delicate balance and it is easy to make the code look better
while actually running slower.
<

Post by BGB
....
32-bit primary instruction size;
32/64/96 bit for variable-length instructions;
Is "pretty good".
In performance-oriented use cases, 16-bit encodings "aren't really worth
it".
In cases where you need a 32 or 64 bit value, being able to encode them
or load them quickly into a register is ideal. Spending multiple
instructions to glue a value together isn't ideal, nor is needing to
load it from memory (this particularly sucks from the compiler POV).
(Rb, Disp) : ~ 66-75%
(Rb, Ri) : ~ 25-33%
Can address the vast majority of cases.
Displacements are most effective when scaled by the size of the element
type, as unaligned displacements are exceedingly rare. The vast majority
of displacements are also positive.
Not having a register-indexed mode is shooting oneself in the foot, as
these are "not exactly rare".
Most other possible addressing modes can be mostly ignored.
Auto-increment becomes moot if one has superscalar or VLIW;
(Rb, Ri, Disp) is only really applicable in niche cases
Eg, array inside struct, etc.
...
RISC-V did sort of shoot itself in the foot in several of these areas,
SHnADD, can mimic a LEA, allowing array access in fewer ops.
PACK, allows an inline 64-bit constant load in 5 instructions...
LUI+ADD+LUI+ADD+PACK
...
Still not ideal...
An extra cycle for memory access is not ideal for a close second place
addressing mode; nor are 64-bit constants rare enough that one
necessarily wants to spend 5 or so clock cycles on them.
But, still better than the situation where one does not have these
instructions.
....

BGB

2023-11-13 02:21:13 UTC

<
For the reasons stated above (some clipped) I agree with this whole
block of statements.
<
Since My 66000 has 32 registers, I went with upto 8 arguments in registers,
upto 8 results in registers, with the 9th of either on-the-stack in such a
way that if the callee is vararg the argument registers can be pushed on the
stack to form a memory resident vector of arguments {{just perfect for
printf().}}
<
With 8 registers covering 98%-ile of calls, there is too little left by
making this boundary 12-16 both of which ARE still possible.
<

Yeah.

Short of things like using 128-bit pointers, or lots of 128-bit
arguments (with an ABI that expresses these in pairs), the 8 argument
ABI seems to be slightly ahead here (even with 64 registers).

Mostly, because 2% of functions needing to use memory arguments seems to
cost less than the indirect cost of every other non-leaf function
needing to reserve an extra 64 bytes in the stack frame.

Had considered a possible ABI tweak where functions that only call other
functions with fewer than 8 register arguments (likely excluding
vararg); only need to reserve space for the first 8 arguments.

But, the gains are likely to be rather small compared to the added
debugging effort.

Post by BGB
Though, admittedly, for a lot of my programs I had still ended up
going with 8 register arguments with 64 GPRs, mostly as the gains of
16 arguments is small, relative of the cost of spending an additional
64 bytes in nearly every stack frame (and also there are still some
unresolved bugs when using 16 argument mode).

<
It is a delicate balance and it is easy to make the code look better
while actually running slower.
<

Yeah.

I suspect it is likely due mostly to something like L1 cache misses or
similar (bigger stack frame, more area for the L1 cache to miss).

OTOH: Had recently added the logic to shuffle prolog register-stores in
an attempt to reduce WAW stalls. Turned out, fully aligning stuff would
be a much bigger pain than initially hope (the curse of multiple cases
of duplicated logic that needs to operate in lockstep).

Did come up with an intermediate option:
Generate an temporary array of which registers are saved at which offsets;
Generate a permutation array for which order to store these registers;
Initial permutation uses simple XOR shuffling;
Have a function to model the WAW cost of each permutation;
Shuffle the permutations with a PRNG (up to N times);
Pick the permutation with the smallest WAW cost.

Mostly works OK, but granted, nearly any ordering is better at this
metric than saving them in a linear order.

Though, doesn't really gain much if the forwarding option is enabled.

Relatedly, was also able to make Doom a little faster with another trick:
Instead of drawing into an off-screen buffer, and then copying this to
the screen in the form of a DIB Bitmap object...

There can be functions to request and release framebuffers for a given
Drawing-Context (with a supplied BITMAPINFOHEADER; this request failing
and returning NULL if the BITMAPINFOHEADER doesn't match the format used
by the HDC or similar; forcing fallback to the older method).

Similarly, there is a "SwapBuffers" style call, with these buffers
effectively operating in a double-buffering style.

In effect, it is an interface slightly more like what SDL uses.

Was kind of a hassle to modify Doom to play well with double buffering
though, initially it was a strobe-filled / flickering mess , with the
status bar effectively having a seizure. Does still have the annoyance
that when one noclip's though a wall, then whatever garbage is left over
is now prone to a strobe effect.

However, using shared buffers and then having Doom draw into them, does
reduce the amount of framebuffer copying needed for each screen update.

As-is, will currently only work though in 320x200 hi-color mode (where
biHeight==-200, where negative height indicates an origin in the
top-left corner).

However, the DIB drawing method does allow more flexibility here (the
internal bitmap can be in a wider range of formats, and will be
converted as needed).

Granted, one can note that things like pixel format conversion and
similar aren't free.

Also recently encountered a video online where someone was running Doom
on a 386, and, the framerates *sucked*... ( Like, mostly single-digit
territory, and with somewhat longer load-times as well. )

Can at least probably say, with reasonable confidence, that my BJX2 core
is faster than a 386...

Some other information implies that the speeds I am seeing are more
on-par with a high-end 486 or maybe a low-end Pentium.

( Nevermind that Quake performance is still crap in my case... )

( Somehow, it seems like old computers were generally worse and less
capable than my childhood self remembered. )

Formats supported in DIB form at present:
RGB555, RGB24, RGBA32, Indexed 1/2/4/8-bit, UTX2.

Formats used by the display hardware:
Color-Cell 8x8 as 4x 4x4x2bpp (2 endpoints per 4x4 cell);
Color-Cell 8x8x1 (2 color endpoints).
Also used for text-mode display.
4x4x16bit RGB555
4x4x8bit Indexed
(New/Experimental) Linear RGB555 and Indexed 8-bit
Framebuffer pixels now in a conventional linear raster ordering.
Also, the framebuffer is now movable, allowing double-buffering.
Framebuffer will require a 32 byte alignment though.
And needs to be in a physically-mapped address range.

Still don't have any "good" 256 color palettes:
6*6*6 and 6*7*6 (216 and 252 color)
Good for bright cartoony graphics, poor for much else.
Generally loses any detail in things like shading.
6*7*6 can't do grays effectively, only purple and green tints.
16 shades of 16 colors
Better "in general", obvious color distortion for cartoon images
13 shades of 19 colors (*1)
Slightly better than the previous
Mostly cutting off "near black" for additional colors.
Say: adding an Orange, Olive-Green, and Sky-Blue gradient.
Don't need 48 colors of "almost black"...

I don't know of any palette optimization algorithms that are fast enough
to run in real-time on the BJX2 core (I suspect "in the old days",
palette optimization was likely offline only).

Granted, other palettes are possible, mostly just the difficulty of
finding an organization that "looks good in the general case".

*1:
0z: Gray
1z: Blue (High Sat)
2z: Green (High Sat)
3z: Cyan (High Sat)
4z: Red (High Sat)
5z: Magenta (High Sat)
6z: Yellow (High Sat)
7z: Pink (Off-White)
8z: Beige (Off-White)
9z: Blue (Low Sat)
Az: Green (Low Sat)
Bz: Cyan (Low Sat)
Cz: Red (Low Sat)
Dz: Magenta (Low Sat)
Ez: Yellow (Low Sat)
Fz: Sky Blue (Off-White)

z0: Orange (Mid Sat)
z1: Olive (Mid Sat)
z2: Sky Blue (Mid Sat)

00: Black
01, 02: Very dark gray.
10/11/12/20/21/22: Various other "nearly black" colors.
Technically, the bottoms of the orange/olive/sky bars;
But, these can effectively "merge" the other colors.

In my fiddling, this was generally the "best performing" palette layout
I could seem to find thus far.

Post by BGB
....
   32-bit primary instruction size;
   32/64/96 bit for variable-length instructions;
   Is "pretty good".
In performance-oriented use cases, 16-bit encodings "aren't really
worth it".
In cases where you need a 32 or 64 bit value, being able to encode
them or load them quickly into a register is ideal. Spending multiple
instructions to glue a value together isn't ideal, nor is needing to
load it from memory (this particularly sucks from the compiler POV).
   (Rb, Disp) : ~ 66-75%
   (Rb, Ri)   : ~ 25-33%
Can address the vast majority of cases.
Displacements are most effective when scaled by the size of the
element type, as unaligned displacements are exceedingly rare. The
vast majority of displacements are also positive.
Not having a register-indexed mode is shooting oneself in the foot, as
these are "not exactly rare".
Most other possible addressing modes can be mostly ignored.
   Auto-increment becomes moot if one has superscalar or VLIW;
   (Rb, Ri, Disp) is only really applicable in niche cases
     Eg, array inside struct, etc.
   ...
RISC-V did sort of shoot itself in the foot in several of these areas,
   SHnADD, can mimic a LEA, allowing array access in fewer ops.
   PACK, allows an inline 64-bit constant load in 5 instructions...
     LUI+ADD+LUI+ADD+PACK
   ...
Still not ideal...
An extra cycle for memory access is not ideal for a close second place
addressing mode; nor are 64-bit constants rare enough that one
necessarily wants to spend 5 or so clock cycles on them.
But, still better than the situation where one does not have these
instructions.
....

MitchAlsup

2023-11-10 18:29:56 UTC

Post by Quadibloc
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

<
Then why are they there ??
<
I think you will find (like RISC-V is) that having and not mandating use
means you get a bit under ½ of what you think you are getting.
<

Post by Quadibloc
Of course, if compilers can't use them, that raises the question of
whether 16-bit instructions are worth having. Without them, the
complications that I needed to be happy about my memory-reference
instructions could have been entirely avoided.

<
There is a subset of RISC-V designers who want to discard the 16-bit
subset in order to solve the problems of the 32-bit set.
<
I might note: given the space of the compressed ISA in RISC-V, I could
install the entire My 66000 ISA and then not need any of the RISC-V
ISA.....
<

Post by Quadibloc
John Savard

Thomas Koenig

2023-11-10 22:03:23 UTC

Post by Quadibloc
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are
general-purpose registers.

This would make your ISA very un-S/360-like.

MitchAlsup

2023-11-10 23:25:41 UTC

Post by Quadibloc
The 32 general registers aren't _quite_ general. They're divided into
four groups of eight.

Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

This breaks with the central tenet of the /360, the PDP-11,
the VAX, and all RISC architectures: (Almost) all registers are
general-purpose registers.

<
But follows S.E.L 32/{...} series and several other minicomputers with
isolated base registers. In the 32/{..} series, there was 2 LDs and 2 STs
1 LD was byte (signed) with 19-bit displacement
2 LD was size (signed) with the lower bits of displacement specifying size.
3 ST was byte <ibid>
3 ST was size <ibid>
<
only registers 1-7 could be used as base register.
<
I saw several others using similar tricks but can't remember.....
<

Post by Thomas Koenig
This would make your ISA very un-S/360-like.

Quadibloc

2023-11-11 05:39:59 UTC

Post by BGB-Alt
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

No doubt you're right.

As that means my 16-bit instructions, with the registers split into four
parts, are useless to compilers, now I have to go around in circles again.
I thought I had finally achieved a single instruction format that satisfied
my ambitions - and now I find it is fatally flawed.

One possibility is to go back to the full format for 32-bit memory
reference instructions. That will still leave me enough opcode space that a
four-bit prefix could precede three 20-bit short instructions. To avoid
creating a variable-length instruction set, which complicates decoding,
I would require such blocks to be aligned on 64-bit boundaries.

So now there's a nested block structure, of 64-bit blocks inside 256-bit
blocks!

John Savard

Quadibloc

2023-11-12 20:55:27 UTC

Post by BGB-Alt
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.
Unless, maybe, registers were being treated like a stack, but even then,
this is still gonna suck.
Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

This led me to seriously reconsider the path down which I was
heading.

I had tried, with all sorts of ingenious compromises of register spaces and
the like, to fit all the capabilities I wanted into the opcode space of a
single version of the instruction set, eliminating the need for blocks
which contained instructions belonging to alternate versions of the
instruction set.

But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.

At first, when I mulled over this, I came up with multiple ideas to address
it, each one crazier than the last.

Seeing, therefore, that this was a difficult nut to crack, and not wanting
to go down in another wrong direction... instead, I found a way to go that
seemed to me to be reasonably sensible.

Go back to uncompromised 32-bit instructions, even though that means there
are no 16-bit instructions.

Then, bring back short instructions - effectively 17 bits long - so as to
have room for full register specifications. This means an alternative block
format where 16, 32, 48, 64... bit instructions are all possible.

*But* because of the room 17-bit short instructions take up in the header,
the 32-bit instructions are the same regular format as in the other case.
Not some kind of 33-bit or 35-bit instruction with a new set of instruction
formats.

So, even though there are now two formats for code instead of one, one is
merely the 32-bit subset of the other, so that although I have taken a step
back in order to take steps forward, it still isn't too far back.

I'm _trying_ to keep a lid on the extravagances in Concertina II, even if
using the word "sanity" in the same breath with it may be considered
inappropriate...

John Savard

Anton Ertl

2023-11-12 22:09:24 UTC

Post by BGB-Alt
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

...

Post by BGB-Alt
Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

...

Post by Quadibloc
But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.

It works for the RISC-V C (compressed) extension. Some of these
compressed instrutions use registers 8-15 (others use all 32
registers, but have other restrictions). But it works fine exactly
because, if your register usage does not fit the limitations of the
16-bit encoding, you just use the 32-bit version of the instruction.
It seems that they designed the ABI such that registers 8-15 occur
often in the code. Maybe the gcc maintainer also put some work into
preferring these registers.

OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
instruction sets with their A32/T32 instruction set(s), designed their
A64 instruction set to strictly use 32-bit instructions.

So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
32-bit instructions, why make your task harder by also implementing
short instructions? Of course, if that is your goal or you have fun
with this, why not? But if you want to make progress, it seems to be
something that can be skipped.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

MitchAlsup

2023-11-13 00:10:44 UTC

Post by BGB-Alt
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

....

Post by BGB-Alt
Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

....

Post by Quadibloc
But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.

It works for the RISC-V C (compressed) extension. Some of these
compressed instrutions use registers 8-15 (others use all 32
registers, but have other restrictions). But it works fine exactly
because, if your register usage does not fit the limitations of the
16-bit encoding, you just use the 32-bit version of the instruction.
It seems that they designed the ABI such that registers 8-15 occur
often in the code. Maybe the gcc maintainer also put some work into
preferring these registers.
OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
instruction sets with their A32/T32 instruction set(s), designed their
A64 instruction set to strictly use 32-bit instructions.
So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
32-bit instructions, why make your task harder by also implementing
short instructions? Of course, if that is your goal or you have fun
with this, why not? But if you want to make progress, it seems to be
something that can be skipped.

<
Sound
<

Post by Anton Ertl
- anton

BGB

2023-11-13 20:12:16 UTC

Post by BGB-Alt
Errm, splitting up registers like this is likely to hurt far more than
anything that 16-bit displacements are likely to gain.

...

Post by BGB-Alt
Much preferable for a compiler to have a flat space of 32 or 64
registers. Having 16 sorta works, but does still add a bit to spill and
fill.

...

Post by Quadibloc
But if the 16-bit instructions I'm making room for are useless to
compilers, that's questionable.

Yeah. They can be used by a compiler, and can make a difference for
code-density.

Just, it is more a case of, if one has a tradeoff of:
Fewer instructions but more bytes;
More instructions but fewer bytes.
Then the former is better for performance.

Things like reusing registers more aggressively and using a smaller
subset of the registers, are good for making 16-bit instructions usable,
but are less good for performance.

...

Though, granted, one doesn't want to try to reserve too many registers
(on an ISA with plenty of registers), as one may find that
saving/restoring them costs more than that gained by having them
available for use.

Though, the partial workaround for this (in my case) was dividing the
registers up into sub-groups, and using heuristics to enable these
groups based on an estimate of the register pressure.

Say:
R8 ..R14: Always available, prioritized for size optimization ("/Os");
R24..R31: Enables as needed for "/Os", always enabled for perf opt.
R40..R47: Enabled with high register pressure.
R56..R63: Enabled with very high register pressure.

Note:
BGBCC's command-line accepts both "/Os" and "-Os" style arguments.
"/Os": Size optimize
"/O1": Moderate speed (try to balance speed and size)
"/O2": Prioritize speed.
"/Z*": Mostly debug related options (like "-g" in GCC)
"/f*": Optional feature flags.
"/m*": Selects target arch/profile.
"/Fe*": Specify output binary (like "-o" in GCC)
Else, it will try to guess an output file name.
Eg: "foo.c" -> "foo.exe"
...

It does try to guess whether the '/' is part of an option or the start
of a filename. If it sees more than one '/', or sees a '.' or similar,
without encountering an '=', assume it is a filename.

It is almost, but not quite, based on a count of the in-use variables.

It helps to also apply a scale factor for each variable based on how
deeply nested in a loop it is (so that if one has a lot of variables in
use inside a deeply nested loop, the register pressure estimate will be
higher than if most are used outside of a loop).

Though, this scale-factor is nowhere near as severe as with the register
allocation priority (where the nesting level was effectively raised to
an exponent). For pressure estimates, one can use a gentler scale, more
like, say: "scale=sqrt(deepest_nest_level+1.0);".

For dynamically allocated variables in leaf blocks (basic block does not
contain a function call), it may make sense to allocate them in scratch
registers.

Scratch registers are similar:
R0..R1: Not used as GPRs by compiler;
R2..R3: Designated scratch, not used for reg alloc.
R4..R7: Always available;
R16..R17: Designated scratch, not used for reg alloc.
R18..R23: Available when R24..R31 are enabled (always for perf opt);
R32..R39, R48..R55: Available under high register pressure.
Always available if the registers are available and perf optimized.

In performance optimized code, in my case, the spread of the registers
is generally too disperse to really make any sort of small sub-setting
particularly effective.

Post by Anton Ertl
OTOH, ARM who have extensive experience with mixed 32-bit/16-bit
instruction sets with their A32/T32 instruction set(s), designed their
A64 instruction set to strictly use 32-bit instructions.

I guess it can also be noted, that 64-bit ARM went all-in with a lot of
the sorts of features that RISC-V avoided. For example, it still has
some more complex addressing modes, etc.

I guess also they approached constants a little differently:
You can load a 16-bit value into 1 of 4 positions within a register,
with one of: zero fill, one fill, or keeping the prior contents.

This allows loading an arbitrary constant in between 1 and 4 instructions.

Though, I did realize that with RISC-V's Bitmanip extensions, it is
possible to get a 64-bit constant load down to 5 instructions, which is
better than RV64I needing 6 (and in both cases, needing 2 registers).

In BJX2, with Jumbo, it is 3 instruction words and 1 clock cycle.
Without Jumbo, it is 4 instructions (albeit less flexible than the
mechanism in ARM).

Post by Anton Ertl
So if MIPS, SPARC, Power, Alpha, and ARM A64 went for fixed-width
32-bit instructions, why make your task harder by also implementing
short instructions? Of course, if that is your goal or you have fun
with this, why not? But if you want to make progress, it seems to be
something that can be skipped.

In my case, I am left with an awkward split in my ISA:
Baseline Mode, which has both 16 and 32-bit instructions (and bigger);
XG2, which is 32-bit (and bigger).

Some of my newer design variants had leaned towards 32-bit and 64
registers, mostly because the higher register count does towards
performance (at least, performance per clock; not so sure it helps with
LUTs or timing constraints though, *).

*: Mostly because the 5-bit LUTRAMs work with 3 bits of data, but the
6-bit LUTRAMs only have 2 bits of data.

John Dallman

2023-11-11 06:50:00 UTC

Post by Quadibloc
This lends itself to writing code where four distinct threads are
interleaved, helping pipelining in implementations too cheap to have
out-of-order executiion.

This is not the conventional way of implementing threads, and seems to
have some drawbacks:

One of the uses of threads is to scale to the hardware resources
available. With this approach, the number of threads is baked in at
compile time.

Debugging such interleaved threads is likely to be even more confusing
than debugging multiple threads usually is.

Pipeline stalls affect every thread, rather than just the thread that
triggers them.

The common threading APIs also lack a way to set such threads to work,
but that's a far more soluble problem.

John

Quadibloc

2023-11-09 21:38:31 UTC

Post by Thomas Koenig
So, r1 = r2 + r3 + offset.
Three registers is 15 bits plus a 16-bit offset, which gives you 31
bits. You're left with one bit of opcode, one for load and one for
store.

Yes, and obviously that isn't enough. So I do have to make some
compromises.

The offset is 16 bits, because the 68000 (and the 8086, and others) had 16
bit offsets!

But the base and index registers are each specified by only 3 bits - only
the destination register gets a 5-bit field.

I need 5 bits for the opcode. That lets me have load and store for four
floating-point types, load, store, unsigned load, and insert for four
integer types (the largest one only uses load and store).

So it is doable! 5 plus 5 plus 3 plus 3 equals 16, so I have 16 bits left
for the offset.

But that leaves only 1/4 of the opcode space. Which would be fine for a
conventional RISC design, as that's plenty for the operate instructions.
But I needed to reserve _half_ the opcode space, because I needed another
1/4 of the opcode space for putting two 16-bit instructions in a 32-bit
word for more compact code.

That led me to look for compromises... and I found some that would not
overly impair the effectiveness of the memory reference instructions,
which I discussed previously. I ended up using _both_ of two alternatives
each of which alone would have given me the needed savings in opcode
space... that way, the compromised memory-reference instructions could be
accompanied by another complete set of memory-reference instructions with
_no_ compromise... except for only being able to specify aligned operands.

Post by Thomas Koenig
The /360 had 12 bits for three registers plus 12 bits of offset, so 24
bits left eight bits for the opcode (the RX format).

Oh, yes, I remember it well.

Post by Thomas Koenig
So, if you want to do this kind of thing, why not go for a full 32-bit
offset in a second 32-bit word?

Because the 360 only took 32 bits for a memory-reference instruction, so
using 32 bits for one is sinfully wasteful!

I want to "have my cake and eat it too" - to have a computer that's just
as good as a Power PC or a 68000 or a System/360, even though they have
different, incompatible, strengths that conflict with a computer being
able to be good at what each of them is good at simultaneously.

John Savard

Quadibloc

2023-11-09 21:42:43 UTC

Post by Quadibloc
I want to "have my cake and eat it too" - to have a computer that's just
as good as a Power PC or a 68000 or a System/360, even though they have
different, incompatible, strengths that conflict with a computer being
able to be good at what each of them is good at simultaneously.

Actually, it's worse than that, since I also want the virtues of processors
like the TMS320C2000 or the Itanium.

John Savard

Quadibloc

2023-11-09 22:11:41 UTC

Post by Quadibloc
I want to "have my cake and eat it too" - to have a computer that's
just as good as a Power PC or a 68000 or a System/360, even though they
have different, incompatible, strengths that conflict with a computer
being able to be good at what each of them is good at simultaneously.

Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.

And don't forget the Cray-I.

So the idea is to have *one* ISA that will serve for...

embedded microcontrollers,
data-base servers,
desktop workstations, and
HPC supercomputers.

Of course, these different tasks will require different implementations,
which focus on doing parts of the ISA well.

John Savard

John Dallman

2023-11-10 00:29:00 UTC

Post by Quadibloc
Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.

What do you consider the virtues of Itanium to be?

No company ever seems to have taken it up on technical grounds, only as a
result of Intel and HP persuading commercial managers that it would
become widely used owing to their market power.

John

Quadibloc

2023-11-10 04:31:45 UTC

Post by Quadibloc
Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.

What do you consider the virtues of Itanium to be?

Well, I think that superscalar operation of microprocessors is a good
thing. Explicitly indicating which instructions may execute in parallel
is one way to facilitate that. Even if the Itanium was an unsuccessful
implementation of that principle.

John Savard

MitchAlsup

2023-11-10 18:26:20 UTC

Post by Quadibloc
Actually, it's worse than that, since I also want the virtues of
processors like the TMS320C2000 or the Itanium.

What do you consider the virtues of Itanium to be?

Itanic's main virtue was to consume several Intel design teams, over 20
years, preventing Intel from taking over the entire µprocessor market.

I, personally, don't believe in exposing the scalarity to the compiler,
nor the rotating register file to do what renaming does naturally,
nor the lack of proper FP instructions (FDIV, SQRT), ...

Academic quality at industrial prices.

John Dallman

2023-11-11 07:07:00 UTC

Post by Quadibloc
Well, I think that superscalar operation of microprocessors is a
good thing.

Indeed.

Post by Quadibloc
Explicitly indicating which instructions may execute in parallel
is one way to facilitate that. Even if the Itanium was an
unsuccessful implementation of that principle.

Intel tried that with the Pentium, with its two pipelines and run-time
automatic instruction scheduling, to moderate success. They tried it with
the i860, with compiler scheduling and a comprehensive lack of success.
The Itanium tried the i860 method, much harder and was still unsuccessful.

In engineering, the gap between "Doing this would be good" and "Here it
is working" generally involves having a good idea about /how/ to do it.

Finding an example where explicit but non-automatic parallelism worked
for general-purpose code and figuring out how that was done should be
easier than inventing a method. In the absence of that, we have some
evidence that just hoping the software people will solve this problem for
you doesn't work.

John

John Levine

2023-11-12 10:34:04 UTC

Post by John Dallman
What do you consider the virtues of Itanium to be?

I knew the people at Yale who invented trace scheduling and started Multiflow.

It was and is a very clever technique for kind of computers we could build
in the 1980s. It works really well for programs with regular memory access
patterns, not so well for programs without. Once we could build enough
transistors to do dynamic memory and instruction scheduling, why try to
do it at compile time?

I gather it is still useful for embedded or realtime applications which
are fairly regular and for cost or power reasons you want to minimize
the number of transistors.

--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Anton Ertl

2023-11-12 13:59:06 UTC

Post by John Levine
I gather it is still useful for embedded or realtime applications which
are fairly regular and for cost or power reasons you want to minimize
the number of transistors.

Even there, VLIW-inspired CPUs like Philips Trimedia was terminated,
and I have not heard much about TI's C6000 lately. Both NXP (spun of
from Philips) and TI seem to bet heavily on ARM.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

MitchAlsup

2023-11-10 01:11:13 UTC

<
My 66000 has all of this.
<

Post by Quadibloc
I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

<
The simple/easy ones definitely, the ones with longer displacements no.
<

<
Block headers are simply consuming entropy.
<

Post by Quadibloc
This has now been dropped. Since I managed to get the normal (unaligned)
memory-reference instruction squeezed into so much less opcode space that
I also had room for the aligned memory-reference format without compromises
in the basic instruction set, it wasn't needed to have multiple instruction
formats.

<
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<

Post by Quadibloc
I had to change the instructions longer than 32 bits to get them in the
basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

<
Yet, mine remains simple and compact.
<

Post by Quadibloc
John Savard

BGB

2023-11-10 04:19:48 UTC

Good to see you are back on here...

<
My 66000 has all of this.
<

Post by Quadibloc
I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

<
The simple/easy ones definitely, the ones with longer displacements no.
<

Yes.

As noted a few times, as I see it, 9 .. 12 is sufficient.
Much less than 9 is "not enough", much more than 12 is wasting entropy,
at least for 32-bit encodings.

12u-scaled would be "pretty good", say, being able to handle 32K for
QWORD ops.

<
Block headers are simply consuming entropy.
<

Also yes.

<
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<

In my case, it is only for 128-bit load/store operations, which require
64-bit alignment.

Well, and an esoteric edge case:
if((PC&0xE)==0xE)
You can't use a 96-bit encoding, and will need to insert a NOP if one
needs to do so.

One can argue that aligned-only allows for a cheaper L1 D$, but also
"sucks pretty bad" for some tasks:
Fast memcpy;
LZ decompression;
Huffman;
...

Post by Quadibloc
I had to change the instructions longer than 32 bits to get them in
the basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

<
Yet, mine remains simple and compact.
<

Mostly similar.
Though, I guess some people could debate this in my case.

Granted, I specify the entire ISA in a single location, rather than
spreading it across a bunch of different documents (as was the case with
RISC-V).

Well, and where there is a lot that is left up to the specific hardware
implementations in terms of stuff that one would need to "actually have
an OS run on it", ...

Post by Quadibloc
John Savard

MitchAlsup

2023-11-10 18:22:43 UTC

Post by BGB
Good to see you are back on here...

<
My 66000 has all of this.
<

Post by Quadibloc
I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

<
The simple/easy ones definitely, the ones with longer displacements no.
<

Yes.
As noted a few times, as I see it, 9 .. 12 is sufficient.
Much less than 9 is "not enough", much more than 12 is wasting entropy,
at least for 32-bit encodings.

<
Can you suggest something I could have done by sacrificing 16-bits
down to 12-bits that would have improved "something" in my ISA ??
{{You see I did not have any trouble in having all 16-bits for MEM
references--just like having 16-bits for integer, logical, and branch
offsets.}}
<

Post by BGB
12u-scaled would be "pretty good", say, being able to handle 32K for
QWORD ops.

<
IBM 360 found so, EMBench is replete with stack sizes and struct sizes
where My 66000 uses 1×32-bit instruction where RISC-V needs 2×32-bit...
Exactly the difference between 12-bits and 14-bits....

<
Block headers are simply consuming entropy.
<

Also yes.

<
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<

In my case, it is only for 128-bit load/store operations, which require
64-bit alignment.

<
VVM does all the wide stuff without necessitating the wide stuff in
registers or instructions.
<

Post by BGB
if((PC&0xE)==0xE)
You can't use a 96-bit encoding, and will need to insert a NOP if one
needs to do so.

<
Ehhhhh...
<

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
Fast memcpy;
LZ decompression;
Huffman;
...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Post by Quadibloc
I had to change the instructions longer than 32 bits to get them in
the basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

<
Yet, mine remains simple and compact.
<

Mostly similar.
Though, I guess some people could debate this in my case.
Granted, I specify the entire ISA in a single location, rather than
spreading it across a bunch of different documents (as was the case with
RISC-V).
Well, and where there is a lot that is left up to the specific hardware
implementations in terms of stuff that one would need to "actually have
an OS run on it", ...

Post by Quadibloc
John Savard

BGB

2023-11-10 18:48:10 UTC

Post by BGB
Good to see you are back on here...

<
My 66000 has all of this.
<

Post by Quadibloc
I want memory-reference instructions to still fit in 32 bits, despite
asking for so much more capacity.

<
The simple/easy ones definitely, the ones with longer displacements no.
<

Yes.
As noted a few times, as I see it, 9 .. 12 is sufficient.
Much less than 9 is "not enough", much more than 12 is wasting
entropy, at least for 32-bit encodings.

Post by BGB
12u-scaled would be "pretty good", say, being able to handle 32K for
QWORD ops.

RISC-V is 12-bit signed unscaled (which can only do +/- 2K).

On average, 12-bit signed unscaled is actually worse than 9-bit unsigned
scaled (4K range, for QWORD).

So, ironically, despite BJX2 having smaller displacements than RISC-V,
it actually deals better with the larger stack frames.

But, if one could address 32K, this should cover the vast majority of
structs and stack-frames.

A 16-bit unsigned scaled displacement would cover 512K for QWORD ops,
which could be nice, but likely unnecessary.

<
Block headers are simply consuming entropy.
<

Also yes.

<
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.
<

In my case, it is only for 128-bit load/store operations, which
require 64-bit alignment.

<
VVM does all the wide stuff without necessitating the wide stuff in
registers or instructions.
<

Post by BGB
if((PC&0xE)==0xE)
You can't use a 96-bit encoding, and will need to insert a NOP if one
needs to do so.

<
Ehhhhh...
<

This is mostly due to a quirk in the L1 I$ design, where "fixing" it
costs more than just being like, "yeah, this case isn't allowed" (and
having the compiler emit a NOP in the rare edge cases it is encountered).

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

But, yeah, for me, a major selling points for unaligned access is mostly
that I can copy blocks of memory around like:
v0=((uint64_t *)cs)[0];
v1=((uint64_t *)cs)[1];
v2=((uint64_t *)cs)[2];
v3=((uint64_t *)cs)[3];
((uint64_t *)ct)[0]=v0;
((uint64_t *)ct)[1]=v1;
((uint64_t *)ct)[2]=v2;
((uint64_t *)ct)[3]=v3;
cs+=32; ct+=32;

For Huffman, some of the fastest strategies to implement the bitstream
reading/writing, tend to be to casually make use of unaligned access
(shifting in and loading bytes is slower in comparison).

Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).

Post by Quadibloc
I had to change the instructions longer than 32 bits to get them in
the basic instruction format, so now they're less dense.
Block structure is still used, but now for only the two things it's
actually needed for: reserving part of a block as unused for the
pseudo-immediates, and for VLIW features (explicitly indicating
parallelism, and instruction predication).
The ISA is still tremendously complicated, since I've put room in it for
a large assortment of instructions of all kinds, but I think it's
definitely made a significant stride towards sanity.

<
Yet, mine remains simple and compact.
<

Mostly similar.
Though, I guess some people could debate this in my case.
Granted, I specify the entire ISA in a single location, rather than
spreading it across a bunch of different documents (as was the case
with RISC-V).
Well, and where there is a lot that is left up to the specific
hardware implementations in terms of stuff that one would need to
"actually have an OS run on it", ...

Post by Quadibloc
John Savard

MitchAlsup

2023-11-10 23:21:08 UTC

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

<

No, I am arguing that all memory references are inherently un aligned, but where
aligned references never suffer a stall penalty; and the the compiler does not
need to understand if the reference is aligned or unaligned.
<

Post by BGB
But, yeah, for me, a major selling points for unaligned access is mostly
v0=((uint64_t *)cs)[0];
v1=((uint64_t *)cs)[1];
v2=((uint64_t *)cs)[2];
v3=((uint64_t *)cs)[3];
((uint64_t *)ct)[0]=v0;
((uint64_t *)ct)[1]=v1;
((uint64_t *)ct)[2]=v2;
((uint64_t *)ct)[3]=v3;
cs+=32; ct+=32;

<
MM Rcs,Rct,#length // without the for loop
<

Post by BGB
For Huffman, some of the fastest strategies to implement the bitstream
reading/writing, tend to be to casually make use of unaligned access
(shifting in and loading bytes is slower in comparison).
Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).

<
Traps to perform unaligned are so 1985......either don't allow them at all
(SIGSEGV) or treat them as first class citizens. The former fails in the market.
<

BGB

2023-11-11 02:37:38 UTC

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

<
No, I am arguing that all memory references are inherently un aligned, but where
aligned references never suffer a stall penalty; and the the compiler does not
need to understand if the reference is aligned or unaligned.
<

OK, fair enough.

I don't have separate aligned/unaligned ops for anything QWORD or
smaller, as all these cases are implicitly unaligned.

Though, aligned is sometimes a little faster, due to playing better with
the L1 cache; but, using misaligned memory access is generally faster
than any of the traditional workarounds (the difference being mostly a
slight increase in the probability of triggering an L1 cache miss).

The main exception is MOV.X requiring 64-bit alignment (for a 128-bit
memory access), but the unaligned fallback here is to use a pair of
MOV.Q instructions instead.

But, this was in part because of how the L1 caches were implemented, and
supporting fully unaligned 128-bit access would have been more expensive
(and the relative gain is smaller).

This does mean alternate logic for aligned vs unaligned "memcpy()", with
the unaligned case being a little slower as a result of needing to use
MOV.Q ops.

It is possible a case could be made for allowing fully unaligned MOV.X
as well.

Would mostly involve reworking how MOV.X is implemented relative to the
extract/insert logic (likely internally working with 192 bits rather
than 128; with as-is, MOV.X implemented by bypassing the main
extract/insert logic).

Post by BGB
But, yeah, for me, a major selling points for unaligned access is
   v0=((uint64_t *)cs)[0];
   v1=((uint64_t *)cs)[1];
   v2=((uint64_t *)cs)[2];
   v3=((uint64_t *)cs)[3];
   ((uint64_t *)ct)[0]=v0;
   ((uint64_t *)ct)[1]=v1;
   ((uint64_t *)ct)[2]=v2;
   ((uint64_t *)ct)[3]=v3;
   cs+=32; ct+=32;

<
MM Rcs,Rct,#length // without the for loop
<

I typically use a "while()" loop or similar, but yeah...

At present, the fastest loop strategy is generally:
while(n--)
{
...
}

<
Traps to perform unaligned are so 1985......either don't allow them at all
(SIGSEGV) or treat them as first class citizens. The former fails in the market.
<

Apparently SiFive went this way, for some reason...

Like, RISC-V requires unaligned access to work, but doesn't specify how,
and apparently they considered trapping to be an acceptable option, but
trapping sucks for performance.

Anton Ertl

2023-11-11 07:22:21 UTC

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
Â Â Fast memcpy;
Â Â LZ decompression;
Â Â Huffman;
Â Â ...

Hashing

Post by BGB
Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).

Let's see what this SiFive U74 does:

[fedora-starfive:~/nfstmp/gforth-riscv:98397] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye "

Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x x ! x foo bye ':

469832112 instructions:u # 0.79 insn per cycle
591015904 cycles:u

0.609751748 seconds time elapsed

0.533195000 seconds user
0.061522000 seconds sys

[fedora-starfive:~/nfstmp/gforth-riscv:98398] perf stat -e instructions:u -e cycles:u gforth-fast -e "create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye "

Performance counter stats for 'gforth-fast -e create x 2 cells allot : foo 10000000 0 do @ @ @ @ @ @ @ @ @ @ loop ; x 1+ x 1+ ! x 1+ foo bye ':

53533370273 instructions:u # 0.77 insn per cycle
69304924487 cycles:u

69.368484169 seconds time elapsed

69.256290000 seconds user
0.049997000 seconds sys

So when we do aligned accesses (first command), the code performs 4.7
instructions and 5.9 cycles per load, while for unaligned accesses
(second command) the same code performs 535.3 instructions and 693.0
cycles per load. So apparently an unaligned load triggers >500
additional instructions, confirming your claim. Interestingly, all
that is attributed to user time; maybe the fixup is performed by a
user-level trap or microcode.

Still, the approach of having separate instructions for aligned and
unaligned accesses (typically with several instructionf for the
unaligned case) has been tried and discarded. Software just does not
declare that some access will be unaligned.

A particularly strong evidence for this is that gas generated
non-working code for ustq (unaligned store quadword) on Alpha for
several years, and apparently nobody noticed until I gave an exercise
to my students where they should use ustq (so no production use,
either).

So, every general-purpose architecture, including RISC-V, the
spiritual descendent of MIPS and Alpha (which had the division),
settled on having memory access instructions that perform both aligned
and unaligned accesses (with performance advantages for aligned
accesses).

If RISC-V implementations want to perform well for code that uses
unaligned accesses for memory copying, compression/decompression, or
hashing, they will eventually have to implement unaligned accesses
more efficiently, but at least the code works, and aligned accesses
are fast.

Why would you not go the same way? It would also save on instruction
encoding space.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

John Dallman

2023-11-11 08:37:00 UTC

...

So apparently an unaligned load triggers >500 additional instructions,
confirming your claim.

Wow. I think I'd rather have SIGBUS on unaligned accesses. That is at
least obvious. Slowdowns like this will be a major drag on performance,
simply because finding them all is tricky.

John

Anton Ertl

2023-11-11 10:22:54 UTC

So apparently an unaligned load triggers >500 additional instructions,
confirming your claim.

Wow. I think I'd rather have SIGBUS on unaligned accesses. That is at
least obvious.

True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned accesses,
and then compiled by package maintainers (who often are not that
familiar with the software) on a lot of platforms, the end result was
that the kernel by default performed a fixup (and put a message in the
dmesg buffer) instead of delivering a SIGBUS.

There was a system call for switching to the SIGBUS behaviour. On
Tru64 OSF/1 (or whatever it is called this week), the default
behaviour was to SIGBUS, but it had the same system call, and a
shell-level tool "uac" to change the behaviour to fix it up. I
implemented a tool "uace" for Linux that can be used for running a
process with the SIGBUS behaviour that you desire:
<https://www.complang.tuwien.ac.at/anton/uace.c>. Maybe something
similar is possible on the U74.

Anyway, it seems that the problems was not a big one on Linux-Alpha
(messages about unaligned accesses were not that frequent).
Apparently the large majority of code performs aligned accesses. It's
just that there are a few unaligned ones.

I would not worry about cores like the U74 (and I have a program that
uses unaligned accesses for hashing); that's just a stepping stone for
getting more capable RISC-V cores, and at some point (before RISC-V
becomes mainstream) the trapping will be replaced with something more
efficient.

We have seen the same development on AMD64. The Penryn
(second-generation Core 2) takes 159 cycles for an unaligned load that
crosses a page boundary, the Sandy Bridge takes 28
<http://al.howardknight.net/?ID=143135464800>. The Sandy Bridge and
Ivy Bridge take 200 cycles for an unaligned page-crossing store,
Haswell and Skylake take 25 and 24.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

John Dallman

2023-11-11 16:53:00 UTC

Post by Anton Ertl
True, but that has been tried out and, in a world (like Linux) where
software is developed on a platform that supports unaligned
accesses, and then compiled by package maintainers (who often are
not that familiar with the software) on a lot of platforms, the end
result was that the kernel by default performed a fixup (and put a
message in the dmesg buffer) instead of delivering a SIGBUS.

Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug.
However, I'm now down to one that actually enforces it, in SPARC Solaris,
and that isn't long for this world.

I dug into what it would take to have x86-64 Linux work with alignment
enforcement turned on, and it's a huge job.

John

Scott Lurndal

2023-11-11 21:28:05 UTC

Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug.
However, I'm now down to one that actually enforces it, in SPARC Solaris,
and that isn't long for this world.
I dug into what it would take to have x86-64 Linux work with alignment
enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1;
it only effects code executing in usermode.

There may even already be some ELF flag that will set it when the
file is exec(2)'d.

John Dallman

2023-11-11 22:47:00 UTC

Post by John Dallman
I dug into what it would take to have x86-64 Linux work with
alignment enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.

I'll take a look, but I doubt glibc on Aarch64 is built to be run with
alignment trapping. Should it be EL0 for usermode?

John

Scott Lurndal

2023-11-12 17:21:51 UTC

Post by John Dallman
I dug into what it would take to have x86-64 Linux work with
alignment enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.

I'll take a look, but I doubt glibc on Aarch64 is built to be run with
alignment trapping. Should it be EL0 for usermode?

The EL1 in the register name describes the minimum exception level
allowed to access the register. SCTLR_EL1 includes control bits
for both EL1 and EL0.

John Dallman

2023-11-12 17:40:00 UTC

Post by Scott Lurndal
It might be easier with AArch64. Just set the A bit (bit 1) in
SCTLR_EL1; it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.

I'll take a look, but I doubt glibc on Aarch64 is built to be run
with alignment trapping. Should it be EL0 for usermode?

The EL1 in the register name describes the minimum exception level
allowed to access the register. SCTLR_EL1 includes control bits
for both EL1 and EL0.

Aha. It's harder for ARM64: I'd have to be in supervisor mode to set that
bit, and the stuff I work on is strictly application code.

John

Scott Lurndal

2023-11-13 14:44:15 UTC

I'll take a look, but I doubt glibc on Aarch64 is built to be run
with alignment trapping. Should it be EL0 for usermode?

The EL1 in the register name describes the minimum exception level
allowed to access the register. SCTLR_EL1 includes control bits
for both EL1 and EL0.

Aha. It's harder for ARM64: I'd have to be in supervisor mode to set that
bit, and the stuff I work on is strictly application code.

Unless the ELF flag trick is implemented. I haven't looked at the kernel
with respect to that.

Kent Dickey

2023-11-12 22:18:31 UTC

Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug.
However, I'm now down to one that actually enforces it, in SPARC Solaris,
and that isn't long for this world.
I dug into what it would take to have x86-64 Linux work with alignment
enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1;
it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.

On Aarch64, with GCC at least, you also need to specify "-mstrict-align"
when compiling all source code, to prevent the compiler from assuming it
can access structure fields in an unaligned way, even if all of your
code accesses are fully aligned. GCC can mess around behind your back,
changing ptr->array32[1] = 0 and ptr->array32[2] = 0 into a single
64-bit write of ptr->array32[1] = 0, among other things. If the offset
of array32[1] wasn't 64-bit aligned, it's an alignment trap if
SCTLR_EL1.A=1.

On all Arm system, Device memory accesses must always be aligned. User code
in general does not get access to Device memory, so this does not affect
regular users.

Kent

MitchAlsup

2023-11-13 00:09:00 UTC

Post by Kent Dickey

Yup. The software I work on is meant, in itself, to work on platforms
that enforce alignment, and it was a useful catcher for some kinds of bug.
However, I'm now down to one that actually enforces it, in SPARC Solaris,
and that isn't long for this world.
I dug into what it would take to have x86-64 Linux work with alignment
enforcement turned on, and it's a huge job.

It might be easier with AArch64. Just set the A bit (bit 1) in SCTLR_EL1;
it only effects code executing in usermode.
There may even already be some ELF flag that will set it when the
file is exec(2)'d.

On Aarch64, with GCC at least, you also need to specify "-mstrict-align"
when compiling all source code, to prevent the compiler from assuming it
can access structure fields in an unaligned way, even if all of your
code accesses are fully aligned. GCC can mess around behind your back,
changing ptr->array32[1] = 0 and ptr->array32[2] = 0 into a single
64-bit write of ptr->array32[1] = 0, among other things. If the offset
of array32[1] wasn't 64-bit aligned, it's an alignment trap if
SCTLR_EL1.A=1.
On all Arm system, Device memory accesses must always be aligned. User code
in general does not get access to Device memory, so this does not affect
regular users.

<
For all the same reasons one does not do misaligned accesses to ATOMIC
memory locations, one does not do misaligned accesses to device control
registers.
<

Post by Kent Dickey
Kent

Anton Ertl

2023-11-12 14:08:11 UTC

Post by John Dallman
I dug into what it would take to have x86-64 Linux work with alignment
enforcement turned on, and it's a huge job.

I did a first attempt in the IA-32 days, and there I found that the
alignment requirements of the hardware were incompatible with the ABI
(which required 4-byte alignment for 8-byte FP numbers).

My second attempt was with AMD64, and there I found that gcc produced
misaligned 16-bit memory accesses for stuff like strcpy(buf, "a"). I
did not try to disable this with a flag at the time, but maybe
-fno-tree-vectorize would help. But even if I use that for my code, I
would also have to recompile all the libraries with that flag.

Another problem (on both platforms) were memcpy, memmove, etc., but I
expected that one could link with alignment-clean versions. But I
don't know how many functions are affected.

I would be surprised if ARM A64 did not have the same problems (except
the idiotic incompatibility between Intel ABI and Intel hardware).

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

John Levine

2023-11-12 14:54:56 UTC

Post by John Dallman
I dug into what it would take to have x86-64 Linux work with alignment
enforcement turned on, and it's a huge job.

I did a first attempt in the IA-32 days, and there I found that the
alignment requirements of the hardware were incompatible with the ABI
(which required 4-byte alignment for 8-byte FP numbers).

This is a very old problem. S/360 was the first byte addressed machine
and required aligned operands. They immediately realized that Fortran
programs that used COMMON or EQUIVALENCE often forced 8-byte FP onto
4-byte boundaries. The Fortran library had a hack that caught the
alignment fault and fixed it up very slowly. But they quickly dealt
with it in hardware. The 360/85 which brought us caches also had "byte
oriented opeands" i.e. misaligned, and it was carried into all
subsequent 370 and later machines.

It makes some sense that they did so since caches greatly decrease the
cost of misaligned operands.

--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

John Dallman

2023-11-12 16:24:00 UTC

Post by John Dallman
I dug into what it would take to have x86-64 Linux work with
alignment enforcement turned on, and it's a huge job.

I did a first attempt in the IA-32 days, and there I found that the
alignment requirements of the hardware were incompatible with the
ABI (which required 4-byte alignment for 8-byte FP numbers).

By the time I was running short of alignment-sensitive platforms, x86-64
was well established, and 64-bit is preferable for this kind of
bug-hunting since accidental correct alignment is rarer.

Post by Anton Ertl
My second attempt was with AMD64, and there I found that gcc
produced misaligned 16-bit memory accesses for stuff like
strcpy(buf, "a"). I did not try to disable this with a flag
at the time, but maybe -fno-tree-vectorize would help. But
even if I use that for my code, I would also have to recompile
all the libraries with that flag.

I reached similar conclusions, reckoning that I'd need to rebuild the
Linux userland for the job, at minimum. An alternative is to wrap all
calls to system libraries and turn alignment traps off and on there,
which would be easier, given I have a well-defined set of software to
test.

Post by Anton Ertl
I would be surprised if ARM A64 did not have the same problems
(except the idiotic incompatibility between Intel ABI and Intel
hardware).

Yup. I have a lot more x86-64 hardware available, so it would be the
choice, if I didn't have so many more urgent projects to do.

John

BGB

2023-11-11 09:03:18 UTC

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

Possibly true.

Some of my data hash/checksum functions were along the lines of:
uint32_t *cs, *cse;
uint64_t v0, v1, v;

cs=buf; cse=buf+((sz+3)>>2);
v0=1; v1=1;
while(cs<cse)
{
v=*cs++;
v0+=v;
v1+=v0;
}
v0=((uint32_t)v0)+(v0>>32); //*
v1=((uint32_t)v1)+(v1>>32);
v0=((uint32_t)v0)+(v0>>32);
v1=((uint32_t)v1)+(v1>>32);
v=(uint32_t)(v0^v1);

*: This step may seem frivolous, but seems to increase the strength of
the checksum.

There are faster variants, but this one gives the general idea.
Not aware of anyone else doing it this way, but it is faster than either
Adler32 or CRC32, while giving some similar properties (the second sum
detecting various issues which would be missed with a single sum).

A faster variant of this being to run multiple sets of sums in parallel
and then combine the values at the end.

Post by BGB
Though, all this falls on its face, if encountering a CPU that uses
traps to emulate unaligned access (apparently a lot of the SiFive cores
and similar).

469832112 instructions:u # 0.79 insn per cycle
591015904 cycles:u
0.609751748 seconds time elapsed
0.533195000 seconds user
0.061522000 seconds sys
53533370273 instructions:u # 0.77 insn per cycle
69304924487 cycles:u
69.368484169 seconds time elapsed
69.256290000 seconds user
0.049997000 seconds sys
So when we do aligned accesses (first command), the code performs 4.7
instructions and 5.9 cycles per load, while for unaligned accesses
(second command) the same code performs 535.3 instructions and 693.0
cycles per load. So apparently an unaligned load triggers >500
additional instructions, confirming your claim. Interestingly, all
that is attributed to user time; maybe the fixup is performed by a
user-level trap or microcode.

I wasn't that sure how it was implemented, but it is "kinda weak" in any
case.

On the BJX2 core, the performance impact of using misaligned load and
store is approximately 3% in my tests, I suspect mostly due to a
slightly higher incidence of L1 cache misses.

Post by Anton Ertl
Still, the approach of having separate instructions for aligned and
unaligned accesses (typically with several instructionf for the
unaligned case) has been tried and discarded. Software just does not
declare that some access will be unaligned.
A particularly strong evidence for this is that gas generated
non-working code for ustq (unaligned store quadword) on Alpha for
several years, and apparently nobody noticed until I gave an exercise
to my students where they should use ustq (so no production use,
either).
So, every general-purpose architecture, including RISC-V, the
spiritual descendent of MIPS and Alpha (which had the division),
settled on having memory access instructions that perform both aligned
and unaligned accesses (with performance advantages for aligned
accesses).
If RISC-V implementations want to perform well for code that uses
unaligned accesses for memory copying, compression/decompression, or
hashing, they will eventually have to implement unaligned accesses
more efficiently, but at least the code works, and aligned accesses
are fast.
Why would you not go the same way? It would also save on instruction
encoding space.

I was never claiming that one should have separate instructions (since,
if the L1 cache supports unaligned access, what is the point of having
aligned only variants of the instructions?...).

Rather, that it might make sense to do an aligned-only core, and then
trap on misaligned (possibly allowing the access to be emulated, like if
SiFive cores); mostly in the name of making the L1 cache cheaper.

A few of my small core experiments had used aligned-only L1 caches, but
I mostly went with a natively unaligned designs for my bigger ISA
designs, mostly as I tend to make frequent use of unaligned memory
access as a "performance trick".

However, BJX2 has a natively unaligned L1 cache (well, apart from MOV.X).

Have gone and added the logic to allow MOV.X to be unaligned as well,
which mostly has the effect of a minor increase in LUT cost and similar
(mostly as the internal extract/insert logic needed to be widened from
128 to 192 bits to deal with this; with MOV.X now being handled in a
similar way to MOV.Q when this feature is enabled).

Though, one thing is whether to "formally fix" the Op96 at
((PC&0xE)==0xE) issue. Ironically, in this case, the "fix" is already
present in the Verilog code, just the restriction exists more as a
"break glass to save some LUTs" option.

Well, along with some other wonk, like leaving it as undefined what
happens if the instruction stream is allowed to cross a 4GB boundary,
... Branching is fine, just the PC increment logic can save some latency
by not bothering with the high 16 bits.

I guess, in an ideal world, there wouldn't be a lot of this wonk, but
needing to often battle with timing constraints and similar does create
incentive for corner cutting in various areas.

Post by Anton Ertl
- anton

Anton Ertl

2023-11-11 11:11:46 UTC

Post by Anton Ertl
Hashing

Possibly true.

Definitely true: The data you want to hash may be aligned to byte
boundaries (e.g., strings), but a fast hash function loads it at the
largest granularity possible and also processes the loaded values at
the largest granularity possible.

And in contrast to block copying, where you can do some prelude, then
perform aligned accesses, and then a postlude (at least on one side of
the copying), for this kind of hashing you want to have in the first
step, the first n bytes in a register, because the first byte
influences the hash function result differently than the second byte.

What you could do is load aligned into a shift buffer (in a register),
and then use something like AMD64's shld to get the data in the needed
form. Same for the second side of block copying. But is this faster
on modern CPUs?

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

Chris M. Thomasson

2023-11-11 19:30:19 UTC

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

[...]

Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.

Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.

Thomas Koenig

2023-11-11 21:22:00 UTC

Post by Chris M. Thomasson

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

[...]
Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.
Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

For smaller elements smaller than a cache line, that makes little
sense. as written. I think there is an unwritten assumption
"for elements larger than cache line" there, or we would all
be using 64-byte bools.

Chris M. Thomasson

2023-11-11 22:23:51 UTC

Post by Chris M. Thomasson

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

[...]
Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.
Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

:^). Basically, I am thinking along the lines of cache line allocators
that return properly aligned and padded l2 lines. Aligning and padding
on l2 lines helps get rid of any nasty false sharing. Remember those
damn hyperthreaded intel processors what had 128 byte l2 lines, but
could falsely share the low 64 bytes with the high 64 bytes? Iirc, Intel
had a work around that involved offsetting a threads stack using alloca.

Also, see what happens if you straddle a l2 cache line and use it for a
LOCK'ed atomic RMW on Intel. It just might assert a bus lock.

Chris M. Thomasson

2023-11-11 22:28:48 UTC

Post by Chris M. Thomasson

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

Hashing

[...]
Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.
Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

Also, think about the atomic state for a mutex. Say:

<pseudo-code>

struct mutex_atomic_state
{
std::atomic<word> m_state;
};

Well, you want this state to be aligned on a cache line boundary and
padded up to the size of a cache line. You want to avoid false sharing
between this state and any user state used in the locked region.

MitchAlsup

2023-11-11 22:58:32 UTC

Post by Chris M. Thomasson
Fwiw, proper alignment is very important wrt a programmer to gain some
of the benefits of, basically, virtually "any" target architecture. For
instance, _proper_ cache line alignment, say, L2, and its 64 bytes. So,
the programmer can set up an array that is aligned on a cache line
boundary and pad each element of said array up to the size of a L2 cache
line.
Two steps... Align your memory on a proper cache line boundary, and pad
the size of each element up to the size of a single cache line.

<
Then consider a 4-way banked cache (¼ cache line per bank) and an access
that straddles a ¼ line boundary and multiple AGEN units. So one AGEN
unit creates the access to the container which straddles the boundary
while another creates an access into the second part of the spanning
access.
<
Then consider that "program order" information is not instantaneously
available, and the bank selector picks the second access. Now, that
spanning access is no longer ATOMIC, and might even see a Snoop between
its first access and its spanning access...............
<

MitchAlsup

2023-11-11 22:53:09 UTC

Post by Chris M. Thomasson
Think of LL/SC... If one did not honor the reservation granule....
well... Shit.. False sharing on a reservation granule can cause live
lock and damage forward progress wrt some LL/SC setups.

<
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned
container. Only aligned containers possess ATOMIC-smelling properties.
<

Scott Lurndal

2023-11-12 17:27:36 UTC

<
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned
container. Only aligned containers possess ATOMIC-smelling properties.
<

That is indeed the case. Consider the effect of a page fault when
an unaligned access crosses a page boundary, for example; leaving
aside, of course, all the difficulties inherent in dealing with
atomicity when the access spans two cache lines.

ARM implementations of LL/SC (Load Exclusive/Store Exclusive) can
have an arbitrary sized reservation granule (ARM's Cortex-M7,
for example, has a single reservation granule the size of the
full address space). Any store between the loadex znd storex
instructions is allowed by the architecture (V7 and V8) to cause
the storex to fail.

Terje Mathisen

2023-11-13 15:10:20 UTC

<
One should NEVER (N. E. V. E. R.) attempt ATOMIC stuff on an unaligned
container. Only aligned containers possess ATOMIC-smelling properties.

This is so obviously correct that you should not have needed to mention
it. Hammering HW with unaligned (maybe even page-straddling) LOCKed
updates is something that should only ever be done for testing purposes.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

2023-11-12 21:37:39 UTC

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

<
I have not argued for aligned memory references since about 2000 (maybe as
early as 1991).
<

BGB

2023-11-13 01:28:51 UTC

Post by BGB
One can argue that aligned-only allows for a cheaper L1 D$, but also
   Fast memcpy;
   LZ decompression;
   Huffman;
   ...

<
Time found that HW can solve the problem way more than adequately--
obviating its inclusion entirely. {Sooner or later Reduced leads RISC}
<

Wait, are you arguing for aligned-only memory ops here?...

<
I have not argued for aligned memory references since about 2000 (maybe as
early as 1991).
<

Makes sense, but I was confused as to what was being argued here...

I prefer unaligned memory access, since it allows a lot of nifty stuff
to be done.

But, I can note that the main drawback it has is in terms of requiring a
more expensive L1 cache.

Aligned-only cache only needs:
A single row of cache-lines
To check a single address for hit/miss;
Can use a simpler set of MUX'es for extract/insert.

Vs, say:
Two rows of cache lines (say, even and odd);
Needs to check two addresses;
More complicated extract/insert logic.

But, say, if one needs to operate within the limits of an aligned-only
cache, then even something like an LZ4 decompressor is painfully slow,
as it has to basically do damn near everything 1 byte at a time (or, at
least, more so than it does already).

I once did have a compressor (FeLZ32) more designed for the constraints
of the SuperH ISA (and aligned-only memory access), but its main
"feature" was that pretty much everything was defined in terms of 32-bit
words (it was not copying bytes, rather, 32 bit words, and the encoded
stream was itself an array of 32-bit words).

It also managed to beat out LZ4's performance by a fair margin on the
Piledriver I was using at the time.

But, this performance advantage effectively evaporated on my Ryzen
(where LZ4 speed increased significantly), and was also mostly N/A on
BJX2. In this case, the byte-oriented formats were more preferable as
they got better compression.

Like, a lot of the performance tricks I had developed on the Piledriver
were effectively rendered moot.

Though, some amount of the tricks were mostly workarounds for "things
that were slow", which the newer CPU had made effectively unnecessary or
counter productive.

...

Quadibloc

2023-11-10 04:43:14 UTC

Post by MitchAlsup
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.

Since I have a complete set of memory-reference instructions for which
unaligned memory-reference instructions are supported, the problem isn't
that I think unaligned fetches and stores take too many gates.

Rather, being able to only specify aligned accesses saves *opcode space*,
which lets me fit in one complete set of memory-reference instructions that
can use all the base registers, all the index registers, and always use all
the registers as destination registers.

While the unaligned-capable instructions, that offer also important
additional addressing modes, had to have certain restrictions to fit in.

So they use six out of the seven index registers, they can use only half
the registers as destination registers on indexed accesses, and they use
four out of the seven base registers.

Having 16-bit instructions for the possibility of more compact code meant
that I had to have at least one of the two restrictions noted above -
having both restrictions meant that I could offer the alternative of
aligned-only instructions with neither restriction, which may be far less
painful for some.

John Savard

MitchAlsup

2023-11-12 21:25:20 UTC

Post by MitchAlsup
I never had any aligned memory references. The HW overhead to "fix" the
problem is so small as to be compelling.

<
I am not buying this. Which takes more opcode space::
a) an ISA with unaligned only LDs and STs (11)
or
b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs (another 11)
<
It is a simple entropy (allocated counting) problem
<

Post by Quadibloc
which lets me fit in one complete set of memory-reference instructions that
can use all the base registers, all the index registers, and always use all
the registers as destination registers.
While the unaligned-capable instructions, that offer also important
additional addressing modes, had to have certain restrictions to fit in.
So they use six out of the seven index registers, they can use only half
the registers as destination registers on indexed accesses, and they use
four out of the seven base registers.
Having 16-bit instructions for the possibility of more compact code meant
that I had to have at least one of the two restrictions noted above -
having both restrictions meant that I could offer the alternative of
aligned-only instructions with neither restriction, which may be far less
painful for some.
John Savard

Quadibloc

2023-11-12 23:15:43 UTC

Post by MitchAlsup
a) an ISA with unaligned only LDs and STs (11)
or b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs
(another 11)

That is true, *other things being equal*.

However, what I had was:

An ISA with unaligned loads and stores, that could use all 32 destination
registers, and all 8 index and base registers. (Call this A)

That took up too much opcode space to allow 16-bit instructions.

So I made various compromises to shave one bit off the loads and stores,
and then I could have 16 bit instructions. (Call this B)

But I didn't like the compromises.

So I made _more_ compromises, to shave _another_ bit off the loads and
stores. This way, I had enough opcode space to add aligned-only loads
and stores... that could use all 32 destination registers, and all 8
index and base registers. (Call this C)

Since other things _were not equal_, it was perfectly possible for C
to use less opcode space than A, and about the same amount of opcode
space as B. So I got to use 16-bit instructions AND have a set of loads
and stores that used all 32 destnation registers, and all 8 index and
base registers.

The compromises on the _unaligned_ loads and stores were painful, but
they were chosen so that code using them wouldn't have to be be
significantly less efficient than code with the set of loads and stores
in A.

John Savard

MitchAlsup

2023-11-13 00:16:24 UTC

Post by MitchAlsup
a) an ISA with unaligned only LDs and STs (11)
or b) an ISA with unaligned LDs and STs (11) and aligned LDs and STs
(another 11)

That is true, *other things being equal*.

<
A poorly chosen starting point (dark alley)
<

Post by Quadibloc
An ISA with unaligned loads and stores, that could use all 32 destination
registers, and all 8 index and base registers. (Call this A)
That took up too much opcode space to allow 16-bit instructions.
So I made various compromises to shave one bit off the loads and stores,
and then I could have 16 bit instructions. (Call this B)
But I didn't like the compromises.

<
Captain Obvious to the rescue::
<

Post by Quadibloc
So I made _more_ compromises, to shave _another_ bit off the loads and
stores. This way, I had enough opcode space to add aligned-only loads
and stores... that could use all 32 destination registers, and all 8
index and base registers. (Call this C)

<
Back out of the dark alley, and start from first principles again.
<

Post by Quadibloc
Since other things _were not equal_, it was perfectly possible for C
to use less opcode space than A, and about the same amount of opcode
space as B. So I got to use 16-bit instructions AND have a set of loads
and stores that used all 32 destnation registers, and all 8 index and
base registers.

<
Maybe "less opcode space" if you count bits, but it is "more opcode space"
if/when you enumerate all the opcodes within the space.
<

Post by Quadibloc
The compromises on the _unaligned_ loads and stores were painful, but
they were chosen so that code using them wouldn't have to be be
significantly less efficient than code with the set of loads and stores
in A.

<
Does you compiler agree with this assertion ??
<

Post by Quadibloc
John Savard

Quadibloc

2023-11-13 00:54:49 UTC

Post by MitchAlsup
Does you compiler agree with this assertion ??

As I'm still only in the early stages of roughing out
the bare outlines of an ISA, I have not yet built such
advanced diagnostic tools, I must admit.

However, my original compromise had been to reduce
the number of index registers used with memory-reference
instructions to 3 from 7.

The two improved compromises I used in this later effort
were:

Compromise 1:

Reduce the number of base registers used with memory-reference
instructions (when using a 16-bit displacement) to 3 from 7.

I figured that _this_ was far less likely to reduce efficiency,
since normally not that many base registers were used in any
case.

Compromise 2:

When an instruction is not indexed, reduce the size of the index
register field to two bits, both containing 0.

When an instruction is indexed, reduce the size of the destination
register field to 4 bits from 5, thus allowing only 16 of the 32
registers to be used with indexed memory accesses.

This one is more painful, but it had historical precedent. One
consequence is that the number of index registers is reduced, to
six from 7, because now index register 4 "looks like zero".

John Savard

Quadibloc

2023-11-13 02:44:57 UTC

Post by MitchAlsup
A poorly chosen starting point (dark alley)
Back out of the dark alley, and start from first principles again.

By the way, I think you mean a _blind_ alley.

A dark alley is just a dangerous place, since robbers can attack you
there without being seen.

A _blind_ alley is one that had no exit, one that is a dead end. That
seems to better fit the context of your remarks.

John Savard

MitchAlsup

2023-11-13 03:06:03 UTC

Post by MitchAlsup
A poorly chosen starting point (dark alley)
Back out of the dark alley, and start from first principles again.

By the way, I think you mean a _blind_ alley.
A dark alley is just a dangerous place, since robbers can attack you
there without being seen.
A _blind_ alley is one that had no exit, one that is a dead end. That
seems to better fit the context of your remarks.

<
based on our definitions I definitively meant dark as in dangerous as
opposed to no way out except backwards.

Post by Quadibloc
John Savard

Chris M. Thomasson

2023-11-13 21:58:06 UTC

Post by MitchAlsup
A poorly chosen starting point (dark alley)
Back out of the dark alley, and start from first principles again.

By the way, I think you mean a _blind_ alley.
A dark alley is just a dangerous place, since robbers can attack you
there without being seen.

Expose the darkness to the light, before any adventures...? ;^)

Post by Quadibloc
A _blind_ alley is one that had no exit, one that is a dead end. That
seems to better fit the context of your remarks.
John Savard

Stefan Monnier

2023-11-13 16:46:47 UTC

Post by Quadibloc
That took up too much opcode space to allow 16-bit instructions.

You might want to try and get fancy in your short instructions by
"randomizing" the subset of registers they can access.

E.g. allow both your short LD and ST instruction access 16 registers
but not exactly the same 16.
Or allow your arithmetic instructions to access only 8 registers for their
input and output args but not exactly the same 8 for the two inputs
and/or for the output.

I suspect that if done well, it could give benefits similar to the
skewed-associative caches. The other upside is that it makes register
allocation *really* interesting, thus opening up opportunities to
spend a few more years working on that subproblem :-)

To up the ante, you could make the set of registers reachable from each
instruction depend not just on the opcode but also on the instruction's
address, so you can sometimes avoid a spill by swapping two
instructions. This would allow the register allocation to interact in
even more interesting ways with instruction scheduling.
There could be a few more PhDs worth of research there.

Stefan

Quadibloc

2023-11-14 14:54:32 UTC