Post by Lawrence D'OliveiroA potential alternative would be something like a scaled-up 64-bit
variant of an ESP32 style design (or a 64-bit version of the Qualcomm
Hexagon).
Would you end up with something similar to RISC-V?
Well, like RISC-V with explicitly parallel instructions (rather than
implicitly via superscalar).
Both ESP32 and Hexagon had used instruction words that could be tagged
to execute in parallel. RISC-V doesn't do this, so the CPU would need to
look at the instructions (and check for register conflicts), before
deciding to do so. For a cost-effective implementation, this logic is
necessarily conservative.
Or, something like my own BJX2 ISA, but it seems, it isn't *that* far
from RISC-V in some areas, and trying to support both in the same CPU
core has led to some amount of convergence (many cases where RV64 had a
feature, but BJX2 lacked a corresponding feature, has resulted in BJX2
gaining the feature in question, albeit often slightly modified; as the
mechanisms often don't demand that exactly the same instruction be
implemented in exactly the same way).
Say, for example, GT and LT are mirrors of each other, immediate size
and encoding matters a lot to the decoder but is mostly ephemeral to the
execute stage logic, etc...
Well, and some major features, such as the presence or absence of ALU
status flags, didn't matter as neither ISA uses ALU status flags.
In my case, SR.T doesn't really count in my case, as it is used almost
exclusively as a predication-control flag. If I were to do a new ISA
with a similar design, likely SR.T would be made exclusively for
predication control, and I would find some other way to do "ADD with
Carry" and similar.
In my case, there are 1/4 as many hardware registers as IA-64, and 1/3
as would be needed for the RISC-V Privileged spec.
Though, would still face the potential issue that WEX (Wide-EXecute)
eats some amount of encoding space, and on a "good" superscalar
implementation, or with OoO, its existence would become mostly moot (it
mostly mattering for cores in the area of too cheap to do superscalar
effectively, but expensive enough to justify being able to execute
instructions in parallel, if the compiler helps them out).
Looks like scaling this up past a width of 2 or 3 becomes mostly no-go,
and say a core with 4+ lanes would almost invariably need to be OoO.
Say, the cost of the Execute stages goes up steadily, but the ability of
the compiler to static schedule things becomes steadily less effective
(seemingly doesn't really work much past 3-wide and a naive
strictly-in-order pipeline).
In practice, this part of the space seems to be mostly dominated by
higher-end microcontrollers, and DSPs (with non-budget "application
class" chips and above mostly going over to OoO).
Meanwhile, the low-end of the microcontroller space tends to be
dominated by 8/16 bit scalar processors, and this probably isn't going
to change anytime quickly (like, if you don't need more than a 16-bit
ALU and 2K of RAM, why go for anything bigger?...).
Like, the processing requirements for keyboards and mice hasn't changed
much (and the main thing they mostly need to deal with is the needless
complexity of the USB protocol).
In my case:
The gains of 3 wide over 2 are small; and main reasons it is 3-wide in
my case is because if I am already needing to pay for 96-bit decode and
a 6R3W register file to get full advantage of the 2-wide case, the
nominal cost increase of a 3rd ALU and similar is low).
Though, did save the cost of eliminating Lane3 integer shift, since
Lane3 integer shift is rare and the integer shift logic isn't cheap (at
least, vs ADD/SUB/AND/OR/XOR, and MOV/EXTS/EXTU). Like, Lane3 existing
mostly for spare register ports and the occasional MOV or ALU op.
Granted, my compiler's strategy is fairly naive:
First emit code as-if it were a plain RISC style ISA;
Feed it through a stage that tries to shuffle and bundle instructions.
Shuffle first to try to untangle RAW dependencies;
Bundle to try to increase ILP;
Though, typically increases the number of RAW dependencies.
Code which is has a big pile of mostly independent operations tends to
do better here (and code with a lots of parallel expressions and lots of
variables, seems to be an area where my ISA is beating RISC-V).
Note that for code with "small tight loops", it is a lot closer, and in
some cases, this is an area where some of RISC-V's design choices make
more sense.
For example, the use of 2-register compare-and-branch operations, are in
general kind of expensive, but seem to be useful in some cases with
tightly running loops:
while(cs<cse)
*ct++=*cs++;
But, in terms of facing off against RISC-V in terms of performance, I
have ended up partly re-evaluating them.
And, in the case of tight loops, even the limitation in my case of them
only having an 8-bit displacement, is less of an issue:
XG2:
BRLT Rm, Rn, Disp8s //Branch if Rn<Rm, +/- 256B
BRLT Rn, Disp13s //Branch if Rn< 0, +/- 8K
CMPGT Rn, Rm; BT Disp23s //Branch if Rm>Rn, +/- 8MB
Baseline:
BRLT Rm, Rn, Disp8s //Branch if Rn<Rm, +/- 256B
BRLT Rn, Disp11s //Branch if Rn< 0, +/- 2K
CMPGT Rn, Rm; BT Disp20s //Branch if Rm>Rn, +/- 1MB
If the loop is tight, then the branch target is much more likely to be
within the 256 byte window (well, and if the CPU has the RV decoder; it
already needs to pay for the EX logic needed to support this
instruction). So, it is more an open question of "is it worthwhile to
have this instruction in a CPU that doesn't have a RISC-V decoder?".
But, looks like, "for best performance", it may be inescapable.
For the 32-bit encoding, displacement doesn't get any bigger, but it is
possible to use a jumbo-encoding to expand it to a 32-bit displacement.
Main weak points on the RV side are still the usual:
Lack of indexed load/store;
Poor handling of constants that don't fit in 12 bits.
So, say, Imm12s for ALU ops is arguably better than Imm9u/Imm9n/Imm10s
(or Imm10u/Imm10n/Imm11s), but not by enough to offset the ISA
effectively falling on its face when Imm12s fails.
Could be helped though if RV added a "LI Xd, Imm17s" instruction.
The "SHnADD" can instruction can help with indexed Load/Store, but would
not entirely "close the gap" for programs like Doom or similar (may or
may not make a difference for Dhrystone, where things are pretty tight,
but seemingly much of this is due to a relative lack of array-oriented
code in Dhrystone, so SHnADD would, similarly, not gain so much here).
But, yeah, higher priorities, if I were to redo things:
Put GBR and LR into GPR space;
Having these in CR space negatively effects prologs and epilogs;
Make encoding rules more consistent;
...
As noted, some of my ideas would make the first 4 registers special:
R0: ZR / PC
R1: LR / TBR (TP)
R2: SP
R3: GBR (GP)
But, likely keep similar register assignments to my existing ISA (but,
unlike my existing register space, could also be directly compatible
with the RISC-V ABI; without something like the existing XG2RV hack).
Putting LR and GBR in GPR space would help with prolog/epilog sequences;
A zero register would eliminate a lot of special cases.
But, this is not compatible with my existing ABI.
Where R2/R3 are used by the existing ABI, and R15 is SP.
But, unclear if it would be worthwhile, since any "redo" would still
have many of my existing issues, and is possibly moot *unless* it can
also either have a significant performance advantage over both RISC-V
and my existing ISA design, or if I instead switched over to RISC-V as
the primary ISA (and implemented the privileged spec) to potentially
leverage the ability to run an existing OS (such as Linux).
But, this latter point would likely only really matter if I found
detailed documentation, say, on SiFive's memory map and hardware
interfaces, and also cloned this (otherwise, a custom CPU core still
isn't going to be able to run an "off the shelf" Linux build; even if
the ISA itself matches up).
Though, a more intermediate option might just be to consider eliminating
the Read-Only pages range (from 0x00010000..0x00FFFFFF), and instead
moving this over to RAM (with 00000000..0000BFFF as ROM, and
0000C000..0000FFFF as SRAM), which would make the hardware memory map
more compatible with what GCC expects (though, would need to take care
such that loading the kernel doesn't stomp the Video RAM or similar,
which had otherwise been mapped into this "hidden" part of the RAM space).
Well, and any OS porting attempts needing to deal with things like the
different hardware interfaces and different approaches to interrupt
handling and similar (this more effects ones' ASM code)
But, I guess, one limitation seems to be:
Vs the existing ISA's, there is little real way to gain much additional
performance for normal integer workloads (at the ISA level);
Similarly, not much obvious way to either make the core significantly
cheaper, nor to make significant gains in terms of clock-speed via ISA
level design choices.
Some of my design attempts had lost the PrWEX encodings, but, if one
allows for a superscalar implementation, the loss of PrWEX isn't too bad
(could try to use optional superscalar support to mop this up).
Similarly, PrWEX is a fairly small subset of the total instruction count.
Say, if seen as 2 bits in the instruction word:
00: Execute if True
01: Execute if False
10: Scalar
11: Wide-Execute (arguably redundant on OoO).
Though, a few "arguably very wasteful" instructions (Load 24 bit
constants into a fixed register), have served multiple purposes (serving
as both the Jumbo Prefixes and PrWEX blocks), and in redesign attempts,
trying to eliminate or replace this "obvious waste" has left the problem
of how to deal with PrWEX and Jumbo prefixes. The unconditional branch
doesn't quite fit, as it also needs to be able to be predicated.
Though, I guess it could also be left as a more generalized:
Unconditional and Scalar-Only block.
But, my existing ISA has continued on with incremental fiddly:
Most recent change was ending up tweaking the decoding rules for the AND
and RSUB instructions, as I was faced with an issue:
Negative immediates for AND and RSUB were a lot more common than
expected, but were N/E with the prior rules;
There isn't really enough encoding space to add new one-extended
variants of these.
So, ended up changing the rules, noting that this seemingly would not
break existing binaries (the previous compiler output was not encoding
AND with immediate values between 512 and 1023), and would allow
encoding the apparent 12-15% of cases where the immediate was negative
(vs the roughly 1.5% of potential immediate values between 512 and 1023).
Also looked at OR and XOR (which were in a similar situation), but noted
that it seems that negative operands to OR and XOR are very rare
(roughly 1% across multiple programs), so it is better to leave these as
the prior rule (even if this now partially breaks symmetry between
AND/OR/XOR).
The percentage of negative inputs to AND was missed previously, as the
stats weren't really distinguishing things based on which operator was
being looked at.
Note that for ADD (with both ADD and SUB combined into one operation):
Balance is ~ 60% positive, 40% negative.
So, zero-extended immediate values would only make sense with separate
ADD/SUB.
It is harder to confirm whether my original choice for going for Disp10s
for Load/Store rather than Disp10u was better. The ratio between
displacements between -511..-64 vs 512..1023 seems to vary between
programs, and is pretty close either way (in any case, probably not
worth dealing with binary breakage over something with an epsilon of
around 1%, and supporting 32-bit ops with a negative displacement is
slightly better in-general than only supporting a positive displacement...).
...