Post by QuadiblocPost by MarcusPost by QuadiblocBut there's also one very bad thing about a vector register file.
Like any register file, it has to be *saved* and *restored* under
certain circumstances. Most especially, it has to be saved before,
and restored after, other user-mode programs run, even if they
aren't _expected_ to use vectors, as a program interrupted by
a real-time-clock interrupt to let other users do stuff has to
be able to *rely* on its registers all staying undisturbed, as if
no interrupts happened.
Yes, that is the major drawback of a vector register file, so it has to
be dealt with somehow.
Yes, and therefore I am looking into ways to deal with it somehow.
Why not just use Mitch Alsup's wonderful VVM?
It is true that the state of the art has advanced since the Cray I
was first introduced. So, perhaps Mitch Alsup has indeed found,
through improving data forwarding, as I understand it, a way to make
the performance of a memory-memory vector machine (like the Control
Data STAR-100) match that of one with vector registers (like the
Cray I, which succeeded where the STAR-100 failed).
But because the historical precedent seems to indicate otherwise, and
because while data forwarding is very definitely a good thing (and,
indeed, necessary to have for best performance _on_ a vector register
machine too) it has its limits.
What _could_ substitute for vector registers isn't data forwarding,
it brings in vector operands closer to the CPU where they're more
quickly accessible. So a STAR-100 with a *really good cache* as well
as data forwarding could, I suppose, compete with a Cray I.
My first question, though, is whether or not we can really make caches
that good.
I think that you are missing some of the points that I'm trying to make.
In my recent comments I have been talking about very low end machines,
the kinds that can execute at most one instruction per clock cycle, or
maybe less, and that may not even have a cache at all.
I'm saying that I believe that within this category there is an
opportunity for improving performance with very little cost by adding
vector operations.
E.g. imagine a non-pipelined implementation with a single memory port,
shared by instruction fetch and data load/store, that requires perhaps
two cycles to fetch and decode an instruction, and executes the
instruction in the third cycle (possibly accessing the memory, which
precludes fetching a new instruction until the fourth or even fifth
cycle).
Now imagine if a single instruction could iterate over several elements
of a vector register. This would mean that the execution unit could
execute up to one operation every clock cycle, approaching similar
performance levels as a pipelined 1 CPI machine. The memory port would
be free for data traffic as no new instructions have to be fetched
during the vector loop. And so on.
I guess possible.
Ironically, the first designs I did were pipelined, except some early
versions of my core did not use pipelined memory access (so, Load/Store
would have been a lot more expensive), in combination with a painfully
slow bus design.
Similarly, imagine a very simple strictly in-order pipelined
implementation, where you have to resolve hazards by stalling the
pipeline every time there is RAW hazard for instance, and you have to
throw away cycles every time you mispredict a branch (which may be
quite often if you only have a very primitive predictor).
Yeah, the above is more like my core design...
With vector operations you pause the front end (fetch and decode) while
iterating over vector elements, which eliminates branch misprediction
penalties. You also magically do away with RAW hazards as by the time
you start issuing a new instruction the vector elements needed from the
previous instruction have already been written to the register file.
And of course you do away with loop overhead instructions (increment,
compare, branch).
As a bonus, I believe that a vector solution like that would be more
energy efficient, as less work has to be done for each operation than if
you have to fetch and decode an instruction for every operation that you
do.
FWIW:
The first version of SIMD operations I did was to feed each vector
element through the main FPU.
So:
Scalar, 6-cycles, with 5-cycle internal latency, 1 overhead cycle.
2-wide SIMD, 8 cycles, 1+6+1
4-wide SIMD, 10 cycles, 1+8+1
The relative LUT cost increase of supporting SIMD in this way is fairly
low (and, still worthwhile, as it is faster than non-pipelined FPU ops).
For the later low-precision unit, latency was reduced by implementing
low-precision FADD/FMUL units with a 3-cycle latency, and running 4 sets
in parallel. This is not cheap, but faster.
Another feature was to allow the low-precision unit to accept
double-precision values, though processing them at low precision.
This was the incentive to adding FADDA/FSUBA/FMULA, since these can do
low-precision ops on Binary64 with a 3-cycle latency (just with the
tradeoff of being partial precision and truncate only).
Though, through experimentation, noted for single-precision stuff that
one needs at least single precision, as, say, S.E8.F16.Z7, is not
sufficient...
Both Quake and my experimental BGBTech3 engine started breaking in
obvious ways with a 16-bit mantissa for "float".
Internally, IIRC, I ended up using a 25 bit mantissa, as this was just
enough to get Binary32 working correctly.
Though, the low-precision unit doesn't currently support integer
conversion, where it can be noted that the mantissa needs to be wide
enough to handle the widest supported integer type (so, say, if wants
Int32, they also need a 32-bit mantissa).
Going from my original SIMD implementation to a vector implementation
would likely require some mechanism for the FPU to access memory. This,
however, is the "evil" part.
Also, annoyingly, adding memory ports as not as easy as simply adding
duplicate L1 caches, as then one has to deal with memory coherence
between them.
Similarly, Block-RAM arrays don't support multiple read ports, so trying
to simply access the same Block-RAM again will effectively duplicate all
of the block-RAM for the L1 cache (horrible waste).
Though, simply mirroring the L1 contents across both ports would be the
simplest approach...
One possible way to have a multi-port L1 cache might be, say:
Add an L0 cache level, which deals with the Load/Store mechanism;
Natively caches a small number of cache lines (say, 2..8);
Would likely need to be fully-associative, for "reasons";
A L0 Miss would be handled by initiating a request to the L1 cache.
Each memory port gets its own L0 cache,
which arbitrate access to the L1 cache.
A store into an L0 cache causes it to signal the others about it.
Each L0 cache will then need to flush the offending cache-line.
The obvious drawbacks:
More expensive;
One needs to effectively hold several cache lines of data in FFs.
Will add latency.
2x 4-way lines, would miss often;
Will add 1+ cycles for every L0 miss.
One other considered variant being:
The Lane1 port accessed the L1 directly (Does Load/Store);
The Lane2 port has a small asymmetric cache, and only does Load.
Store via Lane 1 port invalidates lines in the Lane 2 L0.
In this case, the Lane2 cache could be bigger and use LUTRAM.
Say, 2x 32-lines, 1K.
However, the harder part is how to best arbitrate misses to the Lane 2 port.
IIRC, I had experimentally implemented this latter strategy, with an
approach like:
A Lane 2 miss signals a cache-miss as before.
During a Lane 2 miss, the Lane 1 array-fetch is redirected to Lane 2's
index;
If Lane 2's miss is serviced from Lane 1, it is copied over to Lane 2;
Else, a L1 miss happens as before, and the line was added to both the
main L1 cache, and the Lane 2 sub-cache.
I think one other idea I had considered was to implement a 2-way
set-associative L1 (rather than direct-mapped), but then allow dual-lane
memory access to use each way of the set independently (in the case of a
single memory access, it functions as a 2-way cache).
If an L1 miss occurs, both "ways" will be brought to look at the same
index (corresponding to the lane that had missed), and then the L1 miss
handling would do its thing (either resolving within the L1 cache, or
sending a request out on the bus, loading the cache-line into the way
associated with the lane that had missed).
This approach could be preferable, but would require writing a new L1 cache.
Though, thus far, multi-port memory access hasn't really seemed "worth it".
And, this still wouldn't deal with vectors, merely the possibility of
allowing things like, say:
MOV.Q (R4, 64), R8 | MOV.Q (R5, 192), R10
Generally, disallowing Load|Store or Store|Store, as this would open up
another level of pain, vs Load|Load.
As for reducing looping overhead:
Luckily, this part can be handled purely in software (such as via manual
or automatic loop unrolling). Currently BGBCC requires manual unrolling,
as automatic unrolling takes too much cleverness, but, ...
As I said, VVM has many similar properties, but I am currently exploring
if a VRF solution can be made sufficiently cheap to be feasible in this
very low end space, where I believe that VVM may be a bit too much (this
assumption is mostly based on my own ignorance, so take it with a grain
of salt).
Admittedly, I don't actually understand how VVM would work in hardware.
And, if in the trivial case it decays to behaving like scalar code, this
is something, but kinda meh.
For reference, the microarchitectural complexity that I'm thinking about
https://github.com/BrunoLevy/learn-fpga/blob/master/FemtoRV/RTL/PROCESSOR/femtorv32_quark.v
Interesting in a way that such small cores are possible.
Admittedly, my core is quite significantly more complicated.
And, pretty much all of my designs had significantly more Verilog code.
The smallest cores I had done ended up being around 5k LUT, mostly using
a minimalist / stripped-down version of SH-2 (with an interrupt
mechanism more like the SH-4, *1).
Though, some of the tweaks used in this design were later used as a
basis of "BtSR1", which was in turn used as the design basis for the
original form of BJX2. Though, BtSR1 did change around the 16-bit
instruction layout from SuperH, which was likely a mistake in
retrospect, but doesn't matter as much at this point.
Though, I am left to realize a few things could be better:
If the Imm16 encodings were consistent with the rest of the ISA;
If all of the encoding blocks had the registers in the same places.
For example, the Imm16 block encodes the Rn register field differently
from all of the other blocks. And, the 1R block has the Rn field where
Ro is located in the 3R blocks (it might have been better to instead
treat the 1R block as a subset of the 2R block).
*1: The interrupt mechanism in the SH-2 was more like that in the 8086,
essentially pushing PC and SR onto the stack for an interrupt, and
loading the entry-point address from a vector table. Comparably, the
approach used in the SH-4 was simpler/cheaper as it did not require the
exception-handler mechanism to be able to access memory...
Conversely, SH-2A went in a very different direction regarding exception
handling (IIRC, effectively having multiple sets of the register file,
with a mechanism to spill/reload all the registers from memory as
needed, in hardware, via the associated TBR address; assigning to TBR
effectively performing a context switch on interrupt return). Well,
along with a separate register space for interrupt handling (kinda like
RISC-V), and still pushing stuff onto the stack/popping from the stack
(like 8086). In all, a seemingly likely fairly complicated/expensive
mechanism.
Though, eventually did end up with a mechanism superficially similar (at
the level of the C code), but differing mostly in that the whole
mechanism is handled in software (via the ISR prolog/epilog; and a lot
of Load/Store instructions).
Differs slightly in that in my ABI, TBR has a pointer (at a fixed
location) to the register save area; IIRC, in SH-2A, TBR was itself a
pointer to the register save area (and any other task/process context
stuff would necessarily need to be located after the register save area).
As can be noted:
SH style 16-bit layouts:
zzzz-nnnn-mmmm-zzzz 2R
zzzz-nnnn-zzzz-zzzz 1R
zzzz-nnnn-iiii-iiii 2RI (Imm8)
zzzz-zzzz-iiii-iiii Imm8
zzzz-iiii-iiii-iiii Imm12 (BRA/BSR)
BtSR1/BJX2 16-bit layouts
zzzz-zzzz-nnnn-mmmm 2R
zzzz-zzzz-nnnn-zzzz 1R
zzzz-zzzz-nnnn-iiii 2RI (Imm4)
zzzz-nnnn-iiii-iiii 2RI (Imm8)
zzzz-zzzz-iiii-iiii Imm8
zzzz-iiii-iiii-iiii Imm12 (LDIZ/LDIN)
Both SH-2A and BJX1 had effectively thrown 32-bit encodings ad-hoc into
the 16-bit opcode space.
For BJX2, they had been consolidated:
If the high 3 bits of the 16-bit word were 111, it was the start of a
32-bit encoding.
As noted, 32-bit encodings:
NMOP-zwzz-nnnn-mmmm zzzz-qnmo-oooo-zzzz //3R
NMZP-zwzz-nnnn-mmmm zzzz-qnmz-zzzz-zzzz //2R
ZZNP-zwzz-zzzz-zzzz zzzz-qzzn-nnnn-zzzz //1R (Inconsistent, Ro->Rn)
NMIP-zwzz-nnnn-mmmm zzzz-qnmi-iiii-iiii //3RI (Imm9/Disp9)
NMIP-zwzz-nnnn-zzzz zzzz-qnii-iiii-iiii //2RI (Imm10)
NZZP-zwzz-zzzn-nnnn iiii-iiii-iiii-iiii //2RI (Imm16)
IIIP-zwzz-iiii-iiii zzzz-iiii-iiii-iiii //Disp20 (BRA/BSR)
ZZZP-zwzz-iiii-iiii iiii-iiii-iiii-iiii //Imm24 (LDIZ/LDIN; Jumbo)
In Baseline, the high 3 bits are always 1 for 32-bit ops, in XG2, these
are repurposed as extension bits, but are encoded in an inverted form
(such that the Baseline encodings are still valid in XG2).
Where, here, the uppercase bits are those that are inverted.
n/m/o: register
i: Disp/Immed
z: Opcode
Admittedly, the space is less cleanly organized if compared with RISC-V,
so I suspect something like superscalar would be harder. Might have been
preferable if Rn was always in the same spot (since Rn is needed to do
superscalar checks, and is more of an issue if one needs to figure out
instruction block to figure out Rn alias).
Well, or check every possible register field, possibly ignoring the high
bits for simplicity:
xxxx-xxxx-nnnn-mmmm xxxx-xxxx-oooo-xxxx //Virtual layout
Where: x=don't care.
Where, one can be conservative and reject cases where any possible alias
could happen (even if it is a false alias).
But, this would likely reject a lot more cases, than if the logic
detects which encoding block is in use (and uses these to enable/disable
the various alias checks).
Say:
xxxx-0x00-nnnn-mmmm xxxx-xxxx-oooo-xxxx // (F0/F3/F9)
xxxx-0x00-xxxx-xxxx 0011-xxxx-nnnn-0000 //Virtual layout (F0-1R)
xxxx-0xxx-nnnn-mmmm xxxx-xxxx-xxxx-xxxx // (F1/F2)
xxxx-1x00-xxxx-nnnn xxxx-xxxx-xxxx-xxxx //Virtual layout (F8)
So, matching these cases and turning them into enable-bits for the alias
checks (and, maybe at least trying to check the full 6-bit register
fields; since if the block is determined, the ambiguity that would
require needing to drop to 4 bits is eliminated).
With RISC-V, since the destination register doesn't move, and always has
the same bits, no logic is needed to figure out where it could be.
Likely, more logic would be needed to determine which instructions are
valid prefixes and suffixes.
But, would it be worthwhile:
Less certain, as admittedly my compiler mostly already deals with this;
But, it could deal with cases either where the compiler misses cases
which are allowed on the core, or where there is a possible mismatch
between what the core supports and the bundling rules assumed by the
compiler (though, likely only relevant for higher-end cores; for low-end
cores the strategy of falling back to scalar execution is likely more
sensible).
As-is, doing superscalar in the CPU is unlikely to match the bundling
done by the compiler (since the compiler can do more accurate checks).
Seems to work OK for RISC-V though:
There is basically no other viable option in this case;
The encoding space seems to have been organized reasonably well for
this, making this part easier to pull off.