Misc: Preliminary (actual) performance: BJX2 vs RV64

Discussion:

(too old to reply)

BGB

2024-01-21 09:08:48 UTC

I have now gotten around to fully implementing the ability to boot BJX2
into RISC-V mode.

Though, this part wasn't the hard-part, rather, more, porting most of
TestKern to be able to build on RISC-V (some parts are still stubbed
out, so using it as a kernel in RV Mode will not yet be possible, but
got enough ported at least to be able to run programs "bare metal" in
RV64 Mode).

Both are using more or less the same C library (TestKern + modified
PDPCLIB).

For the BJX2 side, things are compiled with BGBCC.
For the RISC-V side, GCC 12.2.0 (riscv64-unknown-elf, RV64IMA).

This allows more accurate comparison than, say, on paper analysis or
comparing results between different emulators.

So, first program tested was Doom, with preliminary results (average
framerate):
RV -O3 18.1
RV -Os 15.5
XG2 21.6
This is from running the first 3 demos and stopping at the same spot.

Both give "similar" MIPs values, but the mix differs:
BJX2: Dominated by memory Load/Store followed by branches;
RISC-V: Dominated by ALU operations (particularly ADD and Shift).
Load/Store, and Branches, are a little down the list.

RV64 has a lot fewer SP-relative loads/stores compared with BJX2,
despite having fewer GPRs.

Meanwhile, ADD and SLLI seem to be the top two instructions used in
RISC-V (I will still continue to blame the lack of register-indexed
load/store on this one...).

It does seem to suffer more from spending a higher percentage of its
time with interlocks, particularly with ALU operations (doesn't seem
like a great situation to have 2-cycle latency on ADD and Shift
instructions...).

I had expected RV64 to win for Dhrystone, as some earlier tests (albeit,
not running in my emulator) had implied that "GCC magic" would kick in
and make Dhystone fast.

Actual testing, did not agree.

Initial tests:
XG2 : 61538 (0.70 DMIPS/MHz)
RV64: 40816 (0.46 DMIPS/MHz).

The score for BJX2 has actually dropped a fair bit for some reason.
In the past, had gotten it up to around 79k, but has dropped.
I suspect this may be a case of, what optimizations work for Doom, are
not necessarily best for Dhystone (well, also, various instruction
latency values had been increased as well).

However, this was "suspiciously bad" on RV64's part. It seemed that
performance was getting wrecked pretty bad by falling back to naive
character-by-character implementations of "strcpy()" and "strcmp()".

Switched these out for some less generic logic that works 8 bytes at a time:
RV64: 50632 (0.57 DMIPs/MHz)

This is at least more in-line with the Doom results.

General speedup was based on noting that one can do:
li=(uint64_t *)cs;
lj=(li|(li+0x7F7F7F7F7F7F7F7FULL))&0x8080808080808080ULL;
while(lj==0x8080808080808080ULL)
...
As basically a way of detecting the presence/absence of a NUL byte, for
faster than reading each character and checking against NUL.

Comparing other stats (Dhrystone):
XG2:
Bundle size: 1.10
MIPs : 25.1
Interlock : 12.81%
Cache Miss : 14.4% (Cycles)
L1 Hit Rate: 96.6%
Average Trace Length: 4.9 ops.
Mem Access : 23.1% (Total Cycles)
Branch Miss: 0.1% (Total Cycles)

RV64:
Bundle Size: 1.00
MIPs : 21.7
Interlock : 12.08%
Cache Miss : 3.5% (Cycles)
L1 Hit Rate: 99.0%
Average Trace Length: 4.8 ops.
Mem Access : 6.8% (Total Cycles)
Branch Miss: 4.8% (Total Cycles)

Here, RV64 seems to be spending less of its cycles accessing memory, and
more time running ALU ops and branching. BJX2 seems to be spending more
cycles on memory access instructions.

In this case, RV64 also seems to lose a big chunk of cycles doing slower
64-bit multiply rather than 32-bit widening multiply (doesn't exist in
RV64). This is likely where a big chunk of cycles is going (but the
stats don't currently state a "time spent in high-latency ops" case).
Seems to also be spending more cycles running DIV ops (seemingly using
multiply-by-reciprocal sparingly).

Have noted also that it tends to turn constants into memory loads rather
than encode them inline.

Granted, BJX2 does seem to still have a lot more stack spill-and-fill
than RV64 despite having twice as many GPRs. This is more an issue with
BGBCC though.

...

In any case, in some ways, closer than I would have expected.

RISC-V is still winning for a smaller ".text" section, albeit, not as
much for performance.

...

Any thoughts?...

BGB

2024-01-21 19:20:16 UTC