Post by Robert FinchPost by BGBPost by Robert FinchPost by BGBPost by MitchAlsupPost by EricPPost by Robert FinchFigured it out. Each architectural register in the RAT must refer
to N physical registers, where N is the number of banks. Setting
N to 4 results in a RAT that is only about 50% larger than one
supporting only a single bank. The operating mode is used to
select the physical register. The first eight registers are
shared between all operating modes so arguments can be passed to
syscalls. It is tempting to have eight banks of registers, one
for each hardware interrupt level.
A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.
A waste.....
Post by EricPFor example, if there are 2 modes User and Super and a bank for each,
since User and Super are mutually exclusive,
64 of your 256 physical registers will be sitting unused tied
to the other mode bank, so max of 75% utilization efficiency.
If you have 8 register banks then only 3/10 of the physical registers
are available to use, the other 7/10 are sitting idle attached to
arch registers in other modes consuming power.
Also you don't have to play overlapped-register-bank games to pass
args to/from syscalls. You can have specific instructions that reach
into other banks: Move To User Reg, Move From User Reg.
Since only syscall passes args into the OS you only need to access
the user mode bank from the OS kernel bank.
A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.
On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.
I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the captured
register state for the calling task.
A newer change involves saving/restoring registers more directly
to/from the task context for syscalls, which reduces the task-switch
overhead by around 50% (but is mostly N/A for other kinds of interrupts).
...
I am toying with the idea of adding context save and restore
instructions. I would try to get them to work on a cache-line worth
of data, four registers accessed for read or write at the same time.
Context save / restore would be a macro instruction made up of
sixteen individual instructions each of which saves or restores four
registers. It is a bit of a hoop to jump through for an infrequently
used operation. However, it is good to have to clean context switch code.
Added the REGS instruction modifier. The modifier causes the
following load or store instruction to repeat using the registers
specified in the register list bitmask for the source or target
register. In theory it can also be applied to other instructions but
that was not the intent. It is pretty much useless for other
instructions, but a register list could be supplied to the MOV
instruction to zero out multiple registers with a single instruction.
Or possibly the ADDI instruction could be used to load a constant
into multiple registers. I could put code in to disable REGS use with
anything other than load and store ops, but why add extra hardware?
In my case, it is partly a limitation of not really being able to make
it wider than it is already absent adding a 4th register write port
and likely imposing a 256-bit alignment requirement; for a task that
is mostly limited by L1 cache misses...
Like, saving registers would be ~ 40 cycles or so (with another ~ 40
to restore them), saving/restoring 2 registers per cycle with GPRs, if
not for all the L1 misses.
Reason it is not similar for normal function calls (besides these
saving/restoring normal registers), is because often the stack is
still "warm" in the L1 cache.
For interrupts, in the time from one interrupt to another, most of the
L1 cache contents from the previous interrupt are already gone.
So, these instruction sequences are around 80% L1 miss penalty, vs
around 5% for normal prologs/epilogs.
This is similar for the inner loops for "memcpy()", which average
roughly 90% L1 miss penalty.
And, say, "memcpy()" averages around 300MB/sec if just copying the
same small buffer over and over again, but then quickly drops to
70MB/sec if copying memory that falls outside the L1 cache.
Though, comparably, it seems that the drop-off from L2 cache to DRAM
is currently a little smaller.
So, the external DRAM interface can push ~ 100MB/sec with the current
interface (supports SWAP operations, moving 512 bits at a time, and
using a sequence number to transition from one request to another).
But, it is around 70MB/s for requests to make it around the ringbus.
Though, I have noted that if things stay within the limits of what
fits in the L2 cache, multiple parties can access the L2 cache at the
same time without too much impact on each other.
So, say, a modest resolutions, the screen refresh does not impact the
CPU, and the rasterizer module is also mostly independent.
Still, about the highest screen resolution it can really sustain
effectively is ~ 640x480 256-color, or ~ 18MB/sec.
This may be more timing related though, since for screen refresh there
is a relatively tight deadline between when the requests start being
sent, and when the L2 cache needs to hit for that request, and failing
this will result in graphical glitches.
Though, generally what it means is, if the framebuffer image isn't in
the L2 cache, it is gonna look like crap; and effectively the limit is
more "how big of a framebuffer can I fit in the L2 cache".
On the XC7A200T, I can afford a 512K L2 cache, which is just so big
enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it, and
fights a bit more with the main CPU).
OTOH, it is likely the case than on the XC7A100T (which can only
afford a 256K L2 cache), that 640x400 256-color is pushing it (but
color cell mode still works fine).
Had noted though that trying to set the screen resolution at one point
to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and basically was
almost entirely broken and seemingly bogged down the CPU (which could
no longer access memory in a timely manner).
Also seemingly stuff running on the CPU can effect screen artifacts in
these modes, presumably by knocking stuff out of the L2 cache.
Also, it seems like despite my ringbus being a lot faster than my
original bus, it has still managed to become an issue due to latency.
But, despite this, on average, things like interlocks and branch-miss
penalties and similar are now still weighing in a fair bit as well
(with interlock penalties closely following cache misses as the main
source of pipeline stalls).
Well, and these two combined burning around 30% of the total
clock-cycles, with another ~ 2-3% or so being spent on branches, ...
Well, and my recent effort to try to improve FPGA timing enough try to
get it up to 75MHz, did have the drawback of "in general" increasing
the number of cycles spent on interlocks (but, returning a lot of the
instructions to their original latency values, would make the FPGA
timing-constraints issues a bit worse).
But, if I could entirely eliminate these sources of latency, this
would only gain ~30%, and at this point would either need to somehow
increase the average bundle with, or find ways to reduce the total
number of instructions that need to be executed (both of these being
more compiler-related territory).
Though, OTOH, I have noted that in many cases I am beating RISC-V
(RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96 bit
encodings) when both are using the same C library, which implies that
I am probably "not doing too badly" on this front either (though,
ideally, I would be "more consistently" beating RISC-V at this metric,
*1).
*1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3"
are bigger; BJX2 Baseline does beat RV64IM, but this is not a fair
test as BJX2 Baseline has 16-bit ops).
Though, BGBCC also has an "/Os" option, it seems to have very little
effect on XG2 Mode (it mostly does things to try to increase the
number of 16-bit ops used, which is N/A in XG2).
Where, here, one can use ".text" size as a stand-in for total
instruction count (and by extension, the number of instructions that
need to be executed).
Though, in some past tests, it seemed like RISC-V needed to execute a
larger number of instructions to render each frame in Doom, which
doesn't really make follow if both have a roughly similar number of
instructions in the emitted binaries (and if both are essentially
running the same code).
So, something seems curious here...
...
For the Q+ MPU and SOC the bus system is organized like a tree with the
root being at the CPU. The system bus operates with asynchronous
transactions. The bus then fans out through bus bridges to various
system components. Responses coming back from devices are buffered and
merge results together into a more common bus when there are open spaces
in the bus. I think it is fairly fast (well at least for homebrew FPGA).
Bus accesses are single cycle, but they may have a varying amount of
latency. Writes are “posted” so they are essentially single cycle. Reads
percolate back up the tree to the CPU. It operates at the CPU clock rate
(currently 40MHz) and transfers 128-bits at a time. Maximum peak
transfer rate would then be 640 MB/s. Copying memory is bound to be much
slower due to the read latency. Devices on the bus have a configuration
block which looks something like a PCI config block, so devices
addressing may be controlled by the OS.
My original bus was fairly slow:
Put a request on the bus, as it propagates, each layer of the bus holds
the request, until it reaches the destination, and sends back an OK
signal, which returns back up the bus to the sender, and then the sender
switches to sending an IDLE signal, the whole process repeats as the bus
"tears down", and when it is done, the OK signal switches to READY, and
the bus may then accept another request.
This bus could only handle a single active request at a time, and no
further requests could initiate (anywhere) until the prior request had
finished.
Experimentally, I was hard-pressed getting much over about 6MB/sec over
this bus with 128-bit transfers... (but could get it up to around
16MB/sec with 256-bit SWAP messages). As noted, this kinda sucked...
I then replaced this with a ring-bus:
Every object on the node passes messages from input to output, and is
able to drop messages onto the bus, or remove/replace messages as
appropriate. If not handled immediately, they circle the ring until they
can be handled.
This bus was considerably faster, but still seems to suffer from latency
issues.
In this case, the latency of the ring bus was higher than the original
bus, but had the advantage that the L1 cache could effectively drop 4
consecutive requests onto the bus and then (in theory) they could all be
handled within a single trip around the ring.
Theoretically, the bus could move 800MB/sec at 50MHz, but practically
seems to achieve around 70MB/s (which is in-turn effected by things that
effect ring latency, like enabling/disabling various "shortcut paths" or
enabling/disabling the second CPU core).
A point-to-point message-passing bus could be possible, and could have
lower latency, but was not done mostly because it seemed more
complicated and expensive than the ring design.
If one has two endpoints, both can achieve around 70MB/s if L2 hits, but
this drops off if the external RAM accesses become the limiting factor.
The RAM interface is using a modified version of the original bus, where
both the OPM and OK signals were augmented with sequence numbers, where
when the sent sequence number on OPM comes back via the OK signal, one
can immediately move to the next request (incrementing the sequence number).
While this interface still only allows a single request at a time, this
change effectively doubles the throughput. The main reason for using
this interface to talk to external RAM, is that the interface works
across clock-domain crossings (as-is, the ring-bus requests can't
survive a clock-domain crossing).
Most of the MMIO devices are still operating on a narrower version of
the original bus, say:
5b: OPM
28b: Addr
64b: DataIn
64b: DataOut
2b: OK
Where, OPM:
00-000: IDLE
00-zzz: Special Command (if zzz!=000)
01-010: Load DWORD (MMIO)
01-011: Load QWORD (MMIO)
01-111: Load TILE (RAM, Old)
10-010: Store DWORD (MMIO)
10-011: Store QWORD (MMIO)
10-111: Store TILE (RAM, Old)
11-010: Swap DWORD (MMIO, Unused)
11-011: Swap QWORD (MMIO, Unused)
11-111: Swap TILE (RAM, Old)
The ring-bus went over to an 8-bit OPM format, which increases the range
of messages that can be sent.
One advantage of the old bus is that the device-side logic is fairly
simple. Typically, the OPM/Addr/Data signals would be mirrored to all of
the devices, with each device having its own OK and DataOut signal.
A sort of crossbar existed, where whichever device sets its OK value to
something other than READY has its OK and Data signals passed back up
the bus.
Also it works because MMIO only allows a single active request at a time
(and the MMIO bus interface on the ringbus will effectively serialize
all accesses into the MMIO space on a "first come, first serve" basis).
Note that accessing MMIO is comparably slow.
Some devices, like the display / VRAM module, have been partly moved
over to the ringbus (with the screen's frame-buffer mapped into RAM),
but still uses the MMIO interface for access to display control
registers and similar.
The SDcard interface still goes over MMIO, but ended up being modified
to allow sending/receiving 8 bytes at a time over SPI (with 8-bit
transfers, accessing the MMIO bus was a bigger source of latency than
actually sending bytes over SPI at 5MHz).
As-is, I am running the SDcard at 12.5 MHz:
16.7MHz and 25MHz did not work reliably;
Going over 25MHz was out-of-spec;
Even with 8-byte transfers, MMIO access can still become a bottleneck.
A UHS-II interface could in theory run at similar speeds to RAM, but
would likely need a different interface to make use of this.
One possibility would be to map the SDcard into the physical address
space as a huge non-volatile RAM-like space (on the ring-bus). Had
on/off considered this a few times, but didn't get to it.
Effectively, it would require redesigning the whole SDcard and
filesystem interface (essentially moving nearly all of the SDcard logic
into hardware).
Post by Robert FinchMultiple devices access the main DRAM memory via a memory controller.
Several devices that are bus masters have their own ports to the memory
controller and do not use up time on the main system bus tree. The
frame buffer has a streaming data port. The frame buffer streaming cache
is 8kB and loaded in 1kB strips at 800MB/s from the DRAM IIRC. Other
devices share a system cache which is only 16kB due to limited number
block RAMs. There are about a half dozen read ports, so the block RAMs
are replicated. With all the ports accessing simultaneously there could
be 8*40*16 MB/s being transferred, or about 5.1 GB/s for reads.
I had put everything on the ring-bus, with the L2 also serving as the
bridge to access external DRAM (via a direct connection to the DDR
interface module).
Post by Robert FinchThe CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can be
dual ported, but is not configured that way ATM due to resource
limitations. The caches will request data in blocks the size of a cache
line. A cache line is broken into four consecutive 128-bit accesses. So,
data comes back from the boot ROM in a burst at 640 MB/s.
In my case:
L1 I$: 16K or 32K
32K helps notably with GLQuake and similar.
Doom works well with 16K.
L1 D$: 16K or 32K
Mostly 32K works well.
Had tried 64K, but bad for timing, and little effect on performance.
IIRC, had evaluated running the CPU at 25MHz with 128K L1 caches and a
small L2 cache, but modeling this had showed that performance would suck
(even if nearly all of the instructions had a 1-cycle latency).
Post by Robert FinchIIRC there were no display issues with an 800x600x16 bpp display, but I
could not get Thor to do much more than clear the screen. So, it was a
display of random dots that was stable. There is a separate text display
controller with its own dedicated block RAM for displays.
My display module is a little weird, as it was based around a
cell-oriented design:
Cells are typically 128 or 256 bits, representing 8x8 pixels.
Text and 2bpp color-cell modes use 128-bit cells, say:
( 29: 0): Pair of 15-bit colors;
( 31:30): 10
( 61:32): Misc
( 63:62): 00
(127:64): Pixel bits, 8x8x1 bit, raster order
The 4bpp color-cell mode is more like:
( 29: 0): Colors A/B
( 31: 30): 11
( 61: 32): Colors C/D
( 63: 62): 11
( 93: 64): Colors E/F
( 95: 94): 00
(125: 96): Colors G/H
(127:126): 00
(159:128): Pixels A/B (4x4x2)
(191:160): Pixels C/D (4x4x2)
(223:192): Pixels E/F (4x4x2)
(255:224): Pixels G/H (4x4x2)
In the bitmapped modes:
128-bit cell selects 256-color modes (4x4 pixels)
256-bit cell selects hi-color modes (4x4 pixels)
So:
640x400 would be configured as 160x100 cells.
800x600 would be configured as 200x150 cells.
The 800x600 256-color mode held up OK when I had the display module
outputting at a non-standard 36Hz refresh, but increasing this to a more
standard 72Hz blows out the memory bandwidth.
Theoretically, the DDR RAM interface could support these resolutions if
all the timing and latency was good. But, no so good when it is
implemented by the display module hammering out a series of prefetch
requests over the ring-bus just ahead of the current raster position.
Though, the cell-oriented display modes still work better than my
attempt at a linear framebuffer mode (due to cache/timing issues, not
even a 320x200 linear framebuffer mode worked without looking like a
broken mess).
I suspect this is because, with the cell-oriented modes, each cell has 4
or 8 chances for the prefetch to succeed before it actually gets drawn,
whereas in the linear raster mode, there is only 1 chance.
It is likely that a linear framebuffer would require two stages:
Prefetch 1: Somewhat ahead of current raster position, hopefully gets
data into L2;
Prefects 2: Closer to the raster position, intended to actually fetch
the pixel data.
Prefetches are used here rather than actual loads, mostly because these
will get cleaned up quickly, whereas with actual fetches, a back-log
scenario would result in the whole bus getting clogged up with
unresolved requests.
However, the CPU can use normal loads, since the CPU will patiently wait
for the previous request(s) to finish before doing anything else (and
thus avoids flooding the ring-bus with requests).
However, a downside of prefetches, is that one has to keep asking the L2
cache each time whether or not it has the data in question yet.
As for the "BJX2 doesn't always generate smaller .text than RISC-V
issue", went looking at the ASM, and noted there is a big difference:
GCC "-Os" generates very tight and efficient code, but needs to work
within the limits of what the ISA provides;
BGBCC has a bit more to work with, but the relative quality of the
generated code is fairly poor in comparison.
Like, say:
MOV.Q R8, (SP, 40)
.lbl:
MOV.Q (SP, 40), R8
//BGBCC: "Sure why not?..."
...
MOV R2, R9
MOV R9, R2
BRA .lbl
//BGBCC: "Seems fine to me..."
So, I look at the ASM, and once again feel a groan at how crappy a lot
of it is.
Or:
if(!ptr)
...
Was failing to go down the logic path that would have allowed it to use
the BREQ/BRNE instructions (so was always producing a two-op sequence).
Have noticed that code that writes, say:
if(ptr==NULL)
...
Ends up using a 3-instruction sequence, because it doesn't recognize
this pattern as being the same as the "!ptr" case, ...
Did at least find a few more "low hanging fruit" cases that shaved a few
more kB off the binary.
Well, and also added a case to partially optimize:
return(bar());
To merge the 3AC "RET" into the "CSRV" operation, and thus save the use
of a temporary (and roughly two otherwise unnecessary MOV instructions
whenever this happens).
But, ironically, it was still "mostly" generating code with fewer
instructions, despite the still relatively weak code generation at times.
Also it seems:
void foo()
{
//does nothing
}
void bar()
{
...
foo();
...
}
GCC seems to be clever enough to realize that "foo()" does nothing, and
will eliminate the function and function call entirely.
BGBCC has no such optimization.
...