Tonight's tradeoff

Post by Robert Finch
Branch miss logic versus clock frequency.
The branch miss logic for the current OoO version of Thor is quite
involved. It needs to back out the register source indexes to the last
valid source before the branch instruction. To do this in a single
cycle, the logic is about 25+ logic levels deep. I find this somewhat
unacceptable.
I can remove a lot of logic improving the clock frequency substantially
by removing the branch miss logic that resets the registers source id to
the last valid source. Instead of stomping on the instruction on a miss
and flushing the instructions in a single cycle, I think the predicate
for the instructions can be cleared which will effectively turn them
into a NOP. The value of the target register will be propagated in the
reorder buffer meaning the registers source id need not be reset. The
reorder buffer is only eight entries. So, on average four entries would
be turned into NOPs. The NOPs would still propagate through the reorder
buffer so it may take several clock cycles for them to be flushed from
the buffer. Meaning the branch latency for miss-predicted branches would
be quite high. However, if the clock frequency can be improved by 20%
for all instructions, much of the lost performance on the branches may
be made up.

Basically it sounds like you want to eliminate the checkpoint and rollback,
and instead let resources be recovered at Retire. That could work.

However you are not restoring the Renamer's future Register Alias Table (RAT)
to its state at the point of the mispredicted branch instruction, which is
what the rollback would have done, so its state will be whatever it was at
the end of the mispredicted sequence. That needs to be re-sync'ed with the
program state as of the branch.

That can be accomplished by stalling the front end, waiting until the
mispredicted branch reaches Retire and then copying the committed RAT,
maintained by Retire, to the future RAT at Rename, and restart front end.
The list of free physical registers is then all those that are not
marked as architectural registers.
This is partly how I handle exceptions.

Also you still need a mechanism to cancel start of execution of the
subset of pending uOps for the purged set. You don't want to launch
a LD or DIV from the mispredicted set if it has not already started.
If you are using a reservation station design then you need some way
to distribute the cancel request to the various FU's and RS's,
and wait for them to clean themselves up.

Note that some things might not be able to cancel immediately,
like an in-flight MUL in a pipeline or an outstanding LD to the cache.
So some of this will be asynchronous (send cancel request, wait for ACK).

There are some other things that might need cleanup.
A Return Stack Predictor might be manipulated by the mispredicted path.
Not sure how to handle that without a checkpoint.
Maybe have two copies like RAT, a future one maintained by Decode and
a committed one maintained by Retire, and copy the committed to future.

MitchAlsup

2023-11-13 20:01:53 UTC

<
I, personally, don't use a RAT--I use a CAM based architectural decoder
for operand read and a standard physical equality decoder for writes.
<
Every cycle the CAM.valid bits are block loaded into a history table
and if you need to return the CAMs to the checkpointed mappings, you
take the valid bits from the history table and write the CAM.valid
bits back into the physical register file. Presto, the map is how it
used to be.
<
Can even be made to be performed in 0-cycles. {yes: 0 not 1 cycles}
<

Post by EricP
That can be accomplished by stalling the front end, waiting until the
mispredicted branch reaches Retire and then copying the committed RAT,
maintained by Retire, to the future RAT at Rename, and restart front end.
The list of free physical registers is then all those that are not
marked as architectural registers.

<
Sounds slow.
<

Post by EricP
This is partly how I handle exceptions.
Also you still need a mechanism to cancel start of execution of the
subset of pending uOps for the purged set. You don't want to launch
a LD or DIV from the mispredicted set if it has not already started.
If you are using a reservation station design then you need some way
to distribute the cancel request to the various FU's and RS's,
and wait for them to clean themselves up.

<
I use the concept of an execution window to do this both at the reservation
station and function units. There is an insert pointer and a consistent
pointer RS is only allowed to launch when the instruction is between.
FU are only allowed to calculate so long as the instruction remains
between these 2 pointers. The 2 pointers (4-bits each) are broadcast
around the machine every cycle. Each station and unit decide for themselves.

Post by EricP
Note that some things might not be able to cancel immediately,
like an in-flight MUL in a pipeline or an outstanding LD to the cache.
So some of this will be asynchronous (send cancel request, wait for ACK).

<
If an instruction that should not have its result delivered is delivered,
it is delivered to the physical register it was assigned at its issue time.
But since the value had not been delivered, that register is not in the
pool of assignable registers, so no dependency has been created.
<

Post by EricP
There are some other things that might need cleanup.
A Return Stack Predictor might be manipulated by the mispredicted path.

<
Do these with a linked list and you can backup a misprediced return
to a mispredicted call.
<

Post by EricP
Not sure how to handle that without a checkpoint.

<
Every (non exceptional) flow altering instruction needs a checkpoint.
Predicated strings of instructions use a light weight checkpoint;
predicted branches use a heavy weight version.
<

Post by EricP
Maybe have two copies like RAT, a future one maintained by Decode and
a committed one maintained by Retire, and copy the committed to future.

Robert Finch

2023-11-15 06:21:16 UTC

Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024 is
very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource
efficient in an FPGA. I have been researching an x86 OoO implementation
(https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an FPGA
and it turns out to be considerably smaller than Thor. There are more
efficient implementations for components than what is currently in use.

Thor2025 will use a PRF approach although using a PRF seems large to me.
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along with
separate register files for vector mask registers and subroutine link
registers. This set of register files limits the GPR file to only 3
write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.

The trade-off is block RAM usage instead of LUTs.

While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

MitchAlsup

2023-11-15 19:11:31 UTC

Post by Robert Finch
Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024 is
very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource
efficient in an FPGA. I have been researching an x86 OoO implementation
(https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an FPGA
and it turns out to be considerably smaller than Thor. There are more
efficient implementations for components than what is currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

Post by Robert Finch
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along with
separate register files for vector mask registers and subroutine link
registers. This set of register files limits the GPR file to only 3
write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

Robert Finch

2023-11-18 03:39:45 UTC

Post by Robert Finch
Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
is very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource
efficient in an FPGA. I have been researching an x86 OoO
implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
) done in an FPGA and it turns out to be considerably smaller than
Thor. There are more efficient implementations for components than
what is currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

Post by Robert Finch
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and subroutine
link registers. This set of register files limits the GPR file to only
3 write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

Still digesting the PRF diagram.

Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.

Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are
dedicated to the ALU results. I think this will be okay given <1% of
instructions would be FCU updates. Loads are about 25%, and FPU depends
on the application.

The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register or
target operand input.

Not planning to implement the vector register file as it would be immense.

Robert Finch

2023-11-18 10:58:42 UTC

Post by Robert Finch
Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
is very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very
resource efficient in an FPGA. I have been researching an x86 OoO
implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
) done in an FPGA and it turns out to be considerably smaller than
Thor. There are more efficient implementations for components than
what is currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

Post by Robert Finch
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and subroutine
link registers. This set of register files limits the GPR file to
only 3 write ports and 18 read ports to support all the functional
units. Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

Still digesting the PRF diagram.
Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.
Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are
dedicated to the ALU results. I think this will be okay given <1% of
instructions would be FCU updates. Loads are about 25%, and FPU depends
on the application.
The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register or
target operand input.
Not planning to implement the vector register file as it would be immense.

Changed the moniker of my current processor project from Thor to Qupls
(Q-Plus). I wanted a five- letter name beginning with ‘Q’. For a moment
I thought of calling it Quake but thought that would be too confusing.
One must understand the magic behind name choices.

The current design uses instruction postfixes of 32, 48, 80, and 144
bits which provide constants of 23, 39, 64 and 128 bits. Two bits in the
instruction indicate the postfix size. 64 and 128-bit constants have
seven extra unused bits available. The fields available being 71 and 135
bits.

Somewhat ugly, but it is desired to keep instructions a multiple of
16-bits in size. The shortest instruction is a NOP which is 16-bits so
that it may be used for alignment.

I almost switched to 96-bit floats which seem appealing, but once again
remembered that the progression of 32, 64, 128-bit floats work very well
for the float approximations.

Branches are 48-bit, being a combination of a compare and a branch with
a 24-bit target address field. Other flow control ops like JSR and JMP
are also 48-bit to keep all flow controls at 48-bit for simplified decoding.

Most instructions are 32-bits in size.

Sticking with a 64-register unified register file.

Removed the vector operations. There is enough play in the ISA to add
them at a later date if desired.

Loads and stores support two address mode, d(Rn) and d(Rn+Rm*Sc). The
scaled index address mode will likely be a 48-bit op.

MitchAlsup

2023-11-18 17:27:50 UTC

Post by Robert Finch
Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024
is very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource
efficient in an FPGA. I have been researching an x86 OoO
implementation (https://www.stuffedcow.net/files/henry-thesis-phd.pdf
) done in an FPGA and it turns out to be considerably smaller than
Thor. There are more efficient implementations for components than
what is currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

Post by Robert Finch
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and subroutine
link registers. This set of register files limits the GPR file to only
3 write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

Still digesting the PRF diagram.

The diagram is for a 6R6W PRF with a history table, ARN->PRN translation,
Free pool pickers, and register ports. The X with a ½ box is a latch
or flip-flop depending on the clocking that is put around the figure.
It also includes the renamer {history table and free pool pickers}.

Post by Robert Finch
Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.

9 Reads per 1 write ?!?!?

Post by Robert Finch
Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are
dedicated to the ALU results. I think this will be okay given <1% of
instructions would be FCU updates. Loads are about 25%, and FPU depends
on the application.
The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register or
target operand input.
Not planning to implement the vector register file as it would be immense.

Robert Finch

2023-11-18 19:41:14 UTC

Post by Robert Finch
Decided to shelve Thor2024 and begin work on Thor2025. While
Thor2024 is very good there are a few issues with it. The ROB is
used to store register values and that is effectively a CAM. It is
not very resource efficient in an FPGA. I have been researching an
x86 OoO implementation
(https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an
FPGA and it turns out to be considerably smaller than Thor. There
are more efficient implementations for components than what is
currently in use.
Thor2025 will use a PRF approach although using a PRF seems large to me.

<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.
<

Post by Robert Finch
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along
with separate register files for vector mask registers and
subroutine link registers. This set of register files limits the GPR
file to only 3 write ports and 18 read ports to support all the
functional units. Currently the register file is 10r2w.
The trade-off is block RAM usage instead of LUTs.
While having separate registers files seems like a step backwards,
it should ultimately make the hardware more resource efficient. It
does impact the ISA spec.

Still digesting the PRF diagram.

9 Reads per 1 write ?!?!?

Post by Robert Finch
Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are
dedicated to the ALU results. I think this will be okay given <1% of
instructions would be FCU updates. Loads are about 25%, and FPU
depends on the application.
The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register
or target operand input.
Not planning to implement the vector register file as it would be immense.

Freelist:

I just used the find-first/last-one’s trick on a bit-list to pick a PR
for an AR. It can provide PRs for two ARs per cycle. I have all the PRs
from the ROB feeding into the list manager so that on a branch miss the
PRs can be freed up. (Just the portion of the PRs associated with the
miss are freed). Three discarded PRs from commit also feed into the list
manager so they can be freed. It seems like a lot of logic translating
the PR to a bit. It seems a bit impractical to me to feed all the PRs
from the ROB to the list manager. It can be done with the smallish 16
entry ROB, but for a larger ROB the free may have to be split up or
another means found.

RAT:

A register alias table is being used to track the mappings of ARs to
PRs. It uses two maps; speculative and committed. On instruction enqueue
speculative mappings are updated. On commit committed mappings are
updated, and on pipeline flush commit is copied to speculative.

Register file:

I’ve reduced the number of read ports, by not supporting the vector
stuff. There are only 18 read ports. Six groups of three.

ROB:
The ROB acts like a CAM to store both the aRN and pRN for the target
register. The aRN is needed to know which previous pRN to free on
commit. For source operands only the pRN is stored.

Robert Finch

2023-11-25 00:32:09 UTC

On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how many
root pointers to support. With a 12-bit ASID, 4096 root pointers are
required to link to the mapping tables with one root pointer for each
address space. A 512 MB space is probably sufficient for a large number
of apps. Meaning access for a TLB update is via a single root pointer
lookup and then looking up the translation from a single memory page.
Not much for the table walker to do. The 4096 root pointers use two
block RAMs and require an 8192-byte address space for update assuming a
32-bit physical address space (a 16-bit root page number).

An IO mapped area of 64kB is available for root pointer memory. 16 block
RAMs could be setup in this area, that would allow 8 root pointers for
each address space. Three bits of the virtual address space could then
be mapped using root pointers. If the root pointer just points to a
single level of page tables, then a 4GB (32-bit) space could be mapped.
I am mulling over whether it is worth it to support the additional root
pointers. It is a chunk of block RAM memory that might be better spent
elsewhere.

If I use an 11-bit ASID, all the root pointers could be present in a
single block RAM. So, design choices are 11 or 12-bits ASID, 1 or 8 root
pointers per address space.

My thought is to have only a single root pointer per space, and organize
the root pointer table as if there were 32-bits for the pointer. This
would allow a 48-bit physical address space to place the mapping tables
in. The RAM could be mapped so that the high order bits of the pointer
are assumed to be zero. The system could get by using a single block RAM
if the mapping tables location were restricted to a 16MB address range.
Eight-bit pointers could be used then.

Given that it is a small system, with only 512MB of DRAM, I think it
best to keep the page-table-walker simple, and use the minimum amount of
BRAM (1).

MitchAlsup

2023-11-25 01:00:29 UTC

Post by Robert Finch
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how many
root pointers to support. With a 12-bit ASID, 4096 root pointers are
required to link to the mapping tables with one root pointer for each
address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

Robert Finch

2023-11-25 02:28:25 UTC

Post by Robert Finch
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.

It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.

I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is
shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.

4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.

I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I
suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

BGB

2023-11-25 03:16:43 UTC

Post by Robert Finch
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
single 64kB page can handle 512MB of mappings. Tonight’s trade-off is
how many root pointers to support. With a 12-bit ASID, 4096 root
pointers are required to link to the mapping tables with one root
pointer for each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is
shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I
suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is
no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).

Well, along with other things, like if/how to allow "Global" pages:
True global pages are likely a foot gun, as there is no way to exclude
them from a given process (where there may be a need to do so);
Disallowing global pages entirely means higher TLB miss rates because no
processes can share TLB entries.

One option seems to be, say, that a few of the high-order bits of the
ASID could be used as a "page group", with global pages only applying
within a single page-group (possibly with one of the page groups being
designated as "No global pages allowed").

...

Robert Finch

2023-11-25 03:48:35 UTC

Post by Robert Finch
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
single 64kB page can handle 512MB of mappings. Tonight’s trade-off
is how many root pointers to support. With a 12-bit ASID, 4096 root
pointers are required to link to the mapping tables with one root
pointer for each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it
is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed
up and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256 process?
I suspect it is just because 16-bit is easier to pass around/calculate
in a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

I see after reading several webpages that the root pointer is used to
point to only a single table for a process. This is not how I was doing
things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.

I am wondering why there is only a single table per process.

Post by BGB
True global pages are likely a foot gun, as there is no way to exclude
them from a given process (where there may be a need to do so);
Disallowing global pages entirely means higher TLB miss rates because no
processes can share TLB entries.

Global space can be assigned by designating an address space as a global
space and giving it an ASID. All process wanting access to the global
space need only then use the MMU table for that ASID. Eg. use ASID 0 for
the global address space.

Post by BGB
One option seems to be, say, that a few of the high-order bits of the
ASID could be used as a "page group", with global pages only applying
within a single page-group (possibly with one of the page groups being
designated as "No global pages allowed").
...

Scott Lurndal

2023-11-25 17:20:34 UTC

Post by BGB
If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is
no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).

I see after reading several webpages that the root pointer is used to
point to only a single table for a process. This is not how I was doing
things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.
I am wondering why there is only a single table per process.

There is actually two in most operating systems - the lower half
of the VA space is owned by the user-mode code in the process and
the upper-half is shared by all processors and used by the
operating system on behalf of the process. For Intel/AMD, the
kernel manages both halves, for ARMv8, each half has a completely
distinct and separate root pointer (at each exeception level).

MitchAlsup

2023-11-25 19:44:13 UTC

I see after reading several webpages that the root pointer is used to
point to only a single table for a process. This is not how I was doing
things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.
I am wondering why there is only a single table per process.

My 66000 Architecture has 4 Root Pointers available at all instants
of time. The above was designed before the rise of HyperVisors and is
now showing its age problems. All 4 Root Pointers are used based on
privilege level::

HOB=0 HOB=1
Application:: Application 2-level No Access
Guest OS :: Application 2-level Guest OS 2-level
Guest HV :: Guest HV 1-level Guest OS 2-level
Real HV :: Guest HV 1-level Real HV 1-level

The overhead of Application to Application is no higher than that
of Guest OS to a different Guest OS--whereas on the machines with
VMENTER and VMEXIT it takes 10,000 cycles whereas Application to
Application is closer to 1,000 cycles. I want this down in the
10-100 cycle range.

The exception <stack> system is designed to allow Guest HV to
recover a Guest OS that takes page faults while servicing ISRs
(and the like).

The interrupt <stack> system is designed to allow the ISR to
RPC or softIRQ without having to look at the pending stack on
the way out. RTI looks at the pending stack and services the
highest pending PRC/softIRQ affinitized to the CPU with control.

The Interrupt dispatch system allows the CPU to continue running
instructions until the contending CPUs decide which interrupt
is claimed by which CPU (1::1) and then context switch do the
interrupt dispatcher.

BGB

2023-11-25 17:59:42 UTC

Post by Robert Finch
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A
single 64kB page can handle 512MB of mappings. Tonight’s trade-off
is how many root pointers to support. With a 12-bit ASID, 4096 root
pointers are required to link to the mapping tables with one root
pointer for each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed
by the MMU. ASIDs and address spaces should be mapped 1:1. The ASID
that identifies the address space has a life outside of just the TLB.
I may be increasing the typical scope of an ASID.
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it
is shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed
up and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256
process? I suspect it is just because 16-bit is easier to pass
around/calculate in a HLL than some other value like 14-bits. Are
65536 address spaces really needed?

I see after reading several webpages that the root pointer is used to
point to only a single table for a process. This is not how I was doing
things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.
I am wondering why there is only a single table per process.

I went the opposite route of one big address space, with the idea of
allowing memory protection within this address space via the VUGID/ACL
mechanism. There is a KRR, or Keyring Register, which holds up to 4 keys
that may be used for ACL checking, granting an access if it is allowed
by at least one of the keys; triggering an ISR on miss similar to the
TLB. In this case, the conceptual model is more similar to that
typically used in filesystems.

But, I also have a 16-bit ASID...

As-is, there is at most one set of page tables per address space, or
per-process if processes are given different address spaces.

Had considered this, but there is a problem:
What if you have a process that you *don't* want to be able to see into
this global space?...

Though, this is where the idea of page-grouping can come in, say, the
ASID becomes:
gggg-pppp-pppp-pppp

Where:
0000 is visible to all of 0zzz
1000 is visible to all of 1zzz
...
Except:
Fzzz, this group does not have any global pages (all one-off ASIDs).

Or, possible also, is a 2.14 bit split.

Meanwhile:
I went and bought 128GB of RAM, only to realize my PC doesn't work if
one tries to install the full 128GB (the BIOS boot-loops a bunch of
times, and then apparently concludes that there is only 3.5GB ...).

Does work at least if I install 3x 32GB sticks and 1x 16GB stick, giving
112GB. This breaks the pairing rules, but seems to be working.

...

Had I known this, could have spent half as much, and only upgraded to 96GB.

Seemingly MOBO/BIOS/... designers didn't anticipate someone sticking a
full 128GB in this thing?... (BIOS is dated from 2018).

Well, either this, or a hardware compatibility issue with one of the
cards?...

Scott Lurndal

2023-11-25 17:16:55 UTC

Post by Robert Finch
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonightâs trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is
shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I
suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

256 is far too small.

$ ps -ef | wc -l
709

Every time the ASID overflows, the system must basically flush
all the caches system-wide. On an 80 processor system, that's a lot of
overhead.

MitchAlsup

2023-11-25 19:31:13 UTC

Post by Robert Finch
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

Consider the case where two different processes MMAP the same area
of memory.
Should they both end up using the same ASID ??
Should they both take extra TLB walks because they use different ASIDs ??
Should they uses their own ASIDs for their own memory but a different ASID
for the shared memory ?? And How do you expect this to happen ??

Post by Robert Finch
It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.
I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is
shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.
4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.
I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I
suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

Scott Lurndal

2023-11-25 20:02:15 UTC

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process, and thus naturally
consume two TLBs.

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

Given various forms of ASLR being used, it's unlikely even in
two instances of the same executable that a call to mmap
with MAP_SHARED without MAP_FIXED would map the region at
the same virtual address in both processes.

Post by MitchAlsup
Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Post by MitchAlsup
Should they both take extra TLB walks because they use different ASIDs ??

Given the above, yes. It's likely they'll each be scheduled
on different cores anyway in any modern system.

MitchAlsup

2023-11-25 20:40:11 UTC

Post by Robert Finch
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how
many root pointers to support. With a 12-bit ASID, 4096 root pointers
are required to link to the mapping tables with one root pointer for
each address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process, and thus naturally
consume two TLBs.

MMAP() first, fork() second. Now we have 2 processes with the
memory mapped shared memory at the same address.

Post by Scott Lurndal
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.
Given various forms of ASLR being used, it's unlikely even in
two instances of the same executable that a call to mmap
with MAP_SHARED without MAP_FIXED would map the region at
the same virtual address in both processes.

Post by MitchAlsup
Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Post by MitchAlsup
Should they both take extra TLB walks because they use different ASIDs ??

Given the above, yes. It's likely they'll each be scheduled
on different cores anyway in any modern system.

Scott Lurndal

2023-11-25 21:55:04 UTC

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process, and thus naturally
consume two TLBs.

MMAP() first, fork() second. Now we have 2 processes with the
memory mapped shared memory at the same address.

Yes, in that case, they'll be mapped at the same VA. All
the below points still apply so long as TLB's are per core.

Post by MitchAlsup
Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Post by MitchAlsup
Should they both take extra TLB walks because they use different ASIDs ??

Given the above, yes. It's likely they'll each be scheduled
on different cores anyway in any modern system.

Robert Finch

2023-11-26 00:48:06 UTC

Are top-level page directory pages shared between tasks? Suppose a task
needs a 32-bit address space. With one level of page maps, 27 bits is
accommodated, that leaves 5 bits of address translation to be done by
the page directory. Using a whole page which can handle 11 address bits
would be wasteful. But if root pointers could point into the same page
directory page then the space would not be wasted. For instance, root
pointer for task #1 could point the first 32 entries, root pointer for
task #2 could point into the next 32 entries, and so on.

MitchAlsup

2023-11-26 01:34:53 UTC

Post by Robert Finch
Are top-level page directory pages shared between tasks?

The HyperVisor tables supporting a single Guest OS certainly are.
The Guest OS tables supporting Guest OS certainly are.

Post by Robert Finch
Suppose a task
needs a 32-bit address space. With one level of page maps, 27 bits is
accommodated, that leaves 5 bits of address translation to be done by
the page directory. Using a whole page which can handle 11 address bits
would be wasteful. But if root pointers could point into the same page
directory page then the space would not be wasted. For instance, root
pointer for task #1 could point the first 32 entries, root pointer for
task #2 could point into the next 32 entries, and so on.

I should Note: that My 66000 Root Pointers determine the address space they
map; anything from 8MB through 8EB and PTEs supporting 8KB through 8EB page
sizes--with the kicker that large page entries can restrict themselves::
for example you can use a 8MB PTE and enable only 1..1024 pages under that
Virtual sub Address Space; furthermore, levels in the hierarchy can be
skipped--all of this to minimize table walk time.

Scott Lurndal

2023-11-26 15:55:04 UTC

Post by Robert Finch
Are top-level page directory pages shared between tasks?

The top half of the VA space could support this, for
the most part (since the top half is generally shared
by all tasks). The bottom half that's much less likely.

If the VA space is small enough, on ARMv8, the tables can be configured
with fewer than the normal four levels by specifying a smaller VA
size in the TCR_ELx register, so the walk may be only two or three levels
deep instead of four (or five when the VA gets larger than 52 bits).

Using intermediate level blocks (soi disant 'huge pages') reduces the
walk overhead as well, but has it's issues with allocation (since
the huge pages need not just be physical contiguous, but aligned
on huge-page-sized boundaries.

Anton Ertl

2023-11-26 15:45:06 UTC

Post by MitchAlsup
Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.

This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

Post by Scott Lurndal
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.

Post by MitchAlsup
Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

EricP

2023-11-26 17:32:08 UTC

Post by MitchAlsup
Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

If the mapping range is being selected dynamically, the chance that a
range will already be in use goes up with the number of sharers.
At some point when a new member tries to join the sharing group
the map request will be denied.

Software that does not want to have a mapping request fail should assume
that a shared area will be mapped at a different address in each process.
That implies one should not assume that virtual address can be passed
but instead use, say, section relative offsets to build a linked list.

MitchAlsup

2023-11-26 20:52:23 UTC

Post by MitchAlsup
Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

If the mapping range is being selected dynamically, the chance that a
range will already be in use goes up with the number of sharers.
At some point when a new member tries to join the sharing group
the map request will be denied.
Software that does not want to have a mapping request fail should assume
that a shared area will be mapped at a different address in each process.
That implies one should not assume that virtual address can be passed
but instead use, say, section relative offsets to build a linked list.

Here you are using shared memory like PL/1 uses AREA and OFFSET types.

Anton Ertl

2023-11-26 21:26:23 UTC

Post by Scott Lurndal
In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process.

s/process/address range/ for the last word.

Post by Anton Ertl
If the permissions
are also the same, the OS can then use one ASID for the shared area.

It will map, but with a different address range, and therefore a
different ASID. Then, for further mapping requests, the chances that
one of the two address ranges are free are increased. So even with a
large number of processes mapping the same library, you will need only
a few ASIDs for this physical memory, so there will be lots of
sharing. Of course with ASLR this is all no longer relevant.

Post by EricP
Software that does not want to have a mapping request fail should assume
that a shared area will be mapped at a different address in each process.
That implies one should not assume that virtual address can be passed
but instead use, say, section relative offsets to build a linked list.

Yes. The other option is to use MAP_FIXED early in the process, and
to have some way of dealing with potential failures. But sharing of
VAs in user code between processes is not what the sharing of ASIDs we
have discussed here would be primarily about.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

BGB

2023-11-26 21:45:08 UTC

Post by MitchAlsup
Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.
This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

It seems to me, as long as it is a different place on each system,
probably good enough. Demanding a different location in each process
would create a lot of additional memory overhead due to from things like
base-relocations or similar.

Post by Scott Lurndal
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

Still using WSL1 here as for whatever reason hardware virtualization has
thus far refused to work on my PC, and is apparently required for WSL2.

I can add this to my list of annoyances, like I can install "just short
of 128GB", but putting in the full 128GB causes my PC to be like "Oh
Crap, I guess there is 3.5GB ..." (but, apparently "112GB with unmatched
RAM sticks is fine I guess...").

But, yeah, the original POSIX is an easier goal to achieve, vs, say, the
ability to port over the GNU userland.

A lot of it is doable, but things like fork+exec are a problem if one
wants to support NOMMU operation or otherwise run all of the logical
processes in a shared address space.

A practical alternative is something more like a CreateProcess style
call, but this is "not exactly POSIX". In theory though, one could treat
"fork()" more like "vfork()" and then turn the exec* call into a
CreateProcess call and then terminate the current thread. Wouldn't
really work "in general" though, for programs that expect to be able to
"fork()" and then continue running the current program as a sub-process.

Post by MitchAlsup
Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

?...

Typically the ASID applies to the whole virtual address space, not to
individual memory objects.

Or, at least, my page-table scheme doesn't have a way to express
per-page ASIDs (merely if a page is Private/Shared, with the results of
this partly depending on the current ASID given for the page-table).

Where, say, I am mostly using 64-bit entries in the page-table, as going
to a 128-bit page-table format would be a bit steep.

Say, PTE layout (16K pages):
(63:48): ACLID
(47:14): Physical Address.
(13:12): Address or OS flag.
(11:10): For use by OS
( 9: 0): Base page-access and similar.
(9): S1 / U1 (Page-Size or OS Flag)
(8): S0 / U0 (Page-Size or OS Flag)
(7): Nu User (Supervisor Only)
(6): No Execute
(5): No Write
(4): No Read
(3): No Cache
(2): Dirty (OS, ignored by TLB)
(1): Private/Shared (MBZ if not Valid)
(0): Present/Valid

Where, ACLID serves as an index into the ACL table, or to lookup the
VUGID parameters for the page (well, along with an alternate PTE variant
that encodes VUGID directly, but reduces the physical address to 36
bits). It is possible that the original VUGID scheme may be phased out
in favor of using exclusively ACL checking.

Note that the ACL checks don't add new permissions to a page, they add
further restrictions (with the base-access being the most permissive).

Some combinations of flags are special, and encode a few edge-case
modes; such as pages which are Read/Write in Supervisor mode but
Read-Only in user mode (separate from the possible use of ACL's to mark
pages as read-only for certain tasks).

But, FWIW, I ended up adding an extended MAP_GLOBAL flag for "mmap'ed
space should be visible to all of the processes"; which in turn was used
as part of the backing memory for the "GlobalAlloc" style calls (it is
not a global heap, in that each process still manages the memory
locally, but other intersecting processes can see the address within
their own address spaces).

Well, along with a MAP_PHYSICAL flag, for if one needs memory where
VA==PA (this may fail, with the mmap returning NULL, effectively only
allowed for "superusermode"; mostly intended for hardware interfaces).

The usual behavior of MAP_SHARED didn't really make sense outside of the
context of mapping a file, and didn't really serve the needed purpose
(say, one wants to hand off a pointer to a bitmap buffer to the GUI
subsystem to have it drawn into a window).

It is also being used for things like shared scratch buffers, say, for
passing BITMAPINFOHEADER and MIDI commands and similar across the
interprocess calls (the C API style wrapper wraps a lot of this; whereas
the internal COM-style interfaces will require any pointer-style
arguments to point to shared memory).

This is not required for normal syscall handlers, where the usual
assumption is that normal syscalls will have some means of directly
accessing the address space of the caller process. I didn't really want
to require that TKGDI have this same capability.

It is debatable whether calls like BlitImage and similar should require
global memory, or merely recommend it (potentially having the call fall
back to a scratch buffer and internal memcpy if the passed bitmap image
is not already in global memory).

I had originally considered a more complex mechanism for object sharing,
but then ended up going with this for now partly because it was easier
and lower overhead (well, and also because I wanted something that would
still work if/when I started to add proper memory protection). May make
sense to impose a limit on per-process global alloc's though (since it
is intended specifically for shared buffers and not for general heap
allocation; where for heap allocation ANONYMOUS+PRIVATE would be used
instead).

Though, looking at stuff, MAP_GLOBAL semantics may have also been
partially covered by "MAP_ANONYMOUS|MAP_SHARED"?... Though, the
semantics aren't the same.

I guess, another alternative would have been to use shm_open+mmap or
similar.

Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

*1: The 'yy' division point may move, will depend on things like how
much RAM exists (currently, 00/01; no current "sane cost" FPGA boards
having more than 256 or 512 MB of RAM).

*2: If I go to a scheme of giving processes their own address spaces,
then private memory will be used. It is likely that executable code may
remain shared, but the data sections and heap would be put into private
address ranges.

Scott Lurndal

2023-11-26 22:35:19 UTC

Post by BGB
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

The modern preference is to make the memory map flexible.

Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.

It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000. Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.

Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software, and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).

Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.

BGB

2023-11-27 00:40:16 UTC

As for the memory map, actual hardware-relevant part of the map is:
0000_xxxxxxxx..7FFF_xxxxxxxx: User Mode, virtual
8000_xxxxxxxx..BFFF_xxxxxxxx: Supervisor Mode, virtual
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

No good way to make more entirely flexible, some of this stuff requires
special handling from the L1 cache, and by the time it reaches the TLB,
it is too late (unless there were additional logic to be like "Oh, crap,
this was actually meant for MMIO!").

Though, with the 96-bit VA mode, if GBH(47:0)!=0, then the entire 48-bit
space is User Mode Virtual (and it is not possible to access MMIO or
similar at all, short of reloading 0 into GBH, or using XMOV.x
instructions with a 128-bit pointer, say:
0000_0000_00000000-tttt_Fxxx_xxxxxxxx).

Note here that the high 16-bits are ignored for normal pointers
(typically used for type-tagging or bounds-checking by the runtime).

For branches and captured Link-Register values:
If LSB is 0: High 16 bits are ignored;
The branch will always be within the same CPU Mode.
If LSB is 1: High 16 bits encode CPU Mode control flags.
LSB is always set for created LR values.
CPU will trap if the LSB is Clear in LR during an RTS/RTSU.

Setting the LSB and putting the mode in the high 16 bits is also often
used on function pointers so that theoretically Baseline and XG2 code
can play along together (though, at present, BGBCC does not generate any
mixed binaries, so this part would mostly apply to DLLs).

For the time being, there is no PCI or PCIe in my case.
Nor have I gone up the learning curve for what would be required to
interface with any PCIe devices.

Had tried to get USB working, but didn't have much success as it seemed
I was still missing something (seemed to be sending/receiving bytes, but
the devices would not respond as expected to any requests or commands).

Mostly ended up using a PS2 keyboard, and had realized that (IIRC) if
one pulled the D+ and D- lines high (IIRC) the mouse would instead
implement the PS2 protocol (though, this didn't work on the USB
keyboards I had tried).

Most devices are mapped to fixed address ranges in the MMIO space:
F000Cxxx: Rasterizer / Edge-Walker Control Registers
F000Exxx: Various basic devices
SDcard, PS2 Keyboard/Mouse, RS232 UART (*), etc
F008xxxx: FM Synth / Sample Mixer Control / ...
F009xxxx: PCM Audio Loop/Registers
F00Axxxx: MMIO VRAM
F00Bxxxx: MMIO VRAM and Video Control
At present, VRAM is also RAM-backed.
VRAM framebuffer base address in RAM is now movable.

All this existing within:
FFFF_Fxxxxxxx

*: RS232 generally connected to a UART interface that feeds back to a
connected computer via an on-board FTDI chip or similar.

As for physical memory map, it is sorta like:
00000000..00007FFF: Boot ROM
0000C000..0000DFFF: Boot SRAM
00010000..0001FFFF: ZERO's
00020000..0002FFFF: BJX2 NOP's
00030000..0003FFFF: BJX2 BREAK's
...
01000000..1FFFFFFF: Reserved for RAM
20000000..3FFFFFFF: Reserved for More RAM (And/or repeating)
40000000..5FFFFFFF: RAM repeats (and/or Reserved)
60000000..7FFFFFFF: RAM repeats more (and/or Reserved)
80000000..EFFFFFFF: Reserved
F0000000..FFFFFFFF: MMIO in 32-bit Mode (*1)

*1: There used to be an MMIO range at 0000_F0000000, but this has been
eliminated in favor of only recognizing this range as MMIO in 32-bit
mode (where only the low 32-bits of the address are used). Enabling
48-bit addressing will now require using the proper MMIO address.

Currently, nothing past the low 4GB is used in the physical memory map.

Post by Scott Lurndal
It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000. Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.
Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software, and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).
Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.

Possibly, but making things more flexible here would be a non-trivial
level of complexity to deal with at the moment (and, it seemed relevant
at first to design something I could "actually implement").

At the time I started out on this, even maintaining similar hardware
interfaces to a minimalist version of the Sega Dreamcast (what the
BJX1's hardware-interface design was partly based on) was asking a bit
too much (even after leaving out things like the CD-ROM drive and similar).

So, I simplified things somewhat, initially taking some design
inspiration in these areas from the Commodore 64 and MSP430 and similar...

Say:
VRAM was reinterpreted as being an 80x25 grid of 8x8 pixel color cells;
Audio was a simple MMIO-backed PCM loop (with a few registers to adjust
the sample rate and similar).

In terms of output signals, the display module drives a VGA output, and
the audio is generally pulled off by turning an IO pin on and off really
fast.

Or, one drives 2 lines for audio, say:
10: +, 01: -, 11: 0

Using an H-Bridge driver as an amplifier (turns out one needs to drive
like 50-100mA to get any decent level of loudness out of headphones;
which is well beyond the power normal IO pins can deliver). Generally
PCM needs to get turned into PWM/PDM.

Driving stereo via a dual H-Bridge driver would get a little wonky
though, since headphones use Left/Right and a Common, effectively one
needs to drive the center as a neutral, with L/R channels (and/or, just
get lazy and drive mono across both the L/R channels using a single
H-Bridge and ignore the center point, which ironically can get more
loudness at less current because now one is dealing with 70 ohm rather
than 35 ohm).

...

Generally, with all of the hardware addresses at fixed locations.
Doing any kind of dynamic configuration or allowing hardware addresses
to be movable would have likely made the MMIO devices significantly more
expensive (vs hard-coding the address of each device).

Did generally go with MMIO rather than x86 style IO ports though.
Partly because IO ports sucks, and I wasn't quite *that* limited (say,
could afford to use a 28-bit space, rather than a 16-bit space).

...

MitchAlsup

2023-11-27 02:09:52 UTC

The modern preference is to make the memory map flexible.

// cacheable, used, modified bits
CUM kind of access
--- ------------------------------
000 uncacheable DRAM
001 MMI/O
010 config
011 ROM
1xx cacheable DRAM

Post by Scott Lurndal
Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.

Easily done, just create an uncacheable PTE and set UM to 10
for config space or 01 for MMI/O space.

I/O MMU translates these devices from a 32-bit VAS into the
64-bit PAS.

Post by Scott Lurndal
Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.

What they figure if they are already supporting 4 incompatible
mapping systems {Intel, AMD, ARM, RISC-V} you would have though
they had gotten good at these implementations :-)

Post by Scott Lurndal
Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software,

I made the CPU/cores in My 66000 have a configuration port
that is setup during boot that smells just like a PCIe
port.

Post by Scott Lurndal
and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).
Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.

Agreed.

BGB

2023-11-27 06:04:06 UTC

The modern preference is to make the memory map flexible.

As noted, some amount of the above would be part of the OS memory map,
rather than a hardware imposed memory map.

Like, say, Windows on x86 typically had:
00000000..000FFFFF: DOS-like map (9x)
00100000..7FFFFFFF: Userland stuff
80000000..BFFFFFFF: Shared stuff
C0000000..FFFFFFFF: Kernel Stuff

Did the hardware enforce this? No.
Did Windows follow such a structure? Yes, generally.

Linux sorta followed a similar structure, except that on some versions,
they had given the full 4GB to userland addresses (which made an
annoyance if trying to use TagRefs and the OS might actually put memory
in the part of the address space one would have otherwise used to hold
fixnums and similar).

Ironically though, this sort of thing (along with the limits of 32-bit
tagrefs) made incentive for my to go over to 64-bit tagrefs even on
32-bit machines, and a generally similar tagref scheme got carried into
my later projects.

Say:
0ttt_xxxx_xxxxxxxx: Pointers
1ttt_xxxx_xxxxxxxx: Small Value Spaces
2ttt_xxxx_xxxxxxxx: ...
3yyy_xxxx_xxxxxxxx: Bounds Checked Pointers
4iii_iiii_iiiiiiii: Fixnum
..
7iii_iiii_iiiiiiii: Fixnum
8iii_iiii_iiiiiiii: Flonum
..
Biii_iiii_iiiiiiii: Flonum
...

But, this scheme is more used by the runtime, not so much by the hardware.

For the most part, C doesn't use pointer tagging.
However BGBScript/JavaScript and my BASIC variant do make use of
type-tagging.

                // cacheable, used, modified bits
    CUM            kind of access
    ---            ------------------------------
    000            uncacheable DRAM
    001            MMI/O
    010            config
    011            ROM
    1xx            cacheable DRAM

Hmm...
Unfortunate acronyms are inescapable it seems...

Easily done, just create an uncacheable PTE and set UM to 10
for config space or 01 for MMI/O space.

I guess, if PCIe were supported, some scheme could be developed to map
the PCIe space either into part of the MMIO space, into RAM space, or
maybe some other space.

There is a functional difference between MMIO space and RAM space in
terms of how they are accessed:
RAM space: Cache does its thing and works with cache-lines;
MMIO space: A request is sent over the bus, and then it waits for a
response.

If the MMIO bridge sees an MMIO request, it puts it onto the MMIO Bus,
and sees if any device responds (if so, sending the response back to the
origin). Otherwise, if no device responds after a certain number of
clock cycles, an all-zeroes response is sent instead.

Currently, no sort of general purpose bus is routed outside of the FPGA,
and if it did exist, it is not yet clear what form it would take.

Would need to limit pin counts though, so probably some sort of serial
bus in any case.

PCIe might be sort of tempting in the sense that apparently, 1 PCIe lane
can be subdivided to multiple devices, and bridge cards exist that can
apparently route PCIe over a repurposed USB cable and then connect
multiple devices, PCI, or ISA cards. Albeit apparently with mixed results.

I/O MMU translates these devices from a 32-bit VAS into the 64-bit PAS.

What they figure if they are already supporting 4 incompatible
mapping systems {Intel, AMD, ARM, RISC-V} you would have though
they had gotten good at these implementations :-)

Post by Scott Lurndal
Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software,

I made the CPU/cores in My 66000 have a configuration port
that is setup during boot that smells just like a PCIe
port.

Agreed.

At least for the userland address ranges, there is less of this going on
than in SH4, which had basically spent the top 3 bits of the 32-bit
address as mode.

Say, IIRC:
(29): No TLB
(30): No Cache
(31): Supervisor

So, in effect, there was only 512MB of usable address space.
The SH-4A had then expanded the lower part to 31 bits, so one could have
2GB of usermode address space.

But, say, if one can have 47 bits of freely usable virtual address space
for userland, probably good enough.

Robert Finch

2023-11-26 23:20:58 UTC

Post by MitchAlsup
Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.
This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

Post by Scott Lurndal
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

Still using WSL1 here as for whatever reason hardware virtualization has
thus far refused to work on my PC, and is apparently required for WSL2.
I can add this to my list of annoyances, like I can install "just short
of 128GB", but putting in the full 128GB causes my PC to be like "Oh
Crap, I guess there is 3.5GB ..." (but, apparently "112GB with unmatched
RAM sticks is fine I guess...").
But, yeah, the original POSIX is an easier goal to achieve, vs, say, the
ability to port over the GNU userland.
A lot of it is doable, but things like fork+exec are a problem if one
wants to support NOMMU operation or otherwise run all of the logical
processes in a shared address space.
A practical alternative is something more like a CreateProcess style
call, but this is "not exactly POSIX". In theory though, one could treat
"fork()" more like "vfork()" and then turn the exec* call into a
CreateProcess call and then terminate the current thread. Wouldn't
really work "in general" though, for programs that expect to be able to
"fork()" and then continue running the current program as a sub-process.

Post by MitchAlsup
Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

Q+ has a similar setup, but the ACLID is in a separate table.

For Q+ Two similar MMUs have been designed, one to be used in a large
system and a second for a small system. The difference between the two
is in the size of page numbers. The large system uses 64-bit page
numbers, and the small system uses 32-bit page numbers. The PTE for the
large system is 96-bits, 32-bits larger than the PTE for the small
system due to the extra bits for the page number. Pages are 64kB. The
small system supports a 48-bit address range.

The PTE has the following fields:
PPN 64/32 Physical page number
URWX 3 User read-write-execute override
SRWX 3 Supervisor read-write-execute override
HRWX 3 Hypervisor read-write-execute override
MRWX 3 Machine read-write-execute override
CACHE 4 Cache-ability bits
SW 2 OS software usage
A 1 1=accessed/used
M 1 1=modified
V 1 1 if entry is valid, otherwise 0
S 1 1=shared page
G 1 1=global, ignore ASID
T 1 0=page pointer, 1= table pointer
RGN 3 Region table index
LVL/BC 5 the page table level of the entry pointed to

The RWX and CACHE bits are overrides. These values normally come from
the region table, but may be overridden by values in the PTE.
The LVL/BC5 field is five bits to account for a five-bit bounce counter
for inverted page tables. Only a 3-bit level is in use.

There is a separate table with per page information that contains a
reference to an ACL (16-bts), share counts (16-bits), privilege level
(8-bits), and access key (24-bits), and a couple of other fields for
compression / encryption.

I have made the PTBR a full 64-bit address now rather than a page number
with control bits. So, it may now point into the middle of a page
directory which is shared between tasks.

The table walker and region table look like PCI devices to the system.

Post by BGB
But, FWIW, I ended up adding an extended MAP_GLOBAL flag for "mmap'ed
space should be visible to all of the processes"; which in turn was used
as part of the backing memory for the "GlobalAlloc" style calls (it is
not a global heap, in that each process still manages the memory
locally, but other intersecting processes can see the address within
their own address spaces).
Well, along with a MAP_PHYSICAL flag, for if one needs memory where
VA==PA (this may fail, with the mmap returning NULL, effectively only
allowed for "superusermode"; mostly intended for hardware interfaces).
The usual behavior of MAP_SHARED didn't really make sense outside of the
context of mapping a file, and didn't really serve the needed purpose
(say, one wants to hand off a pointer to a bitmap buffer to the GUI
subsystem to have it drawn into a window).
It is also being used for things like shared scratch buffers, say, for
passing BITMAPINFOHEADER and MIDI commands and similar across the
interprocess calls (the C API style wrapper wraps a lot of this; whereas
the internal COM-style interfaces will require any pointer-style
arguments to point to shared memory).
This is not required for normal syscall handlers, where the usual
assumption is that normal syscalls will have some means of directly
accessing the address space of the caller process. I didn't really want
to require that TKGDI have this same capability.
It is debatable whether calls like BlitImage and similar should require
global memory, or merely recommend it (potentially having the call fall
back to a scratch buffer and internal memcpy if the passed bitmap image
is not already in global memory).
I had originally considered a more complex mechanism for object sharing,
but then ended up going with this for now partly because it was easier
and lower overhead (well, and also because I wanted something that would
still work if/when I started to add proper memory protection). May make
sense to impose a limit on per-process global alloc's though (since it
is intended specifically for shared buffers and not for general heap
allocation; where for heap allocation ANONYMOUS+PRIVATE would be used
instead).
Though, looking at stuff, MAP_GLOBAL semantics may have also been
partially covered by "MAP_ANONYMOUS|MAP_SHARED"?... Though, the
semantics aren't the same.
I guess, another alternative would have been to use shm_open+mmap or
similar.
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.
*1: The 'yy' division point may move, will depend on things like how
much RAM exists (currently, 00/01; no current "sane cost" FPGA boards
having more than 256 or 512 MB of RAM).
*2: If I go to a scheme of giving processes their own address spaces,
then private memory will be used. It is likely that executable code may
remain shared, but the data sections and heap would be put into private
address ranges.

Anton Ertl

2023-11-27 07:22:22 UTC

Post by Anton Ertl
This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

If the binary is position-independent (the default on Linux on AMD64),
there is no such overhead.

I just started the same binary twice and looked at the address of the
same peace of code:

Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
see open-file
Code open-file
0x000055c2b76d5833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
...

For the other process the same instruction is:

Code open-file
0x000055dd606e4833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)

Following the calls until I get to glibc, I get, for the two processes:

0x00007f705c0c3b90 <__libc_open64+0>: push %r12
0x00007f190aa34b90 <__libc_open64+0>: push %r12

So not just the binary, but also glibc resides at different virtual
addresses in the two processes.

So obviously the Linux and glibc maintainers think that per-system
ASLR is not good enough. They obviously want ASLR to work as well as
possible against local attackers.

Post by Anton Ertl
Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

?...
Typically the ASID applies to the whole virtual address space, not to
individual memory objects.

Yes, one would need more complicated ASID management than setting
"the" ASID on switching to a process if different VMAs in the process
have different ASIDs. Another reason not to go there.

Power (and IIRC HPPA) do something in this direction with their
"segments", where the VA space was split into 16 equally parts, and
IIRC the 16 parts each extended the address by 16 bits (minus the 4
bits of the segment number), so essentially they have 16 16-bit ASIDs.
The address spaces are somewhat unflexible, but with 64-bit VAs
(i.e. 60-bit address spaces) that may be good enough for quite a
while. The cost is that you now have to manage 16 ASID registers.
And if we ever get to actually making use of more the 60 bits of VA in
other ways, combining this ASID scheme with the other use of the VAs.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

BGB

2023-11-27 09:34:34 UTC

If the binary is position-independent (the default on Linux on AMD64),
there is no such overhead.

OK.

I was thinking mostly of things like PE/COFF, where often a mix of
relative and absolute addressing is used, and loading typically involves
applying base relocations (so, once loaded, the assumption is that the
binary will not move further).

Granted, traditional PE/COFF and ELF manage things like global variables
differently (direct vs GOT).

Though, on x86-64, PC-relative addressing is a thing, so less need for
absolute addressing. PIC with PE/COFF might not be too much of a stretch.

Post by Anton Ertl
I just started the same binary twice and looked at the address of the
Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
see open-file
Code open-file
0x000055c2b76d5833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
...
Code open-file
0x000055dd606e4833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
0x00007f705c0c3b90 <__libc_open64+0>: push %r12
0x00007f190aa34b90 <__libc_open64+0>: push %r12
So not just the binary, but also glibc resides at different virtual
addresses in the two processes.
So obviously the Linux and glibc maintainers think that per-system
ASLR is not good enough. They obviously want ASLR to work as well as
possible against local attackers.

OK.

Post by Anton Ertl
Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

?...
Typically the ASID applies to the whole virtual address space, not to
individual memory objects.

Yes, one would need more complicated ASID management than setting
"the" ASID on switching to a process if different VMAs in the process
have different ASIDs. Another reason not to go there.
Power (and IIRC HPPA) do something in this direction with their
"segments", where the VA space was split into 16 equally parts, and
IIRC the 16 parts each extended the address by 16 bits (minus the 4
bits of the segment number), so essentially they have 16 16-bit ASIDs.
The address spaces are somewhat unflexible, but with 64-bit VAs
(i.e. 60-bit address spaces) that may be good enough for quite a
while. The cost is that you now have to manage 16 ASID registers.
And if we ever get to actually making use of more the 60 bits of VA in
other ways, combining this ASID scheme with the other use of the VAs.

OK.

That seems a bit odd...

Scott Lurndal

2023-11-26 22:27:37 UTC

Post by MitchAlsup
Consider the case where two different processes MMAP the same area
of memory.

In which case, the area of memory would be mapped to different
virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.
This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

Post by Scott Lurndal
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

If an implementation claims support for the XSI option of
POSIX, then it must support MAP_FIXED. There were a couple
of vendors who claimed not to be able to support MAP_FIXED
back in the days when it was being discussed in the standards
committee working groups.

In addition, the standard notes:

"Use of MAP_FIXED may result in unspecified behavior in
further use of malloc() and shmat(). The use of MAP_FIXED is
discouraged, as it may prevent an implementation from making
the most effective use of resources.

Because the semantics of MAP_FIXED are to unmap any
prior mapping in the range, if the implementation had happened to
allocate the heap or shared System V region at that address, the heap
would have become corrupt with dangling references hanging
around which, if stored into, would subsequently corrupt the mapped region.

Post by MitchAlsup
Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

That sounds like a nightmare scenario. Normally the ASID is
closely associated with a single process and the scope of
necessary TLB maintenance operations (e.g. invalidates
after translation table updates) is usually the process.

It's certainly not possible to do that on ARMv8 systems. The
ASID tag in the TLB entry comes from the translation table base
register and applies to all accesses made to the entire range covered
by the translation table by all the threads of the process.

Likewise the VMID tag in the TLB entry comes from the nested
translation table base address system register at the time
of entry creation.

For a subsequent process (child or detached) sharing memory with
that process, there just isn't any way to tag it's TLB entry with
the ASID of the first process to map the shared region.

Anton Ertl

2023-11-27 07:57:08 UTC

Post by Scott Lurndal
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

...

Post by Scott Lurndal
Because the semantics of MAP_FIXED are to unmap any
prior mapping in the range, if the implementation had happened to
allocate the heap or shared System V region at that address, the heap
would have become corrupt with dangling references hanging
around which, if stored into, would subsequently corrupt the mapped region.

Of course you can provide an address without specifying MAP_FIXED, and
a high-quality OS will satisfy the request if possible (and return a
different address if not), while a work-to-rule OS like the POSIX
subsystem for Windows may then treat that address as if the user had
passed NULL.

Interestingly, Linux (since 4.17) also provides MAP_FIXED_NOREPLACE,
which works like MAP_FIXED except that it returns an error if
MAP_FIXED would replace part of an existing mapping. Makes me wonder
if in the no-conflict case, and given a page-aligned addr there is any
difference between MAP_FIXED, MAP_FIXED_NOREPLACE and just providing
an address without any of these flags in Linux. In the conflict case,
the difference between the latter two variants is how you detect that
it did not work as desired.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

Scott Lurndal

2023-11-27 14:59:36 UTC

Post by Scott Lurndal
FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

...

Of course you can provide an address without specifying MAP_FIXED, and
a high-quality OS will satisfy the request if possible (and return a
different address if not), while a work-to-rule OS like the POSIX
subsystem for Windows may then treat that address as if the user had
passed NULL.
Interestingly, Linux (since 4.17) also provides MAP_FIXED_NOREPLACE,
which works like MAP_FIXED except that it returns an error if
MAP_FIXED would replace part of an existing mapping. Makes me wonder
if in the no-conflict case, and given a page-aligned addr there is any
difference between MAP_FIXED, MAP_FIXED_NOREPLACE and just providing
an address without any of these flags in Linux. In the conflict case,
the difference between the latter two variants is how you detect that
it did not work as desired.

I've never seen a case where using MAP_FIXED was useful, and I've
been using mmap since the early 90's. I'm sure there must be one,
probabably where someone uses full VAs instead of offsets in data
structures. Using the full VAs in the region will likely cause
issues in the long term as the application is moved to updated or
different posix systems, particularly if the data file associated
with the region is expected to work in all subsequent
implementats. MAP_FIXED should be avoided, IMO.

Anton Ertl

2023-11-27 16:10:49 UTC

Post by Scott Lurndal
I've never seen a case where using MAP_FIXED was useful, and I've
been using mmap since the early 90's.

Gforth uses it for putting the image into the dictionary (the memory
area for Forth definitions, where more definitions can be put during a
session): It first allocates the space for the dictionary with an
anonymous mmap, then puts the image at the start of this area with a
file mmap with MAP_FIXED.

It also currently uses MAP_FIXED for allocating the memory for
non-relocatable images, but thinking through it again, it's probably
better to use MAP_FIXED_NOREPLACE or nothing, and then check the
address, and report any error. However, we have not received any bug
reports about that, which probably shows that nobody uses
non-relocatable images.

- anton

--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>

Robert Finch

2023-11-30 16:49:18 UTC

The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could be
switched. But to do this there are now 128-register effectively being
renamed which leads to 384 physical registers to manage. This doubles
the size of the register management code. Unless, a pipeline flush
occurs for exception processing which I think would allow the renamer to
reuse the same hardware to manage a new bank of registers. But that
hinges on all references to registers in the current bank being unused.

My other thought was that with approximately three times the number of
architectural registers required, using 256 physical registers would
allow 85 architectural registers. Perhaps some of the registers could be
banked for different operating modes. Banking four registers per mode
would use up 16.

If the 512-register file were divided by three, 170 physical registers
could be available for renaming. This is less than the ideal 192
registers but maybe close enough to not impact performance adversely.

Scott Lurndal

2023-11-30 16:59:37 UTC

Robert Finch <***@gmail.com> writes:
<snip>

Post by Robert Finch
My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes.

How do the operating modes pass data between each other? E.g. for
a system call, the arguments are generally passed to the next higher
privilege level/operating mode via registers.

EricP

2023-11-30 18:35:04 UTC

Post by Robert Finch
The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could be
switched. But to do this there are now 128-register effectively being
renamed which leads to 384 physical registers to manage. This doubles
the size of the register management code. Unless, a pipeline flush
occurs for exception processing which I think would allow the renamer to
reuse the same hardware to manage a new bank of registers. But that
hinges on all references to registers in the current bank being unused.
My other thought was that with approximately three times the number of
architectural registers required, using 256 physical registers would
allow 85 architectural registers. Perhaps some of the registers could be
banked for different operating modes. Banking four registers per mode
would use up 16.
If the 512-register file were divided by three, 170 physical registers
could be available for renaming. This is less than the ideal 192
registers but maybe close enough to not impact performance adversely.

I don't understand the problem.
You want 64 architecture registers, each which needs a physical register,
plus 128 registers for in-flight instructions, so 196 physical registers.

If you add a second bank of 64 architecture registers for interrupts
then each needs a physical register. But that doesn't change the number
of in-flight registers so thats 256 physical total.
Plus two sets of rename banks, one for each mode.

If you drain the pipeline before switching register banks then all
of the 128 in-flight registers will be free at the time of switch.

If you can switch to interrupt mode without draining the pipeline then
some of those 128 will be in-use for the old mode, some for the new mode
(and the uOps carry a privilege mode flag so you can do things like
check LD or ST ops against the appropriate PTE mode access control).

MitchAlsup

2023-11-30 20:30:52 UTC

A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property
that all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.

Post by EricP
If you can switch to interrupt mode without draining the pipeline then
some of those 128 will be in-use for the old mode, some for the new mode
(and the uOps carry a privilege mode flag so you can do things like
check LD or ST ops against the appropriate PTE mode access control).

And 1 bit of state keeps track of which is which.

Robert Finch

2023-11-30 22:51:02 UTC

Post by Robert Finch
The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could
be switched. But to do this there are now 128-register effectively
being renamed which leads to 384 physical registers to manage. This
doubles the size of the register management code. Unless, a pipeline
flush occurs for exception processing which I think would allow the
renamer to reuse the same hardware to manage a new bank of registers.
But that hinges on all references to registers in the current bank
being unused.
My other thought was that with approximately three times the number
of architectural registers required, using 256 physical registers
would allow 85 architectural registers. Perhaps some of the registers
could be banked for different operating modes. Banking four registers
per mode would use up 16.
If the 512-register file were divided by three, 170 physical
registers could be available for renaming. This is less than the
ideal 192 registers but maybe close enough to not impact performance
adversely.

A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property that
all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.

Not quite comprehending. Will not the registers for the new context be
improperly mapped if there are registers in use for the old map? I think
a state bit could be used to pause a fetch of a register still in use in
the old map, but that is draining the pipeline anyway.
When the context swaps, a new set of target registers is always
established before the registers are used. So incoming references in the
new context should always map to the new registers?

And 1 bit of state keeps track of which is which.

Did some experimenting and the RAT turns out to be too large if more
registers are incorporated. Even as few as 256 regs caused the RAT to
increase in size substantially. So, I may go the alternate route of
making register wider rather than deeper, having 128-bit wide registers
instead.

There is an eight bit sequence number bit associated with each
instruction. So it can easily be detected the age of an instruction. I
found a really slick way of detecting instruction age using a matrix
approach on the web. But I did not fully understand it. So I just use
eight bit counters for now.

There is a two bit privilege mode flag for instructions in the ROB. I
suppose the ROB entries could be called uOps.

MitchAlsup

2023-11-30 23:06:32 UTC

Post by Robert Finch
The Q+ register file is implemented with one block-RAM per read port.
With a 64-bit width this gives 512 registers in a block RAM. 192
registers are needed for renaming a 64-entry architectural register
file. That leaves 320 registers unused. My thought was to support two
banks of registers, one for the highest operating mode, and the other
for remaining operating modes. On exceptions the register bank could
be switched. But to do this there are now 128-register effectively
being renamed which leads to 384 physical registers to manage. This
doubles the size of the register management code. Unless, a pipeline
flush occurs for exception processing which I think would allow the
renamer to reuse the same hardware to manage a new bank of registers.
But that hinges on all references to registers in the current bank
being unused.
My other thought was that with approximately three times the number
of architectural registers required, using 256 physical registers
would allow 85 architectural registers. Perhaps some of the registers
could be banked for different operating modes. Banking four registers
per mode would use up 16.
If the 512-register file were divided by three, 170 physical
registers could be available for renaming. This is less than the
ideal 192 registers but maybe close enough to not impact performance
adversely.

A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property that
all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.

Not quite comprehending. Will not the registers for the new context be
improperly mapped if there are registers in use for the old map?

All the in-flight destination registers will get written by the in-flight
instructions. All the instruction of the new context will allocate registers
from the pool which is not currently in-flight. So, while there is mental
confusion on how this gets pulled off in HW, it does get pulled off just
fine. When the new context STs the registers of the old context, it obtains
the correct register from the old context {{Should HW be doing this the
same orchestration applies--and it still works.}}

Post by Robert Finch
I think
a state bit could be used to pause a fetch of a register still in use in
the old map, but that is draining the pipeline anyway.

You are assuming a RAT, I am not using a RAT but a CAM where I can restore
to any checkpoint by simply rewriting the valid bit vector.

Post by Robert Finch
When the context swaps, a new set of target registers is always
established before the registers are used.

You still have to deal with the transient state and the CAM version works
with either SW or HW save/restore.

Post by Robert Finch
So incoming references in the
new context should always map to the new registers?

Which they will--as illustrated above.

And 1 bit of state keeps track of which is which.

Register ports (or equivalently RAT ports) are one of the things that most
limit issue width. K9 was to have 22 RAT ports, and was similar in size to
the {standard decoded Register File.}

Post by Robert Finch
There is an eight bit sequence number bit associated with each
instruction. So it can easily be detected the age of an instruction. I

I assign a 4-bit number (16-checkpints) to all instructions issued in
the same clock cycle. This gives a 6-wide machine up to 96 instructions
in-flight; and makes backing up (misprediction) simple and fast.

Post by Robert Finch
found a really slick way of detecting instruction age using a matrix
approach on the web. But I did not fully understand it. So I just use
eight bit counters for now.
There is a two bit privilege mode flag for instructions in the ROB. I
suppose the ROB entries could be called uOps.

Robert Finch

2023-12-01 01:19:24 UTC

Post by Robert Finch
The Q+ register file is implemented with one block-RAM per read
port. With a 64-bit width this gives 512 registers in a block RAM.
192 registers are needed for renaming a 64-entry architectural
register file. That leaves 320 registers unused. My thought was to
support two banks of registers, one for the highest operating mode,
and the other for remaining operating modes. On exceptions the
register bank could be switched. But to do this there are now
128-register effectively being renamed which leads to 384 physical
registers to manage. This doubles the size of the register
management code. Unless, a pipeline flush occurs for exception
processing which I think would allow the renamer to reuse the same
hardware to manage a new bank of registers. But that hinges on all
references to registers in the current bank being unused.
My other thought was that with approximately three times the number
of architectural registers required, using 256 physical registers
would allow 85 architectural registers. Perhaps some of the
registers could be banked for different operating modes. Banking
four registers per mode would use up 16.
If the 512-register file were divided by three, 170 physical
registers could be available for renaming. This is less than the
ideal 192 registers but maybe close enough to not impact
performance adversely.

A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property that
all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.

Not quite comprehending. Will not the registers for the new context be
improperly mapped if there are registers in use for the old map?

Post by Robert Finch
I
think a state bit could be used to pause a fetch of a register still
in use in the old map, but that is draining the pipeline anyway.

You are assuming a RAT, I am not using a RAT but a CAM where I can restore
to any checkpoint by simply rewriting the valid bit vector.

I think the RAT can be restored to a specific checkpoint as well using
just an index value. Q+ has a checkpoint RAM of which one of the
checkpoints is the active RAT. The RAT is really 16 tables. I stored a
bit vector of the valid registers in the ROB so that the valid
register set may be reset when a checkpoint is restored.

Post by Robert Finch
When the context swaps, a new set of target registers is always
established before the registers are used.

You still have to deal with the transient state and the CAM version works
with either SW or HW save/restore.

Post by Robert Finch
So incoming references in
the new context should always map to the new registers?

Which they will--as illustrated above.

And 1 bit of state keeps track of which is which.

Register ports (or equivalently RAT ports) are one of the things that most
limit issue width. K9 was to have 22 RAT ports, and was similar in size
to the {standard decoded Register File.}

The Q+ RAT has 16 read and 8 write ports. I am trying for a 4-wide
machine. It is using about as many LUTs as the register file. The RAT is
implemented with LUT ram instead of block RAMs. I do not like the size,
but it adds a lot to the operation of the machine.

Post by Robert Finch
There is an eight bit sequence number bit associated with each
instruction. So it can easily be detected the age of an instruction. I

The same thing is done with Q+. It support 16 checkpoints with a
four-bit number too. Having read that 16 is almost the same as infinity.

MitchAlsup

2023-12-01 02:43:20 UTC

Post by Robert Finch
The Q+ register file is implemented with one block-RAM per read
port. With a 64-bit width this gives 512 registers in a block RAM.
192 registers are needed for renaming a 64-entry architectural
register file. That leaves 320 registers unused. My thought was to
support two banks of registers, one for the highest operating mode,
and the other for remaining operating modes. On exceptions the
register bank could be switched. But to do this there are now
128-register effectively being renamed which leads to 384 physical
registers to manage. This doubles the size of the register
management code. Unless, a pipeline flush occurs for exception
processing which I think would allow the renamer to reuse the same
hardware to manage a new bank of registers. But that hinges on all
references to registers in the current bank being unused.
My other thought was that with approximately three times the number
of architectural registers required, using 256 physical registers
would allow 85 architectural registers. Perhaps some of the
registers could be banked for different operating modes. Banking
four registers per mode would use up 16.
If the 512-register file were divided by three, 170 physical
registers could be available for renaming. This is less than the
ideal 192 registers but maybe close enough to not impact
performance adversely.

A couple of bits of state and you don't need to drain the pipeline,
you just have to find the youngest instruction with the property that
all older instructions cannot raise an exception; these can be
allowed to finish execution while you are fetching instruction for
the new context.

Not quite comprehending. Will not the registers for the new context be
improperly mapped if there are registers in use for the old map?

Post by Robert Finch
I
think a state bit could be used to pause a fetch of a register still
in use in the old map, but that is draining the pipeline anyway.

You are assuming a RAT, I am not using a RAT but a CAM where I can restore
to any checkpoint by simply rewriting the valid bit vector.

Post by Robert Finch
When the context swaps, a new set of target registers is always
established before the registers are used.

You still have to deal with the transient state and the CAM version works
with either SW or HW save/restore.

Post by Robert Finch
So incoming references in
the new context should always map to the new registers?

Which they will--as illustrated above.

And 1 bit of state keeps track of which is which.

Register ports (or equivalently RAT ports) are one of the things that most
limit issue width. K9 was to have 22 RAT ports, and was similar in size
to the {standard decoded Register File.}

Post by Robert Finch
There is an eight bit sequence number bit associated with each
instruction. So it can easily be detected the age of an instruction. I

The same thing is done with Q+. It support 16 checkpoints with a
four-bit number too. Having read that 16 is almost the same as infinity.

Branch repair (from misprediction) has to be fast--especially if you are
going for 0-cycle repair.

Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints
that has achieved the consistent state (no older instructions can raise an
exception).

Exception recovery can backup to the checkpoint containing the instruction
which raised the exception, and then single step forward until the exception
is identified. Thus, you do not need "order" at a granularity smaller than
a checkpoint.

One can use pseudo-exceptions to solve difficult timing or sequencing
problems, saving certain kinds of state transitions in the instruction
queuing mechanism. For example, one could use pseudo-exception to regain
memory order in an ATOMIC event when you detect the order was less than
sequentially consistent.

Robert Finch

2023-12-03 06:45:09 UTC

Post by Robert Finch
The Q+ register file is implemented with one block-RAM per read
port. With a 64-bit width this gives 512 registers in a block
RAM. 192 registers are needed for renaming a 64-entry
architectural register file. That leaves 320 registers unused. My
thought was to support two banks of registers, one for the
highest operating mode, and the other for remaining operating
modes. On exceptions the register bank could be switched. But to
do this there are now 128-register effectively being renamed
which leads to 384 physical registers to manage. This doubles the
size of the register management code. Unless, a pipeline flush
occurs for exception processing which I think would allow the
renamer to reuse the same hardware to manage a new bank of
registers. But that hinges on all references to registers in the
current bank being unused.
My other thought was that with approximately three times the
number of architectural registers required, using 256 physical
registers would allow 85 architectural registers. Perhaps some of
the registers could be banked for different operating modes.
Banking four registers per mode would use up 16.
If the 512-register file were divided by three, 170 physical
registers could be available for renaming. This is less than the
ideal 192 registers but maybe close enough to not impact
performance adversely.

Not quite comprehending. Will not the registers for the new context
be improperly mapped if there are registers in use for the old map?

Post by Robert Finch
I
think a state bit could be used to pause a fetch of a register still
in use in the old map, but that is draining the pipeline anyway.

You are assuming a RAT, I am not using a RAT but a CAM where I can restore
to any checkpoint by simply rewriting the valid bit vector.

Post by Robert Finch
When the context swaps, a new set of target registers is always
established before the registers are used.

You still have to deal with the transient state and the CAM version works
with either SW or HW save/restore.

Post by Robert Finch
So incoming references in
the new context should always map to the new registers?

Which they will--as illustrated above.

And 1 bit of state keeps track of which is which.

Register ports (or equivalently RAT ports) are one of the things that most
limit issue width. K9 was to have 22 RAT ports, and was similar in
size to the {standard decoded Register File.}

The Q+ RAT has 16 read and 8 write ports. I am trying for a 4-wide
machine. It is using about as many LUTs as the register file. The RAT
is implemented with LUT ram instead of block RAMs. I do not like the
size, but it adds a lot to the operation of the machine.

Post by Robert Finch
There is an eight bit sequence number bit associated with each
instruction. So it can easily be detected the age of an instruction. I

The same thing is done with Q+. It support 16 checkpoints with a
four-bit number too. Having read that 16 is almost the same as infinity.

Branch repair (from misprediction) has to be fast--especially if you are
going for 0-cycle repair.

I think I am far away from zero-cycle repair. Does getting zero-cycle
repair mean fetching from both branch directions and then selecting the
correct one? I will be happy if I can get branching to work at all. It
is my first implementation using checkpoints. All the details of
handling branches are not yet worked out in code for Q+. I think enough
of the code is in place to get rough timing estimates. Not sure how well
the BTB will work. A gselect predictor is also being used. Expecting a
lot of branch mispredictions.

Post by MitchAlsup
Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints
that has achieved the consistent state (no older instructions can raise an
exception).

Sounds straight-forward enough.

Post by MitchAlsup
Exception recovery can backup to the checkpoint containing the
instruction which raised the exception, and then single step forward
until the exception
is identified. Thus, you do not need "order" at a granularity smaller than
a checkpoint.

This sounds a little trickier to do. Q+ currently takes an exception
when things commit. It looks in the exception field of the queue entry
for a fault code. If there is one it performs almost the same operation
as a branch except it is occurring at the commit stage.

Post by MitchAlsup
One can use pseudo-exceptions to solve difficult timing or sequencing
problems, saving certain kinds of state transitions in the instruction
queuing mechanism. For example, one could use pseudo-exception to regain
memory order in an ATOMIC event when you detect the order was less than
sequentially consistent.

Noted.

Gone back to using variable length instructions. Had to pipeline the
instruction length decode across three clock cycles to get it to meet
timing.

Post by Robert Finch
found a really slick way of detecting instruction age using a matrix
approach on the web. But I did not fully understand it. So I just
use eight bit counters for now.
There is a two bit privilege mode flag for instructions in the ROB.
I suppose the ROB entries could be called uOps.

MitchAlsup

2023-12-03 16:49:33 UTC

Post by Robert Finch
four-bit number too. Having read that 16 is almost the same as infinity.
Branch repair (from misprediction) has to be fast--especially if you are
going for 0-cycle repair.

I think I am far away from zero-cycle repair. Does getting zero-cycle
repair mean fetching from both branch directions and then selecting the
correct one?

No, zero cycle means you access the ICache twice per cycle, once on the
predicted path and once on the alternate path. The alternate path inst
are put in a buffer indexed by branch number. {{This happens 10-12 cycles
before the branch prediction is resolved}}

When the branch instruction is launched out of its inst queue, the buffer
is read, and if the branch prediction failed, you have the instructions
from the mispredicted path ready to decode in the subsequent cycle.

Post by Robert Finch
I will be happy if I can get branching to work at all. It
is my first implementation using checkpoints. All the details of
handling branches are not yet worked out in code for Q+. I think enough
of the code is in place to get rough timing estimates. Not sure how well
the BTB will work. A gselect predictor is also being used. Expecting a
lot of branch mispredictions.

Post by Robert Finch
Checkpoint recovery (from interrupt) just has to pick 1 of the checkpoints
that has achieved the consistent state (no older instructions can raise an
exception).

Sounds straight-forward enough.

Post by Robert Finch
Exception recovery can backup to the checkpoint containing the
instruction which raised the exception, and then single step forward
until the exception
is identified. Thus, you do not need "order" at a granularity smaller than
a checkpoint.

Post by Robert Finch
One can use pseudo-exceptions to solve difficult timing or sequencing
problems, saving certain kinds of state transitions in the instruction
queuing mechanism. For example, one could use pseudo-exception to regain
memory order in an ATOMIC event when you detect the order was less than
sequentially consistent.

Noted.
Gone back to using variable length instructions. Had to pipeline the
instruction length decode across three clock cycles to get it to meet
timing.

Curious:: I got VLE to decode in 4-gates of delay, and I can PARSE up to
16 instruction boundaries in a single cycle (using a tree of multiplexers.)

DECODE, then, only has to process the 32-bit instructions and route the
constants in at Forwarding.

Now:: I also use 3 cycles after ICache access, but 1 of the cycles includes
tag comparison and set select, so I consider this a 2½ cycle decode; the ½
cycle part performs the VLE and instruction-specifier rout to decoder[k].

Robert Finch

2023-12-01 00:19:56 UTC

Figured it out. Each architectural register in the RAT must refer to N
physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting only
a single bank. The operating mode is used to select the physical
register. The first eight registers are shared between all operating
modes so arguments can be passed to syscalls. It is tempting to have
eight banks of registers, one for each hardware interrupt level.

EricP

2023-12-03 16:07:48 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must refer to N
physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting only
a single bank. The operating mode is used to select the physical
register. The first eight registers are shared between all operating
modes so arguments can be passed to syscalls. It is tempting to have
eight banks of registers, one for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.
For example, if there are 2 modes User and Super and a bank for each,
since User and Super are mutually exclusive,
64 of your 256 physical registers will be sitting unused tied
to the other mode bank, so max of 75% utilization efficiency.

If you have 8 register banks then only 3/10 of the physical registers
are available to use, the other 7/10 are sitting idle attached to
arch registers in other modes consuming power.

Also you don't have to play overlapped-register-bank games to pass
args to/from syscalls. You can have specific instructions that reach
into other banks: Move To User Reg, Move From User Reg.
Since only syscall passes args into the OS you only need to access
the user mode bank from the OS kernel bank.

MitchAlsup

2023-12-03 16:58:38 UTC

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

Post by EricP
For example, if there are 2 modes User and Super and a bank for each,
since User and Super are mutually exclusive,
64 of your 256 physical registers will be sitting unused tied
to the other mode bank, so max of 75% utilization efficiency.
If you have 8 register banks then only 3/10 of the physical registers
are available to use, the other 7/10 are sitting idle attached to
arch registers in other modes consuming power.
Also you don't have to play overlapped-register-bank games to pass
args to/from syscalls. You can have specific instructions that reach
into other banks: Move To User Reg, Move From User Reg.
Since only syscall passes args into the OS you only need to access
the user mode bank from the OS kernel bank.

Whereas: Exceptions, interrupts save and restore 32-registers::
A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.

On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

Robert Finch

2023-12-03 19:08:12 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must refer to
N physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting
only a single bank. The operating mode is used to select the physical
register. The first eight registers are shared between all operating
modes so arguments can be passed to syscalls. It is tempting to have
eight banks of registers, one for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

Part of the reason to support multiple banks is that the block RAM is
present and consuming power whether or not it is being used.

A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the Caller
and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.
On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

Q+ has 64 registers. They may end up being 128-bit. It may take 4x as
long (or more) to store and load them as it does on My66000. Where
changing the register bank is just modifying a single bit.

Q+ Status:
Added complexity to the done state. It is now two bits as some
instructions can issue to two functional units and the instruction is
not done until it is done on both units. These instructions include
jump, branch to subroutine, and return instructions. They need to
execute on both the ALU and FCU. The scheduler can now also schedule the
same instruction on more than one unit, if decode indicates to execute
on multiple units.

Still too early to tell but, it is looking like the core will run at
close to 60 MHz at least that is a goal. Executing a maximum of 4
instructions per cycle, it should be close to 240 MIPs peak. A much more
realistic estimate would be 50 MIPs. Given costly branch misprediction,
and a lack of forwarding between units. All this assuming I have not
made too many boo-boos.

BGB

2023-12-04 00:58:10 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must refer to
N physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting
only a single bank. The operating mode is used to select the physical
register. The first eight registers are shared between all operating
modes so arguments can be passed to syscalls. It is tempting to have
eight banks of registers, one for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the captured
register state for the calling task.

A newer change involves saving/restoring registers more directly to/from
the task context for syscalls, which reduces the task-switch overhead by
around 50% (but is mostly N/A for other kinds of interrupts).

...

Robert Finch

2023-12-05 05:07:34 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must refer to
N physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting
only a single bank. The operating mode is used to select the
physical register. The first eight registers are shared between all
operating modes so arguments can be passed to syscalls. It is
tempting to have eight banks of registers, one for each hardware
interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.
On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the captured
register state for the calling task.
A newer change involves saving/restoring registers more directly to/from
the task context for syscalls, which reduces the task-switch overhead by
around 50% (but is mostly N/A for other kinds of interrupts).
...

I am toying with the idea of adding context save and restore
instructions. I would try to get them to work on a cache-line worth of
data, four registers accessed for read or write at the same time.
Context save / restore would be a macro instruction made up of sixteen
individual instructions each of which saves or restores four registers.
It is a bit of a hoop to jump through for an infrequently used
operation. However, it is good to have to clean context switch code.

Added the REGS instruction modifier. The modifier causes the following
load or store instruction to repeat using the registers specified in the
register list bitmask for the source or target register. In theory it
can also be applied to other instructions but that was not the intent.
It is pretty much useless for other instructions, but a register list
could be supplied to the MOV instruction to zero out multiple registers
with a single instruction. Or possibly the ADDI instruction could be
used to load a constant into multiple registers. I could put code in to
disable REGS use with anything other than load and store ops, but why
add extra hardware?

BGB

2023-12-05 06:59:11 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must refer
to N physical registers, where N is the number of banks. Setting N
to 4 results in a RAT that is only about 50% larger than one
supporting only a single bank. The operating mode is used to select
the physical register. The first eight registers are shared between
all operating modes so arguments can be passed to syscalls. It is
tempting to have eight banks of registers, one for each hardware
interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.
On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the captured
register state for the calling task.
A newer change involves saving/restoring registers more directly
to/from the task context for syscalls, which reduces the task-switch
overhead by around 50% (but is mostly N/A for other kinds of interrupts).
...

I am toying with the idea of adding context save and restore
instructions. I would try to get them to work on a cache-line worth of
data, four registers accessed for read or write at the same time.
Context save / restore would be a macro instruction made up of sixteen
individual instructions each of which saves or restores four registers.
It is a bit of a hoop to jump through for an infrequently used
operation. However, it is good to have to clean context switch code.
Added the REGS instruction modifier. The modifier causes the following
load or store instruction to repeat using the registers specified in the
register list bitmask for the source or target register. In theory it
can also be applied to other instructions but that was not the intent.
It is pretty much useless for other instructions, but a register list
could be supplied to the MOV instruction to zero out multiple registers
with a single instruction. Or possibly the ADDI instruction could be
used to load a constant into multiple registers. I could put code in to
disable REGS use with anything other than load and store ops, but why
add extra hardware?

In my case, it is partly a limitation of not really being able to make
it wider than it is already absent adding a 4th register write port and
likely imposing a 256-bit alignment requirement; for a task that is
mostly limited by L1 cache misses...

Like, saving registers would be ~ 40 cycles or so (with another ~ 40 to
restore them), saving/restoring 2 registers per cycle with GPRs, if not
for all the L1 misses.

Reason it is not similar for normal function calls (besides these
saving/restoring normal registers), is because often the stack is still
"warm" in the L1 cache.

For interrupts, in the time from one interrupt to another, most of the
L1 cache contents from the previous interrupt are already gone.

So, these instruction sequences are around 80% L1 miss penalty, vs
around 5% for normal prologs/epilogs.

This is similar for the inner loops for "memcpy()", which average
roughly 90% L1 miss penalty.

And, say, "memcpy()" averages around 300MB/sec if just copying the same
small buffer over and over again, but then quickly drops to 70MB/sec if
copying memory that falls outside the L1 cache.

Though, comparably, it seems that the drop-off from L2 cache to DRAM is
currently a little smaller.

So, the external DRAM interface can push ~ 100MB/sec with the current
interface (supports SWAP operations, moving 512 bits at a time, and
using a sequence number to transition from one request to another).

But, it is around 70MB/s for requests to make it around the ringbus.

Though, I have noted that if things stay within the limits of what fits
in the L2 cache, multiple parties can access the L2 cache at the same
time without too much impact on each other.

So, say, a modest resolutions, the screen refresh does not impact the
CPU, and the rasterizer module is also mostly independent.

Still, about the highest screen resolution it can really sustain
effectively is ~ 640x480 256-color, or ~ 18MB/sec.

This may be more timing related though, since for screen refresh there
is a relatively tight deadline between when the requests start being
sent, and when the L2 cache needs to hit for that request, and failing
this will result in graphical glitches.

Though, generally what it means is, if the framebuffer image isn't in
the L2 cache, it is gonna look like crap; and effectively the limit is
more "how big of a framebuffer can I fit in the L2 cache".

On the XC7A200T, I can afford a 512K L2 cache, which is just so big
enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it, and
fights a bit more with the main CPU).

OTOH, it is likely the case than on the XC7A100T (which can only afford
a 256K L2 cache), that 640x400 256-color is pushing it (but color cell
mode still works fine).

Had noted though that trying to set the screen resolution at one point
to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and basically was
almost entirely broken and seemingly bogged down the CPU (which could no
longer access memory in a timely manner).

Also seemingly stuff running on the CPU can effect screen artifacts in
these modes, presumably by knocking stuff out of the L2 cache.

Also, it seems like despite my ringbus being a lot faster than my
original bus, it has still managed to become an issue due to latency.

But, despite this, on average, things like interlocks and branch-miss
penalties and similar are now still weighing in a fair bit as well (with
interlock penalties closely following cache misses as the main source of
pipeline stalls).

Well, and these two combined burning around 30% of the total
clock-cycles, with another ~ 2-3% or so being spent on branches, ...

Well, and my recent effort to try to improve FPGA timing enough try to
get it up to 75MHz, did have the drawback of "in general" increasing the
number of cycles spent on interlocks (but, returning a lot of the
instructions to their original latency values, would make the FPGA
timing-constraints issues a bit worse).

But, if I could entirely eliminate these sources of latency, this would
only gain ~30%, and at this point would either need to somehow increase
the average bundle with, or find ways to reduce the total number of
instructions that need to be executed (both of these being more
compiler-related territory).

Though, OTOH, I have noted that in many cases I am beating RISC-V
(RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96 bit
encodings) when both are using the same C library, which implies that I
am probably "not doing too badly" on this front either (though, ideally,
I would be "more consistently" beating RISC-V at this metric, *1).

*1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3" are
bigger; BJX2 Baseline does beat RV64IM, but this is not a fair test as
BJX2 Baseline has 16-bit ops).

Though, BGBCC also has an "/Os" option, it seems to have very little
effect on XG2 Mode (it mostly does things to try to increase the number
of 16-bit ops used, which is N/A in XG2).

Where, here, one can use ".text" size as a stand-in for total
instruction count (and by extension, the number of instructions that
need to be executed).

Though, in some past tests, it seemed like RISC-V needed to execute a
larger number of instructions to render each frame in Doom, which
doesn't really make follow if both have a roughly similar number of
instructions in the emitted binaries (and if both are essentially
running the same code).

So, something seems curious here...

...

Robert Finch

2023-12-05 22:04:04 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must refer
to N physical registers, where N is the number of banks. Setting N
to 4 results in a RAT that is only about 50% larger than one
supporting only a single bank. The operating mode is used to
select the physical register. The first eight registers are shared
between all operating modes so arguments can be passed to
syscalls. It is tempting to have eight banks of registers, one for
each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.
On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the captured
register state for the calling task.
A newer change involves saving/restoring registers more directly
to/from the task context for syscalls, which reduces the task-switch
overhead by around 50% (but is mostly N/A for other kinds of
interrupts).
...

I am toying with the idea of adding context save and restore
instructions. I would try to get them to work on a cache-line worth of
data, four registers accessed for read or write at the same time.
Context save / restore would be a macro instruction made up of sixteen
individual instructions each of which saves or restores four
registers. It is a bit of a hoop to jump through for an infrequently
used operation. However, it is good to have to clean context switch code.
Added the REGS instruction modifier. The modifier causes the following
load or store instruction to repeat using the registers specified in
the register list bitmask for the source or target register. In theory
it can also be applied to other instructions but that was not the
intent. It is pretty much useless for other instructions, but a
register list could be supplied to the MOV instruction to zero out
multiple registers with a single instruction. Or possibly the ADDI
instruction could be used to load a constant into multiple registers.
I could put code in to disable REGS use with anything other than load
and store ops, but why add extra hardware?

In my case, it is partly a limitation of not really being able to make
it wider than it is already absent adding a 4th register write port and
likely imposing a 256-bit alignment requirement; for a task that is
mostly limited by L1 cache misses...
Like, saving registers would be ~ 40 cycles or so (with another ~ 40 to
restore them), saving/restoring 2 registers per cycle with GPRs, if not
for all the L1 misses.
Reason it is not similar for normal function calls (besides these
saving/restoring normal registers), is because often the stack is still
"warm" in the L1 cache.
For interrupts, in the time from one interrupt to another, most of the
L1 cache contents from the previous interrupt are already gone.
So, these instruction sequences are around 80% L1 miss penalty, vs
around 5% for normal prologs/epilogs.
This is similar for the inner loops for "memcpy()", which average
roughly 90% L1 miss penalty.
And, say, "memcpy()" averages around 300MB/sec if just copying the same
small buffer over and over again, but then quickly drops to 70MB/sec if
copying memory that falls outside the L1 cache.
Though, comparably, it seems that the drop-off from L2 cache to DRAM is
currently a little smaller.
So, the external DRAM interface can push ~ 100MB/sec with the current
interface (supports SWAP operations, moving 512 bits at a time, and
using a sequence number to transition from one request to another).
But, it is around 70MB/s for requests to make it around the ringbus.
Though, I have noted that if things stay within the limits of what fits
in the L2 cache, multiple parties can access the L2 cache at the same
time without too much impact on each other.
So, say, a modest resolutions, the screen refresh does not impact the
CPU, and the rasterizer module is also mostly independent.
Still, about the highest screen resolution it can really sustain
effectively is ~ 640x480 256-color, or ~ 18MB/sec.
This may be more timing related though, since for screen refresh there
is a relatively tight deadline between when the requests start being
sent, and when the L2 cache needs to hit for that request, and failing
this will result in graphical glitches.
Though, generally what it means is, if the framebuffer image isn't in
the L2 cache, it is gonna look like crap; and effectively the limit is
more "how big of a framebuffer can I fit in the L2 cache".
On the XC7A200T, I can afford a 512K L2 cache, which is just so big
enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it, and
fights a bit more with the main CPU).
OTOH, it is likely the case than on the XC7A100T (which can only afford
a 256K L2 cache), that 640x400 256-color is pushing it (but color cell
mode still works fine).
Had noted though that trying to set the screen resolution at one point
to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and basically was
almost entirely broken and seemingly bogged down the CPU (which could no
longer access memory in a timely manner).
Also seemingly stuff running on the CPU can effect screen artifacts in
these modes, presumably by knocking stuff out of the L2 cache.
Also, it seems like despite my ringbus being a lot faster than my
original bus, it has still managed to become an issue due to latency.
But, despite this, on average, things like interlocks and branch-miss
penalties and similar are now still weighing in a fair bit as well (with
interlock penalties closely following cache misses as the main source of
pipeline stalls).
Well, and these two combined burning around 30% of the total
clock-cycles, with another ~ 2-3% or so being spent on branches, ...
Well, and my recent effort to try to improve FPGA timing enough try to
get it up to 75MHz, did have the drawback of "in general" increasing the
number of cycles spent on interlocks (but, returning a lot of the
instructions to their original latency values, would make the FPGA
timing-constraints issues a bit worse).
But, if I could entirely eliminate these sources of latency, this would
only gain ~30%, and at this point would either need to somehow increase
the average bundle with, or find ways to reduce the total number of
instructions that need to be executed (both of these being more
compiler-related territory).
Though, OTOH, I have noted that in many cases I am beating RISC-V
(RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96 bit
encodings) when both are using the same C library, which implies that I
am probably "not doing too badly" on this front either (though, ideally,
I would be "more consistently" beating RISC-V at this metric, *1).
*1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3" are
bigger; BJX2 Baseline does beat RV64IM, but this is not a fair test as
BJX2 Baseline has 16-bit ops).
Though, BGBCC also has an "/Os" option, it seems to have very little
effect on XG2 Mode (it mostly does things to try to increase the number
of 16-bit ops used, which is N/A in XG2).
Where, here, one can use ".text" size as a stand-in for total
instruction count (and by extension, the number of instructions that
need to be executed).
Though, in some past tests, it seemed like RISC-V needed to execute a
larger number of instructions to render each frame in Doom, which
doesn't really make follow if both have a roughly similar number of
instructions in the emitted binaries (and if both are essentially
running the same code).
So, something seems curious here...
...

For the Q+ MPU and SOC the bus system is organized like a tree with the
root being at the CPU. The system bus operates with asynchronous
transactions. The bus then fans out through bus bridges to various
system components. Responses coming back from devices are buffered and
merge results together into a more common bus when there are open spaces
in the bus. I think it is fairly fast (well at least for homebrew FPGA).
Bus accesses are single cycle, but they may have a varying amount of
latency. Writes are “posted” so they are essentially single cycle. Reads
percolate back up the tree to the CPU. It operates at the CPU clock rate
(currently 40MHz) and transfers 128-bits at a time. Maximum peak
transfer rate would then be 640 MB/s. Copying memory is bound to be much
slower due to the read latency. Devices on the bus have a configuration
block which looks something like a PCI config block, so devices
addressing may be controlled by the OS.

Multiple devices access the main DRAM memory via a memory controller.
Several devices that are bus masters have their own ports to the memory
controller and do not use up time on the main system bus tree. The
frame buffer has a streaming data port. The frame buffer streaming cache
is 8kB and loaded in 1kB strips at 800MB/s from the DRAM IIRC. Other
devices share a system cache which is only 16kB due to limited number
block RAMs. There are about a half dozen read ports, so the block RAMs
are replicated. With all the ports accessing simultaneously there could
be 8*40*16 MB/s being transferred, or about 5.1 GB/s for reads.

The CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can be
dual ported, but is not configured that way ATM due to resource
limitations. The caches will request data in blocks the size of a cache
line. A cache line is broken into four consecutive 128-bit accesses. So,
data comes back from the boot ROM in a burst at 640 MB/s.

IIRC there were no display issues with an 800x600x16 bpp display, but I
could not get Thor to do much more than clear the screen. So, it was a
display of random dots that was stable. There is a separate text display
controller with its own dedicated block RAM for displays.

MitchAlsup

2023-12-05 23:47:10 UTC

Post by Robert Finch
For the Q+ MPU and SOC the bus system is organized like a tree with the
root being at the CPU. The system bus operates with asynchronous
transactions. The bus then fans out through bus bridges to various
system components. Responses coming back from devices are buffered and
merge results together into a more common bus when there are open spaces
in the bus. I think it is fairly fast (well at least for homebrew FPGA).
Bus accesses are single cycle, but they may have a varying amount of
latency.

My "bus" is similar, but is, in effect, a 4-wire protocol done with
transactions on the buss. Read goes to Mem CTL, when "ordered" Snoops
go out, Snoop responses go to requesting core, Mem response goes to
core. When core has SNOOP responses and mem data it sends DONE to
mem Ctl. The arriving DONE allows the next access to that same cache
line to begin (that is DONE "orders" successive accesses to the same
line addresses, while allowing independent accesses to proceed inde-
pendently.

The data width of my "bus" is 1 cache line, or ½ cache line at DDR.
Control is ~90-bits including a 66-bit address.
SNOOP responses are packed.

Post by Robert Finch
Writes are “posted” so they are essentially single cycle.

Writes to DRAM are "posted"
Writes to config space are strongly ordered
Writes to MMI/O are sequentially Consistent

Post by Robert Finch
Reads
percolate back up the tree to the CPU. It operates at the CPU clock rate
(currently 40MHz) and transfers 128-bits at a time. Maximum peak
transfer rate would then be 640 MB/s. Copying memory is bound to be much
slower due to the read latency. Devices on the bus have a configuration
block which looks something like a PCI config block, so devices
addressing may be controlled by the OS.
Multiple devices access the main DRAM memory via a memory controller.

I interpose the LLC (L3) between the "bus" and the Mem Ctl. This interposition
is what eliminates RowHammer. The L3 is not really a cache it is a preview
of the state DRAM will eventually achieve or has already achieved. It is,
in essence, an infinite write buffer between the MC and DRC and a near
infinite read buffer between DRC and MC.

Post by Robert Finch
Several devices that are bus masters have their own ports to the memory
controller and do not use up time on the main system bus tree. The

Yes, PCIe HostBridge has master access to the "bus" all "devices" are
down under HostBridge. With CLX enabled, one can even place DRAM out on
PCIe tree,...

BGB

2023-12-06 02:42:16 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must refer
to N physical registers, where N is the number of banks. Setting
N to 4 results in a RAT that is only about 50% larger than one
supporting only a single bank. The operating mode is used to
select the physical register. The first eight registers are
shared between all operating modes so arguments can be passed to
syscalls. It is tempting to have eight banks of registers, one
for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.
On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the captured
register state for the calling task.
A newer change involves saving/restoring registers more directly
to/from the task context for syscalls, which reduces the task-switch
overhead by around 50% (but is mostly N/A for other kinds of interrupts).
...

I am toying with the idea of adding context save and restore
instructions. I would try to get them to work on a cache-line worth
of data, four registers accessed for read or write at the same time.
Context save / restore would be a macro instruction made up of
sixteen individual instructions each of which saves or restores four
registers. It is a bit of a hoop to jump through for an infrequently
used operation. However, it is good to have to clean context switch code.
Added the REGS instruction modifier. The modifier causes the
following load or store instruction to repeat using the registers
specified in the register list bitmask for the source or target
register. In theory it can also be applied to other instructions but
that was not the intent. It is pretty much useless for other
instructions, but a register list could be supplied to the MOV
instruction to zero out multiple registers with a single instruction.
Or possibly the ADDI instruction could be used to load a constant
into multiple registers. I could put code in to disable REGS use with
anything other than load and store ops, but why add extra hardware?

In my case, it is partly a limitation of not really being able to make
it wider than it is already absent adding a 4th register write port
and likely imposing a 256-bit alignment requirement; for a task that
is mostly limited by L1 cache misses...
Like, saving registers would be ~ 40 cycles or so (with another ~ 40
to restore them), saving/restoring 2 registers per cycle with GPRs, if
not for all the L1 misses.
Reason it is not similar for normal function calls (besides these
saving/restoring normal registers), is because often the stack is
still "warm" in the L1 cache.
For interrupts, in the time from one interrupt to another, most of the
L1 cache contents from the previous interrupt are already gone.
So, these instruction sequences are around 80% L1 miss penalty, vs
around 5% for normal prologs/epilogs.
This is similar for the inner loops for "memcpy()", which average
roughly 90% L1 miss penalty.
And, say, "memcpy()" averages around 300MB/sec if just copying the
same small buffer over and over again, but then quickly drops to
70MB/sec if copying memory that falls outside the L1 cache.
Though, comparably, it seems that the drop-off from L2 cache to DRAM
is currently a little smaller.
So, the external DRAM interface can push ~ 100MB/sec with the current
interface (supports SWAP operations, moving 512 bits at a time, and
using a sequence number to transition from one request to another).
But, it is around 70MB/s for requests to make it around the ringbus.
Though, I have noted that if things stay within the limits of what
fits in the L2 cache, multiple parties can access the L2 cache at the
same time without too much impact on each other.
So, say, a modest resolutions, the screen refresh does not impact the
CPU, and the rasterizer module is also mostly independent.
Still, about the highest screen resolution it can really sustain
effectively is ~ 640x480 256-color, or ~ 18MB/sec.
This may be more timing related though, since for screen refresh there
is a relatively tight deadline between when the requests start being
sent, and when the L2 cache needs to hit for that request, and failing
this will result in graphical glitches.
Though, generally what it means is, if the framebuffer image isn't in
the L2 cache, it is gonna look like crap; and effectively the limit is
more "how big of a framebuffer can I fit in the L2 cache".
On the XC7A200T, I can afford a 512K L2 cache, which is just so big
enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it, and
fights a bit more with the main CPU).
OTOH, it is likely the case than on the XC7A100T (which can only
afford a 256K L2 cache), that 640x400 256-color is pushing it (but
color cell mode still works fine).
Had noted though that trying to set the screen resolution at one point
to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and basically was
almost entirely broken and seemingly bogged down the CPU (which could
no longer access memory in a timely manner).
Also seemingly stuff running on the CPU can effect screen artifacts in
these modes, presumably by knocking stuff out of the L2 cache.
Also, it seems like despite my ringbus being a lot faster than my
original bus, it has still managed to become an issue due to latency.
But, despite this, on average, things like interlocks and branch-miss
penalties and similar are now still weighing in a fair bit as well
(with interlock penalties closely following cache misses as the main
source of pipeline stalls).
Well, and these two combined burning around 30% of the total
clock-cycles, with another ~ 2-3% or so being spent on branches, ...
Well, and my recent effort to try to improve FPGA timing enough try to
get it up to 75MHz, did have the drawback of "in general" increasing
the number of cycles spent on interlocks (but, returning a lot of the
instructions to their original latency values, would make the FPGA
timing-constraints issues a bit worse).
But, if I could entirely eliminate these sources of latency, this
would only gain ~30%, and at this point would either need to somehow
increase the average bundle with, or find ways to reduce the total
number of instructions that need to be executed (both of these being
more compiler-related territory).
Though, OTOH, I have noted that in many cases I am beating RISC-V
(RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96 bit
encodings) when both are using the same C library, which implies that
I am probably "not doing too badly" on this front either (though,
ideally, I would be "more consistently" beating RISC-V at this metric,
*1).
*1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3"
are bigger; BJX2 Baseline does beat RV64IM, but this is not a fair
test as BJX2 Baseline has 16-bit ops).
Though, BGBCC also has an "/Os" option, it seems to have very little
effect on XG2 Mode (it mostly does things to try to increase the
number of 16-bit ops used, which is N/A in XG2).
Where, here, one can use ".text" size as a stand-in for total
instruction count (and by extension, the number of instructions that
need to be executed).
Though, in some past tests, it seemed like RISC-V needed to execute a
larger number of instructions to render each frame in Doom, which
doesn't really make follow if both have a roughly similar number of
instructions in the emitted binaries (and if both are essentially
running the same code).
So, something seems curious here...
...

My original bus was fairly slow:
Put a request on the bus, as it propagates, each layer of the bus holds
the request, until it reaches the destination, and sends back an OK
signal, which returns back up the bus to the sender, and then the sender
switches to sending an IDLE signal, the whole process repeats as the bus
"tears down", and when it is done, the OK signal switches to READY, and
the bus may then accept another request.

This bus could only handle a single active request at a time, and no
further requests could initiate (anywhere) until the prior request had
finished.

Experimentally, I was hard-pressed getting much over about 6MB/sec over
this bus with 128-bit transfers... (but could get it up to around
16MB/sec with 256-bit SWAP messages). As noted, this kinda sucked...

I then replaced this with a ring-bus:
Every object on the node passes messages from input to output, and is
able to drop messages onto the bus, or remove/replace messages as
appropriate. If not handled immediately, they circle the ring until they
can be handled.

This bus was considerably faster, but still seems to suffer from latency
issues.

In this case, the latency of the ring bus was higher than the original
bus, but had the advantage that the L1 cache could effectively drop 4
consecutive requests onto the bus and then (in theory) they could all be
handled within a single trip around the ring.

Theoretically, the bus could move 800MB/sec at 50MHz, but practically
seems to achieve around 70MB/s (which is in-turn effected by things that
effect ring latency, like enabling/disabling various "shortcut paths" or
enabling/disabling the second CPU core).

A point-to-point message-passing bus could be possible, and could have
lower latency, but was not done mostly because it seemed more
complicated and expensive than the ring design.

If one has two endpoints, both can achieve around 70MB/s if L2 hits, but
this drops off if the external RAM accesses become the limiting factor.

The RAM interface is using a modified version of the original bus, where
both the OPM and OK signals were augmented with sequence numbers, where
when the sent sequence number on OPM comes back via the OK signal, one
can immediately move to the next request (incrementing the sequence number).

While this interface still only allows a single request at a time, this
change effectively doubles the throughput. The main reason for using
this interface to talk to external RAM, is that the interface works
across clock-domain crossings (as-is, the ring-bus requests can't
survive a clock-domain crossing).

Most of the MMIO devices are still operating on a narrower version of
the original bus, say:
5b: OPM
28b: Addr
64b: DataIn
64b: DataOut
2b: OK

Where, OPM:
00-000: IDLE
00-zzz: Special Command (if zzz!=000)

01-010: Load DWORD (MMIO)
01-011: Load QWORD (MMIO)
01-111: Load TILE (RAM, Old)

10-010: Store DWORD (MMIO)
10-011: Store QWORD (MMIO)
10-111: Store TILE (RAM, Old)

11-010: Swap DWORD (MMIO, Unused)
11-011: Swap QWORD (MMIO, Unused)
11-111: Swap TILE (RAM, Old)

The ring-bus went over to an 8-bit OPM format, which increases the range
of messages that can be sent.

One advantage of the old bus is that the device-side logic is fairly
simple. Typically, the OPM/Addr/Data signals would be mirrored to all of
the devices, with each device having its own OK and DataOut signal.

A sort of crossbar existed, where whichever device sets its OK value to
something other than READY has its OK and Data signals passed back up
the bus.

Also it works because MMIO only allows a single active request at a time
(and the MMIO bus interface on the ringbus will effectively serialize
all accesses into the MMIO space on a "first come, first serve" basis).

Note that accessing MMIO is comparably slow.
Some devices, like the display / VRAM module, have been partly moved
over to the ringbus (with the screen's frame-buffer mapped into RAM),
but still uses the MMIO interface for access to display control
registers and similar.

The SDcard interface still goes over MMIO, but ended up being modified
to allow sending/receiving 8 bytes at a time over SPI (with 8-bit
transfers, accessing the MMIO bus was a bigger source of latency than
actually sending bytes over SPI at 5MHz).

As-is, I am running the SDcard at 12.5 MHz:
16.7MHz and 25MHz did not work reliably;
Going over 25MHz was out-of-spec;
Even with 8-byte transfers, MMIO access can still become a bottleneck.

A UHS-II interface could in theory run at similar speeds to RAM, but
would likely need a different interface to make use of this.

One possibility would be to map the SDcard into the physical address
space as a huge non-volatile RAM-like space (on the ring-bus). Had
on/off considered this a few times, but didn't get to it.

Effectively, it would require redesigning the whole SDcard and
filesystem interface (essentially moving nearly all of the SDcard logic
into hardware).

Post by Robert Finch
Multiple devices access the main DRAM memory via a memory controller.
Several devices that are bus masters have their own ports to the memory
controller and do not use up time on the main system bus tree. The
frame buffer has a streaming data port. The frame buffer streaming cache
is 8kB and loaded in 1kB strips at 800MB/s from the DRAM IIRC. Other
devices share a system cache which is only 16kB due to limited number
block RAMs. There are about a half dozen read ports, so the block RAMs
are replicated. With all the ports accessing simultaneously there could
be 8*40*16 MB/s being transferred, or about 5.1 GB/s for reads.

I had put everything on the ring-bus, with the L2 also serving as the
bridge to access external DRAM (via a direct connection to the DDR
interface module).

Post by Robert Finch
The CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can be
dual ported, but is not configured that way ATM due to resource
limitations. The caches will request data in blocks the size of a cache
line. A cache line is broken into four consecutive 128-bit accesses. So,
data comes back from the boot ROM in a burst at 640 MB/s.

In my case:
L1 I$: 16K or 32K
32K helps notably with GLQuake and similar.
Doom works well with 16K.
L1 D$: 16K or 32K
Mostly 32K works well.
Had tried 64K, but bad for timing, and little effect on performance.

IIRC, had evaluated running the CPU at 25MHz with 128K L1 caches and a
small L2 cache, but modeling this had showed that performance would suck
(even if nearly all of the instructions had a 1-cycle latency).

Post by Robert Finch
IIRC there were no display issues with an 800x600x16 bpp display, but I
could not get Thor to do much more than clear the screen. So, it was a
display of random dots that was stable. There is a separate text display
controller with its own dedicated block RAM for displays.

My display module is a little weird, as it was based around a
cell-oriented design:
Cells are typically 128 or 256 bits, representing 8x8 pixels.

Text and 2bpp color-cell modes use 128-bit cells, say:
( 29: 0): Pair of 15-bit colors;
( 31:30): 10
( 61:32): Misc
( 63:62): 00
(127:64): Pixel bits, 8x8x1 bit, raster order

The 4bpp color-cell mode is more like:
( 29: 0): Colors A/B
( 31: 30): 11
( 61: 32): Colors C/D
( 63: 62): 11
( 93: 64): Colors E/F
( 95: 94): 00
(125: 96): Colors G/H
(127:126): 00
(159:128): Pixels A/B (4x4x2)
(191:160): Pixels C/D (4x4x2)
(223:192): Pixels E/F (4x4x2)
(255:224): Pixels G/H (4x4x2)

In the bitmapped modes:
128-bit cell selects 256-color modes (4x4 pixels)
256-bit cell selects hi-color modes (4x4 pixels)

So:
640x400 would be configured as 160x100 cells.
800x600 would be configured as 200x150 cells.

The 800x600 256-color mode held up OK when I had the display module
outputting at a non-standard 36Hz refresh, but increasing this to a more
standard 72Hz blows out the memory bandwidth.

Theoretically, the DDR RAM interface could support these resolutions if
all the timing and latency was good. But, no so good when it is
implemented by the display module hammering out a series of prefetch
requests over the ring-bus just ahead of the current raster position.

Though, the cell-oriented display modes still work better than my
attempt at a linear framebuffer mode (due to cache/timing issues, not
even a 320x200 linear framebuffer mode worked without looking like a
broken mess).

I suspect this is because, with the cell-oriented modes, each cell has 4
or 8 chances for the prefetch to succeed before it actually gets drawn,
whereas in the linear raster mode, there is only 1 chance.

It is likely that a linear framebuffer would require two stages:
Prefetch 1: Somewhat ahead of current raster position, hopefully gets
data into L2;
Prefects 2: Closer to the raster position, intended to actually fetch
the pixel data.

Prefetches are used here rather than actual loads, mostly because these
will get cleaned up quickly, whereas with actual fetches, a back-log
scenario would result in the whole bus getting clogged up with
unresolved requests.

However, the CPU can use normal loads, since the CPU will patiently wait
for the previous request(s) to finish before doing anything else (and
thus avoids flooding the ring-bus with requests).

However, a downside of prefetches, is that one has to keep asking the L2
cache each time whether or not it has the data in question yet.

As for the "BJX2 doesn't always generate smaller .text than RISC-V
issue", went looking at the ASM, and noted there is a big difference:
GCC "-Os" generates very tight and efficient code, but needs to work
within the limits of what the ISA provides;
BGBCC has a bit more to work with, but the relative quality of the
generated code is fairly poor in comparison.

Like, say:
MOV.Q R8, (SP, 40)
.lbl:
MOV.Q (SP, 40), R8
//BGBCC: "Sure why not?..."
...
MOV R2, R9
MOV R9, R2
BRA .lbl
//BGBCC: "Seems fine to me..."

So, I look at the ASM, and once again feel a groan at how crappy a lot
of it is.

Or:
if(!ptr)
...
Was failing to go down the logic path that would have allowed it to use
the BREQ/BRNE instructions (so was always producing a two-op sequence).

Have noticed that code that writes, say:
if(ptr==NULL)
...
Ends up using a 3-instruction sequence, because it doesn't recognize
this pattern as being the same as the "!ptr" case, ...

Did at least find a few more "low hanging fruit" cases that shaved a few
more kB off the binary.

Well, and also added a case to partially optimize:
return(bar());
To merge the 3AC "RET" into the "CSRV" operation, and thus save the use
of a temporary (and roughly two otherwise unnecessary MOV instructions
whenever this happens).

But, ironically, it was still "mostly" generating code with fewer
instructions, despite the still relatively weak code generation at times.

Also it seems:
void foo()
{
//does nothing
}
void bar()
{
...
foo();
...
}

GCC seems to be clever enough to realize that "foo()" does nothing, and
will eliminate the function and function call entirely.

BGBCC has no such optimization.

...

Robert Finch

2023-12-07 16:04:36 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must
refer to N physical registers, where N is the number of banks.
Setting N to 4 results in a RAT that is only about 50% larger
than one supporting only a single bank. The operating mode is
used to select the physical register. The first eight registers
are shared between all operating modes so arguments can be
passed to syscalls. It is tempting to have eight banks of
registers, one for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.
On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the
captured register state for the calling task.
A newer change involves saving/restoring registers more directly
to/from the task context for syscalls, which reduces the
task-switch overhead by around 50% (but is mostly N/A for other
kinds of interrupts).
...

I am toying with the idea of adding context save and restore
instructions. I would try to get them to work on a cache-line worth
of data, four registers accessed for read or write at the same time.
Context save / restore would be a macro instruction made up of
sixteen individual instructions each of which saves or restores four
registers. It is a bit of a hoop to jump through for an infrequently
used operation. However, it is good to have to clean context switch code.
Added the REGS instruction modifier. The modifier causes the
following load or store instruction to repeat using the registers
specified in the register list bitmask for the source or target
register. In theory it can also be applied to other instructions but
that was not the intent. It is pretty much useless for other
instructions, but a register list could be supplied to the MOV
instruction to zero out multiple registers with a single
instruction. Or possibly the ADDI instruction could be used to load
a constant into multiple registers. I could put code in to disable
REGS use with anything other than load and store ops, but why add
extra hardware?

In my case, it is partly a limitation of not really being able to
make it wider than it is already absent adding a 4th register write
port and likely imposing a 256-bit alignment requirement; for a task
that is mostly limited by L1 cache misses...
Like, saving registers would be ~ 40 cycles or so (with another ~ 40
to restore them), saving/restoring 2 registers per cycle with GPRs,
if not for all the L1 misses.
Reason it is not similar for normal function calls (besides these
saving/restoring normal registers), is because often the stack is
still "warm" in the L1 cache.
For interrupts, in the time from one interrupt to another, most of
the L1 cache contents from the previous interrupt are already gone.
So, these instruction sequences are around 80% L1 miss penalty, vs
around 5% for normal prologs/epilogs.
This is similar for the inner loops for "memcpy()", which average
roughly 90% L1 miss penalty.
And, say, "memcpy()" averages around 300MB/sec if just copying the
same small buffer over and over again, but then quickly drops to
70MB/sec if copying memory that falls outside the L1 cache.
Though, comparably, it seems that the drop-off from L2 cache to DRAM
is currently a little smaller.
So, the external DRAM interface can push ~ 100MB/sec with the current
interface (supports SWAP operations, moving 512 bits at a time, and
using a sequence number to transition from one request to another).
But, it is around 70MB/s for requests to make it around the ringbus.
Though, I have noted that if things stay within the limits of what
fits in the L2 cache, multiple parties can access the L2 cache at the
same time without too much impact on each other.
So, say, a modest resolutions, the screen refresh does not impact the
CPU, and the rasterizer module is also mostly independent.
Still, about the highest screen resolution it can really sustain
effectively is ~ 640x480 256-color, or ~ 18MB/sec.
This may be more timing related though, since for screen refresh
there is a relatively tight deadline between when the requests start
being sent, and when the L2 cache needs to hit for that request, and
failing this will result in graphical glitches.
Though, generally what it means is, if the framebuffer image isn't in
the L2 cache, it is gonna look like crap; and effectively the limit
is more "how big of a framebuffer can I fit in the L2 cache".
On the XC7A200T, I can afford a 512K L2 cache, which is just so big
enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it,
and fights a bit more with the main CPU).
OTOH, it is likely the case than on the XC7A100T (which can only
afford a 256K L2 cache), that 640x400 256-color is pushing it (but
color cell mode still works fine).
Had noted though that trying to set the screen resolution at one
point to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and basically
was almost entirely broken and seemingly bogged down the CPU (which
could no longer access memory in a timely manner).
Also seemingly stuff running on the CPU can effect screen artifacts
in these modes, presumably by knocking stuff out of the L2 cache.
Also, it seems like despite my ringbus being a lot faster than my
original bus, it has still managed to become an issue due to latency.
But, despite this, on average, things like interlocks and branch-miss
penalties and similar are now still weighing in a fair bit as well
(with interlock penalties closely following cache misses as the main
source of pipeline stalls).
Well, and these two combined burning around 30% of the total
clock-cycles, with another ~ 2-3% or so being spent on branches, ...
Well, and my recent effort to try to improve FPGA timing enough try
to get it up to 75MHz, did have the drawback of "in general"
increasing the number of cycles spent on interlocks (but, returning a
lot of the instructions to their original latency values, would make
the FPGA timing-constraints issues a bit worse).
But, if I could entirely eliminate these sources of latency, this
would only gain ~30%, and at this point would either need to somehow
increase the average bundle with, or find ways to reduce the total
number of instructions that need to be executed (both of these being
more compiler-related territory).
Though, OTOH, I have noted that in many cases I am beating RISC-V
(RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96
bit encodings) when both are using the same C library, which implies
that I am probably "not doing too badly" on this front either
(though, ideally, I would be "more consistently" beating RISC-V at
this metric, *1).
*1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3"
are bigger; BJX2 Baseline does beat RV64IM, but this is not a fair
test as BJX2 Baseline has 16-bit ops).
Though, BGBCC also has an "/Os" option, it seems to have very little
effect on XG2 Mode (it mostly does things to try to increase the
number of 16-bit ops used, which is N/A in XG2).
Where, here, one can use ".text" size as a stand-in for total
instruction count (and by extension, the number of instructions that
need to be executed).
Though, in some past tests, it seemed like RISC-V needed to execute a
larger number of instructions to render each frame in Doom, which
doesn't really make follow if both have a roughly similar number of
instructions in the emitted binaries (and if both are essentially
running the same code).
So, something seems curious here...
...

Put a request on the bus, as it propagates, each layer of the bus holds
the request, until it reaches the destination, and sends back an OK
signal, which returns back up the bus to the sender, and then the sender
switches to sending an IDLE signal, the whole process repeats as the bus
"tears down", and when it is done, the OK signal switches to READY, and
the bus may then accept another request.
This bus could only handle a single active request at a time, and no
further requests could initiate (anywhere) until the prior request had
finished.
Experimentally, I was hard-pressed getting much over about 6MB/sec over
this bus with 128-bit transfers... (but could get it up to around
16MB/sec with 256-bit SWAP messages). As noted, this kinda sucked...
Every object on the node passes messages from input to output, and is
able to drop messages onto the bus, or remove/replace messages as
appropriate. If not handled immediately, they circle the ring until they
can be handled.
This bus was considerably faster, but still seems to suffer from latency
issues.
In this case, the latency of the ring bus was higher than the original
bus, but had the advantage that the L1 cache could effectively drop 4
consecutive requests onto the bus and then (in theory) they could all be
handled within a single trip around the ring.
Theoretically, the bus could move 800MB/sec at 50MHz, but practically
seems to achieve around 70MB/s (which is in-turn effected by things that
effect ring latency, like enabling/disabling various "shortcut paths" or
enabling/disabling the second CPU core).
A point-to-point message-passing bus could be possible, and could have
lower latency, but was not done mostly because it seemed more
complicated and expensive than the ring design.
If one has two endpoints, both can achieve around 70MB/s if L2 hits, but
this drops off if the external RAM accesses become the limiting factor.
The RAM interface is using a modified version of the original bus, where
both the OPM and OK signals were augmented with sequence numbers, where
when the sent sequence number on OPM comes back via the OK signal, one
can immediately move to the next request (incrementing the sequence number).
While this interface still only allows a single request at a time, this
change effectively doubles the throughput. The main reason for using
this interface to talk to external RAM, is that the interface works
across clock-domain crossings (as-is, the ring-bus requests can't
survive a clock-domain crossing).
Most of the MMIO devices are still operating on a narrower version of
5b: OPM
28b: Addr
64b: DataIn
64b: DataOut
2b: OK
00-000: IDLE
00-zzz: Special Command (if zzz!=000)
01-010: Load DWORD (MMIO)
01-011: Load QWORD (MMIO)
01-111: Load TILE (RAM, Old)
10-010: Store DWORD (MMIO)
10-011: Store QWORD (MMIO)
10-111: Store TILE (RAM, Old)
11-010: Swap DWORD (MMIO, Unused)
11-011: Swap QWORD (MMIO, Unused)
11-111: Swap TILE (RAM, Old)
The ring-bus went over to an 8-bit OPM format, which increases the range
of messages that can be sent.
One advantage of the old bus is that the device-side logic is fairly
simple. Typically, the OPM/Addr/Data signals would be mirrored to all of
the devices, with each device having its own OK and DataOut signal.
A sort of crossbar existed, where whichever device sets its OK value to
something other than READY has its OK and Data signals passed back up
the bus.
Also it works because MMIO only allows a single active request at a time
(and the MMIO bus interface on the ringbus will effectively serialize
all accesses into the MMIO space on a "first come, first serve" basis).
Note that accessing MMIO is comparably slow.
Some devices, like the display / VRAM module, have been partly moved
over to the ringbus (with the screen's frame-buffer mapped into RAM),
but still uses the MMIO interface for access to display control
registers and similar.
The SDcard interface still goes over MMIO, but ended up being modified
to allow sending/receiving 8 bytes at a time over SPI (with 8-bit
transfers, accessing the MMIO bus was a bigger source of latency than
actually sending bytes over SPI at 5MHz).
16.7MHz and 25MHz did not work reliably;
Going over 25MHz was out-of-spec;
Even with 8-byte transfers, MMIO access can still become a bottleneck.
A UHS-II interface could in theory run at similar speeds to RAM, but
would likely need a different interface to make use of this.
One possibility would be to map the SDcard into the physical address
space as a huge non-volatile RAM-like space (on the ring-bus). Had
on/off considered this a few times, but didn't get to it.
Effectively, it would require redesigning the whole SDcard and
filesystem interface (essentially moving nearly all of the SDcard logic
into hardware).

Post by Robert Finch
Multiple devices access the main DRAM memory via a memory controller.
Several devices that are bus masters have their own ports to the
memory controller and do not use up time on the main system bus tree.
The frame buffer has a streaming data port. The frame buffer streaming
cache is 8kB and loaded in 1kB strips at 800MB/s from the DRAM IIRC.
Other devices share a system cache which is only 16kB due to limited
number block RAMs. There are about a half dozen read ports, so the
block RAMs are replicated. With all the ports accessing simultaneously
there could be 8*40*16 MB/s being transferred, or about 5.1 GB/s for
reads.

I had put everything on the ring-bus, with the L2 also serving as the
bridge to access external DRAM (via a direct connection to the DDR
interface module).

Post by Robert Finch
The CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can be
dual ported, but is not configured that way ATM due to resource
limitations. The caches will request data in blocks the size of a
cache line. A cache line is broken into four consecutive 128-bit
accesses. So, data comes back from the boot ROM in a burst at 640 MB/s.

L1 I$: 16K or 32K
    32K helps notably with GLQuake and similar.
    Doom works well with 16K.
L1 D$: 16K or 32K
    Mostly 32K works well.
    Had tried 64K, but bad for timing, and little effect on performance.
IIRC, had evaluated running the CPU at 25MHz with 128K L1 caches and a
small L2 cache, but modeling this had showed that performance would suck
(even if nearly all of the instructions had a 1-cycle latency).

Post by Robert Finch
IIRC there were no display issues with an 800x600x16 bpp display, but
I could not get Thor to do much more than clear the screen. So, it was
a display of random dots that was stable. There is a separate text
display controller with its own dedicated block RAM for displays.

My display module is a little weird, as it was based around a
Cells are typically 128 or 256 bits, representing 8x8 pixels.
( 29: 0): Pair of 15-bit colors;
( 31:30): 10
( 61:32): Misc
( 63:62): 00
(127:64): Pixel bits, 8x8x1 bit, raster order
( 29: 0): Colors A/B
( 31: 30): 11
( 61: 32): Colors C/D
( 63: 62): 11
( 93: 64): Colors E/F
( 95: 94): 00
(125: 96): Colors G/H
(127:126): 00
(159:128): Pixels A/B (4x4x2)
(191:160): Pixels C/D (4x4x2)
(223:192): Pixels E/F (4x4x2)
(255:224): Pixels G/H (4x4x2)
128-bit cell selects 256-color modes (4x4 pixels)
256-bit cell selects hi-color modes (4x4 pixels)
640x400 would be configured as 160x100 cells.
800x600 would be configured as 200x150 cells.
The 800x600 256-color mode held up OK when I had the display module
outputting at a non-standard 36Hz refresh, but increasing this to a more
standard 72Hz blows out the memory bandwidth.
Theoretically, the DDR RAM interface could support these resolutions if
all the timing and latency was good. But, no so good when it is
implemented by the display module hammering out a series of prefetch
requests over the ring-bus just ahead of the current raster position.
Though, the cell-oriented display modes still work better than my
attempt at a linear framebuffer mode (due to cache/timing issues, not
even a 320x200 linear framebuffer mode worked without looking like a
broken mess).
I suspect this is because, with the cell-oriented modes, each cell has 4
or 8 chances for the prefetch to succeed before it actually gets drawn,
whereas in the linear raster mode, there is only 1 chance.
Prefetch 1: Somewhat ahead of current raster position, hopefully gets
data into L2;
Prefects 2: Closer to the raster position, intended to actually fetch
the pixel data.
Prefetches are used here rather than actual loads, mostly because these
will get cleaned up quickly, whereas with actual fetches, a back-log
scenario would result in the whole bus getting clogged up with
unresolved requests.
However, the CPU can use normal loads, since the CPU will patiently wait
for the previous request(s) to finish before doing anything else (and
thus avoids flooding the ring-bus with requests).
However, a downside of prefetches, is that one has to keep asking the L2
cache each time whether or not it has the data in question yet.
As for the "BJX2 doesn't always generate smaller .text than RISC-V
GCC "-Os" generates very tight and efficient code, but needs to work
within the limits of what the ISA provides;
BGBCC has a bit more to work with, but the relative quality of the
generated code is fairly poor in comparison.
MOV.Q R8, (SP, 40)
MOV.Q (SP, 40), R8
//BGBCC: "Sure why not?..."
...
MOV R2, R9
MOV R9, R2
BRA .lbl
//BGBCC: "Seems fine to me..."
So, I look at the ASM, and once again feel a groan at how crappy a lot
of it is.
if(!ptr)
    ...
Was failing to go down the logic path that would have allowed it to use
the BREQ/BRNE instructions (so was always producing a two-op sequence).
if(ptr==NULL)
    ...
Ends up using a 3-instruction sequence, because it doesn't recognize
this pattern as being the same as the "!ptr" case, ...
Did at least find a few more "low hanging fruit" cases that shaved a few
more kB off the binary.
return(bar());
To merge the 3AC "RET" into the "CSRV" operation, and thus save the use
of a temporary (and roughly two otherwise unnecessary MOV instructions
whenever this happens).
But, ironically, it was still "mostly" generating code with fewer
instructions, despite the still relatively weak code generation at times.
void foo()
{
     //does nothing
}
void bar()
{
    ...
    foo();
    ...
}
GCC seems to be clever enough to realize that "foo()" does nothing, and
will eliminate the function and function call entirely.
BGBCC has no such optimization.
...

Finally got a synthesis for a complete Q+ system done. Turns out to be
about 10% too large for the XC7A200 :) It should easily fit in the next
larger part. Scratching my head wondering how to reduce sizes while not
losing too much functionality. I could go with just the CPU and a serial
port, remove the frame buffer, sprites, etc.

BGB

2023-12-07 20:22:01 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must
refer to N physical registers, where N is the number of banks.
Setting N to 4 results in a RAT that is only about 50% larger
than one supporting only a single bank. The operating mode is
used to select the physical register. The first eight registers
are shared between all operating modes so arguments can be
passed to syscalls. It is tempting to have eight banks of
registers, one for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.
On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the
captured register state for the calling task.
A newer change involves saving/restoring registers more directly
to/from the task context for syscalls, which reduces the
task-switch overhead by around 50% (but is mostly N/A for other
kinds of interrupts).
...

I am toying with the idea of adding context save and restore
instructions. I would try to get them to work on a cache-line worth
of data, four registers accessed for read or write at the same
time. Context save / restore would be a macro instruction made up
of sixteen individual instructions each of which saves or restores
four registers. It is a bit of a hoop to jump through for an
infrequently used operation. However, it is good to have to clean
context switch code.
Added the REGS instruction modifier. The modifier causes the
following load or store instruction to repeat using the registers
specified in the register list bitmask for the source or target
register. In theory it can also be applied to other instructions
but that was not the intent. It is pretty much useless for other
instructions, but a register list could be supplied to the MOV
instruction to zero out multiple registers with a single
instruction. Or possibly the ADDI instruction could be used to load
a constant into multiple registers. I could put code in to disable
REGS use with anything other than load and store ops, but why add
extra hardware?

In my case, it is partly a limitation of not really being able to
make it wider than it is already absent adding a 4th register write
port and likely imposing a 256-bit alignment requirement; for a task
that is mostly limited by L1 cache misses...
Like, saving registers would be ~ 40 cycles or so (with another ~ 40
to restore them), saving/restoring 2 registers per cycle with GPRs,
if not for all the L1 misses.
Reason it is not similar for normal function calls (besides these
saving/restoring normal registers), is because often the stack is
still "warm" in the L1 cache.
For interrupts, in the time from one interrupt to another, most of
the L1 cache contents from the previous interrupt are already gone.
So, these instruction sequences are around 80% L1 miss penalty, vs
around 5% for normal prologs/epilogs.
This is similar for the inner loops for "memcpy()", which average
roughly 90% L1 miss penalty.
And, say, "memcpy()" averages around 300MB/sec if just copying the
same small buffer over and over again, but then quickly drops to
70MB/sec if copying memory that falls outside the L1 cache.
Though, comparably, it seems that the drop-off from L2 cache to DRAM
is currently a little smaller.
So, the external DRAM interface can push ~ 100MB/sec with the
current interface (supports SWAP operations, moving 512 bits at a
time, and using a sequence number to transition from one request to
another).
But, it is around 70MB/s for requests to make it around the ringbus.
Though, I have noted that if things stay within the limits of what
fits in the L2 cache, multiple parties can access the L2 cache at
the same time without too much impact on each other.
So, say, a modest resolutions, the screen refresh does not impact
the CPU, and the rasterizer module is also mostly independent.
Still, about the highest screen resolution it can really sustain
effectively is ~ 640x480 256-color, or ~ 18MB/sec.
This may be more timing related though, since for screen refresh
there is a relatively tight deadline between when the requests start
being sent, and when the L2 cache needs to hit for that request, and
failing this will result in graphical glitches.
Though, generally what it means is, if the framebuffer image isn't
in the L2 cache, it is gonna look like crap; and effectively the
limit is more "how big of a framebuffer can I fit in the L2 cache".
On the XC7A200T, I can afford a 512K L2 cache, which is just so big
enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it,
and fights a bit more with the main CPU).
OTOH, it is likely the case than on the XC7A100T (which can only
afford a 256K L2 cache), that 640x400 256-color is pushing it (but
color cell mode still works fine).
Had noted though that trying to set the screen resolution at one
point to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and
basically was almost entirely broken and seemingly bogged down the
CPU (which could no longer access memory in a timely manner).
Also seemingly stuff running on the CPU can effect screen artifacts
in these modes, presumably by knocking stuff out of the L2 cache.
Also, it seems like despite my ringbus being a lot faster than my
original bus, it has still managed to become an issue due to latency.
But, despite this, on average, things like interlocks and
branch-miss penalties and similar are now still weighing in a fair
bit as well (with interlock penalties closely following cache misses
as the main source of pipeline stalls).
Well, and these two combined burning around 30% of the total
clock-cycles, with another ~ 2-3% or so being spent on branches, ...
Well, and my recent effort to try to improve FPGA timing enough try
to get it up to 75MHz, did have the drawback of "in general"
increasing the number of cycles spent on interlocks (but, returning
a lot of the instructions to their original latency values, would
make the FPGA timing-constraints issues a bit worse).
But, if I could entirely eliminate these sources of latency, this
would only gain ~30%, and at this point would either need to somehow
increase the average bundle with, or find ways to reduce the total
number of instructions that need to be executed (both of these being
more compiler-related territory).
Though, OTOH, I have noted that in many cases I am beating RISC-V
(RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96
bit encodings) when both are using the same C library, which implies
that I am probably "not doing too badly" on this front either
(though, ideally, I would be "more consistently" beating RISC-V at
this metric, *1).
*1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able to
beat XG2 in terms of having smaller ".text" (though, "-O2" and "-O3"
are bigger; BJX2 Baseline does beat RV64IM, but this is not a fair
test as BJX2 Baseline has 16-bit ops).
Though, BGBCC also has an "/Os" option, it seems to have very little
effect on XG2 Mode (it mostly does things to try to increase the
number of 16-bit ops used, which is N/A in XG2).
Where, here, one can use ".text" size as a stand-in for total
instruction count (and by extension, the number of instructions that
need to be executed).
Though, in some past tests, it seemed like RISC-V needed to execute
a larger number of instructions to render each frame in Doom, which
doesn't really make follow if both have a roughly similar number of
instructions in the emitted binaries (and if both are essentially
running the same code).
So, something seems curious here...
...

Put a request on the bus, as it propagates, each layer of the bus
holds the request, until it reaches the destination, and sends back an
OK signal, which returns back up the bus to the sender, and then the
sender switches to sending an IDLE signal, the whole process repeats
as the bus "tears down", and when it is done, the OK signal switches
to READY, and the bus may then accept another request.
This bus could only handle a single active request at a time, and no
further requests could initiate (anywhere) until the prior request had
finished.
Experimentally, I was hard-pressed getting much over about 6MB/sec
over this bus with 128-bit transfers... (but could get it up to around
16MB/sec with 256-bit SWAP messages). As noted, this kinda sucked...
Every object on the node passes messages from input to output, and is
able to drop messages onto the bus, or remove/replace messages as
appropriate. If not handled immediately, they circle the ring until
they can be handled.
This bus was considerably faster, but still seems to suffer from
latency issues.
In this case, the latency of the ring bus was higher than the original
bus, but had the advantage that the L1 cache could effectively drop 4
consecutive requests onto the bus and then (in theory) they could all
be handled within a single trip around the ring.
Theoretically, the bus could move 800MB/sec at 50MHz, but practically
seems to achieve around 70MB/s (which is in-turn effected by things
that effect ring latency, like enabling/disabling various "shortcut
paths" or enabling/disabling the second CPU core).
A point-to-point message-passing bus could be possible, and could have
lower latency, but was not done mostly because it seemed more
complicated and expensive than the ring design.
If one has two endpoints, both can achieve around 70MB/s if L2 hits,
but this drops off if the external RAM accesses become the limiting
factor.
The RAM interface is using a modified version of the original bus,
where both the OPM and OK signals were augmented with sequence
numbers, where when the sent sequence number on OPM comes back via the
OK signal, one can immediately move to the next request (incrementing
the sequence number).
While this interface still only allows a single request at a time,
this change effectively doubles the throughput. The main reason for
using this interface to talk to external RAM, is that the interface
works across clock-domain crossings (as-is, the ring-bus requests
can't survive a clock-domain crossing).
Most of the MMIO devices are still operating on a narrower version of
   5b: OPM
   28b: Addr
   64b: DataIn
   64b: DataOut
   2b: OK
   00-000: IDLE
   00-zzz: Special Command (if zzz!=000)
   01-010: Load DWORD (MMIO)
   01-011: Load QWORD (MMIO)
   01-111: Load TILE (RAM, Old)
   10-010: Store DWORD (MMIO)
   10-011: Store QWORD (MMIO)
   10-111: Store TILE (RAM, Old)
   11-010: Swap DWORD (MMIO, Unused)
   11-011: Swap QWORD (MMIO, Unused)
   11-111: Swap TILE (RAM, Old)
The ring-bus went over to an 8-bit OPM format, which increases the
range of messages that can be sent.
One advantage of the old bus is that the device-side logic is fairly
simple. Typically, the OPM/Addr/Data signals would be mirrored to all
of the devices, with each device having its own OK and DataOut signal.
A sort of crossbar existed, where whichever device sets its OK value
to something other than READY has its OK and Data signals passed back
up the bus.
Also it works because MMIO only allows a single active request at a
time (and the MMIO bus interface on the ringbus will effectively
serialize all accesses into the MMIO space on a "first come, first
serve" basis).
Note that accessing MMIO is comparably slow.
Some devices, like the display / VRAM module, have been partly moved
over to the ringbus (with the screen's frame-buffer mapped into RAM),
but still uses the MMIO interface for access to display control
registers and similar.
The SDcard interface still goes over MMIO, but ended up being modified
to allow sending/receiving 8 bytes at a time over SPI (with 8-bit
transfers, accessing the MMIO bus was a bigger source of latency than
actually sending bytes over SPI at 5MHz).
   16.7MHz and 25MHz did not work reliably;
   Going over 25MHz was out-of-spec;
   Even with 8-byte transfers, MMIO access can still become a bottleneck.
A UHS-II interface could in theory run at similar speeds to RAM, but
would likely need a different interface to make use of this.
One possibility would be to map the SDcard into the physical address
space as a huge non-volatile RAM-like space (on the ring-bus). Had
on/off considered this a few times, but didn't get to it.
Effectively, it would require redesigning the whole SDcard and
filesystem interface (essentially moving nearly all of the SDcard
logic into hardware).

Post by Robert Finch
Multiple devices access the main DRAM memory via a memory controller.
Several devices that are bus masters have their own ports to the
memory controller and do not use up time on the main system bus tree.
The frame buffer has a streaming data port. The frame buffer
streaming cache is 8kB and loaded in 1kB strips at 800MB/s from the
DRAM IIRC. Other devices share a system cache which is only 16kB due
to limited number block RAMs. There are about a half dozen read
ports, so the block RAMs are replicated. With all the ports accessing
simultaneously there could be 8*40*16 MB/s being transferred, or
about 5.1 GB/s for reads.

I had put everything on the ring-bus, with the L2 also serving as the
bridge to access external DRAM (via a direct connection to the DDR
interface module).

Post by Robert Finch
The CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can
be dual ported, but is not configured that way ATM due to resource
limitations. The caches will request data in blocks the size of a
cache line. A cache line is broken into four consecutive 128-bit
accesses. So, data comes back from the boot ROM in a burst at 640 MB/s.

   L1 I$: 16K or 32K
     32K helps notably with GLQuake and similar.
     Doom works well with 16K.
   L1 D$: 16K or 32K
     Mostly 32K works well.
     Had tried 64K, but bad for timing, and little effect on performance.
IIRC, had evaluated running the CPU at 25MHz with 128K L1 caches and a
small L2 cache, but modeling this had showed that performance would
suck (even if nearly all of the instructions had a 1-cycle latency).

Post by Robert Finch
IIRC there were no display issues with an 800x600x16 bpp display, but
I could not get Thor to do much more than clear the screen. So, it
was a display of random dots that was stable. There is a separate
text display controller with its own dedicated block RAM for displays.

My display module is a little weird, as it was based around a
   Cells are typically 128 or 256 bits, representing 8x8 pixels.
   ( 29: 0): Pair of 15-bit colors;
   ( 31:30): 10
   ( 61:32): Misc
   ( 63:62): 00
   (127:64): Pixel bits, 8x8x1 bit, raster order
   ( 29: 0): Colors A/B
   ( 31: 30): 11
   ( 61: 32): Colors C/D
   ( 63: 62): 11
   ( 93: 64): Colors E/F
   ( 95: 94): 00
   (125: 96): Colors G/H
   (127:126): 00
   (159:128): Pixels A/B (4x4x2)
   (191:160): Pixels C/D (4x4x2)
   (223:192): Pixels E/F (4x4x2)
   (255:224): Pixels G/H (4x4x2)
   128-bit cell selects 256-color modes (4x4 pixels)
   256-bit cell selects hi-color modes (4x4 pixels)
   640x400 would be configured as 160x100 cells.
   800x600 would be configured as 200x150 cells.
The 800x600 256-color mode held up OK when I had the display module
outputting at a non-standard 36Hz refresh, but increasing this to a
more standard 72Hz blows out the memory bandwidth.
Theoretically, the DDR RAM interface could support these resolutions
if all the timing and latency was good. But, no so good when it is
implemented by the display module hammering out a series of prefetch
requests over the ring-bus just ahead of the current raster position.
Though, the cell-oriented display modes still work better than my
attempt at a linear framebuffer mode (due to cache/timing issues, not
even a 320x200 linear framebuffer mode worked without looking like a
broken mess).
I suspect this is because, with the cell-oriented modes, each cell has
4 or 8 chances for the prefetch to succeed before it actually gets
drawn, whereas in the linear raster mode, there is only 1 chance.
Prefetch 1: Somewhat ahead of current raster position, hopefully gets
data into L2;
Prefects 2: Closer to the raster position, intended to actually fetch
the pixel data.
Prefetches are used here rather than actual loads, mostly because
these will get cleaned up quickly, whereas with actual fetches, a
back-log scenario would result in the whole bus getting clogged up
with unresolved requests.
However, the CPU can use normal loads, since the CPU will patiently
wait for the previous request(s) to finish before doing anything else
(and thus avoids flooding the ring-bus with requests).
However, a downside of prefetches, is that one has to keep asking the
L2 cache each time whether or not it has the data in question yet.
As for the "BJX2 doesn't always generate smaller .text than RISC-V
GCC "-Os" generates very tight and efficient code, but needs to work
within the limits of what the ISA provides;
BGBCC has a bit more to work with, but the relative quality of the
generated code is fairly poor in comparison.
   MOV.Q R8, (SP, 40)
   MOV.Q (SP, 40), R8
//BGBCC: "Sure why not?..."
   ...
   MOV R2, R9
   MOV R9, R2
   BRA .lbl
//BGBCC: "Seems fine to me..."
So, I look at the ASM, and once again feel a groan at how crappy a lot
of it is.
   if(!ptr)
     ...
Was failing to go down the logic path that would have allowed it to
use the BREQ/BRNE instructions (so was always producing a two-op
sequence).
   if(ptr==NULL)
     ...
Ends up using a 3-instruction sequence, because it doesn't recognize
this pattern as being the same as the "!ptr" case, ...
Did at least find a few more "low hanging fruit" cases that shaved a
few more kB off the binary.
   return(bar());
To merge the 3AC "RET" into the "CSRV" operation, and thus save the
use of a temporary (and roughly two otherwise unnecessary MOV
instructions whenever this happens).
But, ironically, it was still "mostly" generating code with fewer
instructions, despite the still relatively weak code generation at times.
   void foo()
   {
      //does nothing
   }
   void bar()
   {
     ...
     foo();
     ...
   }
GCC seems to be clever enough to realize that "foo()" does nothing,
and will eliminate the function and function call entirely.
BGBCC has no such optimization.
...

I can fit:
XC7A200T: dual core and a rasterizer module.
XC7A100T: single core and a rasterizer module.
XC7S50: Single core with reduced features.
Say, 2-wide 4R2W register file, 32 GPRs, etc.
XC7S25: A 1-wide core with no FPU or MMU.
But, an XC7S25 would probably be better served with an RV32I core.
Well, and/or an SH-2 variant (*).

Early on, I could fit dual core onto an XC7A100T, but the feature-set
has expanded enough to make this a problem. Would need to trim stuff
down a little to make this happen (though, part of this is that the L2
cache and DDR RAM module burn a lot of LUTs on having a 512-bit RAM
interface; but this is needed to get decent RAM bandwidth, as DDR
bandwidth suffers considerably if I use 128-bit burst transfers).

Well, also early on, the display module also had 32K of VRAM and I was
displaying Doom and Quake using color-cells (color-cell encoding the
screen image each time the screen was redrawn).

Also ironically, when I first added the 320x200 hi-color mode, it was
slower than using color-cell, mostly due to copying over the MMIO bus
being slower than the color-cell encoder. But, this is no longer true.
The high-color mode did have the advantage of better image quality
though (it is sorta hit-miss vs 256 color mode with a fixed system
palette; color-cell has better color but obvious block artifacts,
whereas the 256 color mode lacks block artifacts but worse color fidelity).

...

*: This is closer to what I had intended my 32-bit BSR1 design for, but
annoyingly it came out a little bigger than a modified SH-2 based design
(B32V).

Where B32V was:
Similar feature-set to SH-2;
No FPU or MMU;
Shifts were only in fixed amounts;
Bigger shifts built-up from smaller shifts;
Variable shift was via a runtime call and "shift slide".
Registers R0..R15
R0 was special
R15 was a stack pointer.
No integer multiply;
...
Little endian IIRC, but aligned-only memory access;
Omitted the auto-increment addressing modes;
Addressing modes: (Rm), (Rm, R0)
It left out most other addressing modes.
Cheaper interrupt mechanism.
Closer to the mechanism used on BJX2.
Instruction encoding was otherwise kept from SuperH.
Effectively, fixed-length 16-bit instructions (ZnmZ, Znii, ...).

Thus far, the B32V experiment was able to achieve the smallest LUT count
(around 4000 LUTs IIRC), but didn't end up using it for much.

Core would have been borderline too minimalist to even run something
like Doom (if it had a display interface).

Where, Doom seems to need a few things:
Full general-purpose shift operations;
A (not dead slow) integer multiplier;
...

Comparably, attempts at both my BSR1 design, and RV32I, failed to be
quite as small. But, RV32I would have been competitive in this space.

Though, one additional limiting factor was that both BSR1 and B32V were
designed around a 16-bit address space:
0000..7FFF: ROM
C000..DFFF: RAM
E000..FFFF: MMIO

In this case, they would have used 16-bit pointers, albeit with a 32-bit
register size (though, in these, 'int' was reduced to 16-bits, with
'long' as the native 32-bit type).

Though, a vestige of this still exists in BJX2 (in the Boot ROM).
But, as noted, a 16-bit address space would not be sufficient to run Doom.

Then again, part of the initial design and also the initial Verilog code
for the BJX2 core was derived from the BSR1 core, which was in turn
partly derived from the B32V core (IIRC).

But, ironically, the initial design for BJX2 was more or less bolting a
bunch of stuff from the BJX1-64C variant on top of the BSR1 design
(where BJX1 was basically ended up being like "What if I did the x86-64
thing just using SH4 as a base?"; but could have in-theory been
backwards compatible with SH-4, and used "hardware" interfaces partly
derived from the Sega Dreamcast, but never got onto an actual FPGA and
was likely unworkable).

Did try briefly (without much success) at trying to get Dreamcast ports
of Linux to boot on the emulator for it. Did get some simpler SuperH
Linux ports to boot though (mostly ones that were No-MMU and did all of
their IO via a "debug UART"; rather than, say, trying to use the
PowerVR2 graphics chip and similar).

Early versions of BJX2 used 32-bit pointers, until I went over to the
64-bit layout (with 16 tag bits).

Did experiment briefly with an ABI using 128-bit pointers with a 96-bit
addresses, but have shelved this for now (at best, this will kinda suck
for the extra performance and memory-usage overheads, while otherwise
being almost entirely overkill at this point).

The more practical-seeming option was to keep programs as still using
the 64-bit pointers, but then being able to use 128-bit "__huge"
pointers in the off chance they actually need the 128-bit pointers for
something.

...

Robert Finch

2023-12-07 22:32:32 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must
refer to N physical registers, where N is the number of banks.
Setting N to 4 results in a RAT that is only about 50% larger
than one supporting only a single bank. The operating mode is
used to select the physical register. The first eight
registers are shared between all operating modes so arguments
can be passed to syscalls. It is tempting to have eight banks
of registers, one for each hardware interrupt level.

A consequence of multiple architecture register banks is each extra
bank keeps a set of mostly unused physical register attached to them.

A waste.....

A SysCall in My 66000 only saves and restores 24 of the 32 registers.
So when control arrives, there are 8 argument registers from the
Caller and 24 registers from Guest OS already loaded. So, SysCall
handler already has its stack, and a variety of pointers to data
structures it is interested in.
On the way back, RET only restores 24 registers so Guest OS can pass
back as many as 8 result registers.

I had handled it by saving/restoring 64 of the 64 registers...
For syscalls, it basically messes with the registers in the
captured register state for the calling task.
A newer change involves saving/restoring registers more directly
to/from the task context for syscalls, which reduces the
task-switch overhead by around 50% (but is mostly N/A for other
kinds of interrupts).
...

I am toying with the idea of adding context save and restore
instructions. I would try to get them to work on a cache-line
worth of data, four registers accessed for read or write at the
same time. Context save / restore would be a macro instruction
made up of sixteen individual instructions each of which saves or
restores four registers. It is a bit of a hoop to jump through for
an infrequently used operation. However, it is good to have to
clean context switch code.
Added the REGS instruction modifier. The modifier causes the
following load or store instruction to repeat using the registers
specified in the register list bitmask for the source or target
register. In theory it can also be applied to other instructions
but that was not the intent. It is pretty much useless for other
instructions, but a register list could be supplied to the MOV
instruction to zero out multiple registers with a single
instruction. Or possibly the ADDI instruction could be used to
load a constant into multiple registers. I could put code in to
disable REGS use with anything other than load and store ops, but
why add extra hardware?

In my case, it is partly a limitation of not really being able to
make it wider than it is already absent adding a 4th register write
port and likely imposing a 256-bit alignment requirement; for a
task that is mostly limited by L1 cache misses...
Like, saving registers would be ~ 40 cycles or so (with another ~
40 to restore them), saving/restoring 2 registers per cycle with
GPRs, if not for all the L1 misses.
Reason it is not similar for normal function calls (besides these
saving/restoring normal registers), is because often the stack is
still "warm" in the L1 cache.
For interrupts, in the time from one interrupt to another, most of
the L1 cache contents from the previous interrupt are already gone.
So, these instruction sequences are around 80% L1 miss penalty, vs
around 5% for normal prologs/epilogs.
This is similar for the inner loops for "memcpy()", which average
roughly 90% L1 miss penalty.
And, say, "memcpy()" averages around 300MB/sec if just copying the
same small buffer over and over again, but then quickly drops to
70MB/sec if copying memory that falls outside the L1 cache.
Though, comparably, it seems that the drop-off from L2 cache to
DRAM is currently a little smaller.
So, the external DRAM interface can push ~ 100MB/sec with the
current interface (supports SWAP operations, moving 512 bits at a
time, and using a sequence number to transition from one request to
another).
But, it is around 70MB/s for requests to make it around the ringbus.
Though, I have noted that if things stay within the limits of what
fits in the L2 cache, multiple parties can access the L2 cache at
the same time without too much impact on each other.
So, say, a modest resolutions, the screen refresh does not impact
the CPU, and the rasterizer module is also mostly independent.
Still, about the highest screen resolution it can really sustain
effectively is ~ 640x480 256-color, or ~ 18MB/sec.
This may be more timing related though, since for screen refresh
there is a relatively tight deadline between when the requests
start being sent, and when the L2 cache needs to hit for that
request, and failing this will result in graphical glitches.
Though, generally what it means is, if the framebuffer image isn't
in the L2 cache, it is gonna look like crap; and effectively the
limit is more "how big of a framebuffer can I fit in the L2 cache".
On the XC7A200T, I can afford a 512K L2 cache, which is just so big
enough to fit 640x400 or 640x480 (but 800x600 is kinda pushing it,
and fights a bit more with the main CPU).
OTOH, it is likely the case than on the XC7A100T (which can only
afford a 256K L2 cache), that 640x400 256-color is pushing it (but
color cell mode still works fine).
Had noted though that trying to set the screen resolution at one
point to 800x600 RGB555 (72 Hz), pulls around 70MB/sec, and
basically was almost entirely broken and seemingly bogged down the
CPU (which could no longer access memory in a timely manner).
Also seemingly stuff running on the CPU can effect screen artifacts
in these modes, presumably by knocking stuff out of the L2 cache.
Also, it seems like despite my ringbus being a lot faster than my
original bus, it has still managed to become an issue due to latency.
But, despite this, on average, things like interlocks and
branch-miss penalties and similar are now still weighing in a fair
bit as well (with interlock penalties closely following cache
misses as the main source of pipeline stalls).
Well, and these two combined burning around 30% of the total
clock-cycles, with another ~ 2-3% or so being spent on branches, ...
Well, and my recent effort to try to improve FPGA timing enough try
to get it up to 75MHz, did have the drawback of "in general"
increasing the number of cycles spent on interlocks (but, returning
a lot of the instructions to their original latency values, would
make the FPGA timing-constraints issues a bit worse).
But, if I could entirely eliminate these sources of latency, this
would only gain ~30%, and at this point would either need to
somehow increase the average bundle with, or find ways to reduce
the total number of instructions that need to be executed (both of
these being more compiler-related territory).
Though, OTOH, I have noted that in many cases I am beating RISC-V
(RV64IM) in terms of total ".text" size in XG2 mode (only 32/64/96
bit encodings) when both are using the same C library, which
implies that I am probably "not doing too badly" on this front
either (though, ideally, I would be "more consistently" beating
RISC-V at this metric, *1).
*1: RV64 "-Os -ffunction-sections -Wl,-gc-sections" is still able
to beat XG2 in terms of having smaller ".text" (though, "-O2" and
"-O3" are bigger; BJX2 Baseline does beat RV64IM, but this is not a
fair test as BJX2 Baseline has 16-bit ops).
Though, BGBCC also has an "/Os" option, it seems to have very
little effect on XG2 Mode (it mostly does things to try to increase
the number of 16-bit ops used, which is N/A in XG2).
Where, here, one can use ".text" size as a stand-in for total
instruction count (and by extension, the number of instructions
that need to be executed).
Though, in some past tests, it seemed like RISC-V needed to execute
a larger number of instructions to render each frame in Doom, which
doesn't really make follow if both have a roughly similar number of
instructions in the emitted binaries (and if both are essentially
running the same code).
So, something seems curious here...
...

Put a request on the bus, as it propagates, each layer of the bus
holds the request, until it reaches the destination, and sends back
an OK signal, which returns back up the bus to the sender, and then
the sender switches to sending an IDLE signal, the whole process
repeats as the bus "tears down", and when it is done, the OK signal
switches to READY, and the bus may then accept another request.
This bus could only handle a single active request at a time, and no
further requests could initiate (anywhere) until the prior request
had finished.
Experimentally, I was hard-pressed getting much over about 6MB/sec
over this bus with 128-bit transfers... (but could get it up to
around 16MB/sec with 256-bit SWAP messages). As noted, this kinda
sucked...
Every object on the node passes messages from input to output, and is
able to drop messages onto the bus, or remove/replace messages as
appropriate. If not handled immediately, they circle the ring until
they can be handled.
This bus was considerably faster, but still seems to suffer from
latency issues.
In this case, the latency of the ring bus was higher than the
original bus, but had the advantage that the L1 cache could
effectively drop 4 consecutive requests onto the bus and then (in
theory) they could all be handled within a single trip around the ring.
Theoretically, the bus could move 800MB/sec at 50MHz, but practically
seems to achieve around 70MB/s (which is in-turn effected by things
that effect ring latency, like enabling/disabling various "shortcut
paths" or enabling/disabling the second CPU core).
A point-to-point message-passing bus could be possible, and could
have lower latency, but was not done mostly because it seemed more
complicated and expensive than the ring design.
If one has two endpoints, both can achieve around 70MB/s if L2 hits,
but this drops off if the external RAM accesses become the limiting
factor.
The RAM interface is using a modified version of the original bus,
where both the OPM and OK signals were augmented with sequence
numbers, where when the sent sequence number on OPM comes back via
the OK signal, one can immediately move to the next request
(incrementing the sequence number).
While this interface still only allows a single request at a time,
this change effectively doubles the throughput. The main reason for
using this interface to talk to external RAM, is that the interface
works across clock-domain crossings (as-is, the ring-bus requests
can't survive a clock-domain crossing).
Most of the MMIO devices are still operating on a narrower version of
   5b: OPM
   28b: Addr
   64b: DataIn
   64b: DataOut
   2b: OK
   00-000: IDLE
   00-zzz: Special Command (if zzz!=000)
   01-010: Load DWORD (MMIO)
   01-011: Load QWORD (MMIO)
   01-111: Load TILE (RAM, Old)
   10-010: Store DWORD (MMIO)
   10-011: Store QWORD (MMIO)
   10-111: Store TILE (RAM, Old)
   11-010: Swap DWORD (MMIO, Unused)
   11-011: Swap QWORD (MMIO, Unused)
   11-111: Swap TILE (RAM, Old)
The ring-bus went over to an 8-bit OPM format, which increases the
range of messages that can be sent.
One advantage of the old bus is that the device-side logic is fairly
simple. Typically, the OPM/Addr/Data signals would be mirrored to all
of the devices, with each device having its own OK and DataOut signal.
A sort of crossbar existed, where whichever device sets its OK value
to something other than READY has its OK and Data signals passed back
up the bus.
Also it works because MMIO only allows a single active request at a
time (and the MMIO bus interface on the ringbus will effectively
serialize all accesses into the MMIO space on a "first come, first
serve" basis).
Note that accessing MMIO is comparably slow.
Some devices, like the display / VRAM module, have been partly moved
over to the ringbus (with the screen's frame-buffer mapped into RAM),
but still uses the MMIO interface for access to display control
registers and similar.
The SDcard interface still goes over MMIO, but ended up being
modified to allow sending/receiving 8 bytes at a time over SPI (with
8-bit transfers, accessing the MMIO bus was a bigger source of
latency than actually sending bytes over SPI at 5MHz).
   16.7MHz and 25MHz did not work reliably;
   Going over 25MHz was out-of-spec;
   Even with 8-byte transfers, MMIO access can still become a bottleneck.
A UHS-II interface could in theory run at similar speeds to RAM, but
would likely need a different interface to make use of this.
One possibility would be to map the SDcard into the physical address
space as a huge non-volatile RAM-like space (on the ring-bus). Had
on/off considered this a few times, but didn't get to it.
Effectively, it would require redesigning the whole SDcard and
filesystem interface (essentially moving nearly all of the SDcard
logic into hardware).

Post by Robert Finch
Multiple devices access the main DRAM memory via a memory
controller. Several devices that are bus masters have their own
ports to the memory controller and do not use up time on the main
system bus tree. The frame buffer has a streaming data port. The
frame buffer streaming cache is 8kB and loaded in 1kB strips at
800MB/s from the DRAM IIRC. Other devices share a system cache which
is only 16kB due to limited number block RAMs. There are about a
half dozen read ports, so the block RAMs are replicated. With all
the ports accessing simultaneously there could be 8*40*16 MB/s being
transferred, or about 5.1 GB/s for reads.

I had put everything on the ring-bus, with the L2 also serving as the
bridge to access external DRAM (via a direct connection to the DDR
interface module).

Post by Robert Finch
The CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can
be dual ported, but is not configured that way ATM due to resource
limitations. The caches will request data in blocks the size of a
cache line. A cache line is broken into four consecutive 128-bit
accesses. So, data comes back from the boot ROM in a burst at 640 MB/s.

   L1 I$: 16K or 32K
     32K helps notably with GLQuake and similar.
     Doom works well with 16K.
   L1 D$: 16K or 32K
     Mostly 32K works well.
     Had tried 64K, but bad for timing, and little effect on performance.
IIRC, had evaluated running the CPU at 25MHz with 128K L1 caches and
a small L2 cache, but modeling this had showed that performance would
suck (even if nearly all of the instructions had a 1-cycle latency).

Post by Robert Finch
IIRC there were no display issues with an 800x600x16 bpp display,
but I could not get Thor to do much more than clear the screen. So,
it was a display of random dots that was stable. There is a separate
text display controller with its own dedicated block RAM for displays.

My display module is a little weird, as it was based around a
   Cells are typically 128 or 256 bits, representing 8x8 pixels.
   ( 29: 0): Pair of 15-bit colors;
   ( 31:30): 10
   ( 61:32): Misc
   ( 63:62): 00
   (127:64): Pixel bits, 8x8x1 bit, raster order
   ( 29: 0): Colors A/B
   ( 31: 30): 11
   ( 61: 32): Colors C/D
   ( 63: 62): 11
   ( 93: 64): Colors E/F
   ( 95: 94): 00
   (125: 96): Colors G/H
   (127:126): 00
   (159:128): Pixels A/B (4x4x2)
   (191:160): Pixels C/D (4x4x2)
   (223:192): Pixels E/F (4x4x2)
   (255:224): Pixels G/H (4x4x2)
   128-bit cell selects 256-color modes (4x4 pixels)
   256-bit cell selects hi-color modes (4x4 pixels)
   640x400 would be configured as 160x100 cells.
   800x600 would be configured as 200x150 cells.
The 800x600 256-color mode held up OK when I had the display module
outputting at a non-standard 36Hz refresh, but increasing this to a
more standard 72Hz blows out the memory bandwidth.
Theoretically, the DDR RAM interface could support these resolutions
if all the timing and latency was good. But, no so good when it is
implemented by the display module hammering out a series of prefetch
requests over the ring-bus just ahead of the current raster position.
Though, the cell-oriented display modes still work better than my
attempt at a linear framebuffer mode (due to cache/timing issues, not
even a 320x200 linear framebuffer mode worked without looking like a
broken mess).
I suspect this is because, with the cell-oriented modes, each cell
has 4 or 8 chances for the prefetch to succeed before it actually
gets drawn, whereas in the linear raster mode, there is only 1 chance.
Prefetch 1: Somewhat ahead of current raster position, hopefully gets
data into L2;
Prefects 2: Closer to the raster position, intended to actually fetch
the pixel data.
Prefetches are used here rather than actual loads, mostly because
these will get cleaned up quickly, whereas with actual fetches, a
back-log scenario would result in the whole bus getting clogged up
with unresolved requests.
However, the CPU can use normal loads, since the CPU will patiently
wait for the previous request(s) to finish before doing anything else
(and thus avoids flooding the ring-bus with requests).
However, a downside of prefetches, is that one has to keep asking the
L2 cache each time whether or not it has the data in question yet.
As for the "BJX2 doesn't always generate smaller .text than RISC-V
GCC "-Os" generates very tight and efficient code, but needs to work
within the limits of what the ISA provides;
BGBCC has a bit more to work with, but the relative quality of the
generated code is fairly poor in comparison.
   MOV.Q R8, (SP, 40)
   MOV.Q (SP, 40), R8
//BGBCC: "Sure why not?..."
   ...
   MOV R2, R9
   MOV R9, R2
   BRA .lbl
//BGBCC: "Seems fine to me..."
So, I look at the ASM, and once again feel a groan at how crappy a
lot of it is.
   if(!ptr)
     ...
Was failing to go down the logic path that would have allowed it to
use the BREQ/BRNE instructions (so was always producing a two-op
sequence).
   if(ptr==NULL)
     ...
Ends up using a 3-instruction sequence, because it doesn't recognize
this pattern as being the same as the "!ptr" case, ...
Did at least find a few more "low hanging fruit" cases that shaved a
few more kB off the binary.
   return(bar());
To merge the 3AC "RET" into the "CSRV" operation, and thus save the
use of a temporary (and roughly two otherwise unnecessary MOV
instructions whenever this happens).
But, ironically, it was still "mostly" generating code with fewer
instructions, despite the still relatively weak code generation at times.
   void foo()
   {
      //does nothing
   }
   void bar()
   {
     ...
     foo();
     ...
   }
GCC seems to be clever enough to realize that "foo()" does nothing,
and will eliminate the function and function call entirely.
BGBCC has no such optimization.
...

Finally got a synthesis for a complete Q+ system done. Turns out to be
about 10% too large for the XC7A200 :) It should easily fit in the
next larger part. Scratching my head wondering how to reduce sizes
while not losing too much functionality. I could go with just the CPU
and a serial port, remove the frame buffer, sprites, etc.

XC7A200T: dual core and a rasterizer module.
XC7A100T: single core and a rasterizer module.
XC7S50: Single core with reduced features.
    Say, 2-wide 4R2W register file, 32 GPRs, etc.
XC7S25: A 1-wide core with no FPU or MMU.
    But, an XC7S25 would probably be better served with an RV32I core.
    Well, and/or an SH-2 variant (*).
Early on, I could fit dual core onto an XC7A100T, but the feature-set
has expanded enough to make this a problem. Would need to trim stuff
down a little to make this happen (though, part of this is that the L2
cache and DDR RAM module burn a lot of LUTs on having a 512-bit RAM
interface; but this is needed to get decent RAM bandwidth, as DDR
bandwidth suffers considerably if I use 128-bit burst transfers).
Well, also early on, the display module also had 32K of VRAM and I was
displaying Doom and Quake using color-cells (color-cell encoding the
screen image each time the screen was redrawn).
Also ironically, when I first added the 320x200 hi-color mode, it was
slower than using color-cell, mostly due to copying over the MMIO bus
being slower than the color-cell encoder. But, this is no longer true.
The high-color mode did have the advantage of better image quality
though (it is sorta hit-miss vs 256 color mode with a fixed system
palette; color-cell has better color but obvious block artifacts,
whereas the 256 color mode lacks block artifacts but worse color fidelity).
...
*: This is closer to what I had intended my 32-bit BSR1 design for, but
annoyingly it came out a little bigger than a modified SH-2 based design
(B32V).
Similar feature-set to SH-2;
    No FPU or MMU;
    Shifts were only in fixed amounts;
      Bigger shifts built-up from smaller shifts;
      Variable shift was via a runtime call and "shift slide".
    Registers R0..R15
      R0 was special
      R15 was a stack pointer.
    No integer multiply;
    ...
Little endian IIRC, but aligned-only memory access;
Omitted the auto-increment addressing modes;
    Addressing modes: (Rm), (Rm, R0)
    It left out most other addressing modes.
Cheaper interrupt mechanism.
    Closer to the mechanism used on BJX2.
Instruction encoding was otherwise kept from SuperH.
    Effectively, fixed-length 16-bit instructions (ZnmZ, Znii, ...).
Thus far, the B32V experiment was able to achieve the smallest LUT count
(around 4000 LUTs IIRC), but didn't end up using it for much.
Core would have been borderline too minimalist to even run something
like Doom (if it had a display interface).
Full general-purpose shift operations;
A (not dead slow) integer multiplier;
...
Comparably, attempts at both my BSR1 design, and RV32I, failed to be
quite as small. But, RV32I would have been competitive in this space.
Though, one additional limiting factor was that both BSR1 and B32V were
0000..7FFF: ROM
C000..DFFF: RAM
E000..FFFF: MMIO
In this case, they would have used 16-bit pointers, albeit with a 32-bit
register size (though, in these, 'int' was reduced to 16-bits, with
'long' as the native 32-bit type).
Though, a vestige of this still exists in BJX2 (in the Boot ROM).
But, as noted, a 16-bit address space would not be sufficient to run Doom.
Then again, part of the initial design and also the initial Verilog code
for the BJX2 core was derived from the BSR1 core, which was in turn
partly derived from the B32V core (IIRC).
But, ironically, the initial design for BJX2 was more or less bolting a
bunch of stuff from the BJX1-64C variant on top of the BSR1 design
(where BJX1 was basically ended up being like "What if I did the x86-64
thing just using SH4 as a base?"; but could have in-theory been
backwards compatible with SH-4, and used "hardware" interfaces partly
derived from the Sega Dreamcast, but never got onto an actual FPGA and
was likely unworkable).
Did try briefly (without much success) at trying to get Dreamcast ports
of Linux to boot on the emulator for it. Did get some simpler SuperH
Linux ports to boot though (mostly ones that were No-MMU and did all of
their IO via a "debug UART"; rather than, say, trying to use the
PowerVR2 graphics chip and similar).
Early versions of BJX2 used 32-bit pointers, until I went over to the
64-bit layout (with 16 tag bits).
Did experiment briefly with an ABI using 128-bit pointers with a 96-bit
addresses, but have shelved this for now (at best, this will kinda suck
for the extra performance and memory-usage overheads, while otherwise
being almost entirely overkill at this point).
The more practical-seeming option was to keep programs as still using
the 64-bit pointers, but then being able to use 128-bit "__huge"
pointers in the off chance they actually need the 128-bit pointers for
something.
...

Got the core to fit, about 95% full.

Cut the size of the RAT in half by supporting only a single register
bank. Two banks were being supported because the block RAM has the
capacity. ½ the block RAM is wasted now.

Found out that the register renamer was a larger implementation than it
needed to be. It was re-written to use fifo’s instead of a bitmap. The
result was about 3000 LUTs. The bitmap version was about 13,000 LUTs.
The bitmap version code was really simple, but turned out to be huge.

TG for hierarchical breakdown with percentages.

Robert Finch

2023-12-08 11:48:19 UTC

What happens when there is a sequence of numerous branches in row, such
that the machine would run out of checkpoints for the branches?

Suppose you go
Bra tgt1
Bra tgt1
… 30 times
Bra tgt1

Will the machine still work? Or will it crash?
I have Q+ stalling until checkpoints are available, but it seems like a
loss of performance. It is extra hardware to check for the case that
might be preventable with software. I mean how often would a sequence
like the above occur?

MitchAlsup

2023-12-08 17:53:12 UTC

Post by Robert Finch
What happens when there is a sequence of numerous branches in row, such
that the machine would run out of checkpoints for the branches?

Stall Insert.

Post by Robert Finch
Suppose you go
Bra tgt1
Bra tgt1
… 30 times
Bra tgt1

Unconditional Branches do not need a checkpoint (all by themselves).

Post by Robert Finch
Will the machine still work? Or will it crash?
I have Q+ stalling until checkpoints are available, but it seems like a
loss of performance. It is extra hardware to check for the case that
might be preventable with software. I mean how often would a sequence
like the above occur?

Unconditional branches can be dealt with completely in the front end
{they do not need to be executed--except as they alter IP.}

On the other hand:: compilers are pretty good at cleaning up branches
to unconditional branches.

How will you tell for sure:: Read the ASM your compiler produces (a lot
of it).

BGB

2023-12-08 18:41:55 UTC

Post by Robert Finch
What happens when there is a sequence of numerous branches in row,
such that the machine would run out of checkpoints for the branches?

Stall Insert.

Post by Robert Finch
Suppose you go
Bra tgt1
Bra tgt1
… 30 times
Bra tgt1

Unconditional Branches do not need a checkpoint (all by themselves).

In my case, I don't use "checkpoints".
Granted, it is a fairly naive in-order / stalling-pipeline design as well.

Post by Robert Finch
Will the machine still work? Or will it crash?
I have Q+ stalling until checkpoints are available, but it seems like
a loss of performance. It is extra hardware to check for the case that
might be preventable with software. I mean how often would a sequence
like the above occur?

Unconditional branches can be dealt with completely in the front end
{they do not need to be executed--except as they alter IP.}
On the other hand:: compilers are pretty good at cleaning up branches
to unconditional branches.

Hmm... Computer branch to unconditional branch was how I ended up
implementing things like "switch()".

Will not claim this is ideal for performance though (and does currently
have the limitation that in direct branch-to-branch cases, the branch
predictor doesn't work), so a "table of offsets" may be a better option,
but would be harder to set up in terms of relocs.

Ironically, the prolog compression would also have this issue, except
that typically these functions start with a "MOV LR, R1", which then
"protects" the following BSR instruction and allows the branch predictor
to work.

Granted, this is one of those "needs to be this way otherwise the CPU
craps itself" cases.

Also the branch predictor doesn't work if one crosses certain boundaries:
4K for Disp8
64K for Disp11
16MB for Disp20
The Disp23 cases will also be rejected by the branch predictor.

Though, these are more a result of "carry propagation isn't free", and
this logic is somewhat latency sensitive.

These cases fall back to the slow branch case, as do Disp33 and Abs48
branches.

Post by MitchAlsup
How will you tell for sure:: Read the ASM your compiler produces (a lot
of it).

Robert Finch

2023-12-10 00:15:28 UTC

Getting a bit lazy on the Q+ instruction commit in the interest of
increasing the fmax. The results are already in the register file, so
all the commit has to do is:

1) Update the branch predictor.
2) Free up physical registers
3) Free load/store queue entries associated with the ROB entry.
4) Commit oddball instructions.
5) Process any outstanding exceptions.
6) Free the ROB entry
7) Gather performance statistics.

What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit, but
it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is near
the queue pointer. Commit will also only commit up to the first oddball
instruction or exception.

Decided to axe the branch-to-register feature of conditional branch
instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.

Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in advance,
so choosing a larger branch displacement size should be an option.

MitchAlsup

2023-12-10 01:02:00 UTC

Post by Robert Finch
Getting a bit lazy on the Q+ instruction commit in the interest of
increasing the fmax. The results are already in the register file, so
1) Update the branch predictor.
2) Free up physical registers

By the time you write the physical register into the file, you are in
a position to free up the now permanently invisible physical register
it replaced.

Post by Robert Finch
3) Free load/store queue entries associated with the ROB entry.

Spectré:: write miss buffer data into Cache and TLB.
This is also where I write ST.data into cache.

Post by Robert Finch
4) Commit oddball instructions.
5) Process any outstanding exceptions.
6) Free the ROB entry
7) Gather performance statistics.
What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit, but
it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is near
the queue pointer. Commit will also only commit up to the first oddball
instruction or exception.
Decided to axe the branch-to-register feature of conditional branch
instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.

Question:: How would you handle::

IDIV R6,R7,R8
JMP R6

??

Post by Robert Finch
Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in advance,
so choosing a larger branch displacement size should be an option.

Robert Finch

2023-12-10 02:39:17 UTC

By the time you write the physical register into the file, you are in
a position to free up the now permanently invisible physical register
it replaced.

Hey thanks, I should have thought of that. While there are more physical
registers available than needed (256 and only about 204 are needed), so
it would probably run okay, I think I see a way to reduce multiplexor
usage by freeing the register when it is written.

Post by Robert Finch
3) Free load/store queue entries associated with the ROB entry.

Spectré:: write miss buffer data into Cache and TLB.
This is also where I write ST.data into cache.

Is miss data for a TLB page fault? I have this stored in a register in
the TLB which must be read by the CPU during exception handling.
Otherwise the TLB has a hidden page walker that updates the TLB.
Scratching my head now over writing the store data at commit time.

Post by Robert Finch
4)    Commit oddball instructions.
5)    Process any outstanding exceptions.
6)    Free the ROB entry
7)    Gather performance statistics.
What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit,
but it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is near
the queue pointer. Commit will also only commit up to the first
oddball instruction or exception.
Decided to axe the branch-to-register feature of conditional branch
instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.

IDIV R6,R7,R8
JMP R6
??

There is a JMP address[Rn] instruction (really JSR Rt,address[Rn]) in
the instruction set which is always treated as a branch miss when it
executes. The RTS instruction could also be used, it allows the return
address register to be specified and it is a couple of bytes shorter. It
was just that conditional branches had the feature removed. It required
a third register be read for the flow control unit too.

Post by Robert Finch
Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in
advance, so choosing a larger branch displacement size should be an
option.

MitchAlsup

2023-12-10 04:06:39 UTC

By the time you write the physical register into the file, you are in
a position to free up the now permanently invisible physical register
it replaced.

You are welcome.

Post by Robert Finch
3) Free load/store queue entries associated with the ROB entry.

Spectré:: write miss buffer data into Cache and TLB.
This is also where I write ST.data into cache.

Is miss data for a TLB page fault?

I leave TLB replacements in the miss buffer simply because they are so
seldom that I don't feel it necessary to build yet another buffer.
TLB plus any tablewalk acceleration is deferred until the casuing
instruction retires.

Post by Robert Finch
I have this stored in a register in
the TLB which must be read by the CPU during exception handling.

Technically, the TLB is the storage and comparators, while the rest
of the table walking mechanics {including the TLB} are the MMU.

Post by Robert Finch
Otherwise the TLB has a hidden page walker that updates the TLB.

If you don't defer TLB update until after the causing instruction retires
Spectré-like attacks have a covert channel at their disposal.

Post by Robert Finch
Scratching my head now over writing the store data at commit time.

My 6-wide machine has a conditional-cache (memory reorder buffer)
after execution, calculation instructions can raise no exception.
This is the commit point. Between commit and retire, the conditional
cache updated the Data Cache. So there is a period of time the
pipeline builds up state, and once it has been determined that
nothing can prevent the manifestations of those instructions from
taking place, there is a period of time state gets updated. Once
all state is updated, the instruction has retired.

Post by Robert Finch
4)    Commit oddball instructions.
5)    Process any outstanding exceptions.
6)    Free the ROB entry
7)    Gather performance statistics.
What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit,
but it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is near
the queue pointer. Commit will also only commit up to the first
oddball instruction or exception.
Decided to axe the branch-to-register feature of conditional branch
instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.

IDIV R6,R7,R8
JMP R6
??

I have a LD IP,[address] instruction which is used to access GOT[k] for
calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.

But you side-stepped answering my question. My question is what do you
do when the Jump address will not arrive for another 20 cycles.

Post by Robert Finch
Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in
advance, so choosing a larger branch displacement size should be an
option.

I use GOT[k] to branch farther than the 28-bit unconditional branch
displacement can reach. We have not yet run into a subroutine that
needs branches of more then 18-bits conditionally or 28-bits uncon-
ditionally.

Robert Finch

2023-12-10 06:57:24 UTC

Post by Robert Finch
Getting a bit lazy on the Q+ instruction commit in the interest of
increasing the fmax. The results are already in the register file,
1) Update the branch predictor.
2) Free up physical registers

By the time you write the physical register into the file, you are in
a position to free up the now permanently invisible physical register
it replaced.

Hey thanks, I should have thought of that. While there are more
physical registers available than needed (256 and only about 204 are
needed), so it would probably run okay, I think I see a way to reduce
multiplexor usage by freeing the register when it is written.

You are welcome.

Post by Robert Finch
3) Free load/store queue entries associated with the ROB entry.

Spectré:: write miss buffer data into Cache and TLB.
This is also where I write ST.data into cache.

Is miss data for a TLB page fault?

Post by Robert Finch
I have this stored in a register in the TLB which must be read by the
CPU during exception handling.

Technically, the TLB is the storage and comparators, while the rest
of the table walking mechanics {including the TLB} are the MMU.

Post by Robert Finch
Otherwise the TLB has a hidden page walker that updates the TLB.

If you don't defer TLB update until after the causing instruction retires
Spectré-like attacks have a covert channel at their disposal.

I am tempted to try that approach, Q+ buffers TLB misses already. All
the details added to mitigate Spectré-like attacks would seem to add
hardware though.

Post by Robert Finch
Scratching my head now over writing the store data at commit time.

My 6-wide machine has a conditional-cache (memory reorder buffer)
after execution, calculation instructions can raise no exception.
This is the commit point. Between commit and retire, the conditional
cache updated the Data Cache. So there is a period of time the pipeline
builds up state, and once it has been determined that
nothing can prevent the manifestations of those instructions from
taking place, there is a period of time state gets updated. Once
all state is updated, the instruction has retired.

Post by Robert Finch
4)    Commit oddball instructions.
5)    Process any outstanding exceptions.
6)    Free the ROB entry
7)    Gather performance statistics.
What needs to be committed is computed in the clock cycle before the
commit. This pipelined signal adds a cycle of latency to the commit,
but it only really affects oddball instructions rarely executed, and
exceptions. Commit also will not commit if the commit pointer is
near the queue pointer. Commit will also only commit up to the first
oddball instruction or exception.
Decided to axe the branch-to-register feature of conditional branch
instructions because the branch target would not be known at enqueue
time. It would require updating the ROB in two places.

IDIV R6,R7,R8
JMP R6
??

There is a JMP address[Rn] instruction (really JSR Rt,address[Rn]) in
the instruction set which is always treated as a branch miss when it
executes. The RTS instruction could also be used, it allows the return
address register to be specified and it is a couple of bytes shorter.
It was just that conditional branches had the feature removed. It
required a third register be read for the flow control unit too.

I have a LD IP,[address] instruction which is used to access GOT[k] for
calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.
But you side-stepped answering my question. My question is what do you
do when the Jump address will not arrive for another 20 cycles.

While waiting for the register value, other instructions would continue
to queue and execute. Then that processing would be dumped because of
the branch miss. I suppose hardware could be added to suppress
processing until the register value is known. An option for a larger build.

Post by Robert Finch
Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in
advance, so choosing a larger branch displacement size should be an
option.

I have yet to use GOT addressing.

There are issues to resolve in the Q+ frontend. The next PC value for
the BTB is not available for about three clocks. To go backwards in
time, the next PC needs to be cached, or rather the displacement to the
next PC to reduce cache size. The first time a next PC is needed it will
not be available for three clocks. Once cached it would be available
within a clock. The next PC displacement is the sum of the lengths of
next four instructions. There is not enough room in the FPGA to add
another cache and associated logic, however. Next PC = PC + 20 seems a
whole lot simpler to me.

Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could be
as if they were fixed length while remaining variable length.
Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option for
a larger build. There could be a bit in a control register to allow
execution by packed or unpacked instructions so there is some backwards
compatibility to a smaller build.

Robert Finch

2023-12-10 13:17:36 UTC

Thinking again about using a block header approach to locating
instructions. I note that it is necessary only to locate the position of
a group of instructions being fetched at the same time, since the CPU
processes instructions in groups. An instruction length decoder can be
relied upon to determine the location of instructions within a group.
Assuming a fetch group is four instructions with postfixes, and
instructions are an average of about four bytes in length, then only
about four groups of instructions would fit in a 512-bit, 64-byte block.
To reference a position within a 64-byte block requires a six-bit code.
Five six-bit codes would fit into 32-bits in a header allowing the
position of up to six groups of instructions to be identified. This is
about 6% memory overhead for locating instruction groups. There may end
up being wasted space at the end of a block if instructions are short.
Or the last group may end up being truncated making it necessary to pad
with NOPs. I suspect on average there would only be a couple of bytes
wasted in a block. I am guessing that using blocks of variable length
instructions will increase code density over having larger fixed length
instructions. A five-byte instruction length decreases code density by
20% over a four-byte length. But using blocks should only cost about 10%
in density, making it more economical. Seems like some experimentation
is in order.

MitchAlsup

2023-12-10 15:11:38 UTC

Post by MitchAlsup
I have a LD IP,[address] instruction which is used to access GOT[k] for
calling dynamically linked subroutines. This bypasses the LD-aligner
to deliver IP to fetch faster.
But you side-stepped answering my question. My question is what do you
do when the Jump address will not arrive for another 20 cycles.

Post by Robert Finch
Branches can now use a postfix immediate to extend the branch range.
This allows 32 and 64-bit displacements in addition to the existing
17-bit one. However, the assembler cannot know which to use in
advance, so choosing a larger branch displacement size should be an
option.

What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}

Post by Robert Finch
The first time a next PC is needed it will
not be available for three clocks. Once cached it would be available
within a clock. The next PC displacement is the sum of the lengths of
next four instructions. There is not enough room in the FPGA to add
another cache and associated logic, however. Next PC = PC + 20 seems a
whole lot simpler to me.
Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could be
as if they were fixed length while remaining variable length.

If the first part of an instruction decodes to the length of the instruction
easily (EASILY) and cheaply, you can avoid the header and build a tree of
unary pointers each such pointer pointing at twice as many instruction
starting points as the previous. Even without headers, My 66000 can find
the instruction boundaries of up to 16 instructions per cycle without adding
"stuff" the the block of instructions.

Post by Robert Finch
Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option for
a larger build. There could be a bit in a control register to allow
execution by packed or unpacked instructions so there is some backwards
compatibility to a smaller build.

Robert Finch

2023-12-10 21:24:33 UTC

While waiting for the register value, other instructions would
continue to queue and execute. Then that processing would be dumped
because of the branch miss. I suppose hardware could be added to
suppress processing until the register value is known. An option for a
larger build.

Post by Robert Finch
Branches can now use a postfix immediate to extend the branch
range. This allows 32 and 64-bit displacements in addition to the
existing 17-bit one. However, the assembler cannot know which to
use in advance, so choosing a larger branch displacement size
should be an option.

What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}

Post by Robert Finch
The first time a next PC is needed it
will not be available for three clocks. Once cached it would be
available within a clock. The next PC displacement is the sum of the
lengths of next four instructions. There is not enough room in the
FPGA to add another cache and associated logic, however. Next PC = PC
+ 20 seems a whole lot simpler to me.
Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could
be as if they were fixed length while remaining variable length.

Post by Robert Finch
Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option
for a larger build. There could be a bit in a control register to
allow execution by packed or unpacked instructions so there is some
backwards compatibility to a smaller build.

I cannot get it to work at a decent speed for only six instructions.
With byte-aligned instructions 64-decoders are in use. (They’re really
small). Then output from the appropriate ones are selected. It is
partially the fullness of the FPGA and routing congestion because of the
design. Routing is taking 90% of the time. Logic is only about 10%.

I did some experimenting with block headers and ended up with a block
trailer instead of a header, for the assembler’s benefit which needs to
know all the instruction lengths before the trailer can be output. Only
the index of the instruction group is needed, so usually there are only
a couple of indexes used per instruction block. It can likely get by
with a 24-bit trailer containing four indexes plus the assumed one.
Usually only one or two bytes are wasted at the end of a block.
I assembled the boot rom and there are 4.9 bytes per instruction
average, including the overhead of block trailers and wasted bytes.
Branche and postfixes are five bytes, and there are a lot of them.

Code density is a little misleading because branches occupy five bytes
but do both a compare and branch operation. So they should maybe count
as two instructions.

MitchAlsup

2023-12-10 22:39:10 UTC

While waiting for the register value, other instructions would
continue to queue and execute. Then that processing would be dumped
because of the branch miss. I suppose hardware could be added to
suppress processing until the register value is known. An option for a
larger build.

Post by Robert Finch
Branches can now use a postfix immediate to extend the branch
range. This allows 32 and 64-bit displacements in addition to the
existing 17-bit one. However, the assembler cannot know which to
use in advance, so choosing a larger branch displacement size
should be an option.

What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}

Post by Robert Finch
The first time a next PC is needed it
will not be available for three clocks. Once cached it would be
available within a clock. The next PC displacement is the sum of the
lengths of next four instructions. There is not enough room in the
FPGA to add another cache and associated logic, however. Next PC = PC
+ 20 seems a whole lot simpler to me.
Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could
be as if they were fixed length while remaining variable length.

Post by Robert Finch
Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option
for a larger build. There could be a bit in a control register to
allow execution by packed or unpacked instructions so there is some
backwards compatibility to a smaller build.

I cannot get it to work at a decent speed for only six instructions.
With byte-aligned instructions 64-decoders are in use. (They’re really
small). Then output from the appropriate ones are selected. It is
partially the fullness of the FPGA and routing congestion because of the
design. Routing is taking 90% of the time. Logic is only about 10%.
I did some experimenting with block headers and ended up with a block
trailer instead of a header, for the assembler’s benefit which needs to
know all the instruction lengths before the trailer can be output. Only
the index of the instruction group is needed, so usually there are only
a couple of indexes used per instruction block. It can likely get by
with a 24-bit trailer containing four indexes plus the assumed one.
Usually only one or two bytes are wasted at the end of a block.
I assembled the boot rom and there are 4.9 bytes per instruction
average, including the overhead of block trailers and wasted bytes.
Branche and postfixes are five bytes, and there are a lot of them.
Code density is a little misleading because branches occupy five bytes
but do both a compare and branch operation. So they should maybe count
as two instructions.

Robert Finch

2023-12-11 06:35:39 UTC

Decided to take another try at it. Got tired of jumping through hoops to
support a block header approach. The assembler is just not setup to
support that style of processor. It was also a bit tricky to convert
branch target address to instruction groups and numbers. I still like
the VLE. Simplified the length decoders so they generate a 3-bit number
instead of a 5-bit one. I went a bit insane with the instruction lengths
and allowed for up to 18 bytes. But that is wasteful. There were a
couple of postfixes that were that long to support 128-bit immediates.
The Isa has multiple postfixes repeated now instead of really long ones.
32-bit immediate chunks.

Found a signal slowing things down. I had bypassed the output of the
i-cache ram in case a write and a read to the same address happened at
the same time. The read address was coming from the PC and ended up
feeding through 44 logic levels all the way to the instruction length
decoder. The output of the i-cache was going through the bypass mux
before feeding other logic. It is now just output directly from the RAM.
Not sure why I bypassed it in the first place. If there was a i-cache
update because of a cache miss, then it should be stalled.

Even with 44 logic levels and lots of routing timing was okay at 20 MHz.
Shooting for 40 MHz now.

BGB

2023-12-11 09:57:26 UTC

Post by Robert Finch
Decided to take another try at it. Got tired of jumping through hoops to
support a block header approach. The assembler is just not setup to
support that style of processor. It was also a bit tricky to convert
branch target address to instruction groups and numbers. I still like
the VLE. Simplified the length decoders so they generate a 3-bit number
instead of a 5-bit one. I went a bit insane with the instruction lengths
and allowed for up to 18 bytes. But that is wasteful. There were a
couple of postfixes that were that long to support 128-bit immediates.
The Isa has multiple postfixes repeated now instead of really long ones.
32-bit immediate chunks.

No 128-bit constant load in my case, one does two 64 bit constant loads
for this. In terms of the pipeline, it actually moves four 33-bit
chunks, with each 64-bit constant effectively being glued together
inside the pipeline.

...

Post by Robert Finch
Found a signal slowing things down. I had bypassed the output of the
i-cache ram in case a write and a read to the same address happened at
the same time. The read address was coming from the PC and ended up
feeding through 44 logic levels all the way to the instruction length
decoder. The output of the i-cache was going through the bypass mux
before feeding other logic. It is now just output directly from the RAM.
Not sure why I bypassed it in the first place. If there was a i-cache
update because of a cache miss, then it should be stalled.
Even with 44 logic levels and lots of routing timing was okay at 20 MHz.
Shooting for 40 MHz now.

Ironically, as a side-effect of some of the logic tweaks made in trying
to get my core running at 75MHz, but now reverting to 50MHz; on the
50MHz clock it seems I generally have around 2ns of slack (with around
16-20 logic levels).

This is after re-enabling a lot of the features I had disabled in the
attempted clock-boost to 75MHz.

Most recent ISA tweak:
Had modified the decoding rules slightly so that both the W.m and W.i
bits were given to the immediate for 2RI-Imm10 ops in XG2 Mode
(previously these bits were unused, and I debated between extending the
immediate values or leaving them as possible opcode bits; but if used
for opcode they would lead to new instructions that are effectively N/E
in Baseline Mode, which wouldn't be ideal either).

Relatedly tweaked the JCMPZ-Disp11s and "MOV.x (GBR, Disp10u*Sx), Rn"
ops to effectively be Disp13 and Disp12.

This increases branch-displacement to +/-8K for compare-with-zero, and
MOV.L/Q to 16K/32K for loading global variables (from 4K/8K).

It seems the gains were fairly modest though:
Relatively few functions are large enough to benefit from the larger
branch limit;
It seems it runs out of scalar-type global variables (in Doom) well
before hitting the 16K/32K limit (and this is mostly N/A for structs and
arrays).

Like, it seems Doom only has around 12K worth of scalar global
variables, and nearly the entire rest of the ".data" and ".bss" section
is structs and arrays (which need LEA's to access). So, it only gains a
few percent on the total hit rate (55 to 57).

Roughly the remaining 43% being mostly LEA's to arrays, which end up
needing a 64-bit (Disp33) encoding.

TBD if I will keep this, or revert to only adding W.i, in XG2 Mode (to
keep W.m for possible XG2-Only 2RI opcodes; or possibly make the
Imm12/Disp13s case specific to these particular instructions, though
this is "less good" for the instruction decoder).

If I keep this, will probably end up stuck with it in any case.

Also, in my compiler, I am finding less obvious "low hanging fruit" than
I had hoped (and a lot of the remaining inefficiencies are things that
would "actually take" effort, *).

*: Say, for example, computing an expression that is being passed to a
function, actually directing the output directly to the function
argument register, rather than putting it a temporary and then MOV'ing
it to the needed register.

Alas...

Robert Finch

2023-12-13 12:44:36 UTC

Post by Robert Finch
Decided to take another try at it. Got tired of jumping through hoops
to support a block header approach. The assembler is just not setup to
support that style of processor. It was also a bit tricky to convert
branch target address to instruction groups and numbers. I still like
the VLE. Simplified the length decoders so they generate a 3-bit
number instead of a 5-bit one. I went a bit insane with the
instruction lengths and allowed for up to 18 bytes. But that is
wasteful. There were a couple of postfixes that were that long to
support 128-bit immediates. The Isa has multiple postfixes repeated
now instead of really long ones. 32-bit immediate chunks.

Post by Robert Finch
Found a signal slowing things down. I had bypassed the output of the
i-cache ram in case a write and a read to the same address happened at
the same time. The read address was coming from the PC and ended up
feeding through 44 logic levels all the way to the instruction length
decoder. The output of the i-cache was going through the bypass mux
before feeding other logic. It is now just output directly from the
RAM. Not sure why I bypassed it in the first place. If there was a
i-cache update because of a cache miss, then it should be stalled.
Even with 44 logic levels and lots of routing timing was okay at 20
MHz. Shooting for 40 MHz now.

I am sitting just on the other side of 50 MHz operation a couple of ns
short.

Post by BGB
This is after re-enabling a lot of the features I had disabled in the
attempted clock-boost to 75MHz.
Had modified the decoding rules slightly so that both the W.m and W.i
bits were given to the immediate for 2RI-Imm10 ops in XG2 Mode
(previously these bits were unused, and I debated between extending the
immediate values or leaving them as possible opcode bits; but if used
for opcode they would lead to new instructions that are effectively N/E
in Baseline Mode, which wouldn't be ideal either).
Relatedly tweaked the JCMPZ-Disp11s and "MOV.x (GBR, Disp10u*Sx), Rn"
ops to effectively be Disp13 and Disp12.
This increases branch-displacement to +/-8K for compare-with-zero, and
MOV.L/Q to 16K/32K for loading global variables (from 4K/8K).
Relatively few functions are large enough to benefit from the larger
branch limit;
It seems it runs out of scalar-type global variables (in Doom) well
before hitting the 16K/32K limit (and this is mostly N/A for structs and
arrays).
Like, it seems Doom only has around 12K worth of scalar global
variables, and nearly the entire rest of the ".data" and ".bss" section
is structs and arrays (which need LEA's to access). So, it only gains a
few percent on the total hit rate (55 to 57).
Roughly the remaining 43% being mostly LEA's to arrays, which end up
needing a 64-bit (Disp33) encoding.
TBD if I will keep this, or revert to only adding W.i, in XG2 Mode (to
keep W.m for possible XG2-Only 2RI opcodes; or possibly make the
Imm12/Disp13s case specific to these particular instructions, though
this is "less good" for the instruction decoder).
If I keep this, will probably end up stuck with it in any case.
Also, in my compiler, I am finding less obvious "low hanging fruit" than
I had hoped (and a lot of the remaining inefficiencies are things that
would "actually take" effort, *).
*: Say, for example, computing an expression that is being passed to a
function, actually directing the output directly to the function
argument register, rather than putting it a temporary and then MOV'ing
it to the needed register.
Alas...

I got timing to work at 40+ MHz by using 32-bit instruction parcels
rather than byte-oriented ones.

An issue with 32-bit parcels is that float constants do not fit well
into them because of the opcode present in a postfix. A 32-bit postfix
has only 25 available bits for a constant. The next size up has 57 bits
available. One thought I had was to reduce the floating-point precision
to correspond. Single precision floats would be 25 bits, double
precision 57 bits and quad precision 121 bits. All seven bits short of
the usual.

I could try and use 40-bit parcels but they would need to be at fixed
locations on the cache line for performance, and it would waste bytes.

MitchAlsup

2023-12-13 19:13:32 UTC

Post by Robert Finch
I got timing to work at 40+ MHz by using 32-bit instruction parcels
rather than byte-oriented ones.
An issue with 32-bit parcels is that float constants do not fit well
into them because of the opcode present in a postfix. A 32-bit postfix
has only 25 available bits for a constant. The next size up has 57 bits
available. One thought I had was to reduce the floating-point precision
to correspond. Single precision floats would be 25 bits, double
precision 57 bits and quad precision 121 bits. All seven bits short of
the usual.

It is issues such as you mention that my approach was different. The
instruction-specifier contains everything the decoder needs to know
about where the operands are, how to rout them into calculation, what
to calculate and where to deliver the result. Should the instruction
want constants for an operand* they are concatenated sequentially
after the I-S and come in 32-bit and 64-bit quantities. Should a
32-bit constant be consumed in a 64-bit calculation it is widened
during route.

(*) except for the 16-bit immediates and displacements from the
Major OpCode table.

Post by Robert Finch
I could try and use 40-bit parcels but they would need to be at fixed
locations on the cache line for performance, and it would waste bytes.

In effect I only have 32-bit parcels.

Robert Finch

2023-12-13 22:43:08 UTC

Post by Robert Finch
I got timing to work at 40+ MHz by using 32-bit instruction parcels
rather than byte-oriented ones.
An issue with 32-bit parcels is that float constants do not fit well
into them because of the opcode present in a postfix. A 32-bit postfix
has only 25 available bits for a constant. The next size up has 57
bits available. One thought I had was to reduce the floating-point
precision to correspond. Single precision floats would be 25 bits,
double precision 57 bits and quad precision 121 bits. All seven bits
short of the usual.

It is issues such as you mention that my approach was different. The
instruction-specifier contains everything the decoder needs to know
about where the operands are, how to rout them into calculation, what
to calculate and where to deliver the result. Should the instruction
want constants for an operand* they are concatenated sequentially after
the I-S and come in 32-bit and 64-bit quantities. Should a
32-bit constant be consumed in a 64-bit calculation it is widened
during route.
(*) except for the 16-bit immediates and displacements from the
Major OpCode table.

Post by Robert Finch
I could try and use 40-bit parcels but they would need to be at fixed
locations on the cache line for performance, and it would waste bytes.

In effect I only have 32-bit parcels.

Got timing to work at 40 MHz using 40-bit instruction parcels, with
the parcels at fixed positions within a cache line. It requires only 12
length decoders. There is some wasted space at the end of a cache line.
Room for a header. Not ultra-efficient, but it should work. Assembled
the boot ROM and got an average of 5.86 bytes per instruction.

The larger parcels are needed for this design to support 64-regs. Still
some work to do on the PC increment at the end of a cache line.

BGB

2023-12-14 05:57:31 UTC

Post by Robert Finch
Decided to take another try at it. Got tired of jumping through hoops
to support a block header approach. The assembler is just not setup
to support that style of processor. It was also a bit tricky to
convert branch target address to instruction groups and numbers. I
still like the VLE. Simplified the length decoders so they generate a
3-bit number instead of a 5-bit one. I went a bit insane with the
instruction lengths and allowed for up to 18 bytes. But that is
wasteful. There were a couple of postfixes that were that long to
support 128-bit immediates. The Isa has multiple postfixes repeated
now instead of really long ones. 32-bit immediate chunks.

No 128-bit constant load in my case, one does two 64 bit constant
loads for this. In terms of the pipeline, it actually moves four
33-bit chunks, with each 64-bit constant effectively being glued
together inside the pipeline.
...

Post by Robert Finch
Found a signal slowing things down. I had bypassed the output of the
i-cache ram in case a write and a read to the same address happened
at the same time. The read address was coming from the PC and ended
up feeding through 44 logic levels all the way to the instruction
length decoder. The output of the i-cache was going through the
bypass mux before feeding other logic. It is now just output directly
from the RAM. Not sure why I bypassed it in the first place. If there
was a i-cache update because of a cache miss, then it should be stalled.
Even with 44 logic levels and lots of routing timing was okay at 20
MHz. Shooting for 40 MHz now.

I am sitting just on the other side of 50 MHz operation a couple of ns
short.

Trying to get to 75, the last ns was the hardest...

And, as before, the compromises made along the way ended up hurting more
than what was gained.

Post by BGB
This is after re-enabling a lot of the features I had disabled in the
attempted clock-boost to 75MHz.
Had modified the decoding rules slightly so that both the W.m and W.i
bits were given to the immediate for 2RI-Imm10 ops in XG2 Mode
(previously these bits were unused, and I debated between extending
the immediate values or leaving them as possible opcode bits; but if
used for opcode they would lead to new instructions that are
effectively N/E in Baseline Mode, which wouldn't be ideal either).
Relatedly tweaked the JCMPZ-Disp11s and "MOV.x (GBR, Disp10u*Sx), Rn"
ops to effectively be Disp13 and Disp12.
This increases branch-displacement to +/-8K for compare-with-zero, and
MOV.L/Q to 16K/32K for loading global variables (from 4K/8K).
Relatively few functions are large enough to benefit from the larger
branch limit;
It seems it runs out of scalar-type global variables (in Doom) well
before hitting the 16K/32K limit (and this is mostly N/A for structs
and arrays).
Like, it seems Doom only has around 12K worth of scalar global
variables, and nearly the entire rest of the ".data" and ".bss"
section is structs and arrays (which need LEA's to access). So, it
only gains a few percent on the total hit rate (55 to 57).
Roughly the remaining 43% being mostly LEA's to arrays, which end up
needing a 64-bit (Disp33) encoding.
TBD if I will keep this, or revert to only adding W.i, in XG2 Mode (to
keep W.m for possible XG2-Only 2RI opcodes; or possibly make the
Imm12/Disp13s case specific to these particular instructions, though
this is "less good" for the instruction decoder).
If I keep this, will probably end up stuck with it in any case.
Also, in my compiler, I am finding less obvious "low hanging fruit"
than I had hoped (and a lot of the remaining inefficiencies are things
that would "actually take" effort, *).
*: Say, for example, computing an expression that is being passed to a
function, actually directing the output directly to the function
argument register, rather than putting it a temporary and then MOV'ing
it to the needed register.
Alas...

I got timing to work at 40+ MHz by using 32-bit instruction parcels
rather than byte-oriented ones.

16/32 here.

Though, XG2 Mode is 32-bit only (and also requires 32-bit alignment for
the instruction stream).

Baseline mode is almost entirely free-form, except that there are a few
"quirk" cases (say, 96-bit op not allowed if ((PC&0xE)==0x0E), 32-bit
alignment is needed for branch-tables in "switch()", ...).

Also the 16-bit ops are scalar only.

A lot of the quirk cases go away in XG2 mode because the instruction
stream is always 32-bit aligned.

Post by Robert Finch
An issue with 32-bit parcels is that float constants do not fit well
into them because of the opcode present in a postfix. A 32-bit postfix
has only 25 available bits for a constant. The next size up has 57 bits
available. One thought I had was to reduce the floating-point precision
to correspond. Single precision floats would be 25 bits, double
precision 57 bits and quad precision 121 bits. All seven bits short of
the usual.

This part is a combination game...

My case:
24+9: 33
24+24+16: 64

Though, a significant majority of typical floating point constants can
be represented exactly Binary16, so there ended up being a feature where
floating point constants are represented as Binary16 whenever it is
possible to do so exactly (with the instruction then converting them to
Binary64).

FLDCH Imm16, Rn //Load value as Binary16 to Binary64
PLDCH Imm32, Rn //Load value as 2x Binary16 to 2x Binary32
PLDCXH Imm64, Xn //Load value as 4x Binary16 to 4x Binary32

Post by Robert Finch
I could try and use 40-bit parcels but they would need to be at fixed
locations on the cache line for performance, and it would waste bytes.

Yeah...
Could be worse I guess.

Robert Finch

2023-12-15 21:47:27 UTC

Just realizing that two more bits of branch displacement can be squeezed
out of the design, if the branch target were to use an instruction
number in the block for the low four bits of the target instead of using
the low six bits for a relative displacement. The low order six bits of
the instruction pointer can be recovered from the instruction number,
which need be only four bits.

Currently the branch displacement is seventeen bits, just one short of
the highly desirable eight-teen bit displacement. Adding two extra bits
of displacement is of limited value though since most branches can be
accommodated with only twelve bits.

Robert Finch

2023-12-22 14:22:27 UTC

Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able to
update previous checkpoints, not just the current one. Which checkpoint
gets updated depends on which checkpoint the instruction falls under. It
is the register valid bit that needs to be updated. I used a “brute
force” approach to implement this and it is 40k LUTs. This is about five
times too large a solution. If I reduce the number of checkpoints
supported to four from sixteen, then the component is 20k LUTs. Still
too large.

The issue is there are 256 valid bits times 16 checkpoints which means
4096 registers. Muxing the register inputs and outputs uses a lot of LUTs.

One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new checkpoint
region. It would seriously impact the CPU performance.

EricP

2023-12-22 16:42:51 UTC

Post by Robert Finch
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able to
update previous checkpoints, not just the current one. Which checkpoint
gets updated depends on which checkpoint the instruction falls under. It
is the register valid bit that needs to be updated. I used a “brute
force” approach to implement this and it is 40k LUTs. This is about five
times too large a solution. If I reduce the number of checkpoints
supported to four from sixteen, then the component is 20k LUTs. Still
too large.
The issue is there are 256 valid bits times 16 checkpoints which means
4096 registers. Muxing the register inputs and outputs uses a lot of LUTs.
One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new checkpoint
region. It would seriously impact the CPU performance.

(I don't have a solution, just passing on some info on this particular
checkpointing issue.)

Sounds like you might be using the same free register checkpoint algorithm
I came up with for my simulator, which I assumed was a custom sram design.

There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector,
in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register
and it must mark the register free in *all* checkpoint contexts.

That requires the ability to set all the free flags for a single register,
which means an sram design that can write a whole row, and also set all the
bits in one column, in your case set the 16 bits in each checkpoint for one
of the 256 registers.

I was assuming an ASIC design so a small custom sram seemed reasonable.
But for an FPGA it requires 256*16 flip-flops plus decoders, etc.

I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have
independently come up with the same approach on their BOOM-3 SonicBoom.
Their note [5] describes the same problem as my column setting solves.

https://docs.boom-core.org/en/latest/sections/rename-stage.html

While their target was 22nm ASIC, they say below that they
implemented a version of BOOM-3 on an FPGA but don't give details.
But their project might be open source so maybe the details
are available online.

Sonicboom: The 3rd generation berkeley out-of-order machine
http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

MitchAlsup

2023-12-22 17:49:35 UTC

Two points::
1) the register that gets freed up when you know this newly allocated register
will retire, can be determined with a small amount of logic (2 gates) per
cell in your 256×16 matrix--no need for the column write/clear/set. You can
use this overwrite across columns to perform register write elision.

2) There are going to be allocations where you do not allocate any register
to a particular instruction because the register is overwritten IN the same
issue bundle. Here you can use a different "forwarding" notation so the
result is captured by the stations and used without ever seeing the file.

I called this matrix the "History Table" in Mc 88120, it provided valid
bits back to the aRN->pRN CAMs <backup> and valid bits back to the register
pool <successful retire>.

Back then, we recognized that the architectural registers were a strict
subset of the physical registers, so that as long as there were exactly
31 (then: 32 now) valid registers in the pRF, one could always read
values to be written into reservation station entries. In effect, the
whole thing was a RoB--Once the RoB gets big enough, there is no reason
to have both a RoB and a aRF; just let the RoB do everything and change
its name to Physical Register File. This eliminates the copy to aRF
at retirement.

Post by EricP
I was assuming an ASIC design so a small custom sram seemed reasonable.
But for an FPGA it requires 256*16 flip-flops plus decoders, etc.
I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have
independently come up with the same approach on their BOOM-3 SonicBoom.
Their note [5] describes the same problem as my column setting solves.
https://docs.boom-core.org/en/latest/sections/rename-stage.html

I was doing something very similar n 1991.

Post by EricP
While their target was 22nm ASIC, they say below that they
implemented a version of BOOM-3 on an FPGA but don't give details.
But their project might be open source so maybe the details
are available online.
Sonicboom: The 3rd generation berkeley out-of-order machine
http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

Robert Finch

2023-12-23 00:55:06 UTC

Post by Robert Finch
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able
to update previous checkpoints, not just the current one. Which
checkpoint gets updated depends on which checkpoint the instruction
falls under. It is the register valid bit that needs to be updated. I
used a “brute force” approach to implement this and it is 40k LUTs.
This is about five times too large a solution. If I reduce the number
of checkpoints supported to four from sixteen, then the component is
20k LUTs. Still too large.
The issue is there are 256 valid bits times 16 checkpoints which
means 4096 registers. Muxing the register inputs and outputs uses a
lot of LUTs.
One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new
checkpoint region. It would seriously impact the CPU performance.

I think I maybe found a solution using a block RAM and about 8k LUTs.

Post by EricP
(I don't have a solution, just passing on some info on this particular
checkpointing issue.)
Sounds like you might be using the same free register checkpoint algorithm
I came up with for my simulator, which I assumed was a custom sram design.
There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector,
in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register
and it must mark the register free in *all* checkpoint contexts.
That requires the ability to set all the free flags for a single register,
which means an sram design that can write a whole row, and also set all the
bits in one column, in your case set the 16 bits in each checkpoint for one
of the 256 registers.

Not sure about setting bits in all checkpoints. I probably have not just
understood the issue yet. Partially terminology. There are two different
things happening. The register free/available which is being managed
with fifos and the register contents valid bit. At the far end of the
pipeline, registers that were used are made free again by adding to the
free fifo. This is somewhat inefficient because they could be freed
sooner, but it would require more logic, instead more registers are
used, they are available from the RAM anyway.
The register contents valid bit is cleared when a target register is
assigned, and set once a value is loaded into the target register. The
valid bit is also set for instructions that are stomped on as the old
value is valid. When a checkpoint is restored, it restores the state of
the valid bit along with the physical register tag. I am not
understanding why the valid bit would need to be modified in all
checkpoints. I would think it should reflect the pre-branch state of things.

Post by MitchAlsup
1) the register that gets freed up when you know this newly allocated register
will retire, can be determined with a small amount of logic (2 gates) per
cell in your 256×16 matrix--no need for the column write/clear/set. You can
use this overwrite across columns to perform register write elision.

I just record the old register number in the ROB when a new one is
allocated.

Post by MitchAlsup
2) There are going to be allocations where you do not allocate any register
to a particular instruction because the register is overwritten IN the same
issue bundle. Here you can use a different "forwarding" notation so the
result is captured by the stations and used without ever seeing the file.
I called this matrix the "History Table" in Mc 88120, it provided valid
bits back to the aRN->pRN CAMs <backup> and valid bits back to the
register pool <successful retire>.
Back then, we recognized that the architectural registers were a strict
subset of the physical registers, so that as long as there were exactly
31 (then: 32 now) valid registers in the pRF, one could always read
values to be written into reservation station entries. In effect, the
whole thing was a RoB--Once the RoB gets big enough, there is no reason
to have both a RoB and a aRF; just let the RoB do everything and change
its name to Physical Register File. This eliminates the copy to aRF
at retirement.

I was doing something very similar n 1991.

Robert Finch

2023-12-23 05:55:07 UTC

Whoo hoo! Broke the 1 instruction per clock barrier.

----- Stats -----
Clock ticks: 265 Instructions: 279:
113 IPC: 1.052830
I-Cache hit clocks: 109

EricP

2023-12-23 18:26:17 UTC

Post by Robert Finch
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able
to update previous checkpoints, not just the current one. Which
checkpoint gets updated depends on which checkpoint the instruction
falls under. It is the register valid bit that needs to be updated.
I used a “brute force” approach to implement this and it is 40k
LUTs. This is about five times too large a solution. If I reduce the
number of checkpoints supported to four from sixteen, then the
component is 20k LUTs. Still too large.
The issue is there are 256 valid bits times 16 checkpoints which
means 4096 registers. Muxing the register inputs and outputs uses a
lot of LUTs.
One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new
checkpoint region. It would seriously impact the CPU performance.

I think I maybe found a solution using a block RAM and about 8k LUTs.

This has to do with free physical register list checkpointing and
a particular gotcha that occurs if one tries to use a vanilla sram
to save the free map bit vector for each checkpoint.
It sounds like the BOOM people stepped in this gotcha at some point.

Say a design has a bit vector indicating which physical registers are free.
Rename allocates a register by using a priority selector to scan that
vector and select a free PR to assign as a new dest PR.
When this instruction retires, the old dest PR is freed and
the new dest PR becomes the architectural register.

When Decode sees a conditional branch Bcc it allocates a
checkpoint in a circular buffer by incrementing the head counter,
copies the *current* free bit vector into the new checkpoint row,
and saves the new checkpoint index # in the Bcc uOp.
If a branch mispredict occurs then we can restore the state at the
Bcc by copying various state info from the Bcc checkpoint index #.
This includes copying back the saved free vector to the current free vector.
When the Bcc uOp retires we increment the circular tail counter
to recover the checkpoint buffer row.

The problem occurs when an old dest PR is in use so its free bit is clear
when the checkpoint is saved. Then the instruction retires and marks the
old dest PR as free in the bit vector. Then Bcc triggers a mispredict
and restores the free vector that was copied when the checkpoint was saved,
including the then not-free state of the PR freed after the checkpoint.
Result: the PR is lost from the free list. After enough mispredicts you
run out of free physical registers and hang at Rename waiting to allocate.

It needs some way to edit the checkpointed free bit vector so that
no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
and rollback to checkpoint #Y, that the correct free vector gets restored.

MitchAlsup

2023-12-23 23:19:47 UTC

Post by Robert Finch
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able
to update previous checkpoints, not just the current one. Which
checkpoint gets updated depends on which checkpoint the instruction
falls under. It is the register valid bit that needs to be updated.
I used a “brute force” approach to implement this and it is 40k
LUTs. This is about five times too large a solution. If I reduce the
number of checkpoints supported to four from sixteen, then the
component is 20k LUTs. Still too large.
The issue is there are 256 valid bits times 16 checkpoints which
means 4096 registers. Muxing the register inputs and outputs uses a
lot of LUTs.
One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new
checkpoint region. It would seriously impact the CPU performance.

I think I maybe found a solution using a block RAM and about 8k LUTs.

It is often the case where a logical register can be used in more
than one result in a single checkpoint. When this is the case, no
physical register need be allocated to the now-dead result, so we
invented a way to convey this result is only captured from the
operand bus and was not even contemplated to be written into the
pRF. This makes the pool of free registers go further--up to 30%
further.......

Post by EricP
When Decode sees a conditional branch Bcc it allocates a
checkpoint in a circular buffer by incrementing the head counter,
copies the *current* free bit vector into the new checkpoint row,
and saves the new checkpoint index # in the Bcc uOp.
If a branch mispredict occurs then we can restore the state at the
Bcc by copying various state info from the Bcc checkpoint index #.
This includes copying back the saved free vector to the current free vector.
When the Bcc uOp retires we increment the circular tail counter
to recover the checkpoint buffer row.
The problem occurs when an old dest PR is in use so its free bit is clear
when the checkpoint is saved. Then the instruction retires and marks the
old dest PR as free in the bit vector. Then Bcc triggers a mispredict
and restores the free vector that was copied when the checkpoint was saved,
including the then not-free state of the PR freed after the checkpoint.
Result: the PR is lost from the free list. After enough mispredicts you
run out of free physical registers and hang at Rename waiting to allocate.

Michael Shebanow and I have a patent on that dated around 1992 (filing).
Our design could be retiring one or more checkpoints, backing up a mis-
pedicted branch, and issuing instructions on the alternate path; all in
the same clock.

Post by EricP
It needs some way to edit the checkpointed free bit vector so that
no matter what order of PR-allocate, retire-PR-free, checkpoint save #X,
and rollback to checkpoint #Y, that the correct free vector gets restored.

Robert Finch

2023-12-24 02:27:04 UTC

Post by Robert Finch
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be
able to update previous checkpoints, not just the current one.
Which checkpoint gets updated depends on which checkpoint the
instruction falls under. It is the register valid bit that needs
to be updated. I used a “brute force” approach to implement this
and it is 40k LUTs. This is about five times too large a solution.
If I reduce the number of checkpoints supported to four from
sixteen, then the component is 20k LUTs. Still too large.
The issue is there are 256 valid bits times 16 checkpoints which
means 4096 registers. Muxing the register inputs and outputs uses
a lot of LUTs.
One thought is to stall until all the instructions with targets in
a given checkpoint are finished executing before starting a new
checkpoint region. It would seriously impact the CPU performance.

I think I maybe found a solution using a block RAM and about 8k LUTs.

Not sure about setting bits in all checkpoints. I probably have not
just understood the issue yet. Partially terminology. There are two
different things happening. The register free/available which is
being managed with fifos and the register contents valid bit. At the
far end of the pipeline, registers that were used are made free again
by adding to the free fifo. This is somewhat inefficient because they
could be freed sooner, but it would require more logic, instead more
registers are used, they are available from the RAM anyway.
The register contents valid bit is cleared when a target register is
assigned, and set once a value is loaded into the target register.
The valid bit is also set for instructions that are stomped on as the
old value is valid. When a checkpoint is restored, it restores the
state of the valid bit along with the physical register tag. I am not
understanding why the valid bit would need to be modified in all
checkpoints. I would think it should reflect the pre-branch state of things.

It is often the case where a logical register can be used in more
than one result in a single checkpoint. When this is the case, no
physical register need be allocated to the now-dead result, so we
invented a way to convey this result is only captured from the operand
bus and was not even contemplated to be written into the pRF. This makes
the pool of free registers go further--up to 30%
further.......

Sounds good, but Q+ is not using forwarding busses, everything is
through the register file ATM. Saved in my mind for a later version.

Did things slightly differently have only one checkpoint index, using a
branch counter, if the number of branches outstanding is greater than
the size of the checkpoint array, then the machine is stalled. Otherwise
it is assumed the checkpoints can be reused.

Also using fifos to allocate and free registers in the FPGA because I
think it uses less resources than manipulating bit vectors.

Post by EricP
The problem occurs when an old dest PR is in use so its free bit is clear
when the checkpoint is saved. Then the instruction retires and marks the
old dest PR as free in the bit vector. Then Bcc triggers a mispredict
and restores the free vector that was copied when the checkpoint was saved,
including the then not-free state of the PR freed after the checkpoint.
Result: the PR is lost from the free list. After enough mispredicts you
run out of free physical registers and hang at Rename waiting to allocate.

I think there may be an issue here pending in Q+. But I have not yet run
into it. Thank-you for a description of the issue. I'll know what it is
when it occurs.

Post by MitchAlsup
Michael Shebanow and I have a patent on that dated around 1992 (filing).
Our design could be retiring one or more checkpoints, backing up a mis-
pedicted branch, and issuing instructions on the alternate path; all in
the same clock.

I should look at patents more often. I have been going by descriptions
of things found on the web, and reading some text books.

Update the register valid bits is done across multiple clock cycles in
Q+. There are only eight right ports to the valid bits. So, when
instructions are stomped on during a branch miss it takes clock cycles.
Branches are horrendously slow ATM. There can be 20 instructions fetched
ahead by the time the branch is decoded. They all have to be stomped on
if the branch is taken. Branches are taking many clocks, but they
execute, or skip over lots of NOPs, so the IPC is up.

I configured the number of ALUs to two instead of one and did not make
any difference. The difference is washed out by the time taken for
branches and memory ops. Hit IPC=2.4 executing mainly NOPs.

Robert Finch

2024-01-09 04:44:03 UTC

Predicated logic and the PRED modifier on my mind tonight.

I think I have discovered an interesting way to handle predicated logic.
If a predicate is true the instruction is scheduled and executes
normally. If the predicate is false the instruction is modified to a
special copy operation and scheduled to execute on an ALU regardless of
what the original execution unit would be. What makes this efficient is
that only a single target register read port is required for the ALU
unit versus having a target register read port for every functional
unit. The copy mux is present in the ALU only and not in the other
functional units. For most instructions there is no predication.

Supporting the PRED modifier pushed my core over the size limit to 141k
LUTs for the system. There are only 136k LUTs available. PRED doubled
the size of the scheduler. As the scheduler must check for a previous
PRED modifier to know how to schedule instructions. It uses the PRED
coverage window of eight instructions, so the scheduler searches up to
eight instructions backwards from the current position for a PRED. The
coverage window of the PRED was set to eight instructions to accommodate
vector instructions which expand into eight separate instructions in the
micro-code in most cases.

BGB

2024-01-09 05:43:19 UTC

Post by Robert Finch
Predicated logic and the PRED modifier on my mind tonight.
I think I have discovered an interesting way to handle predicated logic.
If a predicate is true the instruction is scheduled and executes
normally. If the predicate is false the instruction is modified to a
special copy operation and scheduled to execute on an ALU regardless of
what the original execution unit would be. What makes this efficient is
that only a single target register read port is required for the ALU
unit versus having a target register read port for every functional
unit. The copy mux is present in the ALU only and not in the other
functional units. For most instructions there is no predication.
Supporting the PRED modifier pushed my core over the size limit to 141k
LUTs for the system. There are only 136k LUTs available. PRED doubled
the size of the scheduler. As the scheduler must check for a previous
PRED modifier to know how to schedule instructions. It uses the PRED
coverage window of eight instructions, so the scheduler searches up to
eight instructions backwards from the current position for a PRED. The
coverage window of the PRED was set to eight instructions to accommodate
vector instructions which expand into eight separate instructions in the
micro-code in most cases.

I handled predication slightly differently:
The pipeline has a tag pattern, and there is an SR.T bit (PPT):
00z: (Y) Always
01z: (N) Never
100: (N) If-True, T=0
101: (Y) If-True, T=1
110: (Y) If-False, T=1
111: (N) If-False, T=0

Most operations:
Y: Forward the operation;
N: Replace operation with NOP.
Branch operations:
Y: Forward the operation;
N: Replace operation with BRA_NB.

The BRA_NB operator:
If branch-predictor predicted a branch:
Initiate a branch to the following instruction;
Else: NOP.

Originally, this logic was handled in EX1, but has now been partly moved
to ID2 (so is performed along with register fetch). This change
effectively increased the latency of CMPxx handling (to 2 cycles), but
did improve FPGA timing (total negative slack).

In my case, predication is encoded in the base instruction.
00: Execute if True (E0..E3, E8..EB)
01: Execute if False (E4..E7, EC..EF)
10: Scalar (F0..F3, F8..FB)
11: Wide-Execute (F4..F7, FC..FF)

Though, one sub-block was special:
00:
EAnm-ZeoZ: Mirror F0, Wide-Execute, If-True
EBnm-ZeoZ: Mirror F2, Wide-Execute, If-True
01:
EEnm-ZeoZ: Mirror F0, Wide-Execute, If-False
EFnm-ZeoZ: Mirror F2, Wide-Execute, If-False
10:
FAii-iiii: Load Imm24u into R0
FBii-iiii: Load Imm24n into R0
11:
FEii-iiii: Jumbo Prefix, Extends Immediate
FFwZ-Zjii: Jumbo Prefix, Extends Instruction

Well, and with a few more special cases:
FEii-iiii-FAii-iiii: Load Imm48u into R0
FEii-iiii-FBii-iiii: Load Imm48n into R0
FFii-iiii-FAii-iiii: BRA Abs48 (but, only within the same mode, *)
FFii-iiii-FBii-iiii: BSR Abs48 (but, only within the same mode)

*: Sadly, inter-mode jumps require encoding something like:
MOV Imm64, R1; JMP R1

At the moment, the BJX2 CPU core weighs in at around 37k LUTs.
Or:
FF : 11.5k
LUT : 37.0k
MUX : 0.5k
CARRY: 0.9k
BMEM : 34 (AKA: BRAM)
MULT : 38 (AKA: DSP)
DMEM : 1.8k (AKA: LUTRAM)

Current timing slack for 50MHz domain (WNS): 1.783ns

A lingering residual effect of my effort to try to get the CPU running
at 75MHz, is that it is now slightly smaller and has a larger amount of
timing slack (though, had also stumbled into and fixed a few logic bugs
that had resulted from all this).

...

EricP

2024-01-09 13:23:24 UTC

Yes, the general case is each uOp has predicate source and a bool to test.
If the value matches the predicate you execute the ON_MATCH part of the uOp,
if it does not match then execute the ON_NO_MATCH part.

condition = True | False

(pred == condition) ? ON_MATCH : ON_NO_MATCH;

The ON_NO_MATCH uOp function is usually some housekeeping.
On an in-order it might diddle the scoreboard to indicate the register
write is done. On OoO it might copy the old dest register to new.

Note that the source register dependencies change between match and no_match.

if (pred == True) ADD r3 = r2 + r1

If pred == True then it matches and the uOp is dependent on r2 and r1.
If pred != True then it no_match and uOp is dependent on the old dest r3
as a source to copy to the new dest r3.

Dynamically pruning the unnecessary uOp source register dependencies
for the alternate part can allow it to launch earlier.

Also predicated LD and ST have some particular issues to think about.
For example, under TSO a younger LD cannot bypass an older LD.
If an older LD has an unresolved predicate then we don't know if it exists
so we have to block the younger LD until the older predicate resolves.
The LD's ON_NO_MATCH housekeeping might include diddling the LSQ dependency
matrix to wake up any younger LD's in the LSQ that had been blocked.

(Yes, I'm sure one could get fancier with replay traps.)

MitchAlsup

2024-01-09 20:38:41 UTC

A SB handles this situation with greater elegance than a reservation station.
The SB can merely clear the dependency without writing to the RF, so the
now released reader reads the older value. {Thornton SB}

The value capturing reservation station entry has to first capture and then
ignore the delivered result (and so does the RF/RoB. {Thomasulo RS}

The Value-free RS entry is more like the SB than the Thomasulo RS.

A typical SB Can be used to hold result delivery on instructions in the
shadow of a PRED to avoid the data-flow mechanism from getting unkempt.
Both then-clause and else-clause can be held while the condition is
evaluating,...

Post by EricP
Note that the source register dependencies change between match and no_match.
if (pred == True) ADD r3 = r2 + r1
If pred == True then it matches and the uOp is dependent on r2 and r1.
If pred != True then it no_match and uOp is dependent on the old dest r3
as a source to copy to the new dest r3.

Yes, and there can be multiple instructions in the shadow of a PRED.

Post by EricP
Dynamically pruning the unnecessary uOp source register dependencies
for the alternate part can allow it to launch earlier.

As illustrated above, no need to stall launch if you can stall result
delivery. {A key component of the Thornton SB}

Post by EricP
Also predicated LD and ST have some particular issues to think about.
For example, under TSO a younger LD cannot bypass an older LD.

Easy:: don't do TSO <most of the time> or SC <outside of ATOMIC stuff>.

Post by EricP
If an older LD has an unresolved predicate then we don't know if it exists
so we have to block the younger LD until the older predicate resolves.

This is why TSO and SC are slower than causal or weaker. Consider a memory
reorder buffer which allows generated addresses to probe the cache and
determine hit as operand data-flow permits--BUT holds onto the data and
writes (LD or reads (ST) to) the RF in program order. This violates TSO
and SC but mono-threaded codes are immune to this memory ordering problem
{and multi-threaded programs are immune except while performing ATIMIC
things.}

TSO and SC is simply slower when trying to perform memory reference inst-
ructions in both the then-clause and in the else clause while waiting the
resolution of the condition--even if no results are written into RF until
after resolution.

Post by EricP
The LD's ON_NO_MATCH housekeeping might include diddling the LSQ dependency
matrix to wake up any younger LD's in the LSQ that had been blocked.
(Yes, I'm sure one could get fancier with replay traps.)

BGB

2023-12-22 18:02:16 UTC

Post by Robert Finch
Stuck on checkpoint RAM now. Everything was going good until…. I
realized that while instructions are executing they need to be able to
update previous checkpoints, not just the current one. Which
checkpoint gets updated depends on which checkpoint the instruction
falls under. It is the register valid bit that needs to be updated. I
used a “brute force” approach to implement this and it is 40k LUTs.
This is about five times too large a solution. If I reduce the number
of checkpoints supported to four from sixteen, then the component is
20k LUTs. Still too large.
The issue is there are 256 valid bits times 16 checkpoints which means
4096 registers. Muxing the register inputs and outputs uses a lot of LUTs.
One thought is to stall until all the instructions with targets in a
given checkpoint are finished executing before starting a new
checkpoint region. It would seriously impact the CPU performance.

(I don't have a solution, just passing on some info on this particular
checkpointing issue.)
Sounds like you might be using the same free register checkpoint algorithm
I came up with for my simulator, which I assumed was a custom sram design.
There is 1 bit for each physical register that is free.
The checkpoint for a Bcc conditional branch copies the free bit vector,
in your case 256 bits, to a row in the checkpoint sram.
As each instruction retires and frees up its old dest physical register
and it must mark the register free in *all* checkpoint contexts.
That requires the ability to set all the free flags for a single register,
which means an sram design that can write a whole row, and also set all the
bits in one column, in your case set the 16 bits in each checkpoint for one
of the 256 registers.
I was assuming an ASIC design so a small custom sram seemed reasonable.
But for an FPGA it requires 256*16 flip-flops plus decoders, etc.
I think the risc-v Berkeley Out-of-Order Machine (BOOM) folks might have
independently come up with the same approach on their BOOM-3 SonicBoom.
Their note [5] describes the same problem as my column setting solves.
https://docs.boom-core.org/en/latest/sections/rename-stage.html
While their target was 22nm ASIC, they say below that they
implemented a version of BOOM-3 on an FPGA but don't give details.
But their project might be open source so maybe the details
are available online.
Sonicboom: The 3rd generation berkeley out-of-order machine
http://people.eecs.berkeley.edu/~krste/papers/SonicBOOM-CARRV2020.pdf

Yeah... Generally in-order designs are a lot more viable on an FPGAs.
Also if one wants a core that had smaller die area and doesn't need the
fastest possible single-thread performance (though, in theory, good
static scheduling and good L1 caches could at least reduce the issues).

Comparably, it seems that SRAM is a lot cheaper on FPGAs relative to
logic (whereas ASICs apparently have cheap logic but expensive SRAMs),
so a bigger L1 cache is more of an option (though limited some by
timing, and that much past 16K or 32K one is solidly in diminishing
returns territory).

While, say, 2-way can do better than 1-way, the relative cost of a 2-way
cache associativity tends to be higher than doubling the size of the L1
cache.

Where, say, a 16K 2-way cache would still have a lower average hit rate
than a 32K 1-way cache (but, 32K will need less LUTs and have better
timing).

Scheduling is theoretically "not that hard", but my compiler isn't super
good at it. It partly uses a strategy of "check if instructions can be
swapped, and if for a window of N instructions, if this swap will result
in better-case numbers than the prior ordering".

It then repeats thus process up to a certain number of times.

Results are far from ideal, as it tends to miss cases where a swap would
make things worse for prior instructions. Also it can't determine the
"globally best" ordering, as evaluating every possible reordering of
every instruction trace in a program is well outside the range of
"computationally feasible" as-is (nor even necessarily enumerating every
possible ordering within a given sequence of instructions).

Partial issue being that (excluding impossible swaps), this problem has
a roughly "O(N!)" complexity curve.

Well, and "Can I swap two instructions out of a window of 6
instructions?" does not give optimal results (doesn't help that the
WEXifier looks like a horrid mess due to working with instructions as
16-bit words despite only supporting 32-bit instructions; early on I
think I didn't realize that I would be entirely excluding 16-bit
instructions from this process).

Better might be to evaluate a symmetric window of 12..16 instructions
(say, consider interlocks from the prior 5..7 instructions, the two in
the middle, and the following 5..7). For sake of ranking (with bundling)
one pretending that each instruction has roughly twice its actual
interlock latency (this way, hopefully, bundling instructions is less
likely to result in interlock penalties).

Though, partial tradeoff as I still want compile times that aren't glacial.

...

Though, from elsewhere, will note that I have 64 GPRs, basic
predication, and bundling (via daisy-chaining instructions) all within a
32-bit instruction word (though, the variant of the ISA with "full" 64
GPR support excludes 16-bit ops; as there isn't enough encoding space to
support both fully orthogonal 64 GPRs and 16-bit ops at the same time).

...

MitchAlsup

2023-12-10 22:52:04 UTC

While waiting for the register value, other instructions would
continue to queue and execute. Then that processing would be dumped
because of the branch miss. I suppose hardware could be added to
suppress processing until the register value is known. An option for a
larger build.

Post by Robert Finch
Branches can now use a postfix immediate to extend the branch
range. This allows 32 and 64-bit displacements in addition to the
existing 17-bit one. However, the assembler cannot know which to
use in advance, so choosing a larger branch displacement size
should be an option.

What you need is an index and a set to directly access the cache--all
the other stuff can be done in arears {AGEN and cache tag check}

Post by Robert Finch
The first time a next PC is needed it
will not be available for three clocks. Once cached it would be
available within a clock. The next PC displacement is the sum of the
lengths of next four instructions. There is not enough room in the
FPGA to add another cache and associated logic, however. Next PC = PC
+ 20 seems a whole lot simpler to me.
Thus, I may go back to using a fixed size instruction or rather
instructions with fixed alignment. The position of instructions could
be as if they were fixed length while remaining variable length.

Post by Robert Finch
Instructions would just be aligned at fixed intervals. If I set the
length to five bytes for instance, most the instruction set could be
accommodated. Operation by “packed” instructions would be an option
for a larger build. There could be a bit in a control register to
allow execution by packed or unpacked instructions so there is some
backwards compatibility to a smaller build.

That wire:logic ratio is "not that much out of line" for long distance
bussing of data.

My word oriented design would cut the decoders down to 16-decoders and
they have to look at 7-bits to produce 3×5-bit vectors. A tree of
AND gates takes it from here basically performing FF1.

Post by Robert Finch
I did some experimenting with block headers and ended up with a block
trailer instead of a header, for the assembler’s benefit which needs to
know all the instruction lengths before the trailer can be output. Only
the index of the instruction group is needed, so usually there are only
a couple of indexes used per instruction block. It can likely get by
with a 24-bit trailer containing four indexes plus the assumed one.
Usually only one or two bytes are wasted at the end of a block.
I assembled the boot rom and there are 4.9 bytes per instruction
average, including the overhead of block trailers and wasted bytes.
Branche and postfixes are five bytes, and there are a lot of them.
Code density is a little misleading because branches occupy five bytes
but do both a compare and branch operation. So they should maybe count
as two instructions.

Sooner or later you have to mash everything down to {bits, bytes, words}
Instructions having VLE and having non-identity units of work performed,
bytes are probably the best representation. My eXcel spreadsheet stuff
uses bits.

Robert Finch

2023-12-03 19:19:33 UTC

Post by Robert Finch
Figured it out. Each architectural register in the RAT must refer to N
physical registers, where N is the number of banks. Setting N to 4
results in a RAT that is only about 50% larger than one supporting
only a single bank. The operating mode is used to select the physical
register. The first eight registers are shared between all operating
modes so arguments can be passed to syscalls. It is tempting to have
eight banks of registers, one for each hardware interrupt level.

Yes.

Post by EricP
If you have 8 register banks then only 3/10 of the physical registers
are available to use, the other 7/10 are sitting idle attached to
arch registers in other modes consuming power.

Yes too.

Post by EricP
Also you don't have to play overlapped-register-bank games to pass
args to/from syscalls. You can have specific instructions that reach
into other banks: Move To User Reg, Move From User Reg.
Since only syscall passes args into the OS you only need to access
the user mode bank from the OS kernel bank.

The Q+ move instruction is setup this way. It has a couple of extra bits
in the register specifiers. The instruction could also look at the CPU's
previous operating mode, and current operating mode to determine
register specs.

Scott Lurndal

2023-11-25 17:11:09 UTC

Post by Robert Finch
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonightâs trade-off is how many
root pointers to support. With a 12-bit ASID, 4096 root pointers are
required to link to the mapping tables with one root pointer for each
address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.
How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.
Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

Yeah, armv8 was originally 8-bit, and added 16 even before the spec was dry.

I don't see a benefit to tying the ASID (or VMID for that matter) to
the root of the page table. Especially with the common split
address spaces (ARMv8 has a root pointer for each half of the VA space,
for example, where the upper half is shared by all schedulable entities).

MitchAlsup

2023-11-13 19:47:27 UTC