Discussion:
Spills, Fills, and Kills
(too old to reply)
Jonathan Brandmeyer
2020-09-28 04:42:47 UTC
Permalink
In an older Itanium discussion on this newsgroup, someone mentioned this paper by Mattan Erez, Brian P. Towles, and William J. Dally. It looks like a technical report (or draft?) that wasn't published at a conference or journal.

The headline numbers are interesting: A reduction in memory writes by 50% and reads by 30% or so through intercepting register fills and spills at call boundaries. Programs got faster by 5-40%. By leveraging their unique requirements, its possible to provide special-purpose load and store instructions to maintain a separate spill/fill stack that is much like the program's own stack in most other respects. Given recent interest in splitting stacks as a barrier to ROP attacks, I found the idea to be extra relevant.

Uh, why was this a dead-end? It wasn't published, and recent architectures have omitted any special support for spilling and filling the machine registers. x86 has dedicated push/pop instructions, but high-performance implementations seem to get away with little more than acceleration of the stack pointer arithmetic.

My best guess is that the author's baseline was too pessimistic. They don't mention the bias between callee-saved and call-clobbered in their assumed ABI. I'm guessing that the bias was entirely in favor of call-clobbered registers. Modern ABIs reserve about half of the architectural registers as callee-saved. This greatly reduces (but does not eliminate) save/restore traffic by making some of it lazy and hoisting it around the program. In effect, the architectural registers themselves are acting to intercept some of the spill/fill traffic that these authors were trying to save with the ISA.

Are there more recent studies on the fraction of memory references that are spill/fill traffic? Speaking of which, whatever happened to Simplescalar? It seems to have died in the Great ISA Winnowing of the Twenty-Aughts. Was it ever replaced?

What are some good references that studied the impact of varying the fraction of registers which are callee-saved? Posts in this forum cite that the balance is mixed: Some applications prefer more callee-saved while others prefer less, so the 50/50 mix is a compromise. I'd like to see what was actually measured in the studies so that I can understand better how it relates to dedicated spill/fill support in the ISA. For example, is the balance point in the middle of a wide bathtub, or are some applications still affected by a sharp gradient in the optimization curve?
MitchAlsup
2020-09-28 14:41:27 UTC
Permalink
Post by Jonathan Brandmeyer
In an older Itanium discussion on this newsgroup, someone mentioned this paper by Mattan Erez, Brian P. Towles, and William J. Dally. It looks like a technical report (or draft?) that wasn't published at a conference or journal.
The headline numbers are interesting: A reduction in memory writes by 50% and reads by 30% or so through intercepting register fills and spills at call boundaries. Programs got faster by 5-40%. By leveraging their unique requirements, its possible to provide special-purpose load and store instructions to maintain a separate spill/fill stack that is much like the program's own stack in most other respects. Given recent interest in splitting stacks as a barrier to ROP attacks, I found the idea to be extra relevant.
Uh, why was this a dead-end? It wasn't published, and recent architectures have omitted any special support for spilling and filling the machine registers. x86 has dedicated push/pop instructions, but high-performance implementations seem to get away with little more than acceleration of the stack pointer arithmetic.
My 66000 has instructions that save and restore registers on the stack on
behalf of subroutines including setting up new stack and frame pointers.
Post by Jonathan Brandmeyer
My best guess is that the author's baseline was too pessimistic. They don't mention the bias between callee-saved and call-clobbered in their assumed ABI. I'm guessing that the bias was entirely in favor of call-clobbered registers.
Around 50% of called subroutines are leaf level and the vast majority of
these don't need any save/restore because they can "run" in the argument
and temporary (callee destroyed) registers.
Post by Jonathan Brandmeyer
Modern ABIs reserve about half of the architectural registers as callee-saved. This greatly reduces (but does not eliminate) save/restore traffic by making some of it lazy and hoisting it around the program. In effect, the architectural registers themselves are acting to intercept some of the spill/fill traffic that these authors were trying to save with the ISA.
It seems to me that as long as::
a) simple leaf level subroutines can "run" without any overhead,
b) non-leaf subroutines can save/restore as few registers as needed,
c) hardware has the width to save/restore registers at cache-access-width
...per cycle
d) ISA supports the ABI by compressing all the overhead into 1 instruction
...at each end of the call.
The vast majority of the overhead has already disappeared.

Adding laziness to the save/restore process may not "buy" that much.
Post by Jonathan Brandmeyer
Are there more recent studies on the fraction of memory references that are spill/fill traffic? Speaking of which, whatever happened to Simplescalar? It seems to have died in the Great ISA Winnowing of the Twenty-Aughts. Was it ever replaced?
What are some good references that studied the impact of varying the fraction of registers which are callee-saved?
Stanford MIPS (pre company MIPS) had a few papers on stuff like this--studying
the number of registers versus performance. But I know of nothing recent.

In My 66000 ABI I designed the support instructions with the flexibility
to increase/decrease the number of items carried in registers as arguments
and as results: and at the other end to increase/decrease the number of
callee saved (preserved) registers--So that if someone comes to me with
a good argument as moving this boundary (or that) will improve performance
(or decrease overhead) I can simply adjust the numbers in ABI and the
support instructions need no change.

In additions the support instructions have to allow for stack unwind
{TRY, THROW, CATCH} and destructors (C++ longjump) to be implemented
efficiently.
Post by Jonathan Brandmeyer
Posts in this forum cite that the balance is mixed: Some applications prefer more callee-saved while others prefer less, so the 50/50 mix is a compromise.
It is not so much "applications" but individual subroutines.

The balance ALSO depends on the number of registers and their width (along
with paring and sharing rules--if any). In a machine with thirty-two 64-bit
registers running in a 64-bit virtual address space, 50% seems to be a good
balance. Simple leaf subroutines can perform their duties in the unpreserved
register set, needing no stack or stack space. Non-simple leaf routines
can allocate stack space for local variables but might not need any of the
preserved registers as they have "enough" in the temporary registers.
(Obviously, x86 and x86-64 have none of these properties.)
Post by Jonathan Brandmeyer
I'd like to see what was actually measured in the studies so that I can understand better how it relates to dedicated spill/fill support in the ISA. For example, is the balance point in the middle of a wide bathtub, or are some applications still affected by a sharp gradient in the optimization curve?
From what I see, there is a fairly wide (low gradient) "curve"; one much have::

a) enough argument and result registers that the 98%-ile subroutine needs
...no argument or result to be passed on the stack.
b) enough temporary (nonpreserved) registers for the leaf subroutine
...to perform its duties without needing any register in the preserved set.
c) enough preserved registers so that non-leaf subroutine can leave a
...substantial amount of data in registers "over" a subroutine call.

32 registers "seems" to be about the right number to fulfill all these goals,
especially is several of them have not been consumed (MIPS) for various
"other duties.

I did talk with a compiler "guy" when came from CRAY and he has good arguments
as to why all the registers should be caller save.
Ivan Godard
2020-09-28 15:51:21 UTC
Permalink
Post by MitchAlsup
Post by Jonathan Brandmeyer
In an older Itanium discussion on this newsgroup, someone mentioned this paper by Mattan Erez, Brian P. Towles, and William J. Dally. It looks like a technical report (or draft?) that wasn't published at a conference or journal.
The headline numbers are interesting: A reduction in memory writes by 50% and reads by 30% or so through intercepting register fills and spills at call boundaries. Programs got faster by 5-40%. By leveraging their unique requirements, its possible to provide special-purpose load and store instructions to maintain a separate spill/fill stack that is much like the program's own stack in most other respects. Given recent interest in splitting stacks as a barrier to ROP attacks, I found the idea to be extra relevant.
Uh, why was this a dead-end? It wasn't published, and recent architectures have omitted any special support for spilling and filling the machine registers. x86 has dedicated push/pop instructions, but high-performance implementations seem to get away with little more than acceleration of the stack pointer arithmetic.
My 66000 has instructions that save and restore registers on the stack on
behalf of subroutines including setting up new stack and frame pointers.
Post by Jonathan Brandmeyer
My best guess is that the author's baseline was too pessimistic. They don't mention the bias between callee-saved and call-clobbered in their assumed ABI. I'm guessing that the bias was entirely in favor of call-clobbered registers.
Around 50% of called subroutines are leaf level and the vast majority of
these don't need any save/restore because they can "run" in the argument
and temporary (callee destroyed) registers.
Post by Jonathan Brandmeyer
Modern ABIs reserve about half of the architectural registers as callee-saved. This greatly reduces (but does not eliminate) save/restore traffic by making some of it lazy and hoisting it around the program. In effect, the architectural registers themselves are acting to intercept some of the spill/fill traffic that these authors were trying to save with the ISA.
a) simple leaf level subroutines can "run" without any overhead,
b) non-leaf subroutines can save/restore as few registers as needed,
c) hardware has the width to save/restore registers at cache-access-width
...per cycle
d) ISA supports the ABI by compressing all the overhead into 1 instruction
...at each end of the call.
The vast majority of the overhead has already disappeared.
Adding laziness to the save/restore process may not "buy" that much.
Post by Jonathan Brandmeyer
Are there more recent studies on the fraction of memory references that are spill/fill traffic? Speaking of which, whatever happened to Simplescalar? It seems to have died in the Great ISA Winnowing of the Twenty-Aughts. Was it ever replaced?
What are some good references that studied the impact of varying the fraction of registers which are callee-saved?
Stanford MIPS (pre company MIPS) had a few papers on stuff like this--studying
the number of registers versus performance. But I know of nothing recent.
In My 66000 ABI I designed the support instructions with the flexibility
to increase/decrease the number of items carried in registers as arguments
and as results: and at the other end to increase/decrease the number of
callee saved (preserved) registers--So that if someone comes to me with
a good argument as moving this boundary (or that) will improve performance
(or decrease overhead) I can simply adjust the numbers in ABI and the
support instructions need no change.
In additions the support instructions have to allow for stack unwind
{TRY, THROW, CATCH} and destructors (C++ longjump) to be implemented
efficiently.
Post by Jonathan Brandmeyer
Posts in this forum cite that the balance is mixed: Some applications prefer more callee-saved while others prefer less, so the 50/50 mix is a compromise.
It is not so much "applications" but individual subroutines.
The balance ALSO depends on the number of registers and their width (along
with paring and sharing rules--if any). In a machine with thirty-two 64-bit
registers running in a 64-bit virtual address space, 50% seems to be a good
balance. Simple leaf subroutines can perform their duties in the unpreserved
register set, needing no stack or stack space. Non-simple leaf routines
can allocate stack space for local variables but might not need any of the
preserved registers as they have "enough" in the temporary registers.
(Obviously, x86 and x86-64 have none of these properties.)
Post by Jonathan Brandmeyer
I'd like to see what was actually measured in the studies so that I can understand better how it relates to dedicated spill/fill support in the ISA. For example, is the balance point in the middle of a wide bathtub, or are some applications still affected by a sharp gradient in the optimization curve?
a) enough argument and result registers that the 98%-ile subroutine needs
...no argument or result to be passed on the stack.
b) enough temporary (nonpreserved) registers for the leaf subroutine
...to perform its duties without needing any register in the preserved set.
c) enough preserved registers so that non-leaf subroutine can leave a
...substantial amount of data in registers "over" a subroutine call.
32 registers "seems" to be about the right number to fulfill all these goals,
especially is several of them have not been consumed (MIPS) for various
"other duties.
I did talk with a compiler "guy" when came from CRAY and he has good arguments
as to why all the registers should be caller save.
Mill is effectively 100% caller save, as each frame gets a whole new
belt populated with the actual arguments. 16 positions seems to be the
sweet spot for mid-range configurations; 32 seems overkill for even the
largest we have explored.

However, the Mill spiller spills more than the belt: there's the
scratchpad address space, in-flight long-latency results, and most
importantly in-flight loads. Mitch doesn't have an equivalent of the
Mill scratchpad, and from the discussion it seems like he has some
mechanism to let calls proceed even though not all of the pre-call
instructions have retired yet. However, it's not clear to me how he
handles in-flight loads.

If a call is allowed to proceed without all pre-call loads retiring then
the LSQ/LSB must be prepared to handle matches and faults that do not
belong to the current frame. Mill handles this by spilling the in-flight
loads (lazily) and reissuing the load after return if it runs out of
buffers during the callee.

So what does My6600 do if it enters a call while there's an outstanding
load to a callee-save register, and the load faults after the registers
have been "saved"?
MitchAlsup
2020-09-28 16:53:36 UTC
Permalink
Post by Ivan Godard
Post by MitchAlsup
Post by Jonathan Brandmeyer
I'd like to see what was actually measured in the studies so that I can understand better how it relates to dedicated spill/fill support in the ISA. For example, is the balance point in the middle of a wide bathtub, or are some applications still affected by a sharp gradient in the optimization curve?
a) enough argument and result registers that the 98%-ile subroutine needs
...no argument or result to be passed on the stack.
b) enough temporary (nonpreserved) registers for the leaf subroutine
...to perform its duties without needing any register in the preserved set.
c) enough preserved registers so that non-leaf subroutine can leave a
...substantial amount of data in registers "over" a subroutine call.
32 registers "seems" to be about the right number to fulfill all these goals,
especially is several of them have not been consumed (MIPS) for various
"other duties.
I did talk with a compiler "guy" when came from CRAY and he has good arguments
as to why all the registers should be caller save.
Mill is effectively 100% caller save, as each frame gets a whole new
belt populated with the actual arguments. 16 positions seems to be the
sweet spot for mid-range configurations; 32 seems overkill for even the
largest we have explored.
I am curious as to what you do when calling those simple little subroutines
from the str* mem* library, as these need only a few registers {arguments,
operands, results, return values} typically 2-3-and-4.
Post by Ivan Godard
However, the Mill spiller spills more than the belt: there's the
scratchpad address space, in-flight long-latency results, and most
importantly in-flight loads. Mitch doesn't have an equivalent of the
Mill scratchpad, and from the discussion it seems like he has some
mechanism to let calls proceed even though not all of the pre-call
instructions have retired yet. However, it's not clear to me how he
handles in-flight loads.
In flight loads are handled by the standard RAW hazard scheme, different
implementations may do different things. Low end my simply search the
pipeline, higher end machines will rename registers so the consumer always
gets the right version.

Yes, My 66000 has nothing akin to a scratch pad.

In a higher end My 66000, the EXIT instruction loads the preserved registers
and perform control transfer back to caller. The return IP is accessed first
so that instructions at the return point can be fetched. The other preserved
registers are loaded and consumers of those register yet to be loaded will
wait for the value to be available.

Going the other direction, the ENTER instruction saves preserved registers,
if the value in that register is not yet available, the store remains pending
until the data show up at which time it can be stored. One can imagine a high
end My 66000 simply rewriting the data-flow such that yet to arrive data gets
routed directly to store and does not even have to pass through the RF.

Simple leaf subroutines allow the return address which arrives in R0 to remain
in R0, operate out of the temporary registers (R1-R15) and return (MOV IP,R0)
without saving or restoring ANYTHING. So support is there when you want it,
and absent when you don't need it. In the simple leaf, there is no actual
overhead (unless you count the return instruction itself as overhead.)
Post by Ivan Godard
If a call is allowed to proceed without all pre-call loads retiring then
the LSQ/LSB must be prepared to handle matches and faults that do not
belong to the current frame. Mill handles this by spilling the in-flight
loads (lazily) and reissuing the load after return if it runs out of
buffers during the callee.
So what does My6600 do if it enters a call while there's an outstanding
load to a callee-save register, and the load faults after the registers
have been "saved"?
It is implementation dependent, however RAW and WAR (and WAW) dependencies
are obeyed in all implementations {as-if each instruction was executed
serially back to back.}

Faults will back the machine up to the instruction causing the fault.
{Atomic Events can back up to a slightly different, well defined, place}
Ivan Godard
2020-09-28 20:46:51 UTC
Permalink
Post by MitchAlsup
Post by Ivan Godard
Post by MitchAlsup
Post by Jonathan Brandmeyer
I'd like to see what was actually measured in the studies so that I can understand better how it relates to dedicated spill/fill support in the ISA. For example, is the balance point in the middle of a wide bathtub, or are some applications still affected by a sharp gradient in the optimization curve?
a) enough argument and result registers that the 98%-ile subroutine needs
...no argument or result to be passed on the stack.
b) enough temporary (nonpreserved) registers for the leaf subroutine
...to perform its duties without needing any register in the preserved set.
c) enough preserved registers so that non-leaf subroutine can leave a
...substantial amount of data in registers "over" a subroutine call.
32 registers "seems" to be about the right number to fulfill all these goals,
especially is several of them have not been consumed (MIPS) for various
"other duties.
I did talk with a compiler "guy" when came from CRAY and he has good arguments
as to why all the registers should be caller save.
Mill is effectively 100% caller save, as each frame gets a whole new
belt populated with the actual arguments. 16 positions seems to be the
sweet spot for mid-range configurations; 32 seems overkill for even the
largest we have explored.
I am curious as to what you do when calling those simple little subroutines
from the str* mem* library, as these need only a few registers {arguments,
operands, results, return values} typically 2-3-and-4.
They are treated just like big subroutines. Recall there is no register
file to spill, just a massive bypass that spills lazily, and Mill does
result replay so never needs to back up.

If a value is in (or is in-flight and will be delivered to) a result
latch, it just stays there until some other value will deliver to that
latch, whereupon it is moved to a spill buffer in the spiller. The spill
buffer itself acts like just another FU result latch w/r/t the bypass
network which is the realization of the abstract belt. The spill buffer
contents themselves are moved lazily to SRAM and thence to DRAM as
necessary, and back as demanded.

Hence if a small leaf function doesn't use a particular FU result path
then whatever is already in the path is never spilled, while a callee
that uses everything will eventually spill all the caller's data. For
one of the str* functions, likely the only spill will be the result
latch of one of the ALUs that is used for the index increment, and only
one of the spiller's spill buffers will be used.

Note that arguments are passed wherever they lie in the caller, so that
for example the base pointer in a str* call is never moved and never
spilled. Likewise results are returned wherever they lie, and are also
not moved, nor filled from the spiller if they are in a spill buffer,
whereas if they have been migrated to SRAM/DRAM they will migrate back
after return to the caller. The caller will stall if they haven't been
migrated back into reachability when the caller needs their value from
the belt. Migration stalls are very rare because results are usually the
last thing computed.

Lastly, kill is automatic on a Mill because values "fall off the belt".
A live value in a result latch may become dead before that latch is
again needed for some new FU result, and will be simply overwritten
rather than getting moved to a spill buffer.

In actual code, the spiller and its SRAM/DRAM logic doesn't get involved
except in rapidly nesting call sequences (or unnesting return
sequences), especially recursive nests, because values don't last long
enough to get past the spill buffers.
Post by MitchAlsup
Post by Ivan Godard
However, the Mill spiller spills more than the belt: there's the
scratchpad address space, in-flight long-latency results, and most
importantly in-flight loads. Mitch doesn't have an equivalent of the
Mill scratchpad, and from the discussion it seems like he has some
mechanism to let calls proceed even though not all of the pre-call
instructions have retired yet. However, it's not clear to me how he
handles in-flight loads.
In flight loads are handled by the standard RAW hazard scheme, different
implementations may do different things. Low end my simply search the
pipeline, higher end machines will rename registers so the consumer always
gets the right version.
Yes, My 66000 has nothing akin to a scratch pad.
In some ways Mill scratchpad can be seen as a software-controlled
version of the separate spill space of the OP cited article. Both have a
sharply limited (implicit base + manifest offset) addressing ability,
with explicit operations to allocate space for a frame and automatic
unload/reloading excess capacity demand.
Post by MitchAlsup
In a higher end My 66000, the EXIT instruction loads the preserved registers
and perform control transfer back to caller. The return IP is accessed first
so that instructions at the return point can be fetched. The other preserved
registers are loaded and consumers of those register yet to be loaded will
wait for the value to be available.
Going the other direction, the ENTER instruction saves preserved registers,
if the value in that register is not yet available, the store remains pending
until the data show up at which time it can be stored. One can imagine a high
end My 66000 simply rewriting the data-flow such that yet to arrive data gets
routed directly to store and does not even have to pass through the RF.
Simple leaf subroutines allow the return address which arrives in R0 to remain
in R0, operate out of the temporary registers (R1-R15) and return (MOV IP,R0)
without saving or restoring ANYTHING. So support is there when you want it,
and absent when you don't need it. In the simple leaf, there is no actual
overhead (unless you count the return instruction itself as overhead.)
Yes, we are in violent agreement that RISC theology is absurd /w/r/t
call semantics, if not everywhere.

If one is stuck with spatial addressing and indefinite lifetime I think
yours is as good as can be done. Mill does better because of the
auto-kill we get from temporal addressing, and the use of result replay;
it has a lot that sits around and enables and improves those, like the
scratchpad, but those two are fundamental.
Post by MitchAlsup
Post by Ivan Godard
If a call is allowed to proceed without all pre-call loads retiring then
the LSQ/LSB must be prepared to handle matches and faults that do not
belong to the current frame. Mill handles this by spilling the in-flight
loads (lazily) and reissuing the load after return if it runs out of
buffers during the callee.
So what does My6600 do if it enters a call while there's an outstanding
load to a callee-save register, and the load faults after the registers
have been "saved"?
It is implementation dependent, however RAW and WAR (and WAW) dependencies
are obeyed in all implementations {as-if each instruction was executed
serially back to back.}
Faults will back the machine up to the instruction causing the fault.
{Atomic Events can back up to a slightly different, well defined, place}
So if both caller and callee load from the same address (but to
different registers) there's logic in the fault detector that can figure
out which was the earlier for backup? I'm a little hazy on the details
of issue replay, other than that it's complicated :-)
MitchAlsup
2020-09-28 21:00:40 UTC
Permalink
Post by Ivan Godard
Post by MitchAlsup
Post by Ivan Godard
Post by MitchAlsup
Post by Jonathan Brandmeyer
I'd like to see what was actually measured in the studies so that I can understand better how it relates to dedicated spill/fill support in the ISA. For example, is the balance point in the middle of a wide bathtub, or are some applications still affected by a sharp gradient in the optimization curve?
a) enough argument and result registers that the 98%-ile subroutine needs
...no argument or result to be passed on the stack.
b) enough temporary (nonpreserved) registers for the leaf subroutine
...to perform its duties without needing any register in the preserved set.
c) enough preserved registers so that non-leaf subroutine can leave a
...substantial amount of data in registers "over" a subroutine call.
32 registers "seems" to be about the right number to fulfill all these goals,
especially is several of them have not been consumed (MIPS) for various
"other duties.
I did talk with a compiler "guy" when came from CRAY and he has good arguments
as to why all the registers should be caller save.
Mill is effectively 100% caller save, as each frame gets a whole new
belt populated with the actual arguments. 16 positions seems to be the
sweet spot for mid-range configurations; 32 seems overkill for even the
largest we have explored.
I am curious as to what you do when calling those simple little subroutines
from the str* mem* library, as these need only a few registers {arguments,
operands, results, return values} typically 2-3-and-4.
They are treated just like big subroutines. Recall there is no register
file to spill, just a massive bypass that spills lazily, and Mill does
result replay so never needs to back up.
If a value is in (or is in-flight and will be delivered to) a result
latch, it just stays there until some other value will deliver to that
latch, whereupon it is moved to a spill buffer in the spiller. The spill
buffer itself acts like just another FU result latch w/r/t the bypass
network which is the realization of the abstract belt. The spill buffer
contents themselves are moved lazily to SRAM and thence to DRAM as
necessary, and back as demanded.
Hence if a small leaf function doesn't use a particular FU result path
then whatever is already in the path is never spilled, while a callee
that uses everything will eventually spill all the caller's data. For
one of the str* functions, likely the only spill will be the result
latch of one of the ALUs that is used for the index increment, and only
one of the spiller's spill buffers will be used.
Note that arguments are passed wherever they lie in the caller, so that
for example the base pointer in a str* call is never moved and never
spilled. Likewise results are returned wherever they lie, and are also
not moved, nor filled from the spiller if they are in a spill buffer,
whereas if they have been migrated to SRAM/DRAM they will migrate back
after return to the caller. The caller will stall if they haven't been
migrated back into reachability when the caller needs their value from
the belt. Migration stalls are very rare because results are usually the
last thing computed.
Lastly, kill is automatic on a Mill because values "fall off the belt".
A live value in a result latch may become dead before that latch is
again needed for some new FU result, and will be simply overwritten
rather than getting moved to a spill buffer.
In actual code, the spiller and its SRAM/DRAM logic doesn't get involved
except in rapidly nesting call sequences (or unnesting return
sequences), especially recursive nests, because values don't last long
enough to get past the spill buffers.
Post by MitchAlsup
Post by Ivan Godard
However, the Mill spiller spills more than the belt: there's the
scratchpad address space, in-flight long-latency results, and most
importantly in-flight loads. Mitch doesn't have an equivalent of the
Mill scratchpad, and from the discussion it seems like he has some
mechanism to let calls proceed even though not all of the pre-call
instructions have retired yet. However, it's not clear to me how he
handles in-flight loads.
In flight loads are handled by the standard RAW hazard scheme, different
implementations may do different things. Low end my simply search the
pipeline, higher end machines will rename registers so the consumer always
gets the right version.
Yes, My 66000 has nothing akin to a scratch pad.
In some ways Mill scratchpad can be seen as a software-controlled
version of the separate spill space of the OP cited article. Both have a
sharply limited (implicit base + manifest offset) addressing ability,
with explicit operations to allocate space for a frame and automatic
unload/reloading excess capacity demand.
Post by MitchAlsup
In a higher end My 66000, the EXIT instruction loads the preserved registers
and perform control transfer back to caller. The return IP is accessed first
so that instructions at the return point can be fetched. The other preserved
registers are loaded and consumers of those register yet to be loaded will
wait for the value to be available.
Going the other direction, the ENTER instruction saves preserved registers,
if the value in that register is not yet available, the store remains pending
until the data show up at which time it can be stored. One can imagine a high
end My 66000 simply rewriting the data-flow such that yet to arrive data gets
routed directly to store and does not even have to pass through the RF.
Simple leaf subroutines allow the return address which arrives in R0 to remain
in R0, operate out of the temporary registers (R1-R15) and return (MOV IP,R0)
without saving or restoring ANYTHING. So support is there when you want it,
and absent when you don't need it. In the simple leaf, there is no actual
overhead (unless you count the return instruction itself as overhead.)
Yes, we are in violent agreement that RISC theology is absurd /w/r/t
call semantics, if not everywhere.
If one is stuck with spatial addressing and indefinite lifetime I think
yours is as good as can be done. Mill does better because of the
auto-kill we get from temporal addressing, and the use of result replay;
it has a lot that sits around and enables and improves those, like the
scratchpad, but those two are fundamental.
Post by MitchAlsup
Post by Ivan Godard
If a call is allowed to proceed without all pre-call loads retiring then
the LSQ/LSB must be prepared to handle matches and faults that do not
belong to the current frame. Mill handles this by spilling the in-flight
loads (lazily) and reissuing the load after return if it runs out of
buffers during the callee.
So what does My6600 do if it enters a call while there's an outstanding
load to a callee-save register, and the load faults after the registers
have been "saved"?
It is implementation dependent, however RAW and WAR (and WAW) dependencies
are obeyed in all implementations {as-if each instruction was executed
serially back to back.}
Faults will back the machine up to the instruction causing the fault.
{Atomic Events can back up to a slightly different, well defined, place}
So if both caller and callee load from the same address (but to
different registers) there's logic in the fault detector that can figure
out which was the earlier for backup? I'm a little hazy on the details
of issue replay, other than that it's complicated :-)
If the addresses of the two loads are "easy to calculate" then it is likely
that both hit or both miss. If they both hit*, they are completed and nothing
more transpires. If one takes a miss, the other will detect the miss, and
both will wait for memory hierarchy response. If the instructions were
source to different memory units, both values may be delivered simultaneously
(on different result buses.)

"Easy to calculate" means that the base address and the index value are
available without delay.

(*) in the case where these come from 2 different memory units to the cache,
one will see a bank conflict, and then detect that the accessed cache line
is the same one he is already looking for, and SNARF the data (2 or 3 reads
for the price of one access).

Finally, if one takes an exception, the exception is delivered and all
result points, and later at retirement when the exception is recognized,
the younger instructions are flushed, and we are left with the excepting
instruction.
Ivan Godard
2020-09-28 21:09:39 UTC
Permalink
Post by MitchAlsup
Post by Ivan Godard
Post by MitchAlsup
Faults will back the machine up to the instruction causing the fault.
{Atomic Events can back up to a slightly different, well defined, place}
So if both caller and callee load from the same address (but to
different registers) there's logic in the fault detector that can figure
out which was the earlier for backup? I'm a little hazy on the details
of issue replay, other than that it's complicated :-)
If the addresses of the two loads are "easy to calculate" then it is likely
that both hit or both miss. If they both hit*, they are completed and nothing
more transpires. If one takes a miss, the other will detect the miss, and
both will wait for memory hierarchy response. If the instructions were
source to different memory units, both values may be delivered simultaneously
(on different result buses.)
"Easy to calculate" means that the base address and the index value are
available without delay.
(*) in the case where these come from 2 different memory units to the cache,
one will see a bank conflict, and then detect that the accessed cache line
is the same one he is already looking for, and SNARF the data (2 or 3 reads
for the price of one access).
Finally, if one takes an exception, the exception is delivered and all
result points, and later at retirement when the exception is recognized,
the younger instructions are flushed, and we are left with the excepting
instruction.
There's no chance that the elder exception will be delivered and be in
the process of being entered when the younger is delivered on top of it?
Stefan Monnier
2020-09-28 21:12:02 UTC
Permalink
There's no chance that the elder exception will be delivered and be in the
process of being entered when the younger is delivered on top of it?
I can't see how: retirement is done in-order and I don't think an
exception would be raised on an instruction until it reaches retirement,
since until that time, we don't really know if it will really be
architecturally executed.


Stefan
MitchAlsup
2020-09-28 21:17:32 UTC
Permalink
Post by Ivan Godard
Post by MitchAlsup
Post by Ivan Godard
Post by MitchAlsup
Faults will back the machine up to the instruction causing the fault.
{Atomic Events can back up to a slightly different, well defined, place}
So if both caller and callee load from the same address (but to
different registers) there's logic in the fault detector that can figure
out which was the earlier for backup? I'm a little hazy on the details
of issue replay, other than that it's complicated :-)
If the addresses of the two loads are "easy to calculate" then it is likely
that both hit or both miss. If they both hit*, they are completed and nothing
more transpires. If one takes a miss, the other will detect the miss, and
both will wait for memory hierarchy response. If the instructions were
source to different memory units, both values may be delivered simultaneously
(on different result buses.)
"Easy to calculate" means that the base address and the index value are
available without delay.
(*) in the case where these come from 2 different memory units to the cache,
one will see a bank conflict, and then detect that the accessed cache line
is the same one he is already looking for, and SNARF the data (2 or 3 reads
for the price of one access).
Finally, if one takes an exception, the exception is delivered and all
result points, and later at retirement when the exception is recognized,
the younger instructions are flushed, and we are left with the excepting
instruction.
There's no chance that the elder exception will be delivered and be in
the process of being entered when the younger is delivered on top of it?
That is why one sorts out exceptions at retire time, so one has a view over
the instructions in program order. Some instructions may even have several
exceptions logged onto them, and after detecting this is the youngest
instruction containing an exception, the exceptions are then picked in a
priority order and then control gets transferred.

Branch misprediction, and Predicate shadowing can cause blocks (branch) or
individual instructions (PRED) to have been squashed. Of course these
instructions are prevented from raising the exceptions at hand.
Brett
2020-09-28 23:22:28 UTC
Permalink
Post by Ivan Godard
Post by MitchAlsup
Post by Jonathan Brandmeyer
In an older Itanium discussion on this newsgroup, someone mentioned
this paper by Mattan Erez, Brian P. Towles, and William J. Dally. It
looks like a technical report (or draft?) that wasn't published at a
conference or journal.
The headline numbers are interesting: A reduction in memory writes by
50% and reads by 30% or so through intercepting register fills and
spills at call boundaries. Programs got faster by 5-40%.
I care about cache read/write ports as that seems to be the limit for 6
wide designs.

The Mill scratchpad seems to be a huge win here if it packs up register
read/writes into cache line reads/writes when it spills/fills to call data
to cache.

So why did Itanic fail so bad, did they not pack the spiller data?

One would expect the next generation of ISA’s to take advantage of stack
operations this way, like maybe Apple Silicon.

A simple buffer on the stack would seem to be all you need.
Though it seems this could be generalized for most read/writes with extra
effort.
Maybe the cache read/write limit is not so hard a limit as I assumed, and
the industry has just not made that leap yet?
Post by Ivan Godard
Post by MitchAlsup
Post by Jonathan Brandmeyer
By leveraging their unique requirements, its possible to provide
special-purpose load and store instructions to maintain a separate
spill/fill stack that is much like the program's own stack in most
other respects. Given recent interest in splitting stacks as a barrier
to ROP attacks, I found the idea to be extra relevant.
Uh, why was this a dead-end? It wasn't published, and recent
architectures have omitted any special support for spilling and filling
the machine registers. x86 has dedicated push/pop instructions, but
high-performance implementations seem to get away with little more than
acceleration of the stack pointer arithmetic.
My 66000 has instructions that save and restore registers on the stack on
behalf of subroutines including setting up new stack and frame pointers.
Post by Jonathan Brandmeyer
My best guess is that the author's baseline was too pessimistic. They
don't mention the bias between callee-saved and call-clobbered in their
assumed ABI. I'm guessing that the bias was entirely in favor of
call-clobbered registers.
Around 50% of called subroutines are leaf level and the vast majority of
these don't need any save/restore because they can "run" in the argument
and temporary (callee destroyed) registers.
Post by Jonathan Brandmeyer
Modern ABIs reserve about half of the architectural registers as
callee-saved. This greatly reduces (but does not eliminate)
save/restore traffic by making some of it lazy and hoisting it around
the program. In effect, the architectural registers themselves are
acting to intercept some of the spill/fill traffic that these authors
were trying to save with the ISA.
a) simple leaf level subroutines can "run" without any overhead,
b) non-leaf subroutines can save/restore as few registers as needed,
c) hardware has the width to save/restore registers at cache-access-width
...per cycle
d) ISA supports the ABI by compressing all the overhead into 1 instruction
...at each end of the call.
The vast majority of the overhead has already disappeared.
Adding laziness to the save/restore process may not "buy" that much.
Post by Jonathan Brandmeyer
Are there more recent studies on the fraction of memory references that
are spill/fill traffic? Speaking of which, whatever happened to
Simplescalar? It seems to have died in the Great ISA Winnowing of the
Twenty-Aughts. Was it ever replaced?
What are some good references that studied the impact of varying the
fraction of registers which are callee-saved?
Stanford MIPS (pre company MIPS) had a few papers on stuff like this--studying
the number of registers versus performance. But I know of nothing recent.
In My 66000 ABI I designed the support instructions with the flexibility
to increase/decrease the number of items carried in registers as arguments
and as results: and at the other end to increase/decrease the number of
callee saved (preserved) registers--So that if someone comes to me with
a good argument as moving this boundary (or that) will improve performance
(or decrease overhead) I can simply adjust the numbers in ABI and the
support instructions need no change.
In additions the support instructions have to allow for stack unwind
{TRY, THROW, CATCH} and destructors (C++ longjump) to be implemented
efficiently.
Post by Jonathan Brandmeyer
Posts in this forum cite that the balance is mixed: Some applications
prefer more callee-saved while others prefer less, so the 50/50 mix is a compromise.
It is not so much "applications" but individual subroutines.
The balance ALSO depends on the number of registers and their width (along
with paring and sharing rules--if any). In a machine with thirty-two 64-bit
registers running in a 64-bit virtual address space, 50% seems to be a good
balance. Simple leaf subroutines can perform their duties in the unpreserved
register set, needing no stack or stack space. Non-simple leaf routines
can allocate stack space for local variables but might not need any of the
preserved registers as they have "enough" in the temporary registers.
(Obviously, x86 and x86-64 have none of these properties.)
Post by Jonathan Brandmeyer
I'd like to see what was actually measured in the studies so that I can
understand better how it relates to dedicated spill/fill support in the
ISA. For example, is the balance point in the middle of a wide
bathtub, or are some applications still affected by a sharp gradient in
the optimization curve?
a) enough argument and result registers that the 98%-ile subroutine needs
...no argument or result to be passed on the stack.
b) enough temporary (nonpreserved) registers for the leaf subroutine
...to perform its duties without needing any register in the preserved set.
c) enough preserved registers so that non-leaf subroutine can leave a
...substantial amount of data in registers "over" a subroutine call.
32 registers "seems" to be about the right number to fulfill all these goals,
especially is several of them have not been consumed (MIPS) for various
"other duties.
I did talk with a compiler "guy" when came from CRAY and he has good arguments
as to why all the registers should be caller save.
Mill is effectively 100% caller save, as each frame gets a whole new
belt populated with the actual arguments. 16 positions seems to be the
sweet spot for mid-range configurations; 32 seems overkill for even the
largest we have explored.
However, the Mill spiller spills more than the belt: there's the
scratchpad address space, in-flight long-latency results, and most
importantly in-flight loads. Mitch doesn't have an equivalent of the
Mill scratchpad, and from the discussion it seems like he has some
mechanism to let calls proceed even though not all of the pre-call
instructions have retired yet. However, it's not clear to me how he
handles in-flight loads.
If a call is allowed to proceed without all pre-call loads retiring then
the LSQ/LSB must be prepared to handle matches and faults that do not
belong to the current frame. Mill handles this by spilling the in-flight
loads (lazily) and reissuing the load after return if it runs out of
buffers during the callee.
So what does My6600 do if it enters a call while there's an outstanding
load to a callee-save register, and the load faults after the registers
have been "saved"?
EricP
2020-09-28 16:47:25 UTC
Permalink
Post by Jonathan Brandmeyer
Speaking of which, whatever happened to Simplescalar? It seems to have died in the Great ISA Winnowing of the Twenty-Aughts. Was it ever replaced?
I looked into it late last year when I started thinking
about building a simulator for my ideas.

http://www.simplescalar.com/

It seems it hasn't been maintained since 2011.
Many of the links above were broken, like the ones for the
compiled SPEC binaries, and the SPEC source code not avaiilable.

After I built my own simulator I came across the name
GEM5 in my travels but did not investigate.
The source has been updated as recently as 3 months ago.

https://en.wikipedia.org/wiki/Gem5
https://www.gem5.org/
Anton Ertl
2020-09-28 16:45:24 UTC
Permalink
In an older Itanium discussion on this newsgroup, someone mentioned this pa=
per by Mattan Erez, Brian P. Towles, and William J. Dally. It looks like a=
technical report (or draft?) that wasn't published at a conference or jour=
nal.
The headline numbers are interesting: A reduction in memory writes by 50% a=
nd reads by 30% or so through intercepting register fills and spills at cal=
l boundaries. Programs got faster by 5-40%. By leveraging their unique re=
quirements, its possible to provide special-purpose load and store instruct=
ions to maintain a separate spill/fill stack that is much like the program'=
s own stack in most other respects. Given recent interest in splitting sta=
cks as a barrier to ROP attacks, I found the idea to be extra relevant.
Uh, why was this a dead-end? It wasn't published, and recent architectures=
have omitted any special support for spilling and filling the machine regi=
sters.
I have not looked at this particular paper, but in the general
discussion of reducing call overhead by using architectural features
like SPARC's register windows, IA-64's register stack, or 29K's
similar mechanism, the case for not providing such features was that
the benefit of these features (and the cost of not having these
features) can be reduced with compiler techniques like inlining,
interprocedural register allocation, and link-time optimization.

Maybe the most influential paper in this area was

@InProceedings{wall86,
author = "David W. Wall",
title = "Global Register Allocation at Link Time",
booktitle = "SIGPLAN '86 Symposium on Compiler Construction",
year = "1986",
pages = "264--275",
OPTorganization = "ACM SIGPLAN",
annote = "Uses annotations from the compiler to do fast
interprocedural register allocation at link time.
Speedups of 10--20\% are obtained. Most (52--99\%) of
the removable memory references are removed. The
improvement over intraprocedural coloring allocation
is 1--8\%."
}

However, I think that the way that benchmarks like SPEC are compiled
is not representative for the way that most real-world code is
compiled. Link-time optimization is mostly a benchmark feature. So
if you compare an architecture with and without register stack using
your typical SPEC setup, you will underestimate the benefit of the
register stack on real-world code. And as the latest Oracle and
Fujitsu (IIRC) SPARCs demonstrate, register windows does not preclude
high clock rates.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
MitchAlsup
2020-09-28 17:20:06 UTC
Permalink
Post by Anton Ertl
In an older Itanium discussion on this newsgroup, someone mentioned this pa=
per by Mattan Erez, Brian P. Towles, and William J. Dally. It looks like a=
technical report (or draft?) that wasn't published at a conference or jour=
nal.
The headline numbers are interesting: A reduction in memory writes by 50% a=
nd reads by 30% or so through intercepting register fills and spills at cal=
l boundaries. Programs got faster by 5-40%. By leveraging their unique re=
quirements, its possible to provide special-purpose load and store instruct=
ions to maintain a separate spill/fill stack that is much like the program'=
s own stack in most other respects. Given recent interest in splitting sta=
cks as a barrier to ROP attacks, I found the idea to be extra relevant.
Uh, why was this a dead-end? It wasn't published, and recent architectures=
have omitted any special support for spilling and filling the machine regi=
sters.
I have not looked at this particular paper, but in the general
discussion of reducing call overhead by using architectural features
like SPARC's register windows,
So why did SPARC under perform wrt MIPS when they were at comparative clock
rates ?

I think the summary of the SPARC vs. MIPS is that once the compiler gets
good enough, HW register windows are not a value add.
Post by Anton Ertl
IA-64's register stack, or 29K's
similar mechanism, the case for not providing such features was that
the benefit of these features (and the cost of not having these
features) can be reduced with compiler techniques like inlining,
interprocedural register allocation, and link-time optimization.
Maybe the most influential paper in this area was
@InProceedings{wall86,
author = "David W. Wall",
title = "Global Register Allocation at Link Time",
booktitle = "SIGPLAN '86 Symposium on Compiler Construction",
year = "1986",
pages = "264--275",
OPTorganization = "ACM SIGPLAN",
annote = "Uses annotations from the compiler to do fast
interprocedural register allocation at link time.
Speedups of 10--20\% are obtained. Most (52--99\%) of
the removable memory references are removed. The
improvement over intraprocedural coloring allocation
is 1--8\%."
}
However, I think that the way that benchmarks like SPEC are compiled
is not representative for the way that most real-world code is
compiled. Link-time optimization is mostly a benchmark feature. So
if you compare an architecture with and without register stack using
your typical SPEC setup, you will underestimate the benefit of the
register stack on real-world code. And as the latest Oracle and
Fujitsu (IIRC) SPARCs demonstrate, register windows does not preclude
high clock rates.
But is does not buy any performance, either !.
Post by Anton Ertl
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Anton Ertl
2020-09-28 17:40:49 UTC
Permalink
Post by MitchAlsup
Post by Anton Ertl
I have not looked at this particular paper, but in the general
discussion of reducing call overhead by using architectural features
like SPARC's register windows,
So why did SPARC under perform wrt MIPS when they were at comparative clock
rates ?
Did it? How do you know?
Post by MitchAlsup
I think the summary of the SPARC vs. MIPS is that once the compiler gets
good enough, HW register windows are not a value add.
Except that "compiler gets good enough" includes benchmark-only
Post by MitchAlsup
Post by Anton Ertl
Maybe the most influential paper in this area was
@InProceedings{wall86,
author = "David W. Wall",
title = "Global Register Allocation at Link Time",
booktitle = "SIGPLAN '86 Symposium on Compiler Construction",
year = "1986",
pages = "264--275",
OPTorganization = "ACM SIGPLAN",
annote = "Uses annotations from the compiler to do fast
interprocedural register allocation at link time.
Speedups of 10--20\% are obtained. Most (52--99\%) of
the removable memory references are removed. The
improvement over intraprocedural coloring allocation
is 1--8\%."
}
However, I think that the way that benchmarks like SPEC are compiled
is not representative for the way that most real-world code is
compiled. Link-time optimization is mostly a benchmark feature. So
if you compare an architecture with and without register stack using
your typical SPEC setup, you will underestimate the benefit of the
register stack on real-world code. And as the latest Oracle and
Fujitsu (IIRC) SPARCs demonstrate, register windows does not preclude
high clock rates.
But is does not buy any performance, either !.
How do you know?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Jonathan Brandmeyer
2020-09-28 18:56:07 UTC
Permalink
Post by MitchAlsup
Post by Anton Ertl
I have not looked at this particular paper, but in the general
discussion of reducing call overhead by using architectural features
like SPARC's register windows,
So why did SPARC under perform wrt MIPS when they were at comparative clock
rates ?
Did it? How do you know?
Post by MitchAlsup
I think the summary of the SPARC vs. MIPS is that once the compiler gets
good enough, HW register windows are not a value add.
Except that "compiler gets good enough" includes benchmark-only
Post by MitchAlsup
Post by Anton Ertl
Maybe the most influential paper in this area was
@InProceedings{wall86,
author = "David W. Wall",
title = "Global Register Allocation at Link Time",
booktitle = "SIGPLAN '86 Symposium on Compiler Construction",
year = "1986",
pages = "264--275",
OPTorganization = "ACM SIGPLAN",
annote = "Uses annotations from the compiler to do fast
interprocedural register allocation at link time.
Speedups of 10--20\% are obtained. Most (52--99\%) of
the removable memory references are removed. The
improvement over intraprocedural coloring allocation
is 1--8\%."
}
However, I think that the way that benchmarks like SPEC are compiled
is not representative for the way that most real-world code is
compiled. Link-time optimization is mostly a benchmark feature. So
if you compare an architecture with and without register stack using
your typical SPEC setup, you will underestimate the benefit of the
register stack on real-world code. And as the latest Oracle and
Fujitsu (IIRC) SPARCs demonstrate, register windows does not preclude
high clock rates.
But is does not buy any performance, either !.
How do you know?
- anton
Compare the same version of GCC targeting SPEC with and without LTO.

https://lnt.opensuse.org/db_default/v4/SPEC/13141
https://lnt.opensuse.org/db_default/v4/SPEC/13144

Some programs are a little slower, and some are a little faster. On average for this particular test suite its a win, but a tiny one.

It seems that interprocedural register allocation and specialization through constant propagation haven't been all that impactful for workstations and servers. In practice, the functions that benefit the most from those features just get inlined entirely.

-Jonathan
Ivan Godard
2020-09-28 20:52:17 UTC
Permalink
Post by Jonathan Brandmeyer
Post by MitchAlsup
Post by Anton Ertl
I have not looked at this particular paper, but in the general
discussion of reducing call overhead by using architectural features
like SPARC's register windows,
So why did SPARC under perform wrt MIPS when they were at comparative clock
rates ?
Did it? How do you know?
Post by MitchAlsup
I think the summary of the SPARC vs. MIPS is that once the compiler gets
good enough, HW register windows are not a value add.
Except that "compiler gets good enough" includes benchmark-only
Post by MitchAlsup
Post by Anton Ertl
Maybe the most influential paper in this area was
@InProceedings{wall86,
author = "David W. Wall",
title = "Global Register Allocation at Link Time",
booktitle = "SIGPLAN '86 Symposium on Compiler Construction",
year = "1986",
pages = "264--275",
OPTorganization = "ACM SIGPLAN",
annote = "Uses annotations from the compiler to do fast
interprocedural register allocation at link time.
Speedups of 10--20\% are obtained. Most (52--99\%) of
the removable memory references are removed. The
improvement over intraprocedural coloring allocation
is 1--8\%."
}
However, I think that the way that benchmarks like SPEC are compiled
is not representative for the way that most real-world code is
compiled. Link-time optimization is mostly a benchmark feature. So
if you compare an architecture with and without register stack using
your typical SPEC setup, you will underestimate the benefit of the
register stack on real-world code. And as the latest Oracle and
Fujitsu (IIRC) SPARCs demonstrate, register windows does not preclude
high clock rates.
But is does not buy any performance, either !.
How do you know?
- anton
Compare the same version of GCC targeting SPEC with and without LTO.
https://lnt.opensuse.org/db_default/v4/SPEC/13141
https://lnt.opensuse.org/db_default/v4/SPEC/13144
Some programs are a little slower, and some are a little faster. On average for this particular test suite its a win, but a tiny one.
It seems that interprocedural register allocation and specialization through constant propagation haven't been all that impactful for workstations and servers. In practice, the functions that benefit the most from those features just get inlined entirely.
The techniques - even inlining - mostly are a way to remove call
overhead. When the hardware has no call overhead (different ways in
different ISAs) then they don't pay.
MitchAlsup
2020-09-28 21:11:42 UTC
Permalink
Post by Ivan Godard
Post by Jonathan Brandmeyer
Post by MitchAlsup
Post by Anton Ertl
I have not looked at this particular paper, but in the general
discussion of reducing call overhead by using architectural features
like SPARC's register windows,
So why did SPARC under perform wrt MIPS when they were at comparative clock
rates ?
Did it? How do you know?
Post by MitchAlsup
I think the summary of the SPARC vs. MIPS is that once the compiler gets
good enough, HW register windows are not a value add.
Except that "compiler gets good enough" includes benchmark-only
Post by MitchAlsup
Post by Anton Ertl
Maybe the most influential paper in this area was
@InProceedings{wall86,
author = "David W. Wall",
title = "Global Register Allocation at Link Time",
booktitle = "SIGPLAN '86 Symposium on Compiler Construction",
year = "1986",
pages = "264--275",
OPTorganization = "ACM SIGPLAN",
annote = "Uses annotations from the compiler to do fast
interprocedural register allocation at link time.
Speedups of 10--20\% are obtained. Most (52--99\%) of
the removable memory references are removed. The
improvement over intraprocedural coloring allocation
is 1--8\%."
}
However, I think that the way that benchmarks like SPEC are compiled
is not representative for the way that most real-world code is
compiled. Link-time optimization is mostly a benchmark feature. So
if you compare an architecture with and without register stack using
your typical SPEC setup, you will underestimate the benefit of the
register stack on real-world code. And as the latest Oracle and
Fujitsu (IIRC) SPARCs demonstrate, register windows does not preclude
high clock rates.
But is does not buy any performance, either !.
How do you know?
- anton
Compare the same version of GCC targeting SPEC with and without LTO.
https://lnt.opensuse.org/db_default/v4/SPEC/13141
https://lnt.opensuse.org/db_default/v4/SPEC/13144
Some programs are a little slower, and some are a little faster. On average for this particular test suite its a win, but a tiny one.
It seems that interprocedural register allocation and specialization through constant propagation haven't been all that impactful for workstations and servers. In practice, the functions that benefit the most from those features just get inlined entirely.
The techniques - even inlining - mostly are a way to remove call
overhead. When the hardware has no call overhead (different ways in
different ISAs) then they don't pay.
In My 66000, the inherent overhead is::
a) control transfer to subroutine--can be as low as 0 if the call is found
before the last argument setup instruction enters execution.
b) control transfer back to calling point--same as above
c) R0 gets damaged because return address goes there
d) everything else is under SW control.

Now, when it IS desirable to save more than one preserved registers, the HW
ENTER instruction is always more efficient than the equivalent series of
Stores; and the alteration of FP and SP take place "for free" in cycles.
Returning back, the HW EXIT instruction is always more efficient than a
series of Loads, and can transfer control back to calling point by reading
R0 out of sequential order and transferring control while the rest of the
preserved registers are getting loaded.

When one enters a leaf subroutine, a good deal of the time there is no
manipulation of FP or SP and the routine can run directly out of R0-R15
with essentially zero overhead (subject to a-b-c above).

So, overall, there is little to be gained with inlining other than constant
recognition, and the avoidance of moving various registers around preceding
the CALL instruction. If I understand correctly, Mill avoids even these
register moves.
Ivan Godard
2020-09-28 21:48:46 UTC
Permalink
<snip>
Post by MitchAlsup
Post by Ivan Godard
Post by Jonathan Brandmeyer
It seems that interprocedural register allocation and specialization through constant propagation haven't been all that impactful for workstations and servers. In practice, the functions that benefit the most from those features just get inlined entirely.
The techniques - even inlining - mostly are a way to remove call
overhead. When the hardware has no call overhead (different ways in
different ISAs) then they don't pay.
a) control transfer to subroutine--can be as low as 0 if the call is found
before the last argument setup instruction enters execution.
b) control transfer back to calling point--same as above
c) R0 gets damaged because return address goes there
d) everything else is under SW control.
Now, when it IS desirable to save more than one preserved registers, the HW
ENTER instruction is always more efficient than the equivalent series of
Stores; and the alteration of FP and SP take place "for free" in cycles.
Returning back, the HW EXIT instruction is always more efficient than a
series of Loads, and can transfer control back to calling point by reading
R0 out of sequential order and transferring control while the rest of the
preserved registers are getting loaded.
When one enters a leaf subroutine, a good deal of the time there is no
manipulation of FP or SP and the routine can run directly out of R0-R15
with essentially zero overhead (subject to a-b-c above).
So, overall, there is little to be gained with inlining other than constant
recognition, and the avoidance of moving various registers around preceding
the CALL instruction. If I understand correctly, Mill avoids even these
register moves.
It does.

We still do inlining because it shrinks the code by removing the call
and return instructions, and because the inlined callee's code can be
interleaved with the caller's in the machine width, giving a lower
latency overall schedule sometimes. That last shouldn't be an issue for
OOOs like yours.
MitchAlsup
2020-09-28 21:56:03 UTC
Permalink
Post by Ivan Godard
<snip>
Post by MitchAlsup
Post by Ivan Godard
Post by Jonathan Brandmeyer
It seems that interprocedural register allocation and specialization through constant propagation haven't been all that impactful for workstations and servers. In practice, the functions that benefit the most from those features just get inlined entirely.
The techniques - even inlining - mostly are a way to remove call
overhead. When the hardware has no call overhead (different ways in
different ISAs) then they don't pay.
a) control transfer to subroutine--can be as low as 0 if the call is found
before the last argument setup instruction enters execution.
b) control transfer back to calling point--same as above
c) R0 gets damaged because return address goes there
d) everything else is under SW control.
Now, when it IS desirable to save more than one preserved registers, the HW
ENTER instruction is always more efficient than the equivalent series of
Stores; and the alteration of FP and SP take place "for free" in cycles.
Returning back, the HW EXIT instruction is always more efficient than a
series of Loads, and can transfer control back to calling point by reading
R0 out of sequential order and transferring control while the rest of the
preserved registers are getting loaded.
When one enters a leaf subroutine, a good deal of the time there is no
manipulation of FP or SP and the routine can run directly out of R0-R15
with essentially zero overhead (subject to a-b-c above).
So, overall, there is little to be gained with inlining other than constant
recognition, and the avoidance of moving various registers around preceding
the CALL instruction. If I understand correctly, Mill avoids even these
register moves.
It does.
We still do inlining because it shrinks the code by removing the call
and return instructions,
I can believe that only for the called-once subroutines, or the argument
count is such that a lot of MOV instructions vanish.
Post by Ivan Godard
and because the inlined callee's code can be
interleaved with the caller's in the machine width, giving a lower
latency overall schedule sometimes. That last shouldn't be an issue for
OOOs like yours.
OoO makes that problem vanish (even across implementations).
Ivan Godard
2020-09-28 22:19:54 UTC
Permalink
Post by MitchAlsup
Post by Ivan Godard
<snip>
Post by MitchAlsup
Post by Ivan Godard
Post by Jonathan Brandmeyer
It seems that interprocedural register allocation and specialization through constant propagation haven't been all that impactful for workstations and servers. In practice, the functions that benefit the most from those features just get inlined entirely.
The techniques - even inlining - mostly are a way to remove call
overhead. When the hardware has no call overhead (different ways in
different ISAs) then they don't pay.
a) control transfer to subroutine--can be as low as 0 if the call is found
before the last argument setup instruction enters execution.
b) control transfer back to calling point--same as above
c) R0 gets damaged because return address goes there
d) everything else is under SW control.
Now, when it IS desirable to save more than one preserved registers, the HW
ENTER instruction is always more efficient than the equivalent series of
Stores; and the alteration of FP and SP take place "for free" in cycles.
Returning back, the HW EXIT instruction is always more efficient than a
series of Loads, and can transfer control back to calling point by reading
R0 out of sequential order and transferring control while the rest of the
preserved registers are getting loaded.
When one enters a leaf subroutine, a good deal of the time there is no
manipulation of FP or SP and the routine can run directly out of R0-R15
with essentially zero overhead (subject to a-b-c above).
So, overall, there is little to be gained with inlining other than constant
recognition, and the avoidance of moving various registers around preceding
the CALL instruction. If I understand correctly, Mill avoids even these
register moves.
It does.
We still do inlining because it shrinks the code by removing the call
and return instructions,
I can believe that only for the called-once subroutines, or the argument
count is such that a lot of MOV instructions vanish.
No MOVs on a Mill; reason applies only on call-once.
Post by MitchAlsup
Post by Ivan Godard
and because the inlined callee's code can be
interleaved with the caller's in the machine width, giving a lower
latency overall schedule sometimes. That last shouldn't be an issue for
OOOs like yours.
OoO makes that problem vanish (even across implementations).
Yes. This inlining can be beneficial even for call-many functions. We
currently have only a crude cost heuristic that takes some note of code
expansion (not done on -Os for example), but no note of code locality
improvement and only rough guess as to latency change from interleaving.
For measurement and tuning later :-)
Loading...