Discussion:
Maximum number of operands for x86 and x64 instruction set ?
(too old to reply)
Skybuck Flying
2010-06-07 21:11:19 UTC
Permalink
Hello,

Suppose I want to design a new instruction for the x86 instruction set or
perhaps the newer version/extension the x64 instruction set... I have a
question about that...

The question is:

What's the maximum number of operands at my disposal for the design ?

For example SHLD has 3 operands:

Two flexible ones and one hard coded one: CL

So what is the maximum number of "flexible operands" ?

And what is the maximum number of "hardcoded operands" ?

For hardcoded operands I imagine that there might be no limit... for example
a "machine/state reset" instruction could simply reset the entire
machine/state.

But I could also imagine that there might be hardware limitations to do or
unwanted situations cost wise... but for now I am looking for true
architectural maximums.

For the "flexible operands" I am not sure what the answer is... ? also is it
a different answer for x86 vs x64 ?

Bye,
Skybuck.
Alexei A. Frounze
2010-06-07 23:47:58 UTC
Permalink
Post by Skybuck Flying
Hello,
Suppose I want to design a new instruction for the x86 instruction set or
perhaps the newer version/extension the x64 instruction set... I have a
question about that...
What's the maximum number of operands at my disposal for the design ?
Two flexible ones and one hard coded one: CL
So what is the maximum number of "flexible operands" ?
And what is the maximum number of "hardcoded operands" ?
For hardcoded operands I imagine that there might be no limit... for example
a "machine/state reset" instruction could simply reset the entire
machine/state.
But I could also imagine that there might be hardware limitations to do or
unwanted situations cost wise... but for now I am looking for true
architectural maximums.
For the "flexible operands" I am not sure what the answer is... ? also is it
a different answer for x86 vs x64 ?
Bye,
  Skybuck.
I think IRET would be your case for maximum (or near maximum) number
of implicit operands. Its description spans several pages and there
are many many things involved.
Among the general purpose instructions, think of CMPXCHG8/16B, CPUID,
PUSHA/POPA, REP SCAS*.

Alex
Skybuck Flying
2010-06-08 01:44:51 UTC
Permalink
Ok, so implicit operands which I would call the hardcoded operands can be
many...

This leaves the question how many explicit/flexible operands can
instructions have ?

For now my guess would be 2 for general x86 instructions... for sse I am not
so sure...

Bye,
Skybuck.
Alexei A. Frounze
2010-06-08 09:01:47 UTC
Permalink
Post by Skybuck Flying
Ok, so implicit operands which I would call the hardcoded operands can be
many...
This leaves the question how many explicit/flexible operands can
instructions have ?
General-purpose ones have at most 2 because that's how many you can
specify in the ModR/M byte.
Post by Skybuck Flying
For now my guess would be 2 for general x86 instructions... for sse I am not
so sure...
If they use ModR/M and nothing else like it, 2 max.

Alex
Ken Hagan
2010-06-08 09:19:34 UTC
Permalink
On Tue, 08 Jun 2010 02:44:51 +0100, Skybuck Flying
Post by Skybuck Flying
Ok, so implicit operands which I would call the hardcoded operands can be
many...
This leaves the question how many explicit/flexible operands can
instructions have ?
For now my guess would be 2 for general x86 instructions... for sse I am
not so sure...
There's no need to guess, since Intel publish more documentation than you
could possibly want on the subject. See
http://www.intel.com/products/processor/manuals/index.htm.

There are instructions that modify many registers, but the registers
involved are always implicit. Generally there are just one or two explicit
arguments, and at most one of those can use an addressing mode that
involves multiple registers. This follows from the instruction format.

x86 instructions are a 1 or 2-byte "op-code" (more recently, but rarely, 3
bytes), which may be followed by a "modrm" byte, which may in turn be
followed by a "sib" byte. After those you might have 1, 2 or 4 bytes of
immediate data. In front of the whole instruction you might have one or
more prefix bytes.

That's it. (Since you asked the question, I'm guessing that you were
expecting something a little less constrained.) The MMX and SSE
instructions aren't any different from the traditional ones in terms of
encoding. They do work on different registers, but that's implicit in the
op-code. You can't freely mix MMX and SSE registers with normal
instructions. It's no accident that there are 8 general purpose registers,
8 FPU registers, 8 MMX registers and 8 SSE registers. They are all
shoe-horned into the same instruction format.
Noob
2010-06-08 11:49:02 UTC
Permalink
Post by Ken Hagan
There are instructions that modify many registers, but the registers
involved are always implicit. Generally there are just one or two
explicit arguments, and at most one of those can use an addressing mode
that involves multiple registers. This follows from the instruction format.
AFAIU, AVX will introduce 3-operand instructions.

http://en.wikipedia.org/wiki/Advanced_Vector_Extensions
http://en.wikipedia.org/wiki/VEX_prefix
http://software.intel.com/en-us/articles/intel-avx-new-frontiers-in-performance-improvements-and-energy-efficiency/

Regards.
Ken Hagan
2010-06-09 09:03:37 UTC
Permalink
Post by Noob
http://en.wikipedia.org/wiki/Advanced_Vector_Extensions
http://en.wikipedia.org/wiki/VEX_prefix
http://software.intel.com/en-us/articles/intel-avx-new-frontiers-in-performance-improvements-and-energy-efficiency/
Gosh. It's like RISC never happened.
MitchAlsup
2010-06-10 14:24:48 UTC
Permalink
Post by Ken Hagan
Post by Noob
http://en.wikipedia.org/wiki/Advanced_Vector_Extensions
http://en.wikipedia.org/wiki/VEX_prefix
http://software.intel.com/en-us/articles/intel-avx-new-frontiers-in-p...
Gosh. It's like RISC never happened.
At best, the designers of today remember none of the lessons from that
era.

Mitch
Noob
2010-06-10 15:47:51 UTC
Permalink
Post by Ken Hagan
Post by Noob
http://en.wikipedia.org/wiki/Advanced_Vector_Extensions
http://en.wikipedia.org/wiki/VEX_prefix
http://software.intel.com/en-us/articles/intel-avx-new-frontiers-in-p...
Gosh. It's like RISC never happened.
At best, the designers of today remember none of the lessons from that era.
Were you aware that Google's lame Usenet interface truncates long URLs?
Michael S
2010-06-09 00:21:46 UTC
Permalink
Post by Skybuck Flying
Ok, so implicit operands which I would call the hardcoded operands can be
many...
This leaves the question how many explicit/flexible operands can
instructions have ?
For now my guess would be 2 for general x86 instructions...
Assuming, by 'operands' you meant "register operands" the answer is 3
(LEA).
If you also count immediate operands than LEA gets up to 4 and IMUL up
to 3. For the rest of general-purpose non-memory instruction - 2.
Post by Skybuck Flying
for sse I am not so sure...
For majority of SSE instruction - two. For some (e.g. shuffles) two +
one immediate. SSE4.1 BLENDVPx instructions have three register
operands but one of the inputs always comes from XMM0.
For coming AVX extension - three.
Post by Skybuck Flying
Bye,
  Skybuck.
Torben Ægidius Mogensen
2010-06-08 07:52:59 UTC
Permalink
Post by Skybuck Flying
Suppose I want to design a new instruction for the x86 instruction set or
perhaps the newer version/extension the x64 instruction set... I have a
question about that...
What's the maximum number of operands at my disposal for the design ?
As many as you like. Since x86 uses variable-length instructions, there
is no fixed upper limit.

If you want to add your new instructions by a minimal modification to an
existing implementation of the ISA, the answer depends on which
implementation you choose to modify. If the implementation translates
an x86 instruction into micro instructions internally, the answer is
still that you can pretty much use as many operands as you like, as long
as your complex instruction can be expressed in terms of the existing
micro instructions. If not, it gets more complicated. If your
instruction actually requires a modification of the ALU, it gets even
more hairy. Adding an extra operand to the ALU is probably
prohibitively expensive, as it is likely to slow everything else down.

Torben
BGB / cr88192
2010-06-10 18:56:22 UTC
Permalink
Post by Torben Ægidius Mogensen
Post by Skybuck Flying
Suppose I want to design a new instruction for the x86 instruction set or
perhaps the newer version/extension the x64 instruction set... I have a
question about that...
What's the maximum number of operands at my disposal for the design ?
As many as you like. Since x86 uses variable-length instructions, there
is no fixed upper limit.
well, most things limit max opcode length to 16-bytes, so this is a limit...
Post by Torben Ægidius Mogensen
If you want to add your new instructions by a minimal modification to an
existing implementation of the ISA, the answer depends on which
implementation you choose to modify. If the implementation translates
an x86 instruction into micro instructions internally, the answer is
still that you can pretty much use as many operands as you like, as long
as your complex instruction can be expressed in terms of the existing
micro instructions. If not, it gets more complicated. If your
instruction actually requires a modification of the ALU, it gets even
more hairy. Adding an extra operand to the ALU is probably
prohibitively expensive, as it is likely to slow everything else down.
well, x86 does have opcode encoding rules...

conventional arguments (allowed by typical ModRM bytes):
imm
reg
mem
reg, imm
mem, imm
reg, mem
mem, reg
reg, mem, imm
mem, reg, imm

XOP and AVX allow a few more register arguments.


this is excluding the possibility of using an otherwise unusual or a virtual
opcode encoding.

virtual-opcode:
use an existing opcode as a trigger, and switch to custom decoding for the
rest of the opcode.
this could be compared to a more extreme case of XOP.


this could be done within an OS for customized virtual opcodes:
a special magic opcode triggers a #UD.
the OS #UD handler examines the opcode, and decides whether to interpret it
as a special ISA extension.

however, I don't know of any which do this, since usually system-calls are
used instead.


another different hack I have used (in my projects), is to make some fake
opcodes (at the ASM level) which are converted into a function call (which
encodes all of the arguments into the function name), but this call uses a
special calling convention (namely, it preserves all registers/... except
those which it is specifically allowed to modify...).

the example would be like the big ugly URL's used for CGI scripts/... just
as function names.

typically, this strategy is used in combination with link-time and run-time
code generation...


or such...
Alexei A. Frounze
2010-06-11 07:42:57 UTC
Permalink
Post by BGB / cr88192
Post by Skybuck Flying
Suppose I want to design a new instruction for the x86 instruction set or
perhaps the newer version/extension the x64 instruction set... I have a
question about that...
What's the maximum number of operands at my disposal for the design ?
As many as you like.  Since x86 uses variable-length instructions, there
is no fixed upper limit.
well, most things limit max opcode length to 16-bytes, so this is a limit...
If you want to add your new instructions by a minimal modification to an
existing implementation of the ISA, the answer depends on which
implementation you choose to modify.  If the implementation translates
an x86 instruction into micro instructions internally, the answer is
still that you can pretty much use as many operands as you like, as long
as your complex instruction can be expressed in terms of the existing
micro instructions.  If not, it gets more complicated.  If your
instruction actually requires a modification of the ALU, it gets even
more hairy.  Adding an extra operand to the ALU is probably
prohibitively expensive, as it is likely to slow everything else down.
well, x86 does have opcode encoding rules...
imm
reg
mem
reg, imm
mem, imm
reg, mem
mem, reg
reg, mem, imm
mem, reg, imm
XOP and AVX allow a few more register arguments.
this is excluding the possibility of using an otherwise unusual or a virtual
opcode encoding.
use an existing opcode as a trigger, and switch to custom decoding for the
rest of the opcode.
this could be compared to a more extreme case of XOP.
a special magic opcode triggers a #UD.
the OS #UD handler examines the opcode, and decides whether to interpret it
as a special ISA extension.
however, I don't know of any which do this, since usually system-calls are
used instead.
another different hack I have used (in my projects), is to make some fake
opcodes (at the ASM level) which are converted into a function call (which
encodes all of the arguments into the function name), but this call uses a
special calling convention (namely, it preserves all registers/... except
those which it is specifically allowed to modify...).
the example would be like the big ugly URL's used for CGI scripts/... just
as function names.
typically, this strategy is used in combination with link-time and run-time
code generation...
Apparently, both Windows and ReactOS patch the code containing the
prefetchnta instruction, e.g. RtlPrefetchMemoryNonTemporal(),
depending on whether or not the instruction is supported by the CPU:
http://www.computer.org/portal/web/csdl/doi/10.1109/HICSS.2010.182
(click on the PDF link, search for this fxn)
http://www.koders.com/c/fid4D23409E3EA6032D618D125732B4AC17A3E773DA.aspx
(search the code for this fxn)
There's a similar thing in Linux with the option of just skipping the
instruction (see handle_prefetch()):
http://lwn.net/Articles/8634/

From my experience Microsoft's C/C++ compiler seems to generate
prefetchnta in x64 code unconditionally thereby leaving it up to the
system to figure the way around the instruction unsupported by the
CPU. And the system either patches the code or emulates/skips the
instruction.

A number of virtualization products introduce otherwise illegal/non-
existent instructions to communicate between the guest OS and the host
and "emulate" those instructions: http://www.symantec.com/avcenter/reference/Virtual_Machine_Threats.pdf
(see sections starting with VI)

So, these "hacks" are there in the wild, for good and bad.

Alex
unknown
2010-06-11 11:11:41 UTC
Permalink
Post by Alexei A. Frounze
Apparently, both Windows and ReactOS patch the code containing the
prefetchnta instruction, e.g. RtlPrefetchMemoryNonTemporal(),
http://www.computer.org/portal/web/csdl/doi/10.1109/HICSS.2010.182
(click on the PDF link, search for this fxn)
http://www.koders.com/c/fid4D23409E3EA6032D618D125732B4AC17A3E773DA.aspx
(search the code for this fxn)
There's a similar thing in Linux with the option of just skipping the
http://lwn.net/Articles/8634/
I checked the relevant link and code, and I really don'tthink the author
understands just how hard it will be to get it right.

He does note that there are multiple problem areas related to SMP
systems,and that these make the solution much more complicated than it
would be for a single-core setup.

Anyway, his SMP hack to allow fixup of large instructions is to make
sure that all the opcodes used, that could fault, will do so based on
the first 4 opcode bytes only, i.e. independently of any following bytes.

With this restriction he can first use simple store instructions to
overwrite the tail, if any, and then use a locked update to fix the
first four bytes.

(He doesn't state it explicitely, but I assume he fixes 1-3 byte opcodes
by always writing 4 bytes, rewriting the current values into the
following bytes.

The potential probem I noted here is that afaik, many systems only
guarantee the atomicity of locked writes if they are properly aligned,
and 75% of all opcodes will not start on a 4-byte boundary.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Alexei A. Frounze
2010-06-11 23:40:55 UTC
Permalink
On Jun 11, 4:11 am, Terje Mathisen <"terje.mathisen at tmsw.no">
Post by unknown
Post by Alexei A. Frounze
Apparently, both Windows and ReactOS patch the code containing the
prefetchnta instruction, e.g. RtlPrefetchMemoryNonTemporal(),
http://www.computer.org/portal/web/csdl/doi/10.1109/HICSS.2010.182
(click on the PDF link, search for this fxn)
http://www.koders.com/c/fid4D23409E3EA6032D618D125732B4AC17A3E773DA.aspx
(search the code for this fxn)
There's a similar thing in Linux with the option of just skipping the
http://lwn.net/Articles/8634/
I checked the relevant link and code, and I really don'tthink the author
understands just how hard it will be to get it right.
He does note that there are multiple problem areas related to SMP
systems,and that these make the solution much more complicated than it
would be for a single-core setup.
Anyway, his SMP hack to allow fixup of large instructions is to make
sure that all the opcodes used, that could fault, will do so based on
the first 4 opcode bytes only, i.e. independently of any following bytes.
With this restriction he can first use simple store instructions to
overwrite the tail, if any, and then use a locked update to fix the
first four bytes.
(He doesn't state it explicitely, but I assume he fixes 1-3 byte opcodes
by always writing 4 bytes, rewriting the current values into the
following bytes.
The potential probem I noted here is that afaik, many systems only
guarantee the atomicity of locked writes if they are properly aligned,
and 75% of all opcodes will not start on a 4-byte boundary.
Validating that code or suggesting its use wasn't my intention. :) I
just shared on the topic of instruction emulation, which is an
interesting thing.

Alex
unknown
2010-06-12 08:41:57 UTC
Permalink
On Jun 11, 4:11 am, Terje Mathisen<"terje.mathisen at tmsw.no">
Post by unknown
The potential probem I noted here is that afaik, many systems only
guarantee the atomicity of locked writes if they are properly aligned,
and 75% of all opcodes will not start on a 4-byte boundary.
Validating that code or suggesting its use wasn't my intention. :) I
just shared on the topic of instruction emulation, which is an
interesting thing.
Sure, I understood that!

It was indeed interesting, and it got me to think about how to solve
that particular problem (fixing up instructions without using a global
lock on an SMP machine).

I think the simplest (for some version of "simplest") method might be to
temporarily replace the debug interrupt handler with a function that
verifies the source of the interrupt, then either emulates the current
instruction or chains to the previous handler (so debuggers will keep
working etc.)

As soon as this is in place you can safely overwrite the first opcode
byte with an "INT3", then fixup the rest of the instruction and finally
set the first byte to the proper value.

With a fence between the INT3 write and the rest of the opcode bytes and
another before the final update of the first byte, this should be
perfectly safe even on SMP kernels, and the overhead would be very low
indeed.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Andy 'Krazy' Glew
2010-06-12 19:33:15 UTC
Permalink
Post by unknown
On Jun 11, 4:11 am, Terje Mathisen<"terje.mathisen at tmsw.no">
Post by unknown
The potential probem I noted here is that afaik, many systems only
guarantee the atomicity of locked writes if they are properly aligned,
and 75% of all opcodes will not start on a 4-byte boundary.
Validating that code or suggesting its use wasn't my intention. :) I
just shared on the topic of instruction emulation, which is an
interesting thing.
Sure, I understood that!
It was indeed interesting, and it got me to think about how to solve
that particular problem (fixing up instructions without using a global
lock on an SMP machine).
If you have kernel access so can hook INT3, what's wrong with

FOR i FROM lowest byte of instruction TO highest byte DO
*b = INT3 (single byte trap instruction)

FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
*b = appropriate byte of new instrucftion


?


(Assuming that you don't have an issue with an instruction crossing a page boundary, with the pages (in particular, the
lowest addressed pages) being aliased. But that's just a fairly simple extension.)
unknown
2010-06-12 21:56:39 UTC
Permalink
Post by Andy 'Krazy' Glew
Post by unknown
On Jun 11, 4:11 am, Terje Mathisen<"terje.mathisen at tmsw.no">
Post by unknown
The potential probem I noted here is that afaik, many systems only
guarantee the atomicity of locked writes if they are properly aligned,
and 75% of all opcodes will not start on a 4-byte boundary.
Validating that code or suggesting its use wasn't my intention. :) I
just shared on the topic of instruction emulation, which is an
interesting thing.
Sure, I understood that!
It was indeed interesting, and it got me to think about how to solve
that particular problem (fixing up instructions without using a global
lock on an SMP machine).
If you have kernel access so can hook INT3, what's wrong with
FOR i FROM lowest byte of instruction TO highest byte DO
*b = INT3 (single byte trap instruction)
FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
*b = appropriate byte of new instrucftion
?
Nothing at all, it is just a lot more work than really needed:

Only the first (lowest) byte need to be replaced with an INT3, then you
can write all the subsequent bytes with the new opcode bytes, and then
finally fix the INT3, i.e. just a single extra byte write.
Post by Andy 'Krazy' Glew
(Assuming that you don't have an issue with an instruction crossing a
page boundary, with the pages (in particular, the lowest addressed
pages) being aliased. But that's just a fairly simple extension.)
I am assuming that you're not trying to fixup an instruction which also
contains a jump target inside it, like the following code which I wrote
for the 486 once upon a time:

int func(...)
{
asm {
... prolog
; the optimal inner loop is skewed so we need to skip the top
; of the loop on the first iteration:
; JMP FIRST
; To avoid the JMP I use a dummy opcode which skips the top:

cmp ax, 1234h ;; 1-byte opcode + 2-byte immediate
ORG $-2 ;; return to just before the 2-byte immediate
next_iter:
inc si
inc di
FIRST: ;; First iteration starts here...
...

This code was optimal on the 486, but blew away the Icache instruction
boundaries on a Pentium.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Andy 'Krazy' Glew
2010-06-12 22:15:28 UTC
Permalink
Post by unknown
Post by Andy 'Krazy' Glew
Post by unknown
It was indeed interesting, and it got me to think about how to solve
that particular problem (fixing up instructions without using a global
lock on an SMP machine).
If you have kernel access so can hook INT3, what's wrong with
FOR i FROM lowest byte of instruction TO highest byte DO
*b = INT3 (single byte trap instruction)
FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
*b = appropriate byte of new instrucftion
?
Only the first (lowest) byte need to be replaced with an INT3, then you
can write all the subsequent bytes with the new opcode bytes, and then
finally fix the INT3, i.e. just a single extra byte write.
Not if there is a possibility that code might branch into the middle of the instruction. not just the first byte.

You can, of course, write combine an aligned 2, 4, or even 8 bytes' worth of INT3s - I believe the new memory model says
that 64 bit aligned writes are atomic. You can't depend on split order, however.

Ooops, you talk about jumping into the middle of instructions below.
MitchAlsup
2010-06-12 23:28:48 UTC
Permalink
Post by Andy 'Krazy' Glew
If you have kernel access so can hook INT3, what's wrong with
FOR i FROM lowest byte of instruction TO highest byte DO
     *b = INT3 (single byte trap instruction)
FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
     *b = appropriate byte of new instrucftion
Consider the case where an interested CPU has already fetched the
first byte (or first several bytes) of said instruction and one of
these fetched bytes happens to be a major opcode byte, but the rest of
the instruction fetch gets delayed by this or that. There is no
architectural specification that requires the fetch process to back up
when an instruction cache line is stolen.

Now your function comes in and writes INT 3 over the rest of the
instruction, snatching the cache line from the interested CPU.

Finally the delayed CPU finished fetching and now executes the
instruction with all of its minor opcodes, mod/rm, sib, and constants
containing INT 3 byte patterns.

Nothing good will come of this.

You have to prevent the "interested" CPU from fetching the first byte
of the instruction before smearing INT 3's over the opcode space. The
only chance yo have of making this work is to align this particular
instruction on a cache line boundary.....which one cannot do for a
random instruction

Mitch
Andy 'Krazy' Glew
2010-06-13 00:17:49 UTC
Permalink
Post by MitchAlsup
Post by Andy 'Krazy' Glew
If you have kernel access so can hook INT3, what's wrong with
FOR i FROM lowest byte of instruction TO highest byte DO
*b = INT3 (single byte trap instruction)
FOR i FROM highest byte of instruction TO lowest byte of instrucftion DO
*b = appropriate byte of new instrucftion
Consider the case where an interested CPU has already fetched the
first byte (or first several bytes) of said instruction and one of
these fetched bytes happens to be a major opcode byte, but the rest of
the instruction fetch gets delayed by this or that. There is no
architectural specification that requires the fetch process to back up
when an instruction cache line is stolen.
Now your function comes in and writes INT 3 over the rest of the
instruction, snatching the cache line from the interested CPU.
Finally the delayed CPU finished fetching and now executes the
instruction with all of its minor opcodes, mod/rm, sib, and constants
containing INT 3 byte patterns.
Nothing good will come of this.
You have to prevent the "interested" CPU from fetching the first byte
of the instruction before smearing INT 3's over the opcode space. The
only chance yo have of making this work is to align this particular
instruction on a cache line boundary.....which one cannot do for a
random instruction
Fair enough.

There are no atomicity properties defined for instruction fetch.

Perhaps there should be.

Lacking this, the only safe way is to do s hottdown: stop all CPUs, do the write of the instruction bytes, perform a
serializing instruction on each CPU to flush all instruction caches and prefetch (that is architecturally defined), and
then restart.

Thanks, Mitch. I had gone over this with the Pin people, but had forgotten.

Now, there are de-facto instruction fetch atomicity properties. But nothing official. (E.g. on Intel (since P6), and I
believe AMD, any ifetch entirely within an aligned 16 bytes is atomic. And Intel (since P6) will clear when the first
byte is written; i.e. Intel recognizes SMC immediately (as, I think, does AMD).) So I believe that the algorith, I
describe wll work on Intel since P6, for WB memory. I think that it also will work for UC memory. (Glew's rule: any
architectural property should work when caches are disabled.)

But there is nothing official.

---

By the way, this is an example of where making self-modifying code be recognized immediately, part of the memory
ordering model, simplifies things.
unknown
2010-06-13 11:31:14 UTC
Permalink
Now, there are de-facto instruction fetch atomicity properties. But
nothing official. (E.g. on Intel (since P6), and I believe AMD, any
ifetch entirely within an aligned 16 bytes is atomic. And Intel (since
P6) will clear when the first byte is written; i.e. Intel recognizes SMC
Intel have done so since the 486 I believe, definitely since the Pentium!

From 8088 an dup to 386 you could measure the size of the instruction
prefetch buffer by first doing a very long-running instruction that did
not touch memory (i.e. DIV), then a REP STOSB which would overwrite the
immediately following instructions with NOP bytes. Those bytes would all
start out as single-byte INC reg opcodes, so the number of them that got
executed was a pretty good indication of the size of the prefetch
buffer. (I believe this all ran within a CLI/STI block, i.e. with
interrupts disabled...)
immediately (as, I think, does AMD).) So I believe that the algorith, I
describe wll work on Intel since P6, for WB memory. I think that it also
will work for UC memory. (Glew's rule: any architectural property should
work when caches are disabled.)
That's a _very_ good rule, since the opposite could make debugging the
cpu _far_ worse.
But there is nothing official.
:-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2010-06-13 12:23:12 UTC
Permalink
Post by unknown
Now, there are de-facto instruction fetch atomicity properties. But
nothing official. ...
(Glew's rule: any architectural property should
work when caches are disabled.)
That's a _very_ good rule, since the opposite could make debugging the
cpu _far_ worse.
All rules have exceptions, but they should be protected by barriers
proportional to the loss of sanity involved in breaking them. That
one should definitely be safe outside a maximum security enclosure
(e.g. features provided for use only in machine-check handlers may
need to break it, but those should NOT be used outside such code).
Post by unknown
But there is nothing official.
:-)
Personally, I think the author of any normal code (including kernel)
who relies on instruction fetch atomicity needs reeducation. Yes,
I have done that, but it was a long time ago, the constraints were
those of the 1970s, and I wouldn't do it again!


Regards,
Nick Maclaren.
Andy 'Krazy' Glew
2010-06-13 20:01:36 UTC
Permalink
Now, there are de-facto instruction fetch atomicity properties. But
nothing official. (E.g. on Intel (since P6), and I believe AMD, any
ifetch entirely within an aligned 16 bytes is atomic. And Intel (since
P6) will clear when the first byte is written; i.e. Intel recognizes SMC
immediately (as, I think, does AMD).) So I believe that the algorith, I
describe wll work on Intel since P6, for WB memory. I think that it also
will work for UC memory.
Correction: the algorithm I described should work for coherent UC memory. I.e. UC memory where writes are exposed to
other processors for snooping.

It will not work for non-coherent write-through memory. In such a case, the scenario Mitch described will cause a
failure of the protocol. In such a case, a shootdown will be necessary.
Andy 'Krazy' Glew
2010-06-13 20:03:51 UTC
Permalink
Post by unknown
Now, there are de-facto instruction fetch atomicity properties. But
nothing official. (E.g. on Intel (since P6), and I believe AMD, any
ifetch entirely within an aligned 16 bytes is atomic. And Intel (since
P6) will clear when the first byte is written; i.e. Intel recognizes SMC
Intel have done so since the 486 I believe, definitely since the Pentium!
Are you sure? I have pretty clear recollection that there were codes that ran on P5 that failed on P6, because P6
caused the SMC to take effect immediately, at the next instruction boundary. I think on P5, if an instruction in the U
pipe overwrote the very next instruction, and that instruction was already in the V pipe, the effect was not immediately
seen.
unknown
2010-06-14 05:56:47 UTC
Permalink
Post by unknown
Now, there are de-facto instruction fetch atomicity properties. But
nothing official. (E.g. on Intel (since P6), and I believe AMD, any
ifetch entirely within an aligned 16 bytes is atomic. And Intel (since
P6) will clear when the first byte is written; i.e. Intel recognizes SMC
Intel have done so since the 486 I believe, definitely since the Pentium!
Are you sure? I have pretty clear recollection that there were codes
that ran on P5 that failed on P6, because P6 caused the SMC to take
effect immediately, at the next instruction boundary. I think on P5, if
an instruction in the U pipe overwrote the very next instruction, and
that instruction was already in the V pipe, the effect was not
immediately seen.
That is possible, but quite hard to achieve, since it would require you
to setup a loop in such a way that on the first N (N can be 1)
iterations it would _not_ overwrite (or modify in any other way) the
following opcode, but on the next iteration it would.

This is due to the P5 quirk where the first iteration of any particular
piece code would bring it in to the I-cache, while marking the
instruction boundaries, and only one the second and later iterations,
while executing out of I-cache would both pipes be working.

It wouldn't be too hard to setup code to do just this, but it would
definitely never happen by accident:

mov al,'E' ; INC EBP opcode
mov ecx,2
mov edi, offset buffer + 4096 ; Start in next code page...
xor ebp,ebp
next:
mov [edi],al
buffer:
nop ; The second iteration overwrites this NOP

sub edi,4096
nop

dec ecx
jnz next

jmp done
nop
nop
nop
... 4080 NOPs skipped
nop
nop ; The first write lands around here...
nop
nop
nop
done:
test ebp,ebp ; Did it get updated or not?

Since the (next:) target is the top of the loop, it will always run in
the U pipe, while the following NOP can run in V.

The SUB EDI,10 and NOP can pair during the next iteration, while DEC ECX
and JNZ NEXT matches up for the last.

By placing the first write 4K beyond the current code, I'm pretty
certain that no current CPU will take this as a reason to shootdown any
I-cache data for the little loop I'm running twice.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
MitchAlsup
2010-06-13 13:53:18 UTC
Permalink
Now, there are de-facto instruction fetch atomicity properties.  But nothing official.  (E.g. on Intel (since P6), and I
believe AMD, any ifetch entirely within an aligned 16 bytes is atomic.  And Intel (since P6) will clear when the first
byte is written; i.e. Intel recognizes SMC immediately (as, I think, does AMD).)  <snip>
But there is nothing official.
And there IS the problem. AMD Athlons and Opterons obey the self
modifying code checks wrt instruction fetch, but do not on third party
modifications of code simply because there is no spec as to what is
required (and the addresses to be checked are quite distant from the
addresses that need checking). {This was circa '07 and may have been
changed.} When we investigated this issue (circa '05) there was no
code sequence that would successfuly do this that would work on both
Intel and AMD machines {IBM java JIT compiler triggered the
investigation.} So, the compiler had to do a CPUID early and use
result this to pick a code sequence later, as needed.
(Glew's rule: any architectural property should work when caches are disabled.)
This rule will prevent bundling more than one instruction into a
single atomic sequence, should you ever want such a longer atomic
operation with multiple independent memory references. Thus something
like ASF can only work when the instructions and data are both
cacheable and both caches are turned on.

{Note: I am not disagreeing with the general principle involved, but
there are times.....}

Mitch
Andy 'Krazy' Glew
2010-06-13 20:08:29 UTC
Permalink
Post by MitchAlsup
(Glew's rule: any architectural property should work when caches are disabled.)
This rule will prevent bundling more than one instruction into a
single atomic sequence, should you ever want such a longer atomic
operation with multiple independent memory references. Thus something
like ASF can only work when the instructions and data are both
cacheable and both caches are turned on.
{Note: I am not disagreeing with the general principle involved, but
there are times.....}
That's been one of the problems I have had with ASF, and Transactional Memory, and any of the dozens of other proposals
I have seen that use similar mechanisms. (E.g. many flavors of load-linked/store-conditional. N-way LLSC. etc.)

If you disable caches for debugging, they break.

However, you can fix this. Aas I said inmy reply to Terje, you can have coherent (and ordered) UC memory.
MitchAlsup
2010-06-13 21:26:49 UTC
Permalink
Post by Andy 'Krazy' Glew
That's been one of the problems I have had with ASF, and Transactional Memory, and any of the dozens of other proposals
I have seen that use similar mechanisms.  (E.g. many flavors of load-linked/store-conditional.  N-way LLSC. etc.)
If you disable caches for debugging, they break.
However, you can fix this.  Aas I said inmy reply to Terje, you can have coherent (and ordered) UC memory.
So, what is your solution for building atomic primitives that
necessarily need multiple independent memory locations (up to 5)?

{When I looked at this, I could have stopped with DCASQ. But I
suspected that every year another atomic primitive would be requested.
So, it was either be in a position to continuously adds instructions,
or deliver the cook book to SW and let then figure out what they need.
So HW developers could get on with designing other things.}

But I have an even stiffer question: Why should the debugger allow
single stepping through a series of instructions that must appear to
be atomic?

{After all, once you are single stepping, other CPUs in the MP may
interact with something that needs to appear atomic at the other end.
Arguing that the debugger can catch all the other interacting
threadds, shut them down for a while, and then single stepping,
finally waing up the other threads; will not uncover the actual cases
you will want to be debugging--the hard ones.

With ASF, by defining those memory locations that were participating
and those that were not, one could dump intermediate state to a memory
buffer that can be printed after the atomic event has transpired. It
seems to me there is no way to correctly give the illusion of
atomicity and allow single stepping through an atomic event.}

Mitch
Andy 'Krazy' Glew
2010-06-13 23:04:57 UTC
Permalink
Post by MitchAlsup
Post by Andy 'Krazy' Glew
That's been one of the problems I have had with ASF, and Transactional Memory, and any of the dozens of other proposals
I have seen that use similar mechanisms. (E.g. many flavors of load-linked/store-conditional. N-way LLSC. etc.)
If you disable caches for debugging, they break.
However, you can fix this. Aas I said inmy reply to Terje, you can have coherent (and ordered) UC memory.
So, what is your solution for building atomic primitives that
necessarily need multiple independent memory locations (up to 5)?
Why stop at 5?
Post by MitchAlsup
{When I looked at this, I could have stopped with DCASQ. But I
suspected that every year another atomic primitive would be requested.
So, it was either be in a position to continuously adds instructions,
or deliver the cook book to SW and let then figure out what they need.
So HW developers could get on with designing other things.}
But I have an even stiffer question: Why should the debugger allow
single stepping through a series of instructions that must appear to
be atomic?
I don't think it necessarily should.

Similar issues arise with transactional memory. Should a transaction be atomic wrt debugging and single stepping, or not?

Choices include

0) aborting the trasaction - going to the top of the txn (doesn't work for single steping)

1) single stepping skipping all of the atomic.

2) single stepping in the atomic region, with provision ti track state involved as you have described, and/or shoot down
Post by MitchAlsup
{After all, once you are single stepping, other CPUs in the MP may
interact with something that needs to appear atomic at the other end.
Arguing that the debugger can catch all the other interacting
threadds, shut them down for a while, and then single stepping,
finally waing up the other threads; will not uncover the actual cases
you will want to be debugging--the hard ones.
With ASF, by defining those memory locations that were participating
and those that were not, one could dump intermediate state to a memory
buffer that can be printed after the atomic event has transpired. It
seems to me there is no way to correctly give the illusion of
atomicity and allow single stepping through an atomic event.}
Mitch
MitchAlsup
2010-06-14 14:51:40 UTC
Permalink
Post by Andy 'Krazy' Glew
Post by MitchAlsup
Post by Andy 'Krazy' Glew
That's been one of the problems I have had with ASF, and Transactional Memory, and any of the dozens of other proposals
I have seen that use similar mechanisms.  (E.g. many flavors of load-linked/store-conditional.  N-way LLSC. etc.)
If you disable caches for debugging, they break.
However, you can fix this.  Aas I said inmy reply to Terje, you can have coherent (and ordered) UC memory.
So, what is your solution for building atomic primitives that
necessarily need multiple independent memory locations (up to 5)?
Why stop at 5?
We actually stopped at 7 -- the number of free cache miss buffers that
still allows forward progress on the other stuff the machine might be
up to, but the longest sequence of code that proved useful in large
synchronization events used 5 independent memory locations. Here we
could move an element in a concurrent data structure from one location
to another in a single atomic event.
Post by Andy 'Krazy' Glew
Post by MitchAlsup
{When I looked at this, I could have stopped with DCASQ. But I
suspected that every year another atomic primitive would be requested.
So, it was either be in a position to continuously adds instructions,
or deliver the cook book to SW and let then figure out what they need.
So HW developers could get on with designing other things.}
But I have an even stiffer question: Why should the debugger allow
single stepping through a series of instructions that must appear to
be atomic?
I don't think it necessarily should.
Similar issues arise with transactional memory.  Should a transaction be atomic wrt debugging and single stepping, or not?
Choices include
0) aborting the trasaction - going to the top of the txn  (doesn't work for single steping)
1) single stepping skipping all of the atomic.
2) single stepping in the atomic region, with provision ti track state involved as you have described, and/or shoot down
ASF used the general philosophy of 1) above and by making a
distinction between memory location participating in the atomic event,
and other memory locations not participating, we got 1) above and
enough of 2) to allow for print-level debug for after the fact
observation of what happened in the atomic event.

Mitch
unknown
2010-06-14 06:06:05 UTC
Permalink
Post by MitchAlsup
buffer that can be printed after the atomic event has transpired. It
seems to me there is no way to correctly give the illusion of
atomicity and allow single stepping through an atomic event.}
Please excuse me, but DUH!

Are you saying that there are people who demand/require this ability?

I thought the definition of "atomic" was "indivisible", so obviously
even single-stepping code would have to treat such a block as a single
instruction, right?

Just like in the very old days where the single-step interrupt would
delay for an extra instruction if the first was one of those that were
defined to only occur as the first in a pair, i.e. like an update to the
stack segment register.

Until you wrote the paragraph above, I really didn't consider this to be
a problem. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2010-06-14 07:53:49 UTC
Permalink
Post by unknown
Post by MitchAlsup
buffer that can be printed after the atomic event has transpired. It
seems to me there is no way to correctly give the illusion of
atomicity and allow single stepping through an atomic event.}
Please excuse me, but DUH!
Are you saying that there are people who demand/require this ability?
I thought the definition of "atomic" was "indivisible", so obviously
even single-stepping code would have to treat such a block as a single
instruction, right?
Think about it. There are millions of script-kiddies out there who
think they are programmers, and who can't debug even the simplest
code without a debugger to help. Some of those will want (and even
need) to write atomic sections. So how are they going to debug them?

Notice "they". I know how I would, and I can guess how you would.
Post by unknown
Just like in the very old days where the single-step interrupt would
delay for an extra instruction if the first was one of those that were
defined to only occur as the first in a pair, i.e. like an update to the
stack segment register.
That's NOT the very old days! In those, the debugger would misbehave
horribly. Yes, I know that we seem to have warped back there :-(

As an aside, I have been writing some revoltingly messy (inherently
so) code, and tried using a debugger again, as I do at intervals.
Bloody useless, as usual - not even a backtrace as soon as you make
an actual SIGSEGV mistake - dunno why, as some were read-only. So
back to the techniques I used in the 1960s and early 1970s ....


Regards,
Nick Maclaren.
MitchAlsup
2010-06-14 14:59:41 UTC
Permalink
That's NOT the very old days!  In those, the debugger would misbehave
horribly.  Yes, I know that we seem to have warped back there :-(
I remember trying to debug some messy interrupt code on a machine that
was not capable of taking a second interrupt durring an interrupt.
Once you develop techniques for this stuff.....

Mitch
unknown
2010-06-14 15:39:27 UTC
Permalink
Post by MitchAlsup
Post by n***@cam.ac.uk
That's NOT the very old days! In those, the debugger would misbehave
horribly. Yes, I know that we seem to have warped back there :-(
I remember trying to debug some messy interrupt code on a machine that
was not capable of taking a second interrupt durring an interrupt.
Once you develop techniques for this stuff.....
I debugged IRQ handlers by first setting the display to a known text
mode, with a fixed (real) address, then the IRQ code could write info
directly to the frame buffer.

This worked even within the initial global interrupt disable portion.

You can do similar things with polled access to an RS232 port, writing
trace/debug to a monitoring machine, but this is _far_ slower than the
frame buffer access. OTOH it makes it easy to save the logs. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
MitchAlsup
2010-06-14 14:54:39 UTC
Permalink
On Jun 14, 1:06 am, Terje Mathisen <"terje.mathisen at tmsw.no">
Post by unknown
Post by MitchAlsup
buffer that can be printed after the atomic event has transpired. It
seems to me there is no way to correctly give the illusion of
atomicity and allow single stepping through an atomic event.}
Please excuse me, but DUH!
{Please to excuse, but we had to explain this over and over and over.}
Post by unknown
Are you saying that there are people who demand/require this ability?
I thought the definition of "atomic" was "indivisible", so obviously
even single-stepping code would have to treat such a block as a single
instruction, right?
Yes, the whole atomic part of the block is either performed or not.
But, the sublty is that only those memory location participating in
the atomic event are atomic. Other (non-participating) memory
locations are not necessarily atomic. This provides the leeway to dump
intermediate state to a buffer and print it to see what happened, and
why the atomic event failed.

Mtich
Andy 'Krazy' Glew
2010-06-13 20:34:46 UTC
Permalink
Post by MitchAlsup
Post by Andy 'Krazy' Glew
Now, there are de-facto instruction fetch atomicity properties. But nothing official. (E.g. on Intel (since P6), and I
believe AMD, any ifetch entirely within an aligned 16 bytes is atomic. And Intel (since P6) will clear when the first
byte is written; i.e. Intel recognizes SMC immediately (as, I think, does AMD).)<snip>
But there is nothing official.
And there IS the problem. AMD Athlons and Opterons obey the self
modifying code checks wrt instruction fetch, but do not on third party
modifications of code simply because there is no spec as to what is
required (and the addresses to be checked are quite distant from the
addresses that need checking). {This was circa '07 and may have been
changed.} When we investigated this issue (circa '05) there was no
code sequence that would successfuly do this that would work on both
Intel and AMD machines {IBM java JIT compiler triggered the
investigation.} So, the compiler had to do a CPUID early and use
result this to pick a code sequence later, as needed.
Post by Andy 'Krazy' Glew
(Glew's rule: any architectural property should work when caches are disabled.)
This rule will prevent bundling more than one instruction into a
single atomic sequence, should you ever want such a longer atomic
operation with multiple independent memory references. Thus something
like ASF can only work when the instructions and data are both
cacheable and both caches are turned on.
{Note: I am not disagreeing with the general principle involved, but
there are times.....}
I just added this to

http://semipublic.comp-arch.net/wiki/Design_Principles_and_Rules_of_Thumb

If you want to add some...
Torben Ægidius Mogensen
2010-06-11 08:31:40 UTC
Permalink
Post by BGB / cr88192
well, x86 does have opcode encoding rules...
imm
reg
mem
reg, imm
mem, imm
reg, mem
mem, reg
reg, mem, imm
mem, reg, imm
XOP and AVX allow a few more register arguments.
Even within this limittaions, you can have any number of implied
register operands (well, up to the number of registers the x86 has).
Additionally, you can split up an instruction into one or more
instructions that set up operands and one that performs the operation.
It is, of course, debatable if this is one instruction or two or more
instructions, but if they are required to be adjacent, I would still
count them as one.

Torben
Loading...