Discussion:
indirection in old architectures
(too old to reply)
Anton Ertl
2023-12-29 17:20:43 UTC
Permalink
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the
word is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.

The major question I have is why these architectures have this
feature.

The only use I can come up with for the arbitrarily repeated
indirection is the implementation of logic variables in Prolog.
However, Prolog was first implemented in 1970, and it did not become a
big thing until the 1980s (if then), so I doubt that this feature was
implemented for Prolog.

A use for a single indirection is the implementation of the memory
management in the original MacOS: Each dynamically allocated memory
block was referenced only from a single place (its handle), so that
the block could be easily relocated. Only the address of the handle
was freely passed around, and accessing the block then always required
double indirection. MacOS was implemented on the 68000, which did not
have the indirect bit; this demonstrates that the indirect bit is not
necessary for that. Nevertheless, such a usage pattern might be seen
as a reason to add the indirect bit. But is it enough?

Were there any other usage patterns? What happened to them when the
indirect bit went out of fashion?

One other question is how the indirect bit works with stores. How do
you change the first word in the chain, the last one, or any word in
between?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Scott Lurndal
2023-12-29 19:04:56 UTC
Permalink
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the
word is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
That's essentially accurate. The Burroughs medium systems
operands were described by an operand address that included
an 'address controller'. The address controller, a four-bit
field, specified two characteristics of the address; the
two-bit 'index' field contained the number of the index register
(there were three) to be used when calculating the final
address. The other two bits described how the data at the
final address should be treated by the processor
0b00 Unsigned Numeric Data [UN] (BCD)
0b01 Signed Numeric Data [SN] (BCD, first digit 0b1100 = "+", 0b1101 = '-').
0b10 Unsigned Alphanumeric Data [UA] (EBCDIC)
0b11 Indirect Address [IA]

Consider the operand 053251, this described an unsigned
numeric value starting at the address 53251 with no indexing.

The operand 753251 described an address indexed by IX1
and of the type 'indirect address' which points to another
operand word (potentially resulting in infinite recursion,
which was detected by an internal timer which would terminate
the process when triggered).

The actual operand data type was determined by the
address controller of the first operand that isn't
marked IA.
Post by Anton Ertl
The major question I have is why these architectures have this
feature.
Primarily for flexibility in addressing without adding substantial
hardware support.
Post by Anton Ertl
The only use I can come up with for the arbitrarily repeated
indirection is the implementation of logic variables in Prolog.
The aforementioned system ran mostly COBOL code (with some BPL;
assemblers weren't generally provided to customers).
Post by Anton Ertl
Were there any other usage patterns? What happened to them when the
indirect bit went out of fashion?
Consider following a linked list to the final element as an
example usage.

The aforementioned system also had a SLL (Search Linked List)
that would test each element for one of several conditions
and terminate the indirection when the condition was true.
Post by Anton Ertl
One other question is how the indirect bit works with stores. How do
you change the first word in the chain, the last one, or any word in
between?
I guess I don't understand the question. It's just a pointer in
a linked list.
Lawrence D'Oliveiro
2024-01-17 05:28:46 UTC
Permalink
The [Burroughs] system ran mostly COBOL code (with some BPL;
assemblers weren't generally provided to customers).
For an interesting reason: privilege protection was enforced in software,
not hardware.
Scott Lurndal
2024-01-17 16:02:32 UTC
Permalink
[no assembler shipped to customers]
Post by Lawrence D'Oliveiro
The [Burroughs] system ran mostly COBOL code (with some BPL;
assemblers weren't generally provided to customers).
For an interesting reason: privilege protection was enforced in software,
not hardware.
Actually, that is not the case.

Burroughs had multiple lines of mainframes: small, medium and large.

Small systems (b1700/b1800/b1900) had a writeable control store and the instruction set
would be dynamically loaded when the application was scheduled.

Medium systems were BCD systems (B[234][5789]xx) (descended from the orignal line
of Electrodata Datatron systems when Burroughs bought electrodata
in the mid 1950s). Designed to efficiently run COBOL code. These
are the systems I was referring to above. They had hardware
enforced privilege protection.

Large systems (starting with the B5000/B5500) were stack systems
running ALGOL and algol deriviative (DCALGOL, NEWP, etc)
(they also supported COBOL, Fortran, Basic, etc).

The systems you are thinking about were the Large systems. And
there were issues with that (a famous paper in the mid 1970s
showed how to set the 'compiler' flag on any application allowing
it to bypass security protections - put the application on a
tape, load it on an IBM system, patch the executable header,
and restore it on the Burroughs system).
John Levine
2023-12-29 19:36:00 UTC
Permalink
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the
word is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
More or less. Indirect addressing was always controlled by a bit in
the instruction. It was more common to have only a single level of
indirect addressing, just controlled by that instruction bit.
Multi-level wasn't much more useful and you had to have a way to break
address loops.
Post by Anton Ertl
One other question is how the indirect bit works with stores. How do
you change the first word in the chain, the last one, or any word in
between?
The CPU follows the indirect address chain to get the operand address
and then does the operation. On the PDP-10, this stores into the
word that FOO points to, perhaps after multiple indirections:

MOVEM AC,@FOO

while this stores into FOO itself:

MOVEM AC,FOO
Post by Anton Ertl
The major question I have is why these architectures have this
feature.
Let's say you want to add up a list of numbers and your machine
doesn't have any index registers. What else are you going to do?

Indirect addressing was a big improvement over patching the
instructions and index registers were too expensive for small
machines. The IBM 70x mainframes had index registers, the early DEC
PDP series didn't other than the mainframe-esque PDP-6 and -10. The
PDP-11 mini was a complete rethink a decade after the PDP-1 with eight
registers usable for indexing and no indirect addressing.
Post by Anton Ertl
Were there any other usage patterns? What happened to them when the
indirect bit went out of fashion?
They were also useful for argument lists which were invariably in
memory on machines without a lot of registers which was all of them
before S/360 and the PDP-6. On many machines a Fortran subroutine call
would leave the return address in an index register and the addresses
of the arguments were in the words after the call. The routine would
use something like @3(X) to get the third argument. Nobody other than
maybe Lisp cared about reentrant or recursive code, and if the number
of arguments in the call didn't match the number the routine expected
and your program blew up, well, don't do that.

As you suggested, a lot of uses boiled down to providing a fixed
address for something that can move, so instructions could indirect
through that fixed address without having to load it into a register.

For most purposes, index registers do indirection better, and now that
everything has a lot of registers, you can use some of them for the
fixed->movable stuff like the GOT in Unix/linux shared libraries.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Paul A. Clayton
2023-12-31 03:12:10 UTC
Permalink
Post by John Levine
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the
word is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found.
[snip]
Post by John Levine
Post by Anton Ertl
Were there any other usage patterns? What happened to them when the
indirect bit went out of fashion?
[snip]
Post by John Levine
As you suggested, a lot of uses boiled down to providing a fixed
address for something that can move, so instructions could indirect
through that fixed address without having to load it into a register.
Paged virtual memory as commonly implemented introduces one level
of indirection at page (rather than word) granularity.
Virtualization systems using nested page tables introduce a second
direction.

Hierarchical/multi-level page tables have multiple layers of
indirection where instead of a page table base pointer pointing to
a complete page table it points to a typically-page-sized array of
address and metadata entries where each entry points to a similar
array eventually reaching the PTE.

Even with page table caching (and workloads that play well with
this kind of virtual memory), this is not free but it can be
"cheap enough". Using large pages for virtual-physical to physical
translation can help a lot. Presumably having an OS bias placement
of its translation table pages into large quasi-pages would help
caching for VPA-to-PA, i.e., many VPAs used by the OS for paging
would be in the same large page (e.g., 2MiB for x86).

(Andy Glew had suggested using larger pages for intermediate nodes
rather than limiting such to the last node in a hierarchical page
table. This has the same level-reducing effect of huge pages that
short-circuit the translation indirection at the end but allows
eviction and permission control at base-page size, with the
consequent larger number of PTEs active if there is spatial
locality at huge page granularity. Such merely assumes that
locality potentially exists at the intermediate nodes rather than
exclusively at the last node. Interestingly, with such a page
table design one might consider having rather small pages; e.g., a
perhaps insane 64-byte base page size (at least for the tables)
would only provide 3 bits per level but each level could be
flattened to provide 6, 9, 12, etc. bits. Such extreme flexibility
may well not make sense, but it seems interesting to me.)
Post by John Levine
For most purposes, index registers do indirection better, and now that
everything has a lot of registers, you can use some of them for the
fixed->movable stuff like the GOT in Unix/linux shared libraries.
For x86-64 some of the segments can have non-zero bases, so these
provide an additional index register ("indirection").
MitchAlsup
2023-12-31 18:57:21 UTC
Permalink
Post by Paul A. Clayton
Post by John Levine
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the
word is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found.
[snip]
Post by John Levine
Post by Anton Ertl
Were there any other usage patterns? What happened to them when the
indirect bit went out of fashion?
[snip]
Post by John Levine
As you suggested, a lot of uses boiled down to providing a fixed
address for something that can move, so instructions could indirect
through that fixed address without having to load it into a register.
Paged virtual memory as commonly implemented introduces one level
of indirection at page (rather than word) granularity.
Virtualization systems using nested page tables introduce a second
direction.
Hierarchical/multi-level page tables have multiple layers of
indirection where instead of a page table base pointer pointing to
a complete page table it points to a typically-page-sized array of
address and metadata entries where each entry points to a similar
array eventually reaching the PTE.
Even with page table caching (and workloads that play well with
this kind of virtual memory), this is not free but it can be
"cheap enough". Using large pages for virtual-physical to physical
translation can help a lot. Presumably having an OS bias placement
of its translation table pages into large quasi-pages would help
caching for VPA-to-PA, i.e., many VPAs used by the OS for paging
would be in the same large page (e.g., 2MiB for x86).
(Andy Glew had suggested using larger pages for intermediate nodes
rather than limiting such to the last node in a hierarchical page
table.
I had been thinking that since my large-page translation tables have
a count of the number of pages, that when forking off a new GuestOS
that I would allocate the HyperVisor tables as a single 8GB large
page, and when it needs more then switch to a more treeified page
table. This leaves the second level of DRAM translation at 1 very
cacheable and TLB-able PTE--dramatically reducing the table walking
overhead.

A single 8GB page mapping can allow access to one 8192B page up to
1M 8192B pages. Guest OS page tables can map any of these 8192B pages
to any virtual address it desires with permissions it desires.
Post by Paul A. Clayton
This has the same level-reducing effect of huge pages that
short-circuit the translation indirection at the end but allows
eviction and permission control at base-page size, with the
consequent larger number of PTEs active if there is spatial
locality at huge page granularity. Such merely assumes that
locality potentially exists at the intermediate nodes rather than
exclusively at the last node. Interestingly, with such a page
table design one might consider having rather small pages; e.g., a
perhaps insane 64-byte base page size (at least for the tables)
would only provide 3 bits per level but each level could be
flattened to provide 6, 9, 12, etc. bits. Such extreme flexibility
may well not make sense, but it seems interesting to me.)
Post by John Levine
For most purposes, index registers do indirection better, and now that
everything has a lot of registers, you can use some of them for the
fixed->movable stuff like the GOT in Unix/linux shared libraries.
For x86-64 some of the segments can have non-zero bases, so these
provide an additional index register ("indirection").
This has more to do with 16 registers being insufficient than indirection
(segmentation) being better.
MitchAlsup
2023-12-29 20:27:29 UTC
Permalink
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the
word is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
The major question I have is why these architectures have this
feature.
Solves the memory access problem {arrays, nested arrays, linked lists,...}
The early machines had "insufficient" address generation means, and used
indirection as a trick to get around their inefficient memory address mode.
Post by Anton Ertl
The only use I can come up with for the arbitrarily repeated
indirection is the implementation of logic variables in Prolog.
However, Prolog was first implemented in 1970, and it did not become a
big thing until the 1980s (if then), so I doubt that this feature was
implemented for Prolog.
Some of the indirection machines had indirection-bit located in the
container at the address generated, others had the indirection in
the address calculation. In the case of the PDP-10 there was a time-
out counter and there were applications that worked fine up to a
particular size, and then simply failed when the indirection watch
dog counter kept "going off".
Post by Anton Ertl
A use for a single indirection is the implementation of the memory
management in the original MacOS: Each dynamically allocated memory
block was referenced only from a single place (its handle), so that
the block could be easily relocated. Only the address of the handle
was freely passed around, and accessing the block then always required
double indirection. MacOS was implemented on the 68000, which did not
have the indirect bit; this demonstrates that the indirect bit is not
necessary for that. Nevertheless, such a usage pattern might be seen
as a reason to add the indirect bit. But is it enough?
Two things: 1) the indirect bit is insufficient, 2) optimizing compilers
got to the point they were better at dereferencing linked lists than
the indirection machines were. {Reuse and all that rot.}
Post by Anton Ertl
Were there any other usage patterns? What happened to them when the
indirect bit went out of fashion?
Arrays, matrixes, scatter, gather, lists, queues, stacks, arguments,....
We did all sorts of infinite-indirect stuff in asm on the PDP-10 {KI}
when programming at college.

They went out of fashion when compilers got to the point they could
hold the intermediate addresses in registers and short circuit the
amount of indirection needed--improving performance due to accessing
fewer memory locations.

The large register files of RISC spelled their doom.
Post by Anton Ertl
One other question is how the indirect bit works with stores. How do
you change the first word in the chain, the last one, or any word in
between?
In the machines where the indirection is at the instruction level, this
was simple, in the machines where the indirection was at the target, it
was more difficult.
Post by Anton Ertl
- anton
Summary::

First the architects thought registers were expensive.
{Many doubled down by OP-Mem ISAs.}
The architects endowed memory addressing with insufficient capabilities.
{Many to satisfy the OP-Mem and Mem-OP ISA they had imposed upon themselves}
Then they added indirection to make up for insufficient addressing.
And then everyone waited until RISC showed up (1980) before realizing their
error in register counts.
{Along about this time, Compilers started getting good.}
John Levine
2023-12-29 21:59:25 UTC
Permalink
Post by MitchAlsup
Some of the indirection machines had indirection-bit located in the
container at the address generated, others had the indirection in
the address calculation. In the case of the PDP-10 there was a time-
out counter and there were applications that worked fine up to a
particular size, and then simply failed when the indirection watch
dog counter kept "going off".
No, that's what the GE 635 did, a watchdog timer reset each time it
started a new instruction. The PDP-6 and -10 could take an interrupt
each time it calculated an address and would restart the instruction
when the interrupt returned. This worked because unlike on the 635 the
address calculation didn't change anything. (Well, except for the ILDB
and IDPB instructions that needed the first part done flag. But I
digress.)

You could tell how long the time between clock interrupts was by
making an ever longer indirect address chain and seeing where your
program stalled. It wouldn't crash, it just stalled as the very long
address chain kept being interrupted and restarted. I'm not being
hypothetical here.
Post by MitchAlsup
Two things: 1) the indirect bit is insufficient, 2) optimizing compilers
got to the point they were better at dereferencing linked lists than
the indirection machines were. {Reuse and all that rot.}
More importantly, index registers are a lot faster than indirect
addressing and at least since the IBM 801, we have good algorithms to
do register scheduling.
Post by MitchAlsup
Post by Anton Ertl
One other question is how the indirect bit works with stores. How do
you change the first word in the chain, the last one, or any word in
between?
In the machines where the indirection is at the instruction level, this
was simple, in the machines where the indirection was at the target, it
was more difficult.
The indirection was always in the address word(s), not in the target.
It didn't matter if it was a load or a store.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
s***@alum.dartmouth.org
2024-01-01 20:31:28 UTC
Permalink
John Levine <***@taugh.com> wrote:

: More importantly, index registers are a lot faster than indirect
: addressing and at least since the IBM 801, we have good algorithms to
: do register scheduling.

Once upon a time saving an instruction was a big deal; the 801, and
RISC in general, was possible because memory got much cheaper.
Using index registers costs an extra instrucion for loading the index
register.

Index registers were a scarce resource too (except for the Atlas) so
keeping all your pointers in index registers wasn't a good option
either.

sarr`
MitchAlsup
2024-01-04 01:36:35 UTC
Permalink
Post by s***@alum.dartmouth.org
: More importantly, index registers are a lot faster than indirect
: addressing and at least since the IBM 801, we have good algorithms to
: do register scheduling.
Once upon a time saving an instruction was a big deal; the 801, and
RISC in general, was possible because memory got much cheaper.
Using index registers costs an extra instrucion for loading the index
register.
Mark Horowitz stated (~1983) MIPS executes 1.5× as many instructions
as VAX and at 6× the frequency for a 4× improvement in performance.

Now, imagine a RISC ISA that only needs 1.1× as many instructions as
VAX with no degradation WRT operating frequency.
Post by s***@alum.dartmouth.org
Index registers were a scarce resource too (except for the Atlas) so
keeping all your pointers in index registers wasn't a good option
either.
sarr`
Lawrence D'Oliveiro
2024-01-17 06:34:55 UTC
Permalink
Mark Horowitz stated (~1983) MIPS executes 1.5× as many instructions as
VAX and at 6× the frequency for a 4× improvement in performance.
Mmm, maybe you got the last two multipliers the wrong way round?
Terje Mathisen
2024-01-17 07:14:34 UTC
Permalink
Post by Lawrence D'Oliveiro
Mark Horowitz stated (~1983) MIPS executes 1.5× as many instructions as
VAX and at 6× the frequency for a 4× improvement in performance.
Mmm, maybe you got the last two multipliers the wrong way round?
No, that seems correct: It needed 1.5 times as many instructions, so the
6X frequency must be divided by 1.5 for a final speedup of 4X?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
MitchAlsup1
2024-01-17 17:38:51 UTC
Permalink
Post by Lawrence D'Oliveiro
Mark Horowitz stated (~1983) MIPS executes 1.5× as many instructions as
VAX and at 6× the frequency for a 4× improvement in performance.
Mmm, maybe you got the last two multipliers the wrong way round?
Performance is in millions of instructions per second.

If the instruction count was 1.0× a 6× frequency would yield 6× gain.

So, since there were 1.5× as many instructions and 6× as many instructions per
second, 6 / 1.5 = 4×
EricP
2024-01-17 18:09:52 UTC
Permalink
Post by Lawrence D'Oliveiro
Mark Horowitz stated (~1983) MIPS executes 1.5× as many instructions as
VAX and at 6× the frequency for a 4× improvement in performance.
Mmm, maybe you got the last two multipliers the wrong way round?
VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
machine and advertised it as such. But no one had actually measured it.
When they finally did and found it was 0.5 MIPS they just changed to
calling that "1 VUP" or "VAX-780 Units of Processing".

This also showed up in the Dhrystone benchmarks:

https://en.wikipedia.org/wiki/Dhrystone_Results

"Another common representation of the Dhrystone benchmark is the
DMIPS (Dhrystone MIPS) obtained when the Dhrystone score is divided
by 1757 (the number of Dhrystones per second obtained on the VAX 11/780,
nominally a 1 MIPS machine)."

I suppose they should have changed that to DVUPS.

Stanford MIPS (16 registers) in 1984 ran at 4 MHz with a 5 stage pipeline.
The paper I'm looking at compares it to a 8 MHz 68000 and has
Stanford MIPS averaging 5 times faster on their Pascal benchmark.

The MIPS R2000 with 32 registers launched in 1986 at 8.3, 12.5 and 15 MHz.
It supposedly could sustain 1 reg-reg ALU operation per clock.
John Levine
2024-01-17 19:14:00 UTC
Permalink
Post by EricP
VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
machine and advertised it as such.
No, they knew how fast it was. It was about as fast as an IBM 370/158
which IBM rated at 1 MIPS. A Vax instruction could do a lot more than
a 370 instruction so it wasn't implausible that the performance was
similar even though the instruction rate was about half.
Post by EricP
When they finally did and found it was 0.5 MIPS they just changed to
calling that "1 VUP" or "VAX-780 Units of Processing".
Yeah, they got grief for the MIPS stuff.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
EricP
2024-01-17 19:32:45 UTC
Permalink
Post by John Levine
Post by EricP
VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
machine and advertised it as such.
No, they knew how fast it was. It was about as fast as an IBM 370/158
which IBM rated at 1 MIPS.
So those were TOUPS or Three-seventy One-fifty-eight Units of Performance.
Post by John Levine
A Vax instruction could do a lot more than
a 370 instruction so it wasn't implausible that the performance was
similar even though the instruction rate was about half.
And they define 1 VUP = 1 TOUP
Post by John Levine
Post by EricP
When they finally did and found it was 0.5 MIPS they just changed to
calling that "1 VUP" or "VAX-780 Units of Processing".
Yeah, they got grief for the MIPS stuff.
One just has to be careful comparing clock MIPS and VUPS.
John Levine
2024-01-17 19:55:11 UTC
Permalink
Post by EricP
Post by John Levine
Post by EricP
VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
machine and advertised it as such.
No, they knew how fast it was. It was about as fast as an IBM 370/158
which IBM rated at 1 MIPS.
So those were TOUPS or Three-seventy One-fifty-eight Units of Performance.
If you want. IBM mainframe MIPS was a well understood performance
measure at the time. In the mid 1970s, there were a few IBM clones
like Amdahl, but the other mainframe makers were already sinking into
obscurity. I can't think of anyone else making a 32 bit byte
addressable mainframe at the time that wasn't an IBM clone. I suppose
there were the Interdata machines but they were minis and sold mostly
for embedded realtime.
Post by EricP
Post by John Levine
A Vax instruction could do a lot more than
a 370 instruction so it wasn't implausible that the performance was
similar even though the instruction rate was about half.
And they define 1 VUP = 1 TOUP
Yes, but a TOUP really was an IBM MIPS.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
EricP
2024-01-17 20:19:05 UTC
Permalink
Post by John Levine
Post by EricP
Post by John Levine
Post by EricP
VAX-780 was 5 MHz, 200 ns clock and averaged 10 clocks per instruction
giving 0.5 MIPS. When it first came out they thought it was a 1 MIPS
machine and advertised it as such.
No, they knew how fast it was. It was about as fast as an IBM 370/158
which IBM rated at 1 MIPS.
So those were TOUPS or Three-seventy One-fifty-eight Units of Performance.
If you want. IBM mainframe MIPS was a well understood performance
measure at the time. In the mid 1970s, there were a few IBM clones
like Amdahl, but the other mainframe makers were already sinking into
obscurity. I can't think of anyone else making a 32 bit byte
addressable mainframe at the time that wasn't an IBM clone. I suppose
there were the Interdata machines but they were minis and sold mostly
for embedded realtime.
Post by EricP
Post by John Levine
A Vax instruction could do a lot more than
a 370 instruction so it wasn't implausible that the performance was
similar even though the instruction rate was about half.
And they define 1 VUP = 1 TOUP
Yes, but a TOUP really was an IBM MIPS.
Ok, but VAX-780 really was measured by DEC at 0.5 MIPS.
So either the assumption that a VUP = TOUP was wrong
or the assumption that a TOUP = MIPS was.

See section 5 and table 8.

Characterization of Processor Performance in the VAX-11/780, 1984
http://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf
John Levine
2024-01-17 20:27:45 UTC
Permalink
Post by EricP
Post by John Levine
Post by EricP
Post by John Levine
No, they knew how fast it was. It was about as fast as an IBM 370/158
which IBM rated at 1 MIPS.
So those were TOUPS or Three-seventy One-fifty-eight Units of Performance.
If you want. IBM mainframe MIPS was a well understood performance
measure at the time. In the mid 1970s, there were a few IBM clones
like Amdahl, but the other mainframe makers were already sinking into
obscurity. I can't think of anyone else making a 32 bit byte
addressable mainframe at the time that wasn't an IBM clone. I suppose
there were the Interdata machines but they were minis and sold mostly
for embedded realtime.
Post by EricP
Post by John Levine
A Vax instruction could do a lot more than
a 370 instruction so it wasn't implausible that the performance was
similar even though the instruction rate was about half.
And they define 1 VUP = 1 TOUP
Yes, but a TOUP really was an IBM MIPS.
Ok, but VAX-780 really was measured by DEC at 0.5 MIPS.
So either the assumption that a VUP = TOUP was wrong
or the assumption that a TOUP = MIPS was.
As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions. In that era, MIPS meant
either a million IBM instructions, or as some wag put it, Meaningless
Indication of Processor Speed.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Thomas Koenig
2024-01-17 22:37:00 UTC
Permalink
Post by John Levine
As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions.
Why the big difference? Were fancy addressing modes really used so
much? Or did the code for the VAX mostly run POLY instructions? :-)
EricP
2024-01-17 23:14:06 UTC
Permalink
Post by Thomas Koenig
Post by John Levine
As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions.
Why the big difference? Were fancy addressing modes really used so
much? Or did the code for the VAX mostly run POLY instructions? :-)
I was thinking the same thing. VAX address modes like auto-increment
would be equivalent to 2 instructions for each operand and likely used
in benchmarks.

VAX having 32-bit immediates and offsets and and 64-bit float immediates
per operand vs 370 having to build constants or load them.

And POLY for transcendentals is one instruction.

All of those would add clocks to the VAX instruction execute time
but not its instruction count and MIPS.
Scott Lurndal
2024-01-17 23:56:24 UTC
Permalink
Post by EricP
Post by Thomas Koenig
Post by John Levine
As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions.
Why the big difference? Were fancy addressing modes really used so
much? Or did the code for the VAX mostly run POLY instructions? :-)
I was thinking the same thing. VAX address modes like auto-increment
would be equivalent to 2 instructions for each operand and likely used
in benchmarks.
MOVC3 and MOVC5, perhaps?
Lawrence D'Oliveiro
2024-01-18 00:24:38 UTC
Permalink
Post by Scott Lurndal
MOVC3 and MOVC5, perhaps?
Interruptible instructions ... wot fun ...
Terje Mathisen
2024-01-18 09:08:47 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Scott Lurndal
MOVC3 and MOVC5, perhaps?
Interruptible instructions ... wot fun ...
REP MOVS is the classic x86 example: Since all register usage is fixed
(r)si,(r)di,(r)cx the cpu can always accept an interrupt at any point,
it just needs to update those three registers and take the interrupt.

When the instruction resumes, any remaining moves are performed.

This was actually an early 8086/8088 bug: If you had multiple prefix
bytes, like you would need if you were moving data to the Stack segment
instead of the Extra, and the encoding was REP SEGSS MOVS, then only hte
last prefix byte was remembered in the saved IP/PC value.

I used to check for this bug by moving a block which was large enough
that it took over 55ms, so that a timer interrupt was guaranteed:

If the CX value wasn't zero after the instruction, then the bug had
happened.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
MitchAlsup1
2024-01-17 23:55:45 UTC
Permalink
Post by Thomas Koenig
Post by John Levine
As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions.
Why the big difference? Were fancy addressing modes really used so
much? Or did the code for the VAX mostly run POLY instructions? :-)
Fancy addressing modes {indirection, pre decrement, post increment,
Constants, Displacements, index, ADD-CMP-
Branch, CRC, Bit manipulation, ...)
You could say these contribute to most of the gain
John Levine
2024-01-18 02:05:30 UTC
Permalink
Post by Thomas Koenig
Post by John Levine
As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions.
Why the big difference? Were fancy addressing modes really used so
much? Or did the code for the VAX mostly run POLY instructions? :-)
I don't think anyone used the fancy addressing modes or complex
instructions much. But here's an example. Let's say A, B, and C are
floats in addressable memory and you want to do A = B + C

370 code

LE R0,B
AE R0,C
STE R0,A

VAX code

ADDF3 B,C,A

The VAX may not be faster, but that's one instruction rather than 3.

Or that old Fortran favorite I = I + 1

370 code

L R1,I
LA R2,1
AR R1,R2
ST R1,I

VAX code

INCL I

or if you have a lousy optimizer

ADDL2 #1,I

or if you have a really lousy optimizer

ADDL3 #1,I,I

It's still one instruction rather than four.

In 370 code you often also needed extra instructions to make data
addressable since it had no direct addressing and address offsets in
instructions were only 12 bits.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Lawrence D'Oliveiro
2024-01-18 04:51:09 UTC
Permalink
Post by John Levine
ADDF3 B,C,A
The VAX may not be faster, but that's one instruction rather than 3.
If those were register operands, that instruction would be 4 bytes.

I think worst case, each operand could have an index register and a 4-byte
offset (in addition to the operand specifier byte), for a maximum
instruction length of 19 bytes.

So, saying “just one instruction” may not sound as good as you think.

Here’s an old example, from the VMS kernel itself. This instruction

PUSHR #^M<R0,R1,R2,R3,R4,R5>

pushes the first 6 registers onto the stack, and occupies just 2 bytes.
Whereas this sequence

PUSHL R5
PUSHL R4
PUSHL R3
PUSHL R2
PUSHL R1
PUSHL R0

does the equivalent thing, but takes up 2 × 6 = 12 bytes.

Guess which is faster?
EricP
2024-01-18 14:39:55 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by John Levine
ADDF3 B,C,A
The VAX may not be faster, but that's one instruction rather than 3.
If those were register operands, that instruction would be 4 bytes.
I think worst case, each operand could have an index register and a 4-byte
offset (in addition to the operand specifier byte), for a maximum
instruction length of 19 bytes.
The longest instruction I think might be an ADD3H with two H format
16-byte float immediates with an indexed destination with 4 byte offset.

That should be something like 2 opcode, 1 opspec, 16 imm,
1 opspec, 16 imm, 1 opspec, 4 imm, 1 index = 42 bytes.

(Yes its a silly instruction but legal.)
Post by Lawrence D'Oliveiro
So, saying “just one instruction” may not sound as good as you think.
Here’s an old example, from the VMS kernel itself. This instruction
PUSHR #^M<R0,R1,R2,R3,R4,R5>
pushes the first 6 registers onto the stack, and occupies just 2 bytes.
Whereas this sequence
PUSHL R5
PUSHL R4
PUSHL R3
PUSHL R2
PUSHL R1
PUSHL R0
does the equivalent thing, but takes up 2 × 6 = 12 bytes.
Guess which is faster?
John Levine
2024-01-18 16:31:29 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by John Levine
ADDF3 B,C,A
The VAX may not be faster, but that's one instruction rather than 3.
If those were register operands, that instruction would be 4 bytes.
I think worst case, each operand could have an index register and a 4-byte
offset (in addition to the operand specifier byte), for a maximum
instruction length of 19 bytes.
So, saying “just one instruction” may not sound as good as you think.
I wasn't saying they were always better, just pointing out that there
were straightforward reasons that 500K VAX instructions could do the
same work as 1M 370 instructions.

Considering that the 370 is still alive and the VAX died decades ago,
it should be evident that instruction count isn't a very useful
metric across architectures.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Michael S
2024-01-19 14:23:09 UTC
Permalink
On Thu, 18 Jan 2024 16:31:29 -0000 (UTC)
Post by John Levine
Post by Lawrence D'Oliveiro
Post by John Levine
ADDF3 B,C,A
The VAX may not be faster, but that's one instruction rather than 3.
If those were register operands, that instruction would be 4 bytes.
I think worst case, each operand could have an index register and a
4-byte offset (in addition to the operand specifier byte), for a
maximum instruction length of 19 bytes.
So, saying “just one instructionâ€_ may not sound as good as you
think.
I wasn't saying they were always better, just pointing out that there
were straightforward reasons that 500K VAX instructions could do the
same work as 1M 370 instructions.
Considering that the 370 is still alive and the VAX died decades ago,
it should be evident that instruction count isn't a very useful
metric across architectures.
That's not totally fair.
S/360 permanently reinvents itself. VAX could have done the same, but
voluntarily refused.
Anton Ertl
2024-01-19 16:43:30 UTC
Permalink
Post by Michael S
S/360 permanently reinvents itself. VAX could have done the same, but
voluntarily refused.
Voluntarily? The VAX 9000 project cost DEC billions. Ok, one can
imagine an alternative history where DEC has decided to avoid
switching to MIPS and Alpha, and where they would have followed up the
NVAX (which seems to be pipelined, but not superscalar, i.e., like the
486) with eventually an OoO implementation, and from then on might
have had an easier time competing with RISCs.

The question is how many customers would have defected to RISC-based
systems in the meantime, and if DEC could have survived competition
from ever more capable PCs that eliminated the RISC workstation
market and the RISC server market.

IBM z and i survives because of a legacy of system-specific software
(written in assembly or using other system-specific features), because
the additional hardware cost is an acceptable price for being able to
continue to use this software.

Many VAX customers were flexible enough to switch to something else
when VAX was no longer competetive (that's why DEC did the MIPS-based
DECstations), so I doubt that DEC would have survived in the
alternative history I outlined, at least as a significant manufacturer
rather than a niche manufacturer like Unisys.

One interesting aspect is that NVAX was only released in 1991, while
the 486 was released in 1989, and the MIPS R2000 in 1986, so the VAX
instruction set did have a cost.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
John Dallman
2024-01-19 19:58:00 UTC
Permalink
Post by Anton Ertl
Voluntarily? The VAX 9000 project cost DEC billions. Ok, one can
imagine an alternative history where DEC has decided to avoid
switching to MIPS and Alpha, and where they would have followed up
the NVAX (which seems to be pipelined, but not superscalar, i.e., like
the 486) with eventually an OoO implementation, and from then on might
have had an easier time competing with RISCs.
The timeline doesn't work. DEC decided to adopt MIPS in 1989, because
they were loosing market share worryingly quickly. NVAX was released in
1991, and they'd have had real trouble developing it without the cash
from MIPS-based systems.

<https://en.wikipedia.org/wiki/DEC_Alpha#PRISM>

They opted for Alpha because they felt VAX had enough overheads that it
would always be at a disadvantage compared to RISC chips. That is less
obvious now, but that's because of the huge amounts of money that have
gone into x86 development over the last thirty years. DEC's market for
VAX systems was much smaller than the market for x86 in 1995-2010.

<https://en.wikipedia.org/wiki/DEC_Alpha#RISCy_VAX>

John
Anton Ertl
2024-01-20 09:10:00 UTC
Permalink
Post by John Dallman
Post by Anton Ertl
Voluntarily? The VAX 9000 project cost DEC billions. Ok, one can
imagine an alternative history where DEC has decided to avoid
switching to MIPS and Alpha, and where they would have followed up
the NVAX (which seems to be pipelined, but not superscalar, i.e., like
the 486) with eventually an OoO implementation, and from then on might
have had an easier time competing with RISCs.
The timeline doesn't work. DEC decided to adopt MIPS in 1989, because
they were loosing market share worryingly quickly. NVAX was released in
1991, and they'd have had real trouble developing it without the cash
from MIPS-based systems.
I forgot that in this alternative reality DEC would have killed the
VAX 9000 project early, leaving them lots of cash for developping
NVAX. Still, it could easily have been that they would have lost
customers to the RISC competition until they finally managed to do the
OoO-VAX.

Would they have gotten those customers back, or would they have lost
to IA-32/AMD64 anyway? Probably the latter, unless they found a
business model that allowed them to milk the customer base that was
tied to VAX while at the same time being cheap enough to compete with
Intel. They tried to go for that on the Alpha: they used firmware for
market segmentation between VMS/Digital OSF/1 on the one hand and
Linux/Windows on the other; and they also offered some relatively
cheap boards, e.g. with the 21164PC, but those were probably too
limited to be successful.
Post by John Dallman
That is less
obvious now, but that's because of the huge amounts of money that have
gone into x86 development over the last thirty years. DEC's market for
VAX systems was much smaller than the market for x86 in 1995-2010.
For developing an OoO-VAX the relevant time is 1985-1995 (HPS wrote
their papers on OoO (with VAX as example) starting in 1985, the
Pentium Pro appeared in 1995). Of course, for OoO-VAXes to succeed in
the market, the relevant timespan was 1995-2005. Intel dropped the
64-bit IA-32 successor ball and AMD picked it up with the 2003
releases of Opteron and Athlon64.

VAX would have been extended to 64 bits some times in the early 1990s
in the alternative timeline, and DEC would have been tempted to use
the 64-bit extension for market segmentation, which again could have
resulted into DEC painting itself into a niche.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
John Dallman
2024-01-20 16:25:00 UTC
Permalink
Post by Anton Ertl
Post by John Dallman
The timeline doesn't work. DEC decided to adopt MIPS in 1989,
because they were loosing market share worryingly quickly.
NVAX was released in 1991, and they'd have had real trouble
developing it without the cash from MIPS-based systems.
I forgot that in this alternative reality DEC would have killed the
VAX 9000 project early, leaving them lots of cash for developping
NVAX. Still, it could easily have been that they would have lost
customers to the RISC competition until they finally managed to do
the OoO-VAX.
their papers on OoO (with VAX as example) starting in 1985, the
Pentium Pro appeared in 1995). Of course, for OoO-VAXes to succeed
in the market, the relevant timespan was 1995-2005. Intel dropped
the 64-bit IA-32 successor ball and AMD picked it up with the 2003
releases of Opteron and Athlon64.
This requires DEC to take notice of those papers and start developing OoO
quite quickly. They did not do that historically, and they seem to have
been confident that their way of working would carry on being effective,
until RISC demonstrated otherwise. This is the timeframe where IBM gave
up on building mainframes with competitive compute power, and settled for
them being capable data-movers.

If DEC go OoO and build an OoO Micro-VAX CPU by about 1988, they can get
somewhere. The MicroVAX 78032 of 1985 was 125K transistors; the 80386 was
275K transistors the same year, the 40486 was 1.2M transistors in 1989,
so the transistor budget could be there.
Post by Anton Ertl
Would they have gotten those customers back, or would they have lost
to IA-32/AMD64 anyway? Probably the latter, unless they found a
business model that allowed them to milk the customer base that was
tied to VAX while at the same time being cheap enough to compete
with Intel.
I had experience from two different market segments of dealing with DEC.
In the early 1990s, I was working for a company based around MS-DOS
software. That was running pretty fast on 486 and Pentium machines. We
had contact with DEC because one of our large customers had DEC as their
primary IT supplier, and one of our managers had bought a DEC PC, from a
company who realised he was ignorant and unloaded obsolete hardware on
him at high prices.

If you weren't a major DEC customer, they were hell to deal with. They
just didn't do things, even after agreeing to do so. They charged
ludicrous prices for minor things. We needed a replacement key for the
anti-tamper lock on DEC PC, because the chap had lost it. They were free,
but the delivery charge was about $60, by cab. Getting them to just post
it took a lengthy argument.

Getting a replacement Pentium for one that had the FDIV bug required
compiling a log of weeks of broken promises from the parts centre and
faxing it to DEC's personnel department, asking for it to be placed on
the relevant manager's file and considered at his next performance review.
We couldn't just get one from Intel: the necessary heat sink was
permanently bonded to the old chip, so we needed a new one with DEC's
specific heatsink.

At the customer who had DEC as an IT supplier, DEC staff didn't know
anything about PCs or MS-DOS. They only knew VMS, which seemed weird and
arcane to us, but the DEC staff were sure it was infinitely superior, and
could not explain why. They really did not make DEC seem attractive as a
supplier.

Then I changed jobs in 1995 to a company that supplied software for VAX
VMS, Alpha VMS, OSF/1 on Alpha and Windows on Alpha. Dealing with DEC
from there was much better. They were capable, helpful and efficient. But
they still didn't understand PCs, and Windows NT was effective at running
complex software and was far cheaper and more attractive to PC users than
VMS.

The OoO VAX alternate history changes a lot of things. It means PRISM
doesn't start, and the multiple-personality OS concept that became MICA
may or may not happen. The lack of a PRISM+MICA cancellation means Dave
Cutler probably doesn't move to Microsoft, and then Windows NT doesn't
happen, at least not in the same way.

The Mac still causes a shift to GUIs. If DEC can come up with, or buy in,
a good one then they may do very well, and Microsoft may not become
nearly so important. That would reduce the importance of Intel, which
might mean IA-64 never happens.
Post by Anton Ertl
They tried to go for that on the Alpha: they used firmware for
market segmentation between VMS/Digital OSF/1 on the one hand and
Linux/Windows on the other; and they also offered some relatively
cheap boards, e.g. with the 21164PC, but those were probably too
limited to be successful.
Producing software for Alpha Windows was reasonably straightforward, if
you had well-behaved software written in a HLL that there were compilers
for. This meant that people who were coming down from the Unix world
didn't have much trouble. Going upwards from the MS-DOS/Windows world was
harder: you couldn't hit the hardware, you had to rewrite any assembler
code, and FX!32 wasn't quite as good as it was cracked up to be. Alpha
Windows software was worth producing until about 1998, when its
performance advantage evaporated.
Post by Anton Ertl
VAX would have been extended to 64 bits some times in the early
1990s in the alternative timeline, and DEC would have been tempted
to use the 64-bit extension for market segmentation, which again
could have resulted into DEC painting itself into a niche.
Yup. Really, you have to get the traditional DEC management to all retire
before 1990, and the new management need to be brave /and/ lucky.

John
Anton Ertl
2024-01-20 18:15:43 UTC
Permalink
Post by John Dallman
Post by Anton Ertl
their papers on OoO (with VAX as example) starting in 1985, the
Pentium Pro appeared in 1995). Of course, for OoO-VAXes to succeed
in the market, the relevant timespan was 1995-2005. Intel dropped
the 64-bit IA-32 successor ball and AMD picked it up with the 2003
releases of Opteron and Athlon64.
This requires DEC to take notice of those papers and start developing OoO
quite quickly.
...
Post by John Dallman
If DEC go OoO and build an OoO Micro-VAX CPU by about 1988, they can get
somewhere. The MicroVAX 78032 of 1985 was 125K transistors; the 80386 was
275K transistors the same year, the 40486 was 1.2M transistors in 1989,
so the transistor budget could be there.
Not in a single chip. The CPU die of the Pentium Pro has 5.5M
transistors and was available in 1995. Nobody else was much earlier
on OoO, even with the RISC advantage. If DEC had picked up the HPS
ideas and invented what's missing from there, they might have had the
OoO VAX as a multi-chip thing in the early 1990s, and maybe gotten it
on a single chip by 1995. But its performance in the early 1990s
would have been great, so it could have won back customers.
Post by John Dallman
Yup. Really, you have to get the traditional DEC management to all retire
before 1990, and the new management need to be brave /and/ lucky.
Yes, you would basically need to have a whole bunch of managers and
tech team leaders take a time machine from, say, today, so they know
where to go, and they still would need to make and enforce good
decisions to make the company succeed in the long term rather than
painting itself into a corner by maximizing short-term revenue.

You story about your experiences with DEC remind me of one statement I
once read: DEC buy X, and the result is DEC. Compaq buys DEC, and the
result is DEC (as in, the DEC attitude won over the Compaq attitude).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Michael S
2024-01-20 20:19:51 UTC
Permalink
On Sat, 20 Jan 2024 18:15:43 GMT
Post by Anton Ertl
You story about your experiences with DEC remind me of one statement I
once read: DEC buy X, and the result is DEC. Compaq buys DEC, and the
result is DEC (as in, the DEC attitude won over the Compaq attitude).
- anton
But later on HP bought Compaq and eventually the computing side of the
business became indistinguishable from Compaq. Both DEC and HP parts
already dissolved. Ex-SGI side still hanging, but likely not for long.
John Levine
2024-01-20 19:50:30 UTC
Permalink
Post by John Dallman
Yup. Really, you have to get the traditional DEC management to all retire
before 1990, and the new management need to be brave /and/ lucky.
DEC never really understood what business they were in. They had a pretty good
run selling hardware that was cheap and reliable, with software that was adequate.
But more often than not it was used with other software, Compuserve's system
and Tenex on the -10, and Unix on the -11 and Vax.

That worked fine while minicomputers were the cheapest way to do small scale
computing. Once micros came in, they weren't able to produce chips that
competed on their own (as opposed to being slightly cheaper versions of
their minis) and they deluded themselves that they could lock people in
with VMS the way IBM did with DOS and OS and AS/400.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Lawrence D'Oliveiro
2024-01-20 21:36:37 UTC
Permalink
Post by John Levine
DEC never really understood what business they were in.
They were a company running by engineers, selling to engineers and others
who understood technical stuff. That was a great business model from the
introduction of the PDP-1 in 1959 up to the coming of RISC and the IBM PC,
mid-1980s. That was a pretty good run, until you have to start to think
about remaking yourself. Which they had trouble doing.
Lawrence D'Oliveiro
2024-01-20 21:33:11 UTC
Permalink
... Dave Cutler probably doesn't move to Microsoft, and then Windows NT
doesn't happen, at least not in the same way.
Imagine if it hadn’t been created by a Unix-hater. But then, Microsoft had
already divested themselves of Xenix by then, hadn’t they? So they
probably didn’t have anyone left who understood the value of Unix.
Michael S
2024-01-21 13:19:22 UTC
Permalink
On Sat, 20 Jan 2024 21:33:11 -0000 (UTC)
Post by Lawrence D'Oliveiro
... Dave Cutler probably doesn't move to Microsoft, and then
Windows NT doesn't happen, at least not in the same way.
Imagine if it hadn’t been created by a Unix-hater. But then,
Microsoft had already divested themselves of Xenix by then, hadn’t
they? So they probably didn’t have anyone left who understood the
value of Unix.
I see nothing wrong in DC being Unix hater.
Much much worse that he didn't understand that it is not 1970s any more
and that in 1990s plug&play support is necessity, including "hot"
plug&play.
Because of that blind spot, Win9x line, created by people that did
understand the value of plug&play (Brad Silverberg ? I can't find much
info about lead 9x architects on the Net), but very problematic
otherwise, lasted for much longer than it should have been.
Lawrence D'Oliveiro
2024-01-21 21:28:13 UTC
Permalink
Post by Michael S
On Sat, 20 Jan 2024 21:33:11 -0000 (UTC)
Post by Lawrence D'Oliveiro
... Dave Cutler probably doesn't move to Microsoft, and then Windows
NT doesn't happen, at least not in the same way.
Imagine if it hadn’t been created by a Unix-hater. But then, Microsoft
had already divested themselves of Xenix by then, hadn’t they? So they
probably didn’t have anyone left who understood the value of Unix.
I see nothing wrong in DC being Unix hater.
WSL might not have been necessary. Microsoft would not now be struggling
to offer some semblance of Linux compatibility.
EricP
2024-01-21 13:43:30 UTC
Permalink
Post by John Dallman
Post by Anton Ertl
Post by John Dallman
The timeline doesn't work. DEC decided to adopt MIPS in 1989,
because they were loosing market share worryingly quickly.
NVAX was released in 1991, and they'd have had real trouble
developing it without the cash from MIPS-based systems.
I forgot that in this alternative reality DEC would have killed the
VAX 9000 project early, leaving them lots of cash for developping
NVAX. Still, it could easily have been that they would have lost
customers to the RISC competition until they finally managed to do
the OoO-VAX.
their papers on OoO (with VAX as example) starting in 1985, the
Pentium Pro appeared in 1995). Of course, for OoO-VAXes to succeed
in the market, the relevant timespan was 1995-2005. Intel dropped
the 64-bit IA-32 successor ball and AMD picked it up with the 2003
releases of Opteron and Athlon64.
This requires DEC to take notice of those papers and start developing OoO
quite quickly. They did not do that historically, and they seem to have
been confident that their way of working would carry on being effective,
until RISC demonstrated otherwise. This is the timeframe where IBM gave
up on building mainframes with competitive compute power, and settled for
them being capable data-movers.
If DEC go OoO and build an OoO Micro-VAX CPU by about 1988, they can get
somewhere. The MicroVAX 78032 of 1985 was 125K transistors; the 80386 was
275K transistors the same year, the 40486 was 1.2M transistors in 1989,
so the transistor budget could be there.
There was also the CVAX in 1986, 134,000 transistors (out of 180,000 sites),
2um CMOS, 3 layers interconnect, 90 ns clock, internal 1 kB 2-way ass. cache.
Separate FPU coprocessor chip 65,000 transistors.

But these were only available in systems like 6240, quad SMP processors,
256 kB L2 cache, up to 256 MB main memory, and up to 6 high speed IO buses,
in multiple cabinets.
Lawrence D'Oliveiro
2024-01-19 20:51:31 UTC
Permalink
Post by John Levine
Considering that the 370 is still alive and the VAX died decades ago,
it should be evident that instruction count isn't a very useful metric
across architectures.
The 360/370/xx/3090/yy/zSeries line only survives because of business
“legacy” deployments. It was never a performance-oriented architecture
(witness the trouncing by CDC). It is long obsolete, and those deployments
are dwindling, if not circling the plughole.

VAX was the next step forward in the “supermini” and later “workstation”
categories, and these were definitely about price-performance. So when
other better technologies came along, they rendered it obsolete, fairly
quickly.
EricP
2024-01-18 14:19:44 UTC
Permalink
Post by John Levine
Post by Thomas Koenig
Post by John Levine
As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions.
Why the big difference? Were fancy addressing modes really used so
much? Or did the code for the VAX mostly run POLY instructions? :-)
I don't think anyone used the fancy addressing modes or complex
instructions much. But here's an example. Let's say A, B, and C are
floats in addressable memory and you want to do A = B + C
370 code
LE R0,B
AE R0,C
STE R0,A
VAX code
ADDF3 B,C,A
The VAX may not be faster, but that's one instruction rather than 3.
Or that old Fortran favorite I = I + 1
370 code
L R1,I
LA R2,1
AR R1,R2
ST R1,I
VAX code
INCL I
or if you have a lousy optimizer
ADDL2 #1,I
or if you have a really lousy optimizer
ADDL3 #1,I,I
It's still one instruction rather than four.
In 370 code you often also needed extra instructions to make data
addressable since it had no direct addressing and address offsets in
instructions were only 12 bits.
VAX Fortran77 could optimize a DO loop array index to an autoincrement,
I think they called it strength reduction of loop induction variables.

do i = 1, N
A(i) = A(i) + B(i)
end do

ADDD (rB)+, (rA)+

VAX usage stats for compilers Basic, Bliss, Cobol, Fortran, Pascal, PL1,
show usage frequency per operand specifier of autoincrement ~4%, index ~7%
except Basic has 17% for autoincrement.
There is almost no usage of deferred addressing (address of address of data).
Lawrence D'Oliveiro
2024-01-18 21:01:19 UTC
Permalink
Post by EricP
do i = 1, N
A(i) = A(i) + B(i)
end do
ADDD (rB)+, (rA)+
... set up rA, rB, rI ...
BRB $9000
$1000:
ADDD (rB)+, (rA)+
$9000:
SOBGEQ rI, $1000

Why use SOBGEQ with the branch intead of SOBGTR? So that this way, if N =
0, the loop body never executes at all.
John Levine
2024-01-18 22:19:04 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by EricP
do i = 1, N
A(i) = A(i) + B(i)
end do
ADDD (rB)+, (rA)+
... set up rA, rB, rI ...
BRB $9000
ADDD (rB)+, (rA)+
SOBGEQ rI, $1000
Why use SOBGEQ with the branch intead of SOBGTR? So that this way, if N =
0, the loop body never executes at all.
Ah, that must have been a Fortran 77 or later DO loop. In Fortran 66 the
loop usually ran once regardless.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Terje Mathisen
2024-01-19 06:28:57 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by EricP
do i = 1, N
A(i) = A(i) + B(i)
end do
ADDD (rB)+, (rA)+
... set up rA, rB, rI ...
BRB $9000
ADDD (rB)+, (rA)+
SOBGEQ rI, $1000
Why use SOBGEQ with the branch intead of SOBGTR? So that this way, if N =
0, the loop body never executes at all.
This is the kind of tiny loop body where I would have considered
replacing the initial BRB $9000 with a dummy instruction (like a compare
reg with immediate) where the immediate value contained the ADDD loop body.

This assumes of course that such a dummy opcode would (on average) be
faster than a taken forward branch!

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Anton Ertl
2024-01-18 16:35:45 UTC
Permalink
Post by Thomas Koenig
Post by John Levine
As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions.
Why the big difference? Were fancy addressing modes really used so
much? Or did the code for the VAX mostly run POLY instructions? :-)
It's interesting that these are the features you are thinking of,
especially because the IBM 801 research and the RISC research showed
that fancy addressing modes are rarely used. Table 4 of
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
shows that addressing modes that the S/360 or even MIPS does not
support are quite rare:

%
Auto-inc. (R)+ 2.1
Disp. Deferred @D(R) 2.7
Absolute @(PC) 0.6
Auto-inc.def. @(R)+ 0.3
Auto-dec. -(R) 0.9

for a total of 6.6% of the operand specifiers; there are about 1.5
operand specifiers per instruction (Table 3), so that's ~0.1 operand
specifier with a fancy addressing mode per instruction.

Back to why S/360 has more instructions than VAX, John Levine gave a
good answer.

One aspect (partially addressed by John Levine, but not discussed
explicitly) is that the VAX is a three-address machine, while S/360 is
a two-address machine, so the S/360 occasionally needs reg-reg moves
where VAX does not. Plus, S/360 usually requires one of its two
operands to be a register, so in some cases an additional load is
necessary on the S/360 that is not needed on the VAX.

Among the complex VAX instructions CALL/RET and multi-register push
and pop constiture 3.22% of the instructions according to Table 1 of
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
I expect that these correspond to multiple instructions on the S/360.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
EricP
2024-01-18 18:07:52 UTC
Permalink
Post by Anton Ertl
Post by Thomas Koenig
Post by John Levine
As I think I said above. a million IBM instructions did about as much
work as half a million VAX instructions.
Why the big difference? Were fancy addressing modes really used so
much? Or did the code for the VAX mostly run POLY instructions? :-)
It's interesting that these are the features you are thinking of,
especially because the IBM 801 research and the RISC research showed
that fancy addressing modes are rarely used. Table 4 of
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
shows that addressing modes that the S/360 or even MIPS does not
%
Auto-inc. (R)+ 2.1
Auto-dec. -(R) 0.9
for a total of 6.6% of the operand specifiers; there are about 1.5
operand specifiers per instruction (Table 3), so that's ~0.1 operand
specifier with a fancy addressing mode per instruction.
Back to why S/360 has more instructions than VAX, John Levine gave a
good answer.
One aspect (partially addressed by John Levine, but not discussed
explicitly) is that the VAX is a three-address machine, while S/360 is
a two-address machine, so the S/360 occasionally needs reg-reg moves
where VAX does not. Plus, S/360 usually requires one of its two
operands to be a register, so in some cases an additional load is
necessary on the S/360 that is not needed on the VAX.
Among the complex VAX instructions CALL/RET and multi-register push
and pop constiture 3.22% of the instructions according to Table 1 of
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
I expect that these correspond to multiple instructions on the S/360.
- anton
There is also a different paper with slightly different stats that,
amonst other things, shows address mode usage by compiled language.

A Case Study of VAX-11 Instruction Set Usage For Compiler Execution
Wiecek, 1982
https://dl.acm.org/doi/pdf/10.1145/960120.801841
John Levine
2024-01-18 19:08:47 UTC
Permalink
Post by Anton Ertl
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
shows that addressing modes that the S/360 or even MIPS does not
%
Auto-inc. (R)+ 2.1
Auto-dec. -(R) 0.9
That's not entirely fair. The VAX has an immediate address mode that
could encode constant values from 0 to 63. Both papers said it was
about 15% so it was definitely a success. The 370 had sort of a split
personality, a shotgun marriage of a register scientific machine
and a memory-to-memory commercial machine. There were a bunch of
instructions with immediate operands but they all were a one byte
immediate and a memory location. Hence the extra LA instructions to
get immediates into registers.

Both papers said the index mode, which added a scaled register to an
address computed any other way, was about 6% which was higher than I
would have expected. The 370 has a similar base+displacement+index
which I hear is almost never used.
Post by Anton Ertl
Among the complex VAX instructions CALL/RET and multi-register push
and pop constiture 3.22% of the instructions according to Table 1 of
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
I expect that these correspond to multiple instructions on the S/360.
The VAX had an all singing and dancing CALLS/RET that saved registers
and set up a stack frame. and a simple JSB/RSB that just pushed the
return address and jumped. CALLS was extremely slow and did far
more than was usually needed so for the most part it was only used
for inter-module calls that had to use the official calling sequence,
and JSB for everything else.

The VAX instruction set was overoptimized for code size and a
simplistic idea of easy programming which meant among other things
that a fancy instruction was often slower than the equivalent sequence
of simple instructions, and a lot of the fancy instructions weren't
used very much.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Lawrence D'Oliveiro
2024-01-18 20:55:15 UTC
Permalink
The 370 had sort of a split personality, a shotgun marriage of a
register scientific machine and a memory-to-memory commercial machine.
That pins things down quite narrowly as to when it came into being,
doesn’t it? Up to about that point, “scientific” and “business” computing
were considered to be separate worlds, needing their own hardware and
software, and never the twain shall meet.
John Levine
2024-01-18 22:16:11 UTC
Permalink
Post by Lawrence D'Oliveiro
The 370 had sort of a split personality, a shotgun marriage of a
register scientific machine and a memory-to-memory commercial machine.
That pins things down quite narrowly as to when it came into being,
doesn’t it? Up to about that point, “scientific” and “business” computing
were considered to be separate worlds, needing their own hardware and
software, and never the twain shall meet.
Yes, the whole point of S/360 was to produce a unified architecture that IBM
could sell to all of their customers.

It may have been a shotgun marriage, but it's been a very long lasting one.

You can still run most S/360 application code unmodified on the latest zSeries.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Michael S
2024-01-19 14:40:19 UTC
Permalink
On Thu, 18 Jan 2024 22:16:11 -0000 (UTC)
Post by John Levine
Post by Lawrence D'Oliveiro
The 370 had sort of a split personality, a shotgun marriage of a
register scientific machine and a memory-to-memory commercial machine.
That pins things down quite narrowly as to when it came into being,
doesn’t it? Up to about that point, “scientificâ€_ and
“businessâ€_ computing were considered to be separate worlds,
needing their own hardware and software, and never the twain shall
meet.
Yes, the whole point of S/360 was to produce a unified architecture
that IBM could sell to all of their customers.
It may have been a shotgun marriage, but it's been a very long
lasting one.
Was it?
Being younger observer from the outside, my impression is that in the
1st World people stopped using S/360 descendents for "heavy" scientific
calculations around 1980. In other parts of the World it lasted few
years longer, but still no longer than 1990. Use of IBM manframes for
CAD continued well into 90s and may be even into this century, but CAD
is not what people called "scientific computing" back when S/360 was
conceived.
Post by John Levine
You can still run most S/360 application code unmodified on the latest zSeries.
Anton Ertl
2024-01-19 16:22:22 UTC
Permalink
Post by Michael S
Being younger observer from the outside, my impression is that in the
1st World people stopped using S/360 descendents for "heavy" scientific
calculations around 1980. In other parts of the World it lasted few
years longer, but still no longer than 1990.
Meanwhile, in my part of the third world (Austria) politicians praised
themselves for buying a supercomputer from IBM. Searching for it, I
find <https://services.phaidra.univie.ac.at/api/object/o:573/get>, and
on page 2 it tells me that the inauguration of the supercomputer IBM
3090-400E VF (with two vector processors) happened on March 7, 1989.
That project was originally limited to two years, but a contract
signed on 1992-03-19 exteded the run-time and extended the hardware to
a 6-processor ES/9000 720VF; that extension also included 20
RS/6000-550, and they found out that the cumulated computing power
exceeded the one of the vector computer by far. The vector computer
was uninstalled in January 1995.

After the RS/6000 cluster they used an Alpha cluster from 1995 to
2001, and this was replaced in 2001 with a PC-based Linux cluster
(inaugurated on January 28, 2002) consisting of 160 nodes with an
Athlon XP 1700+ and 1GB RAM each.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
John Levine
2024-01-19 18:28:18 UTC
Permalink
Post by Michael S
Post by John Levine
It may have been a shotgun marriage, but it's been a very long lasting one.
Was it?
Being younger observer from the outside, my impression is that in the
1st World people stopped using S/360 descendents for "heavy" scientific
calculations around 1980. ...
It was earlier than that. The 370/195 was IBM's last attempt to build
a supercomputer, introduced in 1970 and never sold very well. They
added vector options on later machines which someone must use, since
they're still on zSeries, but they've never been competitive for
pure computing.

The point of a mainframe is that it has a balance between CPU and I/O.
A PDP-8 had a much faster CPU than a 360/30, but the 360 had an I/O
channel that connected card readers and printers and tapes and disks
so it could do data processing work that nobody did on a PDP-8. A
PDP-8 could also conect to those but each needed an expensive I/O
interface to attach to the 8's simple I/O bus, so hardly anyone did.

Mainframes are also designed to be very reliable and maintainable. A
modern mainframe has dozens of CPUs some of which are only doing
maintenance oversight and others of which are hot spares that can
substitute for a failed processor in the middle of an instruction
stream. They're also designed so the vendor can do maintenance and
replace subystems while the system is running. People expect them to
remain up and running constantly for years at a time.

Apropos another comment that the 360 has evolved but the Vax didn't,
that is certainly true, since zSeries is about 70% new stuff and
30% 360 stuff, but the 360 was a much better place to build from.
It is much easier to build a fast 360 than a fast Vax because
the instruction set, even with all the zSeries additions, is
more regular and amenable to pipelining.

The worst mistake they made from a performance point of view is
that the architecture says an instruction can modify the next
instruction and it is supposed to work. (Back in the 1960s
on machines with 8K of RAM that was not totally silly.) But
even that hardly matters since the vast majority of code
runs out of read-only pages where you can't do that.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Lawrence D'Oliveiro
2024-01-19 20:58:16 UTC
Permalink
Post by John Levine
The point of a mainframe is that it has a balance between CPU and I/O.
The point of a mainframe was that the CPU was expensive. So a lot of
effort went into complex I/O controllers that could perform chains of
multiple transfers before having to come back to the CPU to ask for more
work.

Such an architecture tends to prioritize high throughput over low latency.
Which made it unsuitable for this newfangled “interactive timesharing”
that began to be popular with the new hardware and software coming from
companies like DEC, DG etc.
Post by John Levine
Mainframes are also designed to be very reliable and maintainable.
They did it in a very expensive way, though. Think how Google manages
reliability and maintainability today: by having a cluster of half a
million servers (maybe more by now), each built from the cheapest parts in
all ways but one--the power supply.
John Levine
2024-01-20 02:38:36 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by John Levine
The point of a mainframe is that it has a balance between CPU and I/O.
The point of a mainframe was that the CPU was expensive. So a lot of
effort went into complex I/O controllers that could perform chains of
multiple transfers before having to come back to the CPU to ask for more
work.
On hign end machines, not so much small ones. On the 360/30, the same
microcode engine ran the CPU and the channel. When the channel was
working hard, the CPU pretty much stopped.
Post by Lawrence D'Oliveiro
Such an architecture tends to prioritize high throughput over low latency.
Yup.
Post by Lawrence D'Oliveiro
Which made it unsuitable for this newfangled “interactive timesharing”
that began to be popular with the new hardware and software coming from
companies like DEC, DG etc.
Depended on what model of interaction you wanted. If you wanted the computer
to respond to each character, DEC machines were good at that since they were
designed to do realtime stuff. If you wanted to do line at a time or screen
at a time interaction, mainframes did that just fine. In 1964 SABRE ran on
two IBM 7090s and provided snappy responses to 1500 terminals across the U.S.

I used CP/67 in the early 1970s and it also worked quite well, fast response
in line at a time mode.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
John Dallman
2024-01-19 19:58:00 UTC
Permalink
Post by Michael S
Being younger observer from the outside, my impression is that in
the 1st world people stopped using S/360 descendents for "heavy"
scientific calculations around 1980.
Yup. VAXes and other superminis got you a lot more CPU per dollar.
Post by Michael S
Use of IBM manframes for CAD continued well into 90s and may be
even into this century, but CAD is not what people called
"scientific computing" back when S/360 was conceived.
Some aspects of it are, but many are not. CAD has very uneven processor
usage: vast demands for brief periods when regenerating views or models,
then very little while the designer thinks and adds to the model. Running
this on a time-shared machine is frustrating, because when a few
designers need a lot of CPU at the same time, it gets very slow.
Individual machines keep the designers happier.

John
Scott Lurndal
2024-01-19 20:22:56 UTC
Permalink
Post by John Dallman
Post by Michael S
Being younger observer from the outside, my impression is that in
the 1st world people stopped using S/360 descendents for "heavy"
scientific calculations around 1980.
Yup. VAXes and other superminis got you a lot more CPU per dollar.
Post by Michael S
Use of IBM manframes for CAD continued well into 90s and may be
even into this century, but CAD is not what people called
"scientific computing" back when S/360 was conceived.
Some aspects of it are, but many are not. CAD has very uneven processor
usage: vast demands for brief periods when regenerating views or models,
then very little while the designer thinks and adds to the model. Running
this on a time-shared machine is frustrating, because when a few
designers need a lot of CPU at the same time, it gets very slow.
Individual machines keep the designers happier.
Modern chip development (RTL/Verilog) environments offload the
compute- and io-bound- jobs to a compute grid with thousands of nodes;
even the visualization jobs using X11 tunnelling to get back to
the workstation display when examining waves, for example.

When you're dealing with billions of gates on a single chip.....
Thomas Koenig
2024-01-20 21:50:59 UTC
Permalink
Post by Michael S
Being younger observer from the outside, my impression is that in the
1st World people stopped using S/360 descendents for "heavy" scientific
calculations around 1980.
I certainly used a /360 descendants (Siemens 7881, then IBM
3090) for scientific work, but the latter was also often used
as the front end for the (also S/360 compatible) Fujitsu VP.
Hmm... looking around a bit, the IBM 3090 I worked on had 150
MFlops with its vector facility. That was not too bad when it
was purchased in 1989, but the worksations purchased soon after
eclipsed it in computing power for the individual user, and the
vector computers (Fujitsu VP in Karlsruhe) also did so. The IBM
3090 was used mainly as a front end to the VP.
Anton Ertl
2024-01-19 08:39:16 UTC
Permalink
Post by John Levine
Post by Anton Ertl
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
shows that addressing modes that the S/360 or even MIPS does not
%
Auto-inc. (R)+ 2.1
Auto-dec. -(R) 0.9
That's not entirely fair. The VAX has an immediate address mode that
could encode constant values from 0 to 63. Both papers said it was
about 15% so it was definitely a success. The 370 had sort of a split
personality, a shotgun marriage of a register scientific machine
and a memory-to-memory commercial machine. There were a bunch of
instructions with immediate operands but they all were a one byte
immediate and a memory location. Hence the extra LA instructions to
get immediates into registers.
So this advantage of the VAX over S/360 was not a "fancy" addressing
mode, but the immediate addressing mode that S/360 does not have, but
that all RISCs have, even MIPS, Alpha and RISC-V (except that these
architectures define addi/addiu as separate instructions). VAX has
"short literal", as you explain (15.8% of the operands) as well as
"immediate" (2.4% of the operands). With 1.5 operands per
instruction, that alone is a factor 1.27 more instructions for S/360
than for VAX.
Post by John Levine
Post by Anton Ertl
Among the complex VAX instructions CALL/RET and multi-register push
and pop constiture 3.22% of the instructions according to Table 1 of
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
I expect that these correspond to multiple instructions on the S/360.
The VAX had an all singing and dancing CALLS/RET that saved registers
and set up a stack frame. and a simple JSB/RSB that just pushed the
return address and jumped. CALLS was extremely slow and did far
more than was usually needed so for the most part it was only used
for inter-module calls that had to use the official calling sequence,
and JSB for everything else.
That probably depends on the compiler. Table 2 of
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
lists 4.5% "subroutine call and return", and 2.4% "procedure call and
return"; I assume the latter is the all-singing all-dancing CALL and
RET instruction; the missing 0.82% to the 3.22% mentioned in Table 1
is probably the multi-register push and pop instructions.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
John Levine
2024-01-19 16:59:33 UTC
Permalink
Post by Anton Ertl
So this advantage of the VAX over S/360 was not a "fancy" addressing
mode, but the immediate addressing mode that S/360 does not have, but
that all RISCs have, even MIPS, Alpha and RISC-V (except that these
architectures define addi/addiu as separate instructions). VAX has
"short literal", as you explain (15.8% of the operands) as well as
"immediate" (2.4% of the operands). With 1.5 operands per
instruction, that alone is a factor 1.27 more instructions for S/360
than for VAX.
Looks that way. IBM apparently noticed it too since S/390 added 16 bit
immediate load, compare, add, subtract, and multiply, and zSeries
added immediate everything, such as add immediate to memory.
Post by Anton Ertl
That probably depends on the compiler. Table 2 of
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
lists 4.5% "subroutine call and return", and 2.4% "procedure call and
return"; I assume the latter is the all-singing all-dancing CALL and
RET instruction; the missing 0.82% to the 3.22% mentioned in Table 1
is probably the multi-register push and pop instructions.
Sounds right. I'm surprised the procedure call numbers were so high.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
EricP
2024-01-19 20:29:50 UTC
Permalink
Post by John Levine
Post by Anton Ertl
So this advantage of the VAX over S/360 was not a "fancy" addressing
mode, but the immediate addressing mode that S/360 does not have, but
that all RISCs have, even MIPS, Alpha and RISC-V (except that these
architectures define addi/addiu as separate instructions). VAX has
"short literal", as you explain (15.8% of the operands) as well as
"immediate" (2.4% of the operands). With 1.5 operands per
instruction, that alone is a factor 1.27 more instructions for S/360
than for VAX.
Looks that way. IBM apparently noticed it too since S/390 added 16 bit
immediate load, compare, add, subtract, and multiply, and zSeries
added immediate everything, such as add immediate to memory.
Post by Anton Ertl
That probably depends on the compiler. Table 2 of
<https://www.eecg.utoronto.ca/~moshovos/ACA06/readings/emer-clark-VAX.pdf>
lists 4.5% "subroutine call and return", and 2.4% "procedure call and
return"; I assume the latter is the all-singing all-dancing CALL and
RET instruction; the missing 0.82% to the 3.22% mentioned in Table 1
is probably the multi-register push and pop instructions.
Sounds right. I'm surprised the procedure call numbers were so high.
I found a set of LINPACK performance results for many different cpus
from 1983 by Argonne National Laboratory, including 370/158 (they don't
say which model) and 780. The results show both the execute time and
MFLOPS so that removes the variability due to definition of "instruction".

Dongarra has many versions of this paper over the years.
This is just the one from 1983.

Performance of Various Computers Using Standard Linear Equations Software
in a Fortran Environment, Dongarra, 1983
https://dl.acm.org/doi/pdf/10.1145/859551.859555

For double precision the 158 running compiled code is about 50%
faster than 780 running "coded BLAS" (hand coded assembler)
and about 2 times faster than a 780 for compiled code.

For single precision the 780 is slightly faster for "coded BLAS"
and the 158 is about 50% faster for compiled code.
Lynn Wheeler
2024-01-20 02:17:32 UTC
Permalink
Post by EricP
For single precision the 780 is slightly faster for "coded BLAS"
and the 158 is about 50% faster for compiled code.
trivia: jan1979, I was asked to run cdc6600 rain benchmark on
(engineering) 4341 (before shipping to customers, the engineering 4341
was clocked about 10% slower than what shipped to customers) for
national lab that was looking at getting 70 for a compute farm (sort of
the leading edge of the coming cluster supercomputing tsunami). I also
ran it on 158-3 and 3031. A 370/158 ran both the 370 microcode and the
integrated channel microcode; a 3031 was two 158 engines, one with just
the 370 microcode and a 2nd with just the integrated channel microcode.

cdc6600: 35.77secs
158: 45.64secs
3031: 37.03secs
4341: 36.21secs

... 158 integrated channel microcode was using lots of processing
cycles, even when no i/o was going on.
--
virtualization experience starting Jan1968, online at home since Mar1970
Michael S
2024-01-21 13:06:43 UTC
Permalink
On Fri, 19 Jan 2024 16:17:32 -1000
Post by Lynn Wheeler
Post by EricP
For single precision the 780 is slightly faster for "coded BLAS"
and the 158 is about 50% faster for compiled code.
trivia: jan1979, I was asked to run cdc6600 rain benchmark on
(engineering) 4341 (before shipping to customers, the engineering 4341
was clocked about 10% slower than what shipped to customers) for
national lab that was looking at getting 70 for a compute farm (sort
of the leading edge of the coming cluster supercomputing tsunami). I
also ran it on 158-3 and 3031. A 370/158 ran both the 370 microcode
and the integrated channel microcode; a 3031 was two 158 engines, one
with just the 370 microcode and a 2nd with just the integrated
channel microcode.
cdc6600: 35.77secs
158: 45.64secs
3031: 37.03secs
4341: 36.21secs
... 158 integrated channel microcode was using lots of processing
cycles, even when no i/o was going on.
Did I read it right? Brand new mid-range IBM mainframe barely matched
15 y.o. CDC machine that was 10 years out of production ?
That sounds quite embarrassing.
Anton Ertl
2024-01-21 16:30:36 UTC
Permalink
Post by Michael S
On Fri, 19 Jan 2024 16:17:32 -1000
Post by Lynn Wheeler
cdc6600: 35.77secs
158: 45.64secs
3031: 37.03secs
4341: 36.21secs
...
Post by Michael S
Did I read it right? Brand new mid-range IBM mainframe barely matched
15 y.o. CDC machine that was 10 years out of production ?
That sounds quite embarrassing.
That depends on the price, and there are also properties like size,
power consumption and cooling requirements. IBM mainframes were not
designed for HPC (with a few exceptions); if you wanted that, you
would have bought a Cray-1 in 1979 when the 4341 appeared.

There is also the thing about IBM's market: Amdahl said about the
(high-performance) ACS-360
<https://people.computing.clemson.edu/~mark/acs_end.html>:

|Yes, but the company decided not to build it because it would have
|destroyed the pricing structures. In the first place, it would have
|forced them to make higher-end machines. But with IBM's pricing
|structure, the market disappeared by the time performance got to a
|certain level. Any machine above that in performance or price could
|only lose money.

The ACS-360 was cancelled for that reason.

Also, remember that these were not the 1990s with their extreme
advances every year; instead, the performance advances were quite a
bit slower, just like we have seen in the last two decades. And if
you compare a 2023-vintage Rock 5B (with Cortex-A76 like the Raspi5)
with a 2008-vintage Core 2 Duo E8400 PC, the Rock 5B is slightly
slower when running LaTeX, but its also much cheaper, smaller,
consumes much less power and actually works without a cooler (but we
provided one nonetheless; the Raspi 5 SoC is made in a less advanced
process and needs more cooling).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Lawrence D'Oliveiro
2024-01-21 21:27:23 UTC
Permalink
Did I read it right? Brand new mid-range IBM mainframe barely matched 15
y.o. CDC machine that was 10 years out of production ?
That sounds quite embarrassing.
A minute’s silence for the hardware legend that was Seymour Cray.

And a minute’s jeering at IBM’s FUD campaign to try to put CDC out of
business.
Scott Lurndal
2024-01-21 22:01:34 UTC
Permalink
Did I read it right? Brand new mid-range IBM mainframe barely matched 15
y.o. CDC machine that was 10 years out of production ?
That sounds quite embarrassing.
A minute’s silence for the hardware legend that was Seymour Cray.
He was a friend of my Godfather (who lived in Chippewa Falls), right around
the time I first had access to a computer (1974, B5500). I didn't
realize who he was until much later, however and never had a chance to
discuss computers.
MitchAlsup1
2024-01-21 21:26:37 UTC
Permalink
Post by Michael S
On Fri, 19 Jan 2024 16:17:32 -1000
Post by Lynn Wheeler
Post by EricP
For single precision the 780 is slightly faster for "coded BLAS"
and the 158 is about 50% faster for compiled code.
trivia: jan1979, I was asked to run cdc6600 rain benchmark on
(engineering) 4341 (before shipping to customers, the engineering 4341
was clocked about 10% slower than what shipped to customers) for
national lab that was looking at getting 70 for a compute farm (sort
of the leading edge of the coming cluster supercomputing tsunami). I
also ran it on 158-3 and 3031. A 370/158 ran both the 370 microcode
and the integrated channel microcode; a 3031 was two 158 engines, one
with just the 370 microcode and a 2nd with just the integrated
channel microcode.
cdc6600: 35.77secs
158: 45.64secs
3031: 37.03secs
4341: 36.21secs
... 158 integrated channel microcode was using lots of processing
cycles, even when no i/o was going on.
Did I read it right? Brand new mid-range IBM mainframe barely matched
15 y.o. CDC machine that was 10 years out of production ?
That sounds quite embarrassing.
Target market for 4341 was not scientific computing, either.
Thomas Koenig
2024-01-21 21:51:56 UTC
Permalink
Post by MitchAlsup1
Post by Michael S
On Fri, 19 Jan 2024 16:17:32 -1000
Post by Lynn Wheeler
Post by EricP
For single precision the 780 is slightly faster for "coded BLAS"
and the 158 is about 50% faster for compiled code.
trivia: jan1979, I was asked to run cdc6600 rain benchmark on
(engineering) 4341 (before shipping to customers, the engineering 4341
was clocked about 10% slower than what shipped to customers) for
national lab that was looking at getting 70 for a compute farm (sort
of the leading edge of the coming cluster supercomputing tsunami). I
also ran it on 158-3 and 3031. A 370/158 ran both the 370 microcode
and the integrated channel microcode; a 3031 was two 158 engines, one
with just the 370 microcode and a 2nd with just the integrated
channel microcode.
cdc6600: 35.77secs
158: 45.64secs
3031: 37.03secs
4341: 36.21secs
... 158 integrated channel microcode was using lots of processing
cycles, even when no i/o was going on.
Did I read it right? Brand new mid-range IBM mainframe barely matched
15 y.o. CDC machine that was 10 years out of production ?
That sounds quite embarrassing.
Target market for 4341 was not scientific computing, either.
And yet, people used IBM mainframes for scientific computing...

For example, the IBM 4361 had, as an optional feature, the maximum
precision scalar product developed by the University of Karlsruhe.

Not sure why they went to IBM with it, maybe DEC would have been
a better choice. Then again, the people at the computer center
in Karlsruhe were very mainframe-oriented...
Lawrence D'Oliveiro
2024-01-21 23:54:01 UTC
Permalink
Not sure why they went to IBM with it, maybe DEC would have been a
better choice. Then again, the people at the computer center in
Karlsruhe were very mainframe-oriented...
There seemed to be a lot of people like that, who only knew IBM and saw
the whole world through IBM lenses. To the rest of us, IBM’s way of doing
things just seemed overcomplicated, unwieldy, inflexible ... and
expensive.
John Levine
2024-01-22 02:46:05 UTC
Permalink
Post by Lawrence D'Oliveiro
Not sure why they went to IBM with it, maybe DEC would have been a
better choice. Then again, the people at the computer center in
Karlsruhe were very mainframe-oriented...
There seemed to be a lot of people like that, who only knew IBM and saw
the whole world through IBM lenses. ...
IBM has a big development lab in Boeblingen which is about an hour from Karlsruhe.

At that time DEC had no labs outside the United States.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Lawrence D'Oliveiro
2024-01-22 03:21:21 UTC
Permalink
Post by John Levine
IBM has a big development lab in Boeblingen which is about an hour from Karlsruhe.
At one time, IBM were the world’s biggest holder of patents. Their
researchers came up with many clever ideas. But my impression was, very
few of those ideas actually made it into their products.
Terje Mathisen
2024-01-22 08:59:24 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by John Levine
IBM has a big development lab in Boeblingen which is about an hour from Karlsruhe.
At one time, IBM were the world’s biggest holder of patents. Their
researchers came up with many clever ideas. But my impression was, very
few of those ideas actually made it into their products.
My uni lecturer had a favorite:

IBM's patent for a zero-time sorting chip.

It was basically a DMA-style memory device that was setup as a big
ladder of comparators so that it could do a parallel bubble sort:

As each new item arrived it would be compared with the current top, and
the loser would be pushed down to the next ladder level, replacing the
time which had at the same time lost the comparison at that level.

By the time all items had been loaded, the top would be the overall
winner, right?

You would then reverse the direction, while keeping the comparators
active, so now you would stream out perfectly sorted items.

The real problem is of course that this is effectively very expensive
memory, and as soon as you ran out of space in the chip you would have
to fall back on multi-way merge between separate runs of chip-size chunks.

In pretty much every conceivable real-world situation you would much
rather have 10x more real memory and apply indexing to any data you
might want to retrieve quickly in some sorted order and/or sort it on
demand.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
John Levine
2024-01-22 16:42:50 UTC
Permalink
Post by John Levine
IBM has a big development lab in Boeblingen which is about an hour from Karlsruhe.
At one time, IBM were the world’s biggest holder of patents. Their
researchers came up with many clever ideas. But my impression was, very
few of those ideas actually made it into their products.
A lot of patents are defensive, you don't necessarily plan to use them
but you don't want anyone else to own them.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Thomas Koenig
2024-01-22 17:42:19 UTC
Permalink
Post by John Levine
Post by Lawrence D'Oliveiro
Post by John Levine
IBM has a big development lab in Boeblingen which is about an hour from Karlsruhe.
At one time, IBM were the world’s biggest holder of patents. Their
researchers came up with many clever ideas. But my impression was, very
few of those ideas actually made it into their products.
A lot of patents are defensive, you don't necessarily plan to use them
but you don't want anyone else to own them.
Or you want to be able to use them at a later date, so nobody else
can patent that particular invention.

This has led to some patents being filed in Luxemburg only, for example.

Another method, which is getting harder in the age of search
engines, is the "secret" publication by publishing it somewhere
where it is unlikely to be found, such as the (non-existent)
"Acta Physical Mongolica".
Thomas Koenig
2024-01-22 17:37:23 UTC
Permalink
Post by Lawrence D'Oliveiro
Not sure why they went to IBM with it, maybe DEC would have been a
better choice. Then again, the people at the computer center in
Karlsruhe were very mainframe-oriented...
There seemed to be a lot of people like that, who only knew IBM and saw
the whole world through IBM lenses. To the rest of us, IBM’s way of doing
things just seemed overcomplicated, unwieldy, inflexible ... and
expensive.
That wasn't the case here.

The mainframe they had at the computer center before was a UNIVAC
(don't know which model, it was decommissioned before I started
on the Siemens/Fujitsu mainframe there), and they had a Cyber 205.

So, maybe more mainframe-oriented, but not necessarily IBM.
But then again, the 4361 was not really a mainframe.

But proximity to of Karlsruhe to Böblingen (which John
L. mentioned) might well have been a factor. It is entirely
plausible that contacts existed, for example from students who
started to work there.

And, googling around for a bit, I find that the 4361 was indeed
developed at Böblingen. This probably settles it.
Lynn Wheeler
2024-01-22 02:37:40 UTC
Permalink
Post by Michael S
Did I read it right? Brand new mid-range IBM mainframe barely matched
15 y.o. CDC machine that was 10 years out of production ?
That sounds quite embarrassing.
national lab was looking at getting 70 because of price/performance
... sort of the leading edge of the coming cluster scale-up
supercomputing tsunami.

decade later had project originally HA/6000 for NYTimes to move their
newspaper system (ATEX) off (DEC) VaxCluster to RS/6000. I rename it
HA/CMP when I start doing technical/scientific cluster scale-up with
national labs and commercial cluster scale-up with RDBMS vendors
(Oracle, Sybase, Informix, Ingres). Early Jan1992, meeting with Oracle
CEO, who is told 16-way cluster mid-92 and 128-way cluster
ye-92. However, end of Jan1992, cluster scaleup is transferred for
announce as IBM supercomputer (for technical/scientific *ONLY*, possibly
because of commercial cluster scaleup "threat") and we are told we
couldn't work on anything with more than four processors (we leave IBM a
few months later). A couple weeks later, IBM (cluster) supercomputer
group in the press (pg8)
https://archive.org/details/sim_computerworld_1992-02-17_26_7

First half 80s, IBM 4300s sold into the same mid-range market as VAX and
in about the same numbers for single and small unit orders ... big
difference was large companies ordering hundreds of 4300s at a time for
placing out in departmental areas (sort of the leading edge of the
coming distributed comuting tsunami).

old archived post with vax sales, sliced and diced by model, year,
us/non-us
http://www.garlic.com/~lynn/2002f.html#0

2nd half of 80s, mid-range market was moving to workstation and large PC
servers ... affecting both VAX and 4300s
--
virtualization experience starting Jan1968, online at home since Mar1970
Anton Ertl
2024-01-18 16:24:13 UTC
Permalink
Post by EricP
The MIPS R2000 with 32 registers launched in 1986 at 8.3, 12.5 and 15 MHz.
It supposedly could sustain 1 reg-reg ALU operation per clock.
It could do at most one instruction per clock, and it certainly needed
to branch at some point, so no sustained 1/clock ALU instructions.
Also, a useful program would want to load or store at some point, so
even less ALU instructions. And with cache misses, also fewer than 1
IPC.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Lawrence D'Oliveiro
2024-01-18 20:56:08 UTC
Permalink
[MIPS] could do at most one instruction per clock, and it certainly
needed to branch at some point, so no sustained 1/clock ALU
instructions.
But it also had delayed branches, so perhaps it could sustain that rate
across a taken branch?
Paul A. Clayton
2024-01-18 23:16:02 UTC
Permalink
Post by Lawrence D'Oliveiro
[MIPS] could do at most one instruction per clock, and it certainly
needed to branch at some point, so no sustained 1/clock ALU
instructions.
But it also had delayed branches, so perhaps it could sustain that rate
across a taken branch?
Anton Ertl's point was that 1 **ALU** instruction per clock was
not sustainable. Control flow and memory access instructions are
not ALU instructions. (EricP had written "It supposedly could
sustain 1 reg-reg ALU operation per clock.", but useful programs
tend to access memory and have control flow operations.)

Even without data memory accesses, TLB misses — handled in
software — would prevent perfect performance for a straight-line
program unless the TLB was pre-loaded. (I do not remember if the
PC wrapped, but I think there were multiple segments that had
different mapping features that would have either prevented
such looping or made it "very interesting".)
Joe Pfeiffer
2023-12-30 19:26:02 UTC
Permalink
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the
word is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
The major question I have is why these architectures have this
feature.
I'll hazard a guess that once you've got the indirect bit out in memory,
it's easier to just use the same logic on all memory reads than to only
let it happen once.
John Levine
2023-12-30 23:26:20 UTC
Permalink
Post by Joe Pfeiffer
I'll hazard a guess that once you've got the indirect bit out in memory,
it's easier to just use the same logic on all memory reads than to only
let it happen once.
That's not how indirect addressing worked.

There was always a bit in the instruction to say to do indirection.

Sometimes that was it, sometimes on machines where the word size was
bigger than the address size, it also looked at some other bit in the
indirect word to see whether to keep going. On the PDP-8, the words
were 12 bits and the addresses were 12 bits so there was no room, they
couldn't have done multilevel indirect if they wanted to.

As several of us noted, multilevel indirection needed something to
break loops, while single level didn't. In my experience, multiple
indirection wasn't very useful, I didn't miss it on the -8, and I
can't recall using it other than as a gimmick on the PDP-10.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Vir Campestris
2023-12-31 17:40:53 UTC
Permalink
Post by John Levine
and I
can't recall using it other than as a gimmick on the PDP-10.
It's a very long time ago, but I'm sure I do recall seeing it used on a
DECSystem10 for arrays of pointers for indirection.

The fact that 40 years later I can remember the @ being used in
assembler must mean something.

Modern machines don't like wasting space so much. On the '10 an address
pointed to was a 36 bit value with an 18 bit address in it. And the
indirection bit. There was space for things like this.

Andy
Scott Lurndal
2023-12-31 18:25:56 UTC
Permalink
Post by John Levine
Post by Joe Pfeiffer
I'll hazard a guess that once you've got the indirect bit out in memory,
it's easier to just use the same logic on all memory reads than to only
let it happen once.
That's not how indirect addressing worked.
There was always a bit in the instruction to say to do indirection.
In our case (B3500 et alia), there was a bit per operand, so a three operand
instruction could have all three addresses indirect. The processor treated
the value at the indirect address as an operand address allowing infinite
recursion (subject to a processor timer in case of loops).
Quadibloc
2023-12-31 08:00:14 UTC
Permalink
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the word
is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
The major question I have is why these architectures have this feature.
No doubt this answer has already been given.

The reason these architectures had that feature was because of a feature
they _didn't_ have: an index register.

So in order to access arrays and stuff like that, instead of doing surgery
on the short address inside an instruction, you can simply store a full
address in a word somewhere that points anywhere you would like.
Post by Anton Ertl
One other question is how the indirect bit works with stores. How do
you change the first word in the chain, the last one, or any word in
between?
Let's assume we do have an architecture that supports multi-level
indirection. So an instruction word looks like this:

(i)(x)(opcode)(p)(address)

and an address constant looks like this:

(i)(x)(address)

So in an address constant (some architectures that had index registers
kept indirection) you could specify indexing too, but now the address was
longer by the length of the opcode field.

If the address inside an instruction is too short to handle all of memory
(i.e. the word length is less than 24 bits) then you need a "page" bit in
the instruction: 0 means page zero, shared by the whole program, 1 means
the current page - the one the instruction is on.

Let's now say the instruction is a _store_ instruction. Then what? Well,
if the indirect bit is set, it acts like a *load* instruction, to fetch and
load the effective address. It only stores at the point where indirection
ends - where the address is now of the actual location to do the storing
in, rather than the location of the effective address, which must be read,
not written.

John Savard
MitchAlsup
2023-12-31 17:16:35 UTC
Permalink
Post by Quadibloc
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the word
is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
The major question I have is why these architectures have this feature.
No doubt this answer has already been given.
The reason these architectures had that feature was because of a feature
they _didn't_ have: an index register.
This is a better explanation than above. Instead of paying the high price
needed for index registers, they use main memory as their index registers.
{{A lot like building linked lists in FORTRAN 66}}.
Post by Quadibloc
So in order to access arrays and stuff like that, instead of doing surgery
on the short address inside an instruction, you can simply store a full
address in a word somewhere that points anywhere you would like.
Post by Anton Ertl
One other question is how the indirect bit works with stores. How do
you change the first word in the chain, the last one, or any word in
between?
Let's assume we do have an architecture that supports multi-level
(i)(x)(opcode)(p)(address)
(i)(x)(address)
So in an address constant (some architectures that had index registers
kept indirection) you could specify indexing too, but now the address was
longer by the length of the opcode field.
If the address inside an instruction is too short to handle all of memory
(i.e. the word length is less than 24 bits) then you need a "page" bit in
the instruction: 0 means page zero, shared by the whole program, 1 means
the current page - the one the instruction is on.
Going all PDP-8 on us now ??
Post by Quadibloc
Let's now say the instruction is a _store_ instruction. Then what? Well,
if the indirect bit is set, it acts like a *load* instruction, to fetch and
load the effective address. It only stores at the point where indirection
ends - where the address is now of the actual location to do the storing
in, rather than the location of the effective address, which must be read,
not written.
John Savard
Thomas Koenig
2023-12-31 17:54:44 UTC
Permalink
Post by MitchAlsup
Post by Quadibloc
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the word
is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
The major question I have is why these architectures have this feature.
No doubt this answer has already been given.
The reason these architectures had that feature was because of a feature
they _didn't_ have: an index register.
This is a better explanation than above. Instead of paying the high price
needed for index registers, they use main memory as their index registers.
{{A lot like building linked lists in FORTRAN 66}}.
The PDP-10 had both a recursive indirect bit and index registers (aka
memory locations 1 to 15), if I remember the manuals correctly
(I did a bit of reading, but I've never even come close to one of
these machines).
MitchAlsup
2023-12-31 18:59:45 UTC
Permalink
Post by Thomas Koenig
Post by MitchAlsup
Post by Quadibloc
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the word
is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
The major question I have is why these architectures have this feature.
No doubt this answer has already been given.
The reason these architectures had that feature was because of a feature
they _didn't_ have: an index register.
This is a better explanation than above. Instead of paying the high price
needed for index registers, they use main memory as their index registers.
{{A lot like building linked lists in FORTRAN 66}}.
The PDP-10 had both a recursive indirect bit and index registers (aka
memory locations 1 to 15), if I remember the manuals correctly
(I did a bit of reading, but I've never even come close to one of
these machines).
All of the PDP-10s at CMU had the register upgrade. {2×Ki and 1×Kl}
I believe that most PDP-10 ever sold had the register upgrade.
John Levine
2023-12-31 20:19:32 UTC
Permalink
Post by Thomas Koenig
The PDP-10 had both a recursive indirect bit and index registers (aka
memory locations 1 to 15), if I remember the manuals correctly
(I did a bit of reading, but I've never even come close to one of
these machines).
Yup. Each instruction had an 18 bit address, a four bit index register, and an indirect bit.
It took the address, and added the contents of the right half of the index register if non-zero.
If the indirect bit was off, that was the operand address. If the indirect bit was set, it
fetched the word at that location and did the whole thing over again, including the indexing.

You could in principle create extremely complicated address chanis but
it was so confusing that nobody did.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
MitchAlsup
2023-12-31 20:42:42 UTC
Permalink
Post by John Levine
Post by Thomas Koenig
The PDP-10 had both a recursive indirect bit and index registers (aka
memory locations 1 to 15), if I remember the manuals correctly
(I did a bit of reading, but I've never even come close to one of
these machines).
Yup. Each instruction had an 18 bit address, a four bit index register, and an indirect bit.
It took the address, and added the contents of the right half of the index register if non-zero.
If the indirect bit was off, that was the operand address. If the indirect bit was set, it
fetched the word at that location and did the whole thing over again, including the indexing.
You could in principle create extremely complicated address chanis but
it was so confusing that nobody did.
At CMU is used this a lot for things like symbol table searches.
What I did not use was the index register stuff of the indirection (except
at the first level).
Lawrence D'Oliveiro
2024-01-17 06:36:41 UTC
Permalink
... but I've never even come close to one of these
machines).
You could have one, or a software emulation of one, right in front of you,
just a SIMH install away.
Scott Lurndal
2023-12-31 18:28:07 UTC
Permalink
Post by Quadibloc
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the word
is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
The major question I have is why these architectures have this feature.
No doubt this answer has already been given.
The reason these architectures had that feature was because of a feature
they _didn't_ have: an index register.
Not necessarily true. The B3500 had three index registers (special
locations in memory, not real registers). Later systems in the early
80's added an additional four register-based index registers, but
continued to support indirect addressing.
EricP
2024-01-05 16:33:40 UTC
Permalink
Post by Anton Ertl
Some (many?) architectures of the 1960s (earlier? later?) have the
feature that, when loading from an address, if a certain bit in the
word is set, the CPU uses that word as the address to access, and this
repeats until a word without this bit is found. At least that's how I
understand the descriptions of this feature.
The major question I have is why these architectures have this
feature.
The only use I can come up with for the arbitrarily repeated
indirection is the implementation of logic variables in Prolog.
However, Prolog was first implemented in 1970, and it did not become a
big thing until the 1980s (if then), so I doubt that this feature was
implemented for Prolog.
A use for a single indirection is the implementation of the memory
management in the original MacOS: Each dynamically allocated memory
block was referenced only from a single place (its handle), so that
the block could be easily relocated. Only the address of the handle
was freely passed around, and accessing the block then always required
double indirection. MacOS was implemented on the 68000, which did not
have the indirect bit; this demonstrates that the indirect bit is not
necessary for that. Nevertheless, such a usage pattern might be seen
as a reason to add the indirect bit. But is it enough?
Were there any other usage patterns? What happened to them when the
indirect bit went out of fashion?
One other question is how the indirect bit works with stores. How do
you change the first word in the chain, the last one, or any word in
between?
- anton
PDP-11 and VAX had multiple address modes with a single level of indirection.
The VAX usage stats from 1984 show about 3% use on SPEC.

DG Nova had infinite indirection - if the Indirect bits was set in the
instruction then in the address register if the msb of the address was zero
then it was the address of the 16-bit data, if the msb of the address was 1
then it was the address of another address, looping until msb = 0.
I don't know how DG used it but, just guessing, because Nova only had
4 registers might be to create a kind of virtual register set in memory.

The best use I have for single level indirection is compilers & linkers.
The compiler emits a variable reference without knowing if it is local
to the linkage unit or imported from a DLL. Linker discovers it is a
DLL export variable and changes the assigned variable to be a pointer
to the imported value that is patched by the loader,
and just flips the Indirect bit on the instruction.

Doing the same thing without address indirection requires inserting
extra LD instructions and having a spare register allocated to the
linker to work with.
John Levine
2024-01-05 18:05:21 UTC
Permalink
Post by EricP
PDP-11 and VAX had multiple address modes with a single level of indirection.
The VAX usage stats from 1984 show about 3% use on SPEC.
The main place the PDP-11 used indirect addressing was in @(PC)+ which
was the idiom for absolute addressing. It fetched the next word in the
instruction stream as an immediate via (PC)+ and then used it as an
address via indirection. The assembler let you write @#123 to geerate
that address mode and put the 123 in line.

It was also useful for threaded code, where you had a register,
typically R4, pointing at a list of routine addresses and dispatched
with JMP @(R4)+

If you were feeling clever you could do this coroutine switch JSR PC,@(SP)+

That popped the top word off the stack, then pushed the current PC, then jumped
to the address it had popped.
Post by EricP
DG Nova had infinite indirection - if the Indirect bits was set in the
instruction then in the address register if the msb of the address was zero
then it was the address of the 16-bit data, if the msb of the address was 1
then it was the address of another address, looping until msb = 0.
I don't know how DG used it but, just guessing, because Nova only had
4 registers might be to create a kind of virtual register set in memory.
My guess is that it was cheap to implement and let them say look, here
is a cool thing that we do and DEC doesn't. I would be surprised if
there were many long indirect chains.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
MitchAlsup
2024-01-05 23:20:45 UTC
Permalink
Post by John Levine
Post by EricP
PDP-11 and VAX had multiple address modes with a single level of indirection.
The VAX usage stats from 1984 show about 3% use on SPEC.
was the idiom for absolute addressing. It fetched the next word in the
instruction stream as an immediate via (PC)+ and then used it as an
that address mode and put the 123 in line.
It was also useful for threaded code, where you had a register,
typically R4, pointing at a list of routine addresses and dispatched
That popped the top word off the stack, then pushed the current PC, then jumped
to the address it had popped.
I used this in a real-timeOS I developed at CMU to deal with laser power control.
Post by John Levine
Post by EricP
DG Nova had infinite indirection - if the Indirect bits was set in the
instruction then in the address register if the msb of the address was zero
then it was the address of the 16-bit data, if the msb of the address was 1
then it was the address of another address, looping until msb = 0.
I don't know how DG used it but, just guessing, because Nova only had
4 registers might be to create a kind of virtual register set in memory.
My guess is that it was cheap to implement and let them say look, here
is a cool thing that we do and DEC doesn't. I would be surprised if
there were many long indirect chains.
Vir Campestris
2024-01-22 11:58:28 UTC
Permalink
<snip>
Post by John Levine
Post by EricP
DG Nova had infinite indirection - if the Indirect bits was set in the
instruction then in the address register if the msb of the address was zero
then it was the address of the 16-bit data, if the msb of the address was 1
then it was the address of another address, looping until msb = 0.
I don't know how DG used it but, just guessing, because Nova only had
4 registers might be to create a kind of virtual register set in memory.
My guess is that it was cheap to implement and let them say look, here
is a cool thing that we do and DEC doesn't. I would be surprised if
there were many long indirect chains.
As has been mentioned elsewhere recently DEC did exactly this on the PDP-10.

Andy
John Levine
2024-01-22 16:46:45 UTC
Permalink
Post by Vir Campestris
Post by John Levine
My guess is that it was cheap to implement and let them say look, here
is a cool thing that we do and DEC doesn't. I would be surprised if
there were many long indirect chains.
As has been mentioned elsewhere recently DEC did exactly this on the PDP-10.
It was more complicated than that on the PDP-6/10. At each stage it not
only did indirection, it could also add in an index register. I can sort
of imagine how one might use all that for dynamically allocated array
rows but I never saw more than two levels in practice and never saw
indexing in indirect words.

In their defense, the addressing was very consistent, start with the
instruction word and keep indexing and indirecting until you come up
with the address.
--
Regards,
John Levine, ***@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
Anton Ertl
2024-01-22 17:48:38 UTC
Permalink
John Levine <***@taugh.com> writes:
[unbounded indirection:]
Post by John Levine
It was more complicated than that on the PDP-6/10. At each stage it not
only did indirection, it could also add in an index register. I can sort
of imagine how one might use all that for dynamically allocated array
rows but I never saw more than two levels in practice and never saw
indexing in indirect words.
The implementation of a logic variable is a parent-pointer tree where
you follow the parent pointer pointers until you are at the root
(which is a free variable or instantiated to a value). The automatic
unbounded indirection of the PDP-6/10 and Nova appears to be ideal for
that. And actually the most influential Prolog for quite a number of
years was DEC-10 Prolog; I don't know if it used that feature, but I
would be surprised if it did not. Still, Prolog could be implemented
on architectures without that feature.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-***@googlegroups.com>
Loading...