Discussion:
what makes a computer architect great?
(too old to reply)
mag
2013-01-23 16:10:07 UTC
Permalink
What makes a computer architect great? Is it just a matter of
memorizing Tomasulo's algorithm? Is that still necessary?

For those whose job title is "Computer Architect" (or something like
that) what would you say were the key acheivements or skills that led
you to that position? Is it more about who you know than what you know?

What are the major problems in computing that you think will be best
solved in the future by computer architects? Why do you think this is
the case?
Ivan Godard
2013-01-23 17:06:48 UTC
Permalink
On 1/23/2013 8:10 AM, mag wrote:
> What makes a computer architect great? Is it just a matter of memorizing
> Tomasulo's algorithm? Is that still necessary?

A willingness to know but disregard precedent. And a management willing
to let you.

> For those whose job title is "Computer Architect" (or something like
> that) what would you say were the key acheivements or skills that led
> you to that position? Is it more about who you know than what you know?

Who you know may matter in the Intels and IBMs of the world, I have no
experience there, but not in smaller shops. Otherwise, it's the same
skill set as for a language designer, which I used to be: aesthetics
grounded in practicality. Elegance wins.

> What are the major problems in computing that you think will be best
> solved in the future by computer architects? Why do you think this is
> the case?

Job security for compiler writers.
MitchAlsup
2013-01-24 04:48:12 UTC
Permalink
On Wednesday, January 23, 2013 10:10:07 AM UTC-6, mag wrote:
> What makes a computer architect great? Is it just a matter of
> memorizing Tomasulo's algorithm? Is that still necessary?

Tomasulo's is so 1965.

What makes a computer architect great is breadth over depth, closeness
to hardware over pie in the sky, and a vast exposure to the disease of
computerarchitecture in general.

> For those whose job title is "Computer Architect" (or something like
> that) what would you say were the key acheivements or skills that led
> you to that position? Is it more about who you know than what you know?

I build my first calculator the year before TTL logic was introduced into
the market out of resistors, transistors, capacitors, a great big rotary
switch (the op code), and a telephone dial as the input device.

So, for me, it was the experience of getting in there and doing it.

Secondly, I had the opportunity to work in languages for about a decade
learning the ins and outs of what HHLs require on instructions sets.
This lead me to disdain condition codes as typically implemented. So
the ISAs I have implemented did not have CCs but followed other paths
that better fit the semantic requirements of HLLs and HW at the same
time.

Having written compilers puts you in a different league as to what to
leave out of ISAs. A lot of the elegance of CA is in what gets left out,
rather than what gets thrown in.

> What are the major problems in computing that you think will be best
> solved in the future by computer architects? Why do you think this is
> the case?

Absorbing latency is the singular concern from the time of Strech to the
present. There are a variety of ways to do this, DCA (a predicessor to
Tomasulo), reservation stations (Tomasulo), Scoreboards, Pipelineing,
Dispatch stacks, and Caches. There are modern ways that throw threads
at absorbing latency taht work marvelously wiht embarasing parallelism
and not so well for serial codes.

Other than absorbing latency, the only performance game in town is
exploiting parallelism. Never forget that when you exploit parallelism,
you don't want to get caught up in paralizing. Its a subtle trade off.

The real quest in computer architecture is a model for computation that does
not inherently have the vonNeumann bottleneck. This is a pie-in-the-sky
arena for CA.

Mitch
mag
2013-01-24 06:10:30 UTC
Permalink
On 2013-01-24 04:48:12 +0000, MitchAlsup said:

> On Wednesday, January 23, 2013 10:10:07 AM UTC-6, mag wrote:
>> What makes a computer architect great? Is it just a matter of
>> memorizing Tomasulo's algorithm? Is that still necessary?
>
> Tomasulo's is so 1965.
>
> What makes a computer architect great is breadth over depth, closeness
> to hardware over pie in the sky, and a vast exposure to the disease of
> computerarchitecture in general.

[snip]

>
> I build my first calculator the year before TTL logic was introduced into
> the market out of resistors, transistors, capacitors, a great big rotary
> switch (the op code), and a telephone dial as the input device.
>
> So, for me, it was the experience of getting in there and doing it.

A question I have is how do you go aout "getting in there" nowadays? My
experience is that it's very difficult to break into this field. You
almost have to be an architect to get an architect's job. No
opportunities for someone with potential to get his or her feet wet
exist, unless you have a friend who's a project director that wants to
turn you loose on soemthing.

Say I wanted to get an architect's job in the next 12-18 months: What
self directed project should I take on that I spend maybe 4 hours a
week working at that at the end would prepare me to get through an
interview for a CPU architect position? Assume that I have taken
graduate level computer architecture courses, but haven't had the
opportunity to put it into much practice since graduating from college.
Also assume I have 15 years experience in ASIC design, including half
of that designing or verifying CPUs--ARM and x86 and others.

>
> Secondly, I had the opportunity to work in languages for about a decade
> learning the ins and outs of what HHLs require on instructions sets.
> This lead me to disdain condition codes as typically implemented. So
> the ISAs I have implemented did not have CCs but followed other paths
> that better fit the semantic requirements of HLLs and HW at the same
> time.
>
> Having written compilers puts you in a different league as to what to
> leave out of ISAs. A lot of the elegance of CA is in what gets left out,
> rather than what gets thrown in.

I seems you need "closeness ot the hardware" but also have experience
with software (you mentioned above having wrote a compiler gives you
valuable experience in designing an ISA--I agree, but writing even a
simple compiler is easier said than done from my experience). Are there
more hardware-oriented computer architects versus software-oriented
architects? If you had to choose one to be more proficient in would it
be software or hardware when it comes down to architecting a CPU?

>
>> What are the major problems in computing that you think will be best
>> solved in the future by computer architects? Why do you think this is
>> the case?
>
> Absorbing latency is the singular concern from the time of Strech to the
> present. There are a variety of ways to do this, DCA (a predicessor to
> Tomasulo), reservation stations (Tomasulo), Scoreboards, Pipelineing,
> Dispatch stacks, and Caches. There are modern ways that throw threads
> at absorbing latency taht work marvelously wiht embarasing parallelism
> and not so well for serial codes.
>
> Other than absorbing latency, the only performance game in town is
> exploiting parallelism. Never forget that when you exploit parallelism,
> you don't want to get caught up in paralizing. Its a subtle trade off.

Do you mean exploiting more and more latent parallelism in applications
that were not designed to be parallel by improving the ability to do
the exploitation with compilers and the CPU?

>
> The real quest in computer architecture is a model for computation that does
> not inherently have the vonNeumann bottleneck. This is a pie-in-the-sky
> arena for CA.

Sorry: how does the von Neumann bottleneck affect modern CPU's? Doesn't
a harvard architecture with sufficiently large caches, and numbers of
levels of caching make this problem negligible?
>
> Mitch
unknown
2013-01-24 07:08:07 UTC
Permalink
mag wrote:
> A question I have is how do you go aout "getting in there" nowadays? My
> experience is that it's very difficult to break into this field. You
> almost have to be an architect to get an architect's job. No
> opportunities for someone with potential to get his or her feet wet
> exist, unless you have a friend who's a project director that wants to
> turn you loose on soemthing.
>
> Say I wanted to get an architect's job in the next 12-18 months: What
> self directed project should I take on that I spend maybe 4 hours a week

4 hours a week???

If you are at all passionate about this stuff you'll spend at least 4
hours/day (besides your daytime job).

I know I'm spending more time than that just reading and responding to
comp.arch posts. :-)

> working at that at the end would prepare me to get through an interview
> for a CPU architect position? Assume that I have taken graduate level
> computer architecture courses, but haven't had the opportunity to put it
> into much practice since graduating from college. Also assume I have 15
> years experience in ASIC design, including half of that designing or
> verifying CPUs--ARM and x86 and others.

And this part doesn't fit with the rest: If you've actually spent 7.5
years "designing or verifying CPUs", then you are already doing this
work, right?

>> Other than absorbing latency, the only performance game in town is
>> exploiting parallelism. Never forget that when you exploit parallelism,
>> you don't want to get caught up in paralizing. Its a subtle trade off.
>
> Do you mean exploiting more and more latent parallelism in applications
> that were not designed to be parallel by improving the ability to do the
> exploitation with compilers and the CPU?
>
>>
>> The real quest in computer architecture is a model for computation
>> that does
>> not inherently have the vonNeumann bottleneck. This is a pie-in-the-sky
>> arena for CA.
>
> Sorry: how does the von Neumann bottleneck affect modern CPU's? Doesn't
> a harvard architecture with sufficiently large caches, and numbers of
> levels of caching make this problem negligible?

Ouch!!!

This really doesn't fit with your stated 7.5 years of low-level cpu
experience. :-(

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Jon
2013-01-24 10:08:59 UTC
Permalink
> Terje said
>> working at that at the end would prepare me to get through an interview
>> for a CPU architect position? Assume that I have taken graduate level
>> computer architecture courses, but haven't had the opportunity to put it
>> into much practice since graduating from college. Also assume I have 15
>> years experience in ASIC design, including half of that designing or
>> verifying CPUs--ARM and x86 and others.
>
> And this part doesn't fit with the rest: If you've actually spent 7.5
> years "designing or verifying CPUs", then you are already doing this
> work, right?

Not really. There's a big difference between being an architect and designer in many companies. If you're a designer, you typically just get given a block to design (i.e. multiplier or bus interface, etc) that has been spec'ed fairly rigidly by someone else, and you just need to churn out the RTL or schematic to match. Not all that interesting. It could be for a CPU, or it could be for any other ASIC. The interesting bit for the OP is, I guess, specifying what blocks are needed, what they should do, the h/w s/w interface, etc. which is what the (micro)architects are doing.

> Mitch said
> So, for me, it was the experience of getting in there and doing it.

Yep. If you have ASIC design experience why not knock up your own CPU in Verilog and put in an FPGA? Do a port of GCC for it as well.

> Mag said
> what would you say were the key acheivements or skills that led
> you to that position

Doing the above.

Cheers,
Jon
Walter Banks
2013-01-28 16:19:47 UTC
Permalink
Jon wrote:

> > Mitch said
> > So, for me, it was the experience of getting in there and doing it.
>
> Yep. If you have ASIC design experience why not knock up your own CPU in Verilog and put in an FPGA? Do a port of GCC for it as well.

Certainly create ISA's there is nothing like that experience to focus
on ISA implementation details.

I am not so sure that porting GCC is all that useful. GCC is most
effective processing historical ISAs and will give a skewed view of
what is important.. Mitch's other point was look at what is needed
in HLL's a point I completely agree with.

One additional point is have a clear view of what your intended
application for a ISA will be. There are some very interesting
problems in ISA design once you get away from main stream general
purpose processors.

w..


.
Ivan Godard
2013-01-24 08:37:43 UTC
Permalink
On 1/23/2013 10:10 PM, mag wrote:
> On 2013-01-24 04:48:12 +0000, MitchAlsup said:

Let me start by saying Amen! to Mitch's post.

>> On Wednesday, January 23, 2013 10:10:07 AM UTC-6, mag wrote:
>>> What makes a computer architect great? Is it just a matter of
>>> memorizing Tomasulo's algorithm? Is that still necessary?
>>
>> Tomasulo's is so 1965.
>>
>> What makes a computer architect great is breadth over depth, closeness
>> to hardware over pie in the sky, and a vast exposure to the disease of
>> computerarchitecture in general.
>
> [snip]
>
>>
>> I build my first calculator the year before TTL logic was introduced into
>> the market out of resistors, transistors, capacitors, a great big rotary
>> switch (the op code), and a telephone dial as the input device.
>>
>> So, for me, it was the experience of getting in there and doing it.
>
> A question I have is how do you go aout "getting in there" nowadays? My
> experience is that it's very difficult to break into this field. You
> almost have to be an architect to get an architect's job.

You almost have to be a movie star to get a movie star's job too. Like
any other high-skill occupation it takes ten years to learn your trade,
which in practice means to learn whether you are any good at it. Few
will hire you without that knowledge and practice, because architect is
a very high risk position for an employer. You manager usually can't
judge your decisions himself, and many millions of dollars ride on those
decisions being right.

So how do you break in? Same way you break in to Hollywood: wait on
tables while taking on anything remotely connected no matter how
demeaning. Write your own scripts (compiler), get together with friends
and put on free shows for old folks homes (work on Open Source
projects), audit classes you can't afford (audit classes you can't
afford), hang out places like this board, and post thoughtful, if
ignorant, questions.
> No
> opportunities for someone with potential to get his or her feet wet
> exist, unless you have a friend who's a project director that wants to
> turn you loose on soemthing.

Find such friends. Show them your script (fleshed out design). Get them
to help you improve it. Don't expect to be paid.

Luck helps. The first real programming job I had was at Burroughs on the
B6500 project, then still in sim and two years from hardware. The
manager was Ben Dent, to this day probably the best manager I have ever
had. His way of doing employee interviews was to sit the candidate down
in front of a whiteboard and explain the machine, which was a truly
elegant design, tagged memory and all. Someways into the explanation I
stopped him and said "but that can't possibly work - if you passed one
of those as a parameter from over here you'd get the wrong thing when
you used it. You'd have to add .. the stack number into the format" He
got a funny look on his face, said "Well, I wasn't going to get into it
but what you just described is called a Stuffed Indirect Reference Word
and has tag 6". In my ignorance I couldn't have spotted something like
that in an explanation of Netburst - too complicated to grok. The
elegance of the B6500 permitted the understanding from first principles.

Ben was not only unusual as a manager and as an engineer but as a
*black* engineer, which is sadly even now a rare thing. He hired me,
though why they put up with me thereafter is unknown. I did the DCALGOL
compiler there, my first. But getting the job was luck. Grab luck when
you find it.

> Say I wanted to get an architect's job in the next 12-18 months: What
> self directed project should I take on that I spend maybe 4 hours a week
> working at that at the end would prepare me to get through an interview
> for a CPU architect position?

Well, start with 60 hours a week. Remember 10 years, at 2k hours/year,
is 20k hours. At 4 hours a week you are forgetting faster than you are
learning. And be aware that not a few people study piano for ten years
and get competent enough to realize that they never will be great.

Assume that I have taken graduate level
> computer architecture courses, but haven't had the opportunity to put it
> into much practice since graduating from college. Also assume I have 15
> years experience in ASIC design, including half of that designing or
> verifying CPUs--ARM and x86 and others.

Read. Read. Read. I read parts of three of four theses a week. I say
"parts" because the signal-to-noise ratio of the average thesis is
pretty poor. In many cases I believe that I may be the only person aside
from the author to have read it - it's clear the review board never did.
Still, Google is your friend

>>
>> Secondly, I had the opportunity to work in languages for about a decade
>> learning the ins and outs of what HHLs require on instructions sets.
>> This lead me to disdain condition codes as typically implemented. So
>> the ISAs I have implemented did not have CCs but followed other paths
>> that better fit the semantic requirements of HLLs and HW at the same
>> time.
>>
>> Having written compilers puts you in a different league as to what to
>> leave out of ISAs. A lot of the elegance of CA is in what gets left out,
>> rather than what gets thrown in.
>
> I seems you need "closeness ot the hardware" but also have experience
> with software (you mentioned above having wrote a compiler gives you
> valuable experience in designing an ISA--I agree, but writing even a
> simple compiler is easier said than done from my experience).

Funny you noticed :-) Start by working on an existing non-simple
compiler. gcc and clang both are always looking for volunteers.

Are there
> more hardware-oriented computer architects versus software-oriented
> architects? If you had to choose one to be more proficient in would it
> be software or hardware when it comes down to architecting a CPU?

There's a common misconception that architecture is circuit design.
True, they overlap somewhat, but circuit design is its own expertise.
I'm currently in my fourth ISA, and I've never written a line of Verilog
in my life and intend to stay a virgin.

Now it is true that you can't do a decent design without circuit
expertise somewhere on the team. The Mill has been mostly Art Kahlich
and me. Art is a hardware guy; the bolts attach behind the ears. I'm a
software guy - language design (Algol 68, Ada, Mary, minor contributions
to Smalltalk, C++, etc.), a mincomputer OS, 9 compilers, an OODB. The
relationship with Art has been very productive because we each curb the
others' excesses. I'll come up with this great idea, and he'll say
"That'll cost 30% of the clock"; he'll do likewise and I'll say "yeah,
but the compiler will never find it". Decide where on that spectrum you
want to lie. Either are valid; what do you like?

>
>>
>>> What are the major problems in computing that you think will be best
>>> solved in the future by computer architects? Why do you think this is
>>> the case?
>>
>> Absorbing latency is the singular concern from the time of Strech to the
>> present. There are a variety of ways to do this, DCA (a predicessor to
>> Tomasulo), reservation stations (Tomasulo), Scoreboards, Pipelineing,
>> Dispatch stacks, and Caches. There are modern ways that throw threads
>> at absorbing latency taht work marvelously wiht embarasing parallelism
>> and not so well for serial codes.
>>
>> Other than absorbing latency, the only performance game in town is
>> exploiting parallelism. Never forget that when you exploit parallelism,
>> you don't want to get caught up in paralizing. Its a subtle trade off.
>
> Do you mean exploiting more and more latent parallelism in applications
> that were not designed to be parallel by improving the ability to do the
> exploitation with compilers and the CPU?

Rule of thumb: people will recompile their code for a 2x gain; they will
reconceptualize their problem for a 10X gain. Acceptance of new concepts
is generational; be prepared to wait. In the meantime, be saleable.

Another rule of thumb: you will make the first sale by giving people
what they want. You will make the follow-on sale by having given them,
what they needed in the first sale, even if they objected. I got invited
to the Green team in France because Jean Ichbiah had asked me to
critique his then-language LIS. He told me later that he had thought my
report was the biggest crock of bull he'd ever read and he'd tossed in
the file. So I asked why he sought me out for what became Ada; "Because
two years later I understood what you were talking about".
Quadibloc
2013-01-24 22:50:46 UTC
Permalink
On Jan 23, 11:10 pm, mag <***@nospam.net> wrote:

> Sorry: how does the von Neumann bottleneck affect modern CPU's? Doesn't
> a harvard architecture with sufficiently large caches, and numbers of
> levels of caching make this problem negligible?

The von Neumann bottleneck is a very big issue, but the limited
bandwidth to external memory is a related one that is currently the
fastest-growing one; unlike the von Neumann bottleneck, it still gets
worse as we go to more cores, instead of staying the same.

John Savard
MitchAlsup
2013-01-25 03:19:55 UTC
Permalink
On Thursday, January 24, 2013 12:10:30 AM UTC-6, mag wrote:
> Sorry: how does the von Neumann bottleneck affect modern CPU's? Doesn't
> a harvard architecture with sufficiently large caches, and numbers of
> levels of caching make this problem negligible?

Your thought train is stuck on one/few thread/s at a time processing.
This is so 1985.

One vonNeumann bottleneck is the requirement to appear to fully execute
one instruction before starting to execute the next instruction.

Consider a processor where there are 10,000 threads on a single die,
and ask yourself the question:: "What does it mean to single step
this machine"? Do you want to single step a single thread, a single core,
a single function unit, or something else.

How do you process 1,800 exceptions per <some small unit of time>?
while letting the other 98,200 threads make forward progress.

Does the kernel operate with as many threads as the application(s)?

In vonNeumann's day none of these questions needed to be ask.

Mitch
n***@cam.ac.uk
2013-01-25 09:37:13 UTC
Permalink
In article <6d8c8536-74db-4742-a0f6-***@googlegroups.com>,
MitchAlsup <***@aol.com> wrote:
>On Thursday, January 24, 2013 12:10:30 AM UTC-6, mag wrote:
>> Sorry: how does the von Neumann bottleneck affect modern CPU's? Doesn't
>> a harvard architecture with sufficiently large caches, and numbers of
>> levels of caching make this problem negligible?
>
>Your thought train is stuck on one/few thread/s at a time processing.
>This is so 1985.

1975?

>One vonNeumann bottleneck is the requirement to appear to fully execute
>one instruction before starting to execute the next instruction.
>
>Consider a processor where there are 10,000 threads on a single die,
>and ask yourself the question:: "What does it mean to single step
>this machine"? Do you want to single step a single thread, a single core,
>a single function unit, or something else.

Yes. So we have to live without that.

>How do you process 1,800 exceptions per <some small unit of time>?
>while letting the other 98,200 threads make forward progress.
>
>Does the kernel operate with as many threads as the application(s)?
>
>In vonNeumann's day none of these questions needed to be ask.

Hence my preference for eliminating interrupts of the current form.
I still don't think it would be hard (technically) - the big problem
is political (people's mindset).


Regards,
Nick Maclaren.
Ivan Godard
2013-01-25 15:35:51 UTC
Permalink
On 1/25/2013 1:37 AM, ***@cam.ac.uk wrote:
> In article <6d8c8536-74db-4742-a0f6-***@googlegroups.com>,

<snip>

> Hence my preference for eliminating interrupts of the current form.
> I still don't think it would be hard (technically) - the big problem
> is political (people's mindset).

It's not hard to replace interrupts with queued messages or other
synchronization devices.

Exceptions are another matter however. It's hard to figure out what to
do when there's nothing to be done.
n***@cam.ac.uk
2013-01-25 17:56:07 UTC
Permalink
In article <kdu8om$gsp$***@dont-email.me>,
Ivan Godard <***@ootbcomp.com> wrote:
>
>> Hence my preference for eliminating interrupts of the current form.
>> I still don't think it would be hard (technically) - the big problem
>> is political (people's mindset).
>
>It's not hard to replace interrupts with queued messages or other
>synchronization devices.
>
>Exceptions are another matter however. It's hard to figure out what to
>do when there's nothing to be done.

Not really. Once you abandon the dogma that they have to be handled
by recoverable interrupts, there are plenty of alternatives.


Regards,
Nick Maclaren.
Quadibloc
2013-01-25 20:24:51 UTC
Permalink
On Jan 25, 10:56 am, ***@cam.ac.uk wrote:
> In article <kdu8om$***@dont-email.me>,
> Ivan Godard  <***@ootbcomp.com> wrote:

> >Exceptions are another matter however. It's hard to figure out what to
> >do when there's nothing to be done.
>
> Not really.  Once you abandon the dogma that they have to be handled
> by recoverable interrupts, there are plenty of alternatives.

It's certainly true that if a program fails on a divide-by-zero, it's
not clear what's gained by trying to continue running. But one would
still like to have as much information as possible to diagnose the
error.

On the other hand, if one attempts to access an address in virtual
memory that is currently in the swap file, one most definitely wants
to seamlessly continue.

It's true that our thinking may be blinkered by a fixed concept of
what a computer is supposed to do for the programmer. Moving the ISA
to a higher level could well change the possible options. But that was
tried in the '70s, and got abandoned for performance reasons. Which
still dominate now, despite performance being largely ignored at other
levels of the chain (the thinking being, I guess, that most programs
don't need to worry about performance, but one doesn't want to give up
the option of maximum performance for the few programs for which it
matters).

So we have CPUs designed to max out performance as if it were still
the 60s, and programs that are profligate with cycles running on them.

John Savard
BGB
2013-01-25 22:16:39 UTC
Permalink
On 1/25/2013 2:24 PM, Quadibloc wrote:
> On Jan 25, 10:56 am, ***@cam.ac.uk wrote:
>> In article <kdu8om$***@dont-email.me>,
>> Ivan Godard <***@ootbcomp.com> wrote:
>
>>> Exceptions are another matter however. It's hard to figure out what to
>>> do when there's nothing to be done.
>>
>> Not really. Once you abandon the dogma that they have to be handled
>> by recoverable interrupts, there are plenty of alternatives.
>
> It's certainly true that if a program fails on a divide-by-zero, it's
> not clear what's gained by trying to continue running. But one would
> still like to have as much information as possible to diagnose the
> error.
>

yep.

all speculative here, as I am by no means a HW engineer...


an alternative could be, say, if this operation resulted in a NaN or
some sort of sentinel value ('undefined'). granted, this is a problem
for normal two's complement integers apart from shaving off part of
their value range (say, we have 32-bit integers than only hold values
between -2147418112 and 2147418111 or similar), or adding/removing a few
bits for a tag (say, registers are internally 36 bits, but use 32-bits
for in-memory and for arithmetic).

either way, rather than an exception, the value just switches over to a
trap-value indicating about what has occurred, which then propagates in
a no-op way until code detects and handles the event (errm, sort of like
the "everything has gone NaN event when dealing with floating-point math").

on a related note, the CPU could set some status registers to indicate
the initial exception (type, PC value, ...), under the premise that the
program will do a "check and reset" operation at some point.

"hey yeah, the exception register was set at this point, and all of this
other stuff now contains 'undefined', ... but hey, we can jump back and
repeat the calculation if-needed...".


> On the other hand, if one attempts to access an address in virtual
> memory that is currently in the swap file, one most definitely wants
> to seamlessly continue.
>

yes, and probably detect this early as well:
we don't want the program to go awry because it encountered such a
state, unless it can be reliably rolled back to the state at the time of
the event, which would probably cost more than raising an exception I think.


say:
raising an exception captures the register state;
undefined values can't be written to memory (or, possibly, it disables
write-backs to memory in-general, or goes into a non-preserved
"shadow-mode").

then, when the exception is checked/handled some-cycles later, any
post-exception cache lines are flushed, and the register state is
restored back to its state at the point of the exception.

this way, any operations performed following the initial exception have
no visible results, and it looks to software as if execution had simply
stopped when the exception occurred, and the program can then continue
execution from this point.


if the architecture is largely parallel, all this could apply over the
whole processor, or possibly just over interconnected thread-groups or
similar.

then, in the case of an exception, the state of all of the threads can
be reverted to a "known good" state all at once.


maybe:
there is a special (mandatory? or maybe auto-inserted.) operation which
is basically either:
no-op, if no exception has occurred;
causes the thread to halt if an exception has occurred;
the state-reversion and handler-signalling then takes place when all the
threads in the thread-group have reached a halted state.


things get more complicated though if it isn't possible for threads to
all know about an exception at the same time (like, if there is
propagation delay for the exception status), though probably for
thread-groups, they would be close enough in hardware such that the
exception-event being raised seams nearly simultaneous.

but, so long as post-exception write-back can be prevented, it shouldn't
actually matter in terms of overall state, as a few threads running
ahead of the barrier will only cause a minor delay as the processors
waits for it to be able to do its revert-and-invoke-handler magic.


> It's true that our thinking may be blinkered by a fixed concept of
> what a computer is supposed to do for the programmer. Moving the ISA
> to a higher level could well change the possible options. But that was
> tried in the '70s, and got abandoned for performance reasons. Which
> still dominate now, despite performance being largely ignored at other
> levels of the chain (the thinking being, I guess, that most programs
> don't need to worry about performance, but one doesn't want to give up
> the option of maximum performance for the few programs for which it
> matters).
>
> So we have CPUs designed to max out performance as if it were still
> the 60s, and programs that are profligate with cycles running on them.
>

yeah, pretty much...

we don't care for the most part, when the computer is either wasting
cycles or just sitting idle.

but, as soon as something pops up where people do care, full speed isn't
something to be given up.


or such...
n***@cam.ac.uk
2013-01-26 10:50:54 UTC
Permalink
In article <d86e7b46-7d55-43f4-bc84-***@rm7g2000pbc.googlegroups.com>,
Quadibloc <***@ecn.ab.ca> wrote:
>
>> >Exceptions are another matter however. It's hard to figure out what to
>> >do when there's nothing to be done.
>>
>> Not really. =A0Once you abandon the dogma that they have to be handled
>> by recoverable interrupts, there are plenty of alternatives.
>
>It's certainly true that if a program fails on a divide-by-zero, it's
>not clear what's gained by trying to continue running. But one would
>still like to have as much information as possible to diagnose the
>error.

Trap-diagnose-and-terminate is trivial to implement compared to
interrupt-fixup-and-recover.

>On the other hand, if one attempts to access an address in virtual
>memory that is currently in the swap file, one most definitely wants
>to seamlessly continue.

Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
1960s hack.


Regards,
Nick Maclaren.
John Levine
2013-01-26 15:32:44 UTC
Permalink
>>On the other hand, if one attempts to access an address in virtual
>>memory that is currently in the swap file, one most definitely wants
>>to seamlessly continue.
>
>Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
>1960s hack.

But unifying paging and file access still has a lot of appeal.

--
Regards,
John Levine, ***@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. http://jl.ly
n***@cam.ac.uk
2013-01-26 15:38:41 UTC
Permalink
In article <ke0sus$2bu3$***@leila.iecc.com>, John Levine <***@iecc.com> wrote:
>>>On the other hand, if one attempts to access an address in virtual
>>>memory that is currently in the swap file, one most definitely wants
>>>to seamlessly continue.
>>
>>Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
>>1960s hack.
>
>But unifying paging and file access still has a lot of appeal.

I am not denying the appeal - I am saying that the harm caused by
that superficially simple approach outweighs any benefits tens or
hundreds to one. We learnt that in the 1970s :-(


Regards,
Nick Maclaren.
Anne & Lynn Wheeler
2013-01-26 18:24:05 UTC
Permalink
***@cam.ac.uk writes:
> I am not denying the appeal - I am saying that the harm caused by
> that superficially simple approach outweighs any benefits tens or
> hundreds to one. We learnt that in the 1970s :-(

that was something that tss/360 didn't learn ... when I was doing
paged-mapped filesystem for cms in the early 70s ... I tried to avoid
all the things that tss/360 had done wrong.
http://www.garlic.com/~lynn/submain.html#mmap

this was in parallel when future system effort was trying to emulate the
single level store from tss/360 ... w/o any corrections ... which
possibly contributed to FS failure
http://www.garlic.com/~lynn/submain.html#futuresys

s/38 is periodically described as simplified future system
implementation ... however, for the s/38 entry level market ... the
enormous performance penalties of simple single-level store design was
never really an issue (simplicity outweighed performance).

the poor performance reputation of single-level-store from tss/360 and
future system possibly contributed to decisions to not allow me to ship
my cms paged mapped filesystem ... even tho i had benchmarks where it
significantly outperformed that of standard cms filesystem.

--
virtualization experience starting Jan1968, online at home since Mar1970
n***@cam.ac.uk
2013-01-26 18:40:54 UTC
Permalink
In article <***@garlic.com>,
Anne & Lynn Wheeler <***@garlic.com> wrote:
>
>> I am not denying the appeal - I am saying that the harm caused by
>> that superficially simple approach outweighs any benefits tens or
>> hundreds to one. We learnt that in the 1970s :-(
>
>that was something that tss/360 didn't learn ... when I was doing
>paged-mapped filesystem for cms in the early 70s ... I tried to avoid
>all the things that tss/360 had done wrong.
>http://www.garlic.com/~lynn/submain.html#mmap
>
>this was in parallel when future system effort was trying to emulate the
>single level store from tss/360 ... w/o any corrections ... which
>possibly contributed to FS failure
>http://www.garlic.com/~lynn/submain.html#futuresys
>
>s/38 is periodically described as simplified future system
>implementation ... however, for the s/38 entry level market ... the
>enormous performance penalties of simple single-level store design was
>never really an issue (simplicity outweighed performance).
>
>the poor performance reputation of single-level-store from tss/360 and
>future system possibly contributed to decisions to not allow me to ship
>my cms paged mapped filesystem ... even tho i had benchmarks where it
>significantly outperformed that of standard cms filesystem.

I regard the performance issue (important though it is) as trivial
compared with the correctness, RAS, security etc. one. System/38
and all that were NOT general-purpose systems, and could get away
with the assumption that there was a single set of co-designed
applications (with all of the simplification that entails).

No matter how you cut it, memory mapped files are updatable global
storage (both across programs and time), and experience has been
that updatable global data should be avoided. Once you add error
recovery, parallelism (and, worse, both), they become positive
nightmares. That's made vastly worse by the fact that they are
not simple objects and the metadata is also updatable - which, in
the Unix model, includes ownerships, permissions and more.

Even with the simplest form of mainframe model (single writer or
multiple readers, for the combined data and metadata), there were
enough RAS problems to cause major trouble. And the Unix model is
nothing short of disaster by design :-(

Why is this so tied in with memory mapping? Because memory mapping
forces precisely the semantics that are the cause of the trouble.


Regards,
Nick Maclaren.
Anne & Lynn Wheeler
2013-01-26 19:06:31 UTC
Permalink
***@cam.ac.uk writes:
> I regard the performance issue (important though it is) as trivial
> compared with the correctness, RAS, security etc. one. System/38
> and all that were NOT general-purpose systems, and could get away
> with the assumption that there was a single set of co-designed
> applications (with all of the simplification that entails).

re:
http://www.garlic.com/~lynn/2013.html#63 what makes a computer architect great?

I actually was able to improve RAS, security and parallelism ...
including support old-style filesystem semantics. a big performance
benefit was 360/370 style channel program paradigm was performance
disaster ... aka aligned the file system semantics with the virtual
memory paradigm; i supported "windowing" as paradigm to emulate multiple
buffer overlapped operation; could use "window" semantics to simulate
old-style buffer i/o ... or single large memory map with demand page
... or spectrum between the two.

--
virtualization experience starting Jan1968, online at home since Mar1970
Michael S
2013-01-26 18:38:54 UTC
Permalink
On Jan 26, 12:50 pm, ***@cam.ac.uk wrote:
> In article <d86e7b46-7d55-43f4-bc84-***@rm7g2000pbc.googlegroups.com>,
>
> Quadibloc  <***@ecn.ab.ca> wrote:
>
> >> >Exceptions are another matter however. It's hard to figure out what to
> >> >do when there's nothing to be done.
>
> >> Not really. =A0Once you abandon the dogma that they have to be handled
> >> by recoverable interrupts, there are plenty of alternatives.
>
> >It's certainly true that if a program fails on a divide-by-zero, it's
> >not clear what's gained by trying to continue running. But one would
> >still like to have as much information as possible to diagnose the
> >error.
>
> Trap-diagnose-and-terminate is trivial to implement compared to
> interrupt-fixup-and-recover.
>
> >On the other hand, if one attempts to access an address in virtual
> >memory that is currently in the swap file, one most definitely wants
> >to seamlessly continue.
>
> Demand paging is SO 1970s.  Memory is cheap.  TLBs are a horrible
> 1960s hack.
>
> Regards,
> Nick Maclaren.

We were at it 1000 times already.
Today's OS designers like paging hardware not due to demand paging to
mass storage (although that is not totally obsolete either, especially
in IBM world, both z and POWER, and, with current proliferation of
SSDs, we are likely to see limited renaissance of demand paging in the
PC world as well) but because it greatly simplifies memory management.
Not having to care about fragmentation of physical memory is a Very
Good Thing as far as OS designers are concerned.
And it's not just TLBs. OS designers want both TLBs and hardware page
walkers.
unknown
2013-01-26 22:27:14 UTC
Permalink
Michael S wrote:
> On Jan 26, 12:50 pm, ***@cam.ac.uk wrote:
>> Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
>> 1960s hack.
>>

Actual paging is of course totally unacceptable, but TLBs seems like a
reasonable way to handle virt-to-phys caching?

>
> We were at it 1000 times already.
> Today's OS designers like paging hardware not due to demand paging to
> mass storage (although that is not totally obsolete either, especially
> in IBM world, both z and POWER, and, with current proliferation of
> SSDs, we are likely to see limited renaissance of demand paging in the
> PC world as well) but because it greatly simplifies memory management.
> Not having to care about fragmentation of physical memory is a Very
> Good Thing as far as OS designers are concerned.

It is indeed a very good thing, unfortunately we don't quite have it any
more...

> And it's not just TLBs. OS designers want both TLBs and hardware page
> walkers.

Due to memory sizes outrunning reasonably TLB sizes, at least as long as
everything has a single page size, OSs have to try to merge multiple 4K
pages into larger superpages that can use a single TLB entry, right?

However, in order to this they must first get rid of (physical) memory
fragmentation. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2013-01-26 22:52:12 UTC
Permalink
In article <2r1et9-***@ntp-sure.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Michael S wrote:
>>
>>> Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
>>> 1960s hack.
>
>Actual paging is of course totally unacceptable, but TLBs seems like a
>reasonable way to handle virt-to-phys caching?

TLBs as a hardware cache for page tables are fine; handling TLBs
by interrupt is not. It's a pretty fair RAS and often performance
disaster, as I have explained before.

On the old, serial computers, it was just about tolerable, but that's
not where we are today and not where we are going.


Regards,
Nick Maclaren.
Michael S
2013-01-26 23:00:58 UTC
Permalink
On Jan 27, 12:27 am, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> Michael S wrote:
> > On Jan 26, 12:50 pm, ***@cam.ac.uk wrote:
> >> Demand paging is SO 1970s.  Memory is cheap.  TLBs are a horrible
> >> 1960s hack.
>
> Actual paging is of course totally unacceptable, but TLBs seems like a
> reasonable way to handle virt-to-phys caching?
>

Actual paging is still acceptable in some situations.
Think time sharing machine (e.g. Windows Terminal Server) with few
dozens of clients that mostly use interactive applications and often
leave them open without touching kbd/mouse for hours and days.
In such situation, paging stuff out, especially to [realatively small]
SSD and using freed memory for [huge rotating RAID] disk cache sounds
like very reasonable strategy for an OS.
With current consolidation trends the situation I am describing will
become more common than it was in last couple of decades.

>
>
> > We were at it 1000 times already.
> > Today's OS designers like paging hardware not due to demand paging to
> > mass storage (although that is not totally obsolete either, especially
> > in IBM world, both z and POWER, and, with current proliferation of
> > SSDs, we are likely to see limited renaissance of demand paging in the
> > PC world as well) but because it greatly simplifies memory management.
> > Not having to care about fragmentation of physical memory is a Very
> > Good Thing as far as OS designers are concerned.
>
> It is indeed a very good thing, unfortunately we don't quite have it any
> more...
>
> > And it's not just TLBs. OS designers want both TLBs and hardware page
> > walkers.
>
> Due to memory sizes outrunning reasonably TLB sizes, at least as long as
> everything has a single page size, OSs have to try to merge multiple 4K
> pages into larger superpages that can use a single TLB entry, right?
>
> However, in order to this they must first get rid of (physical) memory
> fragmentation. :-)
>

No, OS should not use "big" pages at all except for special things
like frame buffers, bounce buffers for legacy 32-bit I/O device and
when explicitly requested by application.
Processors with too small TLBs should naturally lose market share to
processors with bigger TLBs and/or to processors that make TLB misses
cheaper by means of efficient caching of page tables in L2/L3 caches.

> Terje
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"
unknown
2013-01-27 12:50:57 UTC
Permalink
Michael S wrote:
> On Jan 27, 12:27 am, Terje Mathisen <"terje.mathisen at tmsw.no">
>> Actual paging is of course totally unacceptable, but TLBs seems like a
>> reasonable way to handle virt-to-phys caching?
>
> Actual paging is still acceptable in some situations.
> Think time sharing machine (e.g. Windows Terminal Server) with few
> dozens of clients that mostly use interactive applications and often
> leave them open without touching kbd/mouse for hours and days.
> In such situation, paging stuff out, especially to [realatively small]
> SSD and using freed memory for [huge rotating RAID] disk cache sounds
> like very reasonable strategy for an OS.

What you are describing here isn't paging at all!

This is segment swapping, and should only require a single base+limit
descriptor to mark the application as either totally resident or not
there at all.

Wasting 100 K TLB entries (for a 400 MB minimal Terminal Server client)
just so that you can detect that it needs to be swapped back in seems
like a sub-optimal solution. :-(

(Yeah, I know that none of those entries will actually be loaded when
the client app is swapped out, but it still seems like a lot of overhead.)

>> Due to memory sizes outrunning reasonably TLB sizes, at least as long as
>> everything has a single page size, OSs have to try to merge multiple 4K
>> pages into larger superpages that can use a single TLB entry, right?
>>
>> However, in order to this they must first get rid of (physical) memory
>> fragmentation. :-)
>>
>
> No, OS should not use "big" pages at all except for special things
> like frame buffers, bounce buffers for legacy 32-bit I/O device and
> when explicitly requested by application.

That seems like a bad approach:

As long as an application makes big enough memory requests (i.e. many
MBs) it seems like a very obvious idea to check if you can satisfy
(parts of) that request with one or more big pages?

> Processors with too small TLBs should naturally lose market share to
> processors with bigger TLBs and/or to processors that make TLB misses
> cheaper by means of efficient caching of page tables in L2/L3 caches.

Here we agree 100%.

I still remember that awful PentiumII cpu which had 512 KB of L2 but
only 64x4=256 KB that could be described by 4KB TLB entries.

It had a performance knee around 256 KB working set which was almost as
bad as the one past 512 KB.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2013-01-27 12:59:59 UTC
Permalink
In article <iekft9-***@ntp-sure.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>
>I still remember that awful PentiumII cpu which had 512 KB of L2 but
>only 64x4=256 KB that could be described by 4KB TLB entries.
>
>It had a performance knee around 256 KB working set which was almost as
>bad as the one past 512 KB.

I can't remember offhand which CPUs had that defect, but some were
markedly worse, and few didn't have it at all. But, as I said, the
performance knee is the minor problem - the RAS problems are the
really nasty ones.


Regards,
Nick Maclaren.
unknown
2013-01-27 20:10:40 UTC
Permalink
***@cam.ac.uk wrote:
> In article <iekft9-***@ntp-sure.tmsw.no>,
> Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>
>> I still remember that awful PentiumII cpu which had 512 KB of L2 but
>> only 64x4=256 KB that could be described by 4KB TLB entries.
>>
>> It had a performance knee around 256 KB working set which was almost as
>> bad as the one past 512 KB.
>
> I can't remember offhand which CPUs had that defect, but some were
> markedly worse, and few didn't have it at all. But, as I said, the
> performance knee is the minor problem - the RAS problems are the
> really nasty ones.

I do know that there are CPUs which have sw TLB replacement, but I've
never written low-level code for any of them.

Afaik there has never been any x86 cpu with that approach, mainly
because it wouldn't really be x86, i.e. it could never boot existing
operating systems...

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2013-01-28 08:50:35 UTC
Permalink
In article <17egt9-***@ntp-sure.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>>>
>>> I still remember that awful PentiumII cpu which had 512 KB of L2 but
>>> only 64x4=256 KB that could be described by 4KB TLB entries.
>>>
>>> It had a performance knee around 256 KB working set which was almost as
>>> bad as the one past 512 KB.
>>
>> I can't remember offhand which CPUs had that defect, but some were
>> markedly worse, and few didn't have it at all. But, as I said, the
>> performance knee is the minor problem - the RAS problems are the
>> really nasty ones.
>
>I do know that there are CPUs which have sw TLB replacement, but I've
>never written low-level code for any of them.

I haven't for most of them - that's not needed to see the problems!
Also, tuning at that level is the same for high- and low-level code.

>Afaik there has never been any x86 cpu with that approach, mainly
>because it wouldn't really be x86, i.e. it could never boot existing
>operating systems...

No interrupts at all? I didn't know that. The lack of interrupts
partially explains why I haven't seen the failures on x86, the other
part being that I haven't managed any highly-parallel shared memory
x86 systems.


Regards,
Nick Maclaren.
c***@googlemail.com
2013-01-28 23:07:13 UTC
Permalink
Am Sonntag, 27. Januar 2013 21:10:40 UTC+1 schrieb Terje Mathisen:
> I do know that there are CPUs which have sw TLB replacement, but I've
> never written low-level code for any of them.

If you are interested in a quick overview on MIPS TLB miss handlers, I can recommend Gernot Heiser's "Inside L4/MIPS" document (http://www.cse.unsw.edu.au/~disy/L4/MIPS/inside/inside.pdf), Chapter 4.1.

> Afaik there has never been any x86 cpu with that approach, mainly
> because it wouldn't really be x86, i.e. it could never boot existing
> operating systems...

Does the Crusoe count as x86? ISTR I read somewhere that the Code Morphing software (CMS) handled the TLB contents - which were relevant for x86 code/data accesses only, whereas the CMS ran in physical address space of the underlying VLIW. That's at least my memory of the Crusoe (which might well be inaccurate) - if someone has more information on this, I would love to hear about details.

-- Michael
Michael S
2013-01-28 09:53:19 UTC
Permalink
On Jan 27, 2:50 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
wrote:
> Michael S wrote:
> > On Jan 27, 12:27 am, Terje Mathisen <"terje.mathisen at tmsw.no">
> >> Actual paging is of course totally unacceptable, but TLBs seems like a
> >> reasonable way to handle virt-to-phys caching?
>
> > Actual paging is still acceptable in some situations.
> > Think time sharing machine (e.g. Windows Terminal Server) with few
> > dozens of clients that mostly use interactive applications and often
> > leave them open without touching  kbd/mouse for hours and days.
> > In such situation, paging stuff out, especially to [realatively small]
> > SSD and using freed memory for [huge rotating RAID] disk cache sounds
> > like very reasonable strategy for an OS.
>
> What you are describing here isn't paging at all!
>
> This is segment swapping, and should only require a single base+limit
> descriptor to mark the application as either totally resident or not
> there at all.
>
> Wasting 100 K TLB entries (for a 400 MB minimal Terminal Server client)
> just so that you can detect that it needs to be swapped back in seems
> like a sub-optimal solution. :-(

Except than "sub-optimal solution" is far more universal.
And except than when you have SSD with decently sized IO request queue
as a backup store there is no measurable performance difference
between "optimal" and "sub-optimal".
And except than a single user more often then not has several
interactive applications open and is likely to touch them with very
different time patters, so swapping in and out the whole client is, in
fact, sub-optimal. And except than the whole process is mostly about
dynamically-allocated data, rather than text, so swapping per-
application segments would not work either.
And except than "sub-optimal solution" is far more universal. Oh, I
already said that, don't I?


>
> (Yeah, I know that none of those entries will actually be loaded when
> the client app is swapped out, but it still seems like a lot of overhead.)
>
> >> Due to memory sizes outrunning reasonably TLB sizes, at least as long as
> >> everything has a single page size, OSs have to try to merge multiple 4K
> >> pages into larger superpages that can use a single TLB entry, right?
>
> >> However, in order to this they must first get rid of (physical) memory
> >> fragmentation. :-)
>
> > No, OS should not use "big" pages at all except for special things
> > like frame buffers, bounce buffers for legacy 32-bit I/O device and
> > when explicitly requested by application.
>
> That seems like a bad approach:
>
> As long as an application makes big enough memory requests (i.e. many
> MBs) it seems like a very obvious idea to check if you can satisfy
> (parts of) that request with one or more big pages?
>

Only if memory is "free", and then you are very certain that in the
future you wouldn't want to page *part* of this buffer out.
On [modern fat x86] CPUs situations in which TLB misses account for
more than couple of percents of the run time are very rare. So, IMHO,
the best strategy for OS is to do absolutely nothing about it.

> > Processors with too small TLBs should naturally lose market share to
> > processors with bigger TLBs and/or to processors that make TLB misses
> > cheaper by means of efficient caching of page tables in L2/L3 caches.
>
> Here we agree 100%.
>
> I still remember that awful PentiumII cpu which had 512 KB of L2 but
> only 64x4=256 KB that could be described by 4KB TLB entries.
>
> It had a performance knee around 256 KB working set which was almost as
> bad as the one past 512 KB.

In microbenchmarks or in real-world useful app?
To see such defined knee you should have had near-perfect hit rate
within 512 KB, relatively good locality within 32-byte cache lines
(otherwise time would be dominated by L1D-miss-L2-hit) and, at the
same time, very poor locality within 4KB pages.

>
> Terje
> --
> - <Terje.Mathisen at tmsw.no>
> "almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2013-01-28 10:10:31 UTC
Permalink
In article <20edc988-5a08-4737-ac8b-***@m12g2000yqp.googlegroups.com>,
Michael S <***@yahoo.com> wrote:
>>
>> I still remember that awful PentiumII cpu which had 512 KB of L2 but
>> only 64x4=3D256 KB that could be described by 4KB TLB entries.
>>
>> It had a performance knee around 256 KB working set which was almost as
>> bad as the one past 512 KB.
>
>In microbenchmarks or in real-world useful app?

Real-world useful applications. It's a well-known problem in the
HPC community and has been for many decades.

>To see such defined knee you should have had near-perfect hit rate
>within 512 KB, relatively good locality within 32-byte cache lines
>(otherwise time would be dominated by L1D-miss-L2-hit) and, at the
>same time, very poor locality within 4KB pages.

No, you are over-simplifying. The Official Dogma is that the cost
of a TLB miss is dominated by the memory access, but it very often
isn't true, and is rarely true if they are handled by interrupt.

It is common for a TLB+cache miss to be 3-10 times as costly as a
cache miss and, on older systems, that was 10-100. It's murkier
on highly-parallel systems, because a TLB miss necessarily has to
take out a global lock on at least the relevant page table area.
And the system-dependent hacks to minimise the cost of THAT are
unspeakable (and usually buggy).


Regards,
Nick Maclaren.
Michael S
2013-01-28 11:26:56 UTC
Permalink
On Jan 28, 12:10 pm, ***@cam.ac.uk wrote:
> In article <20edc988-5a08-4737-ac8b-***@m12g2000yqp.googlegroups.com>,
> Michael S  <***@yahoo.com> wrote:
>
>
>
> >> I still remember that awful PentiumII cpu which had 512 KB of L2 but
> >> only 64x4=3D256 KB that could be described by 4KB TLB entries.
>
> >> It had a performance knee around 256 KB working set which was almost as
> >> bad as the one past 512 KB.
>
> >In microbenchmarks or in real-world useful app?
>
> Real-world useful applications.  It's a well-known problem in the
> HPC community and has been for many decades.
>
> >To see such defined knee you should have had near-perfect hit rate
> >within 512 KB, relatively good locality within 32-byte cache lines
> >(otherwise time would be dominated by L1D-miss-L2-hit) and, at the
> >same time, very poor locality within 4KB pages.
>
> No, you are over-simplifying.  The Official Dogma is that the cost
> of a TLB miss is dominated by the memory access, but it very often
> isn't true, and is rarely true if they are handled by interrupt.
>
> It is common for a TLB+cache miss to be 3-10 times as costly as a
> cache miss and, on older systems, that was 10-100.  It's murkier
> on highly-parallel systems, because a TLB miss necessarily has to
> take out a global lock on at least the relevant page table area.
> And the system-dependent hacks to minimise the cost of THAT are
> unspeakable (and usually buggy).
>
> Regards,
> Nick Maclaren.

But we are not talking about some abstract system. We are talking
about very specific Pentium-II.
IIRC, on Pentium-II, under condition that missing PT entry presents
in L2 cache, the TLB miss is only about twice costlier than regular
L1D-miss-L2-hit. On the other hand, it has no L3 cache so L2 miss is
relatively costly. At 400 MHz L2 miss would cost ~5-6 times more than
L1D-miss-L2-hit and even that only with very good chipset.
unknown
2013-01-28 19:10:50 UTC
Permalink
Michael S wrote:
> On Jan 27, 2:50 pm, Terje Mathisen <"terje.mathisen at tmsw.no">
>> What you are describing here isn't paging at all!
>>
>> This is segment swapping, and should only require a single base+limit
>> descriptor to mark the application as either totally resident or not
>> there at all.
>>
>> Wasting 100 K TLB entries (for a 400 MB minimal Terminal Server client)
>> just so that you can detect that it needs to be swapped back in seems
>> like a sub-optimal solution. :-(
>
> Except than "sub-optimal solution" is far more universal.

I know!

> And except than when you have SSD with decently sized IO request queue
> as a backup store there is no measurable performance difference
> between "optimal" and "sub-optimal".

I'm still looking forward to SSDs large enough that I can get away from
spinning rust as my main/primary disk.

> And except than a single user more often then not has several
> interactive applications open and is likely to touch them with very
> different time patters, so swapping in and out the whole client is, in
> fact, sub-optimal. And except than the whole process is mostly about
> dynamically-allocated data, rather than text, so swapping per-
> application segments would not work either.
> And except than "sub-optimal solution" is far more universal. Oh, I
> already said that, don't I?

About 5 times so far? :-)

Let's just agree to disagree here, I still believe RAM is cheap enough
that letting real people wait on pages that must be swapped in is a crime.

>> As long as an application makes big enough memory requests (i.e. many
>> MBs) it seems like a very obvious idea to check if you can satisfy
>> (parts of) that request with one or more big pages?
>>
>
> Only if memory is "free", and then you are very certain that in the
> future you wouldn't want to page *part* of this buffer out.

See above, I don't want to page.

> On [modern fat x86] CPUs situations in which TLB misses account for
> more than couple of percents of the run time are very rare. So, IMHO,
> the best strategy for OS is to do absolutely nothing about it.
>
>>> Processors with too small TLBs should naturally lose market share to
>>> processors with bigger TLBs and/or to processors that make TLB misses
>>> cheaper by means of efficient caching of page tables in L2/L3 caches.
>>
>> Here we agree 100%.
>>
>> I still remember that awful PentiumII cpu which had 512 KB of L2 but
>> only 64x4=256 KB that could be described by 4KB TLB entries.
>>
>> It had a performance knee around 256 KB working set which was almost as
>> bad as the one past 512 KB.
>
> In microbenchmarks or in real-world useful app?

Actual apps, afair a spreadsheet recalculation was among the tests.

Touches every cell, ran significantly slower when the working set passed
the TLB size but was still within L2.

> To see such defined knee you should have had near-perfect hit rate
> within 512 KB, relatively good locality within 32-byte cache lines
> (otherwise time would be dominated by L1D-miss-L2-hit) and, at the
> same time, very poor locality within 4KB pages.

Yep.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2013-01-28 20:22:34 UTC
Permalink
In article <s2vit9-***@ntp-sure.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Michael S wrote:
>>
>> And except than "sub-optimal solution" is far more universal. Oh, I
>> already said that, don't I?
>
>About 5 times so far? :-)
>
>Let's just agree to disagree here, I still believe RAM is cheap enough
>that letting real people wait on pages that must be swapped in is a crime.

Yes - especially since it harms RAS :-(

>>> I still remember that awful PentiumII cpu which had 512 KB of L2 but
>>> only 64x4=256 KB that could be described by 4KB TLB entries.
>>>
>>> It had a performance knee around 256 KB working set which was almost as
>>> bad as the one past 512 KB.
>>
>> In microbenchmarks or in real-world useful app?
>
>Actual apps, afair a spreadsheet recalculation was among the tests.
>
>Touches every cell, ran significantly slower when the working set passed
>the TLB size but was still within L2.

The same phenomenon occurs when you have certain matrix algorithms
that need to access them both ways round.


Regards,
Nick Maclaren.
t***@aol.com
2013-01-28 21:05:27 UTC
Permalink
On Monday, January 28, 2013 4:53:19 AM UTC-5, Michael S wrote:
>Only if memory is "free", and then you are very certain that in the
>future you wouldn't want to page *part* of this buffer out.
>On [modern fat x86] CPUs situations in which TLB misses account for
>more than couple of percents of the run time are very rare. So, IMHO,
>the best strategy for OS is to do absolutely nothing about it.

I recall an experimental Linux kernel where it was changed to use only
large pages. It ran about 10% faster IIRC. I think they were using a P-III.

The reason TLB misses are so low these days is because Windows (and
probably Linux) have been tuned for years (as in close to a decade) to take advantage of large pages.

- Tim
Michael S
2013-01-28 22:09:38 UTC
Permalink
On Jan 28, 11:05 pm, ***@aol.com wrote:
> On Monday, January 28, 2013 4:53:19 AM UTC-5, Michael S wrote:
> >Only if memory is "free", and then you are very certain that in the
> >future you wouldn't want to page *part* of this buffer out.
> >On [modern fat x86] CPUs situations in which TLB misses account for
> >more than couple of percents of the run time are very rare. So, IMHO,
> >the best strategy for OS is to do absolutely nothing about it.
>
> I recall an experimental Linux kernel where it was changed to use only
> large pages.  It ran about 10% faster IIRC.  I think they were using a P-III.
>

I can't care less about how much faster *kernel* runs.
Do I wait for computer less? I'd guess not. I'd guess with kernel like
that I'd wait more.

> The reason TLB misses are so low these days is because Windows (and
> probably Linux) have been tuned for years (as in close to a decade) to take advantage of large pages.
>
>                 - Tim

Sorry, I don't believe that it matters.
If tomorrow large pages disappear nobody except big transactional
databases will pay attention.
t***@aol.com
2013-01-29 00:29:44 UTC
Permalink
On Monday, January 28, 2013 5:09:38 PM UTC-5, Michael S wrote:
> On Jan 28, 11:05 pm, ***@aol.com wrote: > On Monday, January 28, 2013 4:53:19 AM UTC-5, Michael S wrote: >
>I can't care less about how much faster *kernel* runs.
>Do I wait for computer less? I'd guess not. I'd guess with kernel like
>that I'd wait more.

Who said they measured the "kernel"? They were measuring the performance of the applications.


>Sorry, I don't believe that it matters.
Oh, that is a great way to do performance tuning.

>If tomorrow large pages disappear nobody except big transactional
>databases will pay attention.

And you would be SOOOOO wrong. Why do you think the number of large page TLB entries increases in each generation of x86 processors?

- Tim
Paul A. Clayton
2013-01-29 01:14:32 UTC
Permalink
On Monday, January 28, 2013 7:29:44 PM UTC-5, ***@aol.com wrote:
> On Monday, January 28, 2013 5:09:38 PM UTC-5, Michael S wrote:
[snip]
>>If tomorrow large pages disappear nobody except big transactional
>>databases will pay attention.
>
> And you would be SOOOOO wrong. Why do you think the number
> of large page TLB entries increases in each generation of
> x86 processors?

[Self-promoting comment follows:]
So there is some slim hope that my idea of sharing TLB entries
for PDEs (and entries for higher levels of the page table) and
like coverage huge pages might actually be adopted?! (The
more recently the solution seems to be hash-rehash on a TLB
that also stores base-sized page translation while previously
specialized huge page TLBs--limited to L1?--were provided.)

[Childish whining follows:]
Why is it that with two of the dominant ISAs (x86 and ARM)
using hierarchical (tree) page tables with support for
node-coverage huge pages (and use of hardware page table
walkers such that a PDE cache would not involve software
changes) my idea seems to be unused?! (I presented it
publicly well over a year ago, so unless it is already part
of a now publicly available patent application/granted
patent by someone else--and which preceded my
presentation [yes, revoking a patent based on prior art
would involve effort]--, so it should not be encumbered
by a patent. [It is also _somewhat_ obvious.])

Such TLB entry sharing would not be a _huge_ win, but
it seems that it would have a modest benefit. (It is
also sufficiently obvious that it is difficult to
believe that no one at either Intel or ARM has
considered such a possibility. [comp.arch and the
OpenRISC discussion forum/mailing list might not be
heavily read by professional architects, so my
presentation of the idea may not have been noticed.
I would be thrilled if I actually inspired a
component of an actual hardware design, and this idea
seemed the most likely to accomplish such.])
Rick Jones
2013-01-29 01:01:39 UTC
Permalink
Michael S <***@yahoo.com> wrote:
> Sorry, I don't believe that it matters.
> If tomorrow large pages disappear nobody except big transactional
> databases will pay attention.

Oh, a few of the folks running SPEC (CPU, perhaps others also not big
transactional database) benchmarks might care, though finding the
specific results using large pages is not all that easy.

And I have some recollections of making a customer (non-transactional
database) rather happier with the performance of their PA-RISC systems
by employing large pages.

All but anecdote of course.

rick jones
--
I don't interest myself in "why." I think more often in terms of
"when," sometimes "where;" always "how much." - Joubert
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
n***@cam.ac.uk
2013-01-29 08:27:09 UTC
Permalink
In article <ke771j$q9m$***@usenet01.boi.hp.com>,
Rick Jones <***@hp.com> wrote:
>Michael S <***@yahoo.com> wrote:
>> Sorry, I don't believe that it matters.
>> If tomorrow large pages disappear nobody except big transactional
>> databases will pay attention.
>
>Oh, a few of the folks running SPEC (CPU, perhaps others also not big
>transactional database) benchmarks might care, though finding the
>specific results using large pages is not all that easy.
>
>And I have some recollections of making a customer (non-transactional
>database) rather happier with the performance of their PA-RISC systems
>by employing large pages.
>
>All but anecdote of course.

I can do better. Count us among the last, by a large factor, and
not just on PA-RISC. And, yes, it was measured and, yes, the
application consumed a LOT of the time of the machine (well over
half in the PA-RISC case).


Regards,
Nick Maclaren.
Mike
2013-01-29 23:56:57 UTC
Permalink
<***@cam.ac.uk> wrote in message
news:ke814t$u00$***@needham.csi.cam.ac.uk...
| In article <ke771j$q9m$***@usenet01.boi.hp.com>,
| Rick Jones <***@hp.com> wrote:
| >Michael S <***@yahoo.com> wrote:
| >> Sorry, I don't believe that it matters.
| >> If tomorrow large pages disappear nobody except big transactional
| >> databases will pay attention.
| >
| >Oh, a few of the folks running SPEC (CPU, perhaps others also not
big
| >transactional database) benchmarks might care, though finding the
| >specific results using large pages is not all that easy.
| >
| >And I have some recollections of making a customer
(non-transactional
| >database) rather happier with the performance of their PA-RISC
systems
| >by employing large pages.
| >
| >All but anecdote of course.
|
| I can do better. Count us among the last, by a large factor, and
| not just on PA-RISC. And, yes, it was measured and, yes, the
| application consumed a LOT of the time of the machine (well over
| half in the PA-RISC case).
|
|
| Regards,
| Nick Maclaren.


This whole discussion thread seems to have much more hand waving and
reliance on anecdote than I would expect.

A 4KB page size for virtual memory has been almost ubiquitous since
the 70's. However, since then, PC's have grown to be tens of
thousands of times faster and larger than mainframes were. CPU cache
is as big or bigger than mainframe ram. Ram is as big or bigger than
70's disk farms, and disk has grown even more. (Luckily, ram speed
is still faster in number of machine cycles to access than disk was or
we would be just spinning our wheels.) Finally, applications have
also grown tremendously but probably less in active set code size than
in data area.

Given the importance of balance in system design, I should think there
would be well established and almost universally accepted models of
all the reasonable memory management schemes. For a given class of
applications should you swap memory with large variable sized segments
or use fixed size paged virtual memory? For a given latency and
throughput access time in CPU cycles, how many pages should you use to
partition cache, main ram, and disk? What are the trade offs in TLB
vs. small and large pages? Should the whole system operate under one
scheme or should different object classes be stored in different areas
of a system with varying page sizes?

I am suspicious that most of the differences of opinion in this thread
are simply because of familiarity with applications that have
dramatically different memory usage patterns.

Mike
Rick Jones
2013-01-30 02:30:40 UTC
Permalink
I'll wave my hands a bit more and point-out that HP-UX (perhaps other
OSes as well for their respective processors) put-in OS support for
the variable page size support present in both PA-RISC and Itanium,
and it wasn't just an academic excercise :) The kernel can
automagically select larger page sizes on behalf of the running
process and/or take hints encoded into the executable. The handwaving
in my case at least stems from my not having been involved in HP-UX
for well over a decade now and the dimm memory not being able to
cough-up specific references upon demand.

Certainly, large transactional databases is a common use-case for
HP-UX systems, but that is not the only "benchmark" the HP-UX
performance folks ran/run. If one were to go trolling through the
SPECcpu submissions for HP-UX systems one might find some of the
compiler options used to control instruction and data page size being
used for various component benchmarks.

rick jones
--
portable adj, code that compiles under more than one compiler
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
Robert A Duff
2013-01-30 16:31:05 UTC
Permalink
Rick Jones <***@hp.com> writes:

>...The handwaving
> in my case at least stems from my not having been involved in HP-UX
> for well over a decade now and the dimm memory not being able to
> cough-up specific references upon demand.

You have dual inline memory modules in your brain?

;-)

- Bob
Rick Jones
2013-01-30 20:13:09 UTC
Permalink
Robert A Duff <***@shell01.theworld.com> wrote:
> Rick Jones <***@hp.com> writes:
> > ...The handwaving in my case at least stems from my not having
> > been involved in HP-UX for well over a decade now and the dimm
> > memory not being able to cough-up specific references upon demand.

> You have dual inline memory modules in your brain?

With increasing numbers of bit errors and only minimal ECC :)

rick jones
--
I don't interest myself in "why." I think more often in terms of
"when," sometimes "where;" always "how much." - Joubert
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...
unknown
2013-01-30 05:48:52 UTC
Permalink
Mike wrote:
[snipped yet another TLB/page size message]
> This whole discussion thread seems to have much more hand waving and
> reliance on anecdote than I would expect.

Which is a pity.
>
> A 4KB page size for virtual memory has been almost ubiquitous since
> the 70's. However, since then, PC's have grown to be tens of
> thousands of times faster and larger than mainframes were. CPU cache
> is as big or bigger than mainframe ram. Ram is as big or bigger than
> 70's disk farms, and disk has grown even more. (Luckily, ram speed
> is still faster in number of machine cycles to access than disk was or
> we would be just spinning our wheels.) Finally, applications have
> also grown tremendously but probably less in active set code size than
> in data area.

Exactly right!

Code is only 10x or so larger, mostly due to OO coding with lots & lots
of libraries linked in.

Working set data otoh has grown by several orders of magnitude, only
partially ameliorated by a constant search for more cache-friendly
algorithms.
>
> Given the importance of balance in system design, I should think there
> would be well established and almost universally accepted models of
> all the reasonable memory management schemes. For a given class of

We're almost as bad as lawyers, there's nothing "universally accepted"
here. :-)

I think (probably like you) that it seems reasonable that the optimum
page size should increase along with the working set code and data
sizes, at least by a log(N) factor.

> applications should you swap memory with large variable sized segments
> or use fixed size paged virtual memory? For a given latency and
> throughput access time in CPU cycles, how many pages should you use to
> partition cache, main ram, and disk? What are the trade offs in TLB
> vs. small and large pages? Should the whole system operate under one
> scheme or should different object classes be stored in different areas
> of a system with varying page sizes?

All good questions.
>
> I am suspicious that most of the differences of opinion in this thread
> are simply because of familiarity with applications that have
> dramatically different memory usage patterns.

Our current reliance on cache-friendly behavior has made it very easy to
assume that there are no really important applications that cannot be
made to work well with caches. I know that I'm probably in that group
(see my .sig), but I try to keep an open mind.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
n***@cam.ac.uk
2013-01-30 08:46:34 UTC
Permalink
In article <4romt9-***@ntp-sure.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Mike wrote:
>
>Code is only 10x or so larger, mostly due to OO coding with lots & lots
>of libraries linked in.

That's deceptive. Measured as lines of source, perhaps. Measured
as complexity of source or bytes generated/loaded, it is very much
larger. But that's a RAS not a performance problem.

>I think (probably like you) that it seems reasonable that the optimum
>page size should increase along with the working set code and data
>sizes, at least by a log(N) factor.

It's not clear if you look at it in more detail.

>> I am suspicious that most of the differences of opinion in this thread
>> are simply because of familiarity with applications that have
>> dramatically different memory usage patterns.
>
>Our current reliance on cache-friendly behavior has made it very easy to
>assume that there are no really important applications that cannot be
>made to work well with caches. I know that I'm probably in that group
>(see my .sig), but I try to keep an open mind.

And I know that is overstated, but it's unclear by how much. There
are lots of applications that nobody has managed to make cache
friendly, but it's not generally possible to prove that complete
applications cannot be redesigned to be so.

What I find is bizarre is that the previous poster seemed to say
that being familiar with a wide variety of applications means that
we are less competent to make informed and educated remarks rather
than more competent!

Anyway, the point I have been hammering on about is not caching per
se, but the harm caused by attempting to emulate caching by horrible
and highly privileged software hacks. Purely transparent hardware
caching is a performance tweak, pure and simple.


Regards,
Nick Maclaren.
Quadibloc
2013-01-30 13:33:58 UTC
Permalink
On Jan 30, 1:46 am, ***@cam.ac.uk wrote:
> In article <4romt9-***@ntp-sure.tmsw.no>,
> Terje Mathisen  <"terje.mathisen at tmsw.no"> wrote:

> >Our current reliance on cache-friendly behavior has made it very easy to
> >assume that there are no really important applications that cannot be
> >made to work well with caches. I know that I'm probably in that group
> >(see my .sig), but I try to keep an open mind.
>
> And I know that is overstated, but it's unclear by how much.  There
> are lots of applications that nobody has managed to make cache
> friendly, but it's not generally possible to prove that complete
> applications cannot be redesigned to be so.

And this, of course, reminds me of how, after devising an incredibly
elaborate scheme to let my imaginary architecture process 36-bit and
48-bit data in 32/64/128-bit memory, using the cache, I then racked my
brain for elaborate schemes to allow efficient large arrays of
nonstandard data items for the case where data is processed in a
random manner that doesn'twork well with the cache.

John Savard
Paul A. Clayton
2013-01-30 16:42:56 UTC
Permalink
On Wednesday, January 30, 2013 3:46:34 AM UTC-5, ***@cam.ac.uk wrote:
> In article <4romt9-***@ntp-sure.tmsw.no>,
> Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
> >Mike wrote:
[snip]
>>> I am suspicious that most of the differences of opinion in this thread
>>> are simply because of familiarity with applications that have
>>> dramatically different memory usage patterns.
[snip]
> What I find is bizarre is that the previous poster seemed to say
> that being familiar with a wide variety of applications means that
> we are less competent to make informed and educated remarks rather
> than more competent!

I think Mike was trying to indicate that different
posters had different areas of knowledge and
ignorance ("familiarity with applications") and
the "differences of opinion" derive substantially
from limited *individual* knowledge. *Collectively*
the group may be "familiar with a wide variety of
applications", but inadequate communication and
*individual* specialization "means we are less
competent".

It seems to me (in my great ignorance) that the
common unix-like way of handling of files
substantially encourages the support for pages
(rather than segments) of smallish size. In that
model (especially with memory mapping of files),
page-oriented file caching for block devices
seems to fit well with page-oriented "anonymous
mapping".

One point not mentioned seems to be the
possibility of separating translation from
permission. Such might facilitate the use of
larger than 4KiB translation pages but smaller or
larger permission coverage areas.

In addition to legacy software and architecture
issues discouraging change in virtual memory,
some memory system design choices make relocation
of smaller-page-sized chunks more expensive than
necessary (which makes forming larger pages less
practical).

Even without formally defining larger pages,
x86 and ARM could use the Alpha-like page group
design (without Alpha's page group size indicators
in PTEs) at least for cache block sized chunks of
a page table that map to contiguous aligned
memory by using a sectored TLB with a single
translation (what Madhusudhan Talluri called
partial subblocking in his PhD thesis "Use of
Superpages and Subblocking in the Address
Translation Hierarchy", 1995). ("CoLT:
Coalesced Large-Reach TLBs", 2012, Binh Pham et
al., removes the alignment requirement by
providing a small addend.) (Sadly, this does
not seem to be done.)
Mike
2013-01-30 17:05:48 UTC
Permalink
<***@cam.ac.uk> wrote in message
news:keamla$oia$***@needham.csi.cam.ac.uk...
| In article <4romt9-***@ntp-sure.tmsw.no>,
| Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
| >Mike wrote:
| >
| >Code is only 10x or so larger, mostly due to OO coding with lots &
lots
| >of libraries linked in.
|
| What I find is bizarre is that the previous poster seemed to say
| that being familiar with a wide variety of applications means that
| we are less competent to make informed and educated remarks rather
| than more competent!


That is not what I meant. I was trying to say that familiarity with
a specific range of application behaviors, less than all possible,
probably creates a bias for one or another memory management scheme.

Since disk is soooo much slower than the CPU, with regard to fixed
size paged virtual memory, I think there is a short cut to select the
optimum page size. If the page size is too small, most time will be
spent waiting on disk seek time and rotational delay. If the page
size is too large, most time will be waiting on transfer time and more
un-needed data will be read in. Since the 70's, disk seek time has
dropped from about 50 milliseconds to 5 milliseconds. Disk rotational
delay has dropped from 8 or 9 milliseconds to about .4 milliseconds in
a 15k rpm drive. Over the same time transfer speeds have increased
to between 300 and 600 megabytes per second. Balancing these two
issues implies a compromise page size between 30 and 60 KB.

Mike
Anne & Lynn Wheeler
2013-01-30 17:22:22 UTC
Permalink
"Mike" <***@mike.net> writes:
> That is not what I meant. I was trying to say that familiarity with
> a specific range of application behaviors, less than all possible,
> probably creates a bias for one or another memory management scheme.
>
> Since disk is soooo much slower than the CPU, with regard to fixed
> size paged virtual memory, I think there is a short cut to select the
> optimum page size. If the page size is too small, most time will be
> spent waiting on disk seek time and rotational delay. If the page
> size is too large, most time will be waiting on transfer time and more
> un-needed data will be read in. Since the 70's, disk seek time has
> dropped from about 50 milliseconds to 5 milliseconds. Disk rotational
> delay has dropped from 8 or 9 milliseconds to about .4 milliseconds in
> a 15k rpm drive. Over the same time transfer speeds have increased
> to between 300 and 600 megabytes per second. Balancing these two
> issues implies a compromise page size between 30 and 60 KB.

in the early 80s, ibm mainframe operating systems had support for "big
page" transfers ... basically full 3380 track of 10 4kbyte pages. for
page out ... virtual pages for the same virtual address space was
collected in groups of 10 pages at a time (hopefully being used
together) and written out with strategy somewhat similar to
log-structured file system ... i.e. closest unused track location to
current head position (in the direction of moving cursor).

the issue was that 3380 had 3mbyte/sec transfer ... but access was only
marginally better than 3330 that had 800kbyte/sec ... so the big page
strategy was to better leverage the higher transfer rate ... and
optimize everything to offset the relatively long access time. paging
area was typically configured to be ten times that needed
... significantly increasing probability unused track near the current
head position.

subsequent page fault ... for any member of a "big page" would bring in
the whole big page (tended to increase real storage requirement but
improved the number of pages fetched for disk arm access).

--
virtualization experience starting Jan1968, online at home since Mar1970
n***@cam.ac.uk
2013-01-30 17:44:40 UTC
Permalink
In article <***@earthlink.com>,
Mike <***@mike.net> wrote:
>
>| What I find is bizarre is that the previous poster seemed to say
>| that being familiar with a wide variety of applications means that
>| we are less competent to make informed and educated remarks rather
>| than more competent!
>
>That is not what I meant. I was trying to say that familiarity with
>a specific range of application behaviors, less than all possible,
>probably creates a bias for one or another memory management scheme.

Well, yes, it might. But several of us are familiar with quite a
wide range of application behaviors and system designs.

>Since disk is soooo much slower than the CPU, with regard to fixed
>size paged virtual memory, I think there is a short cut to select the
>optimum page size. If the page size is too small, most time will be
>spent waiting on disk seek time and rotational delay. If the page
>size is too large, most time will be waiting on transfer time and more
>un-needed data will be read in. Since the 70's, disk seek time has
>dropped from about 50 milliseconds to 5 milliseconds. Disk rotational
>delay has dropped from 8 or 9 milliseconds to about .4 milliseconds in
>a 15k rpm drive. Over the same time transfer speeds have increased
>to between 300 and 600 megabytes per second. Balancing these two
>issues implies a compromise page size between 30 and 60 KB.

Sorry, nope. See the Wheelers' posting for one reason. Another
is that using a page size much larger than the granuality of data
access does not gain you anything, not even if it can be transferred
for very little more cost than smaller ones. Your calculation is
more reasonable for the actual transfer size used (though actually
the size would be more like 250 KB), but tying that to the page
size stopped making sense in the 1960s.

A far more important reason to use large page sizes nowadays is to
enable data streaming - many architectures can stream data only
within a page (POWER, PA-RISC and others) - and the benefits can
be large. But that is one of the main reasons that I assert that
paging is an outdated technology, and we should go back to segment
swapping only.


Regards,
Nick Maclaren.
EricP
2013-01-30 19:26:01 UTC
Permalink
Mike wrote:
>
> Since disk is soooo much slower than the CPU, with regard to fixed
> size paged virtual memory, I think there is a short cut to select the
> optimum page size. If the page size is too small, most time will be
> spent waiting on disk seek time and rotational delay. If the page
> size is too large, most time will be waiting on transfer time and more
> un-needed data will be read in. Since the 70's, disk seek time has
> dropped from about 50 milliseconds to 5 milliseconds. Disk rotational
> delay has dropped from 8 or 9 milliseconds to about .4 milliseconds in
> a 15k rpm drive. Over the same time transfer speeds have increased
> to between 300 and 600 megabytes per second. Balancing these two
> issues implies a compromise page size between 30 and 60 KB.

Page fault clustering (reading the target page and a certain
number ahead or behind it) does the same and has been around for
a long time. It has the advantage of being a runtime tuning
parameter but does not cut down the number of TLB entries.

Eric
Robert Wessel
2013-01-30 20:34:46 UTC
Permalink
On Wed, 30 Jan 2013 12:05:48 -0500, "Mike" <***@mike.net> wrote:

>
><***@cam.ac.uk> wrote in message
>news:keamla$oia$***@needham.csi.cam.ac.uk...
>| In article <4romt9-***@ntp-sure.tmsw.no>,
>| Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>| >Mike wrote:
>| >
>| >Code is only 10x or so larger, mostly due to OO coding with lots &
>lots
>| >of libraries linked in.
>|
>| What I find is bizarre is that the previous poster seemed to say
>| that being familiar with a wide variety of applications means that
>| we are less competent to make informed and educated remarks rather
>| than more competent!
>
>
>That is not what I meant. I was trying to say that familiarity with
>a specific range of application behaviors, less than all possible,
>probably creates a bias for one or another memory management scheme.
>
>Since disk is soooo much slower than the CPU, with regard to fixed
>size paged virtual memory, I think there is a short cut to select the
>optimum page size. If the page size is too small, most time will be
>spent waiting on disk seek time and rotational delay. If the page
>size is too large, most time will be waiting on transfer time and more
>un-needed data will be read in. Since the 70's, disk seek time has
>dropped from about 50 milliseconds to 5 milliseconds. Disk rotational
>delay has dropped from 8 or 9 milliseconds to about .4 milliseconds in
>a 15k rpm drive. Over the same time transfer speeds have increased
>to between 300 and 600 megabytes per second. Balancing these two
>issues implies a compromise page size between 30 and 60 KB.


Rotational delays for 15K drives are more like 2ms.
unknown
2013-01-31 11:04:58 UTC
Permalink
Mike wrote:
> Since disk is soooo much slower than the CPU, with regard to fixed
> size paged virtual memory, I think there is a short cut to select the
> optimum page size. If the page size is too small, most time will be
> spent waiting on disk seek time and rotational delay. If the page
> size is too large, most time will be waiting on transfer time and more
> un-needed data will be read in. Since the 70's, disk seek time has

Obviously right.

> dropped from about 50 milliseconds to 5 milliseconds. Disk rotational
> delay has dropped from 8 or 9 milliseconds to about .4 milliseconds in
> a 15k rpm drive. Over the same time transfer speeds have increased
> to between 300 and 600 megabytes per second. Balancing these two

That's not so: Single-spindle transfer rates are approximately constant
for all current disk types: Those that spin faster simply provide less
total bytes in the same area, making each bit larger (and hopefully more
fault-tolerant).

The numbers you quote above (300-600 MB/s) seems to be close to the
Mbit/s rate, except for being a little low. Your range is pretty much
spot on for modern SSD drives OTOH.

> issues implies a compromise page size between 30 and 60 KB.

If we have 100 MB/s transfer and 5 ms seek time, using a 50/50 split
between seek and transfer time gives 512 KB as a reasonable minimum disk
IO unit.

I've read somewhere that Google is mostly using 64 MB as the block size
for GFS, giving 99+% transfer efficiency.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Stephen Fuld
2013-01-31 17:46:45 UTC
Permalink
On 1/31/2013 3:04 AM, Terje Mathisen wrote:
> Mike wrote:
>> Since disk is soooo much slower than the CPU, with regard to fixed
>> size paged virtual memory, I think there is a short cut to select the
>> optimum page size. If the page size is too small, most time will be
>> spent waiting on disk seek time and rotational delay. If the page
>> size is too large, most time will be waiting on transfer time and more
>> un-needed data will be read in. Since the 70's, disk seek time has
>
> Obviously right.
>
>> dropped from about 50 milliseconds to 5 milliseconds. Disk rotational
>> delay has dropped from 8 or 9 milliseconds to about .4 milliseconds in
>> a 15k rpm drive. Over the same time transfer speeds have increased
>> to between 300 and 600 megabytes per second. Balancing these two
>
> That's not so: Single-spindle transfer rates are approximately constant
> for all current disk types: Those that spin faster simply provide less
> total bytes in the same area, making each bit larger (and hopefully more
> fault-tolerant).
>
> The numbers you quote above (300-600 MB/s) seems to be close to the
> Mbit/s rate, except for being a little low. Your range is pretty much
> spot on for modern SSD drives OTOH.
>
>> issues implies a compromise page size between 30 and 60 KB.
>
> If we have 100 MB/s transfer and 5 ms seek time, using a 50/50 split
> between seek and transfer time gives 512 KB as a reasonable minimum disk
> IO unit.

While you math may be fine, your assumptions aren't correct. The
Seagate data sheet for their Barracuda class drives (of course others
are different)

Data sheet linked to from

http://www.seagate.com/internal-hard-drives/desktop-hard-drives/



gives the average seek time of 8.5 for reads/9.5 for writes (reads are
faster because writes require a longer settling time to assure that data
is written exactly on track. If the read is off, it can always be
retried). In reality, the average seek time is less because this number
assumes random seeks across the whole drive, but most drives are less
than 100 % full and data is clustered towards the front of the drive,
but the real number is hard to measure and inappropriate for a vendor to
quote as a spec.) So let's say 9 ms.

The drive rotates at 7,200 RPM, so the average rotational delay is 4.17
ms (half a rotation).

For the transfer rate, you need to use the average rate, not the
instantaneous rate since you are talking about transfers larger than one
disk block. The sustained rate varies across the disk, but Seagate
gives the average number as 156 MB/sec. Again, this is probably
somewhat low as most of the data is toward the outer diameter of the
drive where the average rate is faster, but again, not a specable number.

So using the 50/50 numbers the total access time is ~13 ms and at 156
MB/sec we can transfer about 2 MB in that time. However, what makes the
50/50 rule magic? Also, this neglects the effect of caching within the
drive, which tends to decrease the access time and slightly increase the
data transfer rate. How should this be factored in?

Overall, while I agree with the general idea that faster drives should,
all other things being equal, lead to larger pages (even without
aggregating multiple smaller pages into super pages), I am skeptical of
using formulas to tell how much bigger.


> I've read somewhere that Google is mostly using 64 MB as the block size
> for GFS, giving 99+% transfer efficiency.


Could very well be, but almost certainly that is not for program paging.


--
- Stephen Fuld
(e-mail address disguised to prevent spam)
unknown
2013-01-31 21:21:33 UTC
Permalink
Stephen Fuld wrote:
> On 1/31/2013 3:04 AM, Terje Mathisen wrote:
>> The numbers you quote above (300-600 MB/s) seems to be close to the
>> Mbit/s rate, except for being a little low. Your range is pretty much
>> spot on for modern SSD drives OTOH.
>>
>>> issues implies a compromise page size between 30 and 60 KB.
>>
>> If we have 100 MB/s transfer and 5 ms seek time, using a 50/50 split
>> between seek and transfer time gives 512 KB as a reasonable minimum disk
>> IO unit.
>
> While you math may be fine, your assumptions aren't correct. The
> Seagate data sheet for their Barracuda class drives (of course others
> are different)
>
> Data sheet linked to from
>
> http://www.seagate.com/internal-hard-drives/desktop-hard-drives/
>
> gives the average seek time of 8.5 for reads/9.5 for writes (reads are
> faster because writes require a longer settling time to assure that data
> is written exactly on track. If the read is off, it can always be
> retried). In reality, the average seek time is less because this number
> assumes random seeks across the whole drive, but most drives are less
> than 100 % full and data is clustered towards the front of the drive,
> but the real number is hard to measure and inappropriate for a vendor to
> quote as a spec.) So let's say 9 ms.
>
> The drive rotates at 7,200 RPM, so the average rotational delay is 4.17
> ms (half a rotation).
>
> For the transfer rate, you need to use the average rate, not the
> instantaneous rate since you are talking about transfers larger than one
> disk block. The sustained rate varies across the disk, but Seagate
> gives the average number as 156 MB/sec. Again, this is probably
> somewhat low as most of the data is toward the outer diameter of the
> drive where the average rate is faster, but again, not a specable number.

OK, so I was about 35% low on the transfer speed, and similarly
optimistic re seek times, this still doesn't change the conclusion.
>
> So using the 50/50 numbers the total access time is ~13 ms and at 156
> MB/sec we can transfer about 2 MB in that time. However, what makes the
> 50/50 rule magic? Also, this neglects the effect of caching within the

The 50/50 rule is my (admittedly very simple) rule-of-thumb: This is the
point where we have gained at least half of the maximum possible
sustained IO rate.

> drive, which tends to decrease the access time and slightly increase the
> data transfer rate. How should this be factored in?
>
> Overall, while I agree with the general idea that faster drives should,
> all other things being equal, lead to larger pages (even without
> aggregating multiple smaller pages into super pages), I am skeptical of
> using formulas to tell how much bigger.

I'm not suggesting this at all, I was just using that calculation to
show that for actual paging, the effective block size needs to be _far_
larger than the longtime default 4KB page size.
>
>
>> I've read somewhere that Google is mostly using 64 MB as the block size
>> for GFS, giving 99+% transfer efficiency.
>
>
> Could very well be, but almost certainly that is not for program paging.

You've seen my posts, I believe actual paging to be totally
unacceptable: If you cannot afford enough ram to keep your entire
working set loaded, then you are doing something terribly wrong, like
trying to run two things at once when you should instead setup a batch
process.

One of the recent suggestions was that by keeping the swapped pages on
SSD you could handle applications that go dormant for long periods of
time, then quickly bring them back in: This is also bogus in that it (a)
requires SSD storage to be an order of magnitude cheaper than RAM, and
(b) needs a guarantee that you will never need to load in simultaneously
more of those dormant applications than will fit in the ram you do have.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Ivan Godard
2013-01-31 22:16:57 UTC
Permalink
On 1/31/2013 1:21 PM, Terje Mathisen wrote:
> Stephen Fuld wrote:

<snip>

>> Could very well be, but almost certainly that is not for program paging.
>
> You've seen my posts, I believe actual paging to be totally
> unacceptable: If you cannot afford enough ram to keep your entire
> working set loaded, then you are doing something terribly wrong, like
> trying to run two things at once when you should instead setup a batch
> process.

Well, sort of. You have given a good rule of thumb for when you have a
good handle on your working set in advance of purchase time. However,
the working set tends in most cases to be very fuzzy and to change
significantly over time horizons much shorter than the hardware upgrade
interval.

In such cases many users would prefer to gracefully degrade into longer
run times over simply halting with a memory-exhausted diagnostic. Paging
is a way to provide that degradation. Yes, your usual 30-second compile
is now eighteen minutes; so go for a walk. Paging lets you get something
done while accounting take its usual three week turnaround for more memory.
n***@cam.ac.uk
2013-01-31 22:35:18 UTC
Permalink
In article <keeqgk$df5$***@dont-email.me>,
Ivan Godard <***@ootbcomp.com> wrote:
>On 1/31/2013 1:21 PM, Terje Mathisen wrote:
>>
>> You've seen my posts, I believe actual paging to be totally
>> unacceptable: If you cannot afford enough ram to keep your entire
>> working set loaded, then you are doing something terribly wrong, like
>> trying to run two things at once when you should instead setup a batch
>> process.
>
>Well, sort of. You have given a good rule of thumb for when you have a
>good handle on your working set in advance of purchase time. However,
>the working set tends in most cases to be very fuzzy and to change
>significantly over time horizons much shorter than the hardware upgrade
>interval.
>
>In such cases many users would prefer to gracefully degrade into longer
>run times over simply halting with a memory-exhausted diagnostic. Paging
>is a way to provide that degradation. Yes, your usual 30-second compile
>is now eighteen minutes; so go for a walk. Paging lets you get something
>done while accounting take its usual three week turnaround for more memory.

That used to be true in the 1960s and 1970s, but is dubious today.
However, let it pass.

The other side is that, by requiring support for paging, you are
forcing a 10% performance hit on almost all programs, a MUCH
larger one on some important programs, and significantly degrading
RAS. That's not a good exchange.


Regards,
Nick Maclaren.
Ivan Godard
2013-02-01 00:32:56 UTC
Permalink
On 1/31/2013 2:35 PM, ***@cam.ac.uk wrote:
> In article <keeqgk$df5$***@dont-email.me>,
> Ivan Godard <***@ootbcomp.com> wrote:
>> On 1/31/2013 1:21 PM, Terje Mathisen wrote:
>>>
>>> You've seen my posts, I believe actual paging to be totally
>>> unacceptable: If you cannot afford enough ram to keep your entire
>>> working set loaded, then you are doing something terribly wrong, like
>>> trying to run two things at once when you should instead setup a batch
>>> process.
>>
>> Well, sort of. You have given a good rule of thumb for when you have a
>> good handle on your working set in advance of purchase time. However,
>> the working set tends in most cases to be very fuzzy and to change
>> significantly over time horizons much shorter than the hardware upgrade
>> interval.
>>
>> In such cases many users would prefer to gracefully degrade into longer
>> run times over simply halting with a memory-exhausted diagnostic. Paging
>> is a way to provide that degradation. Yes, your usual 30-second compile
>> is now eighteen minutes; so go for a walk. Paging lets you get something
>> done while accounting take its usual three week turnaround for more memory.
>
> That used to be true in the 1960s and 1970s, but is dubious today.
> However, let it pass.
>
> The other side is that, by requiring support for paging, you are
> forcing a 10% performance hit on almost all programs, a MUCH
> larger one on some important programs, and significantly degrading
> RAS. That's not a good exchange.

Nor is it an essential one.

When you view the goal as a graceful degradation in the event of excess
global memory demand (but not excess memory working set) then the
devices, OS and hardware, that supported 1970's time sharing are inapt
at best. The machinery back then, hardware and software, was
resource-constrained in all sorts of ways that are no longer relevant,
or look to be so very soon.

The granularity of locality, both temporal and physical, follows a
self-similar power law that can only be approximated by tilings. Caches
"page" at line granularity; MMPs "page" at process granularity. Grids of
fixed size blocks of address space are so Ferranti Atlas :-)

Ivan
n***@cam.ac.uk
2013-02-01 09:56:12 UTC
Permalink
In article <kef2fj$h5p$***@dont-email.me>,
Ivan Godard <***@ootbcomp.com> wrote:
>
>The granularity of locality, both temporal and physical, follows a
>self-similar power law that can only be approximated by tilings. Caches
>"page" at line granularity; MMPs "page" at process granularity. Grids of
>fixed size blocks of address space are so Ferranti Atlas :-)

That's the hardware, but my experience and investigations indicates
that it is not true for applications :-(

The problem is that there are several very different 'power laws'
in effect and the result is, the more memory there is, the wider
the variation between the best page size for each. And, of course,
the wider the variation in page sizes that each application needs
for different data areas or uses.

My approach is to adopt the way that the Atlas actually used memory;
segments were swappable, but not pageable. However, with modern
constraints, I don't believe the use of pages to avoid fragmentation
is needed any longer. Segments are extended rarely enough, and memory
is cheap enough, that separating them fairly widely and moving them
when needed is a better solution.


Regards,
Nick Maclaren.
Ivan Godard
2013-02-01 12:09:24 UTC
Permalink
On 2/1/2013 1:56 AM, ***@cam.ac.uk wrote:
> In article <kef2fj$h5p$***@dont-email.me>,
> Ivan Godard <***@ootbcomp.com> wrote:
>>
>> The granularity of locality, both temporal and physical, follows a
>> self-similar power law that can only be approximated by tilings. Caches
>> "page" at line granularity; MMPs "page" at process granularity. Grids of
>> fixed size blocks of address space are so Ferranti Atlas :-)
>
> That's the hardware, but my experience and investigations indicates
> that it is not true for applications :-(
>
> The problem is that there are several very different 'power laws'
> in effect and the result is, the more memory there is, the wider
> the variation between the best page size for each. And, of course,
> the wider the variation in page sizes that each application needs
> for different data areas or uses.
>
> My approach is to adopt the way that the Atlas actually used memory;
> segments were swappable, but not pageable. However, with modern
> constraints, I don't believe the use of pages to avoid fragmentation
> is needed any longer. Segments are extended rarely enough, and memory
> is cheap enough, that separating them fairly widely and moving them
> when needed is a better solution.

So you would abandon paging for segmentation? As an old Burroughs hand
you'll get few of the usual arguments from me - I agree that swap is
efficient enough.

But segmentation itself is not sufficient for good memory utilization
either. The problem is that whole segments don't track the power law
usages much better than pages do. Say you are working on a whopping big
matrix - and are spending nearly all your time near the diagonal. Or you
are running a browser - and are spending nearly all your time in the MP3
decoder.

Neither of these tile well with any particular size pages, but segments
at the matrix or load module level pull in vast quantities of idle
memory. And all of what people do with machines is like this, at rapidly
changing temporal scales.

There's an obvious solution for optimum use of physmem: paging, with
one-byte pages. We already do that, or near enough - we call the
hardware "caches". KSR and others have explored COMA at the page level,
but that's too coarse - would COMA work at the line level? No way I can
see to get the directory/swap overhead down below the lossage cost of
the inefficient use of memory at larger granularity. It seems cheaper to
waste memory than to keep track of the little pieces.

So if we give up on fine-grain COMA, the question becomes whether, over
the fuzzy "average" usage distribution, whole object swap (aka
segmentation) gives a better or worse mapping of actual memory
utilization than finer granularity tiled swap (aka paging).

This is a question on which reasonable engineers will differ. My own
take is that small objects tend to be used in full or not at all, and so
segmenting single string literals or single function code bodies works
well, while for larger objects, like big arrays, memory-mapped data
files, and load modules, paging does a better job of capturing local hot
spots. Of course, with fine-grained segments the number of things to
keep track of is vastly larger than even 4k pages, and cost is
proportional to count and not size.

The problem with segments is garbage collect. The Burroughs hardware had
a machine operation called MSEQ (masked search for equal) in which we
actually searched memory for copies of descriptors so we could fix them
up when a segment was swapped out. You could do that in 1968 with equal
core and CPU speeds and only 26 bits of memory. Today we need complex
structure and indirection to do segments, and memory is cheap enough
that the saving in memory is simply not enough to pay for the added cost
of segments over pages.

However, if segments bring added value in their own right for other
reasons then using them also for memory is a freebie that should be
taken. I'm with Andy on this - I drank the capability Kool-Aid too. If
you can sell caps on RAS and programability grounds then beyond doubt
segmentation is the way to go, because it falls out of caps for free.

Like Andy, I know how to build a caps machine that would brush your
teeth. Neither of us have ever sold anybody on building one though, and
I at least don't know how to sell the result to end users even if I
could get it built.

So the Mill uses a grant model rather than caps, because it's
unobtrusive enough to sell to people, and languages, that want an
emulated PDP-11.
Michael S
2013-02-01 12:30:11 UTC
Permalink
On Feb 1, 2:09 pm, Ivan Godard <***@ootbcomp.com> wrote:
> On 2/1/2013 1:56 AM, ***@cam.ac.uk wrote:
>
> So the Mill uses a grant model rather than caps, because it's
> unobtrusive enough to sell to people, and languages, that want an
> emulated PDP-11.

So the Mill is not a pure hobby? Do you have business plans about it?
n***@cam.ac.uk
2013-02-01 12:34:23 UTC
Permalink
In article <kegb8v$r4u$***@dont-email.me>,
Ivan Godard <***@ootbcomp.com> wrote:
>
>So you would abandon paging for segmentation? As an old Burroughs hand
>you'll get few of the usual arguments from me - I agree that swap is
>efficient enough.

Yes, precisely, though from a different background!

>But segmentation itself is not sufficient for good memory utilization
>either. The problem is that whole segments don't track the power law
>usages much better than pages do. Say you are working on a whopping big
>matrix - and are spending nearly all your time near the diagonal. Or you
>are running a browser - and are spending nearly all your time in the MP3
>decoder.

I don't see the problem. My point is not to abolish caching, which
is what helps there, but page tables as such. The problem with page
tables is NOT their use as transparent caches, but their non-transparent
effects which are usually handled by interrupt or other glitching.
And it's that aspect that is so harmful.

The hardware point is that this exchanges a large TLB of fixed-size
pages for a much smaller one of arbitrary-size pages. Roughly cost
neutral.

>There's an obvious solution for optimum use of physmem: paging, with
>one-byte pages. We already do that, or near enough - we call the
>hardware "caches". KSR and others have explored COMA at the page level,
>but that's too coarse - would COMA work at the line level? No way I can
>see to get the directory/swap overhead down below the lossage cost of
>the inefficient use of memory at larger granularity. It seems cheaper to
>waste memory than to keep track of the little pieces.

Well, yes - but, like Terje, I regard that as an obsolete requirement.
I know of applications where physical memory is the main bottleneck,
and the tuning of every single one involves ensuring that they NEVER
page! Note that I regard the claims that some applications need to
load every library on the system as merely evidence of such gross
misdesign that it should be discounted. Yes, they exist, but they
are quite simply broken by design.

>The problem with segments is garbage collect. The Burroughs hardware had
>a machine operation called MSEQ (masked search for equal) in which we
>actually searched memory for copies of descriptors so we could fix them
>up when a segment was swapped out. You could do that in 1968 with equal
>core and CPU speeds and only 26 bits of memory. Today we need complex
>structure and indirection to do segments, and memory is cheap enough
>that the saving in memory is simply not enough to pay for the added cost
>of segments over pages.

Eh? That's what use counts are for! And, if you go to segments (by
which I mean Unix-style, not x86-style), they are swapped in and out
only as part of program or library loading or unloading, which is a
very heavyweight activity. It's only small pages that need hardware
support.

>Like Andy, I know how to build a caps machine that would brush your
>teeth. Neither of us have ever sold anybody on building one though, and
>I at least don't know how to sell the result to end users even if I
>could get it built.

Right - and, like you two, I have used such a system and believe that
it is an essential component of improving RAS, but there ain't no
chance, politically :-(


Regards,
Nick Maclaren.
Quadibloc
2013-02-01 14:03:59 UTC
Permalink
On Feb 1, 5:09 am, Ivan Godard <***@ootbcomp.com> wrote:

> There's an obvious solution for optimum use of physmem: paging, with
> one-byte pages. We already do that, or near enough - we call the
> hardware "caches".

If the next level after RAM is something that behaves like bulk core,
you could indeed turn the RAM into a giant cache. This may be doable
with flash memory as the next level, but it would not work at all well
if you're dealing with magnetic storage, specifically hard disks.

With hard disks, you had better bring stuff down in large chunks.

So the problem is that general computer designs can't be predicated on
the existence of a random-access mass storage layer between them and
the hard disks just yet; that kind of hardware is too new or too
expensive.

It's not like we really have something random-access which is _that_
much cheaper than DRAM to make it an obvious pick for having more
effective memory capacity.

John Savard
Bill Findlay
2013-02-01 15:13:13 UTC
Permalink
On 01/02/2013 09:56, in article keg3fs$7s1$***@needham.csi.cam.ac.uk,
"***@cam.ac.uk" <***@cam.ac.uk> wrote:


>
> My approach is to adopt the way that the Atlas actually used memory;
> segments were swappable, but not pageable. However, with modern
> constraints, I don't believe the use of pages to avoid fragmentation
> is needed any longer. Segments are extended rarely enough, and memory
> is cheap enough, that separating them fairly widely and moving them
> when needed is a better solution.
>

Are you taliking about Titan here, Nick, rather than the Manchester Atlas?

--
Bill Findlay
with blueyonder.co.uk;
use surname & forename;
n***@cam.ac.uk
2013-02-01 15:15:43 UTC
Permalink
In article <CD318C09.24F3E%***@blueyonder.co.uk>,
Bill Findlay <***@blueyonder.co.uk> wrote:
>
>> My approach is to adopt the way that the Atlas actually used memory;
>> segments were swappable, but not pageable. However, with modern
>> constraints, I don't believe the use of pages to avoid fragmentation
>> is needed any longer. Segments are extended rarely enough, and memory
>> is cheap enough, that separating them fairly widely and moving them
>> when needed is a better solution.
>
>Are you taliking about Titan here, Nick, rather than the Manchester Atlas?

Er, yes :-)

But it is also the way that many/most sites ran IBM MVS, insofar as
they could, after the initial enthusiasm for paging ran out.


Regards,
Nick Maclaren.
Quadibloc
2013-01-30 13:30:18 UTC
Permalink
On Jan 29, 4:56 pm, "Mike" <***@mike.net> wrote:

> A 4KB page size for virtual memory has been almost ubiquitous since
> the 70's.  However, since then, PC's have grown to be tens of
> thousands of times faster and larger than mainframes were.  CPU cache
> is as big or bigger than mainframe ram.  Ram is as big or bigger than
> 70's disk farms, and disk has grown even more.   (Luckily, ram speed
> is still faster in number of machine cycles to access than disk was or
> we would be just spinning our wheels.)  Finally, applications have
> also grown tremendously but probably less in active set code size than
> in data area.

All of that is true, and would seem to make a definitive case for
larger pages. However, one thing _hasn't_ grown. The bus on a Pentium
system may be 64 bits wide, the same as on a 360/195 with a maximum of
4 megabytes of main memory.

Well, at least hasn't grown _much_. I think we actually are on 128-bit
buses on some systems these days. OTOH, the 360/195 had deeper
interleaving than the current iteration of DDR, IIRC.

John Savard
Michael S
2013-01-30 14:15:09 UTC
Permalink
On Jan 30, 3:30 pm, Quadibloc <***@ecn.ab.ca> wrote:
> On Jan 29, 4:56 pm, "Mike" <***@mike.net> wrote:
>
> > A 4KB page size for virtual memory has been almost ubiquitous since
> > the 70's.  However, since then, PC's have grown to be tens of
> > thousands of times faster and larger than mainframes were.  CPU cache
> > is as big or bigger than mainframe ram.  Ram is as big or bigger than
> > 70's disk farms, and disk has grown even more.   (Luckily, ram speed
> > is still faster in number of machine cycles to access than disk was or
> > we would be just spinning our wheels.)  Finally, applications have
> > also grown tremendously but probably less in active set code size than
> > in data area.
>
> All of that is true, and would seem to make a definitive case for
> larger pages. However, one thing _hasn't_ grown. The bus on a Pentium
> system may be 64 bits wide, the same as on a 360/195 with a maximum of
> 4 megabytes of main memory.
>
> Well, at least hasn't grown _much_. I think we actually are on 128-bit
> buses on some systems these days. OTOH, the 360/195 had deeper
> interleaving than the current iteration of DDR, IIRC.
>
> John Savard

What is data bus width? That's not an easy question even if we ignore
the fact that majority of today's processors has separate (and
different) buses for memory, I/O and SMP interconnects and focus
ourselves on memory side alone.
Take, for example, two recent generations of Intel's high end Xeon
chips, i.e. Westmere-EX and SandyBridge-E.
Both can attach 4 parallel DDR3 DIMM "channels". However when you look
at physical pins the picture is quite different:
On Westmere-EX each channel connected to scalable memory buffer (SMB)
by means of over 23 differential pairs (10 outputs+13 inputs).
SandyBridge-E, on the other hand, talks directly to DIMMs, so, for
each channel, there are 40 unidirectional control/address signals and
108 bidirectional data/parity/datastrobe signals.
In both cases I ignored clocks and clock-enable signals, which are not
identical either.

So is "data bus width" of Westmere-EX identical to SandyBridge-E or,
may be, not?
Andy (Super) Glew
2013-01-31 00:33:58 UTC
Permalink
On 1/28/2013 1:05 PM, ***@aol.com wrote:
> On Monday, January 28, 2013 4:53:19 AM UTC-5, Michael S wrote:
>> Only if memory is "free", and then you are very certain that in the
>> future you wouldn't want to page *part* of this buffer out.
>> On [modern fat x86] CPUs situations in which TLB misses account for
>> more than couple of percents of the run time are very rare. So, IMHO,
>> the best strategy for OS is to do absolutely nothing about it.
>
> I recall an experimental Linux kernel where it was changed to use only
> large pages. It ran about 10% faster IIRC. I think they were using a P-III.
>
> The reason TLB misses are so low these days is because Windows (and
> probably Linux) have been tuned for years (as in close to a decade) to take advantage of large pages.
>
> - Tim
>

Unless things have changed dramatically in the last two years, this is
not really true.

OS support for large pages was often a benchmark special, not used in
many real systems. Or used by OS, not by user processes.

As reflected by and caused by the ridiculously low number of large TLB
entries in many early machines. Oftentimes you could map less memory
with large than small.

But recent machines, since Nehalem, have had decent numbers of large TLB
entries.

Probably the best benefit has been

* PDE caching

* L2 TLBs.


Credit to Mitch for his TLB designs that fairly flexibly shared entries
between large and small TLB entries.

--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies; in the past of companies
such as Intellectual Ventures and QIPS, Intel, AMD, Motorola, and
Gould), I reveal this only so that the reader may account for any
possible bias I may have towards my employer's products. The statements
I make here in no way represent my employers' positions on the issue,
nor am I authorized to speak on behalf of my employers, past or present.
Vince Weaver
2013-01-31 02:21:15 UTC
Permalink
On 2013-01-31, Andy (Super) Glew <***@SPAM.comp-arch.net> wrote:
>
> OS support for large pages was often a benchmark special, not used in
> many real systems. Or used by OS, not by user processes.

I keep seeing responses here like this where huge pages are seen as some
sort of lost technology.

Linux has added "Transparent Huge Page" support back in 2.6.38 or so,
about 2 years ago. So in theory if you're running a recent Linux
distribution you may be using huge pages now.

Some details here: http://lwn.net/Articles/423584/

I know there have been some rough edges getting the support to work smoothly
so maybe it's not enabled by all by default yet.

Vince
Andy (Super) Glew
2013-01-31 07:18:34 UTC
Permalink
On 1/30/2013 6:21 PM, Vince Weaver wrote:
> On 2013-01-31, Andy (Super) Glew <***@SPAM.comp-arch.net> wrote:
>>
>> OS support for large pages was often a benchmark special, not used in
>> many real systems. Or used by OS, not by user processes.
>
> I keep seeing responses here like this where huge pages are seen as some
> sort of lost technology.
>
> Linux has added "Transparent Huge Page" support back in 2.6.38 or so,
> about 2 years ago. So in theory if you're running a recent Linux
> distribution you may be using huge pages now.
>
> Some details here: http://lwn.net/Articles/423584/
>
> I know there have been some rough edges getting the support to work smoothly
> so maybe it's not enabled by all by default yet.
>
> Vince
>

Maybe not lost, but really, really, slow to take off.

If they have finally taken off, great.

Cry wolf...



Large pages were introduced by P6 / Pentium Pro in 1997.

Actually, they were designed for the P5 (Pentium), but withdrawn at the
last minute
- see http://www.rcollins.org/articles/2mpages/2MPages.html
for mentions in the P5 manuals.


I.e. they were already "old hat", something to be assumed, when I
arrived at Intel in 1991.


But they took a long time to take off. Part of the reason being chicken
and egg:

* they were originally justified to map framebuffers
* which required only a very small number of large TLB entries
* which meant that using large TLB entries for other stuff was a loss
* which meant that SW/OS were reluctant to use it

etc., etc.



Chicken-and-egg. 1997 til ... what, 2011? 14 years.




So you may understand why so many of us got exasperated by the delays.



But I suppose that it is a lesson to me: things change. Not quickly,
but eventually.






--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies; in the past of companies
such as Intellectual Ventures and QIPS, Intel, AMD, Motorola, and
Gould), I reveal this only so that the reader may account for any
possible bias I may have towards my employer's products. The statements
I make here in no way represent my employers' positions on the issue,
nor am I authorized to speak on behalf of my employers, past or present.
Marko Zec
2013-01-31 09:21:10 UTC
Permalink
"Andy (Super) Glew" <***@spam.comp-arch.net> wrote:
> On 1/30/2013 6:21 PM, Vince Weaver wrote:
>> On 2013-01-31, Andy (Super) Glew <***@SPAM.comp-arch.net> wrote:
>>>
>>> OS support for large pages was often a benchmark special, not used in
>>> many real systems. Or used by OS, not by user processes.
>>
>> I keep seeing responses here like this where huge pages are seen as some
>> sort of lost technology.
>>
>> Linux has added "Transparent Huge Page" support back in 2.6.38 or so,
>> about 2 years ago. So in theory if you're running a recent Linux
>> distribution you may be using huge pages now.
>>
>> Some details here: http://lwn.net/Articles/423584/
>>
>> I know there have been some rough edges getting the support to work smoothly
>> so maybe it's not enabled by all by default yet.
>>
>> Vince
>>
>
> Maybe not lost, but really, really, slow to take off.
>
> If they have finally taken off, great.
>
> Cry wolf...
>
> Large pages were introduced by P6 / Pentium Pro in 1997.
>
> Actually, they were designed for the P5 (Pentium), but withdrawn at the
> last minute
> - see http://www.rcollins.org/articles/2mpages/2MPages.html
> for mentions in the P5 manuals.
>
> I.e. they were already "old hat", something to be assumed, when I
> arrived at Intel in 1991.
>
> But they took a long time to take off. Part of the reason being chicken
> and egg:
>
> * they were originally justified to map framebuffers
> * which required only a very small number of large TLB entries
> * which meant that using large TLB entries for other stuff was a loss
> * which meant that SW/OS were reluctant to use it
>
> etc., etc.
>
> Chicken-and-egg. 1997 til ... what, 2011? 14 years.
>

FreeBSD has supported mapping kernel memory to superpages since circa
2004, and in 2009 got support for transparent on-the-fly mapping of
both kernel and user space virtual memory to superpages as well.

The first report from Rice on "Practical, Transparent, Operating System
Support for Superpages" on FreeBSD and Alpha 21264 was published at
OSDI 2002, so the delay wasn't as bad as you imply...


>
> So you may understand why so many of us got exasperated by the delays.
>
> But I suppose that it is a lesson to me: things change. Not quickly,
> but eventually.
>
Paul A. Clayton
2013-01-31 02:37:12 UTC
Permalink
On Wednesday, January 30, 2013 7:33:58 PM UTC-5, Andy (Super) Glew wrote:
> On 1/28/2013 1:05 PM, ***@aol.com wrote:
[snip]
>> I recall an experimental Linux kernel where it was changed to use only
>> large pages. It ran about 10% faster IIRC. I think they were using a
>> P-III.
>>
>> The reason TLB misses are so low these days is because Windows (and
>> probably Linux) have been tuned for years (as in close to a decade) to
>> take advantage of large pages.
>>
>> - Tim
>
> Unless things have changed dramatically in the last two years, this is
> not really true.
>
> OS support for large pages was often a benchmark special, not used in
> many real systems. Or used by OS, not by user processes.

I thought things had improved a little more recently
(e.g., Linux has a hugepaged kernel daemon), though it
seems to me that even with 2MiB large pages there are
some processes (like a web browser--at least primitive
ones that run as a single process like the one I am
using which appears to use the capacity for couple of
hundred huge pages :-) that could use a few.

16KiB or even 64KiB pages would probably be much more
useful (and easier for the OS to provide)--or so I
would guess.

> As reflected by and caused by the ridiculously low number of large TLB
> entries in many early machines. Oftentimes you could map less memory
> with large than small.

Yes, "reflected by *AND* caused by" (emphasis added).

I would not be surprised if some OS used huge pages
almost as locked TLB entries (i.e., if the TLB
provided 8 entries, the OS only used 8 huge pages)

> But recent machines, since Nehalem, have had decent numbers of large TLB
> entries.
>
> Probably the best benefit has been
>
> * PDE caching
>
> * L2 TLBs.

That seems very likely to be the case. PDE caching
has an effect similar to a software TLB--single memory
access for common cases. Having 512 L2 entries would
certainly reduce the miss rate relative to, say, just
64 L1 entries.

Lack use of sectoring or even just use of "partial
subblocks" (i.e., not including differing
translations) seems disappointing to me. (Even
just allowing one way to have two translations
per entry would seem likely to be a win, albeit
slightly irregular in layout.)

> Credit to Mitch for his TLB designs that fairly flexibly shared entries
> between large and small TLB entries.

Any details available on this? Hash-rehash seems most
likely for an L2 TLB with x86's disparate page sizes,
but perhaps a more clever arrangement might have been
designed. Also how was replacement handled. A page
with 512 times more coverage should (ISTM) be more
"sticky". (A skewed associative TLB with cuckoo
replacement could probably reduce issues with
associativity starvation which might be more common
with highly "sticky" entries, at least if such
entries are somewhat frequent.)
Andy (Super) Glew
2013-01-31 07:22:44 UTC
Permalink
On 1/30/2013 6:37 PM, Paul A. Clayton wrote:
> On Wednesday, January 30, 2013 7:33:58 PM UTC-5, Andy (Super) Glew wrote:

>> Credit to Mitch for his TLB designs that fairly flexibly shared entries
>> between large and small TLB entries.
>
> Any details available on this? Hash-rehash seems most
> likely for an L2 TLB with x86's disparate page sizes,
> but perhaps a more clever arrangement might have been
> designed. Also how was replacement handled. A page
> with 512 times more coverage should (ISTM) be more
> "sticky". (A skewed associative TLB with cuckoo
> replacement could probably reduce issues with
> associativity starvation which might be more common
> with highly "sticky" entries, at least if such
> entries are somewhat frequent.)

I'll let Mitch decide if his TLB designs are discussable.

However, what I can say is that my interest in hash-rehash - which you
and I, Paul, have discussed - was motivated by trying to fix the
deficiencies of TLBs that hold multiple page sizes.

I.e. I wasn't *totally* happy with Mitch's designs.


--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies; in the past of companies
such as Intellectual Ventures and QIPS, Intel, AMD, Motorola, and
Gould), I reveal this only so that the reader may account for any
possible bias I may have towards my employer's products. The statements
I make here in no way represent my employers' positions on the issue,
nor am I authorized to speak on behalf of my employers, past or present.
EricP
2013-01-31 18:04:52 UTC
Permalink
Andy (Super) Glew wrote:
> On 1/30/2013 6:37 PM, Paul A. Clayton wrote:
>> On Wednesday, January 30, 2013 7:33:58 PM UTC-5, Andy (Super) Glew wrote:
>
>>> Credit to Mitch for his TLB designs that fairly flexibly shared entries
>>> between large and small TLB entries.
>>
>> Any details available on this? Hash-rehash seems most
>> likely for an L2 TLB with x86's disparate page sizes,
>> but perhaps a more clever arrangement might have been
>> designed. Also how was replacement handled. A page
>> with 512 times more coverage should (ISTM) be more
>> "sticky". (A skewed associative TLB with cuckoo
>> replacement could probably reduce issues with
>> associativity starvation which might be more common
>> with highly "sticky" entries, at least if such
>> entries are somewhat frequent.)
>
> I'll let Mitch decide if his TLB designs are discussable.
>
> However, what I can say is that my interest in hash-rehash - which you
> and I, Paul, have discussed - was motivated by trying to fix the
> deficiencies of TLBs that hold multiple page sizes.

I'm guessing that you are trying to avoid being forced
to use a fully assoc. tbl?

Eric
Andy (Super) Glew
2013-01-31 20:46:12 UTC
Permalink
Note: changing subject from
Re: what makes a computer architect great?
to
Multiple Page Sizes


On 1/31/2013 10:04 AM, EricP wrote:
> Andy (Super) Glew wrote:
>> On 1/30/2013 6:37 PM, Paul A. Clayton wrote:
>>> On Wednesday, January 30, 2013 7:33:58 PM UTC-5, Andy (Super) Glew
>>> wrote:
>>
>>>> Credit to Mitch for his TLB designs that fairly flexibly shared entries
>>>> between large and small TLB entries.
>>>
>>> Any details available on this? Hash-rehash seems most
>>> likely for an L2 TLB with x86's disparate page sizes,
>>> but perhaps a more clever arrangement might have been
>>> designed. Also how was replacement handled. A page
>>> with 512 times more coverage should (ISTM) be more
>>> "sticky". (A skewed associative TLB with cuckoo
>>> replacement could probably reduce issues with
>>> associativity starvation which might be more common
>>> with highly "sticky" entries, at least if such
>>> entries are somewhat frequent.)
>>
>> I'll let Mitch decide if his TLB designs are discussable.
>>
>> However, what I can say is that my interest in hash-rehash - which you
>> and I, Paul, have discussed - was motivated by trying to fix the
>> deficiencies of TLBs that hold multiple page sizes.
>
> I'm guessing that you are trying to avoid being forced
> to use a fully assoc.

Yes. See below.


>tbl?

I do not understand.

---

This prompted me to make a minor cleanup to
https://www.semipublic.comp-arch.net/wiki/Hash-rehash_TLBs_for_multiple_page_sizes


== The Problem: multiple page sizes in limited associativity TLBs ==

It is nice to have [[multiple TLB page sizes]], or [[superpages]].

It is fairly easy to have multiple page sizes in a [[fully associative
TLB]]
that can be directed to ignore the appropriate number of low-order bits.
I.e. a [[masked TLB]], or a TLB that uses the [[extra bit trick]] to
mask off the loqw order bits.

However, fully associative TLBs are expensive.
And masked TLBs are even more expensive.
May systems prefer to use lower associativity, e.g. 4-way or 16-way
associative.

But an N-way associative TLB cannot easily hold different page sizes,
since you typically want to index the TLB using low order bits,
and tag match using high order bits.
And the low order bits you want to index with are different for
different page sizes.

(You *could* use common bits to index.
If there are enough such common bits. (Perhaps true for big address
spaces, e.g. 64 bits. Not always true for small address spaces.)
But this only works if the entire address space is used uniformly.
If small page accesses are clustered in particular ranges of the address
bits used for indexing, it loses.)

== Hash-rehash to the rescue ==

[[Hash-rehash]], aka multiple probes,
provides one way of accomodating multiple page sizes in the same N-way
associative TB.

In such a TLB,
you might index first using the index function of the most commn page
size - say 4KiB pages.
If that misses, you may then index using the inex function of the next
larger page size: say 2MiB pages.
If that misses... 1GiB... and up.

== L1/L2 TLBs, and hash-rehash ==

It is natural to consider making an L2 TLB hash-rehash.

The variable latency of hash-rehash is less important, since you are already
handling variable latency given the L1 TLB miss.

=== Design Point: L1 TLB uniform, L2 TLB hash-rehash ===

Obviously, for a single probe port, multiple probes
costs more time,
and produces variable latency.

One attractive way of coping with this is to make the L1 TLB the
smallest page size,
but to make the L2 TLB hash-rehash.

One might "split" large pages into smaller page size entries when moved
from L1 to L2.

Note that such TLB entry "fragging" (fragmenting) is occasionally needed
anyway. E.g. on Intel x86,
one often wants to use a single large TLB entry from [0,2M), i.e. for
low memory.
But, although this memory may not need finer grain *mappings*,
it often needs finer grain memory attributes,
since some parts are ordnary DRAM, and need to be mapped [[WB]],
while others are [[MMIO (memory mapped I/O)]] and need to be mapped [[UC]],
and still others are [[WP (write protected)]] memory type.
I.e. the memory types are inconsistent.
And if you want to cache the physical memory types in the TLB,
rather than accessing structures such as the [[MTRRs]] in addition to
the TLB...

=== Design Point: L1 TLB fully associative, L2 TLB hash-rehash ===

Another design point is to have a fully associative TLB
that supports multiple page sizes as the L1 TLB.
And a hash-rehas TLB for the L2.

== Design Point: Separate L1 TLBs by page size, hash-rehash L2 TLB ==

... Obvious.. although this exposes you to the usual sizing issues.

Hybrid approaches - where you fragment infrequently used large pages,
while loading frequently used large pages
into special TLB entries -
are
a) obvious,
but
b) hard to tune.

== Multiple Ports and Hash-Rehash ==

If you have multiple TLB translation ports

a) you might use them both for small page translations if needed

but

b) if only one translation is needed in a gien cycle,
you might do a small translation on one,
and a large translation on the other.


== Implications for Virtual Memory Architecture ==

Multiple page sizes may be a good thing.
But allowing all page sizes - 4K, 8K, 16K ... 2M, 4M, ... 1GB, 2GB, ...
may be overkill.

Although implementable by a [[fully associative masked TLB]],
even there it is more expensive - more mask bits.

And hash-rehash implies that we want to have as few extra probes as
possible.

== See Also ==

* [[TLB Structures for Multiple Page Sizes]]
* [[TLB consistency for TLBs that may have multiple entries for a given
address]]
* [[page table structures for multiple page sizes]]


--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies; in the past of companies
such as Intellectual Ventures and QIPS, Intel, AMD, Motorola, and
Gould), I reveal this only so that the reader may account for any
possible bias I may have towards my employer's products. The statements
I make here in no way represent my employers' positions on the issue,
nor am I authorized to speak on behalf of my employers, past or present.
Andy (Super) Glew
2013-01-31 20:53:20 UTC
Permalink
On 1/30/2013 11:22 PM, Andy (Super) Glew wrote:
> On 1/30/2013 6:37 PM, Paul A. Clayton wrote:
>> On Wednesday, January 30, 2013 7:33:58 PM UTC-5, Andy (Super) Glew wrote:
>
>>> Credit to Mitch for his TLB designs that fairly flexibly shared entries
>>> between large and small TLB entries.
>>
>> Any details available on this? Hash-rehash seems most
>> likely for an L2 TLB with x86's disparate page sizes,
>> but perhaps a more clever arrangement might have been
>> designed. Also how was replacement handled. A page
>> with 512 times more coverage should (ISTM) be more
>> "sticky". (A skewed associative TLB with cuckoo
>> replacement could probably reduce issues with
>> associativity starvation which might be more common
>> with highly "sticky" entries, at least if such
>> entries are somewhat frequent.)
>
> I'll let Mitch decide if his TLB designs are discussable.
>
> However, what I can say is that my interest in hash-rehash - which you
> and I, Paul, have discussed - was motivated by trying to fix the
> deficiencies of TLBs that hold multiple page sizes.
>
> I.e. I wasn't *totally* happy with Mitch's designs.

>> A page with 512 times more coverage should (ISTM) be more
>> "sticky".

I don't think this is truie.

Or, rather: I think that "ideal LRU" would provide all the necessary
stickiness needed.

The problem is that our LRU implementations are not ideal. Pseudo-LRU.
Clock-LRU.

It may motivate extra bits for LRU tracking.

Or, possibly, a fairshare type randomization - LRU, distorted by a
randomization biased by page size of the possible victims.





--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies; in the past of companies
such as Intellectual Ventures and QIPS, Intel, AMD, Motorola, and
Gould), I reveal this only so that the reader may account for any
possible bias I may have towards my employer's products. The statements
I make here in no way represent my employers' positions on the issue,
nor am I authorized to speak on behalf of my employers, past or present.
Paul A. Clayton
2013-02-01 00:46:34 UTC
Permalink
On Thursday, January 31, 2013 3:53:20 PM UTC-5, Andy (Super) Glew wrote:
> On 1/30/2013 11:22 PM, Andy (Super) Glew wrote:
> > On 1/30/2013 6:37 PM, Paul A. Clayton wrote:
[snip]
>>> A page with 512 times more coverage should (ISTM) be more
>>> "sticky".
>
> I don't think this is truie.
>
> Or, rather: I think that "ideal LRU" would provide all the necessary
> stickiness needed.

I was thinking that informing the replacement
policy with frequency of use could be helpful.

In a four way set associative TLB, it would only
take access to four pages with the same index
before a huge page is kicked out if simple LRU
replacement is used.

Perhaps I am not thinking clearly in this, but
it seems one might run into something like a
birthday paradox.

ISTR that frequency-informed replacement works
better than strict LRU with larger caches and
with certain workload characteristics. Page
size alone might be an extremely simplistic
form of frequency quasi-information, but it is
freely available.

> The problem is that our LRU implementations are not ideal. Pseudo-LRU.
> Clock-LRU.

For 8 way or less with simple indexing, true
LRU might not be so bad; but with skewed
associativity (especially with cuckoo replacement)
true LRU can be as expensive as it would be for
a fully associative cache.

It bothers me that clock-LRU replacement is used
when entry-pair clock+LRU-within-pair has the
same storage and is probably not that much more
complex, though perhaps the intent is support
something close to MRU or random replacement
when conflict misses are too high.

> It may motivate extra bits for LRU tracking.
>
> Or, possibly, a fairshare type randomization - LRU, distorted by a
> randomization biased by page size of the possible victims.
MitchAlsup
2013-02-01 07:21:58 UTC
Permalink
On Thursday, January 31, 2013 1:22:44 AM UTC-6, Andy (Super) Glew wrote:
> I'll let Mitch decide if his TLB designs are discussable.

I have done a couple of TLB designs where the page size was configurable.

In one design (SPARC V9), basically, the only way to do the TLB was to
havea bit in the TLB for every 6 bits in the TLB.CAM to turn that part
of the CAM off, so it supported 8K pages, 512KB pages, 32MB pages.....
The logic in the CAM was "not that bad" and had the effect of preserving
the FA nature of the TLB.

I never liked this particular arrangement, but it seemed fored upon us
by the architecture. So we trudged forward.

In another design, we were not allowed access to CAM technology, so we
simply used 4 SRAM arrays. When used as 4KB pages (only) we has 1024 entries
of 4-way set, and it worked rather well. When used with 4KB and 2MB pages
it worked as 3-way set associative on the 4KB pages and DM for the 2MB pages.
When used with 4KB, 2MB and 1GB pages it was 2-way SA at the 4KB level
(512 pages), DM at the 2MB level (256 entries) and Dm at the 1GB level (also
256 entries).

The standard 4KB pages works rather well in this design, but there was almost
no realistic way to use all 256 entries of the 1GB pages (lasck of main memory
being the main culpret). But, evn here, it worked marvelously for the Data
Bases which allocate 1/2 of main memory at boot time and never relenquish it.

Was this better than a 2-level TLB? This depends on the aplication!
Was this better than the alternatives? Depends if you are allowed CAM
technology (or not) in the TLB!

A great bit TLB such as this, backed up wiht a table walker which accesses
L2 cache for PTPs and PTEs must be compared to a smaller TLB with more
associativite and refileld fromeither and L2 TLB or accesses from the L1 cache.

It all comes down to what kind of application profile you optimize for.

I did not "like" either of the design limitations imposed upon us. But
I tried to remain realistic as to waht we ould do with the technology at
hand. I face a rather similar circumstand now, where I need medium sized
ROM arrays and am being told to implement the bits in logic gates.....

Two other points:
A) SW table walking blows wind--HW table walkers can easily walk several
TLB misses simultaneously, and without havng to fault all the way back to
supervisory code; doing the job in fewer cycle and less opportunity for
SW to fail.
B) The atomic updates to the PTEs also "blow wind" causing all sorts of
irregularities in the memory ordering models.

Mitch
Joe keane
2013-02-01 01:47:35 UTC
Permalink
In article <***@SPAM.comp-arch.net>,
Andy (Super) Glew <***@SPAM.comp-arch.net> wrote:
>Probably the best benefit has been
>
>* PDE caching
>
>* L2 TLBs.

That's trying to solve the problem. Bad architect!

customer:

''I want a machine that supports virtual memory efficiently.''

architect:

''No you don't.''

customer:

''Well yes actually i do.''

architect:

''What about your frame buffer? Do you want *that* to get paged out?''

customer:

''No that doesn't even make sense.''

architect:

''What about your interrupt handling? What if *that* gets paged out?''

customer:

''I'm pretty sure that would be bad.''

architect:

''So see, you don't really want virtual memory after all.''

customer:

''Well yes actually i do.''

architect:

''No you don't.''
n***@cam.ac.uk
2013-01-31 09:43:19 UTC
Permalink
In article <31fa4dd5-a1bf-4f81-ab32-***@googlegroups.com>,
Paul A. Clayton <***@gmail.com> wrote:
>On Wednesday, January 30, 2013 7:33:58 PM UTC-5, Andy (Super) Glew wrote:
>
>> As reflected by and caused by the ridiculously low number of large TLB
>> entries in many early machines. Oftentimes you could map less memory
>> with large than small.
>
>Yes, "reflected by *AND* caused by" (emphasis added).
>
>I would not be surprised if some OS used huge pages
>almost as locked TLB entries (i.e., if the TLB
>provided 8 entries, the OS only used 8 huge pages)

Some did, and some may still do.


Regards,
Nick Maclaren.
Andy (Super) Glew
2013-01-31 20:49:05 UTC
Permalink
On 1/31/2013 1:43 AM, ***@cam.ac.uk wrote:
> In article <31fa4dd5-a1bf-4f81-ab32-***@googlegroups.com>,
> Paul A. Clayton <***@gmail.com> wrote:
>> On Wednesday, January 30, 2013 7:33:58 PM UTC-5, Andy (Super) Glew wrote:
>>
>>> As reflected by and caused by the ridiculously low number of large TLB
>>> entries in many early machines. Oftentimes you could map less memory
>>> with large than small.
>>
>> Yes, "reflected by *AND* caused by" (emphasis added).
>>
>> I would not be surprised if some OS used huge pages
>> almost as locked TLB entries (i.e., if the TLB
>> provided 8 entries, the OS only used 8 huge pages)
>
> Some did, and some may still do.

Yes.

And this is what I consider primitive, not-really-proper, support for
large pages. It may be the best they can do, given the hardware
restrictions.

But it is not the same as transparently supporting large and small page
sizes.



--
The content of this message is my personal opinion only. Although I am
an employee (currently of MIPS Technologies; in the past of companies
such as Intellectual Ventures and QIPS, Intel, AMD, Motorola, and
Gould), I reveal this only so that the reader may account for any
possible bias I may have towards my employer's products. The statements
I make here in no way represent my employers' positions on the issue,
nor am I authorized to speak on behalf of my employers, past or present.
EricP
2013-01-28 18:13:57 UTC
Permalink
Michael S wrote:
> On Jan 27, 12:27 am, Terje Mathisen <"terje.mathisen at tmsw.no">
> wrote:
>> Michael S wrote:
>>> On Jan 26, 12:50 pm, ***@cam.ac.uk wrote:
>>>> Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
>>>> 1960s hack.
>> Actual paging is of course totally unacceptable, but TLBs seems like a
>> reasonable way to handle virt-to-phys caching?
>>
>
> Actual paging is still acceptable in some situations.
> Think time sharing machine (e.g. Windows Terminal Server) with few
> dozens of clients that mostly use interactive applications and often
> leave them open without touching kbd/mouse for hours and days.
> In such situation, paging stuff out, especially to [realatively small]
> SSD and using freed memory for [huge rotating RAID] disk cache sounds
> like very reasonable strategy for an OS.
> With current consolidation trends the situation I am describing will
> become more common than it was in last couple of decades.

For global replacement approach the normal clock algorithm
will trim pages from dormant processes.
For local replacement (WinNT) a system thread can do this.

I'm thinking maybe smart phones shouldn't be trimming pages at all,
so all memory sections are what I call 'resident' sections
with load on demand. Pages page in but don't page out
unless the app stops or if free memory gets too low.

>>> We were at it 1000 times already.
>>> Today's OS designers like paging hardware not due to demand paging to
>>> mass storage (although that is not totally obsolete either, especially
>>> in IBM world, both z and POWER, and, with current proliferation of
>>> SSDs, we are likely to see limited renaissance of demand paging in the
>>> PC world as well) but because it greatly simplifies memory management.
>>> Not having to care about fragmentation of physical memory is a Very
>>> Good Thing as far as OS designers are concerned.
>> It is indeed a very good thing, unfortunately we don't quite have it any
>> more...
>>
>>> And it's not just TLBs. OS designers want both TLBs and hardware page
>>> walkers.
>> Due to memory sizes outrunning reasonably TLB sizes, at least as long as
>> everything has a single page size, OSs have to try to merge multiple 4K
>> pages into larger superpages that can use a single TLB entry, right?
>>
>> However, in order to this they must first get rid of (physical) memory
>> fragmentation. :-)
>>
>
> No, OS should not use "big" pages at all except for special things
> like frame buffers, bounce buffers for legacy 32-bit I/O device and
> when explicitly requested by application.
> Processors with too small TLBs should naturally lose market share to
> processors with bigger TLBs and/or to processors that make TLB misses
> cheaper by means of efficient caching of page tables in L2/L3 caches.

I liked Andy's suggestion of using a Block Address Translation (BAT)
table (and I notice that MIPS64 has one) for bulk relocations.

There are multiple ways to do this. One way is to lookup each virtual
address in the BAT and if there is a hit it inhibits the TLB lookup.
The BAT entry has a virtual address, length, and access and cache controls.
If there is a miss then check the TLB.

So it picks off chunks of virtual space and maps
directly to physical address range.

This would allow the bulk of the non-pagable kernel to be
loaded into a contiguous range of physical memory
and then mapped into a virtual range with two BAT entries.
Drivers loaded at boot time could use this too.
Drivers loaded later would use page table mapping.

(This has implications for reliability. With page mapping if
a memory frame starts to go bad it is simple to move one page
to a new frame, change the PTE, and decommission the old frame.
With BAT mapping this is trickier as the OS is loaded contiguous.)

Another BAT entry could map all the device control registers,
assuming they are all located at high physical addresses.
Another could map the graphics memory.
And another map all of ram as an alias so the file system
can access any frame as file cache.

So there need be very little TLB activity for the kernel.

Eric
Paul A. Clayton
2013-01-29 00:45:39 UTC
Permalink
On Monday, January 28, 2013 1:13:57 PM UTC-5, EricP wrote:
[snip]
> There are multiple ways to do this. One way is to lookup each virtual
> address in the BAT and if there is a hit it inhibits the TLB lookup.
> The BAT entry has a virtual address, length, and access and cache controls.
> If there is a miss then check the TLB.
>
> So it picks off chunks of virtual space and maps
> directly to physical address range.
>
> This would allow the bulk of the non-pagable kernel to be
> loaded into a contiguous range of physical memory
> and then mapped into a virtual range with two BAT entries.
> Drivers loaded at boot time could use this too.
> Drivers loaded later would use page table mapping.
>
> (This has implications for reliability. With page mapping if
> a memory frame starts to go bad it is simple to move one page
> to a new frame, change the PTE, and decommission the old frame.
> With BAT mapping this is trickier as the OS is loaded contiguous.)

Something similar to Shadow Memory (proposed in "Increasing
TLB Reach Using Superpages Backed by Shadow Memory",
Mark Swanson et al., 1998--where a virtual large page is
remapped to multiple base-sized pages) might be used where
a secondary TLB near the memory controller (though locating
such at the memory controller would limit the flexibility
of remapping or force a redirection to another memory
controller) remaps some "physical" addresses to actual
physical addresses.

For pages that have gone bad, relatively few entries
would be required and they could be locked in place.
However, this could be extended to support large page
translations with exceptions where a few base-sized
page areas are either not present/invalid or are
mapped to a different location; with such a use
providing a page table (presumably with a format more
oriented to the difference in locality from an
ordinary page table) would seem to make sense. (Such
a TLB could more readily be software-managed without
interrupt concerns since the management software would
not be located at the core providing the memory
request.)
Ivan Godard
2013-01-26 19:23:59 UTC
Permalink
On 1/26/2013 2:50 AM, ***@cam.ac.uk wrote:
> In article <d86e7b46-7d55-43f4-bc84-***@rm7g2000pbc.googlegroups.com>,
> Quadibloc <***@ecn.ab.ca> wrote:
>>
>>>> Exceptions are another matter however. It's hard to figure out what to
>>>> do when there's nothing to be done.
>>>
>>> Not really. =A0Once you abandon the dogma that they have to be handled
>>> by recoverable interrupts, there are plenty of alternatives.
>>
>> It's certainly true that if a program fails on a divide-by-zero, it's
>> not clear what's gained by trying to continue running. But one would
>> still like to have as much information as possible to diagnose the
>> error.
>
> Trap-diagnose-and-terminate is trivial to implement compared to
> interrupt-fixup-and-recover.
>
>> On the other hand, if one attempts to access an address in virtual
>> memory that is currently in the swap file, one most definitely wants
>> to seamlessly continue.
>
> Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
> 1960s hack.

Yes for virtual memory. No for COW, memory-mapped files, and protection.
n***@cam.ac.uk
2013-01-26 20:20:41 UTC
Permalink
In article <ke1age$suu$***@dont-email.me>,
Ivan Godard <***@ootbcomp.com> wrote:
>>
>>> On the other hand, if one attempts to access an address in virtual
>>> memory that is currently in the swap file, one most definitely wants
>>> to seamlessly continue.
>>
>> Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
>> 1960s hack.
>
>Yes for virtual memory. No for COW, memory-mapped files, and protection.

I have already described the major problems caused by memory-mapped
files; COW is more debatable, but is only an efficiency hack built
on top of demand paging, not a primary function.

But why on earth does protection imply interrupts (in the sense of
interrupt-fixup-and-recover)?


Regards,
Nick Maclaren.
Ivan Godard
2013-01-27 01:39:53 UTC
Permalink
On 1/26/2013 12:20 PM, ***@cam.ac.uk wrote:
> In article <ke1age$suu$***@dont-email.me>,
> Ivan Godard <***@ootbcomp.com> wrote:
>>>
>>>> On the other hand, if one attempts to access an address in virtual
>>>> memory that is currently in the swap file, one most definitely wants
>>>> to seamlessly continue.
>>>
>>> Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
>>> 1960s hack.
>>
>> Yes for virtual memory. No for COW, memory-mapped files, and protection.
>
> I have already described the major problems caused by memory-mapped
> files; COW is more debatable, but is only an efficiency hack built
> on top of demand paging, not a primary function.
>
> But why on earth does protection imply interrupts (in the sense of
> interrupt-fixup-and-recover)?

Begging your pardon, but you were talking about TLBs. Why on earth does
a TLB imply interrupts?

A TLB (in the location it is usually configured) is essential to enforce
protection. What you do for TLB replacement is a separable issue to that
function.

And what you do on a table (as opposed to TLB) miss is still a different
issue.

I grant that a descriptor model (or caps) is an alternative to a TLB. Do
you feel that in general protection should be attached to the access
rather than to the accessed object? If so I'm inclined to agree, but
there are a number of reasons, not all invalid, why TLBs are more common
than caps.
n***@cam.ac.uk
2013-01-27 12:16:54 UTC
Permalink
In article <ke20h8$rsp$***@dont-email.me>,
Ivan Godard <***@ootbcomp.com> wrote:
>>>>
>>>>> On the other hand, if one attempts to access an address in virtual
>>>>> memory that is currently in the swap file, one most definitely wants
>>>>> to seamlessly continue.
>>>>
>>>> Demand paging is SO 1970s. Memory is cheap. TLBs are a horrible
>>>> 1960s hack.
>>>
>>> Yes for virtual memory. No for COW, memory-mapped files, and protection.
>>
>> I have already described the major problems caused by memory-mapped
>> files; COW is more debatable, but is only an efficiency hack built
>> on top of demand paging, not a primary function.
>>
>> But why on earth does protection imply interrupts (in the sense of
>> interrupt-fixup-and-recover)?
>
>Begging your pardon, but you were talking about TLBs. Why on earth does
>a TLB imply interrupts?

Context, my dear sir, context!

Anyway, we are agreed. TLBs per se (i.e. NOT implemented as they
are at present) are merely caches for memory translation. All of
the major problems come from implementing them by interrupt. There
are many reasonable ways to use TLBs without that, as you say.


Regards,
Nick Maclaren.
Mark Thorson
2013-01-27 00:26:41 UTC
Permalink
Quadibloc wrote:
>
> It's certainly true that if a program fails on a divide-by-zero, it's
> not clear what's gained by trying to continue running. But one would
> still like to have as much information as possible to diagnose the
> error.

This is why an integer NaN encoding is needed.
If anybody cares about the result, they can test
for that. If the result is being used for audio,
video, or graphics, you can skip the test and not
worry about it. At worst, the user might see
a one-frame glitch or hear a pop.
Quadibloc
2013-01-27 01:53:16 UTC
Permalink
On Jan 26, 5:26 pm, Mark Thorson <***@sonic.net> wrote:

> This is why an integer NaN encoding is needed.

But that's impractical, as it would add significant complexity to the
design of integer arithmetic circuits, given that they're so simple
and straightforward, unlike floating-point ones.

Of course, decimal integer NaN codes are possible. Even when Chen-Ho
encoding is used.

John Savard
n***@cam.ac.uk
2013-01-27 12:12:58 UTC
Permalink
In article <11c629b6-1ce0-4068-ad6c-***@w3g2000yqj.googlegroups.com>,
Quadibloc <***@ecn.ab.ca> wrote:
>On Jan 26, 5:26=A0pm, Mark Thorson <***@sonic.net> wrote:
>
>> This is why an integer NaN encoding is needed.
>
>But that's impractical, as it would add significant complexity to the
>design of integer arithmetic circuits, given that they're so simple
>and straightforward, unlike floating-point ones.

No, it's not. It's been done, and is simple. What it does do is
to make it harder to pun integers and bit-patterns, which is a
pretty revolting facility in a high-level language anyway and
not half as useful as is usually claimed even at the lowest level.


Regards,
Nick Maclaren.
unknown
2013-01-27 12:55:16 UTC
Permalink
***@cam.ac.uk wrote:
> In article <11c629b6-1ce0-4068-ad6c-***@w3g2000yqj.googlegroups.com>,
> Quadibloc <***@ecn.ab.ca> wrote:
>> On Jan 26, 5:26=A0pm, Mark Thorson <***@sonic.net> wrote:
>>
>>> This is why an integer NaN encoding is needed.
>>
>> But that's impractical, as it would add significant complexity to the
>> design of integer arithmetic circuits, given that they're so simple
>> and straightforward, unlike floating-point ones.
>
> No, it's not. It's been done, and is simple. What it does do is
> to make it harder to pun integers and bit-patterns, which is a
> pretty revolting facility in a high-level language anyway and
> not half as useful as is usually claimed even at the lowest level.

I'd suggest keeping unsigned == binary array of bits and make the int
range symmetrical, reserving the current MININT value (0x800...00) as a
NaN marker.

The unfortunate problem is of course that you must have separate
int/unsigned ADD/SUB/etc operations. :-(

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Mark Thorson
2013-01-27 00:21:54 UTC
Permalink
***@cam.ac.uk wrote:
>
> Ivan Godard <***@ootbcomp.com> wrote:
> >
> >Exceptions are another matter however. It's hard to figure out what to
> >do when there's nothing to be done.
>
> Not really. Once you abandon the dogma that they have to be handled
> by recoverable interrupts, there are plenty of alternatives.

Yes, just set the not-sure-about-these-results flag
and keep going.
n***@cam.ac.uk
2013-01-27 12:17:44 UTC
Permalink
In article <***@sonic.net>,
Mark Thorson <***@sonic.net> wrote:
>> >
>> >Exceptions are another matter however. It's hard to figure out what to
>> >do when there's nothing to be done.
>>
>> Not really. Once you abandon the dogma that they have to be handled
>> by recoverable interrupts, there are plenty of alternatives.
>
>Yes, just set the not-sure-about-these-results flag
>and keep going.

That's one approach, but not the only one.


Regards,
Nick Maclaren.
MitchAlsup
2013-01-26 05:31:58 UTC
Permalink
On Friday, January 25, 2013 9:35:51 AM UTC-6, Ivan Godard wrote:
> Exceptions are another matter however. It's hard to figure out what to
> do when there's nothing to be done.

And no general consensus as to what is the correct thing to do.

An exception is an event where, no mater what you define to happen
when one takes place, someone will take exception to your definition.
(probably Kahan).

Mitch
ChrisQ
2013-01-25 17:05:35 UTC
Permalink
On 01/24/13 06:10, mag wrote:

>
> A question I have is how do you go aout "getting in there" nowadays? My
> experience is that it's very difficult to break into this field. You
> almost have to be an architect to get an architect's job. No
> opportunities for someone with potential to get his or her feet wet
> exist, unless you have a friend who's a project director that wants to
> turn you loose on soemthing.

I'm not a computer architect, but above all, you need to have a passion
for your subject, otherwise you will never be any more than average. If
it's just a means to an end to earn a living, forget it, as that's
a fundamental conflict of interest. You need to be building stuff out
of work hours, eat sleep and breath it if necessary, just out of the
interest. Build up serious knowledge base to include hardware and software
and understand it from the highest to the lowest levels.

Not easy. Good engineers in any discipline take decades to get that far...

Regards,

Chris
MitchAlsup
2013-01-28 16:41:35 UTC
Permalink
On Thursday, January 24, 2013 12:10:30 AM UTC-6, mag wrote:
> A question I have is how do you go aout "getting in there" nowadays? My
> experience is that it's very difficult to break into this field. You
> almost have to be an architect to get an architect's job.

Well, the industry has this problem:: The work-life of a successful architect
is measured in decades, and thus the openings for these kinds of positions
does not happen all that often.

Yet, acedemia is producing several hundred purported computer architects
per year to fill the 6 available openings (US).

Back to the question: How do you break in?
You work as a designer, a compiler writer, a software engineer; making
good impressions on everyone you come in contact with; bide your time
for 1.5 decades and you just might get lucky.

Mitch
Mark Thorson
2013-01-25 23:13:03 UTC
Permalink
mag wrote:
>
> What makes a computer architect great? Is it just a matter of
> memorizing Tomasulo's algorithm? Is that still necessary?

Great to whom? Management? The key is to
get two or three times as much work done as
the next guy. There is a way to do that,
and make it look easy.

http://www.theregister.co.uk/2013/01/16/developer_oursources_job_china/
ChrisQ
2013-01-26 13:37:17 UTC
Permalink
On 01/25/13 23:13, Mark Thorson wrote:

> Great to whom? Management? The key is to
> get two or three times as much work done as
> the next guy. There is a way to do that,
> and make it look easy.
>
> http://www.theregister.co.uk/2013/01/16/developer_oursources_job_china/

I read about that. I guess they had to fire the guy to save face, but
showing
such initiative is rare indeed. Great management and problem solving
potential
and should be running his own business :-)...

Regards,

Chris
Loading...