Discussion:
Non Power of 2 Cache Sizes
(too old to reply)
Vikas Mishra
2005-08-16 07:01:01 UTC
Permalink
Hi Folks,

We were having a discussion today about having a "non power of 2
instruction cache". Conceptually I would think that the only "hard"
reason for having a power of 2 is that the line index calculation is
trivial and for a non power of 2 it would be some complicated
calculation like modulus with 40K (for a size of say 40KB) or something
equally strange.

But is there any other real reason besides this why "non power of 2
caches" can't be used ? I seemed to remember having read somewhere that
cache's can't be non power of two but I don't remember where. I have
searched Hennessey and Patterson ( "A Quantitative Approach" ) but
wasn't able to find anything specific which said whether this is
possible or not.

In case there have been real/experimental designs with non power of 2
caches, could you point me to references.

Thanks & Regards,
Vikas
Anton Ertl
2005-08-16 07:44:02 UTC
Permalink
Post by Vikas Mishra
In case there have been real/experimental designs with non power of 2
caches, could you point me to references.
SuperSPARC, 21164, Pentium 4, 21364, Itanium 2.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Grumble
2005-08-16 08:25:06 UTC
Permalink
Post by Anton Ertl
Post by Vikas Mishra
In case there have been real/experimental designs with non power of 2
caches, could you point me to references.
SuperSPARC, 21164, Pentium 4, 21364, Itanium 2.
Do you mean the P4's trace cache?
Anton Ertl
2005-08-16 10:58:16 UTC
Permalink
Post by Grumble
Post by Anton Ertl
Post by Vikas Mishra
In case there have been real/experimental designs with non power of 2
caches, could you point me to references.
SuperSPARC, 21164, Pentium 4, 21364, Itanium 2.
Do you mean the P4's trace cache?
Yes.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Dan Koren
2005-08-16 08:22:20 UTC
Permalink
Post by Anton Ertl
Post by Vikas Mishra
In case there have been real/experimental designs with non power of 2
caches, could you point me to references.
SuperSPARC, 21164, Pentium 4, 21364, Itanium 2.
Some of the PA-RISC chips have 1.75 MB on chip.



dk
Patrick Schaaf
2005-08-16 08:28:51 UTC
Permalink
Post by Vikas Mishra
We were having a discussion today about having a "non power of 2
instruction cache". Conceptually I would think that the only "hard"
reason for having a power of 2 is that the line index calculation is
trivial and for a non power of 2 it would be some complicated
calculation like modulus with 40K (for a size of say 40KB) or something
equally strange.
Not neccessarily, I would think, except for direct mapped caches.
Way selection is done with an associative tag structure, and I
don't see any reason why that would be more efficient when done
with a power-of-two number of ways. So you could have a 256k*7-way
cache, for a total of 1.75 MB. Some googling finds this page,
which describes exactly that for a PA-RISC processor:

http://www.arcade-eu.info/overview/2005/ev7.html

best regards
Patrick
Anton Ertl
2005-08-16 10:58:56 UTC
Permalink
Post by Patrick Schaaf
Way selection is done with an associative tag structure, and I
don't see any reason why that would be more efficient when done
with a power-of-two number of ways.
I guess there is some reason, because set-associative caches with a
power-of-2 number of ways are quite common.

Hmm, maybe the reason is marketing; If the Athlon 64 had 64*9=576KB L2
cache, nobody would be able to remember that:-).
Post by Patrick Schaaf
So you could have a 256k*7-way
cache, for a total of 1.75 MB. Some googling finds this page,
http://www.arcade-eu.info/overview/2005/ev7.html
The EV7 is an Alpha, not a PA-RISC CPU.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Vikas Mishra
2005-08-16 10:59:27 UTC
Permalink
Post by Patrick Schaaf
Post by Vikas Mishra
We were having a discussion today about having a "non power of 2
instruction cache". Conceptually I would think that the only "hard"
reason for having a power of 2 is that the line index calculation is
trivial and for a non power of 2 it would be some complicated
calculation like modulus with 40K (for a size of say 40KB) or something
equally strange.
Not neccessarily, I would think, except for direct mapped caches.
Way selection is done with an associative tag structure, and I
don't see any reason why that would be more efficient when done
with a power-of-two number of ways. So you could have a 256k*7-way
cache, for a total of 1.75 MB. Some googling finds this page,
http://www.arcade-eu.info/overview/2005/ev7.html
Patrick thanks. So is there a specific restriction that you are aware of for
a Direct Mapped Cache ? Actually I should have been more specific in my
question - I was interested more in a direct mapped cache. My apologies for
the same.

Regards,
Vikas
John Mashey
2005-08-16 18:04:38 UTC
Permalink
Post by Vikas Mishra
Post by Patrick Schaaf
Not neccessarily, I would think, except for direct mapped caches.
Way selection is done with an associative tag structure, and I
don't see any reason why that would be more efficient when done
with a power-of-two number of ways. So you could have a 256k*7-way
cache, for a total of 1.75 MB. Some googling finds this page,
Patrick thanks. So is there a specific restriction that you are aware of for
a Direct Mapped Cache ? Actually I should have been more specific in my
question - I was interested more in a direct mapped cache. My apologies for
the same.
1) Maybe someone has implemented a non-power-of-2 direct-mapped cache,
but I've never heard of one. I've seen designs where there was some
simple [low-gate-delay] hashing inot a power-of-2 cache, but that was
for performance (avoidance of hot spots, especially in virtual caches),
not to allow a non-power-of-2 cache. Addressing bits cost, whether for
registers, cachelines, or memory addresses, so people try not to waste
them.

Of course, I can't think of any real designer who would waste a second
of thought about putting a DIV/MOD in the the address path. [Google
comp.arch: mashey powers 2 1994].

2) Somebody said, it doesn't really cost much to have non-power-of-2
set-associative caches, and there is a very good reason to do this
sometimes.

In real-world chip design, things like floorplans, wire lengths, gate
delays, design costs, time-to-market, etc, etc, actually matter.

Suppose you are considering a design with the usual collection of
units, and the data-cache is 2-way. You may well have filled the die.
On the other hand, you might be pad/bump limited, i.e., in order to get
the number of power/grounds and signals you need, there is a minimum
size for the chip, and there may be extra space left. What do you do
with the space?

Suppose there is not enough space to double the cache size, but there
is enough space to do 3-way. That can be a very attractive
alternative, as it is much easier to design than say, redesignign the
pipeline, adding functional units, etc. Alternatively, if you had
planned an 8-way cache, and misestimated space on other units, it might
just be easier to go to 7-way, assuming that the layout let you make
use of the freed-up die space [sometimes it does, sometimes it
doesn't.]

Such flexibility is often quite useful.

In the MIPS R2000, we had a 64-element TLB, with a 6-bit index, but
(unlike the size of a direct-mapped cache), there was nothing magic
about 64. It could have been 63 or 60 with minimal bother, and that
flexibility was valuable, because for a while, it didn't look like 64
would fit. Had the design been one that required power-of-2, I would
have been very nervous that we'd have been forced to go to 32.
Terje Mathisen
2005-08-16 10:49:34 UTC
Permalink
Post by Vikas Mishra
Hi Folks,
We were having a discussion today about having a "non power of 2
instruction cache". Conceptually I would think that the only "hard"
reason for having a power of 2 is that the line index calculation is
trivial and for a non power of 2 it would be some complicated
calculation like modulus with 40K (for a size of say 40KB) or something
equally strange.
But is there any other real reason besides this why "non power of 2
caches" can't be used ? I seemed to remember having read somewhere that
cache's can't be non power of two but I don't remember where. I have
searched Hennessey and Patterson ( "A Quantitative Approach" ) but
wasn't able to find anything specific which said whether this is
possible or not.
Besides the quite common multi-way caches, where "multi" isn't a power
of two, but the size of each way is, it is definitely possible to have
strange sizes, it is just _very_ hard to make them fast enough!

It should be easier to have non-power-of-two memory banks, since RAM is
a _lot_ slower than cache (presumably :-) anyway, and even before
multi-level caches became common there was a mainframe with something
like 17-way main memory. (Honeywell?)

If you are going to do something like this (i.e. having a prime number
of banks), then it is probably a good idea to select a prime of the form
2^n-1 or 2^n+1, since you can do div/mod operations on these a lot
faster than on a random modulus.

Terje
--
- <***@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Nick Maclaren
2005-08-16 11:05:15 UTC
Permalink
In article <ddsgb9$4em$***@osl016lin.hda.hydro.com>,
Terje Mathisen <***@hda.hydro.com> writes:
|>
|> Besides the quite common multi-way caches, where "multi" isn't a power
|> of two, but the size of each way is, it is definitely possible to have
|> strange sizes, it is just _very_ hard to make them fast enough!

Sorry, Terje, but you are suffering from a lack of imagination!
It is hard to make level 1 data caches fast enough, but that is
about all.

|> If you are going to do something like this (i.e. having a prime number
|> of banks), then it is probably a good idea to select a prime of the form
|> 2^n-1 or 2^n+1, since you can do div/mod operations on these a lot
|> faster than on a random modulus.

Nope. That is true only if you use the current, ghastly, designs.
Such a cache structure is undesirable on performance grounds, even
with a power of two, because it is really nasty when your array
sizes match the cache size.

What you should do is to hash the address - as this need not be a
cryptographic hash, doesn't have to be perfect, and doesn't have
to be done in software, it can be fast. Such a hash can produce
values in any range.


Regards,
Nick Maclaren.
Torben Ægidius Mogensen
2005-08-16 11:13:20 UTC
Permalink
Post by Nick Maclaren
What you should do is to hash the address - as this need not be a
cryptographic hash, doesn't have to be perfect, and doesn't have
to be done in software, it can be fast. Such a hash can produce
values in any range.
Yes, but if you start with a power of two address range and hash down
to a range that isn't a power of two, some results will occur more
often than others. If the range of addresses is much larger than the
range of hash values, this won't be significant, though.

What matters more is that you are probably going to use binary to
represent the hash values, and if the range isn't a power of two, you
won't use these bits optimally. Again, if the range is only slightly
smaller than a power of two (say, 2^n-1), it won't matter much.

Torben
Nick Maclaren
2005-08-16 11:21:30 UTC
Permalink
In article <***@tyr.diku.dk>,
***@tyr.diku.dk (=?iso-8859-1?q?Torben_=C6gidius_Mogensen?=) writes:
|>
|> > What you should do is to hash the address - as this need not be a
|> > cryptographic hash, doesn't have to be perfect, and doesn't have
|> > to be done in software, it can be fast. Such a hash can produce
|> > values in any range.
|>
|> Yes, but if you start with a power of two address range and hash down
|> to a range that isn't a power of two, some results will occur more
|> often than others. If the range of addresses is much larger than the
|> range of hash values, this won't be significant, though.
|>
|> What matters more is that you are probably going to use binary to
|> represent the hash values, and if the range isn't a power of two, you
|> won't use these bits optimally. Again, if the range is only slightly
|> smaller than a power of two (say, 2^n-1), it won't matter much.

Well, yes, but in what sense is taking a number modulo a fixed
range not a hashing function? All of those statements apply to
that as well, as they do when using conventional cache lookup
on a system where physical addresses ranges are not from 2^K
to 2^(K+L) with K > L.

The point is that you are free to choose a hash function to be
fast as well as adequately uniform, and it isn't hard once you
get beyond the L1 data path.


Regards,
Nick Maclaren.
Dan Koren
2005-08-16 14:02:24 UTC
Permalink
Post by Nick Maclaren
|>
|> Besides the quite common multi-way caches, where "multi" isn't a power
|> of two, but the size of each way is, it is definitely possible to have
|> strange sizes, it is just _very_ hard to make them fast enough!
Sorry, Terje, but you are suffering from a lack of imagination!
It is hard to make level 1 data caches fast enough, but that is
about all.
|> If you are going to do something like this (i.e. having a prime number
|> of banks), then it is probably a good idea to select a prime of the form
|> 2^n-1 or 2^n+1, since you can do div/mod operations on these a lot
|> faster than on a random modulus.
Nope. That is true only if you use the current, ghastly, designs.
Such a cache structure is undesirable on performance grounds, even
with a power of two, because it is really nasty when your array
sizes match the cache size.
What you should do is to hash the address - as this need not be a
cryptographic hash, doesn't have to be perfect, and doesn't have
to be done in software, it can be fast. Such a hash can produce
values in any range.
Hashing is becoming a lost art,
just like everything else of
importance in computing ;-)



dk
Chris Colohan
2005-08-16 14:52:04 UTC
Permalink
Post by Dan Koren
Hashing is becoming a lost art,
just like everything else of
importance in computing ;-)
Ever used Perl? Everything is a scalar or a hash... (You just don't
get to choose the hash function used.)

:-)

Chris
--
Chris Colohan Email: ***@colohan.ca PGP: finger ***@cs.cmu.edu
Web: www.colohan.com Phone: (412)268-4751
b***@burtleburtle.net
2005-08-17 17:40:50 UTC
Permalink
Post by Dan Koren
Hashing is becoming a lost art,
just like everything else of
importance in computing ;-)
That's not quite a haiku. Too many syllables.
Rich Walker
2005-08-17 23:00:39 UTC
Permalink
Post by b***@burtleburtle.net
Post by Dan Koren
Hashing is becoming a lost art,
just like everything else of
importance in computing ;-)
That's not quite a haiku. Too many syllables.
All useful papers
Are in closed journals.
Much waste.


cheers, Rich.
--
rich walker | Shadow Robot Company | ***@shadow.org.uk
technical director 251 Liverpool Road |
need a Hand? London N1 1LX | +UK 20 7700 2487
www.shadow.org.uk/products/newhand.shtml
Terje Mathisen
2005-08-16 21:02:54 UTC
Permalink
Post by Nick Maclaren
|> If you are going to do something like this (i.e. having a prime number
|> of banks), then it is probably a good idea to select a prime of the form
|> 2^n-1 or 2^n+1, since you can do div/mod operations on these a lot
|> faster than on a random modulus.
Nope. That is true only if you use the current, ghastly, designs.
Such a cache structure is undesirable on performance grounds, even
with a power of two, because it is really nasty when your array
sizes match the cache size.
I agree. Prime bases are better though.
Post by Nick Maclaren
What you should do is to hash the address - as this need not be a
cryptographic hash, doesn't have to be perfect, and doesn't have
to be done in software, it can be fast. Such a hash can produce
values in any range.
But that is what we're currently doing!

All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)

Terje
--
- <***@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Nick Maclaren
2005-08-17 08:29:57 UTC
Permalink
Post by Terje Mathisen
Post by Nick Maclaren
What you should do is to hash the address - as this need not be a
cryptographic hash, doesn't have to be perfect, and doesn't have
to be done in software, it can be fast. Such a hash can produce
values in any range.
But that is what we're currently doing!
All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)
Er, yes. Sorry about being massively unclear. I made a hash of using
the term hash :-(

What I meant was using a slightly scrambled hash - e.g. consider the
following to reduce 64 bits to 16:

( N ^ (N>>1) ^ (N>>4) ^ (N>>11) ^ (N>>26)) ^ (N>>57) ) & 0xffff

This is dead easy to calculate fast in hardware, and would remove
many of the "gotchas". It was just picked up off the top of my head,
and there are doubtless better choices.

I don't know how easy it is to take modulo N for a fixed N is in
hardware - the fact that doing it for a variable N is a pain proves
little. As I have posted before, x%N for fixed N can be done a LOT
faster for fixed but arbitrary N (i.e. ignoring setup costs) than
for variable N in software, but still not fast enough for this use.


Regards,
Nick Maclaren.
Terje Mathisen
2005-08-17 11:37:53 UTC
Permalink
Post by Nick Maclaren
What I meant was using a slightly scrambled hash - e.g. consider the
( N ^ (N>>1) ^ (N>>4) ^ (N>>11) ^ (N>>26)) ^ (N>>57) ) & 0xffff
This is dead easy to calculate fast in hardware, and would remove
many of the "gotchas". It was just picked up off the top of my head,
and there are doubtless better choices.
I don't know how easy it is to take modulo N for a fixed N is in
hardware - the fact that doing it for a variable N is a pain proves
little. As I have posted before, x%N for fixed N can be done a LOT
faster for fixed but arbitrary N (i.e. ignoring setup costs) than
for variable N in software, but still not fast enough for this use.
Fixed N, where N is on the form (2^n) +/- 1 can be done more or less (*)
in the same time as your example above, the code is similar.

Terje

(*) A little more, because you introduce a small adder in the path.
--
- <***@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Nick Maclaren
2005-08-17 11:45:50 UTC
Permalink
In article <ddv7hu$is1$***@osl016lin.hda.hydro.com>,
Terje Mathisen <***@hda.hydro.com> writes:
|> >
|> > I don't know how easy it is to take modulo N for a fixed N is in
|> > hardware - the fact that doing it for a variable N is a pain proves
|> > little. As I have posted before, x%N for fixed N can be done a LOT
|> > faster for fixed but arbitrary N (i.e. ignoring setup costs) than
|> > for variable N in software, but still not fast enough for this use.
|>
|> Fixed N, where N is on the form (2^n) +/- 1 can be done more or less (*)
|> in the same time as your example above, the code is similar.

Oh, yes, but the problem with numbers like that is that they aren't
rare as array sizes in programs. They may not be AS common as
powers of two, but 2^n+1 in particular is definitely common. It
falls naturally out of certain FFT designs. 2^n-1 is rarer, but
not rare.

The advantage of a non-modulo approach is that it doesn't have
the catastrophic failure mode when a critical array size matches
the cache size. All modulo approaches do, but array sizes tend
to follow certain patterns.


Regards,
Nick Maclaren.
David Kanter
2005-08-17 15:39:38 UTC
Permalink
Post by Nick Maclaren
|> >
|> > I don't know how easy it is to take modulo N for a fixed N is in
|> > hardware - the fact that doing it for a variable N is a pain proves
|> > little. As I have posted before, x%N for fixed N can be done a LOT
|> > faster for fixed but arbitrary N (i.e. ignoring setup costs) than
|> > for variable N in software, but still not fast enough for this use.
|>
|> Fixed N, where N is on the form (2^n) +/- 1 can be done more or less (*)
|> in the same time as your example above, the code is similar.
Oh, yes, but the problem with numbers like that is that they aren't
rare as array sizes in programs. They may not be AS common as
powers of two, but 2^n+1 in particular is definitely common. It
falls naturally out of certain FFT designs. 2^n-1 is rarer, but
not rare.
The advantage of a non-modulo approach is that it doesn't have
the catastrophic failure mode when a critical array size matches
the cache size. All modulo approaches do, but array sizes tend
to follow certain patterns.
What exactly is the problem with having your array size match the cache
size? Sorry, ISTR something like this mentioned in H&P:aQa, I just
don't recall the exact problem.

David
Nick Maclaren
2005-08-17 15:48:16 UTC
Permalink
In article <***@g47g2000cwa.googlegroups.com>,
"David Kanter" <***@gmail.com> writes:
|>
|> What exactly is the problem with having your array size match the cache
|> size? Sorry, ISTR something like this mentioned in H&P:aQa, I just
|> don't recall the exact problem.

With a fully-associative cache, there is little problem. With
one with a bounded associativity, there is a major problem when
the size of one dimension of a matrix is a multiple of that of
the cache size. Try writing the naive matrix multiply on square
matrices and see what happens whan that is so.

Essentially, every time round the inner loop (N^3 of them) means
several cache misses, including a write of a dirty line.


Regards,
Nick Maclaren.
Terje Mathisen
2005-08-18 18:31:01 UTC
Permalink
Post by David Kanter
Post by Nick Maclaren
|> >
|> > I don't know how easy it is to take modulo N for a fixed N is in
|> > hardware - the fact that doing it for a variable N is a pain proves
|> > little. As I have posted before, x%N for fixed N can be done a LOT
|> > faster for fixed but arbitrary N (i.e. ignoring setup costs) than
|> > for variable N in software, but still not fast enough for this use.
|>
|> Fixed N, where N is on the form (2^n) +/- 1 can be done more or less (*)
|> in the same time as your example above, the code is similar.
Oh, yes, but the problem with numbers like that is that they aren't
rare as array sizes in programs. They may not be AS common as
powers of two, but 2^n+1 in particular is definitely common. It
falls naturally out of certain FFT designs. 2^n-1 is rarer, but
not rare.
The advantage of a non-modulo approach is that it doesn't have
the catastrophic failure mode when a critical array size matches
the cache size. All modulo approaches do, but array sizes tend
to follow certain patterns.
What exactly is the problem with having your array size match the cache
size? Sorry, ISTR something like this mentioned in H&P:aQa, I just
don't recall the exact problem.
Nothing at all, IF you always stride through all (or most of the entries
in order. It is particularly when you work with matrices and step along
them in the wrong direction that caches blow up.

Terje
--
- <***@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Sander Vesik
2005-08-17 17:18:58 UTC
Permalink
Post by Terje Mathisen
All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)
using mod for the hash is probably fundamentaly wrong way to
go about it though.
Post by Terje Mathisen
Terje
--
Sander

+++ Out of cheese error +++
Anton Ertl
2005-08-17 17:45:01 UTC
Permalink
Post by Sander Vesik
Post by Terje Mathisen
All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)
using mod for the hash is probably fundamentaly wrong way to
go about it though.
Why?

Apart from being simple to implement, the mod approach has the
following benefits:

- Its performance effects (in particular, the worst-case performance)
are relatively easy to understand, making it easy to avoid them.

- Spatial locality performs well with it. If I have, e.g., a 4KB way
size, then I know that I can access a 4KB array without incurring any
conflict misses. If, OTOH, i had a random-style hash function, even
two cache lines that are adjacent in virtual address space could
conflict with each other. BTW, that's one of the benefits of using
page colouring over arbitrary page mapping: It performs better in the
presence of spatial locality. Of course, with enough associativity in
the cache, it does not matter.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
***@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Nick Maclaren
2005-08-17 19:07:08 UTC
Permalink
Post by Anton Ertl
Post by Sander Vesik
Post by Terje Mathisen
All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)
using mod for the hash is probably fundamentaly wrong way to
go about it though.
Why?
Apart from being simple to implement, the mod approach has the
- Its performance effects (in particular, the worst-case performance)
are relatively easy to understand, making it easy to avoid them.
Oh, really? Do tell.

As with strength reduction and integer multiplication, 90% of the
cases are easy to feasible. The remaining 10% are difficult to
impossible (and, yes, I have seen the latter). Perhaps the most
common example of a damn-near impossible problem with cache sizes
is a 1-D FFT of arbitrary and uncontrollable size, but there are
others.
Post by Anton Ertl
- Spatial locality performs well with it. If I have, e.g., a 4KB way
size, then I know that I can access a 4KB array without incurring any
conflict misses. ...
Provided that your program uses no other data, not even workspace
generated by the compiler. Please come back slightly closer to
reality.



Regards,
Nick Maclaren.
Eugene Miya
2005-08-18 00:18:55 UTC
Permalink
Post by Terje Mathisen
It should be easier to have non-power-of-two memory banks, since RAM is
a _lot_ slower than cache (presumably :-) anyway, and even before
multi-level caches became common there was a mainframe with something
like 17-way main memory. (Honeywell?)
The BSP.
Different firm.

--
Eugene Miya
2005-09-15 17:03:46 UTC
Permalink
Post by Terje Mathisen
multi-level caches became common there was a mainframe with something
like 17-way main memory. (Honeywell?)
Burroughs. I know the guy who did this (lives in Palo Alto).

--
Eugene Miya
2005-09-16 16:34:41 UTC
Permalink
Post by Eugene Miya
Post by Terje Mathisen
multi-level caches became common there was a mainframe with something
like 17-way main memory. (Honeywell?)
Burroughs. I know the guy who did this (lives in Palo Alto).
BTW, Terje, if we ever have another ca-fest, I will see about getting Steve
to come, and you can talk to him about 17-way memories. And depending on
where in the bay area it gets held, I'll get Gordon Bell to join us.
Additionally in 2 weeks you are missing Gordon Moore talking locally.

--
Terje Mathisen
2005-09-17 16:48:29 UTC
Permalink
Post by Eugene Miya
Post by Eugene Miya
Post by Terje Mathisen
multi-level caches became common there was a mainframe with something
like 17-way main memory. (Honeywell?)
Burroughs. I know the guy who did this (lives in Palo Alto).
BTW, Terje, if we ever have another ca-fest, I will see about getting Steve
to come, and you can talk to him about 17-way memories. And depending on
where in the bay area it gets held, I'll get Gordon Bell to join us.
Additionally in 2 weeks you are missing Gordon Moore talking locally.
Don't keep reminding me that most of my real heroes are located 9
timezones away. Except for this, Oslo, Norway is a really great place to
live. :-)

Anyway, I've already marked my calendar for the week after Easter,
monday or tuesday would be perfect.

Terje
--
- <***@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Eugene Miya
2005-09-19 22:15:56 UTC
Permalink
Post by Terje Mathisen
Post by Eugene Miya
Post by Eugene Miya
Post by Terje Mathisen
like 17-way main memory. (Honeywell?)
Burroughs.
BTW, Terje, if we ever have another ca-fest, I will see about getting Steve
to come, and you can talk to him about 17-way memories. And depending on
where in the bay area it gets held, I'll get Gordon Bell to join us.
Additionally in 2 weeks you are missing Gordon Moore talking locally.
Don't keep reminding me that most of my real heroes are located 9
timezones away.
Last Tuesday I had dinner with Gio Wiederhold and another ex-DARPA
program manager (Oscar). Oscar remarked how amazing living near
Stanford is. Although Oscar spent the last year having nothing to do
work technologies (he was on what we call a grand jury), he was
amazed how people and firms think.

I saw efforts at over half a dozen other universities fail.
This is not that every Stanford effort succeeds or gets blended into
google.
Post by Terje Mathisen
Except for this, Oslo, Norway is a really great place to live. :-)
Oh, I have no doubt. Erik Fair who wrote a version of NNTP is from
Norway. We didn't take you to the Norway store in Oakland.
I will get to Norway one of these days, but I also want to go much
further North as well.
Post by Terje Mathisen
Anyway, I've already marked my calendar for the week after Easter,
monday or tuesday would be perfect.
I don't know yet if I have time to attend Asilomar.

No one here has come forward to run the next ca-fest.

--
Terje Mathisen
2005-09-20 07:41:25 UTC
Permalink
Post by Eugene Miya
Oh, I have no doubt. Erik Fair who wrote a version of NNTP is from
Norway. We didn't take you to the Norway store in Oakland.
I will get to Norway one of these days, but I also want to go much
further North as well.
Svalbard/Longyearbyen!!!

If you can ever manage a trip to Svalbard, preferably in the March-May
period, I guarantee you'll be amazed. The light at 78+ degrees north is
so spectacular that I really didn't believe it until I experienced it in
March 2004.

Try to imagine the most glorious sunset you've ever seen, only lasting
more or less all day. :-)

As a scientist you might even wangle an invitation to the Ny Ålesund
research station at N79.

Terje
--
- <***@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Eugene Miya
2005-09-21 05:24:20 UTC
Permalink
In article <dgoeel$opp$***@osl016lin.hda.hydro.com>,
Terje Mathisen <***@hda.hydro.com> wrote:
Only time for one folow up.
Post by Terje Mathisen
NNTP / Norway.
further North as well.
Svalbard/Longyearbyen!!!
Well German friends have worked on Svalbard.
I am awaiting a Greenland bid, but also Northern Norway annd Iceland.
Post by Terje Mathisen
If you can ever manage a trip to Svalbard, preferably in the March-May
period, I guarantee you'll be amazed. The light at 78+ degrees north is
so spectacular that I really didn't believe it until I experienced it in
March 2004.
Earlier. Easily Feb.
Post by Terje Mathisen
Try to imagine the most glorious sunset you've ever seen, only lasting
more or less all day. :-)
As a scientist you might even wangle an invitation to the Ny Ålesund
research station at N79.
Oh, I have an idea. I have time (months) at 83S.

--

John Savard
2005-08-17 01:20:24 UTC
Permalink
Post by Vikas Mishra
But is there any other real reason besides this why "non power of 2
caches" can't be used ?
The obvious reason is that you are partly wasting address decoding
circuitry for the cache.

But that hasn't stopped Intel from having 3 MB caches for some of its
Itanium chips. Maybe it's really a 4 MB cache of which part is not used,
so as to get the yield tolerable for such huge dies...

John Savard
http://www.quadibloc.com/index.html
_________________________________________
Usenet Zone Free Binaries Usenet Server
More than 120,000 groups
Unlimited download
http://www.usenetzone.com to open account
John Mashey
2005-08-17 04:22:33 UTC
Permalink
Post by John Savard
Post by Vikas Mishra
But is there any other real reason besides this why "non power of 2
caches" can't be used ?
The obvious reason is that you are partly wasting address decoding
circuitry for the cache.
But that hasn't stopped Intel from having 3 MB caches for some of its
Itanium chips. Maybe it's really a 4 MB cache of which part is not used,
so as to get the yield tolerable for such huge dies...
Sigh.

1) *No one* builds a die with 4MB of cache and only uses 3MB. Any
designer who proposed that would be fired. There are different ways to
do this, of which some have been used since the mid-1990s or earlier in
microprocessors, but they involve adding a just a modest amount of
extra silicon to the cache to improve yield.

2) I lose track of the various versions, but in Itanium 2-land, L3
caches:
3MB: 12-way
6MB: 24-way
9MB: 18-way

The L1 caches were 16KB (4-way) and L2 was 256KB (8-way).
a) The L3 is a big chunk of the chip, of quite irregular shape, and it
is organized as 140 "tiles", of which:
128 are for data
8 for ECC
4 for yield redundancy

For the gory details see:
Rusau, Muljone, Cherkauer, "Itanium 2 Processor 6M: Higher Frequency
and Larger L3 Cache," IEEE Micro Vol 24, No 2 (March/April 2004), 10-19
Anne & Lynn Wheeler
2005-08-17 15:08:02 UTC
Permalink
Post by John Mashey
Sigh.
1) *No one* builds a die with 4MB of cache and only uses 3MB. Any
designer who proposed that would be fired. There are different ways to
do this, of which some have been used since the mid-1990s or earlier in
microprocessors, but they involve adding a just a modest amount of
extra silicon to the cache to improve yield.
walk down memory lane ...

big change from 370/168-1 to 370/168-3 was to double the cache size
from 32kbytes to 64kbytes ... however they used the 2kbit to index the
additional entries.

the problem was that worked when the software was running 4k virtual
pages ... but if the software was running with 2k virtual pages, it
would only use 32kbytes.

also, if you had software that switched betweek 2k virtual pages and
4k virtual pages ... the cache was flushed at the switch.

the dos/vs and vs1 operating systems ran 2k virtual pages, mvs ran 4k
virtual pages. vm for "real" address virtual machines ran 4k virtual
pages by default ... but if they were running guest operating systems
that used virtual memory ... the shadow pages were whatever the guest
was using.

there was some large customer running vs1 production work under vm on
168-1 and installed the 168-3 upgrade (32k->64k cache) expecting
performance to get better ... but things got much worse. first, since
vs1 was running 2k virtual pages ... all of the vm shadow tables for
vs1 were 2k virtual pages ... which met the cache only worked with
32kbytes. the other was that vm supervisor reloaded 4k virtual page
mode by default ... so there was lots of cache flush overhead
constantly switching back and forth betweek 2k and 4k virtual page
moades.

the 370 was purely 24bit/16mbyte virtual address spaces. the 168s used
the 8mbyte bit in TLB entry selection. MVS (& svs) had 16mbyte space
mapped out so that there was an 8mbyte MVS kernel image in each of its
virtual address spaces. That left only 8mbytes (in each virtual
address space) for running application (although as system functions
grew ... there were some environments where only 3mbytes in each
virtual address space left for application execution).

the 168 had a 128 entry TLB with seven entry "STO" stack (segment
table origin, basically uniquely identified each virtual address
sapce).

For basic MVS, half the TLB entries went to the kernel image pages and
half the TLB entries went to application pages.

however, a lot of virtual address space evivronments running under VM
tended to start virtual address space use at zero and grow upwards, a
large number of applications rarely ever exceeded a couple mbytes ...
so frequently at least half the TLB entries (for virtual address
Post by John Mashey
8mbytes) went unused.
--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
Joe Pfeiffer
2005-08-17 05:29:19 UTC
Permalink
Post by Vikas Mishra
Hi Folks,
We were having a discussion today about having a "non power of 2
instruction cache". Conceptually I would think that the only "hard"
reason for having a power of 2 is that the line index calculation is
trivial and for a non power of 2 it would be some complicated
calculation like modulus with 40K (for a size of say 40KB) or something
equally strange.
But once you've hit this, the question becomes one of what compelling
reason do you have to *not* have a power of 2?
--
Joseph J. Pfeiffer, Jr., Ph.D. Phone -- (505) 646-1605
Department of Computer Science FAX -- (505) 646-1002
New Mexico State University http://www.cs.nmsu.edu/~pfeiffer
skype: jjpfeifferjr
Iain McClatchie
2005-08-19 19:04:05 UTC
Permalink
Joseph> But once you've hit this, the question becomes one of what
Joseph> compelling reason do you have to *not* have a power of 2?

There is more than one way to do non-power-of-two.

- There is an advantage to having fixed instruction sizes -- you
can do without a queue between fetch and decode.
- There is a hit rate advantage to smaller instructions, witness
Tensilica's 24-bit instructions.
- Fixed-size non-power-of-two-sized instruction require one of
two solutions that I know of:

1) A rotator mux after fetch. This can get pretty wide, with a
lot of wires. It is, of course, important not to get overly
impressed by expensive datapath bits that are easy to analyze,
because they're usually not a big deal compared to the expensive
control bits that surprise you late in the design.

2) Fixed-pattern packing. Itanium does it, to put 3 instructions
into 128 bits. I like the idea of 5 instructions in 128 bits,
myself.

Then finally there is the question of addressing these instructions.
My guess is that it would be fine to avoid any complicated
addressing logic, and simply make the 5th instruction in 128 bits
unaddressable.

Nick is going to jump in here and ask how we take an exception or
interrupt on that last instruction. I have a plan for that:
Remember the atomic basic blocks? They're back. If you take an
exception anywhere within the basic block, the entire block fails
and you restart the whole thing. If you don't like that
granularity, then you can restart on 128b boundaries.
Seongbae Park
2005-08-17 17:08:26 UTC
Permalink
Vikas Mishra <***@gmail.com> wrote:
...
Post by Vikas Mishra
In case there have been real/experimental designs with non power of 2
caches, could you point me to references.
One of the most recent examples: Sun's Niagara with 12-way 4-banked 3MB L2:

Niagara: A 32-Way Multithreaded Sparc Processor
Kongetira, P.; Aingaran, K.; Olukotun, K.;
Micro, IEEE
Volume 25, Issue 2, March-April 2005 Page(s):21 - 29

There are good reasons for this seemingly odd-choice.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
Continue reading on narkive:
Loading...