Understanding O(1) vs O(n) Time Complexity Intuitively - arrays

I understand that O(1) is constant-time, which means that the operation does not depend on the input size, and O(n) is linear time, which means that the operation changes linearly with input size.
If I had an algorithm that could simply go directly to an array index rather than going through each index one-by-one to find the required one, that would be considered constant-time rather than linear-time, right? This is what the textbooks say. But, intuitively, I don't understand how a computer could work this way: Wouldn't the computer still need to go through each index one-by-one, from 0 to (potentially) n, in order to find the specific index given to it? But then, is this not the same as what a linear-time algorithm does?
Edit
My response to ElKamina's answer elaborates on how my confusion extends to hardware:
But wouldn't the computer have to check where it is on its journey to
the index? For instance, if it's trying to find index 3, "I'm at index
0 so I need address 0 + 3", "Ok, now I'm at address 1, so I have to
move forward 2", "Ok, now I'm at address 2, so I have to move forward
1", "Ok, now I'm at index 3". Isn't those the same thing as what
linear-time algorithms do? How can the computer not do it
sequentially?

Theory
Imagine you have an array which stores events in the order they happened. If each event takes the same amount of space in a computer's memory, you know where that array begins, and you know what number event you're interested in, then you can precalculate the location of each event.
Imagine you want to store records and key them by telephone numbers. Since there are many numbers, you can calculate a hash of each one. The simplest hash you might apply is to treat the telephone number like a regular number and take it modulus the length of the array you'd like to store the number in. Again, you can assume each record takes the same amount of space, you know the number of records, you know where the array begins, and you know the offset of the event of interest. From these, you can precalculate the location of each event.
If array items have different sizes, then instead fill the array with pointers to the actual items. Your lookup then has two stages: find the appropriate array element and then follow it to the item in question.
Much like we can use shmancy GPS systems to tell us where an address is, but we still need to do the work of driving there, the problem with accessing memory is not knowing where an item is, it's getting there.
Answer to your question
With this in mind, the answer to your question is that look-up is almost never free, but it also is rarely O(N).
Tape memory: O(N)
Tape memory requires O(N) seeks, for obvious reasons: you have to spool and unspool the tape to position it to the needed location. It's slow. It's also cheap and reliable, so it's still in use today in long-term back-up systems. Special algorithms which account for the physical nature of the tape can speed up operations on it slightly.
Notice that, per the foregoing, the problem with tape is not that we don't know where the thing is we're trying to find. The problem is getting the physical medium to get there. The nature of a good tape algorithm is to try to minimize the total amount of tape spooled and unspooled over a grouping of operations.
Speaking of which, what if, instead of having one long tape, we had two shorter tapes: this would reduce the point-to-point travel time. What if we had four tapes?
Disk memory: O(N), but smaller
Hard drives make a huge reduction in seek time by turning the tape into a series of rings. Now, even though there are N memory spaces on a disk, any one can be accessed in short order by moving the drive head and the disk to the appropriate point. (Figuring out how to express this in big-oh notation is a challenge.)
Again, if you use faster disks or smaller disks, you can optimize performance.
RAM: O(1), but with caveats
Pretty much everyone who answers this question is going to fixate on RAM, since that's what programmers work with most frequently. Look to their answers for fuller explanations.
But, just briefly, RAM is a natural extension of the ideas developed above. The RAM holds N items and we know where the item we want is. However, this time there's nothing that needs to mechanically move in order for us to get to that item. In addition, we saw that by having more short tapes or smaller, faster drives, we could get to the memory we wanted faster. RAM takes this idea to its extreme.
For practical purposes, you can think of RAM as being a collection of little memory stores, all strung together. Your computer doesn't know exactly where in RAM a particular item is, just the collection it belongs to. So it grabs the whole collection, consisting of thousands or millions of bytes. It stashes this in something like an L3 cache.
But where is a particular item in that cache? Again, you can think of the computer as not really knowing, it just grabs the a subset which is guaranteed to include the item and passes it to the L2 cache.
And again, for the L1 cache.
And, at this point, we've gone from gigabytes (or terabytes) of RAM to something like 3-30 kilobytes. And, at this level, your computer (finally) knows exactly where the item is and grabs it for processing.
This kind of hierarchical behavior means that accessing adjacent items in RAM is much faster than randomly accessing different points all across RAM. That was also true of tape drives and hard disks.
However, unlike tape drives and hard disks, the worst-case time where all the caches are missed is not dependent on the amount of memory (or, at least, is very weakly dependent: path lengths, speed of light, &c)! For this reason, you can treat it as an O(1) operation in the size of the memory.
Comparing speeds
Knowing this, we can talk about access speed by looking at Latency Numbers Every Programmer Should Know:
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
In more human terms, these look like:
Minute:
L1 cache reference 0.5 s One heart beat (0.5 s)
Branch mispredict 5 s Yawn
L2 cache reference 7 s Long yawn
Mutex lock/unlock 25 s Making a coffee
Hour:
Main memory reference 100 s Brushing your teeth
Compress 1K bytes with Zippy 50 min One episode of a TV show (including ad breaks)
Day:
Send 2K bytes over 1 Gbps network 5.5 hr From lunch to end of work day
Week:
SSD random read 1.7 days A normal weekend
Read 1 MB sequentially from memory 2.9 days A long weekend
Round trip within same datacenter 5.8 days A medium vacation
Read 1 MB sequentially from SSD 11.6 days Waiting for almost 2 weeks for a delivery
Year:
Disk seek 16.5 weeks A semester in university
Read 1 MB sequentially from disk 7.8 months Almost producing a new human being
The above 2 together 1 year
Decade:
Send packet CA->Netherlands->CA 4.8 years Average time it takes to complete a bachelor's degree

Underlying any calculation of time complexity is a cost model. Cost models tend to be oversimplified; for example, we generally talk about the time complexity of sort algorithms in terms of how many elements do we have to compare to each other.
The assumption underlying concluding that indexing into an array is O(1) is that of random access memory; that we can access location N by encoding N on the address lines of the memory bus, and the contents of that location come back on the data bus. If memory were sequential access (e.g., accessing off of a magnetic tape), we'd assume a different cost model.

Imagine computer memory as buckets, say you have 10 buckets in from of you.
if someone tells you to pick something up from bucket number 8, you will not first stick your hand into bucket 1 to 7. you would simply put your hand directly into bucket 8.
Arrays work the same way, in most languages map to some form of memory layout. so e.g. if you have an byte array of 10 that would be 10 sequential bytes.
other types could vary in size depending if the content is a value type/struct or if it is a reference type where the array would consist of pointers.

We assume that the memory is "Random Access Memory" (also known as RAM), not the tape or disk memory. In RAM you can access any address in constant time. See the corresponding wiki article for more information on how it works.
Also, elements of the array are stored sequentially. Say we want to store integers in Java which take up 4 bytes. If we wanted to look for kth element, we would directly look at start + 4 * k location in the memory.
You could implement an array in other ways as well. For example, you could implement the array with a linked list, in which case it would take O(n) time to access an element. But this is not how arrays are implemented typically.

No one here has explained why (IMO) in sufficient detail you can access it in O(1) time in detail, so I will try to:
As a note before I do, this is probably trivializing how complex the hardware in the computer has become, but hopefully it's something along the right path. You would cover this in a Computer Organization course that goes into the guts of the hardware.
When you have circuits, the voltage passed through the computer propagates very fast, and the results that come back depends on the pulse of the clock. Take this diagram for example:
https://upload.wikimedia.org/wikipedia/commons/3/3d/Square_array_of_mosfet_cells_read.png
The following is missing parts that you would learn properly from a textbook or course (or online), but omission of those details should still leave you with sufficient enough of a high level overview for a rough idea of how this works:
The address you send as bits will go up the left side of the image, and based on the address size you send, the voltage will be properly sent to the proper memory cell that has the data you want. Upon the cell receiving voltage, it will then emit the value back down to the bottom (which also is basically instant), and now you've read the 'value stored in memory' since the data you want has arrived. Because of how fast voltage travels, you pretty much almost instantly get the result due to the speed of voltage change in circuits. This means it does not depend on traversing the elements before it since you can just go to it, which is the idea behind RAM. The bottleneck comes from the clock pulse with the latches, which when you take a computer organization course you will see what we do and why we do it.
This is why we consider it doable in O(1) time.
Now an Operating Systems and Computer Organization course would show you all about how this is connected under the hood, why its way more complex than what I've written (and what might not even be that accurate anymore), but hopefully gives you an intuition as to why we can do it in constant time.
Since complexity notation hides the constants under the hood (which from the above, we can assume it's constant time to go to any offset in memory), it then would make sense that we can jump to any array offset in O(1) time from a high level point of view -- which is what complexity analysis aims to do for us -- compared to. This is also why we don't need to traverse over every element in memory to get where we want, which as you said is O(n).

Assuming the data structure you are talking about is a vector/array, you can easily reach index 'x' by incrementing whatever you use to iterate over it.
Say you have a vector of struct "A" where A occupies 20bytes, say you want to get to index 28 and you know the vector starts at memory location 'x', than you simply need to go to x + 20 bytes and that is your element.
With a data structure like a list the lookup time will be O(n) since its not continously assigned you have to jump from pointer to pointer.
With a binary tree its O(log2(n)) ... etc
So the answer here is that it depend on your structure. I would recommend reading some books about fundamental data structures, those might help you greatly in gaining more theoretical understanding of the various concepts you are using.

Related

Fastest disk based solution to cache trillions of unique md5 hashes

Is there a very low latency disk based caching solution that I can use to store only unique values (NOT key+value)?
My script needs to keep track of which files it has processed so it doesn't redo any work. I need to check the cache to search for the md5 hash of the file, if it doesn't exist, I process the file and add the hash to the cache.
Is there a faster disk based caching solution than using a key-value based solution?
Try LevelDB.
It's a key-value store but is very compact due to the trie structure.
Less space usage => less I/O => better performance.
Not sure about "trillions" (a trillion MD5 hashes would be 16,000 TB), but Bitcoin core as well as Ethereum implementations all use LevelDB.
In your case, there is no need for an "Ordered Key-Value Store". That is you can rely on plain Key-Value stores (direct dbm successors):
Good candidates are:
tokyo cabinet it has a hash-based format, that might be faster in your case.
gdbm
In the case where the datatset fit into memory, you might want to try LMDB.
I do not recomment LevelDB because it is slow.
Do the math. 1 trillion MD5s, without any tricks, would take 16TB of disk space. This is, I assume, far more than your RAM size.
Since each MD5 lookup is essentially a 'random' probe into the disk, there will necessarily be about 1 disk hit per check.
If, say, an SSD read is 1ms, that is 1e9 seconds to insert (or check) a trillion hashes. That's 30 years.
There are a lot flaws in my math, but I think this says that it is not practical today to store and check a trillion of anything random.
If you want to crank it down to a billion MD5s, now we are getting in the range of RAM sizes. But you probably want to have the data persisted? So you really need some database-like tool that will do the persisting for you, while making the checks purely in RAM (CPU-speed).
In any case, I would consider writing code that breaks the MD5 into 2 or 3 chunks, then use the chunks like a directory structure. At the bottom level, you have a variable-length bunch of values for the last chunk. Each is perhaps 8 bytes long. That would need a linear or binary search into a bunch of numbers that are half the size of an MD5. The savings here helps compensate for the various overheads in the rest of the structure, plus the need for writing blocks to disk. Hence, I would still expect needing about 16GB of RAM to house a billion MD5s.
Given that approach, virtually any database engine is already geared up to do most of the work reasonably efficiently. The lowest level would be some type of BLOB containing multiple 8-byte chunks.
Another trick to use... Let's look at just the first 5 bytes of an MD5. There are a trillion different values in 5 bytes. If you have only a billion entries in your dataset, then checking the 5 bytes has a 99.9% chance of correctly saying "the md5 is not in the dataset" versus less than 0.01% chance of saying "the md5 might be in the dataset". In the former case, you get a quick answer with only 5GB for a billion items. In the latter case, you may have to go to disk and be slower. Still the average time is better. This helps with the speed of checking. (But does not address the speed of loading.)

Using a filesystem in place of a Database to perform quick lookups

I was recently posed a simple question to which I responded with an unusual answer. I suspect my answer was particularly bad but am not certain what the performance characteristics truly would be.
Suppose you are given inputs each in the form of a hash code (just a bunch of bits.) Uniquely corresponding to each hash code is an integer value which you would like to return. Your system knows the most likely queries and caches them in memory. For the remaining, less frequent lookups you will have to access a hard drive (disk I/O.) There exists a least recently used policy to replace the cache in memory but that shouldn't be terribly important here.
For the hashes on disk, the conventional way to store them would be in a Database (keyed on the hashes) in a tree shape. This would grant you O(log(n)) lookup time once at the database stage.
My answer seemed odd to the asker and a little odd to me. Suppose instead of a database, you simply kept the values on disk in a file system with a directory structure that exactly mirrored the bits of the hash values. For instance, if we had three bit hashes (and only had entries for 100 => 42 and 010 => 314159 your file system would look like:
\0
.\00
.\000
.\100
42.justanumber
.\10
.\010
314159.justanumber
.\110
\1
.\01
.\001
.\101
.\11
.\011
.\111
The x.justanumber files are empty. The filenames themselves contain the information you're looking for.
Further assume that updates never occur (the entire DB/file system is re-written weekly.) I'd think that a filesystem set up this way would give you O(1) lookup time instead of the O(log(n)) lookup time of a tree-based DB. Am I missing something? Why would this not be preferable?
I believe I've come to an answer.
If the system always uses a folder for every possible bit combination and you have a 32-bit hash, you will have 2^32 folders at the outer layer and roughly the same amount within all inner layers so you'll have 2^33 folders each of size 4kB due to page size limitations. Additionally, each "empty" hash file will occupy 4kB on disk for the same reason. So assuming, let's say, 5% occupancy you would have:
2^33 * 4kB + 2^32 * 4kB * 0.05 = 35.22 TB of storage needed.
The savings are O(1) lookup time instead of O(log(# hashes)) lookup time which almost certainly doesn't justify the storage space requirements. (Also remember this is for the uncommon case where you have a memory/cache miss.)
Admittedly, if you only created the minimum number of folders needed to support the 5% occupancy you'd end up with less space needed. The exact number of folders needed would depend on the distribution of the hashes. Assuming a good hash function, this should be close to random, which I believe means we'll end up needing the majority of the inner layers and definitely 5% of the outermost layers of directories. This instead gives something like:
2^32 * 4 kB * 0.05 + 2^32 * 4kB * 0.40 + 2^32 * 4kB * 0.05 = 8.59 TB
...which is still way too much space for such a simple thing. (I made up the 40% figure for the inner folders. If anyone can come up with a rigorous figure there, please comment/answer.)

Is array access always constant-time / O(1)?

From Richard Bird, Pearls of Functional Algorithm Design (2010), page 6:
For a pure functional programmer, an update operation takes logarithmic time in the size of the array. To be fair, procedural programmers also appreciate that constant-time indexing and updating
are only possible when the arrays are small.
Under what conditions do arrays have non-constant-time access? Is this related to CPU cache?
Most modern machine architectures try to offer small unit time access to memory.
They fail. Instead we see layers of caches with differing speeds.
The problem is simple: the speed of light. If you need an enormous memory [array] (in the extreme, imagine a memory the size of the Andromeda galaxy), it will take enormous space, and light cannot cross enormous space in short periods of time. Information cannot travel faster than the speed of light. You are screwed by physics from the start.
So the best you can do is build part of memory "nearby" where light takes only fractions of nanosecond to traverse (thus registers and L1 cache), and part of the memory far away (the disk drive). Other practical complications ensue, such as capacitance (think inertia) to slow down access to things further away.
Now, if you are willing to take the access time of your farthest memory element as "unit" time, yes, access to everything takes the same amount of time, e.g., O(1). In practical computing, we treat RAM memory this way most of the time, and we leave out other, slower devices to avoid screwing up our simple model.
Then you discover people that aren't satisfied with that, and voila, you have people optimizing for cache line access. So it may be O(1) in theory, and acts like O(1) for small arrays (that fit in the first level of cache), but it often is not in practice.
An extreme practical case is an array that doesn't fit in main memory; now an array access may cause paging from the disk.
Sometimes we don't care even in that case. Google is essentially a giant cache. We tend to think of Google searches as O(1).
Can oranges be red?
Yes, they can be red due to a number of reasons -
You color them red.
You grow a genetically modified variety.
You grow them on Mars, the red planet, where every thing is supposed to look red.
The (theoretical) list of some practical (given todays technology) and impractical (fiction / or future reality) goes on...
Point is, that I think the question you are asking, is really about two orthogonal concepts. Namely -
Big O Notation - "In mathematics, big O notation describes the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions."
vs
Practicalities (hardware and software) a good software engineer should be aware of, while architecting / designing their app and writing code.
In other words, while the concept of Big O Notation can be called academic, but it is most appropriate way of classifying algorithms complexity (Time / Space).. and that's where it ends. There is no need to muddy the waters with orthogonal concerns.
To be clear, I am not saying that one should not be aware of the under the hood implementation details and workings of things, which affect the performance of the software you write.. but there is no point of mixing the two together. For example, does it make sense to say -
Arrays do not have constant time access (with indexes) because -
Large arrays do not fit in CPU cache, and hence incur high cost of cache misses.
On a system under memory pressure, the array, big or small, has been swapped out from Physical Memory to Hard Disk, and not only is impacted by a cache miss, but also a hard page fault.
On a system under extreme CPU load, the thread which read the supposed array can be pre-empted, and may not get a chance to execute for several seconds.
On a hypothetical OS, which backs its memory not just with disk, but with additional memory on another computer on the other corner of the world, will make array access un-imaginably slow.
Like my apple and orange example, as you read through my increasingly absurd examples, hope the point I am trying to make is clear.
Conclusion - Any day, I'd answer the question "Do Arrays have constant time O(1) access (with indexes)", as yes.. without any doubt or ifs and buts, they do.
EDIT:
Put it another way - If O(1) is not the answer.. then neither is O(log n), or O(n log n), or O(n^2) or O(n ^ 3)..... and certainly not 42.
He is talking about Computation Models, and in particular the word-based RAM machine
A RAM machine is a formalization of something of very similar to an actual computer: we model the computer memory as a big array of memory words of w bits each, and we can read/write any words in O(1) time
But we have yet something important to define: how large should a word be?
We need a word size w ≥ Ω(log n) to be able at least to address the n parts of our input.
For this reason, word-based RAMs normally assume a word length of O(log n)
But having the word length of your machine depends on the size of the input appears strange and unrealistic
What if we keep the word length fixed? Then even following a pointer needs Ω(log n) time just to read the entire pointer
We need Ω(log n) words to store pointer and Ω(log n) time to access input element
if a language supports sparse arrays, access to the array would have to go through a directory, and a tree-structured directory would have non-linear access time. Or did you mean realistic conditions? ;-)

How can we allocate memory of order 10^15 in C

I need to allocate memory of order of 10^15 to store integers which can be of long long type.
If i use an array and declare something like
long long a[1000000000000000];
that's never going to work. So how can i allocate such a huge amount of memory.
Really large arrays generally aren't a job for memory, more one for disk. 1015 array elements at 64 bits apiece is (I think) 8 petabytes. You can pick up 8G memory slices for about $15 at the moment so, even if your machine could handle that much memory or address space, you'd be outlaying about $15 million dollars.
In addition, with upcoming DDR4 being clocked up to about 4GT/s (giga-transfers), even if each transfer was a 64-bit value, it would still take about one million seconds just to initialise that array to zero. Do you really want to be waiting around for eleven and a half days before your code even starts doing anything useful?
And, even if you go the disk route, that's quite a bit. At (roughly) $50 per TB, you're still looking at $400,000 and you'll possibly have to provide your own software for managing those 8,000 disks somehow. And I'm not even going to contemplate figuring out how long it would take to initialise the array on disk.
You may want to think about rephrasing your question to indicate the actual problem rather than what you currently have, a proposed solution. It may be that you don't need that much storage at all.
For example, if you're talking about an array where many of the values are left at zero, a sparse array is one way to go.
You can't. You don't have all this memory, and you'll don't have it for a while. Simple.
EDIT: If you really want to work with data that does not fit into your RAM, you can use some library that work with mass storage data, like stxxl, but it will work a lot slower, and you have always disk size limits.
MPI is what you need, that's actually a small size for parallel computing problems the blue gene Q monster at Lawerence Livermore National Labs holds around 1.5 PB of ram. you need to use block decomposition to divide up your problem and viola!
the basic approach is dividing up the array into equal blocks or chunks among many processors
You need to uppgrade to a 64-bit system. Then get 64-bit-capable compiler then put a l at the end of 100000000000000000.
Have you heard of sparse matrix implementation? In one of the sparse matrices, you just use very little part of the matrix despite of the matrix being huge.
Here are some libraries for you.
Here is a basic info about sparse-matrices You dont actually use all of it. Just the needed few points.

Design code to fit in CPU Cache?

When writing simulations my buddy says he likes to try to write the program small enough to fit into cache. Does this have any real meaning? I understand that cache is faster than RAM and the main memory. Is it possible to specify that you want the program to run from cache or at least load the variables into cache? We are writing simulations so any performance/optimization gain is a huge benefit.
If you know of any good links explaining CPU caching, then point me in that direction.
At least with a typical desktop CPU, you can't really specify much about cache usage directly. You can still try to write cache-friendly code though. On the code side, this often means unrolling loops (for just one obvious example) is rarely useful -- it expands the code, and a modern CPU typically minimizes the overhead of looping. You can generally do more on the data side, to improve locality of reference, protect against false sharing (e.g. two frequently-used pieces of data that will try to use the same part of the cache, while other parts remain unused).
Edit (to make some points a bit more explicit):
A typical CPU has a number of different caches. A modern desktop processor will typically have at least 2 and often 3 levels of cache. By (at least nearly) universal agreement, "level 1" is the cache "closest" to the processing elements, and the numbers go up from there (level 2 is next, level 3 after that, etc.)
In most cases, (at least) the level 1 cache is split into two halves: an instruction cache and a data cache (the Intel 486 is nearly the sole exception of which I'm aware, with a single cache for both instructions and data--but it's so thoroughly obsolete it probably doesn't merit a lot of thought).
In most cases, a cache is organized as a set of "lines". The contents of a cache is normally read, written, and tracked one line at a time. In other words, if the CPU is going to use data from any part of a cache line, that entire cache line is read from the next lower level of storage. Caches that are closer to the CPU are generally smaller and have smaller cache lines.
This basic architecture leads to most of the characteristics of a cache that matter in writing code. As much as possible, you want to read something into cache once, do everything with it you're going to, then move on to something else.
This means that as you're processing data, it's typically better to read a relatively small amount of data (little enough to fit in the cache), do as much processing on that data as you can, then move on to the next chunk of data. Algorithms like Quicksort that quickly break large amounts of input in to progressively smaller pieces do this more or less automatically, so they tend to be fairly cache-friendly, almost regardless of the precise details of the cache.
This also has implications for how you write code. If you have a loop like:
for i = 0 to whatever
step1(data);
step2(data);
step3(data);
end for
You're generally better off stringing as many of the steps together as you can up to the amount that will fit in the cache. The minute you overflow the cache, performance can/will drop drastically. If the code for step 3 above was large enough that it wouldn't fit into the cache, you'd generally be better off breaking the loop up into two pieces like this (if possible):
for i = 0 to whatever
step1(data);
step2(data);
end for
for i = 0 to whatever
step3(data);
end for
Loop unrolling is a fairly hotly contested subject. On one hand, it can lead to code that's much more CPU-friendly, reducing the overhead of instructions executed for the loop itself. At the same time, it can (and generally does) increase code size, so it's relatively cache unfriendly. My own experience is that in synthetic benchmarks that tend to do really small amounts of processing on really large amounts of data, that you gain a lot from loop unrolling. In more practical code where you tend to have more processing on an individual piece of data, you gain a lot less--and overflowing the cache leading to a serious performance loss isn't particularly rare at all.
The data cache is also limited in size. This means that you generally want your data packed as densely as possible so as much data as possible will fit in the cache. Just for one obvious example, a data structure that's linked together with pointers needs to gain quite a bit in terms of computational complexity to make up for the amount of data cache space used by those pointers. If you're going to use a linked data structure, you generally want to at least ensure you're linking together relatively large pieces of data.
In a lot of cases, however, I've found that tricks I originally learned for fitting data into minuscule amounts of memory in tiny processors that have been (mostly) obsolete for decades, works out pretty well on modern processors. The intent is now to fit more data in the cache instead of the main memory, but the effect is nearly the same. In quite a few cases, you can think of CPU instructions as nearly free, and the overall speed of execution is governed by the bandwidth to the cache (or the main memory), so extra processing to unpack data from a dense format works out in your favor. This is particularly true when you're dealing with enough data that it won't all fit in the cache at all any more, so the overall speed is governed by the bandwidth to main memory. In this case, you can execute a lot of instructions to save a few memory reads, and still come out ahead.
Parallel processing can exacerbate that problem. In many cases, rewriting code to allow parallel processing can lead to virtually no gain in performance, or sometimes even a performance loss. If the overall speed is governed by the bandwidth from the CPU to memory, having more cores competing for that bandwidth is unlikely to do any good (and may do substantial harm). In such a case, use of multiple cores to improve speed often comes down to doing even more to pack the data more tightly, and taking advantage of even more processing power to unpack the data, so the real speed gain is from reducing the bandwidth consumed, and the extra cores just keep from losing time to unpacking the data from the denser format.
Another cache-based problem that can arise in parallel coding is sharing (and false sharing) of variables. If two (or more) cores need to write to the same location in memory, the cache line holding that data can end up being shuttled back and forth between the cores to give each core access to the shared data. The result is often code that runs slower in parallel than it did in serial (i.e., on a single core). There's a variation of this called "false sharing", in which the code on the different cores is writing to separate data, but the data for the different cores ends up in the same cache line. Since the cache controls data purely in terms of entire lines of data, the data gets shuffled back and forth between the cores anyway, leading to exactly the same problem.
Here's a link to a really good paper on caches/memory optimization by Christer Ericsson (of God of War I/II/III fame). It's a couple of years old but it's still very relevant.
A useful paper that will tell you more than you ever wanted to know about caches is What Every Programmer Should Know About Memory by Ulrich Drepper. Hennessey covers it very thoroughly. Christer and Mike Acton have written a bunch of good stuff about this too.
I think you should worry more about data cache than instruction cache — in my experience, dcache misses are more frequent, more painful, and more usefully fixed.
UPDATE: 1/13/2014
According to this senior chip designer, cache misses are now THE overwhelmingly dominant factor in code performance, so we're basically all the way back to the mid-80s and fast 286 chips in terms of the relative performance bottlenecks of load, store, integer arithmetic, and cache misses.
A Crash Course In Modern Hardware by Cliff Click # Azul
.
.
.
.
.
--- we now return you to your regularly scheduled program ---
Sometimes an example is better than a description of how to do something. In that spirit here's a particularly successful example of how I changed some code to better use on chip caches. This was done some time ago on a 486 CPU and latter migrated to a 1st Generation Pentium CPU. The effect on performance was similar.
Example: Subscript Mapping
Here's an example of a technique I used to fit data into the chip's cache that has general purpose utility.
I had a double float vector that was 1,250 elements long, which was an epidemiology curve with very long tails. The "interesting" part of the curve only had about 200 unique values but I didn't want a 2-sided if() test to make a mess of the CPU's pipeline(thus the long tails, which could use as subscripts the most extreme values the Monte Carlo code would spit out), and I needed the branch prediction logic for a dozen other conditional tests inside the "hot-spot" in the code.
I settled on a scheme where I used a vector of 8-bit ints as a subscript into the double vector, which I shortened to 256 elements. The tiny ints all had the same values before 128 ahead of zero, and 128 after zero, so except for the middle 256 values, they all pointed to either the first or last value in the double vector.
This shrunk the storage requirement to 2k for the doubles, and 1,250 bytes for the 8-bit subscripts. This shrunk 10,000 bytes down to 3,298. Since the program spent 90% or more of it's time in this inner-loop, the 2 vectors never got pushed out of the 8k data cache. The program immediately doubled its performance. This code got hit ~ 100 billion times in the process of computing an OAS value for 1+ million mortgage loans.
Since the tails of the curve were seldom touched, it's very possible that only the middle 200-300 elements of the tiny int vector were actually kept in cache, along with 160-240 middle doubles representing 1/8ths of percents of interest. It was a remarkable increase in performance, accomplished in an afternoon, on a program that I'd spent over a year optimizing.
I agree with Jerry, as it has been my experience also, that tilting the code towards the instruction cache is not nearly as successful as optimizing for the data cache/s. This is one reason I think AMD's common caches are not as helpful as Intel's separate data and instruction caches. IE: you don't want instructions hogging up the cache, as it just isn't very helpful. In part this is because CISC instruction sets were originally created to make up for the vast difference between CPU and memory speeds, and except for an aberration in the late 80's, that's pretty much always been true.
Another favorite technique I use to favor the data cache, and savage the instruction cache, is by using a lot of bit-ints in structure definitions, and the smallest possible data sizes in general. To mask off a 4-bit int to hold the month of the year, or 9 bits to hold the day of the year, etc, etc, requires the CPU use masks to mask off the host integers the bits are using, which shrinks the data, effectively increases cache and bus sizes, but requires more instructions. While this technique produces code that doesn't perform as well on synthetic benchmarks, on busy systems where users and processes are competing for resources, it works wonderfully.
Mostly this will serve as a placeholder until I get time to do this topic justice, but I wanted to share what I consider to be a truly groundbreaking milestone - the introduction of dedicated bit manipulation instructions in the new Intel Hazwell microprocessor.
It became painfully obvious when I wrote some code here on StackOverflow to reverse the bits in a 4096 bit array that 30+ yrs after the introduction of the PC, microprocessors just don't devote much attention or resources to bits, and that I hope will change. In particular, I'd love to see, for starters, the bool type become an actual bit datatype in C/C++, instead of the ridiculously wasteful byte it currently is.
UPDATE: 12/29/2013
I recently had occasion to optimize a ring buffer which keeps track of 512 different resource users' demands on a system at millisecond granularity. There is a timer which fires every millisecond which added the sum of the most current slice's resource requests and subtracted out the 1,000th time slice's requests, comprising resource requests now 1,000 milliseconds old.
The Head, Tail vectors were right next to each other in memory, except when first the Head, and then the Tail wrapped and started back at the beginning of the array. The (rolling)Summary slice however was in a fixed, statically allocated array that wasn't particularly close to either of those, and wasn't even allocated from the heap.
Thinking about this, and studying the code a few particulars caught my attention.
The demands that were coming in were added to the Head and the Summary slice at the same time, right next to each other in adjacent lines of code.
When the timer fired, the Tail was subtracted out of the Summary slice, and the results were left in the Summary slice, as you'd expect
The 2nd function called when the timer fired advanced all the pointers servicing the ring. In particular....
The Head overwrote the Tail, thereby occupying the same memory location
The new Tail occupied the next 512 memory locations, or wrapped
The user wanted more flexibility in the number of demands being managed, from 512 to 4098, or perhaps more. I felt the most robust, idiot-proof way to do this was to allocate both the 1,000 time slices and the summary slice all together as one contiguous block of memory so that it would be IMPOSSIBLE for the Summary slice to end up being a different length than the other 1,000 time slices.
Given the above, I began to wonder if I could get more performance if, instead of having the Summary slice remain in one location, I had it "roam" between the Head and the Tail, so it was always right next to the Head for adding new demands, and right next to the Tail when the timer fired and the Tail's values had to be subtracted from the Summary.
I did exactly this, but then found a couple of additional optimizations in the process. I changed the code that calculated the rolling Summary so that it left the results in the Tail, instead of the Summary slice. Why? Because the very next function was performing a memcpy() to move the Summary slice into the memory just occupied by the Tail. (weird but true, the Tail leads the Head until the end of the ring when it wraps). By leaving the results of the summation in the Tail, I didn't have to perform the memcpy(), I just had to assign pTail to pSummary.
In a similar way, the new Head occupied the now stale Summary slice's old memory location, so again, I just assigned pSummary to pHead, and zeroed all its values with a memset to zero.
Leading the way to the end of the ring(really a drum, 512 tracks wide) was the Tail, but I only had to compare its pointer against a constant pEndOfRing pointer to detect that condition. All of the other pointers could be assigned the pointer value of the vector just ahead of it. IE: I only needed a conditional test for 1:3 of the pointers to correctly wrap them.
The initial design had used byte ints to maximize cache usage, however, I was able to relax this constraint - satisfying the users request to handle higher resource counts per user per millisecond - to use unsigned shorts and STILL double performance, because even with 3 adjacent vectors of 512 unsigned shorts, the L1 cache's 32K data cache could easily hold the required 3,720 bytes, 2/3rds of which were in locations just used. Only when the Tail, Summary, or Head wrapped were 1 of the 3 separated by any significant "step" in the 8MB L3cache.
The total run-time memory footprint for this code is under 2MB, so it runs entirely out of on-chip caches, and even on an i7 chip with 4 cores, 4 instances of this process can be run without any degradation in performance at all, and total throughput goes up slightly with 5 processes running. It's an Opus Magnum on cache usage.
Most C/C++ compilers prefer to optimize for size rather than for "speed". That is, smaller code generally executes faster than unrolled code because of cache effects.
If I were you, I would make sure I know which parts of code are hotspots, which I define as
a tight loop not containing any function calls, because if it calls any function, then the PC will be spending most of its time in that function,
that accounts for a significant fraction of execution time (like >= 10%) which you can determine from a profiler. (I just sample the stack manually.)
If you have such a hotspot, then it should fit in the cache. I'm not sure how you tell it to do that, but I suspect it's automatic.

Resources