Design code to fit in CPU Cache? - c

When writing simulations my buddy says he likes to try to write the program small enough to fit into cache. Does this have any real meaning? I understand that cache is faster than RAM and the main memory. Is it possible to specify that you want the program to run from cache or at least load the variables into cache? We are writing simulations so any performance/optimization gain is a huge benefit.
If you know of any good links explaining CPU caching, then point me in that direction.

At least with a typical desktop CPU, you can't really specify much about cache usage directly. You can still try to write cache-friendly code though. On the code side, this often means unrolling loops (for just one obvious example) is rarely useful -- it expands the code, and a modern CPU typically minimizes the overhead of looping. You can generally do more on the data side, to improve locality of reference, protect against false sharing (e.g. two frequently-used pieces of data that will try to use the same part of the cache, while other parts remain unused).
Edit (to make some points a bit more explicit):
A typical CPU has a number of different caches. A modern desktop processor will typically have at least 2 and often 3 levels of cache. By (at least nearly) universal agreement, "level 1" is the cache "closest" to the processing elements, and the numbers go up from there (level 2 is next, level 3 after that, etc.)
In most cases, (at least) the level 1 cache is split into two halves: an instruction cache and a data cache (the Intel 486 is nearly the sole exception of which I'm aware, with a single cache for both instructions and data--but it's so thoroughly obsolete it probably doesn't merit a lot of thought).
In most cases, a cache is organized as a set of "lines". The contents of a cache is normally read, written, and tracked one line at a time. In other words, if the CPU is going to use data from any part of a cache line, that entire cache line is read from the next lower level of storage. Caches that are closer to the CPU are generally smaller and have smaller cache lines.
This basic architecture leads to most of the characteristics of a cache that matter in writing code. As much as possible, you want to read something into cache once, do everything with it you're going to, then move on to something else.
This means that as you're processing data, it's typically better to read a relatively small amount of data (little enough to fit in the cache), do as much processing on that data as you can, then move on to the next chunk of data. Algorithms like Quicksort that quickly break large amounts of input in to progressively smaller pieces do this more or less automatically, so they tend to be fairly cache-friendly, almost regardless of the precise details of the cache.
This also has implications for how you write code. If you have a loop like:
for i = 0 to whatever
step1(data);
step2(data);
step3(data);
end for
You're generally better off stringing as many of the steps together as you can up to the amount that will fit in the cache. The minute you overflow the cache, performance can/will drop drastically. If the code for step 3 above was large enough that it wouldn't fit into the cache, you'd generally be better off breaking the loop up into two pieces like this (if possible):
for i = 0 to whatever
step1(data);
step2(data);
end for
for i = 0 to whatever
step3(data);
end for
Loop unrolling is a fairly hotly contested subject. On one hand, it can lead to code that's much more CPU-friendly, reducing the overhead of instructions executed for the loop itself. At the same time, it can (and generally does) increase code size, so it's relatively cache unfriendly. My own experience is that in synthetic benchmarks that tend to do really small amounts of processing on really large amounts of data, that you gain a lot from loop unrolling. In more practical code where you tend to have more processing on an individual piece of data, you gain a lot less--and overflowing the cache leading to a serious performance loss isn't particularly rare at all.
The data cache is also limited in size. This means that you generally want your data packed as densely as possible so as much data as possible will fit in the cache. Just for one obvious example, a data structure that's linked together with pointers needs to gain quite a bit in terms of computational complexity to make up for the amount of data cache space used by those pointers. If you're going to use a linked data structure, you generally want to at least ensure you're linking together relatively large pieces of data.
In a lot of cases, however, I've found that tricks I originally learned for fitting data into minuscule amounts of memory in tiny processors that have been (mostly) obsolete for decades, works out pretty well on modern processors. The intent is now to fit more data in the cache instead of the main memory, but the effect is nearly the same. In quite a few cases, you can think of CPU instructions as nearly free, and the overall speed of execution is governed by the bandwidth to the cache (or the main memory), so extra processing to unpack data from a dense format works out in your favor. This is particularly true when you're dealing with enough data that it won't all fit in the cache at all any more, so the overall speed is governed by the bandwidth to main memory. In this case, you can execute a lot of instructions to save a few memory reads, and still come out ahead.
Parallel processing can exacerbate that problem. In many cases, rewriting code to allow parallel processing can lead to virtually no gain in performance, or sometimes even a performance loss. If the overall speed is governed by the bandwidth from the CPU to memory, having more cores competing for that bandwidth is unlikely to do any good (and may do substantial harm). In such a case, use of multiple cores to improve speed often comes down to doing even more to pack the data more tightly, and taking advantage of even more processing power to unpack the data, so the real speed gain is from reducing the bandwidth consumed, and the extra cores just keep from losing time to unpacking the data from the denser format.
Another cache-based problem that can arise in parallel coding is sharing (and false sharing) of variables. If two (or more) cores need to write to the same location in memory, the cache line holding that data can end up being shuttled back and forth between the cores to give each core access to the shared data. The result is often code that runs slower in parallel than it did in serial (i.e., on a single core). There's a variation of this called "false sharing", in which the code on the different cores is writing to separate data, but the data for the different cores ends up in the same cache line. Since the cache controls data purely in terms of entire lines of data, the data gets shuffled back and forth between the cores anyway, leading to exactly the same problem.

Here's a link to a really good paper on caches/memory optimization by Christer Ericsson (of God of War I/II/III fame). It's a couple of years old but it's still very relevant.

A useful paper that will tell you more than you ever wanted to know about caches is What Every Programmer Should Know About Memory by Ulrich Drepper. Hennessey covers it very thoroughly. Christer and Mike Acton have written a bunch of good stuff about this too.
I think you should worry more about data cache than instruction cache — in my experience, dcache misses are more frequent, more painful, and more usefully fixed.

UPDATE: 1/13/2014
According to this senior chip designer, cache misses are now THE overwhelmingly dominant factor in code performance, so we're basically all the way back to the mid-80s and fast 286 chips in terms of the relative performance bottlenecks of load, store, integer arithmetic, and cache misses.
A Crash Course In Modern Hardware by Cliff Click # Azul
.
.
.
.
.
--- we now return you to your regularly scheduled program ---
Sometimes an example is better than a description of how to do something. In that spirit here's a particularly successful example of how I changed some code to better use on chip caches. This was done some time ago on a 486 CPU and latter migrated to a 1st Generation Pentium CPU. The effect on performance was similar.
Example: Subscript Mapping
Here's an example of a technique I used to fit data into the chip's cache that has general purpose utility.
I had a double float vector that was 1,250 elements long, which was an epidemiology curve with very long tails. The "interesting" part of the curve only had about 200 unique values but I didn't want a 2-sided if() test to make a mess of the CPU's pipeline(thus the long tails, which could use as subscripts the most extreme values the Monte Carlo code would spit out), and I needed the branch prediction logic for a dozen other conditional tests inside the "hot-spot" in the code.
I settled on a scheme where I used a vector of 8-bit ints as a subscript into the double vector, which I shortened to 256 elements. The tiny ints all had the same values before 128 ahead of zero, and 128 after zero, so except for the middle 256 values, they all pointed to either the first or last value in the double vector.
This shrunk the storage requirement to 2k for the doubles, and 1,250 bytes for the 8-bit subscripts. This shrunk 10,000 bytes down to 3,298. Since the program spent 90% or more of it's time in this inner-loop, the 2 vectors never got pushed out of the 8k data cache. The program immediately doubled its performance. This code got hit ~ 100 billion times in the process of computing an OAS value for 1+ million mortgage loans.
Since the tails of the curve were seldom touched, it's very possible that only the middle 200-300 elements of the tiny int vector were actually kept in cache, along with 160-240 middle doubles representing 1/8ths of percents of interest. It was a remarkable increase in performance, accomplished in an afternoon, on a program that I'd spent over a year optimizing.
I agree with Jerry, as it has been my experience also, that tilting the code towards the instruction cache is not nearly as successful as optimizing for the data cache/s. This is one reason I think AMD's common caches are not as helpful as Intel's separate data and instruction caches. IE: you don't want instructions hogging up the cache, as it just isn't very helpful. In part this is because CISC instruction sets were originally created to make up for the vast difference between CPU and memory speeds, and except for an aberration in the late 80's, that's pretty much always been true.
Another favorite technique I use to favor the data cache, and savage the instruction cache, is by using a lot of bit-ints in structure definitions, and the smallest possible data sizes in general. To mask off a 4-bit int to hold the month of the year, or 9 bits to hold the day of the year, etc, etc, requires the CPU use masks to mask off the host integers the bits are using, which shrinks the data, effectively increases cache and bus sizes, but requires more instructions. While this technique produces code that doesn't perform as well on synthetic benchmarks, on busy systems where users and processes are competing for resources, it works wonderfully.

Mostly this will serve as a placeholder until I get time to do this topic justice, but I wanted to share what I consider to be a truly groundbreaking milestone - the introduction of dedicated bit manipulation instructions in the new Intel Hazwell microprocessor.
It became painfully obvious when I wrote some code here on StackOverflow to reverse the bits in a 4096 bit array that 30+ yrs after the introduction of the PC, microprocessors just don't devote much attention or resources to bits, and that I hope will change. In particular, I'd love to see, for starters, the bool type become an actual bit datatype in C/C++, instead of the ridiculously wasteful byte it currently is.
UPDATE: 12/29/2013
I recently had occasion to optimize a ring buffer which keeps track of 512 different resource users' demands on a system at millisecond granularity. There is a timer which fires every millisecond which added the sum of the most current slice's resource requests and subtracted out the 1,000th time slice's requests, comprising resource requests now 1,000 milliseconds old.
The Head, Tail vectors were right next to each other in memory, except when first the Head, and then the Tail wrapped and started back at the beginning of the array. The (rolling)Summary slice however was in a fixed, statically allocated array that wasn't particularly close to either of those, and wasn't even allocated from the heap.
Thinking about this, and studying the code a few particulars caught my attention.
The demands that were coming in were added to the Head and the Summary slice at the same time, right next to each other in adjacent lines of code.
When the timer fired, the Tail was subtracted out of the Summary slice, and the results were left in the Summary slice, as you'd expect
The 2nd function called when the timer fired advanced all the pointers servicing the ring. In particular....
The Head overwrote the Tail, thereby occupying the same memory location
The new Tail occupied the next 512 memory locations, or wrapped
The user wanted more flexibility in the number of demands being managed, from 512 to 4098, or perhaps more. I felt the most robust, idiot-proof way to do this was to allocate both the 1,000 time slices and the summary slice all together as one contiguous block of memory so that it would be IMPOSSIBLE for the Summary slice to end up being a different length than the other 1,000 time slices.
Given the above, I began to wonder if I could get more performance if, instead of having the Summary slice remain in one location, I had it "roam" between the Head and the Tail, so it was always right next to the Head for adding new demands, and right next to the Tail when the timer fired and the Tail's values had to be subtracted from the Summary.
I did exactly this, but then found a couple of additional optimizations in the process. I changed the code that calculated the rolling Summary so that it left the results in the Tail, instead of the Summary slice. Why? Because the very next function was performing a memcpy() to move the Summary slice into the memory just occupied by the Tail. (weird but true, the Tail leads the Head until the end of the ring when it wraps). By leaving the results of the summation in the Tail, I didn't have to perform the memcpy(), I just had to assign pTail to pSummary.
In a similar way, the new Head occupied the now stale Summary slice's old memory location, so again, I just assigned pSummary to pHead, and zeroed all its values with a memset to zero.
Leading the way to the end of the ring(really a drum, 512 tracks wide) was the Tail, but I only had to compare its pointer against a constant pEndOfRing pointer to detect that condition. All of the other pointers could be assigned the pointer value of the vector just ahead of it. IE: I only needed a conditional test for 1:3 of the pointers to correctly wrap them.
The initial design had used byte ints to maximize cache usage, however, I was able to relax this constraint - satisfying the users request to handle higher resource counts per user per millisecond - to use unsigned shorts and STILL double performance, because even with 3 adjacent vectors of 512 unsigned shorts, the L1 cache's 32K data cache could easily hold the required 3,720 bytes, 2/3rds of which were in locations just used. Only when the Tail, Summary, or Head wrapped were 1 of the 3 separated by any significant "step" in the 8MB L3cache.
The total run-time memory footprint for this code is under 2MB, so it runs entirely out of on-chip caches, and even on an i7 chip with 4 cores, 4 instances of this process can be run without any degradation in performance at all, and total throughput goes up slightly with 5 processes running. It's an Opus Magnum on cache usage.

Most C/C++ compilers prefer to optimize for size rather than for "speed". That is, smaller code generally executes faster than unrolled code because of cache effects.

If I were you, I would make sure I know which parts of code are hotspots, which I define as
a tight loop not containing any function calls, because if it calls any function, then the PC will be spending most of its time in that function,
that accounts for a significant fraction of execution time (like >= 10%) which you can determine from a profiler. (I just sample the stack manually.)
If you have such a hotspot, then it should fit in the cache. I'm not sure how you tell it to do that, but I suspect it's automatic.

Related

Idiomatic way of performance evaluation?

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?
Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

Fastest use of a dataset of just over 64 bytes?

Structure: I have 8 64-bit integers (512 bits = 64 bytes, the assumed cache line width) that I would like to compare to another, single 64-bit integer, in turn, without cache misses. The data set is, unfortunately, absolutely inflexible -- it's already as small as possible.
Access pattern: Each uint64_t is in fact an array of 4x4x4 bits, each bit representing the presence or absence of a voxel. This means sometimes I will be using half of one chunk and half of another, or even corners of 8 different 64-bit chunks.... I guess what this means is there is a high likelihood of a lack of alignment.
How can I do this as fast as possible i.e. without thrashing the cache?
P.S. The idea is that this code will ultimately run on a fairly wide range of architectures of at least a 64B cache line width, so I'd prefer this were absolutely as fast as possible. This also means I can't rely on MOVNTDQA, which anyway may incur a performance hit of it's own inspite of loading the 9th element directly to the CPU.
P.P.S. My knowledge of this area is fairly limited so please take it easy on me. But please spare me the premature optimisation comments; be sure that this is the 3% of this application that really counts.
I wouldn't worry about it. If your dataset is really only 9 integers, most of it will likely be stored in registers anyway. Also, there isn't really any way to optimize cache usage without specifying an architecture, since cache structure is architecture dependent. If you can list several target architectures you may be able to find some commonalities that you can optimize toward, but without knowing those architectures, I don't think there's much we can do for you.
Lastly, this seems like a good example of optimizing too early. I would suggest you take the following steps:
Decide what your maximum acceptable run time is
Finish your program in C
Compile for all of your target architectures
For those platforms that don't meet your speed spec, hand-optimize the intermediate assembly files and recompile until you meet your spec.
Are you sure you get cache-misses?
Even if the comparing value is not in an register, i think your first uint64 array should be on one cache stage (or what ever it is called) and your other data in another.
Your cache surely has some n-way associativity, that prevents your data row from being removed from the cache just by accessing your compare value.
Do not lose your time on Micro Optimizations. Improve your algorithms and data structures.

sprintf vs strcpy - use more memory and fast copy or almost no memory and slow copy?

I'm writing latency-critical application (homemade HFT trading system). I have such code which just convert uint64 to string:
// TODO: cache sprintf, use strcpy? measure?
sprintf(dest, "%" PRIu64, divRes.quot);
Here divRes.quot is integer number which guaranteed to be between 1 and 1 000 000. So I can preallocate (pretty big) array and "cache" every single value. Then I can just execute strcpy(dest, cache[divRes.quot]).
At first glance it must be significantly faster, because strcpy must be significantly faster, than sprintf. However note that I'm using huge array which almost surely can not be fully loaded to CPU cache. So second approach almost surely will go to main memory. While in first approach i'm pretty likely will stay in CPU cache (probably even in fastest L1 cache?!)
So in average what would be faster:
slow function in CPU cache
fast function with access to main memory?
I think it depends on how much faster one function than another and how much faster CPU cache access than main memory access.
I guess it's very hard to write a true test. Because in real application overall system load will be different and so cache/memory usage will be different and this can change things dramatically.
Please note I don't care about readability, maintance etc, I only need speed.
For the table lookup to work out well, you'd have to be doing this often enough (on a CPU with a large cache) for the majority of the table to be in cache most of the time you're doing it. The table occupies around 7 megabytes of memory, so unless the cache was quite large, and you were converting millions of numbers at a time so most of the accesses were to cache, it would almost certainly be a net loss.
By my figuring, it's likely to take around 100 clocks to convert a single number using normal division (~5 divisions + 6 additions). A read from main memory typically takes something like 200 processor clocks, so you'd need a cache hit rate around 50% just to break even.
Personally, I doubt I'd use either of those approaches. Instead, I'd probably do a hybrid. I'd divide the number by 1000, then do two table lookups (one with the dividend, the other with the remainder).
The advantage is that this reduces the table size to around 4 kilobytes, and increases the use of each table entry by a factor of about 1000. Assuming you're converting at least a few hundred (or so) randomly distributed numbers at a time, you can probably count on nearly 100% cache hit rate. With a high cache hit rate, we can plan on one division plus two loads from cache for a total of around 25 clocks, or roughly 4 times as fast as we can expect from a naive conversion.

Is array access always constant-time / O(1)?

From Richard Bird, Pearls of Functional Algorithm Design (2010), page 6:
For a pure functional programmer, an update operation takes logarithmic time in the size of the array. To be fair, procedural programmers also appreciate that constant-time indexing and updating
are only possible when the arrays are small.
Under what conditions do arrays have non-constant-time access? Is this related to CPU cache?
Most modern machine architectures try to offer small unit time access to memory.
They fail. Instead we see layers of caches with differing speeds.
The problem is simple: the speed of light. If you need an enormous memory [array] (in the extreme, imagine a memory the size of the Andromeda galaxy), it will take enormous space, and light cannot cross enormous space in short periods of time. Information cannot travel faster than the speed of light. You are screwed by physics from the start.
So the best you can do is build part of memory "nearby" where light takes only fractions of nanosecond to traverse (thus registers and L1 cache), and part of the memory far away (the disk drive). Other practical complications ensue, such as capacitance (think inertia) to slow down access to things further away.
Now, if you are willing to take the access time of your farthest memory element as "unit" time, yes, access to everything takes the same amount of time, e.g., O(1). In practical computing, we treat RAM memory this way most of the time, and we leave out other, slower devices to avoid screwing up our simple model.
Then you discover people that aren't satisfied with that, and voila, you have people optimizing for cache line access. So it may be O(1) in theory, and acts like O(1) for small arrays (that fit in the first level of cache), but it often is not in practice.
An extreme practical case is an array that doesn't fit in main memory; now an array access may cause paging from the disk.
Sometimes we don't care even in that case. Google is essentially a giant cache. We tend to think of Google searches as O(1).
Can oranges be red?
Yes, they can be red due to a number of reasons -
You color them red.
You grow a genetically modified variety.
You grow them on Mars, the red planet, where every thing is supposed to look red.
The (theoretical) list of some practical (given todays technology) and impractical (fiction / or future reality) goes on...
Point is, that I think the question you are asking, is really about two orthogonal concepts. Namely -
Big O Notation - "In mathematics, big O notation describes the limiting behavior of a function when the argument tends towards a particular value or infinity, usually in terms of simpler functions."
vs
Practicalities (hardware and software) a good software engineer should be aware of, while architecting / designing their app and writing code.
In other words, while the concept of Big O Notation can be called academic, but it is most appropriate way of classifying algorithms complexity (Time / Space).. and that's where it ends. There is no need to muddy the waters with orthogonal concerns.
To be clear, I am not saying that one should not be aware of the under the hood implementation details and workings of things, which affect the performance of the software you write.. but there is no point of mixing the two together. For example, does it make sense to say -
Arrays do not have constant time access (with indexes) because -
Large arrays do not fit in CPU cache, and hence incur high cost of cache misses.
On a system under memory pressure, the array, big or small, has been swapped out from Physical Memory to Hard Disk, and not only is impacted by a cache miss, but also a hard page fault.
On a system under extreme CPU load, the thread which read the supposed array can be pre-empted, and may not get a chance to execute for several seconds.
On a hypothetical OS, which backs its memory not just with disk, but with additional memory on another computer on the other corner of the world, will make array access un-imaginably slow.
Like my apple and orange example, as you read through my increasingly absurd examples, hope the point I am trying to make is clear.
Conclusion - Any day, I'd answer the question "Do Arrays have constant time O(1) access (with indexes)", as yes.. without any doubt or ifs and buts, they do.
EDIT:
Put it another way - If O(1) is not the answer.. then neither is O(log n), or O(n log n), or O(n^2) or O(n ^ 3)..... and certainly not 42.
He is talking about Computation Models, and in particular the word-based RAM machine
A RAM machine is a formalization of something of very similar to an actual computer: we model the computer memory as a big array of memory words of w bits each, and we can read/write any words in O(1) time
But we have yet something important to define: how large should a word be?
We need a word size w ≥ Ω(log n) to be able at least to address the n parts of our input.
For this reason, word-based RAMs normally assume a word length of O(log n)
But having the word length of your machine depends on the size of the input appears strange and unrealistic
What if we keep the word length fixed? Then even following a pointer needs Ω(log n) time just to read the entire pointer
We need Ω(log n) words to store pointer and Ω(log n) time to access input element
if a language supports sparse arrays, access to the array would have to go through a directory, and a tree-structured directory would have non-linear access time. Or did you mean realistic conditions? ;-)

Is a pointer indirection more costly than a conditional?

Is a pointer indirection (to fetch a value) more costly than a conditional?
I've observed that most decent compilers can precompute a pointer indirection to varying degrees--possibly removing most branching instructions--but what I'm interested in is whether the cost of an indirection is greater than the cost of a branch point in the generated code.
I would expect that if the data referenced by the pointer is not in a cache at runtime that a cache flush might occur, but I don't have any data to back that.
Does anyone have solid data (or a justifiable opinion) on the matter?
EDIT: Several posters noted that there is no "general case" on the cost of branching: it varies wildly from chip to chip.
If you happen to know of a notable case where branching would be cheaper (with or without branch prediction) than an in-cache indirection, please mention it.
This is very much dependant on the circumstances.
1 How often is the data in cache (L1, L2, L3) or and how often it must be fetched all the way from the RAM?
A fetch from RAM will take around 10-40ns. Of course, that will fill a whole cache-line in little more than that, so if you then use the next few bytes as well, it will definitely not "hurt as bad".
2 What processor is it?
Older Intel Pentium4 were famous for their long pipeline stages, and would take 25-30 clockcycles (~15ns at 2GHz) to "recover" from a branch that was mispredicted.
3 How "predictable" is the condition?
Branch prediction really helps in modern processors, and they can cope quite well with "unpredictable" branches too, but it does hurt a little bit.
4 How "busy" and "dirty" is the cache?
If you have to throw out some dirty data to fill the cache-line, it will take another 15-50ns on top of the "fetch the data in" time.
The indirection itself will be a fast instruction, but of course, if the next instruction uses the data immediately after, you may not be able to execute that instruction immediately - even if the data is in L1 cache.
On a good day (well predicted, target in cache, wind in the right direction, etc), a branch, on the other hand, takes 3-7 cycles.
And finally, of course, the compiler USUALLY knows quite well what works best... ;)
In summary, it's hard to say for sure, and the only way to tell what is better IN YOUR case would be to benchmark alternative solutions. I would thin that an indirect memory access is faster than a jump, but without seeing what code your source compiles to, it's quite hard to say.
It would really depend on your platform. There is no one right answer without looking at the innards of the target CPU. My advice would be to measure it both ways in a test app to see if there is even a noticeable difference.
My gut instinct would be that on a modern CPU, branching through a function pointer and conditional branching both rely on the accuracy of the branch predictor, so I'd expect similar performance from the two techniques if the predictor is presented with similar workloads. (i.e. if it always ends up branching the same way, expect it to be fast; if it's hard to predict, expect it to hurt.) But the only way to know for sure is to run a real test on your target platform.
It depends from processor to processor, but depending on the set of data you're working with, a pipeline flush caused by a mispredicted branch (or badly ordered instructions in some cases) can be more damaging to the speed than a simple cache miss.
In the PowerPC case, for instance, branches not taken (but predicted to be taken) cost about 22 cycles (the time taken to re-fill the pipeline), while a L1 cache miss may cost 600 or so memory cycles. However, if you're going to access contiguous data, it may be better to not branch and let the processor cache-miss your data at the cost of 3 cycles (branches predicted to be taken and taken) for every set of data you're processing.
It all boils down to: test it yourself. The answer is not definitive for all problems.
Since the processor would have to predict the conditional answer in order to plan which instruction has more chances of having to be executed, I would say that the actual cost of the instructions is not important.
Conditional instructions are bad efficiency wise because they make the process flow unpredictable.

Resources