general question about solving problems with parallelisation

general question about solving problems with parallelisation - c

i have a general question about programming of parallel algorithms in C. Lets assume that our task is to implement some matrix algorithms with MPI and/or OpenMP. There are some situations, like false sharing in OpenMP or in MPI where the communications arise in dependence of the matrix dimension (columns cyclic distrubuted among processes), which cause some problems . Would it be a good and a common attempt to solve this situations by, for example, transposing the matrix, because this would reduce the necessary communications or even avoiding the false sharing problem? After that you would undo the transposition. Of course, assuming that this would lead to a much better speed up.
I dont think that this would be very cunning and more of a lazy way to do this. But im curious to read some opions about this.

Let's start with the first question first: can it make sense to transpose? The answer is, it depends, and you can estimate whether it will improve things or not.
The transposition/retransposition with impose a one-time memory bandwidth cost of 2*(going through memory the fast way) + 2*(going through memory the slow way) where those memory operations are literally memory operations in the multicore case, or network communications in the distributed memory case. You're going to be reading a matrix in the fast way and putting it into memory the slow way. (You can make this, essentially, 4*(going through memory the fast way) by reading the matrix in one cache-sized block at a time, transposing in cache, and writing out in order).
Whether or not that's a win or not depends on how many times you'll be accessing the array. If you would have been hitting the entire non-transposed array 4 times with memory access in the "wrong" direction, then you will clearly win by doing the two transposes. If you'd only be going through the non-transposed array once in the wrong direction, then you almost certainly won't win by doing the transposition.
As to the larger question, #AlexandreC is absolutely right here -- trying to implement your own linear algebra routines is madness. Take a look at, eg, How To Write Fast Numerical Code, figure 3; there can be factors of 40 in performance between naive and highly-tuned (say) GEMM operations. These things are highly memory-bandwidth limited, and in parallel that means network limited. By far, best is to use existing tools.
For multicore linear algebra, existing libraries include
Atlas
Plasma
Flame
For MPI implementations, there are
BLACS
Scalapack
or complete solver environments like
PETSc
Trilinos

I don't know that you'd throw the transpose away the second that you completed the operation, but yes this is a valid mechanism to increase parallelism.
I am not an expert; I've only read a little bit about this topic, and even that was for SIMD architectures, so please take my opinion lightly... but I think the usual mechanism is to lay your structures out in memory to match the machine (so you'd transpose a large matrix to line up better with your vectors and increase the dependency distance in your loops), and then you also build an indexing structure of pointers around that so that you can quickly access individual elements in the transpose differently. This gets more difficult to do as your input changes more dynamically.

I dont think that this would be very cunning and more of a lazy way to do this.
Lazy solutions are usually better than "cunning" ones, because they tend to be more simple and straightforward. They're therefore easier to implement, document, understand and maintain. Indeed, laziness is arguably one of the greatest virtues a programmer can have. As long as the program produces correct results at acceptable speeds, nobody should care how elegantly you solved the problem (including you).

Related

C- Why is for loop pointer indexing faster? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Some years ago I was on a panel that was interviewing candidates for a relatively senior embedded C programmer position.
One of the standard questions that I asked was about optimisation techniques. I was quite surprised that some of the candidates didn't have answers.
So, in the interests of putting together a list for posterity - what techniques and constructs do you normally use when optimising C programs?
Answers to optimisation for speed and size both accepted.

First things first - don't optimise too early. It's not uncommon to spend time carefully optimising a chunk of code only to find that it wasn't the bottleneck that you thought it was going to be. Or, to put it another way "Before you make it fast, make it work"
Investigate whether there's any option for optimising the algorithm before optimising the code. It'll be easier to find an improvement in performance by optimising a poor algorithm than it is to optimise the code, only then to throw it away when you change the algorithm anyway.
And work out why you need to optimise in the first place. What are you trying to achieve? If you're trying, say, to improve the response time to some event work out if there is an opportunity to change the order of execution to minimise the time critical areas. For example when trying to improve the response to some external interrupt can you do any preparation in the dead time between events?
Once you've decided that you need to optimise the code, which bit do you optimise? Use a profiler. Focus your attention (first) on the areas that are used most often.
So what can you do about those areas?
minimise condition checking. Checking conditions (eg. terminating conditions for loops) is time that isn't being spent on actual processing. Condition checking can be minimised with techniques like loop-unrolling.
In some circumstances condition checking can also be eliminated by using function pointers. For example if you are implementing a state machine you may find that implementing the handlers for individual states as small functions (with a uniform prototype) and storing the "next state" by storing the function pointer of the next handler is more efficient than using a large switch statement with the handler code implemented in the individual case statements. YMMV.
minimise function calls. Function calls usually carry a burden of context saving (eg. writing local variables contained in registers to the stack, saving the stack pointer), so if you don't have to make a call this is time saved. One option (if you're optimising for speed and not space) is to make use of inline functions.
If function calls are unavoidable minimise the data that is being passed to the functions. For example passing pointers is likely to be more efficient than passing structures.
When optimising for speed choose datatypes that are the native size for your platform. For example on a 32bit processor it is likely to be more efficient to manipulate 32bit values than 8 or 16 bit values. (side note - it is worth checking that the compiler is doing what you think it is. I've had situations where I've discovered that my compiler insisted on doing 16 bit arithmetic on 8 bit values with all of the to and from conversions to go with them)
Find data that can be precalculated, and either calculate during initialisation or (better yet) at compile time. For example when implementing a CRC you can either calculate your CRC values on the fly (using the polynomial directly) which is great for size (but dreadful for performance), or you can generate a table of all of the interim values - which is a much faster implementation, to the detriment of the size.
Localise your data. If you're manipulating a blob of data often your processor may be able to speed things up by storing it all in cache. And your compiler may be able to use shorter instructions that are suited to more localised data (eg. instructions that use 8 bit offsets instead of 32 bit)
In the same vein, localise your functions. For the same reasons.
Work out the assumptions that you can make about the operations that you're performing and find ways of exploiting them. For example, on an 8 bit platform if the only operation that at you're doing on a 32 bit value is an increment you may find that you can do better than the compiler by inlining (or creating a macro) specifically for this purpose, rather than using a normal arithmetic operation.
Avoid expensive instructions - division is a prime example.
The "register" keyword can be your friend (although hopefully your compiler has a pretty good idea about your register usage). If you're going to use "register" it's likely that you'll have to declare the local variables that you want "register"ed first.
Be consistent with your data types. If you are doing arithmetic on a mixture of data types (eg. shorts and ints, doubles and floats) then the compiler is adding implicit type conversions for each mismatch. This is wasted cpu cycles that may not be necessary.
Most of the options listed above can be used as part of normal practice without any ill effects. However if you're really trying to eke out the best performance:
- Investigate where you can (safely) disable error checking. It's not recommended, but it will save you some space and cycles.
- Hand craft portions of your code in assembler. This of course means that your code is no longer portable but where that's not an issue you may find savings here. Be aware though that there is potentially time lost moving data into and out of the registers that you have at your disposal (ie. to satisfy the register usage of your compiler). Also be aware that your compiler should be doing a pretty good job on its own. (of course there are exceptions)

As everybody else has said: profile, profile profile.
As for actual techniques, one that I don't think has been mentioned yet:
Hot & Cold Data Separation: Staying within the CPU's cache is incredibly important. One way of helping to do this is by splitting your data structures into frequently accessed ("hot") and rarely accessed ("cold") sections.
An example: Suppose you have a structure for a customer that looks something like this:
struct Customer
{
int ID;
int AccountNumber;
char Name[128];
char Address[256];
};
Customer customers[1000];
Now, lets assume that you want to access the ID and AccountNumber a lot, but not so much the name and address. What you'd do is to split it into two:
struct CustomerAccount
{
int ID;
int AccountNumber;
CustomerData *pData;
};
struct CustomerData
{
char Name[128];
char Address[256];
};
CustomerAccount customers[1000];
In this way, when you're looping through your "customers" array, each entry is only 12 bytes and so you can fit many more entries in the cache. This can be a huge win if you can apply it to situations like the inner loop of a rendering engine.

My favorite technique is to use a good profiler. Without a good profile telling you where the bottleneck lies, no tricks and techniques are going to help you.

most common techniques I encountered are:
loop unrolling
loop optimization for better cache prefetch
(i.e. do N operations in M cycles instead of NxM singular operations)
data aligning
inline functions
hand-crafted asm snippets
As for general recommendations, most of them are already sounded:
choose better algos
use profiler
don't optimize if it doesn't give 20-30% performance boost

For low-level optimization:
START_TIMER/STOP_TIMER macros from ffmpeg (clock-level accuracy for measurement of any code).
Oprofile, of course, for profiling.
Enormous amounts of hand-coded assembly (just do a wc -l on x264's /common/x86 directory, and then remember most of the code is templated).
Careful coding in general; shorter code is usually better.
Smart low-level algorithms, like the 64-bit bitstream writer I wrote that uses only a single if and no else.
Explicit write-combining.
Taking into account important weird aspects of processors, like Intel's cacheline split issue.
Finding cases where one can losslessly or near-losslessly make an early termination, where the early-termination check costs much less than the speed one gains from it.
Actually inlined assembly for tasks which are far more suited to the x86 SIMD unit, such as median calculations (requires compile-time check for MMX support).

First and foremost, use a better/faster algorithm. There is no point optimizing code that is slow by design.
When optimizing for speed, trade memory for speed: lookup tables of precomputed values, binary trees, write faster custom implementation of system calls...
When trading speed for memory: use in-memory compression

Avoid using the heap. Use obstacks or pool-allocator for identical sized objects. Put small things with short lifetime onto the stack. alloca still exists.

Pre-mature optimization is the root of all evil!
;)

As my applications usually don't need much CPU time by design, I focus on the size my binaries on disk and in memory. What I do mostly is looking out for statically sized arrays and replacing them with dynamically allocated memory where it's worth the additional effort of free'ing the memory later. To cut down the size of the binary, I look for big arrays that are initialized at compile time and put the initializiation to runtime.
char buf[1024] = { 0, };
/* becomes: */
char buf[1024];
memset(buf, 0, sizeof(buf));
This will remove the 1024 zero-bytes from the binaries .DATA section and will instead create the buffer on the stack at runtime and the fill it with zeros.
EDIT: Oh yeah, and I like to cache things. It's not C specific but depending on what you're caching, it can give you a huge boost in performance.
PS: Please let us know when your list is finished, I'm very curious. ;)

If possible, compare with 0, not with arbitrary numbers, especially in loops, because comparison with 0 is often implemented with separate, faster assembler commands.
For example, if possible, write
for (i=n; i!=0; --i) { ... }
instead of
for (i=0; i!=n; ++i) { ... }

Another thing that was not mentioned:
Know your requirements: don't optimize for situations that will unlikely or never happen, concentrate on the most bang for the buck

basics/general:
Do not optimize when you have no problem.
Know your platform/CPU...
...know it thoroughly
know your ABI
Let the compiler do the optimization, just help it with the job.
some things that have actually helped:
Opt for size/memory:
Use bitfields for storing bools
re-use big global arrays by overlaying with a union (be careful)
Opt for speed (be careful):
use precomputed tables where possible
place critical functions/data in fast memory
Use dedicated registers for often used globals
count to-zero, zero flag is free

Difficult to summarize ...
Data structures:
Splitting of a data structure depending on case of usage is extremely important. It is common to see a structure that holds data that is accessed based on a flow control. This situation can lower significantly the cache usage.
To take into account cache line size and prefetch rules.
To reorder the members of the structure to obtain a sequential access to them from your code
Algorithms:
Take time to think about your problem and to find the correct algorithm.
Know the limitations of the algorithm you choose (a radix-sort/quick-sort for 10 elements to be sorted might not be the best choice).
Low level:
As for the latest processors it is not recommended to unroll a loop that has a small body. The processor provides its own detection mechanism for this and will short-circuit whole section of its pipeline.
Trust the HW prefetcher. Of course if your data structures are well designed ;)
Care about your L2 cache line misses.
Try to reduce as much as possible the local working set of your application as the processors are leaning to smaller caches per cores (C2D enjoyed a 3MB per core max where iCore7 will provide a max of 256KB per core + 8MB shared to all cores for a quad core die.).
The most important of all: Measure early, Measure often and never ever makes assumptions, base your thinking and optimizations on data retrieved by a profiler (please use PTU).
Another hint, performance is key to the success of an application and should be considered at design time and you should have clear performance targets.
This is far from being exhaustive but should provide an interesting base.

These days, the most important things in optimzation are:
respecting the cache - try to access memory in simple patterns, and don't unroll loops just for fun. Use arrays instead of data structures with lots of pointer chasing and it'll probably be faster for small amounts of data. And don't make anything too big.
avoiding latency - try to avoid divisions and stuff that's slow if other calculations depend on them immediately. Memory accesses that depend on other memory accesses (ie, a[b[c]]) are bad.
avoiding unpredictabilty - a lot of if/elses with unpredictable conditions, or conditions that introduce more latency, will really mess you up. There's a lot of branchless math tricks that are useful here, but they increase latency and are only useful if you really need them. Otherwise, just write simple code and don't have crazy loop conditions.
Don't bother with optimizations that involve copy-and-pasting your code (like loop unrolling), or reordering loops by hand. The compiler usually does a better job than you at doing this, but most of them aren't smart enough to undo it.

Collecting profiles of code execution get you 50% of the way there. The other 50% deals with analyzing these reports.
Further, if you use GCC or VisualC++, you can use "profile guided optimization" where the compiler will take info from previous executions and reschedule instructions to make the CPU happier.

Inline functions! Inspired by the profiling fans here I profiled an application of mine and found a small function that does some bitshifting on MP3 frames. It makes about 90% of all function calls in my applcation, so I made it inline and voila - the program now uses half of the CPU time it did before.

On most of embedded system i worked there was no profiling tools, so it's nice to say use profiler but not very practical.
First rule in speed optimization is - find your critical path.
Usually you will find that this path is not so long and not so complex. It's hard to say in generic way how to optimize this it's depend on what are you doing and what is in your power to do. For example you want usually avoid memcpy on critical path, so ever you need to use DMA or optimize, but what if you hw does not have DMA ? check if memcpy implementation is a best one if not rewrite it.
Do not use dynamic allocation at all in embedded but if you do for some reason don't do it in critical path.
Organize your thread priorities correctly, what is correctly is real question and it's clearly system specific.
We use very simple tools to analyze the bottle-necks, simple macro that store the time-stamp and index. Few (2-3) runs in 90% of cases will find where you spend your time.
And the last one is code review a very important one. In most case we avoid performance problem during code review very effective way :)

Measure performance.
Use realistic and non-trivial benchmarks. Remember that "everything is fast for small N".
Use a profiler to find hotspots.
Reduce number of dynamic memory allocations, disk accesses, database accesses, network accesses, and user/kernel transitions, because these often tend to be hotspots.
Measure performance.
In addition, you should measure performance.

Sometimes you have to decide whether it is more space or more speed that you are after, which will lead to almost opposite optimizations. For example, to get the most out of you space, you pack structures e.g. #pragma pack(1) and use bit fields in structures. For more speed you pack to align with the processors preference and avoid bitfields.
Another trick is picking the right re-sizing algorithms for growing arrays via realloc, or better still writing your own heap manager based on your particular application. Don't assume the one that comes with the compiler is the best possible solution for every application.

If someone doesn't have an answer to that question, it could be they don't know much.
It could also be that they know a lot. I know a lot (IMHO :-), and if I were asked that question, I would be asking you back: Why do you think that's important?
The problem is, any a-priori notions about performance, if they are not informed by a specific situation, are guesses by definition.
I think it is important to know coding techniques for performance, but I think it is even more important to know not to use them, until diagnosis reveals that there is a problem and what it is.
Now I'm going to contradict myself and say, if you do that, you learn how to recognize the design approaches that lead to trouble so you can avoid them, and to a novice, that sounds like premature optimization.
To give you a concrete example, this is a C application that was optimized.

Great lists. I will just add one tip I didn't saw in the above lists that in some case can yield huge optimisation for minimal cost.
bypass linker
if you have some application divided in two files, say main.c and lib.c, in many cases you can just add a \#include "lib.c" in your main.c That will completely bypass linker and allow for much more efficient optimisation for compiler.
The same effect can be achieved optimizing dependencies between files, but the cost of changes is usually higher.

Sometimes Google is the best algorithm optimization tool. When I have a complex problem, a bit of searching reveals some guys with PhD's have found a mapping between this and a well-known problem and have already done most of the work.

I would recommend optimizing using more efficient algorithms and not do it as an afterthought but code it that way from the start. Let the compiler work out the details on the small things as it knows more about the target processor than you do.
For one, I rarely use loops to look things up, I add items to a hashtable and then use the hashtable to lookup the results.
For example you have a string to lookup and then 50 possible values. So instead of doing 50 strcmps, you add all 50 strings to a hashtable and give each a unique number ( you only have to do this once ). Then you lookup the target string in the hashtable and have one large switch with all 50 cases ( or have functions pointers ).
When looking up things with common sets of input ( like css rules ), I use fast code to keep track of the only possible solitions and then iterate thought those to find a match. Once I have a match I save the results into a hashtable ( as a cache ) and then use the cache results if I get that same input set later.
My main tools for faster code are:
hashtable - for quick lookups and for caching results
qsort - it's the only sort I use
bsp - for looking up things based on area ( map rendering etc )

Writing For loops efficiently

I am constructing the partial derivative of a function in C. The process is mainly consisted of a large number of small loops. Each loop is responsible for filling a column of the matrix. Because the size of the matrix is huge, the code should be written efficiently. I have a number of plans in mind for the implementation which I do not want get into the details.
I know that the smart compilers try to take advantage of the cache automatically. But I would like to know more the details of using cache and writing an efficient code and efficient loops. It is appreciated if provide with some resources or websites so I can know more about writing the efficient codes in terms of reducing memory access time and taking advantage guy.
I know that my request my look sloppy, but I am not a computer guy. I did some research but with no success.
So, any help is appreciated.
Thanks

Well written code tends to be efficient (though not always optimal). Start by writing good clean code, and if you actually have a performance problem that can be isolated and addressed.

It is probably best that you write the code in the most readable and understandable way you can and then profile it to see where the bottlenecks really are. Often times your conception of where you need efficiency doesn't match up with reality.
Modern compilers do a decent job with many aspects of optimization and it seems unlikely that the process of looping will itself be a problem. Perhaps you should consider focusing on simplifying the calculation done by each loop.
Otherwise, you'll be looking at things such as accessing your matrix row by row so that you take advantage of the row-major storage order C uses (see this question).
You'll want to build your for loops without if statements inside because if statements create what is called "branching". The computer essentially guesses which option will be right and pays a sometimes hefty option if it is wrong.
To extend that theme, you want to do as little inside the for loop as possible. You'll also want to define it with static limits, e.g.:
for(int i=1;i<100;i++) //This is better than
for(int i=1;i<N/i;i++) //this
Static limits means that very little effort is expended determining if the for loop should keep going. They also permit you to use OpenMP to divy up the work in the loops, which can sometimes speed things up considerably. This is simple to do:
#pragma omp parallel for
for(int i=0;i<100;i++)
And, walla! the code is parallelized.

C coding practices for performance or code size - beyond what a compiler does

I'm looking to see what can a programmer do in C, that can determine the performance and/or the size of the generated object file.
For e.g,
1. Declaring simple get/set functions as inline may increase performance (at the cost of a larger footprint)
2. For loops that do not use the value of the loop variable itself, count down to zero instead of counting up to a certain value
etc.
It looks like compilers now have advanced to a level where "simple" tricks (like the two points above) are not required at all. Appropriate options during compilation do the job anyway. Heck, I also saw posts here on how compilers handle recursion - that was very interesting! So what are we left to do at a C level then? :)
My specific environment is: GCC 4.3.3 re-targeted for ARM architecture (v4). But responses on other compilers/processors are also welcome and will be munched upon.
PS: This approach of mine goes against the usual "code first!, then benchmark, and finally optimize" approach.
Edit: Just like it so happens, I found a similar post after posting the question: Should we still be optimizing "in the small"?

One thing I can think of that a compiler probably won't optimize is "cache-friendliness": If you're iterating over a two-dimensional array in row-major order, say, make sure your inner loop runs across the column index to avoid cache thrashing. Having the inner loop run over the wrong index can cause a huge performance hit.
This applies to all programming languages, but if you're programming in C, performance is probably critical to you, so it's especially relevant.

"Always" know the time and space complexity of your algorithms. The compiler will never be able to do that job as well as you can. :)

Compilers these days still aren't very good at vectorizing your code so you'll still want to do the SIMD implementation of most algorithms yourself.
Choosing the right datastructures for your exact problem can dramatically increase performance (I've seen cases where moving from a Kd-tree to a BVH would do that, in that specific case).
Compilers might pad some structs/ variables to fit into the cache but other cache optimizations such as the locality of your data are still up to you.
Compilers still don't automatically make your code multithreaded and using openmp, in my experience, doesn't really help much. (You really have to understand openmp anyway to dramatically increase performance). So currently, you're on your own doing multithreading.

To add to what Martin says above about cache-friendliness:
reordering your structures such that fields which are commonly accessed together are in the same cache line can help (for instance by loading just one cache line rather than two.) You are essentially increasing the density of useful data in your data cache by doing this. There is a linux tool which can help you in doing this: dwarves 1. http://www.linuxinsight.com/files/ols2007/melo-reprint.pdf
you can use a similar strategy for increasing density of your code. In gcc you can mark hot and cold branches using likely/unlikely tags. That enables gcc to keep the cold branches separately which helps in increasing the icache density.
And now for something completely different:
for fields that might be accessed (read and written) across CPUs, the opposite strategy makes sense. The trouble is that for coherence purposes only one CPU can be allowed to write to the same address (in reality the same cacheline.) This can lead to a condition called cache-line ping pong. This is pretty bad and could be worse if that cache-line contains other unrelated data. Here, padding this contended data to a cache-line length makes sense.
Note: these clearly are micro-optimizations, to be done only at later stages when you are trying to wring the last bits of performance from your code.

PreComputation where possible... (sorry but its not always possible... I did extensive precomputation on my chess engine.) Store those results in memory, keeping cache in mind.. the bigger the size of precomputation data in memory the lesser is the chance of doing a cache hit. Since most of recent hardware is multicore you can design your application to target it.
if you are using several big arrays make sure you group them close to each other on where they would be used, boosting cache hits

Many people are not aware of this: Define an inline label (varies by compiler) which means inline, in its intent - many compilers place the keyword in an entirely different context from the original meaning. There are also ways to increase the inline size limits, before the compiler begins popping trivial things out of line. Human directed inlining can produce much faster code (compilers are often conservative, or do not account for enough of the program), but you need to learn to use it correctly, because it can (easily) be counterproductive. And yes, this absolutely applies to code size as well as speed.

Kernel methods for large scale dataset

Kernel-based classifier usually requires O(n^3) training time because of the inner-product computation between two instances. To speed up the training, inner-product values can be pre-computed and stored in a two-dimensional array. However when the no. of instances is very large, say over 100,000, there will not be sufficient memory to do so.
So any better idea for this?

For modern implementations of support vector machines, the scaling of the training algorithm is dependent on lots of factors, such as the nature of the training data and kernel that you are using. The scaling factor of O(n^3) is an analytical result and isn't particularly useful in predicting how SVM training will scale in real-world situations. For example, empirical estimates of the training algorithm used by SVMLight put the scaling against training set size to be approximately O(n^2).
I would suggest you ask this question in the kernel machines forum. I think you're more likely to get a better answer than on Stack Overflow, which is more of a general-purpose programming site.

The Relevance Vector Machine has a sequential training mode in which you do not need to keep the entire kernel matrix in memory. You can basically calculate a column at a time, determine if it appears relevant, and throw it away otherwise. I have not had much luck with it myself, though, and the RVM has some other issues. There is most likely a better solution in the realm of Gaussian Processes. I haven't really sat down much with those, but I have seen mention of an online algorithm for it.

I am not a numerical analyst, but isn't the QR decomposition which you need to do ordinary least-squares linear regression also O(n^3)?
Anyways, you'll probably want to search the literature (since this is fairly new stuff) for online learning or active learning versions of the algorithm you're using. The general idea is to either discard data far from your decision boundary or to not include them in the first place. The danger is that you might get locked into a bad local maximum and then your online/active algorithm will ignore data that would help you get out.

Did you apply computational complexity theory in real life?

I'm taking a course in computational complexity and have so far had an impression that it won't be of much help to a developer.
I might be wrong but if you have gone down this path before, could you please provide an example of how the complexity theory helped you in your work? Tons of thanks.

O(1): Plain code without loops. Just flows through. Lookups in a lookup table are O(1), too.
O(log(n)): efficiently optimized algorithms. Example: binary tree algorithms and binary search. Usually doesn't hurt. You're lucky if you have such an algorithm at hand.
O(n): a single loop over data. Hurts for very large n.
O(n*log(n)): an algorithm that does some sort of divide and conquer strategy. Hurts for large n. Typical example: merge sort
O(n*n): a nested loop of some sort. Hurts even with small n. Common with naive matrix calculations. You want to avoid this sort of algorithm if you can.
O(n^x for x>2): a wicked construction with multiple nested loops. Hurts for very small n.
O(x^n, n! and worse): freaky (and often recursive) algorithms you don't want to have in production code except in very controlled cases, for very small n and if there really is no better alternative. Computation time may explode with n=n+1.
Moving your algorithm down from a higher complexity class can make your algorithm fly. Think of Fourier transformation which has an O(n*n) algorithm that was unusable with 1960s hardware except in rare cases. Then Cooley and Tukey made some clever complexity reductions by re-using already calculated values. That led to the widespread introduction of FFT into signal processing. And in the end it's also why Steve Jobs made a fortune with the iPod.
Simple example: Naive C programmers write this sort of loop:
for (int cnt=0; cnt < strlen(s) ; cnt++) {
/* some code */
}
That's an O(n*n) algorithm because of the implementation of strlen(). Nesting loops leads to multiplication of complexities inside the big-O. O(n) inside O(n) gives O(n*n). O(n^3) inside O(n) gives O(n^4). In the example, precalculating the string length will immediately turn the loop into O(n). Joel has also written about this.
Yet the complexity class is not everything. You have to keep an eye on the size of n. Reworking an O(n*log(n)) algorithm to O(n) won't help if the number of (now linear) instructions grows massively due to the reworking. And if n is small anyway, optimizing won't give much bang, too.

While it is true that one can get really far in software development without the slightest understanding of algorithmic complexity. I find I use my knowledge of complexity all the time; though, at this point it is often without realizing it. The two things that learning about complexity gives you as a software developer are a way to compare non-similar algorithms that do the same thing (sorting algorithms are the classic example, but most people don't actually write their own sorts). The more useful thing that it gives you is a way to quickly describe an algorithm.
For example, consider SQL. SQL is used every day by a very large number of programmers. If you were to see the following query, your understanding of the query is very different if you've studied complexity.
SELECT User.*, COUNT(Order.*) OrderCount FROM User Join Order ON User.UserId = Order.UserId
If you have studied complexity, then you would understand if someone said it was O(n^2) for a certain DBMS. Without complexity theory, the person would have to explain about table scans and such. If we add an index to the Order table
CREATE INDEX ORDER_USERID ON Order(UserId)
Then the above query might be O(n log n), which would make a huge difference for a large DB, but for a small one, it is nothing at all.
One might argue that complexity theory is not needed to understand how databases work, and they would be correct, but complexity theory gives a language for thinking about and talking about algorithms working on data.

For most types of programming work the theory part and proofs may not be useful in themselves but what they're doing is try to give you the intuition of being able to immediately say "this algorithm is O(n^2) so we can't run it on these one million data points". Even in the most elementary processing of large amounts of data you'll be running into this.
Thinking quickly complexity theory has been important to me in business data processing, GIS, graphics programming and understanding algorithms in general. It's one of the most useful lessons you can get from CS studies compared to what you'd generally self-study otherwise.

Computers are not smart, they will do whatever you instruct them to do. Compilers can optimize code a bit for you, but they can't optimize algorithms. Human brain works differently and that is why you need to understand the Big O. Consider calculating Fibonacci numbers. We all know F(n) = F(n-1) + F(n-2), and starting with 1,1 you can easily calculate following numbers without much effort, in linear time. But if you tell computer to calculate it with that formula (recursively), it wouldn't be linear (at least, in imperative languages). Somehow, our brain optimized algorithm, but compiler can't do this. So, you have to work on the algorithm to make it better.
And then, you need training, to spot brain optimizations which look so obvious, to see when code might be ineffective, to know patterns for bad and good algorithms (in terms of computational complexity) and so on. Basically, those courses serve several things:
understand executional patterns and data structures and what effect they have on the time your program needs to finish;
train your mind to spot potential problems in algorithm, when it could be inefficient on large data sets. Or understand the results of profiling;
learn well-known ways to improve algorithms by reducing their computational complexity;
prepare yourself to pass an interview in the cool company :)

It's extremely important. If you don't understand how to estimate and figure out how long your algorithms will take to run, then you will end up writing some pretty slow code. I think about compuational complexity all the time when writing algorithms. It's something that should always be on your mind when programming.
This is especially true in many cases because while your app may work fine on your desktop computer with a small test data set, it's important to understand how quickly your app will respond once you go live with it, and there are hundreds of thousands of people using it.

Yes, I frequently use Big-O notation, or rather, I use the thought processes behind it, not the notation itself. Largely because so few developers in the organization(s) I frequent understand it. I don't mean to be disrespectful to those people, but in my experience, knowledge of this stuff is one of those things that "sorts the men from the boys".
I wonder if this is one of those questions that can only receive "yes" answers? It strikes me that the set of people that understand computational complexity is roughly equivalent to the set of people that think it's important. So, anyone that might answer no perhaps doesn't understand the question and therefore would skip on to the next question rather than pause to respond. Just a thought ;-)

There are points in time when you will face problems that require thinking about them. There are many real world problems that require manipulation of large set of data...
Examples are:
Maps application... like Google Maps - how would you process the road line data worldwide and draw them? and you need to draw them fast!
Logistics application... think traveling sales man on steroids
Data mining... all big enterprises requires one, how would you mine a database containing 100 tables and 10m+ rows and come up with a useful results before the trends get outdated?
Taking a course in computational complexity will help you in analyzing and choosing/creating algorithms that are efficient for such scenarios.
Believe me, something as simple as reducing a coefficient, say from T(3n) down to T(2n) can make a HUGE differences when the "n" is measured in days if not months.

There's lots of good advice here, and I'm sure most programmers have used their complexity knowledge once in a while.
However I should say understanding computational complexity is of extreme importance in the field of Games! Yes you heard it, that "useless" stuff is the kind of stuff game programming lives on.
I'd bet very few professionals probably care about the Big-O as much as game programmers.

I use complexity calculations regularly, largely because I work in the geospatial domain with very large datasets, e.g. processes involving millions and occasionally billions of cartesian coordinates. Once you start hitting multi-dimensional problems, complexity can be a real issue, as greedy algorithms that would be O(n) in one dimension suddenly hop to O(n^3) in three dimensions and it doesn't take much data to create a serious bottleneck. As I mentioned in a similar post, you also see big O notation becoming cumbersome when you start dealing with groups of complex objects of varying size. The order of complexity can also be very data dependent, with typical cases performing much better than general cases for well designed ad hoc algorithms.
It is also worth testing your algorithms under a profiler to see if what you have designed is what you have achieved. I find most bottlenecks are resolved much better with algorithm tweaking than improved processor speed for all the obvious reasons.
For more reading on general algorithms and their complexities I found Sedgewicks work both informative and accessible. For spatial algorithms, O'Rourkes book on computational geometry is excellent.

In your normal life, not near a computer you should apply concepts of complexity and parallel processing. This will allow you to be more efficient. Cache coherency. That sort of thing.

Yes, my knowledge of sorting algorithms came in handy one day when I had to sort a stack of student exams. I used merge sort (but not quicksort or heapsort). When programming, I just employ whatever sorting routine the library offers. ( haven't had to sort really large amount of data yet.)
I do use complexity theory in programming all the time, mostly in deciding which data structures to use, but also in when deciding whether or when to sort things, and for many other decisions.

'yes' and 'no'
yes) I frequently use big O-notation when developing and implementing algorithms.
E.g. when you should handle 10^3 items and complexity of the first algorithm is O(n log(n)) and of the second one O(n^3), you simply can say that first algorithm is almost real time while the second require considerable calculations.
Sometimes knowledges about NP complexities classes can be useful. It can help you to realize that you can stop thinking about inventing efficient algorithm when some NP-complete problem can be reduced to the problem you are thinking about.
no) What I have described above is a small part of the complexities theory. As a result it is difficult to say that I use it, I use minor-minor part of it.
I should admit that there are many software development project which don't touch algorithm development or usage of them in sophisticated way. In such cases complexity theory is useless. Ordinary users of algorithms frequently operate using words 'fast' and 'slow', 'x seconds' etc.

#Martin: Can you please elaborate on the thought processes behind it?
it might not be so explicit as sitting down and working out the Big-O notation for a solution, but it creates an awareness of the problem - and that steers you towards looking for a more efficient answer and away from problems in approaches you might take. e.g. O(n*n) versus something faster e.g. searching for words stored in a list versus stored in a trie (contrived example)
I find that it makes a difference with what data structures I'll choose to use, and how I'll work on large numbers of records.

A good example could be when your boss tells you to do some program and you can demonstrate by using the computational complexity theory that what your boss is asking you to do is not possible.