Optimal Buffer Sizing - c

I guess this is a performance computing question. I'm writing a program in C which produces a large amount of output, much more than can typically be stored in RAM in its entirety. I intend to simply write the output to stdout; so it may just be going to the screen, or may be being redirected into a file. My problem is how to choose an optimal buffer size for the data that will be stored in RAM?
The output data itself isn't particularly important, so let's just say that it's producing a massive list of random integers.
I intend to have 2 threads: one which produces the data and writes it to a buffer, and the other which writes that buffer to stdout. This way, I can begin producing the next buffer of output whilst the previous buffer is still being written to stdout.
To be clear, my question isn't about how to use functions like malloc() and pthread_create() etc. My question is purely about how to choose a number of bytes (512, 1024, 1048576) for the optimal buffer size, which will give the best performance?
Ideally, I'd like to find a way in which I could choose an optimal buffer size dynamically, so that my program could adjust to whatever hardware it was being run on at the time. I have tried searching for answers to this problem, and although I found a few threads about buffer size, I couldn't find anything particularly relevant to this problem. Therefore, I just wanted to post it as a question in the hope that I could get some different points of view, and come up with something better than I could on my own.

It's a big waste of time to mix design and optimization. This is considered one of the top canonical mistakes. It is likely to damage your design and not actually optimize much.
Get your program working, and if there is an indication of a performance issue, then profile it and consider analyzing the part that's really causing the problem.
I would think this applies particularly to a complex architectural optimization like multithreading your application. Multithreading a single image is something you never really want to do: it's impossible to test, prone to unreproducible bugs, it will fail differently in different execution environments, and there are other problems. But, for some programs, multithreaded parallel execution is required for functionality or is one way to get necessary performance. It's widely supported, and essentially it is, at times, a necessary evil.
It's not something you want in the initial design without solid evidence that programs like yours need it.
Almost any other method of parallelism (message passing?) will be easier to implement and debug, and you are getting lots of that in the I/O system of your OS anyway.

I personally think you are wasting your time.
First, run time ./myprog > /dev/null
Now, use time dd if=/dev/zero of=myfile.data bs=1k count=12M.
dd is about as simple a program as you can get, and it will write the file pretty quickly. But writing several gigabytes still takes a little while. (12G takes about 4 minutes on my machine - which is probably not the fastest disk in the world - the same size file to /dev/null takes about 5 seconds).
You can experiment with some different numbers in bs=x count=y where the combination makes, the same size as your program output for the test-run. But I only found that if you make VERY large blocks, it actually takes longer (1MB per write - probably because the OS needs to copy 1MB before it can write the data, then write it out and then copy the next 1MB, where with smaller blocks (I tested 1k and 4k), it takes a lot less time to copy the data, and there's actually less "disk spinning round not doing anything before we write to it").
Compare both of these times to your program running time. Is the time it takes to write the file the with dd much shorter than your program writing to the file?
If there isn't much difference, then look at the time it takes to write to /dev/null with your program - is that accounting for some or all of the difference?

Short answer: Measure it.
Long answer: From my experience, it depends too much on the factors that are hard to predict in advance. On the other hand, you do not have to commit yourself before the start. Just implement a generic solution and when you are done, make a few performance tests and take the settings with the best results. A profiler may help you to concentrate on the performance critical parts in your program.
From what I've seen, those that produce the fastest code, often try the simplest, straightforward approach first. What the do better than the average programmers is that they have a very good technique at writing good performance tests, which is by far not trivial.
Without experience, it is easy to fall into certain traps, for example, ignoring caching effects, or (maybe in your application?!) underestimating the costs of IO operations. In the worst case, you end up squeezing parts of the program which do not contribute to the overall performance at all.
Back to your original question:
In the scenario that you describe (one CPU-bound producer and one IO-bound consumer), it is likely that one of them will be the bottleneck (unless the rate at which the producer generates data varies a lot). Depending on which one is faster, the whole situation changes radically:
Let us first assume, the IO-bound consumer is your bottleneck (doesn't matter whether it writes to stdout or to a file). What are the likely consequences?
Optimizing the algorithm to produce the data will not improve performance, instead you have to maximize the write performance. I would assume, however, that the write performance will not depend very much on the buffer size (unless the buffer is too small).
In the other case, if the producer is the limiting factor, the situation is reversed. Here you have to profile the generation code and improve the speed of the algorithm and maybe the communication of the data between the reader and the writer thread. The buffer size, however, will still not matter, as the buffer will be empty most of the time, anyway.
Granted, the situation could be more complex than I have described. But unless you are actually sure that you are not in one of the extreme cases, I would not invest in tuning the buffer size yet. Just keep it configurable and you should be fine. I don't think that it should be a problem later to refit it to other hardware environments.

Most modern OSes are good at using the disk as a backing store for RAM. I suggest you leave the heuristics to the OS and just ask for as much memory as you want, till you hit a performance bottleneck.

There's no need to use buffering, the OS will automatically swap pages to the disk for you whenever necessary, you don't have to program that. The simples would be for you to leave in in RAM if you don't need to save the data, else you're probably better of saving it after generating the data, because it's better for the disk i/o.

Related

C- Why is for loop pointer indexing faster? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Some years ago I was on a panel that was interviewing candidates for a relatively senior embedded C programmer position.
One of the standard questions that I asked was about optimisation techniques. I was quite surprised that some of the candidates didn't have answers.
So, in the interests of putting together a list for posterity - what techniques and constructs do you normally use when optimising C programs?
Answers to optimisation for speed and size both accepted.
First things first - don't optimise too early. It's not uncommon to spend time carefully optimising a chunk of code only to find that it wasn't the bottleneck that you thought it was going to be. Or, to put it another way "Before you make it fast, make it work"
Investigate whether there's any option for optimising the algorithm before optimising the code. It'll be easier to find an improvement in performance by optimising a poor algorithm than it is to optimise the code, only then to throw it away when you change the algorithm anyway.
And work out why you need to optimise in the first place. What are you trying to achieve? If you're trying, say, to improve the response time to some event work out if there is an opportunity to change the order of execution to minimise the time critical areas. For example when trying to improve the response to some external interrupt can you do any preparation in the dead time between events?
Once you've decided that you need to optimise the code, which bit do you optimise? Use a profiler. Focus your attention (first) on the areas that are used most often.
So what can you do about those areas?
minimise condition checking. Checking conditions (eg. terminating conditions for loops) is time that isn't being spent on actual processing. Condition checking can be minimised with techniques like loop-unrolling.
In some circumstances condition checking can also be eliminated by using function pointers. For example if you are implementing a state machine you may find that implementing the handlers for individual states as small functions (with a uniform prototype) and storing the "next state" by storing the function pointer of the next handler is more efficient than using a large switch statement with the handler code implemented in the individual case statements. YMMV.
minimise function calls. Function calls usually carry a burden of context saving (eg. writing local variables contained in registers to the stack, saving the stack pointer), so if you don't have to make a call this is time saved. One option (if you're optimising for speed and not space) is to make use of inline functions.
If function calls are unavoidable minimise the data that is being passed to the functions. For example passing pointers is likely to be more efficient than passing structures.
When optimising for speed choose datatypes that are the native size for your platform. For example on a 32bit processor it is likely to be more efficient to manipulate 32bit values than 8 or 16 bit values. (side note - it is worth checking that the compiler is doing what you think it is. I've had situations where I've discovered that my compiler insisted on doing 16 bit arithmetic on 8 bit values with all of the to and from conversions to go with them)
Find data that can be precalculated, and either calculate during initialisation or (better yet) at compile time. For example when implementing a CRC you can either calculate your CRC values on the fly (using the polynomial directly) which is great for size (but dreadful for performance), or you can generate a table of all of the interim values - which is a much faster implementation, to the detriment of the size.
Localise your data. If you're manipulating a blob of data often your processor may be able to speed things up by storing it all in cache. And your compiler may be able to use shorter instructions that are suited to more localised data (eg. instructions that use 8 bit offsets instead of 32 bit)
In the same vein, localise your functions. For the same reasons.
Work out the assumptions that you can make about the operations that you're performing and find ways of exploiting them. For example, on an 8 bit platform if the only operation that at you're doing on a 32 bit value is an increment you may find that you can do better than the compiler by inlining (or creating a macro) specifically for this purpose, rather than using a normal arithmetic operation.
Avoid expensive instructions - division is a prime example.
The "register" keyword can be your friend (although hopefully your compiler has a pretty good idea about your register usage). If you're going to use "register" it's likely that you'll have to declare the local variables that you want "register"ed first.
Be consistent with your data types. If you are doing arithmetic on a mixture of data types (eg. shorts and ints, doubles and floats) then the compiler is adding implicit type conversions for each mismatch. This is wasted cpu cycles that may not be necessary.
Most of the options listed above can be used as part of normal practice without any ill effects. However if you're really trying to eke out the best performance:
- Investigate where you can (safely) disable error checking. It's not recommended, but it will save you some space and cycles.
- Hand craft portions of your code in assembler. This of course means that your code is no longer portable but where that's not an issue you may find savings here. Be aware though that there is potentially time lost moving data into and out of the registers that you have at your disposal (ie. to satisfy the register usage of your compiler). Also be aware that your compiler should be doing a pretty good job on its own. (of course there are exceptions)
As everybody else has said: profile, profile profile.
As for actual techniques, one that I don't think has been mentioned yet:
Hot & Cold Data Separation: Staying within the CPU's cache is incredibly important. One way of helping to do this is by splitting your data structures into frequently accessed ("hot") and rarely accessed ("cold") sections.
An example: Suppose you have a structure for a customer that looks something like this:
struct Customer
{
int ID;
int AccountNumber;
char Name[128];
char Address[256];
};
Customer customers[1000];
Now, lets assume that you want to access the ID and AccountNumber a lot, but not so much the name and address. What you'd do is to split it into two:
struct CustomerAccount
{
int ID;
int AccountNumber;
CustomerData *pData;
};
struct CustomerData
{
char Name[128];
char Address[256];
};
CustomerAccount customers[1000];
In this way, when you're looping through your "customers" array, each entry is only 12 bytes and so you can fit many more entries in the cache. This can be a huge win if you can apply it to situations like the inner loop of a rendering engine.
My favorite technique is to use a good profiler. Without a good profile telling you where the bottleneck lies, no tricks and techniques are going to help you.
most common techniques I encountered are:
loop unrolling
loop optimization for better cache prefetch
(i.e. do N operations in M cycles instead of NxM singular operations)
data aligning
inline functions
hand-crafted asm snippets
As for general recommendations, most of them are already sounded:
choose better algos
use profiler
don't optimize if it doesn't give 20-30% performance boost
For low-level optimization:
START_TIMER/STOP_TIMER macros from ffmpeg (clock-level accuracy for measurement of any code).
Oprofile, of course, for profiling.
Enormous amounts of hand-coded assembly (just do a wc -l on x264's /common/x86 directory, and then remember most of the code is templated).
Careful coding in general; shorter code is usually better.
Smart low-level algorithms, like the 64-bit bitstream writer I wrote that uses only a single if and no else.
Explicit write-combining.
Taking into account important weird aspects of processors, like Intel's cacheline split issue.
Finding cases where one can losslessly or near-losslessly make an early termination, where the early-termination check costs much less than the speed one gains from it.
Actually inlined assembly for tasks which are far more suited to the x86 SIMD unit, such as median calculations (requires compile-time check for MMX support).
First and foremost, use a better/faster algorithm. There is no point optimizing code that is slow by design.
When optimizing for speed, trade memory for speed: lookup tables of precomputed values, binary trees, write faster custom implementation of system calls...
When trading speed for memory: use in-memory compression
Avoid using the heap. Use obstacks or pool-allocator for identical sized objects. Put small things with short lifetime onto the stack. alloca still exists.
Pre-mature optimization is the root of all evil!
;)
As my applications usually don't need much CPU time by design, I focus on the size my binaries on disk and in memory. What I do mostly is looking out for statically sized arrays and replacing them with dynamically allocated memory where it's worth the additional effort of free'ing the memory later. To cut down the size of the binary, I look for big arrays that are initialized at compile time and put the initializiation to runtime.
char buf[1024] = { 0, };
/* becomes: */
char buf[1024];
memset(buf, 0, sizeof(buf));
This will remove the 1024 zero-bytes from the binaries .DATA section and will instead create the buffer on the stack at runtime and the fill it with zeros.
EDIT: Oh yeah, and I like to cache things. It's not C specific but depending on what you're caching, it can give you a huge boost in performance.
PS: Please let us know when your list is finished, I'm very curious. ;)
If possible, compare with 0, not with arbitrary numbers, especially in loops, because comparison with 0 is often implemented with separate, faster assembler commands.
For example, if possible, write
for (i=n; i!=0; --i) { ... }
instead of
for (i=0; i!=n; ++i) { ... }
Another thing that was not mentioned:
Know your requirements: don't optimize for situations that will unlikely or never happen, concentrate on the most bang for the buck
basics/general:
Do not optimize when you have no problem.
Know your platform/CPU...
...know it thoroughly
know your ABI
Let the compiler do the optimization, just help it with the job.
some things that have actually helped:
Opt for size/memory:
Use bitfields for storing bools
re-use big global arrays by overlaying with a union (be careful)
Opt for speed (be careful):
use precomputed tables where possible
place critical functions/data in fast memory
Use dedicated registers for often used globals
count to-zero, zero flag is free
Difficult to summarize ...
Data structures:
Splitting of a data structure depending on case of usage is extremely important. It is common to see a structure that holds data that is accessed based on a flow control. This situation can lower significantly the cache usage.
To take into account cache line size and prefetch rules.
To reorder the members of the structure to obtain a sequential access to them from your code
Algorithms:
Take time to think about your problem and to find the correct algorithm.
Know the limitations of the algorithm you choose (a radix-sort/quick-sort for 10 elements to be sorted might not be the best choice).
Low level:
As for the latest processors it is not recommended to unroll a loop that has a small body. The processor provides its own detection mechanism for this and will short-circuit whole section of its pipeline.
Trust the HW prefetcher. Of course if your data structures are well designed ;)
Care about your L2 cache line misses.
Try to reduce as much as possible the local working set of your application as the processors are leaning to smaller caches per cores (C2D enjoyed a 3MB per core max where iCore7 will provide a max of 256KB per core + 8MB shared to all cores for a quad core die.).
The most important of all: Measure early, Measure often and never ever makes assumptions, base your thinking and optimizations on data retrieved by a profiler (please use PTU).
Another hint, performance is key to the success of an application and should be considered at design time and you should have clear performance targets.
This is far from being exhaustive but should provide an interesting base.
These days, the most important things in optimzation are:
respecting the cache - try to access memory in simple patterns, and don't unroll loops just for fun. Use arrays instead of data structures with lots of pointer chasing and it'll probably be faster for small amounts of data. And don't make anything too big.
avoiding latency - try to avoid divisions and stuff that's slow if other calculations depend on them immediately. Memory accesses that depend on other memory accesses (ie, a[b[c]]) are bad.
avoiding unpredictabilty - a lot of if/elses with unpredictable conditions, or conditions that introduce more latency, will really mess you up. There's a lot of branchless math tricks that are useful here, but they increase latency and are only useful if you really need them. Otherwise, just write simple code and don't have crazy loop conditions.
Don't bother with optimizations that involve copy-and-pasting your code (like loop unrolling), or reordering loops by hand. The compiler usually does a better job than you at doing this, but most of them aren't smart enough to undo it.
Collecting profiles of code execution get you 50% of the way there. The other 50% deals with analyzing these reports.
Further, if you use GCC or VisualC++, you can use "profile guided optimization" where the compiler will take info from previous executions and reschedule instructions to make the CPU happier.
Inline functions! Inspired by the profiling fans here I profiled an application of mine and found a small function that does some bitshifting on MP3 frames. It makes about 90% of all function calls in my applcation, so I made it inline and voila - the program now uses half of the CPU time it did before.
On most of embedded system i worked there was no profiling tools, so it's nice to say use profiler but not very practical.
First rule in speed optimization is - find your critical path.
Usually you will find that this path is not so long and not so complex. It's hard to say in generic way how to optimize this it's depend on what are you doing and what is in your power to do. For example you want usually avoid memcpy on critical path, so ever you need to use DMA or optimize, but what if you hw does not have DMA ? check if memcpy implementation is a best one if not rewrite it.
Do not use dynamic allocation at all in embedded but if you do for some reason don't do it in critical path.
Organize your thread priorities correctly, what is correctly is real question and it's clearly system specific.
We use very simple tools to analyze the bottle-necks, simple macro that store the time-stamp and index. Few (2-3) runs in 90% of cases will find where you spend your time.
And the last one is code review a very important one. In most case we avoid performance problem during code review very effective way :)
Measure performance.
Use realistic and non-trivial benchmarks. Remember that "everything is fast for small N".
Use a profiler to find hotspots.
Reduce number of dynamic memory allocations, disk accesses, database accesses, network accesses, and user/kernel transitions, because these often tend to be hotspots.
Measure performance.
In addition, you should measure performance.
Sometimes you have to decide whether it is more space or more speed that you are after, which will lead to almost opposite optimizations. For example, to get the most out of you space, you pack structures e.g. #pragma pack(1) and use bit fields in structures. For more speed you pack to align with the processors preference and avoid bitfields.
Another trick is picking the right re-sizing algorithms for growing arrays via realloc, or better still writing your own heap manager based on your particular application. Don't assume the one that comes with the compiler is the best possible solution for every application.
If someone doesn't have an answer to that question, it could be they don't know much.
It could also be that they know a lot. I know a lot (IMHO :-), and if I were asked that question, I would be asking you back: Why do you think that's important?
The problem is, any a-priori notions about performance, if they are not informed by a specific situation, are guesses by definition.
I think it is important to know coding techniques for performance, but I think it is even more important to know not to use them, until diagnosis reveals that there is a problem and what it is.
Now I'm going to contradict myself and say, if you do that, you learn how to recognize the design approaches that lead to trouble so you can avoid them, and to a novice, that sounds like premature optimization.
To give you a concrete example, this is a C application that was optimized.
Great lists. I will just add one tip I didn't saw in the above lists that in some case can yield huge optimisation for minimal cost.
bypass linker
if you have some application divided in two files, say main.c and lib.c, in many cases you can just add a \#include "lib.c" in your main.c That will completely bypass linker and allow for much more efficient optimisation for compiler.
The same effect can be achieved optimizing dependencies between files, but the cost of changes is usually higher.
Sometimes Google is the best algorithm optimization tool. When I have a complex problem, a bit of searching reveals some guys with PhD's have found a mapping between this and a well-known problem and have already done most of the work.
I would recommend optimizing using more efficient algorithms and not do it as an afterthought but code it that way from the start. Let the compiler work out the details on the small things as it knows more about the target processor than you do.
For one, I rarely use loops to look things up, I add items to a hashtable and then use the hashtable to lookup the results.
For example you have a string to lookup and then 50 possible values. So instead of doing 50 strcmps, you add all 50 strings to a hashtable and give each a unique number ( you only have to do this once ). Then you lookup the target string in the hashtable and have one large switch with all 50 cases ( or have functions pointers ).
When looking up things with common sets of input ( like css rules ), I use fast code to keep track of the only possible solitions and then iterate thought those to find a match. Once I have a match I save the results into a hashtable ( as a cache ) and then use the cache results if I get that same input set later.
My main tools for faster code are:
hashtable - for quick lookups and for caching results
qsort - it's the only sort I use
bsp - for looking up things based on area ( map rendering etc )

Profiling a Single Function Predictably

I need a better way of profiling numerical code. Assume that I'm using GCC in Cygwin on 64 bit x86 and that I'm not going to purchase a commercial tool.
The situation is this. I have a single function running in one thread. There are no code dependencies or I/O beyond memory accesses, with the possible exception of some math libraries linked in. But for the most part, it's all table look-ups, index calculations, and numerical processing. I've cache aligned all arrays on the heap and stack. Due to the complexity of the algorithm(s), loop unrolling, and long macros, the assembly listing can become quite lengthy -- thousands of instructions.
I have been resorting to using either, the tic/toc timer in Matlab, the time utility in the bash shell, or using the time stamp counter (rdtsc) directly around the function. The problem is this: the variance (which might be as much as 20% of the runtime) of the timing is larger than the size of the improvements I'm making, so I have no way of knowing if the code is better or worse after a change. You might think then it's time to give up. But I would disagree. If you are persistent, many incremental improvements can lead to a two or three times performance increase.
One problem I have had multiple times that is particularly maddening is that I make a change and the performance seems to improve consistently by say 20%. The next day, the gain is lost. Now it's possible I made what I thought was an innocuous change to the code and then completely forgot about it. But I'm wondering if it's possible something else is going on. Like maybe GCC doesn't yield a 100% deterministic output as I believe it does. Or maybe it's something simpler, like the OS moved my process to a busier core.
I have considered the following, but I don't know if any of these ideas are feasible or make any sense. If yes, I would like explicit instructions on how to implement a solution. The goal is to minimize the variance of the runtime so I can meaningfully compare different versions of optimized code.
Dedicate a core of my processor to run only my routine.
Direct control over the cache(s) (load it up or clear it out).
Ensuring my dll or executable always loads to the same place in memory. My thinking here is that maybe the set-associativity of the cache interacts with the code/data location in RAM to alter performance on each run.
Some kind of cycle accurate emulator tool (not commercial).
Is it possible to have a degree of control over context switches? Or does it even matter? My thinking is the timing of the context switches is causing variability, maybe by causing the pipeline to be flushed at an inopportune time.
In the past I have had success on RISC architectures by counting instructions in the assembly listing. This only works, of course, if the number of instructions is small. Some compilers (like TI's Code Composer for the C67x) will give you a detailed analysis of how it's keeping the ALU busy.
I haven't found the assembly listings produced by GCC/GAS to be particularly informative. With full optimization on, code is moved all over the place. There can be multiple location directives for a single block of code dispersed about the assembly listing. Further, even if I could understand how the assembly maps back into my original code, I'm not sure there's much correlation between instruction count and performance on a modern x86 machine anyway.
I made a weak attempt at using gcov for line-by-line profiling, but due to an incompatibility between the version of GCC I built and the MinGW compiler, it wouldn't work.
One last thing you can do is average over many, many trial runs, but that takes forever.
EDIT (RE: Call Stack Sampling)
The first question I have is, practically, how do I do this? In one of your power point slides, you showed using Visual Studio to pause the program. What I have is a DLL compiled by GCC with full optimizations in Cygwin. This is then called by a mex DLL compiled by Matlab using the VS2013 compiler.
The reason I use Matlab is because I can easily experiment with different parameters and visualize the results without having to write or compile any low level code. Further, I can compare my optimized DLL to the high level Matlab code to ensure my optimizations have not broken anything.
The reason I use GCC is that I have a lot more experience with it than with Microsoft's compiler. I'm familiar with many flags and extensions. Further, Microsoft has been reluctant, at least in the past, to maintain and update the native C compiler (C99). Finally, I've seen GCC kick the pants off commercial compilers, and I've looked at the assembly listing to see how it's actually done. So I have some intuition of how the compiler actually thinks.
Now, with regards to making guesses about what to fix. This isn't really the issue; it's more like making guesses about how to fix it. In this example, as is often the case in numerical algorithms, there is really no I/O (excluding memory). There are no function calls. There's virtually no abstraction at all. It's like I'm sitting on top of a piece of saran wrap. I can see the computer architecture below, and there's really nothing in-between. If I re-rolled up all the loops, I could probably fit the code on about one page or so, and I could almost count the resultant assembly instructions. Then I could do a rough comparison to the theoretical number of operations a single core is capable of doing to see how close to optimal I am. The trouble then is I lose the auto-vectorization and instruction level parallelization I got from unrolling. Unrolled, the assembly listing is too long to analyze in this way.
The point is that there really isn't much to this code. However, due to the incredible complexity of the compiler and modern computer architecture, there is quite a bit of optimization to be had even at this level. But I don't know how small changes are going to affect the output of the compiled code. Let me give a couple of examples.
This first one is somewhat vague, but I'm sure I've seen it happen a few times. You make a small change and get a 10% improvement. You make another small change and get another 10% improvement. You undo the first change and get another 10% improvement. Huh? Compiler optimizations are neither linear, nor monotonic. It's possible, the second change required an additional register, which broke the first change by forcing the compiler to alter its register allocation algorithm. Maybe, the second optimization somehow occluded the compiler's ability to do optimizations which was fixed by undoing the first optimization. Who knows. Unless the compiler is introspective enough to dump its full analysis at every level of abstraction, you'll never really know how you ended up with the final assembly.
Here is a more specific example which happened to me recently. I was hand coding AVX intrinsics to speed up a filter operation. I thought I could unroll the outer loop to increase instruction level parallelism. So I did, and the result was that the code was twice as slow. What happened was there were not enough 256 bit registers to go around. So the compiler was temporarily saving results on the stack, which killed performance.
As I was alluding to in this post, which you commented on, it's best to tell the compiler what you want, but unfortunately, you often have no choice and are forced to hand tweak optimizations, usually via guess and check.
So I guess my question would be, in these scenarios (the code is effectively small until unrolled, each incremental performance change is small, and you're working at a very low level of abstraction), would it be better to have "precision of timing" or is call stack sampling better at telling me which code is superior?
I've faced a similar problem some time ago but that was on Linux which made it easier to tweak. Basically the noise introduced by OS (called "OS jitter") was as big as 5-10% in SPEC2000 tests (I can imagine it's much higher on Windows due to much bigger amount of bloatware).
I was able to bring deviation to below 1% by combination of the following:
disable dynamic frequency scaling (better do this both in BIOS and in Linux kernel as not all kernel versions do this reliably)
disable memory prefetching and other fancy settings like "Turbo boost", etc. (BIOS, again)
disable hyperthreading
enable high-performance process scheduler in kernel
bind process to core to prevent thread migration (use core 0 - for some reason it was more reliable on my kernel, go figure)
boot to single-user mode (in which no services are running) - this isn't as easy in modern systemd-based distros
disable ASLR
disable network
drop OS pagecache
There may be more to it but 1% noise was good enough for me.
I might put detailed instructions to github later today if you need them.
-- EDIT --
I've published my benchmarking script and instructions here.
Am I right that what you're doing is making an educated guess of what to fix, fixing it, and then trying to measure to see if it made any difference?
I do it a different way, which works especially well as the code gets large.
Rather than guess (which I certainly can) I let the program tell me how the time is spent, by using this method.
If the method tells me that roughly 30% is spent doing such-and-so, I can concentrate on finding a better way to do that.
Then I can run it and just time it.
I don't need a lot of precision.
If it's better, that's great.
If it's worse, I can undo the change.
If it's about the same, I can say "Oh well, maybe it didn't save much, but let's do it all again to find another problem,"
I need not worry.
If there's a way to speed up the program, this will pinpoint it.
And often the problem is not just a simple statement like "line or routine X spends Y% of the time", but "the reason it's doing that is Z in certain cases" and the actual fix may be elsewhere.
After fixing it, the process can be done again, because a different problem, which was small before, is now larger (as a percent, because the total has been reduced by fixing the first problem).
Repetition is the key, because each speedup factor multiplies all the previous, like compound interest.
When the program no longer points out things I can fix, I can be sure it is nearly optimal, or at least nobody else is likely to beat it.
And at no point in this process did I need to measure the time with much precision.
Afterwards, if I want to brag about it in a powerpoint, maybe I'll do multiple timings to get smaller standard error, but even then, what people really care about is the overall speedup factor, not the precision.

Handing out addresses in a custom memory allocator w.r.t. cache conflicts

I've spent my afternoon reading up on processor caches after reading about the effect power of twos can have on cache conflicts. Now I wish to apply this new knowledge to my memory allocator for multi-threaded programs. However, I don't fully understand it yet.
I was under the impression that processors loved powers of two, so my allocator rounds requested sizes to their next power of two and then slices pages into multiples of this size and hands them out. When a page is full, it simply maps a new page and slices it up the same way. This leads to very similar and predictable offsets into pages.
To what extent should I adapt my allocator to avoid this issue? For example, should I try to randomize addresses slightly or am I screwed for using powers of two in the first place?
Thanks!
Until you have uncontrovertible proof that this is performance critical, just leave it be. The extra complication will most probably not be worth it.
Everybody should read (and understand!) Bentley's "Writing efficient programs" (sadly out of print now, his "Programming Pearls" contains a summary, and is well worth a read too).
Before embarking on a code-optimization bout, make sure it is worth it. If the performance is adequate, there are better uses of your time. Yes, you have to measure first.
Measure where the cost is being spent. Programmers are notoriously bad at guessing where the costs are
The most performance gains come from restating the problem (sometimes it is enough to solve a problem that is faster to solve), then overall organization of the system, next better algorithms/data structures; and at the very, very end detail optimizations like the one considered here.
Your friendly compiler, given a bit of prodding in the direction of "generate good code" will today generate much better code than an experienced assembly language programmer when given similar (full function scale) tasks. Most local source code reorganizations "for performance" are either moot (the compiler would have done so on its own) or deleterious (the compiler will recognize and rewrite the usual code sequences, unusual code can confuse it to do nothing or generate bad code).
Programmer time (writing, debugging, maintaining) is much more valuable than a few microseconds of computer time here and there, except for extremely unusual circumstances. Write the simplest code that does the job, rework only if experience shows it is worthwile.

how to allocate more cpu and RAM to a c program in linux

I am running a simple C program which performs a lot calculations(CFD) hence takes a lot of time to run. However i still have a lot of unused CPU and RAM. So how will i allocate some of my processing power to one program.??
I'm guessing that CFD means Computational Fluid Dynamics (but CFD has also a lot of other meanings, so I might guess wrong).
You definitely should first profile your code. At the very least, compile it with gcc -Wall -pg -O and learn how to use gprof. You might also use strace to find out the system calls done by your code.
I'm not an expert of CFD (even if in the previous century I did work with CFD experts). But such code uses a lot of finite elements analysis and other vector computation.
If you are writing the code, you might perhaps consider using OpenMP (so by carefully adding OpenMP pragmas in your source code, you might speed it up), or even consider using GPGPUs by coding OpenCL kernels that run on the GPU.
You could also learn more about pthreads programming and change your code to use threads.
If you are using important numerical libraries like e.g. BLAS they have a lot of tuning, and even specialized variants (e.g. multi-core, OpenMP-ed, or even in OpenCL).
In all cases, parallelizing your code is a lot of work. You'll spend weeks or months on improving it, if it is possible.
Linux doesn't keep programs waiting and CPU free when they need to do calculations.
Either you have a multicore CPU and one single thread running (as suggested by #Pankrates) or you are blocking on some I/O.
You could nice the process with a negative increment, but you need to be superuser for that. See
man nice
This would increase the scheduling priority of the process. If it is competing with other processes for CPU time, it would get more CPU time and therefore "run faster".
As for increasing the amount of RAM used by the program: you'd need to rewrite or reconfigure the program to use more RAM. It is difficult to say more given the information available in the question.
To use multiple CPU's at once, you either need to run multiple copies of your program, or run multiple threads within the program. Neither is terribly hard to get started on.
However, it's much easier to do a parallel version of "I've got 10000 large numbers, I want to find out for each of them if they are primes or not" than it is to do "lots of A = A + B" type calculations in parallel - because you need the new A before you can make the next step. CFD calculations tend to do the latter [as far as I understand it], but with large arrays. You may be able to split large vector calculations into a set of smaller vector caclulations [say we have a matrix of 1000 x 1000, you could split that into 4 sets of 250 x 1000 matrixes, or 4 sets of 500 x 500 matrixes, and perform each of those in it's own thread].
If it's your own code, then you hopefully know what it does and how it works. If it's someone elses code, then you need to talk to whoever owns the code.
There is no magical way to "automatically make use of more CPU's". 30% CPU usage on a quad-core processor probably means that your system is basically using one core, and 5% or so is overhead for other things going on in the system - or maybe there is a second thread somewhere in your application that uses a little bit of CPU doing whatever it does. Or the application is multithreaded, but doesn't use the multiple cores to full extent because there is contention between the threads over some shared resource... It's impossible for us to say which of these three [or several other] alternatives.
Asking for more RAM isn't going to help unless you have something useful to put into that memory. If there is free memory, your application get as much memory as it needs.

Efficent Cache usage in C

I want to write code that best uses the cache memory of my system. For example, I have one large array(2kb size) that is frequently used in operations. For better execution speed, I want it to be loaded in chache memory so that It takes less time for processor to fetch it. How can I ensure this thing in C language? Any help would be appreciated.
First, ask what processor you are running your code on. Then read its tech specs to find out how big the cache is and how the cache is arranged.
CPU caches are pretty big these days so if your array is only 2kB then it's almost certainly going to be held entirely in cache unless you read megabytes of data between accesses to the array.
In short: don't worry about it. Your array is tiny so it's unlikely you can do much to "optimise" cache usage for it. Instead, look at the algorithm you are using to see if there is a more efficient approach that can be used, and run a profiler on your code to see where the bottlenecks are.
For GCC you could use "-fprefetch-loop-arrays" option.
Anyway, if you're using that memory block frequently, it will most probably reside in cache.
If you want to make it faster, read this: http://www.agner.org/optimize/optimizing_cpp.pdf , and try to use it. And don't forget, the best optimizations can be done by changing the algorithm.

Resources