Compile time data-specific optimizations

Compile time data-specific optimizations - c

In some cases, one knows at compile time what a particular piece of algorithmic data looks like, and as such might wish to convey this information to the compiler. This question is about how one might best achieve that.
By way of example, consider the following example of a sparse matrix multiplication in which the matrix is constant and known at compile time:
matrix = [ 0, 210, 0, 248, 137]
[ 0, 0, 0, 0, 239]
[ 0, 0, 0, 0, 0]
[116, 112, 0, 0, 7]
[ 0, 0, 0, 0, 165]
In such a case, a fully branchless implementation could be written to implement the matrix vector multiplication for an arbitrary input vector:
#include <stdio.h>
#define ARRAY_SIZE 8
static const int matrix[ARRAY_SIZE] = {210, 248, 137, 239, 116, 112, 7, 165};
static const int input_indices[ARRAY_SIZE] = {1, 3, 4, 4, 0, 1, 4, 4};
static const int output_indices[ARRAY_SIZE] = {0, 0, 0, 1, 3, 3, 3, 4};
static void matrix_multiply(int *input_array, int *output_array)
{
for (int i=0; i<ARRAY_SIZE; ++i){
output_array[output_indices[i]] += (
matrix[i] * input_array[input_indices[i]]);
}
}
int main()
{
int test_input[5] = {36, 220, 212, 122, 39};
int output[5] = {0};
matrix_multiply(test_input, output);
for (int i=0; i<5; ++i){
printf("%d\n", output[i]);
}
}
which prints the correct result for the matrix-vector multiplication (81799, 9321, 0, 29089, 6435).
Further optimisations can be envisaged that build on data specific knowledge about the memory locality of reference.
Now, clearly this is an approach which can be used, but it starts getting unwieldy when the size of the data gets big (say ~100MB in my case) and also in any real world situation would depend on meta-programming to generate the associated data dependent knowledge.
Does the general strategy of baking in data specific knowledge have mileage as regards optimisation? If so, what is the best approach to do this?
In the example given, on one level the whole thing than be reduced to knowledge about ARRAY_SIZE with the arrays set at runtime. This leads me to think the approach is limited (and is really a data structures problem), but I'm very interested to know if the general approach of data derived compile-time optimisations is useful in any situation.

I don't think this is a very good answer to this question but I'm going to try offering it anyway. It's also more of a search for the same basic answer.
I work in 3D VFX including raytracing where it's not uncommon to take a fairly modest input with data structures that build in under a second, and then do a monumental amount of processing subsequently to the point where a user might wait hours for a quality production render in a difficult lighting situation.
In theory at least, this could go so much faster if we could make these "data-specific optimizations". Variables could turn into literal constants, significantly less branching could be required, data that is known to always have an upper bound of 45 elements could be allocated on the stack instead of heap or use another form of memory preallocated in advance, locality of reference could be exploited to a greater deal than ever before, vectorization could be applied more easily, achieving both thread-safety and efficiency could be a lot easier, etc.
Where this gets awkward for me is that this requires information about user inputs which can only be provided after the usual notion of "compile-time". So a lot of my interest here relates to code-generation techniques while the application is running.
Now, clearly this is an approach which can be used, but it starts
getting unwieldy when the size of the data gets big (say ~100MB in my
case) and also in any real world situation would depend on
meta-programming to generate the associated data dependent knowledge.
I think beyond that, if the data size gets excessive, then we do often need a good share of branching and variables just to avoid generating so much code that we start becoming bottlenecked by icache misses.
Yet even the ability to turn a dozen variables accessed frequently into compile-time constants and allowing a handful of data structures to exploit greater knowledge of the specified input (and with the aid of an aggressive optimizer) may yield great mileage here, especially considering how well optimizers do provided they have the necessary information provided in advance.
Some of this could be tackled normally with increasingly elaborate and generalized code, metaprogramming techniques, etc, yet there's a peak to how far we can go there: an optimizer can only optimize as much as the information is has available in advance. The difficulty here is providing that information in a practical way. And, as you already guessed, this can quickly get unwieldy, difficult to maintain, and productivity starts to become just as great (if not greater) of a concern than efficiency.
So the most promising techniques to me revolve about code-generation techniques tuned for a specific problem domain, but not for a specific input (optimizing for the specific input will lean more on the optimizer, the code generation is there so that we can provide more of that info needed for the optimizer more easily/appropriately). A modest example that already does something like this is Open Shading Language, where it uses JIT compilation that exploits this idea to a modest level:
OSL uses the LLVM compiler framework to translate shader networks into
machine code on the fly (just in time, or "JIT"), and in the process
heavily optimizes shaders and networks with full knowledge of the
shader parameters and other runtime values that could not have been
known when the shaders were compiled from source code. As a result, we
are seeing our OSL shading networks execute 25% faster than the
equivalent shaders hand-crafted in C! (That's how our old shaders
worked in our renderer.)
While a 25% improvement over handwritten code is modest, that's still a big deal in a production renderer, and it seems like we could go far beyond that.
The use of nodes as a visual programming language also offers a more restrictive environment that helps reduce human errors, allows expressing solutions at a higher-level, seeing the results of changes made on the fly (instant turnaround), etc. -- so it adds not only efficiency but that productivity we need to avoid getting lost in such optimizations. Maintaining and building the code generator could be a little complex, but it only needs to have the minimal amount of code required and doesn't scale in complexity with the amount of code generated using it.
So apologies -- this isn't exactly an answer to your question as a comment, but I think we're searching for a similar thing.

Related

C- Why is for loop pointer indexing faster? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Some years ago I was on a panel that was interviewing candidates for a relatively senior embedded C programmer position.
One of the standard questions that I asked was about optimisation techniques. I was quite surprised that some of the candidates didn't have answers.
So, in the interests of putting together a list for posterity - what techniques and constructs do you normally use when optimising C programs?
Answers to optimisation for speed and size both accepted.

First things first - don't optimise too early. It's not uncommon to spend time carefully optimising a chunk of code only to find that it wasn't the bottleneck that you thought it was going to be. Or, to put it another way "Before you make it fast, make it work"
Investigate whether there's any option for optimising the algorithm before optimising the code. It'll be easier to find an improvement in performance by optimising a poor algorithm than it is to optimise the code, only then to throw it away when you change the algorithm anyway.
And work out why you need to optimise in the first place. What are you trying to achieve? If you're trying, say, to improve the response time to some event work out if there is an opportunity to change the order of execution to minimise the time critical areas. For example when trying to improve the response to some external interrupt can you do any preparation in the dead time between events?
Once you've decided that you need to optimise the code, which bit do you optimise? Use a profiler. Focus your attention (first) on the areas that are used most often.
So what can you do about those areas?
minimise condition checking. Checking conditions (eg. terminating conditions for loops) is time that isn't being spent on actual processing. Condition checking can be minimised with techniques like loop-unrolling.
In some circumstances condition checking can also be eliminated by using function pointers. For example if you are implementing a state machine you may find that implementing the handlers for individual states as small functions (with a uniform prototype) and storing the "next state" by storing the function pointer of the next handler is more efficient than using a large switch statement with the handler code implemented in the individual case statements. YMMV.
minimise function calls. Function calls usually carry a burden of context saving (eg. writing local variables contained in registers to the stack, saving the stack pointer), so if you don't have to make a call this is time saved. One option (if you're optimising for speed and not space) is to make use of inline functions.
If function calls are unavoidable minimise the data that is being passed to the functions. For example passing pointers is likely to be more efficient than passing structures.
When optimising for speed choose datatypes that are the native size for your platform. For example on a 32bit processor it is likely to be more efficient to manipulate 32bit values than 8 or 16 bit values. (side note - it is worth checking that the compiler is doing what you think it is. I've had situations where I've discovered that my compiler insisted on doing 16 bit arithmetic on 8 bit values with all of the to and from conversions to go with them)
Find data that can be precalculated, and either calculate during initialisation or (better yet) at compile time. For example when implementing a CRC you can either calculate your CRC values on the fly (using the polynomial directly) which is great for size (but dreadful for performance), or you can generate a table of all of the interim values - which is a much faster implementation, to the detriment of the size.
Localise your data. If you're manipulating a blob of data often your processor may be able to speed things up by storing it all in cache. And your compiler may be able to use shorter instructions that are suited to more localised data (eg. instructions that use 8 bit offsets instead of 32 bit)
In the same vein, localise your functions. For the same reasons.
Work out the assumptions that you can make about the operations that you're performing and find ways of exploiting them. For example, on an 8 bit platform if the only operation that at you're doing on a 32 bit value is an increment you may find that you can do better than the compiler by inlining (or creating a macro) specifically for this purpose, rather than using a normal arithmetic operation.
Avoid expensive instructions - division is a prime example.
The "register" keyword can be your friend (although hopefully your compiler has a pretty good idea about your register usage). If you're going to use "register" it's likely that you'll have to declare the local variables that you want "register"ed first.
Be consistent with your data types. If you are doing arithmetic on a mixture of data types (eg. shorts and ints, doubles and floats) then the compiler is adding implicit type conversions for each mismatch. This is wasted cpu cycles that may not be necessary.
Most of the options listed above can be used as part of normal practice without any ill effects. However if you're really trying to eke out the best performance:
- Investigate where you can (safely) disable error checking. It's not recommended, but it will save you some space and cycles.
- Hand craft portions of your code in assembler. This of course means that your code is no longer portable but where that's not an issue you may find savings here. Be aware though that there is potentially time lost moving data into and out of the registers that you have at your disposal (ie. to satisfy the register usage of your compiler). Also be aware that your compiler should be doing a pretty good job on its own. (of course there are exceptions)

As everybody else has said: profile, profile profile.
As for actual techniques, one that I don't think has been mentioned yet:
Hot & Cold Data Separation: Staying within the CPU's cache is incredibly important. One way of helping to do this is by splitting your data structures into frequently accessed ("hot") and rarely accessed ("cold") sections.
An example: Suppose you have a structure for a customer that looks something like this:
struct Customer
{
int ID;
int AccountNumber;
char Name[128];
char Address[256];
};
Customer customers[1000];
Now, lets assume that you want to access the ID and AccountNumber a lot, but not so much the name and address. What you'd do is to split it into two:
struct CustomerAccount
{
int ID;
int AccountNumber;
CustomerData *pData;
};
struct CustomerData
{
char Name[128];
char Address[256];
};
CustomerAccount customers[1000];
In this way, when you're looping through your "customers" array, each entry is only 12 bytes and so you can fit many more entries in the cache. This can be a huge win if you can apply it to situations like the inner loop of a rendering engine.

My favorite technique is to use a good profiler. Without a good profile telling you where the bottleneck lies, no tricks and techniques are going to help you.

most common techniques I encountered are:
loop unrolling
loop optimization for better cache prefetch
(i.e. do N operations in M cycles instead of NxM singular operations)
data aligning
inline functions
hand-crafted asm snippets
As for general recommendations, most of them are already sounded:
choose better algos
use profiler
don't optimize if it doesn't give 20-30% performance boost

For low-level optimization:
START_TIMER/STOP_TIMER macros from ffmpeg (clock-level accuracy for measurement of any code).
Oprofile, of course, for profiling.
Enormous amounts of hand-coded assembly (just do a wc -l on x264's /common/x86 directory, and then remember most of the code is templated).
Careful coding in general; shorter code is usually better.
Smart low-level algorithms, like the 64-bit bitstream writer I wrote that uses only a single if and no else.
Explicit write-combining.
Taking into account important weird aspects of processors, like Intel's cacheline split issue.
Finding cases where one can losslessly or near-losslessly make an early termination, where the early-termination check costs much less than the speed one gains from it.
Actually inlined assembly for tasks which are far more suited to the x86 SIMD unit, such as median calculations (requires compile-time check for MMX support).

First and foremost, use a better/faster algorithm. There is no point optimizing code that is slow by design.
When optimizing for speed, trade memory for speed: lookup tables of precomputed values, binary trees, write faster custom implementation of system calls...
When trading speed for memory: use in-memory compression

Avoid using the heap. Use obstacks or pool-allocator for identical sized objects. Put small things with short lifetime onto the stack. alloca still exists.

Pre-mature optimization is the root of all evil!
;)

As my applications usually don't need much CPU time by design, I focus on the size my binaries on disk and in memory. What I do mostly is looking out for statically sized arrays and replacing them with dynamically allocated memory where it's worth the additional effort of free'ing the memory later. To cut down the size of the binary, I look for big arrays that are initialized at compile time and put the initializiation to runtime.
char buf[1024] = { 0, };
/* becomes: */
char buf[1024];
memset(buf, 0, sizeof(buf));
This will remove the 1024 zero-bytes from the binaries .DATA section and will instead create the buffer on the stack at runtime and the fill it with zeros.
EDIT: Oh yeah, and I like to cache things. It's not C specific but depending on what you're caching, it can give you a huge boost in performance.
PS: Please let us know when your list is finished, I'm very curious. ;)

If possible, compare with 0, not with arbitrary numbers, especially in loops, because comparison with 0 is often implemented with separate, faster assembler commands.
For example, if possible, write
for (i=n; i!=0; --i) { ... }
instead of
for (i=0; i!=n; ++i) { ... }

Another thing that was not mentioned:
Know your requirements: don't optimize for situations that will unlikely or never happen, concentrate on the most bang for the buck

basics/general:
Do not optimize when you have no problem.
Know your platform/CPU...
...know it thoroughly
know your ABI
Let the compiler do the optimization, just help it with the job.
some things that have actually helped:
Opt for size/memory:
Use bitfields for storing bools
re-use big global arrays by overlaying with a union (be careful)
Opt for speed (be careful):
use precomputed tables where possible
place critical functions/data in fast memory
Use dedicated registers for often used globals
count to-zero, zero flag is free

Difficult to summarize ...
Data structures:
Splitting of a data structure depending on case of usage is extremely important. It is common to see a structure that holds data that is accessed based on a flow control. This situation can lower significantly the cache usage.
To take into account cache line size and prefetch rules.
To reorder the members of the structure to obtain a sequential access to them from your code
Algorithms:
Take time to think about your problem and to find the correct algorithm.
Know the limitations of the algorithm you choose (a radix-sort/quick-sort for 10 elements to be sorted might not be the best choice).
Low level:
As for the latest processors it is not recommended to unroll a loop that has a small body. The processor provides its own detection mechanism for this and will short-circuit whole section of its pipeline.
Trust the HW prefetcher. Of course if your data structures are well designed ;)
Care about your L2 cache line misses.
Try to reduce as much as possible the local working set of your application as the processors are leaning to smaller caches per cores (C2D enjoyed a 3MB per core max where iCore7 will provide a max of 256KB per core + 8MB shared to all cores for a quad core die.).
The most important of all: Measure early, Measure often and never ever makes assumptions, base your thinking and optimizations on data retrieved by a profiler (please use PTU).
Another hint, performance is key to the success of an application and should be considered at design time and you should have clear performance targets.
This is far from being exhaustive but should provide an interesting base.

These days, the most important things in optimzation are:
respecting the cache - try to access memory in simple patterns, and don't unroll loops just for fun. Use arrays instead of data structures with lots of pointer chasing and it'll probably be faster for small amounts of data. And don't make anything too big.
avoiding latency - try to avoid divisions and stuff that's slow if other calculations depend on them immediately. Memory accesses that depend on other memory accesses (ie, a[b[c]]) are bad.
avoiding unpredictabilty - a lot of if/elses with unpredictable conditions, or conditions that introduce more latency, will really mess you up. There's a lot of branchless math tricks that are useful here, but they increase latency and are only useful if you really need them. Otherwise, just write simple code and don't have crazy loop conditions.
Don't bother with optimizations that involve copy-and-pasting your code (like loop unrolling), or reordering loops by hand. The compiler usually does a better job than you at doing this, but most of them aren't smart enough to undo it.

Collecting profiles of code execution get you 50% of the way there. The other 50% deals with analyzing these reports.
Further, if you use GCC or VisualC++, you can use "profile guided optimization" where the compiler will take info from previous executions and reschedule instructions to make the CPU happier.

Inline functions! Inspired by the profiling fans here I profiled an application of mine and found a small function that does some bitshifting on MP3 frames. It makes about 90% of all function calls in my applcation, so I made it inline and voila - the program now uses half of the CPU time it did before.

On most of embedded system i worked there was no profiling tools, so it's nice to say use profiler but not very practical.
First rule in speed optimization is - find your critical path.
Usually you will find that this path is not so long and not so complex. It's hard to say in generic way how to optimize this it's depend on what are you doing and what is in your power to do. For example you want usually avoid memcpy on critical path, so ever you need to use DMA or optimize, but what if you hw does not have DMA ? check if memcpy implementation is a best one if not rewrite it.
Do not use dynamic allocation at all in embedded but if you do for some reason don't do it in critical path.
Organize your thread priorities correctly, what is correctly is real question and it's clearly system specific.
We use very simple tools to analyze the bottle-necks, simple macro that store the time-stamp and index. Few (2-3) runs in 90% of cases will find where you spend your time.
And the last one is code review a very important one. In most case we avoid performance problem during code review very effective way :)

Measure performance.
Use realistic and non-trivial benchmarks. Remember that "everything is fast for small N".
Use a profiler to find hotspots.
Reduce number of dynamic memory allocations, disk accesses, database accesses, network accesses, and user/kernel transitions, because these often tend to be hotspots.
Measure performance.
In addition, you should measure performance.

Sometimes you have to decide whether it is more space or more speed that you are after, which will lead to almost opposite optimizations. For example, to get the most out of you space, you pack structures e.g. #pragma pack(1) and use bit fields in structures. For more speed you pack to align with the processors preference and avoid bitfields.
Another trick is picking the right re-sizing algorithms for growing arrays via realloc, or better still writing your own heap manager based on your particular application. Don't assume the one that comes with the compiler is the best possible solution for every application.

If someone doesn't have an answer to that question, it could be they don't know much.
It could also be that they know a lot. I know a lot (IMHO :-), and if I were asked that question, I would be asking you back: Why do you think that's important?
The problem is, any a-priori notions about performance, if they are not informed by a specific situation, are guesses by definition.
I think it is important to know coding techniques for performance, but I think it is even more important to know not to use them, until diagnosis reveals that there is a problem and what it is.
Now I'm going to contradict myself and say, if you do that, you learn how to recognize the design approaches that lead to trouble so you can avoid them, and to a novice, that sounds like premature optimization.
To give you a concrete example, this is a C application that was optimized.

Great lists. I will just add one tip I didn't saw in the above lists that in some case can yield huge optimisation for minimal cost.
bypass linker
if you have some application divided in two files, say main.c and lib.c, in many cases you can just add a \#include "lib.c" in your main.c That will completely bypass linker and allow for much more efficient optimisation for compiler.
The same effect can be achieved optimizing dependencies between files, but the cost of changes is usually higher.

Sometimes Google is the best algorithm optimization tool. When I have a complex problem, a bit of searching reveals some guys with PhD's have found a mapping between this and a well-known problem and have already done most of the work.

I would recommend optimizing using more efficient algorithms and not do it as an afterthought but code it that way from the start. Let the compiler work out the details on the small things as it knows more about the target processor than you do.
For one, I rarely use loops to look things up, I add items to a hashtable and then use the hashtable to lookup the results.
For example you have a string to lookup and then 50 possible values. So instead of doing 50 strcmps, you add all 50 strings to a hashtable and give each a unique number ( you only have to do this once ). Then you lookup the target string in the hashtable and have one large switch with all 50 cases ( or have functions pointers ).
When looking up things with common sets of input ( like css rules ), I use fast code to keep track of the only possible solitions and then iterate thought those to find a match. Once I have a match I save the results into a hashtable ( as a cache ) and then use the cache results if I get that same input set later.
My main tools for faster code are:
hashtable - for quick lookups and for caching results
qsort - it's the only sort I use
bsp - for looking up things based on area ( map rendering etc )

elegant (and fast!) way to rearrange columns and rows in an ADC buffer

Abstract:
I am looking for an elegant and fast way to "rearrange" the values in my ADC Buffer for further processing.
Introduction:
on an ARM Cortex M4 Processor I am using 3 ADCs to sample analog values, with DMA and "Double Buffer Technique". When I get a "half buffer complete Interrupt" the data in the 1D array are arranged like this:
Ch1S1, Ch2S1, Ch3S1, Ch1S2, Ch2S2, Ch3S2, Ch1S3 ..... Ch1Sn-1, Ch2Sn-1, Ch3Sn-1, Ch1Sn, Ch2Sn, Ch3Sn
Where Sn stands for Sample# and CHn for Channel Number.
As I do 2x Oversampling n equals 16, the channel count is 9 in reality, in the example above it is 3
Or written in an 2D-form
Ch1S1, Ch2S1, Ch3S1,
Ch1S2, Ch2S2, Ch3S2,
Ch1S3 ...
Ch1Sn-1, Ch2Sn-1, Ch3Sn-1,
Ch1Sn, Ch2Sn, Ch3Sn
Where the rows represent the n samples and the colums represent the channels ...
I am using CMSIS-DSP to calculate all the vector stuff, like shifting, scaling, multiplication, once I have "sorted out" the channels. This part is pretty fast.
Issue:
But the code I am using for "reshaping" the 1-D Buffer array to an accumulated value for each channel is pretty poor and slow:
for(i = 0; i < ADC_BUFFER_SZ; i++) {
for(j = 0; j < MEAS_ADC_CHANNELS; j++) {
if(i) *(ADC_acc + j) += *(ADC_DMABuffer + bP); // sum up all elements
else *(ADC_acc + j) = *(ADC_DMABuffer + bP); // initialize new on first run
bP++;
}
}
After this procedure I get a 1D array with one (accumulated) U32 value per Channel, but this code is pretty slow: ~4000 Clock cycles for 16 Samples per channel / 9 Channels or ~27 Clock cycles per sample. In order to archive higher Sample rates, this needs to be many times faster, than it is right now.
Question(s):
What I am looking for is: some elegant way, using the CMSIS-DPS functions to archive the same result as above, but much faster. My gut says that I am thinking in the wrong direction, that there must be a solution within the CMSIS-DSP lib, as I am most probably not the first guy who stumbles upon this topic and I most probably won't be the last. So I'm asking for a little push in the right direction, I as guess this could be a severe case of "work-blindness" ...
I was thinking about using the dot-product function "arm_dot_prod_q31" together with an array filled with ones for the accumulation task, because I could not find the CMSIS function which would simply sum up an 1D array? But this would not solve the "reshaping" issue, I still had to copy data around and create new buffers to prepare the vectors for the "arm_dot_prod_q31" call ...
Besides that it feels somehow awkward using a dot-product, where I just want to sum up array elements …
I also thought about transforming the ADC Buffer into a 16 x 9 or 9 x 16 Matrix, but then I could not find anything where I could easily (=fast & elegant) access rows or columns, which would leave me with another issue to solve, which would eventually require to create new buffers and copying data around, as I am missing a function where I could multiply a matrix with a vector ...
Maybe someone has a hint for me, that points me in the right direction?
Thanks a lot and cheers!

ARM is a risk device, so 27 cycles is roughly equal to 27 instructions, IIRC. You may find that you're going to need a higher clock rate to meet your timing requirements. What OS are you running? Do you have access to the cache controller? You may need to lock data buffers into the cache to get high enough performance. Also, keep your sums and raw data physically close in memory as your system will allow.
I am not convinced your perf issue is entirely the consequence of how you are stepping through your data array, but here's a more streamlined approach than what you are using:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
for (int idxRaw = 0, int idxSum = 0; idxRaw < ADC_BUFFER_SZ; idxRaw++)
{
sums[idxSum++] += raw[idxRaw];
if (idxSum == MEAS_ADC_CHANNELS) idxSum = 0;
}
Note that I have not tested the above code, nor even tried to compile it. The algorithm is simple enough, you should be able to get working quickly.
Writing pointer math in your code, will not make it any faster. The compiler will convert array notation to efficient pointer math for you. You definitely don't need two loops.
That said, I often use a pointer for iteration:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
int *itRaw = raw;
int *itRawEnd = raw + ADC_BUFFER_SZ;
int *itSums = sums;
int *itSumsEnd = itSums + MEAS_ADC_CHANNELS;
while(itRaw != itEnd)
{
*itSums += *itRaw;
itRaw++;
itSums++;
if (itSums == itSumsEnd) itSums = sums;
}
But almost never, when I am working with a mathematician or scientist, which is often the case with measurement/metrological device development. It's easier to explain the array notation to non-C reviewers, than the iterator form.
Also, if I have an algorithm description that uses the phrase "for each...", I tend to prefer the for loop form, but when the description uses "while ...", then of course I will probably use the while... form, unless I can skip one or more variable assignment statements by rearranging it to a do..while. But I often stick as close as possible to the original description until after I've passed all the testing criteria, then do rearrangement of loops for code hygiene purposes. It's easier to argue with a domain expert that their math is wrong, when you can easily convince them that you implemented what they described.
Always get it right first, then measure and make the determination whether to further hone the code. Decades ago, some C compilers for embedded systems could do a better job of optimizing one kind of loop than another. We used to have to keep a warry eye on the machine code they generated, and often developed habits that avoided those worst case scenarios. That is uncommon today, and almost certainly not the case for you ARM tool chain. But you may have to look into how your compilers optimization features work and try something different.
Do try to avoid doing value math on the same line as your pointer math. It's just confusing:
*(p1 + offset1) += *(p2 + offset2); // Can and should be avoided.
*(p1++) = *(p2++); // reasonable, especially for experienced coders/reviewers.
p1[offset1] += p2[offset2]; // Okay. Doesn't mix math notation with pointer notation.
p1[offset1 + A*B/C] += p2...; // Very bad.
// But...
int offset1 += A*B/C; // Especially helpful when stepping in the debugger.
p1[offset1]... ; // Much better.
Hence the iterator form mentioned earlier. It may reduce the lines of code, but does not reduce the complexity and definitely does increase the odds of introducing a bug at some point.
A purist could argue that p1[x] is in fact pointer notation in C, but array notation has almost, if not completely universal binding rules across languages. Intentions are obvious, even to non programmers. While the examples above are pretty trivial and most C programmers would have no problems reading any of them, it's when the number of variables involved and the complexity of the math increases, that mixing your value math with pointer math quickly becomes problematic. You'll almost never do it for anything non-trivial, so for consistency's sake, just get in the habit of avoiding it all-together.

Basic GPU application, integer calculations

Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some GPU power to entlast the CPU from redundant operations. However I cannot find a good "guideline" what exact technology/tools should I pick in my situation. I just read plethora of docs, it drains my mental powers very fast. I am not sure if it is possible at all, so I'm puzzled.
Here I've made a very rough sketch of my typical application skeleton that I develop, but given that it uses GPU now (note, I have almost zero practical knowledge about GPU programming). Still important is that data types and functionality must be exactly preserved. Here it is:
So F(A,R,P) is some custom function, for example element substitution, repetition, etc. Function is presumably constant in program lifetime, rectangle's shapes generally are not equal with A shape, so it is not in-place calculation. So they are simply generated whith my functions. Examples of F: repeat rows and columns of A; substitute values with values from Substitution tables; compose some tiles into single array; any math function on A values, etc. As said all this can be easily made on CPU, but app must be really smooth. BTW in pure Python it became just unusable after adding several visual features, which are based on numpy arrays. Cython helps to make fast custom functions but then the source code is already kind of a salad.
Question:
Does this schema reflect some (standart) technology/dev.tools?
Is CUDA what I am looking for? If yes, some links/examples which coincides whith my application structure, would be great.
I realise, this a big question, so I will give more details if it helps.
Update
Here is a concrete example of two typical calculations for my prototype of bitmap editor. So the editor works with indexes and the data include layers with corresponding bit masks. I can determine the size of layers and masks are same size as layers and, say, all layers are same size (1024^2 pixels = 4 MB for 32 bit values). And my palette is say, 1024 elements (4 Kilobytes for 32 bpp format).
Consider I want to do two things now:
Step 1. I want to flatten all layers in one. Say A1 is default layer (background) and layers 'A2' and 'A3' have masks 'm2' and 'm3'. In python i'd write:
from numpy import logical_not
...
Result = (A1 * logical_not(m2) + A2 * m2) * logical_not(m3) + A3 * m3
Since the data is independent I believe it must give speedup proportionl to number of parallel blocks.
Step 2. Now I have an array and want to 'colorize' it with some palette, so it will be my lookup table. As I see now, there is a problem with simultanous read of lookup table element.
But my idea is, probably one can just duplicate the palette for all blocks, so each block can read its own palette? Like this:

When your code is highly parallel (i.e. there are small or no data dependencies between stages of processing) then you can go for CUDA (more finegrained control over synching) or OpenCL (very similar AND portable OpenGL-like API to interface with the GPU for kernel processing). Most of the acceleration work we do happens in OpenCL, which has excellent interop with both OpenGL and DirectX, but we also have the same setup working with CUDA. One big difference between CUDA and OpenCL is that in CUDA you can compile kernels once and delay-load (and/or link) them in your app, whereas in OpenCL the compiler plays nice with the OpenCL driver stack to ensure the kernel is compiled when the app starts.
One alternative that is often overlooked if you're using Microsoft Visual Studio is C++AMP, a C++ syntax-friendly and intuitive api for those who do not want to dig into the logic twists and turns of OpenCL/CUDA API's. Big advantage here is that the code also works if you do not have a GPU in the system, but then you do not have as many options to tweak performance. Still, in a lot of cases, this is a fast and efficient way to write proof your concept code and re-implement bits and parts in CUDA or OpenCL later.
OpenMP and Thread Building Blocks are only good alternatives when you have synching issues and lots of data dependencies. Native threading using worker threads is also a viable solution, but only if you have a good idea on how synch-points can be set up between the different processes in such a way that threads do not starve each-other out when fighting for priority. This is a lot harder to get right, and tools such as Parallel Studio are a must. But then, so is NVida NSight if you're writing GPU code.
Appendix:
A new platform called Quasar (http://quasar.ugent.be/blog/) is being developed that enables you to write your math problems in a syntax that is very similar to Matlab, but with full support of c/c++/c# or java integration, and cross-compiles (LLVM, CLANG) your "kernel" code to any underlying hardware configuration. It generates CUDA ptx files, or runs on openCL, or even on your CPU using TBB's, or a mixture of them. Using a few monikers, you can decorate the algorithm so that the underlying compiler can infer types (you can also explicitly use strict typing), so you can leave the type-heavy stuff entirely up to the compiler. To be fair, at the time of writing, the system is still w.i.p. and the first OpenCL compiled programs are just being tested, but most important benefit is fast prototyping with almost identical performance compared to optimized cuda.

What you want to do is send values really fast to the GPU using the high frequency dispatch and then display the result of a function which is basically texture lookups and some parameters.
I would say this problem will only be worth solving on the GPU if two conditions are met:
The size of A[] is optimised to make the transfer times irrelevant (Look at, http://blog.theincredibleholk.org/blog/2012/11/29/a-look-at-gpu-memory-transfer/).
The lookup table is not too big and/or the lookup values are organized in a way that the cache can be maximally utilized, in general random lookups on the GPU can be slow, ideally you can pre-load the R[] values in a shared memory buffer for each element of the A[] buffer.
If you can answer both of those questions positively then and only then consider having a go at using the GPU for your problem, else those 2 factors will overpower the computational speed-up that the GPU can provide you with.
Another thing you can have a look at is to as best as you can overlap the transfer and computing times to hide as much as possible the slow transfer rates of CPU->GPU data.
Regarding your F(A, R, P) function you need to make sure that you do not need to know the value of F(A, R, P)[0] in order to know what the value of F(A, R, P)[1] is because if you do then you need to rewrite F(A, R, P) to go around this issue, using some parallelization technique. If you have a limited number of F() functions then this can be solved by writing a parallel version of each F() function for the GPU to use, but if F() is user-defined then your problem becomes a bit trickier.
I hope this is enough information to have an informed guess towards whether you should or not use a GPU to solve your problem.
EDIT
Having read your edit, I would say yes. The palette could fit in shared memory (See GPU shared memory size is very small - what can I do about it?) which is very fast, if you have more than one palette, you could fit 16KB (size of shared mem on most cards) / 4KB per palette = 4 palettes per block of threads.
One last warning, integer operations are not the fastest on the GPU, consider using floating points if necessary after you have implemented your algorithm and it is working as a cheap optimization.

There is not much difference between OpenCL/CUDA so choose which works better for you. Just remember that CUDA will limit you to the NVidia GPUs.
If i understand corretly to your problem, kernel (function executed on GPU) should be simple. It should follow this pseudocode:
kernel main(shared A, shared outA, const struct R, const struct P, const int maxOut, const int sizeA)
int index := getIndex() // get offset in input array
if(sizeA >= index) return // GPU often works better when n of threads is 2^n
int outIndex := index*maxOut // to get offset in output array
outA[outIndex] := F(A[index], R, P)
end
Functions F should be inlined and you can use switch or if for different function. Since there is not known size of the output of F, then you have to use more memory. Each kernel instance must know positions for correct memory writes and reads so there have to be some maximum size (if there is none, than this all is useless and you have to use CPU!). If different sizes are sparse, then I would use something like computing these different sizes after getting the array back to RAM and compute these few with CPU, while filling outA with some zeros or indication values.
Sizes of arrays are obviously length(A) * maxOut = length(outA).
I forgot to mention that if execution of F is not same in most of the cases (same source code), than GPU will serialize it. GPU multiprocessors have a few cores connected into the same instruction cache so it will have to serialize the code, which is not the same for all cores! OpenMP or threads are better choice for this kind of problem!

Assign whole array into another array in C via a struct

I am a student and I will have a presentation at school about arrays. I have this code that should assign a whole array into another array.
#define MAX 10
#include <stdio.h>
typedef struct {
int data[MAX];
} INT_ARR;
int main()
{
int i;
INT_ARR arr1 = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
INT_ARR arr2 = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
arr1 = arr2;
return 0;
}
How can I use this technique in real projects?
What I can do with this?
What are the pros and cons?

Pros
Assignment is simple. You write minimum of code, the compiler does all the work.
It’s hard for you as a programmer to make a mistake – both in correctness and performance. Machines don’t make errors and compilers are much better at optimization than humans.
No function call is needed.
Function calls are slow, generally speaking. Sometimes the optimizer can get rid of them, but getting rid of one while preserving maintainability of the code cannot be bad.
When used as a local variable, the array is stored on the stack.
Allocation is fast, deallocation is fast.
Cons
Assignment is simple. You write minimum of code, the compiler does all the work.
It might not be obvious that such an assignment is quite an expensive operation. Certainly more expensive than a simple int assignment. Confusing your future self or any other contributors to your project is bad.
Typing arr1.data[i] is more annoying than typing arr1[i].
When used as a local variable, the array is stored on the stack.
Therefore it must be really tiny. Otherwise you get stack overflow.
Unlike C99 variable-length arrays, these arrays have their length defined at compile time.
You rarely need this.
Tiny arrays of fixed width with the ability of assignment en masse are not very useful in practice. If you come across their real-world application, please #ping me in a comment with its description, I will be interested.
I’d like to write more about the last point.
Basically you have d-dimensional vectors for a fixed d, known at compile time. This could be useful if d could change in a future version of the program, but I guess that this is an extremely unlikely scenario and that plain structs would be better in such an application. By plain structs I mean structs used not only as a minimal wrapper of a single array; sure they can still contain an array as a member.
What you need more is the ability to copy parts of an array and to assign them separately. The memcpy() function serves that purpose. You can use it even with the arrays inside structs, like the ones in this question.
Another way to look at array assignment is through pointers. Sometimes you don’t need to have two distinct chunks of memory, you need just several names for one. In such a scenario, you should use pointers to the memory and thus avoid copying the array. Pointer assignment is fast and simple.
Miscellaneous remarks to the code
The initializers should have double braces:
INT_ARR arr2 = {{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}};
The outer ones are for the struct, the inner ones for the array.
In your code, there is an unused int i. I guess it is a relict from testing your code.
In C, you should use int main(void) instead of int main(). The latter is deprecated syntax and may be removed in future versions of C. In function declarations (not definitions), empty parentheses mean “I don’t say anything about parameters”. Using this instead of explicit void can lead to very unpleasant surprises.
You should choose more descriptive and preferably English identifiers. Even if you’re not a native English speaker. Thus you enable others to collaborate with you. I edited your question to improve the naming convention.

There is no "one true answer". It depends what you want to do. In fact, you probably won't use this in real programming except for some basic school projects.
As Sam said, you should use memcpy(). The only con I can think of is code performance. Find out more here.

Techniques for static code analysis in detecting integer overflows

I'm trying to find some effective techniques which I can base my integer-overflow detection tool on. I know there are many ready-made detection tools out there, but I'm trying to implement a simple one on my own, both for my personal interest in this area and also for my knowledge.
I know techniques like Pattern Matching and Type Inference, but I read that more complicated code analysis techniques are required to detect the int overflows. There's also the Taint Analysis which can "flag" un-trusted sources of data.
Is there some other technique, which I might not be aware of, which is capable of detecting integer overflows?

It may be worth to try with cppcheck static analysis tool, that claims to detect signed integer overflow as of version 1.67:
New checks:
- Detect shift by too many bits, signed integer overflow and dangerous sign conversion
Notice that it supports both C and C++ languages.
There is no overflow check for unsigned integers, as by Standard unsigned types never overflow.
Here is some basic example:
#include <stdio.h>
int main(void)
{
int a = 2147483647;
a = a + 1;
printf("%d\n", a);
return 0;
}
With such code it gets:
$ ./cppcheck --platform=unix64 simple.c
Checking simple.c...
[simple.c:6]: (error) Signed integer overflow for expression 'a+1'
However I wouldn't expect too much from it (at least with current version), as slighly different program:
int a = 2147483647;
a++;
passes without noticing overflow.

It seems you are looking for some sort of Value Range Analysis, and detect when that range would exceed the set bounds. This is something that on the face of it seems simple, but is actually hard. There will be lots of false positives, and that's even without counting bugs in the implementation.
To ignore the details for a moment, you associate a pair [lower bound, upper bound] with every variable, and do some math to figure out the new bounds for every operator. For example if the code adds two variables, in your analysis you add the upper bounds together to form the new upper bound, and you add the lower bounds together to get the new lower bound.
But of course it's not that simple. Firstly, what if there is non-straight-line code? if's are not too bad, you can just evaluate both sides and then take the union of the ranges after it (which can lose information! if two ranges have a gap in between, their union will span the gap). Loops require tricks, a naive implementation may run billions of iterations of analysis on a loop or never even terminate at all. Even if you use an abstract domain that has no infinite ascending chains, you can still get into trouble. The keywords to solve this are "widening operator" and (optionally, but probably a good idea) "narrowing operator".
It's even worse than that, because what's a variable? Your regular local variable of scalar type that never has its address taken isn't too bad. But what about arrays? Now you don't even know for sure which entry is being affected - the index itself may be a range! And then there's aliasing. That's far from a solved problem and causes many real world tools to make really pessimistic assumptions.
Also, function calls. You're going to call functions from some context, hopefully a known one (if not, then it's simple: you know nothing). That makes it hard, not only is there suddenly a lot more state to keep track of at the same time, there may be several places a function could be called from, including itself. The usual response to that is to re-evaluate that function when a range of one of its arguments has been expanded, once again this could take billions of steps if not done carefully. There also algorithms that analyze a function differently for different context, which can give more accurate results, but it's easy to spend a lot of time analyzing contexts that aren't different enough to matter.
Anyway if you've made it this far, you could read Accurate Static Branch Prediction by Value Range Propagation and related papers to get a good idea of how to actually do this.
And that's not all. Considering only the ranges of individual variables without caring about the relationships between (keyword: non-relational abstract domain) them does bad on really simple (for a human reader) things such as subtracting two variables that always close together in value, for which it will make a large range, with the assumption that they may be as far apart as their bounds allow. Even for something trivial such as
; assume x in [0 .. 10]
int y = x + 2;
int diff = y - x;
For a human reader, it's pretty obvious that diff = 2. In the analysis described so far, the conclusions would be that y in [2 .. 12] and diff in [-8, 12]. Now suppose the code continues with
int foo = diff + 2;
int bar = foo - diff;
Now we get foo in [-6, 14] and bar in [-18, 22] even though bar is obviously 2 again, the range doubled again. Now this was a simple example, and you could make up some ad-hoc hacks to detect it, but it's a more general problem. This effect tends to blow up the ranges of variables quickly and generate lots of unnecessary warnings. A partial solution is assigning ranges to differences between variables, then you get what's called a difference-bound matrix (unsurprisingly this is an example of a relational abstract domain). They can get big and slow for interprocedual analysis, or if you want to throw non-scalar variables at them too, and the algorithms start to get more complicated. And they only get you so far - if you throw a multiplication in the mix (that includes x + x and variants), things still go bad very fast.
So you can throw something else in the mix that can handle multiplication by a constant, see for example Abstract Domains of Affine Relations⋆ - this is very different from ranges, and won't by itself tell you much about the ranges of your variables, but you could use it to get more accurate ranges.
The story doesn't end there, but this answer is getting long. I hope this does not discourage you from researching this topic, it's a topic that lends itself well to starting out simple and adding more and more interesting things to your analysis tool.

Checking integer overflows in C:
When you add two 32-bit numbers and get a 33-bit result, the lower 32 bits are written to the destination, with the highest bit signaled out as a carry flag. Many languages including C don't provide a way to access this 'carry', so you can use limits i.e. <limits.h>, to check before you perform an arithmetic operation. Consider unsigned ints a and b :
if MAX - b < a, we know for sure that a + b would cause an overflow. An example is given in this C FAQ.
Watch out: As chux pointed out, this example is problematic with signed integers, because it won't handle MAX - b or MIN + b if b < 0. The example solution in the second link (below) covers all cases.
Multiplying numbers can cause an overflow, too. A solution is to double the length of the first number, then do the multiplication. Something like:
(typecast)a*b
Watch out: (typecast)(a*b) would be incorrect because it truncates first then typecasts.
A detailed technique for c can be found HERE. Using macros seems to be an easy and elegant solution.

I'd expect Frama-C to provide such a capability. Frama-C is focused on C source code, but I don't know if it is dialect-sensitive or specific. I believe it uses abstract interpretation to model values. I don't know if it specifically checks for overflows.
Our DMS Software Reengineering Toolkit has variety of langauge front ends, including most major dialects of C. It provides control and data flow analysis, and also abstract interpretation for computing ranges, as foundations on which you can build an answer. My Google Tech Talk on DMS at about 0:28:30 specifically talks about how one can use DMS's abstract interpretation on value ranges to detect overflow (of an index on a buffer). A variation on checking the upper bound on array sizes is simply to check for values not exceeding 2^N. However, off the shelf DMS does not provide any specific overflow analysis for C code. There's room for the OP to do interesting work :=}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight