How Switch Statement Works - c

How does a switch statement immediately drop to the correct location in memory? With nested if-statements, it has to perform comparisons with each one, but with a switch statement it goes directly to the correct case. How is this implemented?

There are many different ways to compile a switch statement into machine code. Here are a few:
The compiler can produce a series of tests, which is not so inefficient as only about log2(N) tests are enough to dispatch a value among N possible cases.
The compiler can produce a table of values and jump addresses, which in turn will be used by generic lookup code (linear or dichotomic, similar to bsearch()) and finally jump to the corresponding location.
If the case values are dense enough, the compiler can generate a table of jump addresses and code that checks if the switch value is within a range encompassing all case values and jump directly to the corresponding address. This is probably the implementation closest to your description: but with a switch statement it goes directly to the correct case.
Depending on the specific abilities of the target CPU, compiler settings, and the number and distribution of case values, the compiler might use one of the above approaches or another, or a combination of them, or even some other methods.
Compiler designers spend a great deal of effort trying to improve heuristics for these choices. Look at the assembly output or use an online tool such as Godbolt's Compiler Explorer to see various code generation possibilities.

Related

Does Intel MKL or some similar library provide a vectorized way to count the number of elements in an array fulfilling some condition in C?

The problem
I'm working on implementing and refining an optimization algorithm with some fairly large arrays (from tens of millions of floats and up) and using mainly Intel MKL in C (not C++, at least not so far) to squeeze out every possible bit of performance. Now I've run into a silly problem - I have a parameter that sets maxima and minima for subsets of a set of (tens of millions) of coefficients. Actually applying these maxima and minima using MKL functions is easy - I can create equally-sized vectors with the limits for every element and use V?Fmax and V?Fmin to apply them. But I also need to account for this clipping in my error metric, which requires me to count the number of elements that fall outside these constraints.
However, I can't find an MKL function that allows me to do things like counting the number of elements that fulfill some condition, the way you can create and sum logical arrays with e.g. NumPy in Python or in MATLAB. Irritatingly, when I try to google this question, I only get answers relating to Python and R.
Obviously I can just write a loop that increments a counter for each element that fulfills one of the conditions, but if there is an already optimized implementation that allows me to achieve this, I would much prefer that just owing to the size of my arrays.
Does anyone know of a clever way to achieve this robustly and very efficiently using Intel MKL (maybe with the statistics toolbox or some creative use of elementary functions?), a similarly optimized library that does this, or a highly optimized way to hand-code this? I've been racking my brain trying to come up with some out-of-the box method, but I'm coming up empty.
Note that it's necessary for me to be able to do this in C, that it's not viable for me to shift this task to my Python frontend, and that it is indeed necessary for me to code this particular subprogram in C in the first place.
Thanks!
If you were using c++, count_if from the algorithms library with an execution policy of par_unseq may parallelize and vectorize the count. On Linux at least, it typically uses Intel TBB to do this.
It's not likely to be as easy in c. Because c doesn't have concepts like templates, callables or lambdas, the only way to specialize a generic (library-provided) count()-function would be to pass a function pointer as a callback (like qsort() does). Unless the compiler manages to devirtualize and inline the callback, you can't vectorize at all, leaving you with (possibly thread parallelized) scalar code. OTOH, if you use for example gcc vector intrinsics (my favourite!), you get vectorization but not parallelization. You could try to combine the approaches, but I'd say get over yourself and use c++.
However, if you only need vectorization, you can almost certainly just write sequential code and have the compiler autovectorize, unless the predicate for what should be counted is poorly written, or your compiler is braindamaged.
For example. gcc vectorizes the code on x86 if at least sse4 instructions are available (-msse4). With AVX[2/512] (-mavx / -mavx2 / -mavx512f) you can get wider vectors to do more elements at once. In general, if you're compiling on the same hardware you will be running the program on, I'd recommend letting gcc autodetect the optimal instruction set extensions (-march=native).
Note that in the provided code, the conditions should not use short-circuiting or (||), because then the read from the max-vector is semantically forbidden if the comparison with the min-vector was already true for the current element, severely hindering vectorization (though avx512 could potentially vectorize this with somewhat catastrophic slowdown).
I'm pretty sure gcc is not nearly optimal in the code it generates for avx512, since it could do the k-reg (mask register) or in the mask registers with kor[b/w/d/q], but maybe somebody with more experience in avx512 (*cougth* Peter Cordes *cough*) could weigh in on that.
MKL doesn't provide such functions but You may try to check another performance library - IPP which contains a set of threshold functions that could be useful to your case. Please refer to the IPP Developer Reference to check more details - https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/essential-functions/conversion-functions/threshold.html

Compiler Hints and Semantics for Optimizations [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I spent the last couple of weeks optimizing a numerical algorithm. Through a combination of precomputation, memory alignment, compiler hints and flags, and trial and error experimentation, I brought the run-time down over an order of magnitude. I have not yet explicitly vectorized using intrinsics or used multi-threading.
Frequently when working on this type of problem, there is an initialization routine, after which, many parameters become constant. These might be filter lengths, the expression of a switch statement, for loop length or iteration increment. If the parameters were know at compile time, the compiler should be able to do a much more effective job of optimization by knowing exactly how to unroll loops, replace index calculations with instructions that have the offset encoded in the instruction, simplify or eliminate expressions at compile time, possibly eliminate switch statements, etc. The most extreme way of dealing with this problem would be to run the initialization routine (at run-time), then run the compiler on the critical function to be optimized using some kind of plugin that allows iteration over the abstract syntax tree, replace the parameters with constants, and finally dynamically link to the shared object. If the routine is short, it could be dynamically compiled inside the binary using a number of tools.
More practically, I rely very heavily on alignment, gcc __builtin_assume_aligned, restrict, manual loop unrolling, and compiler flags to get the compiler to do what I want given the unknown value of parameters at compile time. I'm wondering what other options are available to me that are at least close to portable. I only use intrinsics as a last resort since it's not portable and a lot of work. Specifically, how can I provide the compiler (gcc) with additional information concerning loop variables using either language semantics, compiler extensions, or external tools so it can do a better job of doing optimizations for me. Similarly is there any way to qualify variables as having a stride so that loads and stores are always aligned, thus more easily enabling the auto-vectorization and loop unrolling process.
These issues come up frequently, so I am hoping there is some more elegant method of solving them. What follows are examples of the kind of problems I hand optimize but I believe the compiler ought to be able to do for me. These are not intended to be further questions.
Sometimes you have a filter, the length of which is not a multiple of the length of the longest SIMD register, and there may be memory alignment issues as well. In this case case I either (A) unroll the loop by a multiple of the vector register and call into the unoptimized code for the epilogue/prologue or (B) pad the start or end of the filter with zeros. I've recently learned gcc and other compilers have the ability to peel loops. From the limited documentation I've been able to find, I believe the finest grain control you have over peeling is over entire functions (rather than individual loops) using compiler directives. Further, there are some parameters you can provide, but it's mostly just an upper or lower bound on the amount of unrolling or number of instructions produced.
In order to really know the best method of unrolling/peeling or zero padding, the compiler needs to know something about the length of the loop and/or the size of the increment. For example, it would be very helpful to know that a loop is likely to have a length greater than a million or less than 100. It would be helpful to know that the loop will always run either 32 or 34 times. In fact, since the compiler knows much more about the computer architecture than I do, it would be much better if it made all the unrolling decisions based on information I provide about the loop variables. I had a situation where I wanted the compiler to unroll a loop. I specifically gave it the #pragma GCC optimize ("unroll-loops") directive. However, what it required to work was also the statement N &= ~7, thus informing the compiler that the loop length was a multiple of 8. This is not a semantic feature of the language, and it does not have the effect of changing the value of N. It was strictly to inform the static analyzer of the compiler that the loop was already a multiple of the length of the AVX register. In this case I was lucky and it worked because gcc is very clever. But in other cases, my hints don't seem to work (or they do, but there is no compiler feedback to let me know the additional information was of no value). In one case I had to explicitly tell the compiler not to unroll the loop because the outer loop was very short and the overhead was not worth it. With the optimizer on the maximum setting, often the only way to know what's going on is to look at the assembly listing, make some changes, and try again.
In another situation I carefully unrolled a loop so the compiler would use the AVX registers. The manual unrolling was probably necessary as the compiler doesn't have sufficient information about the length of the loop or that the length was of a particular multiple. Unfortunately, the inner loop was accessing an unaligned array of floats of length four per group (16 byte alignment). The compiler was using only the legacy 128 bit XMM registers. After making a weak attempt to vectorize using AVX intrinsics, I discovered the extra overhead of the unaligned access made the performance no better than what gcc was doing. So I thought, I could align each group of floats on the start of a cache line and use a stride equal to the cache length (or half, which is the length of and AVX register) to eliminate the alignment problem. However, this may turn out to be ineffective due to the extra memory bandwidth. It's certainly more work on my part. It makes the code harder to understand. And, at the very least, getting the stride right would depend on compile time constants I would need to supply. I wonder if there is some simpler way to do this relying on the compiler to do all the work? I would be willing to try it if it meant only changing a line or two of code. It's not worth it if I have to do it manually (in this case anyway). (Thinking about it as I write this, I may be able to use a union or struct with 48 bytes of padding and and a few extra lines of code. I would have to give that some thought...)

In c: do internal states improve speed?

Lets say I have a function with two parameters that is repetitively called. Does it increase the memory usage when you have functions with arguments?
Would it be faster to generate a function for each repetitive case, and call that function with no parameters?
I believe this is sometimes refereed to as 'internal state', but my question is which of the two options will perform faster?
EDIT>>>>>>>>
Your answers are all enlightening, allow me to clarify all at once.
It seems logical that
x = x + 10
would be faster than:
x = x + y
And I'm not talking about the time it takes to define and initialize y, I am just talking about the operation itself. I'm logically, in the second case there must be some extra step in which the CPU must find Y before performing the operation. When you amplify this with functions and then multiply it over and over, I would assume this would make a significant difference.
And yes, what in my case it applies to physics and the speed will likely be felt.
PS I am very interested in compiler functionality and debating learning assembler.
Parameters are typically passed on the stack so they don't take up more memory.
Parameters may be "un-noticeably" slower because the values may be copied to the stack (depends on how good the compiler is at optimizing).
The compiler is way smarter than you are, so don't try to outsmart the compiler. Write clear code and let the compiler worry about performance.
re: your edit
"it depends"
Does your processor have a different instruction to add 10 to a variable?
What sort of addressing modes does it support?
Regardless of the answers to the above, does the compiler make use of all the processor's features which might squeeze out every drip of performance.
e.g. - The good old 68000 chips had an "INC" opcode to increment a register by 1. It was much faster than other methods. If you were hand rolling assembly the fastest way to do x = x + 10 might have been to call INC 10 times...
I've worked with time constrained real time embedded apps and never had to worry about this level of optimization. I'd write the code and worry about performance if/when it becomes an issue.
Is the repetitive call is made with compile-time parameters, then you can indeed improve performance by "instantiating" a special version of the same function for the given set of compile-time parameters. In such cases the function will not even have a "state": the parameter values will essentially be embedded into the function code. Some compilers can do it implicitly.
The amount of improvement will depend on the nature of the function. In each given version of the function the entire blocks of code might be easily recognized as unreachable and eliminated entirely. One can also say that function inlining by nature involves the same kind of optimization.
Obviously, using such optimizations thoughtlessly might easily lead to a combinatorial explosion of the number of different versions of the same function (i.e. to code bloat), so it should be used with care.
(BTW, this is very similar to what C++ function templates with non-type template parameters do.)
If the repetitive call is made with run-time parameters, then pre-saving them in a run-time state might not achieve any significant improvement. Retrieving parameter values from some "state" is not necessarily more efficient than retrieving them from the "regular" function parameters.
Of course, there are such classic techniques as packing multiple function parameters into a struct object and passing such struct object to the function (instead of passing a large number of independent parameters). If the parameters remain unchanged between multiple calls, then this does improve overall performance by saving time on parameter preparation. But whether to call such struct object a "state" or not is a different question. It is definitely a manual technique, not something done by the compiler and involving any "internal state".
Does it increase the memory usage when you have functions with arguments?
No, function arguments are passed on the stack (or in registers if x64 calling convention).
Would it be faster to generate a function for each repetitive case, and call that function with no parameters?
No, your compiler should optimize it for you, there's no need to make your code less readable

Compare two matching strings algorithms - Practical approach

I wrote two differents algorithms that resolve some particular case of strings matching (implemented in C). I know that the theoretical O of this algorithms are equals but I think that in practical, one is better than the oder.
My question is, someone could recommend me some paper or some reading where shows how to compare algorithms with a practical approach?
I have several test set, I'm interested in measure execute time and memory size. I need take this values as independently as possible of the operating system and others program that could be runing concurrently.
Thanks!!!
you could compare your algorithms by generating the assembly code and compare them.
You could generate the assembly code with the gcc -S mycode.c command
I find that "looking at the code" is a good start. If one uses more variables and is more complicated than the other, it is probably slower.
However, there are of course clever tricks that can make a more complicated function actually run faster (for example code that reads 8 bytes at a time - but of course, once you find a difference, the code is more complex - for long strings that are largely similar, there is a big win tho').
So, in the end, there is no substitute for actually running the code, using clock-cycle timing (RDTSC instruction on x86 processors, for example), or running a large loop to execute the code many times to give a reasonable length runtime.
If your code isn't supposed to run on a single embedded target, you probably want to run the code on a set of different hardware to determine if the code that is faster on processor A is also faster on B, C and D type processors. Often this does work, but sometimes you can find that a particular processor model is faster for SOME operations, and another is faster for another (for example based on cache-size, etc).
It would also be very important, in the case of string operations, to try with different size inputs, different points of difference (e.g. a long string, but different "early", vs. long string with difference "late"). Sometimes, the different approaches will show different results for short/long strings or early/late point of difference (and of course "equal" strings that are long or short).
In order to complete all comments, I found a book called "A guide to experimental algorithmics" by Catherine C. Mcgeoch Amazon and a profesor recommend me a practical paper pdf.

Nested if or switch case statement

From the point of view of optimizing the code run time, is there a thumb rule for where to use "nested if" statement and when to use "switch case" statements ?
I doubt you will ever find a real-life application where the difference between a nested if and a switch case is even worth measuring. Disk access, web access, etc. take many many orders of magnitude more time.
Choose what is easiest to read and debug.
Also see What is the difference between IF-ELSE and SWITCH? (possible duplicate) as well as Advantage of switch over if-else statement. Interestingly, a proponent of switch writes
In the worst case the compiler will
generate the same code as a if-else
chain, so you don't lose anything. If
in doubt put the most common cases
first into the switch statement.
In the best case the optimizer may
find a better way to generate the
code. Common things a compiler does is
to build a binary decission tree
(saves compares and jumps in the
average case) or simply build a
jump-table (works without compares at
all).
If you have more than 2-3 comparisons
then "switch"
else "if"
try to apply some patterns before you go to switch like strategy...
I don't believe it will make any difference for a decision structure that could be implemented using either method. It's highly likely that your compiler would produce the same instructions in the executable.

Resources