literal division at compile time - c

Assume the following code:
static int array[10];
int main ()
{
for (int i = 0; i < (sizeof(array) / sizeof(array[0])); i++)
{
// ...
}
}
The result of sizeof(array) / sizeof(array[0]) should in theory be known at compile time and set to some value depending on the size of the int. Even though, will the compiler do the manual division in run time each time the for loop iterates?
To avoid that, does the code need to be adjusted as:
static int array[10];
int main ()
{
static const int size = sizeof(array) / sizeof(array[0]);
for (int i = 0; i < size; i++)
{
// ...
}
}

You should write the code in whatever way is most readable and maintainable for you. (I'm not making any claims about which one that is: it's up to you.) The two versions of the code you wrote are so similar that a good optimizing compiler should probably produce equally good code for each version.
You can click on this link to see what assembly your two different proposed codes generate in various compilers:
https://godbolt.org/z/v914qYY8E
With GCC 11.2 (targetting x86_64) and with minimal optimizations turned on (-O1), both versions of your main function have the exact same assembly code. With optimizations turned off (-O0), the assembly is slightly different but the size calculation is still done at a compile time for both.
Even if you doubt what I am saying, it is still better to use the more readable version as a starting point. Only change it to the less readable version if you find an actual example of a programming environment where doing that would provide a meaningful speed increase for you application. Avoid wasting time with premature optimization.

Even though, will the compiler do the manual division in run time each time the for loop iterates?
No. It's an integer constant expression which will be calculated at compile-time. Which is why you can even do this:
int some_other_array [sizeof(array) / sizeof(array[0])];
To avoid that, does the code need to be adjusted as
No.
See for yourself: https://godbolt.org/z/rqv15vW6a. Both versions produced 100% identical machine code, each one containing a mov ebx, 10 instruction with the pre-calculated value.

Related

What are the various ways code may be 'vectorized'?

I usually hear the term vectorized functions in one of two ways:
In a very high-level language when the data is passed all-at-once (or at least, in bulk chunks) to a lower-level library that does the calculations in faster way. An example of this would be python's use of numpy for array/LA-related stuff.
At the lowest level, when using a specific machine instruction or procedure that makes heavy use of them (such as YMM, ZMM, XMM register instructions).
However, it seems like the term is passed around quite generally, and I wanted to know if there's a third (or even more) ways in which it's used. And this would just be, for example, passing multiple values to a function rather than one (usually done via an array) for example:
// non-'vectorized'
#include <stdio.h>
int squared(int num) {
return num*num;
}
int main(void) {
int nums[] = {1,2,3,4,5};
for (int i=0; i < sizeof(nums)/sizeof(*nums); i++) {
int n_squared = squared(nums[i]);
printf("%d^2 = %d\n", nums[i], n_squared);
}
}
// 'vectorized'
#include <stdio.h>
void squared(int num[], int size) {
for (int i=0; i<size; i++) {
*(num +i) = num[i] * num[i];
}
}
int main(void) {
int nums[] = {1,2,3,4,5};
squared(nums, sizeof(nums)/sizeof(*nums));
for (int i=0; i < sizeof(nums)/sizeof(*nums); i++) {
printf("Squared=%d\n", nums[i]);
}
}
Is the above considered 'vectorized code'? Is there a more formal/better definition of what makes something vectorized or not?
Vectorized code, in the context you seem to be referring to, normally means "an implementation that happens to make use of Single Instruction Multiple Data (SIMD) hardware instructions".
This can sometimes mean that someone manually wrote a version of a function that is equivalent to the canonical one, but happens to make use of SIMD. More often than not, it's something that the compiler does under the hood as part of its optimization passes.
In a very high-level language when the data is passed all-at-once (or at least, in bulk chunks) to a lower-level library that does the calculations in faster way. An example of this would be python's use of numpy for array/LA-related stuff.
That's simply not correct. The process of handing off a big chunk of data to some block of code that goes through it quickly is not vectorization in of itself.
You could say "Now that my code uses numpy, it's vectorized" and be sort of correct, but only transitively. A better way to put it would be "Now that my code uses numpy, it runs a lot faster because numpy is vectorized under the hood.". Importantly though, not all fast libraries to which big chunks of data are passed at once are vectorized.
...Code examples...
Since there is no SIMD instruction in sight in either example, then neither are vectorized yet. It might be true that the second version is more likely to lead to a vectorized program. If that's the case, then we'd say that the program is more vectorizable than the first. However, the program is not vectorized until the compiler makes it so.

Optimizing C code, Horner's polynomial evaluation

I'm trying to learn how to optimize code (I'm also learning C), and in one of my books there's a problem for optimizing Horner's method for evaluation polynomials. I'm a little lost on how to approach the problem. I'm not great at recognizing what needs optimizing.
Any advice on how to make this function run faster would be appreciated.
Thanks
double polyh(double a[], double x, int degree) {
long int i;
double result = a[degree];
for (i = degree-1; i >= 0; i--)
result = a[i] + x*result;
return result;
}
You really need to profile your code to test whether proposed optimizations really help. For example, it may be the case that declaring i as long int rather than int slows the function on your machine, but on the other hand it may make no difference on your machine but might make a difference on others, etc. Anyway, there's no reason to declare i a long int when degree is an int, so changing it probably won't hurt. (But still profile!)
Horner's rule is supposedly optimal in terms of the number of multiplies and adds required to evaluate a polynomial, so I don't see much you can do with it. One thing that might help (profile!) is changing the test i>=0 to i!=0. Of course, then the loop doesn't run enough times, so you'll have to add a line below the loop to take care of the final case.
Alternatively you could use a do { ... } while (--i) construct. (Or is it do { ... } while (i--)? You figure it out.)
You might not even need i, but using degree instead will likely not save an observable amount of time and will make the code harder to debug, so it's not worth it.
Another thing that might help (I doubt it, but profile!) is breaking up the arithmetic expression inside the loop and playing around with order, like
for (...) {
result *= x;
result += a[i];
}
which may reduce the need for temporary variables/registers. Try it out.
Some suggestion:
You may use int instead of long int for looping index.
Almost certainly the problem is inviting you to conjecture on the values of a. If that vector is mostly zeros, then you'll go faster (by doing fewer double multiplications, which will be the clear bottleneck on most machines) by computing only the values of a[i] * x^i for a[i] != 0. In turn the x^i values can be computed by careful repeated squaring, preserving intermediate terms so that you never compute the same partial power more than once. See the Wikipedia article if you've never implemented repeated squaring.

Macros for 3D loops in C

I'm developing a C (C99) program that loops heavily over 3-D arrays in many places. So naturally, the following access pattern is ubiquitous in the code:
for (int i=0; i<i_size, i++) {
for (int j=0; j<j_size, j++) {
for (int k=0; k<k_size, k++) {
...
}
}
}
Naturally, this fills many lines of code with clutter and requires extensive copypasting. So I was wondering whether it would make sense to use macros to make it more compact, like this:
#define BEGIN_LOOP_3D(i,j,k,i_size,j_size,k_size) \
for (int i=0; i<(i_size), i++) { \
for (int j=0; j<(j_size), j++) { \
for (int k=0; k<(k_size), k++) {
and
#define END_LOOP_3D }}}
On one hand, from a DRY principle standpoint, this seems great: it makes the code a lot more compact, and allows you to indent the contents of the loop by just one block instead of three. On the other hand, the practice of introducing new language constructs seems hideously ugly and, even though I can't think of any obvious problems with it right now, seems alarmingly prone to creating bugs that are a nightmare to debug.
So what do you think: do the compactness and reduced repetition justify this despite the ugliness and the potential drawbacks?
Never put open or close {} inside macros. C programmers are not used to this so code gets difficult to read.
In your case this is even completely superfluous, you just don't need them. If you do such a thing do
FOR3D(I, J, K, ISIZE, JSIZE, KSIZE) \
for (size_t I=0; I<ISIIZE, I++) \
for (size_t J=0; J<JSIZE, J++) \
for (size_t K=0; K<KSIZE, K++)
no need for a terminating macro. The programmer can place the {} directly.
Also, above I have used size_t as the correct type in C for loop indices. 3D matrices easily get large, int arithmetic overflows when you don't think of it.
If these 3D arrays are “small”, you can ignore me. If your 3D arrays are large, but you don't much care about performance, you can ignore me. If you subscribe to the (common but false) doctrine that compilers are quasi-magical tools that can poop out optimal code almost irrespective of the input, you can ignore me.
You are probably aware of the general caveats regarding macros, how they can frustrate debugging, etc., but if your 3D arrays are “large” (whatever that means), and your algorithms are performance-oriented, there may be drawbacks of your strategy that you may not have considered.
First: if you are doing linear algebra, you almost certainly want to use dedicated linear algebra libraries, such as BLAS, LAPACK, etc., rather than “rolling your own”. OpenBLAS (from GotoBLAS) will totally smoke any equivalent you write, probably by at least an order of magnitude. This is doubly true if your matrices are sparse and triply true if your matrices are sparse and structured (such as tridiagonal).
Second: if your 3D arrays represent Cartesian grids for some kind of simulation (like a finite-difference method), and/or are intended to be fed to any numerical library, you absolutely do not want to represent them as C 3D arrays. You will want, instead, to use a 1D C array and use library functions where possible and perform index computations yourself (see this answer for details) where necessary.
Third: if you really do have to write your own triple-nested loops, the nesting order of the loops is a serious performance consideration. It might well be that the data-access pattern for ijk order (rather than ikj or kji) yields poor cache behavior for your algorithm, as is the case for dense matrix-matrix multiplication, for example. Your compiler might be able to do some limited loop exchange (last time I checked, icc would produce reasonably fast code for naive xGEMM, but gcc wouldn't). As you implement more and more triple-nested loops, and your proposed solution becomes more and more attractive, it becomes less and less likely that a “one loop-order fits all” strategy will give reasonable performance in all cases.
Fourth: any “one loop-order fits all” strategy that iterates over the full range of every dimension will not be tiled, and may exhibit poor performance.
Fifth (and with reference to another answer with which I disagree): I believe, in general, that the “best” data type for any object is the set with the smallest size and the least algebraic structure, but if you decide to indulge your inner pedant and use size_t or another unsigned integer type for matrix indices, you will regret it. I wrote my first naive linear algebra library in C++ in 1994. I've written maybe a half dozen in C over the last 8 years and, every time, I've started off trying to use unsigned integers and, every time, I've regretted it. I've finally decided that size_t is for sizes of things and a matrix index is not the size of anything.
Sixth (and with reference to another answer with which I disagree): a cardinal rule of HPC for deeply nested loops is to avoid function calls and branches in the innermost loop. This is particularly important where the op-count in the innermost loop is small. If you're doing a handful of operations, as is the case more often than not, you don't want to add a function call overhead in there. If you're doing hundreds or thousands of operations in there, you probably don't care about a handful of instructions for a function call/return and, therefore, they're OK.
Finally, if none of the above are considerations that jibe with what you're trying to implement, then there's nothing wrong with what you're proposing, but I would carefully consider what Jens said about braces.
The best way is to use a function. Let the compiler worry about performance and optimization, though if you are concerned you can always declare functions as inline.
Here's a simple example:
#include <stdio.h>
#include <stdint.h>
typedef void(*func_t)(int* item_ptr);
void traverse_3D (size_t x,
size_t y,
size_t z,
int array[x][y][z],
func_t function)
{
for(size_t ix=0; ix<x; ix++)
{
for(size_t iy=0; iy<y; iy++)
{
for(size_t iz=0; iz<z; iz++)
{
function(&array[ix][iy][iz]);
}
}
}
}
void fill_up (int* item_ptr) // fill array with some random numbers
{
static uint8_t counter = 0;
*item_ptr = counter;
counter++;
}
void print (int* item_ptr)
{
printf("%d ", *item_ptr);
}
int main()
{
int arr [2][3][4];
traverse_3D(2, 3, 4, arr, fill_up);
traverse_3D(2, 3, 4, arr, print);
}
EDIT
To shut up all speculations, here are some benchmarking results from Windows.
Tests were done with a matrix of size [20][30][40]. The fill_up function was called either from traverse_3D or from a 3-level nested loop directly in main(). Benchmarking was done with QueryPerformanceCounter().
Case 1: gcc -std=c99 -pedantic-errors -Wall
With function, time in us: 255.371402
Without function, time in us: 254.465830
Case 2: gcc -std=c99 -pedantic-errors -Wall -O2
With function, time in us: 115.913261
Without function, time in us: 48.599049
Case 3: gcc -std=c99 -pedantic-errors -Wall -O2, traverse_3D function inlined
With function, time in us: 37.732181
Without function, time in us: 37.430324
Why the "without function" case performs somewhat better with the function inlined, I have no idea. I can comment out the call to it and still get the same benchmarking results for the "without function" case.
The conclusion however, is that with proper optimization, performance is most likely a non-issue.

what is the better between them in speed (using asm or without), and the speed actually deserve

In my case I want the better speed.
which one is better between them in speed.
For example:
int n=0;
__asm {
MOV n, 100
INC n
ADD n, 100
}
printf("%d\n", n);
Or
int n=0;
n = 100;
n++;
n += 100;
printf("%d\n", n);
And I'm used the following code to know what better between them, but they will not give me a result to show me what's better.
double duration;
clock_t start, end;
start = clock();
// code here
end = clock();
duration = (double)(end - start) / CLOCKS_PER_SEC;
printf("%f\n", duration );
Besides measuring clock cycles you can also use disassembly to determine exactly what the compiler chose
basically in such a small code the compiler would most likely make the correct choices for faster and smaller code (especially if optimized for speed) and in that case it is prefered to use C instead of asm for code portability and the avoidance of bugs when you make small changes
If the compiler did not choose the same approach as your assembly i would try to figure out why. It might be very important.
You could measure the performance difference of your two approaches with cpu cycles. Try this, instead of clock:
inline uint64 getCycles() {
uint64 cycles;
__asm__ __volatile__("rdtsc" : "=A" (cycles));
return cycles;
}
I think there is not going to be much difference between two versions. The only difference could be that you may write asm code that best fit to your CPU but compiler may write a general one.
If you had memory operations,if statements,for loops there could be a difference.
Loops,cache issues and system calls are biggest time consumers.
I would say that it is wiser to write C code for such a simple case for readability.
You can write clean C code and get best possible performance with other means: Using special hardware (ex: GPU), better algorithm, re-define the problem, better compiler usage....

Efficiency of boolean comparisons? In C

I'm writing a loop in C, and I am just wondering on how to optimize it a bit. It's not crucial here as I'm just practicing, but for further knowledge, I'd like to know:
In a loop, for example the following snippet:
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
}
Does the processor check both (i < 10) and (i == 10) for every iteration? Or does it just check (i < 10) and, if it's true, continue?
If it checks both, wouldn't:
int i = 0;
while (i != 10) {
printf("%d\n", i);
i++;
}
be more efficient?
Thanks!
Both will be translated in a single assembly instruction. Most CPUs have comparison instructions for LESS THAN, for LESS THAN OR EQUAL, for EQUAL and for NOT EQUAL.
One of the interesting things about these optimization questions is that they often show why you should code for clarity/correctness before worrying about the performance impact of these operations (which oh-so often don't have any difference).
Your 2 example loops do not have the same behavior:
int i = 0;
/* this will print 11 lines (0..10) */
while (i <= 10) {
printf("%d\n", i);
i++;
}
And,
int i = 0;
/* This will print 10 lines (0..9) */
while (i != 10) {
printf("%d\n", i);
i++;
}
To answer your question though, it's nearly certain that the performance of the two constructs would be identical (assuming that you fixed the problem so the loop counts were the same). For example, if your processor could only check for equality and whether one value were less than another in two separate steps (which would be a very unusual processor), then the compiler would likely transform the (i <= 10) to an (i < 11) test - or maybe an (i != 11) test.
This a clear example of early optimization.... IMHO, that is something that programmers new to their craft are way to prone to worry about. If you must worry about it, learn to benchmark and profile your code so that your worries are based on evidence rather than supposition.
Speaking to your specific questions. First, a <= is not implemented as two operations testing for < and == separately in any C compiler I've met in my career. And that includes some monumentally stupid compilers. Notice that for integers, a <= 5 is the same condition as a < 6 and if the target architecture required that only < be used, that is what the code generator would do.
Your second concern, that while (i != 10) might be more efficient raises an interesting issue of defensive programming. First, no it isn't any more efficient in any reasonable target architecture. However, it raises a potential for a small bug to cause a larger failure. Consider this: if some line of code within the body of the loop modified i, say by making it greater than 10, what might happen? How long would it take for the loop to end, and would there be any other consequences of the error?
Finally, when wondering about this kind of thing, it often is worthwhile to find out what code the compiler you are using actually generates. Most compilers provide a mechanism to do this. For GCC, learn about the -S option which will cause it to produce the assembly code directly instead of producing an object file.
The operators <= and < are a single instruction in assembly, there should be no performance difference.
Note that tests for 0 can be a bit faster on some processors than to test for any other constant, therefore it can be reasonable to make a loop run backward:
int i = 10;
while (i != 0)
{
printf("%d\n", i);
i--;
}
Note that micro optimizations like these usually can gain you only very little more performance, better use your time to use efficient algorithms.
Does the processor check both (i < 10) and (i == 10) for every iteration? Or does it just check (i < 10) and, if it's true, continue?
Neither, it will most likely check (i < 11). The <= 10 is just there for you to give better meaning to your code since 11 is a magic number which actually means (10+1).
Depends on the architecture and compiler. On most architectures, there is a single instruction for <= or the opposite, which can be negated, so if it is translated into a loop, the comparison will most likely be only one instruction. (On x86 or x86_64 it is one instruction)
The compiler might unroll the loop into a sequence of ten times i++, when only constant expressions are involved it will even optimize the ++ away and leave only constants.
And Ira is right, the comparison does vanish if there is a printf involved, which execution time might be millions of clock cycles.
I'm writing a loop in C, and I am just wondering on how to optimize it a bit.
If you compile with optimizations turned on, the biggest optimization will be from unrolling that loop.
It's going to be hard to profile that code with -O2, because for trivial functions the compiler will unroll the loop and you won't be able to benchmark actual differences in compares. You should be careful when profiling test cases that use constants that might make the code trivial when optimized by the compiler.
disassemble. Depending on the processor, and optimization and a number of things this simple example code actually unrolls or does things that do not reflect your real question. Compiling with gcc -O1 though both example loops you provided resulted in the same assembler (for arm).
Less than in your C code often turns into a branch if greater than or equal to the far side of the loop. If your processor doesnt have a greater than or equal it may have a branch if greater than and a branch if equal, two instructions.
typically though there will be a register holding i. there will be an instruction to increment i. Then an instruction to compare i with 10, then equal to, greater than or equal, and less than are generally done in a single instruction so you should not normally see a difference.
// Case I
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
printf("%d\n", i);
i++;
}
// Case II
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
}
Case I code take more space but fast and Case II code is take less space but slow compare to Case I code.
Because in programming space complexity and time complexity always proportional to each other. It means you must compromise either space or time.
So in that way you can optimize your time complexity or space complexity but not both.
And your both code are same.

Resources