I'm developing a C (C99) program that loops heavily over 3-D arrays in many places. So naturally, the following access pattern is ubiquitous in the code:
for (int i=0; i<i_size, i++) {
for (int j=0; j<j_size, j++) {
for (int k=0; k<k_size, k++) {
...
}
}
}
Naturally, this fills many lines of code with clutter and requires extensive copypasting. So I was wondering whether it would make sense to use macros to make it more compact, like this:
#define BEGIN_LOOP_3D(i,j,k,i_size,j_size,k_size) \
for (int i=0; i<(i_size), i++) { \
for (int j=0; j<(j_size), j++) { \
for (int k=0; k<(k_size), k++) {
and
#define END_LOOP_3D }}}
On one hand, from a DRY principle standpoint, this seems great: it makes the code a lot more compact, and allows you to indent the contents of the loop by just one block instead of three. On the other hand, the practice of introducing new language constructs seems hideously ugly and, even though I can't think of any obvious problems with it right now, seems alarmingly prone to creating bugs that are a nightmare to debug.
So what do you think: do the compactness and reduced repetition justify this despite the ugliness and the potential drawbacks?
Never put open or close {} inside macros. C programmers are not used to this so code gets difficult to read.
In your case this is even completely superfluous, you just don't need them. If you do such a thing do
FOR3D(I, J, K, ISIZE, JSIZE, KSIZE) \
for (size_t I=0; I<ISIIZE, I++) \
for (size_t J=0; J<JSIZE, J++) \
for (size_t K=0; K<KSIZE, K++)
no need for a terminating macro. The programmer can place the {} directly.
Also, above I have used size_t as the correct type in C for loop indices. 3D matrices easily get large, int arithmetic overflows when you don't think of it.
If these 3D arrays are “small”, you can ignore me. If your 3D arrays are large, but you don't much care about performance, you can ignore me. If you subscribe to the (common but false) doctrine that compilers are quasi-magical tools that can poop out optimal code almost irrespective of the input, you can ignore me.
You are probably aware of the general caveats regarding macros, how they can frustrate debugging, etc., but if your 3D arrays are “large” (whatever that means), and your algorithms are performance-oriented, there may be drawbacks of your strategy that you may not have considered.
First: if you are doing linear algebra, you almost certainly want to use dedicated linear algebra libraries, such as BLAS, LAPACK, etc., rather than “rolling your own”. OpenBLAS (from GotoBLAS) will totally smoke any equivalent you write, probably by at least an order of magnitude. This is doubly true if your matrices are sparse and triply true if your matrices are sparse and structured (such as tridiagonal).
Second: if your 3D arrays represent Cartesian grids for some kind of simulation (like a finite-difference method), and/or are intended to be fed to any numerical library, you absolutely do not want to represent them as C 3D arrays. You will want, instead, to use a 1D C array and use library functions where possible and perform index computations yourself (see this answer for details) where necessary.
Third: if you really do have to write your own triple-nested loops, the nesting order of the loops is a serious performance consideration. It might well be that the data-access pattern for ijk order (rather than ikj or kji) yields poor cache behavior for your algorithm, as is the case for dense matrix-matrix multiplication, for example. Your compiler might be able to do some limited loop exchange (last time I checked, icc would produce reasonably fast code for naive xGEMM, but gcc wouldn't). As you implement more and more triple-nested loops, and your proposed solution becomes more and more attractive, it becomes less and less likely that a “one loop-order fits all” strategy will give reasonable performance in all cases.
Fourth: any “one loop-order fits all” strategy that iterates over the full range of every dimension will not be tiled, and may exhibit poor performance.
Fifth (and with reference to another answer with which I disagree): I believe, in general, that the “best” data type for any object is the set with the smallest size and the least algebraic structure, but if you decide to indulge your inner pedant and use size_t or another unsigned integer type for matrix indices, you will regret it. I wrote my first naive linear algebra library in C++ in 1994. I've written maybe a half dozen in C over the last 8 years and, every time, I've started off trying to use unsigned integers and, every time, I've regretted it. I've finally decided that size_t is for sizes of things and a matrix index is not the size of anything.
Sixth (and with reference to another answer with which I disagree): a cardinal rule of HPC for deeply nested loops is to avoid function calls and branches in the innermost loop. This is particularly important where the op-count in the innermost loop is small. If you're doing a handful of operations, as is the case more often than not, you don't want to add a function call overhead in there. If you're doing hundreds or thousands of operations in there, you probably don't care about a handful of instructions for a function call/return and, therefore, they're OK.
Finally, if none of the above are considerations that jibe with what you're trying to implement, then there's nothing wrong with what you're proposing, but I would carefully consider what Jens said about braces.
The best way is to use a function. Let the compiler worry about performance and optimization, though if you are concerned you can always declare functions as inline.
Here's a simple example:
#include <stdio.h>
#include <stdint.h>
typedef void(*func_t)(int* item_ptr);
void traverse_3D (size_t x,
size_t y,
size_t z,
int array[x][y][z],
func_t function)
{
for(size_t ix=0; ix<x; ix++)
{
for(size_t iy=0; iy<y; iy++)
{
for(size_t iz=0; iz<z; iz++)
{
function(&array[ix][iy][iz]);
}
}
}
}
void fill_up (int* item_ptr) // fill array with some random numbers
{
static uint8_t counter = 0;
*item_ptr = counter;
counter++;
}
void print (int* item_ptr)
{
printf("%d ", *item_ptr);
}
int main()
{
int arr [2][3][4];
traverse_3D(2, 3, 4, arr, fill_up);
traverse_3D(2, 3, 4, arr, print);
}
EDIT
To shut up all speculations, here are some benchmarking results from Windows.
Tests were done with a matrix of size [20][30][40]. The fill_up function was called either from traverse_3D or from a 3-level nested loop directly in main(). Benchmarking was done with QueryPerformanceCounter().
Case 1: gcc -std=c99 -pedantic-errors -Wall
With function, time in us: 255.371402
Without function, time in us: 254.465830
Case 2: gcc -std=c99 -pedantic-errors -Wall -O2
With function, time in us: 115.913261
Without function, time in us: 48.599049
Case 3: gcc -std=c99 -pedantic-errors -Wall -O2, traverse_3D function inlined
With function, time in us: 37.732181
Without function, time in us: 37.430324
Why the "without function" case performs somewhat better with the function inlined, I have no idea. I can comment out the call to it and still get the same benchmarking results for the "without function" case.
The conclusion however, is that with proper optimization, performance is most likely a non-issue.
Related
Assume the following code:
static int array[10];
int main ()
{
for (int i = 0; i < (sizeof(array) / sizeof(array[0])); i++)
{
// ...
}
}
The result of sizeof(array) / sizeof(array[0]) should in theory be known at compile time and set to some value depending on the size of the int. Even though, will the compiler do the manual division in run time each time the for loop iterates?
To avoid that, does the code need to be adjusted as:
static int array[10];
int main ()
{
static const int size = sizeof(array) / sizeof(array[0]);
for (int i = 0; i < size; i++)
{
// ...
}
}
You should write the code in whatever way is most readable and maintainable for you. (I'm not making any claims about which one that is: it's up to you.) The two versions of the code you wrote are so similar that a good optimizing compiler should probably produce equally good code for each version.
You can click on this link to see what assembly your two different proposed codes generate in various compilers:
https://godbolt.org/z/v914qYY8E
With GCC 11.2 (targetting x86_64) and with minimal optimizations turned on (-O1), both versions of your main function have the exact same assembly code. With optimizations turned off (-O0), the assembly is slightly different but the size calculation is still done at a compile time for both.
Even if you doubt what I am saying, it is still better to use the more readable version as a starting point. Only change it to the less readable version if you find an actual example of a programming environment where doing that would provide a meaningful speed increase for you application. Avoid wasting time with premature optimization.
Even though, will the compiler do the manual division in run time each time the for loop iterates?
No. It's an integer constant expression which will be calculated at compile-time. Which is why you can even do this:
int some_other_array [sizeof(array) / sizeof(array[0])];
To avoid that, does the code need to be adjusted as
No.
See for yourself: https://godbolt.org/z/rqv15vW6a. Both versions produced 100% identical machine code, each one containing a mov ebx, 10 instruction with the pre-calculated value.
I usually hear the term vectorized functions in one of two ways:
In a very high-level language when the data is passed all-at-once (or at least, in bulk chunks) to a lower-level library that does the calculations in faster way. An example of this would be python's use of numpy for array/LA-related stuff.
At the lowest level, when using a specific machine instruction or procedure that makes heavy use of them (such as YMM, ZMM, XMM register instructions).
However, it seems like the term is passed around quite generally, and I wanted to know if there's a third (or even more) ways in which it's used. And this would just be, for example, passing multiple values to a function rather than one (usually done via an array) for example:
// non-'vectorized'
#include <stdio.h>
int squared(int num) {
return num*num;
}
int main(void) {
int nums[] = {1,2,3,4,5};
for (int i=0; i < sizeof(nums)/sizeof(*nums); i++) {
int n_squared = squared(nums[i]);
printf("%d^2 = %d\n", nums[i], n_squared);
}
}
// 'vectorized'
#include <stdio.h>
void squared(int num[], int size) {
for (int i=0; i<size; i++) {
*(num +i) = num[i] * num[i];
}
}
int main(void) {
int nums[] = {1,2,3,4,5};
squared(nums, sizeof(nums)/sizeof(*nums));
for (int i=0; i < sizeof(nums)/sizeof(*nums); i++) {
printf("Squared=%d\n", nums[i]);
}
}
Is the above considered 'vectorized code'? Is there a more formal/better definition of what makes something vectorized or not?
Vectorized code, in the context you seem to be referring to, normally means "an implementation that happens to make use of Single Instruction Multiple Data (SIMD) hardware instructions".
This can sometimes mean that someone manually wrote a version of a function that is equivalent to the canonical one, but happens to make use of SIMD. More often than not, it's something that the compiler does under the hood as part of its optimization passes.
In a very high-level language when the data is passed all-at-once (or at least, in bulk chunks) to a lower-level library that does the calculations in faster way. An example of this would be python's use of numpy for array/LA-related stuff.
That's simply not correct. The process of handing off a big chunk of data to some block of code that goes through it quickly is not vectorization in of itself.
You could say "Now that my code uses numpy, it's vectorized" and be sort of correct, but only transitively. A better way to put it would be "Now that my code uses numpy, it runs a lot faster because numpy is vectorized under the hood.". Importantly though, not all fast libraries to which big chunks of data are passed at once are vectorized.
...Code examples...
Since there is no SIMD instruction in sight in either example, then neither are vectorized yet. It might be true that the second version is more likely to lead to a vectorized program. If that's the case, then we'd say that the program is more vectorizable than the first. However, the program is not vectorized until the compiler makes it so.
I'm trying to learn how to optimize code (I'm also learning C), and in one of my books there's a problem for optimizing Horner's method for evaluation polynomials. I'm a little lost on how to approach the problem. I'm not great at recognizing what needs optimizing.
Any advice on how to make this function run faster would be appreciated.
Thanks
double polyh(double a[], double x, int degree) {
long int i;
double result = a[degree];
for (i = degree-1; i >= 0; i--)
result = a[i] + x*result;
return result;
}
You really need to profile your code to test whether proposed optimizations really help. For example, it may be the case that declaring i as long int rather than int slows the function on your machine, but on the other hand it may make no difference on your machine but might make a difference on others, etc. Anyway, there's no reason to declare i a long int when degree is an int, so changing it probably won't hurt. (But still profile!)
Horner's rule is supposedly optimal in terms of the number of multiplies and adds required to evaluate a polynomial, so I don't see much you can do with it. One thing that might help (profile!) is changing the test i>=0 to i!=0. Of course, then the loop doesn't run enough times, so you'll have to add a line below the loop to take care of the final case.
Alternatively you could use a do { ... } while (--i) construct. (Or is it do { ... } while (i--)? You figure it out.)
You might not even need i, but using degree instead will likely not save an observable amount of time and will make the code harder to debug, so it's not worth it.
Another thing that might help (I doubt it, but profile!) is breaking up the arithmetic expression inside the loop and playing around with order, like
for (...) {
result *= x;
result += a[i];
}
which may reduce the need for temporary variables/registers. Try it out.
Some suggestion:
You may use int instead of long int for looping index.
Almost certainly the problem is inviting you to conjecture on the values of a. If that vector is mostly zeros, then you'll go faster (by doing fewer double multiplications, which will be the clear bottleneck on most machines) by computing only the values of a[i] * x^i for a[i] != 0. In turn the x^i values can be computed by careful repeated squaring, preserving intermediate terms so that you never compute the same partial power more than once. See the Wikipedia article if you've never implemented repeated squaring.
I'm creating a matrix library with a bunch of function like this one (it's actually a long one):
void matx_multiply(int x, float mat1[], float mat2[], float result[])
{
int row,col,k;
for (row=0; row<x; row++) {
for(col=0; col<x; col++){
result[row + col*x]=0.0f;
for (k=0; k<x; k++)
{
result[row + col*x]+=mat1[row + k*x]*mat2[k + col*x];
}
}
}
}
Firstly, I wonder if it's better to change this to an inline function if it is use ~5 times in a program (often in a loop), with x being known at compile time.
I think it's better to inline it, so the compiler can decide at compile time whether we want it to be expanded in the code or not (depending on the optimization parameter). In addition, the compiler might fairly well optimize the loop if it knows x (for example, if x=2, it may decide to unroll the loop)
More importantly, I want to add a set of functions:
#define mat2_multiply(m1,m2,res) matx_multiply(2,m1,m2,res)
#define mat3_multiply(m1,m2,res) matx_multiply(3,m1,m2,res)
...
or
static inline void mat2_multiply(float mat1[static 2],
float mat2[static 2],
float result[static 2])
{
matx_multiply(2,mat1,mat2,result);
}
...
or creating a function (but it create a function call for nothing)
One way is much more concise and is always expanded (it will verify if mat1 and mat2 are array of float anyway)
The second way is more secure, it verify the length of the array, but it's less concies and may not be expanded...
What would you do, and what would you want a matrix library to do ?
I want my library to be fairly fast (for OpenGL app), fairly small and easy to use.
Use the inline keyword. There are several disadvantages if you use the preprocessor to develop function-like macros:
no type safety checking
no sanity checking
bad to debug
bad readability
expressions passed to macros can be evaluated more than once
gprof is not working properly on my system (MinGW) so I'd like to know which one of the following snippets is more efficient, on average.
I'm aware that internally C compilers convert everything into pointers arithmetic, but nevertheless I'd like to know if any of the following snippets has any significant advantage over the others.
The array has been allocated dynamically in contiguous memory as 1d array and may be re-allocated at run time (its for a simple board game, in which the player is allowed to re-define the board's size, as often as he wants to).
Please note that i & j must get calculated and passed into the function set_cell() in every loop iteration (gridType is a simple struct with a few ints and a pointer to another cell struct).
Thanks in advance!
Allocate memory
grid = calloc( (nrows * ncols), sizeof(gridType) );
Snippet #1 (parse sequentially as 1D)
gridType *gp = grid;
register int i=0 ,j=0; // we need to pass those in set_cell()
if ( !grid )
return;
for (gp=grid; gp < grid+(nrows*ncols); gp++)
{
set_cell( gp, i, j, !G_OPENED, !G_FOUND, value, NULL );
if (j == ncols-1) { // last col of current row has been reached
j=0;
i++;
}
else // last col of current row has NOT been reached
j++;
}
Snippet #2 (parse as 2D array, using pointers only)
gridType *gp1, *gp2;
if ( !grid )
return;
for (gp1=grid; gp1 < grid+nrows; gp1+=ncols)
for (gp2=gp1; gp2 < gp1+ncols; gp2++)
set_cell( gp2, (gp1-grid), (gp2-gp1), !G_OPENED, !G_FOUND, value, NULL );
Snippet #3 (parse as 2D, using counters only)
register int i,j; // we need to pass those in set_cell()
for (i=0; i<nrows; i++)
for (j=0; j<ncols; j++)
set_cell( &grid[i * ncols + j], i, j, !G_OPENED, !G_FOUND, value, NULL);
Free memory
free( grid );
EDIT:
I fixed #2 form gp1++) to gp1+=ncols), in the 1st loop, after Paul's correction (thx!)
For anything like this, the answer is going to depend on the compiler and the machine you're running it on. You could try each of your code snippets, and calculating how long each one takes.
However, this is a prime example of premature optimization. The best thing to do is to pick the snippet which looks the clearest and most maintainable. You'll get much more benefit from doing that in the long run than from any savings you'd make from choosing the one that's fastest on your machine (which might not be fastest on someone else's anyway!)
Well, snippet 2 doesn't exactly work. You need different incrementing behavior; the outer loop should read for (gp1 = grid; gp1 < grid + (nrows * ncols); gp1 += ncols).
Of the other two, any compiler that's paying attention will almost certainly convert snippet 3 into something equivalent to snippet 1. But really, there's no way to know without profiling them.
Also, remember the words of Knuth: "Premature optimization is the ROOT OF ALL EVIL. I have seen more damage done in the name of 'optimization' than for all other causes combined, including sheer, wrongheaded stupidity." People who write compilers are smarter than you (unless you're secretly Knuth or Hofstadter), so let the compiler do its job and you can get on with yours. Trying to write "clever" optimized code will usually just confuse the compiler, preventing it from writing even better, more optimized code.
This is the way I'd write it. IMHO it's shorter, clearer and simpler than any of your ways.
int i, j;
gridType *gp = grid;
for (i = 0; i < nrows; i++)
for (j = 0; j < ncols; j++)
set_cell( gp++, i, j, !G_OPENED, !G_FOUND, value, NULL );
gprof not working isn't a real
excuse. You can still set up a
benchmark and measure execution
time.
You might not be able to measure any
difference on modern CPUs until
nrows*ncols is getting very
large or the reallocation happens
very often, so you might optimize the wrong part of your code.
This certainly is micro-optimization as the most runtime will most probably be spent in set_cell and everything else could be optimized to the same or very similar code by the compiler.
You don't know until you measure it.
Any decent compiler may produce the same code, even if it doesn't the effects of caching, pilelining, predictive branching and other clever stuff means that simply guessing the number of instructions isn't enough