matrix library: inline vs define - c

I'm creating a matrix library with a bunch of function like this one (it's actually a long one):
void matx_multiply(int x, float mat1[], float mat2[], float result[])
{
int row,col,k;
for (row=0; row<x; row++) {
for(col=0; col<x; col++){
result[row + col*x]=0.0f;
for (k=0; k<x; k++)
{
result[row + col*x]+=mat1[row + k*x]*mat2[k + col*x];
}
}
}
}
Firstly, I wonder if it's better to change this to an inline function if it is use ~5 times in a program (often in a loop), with x being known at compile time.
I think it's better to inline it, so the compiler can decide at compile time whether we want it to be expanded in the code or not (depending on the optimization parameter). In addition, the compiler might fairly well optimize the loop if it knows x (for example, if x=2, it may decide to unroll the loop)
More importantly, I want to add a set of functions:
#define mat2_multiply(m1,m2,res) matx_multiply(2,m1,m2,res)
#define mat3_multiply(m1,m2,res) matx_multiply(3,m1,m2,res)
...
or
static inline void mat2_multiply(float mat1[static 2],
float mat2[static 2],
float result[static 2])
{
matx_multiply(2,mat1,mat2,result);
}
...
or creating a function (but it create a function call for nothing)
One way is much more concise and is always expanded (it will verify if mat1 and mat2 are array of float anyway)
The second way is more secure, it verify the length of the array, but it's less concies and may not be expanded...
What would you do, and what would you want a matrix library to do ?
I want my library to be fairly fast (for OpenGL app), fairly small and easy to use.

Use the inline keyword. There are several disadvantages if you use the preprocessor to develop function-like macros:
no type safety checking
no sanity checking
bad to debug
bad readability
expressions passed to macros can be evaluated more than once

Related

What are the various ways code may be 'vectorized'?

I usually hear the term vectorized functions in one of two ways:
In a very high-level language when the data is passed all-at-once (or at least, in bulk chunks) to a lower-level library that does the calculations in faster way. An example of this would be python's use of numpy for array/LA-related stuff.
At the lowest level, when using a specific machine instruction or procedure that makes heavy use of them (such as YMM, ZMM, XMM register instructions).
However, it seems like the term is passed around quite generally, and I wanted to know if there's a third (or even more) ways in which it's used. And this would just be, for example, passing multiple values to a function rather than one (usually done via an array) for example:
// non-'vectorized'
#include <stdio.h>
int squared(int num) {
return num*num;
}
int main(void) {
int nums[] = {1,2,3,4,5};
for (int i=0; i < sizeof(nums)/sizeof(*nums); i++) {
int n_squared = squared(nums[i]);
printf("%d^2 = %d\n", nums[i], n_squared);
}
}
// 'vectorized'
#include <stdio.h>
void squared(int num[], int size) {
for (int i=0; i<size; i++) {
*(num +i) = num[i] * num[i];
}
}
int main(void) {
int nums[] = {1,2,3,4,5};
squared(nums, sizeof(nums)/sizeof(*nums));
for (int i=0; i < sizeof(nums)/sizeof(*nums); i++) {
printf("Squared=%d\n", nums[i]);
}
}
Is the above considered 'vectorized code'? Is there a more formal/better definition of what makes something vectorized or not?
Vectorized code, in the context you seem to be referring to, normally means "an implementation that happens to make use of Single Instruction Multiple Data (SIMD) hardware instructions".
This can sometimes mean that someone manually wrote a version of a function that is equivalent to the canonical one, but happens to make use of SIMD. More often than not, it's something that the compiler does under the hood as part of its optimization passes.
In a very high-level language when the data is passed all-at-once (or at least, in bulk chunks) to a lower-level library that does the calculations in faster way. An example of this would be python's use of numpy for array/LA-related stuff.
That's simply not correct. The process of handing off a big chunk of data to some block of code that goes through it quickly is not vectorization in of itself.
You could say "Now that my code uses numpy, it's vectorized" and be sort of correct, but only transitively. A better way to put it would be "Now that my code uses numpy, it runs a lot faster because numpy is vectorized under the hood.". Importantly though, not all fast libraries to which big chunks of data are passed at once are vectorized.
...Code examples...
Since there is no SIMD instruction in sight in either example, then neither are vectorized yet. It might be true that the second version is more likely to lead to a vectorized program. If that's the case, then we'd say that the program is more vectorizable than the first. However, the program is not vectorized until the compiler makes it so.

passing a constant through a function in C

I have some C function which, among other things, does a modulo operation. So it looks something like
const int M = 641;
void func( ...parameters..) {
int x;
... some operations ...
x %= M;
... some more operations ...
}
Now, what is crucial for me is that the number M here is a constant. If I would not tell the compiler that M is a constant, then I would get much slower performance.
Currently, I am very happy with my function func( .. ) , and I want would like to extend it, so it can work on different moduli. But again, it is crucial here that these moduli are fixed. So I would like to be able to do something like
const int arrayM[] = {641, 31, 75, 81, 123};
and then have for each index in the array of constants array_M[i] a version of the function func, say func_i, which is a copy of the function func, but where array_M[i] replaces the role of M.
In practice, my array of constants arrayM[] will consist of around 600 explicit prime numbers, which I will choose in a particular way so that x % array_M[i] compiles to a very fast modulus function (for instance Mersenne primes).
My question is: How do I do this in C without making 600 copies of my function func, and changing the variable M in the code each time ?
Finally, I would like to ask the same question again for CUDA code. So if I would have a cuda-kernel, where at some point in the code a modulus M operation is carried out, and I want to have different copies of the same kernel (one for each index of array_M).
You may use a define like:
#define F(i,n) void func_##i() { printf("%d\n",n); }
#include <stdio.h>
F(1,641)
F(2,31)
...
int main() {
func_1();
func_2();
}
It is possible to obtain the same effect from a list of constant but it is much much more tricky. See recursive macro.
Most compilers will do constant propagation. You need to turn up the optimisation level high. The only way to be sure however is to examine the assembly code, or to explicitly write the code out with the constants folded in, which is ugly and hard to maintain. C++ allows you to specify a scalar as a template.

Is it possible to declare a variable in a function within a #define clause

In order to speed up the performance of my program, I'd like to introduce a function for calculating the leftover of a floating point division (where the quotient is natural, obviously).
Therefore I have following simple function:
double mfmod(double x,double y) {
double a;
return ((a=x/y)-(int)a)*y;
}
As I've heard I could speed up even more by putting this function within a #define clause, but the variable a makes this quite difficult. At this moment I'm here:
#define mfmod(x,y) { \
double a; \
return ((a=x/y)-(int)a)*y; \
}
But trying to launch this gives problems, due to the variable.
The problems are the following: a bit further I'm trying to launch this function:
double test = mfmod(f, div);
And this can't be compiled due to the error message type name is not allowed.
(for your information, f and div are doubles)
Does anybody know how to do this? (if it's even possible) (I'm working with Windows, more exactly Visual Studio, not with GCC)
As I've heard I could speed up even more by putting this function within a #define clause
I think you must have misunderstood. Surely the advice was to implement the behavior as a (#defined) macro, instead of as a function. Your macro is syntactically valid, but the code resulting from expanding it is not a suitable replacement for calling your function.
Defining this as a macro is basically a way to manually inline the function body, but it has some drawbacks. In particular, in standard C, a macro that must expand to an expression (i.e. one that can be used as a value) cannot contain a code block, and therefore cannot contain variable declarations. That, in turn, may make it impossible to avoid multiple evaluation of the macro arguments, which is a big problem.
Here's one way you could write your macro:
#define mfmod(x,y) ( ((x)/(y)) - (int) ((x)/(y)) )
This is not a clear a win over the function, however, as its behavior varies with the argument types, it must evaluate both arguments twice (which can produce unexpected and even undefined results in some cases), and it must also perform the division twice.
If you were willing to change the usage mode so that the macro sets a result variable instead of expanding to an expression, then you could get around many of the problems. #BnBDim provided a first cut at this, but it suffers from some of the same type and multiple-evaluation problems as the above. Here's how you could do it to obtain the same result as your function:
#define mfmod(x, y, res) do { \
double _div = (y); \
double _quot = (double) (x) / _div; \
res = (_quot - (int) _quot) * _div; \
} while (0)
Note that it takes care to reference the arguments once each, and also inside parentheses but for res, which must be an lvalue. You would use it much like a void function instead of like a value-returning function:
double test;
mfmod(f, div, test);
That still affords a minor, but unavoidable risk of breakage in the event that one of the actual arguments to the macro collides with one of the variables declared inside the code block it provides. Using variable names prefixed with underscores is intended to minimize that risk.
Overall, I'd be inclined to go with the function instead, and to let the compiler handle the inlining. If you want to encourage the compiler to do so then you could consider declaring the function inline, but very likely it will not need such a hint, and it is not obligated to honor one.
Or better, just use fmod() until and unless you determine that doing so constitutes a bottleneck.
To answer the stated question: yes, you can declare variables in defined macro 'functions' like the one you are working with, in certain situations. Some of the other answers have shown examples of how to do this.
The problem with your current #define is that you are telling it to return something in the middle of your code, and you are not making a macro that expands in the way you probably want. If I use your macro like this:
...
double y = 1.0;
double x = 1.0;
double z = mfmod(x, y);
int other = (int)z - 1;
...
This is going to expand to:
...
double y = 1.0;
double x = 1.0;
double z = {
double a;
return ((a=x/y)-(int)a)*y;
};
int other = (int)z - 1;
...
The function (if it compiled) would never proceed beyond the initialization of z, because it would return in the middle of the macro. You also are trying to 'assign' a block of code to z.
That being said, this is another example of making assumptions about performance without any (stated) benchmarking. Have you actually measured any performance problem with just using an inline function?
__attribute__((const))
extern inline double mfmod(const double x, const double y) {
const double a = x/y;
return (a - (int)a) * y;
}
Not only is this cleaner, clearer, and easier to debug than the macro, it has the added benefit of being declared with the const attribute, which will suggest to the compiler that subsequent calls to the function with the same arguments should return the same value, which can cause repeated calls to the function to be optimized away entirely, whereas the macro would be evaluated every time (conceptually). To be honest, even using the local double to cache the division result is probably a premature optimization, since the compiler will probably optimize this away. If this were me, and I absolutely HAD to have a macro to do this, I would write it as follows:
#define mfmod(x, y) (((x/y)-((int)(x/y)))*y)
There will almost certainly not be any noticeable performance hit under optimization for doing the division twice. If there were, I would use the inline function above. I will leave it to you to do the benchmarking.
You could use this work-around
#define mfmod(x,y,res) \
do { \
double a=(x)/(y); \
res = (a-(int)a)*(y); \
} while(0)

Macros for 3D loops in C

I'm developing a C (C99) program that loops heavily over 3-D arrays in many places. So naturally, the following access pattern is ubiquitous in the code:
for (int i=0; i<i_size, i++) {
for (int j=0; j<j_size, j++) {
for (int k=0; k<k_size, k++) {
...
}
}
}
Naturally, this fills many lines of code with clutter and requires extensive copypasting. So I was wondering whether it would make sense to use macros to make it more compact, like this:
#define BEGIN_LOOP_3D(i,j,k,i_size,j_size,k_size) \
for (int i=0; i<(i_size), i++) { \
for (int j=0; j<(j_size), j++) { \
for (int k=0; k<(k_size), k++) {
and
#define END_LOOP_3D }}}
On one hand, from a DRY principle standpoint, this seems great: it makes the code a lot more compact, and allows you to indent the contents of the loop by just one block instead of three. On the other hand, the practice of introducing new language constructs seems hideously ugly and, even though I can't think of any obvious problems with it right now, seems alarmingly prone to creating bugs that are a nightmare to debug.
So what do you think: do the compactness and reduced repetition justify this despite the ugliness and the potential drawbacks?
Never put open or close {} inside macros. C programmers are not used to this so code gets difficult to read.
In your case this is even completely superfluous, you just don't need them. If you do such a thing do
FOR3D(I, J, K, ISIZE, JSIZE, KSIZE) \
for (size_t I=0; I<ISIIZE, I++) \
for (size_t J=0; J<JSIZE, J++) \
for (size_t K=0; K<KSIZE, K++)
no need for a terminating macro. The programmer can place the {} directly.
Also, above I have used size_t as the correct type in C for loop indices. 3D matrices easily get large, int arithmetic overflows when you don't think of it.
If these 3D arrays are “small”, you can ignore me. If your 3D arrays are large, but you don't much care about performance, you can ignore me. If you subscribe to the (common but false) doctrine that compilers are quasi-magical tools that can poop out optimal code almost irrespective of the input, you can ignore me.
You are probably aware of the general caveats regarding macros, how they can frustrate debugging, etc., but if your 3D arrays are “large” (whatever that means), and your algorithms are performance-oriented, there may be drawbacks of your strategy that you may not have considered.
First: if you are doing linear algebra, you almost certainly want to use dedicated linear algebra libraries, such as BLAS, LAPACK, etc., rather than “rolling your own”. OpenBLAS (from GotoBLAS) will totally smoke any equivalent you write, probably by at least an order of magnitude. This is doubly true if your matrices are sparse and triply true if your matrices are sparse and structured (such as tridiagonal).
Second: if your 3D arrays represent Cartesian grids for some kind of simulation (like a finite-difference method), and/or are intended to be fed to any numerical library, you absolutely do not want to represent them as C 3D arrays. You will want, instead, to use a 1D C array and use library functions where possible and perform index computations yourself (see this answer for details) where necessary.
Third: if you really do have to write your own triple-nested loops, the nesting order of the loops is a serious performance consideration. It might well be that the data-access pattern for ijk order (rather than ikj or kji) yields poor cache behavior for your algorithm, as is the case for dense matrix-matrix multiplication, for example. Your compiler might be able to do some limited loop exchange (last time I checked, icc would produce reasonably fast code for naive xGEMM, but gcc wouldn't). As you implement more and more triple-nested loops, and your proposed solution becomes more and more attractive, it becomes less and less likely that a “one loop-order fits all” strategy will give reasonable performance in all cases.
Fourth: any “one loop-order fits all” strategy that iterates over the full range of every dimension will not be tiled, and may exhibit poor performance.
Fifth (and with reference to another answer with which I disagree): I believe, in general, that the “best” data type for any object is the set with the smallest size and the least algebraic structure, but if you decide to indulge your inner pedant and use size_t or another unsigned integer type for matrix indices, you will regret it. I wrote my first naive linear algebra library in C++ in 1994. I've written maybe a half dozen in C over the last 8 years and, every time, I've started off trying to use unsigned integers and, every time, I've regretted it. I've finally decided that size_t is for sizes of things and a matrix index is not the size of anything.
Sixth (and with reference to another answer with which I disagree): a cardinal rule of HPC for deeply nested loops is to avoid function calls and branches in the innermost loop. This is particularly important where the op-count in the innermost loop is small. If you're doing a handful of operations, as is the case more often than not, you don't want to add a function call overhead in there. If you're doing hundreds or thousands of operations in there, you probably don't care about a handful of instructions for a function call/return and, therefore, they're OK.
Finally, if none of the above are considerations that jibe with what you're trying to implement, then there's nothing wrong with what you're proposing, but I would carefully consider what Jens said about braces.
The best way is to use a function. Let the compiler worry about performance and optimization, though if you are concerned you can always declare functions as inline.
Here's a simple example:
#include <stdio.h>
#include <stdint.h>
typedef void(*func_t)(int* item_ptr);
void traverse_3D (size_t x,
size_t y,
size_t z,
int array[x][y][z],
func_t function)
{
for(size_t ix=0; ix<x; ix++)
{
for(size_t iy=0; iy<y; iy++)
{
for(size_t iz=0; iz<z; iz++)
{
function(&array[ix][iy][iz]);
}
}
}
}
void fill_up (int* item_ptr) // fill array with some random numbers
{
static uint8_t counter = 0;
*item_ptr = counter;
counter++;
}
void print (int* item_ptr)
{
printf("%d ", *item_ptr);
}
int main()
{
int arr [2][3][4];
traverse_3D(2, 3, 4, arr, fill_up);
traverse_3D(2, 3, 4, arr, print);
}
EDIT
To shut up all speculations, here are some benchmarking results from Windows.
Tests were done with a matrix of size [20][30][40]. The fill_up function was called either from traverse_3D or from a 3-level nested loop directly in main(). Benchmarking was done with QueryPerformanceCounter().
Case 1: gcc -std=c99 -pedantic-errors -Wall
With function, time in us: 255.371402
Without function, time in us: 254.465830
Case 2: gcc -std=c99 -pedantic-errors -Wall -O2
With function, time in us: 115.913261
Without function, time in us: 48.599049
Case 3: gcc -std=c99 -pedantic-errors -Wall -O2, traverse_3D function inlined
With function, time in us: 37.732181
Without function, time in us: 37.430324
Why the "without function" case performs somewhat better with the function inlined, I have no idea. I can comment out the call to it and still get the same benchmarking results for the "without function" case.
The conclusion however, is that with proper optimization, performance is most likely a non-issue.

Does Intel array notation and elementary functions vectorize well with Xeon Phi ISA?

I try to find a proper material that clearly explains the different ways to write C/C++ source code that can be vectorized by the Intel compiler using array notation and elementary functions. All the materials online take trivial examples: saxpy, reduction etc. But there is a lack of explanation on how to vectorize a code that has conditional branching or contains a loop with loop-dependence.
For an example: say there is a sequential code I want to run with different arrays. A matrix is stored in major row format. The columns of the matrix is computed by the compute_seq() function:
#define N 256
#define STRIDE 256
__attribute__((vector))
inline void compute_seq(float *sum, float* a) {
int i;
*sum = 0.0f;
for(i=0; i<N; i++)
*sum += a[i*STRIDE];
}
int main() {
// Initialize
float *A = malloc(N*N*sizeof(float));
float sums[N];
// The following line is not going to be valid, but I would like to do somthing like this:
compute_seq(sums[:],*(A[0:N:1]));
}
Any comments appreciated.
Here is a corrected version of the example.
__attribute__((vector(linear(sum),linear(a))))
inline void compute_seq(float *sum, float* a) {
int i;
*sum = 0.0f;
for(i=0; i<N; i++)
*sum += a[i*STRIDE];
}
int main() {
// Initialize
float *A = malloc(N*N*sizeof(float));
float sums[N];
compute_seq(&sums[:],&A[0:N:N]);
}
The important change is at the call site. The expression &sums[:] creates an array section consisting of &sums[0], &sums[1], &sums[2], ... &sums[N-1]. The expression &A[0:N:N] creates an array section consisting of &A[0*N], &A[1*N], &A[2*N], ...&A[(N-1)*N].
I added two linear clauses to the vector attribute to tell the compiler to generate a clone optimized for the case that the arguments are arithmetic sequences, as they are in this example. For this example, they (and the vector attribute) are redundant since the compiler can see both the callee and call site in the same translation unit and figure out the particulars for itself. But if compute_seq were defined in another translation unit, the attribute might help.
Array notation is a work in progress. icc 14.0 beta compiled my example for Intel(R) Xeon Phi(TM) without complaint. icc 13.0 update 3 reported that it couldn't vectorize the function ("dereference too complex"). Perversely, leaving the vector attribute off shut up the report, probably because the compiler can vectorize it after inlining.
I use the compiler option "-opt-assume-safe-padding" when compiling for Intel(R) Xeon Phi(TM). It may improve vector code quality. It lets the compiler assume that the page beyond any accessed address is safe to touch, thus enabling certain instruction sequences that would otherwise be disallowed.

Resources