Does using multiple call to external functions written in C affects the performance of an OCaml program?
For instance, let's assume that I want to create a function which creates a float list using the previous value in the list to compute the next iteration. For some reason, I wish this function to come from a cstub.
Does it make any difference in term of performance whether I write everything in C or if I mix C external functions with my OCaml code?
I guess this question is related to what actually happens when compiling:
ocamlc -o hello.byte -c hello.cma cstub.o
Said differently, is there any material difference between doing:
external next_iter: float -> float = "next_iter"
let make_list n first_val =
let rec aux acc current_val n =
if n = 0 then (* I assume that n will never be <0 *)
acc
else
let new_val = next_iter current_val in
aux (new_val :: acc) new_val (n-1) in
aux [] first_val n
and
external make_list: float -> int -> float list = "make_list"
(* Full implementation in C *)
Side question, if my cstub looks like this:
#include <caml/mlvalues.h>
CAMLprim value add_3(value x)
{
int i = Int_val(x);
return Val_int(x+3);
}
Is the location of the returned value shared with the OCaml code or does OCaml reallocate a new part of the memory before using the value?
I am asking because I expect the second option to be especially inefficient when using the make_list solution from cstub.c for a large list (if it is implemented this way).
In general, calling a C function from OCaml has some small constant overhead. First of all, C calling conventions usually differ from the OCaml calling convention and are less efficient. When a C function is called, a compiler needs to store some registers that might be clobbered by the call, as well as it needs to restore them afterward. Also, if a C function allocates values in the OCaml heap (that is assumed by default) the call is wrapped by code that setups and clears garbage collector roots. If your function doesn't allocate, then you may mark its external specification with the [##noalloc] attribute to remove unnecessary GC setup. Finally, OCaml compiler can't inline (obviously) your external calls, so some optimization opportunities are missed, like code specialization and allocation elimination. To form this in numbers, the call wrapping code is usually about 10 extra assembly instructions. Thus if your C function is compatible in size, then the overhead might be significant, so you may consider either make the call non-allocatable or consider rewriting it in OCaml. But in general, C functions are much bigger thus the overhead is negligible. As a final note, OCaml is not Python and is very efficient, so there is rarely or never a need to reimplement some algorithm in C. The external interface is mostly used for calling existing libraries, that are not available in C; invoking system calls; calling high-performance mathematical libraries and so on.
Side question
Is the location of the returned value shared with the OCaml code or does OCaml reallocate a new part of the memory before using the value?
In your example, the returned value is an immediate value and it is stored in a CPU register, i.e., it is not allocated. Int_val and Val_int are simple macros that translate between the C int representation and the OCaml int representation, i.e., shifts a value to the left and sets the least significant bit.
But in general, if a value is allocated with caml_alloc and friends, then the value is allocated in the OCaml heap and is not copied (unless GC is performing moving for its own purposes).
Related
I have a fortran90 code to optimize.
Now I'd like to access memory location of a structure in an external loop, and then access the deepest structure in a nested loop.
Something like this:
sample fortran loop - legacy version
do i = 1, N
ii = some integer
jj = some other integer
do j = 1, M
c = a(ii, jj)%b(i)
enddo
enddo
has to become:
second fortran loop - what I would like to write
do i = 1, N
ii = some integer
jj = some other integer
pointertoa = &a(ii, jj) !I know it's not correct in fortran, that is the question!
do j = 1, M
c = pointertoa%b(i)
enddo
enddo
I have this (sample) C code working as I expect:
Working memory addressing in C
#include <stdio.h>
struct mem{
int a;
struct mm{
int b;
float v;
} mmm;
};
void main(){
struct mem *m, dum;
dum.a = 12;
dum.mmm.b = 5;
dum.mmm.v = 3.2;
m = &dum; //m is given dum memory address
printf("dum.a = %d\n", dum.a);
printf("dum.mmm.b = %d\n", dum.mmm.b);
printf("dum.mmm.v = %f\n", dum.mmm.v);
printf("m.a = %d\n", m->a);
printf("m.mmm.b = %d\n", m->mmm.b);
printf("m.mmm.v = %f\n", m->mmm.v);
}
A couple of question:
How would you do the same I did in C in fortran90?
Do you think the second fortran loop would speed up the code?
Fortran will make it very very difficult for you to get the memory address of a variable or anything else for that matter. The tricks and techniques you may have learned in C, messing around with pointers and memory addresses, just aren't supported in Fortran. Nor, generally, are they needed within Fortran's core application domains. Your question rather suggests you are trying to write C in Fortran. Don't.
Now I've got that off my chest, you may be able to achieve what you want using the recently-introduced associate construct. Something like
associate(pointertoa => a(ii, jj))
do j = 1, M
c = pointertoa%b(i)
enddo
end associate
Whether this achieves your efficiency goals I haven't a scooby. But I'll be surprised if it does. Optimising access to array elements is something that Fortran compilers have been working on for 50+ years and they're really quite good at it.
EDIT, in response to OP's first comment ...
If your compiler supports associate you can certainly use it. But if someone is going to look over your shoulder and hit you painfully on the head if you use any feature introduced to Fortran after publication of the 90 standard then it's up to you whether or not you take the hit. The compiler ain't going to care, nor is the compiled code. associate is part of the standard and Fortran has a very good record of maintaining backwards compatibility so the likelihood that a future compiler will get upset is very small.
In writing your C function don't forget to take advantage of loop unrolling, memory prefetching, multiple instruction pipelines, vector operations, common subexpression elimination, all that sort of shit. If you manage to write a C function that outperforms the product of a Fortran compiler with optimisation turned up to 11 come back with the data to prove it and I'll eat my hat.
And, while I'm writing again, I note that the loop
do j = 1, M
c = pointertoa%b(i)
enddo
is almost entirely redundant and a good optimising compiler would just create code to execute c = pointertoa%b(i) once.
This is an extension of the previously asked question: link. In a short, I am trying to convert a C program into Matlab and looking for your suggestion to improve the code as the code is not giving the correct output. Did I convert xor the best way possible?
C Code:
void rc4(char *key, char *data){
://Other parts of the program
:
:
i = j = 0;
int k;
for (k=0;k<strlen(data);k++){
:
:
has[k] = data[k]^S[(S[i]+S[j]) %256];
}
int main()
{
char key[] = "Key";
char sdata[] = "Data";
rc4(key,sdata);
}
Matlab code:
function has = rc4(key, data)
://Other parts of the program
:
:
i=0; j=0;
for k=0:length(data)-1
:
:
out(k+1) = S(mod(S(i+1)+S(j+1), 256)+1);
v(k+1)=double(data(k+1))-48;
C = bitxor(v,out);
data_show =dec2hex(C);
has = data_show;
end
It looks like you're doing bitwise XOR on 64-bit doubles. [Edit: or not, seems I forgot bitxor() will do an implicit conversion to integer - still, an implicit conversion may not always do what you expect, so my point remains, plus it's far more efficient to store 8-bit integer data in the appropriate type rather than double]
To replicate the C code, if key, data, out and S are not already the correct type you can either convert them explicitly - with e.g. key = int8(key) - or if they're being read from a file even better to use the precision argument to fread() to create them as the correct type in the first place. If this is in fact already happening in the not-shown code then you simply need to remove the conversion to double and let v be int8 as well.
Second, k is being used incorrectly - Matlab arrays are 1-indexed so either k needs to loop over 1:length(data) or (if the zero-based value of k is used as i and j are) then you need to index data by k+1.
(side note: who is x and where did he come from?)
Third, you appear to be constructing v as an array the same size of data - if this is correct then you should take the bitxor() and following lines outside the loop. Since they work on entire arrays you're needlessly repeating this every iteration instead of doing it just once at the end when the arrays are full.
As a general aside, since converting C code to Matlab code can sometimes be tricky (and converting C code to efficient Matlab code very much more so), if it's purely a case of wanting to use some existing non-trivial C code from within Matlab then it's often far easier to just wrap it in a MEX function. Of course if it's more of a programming exercise or way to explore the algorithm, then the pain of converting it, trying to vectorise it well, etc. is worthwhile and, dare I say it, (eventually) fun.
If A is an n x n matrix and x a vector of dimension n, is it then possible to pass x to GEMV as the argument to both the x and y parameter, with beta=0, to achieve the operation x ← A ⋅ x ?
I'm specifically interested in the Cublas implementation, with C interface.
No. And for Fortran it has nothing to do with the implementation - In Fortran it breaks the language standard to have aliased actual arguments for any subprogram as it breaks the language standard unless those arguments are Intent(In). Thus if the interface has dummy arguments that are Intent(Out), Intent(InOut) or have no Intent you should always use separate variables for the corresponding actual arguments when invoking the subprogram.
NO.
Each element of the output depends on ALL elements of the input vector x
For example: if x is the input and y is the output, A is the matrix,
The ith element of y would be generated in the following manner.
y_i = A_i1*x_1 + A_i2 * x_2 ... + A_in * x_n
So if you over-write x_i with the result from above, some other x_r which depends on x_i will not receive the proper input and produce improper results.
EDIT
I was going to make this a comment, but it was getting too big. So here is the explanation why the above reasoning holds good for parallel implementations too.
Unless each parallel group / thread makes a local copy of the original data, in which case the original data can be destroyed, this line of reasoning holds.
However, doing so (making a local copy) is only practical and beneficial when
Each parallel thread / block would not be able to access the
original array without significant amount of over-head.
There is enough local memory (call it cache, or shared memory or even
regular memory in case of MPI) to hold a separate copy for each
parallel thread / block.
Notes:
(1) may not be true for many multi-threaded applications on a single machine.
(1) may be true for CUDA but (2) is definitely not applicable for CUDA.
I am working with cryptography and need to use some really large numbers. I am also using the new Intel instruction for carryless multiplication that requires m128i data type which is done by loading it with a function that takes in floating point data as its arguments.
I need to store 2^1223 integer and then square it and store that value as well.
I know I can use the GMP library but I think it would be faster to create two data types that both store values like 2^1224 and 2^2448. It will have less overhead.I am going to using karatsuba to multiply the numbers so the only operation I need to perform on the data type is addition as I will be breaking the number down to fit m128i.
Can someone direct me in the direction towards material that can help me create the size of integer I need.
If you need your own datatypes (regardless of whether it's for math, etc), you'll need to fall back to structures and functions. For example:
struct bignum_s {
char bignum_data[1024];
}
(obviously you want to get the sizing right, this is just an example)
Most people end up typedefing it as well:
typedef struct bignum_s bignum;
And then create functions that take two (or whatever) pointers to the numbers to do what you want:
/* takes two bignums and ORs them together, putting the result back into a */
void
bignum_or(bignum *a, bignum *b) {
int i;
for(i = 0; i < sizeof(a->bignum_data); i++) {
a->bignum_data[i] |= b->bignum_data[i];
}
}
You really want to end up defining nearly every function you might need, and this frequently includes memory allocation functions (bignum_new), memory freeing functions (bignum_free) and init routines (bignum_init). Even if you don't need them now, doing it in advance will set you up for when the code needs to grow and develop later.
I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}
1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.
First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).
Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.
Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.