I use R and I implemented a Monte Carlo simulation in R which takes long time because of the for loops. Then I realized that I can do the for loops in C, using R API. So I generate my vectors, matrices in R then I call functions from C(which will do the for loops) and finally I present my results in R. However, I only know the basics of C and I cannot figure how to transform some functions to C. For instance I start with a function in R like this:
t=sample(1:(P*Q), size=1)
How can I do this in C? Also I have an expression in R:
A.q=phi[,which(q==1)]
How can I use "which" expression in C?
Before you start writing C code, you would be better off rewriting your R code to make it run faster. sample is vectorised. Can you move the call to it outside the loop? That should speed things up. Even better, can you get rid of the loop entirely?
Also, you don't need to use which when you are indexing. R accepts logical vectors as indicies. Compare:
A.q=phi[,which(q==1)]
A.q=phi[,q==1]
Finally, I recommend not calling your variables t or q since there are functions with those names. Try giving your variables descriptive names instead - it will make your code more readable.
Related
I am moving from python to C, in the hope of faster implementation, and trying to learn vectorization in C equivalent to python vectorization. For example, assume that we have binary array Input_Binary_Array, if I want to multiply each element for the index, say, i, by 2**i and then sum all non-zero, in python-vectorization we do the following:
case 1 : Value = (2. ** (np.nonzero(Input_Binary_Array)[0] + 1)).sum()
Or if we do slicing and do elementwise addition/subtraction/multiplication, we do the following:
case 2 : Array_opr= (Input_Binary_Array[size:] * 2**Size -Input_Binary_Array[:-size])
C is a powerful low-level language, so simple for/while loop is quite faster, but I am not sure that there are no equivalent vectorizations like python.
So, my question is, is there an explicit vectorization code for:
1.
multiplying all elements of an array
with a constant number (scalar)
2.
elementwise addition, subtraction, division for 2 given arrays of same size.
3.
slicing, summing, cumulative summing
or, the simple for, while loop is the only faster option to do above operations like python vectorization (case 1, 2)?
The answer is to either use a library to achieve those things, or write one. The C language by itself is fairly minimalist. It's part of the appeal. Some libraries out there include the Intel MLK, and there's gsl, which has that along with huge number of other functions, and more.
Now, with that said, I would recommend that if moving from Python to C is your plan, moving from Python to C++ is the better one. The reason that I say this is because C++ already has a lot of the tools you would be looking for to build what you like syntactically.
Specifically, you want to look at C++ std::vector, iterators, ranges and lambda expressions, all within C++20 and working rather well. I was able to make my own iterator on my own sort of weird collection and then have Linq style functions tacked onto it, with Linq semantics...
So I could say
mycollection<int> myvector = { 1, 2, 4, 5 };
Something like that anyway - the initializer expression rules I forget sometimes.
auto v = mycollection
.where( []( auto& itm ) { itm > 3; } )
.sum( []( auto& itm ) { return itm; } );
and get more or less what I expect.
Since you control the iterator down to every single detail you could ever want (and the std framework already thought of many), you can make it go as fast as you need, use multiple cores and so on.
In fact, I think STL for MS and maybe GCC both actually have swap in parallel algorithms where you just use them.
So C is good, but consider C++, if you are going that "C like" route. Because that's the only way you'll get the performance you want with the syntax you need.
Iterators basically let you wrap the concept of a for loop as an object. So,
So, my question is, is there an explicit vectorization code for:
1.
multiplying all elements of an array
with a constant number (scalar)
The C language itself does not have a syntax for expressing that with a single simple statement. One would ordinarily write a loop to perform the multiplication element by element, or possibly find or write a library that handles it.
Note also that as far as I am aware, the Python language does not have this either. In particular, the product of a Python list and an integer n is not scalar multiplication of the list elements, but rather a list with n times as many elements. Some of your Python examples look like they may be using Numpy, which can provide for that sort of thing, but that's a third-party package, analogous to a library in C.
elementwise addition, subtraction, division for 2 given arrays of same
size.
Same as above. Including this not being in Python, either, at least not as the effect of any built-in Python operator on objects of a built-in type.
slicing, summing, cumulative summing
C has limited array slicing, in that you can access contiguous subarrays of an array more or less as arrays in their own right. The term "slice" is not typically used, however.
C does not have built-in sum() functions for arrays.
or, the simple for, while loop is the only faster option to do above
operations like python vectorization (case 1, 2)?
There are lots and lots of third-party C libraries, including some good ones for linear algebra. The C language and standard library does not have such features built in. One would typically choose among writing explicit loops, writing a custom library, and integrating a third party library based on how much your code relies on such operations, whether you need or can benefit from customization to your particular cases, and whether the operations need to be highly optimized.
In a MATLAB program I call many times (over 3 million) to a function that converts from local coordinates to global coordinates in an image, just a a simple transformation. My whole code takes 6 minutes to run, and the coordinates conversion function takes 20% of that time.
How can I optimize this code?
function LMP_glb = do(pnt_val,LMP,NP_glb)
NP_co = ones(1,3)*round(pnt_val+1);
LMP_glb = [NP_glb(1:3) + LMP(1:3) - NP_co(1:3)]; %basic operations
end
Note: this function is called from several other functions in my code (not in a single endless for loop).
Thank you.
You can make this a little bit simpler:
function LMP_glb = do(pnt_val,LMP,NP_glb)
LMP_glb = NP_glb(1:3) + LMP(1:3) - round(pnt_val+1);
end
as MATLAB will handle expanding the last part of the expression from a scalar to an array for you.
I can't see much of a way to optimize it beyond that, as it isn't doing much. If you were to inline this (i.e. put the expression directly in your code rather than calling out to it as a function), you would additionally save the time taken in the overhead of a function call - if there are 3 million calls, that could be expensive.
#CitizenInsane 's suggestion is also a good one - if it's possible to precompute this function for all your coordinates at once, and then use them later, it might be possible to vectorize the operation for additional speedup.
Lastly, it would be possible (if you're comfortable with a little C), to implement this function in C, compile it, and call it as a MEX function. This may or may not be faster - you'd need to try it out.
I guess your do function is called inside some kind of for loops. If your for loop doesn't have data inter-dependences, you can simply change it to parfor, and Matlab will start parallel pools to take advantage of multicore processor, i.e. parallel computing. Even if your data is interlaced inside for loops, you can do some trick to get rid of it ( declare addition matrix, array to hold another copy of the data structure, trade memory usage for speed).
As mentioned by Sam, you can rewrite the function using C/C++, compile it by Matlab compiler, and call the compiled MEX function directly. From my experience, C/C++ MEX function is faster than Matlab build in for simple arithmetics.
Parfor guide:
http://www.mathworks.com/help/distcomp/getting-started-with-parfor.html
MEX example:
http://www.mathworks.com/help/matlab/matlab_external/standalone-example.html
Let us assume that p and q are lists in Python of common length n. Each list contains the contents of range(n) in some order (which is important!). We can assume that n is small (i.e. does not exceed 2^16). I now define an operation on these lists using the following code
def mult(p,q):
return [q[i] for i in p]
Clearly mult(p,q) is again a list containing the contents of range(n) in some order. This python code is an example of the composition of permutations (see http://en.wikipedia.org/wiki/Permutation).
I would like to make this code run as fast as possible in Python. I tried replacing p and q by numpy arrays to see if this would speed things up but the difference was negligible under timeit tests (note that numpy is not designed with the above function in mind). I also wrote a C extension for Python to try and speed things up but this did not seem to help (I was however using functions such as PySequence_Fast_GET_ITEM which are likely the same functions that Python itself uses).
Would it be possible to write a new type for Python in C (as is described here http://docs.python.org/2/extending/newtypes.html) which would have the property that the above mult function would be fast(er)? Or, indeed, write any program in C which would give Python such a type.
I am asking this question to see whether or not I am barking up the wrong tree. In particular, is there essentially some inherent property of Python which means this can never be sped up? I am mostly interested in Python 2.7 but would be interested to know of any solutions for Python 3+.
As Abid Rahman's comment indicates, using NumPy properly is a better bet than implementing your own C datastructure.
import numpy as np
p = np.array(range(1000))
q = np.array(range(1000))
%timeit [q[i] for i in p]
# 1000 loops, best of 3: 312 us per loop
%timeit q[p]
# 100000 loops, best of 3: 4.31 us per loop
NumPy basically does what you were hoping to do yourself (push the array access down to the C level). However, if you just do a list comprehension, all the looping will be handled in Python, so it won't be much faster than the original with regular Python lists.
My question may seem primitive or dumb because, I've just switched to C.
I have been working with MATLAB for several years and I've learned that any computation should be vectorized in MATLAB and I should avoid any for loop to get an acceptable performance.
It seems that if I want to add two vectors, or multiply matrices, or do any other matrix computation, I should use a for loop.
It is appreciated if you let me know whether or not there is any way to do the computations in a vectorized sense, e.g. reading all elements of a vector using only one command and adding those elements to another vector using one command.
Thanks
MATLAB suggests you to avoid any for loop because most of the operations available on vectors and matrices are already implements in its API and ready to be used. They are probably optimized and they work directly on underlying data instead that working at MATLAB language level, a sort of opaque implementation I guess.
Even MATLAB uses for loops underneath to implement most of its magic (or delegates them to highly specialized assembly instructions or through CUDA to the GPU).
What you are asking is not directly possible, you will need to use loops to work on vectors and matrices, actually you would search for a library which allows you to do most of the work without directly using a for loop but by using functions already defined that wraps them.
As it was mentioned, it is not possible to hide the for loops. However, I doubt that the code MATLAB produces is in any way faster the the one produced by C. If you compile your C code with the -O3 it will try to use every hardware feature your computer has available, such as SIMD extensions and multiple issue. Moreover, if your code is good and it doesn't cause too many pipeline stalls and you use the cache, it will be really fast.
But i think what you are looking for are some libraries, search google for LAPACK or BLAS, they might be what you are looking for.
In C there is no way to perform operations in a vectorized way. You can use structures and functions to abstract away the details of operations but in the end you will always be using fors to process your data.
As for speed C is a compiled language and you will not get a performance hit from using for loops in C. C has the benefit (compared to MATLAB) that it does not hide anything from you, so you can always see where your time is being used. On the downside you will notice that things that MATLAB makes trivial (svd,cholesky,inv,cond,imread,etc) are challenging in C.
I am trying to learn MATLAB and one of the first problems I encountered was to guess the background from an image sequence with a static camera and moving objects. For a start I just want to do a mean or median on pixels over time, so it's just a single function I would like to apply to one of the rows of the 4 dimensional array.
I have loaded my RGB images in a 4 dimensional array with the following dimensions:
uint8 [ num_images, width, height, RGB ]
Here is the function I wrote which includes 4 nested loops. I use preallocation but still, it is extremely slow. In C++ I believe this function could run at least 10x-20x faster, and I think on CUDA it could actually run in real time. In MATLAB it takes about 20 seconds with the 4 nested loops. My stack is 100 images with 640x480x3 dimensions.
function background = calc_background(stack)
tic;
si = size(stack,1);
sy = size(stack,2);
sx = size(stack,3);
sc = size(stack,4);
background = zeros(sy,sx,sc);
A = zeros(si,1);
for x = 1:sx
for y = 1:sy
for c = 1:sc
for i = 1:si
A(i) = stack(i,y,x,c);
end
background(y,x,c) = median(A);
end
end
end
background = uint8(background);
disp(toc);
end
Could you tell me how to make this code much faster? I have tried experimenting with somehow getting the data directly from the array using only the indexes and it seems MUCH faster. It completes in 3 seconds vs. 20 seconds, so that’s a 7x performance difference, just by writing a smaller function.
function background = calc_background2(stack)
tic;
% bad code, confusing
% background = uint8(squeeze(median(stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 ))));
% good code (credits: Laurent)
background=uint8((squeeze(median(stack,1)));
disp(toc);
end
So now I don't understand if MATLAB could be this fast then why is the nested loop version so slow? I am not making any dynamic resizing and MATLAB must be running the same 4 nested loops inside.
Why is this happening?
Is there any way to make nested loops run fast, like it would happen naturally in C++?
Or should I get used to the idea of programming MATLAB in this crazy one line statements way to get optimal performance?
Update
Thank you for all the great answers, now I understand a lot more. My original code with stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 )) didn't make any sense, it is exactly the same as stack, I was just lucky with median's default option of using the 1st dimension for its working range.
I think it's better to ask how to write an efficient question in an other question, so I asked it here:
How to write vectorized functions in MATLAB
If I understand your question, you're asking why Matlab is faster for matrix operations than for procedural programming calls. The answer is simply that that's how it's designed. If you really want to know what makes it that way, you can read this newsletter from Matlab's website which discusses some of the underlying technology, but you probably won't get a great answer, as the software is proprietary. I also found some relevant pages by simply googling, and this old SO question
also seems to address your question.
Matlab is an interpreted language, meaning that it must evaluate each line of code of your script.
Evaluating is a lengthy process since it must parse, 'compile' and interpret each line*.
Using for loops with simple operations means that matlab takes far more time parsing/compiling than actually executing your code.
Builtin functions, on the other hand are coded in a compiled language and heavily optimized. They're very fast, hence the speed difference.
Bottom line: we're very used to procedural language and for loops, but there's almost always a nice and fast way to do the same things in a vectorized way.
* To be complete and to pay honour to whom honour is due: recent versions of Matlab actually tries to accelerate loops by analyzing repeated operations to compile chunks of repetitive operations into native executable. This is called Just In Time compilation (JIT) and was pointed out by Jonas in the following comments.
Original answer:
If I understood well (and you want the median of the first dimension) you might try:
background=uint8((squeeze(median(stack,1)));
Well, the difference between both is their method of executing code. To sketch it very roughly: in C you feed your code to a compiler which will try to optimize your code or at any rate convert it to machine code. This takes some time, but when you actually execute your program, it is in machine code already and therefore executes very fast. You compiler can take a lot of time trying to optimize the code for you, in general you don't care whether it takes 1 minute or 10 minutes to compile a distribution-ready program.
MATLAB (and other interpreted languages) don't generally work that way. When you execute your program, an interpreter will interprete each line of code and transform it into a sequence of machine code on the fly. This is a bit slower if you write for-loops as it has to interprete the code over and over again (at least in principle, there are other overheads which might matter more for the newest versions of MATLAB). Here the hurdle is the fact that everything has to be done at runtime: the interpreter can perform some optimizations, but it is not useful to perform time-consuming optimizations that might increase performance by a lot in some cases as they will cause performance to suffer in most other cases.
You might ask what you gain by using MATLAB? You gain flexibility and clear semantics. When you want to do a matrix multiplication, you just write it as such; in C this would yield a double for loop. You have to worry very little about data types, memory management, ...
Behind the scenes, MATLAB uses compiled code (Fortan/C/C++ if I'm not mistaken) to perform large operations: so a matrix multiplication is really performed by a piece of machine code which was compiled from another language. For smaller operations, this is the case as well, but you won't notice the speed of these calculations as most of your time is spent in management code (passing variables, allocating memory, ...).
To sum it all up: yes you should get used to such compact statements. If you see a line of code like Laurent's example, you immediately see that it computes a median of stack. Your code requires 11 lines of code to express the same, so when you are looking at code like yours (which might be embedded in hundreds of lines of other code), you will have a harder time understanding what is happening and pinpointing where a certain operation is performed.
To argue even further: you shouldn't program in MATLAB in the same way as you'd program in C/C++; nor should you do the other way round. Each language has its stronger and weaker points, learn to know them and use each language for what it's made for. E.g. you could write a whole compiler or webserver in MATLAB but in general that will be really slow as MATLAB was not intended to handle or concatenate strings (it can, but it might be very slow).