If I try this in Racket:
(expt 2 1000)
I get a number many times bigger than all the atoms in the universe:
10715086071862673209484250490600018105614048117055336074437503883703510511249361224931983788156958581275946729175531468251871452856923140435984577574698574803934567774824230985421074605062371141877954182153046474983581941267398767559165543946077062914571196477686542167660429831652624386837205668069376
I can even get crazier with (expt 2 10000) which still only takes a second on my T450 laptop. So as I understand, this is only possible because of tail recursion. Is this correct? If so, is Racket's tail recursion pure functional programming, or are there hidden side-effects going on behind the scenes? Also, when I see Common Lisp's loop, is it based on tail recursion under the hood? In general, I guess I'm wondering how these feats of recursion/looping are possible.
Racket uses a C library to implement large integers (bignums).
The library is called GMP:
https://gmplib.org/manual/Integer-Exponentiation.html
Now the case of 2^n is pretty easy to implement in a binary represetation.
You only need a 1 followed by n zeros. That is, GMP can compute the number very fast.
Tail calling is a wonderful thing, but it's important to understand that it doesn't make it possible to compute things that wouldn't be computable otherwise. In general, any code that's written in (say) a functional language with tail-calling can be written in another language using a loop. The advantage of a language with tail-calling is that programmers don't need to rewrite their recursive calls to loops in order to allow their programs to run.
It looks like you're focusing here on the ability of Racket (and Scheme) to compute with very large numbers. This is because, by default, Racket and Scheme use "bignums" to represent integers. Packages with bignum functionality are available for many languages, including C, but they can make for extra work in languages without garbage collection, because their representations are not of a bounded size.
Also, when I see Common Lisp's loop, is it based on tail recursion under the hood?
This is an implementation detail, but most likely not. First, CL already allows TAGBODY blocks, which makes LOOP expressible in terms of CL constructs.
For example, if I macroexpand a simple LOOP:
(loop)
I obtain a rather uniform result across implementations.
;; SBCL
(BLOCK NIL (TAGBODY #:G869 (PROGN) (GO #:G869)))
;; CCL
(BLOCK NIL (TAGBODY #:G4 (PROGN) (GO #:G4)))
;; JSCL
(BLOCK NIL (TAGBODY #:G869 (PROGN) (GO #:G869)))
;; ECL
(BLOCK NIL (TAGBODY #:G109 (PROGN) (GO #:G109)))
;; ABCL
(BLOCK NIL (TAGBODY #:G44 (GO #:G44)))
Implementation are typically written in languages that have jump or loops, or that can emulate them easily. Moreover, a lot of CL implementations are compiled and can target assembly language that has jump primitives. So usually, there is no need to have an intermediate step that goes through tail-recursive functions.
That being said, implementing TAGBODY with tail-recursion seems doable.
For example JSCL cuts the expressions inside a tagbody into different methods, for each label, and those methods are called when using go: https://github.com/jscl-project/jscl/blob/db07c5ebfa2e254a0154666465d6f7591ce66e37/src/compiler/compiler.lisp#L982
Moreover, if I let the loop run for a while, no stack overflow happen. In that case however this is not due to tail-call elimination (which, AFAIK, is not implemented on all browsers). It looks like the code for tagbody always has an implicit while loop, and that go throws exceptions for the tagbody to catch.
Related
I am moving from python to C, in the hope of faster implementation, and trying to learn vectorization in C equivalent to python vectorization. For example, assume that we have binary array Input_Binary_Array, if I want to multiply each element for the index, say, i, by 2**i and then sum all non-zero, in python-vectorization we do the following:
case 1 : Value = (2. ** (np.nonzero(Input_Binary_Array)[0] + 1)).sum()
Or if we do slicing and do elementwise addition/subtraction/multiplication, we do the following:
case 2 : Array_opr= (Input_Binary_Array[size:] * 2**Size -Input_Binary_Array[:-size])
C is a powerful low-level language, so simple for/while loop is quite faster, but I am not sure that there are no equivalent vectorizations like python.
So, my question is, is there an explicit vectorization code for:
1.
multiplying all elements of an array
with a constant number (scalar)
2.
elementwise addition, subtraction, division for 2 given arrays of same size.
3.
slicing, summing, cumulative summing
or, the simple for, while loop is the only faster option to do above operations like python vectorization (case 1, 2)?
The answer is to either use a library to achieve those things, or write one. The C language by itself is fairly minimalist. It's part of the appeal. Some libraries out there include the Intel MLK, and there's gsl, which has that along with huge number of other functions, and more.
Now, with that said, I would recommend that if moving from Python to C is your plan, moving from Python to C++ is the better one. The reason that I say this is because C++ already has a lot of the tools you would be looking for to build what you like syntactically.
Specifically, you want to look at C++ std::vector, iterators, ranges and lambda expressions, all within C++20 and working rather well. I was able to make my own iterator on my own sort of weird collection and then have Linq style functions tacked onto it, with Linq semantics...
So I could say
mycollection<int> myvector = { 1, 2, 4, 5 };
Something like that anyway - the initializer expression rules I forget sometimes.
auto v = mycollection
.where( []( auto& itm ) { itm > 3; } )
.sum( []( auto& itm ) { return itm; } );
and get more or less what I expect.
Since you control the iterator down to every single detail you could ever want (and the std framework already thought of many), you can make it go as fast as you need, use multiple cores and so on.
In fact, I think STL for MS and maybe GCC both actually have swap in parallel algorithms where you just use them.
So C is good, but consider C++, if you are going that "C like" route. Because that's the only way you'll get the performance you want with the syntax you need.
Iterators basically let you wrap the concept of a for loop as an object. So,
So, my question is, is there an explicit vectorization code for:
1.
multiplying all elements of an array
with a constant number (scalar)
The C language itself does not have a syntax for expressing that with a single simple statement. One would ordinarily write a loop to perform the multiplication element by element, or possibly find or write a library that handles it.
Note also that as far as I am aware, the Python language does not have this either. In particular, the product of a Python list and an integer n is not scalar multiplication of the list elements, but rather a list with n times as many elements. Some of your Python examples look like they may be using Numpy, which can provide for that sort of thing, but that's a third-party package, analogous to a library in C.
elementwise addition, subtraction, division for 2 given arrays of same
size.
Same as above. Including this not being in Python, either, at least not as the effect of any built-in Python operator on objects of a built-in type.
slicing, summing, cumulative summing
C has limited array slicing, in that you can access contiguous subarrays of an array more or less as arrays in their own right. The term "slice" is not typically used, however.
C does not have built-in sum() functions for arrays.
or, the simple for, while loop is the only faster option to do above
operations like python vectorization (case 1, 2)?
There are lots and lots of third-party C libraries, including some good ones for linear algebra. The C language and standard library does not have such features built in. One would typically choose among writing explicit loops, writing a custom library, and integrating a third party library based on how much your code relies on such operations, whether you need or can benefit from customization to your particular cases, and whether the operations need to be highly optimized.
The problem
I'm working on implementing and refining an optimization algorithm with some fairly large arrays (from tens of millions of floats and up) and using mainly Intel MKL in C (not C++, at least not so far) to squeeze out every possible bit of performance. Now I've run into a silly problem - I have a parameter that sets maxima and minima for subsets of a set of (tens of millions) of coefficients. Actually applying these maxima and minima using MKL functions is easy - I can create equally-sized vectors with the limits for every element and use V?Fmax and V?Fmin to apply them. But I also need to account for this clipping in my error metric, which requires me to count the number of elements that fall outside these constraints.
However, I can't find an MKL function that allows me to do things like counting the number of elements that fulfill some condition, the way you can create and sum logical arrays with e.g. NumPy in Python or in MATLAB. Irritatingly, when I try to google this question, I only get answers relating to Python and R.
Obviously I can just write a loop that increments a counter for each element that fulfills one of the conditions, but if there is an already optimized implementation that allows me to achieve this, I would much prefer that just owing to the size of my arrays.
Does anyone know of a clever way to achieve this robustly and very efficiently using Intel MKL (maybe with the statistics toolbox or some creative use of elementary functions?), a similarly optimized library that does this, or a highly optimized way to hand-code this? I've been racking my brain trying to come up with some out-of-the box method, but I'm coming up empty.
Note that it's necessary for me to be able to do this in C, that it's not viable for me to shift this task to my Python frontend, and that it is indeed necessary for me to code this particular subprogram in C in the first place.
Thanks!
If you were using c++, count_if from the algorithms library with an execution policy of par_unseq may parallelize and vectorize the count. On Linux at least, it typically uses Intel TBB to do this.
It's not likely to be as easy in c. Because c doesn't have concepts like templates, callables or lambdas, the only way to specialize a generic (library-provided) count()-function would be to pass a function pointer as a callback (like qsort() does). Unless the compiler manages to devirtualize and inline the callback, you can't vectorize at all, leaving you with (possibly thread parallelized) scalar code. OTOH, if you use for example gcc vector intrinsics (my favourite!), you get vectorization but not parallelization. You could try to combine the approaches, but I'd say get over yourself and use c++.
However, if you only need vectorization, you can almost certainly just write sequential code and have the compiler autovectorize, unless the predicate for what should be counted is poorly written, or your compiler is braindamaged.
For example. gcc vectorizes the code on x86 if at least sse4 instructions are available (-msse4). With AVX[2/512] (-mavx / -mavx2 / -mavx512f) you can get wider vectors to do more elements at once. In general, if you're compiling on the same hardware you will be running the program on, I'd recommend letting gcc autodetect the optimal instruction set extensions (-march=native).
Note that in the provided code, the conditions should not use short-circuiting or (||), because then the read from the max-vector is semantically forbidden if the comparison with the min-vector was already true for the current element, severely hindering vectorization (though avx512 could potentially vectorize this with somewhat catastrophic slowdown).
I'm pretty sure gcc is not nearly optimal in the code it generates for avx512, since it could do the k-reg (mask register) or in the mask registers with kor[b/w/d/q], but maybe somebody with more experience in avx512 (*cougth* Peter Cordes *cough*) could weigh in on that.
MKL doesn't provide such functions but You may try to check another performance library - IPP which contains a set of threshold functions that could be useful to your case. Please refer to the IPP Developer Reference to check more details - https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/essential-functions/conversion-functions/threshold.html
I've recently learned about tail-recursions as a way to make a recursion that doesn't crash when you give it too big of a number to work with. I realised that I could easily rewrite a tail-recursion as a while loop and have it do basically exactly the same thing, which lead me wondering - is there any use for recursions when you can do everything with a normal loop?
Yes, recursion code looks smaller and is easier to understand, but it also has a chance of completely crashing, while a simple loop cannot crash doing the same task.
I'll take for example the Haskell language, it is Purely functional:
Every function in Haskell is a function in the mathematical sense
(i.e., "pure"). Even side-effecting IO operations are but a
description of what to do, produced by pure code. There are no
statements or instructions, only expressions which cannot mutate
variables (local or global) nor access state like time or random
numbers.
So, in haskell a recursive function is tail recursive if the final result of the
recursive call is the final result of the function itself. If the
result of the recursive call must be further processed (say, by adding
1 to it, or consing another element onto the beginning of it), it is
not tail recursive. (see here)
On the other hand, in many programming languages, calling a function uses stack space, so a function that is tail recursive can build up a large stack of calls to itself, which wastes memory. Since in a tail call, the containing function is about to return, its environment can actually be discarded and the recursive call can be entered without creating a new stack frame. This trick is called tail call elimination or tail call optimisation and allows tail-recursive functions to recur indefinitely.
It's been a long while since I posted this question and my opinion on the topic has changed. Here's why:
I learned Haskell, and it's a language that fixes everything bad about recursion - recursive definitions and algorithms are turned into normal looping algorithms and most of the time you don't even use recursion directly and instead use map, fold, filter, or a combination of those. And with everything bad removed, the good sides of functional programming start to shine through - everything is closer to its mathematical definition, not obscured by clunky loops and variables.
To someone else who is struggling to understand why recursion is great, go learn Haskell. It has a lot of other very interesting features like being lazy (values are evaluated only when they're requested), static (variables can never be modified), pure (functions cannot do anything other than take input and return output, so no printing to the console), strongly typed with a very expressive type system, filled with mind-blowing abstractions like Functor, Monad, State, and much more. I can almost say it's life-changing.
I have seen a lot of search algorithms to search in a binary sorted tree, but all of them are using the same way: recursion. I know recursion is expensive in comparison to loops, because every time we call the search function, a new stack frame is created for that method, which will eventually use a lot of memory if the binary search tree is too big.
Why can't we search the binary search tree like this:
while (root!=NULL)
{
if (root->data==data)
return 0;
else if (data > root->data)
root=root->right;
else
root=root->left;
}
I think that this way is faster and more efficient than the recursive way, correct me if I am wrong!
Probably your way -which is the common way to code that in C- might be faster, but you should benchmark, because some C compilers (e.g. recent GCC when invoked with gcc -O2 ...) are able to optimize most tail calls as a jump (and passing values in registers). tail call optimization means that a call stack frame is reused by the callee (so the call stack stays bounded). See this question.
FWIW, in OCaml (or Scheme, or Haskell, or most Common Lisp implementations) you would code a tail-recursive call and you know that the compiler is optimizing it as a jump.
Recursion is not always slower than loops (in particular for tail-calls). It is a matter of optimization by the compiler.
Read about continuations and continuation passing style. If you know only C, learn some functional language (Ocaml, Haskell, or Scheme with SICP ...) where tail-calls are very often used. Read about call/cc in Scheme.
Yes, that's the normal way of doing it.
Theoretically both your solution and the recursive solution have the same Big Oh complexity. In theory they are both O(log n). If you want performance measured in seconds you need to go practical, write the code of both methods (iterative, recursive), run them and measure the run time.
I am new here so apologies if I did the post in a wrong way.
I was wondering if someone could please explain why is C so slow with function calling ?
Its easy to give a shallow answer to the standard question about Recursive Fibonacci, but I would appreciate if I knew the "deeper" reason as deep as possible.
Thanks.
Edit1 : Sorry for that mistake. I misunderstood an article in Wiki.
When you make a function call, your program has to put several registers on the stack, maybe push some more stuff, and mess with the stack pointer. That's about all for what can be "slow". Which is, actually, pretty fast. About 10 machine instructions on an x86_64 platform.
It's slow if your code is sparse and your functions are very small. This is the case of the Fibonacci function. However, you have to make a difference between "slow calls" and "slow algorithm": calculating the Fibonacci suite with a recursive implementation is pretty much the slowest straightforward way of doing it. There is almost as much code involved in the function body than in the function prologue and epilogue (where pushing and popping takes place).
There are cases in which calling functions will actually make your code faster overall. When you deal with large functions and your registers are crowded, the compiler may have a rough time deciding in which register to store data. However, isolating code inside a function call will simplify the compiler's task of deciding which register to use.
So, no, C calls are not slow.
Based on the additional information you posted in the comment, it seems that what is confusing you is this sentence:
"In languages (such as C and Java)
that favor iterative looping
constructs, there is usually
significant time and space cost
associated with recursive programs,
due to the overhead required to manage
the stack and the relative slowness of
function calls;"
In the context of a recursive implementation fibonacci calculations.
What this is saying is that making recursive function calls is slower than looping but this does not mean that function calls are slow in general or that function calls in C are slower than function calls in other languages.
Fibbonacci generation is naturally a recursive algorithm, and so the most obvious and natural implementation involves many function calls, but is can also be expressed as an iteration (a loop) instead.
The fibonacci number generation algorithm in particular has a special property called tail recursion. A tail-recursive recursive function can be easily and automatically converted into an iteration, even if it is expressed as a recursive function. Some languages, particularly functional languages where recursion is very common and iteration is rare, guarantee that they will recognize this pattern and automatically transform such a recursion into an iteration "under the hood". Some optimizing C compilers will do this as well, but it is not guaranteed. In C, since iteration is both common and idiomatic, and since the tail recursive optimization is not necessarily going to be made for you by the compiler, it is a better idea to write it explicitly as an iteration to achieve the best performance.
So interpreting this quote as a comment on the speed of C function calls, relative to other languages, is comparing apples to oranges. The other languages in question are those that can take certain patterns of function calls (which happen to occur in fibbonnaci number generation) and automatically transform them into something that is faster, but is faster because it is actually not a function call at all.
C is not slow with function calls.
The overhead of calling a C function is extremely low.
I challenge you to provide evidence to support your assertion.
There are a couple of reasons C can be slower than some other languages for a job like computing Fibonacci numbers recursively. Neither really has anything to do with slow function calls though.
In quite a few functional languages (and languages where a more or less functional style is common), recursion (often very deep recursion) is quite common. To keep speed reasonable, many implementations of such languages do a fair amount of work optimizing recursive calls to (among other things) turn them into iteration when possible.
Quite a few also "memoize" results from previous calls -- i.e., they keep track of the results from a function for a number of values that have been passed recently. When/if the same value is passed again, they can simply return the appropriate value without re-calculating it.
It should be noted, however, that the optimization here isn't really faster function calls -- it's avoiding (often many) function calls.
The Recursive Fibonacci is the reason, not C-language. Recursive Fibonacci is something like
int f(int i)
{
return i < 2 ? 1 : f(i-1) + f(i-2);
}
This is the slowest algorithm to calculate Fibonacci number, and by using stack store called functions list -> make it slower.
I'm not sure what you mean by "a shallow answer to the standard question about Recursive Fibonacci".
The problem with the naive recursive implementation is not that the function calls are slow, but that you make an exponentially large number of calls. By caching the results (memoization) you can reduce the number of calls, allowing the algorithm to run in linear time.
Of all the languages out there, C is probably the fastest (unless you are an assembly language programmer). Most C function calls are 100% pure stack operations. Meaning when you call a function, what this translates too in your binary code is, the CPU pushes any parameters you pass to your function onto the stack. Afterwards, it calls the function. The function then pops your parameters. After that, it executes whatever code makes up your function. Finally, any return parameters are pushed onto the stack, then the function ends and the parameters are popped off. Stack operations on any CPU are usually faster then anything else.
If you are using a profiler or something that is saying a function call you are making is slow, then it HAS to be the code inside your function. Try posting your code here and we will see what is going on.
I'm not sure what you mean. C is basically one abstraction layer on top of CPU assembly instructions, which is pretty fast.
You should clarify your question really.
In some languages, mostly of the functional paradigm, function calls made at the end of a function body can be optimized so that the same stack frame is re-used. This can potentially save both time and space. The benefit is particularly significant when the function is both short and recursive, so that the stack overhead might otherwise dwarf the actual work being done.
The naive Fibonacci algorithm will therefore run much faster with such optimization available. C does not generally perform this optimization, so its performance could suffer.
BUT, as has been stated already, the naive algorithm for the Fibonacci numbers is horrendously inefficient in the first place. A more efficient algorithm will run much faster, whether in C or another language. Other Fibonacci algorithms probably will not see nearly the same benefit from the optimization in question.
So in a nutshell, there are certain optimizations that C does not generally support that could result in significant performance gains in certain situations, but for the most part, in those situations, you could realize equivalent or greater performance gains by using a slightly different algorithm.
I agree with Mark Byers, since you mentioned the recursive Fibonacci. Try adding a printf, so that a message is printed each time you do an addition. You will see that the recursive Fibonacci is doing a lot more additions that it may appear at first glance.
What the article is talking about is the difference between recursion and iteration.
This is under the topic called algorithm analysis in computer science.
Suppose I write the fibonacci function and it looks something like this:
//finds the nth fibonacci
int rec_fib(n) {
if(n == 1)
return 1;
else if (n == 2)
return 1;
else
return fib(n-1) + fib(n - 2)
}
Which, if you write it out on paper (I recommend this), you will see this pyramid-looking shape emerge.
It's taking A Whole Lotta Calls to get the job done.
However, there is another way to write fibonacci (there are several others too)
int fib(int n) //this one taken from scriptol.com, since it takes more thought to write it out.
{
int first = 0, second = 1;
int tmp;
while (n--)
{
tmp = first+second;
first = second;
second = tmp;
}
return first;
}
This one only takes the length of time that is directly proportional to n,instead of the big pyramid shape you saw earlier that grew out in two dimensions.
With algorithm analysis you can determine exactly the speed of growth in terms of run-time vs. size of n of these two functions.
Also, some recursive algorithms are fast(or can be tricked into being faster). It depends on the algorithm - which is why algorithm analysis is important and useful.
Does that make sense?