Assessing Haskell speed - c

In short: I need to check Haskell speed on simple operations, and currently have poor results, but I'm not sure if I doing compilation/optimization right.
UPD: The problem is answered, see comments - the trouble was in different integer type...
In details:
I work in a project where a number of services are doing bulk-processing on data, so that at least certain part of these services simply need to be fast. They are doing some heavy calculations and manipulations on data, not only extract-load. In other words it is a matter of how many instances and hours are going to be spent on say on each 1e15 records of data.
Currently we are considering adding a few more services to project and some of colleagues are curious to try writing them in different language from those already used. I'm less or more ok with it, but I insist we check the "core performance" of the proposed languages first. Surely speed testing is hard and controversial, so I propose we use very simple test, with simple operations and without complex data structures etc. We agreed for "poor recursive" fibonacci function:
fib x
| x < 2 = x
| otherwise = fib (x - 2) + fib (x - 1)
main = print (fib 43)
I wrote it in several languages for comparison. The C version looks like:
#include <stdio.h>
int fib(int x) {
return (x < 2) ? x : (fib(x - 2) + fib (x - 1));
}
int main(void) {
printf("%d\n", fib(43));
}
I compile the first with ghc -O2 test.hs and the latter with gcc -O2 test.c. GHC version is 8.4.3. Right now I see results differing about 12 times (2.2 seconds on C version, 3 seconds on Java version and 30 seconds on Haskell in the same machine).
I wonder if I did everything about compiler and compiler options right. For I thought as Haskell compiles to native-code (?) it should be comparable to C and similar languages. And I need hints on this.
P.S. Please, don't say that fibonacci function could be written without "exponential" recursion - I know - but as I said we need some test with a lot of simple calculations.
P.P.S. I don't mean if we can't make Haskell faster we won't use it. But probably we'll reserve it for some other service where time is spent mainly on input-output, so it won't matter. For current bunch of "speed-critical" services it's just the matter of whether the customer will pay $10000 or $120000 for these instances monthly.

Related

What optimizations should be left for compiler?

Assume that you have chosen the most efficient algorithm for solving a problem where performance is the first priority, and now that you're implementing it you have to decide about details like this:
v[i*3+0], v[i*3+1] and v[i*3+2] contain the components of the velocity of particle i and we want to calculate the total kinetic energy. Given that all particles are of the same mass, one may write:
inline double sqr(double x)
{
return x*x;
}
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
sum += sqr(v[i*3+0]) + sqr(v[i*3+1]) + sqr(v[i*3+2]);
return 0.5 * mass * sum;
}
To reduce the number of multiplications, it can be written as:
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
{
double *w = v + i*3;
sum += sqr(w[0]) + sqr(w[1]) + sqr(w[2]);
}
return 0.5 * mass * sum;
}
(one may write a function with even fewer multiplications, but that's not the point of this question)
Now my question is: Since many C compilers can do this kind of optimizations automatically, where should the developer rely on the compiler and where should she/he try to do some optimization manually?
where should the developer rely on the compiler and where should she/he try to do some optimization manually?
Do I have fairly in-depth knowledge of the target hardware as well as how C code translates to assembler? If no, forget about manual optimizations.
Are there any obvious bottlenecks in this code - how do I know that it needs optimization in the first place? Obvious culprits are I/O, complex loops, busy-wait loops, naive algorithms etc.
When I found this bottleneck, how exactly did I benchmark it and am I certain that the problem doesn't lie in the benchmarking method itself? Experience from SO shows that some 9 out of 10 strange performance questions can be explained by incorrect benchmarking. Including: benchmarking with compiler optimizations disabled...
From there on you can start looking at system-specific things as well as the algorithms themselves - there's far too many things to look at to cover in an SO answer. It's a huge difference between optimizing code for a low-end microcontroller and a 64-bit desktop PC (and everything in between).
One thing that looks a bit like premature optimization, but could just be ignorance of language abilities is that you have all of the information to describe particles flattened into an array of double values.
I would suggest instead that you break this down, making your code easier to read by creating a struct to hold the three datapoints on each particle. At that point you can create functions which take a single particle or multiple particles and do computations on them.
This will be much easier for you than having to pass three times the number of particles arguments to functions, or trying to "slice" the array. If it's easier for you to reason about, you're less likely to generate warnings/errors.
Looking at how both gcc and clang handle your code, the micro optimisation you contemplate is vain. The compilers already apply standard common subexpression elimination techniques that remove to overhead you are trying to eliminate.
As a matter of fact, the code generated handles 2 components at a time using XMM registers.
If performance is a must, then here are steps that will save the day:
the real judge is the wall clock. Write a benchmark with realistic data and measure performance until you get consistent results.
if you have a profiler, use it to determine where are the bottlenecks if any. Changing algorithms for those parts that appear to hog performance is an effective approach.
try and get the best from the compiler: study the optimization options and try and let the compiler use more aggressive techniques if they are appropriate for the target system. For example -mavx512f -mavx512cd let the gcc generate code that handles 8 components at a time using the 512-bit ZMM registers.
This is a non intrusive technique as the code does not change, so you don't risk introducing new bugs by hand optimizing the code.
Optimisation is a difficult art. In my experience, simplifying the code gets better results and far fewer bugs than adding extra subtle stuff to try and improve performance at the cost of readability and correctness.
Looking at the code, an obvious simplification seems to generate the same results and might facilitate the optimizer's job (but again, let the wall clock be the judge):
double get_kinetic_energy(const double v[], int n, double mass)
{
double sum = 0.0;
for (int i = 0; i < 3 * n; i++)
sum += v[i] * v[i];
return 0.5 * mass * sum;
}
Compilers like clang and gcc are simultaneously far more capable and far less capable than a lot of people give them credit for.
They have an exceptionally wide range of patterns where they can transform code into an alternative form which is likely to be more efficient and still behave as required.
Neither, however, is especially good at judging when optimizations will be useful. Both are prone to making some "optimization" decisions that are almost comically absurd.
For example, given
void test(char *p)
{
char *e = p+5;
do
{
p[0] = p[1];
p++;
}while(p < e);
}
when targeting the Cortex-M0 with an optimization level below -O2, gcc 10.2.1 will generate code equivalent to calling memmove(p, p+1, 7);. While it would be theoretically possible that a library implementation of memmove might optimize the n==7 case in such a way as to outperform the five-instruction byte-based loop generated at -Og (or even -O0), it would seem far more likely that any plausible implementations would spend some time analyzing what needs to be done, and then after doing that spend just as long executing the loop as would code generated using -O0.
What happens, in essence, is that gcc analyzes the loop, figures out what it's trying to do, and then uses its own recipe to perform that action in a manner that may or may not be any better than what the programmer was trying to do in the first place.

Where can I find the mathematical routines in GCC's source? How do the math functions work?

Good night. I'm a Math bachelor and I'm studying log() and series. And I do want to see how GCC compute all those stuffs, it will help me a lot, there's nothing inside math.h I alread read it. I crazy trying to find how GCC compute logarithms and square roots using the fastest method ever. I donwloaded the source but I can't find where are the math routines.
https://github.com/gcc-mirror/gcc
I just want to see it, I'm not a good programmer at all, my thing is math.
Mathematical functions are part of the C standard library, and GCC just uses those. If you want to look at the source code, you can either download the source code from the official glibc website (for the GNU C Library version, which is one of the most used), or use an online code browser. Here's the code for log() for example.
Since you are saying you're not that much of a programmer though, I doubt you'll find the GNU C Standard Library comprehensible. It is the result of decades of optimizations and compatibility adjustments, and the code is very complex. I would suggest to take a look at the musl C Library instead. The source code is much cleaner and more commented. Here's the log() function, and here's all the files regarding mathematical functions.
Finally, nor GCC or the C library have "the fastest method ever" to compute such functions. The goal of the C library is not to provide the fastest possible implementation of each mathematical function, but to provide a good enough implementation while still being portable enough to be used on multiple architectures, so those are still really fast, but most likely not "the fastest ever". In the best case, some mathematical function could even be reduced to a single CPU instruction if the CPU supports fast built-in hardware mathematical operations (like for example Intel x86 with fsqrt for the square root).
Take a look at this log implementation.
This is from fdlibm that has the implementations (following the IEEE-754) of a lot of math functions in C for humans.
From the implementation:
Method
Argument Reduction: find k and f such that
x = 2^k * (1+f),
where sqrt(2)/2 < 1+f < sqrt(2) .
Approximation of log(1+f).
Let s = f/(2+f) ; based on log(1+f) = log(1+s) - log(1-s)
= 2s + 2/3 s**3 + 2/5 s**5 + .....,
= 2s + s*R
We use a special Reme algorithm on [0,0.1716] to generate polynomial of degree 14 to approximate R The maximum error of this polynomial approximation is bounded by 2**-58.45. In other words,
2 4 6 8 10 12 14
R(z) ~ Lg1*s +Lg2*s +Lg3*s +Lg4*s +Lg5*s +Lg6*s +Lg7*s
(the values of Lg1 to Lg7 are listed in the program)
and
| 2 14 | -58.45
| Lg1*s +...+Lg7*s - R(z) | <= 2
| |
Note that 2s = f - s*f = f - hfsq + s*hfsq, where hfsq = f*f/2. In order to guarantee error in log below 1ulp, we compute log by
log(1+f) = f - s*(f - R) (if f is not too large)
log(1+f) = f - (hfsq - s*(hfsq+R)). (better accuracy)
Finally,
log(x) = k*ln2 + log(1+f).
= k*ln2_hi+(f-(hfsq-(s*(hfsq+R)+k*ln2_lo)))
Here ln2 is split into two floating point number:
ln2_hi + ln2_lo,
where n*ln2_hi is always exact for |n| < 2000.
Real implementation and special cases of the explanation you can check in this link.
Functions like log are part of the math library that's commonly called "libm". Implementations of the standard C library, typically come with an implementation of libm, so what you are looking for is most likely in glibc. You can find the implementation of log in glibc here: https://code.woboq.org/userspace/glibc/sysdeps/ieee754/dbl-64/e_log.c.html
There are some comments in the source code that give you hints about the algorithm used, but not a detailed explanation.
Of course there are different implementations of libm - for example there is openlibm and netlib fdlibm. The documentation of both explain the algorithm used. Here is how log is implemented in openlibm: https://github.com/JuliaMath/openlibm/blob/master/src/e_log.c
(Interesting - looks like log in openlibm and fdlibm have come from the same source)

How to measure the quality of my code?

I have some experience with coding but one of the biggest questions which annoying me is how to improve my code.
I every time check the complexity, readability and the correctness of the code but my question is how I can measure the size and the time of specific commands.
For example:
when I have the next problem:
A is an integer
B is an integer
C is an integer
if - A is bigger the B assign C=A
else - C=B
for that problem, we have 2 simple solutions -
1. use if-else statement
2. use ternary operator
for a dry check of the size of the file before compilation, i get that the second solution file is less from the first in a half (for 1000000 operations I get a difference of some MB).
My question is how I can measure the time difference between some codes which make the same operation but with different commands, and how much the compiler makes optimization for commands which is close like the 2 from the example.
The best and most direct way is to check an assembly code generated by your compiler at different optimization level.
//EDIT
I didn't mention benchmarking, because your question is about checking the difference between two source codes using different language constructions to do the same job.
Don't get me wrong, benchmaking is recommended solution of assuring general software performance, but in this particular scenario, it might be unreliable, because of extremely small execution time frames basic operations have.
Even when you calculate amortized time from multiple runs, the difference might be to much dependend on the OS and environment and thus pollute your results.
To learn more on the subject I recommend this talk from Cppcon, it's actaully kinda interesting.
But most importantly,
Quick peek under the hood by exploring assembly code can give you information whether two statements has been optimized into exactly same code. It might not be so clear from benchmarking the code.
In the case you asked (if vs tenary operator) it should always lead to same machine code, because tenary operator is just a syntactic sugar for if and physically it's actually the same operation.
Analyse the Time Complexity of the two algorithms. If they seem competitive,
Benchmark.
Provide a sufficient large input for your problem, so that the timing is not affected by other -OS- overheads.
Develop two programs that solve the same problem, but with a different approach.
I have some methods in Time measurements to time code. Example:
#include <sys/time.h>
#include <time.h>
typedef struct timeval wallclock_t;
void wallclock_mark(wallclock_t *const tptr)
{
gettimeofday(tptr, NULL);
}
double wallclock_since(wallclock_t *const tptr)
{
struct timeval now;
gettimeofday(&now, NULL);
return difftime(now.tv_sec, tptr->tv_sec)
+ ((double)now.tv_usec - (double)tptr->tv_usec) / 1000000.0;
}
int main(void)
{
wallclock_t t;
double s;
wallclock_mark(&t);
/*
* Solve the problem with Algorithm 1
*/
s = wallclock_since(&t);
printf("That took %.9f seconds wall clock time.\n", s);
return 0;
}
You will get a time measurement. Then you use solve the problem with "Algorithm 2", for example, and compare these measurements.
PS: Or you could check the Assembly code of every approach, for a more low level approach.
One of the ways is using time function in the bash shell followed by execution repeated for a large number of times. This will show which is better. And make A template which does nothing of the two and you can know the buffer time.
Please take the calculation for many cases and compare averages before making any conclusions.

Techniques for static code analysis in detecting integer overflows

I'm trying to find some effective techniques which I can base my integer-overflow detection tool on. I know there are many ready-made detection tools out there, but I'm trying to implement a simple one on my own, both for my personal interest in this area and also for my knowledge.
I know techniques like Pattern Matching and Type Inference, but I read that more complicated code analysis techniques are required to detect the int overflows. There's also the Taint Analysis which can "flag" un-trusted sources of data.
Is there some other technique, which I might not be aware of, which is capable of detecting integer overflows?
It may be worth to try with cppcheck static analysis tool, that claims to detect signed integer overflow as of version 1.67:
New checks:
- Detect shift by too many bits, signed integer overflow and dangerous sign conversion
Notice that it supports both C and C++ languages.
There is no overflow check for unsigned integers, as by Standard unsigned types never overflow.
Here is some basic example:
#include <stdio.h>
int main(void)
{
int a = 2147483647;
a = a + 1;
printf("%d\n", a);
return 0;
}
With such code it gets:
$ ./cppcheck --platform=unix64 simple.c
Checking simple.c...
[simple.c:6]: (error) Signed integer overflow for expression 'a+1'
However I wouldn't expect too much from it (at least with current version), as slighly different program:
int a = 2147483647;
a++;
passes without noticing overflow.
It seems you are looking for some sort of Value Range Analysis, and detect when that range would exceed the set bounds. This is something that on the face of it seems simple, but is actually hard. There will be lots of false positives, and that's even without counting bugs in the implementation.
To ignore the details for a moment, you associate a pair [lower bound, upper bound] with every variable, and do some math to figure out the new bounds for every operator. For example if the code adds two variables, in your analysis you add the upper bounds together to form the new upper bound, and you add the lower bounds together to get the new lower bound.
But of course it's not that simple. Firstly, what if there is non-straight-line code? if's are not too bad, you can just evaluate both sides and then take the union of the ranges after it (which can lose information! if two ranges have a gap in between, their union will span the gap). Loops require tricks, a naive implementation may run billions of iterations of analysis on a loop or never even terminate at all. Even if you use an abstract domain that has no infinite ascending chains, you can still get into trouble. The keywords to solve this are "widening operator" and (optionally, but probably a good idea) "narrowing operator".
It's even worse than that, because what's a variable? Your regular local variable of scalar type that never has its address taken isn't too bad. But what about arrays? Now you don't even know for sure which entry is being affected - the index itself may be a range! And then there's aliasing. That's far from a solved problem and causes many real world tools to make really pessimistic assumptions.
Also, function calls. You're going to call functions from some context, hopefully a known one (if not, then it's simple: you know nothing). That makes it hard, not only is there suddenly a lot more state to keep track of at the same time, there may be several places a function could be called from, including itself. The usual response to that is to re-evaluate that function when a range of one of its arguments has been expanded, once again this could take billions of steps if not done carefully. There also algorithms that analyze a function differently for different context, which can give more accurate results, but it's easy to spend a lot of time analyzing contexts that aren't different enough to matter.
Anyway if you've made it this far, you could read Accurate Static Branch Prediction by Value Range Propagation and related papers to get a good idea of how to actually do this.
And that's not all. Considering only the ranges of individual variables without caring about the relationships between (keyword: non-relational abstract domain) them does bad on really simple (for a human reader) things such as subtracting two variables that always close together in value, for which it will make a large range, with the assumption that they may be as far apart as their bounds allow. Even for something trivial such as
; assume x in [0 .. 10]
int y = x + 2;
int diff = y - x;
For a human reader, it's pretty obvious that diff = 2. In the analysis described so far, the conclusions would be that y in [2 .. 12] and diff in [-8, 12]. Now suppose the code continues with
int foo = diff + 2;
int bar = foo - diff;
Now we get foo in [-6, 14] and bar in [-18, 22] even though bar is obviously 2 again, the range doubled again. Now this was a simple example, and you could make up some ad-hoc hacks to detect it, but it's a more general problem. This effect tends to blow up the ranges of variables quickly and generate lots of unnecessary warnings. A partial solution is assigning ranges to differences between variables, then you get what's called a difference-bound matrix (unsurprisingly this is an example of a relational abstract domain). They can get big and slow for interprocedual analysis, or if you want to throw non-scalar variables at them too, and the algorithms start to get more complicated. And they only get you so far - if you throw a multiplication in the mix (that includes x + x and variants), things still go bad very fast.
So you can throw something else in the mix that can handle multiplication by a constant, see for example Abstract Domains of Affine Relations⋆ - this is very different from ranges, and won't by itself tell you much about the ranges of your variables, but you could use it to get more accurate ranges.
The story doesn't end there, but this answer is getting long. I hope this does not discourage you from researching this topic, it's a topic that lends itself well to starting out simple and adding more and more interesting things to your analysis tool.
Checking integer overflows in C:
When you add two 32-bit numbers and get a 33-bit result, the lower 32 bits are written to the destination, with the highest bit signaled out as a carry flag. Many languages including C don't provide a way to access this 'carry', so you can use limits i.e. <limits.h>, to check before you perform an arithmetic operation. Consider unsigned ints a and b :
if MAX - b < a, we know for sure that a + b would cause an overflow. An example is given in this C FAQ.
Watch out: As chux pointed out, this example is problematic with signed integers, because it won't handle MAX - b or MIN + b if b < 0. The example solution in the second link (below) covers all cases.
Multiplying numbers can cause an overflow, too. A solution is to double the length of the first number, then do the multiplication. Something like:
(typecast)a*b
Watch out: (typecast)(a*b) would be incorrect because it truncates first then typecasts.
A detailed technique for c can be found HERE. Using macros seems to be an easy and elegant solution.
I'd expect Frama-C to provide such a capability. Frama-C is focused on C source code, but I don't know if it is dialect-sensitive or specific. I believe it uses abstract interpretation to model values. I don't know if it specifically checks for overflows.
Our DMS Software Reengineering Toolkit has variety of langauge front ends, including most major dialects of C. It provides control and data flow analysis, and also abstract interpretation for computing ranges, as foundations on which you can build an answer. My Google Tech Talk on DMS at about 0:28:30 specifically talks about how one can use DMS's abstract interpretation on value ranges to detect overflow (of an index on a buffer). A variation on checking the upper bound on array sizes is simply to check for values not exceeding 2^N. However, off the shelf DMS does not provide any specific overflow analysis for C code. There's room for the OP to do interesting work :=}

Make a cosine table with the gcc preprocessor

I wish to make a cosine table at compile time. Is there a way to do this without hard coding anything?
Why not hardcode it? I am not aware of any changes in the result of the cosine function that they are planning, well not for another 100 years or so.
I am not convinced that precalculating a sine table would result in a performance improvement. I suggest:
Benchmark your application calling fcos() to decide whether it's fast enough. If it is, stop here.
If it really is too slow, consider using -ffast-math if it is acceptable for your usage.
Lookup tables, particularly large ones, will increase the size of your program that needs to be held in the CPU cache, which reduces its hit rate. This in turn will slow other parts of your application down.
I am assuming you're doing this in an incredibly tight loop, as that's the only case it could possibly matter in anyway.
If you actually DID discover that using a lookup table was beneficial, why not just precalculate it at runtime? It's going to have hardly any impact on startup time (unless it's a huuuuuge one). It may actually be faster to do it at runtime, because your CPU may be able to do sines faster than your disc can load floats in.
With C++, you can use templates metaprogramming to generate your lookup table at runtime.
Now, here is a standard C trick that may or may not accomplish what you want.
Write a program (say, cosgen) that generates the cosine table C statement (i.e., the code that you desire).
Run cosgen and dump the output (c code) to a file, say cos_table.c
In your main program, use a #include "cos_table.c" to insert the table where you want.
You could generate it with any scripting language you liked and include the result. Use make to have the scripting language do its thing anytime you change the source. It's hard coded to C but not to you, really.
With the magic of computers, the apparently impossible becomes possible:
#include <stdio.h>
#include <math.h>
#define MAX_ANGLE 90
double kinopiko_krazy_kosines[MAX_ANGLE];
int main ()
{
int i;
for (i = 0; i <= 90; i++) {
double angle = (M_PI * i) / (2.0*90.0);
kinopiko_krazy_kosines[i] = cos (angle);
printf ("#define cos_%d %f\n", i, kinopiko_krazy_kosines[i]);
}
}
http://codepad.org/G6JTATne
Since you're targetting Cell, you're probably targetting the SPE's? They do have proper FP support, vectorised in fact, but do not have large working memories. For that reason it's in fact a BAD IDEA to use tables - you're sacrificing a very limited resource.
I'd create a hard-coded lookup table - once with a scripting language - but I'm not sure it'll be faster than just using the standard math library.
I guess it depends on the size of the table, but I would suspect getting the FPU to do the calculation might be faster than accessing memory. So once you've got your table solution, I'd benchmark it to see if it's faster than the standard function.
Wave tables are the way to go. You can hard code it as suggested, or run it during application start up.

Resources