Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
For many years I use this very simple program to get a rough estimate of the programming language performance. I have a dozen of versions in Ruby (600 ms), Python (1500 ms), JavaScript (45 ms), C (25 ms both GCC/Clang on my notebook) and other languages. Do not make serious conclusions based on such a simple benchmark, because it is far from any real life case. I call it "classic" simply because I use it for decades already. Maybe even saying "a rough estimate" is too much. This test is extremely simple, mostly because writing better test for a language you do not know is time consuming and I usually write it when I get my hands on the new language for the first time. Sometimes, though, I will run the test few years later when the compiler/interpreter gets an update. Anyway recently I ported this test to(for?) Rust and was really surprised because it outperformed previous record holder C about three times (7 ms!~#!). My question is for those who know something about Rust compilation, why is it so fast? I know it uses LLVM just as Clang so I expected about the same speed (Just as Nim performs about as C because it compiles to C, though not very efficiently and is still about two time slower than C when this simple benchmark is run).
Rust
// rustc --color always -C opt-level=3 -C prefer-dynamic classic.rs -C link-args=-s -o classic.rust
use std::ptr;
#[repr(C)]
struct timeval {
tv_sec: i64,
tv_usec: i64
}
extern {
fn gettimeofday(tv: &mut timeval, tzp: *const ()) -> i32;
}
fn time1000() -> i64 {
let mut tv = timeval { tv_sec: 0, tv_usec: 0 };
unsafe {
gettimeofday(&mut tv, ptr::null());
}
tv.tv_sec * 1000 + tv.tv_usec / 1000
}
fn classic() {
let mut a:i64 = 3000000;
loop {
a = a - 1;
if a == 0 { break; }
let mut b = (a / 100000) as i64;
b = b * 100000;
if a == b { print!("{} ", a); }
}
}
fn main() {
let mut t = time1000();
classic();
t = time1000() - t;
println!("{}", t);
}
C
#include "stdio.h"
#include <sys/time.h>
long time1000() {
struct timeval val;
gettimeofday(&val, 0);
return val.tv_sec * 1000 + val.tv_usec / 1000;
}
void classic() {
double a = 3000000, b;
while (1) {
a--;
if (a == 0) break;
b = a / 100000;
b = (int) b;
b *= 100000;
if (a == b) { printf("%i ", (int)a); }
}
}
int main() {
int T = time1000();
classic();
T = time1000() - T;
printf("%i", (int)T);
}
Substitute
int64_t a = 3000000, b;
for
double a = 3000000, b;
to make it equivalent (on a 64 bit arch.) with
let mut a:i64 = 3000000;
//...
let mut b = (a / 100000) as i64;
and C wins (even with stdio).
On my PC, C is about 1.4–1.5 times faster (-O3, measured on a 100-iteration shell for-loop to discount startup overhead).
Related
This question already has answers here:
Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled
(3 answers)
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
Closed 3 years ago.
My SSE code is just as slow as the standard C one, what am I doing wrong ?
Im running on a Intel i3-6100 CPU, using C with minGW and CLion, im using the -O0 flag.
Im mesuring the performance using the clock() function, both versions are equally fast to about 45 ticks (of over 1000) (SSE:1138 ticks - C:1093ticks).
I thaught that SSE somehow messes up the clock() time mesuring, but even by simply counting seconds ther is no differnce.
the function : (swaping comments..)
void vTrace(struct Ray * ray, float t, struct Vec3f * r){
//__m128 * mr = (__m128 *)r;
//__m128 mt_m = _mm_set1_ps(t);
//*mr = _mm_add_ps(*(__m128*)&ray->o, _mm_mul_ps(*(__m128*)&ray->d, mt_m));
r->x = ray->o.x + ray->d.x*t;
r->y = ray->o.y + ray->d.y*t;
r->z = ray->o.z + ray->d.z*t;
}
the benchmarking code:
float benchmark_t = 1;
struct Ray benchmark_ray;
vInit3f(&benchmark_ray.o, 0.2, 0.23, 1.4);
vInit3f(&benchmark_ray.d, 0.2, 0.23, 1.4);
ticks = clock();
i = 0;
while(i < 1000000000 ){
vTrace(&benchmark_ray, benchmark_t, &benchmark_ray.o);
i ++;
}
printf("TIME : %i ticks\n", (clock()-ticks));
printVec("result", benchmark_ray.o);
the structures :
struct Vec3f{
float x;
float y;
float z;
float w;//just for SSE
};
struct Ray{
struct Vec3f o;
struct Vec3f d;
struct Vec3f inverse_d;
};
Using SSE the performance should be about 4-times that fast why is there no performance gain ?
The code somehow autovectorized, I dont know why but it did.
So there was no great performance difference.
(next time step in the assembly code first)
In the context of my homework task I need to smart brute-force a set of passwords. Every password in the set has either of three possible masks:
%%##
##%%
#%%#
%##%
( # - a numeric character, % - a lowercase alpha character ).
At this point I am doing something like this to run over only one pattern ( the 1st one ) in multithreading:
// Compile: $ gcc test.c -o test -fopenmp -O3 -std=c99
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <omp.h>
int main() {
const char alp[26] = "abcdefghijklmnopqrstuvwxyz";
const char num[10] = "0123456789";
register int i;
char pass[4];
#pragma omp parallel for private(pass)
for (i = 0; i < 67600; i++) {
pass[3] = num[i % 10];
pass[2] = num[i / 10 % 10];
pass[1] = alp[i / 100 % 26];
pass[0] = alp[i / 2600 % 26];
/* Slow password processing here */
}
return 0;
}
But, unfortunately, that technique has nothing to do with searching passwords with different patterns.
So my question is:
Is there a way to construct an effective set of parallel for instructions in order to run the attack simultaneously on each password pattern?
Help is much appreciated.
The trick here is to note that all four password options are simply rotations/shifts of each other.
That is, for the example password qr34 and the patterns you mention, you are looking at:
qr34 %%## #Original potential password
4qr3 #%%# #Rotate 1 place right
34qr ##%% #Rotate 2 places right
r34q %##% #Rotate 3 places right
Given this, you can use the same generation technique as in your first question.
For each potential password generated, check the potential password as well as the next three shifts of that password.
Note that the following code relies on an interesting property of C/C++: if the truth value of a statement can be deduced early, no further execution takes place. That is, given the statement if(A || B || C), if A is false, then B must be evaluated; however, if B is true, then C is never evaluated.
This means that we can have A=CheckPass(pass) and B=CheckPass(RotatePass(pass)) and C=CheckPass(RotatePass(pass)) with the guarantee that the password will only be rotated as many times as necessary.
Note that this scheme requires that each thread have its own, private copy of the potential password.
//Compile with, e.g.: gcc -O3 temp.c -std=c99 -fopenmp
#include <stdio.h>
#include <unistd.h>
#include <string.h>
int PassCheck(char *pass){
return strncmp(pass, "4qr3", 4)==0;
}
//Rotate string one character to the right
char* RotateString(char *str, int len){
char lastchr = str[len-1];
for(int i=len-1;i>0;i--)
str[i]=str[i-1];
str[0] = lastchr;
return str;
}
int main(){
const char alph[27] = "abcdefghijklmnopqrstuvwxyz";
const char num[11] = "0123456789";
char goodpass[4] = "----"; //Provide a default password to indicate an error state
#pragma omp parallel for collapse(4)
for(int i = 0; i < 26; i++)
for(int j = 0; j < 26; j++)
for(int m = 0; m < 10; m++)
for(int n = 0; n < 10; n++){
char pass[4] = {alph[i],alph[j],num[m],num[n]};
if(
PassCheck(pass) ||
PassCheck(RotateString(pass,4)) ||
PassCheck(RotateString(pass,4)) ||
PassCheck(RotateString(pass,4))
){
//It is good practice to use `critical` here in case two
//passwords are somehow both valid. This won't arise in
//your code, but is worth thinking about.
#pragma omp critical
{
memcpy(goodpass, pass, 4);
//#pragma omp cancel for //Escape for loops!
}
}
}
printf("Password was '%.4s'.\n",goodpass);
return 0;
}
I notice that you are generating your password using
pass[3] = num[i % 10];
pass[2] = num[i / 10 % 10];
pass[1] = alp[i / 100 % 26];
pass[0] = alp[i / 2600 % 26];
This sort of technique is occasionally useful, especially in scientific programming, but usually only for addressing convenience and memory locality.
For instance, an array of arrays where an element is accessed as a[y][x] can be written as a flat-array with elements accessed as a[y*width+x]. This gives a speed gain, but only because the memory is contiguous.
In your case, this indexing does not produce any speed gains, but does make it more difficult to reason about how your program works. I would avoid it for this reason.
It's been said that "premature optimization is the root of all evil". This is especially true of micro-optimizations such as the one you're trying here. The biggest speed gains come from high-level algorithmic decisions, not from fiddly stuff. The -O3 compilation flag does most of everything you'll ever need done in terms of making your code fast at this level.
Micro-optimizations assume that doing something convoluted in your high-level code will somehow enable you to out-smart the compiler. This is not a good assumption since the compiler is often quite smart and will be even smarter tomorrow. Your time is very valuable: don't use it on this stuff unless you have a clear justification. (Further discussion of "premature optimization" is here.)
I am currently working on a project where I would like to optimize some numerical computation in Python by calling C.
In short, I need to compute the value of y[i] = f(x[i]) for each element in an huge array x (typically has 10^9 entries or more). Here, x[i] is an integer between -10 and 10 and f is function that takes x[i] and returns a double. My issue is that f but it takes a very long time to evaluate in a way that is numerically stable.
To speed things up, I would like to just hard code all 2*10 + 1 possible values of f(x[i]) into constant array such as:
double table_of_values[] = {f(-10), ...., f(10)};
And then just evaluate f using a "lookup table" approach as follows:
for (i = 0; i < N; i++) {
y[i] = table_of_values[x[i] + 11]; //instead of y[i] = f(x[i])
}
Since I am not really well-versed at writing optimized code in C, I am wondering:
Specifically - since x is really large - I'm wondering if it's worth doing second-degree optimization when evaluating the loop (e.g. by sorting x beforehand, or by finding a smart way to deal with the negative indices (aside from just doing [x[i] + 10 + 1])?
Say x[i] were not between -10 and 10, but between -20 and 20. In this case, I could still use the same approach, but would need to hard code the lookup table manually. Is there a way to generate the look-up table dynamically in the code so that I make use of the same approach and allow for x[i] to belong to a variable range?
It's fairly easy to generate such a table with dynamic range values.
Here's a simple, single table method:
#include <malloc.h>
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
double *table_of_values;
int table_bias;
// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif
#define XLEN (1 << 9)
xval_t *x;
// fslow -- your original function
double
fslow(int i)
{
return 1; // whatever
}
// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
int len;
table_bias = -lo;
len = hi - lo;
len += 1;
// NOTE: you can do free(table_of_values) when no longer needed
table_of_values = malloc(sizeof(double) * len);
for (int i = lo; i <= hi; ++i)
table_of_values[i + table_bias] = f(i);
}
// fcached -- retrieve cached table data
double
fcached(int i)
{
return table_of_values[i + table_bias];
}
// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
double *tptr;
int bias;
double val;
// ensure these go into registers to prevent needless extra memory fetches
tptr = table_of_values;
bias = table_bias;
for (int i = 0; i < XLEN; ++i) {
val = tptr[x[i] + bias];
// do stuff with val
VARIABLE_USED(val);
}
}
int
main(void)
{
ftablegen(fslow,-10,10);
x = malloc(sizeof(xval_t) * XLEN);
fripper(x);
return 0;
}
Here's a slightly more complex way that allows many similar tables to be generated:
#include <malloc.h>
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 1
typedef short xval_t;
#endif
#if 0
typedef char xval_t;
#endif
#define XLEN (1 << 9)
xval_t *x;
struct table {
int tbl_lo; // lowest index
int tbl_hi; // highest index
int tbl_bias; // bias for index
double *tbl_data; // cached data
};
struct table ftable1;
struct table ftable2;
double
fslow(int i)
{
return 1; // whatever
}
double
f2(int i)
{
return 2; // whatever
}
// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi,struct table *tbl)
{
int len;
tbl->tbl_bias = -lo;
len = hi - lo;
len += 1;
// NOTE: you can do free tbl_data when no longer needed
tbl->tbl_data = malloc(sizeof(double) * len);
for (int i = lo; i <= hi; ++i)
tbl->tbl_data[i + tbl->tbl_bias] = fslow(i);
}
// fcached -- retrieve cached table data
double
fcached(struct table *tbl,int i)
{
return tbl->tbl_data[i + tbl->tbl_bias];
}
// fripper -- access x and table arrays
void
fripper(xval_t *x,struct table *tbl)
{
double *tptr;
int bias;
double val;
// ensure these go into registers to prevent needless extra memory fetches
tptr = tbl->tbl_data;
bias = tbl->tbl_bias;
for (int i = 0; i < XLEN; ++i) {
val = tptr[x[i] + bias];
// do stuff with val
VARIABLE_USED(val);
}
}
int
main(void)
{
x = malloc(sizeof(xval_t) * XLEN);
// NOTE: we could use 'char' for xval_t ...
ftablegen(fslow,-37,62,&ftable1);
fripper(x,&ftable1);
// ... but, this forces us to use a 'short' for xval_t
ftablegen(f2,-99,307,&ftable2);
return 0;
}
Notes:
fcached could/should be an inline function for speed. Notice that once the table is calculated once, fcached(x[i]) is quite fast. The index offset issue you mentioned [solved by the "bias"] is trivially small in calculation time.
While x may be a large array, the cached array for f() values is fairly small (e.g. -10 to 10). Even if it were (e.g.) -100 to 100, this is still about 200 elements. This small cached array will [probably] stay in the hardware memory cache, so access will remain quite fast.
Thus, sorting x to optimize H/W cache performance of the lookup table will have little to no [measurable] effect.
The access pattern to x is independent. You'll get best performance if you access x in a linear manner (e.g. for (i = 0; i < 999999999; ++i) x[i]). If you access it in a semi-random fashion, it will put a strain on the H/W cache logic and its ability to keep the needed/wanted x values "cache hot"
Even with linear access, because x is so large, by the time you get to the end, the first elements will have been evicted from the H/W cache (e.g. most CPU caches are on the order of a few megabytes)
However, if x only has values in a limited range, changing the type from int x[...] to short x[...] or even char x[...] cuts the size by a factor of 2x [or 4x]. And, that can have a measurable improvement on the performance.
Update: I've added an fripper function to show the fastest way [that I know of] to access the table and x arrays in a loop. I've also added a typedef named xval_t to allow the x array to consume less space (i.e. will have better H/W cache performance).
UPDATE #2:
Per your comments ...
fcached was coded [mostly] to illustrate simple/single access. But, it was not used in the final example.
The exact requirements for inline has varied over the years (e.g. was extern inline). Best use now: static inline. However, if using c++, it may be, yet again different. There are entire pages devoted to this. The reason is because of compilation in different .c files, what happens when optimization is on or off. Also, consider using a gcc extension. So, to force inline all the time:
__attribute__((__always_inline__)) static inline
fripper is the fastest because it avoids refetching globals table_of_values and table_bias on each loop iteration. In fripper, compiler optimizer will ensure they remain in registers. See my answer: Is accessing statically or dynamically allocated memory faster? as to why.
However, I coded an fripper variant that uses fcached and the disassembled code was the same [and optimal]. So, we can disregard that ... Or, can we? Sometimes, disassembling the code is a good cross check and the only way to know for sure. Just an extra item when creating fully optimized C code. There are many options one can give to the compiler regarding code generation, so sometimes it's just trial and error.
Because benchmarking is important, I threw in my routines for timestamping (FYI, [AFAIK] the underlying clock_gettime call is the basis for python's time.clock()).
So, here's the updated version:
#include <malloc.h>
#include <time.h>
typedef long long s64;
#define SUPER_INLINE \
__attribute__((__always_inline__)) static inline
#define VARIABLE_USED(_sym) \
do { \
if (1) \
break; \
if (!! _sym) \
break; \
} while (0)
#define TVSEC 1000000000LL // nanoseconds in a second
#define TVSECF 1e9 // nanoseconds in a second
// tvget -- get high resolution time of day
// RETURNS: absolute nanoseconds
s64
tvget(void)
{
struct timespec ts;
s64 nsec;
clock_gettime(CLOCK_REALTIME,&ts);
nsec = ts.tv_sec;
nsec *= TVSEC;
nsec += ts.tv_nsec;
return nsec;
)
// tvgetf -- get high resolution time of day
// RETURNS: fractional seconds
double
tvgetf(void)
{
struct timespec ts;
double sec;
clock_gettime(CLOCK_REALTIME,&ts);
sec = ts.tv_nsec;
sec /= TVSECF;
sec += ts.tv_sec;
return sec;
)
double *table_of_values;
int table_bias;
double *dummyptr;
// use the smallest of these that can contain the values the x array may have
#if 0
typedef int xval_t;
#endif
#if 0
typedef short xval_t;
#endif
#if 1
typedef char xval_t;
#endif
#define XLEN (1 << 9)
xval_t *x;
// fslow -- your original function
double
fslow(int i)
{
return 1; // whatever
}
// ftablegen -- generate variable table
void
ftablegen(double (*f)(int),int lo,int hi)
{
int len;
table_bias = -lo;
len = hi - lo;
len += 1;
// NOTE: you can do free(table_of_values) when no longer needed
table_of_values = malloc(sizeof(double) * len);
for (int i = lo; i <= hi; ++i)
table_of_values[i + table_bias] = f(i);
}
// fcached -- retrieve cached table data
SUPER_INLINE double
fcached(int i)
{
return table_of_values[i + table_bias];
}
// fripper_fcached -- access x and table arrays
void
fripper_fcached(xval_t *x)
{
double val;
double *dptr;
dptr = dummyptr;
for (int i = 0; i < XLEN; ++i) {
val = fcached(x[i]);
// do stuff with val
dptr[i] = val;
}
}
// fripper -- access x and table arrays
void
fripper(xval_t *x)
{
double *tptr;
int bias;
double val;
double *dptr;
// ensure these go into registers to prevent needless extra memory fetches
tptr = table_of_values;
bias = table_bias;
dptr = dummyptr;
for (int i = 0; i < XLEN; ++i) {
val = tptr[x[i] + bias];
// do stuff with val
dptr[i] = val;
}
}
int
main(void)
{
ftablegen(fslow,-10,10);
x = malloc(sizeof(xval_t) * XLEN);
dummyptr = malloc(sizeof(double) * XLEN);
fripper(x);
fripper_fcached(x);
return 0;
}
You can have negative indices in your arrays. (I am not sure if this is in the specifications.) If you have the following code:
int arr[] = {1, 2 ,3, 4, 5};
int* lookupTable = arr + 3;
printf("%i", lookupTable[-2]);
it will print out 2.
This works because arrays in c are defined as pointers. And if the pointer does not point to the begin of the array, you can access the item before the pointer.
Keep in mind though that if you have to malloc() the memory for arr you probably cannot use free(lookupTable) to free it.
I really think Craig Estey is on the right track for building your table in an automatic way. I just want to add a note for looking up the table.
If you know that you will run the code on a Haswell machine (with AVX2) you should make sure your code utilise VGATHERDPD which you can utilize with the _mm256_i32gather_pd intrinsic. If you do that, your table lookups will fly! (You can even detect avx2 on the fly with cpuid(), but that's another story)
EDIT:
Let me elaborate with some code:
#include <stdint.h>
#include <stdio.h>
#include <immintrin.h>
/* I'm not sure if you need the alignment */
double table[8] __attribute__((aligned(16)))= { 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };
int main()
{
int32_t i[4] = { 0,2,4,6 };
__m128i index = _mm_load_si128( (__m128i*) i );
__m256d result = _mm256_i32gather_pd( table, index, 8 );
double* f = (double*)&result;
printf("%f %f %f %f\n", f[0], f[1], f[2], f[3]);
return 0;
}
Compile and run:
$ gcc --std=gnu99 -mavx2 gathertest.c -o gathertest && ./gathertest
0.100000 0.300000 0.500000 0.700000
This is fast!
I wanted to learn a bit about rust tasks, so I did a monte carlo computation
of PI. Now my puzzle is why the single-threaded C version is 4 times faster
than the 4-way threaded Rust version. Clearly I am doing something wrong, or my mental performance model is way off.
Here's the C version:
#include <stdlib.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#define PI 3.1415926535897932
double monte_carlo_pi(int nparts)
{
int i, in=0;
double x, y;
srand(getpid());
for (i=0; i<nparts; i++) {
x = (double)rand()/(double)RAND_MAX;
y = (double)rand()/(double)RAND_MAX;
if (x*x + y*y < 1.0) {
in++;
}
}
return in/(double)nparts * 4.0;
}
int main(int argc, char **argv)
{
int nparts;
double mc_pi;
nparts = atoi(argv[1]);
mc_pi = monte_carlo_pi(nparts);
printf("computed: %f error: %f\n", mc_pi, mc_pi - PI);
}
The Rust version was not a line-by-line port:
use std::rand;
use std::rand::distributions::{IndependentSample,Range};
fn monte_carlo_pi(nparts: uint ) -> uint {
let between = Range::new(0f64,1f64);
let mut rng = rand::task_rng();
let mut in_circle = 0u;
for _ in range(0u, nparts) {
let a = between.ind_sample(&mut rng);
let b = between.ind_sample(&mut rng);
if a*a + b*b <= 1.0 {
in_circle += 1;
}
}
in_circle
}
fn main() {
let (tx, rx) = channel();
let ntasks = 4u;
let nparts = 100000000u; /* I haven't learned how to parse cmnd line args yet!*/
for _ in range(0u, ntasks) {
let child_tx = tx.clone();
spawn(proc() {
child_tx.send(monte_carlo_pi(nparts/ntasks));
});
}
let result = rx.recv() + rx.recv() + rx.recv() + rx.recv();
println!("pi is {}", (result as f64)/(nparts as f64)*4.0);
}
Build and time the C version:
$ clang -O2 mc-pi.c -o mc-pi-c; time ./mc-pi-c 100000000
computed: 3.141700 error: 0.000108
./mc-pi-c 100000000 1.68s user 0.00s system 99% cpu 1.683 total
Build and time the Rust version:
$ rustc -v
rustc 0.12.0-nightly (740905042 2014-09-29 23:52:21 +0000)
$ rustc --opt-level 2 --debuginfo 0 mc-pi.rs -o mc-pi-rust; time ./mc-pi-rust
pi is 3.141327
./mc-pi-rust 2.40s user 24.56s system 352% cpu 7.654 tota
The bottleneck, as Dogbert observed, was the random number generator. Here's one that is fast and seeded differently on each thread
fn monte_carlo_pi(id: u32, nparts: uint ) -> uint {
...
let mut rng: XorShiftRng = SeedableRng::from_seed([id,id,id,id]);
...
}
Meaningful benchmarks are a tricky thing, because you have all kinds of optimization options, etc. Also, the structure of the code can have a huge impact.
Comparing C and Rust is a little like comparing apples and oranges. We typically use compute-intensive algorithms like the one you dispicit above, but the real world can throw you a curve.
Having said that, in general, Rust can and does approach the peformance of C and C++, and most likey can do better on concurrency tasks in general.
Take a look at the benchmarks here:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/rust-clang.html
I chose the Rust vs. C Clang benchmark comparasion, because both rely on the underlying LLVM.
On the other hand, a comparasion with C gcc yields different results:
And guess what? Rust still comes out ahead!
I entreat you to explore the Benchmark Game site in more detail. There are some cases where C will edge out Rust in some instances.
In general, when you are creating a real-world solution, you want to do performance benchmarks for your specific cases. Always do this, because you will often be surprised by the results. Never assume.
I think that too many times, benchmarks are used to forward the "my language is better than your langage" style of rwars. But as one who have used over 20 computer languages throughout his longish career, I always say that it is a matter of the best tool for the job.
I initially wrote this (brute force and inefficient) method of calculating primes with the intent of making sure that there was no difference in speed between using "if-then-else" versus guards in Haskell (and there is no difference!). But then I decided to write a C program to compare and I got the following (Haskell slower by just over 25%) :
(Note I got the ideas of using rem instead of mod and also the O3 option in the compiler invocation from the following post : On improving Haskell's performance compared to C in fibonacci micro-benchmark)
Haskell : Forum.hs
divisibleRec :: Int -> Int -> Bool
divisibleRec i j
| j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)
divisible::Int -> Bool
divisible i = divisibleRec i (i-1)
r = [ x | x <- [2..200000], divisible x == False]
main :: IO()
main = print(length(r))
C : main.cpp
#include <stdio.h>
bool divisibleRec(int i, int j){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ return divisibleRec(i,j-1); }
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<200000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results I got were as follows :
Compilation times
time (ghc -O3 -o runProg Forum.hs)
real 0m0.355s
user 0m0.252s
sys 0m0.040s
time (gcc -O3 -o runProg main.cpp)
real 0m0.070s
user 0m0.036s
sys 0m0.008s
and the following running times :
Running times on Ubuntu 32 bit
Haskell
17984
real 0m54.498s
user 0m51.363s
sys 0m0.140s
C++
number of primes = 17984
real 0m41.739s
user 0m39.642s
sys 0m0.080s
I was quite impressed with the running times of Haskell. However my question is this : can I do anything to speed up the haskell program without :
Changing the underlying algorithm (it is clear that massive speedups can be gained by changing the algorithm; but I just want to understand what I can do on the language/compiler side to improve performance)
Invoking the llvm compiler (because I dont have this installed)
[EDIT : Memory usage]
After a comment by Alan I noticed that the C program uses a constant amount of memory where as the Haskell program slowly grows in memory size. At first I thought this had something to do with recursion, but gspr explains below why this is happening and provides a solution. Will Ness provides an alternative solution which (like gspr's solution) also ensures that the memory remains static.
[EDIT : Summary of bigger runs]
max number tested : 200,000:
(54.498s/41.739s) = Haskell 30.5% slower
max number tested : 400,000:
3m31.372s/2m45.076s = 211.37s/165s = Haskell 28.1% slower
max number tested : 800,000:
14m3.266s/11m6.024s = 843.27s/666.02s = Haskell 26.6% slower
[EDIT : Code for Alan]
This was the code that I had written earlier which does not have recursion and which I had tested on 200,000 :
#include <stdio.h>
bool divisibleRec(int i, int j){
while(j>0){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ j -= 1;}
}
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<8000000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results for the C code with and without recursion are as follows (for 800,000) :
With recursion : 11m6.024s
Without recursion : 11m5.328s
Note that the executable seems to take up 60kb (as seen in System monitor) irrespective of the maximum number, and therefore I suspect that the compiler is detecting this recursion.
This isn't really answering your question, but rather what you asked in a comment regarding growing memory usage when the number 200000 grows.
When that number grows, so does the list r. Your code needs all of r at the very end, to compute its length. The C code, on the other hand, just increments a counter. You'll have to do something similar in Haskell too if you want constant memory usage. The code will still be very Haskelly, and in general it's a sensible proposition: you don't really need the list of numbers for which divisible is False, you just need to know how many there are.
You can try with
main :: IO ()
main = print $ foldl' (\s x -> if divisible x then s else s+1) 0 [2..200000]
(foldl' is a stricter foldl from Data.List that avoids thunks being built up).
Well bang patters give you a very small win (as does llvm, but you seem to have expected that):
{-# LANUGAGE BangPatterns #-}
divisibleRec !i !j | j == 1 = False
And on my x86-64 I get a very big win by switching to smaller representations, such as Word32:
divisibleRec :: Word32 -> Word32 -> Bool
...
divisible :: Word32 -> Bool
My timings:
$ time ./so -- Int
2262
real 0m2.332s
$ time ./so -- Word32
2262
real 0m1.424s
This is a closer match to your C program, which is only using int. It still doesn't match performance wise, I suspect we'd have to look at core to figure out why.
EDIT: and the memory use, as was already noted I see, is about the named list r. I just inlined r, made it output a 1 for each non-divisble value and took the sum:
main = print $ sum $ [ 1 | x <- [2..800000], not (divisible x) ]
Another way to write down your algorithm is
main = print $ length [()|x<-[2..200000], and [rem x d>0|d<-[x-1,x-2..2]]]
Unfortunately, it runs slower. Using all ((>0).rem x) [x-1,x-2..2] as a test, it runs slower still. But maybe you'd test it on your setup nevertheless.
Replacing your code with explicit loop with bang patterns made no difference whatsoever:
{-# OPTIONS_GHC -XBangPatterns #-}
r4::Int->Int
r4 n = go 0 2 where
go !c i | i>n = c
| True = go (if not(divisible i) then (c+1) else c) (i+1)
divisibleRec::Int->Int->Bool
divisibleRec i !j | j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)
When I started programming in Haskell I was also impressed about its speed. You may be interested in reading point 5 "The speed of Haskell" of this article.