Haskell vs C speed: Sieve of Eratosthenes - c

Haskell one is implemented using optimized Data.IntSet with complexity O(lg n). However, there is a 15x (previously 30x) speed difference for n = 2000000 despite Haskell code is already optimized for even number cases. I would like to know whether/why my implementation in Haskell is imperfect.
Original Haskell
primesUpTo :: Int -> [Int]
primesUpTo n = 2 : put S.empty [3,5..n]
where put :: S.IntSet -> [Int] -> [Int]
put _ [] = []
put comps (x:xs) =
if S.member x comps
then put comps xs
else x : put (S.union comps multiples) xs
where multiples = S.fromList [x*2, x*3 .. n]
Update
fromDistinctAscList gives a 4x speed increase. 2-3-5-7-Wheel speeds up by another 50%.
primesUpTo :: Int -> [Int]
primesUpTo n = 2 : 3 : 5 : 7 : put S.empty (takeWhile (<=n) (spin wheel 11))
where put :: S.IntSet -> [Int] -> [Int]
put _ [] = []
put comps (x:xs) =
if S.member x comps
then put comps xs
else x : put (S.union comps multiples) xs
where multiples = S.fromDistinctAscList [x*x, x*(x+2) .. n]
spin (x:xs) n = n : spin xs (n + x)
wheel = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wheel
Benchmarking
All time are measured by *nix time command, real space
Haskell original : 2e6: N/A; 2e7: >30s
Haskell optimized: 2e6: 0.396s; 2e7: 6.273s
C++ Set (ordered): 2e6: 4.694s; 2e7: >30s
C++ Bool Array : 2e6: 0.039s; 2e7: 0.421s
Haskell optimized is slower than C++ Bool by 10~15x, and faster than C++ Set by 10x.
Source code
C Compiler options: g++ 5.3.1, g++ -std=c++11
Haskell options: ghc 7.8.4, ghc
C code (Bool array) http://pastebin.com/W0s7cSWi
prime[0] = prime[1] = false;
for (int i=2; i<=limit; i++) { //edited
if (!prime[i]) continue;
for (int j=2*i; j<=n; j+=i)
prime[j] = false;
}
C code (Set) http://pastebin.com/sNpghrU4
nonprime.insert(1);
for (int i=2; i<=limit; i++) { //edited
if (nonprime.count(i) > 0) continue;
for (int j=2*i; j<=n; j+=i)
nonprime.insert(j);
}
Haskell code http://pastebin.com/HuMqwvRW
Code as written above.

I would like to know whether/why my implementation in Haskell is imperfect.
Instead of fromList you better use fromDistinctAscList which performs linearly. You may also add only odd multiples starting with x*x not x*2, because all the smaller odd multiples have already been added. Style-wise, a right fold may fit better than recursion.
Doing so, I get more than 3 times performance improvement for n equal to 2,000,000:
import Data.IntSet (member, union, empty, fromDistinctAscList)
sieve :: Int -> [Int]
sieve n = 2: foldr go (const []) [3,5..n] empty
where
go i run obs
| member i obs = run obs
| otherwise = i: run (union obs inc)
where inc = fromDistinctAscList [i*i, i*(i + 2)..n]
Nevertheless, an array has both O(1) access and cache friendly memory allocation. Using mutable arrays, I see more than 15 times performance improvement over your Haskell code (again n equal to 2,000,000):
{-# LANGUAGE FlexibleContexts #-}
import Data.Array.ST (STUArray)
import Control.Monad (forM_, foldM)
import Control.Monad.ST (ST, runST)
import Data.Array.Base (newArray, unsafeWrite, unsafeRead)
sieve :: Int -> [Int]
sieve n = reverse $ runST $ do
arr <- newArray (0, n) False :: ST s (STUArray s Int Bool)
foldM (go arr) [2] [3,5..n]
where
go arr acc i = do
b <- unsafeRead arr i
if b then return acc else do
forM_ [i*i, i*(i + 2).. n] $ \k -> unsafeWrite arr k True
return $ i: acc

Related

Haskell -> C FFI performance

This is the dual question of Performance considerations of Haskell FFI / C?: I would like to call a C function with as small an overhead as possible.
To set the scene, I have the following C function:
typedef struct
{
uint64_t RESET;
} INPUT;
typedef struct
{
uint64_t VGA_HSYNC;
uint64_t VGA_VSYNC;
uint64_t VGA_DE;
uint8_t VGA_RED;
uint8_t VGA_GREEN;
uint8_t VGA_BLUE;
} OUTPUT;
void Bounce(const INPUT* input, OUTPUT* output);
Let's run it from C and time it, with gcc -O3:
int main (int argc, char **argv)
{
INPUT input;
input.RESET = 0;
OUTPUT output;
int cycles = 0;
for (int j = 0; j < 60; ++j)
{
for (;; ++cycles)
{
Bounce(&input, &output);
if (output.VGA_HSYNC == 0 && output.VGA_VSYNC == 0) break;
}
for (;; ++cycles)
{
Bounce(&input, &output);
if (output.VGA_DE) break;
}
}
printf("%d cycles\n", cycles);
}
Running it for 25152001 cycles takes ~400 ms:
$ time ./Bounce
25152001 cycles
real 0m0.404s
user 0m0.403s
sys 0m0.001s
Now let's write some Haskell code to set up FFI (note that Bool's Storable instance really does use a full int):
data INPUT = INPUT
{ reset :: Bool
}
data OUTPUT = OUTPUT
{ vgaHSYNC, vgaVSYNC, vgaDE :: Bool
, vgaRED, vgaGREEN, vgaBLUE :: Word64
}
deriving (Show)
foreign import ccall unsafe "Bounce" topEntity :: Ptr INPUT -> Ptr OUTPUT -> IO ()
instance Storable INPUT where ...
instance Storable OUTPUT where ...
And let's do what I believe to be functionally equivalent to our C code from before:
main :: IO ()
main = alloca $ \inp -> alloca $ \outp -> do
poke inp $ INPUT{ reset = False }
let loop1 n = do
topEntity inp outp
out#OUTPUT{..} <- peek outp
let n' = n + 1
if not vgaHSYNC && not vgaVSYNC then loop2 n' else loop1 n'
loop2 n = do
topEntity inp outp
out <- peek outp
let n' = n + 1
if vgaDE out then return n' else loop2 n'
loop3 k n
| k < 60 = do
n <- loop1 n
loop3 (k + 1) n
| otherwise = return n
n <- loop3 (0 :: Int) (0 :: Int)
printf "%d cycles" n
I build it with GHC 8.6.5, using -O3, and I get.. more than 3 seconds!
$ time ./.stack-work/dist/x86_64-linux/Cabal-2.4.0.1/build/sim-ffi/sim-ffi
25152001 cycles
real 0m3.468s
user 0m3.146s
sys 0m0.280s
And it's not a constant overhead at startup, either: if I run for 10 times the cycles, I get roughly 3.5 seconds from C and 34 seconds from Haskell.
What can I do to reduce the Haskell -> C FFI overhead?
I managed to reduce the overhead so that the 25 M calls now finish in 1.2 seconds. The changes were:
Make loop1, loop2 and loop3 strict in the n argument (using BangPatterns)
Add an INLINE pragma to peek in OUTPUT's Storable instance
Point #1 is silly, of course, but that's what I get for not profiling earlier. That change alone gets me to 1.5 seconds....
Point #2, however, makes a ton of sense and is generally applicable. It also addresses the comment from #Thomas M. DuBuisson:
Do you ever need the Haskell structure in haskell? If you can just keep it as a pointer to memory and have a few test functions such as vgaVSYNC :: Ptr OUTPUT -> IO Bool then that will save a log of copying, allocation, GC work on every call.
In the eventual full program, I do need to look at all the fields of OUTPUT. However, with peek inlined, GHC is happy to do the case-of-case transformation, so I can see in Core that now there is no OUTPUT value allocated; the output of peek is consumed directly.

Cython: slow numpy arrays

I am trying to speed up my code using cython. After translating a code into cython from python I am seeing that I have not gained any speed up. I think the origin of the problem is the bad performance I am getting by using numpy arrays into cython.
I have came up with a very simple program to show this:
############### test.pyx #################
import numpy as np
cimport numpy as np
cimport cython
def func1(long N):
cdef double sum1,sum2,sum3
cdef long i
sum1 = 0.0
sum2 = 0.0
sum3 = 0.0
for i in range(N):
sum1 += i
sum2 += 2.0*i
sum3 += 3.0*i
return sum1,sum2,sum3
def func2(long N):
cdef np.ndarray[np.float64_t,ndim=1] sum_arr
cdef long i
sum_arr = np.zeros(3,dtype=np.float64)
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
def func3(long N):
cdef double sum_arr[3]
cdef long i
sum_arr[0] = 0.0
sum_arr[1] = 0.0
sum_arr[2] = 0.0
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
##########################################
################## test.py ###############
import time
import test as test
N = 1000000000
for i in xrange(10):
start = time.time()
sum1,sum2,sum3 = test.func1(N)
print 'Time taken = %.3f'%(time.time()-start)
print '\n'
for i in xrange(10):
start = time.time()
sum_arr = test.func2(N)
print 'Time taken = %.3f'%(time.time()-start)
print '\n'
for i in xrange(10):
start = time.time()
sum_arr = test.func3(N)
print 'Time taken = %.3f'%(time.time()-start)
############################################
And from python test.py I get:
Time taken = 1.445
Time taken = 1.433
Time taken = 1.434
Time taken = 1.428
Time taken = 1.449
Time taken = 1.425
Time taken = 1.421
Time taken = 1.451
Time taken = 1.483
Time taken = 1.418
Time taken = 2.623
Time taken = 2.603
Time taken = 2.977
Time taken = 3.237
Time taken = 2.748
Time taken = 2.798
Time taken = 2.811
Time taken = 2.783
Time taken = 2.585
Time taken = 2.595
Time taken = 1.503
Time taken = 1.529
Time taken = 1.509
Time taken = 1.543
Time taken = 1.427
Time taken = 1.425
Time taken = 1.423
Time taken = 1.415
Time taken = 1.414
Time taken = 1.418
My question is: why func2 is almost 2x slower than func1 and func3?
Is there a way to improve this?
Thanks!
######## UPDATE
My real problem is as follows. I am calling a function that accepts a 3D array (say P[i,j,k]). The function will loop through each element and compute several quantities: a quantity that depends on the value of the array in that position (say A=f(P[i,j,k])) and another quantities that only depend on the position of the array itself (B=g(i,j,k)). Schematically things will look like this:
for i in xrange(N):
corr1 = h(i,val)
for j in xrange(N):
corr2 = h(j,val)
for k in xrange(N):
corr3 = h(k,val)
A = f(P[i,j,k])
B = g(i,j,k)
Arr[B] += A*corr1*corr2*corr3
where val is a property of the 3D array represented by a number. This number can be different for different fields.
Since I have to do this operation over many 3D arrays, I though that it would be better if I create a new routine that accepts many different input 3D arrays, leaving the number of arrays unknown a-priori. The idea is that since B will be exactly the same over all arrays, I can avoid computing it for each array and only compute it once. The problem is that the corr1, corr2, corr3 above will become arrays:
If I have a number of 3D arrays equal to num_3D_arrays I am doing something as:
for i in xrange(N):
for p in xrange(num_3D_arrays):
corr1[p] = h(i,val[p])
for j in xrange(N):
for p in xrange(num_3D_arrays):
corr2[p] = h(j,val[p])
for k in xrange(N):
for p in xrange(num_3D_arrays):
corr3[p] = h(k,val[p])
B = g(i,j,k)
for p in xrange(num_3D_arrays):
A[p] = f(P[i,j,k])
Arr[p,B] += A[p]*corr1[p]*corr2[p]*corr3[p]
So the val that I am changing the variables corr1,corr2,corr3 and A from scalar to arrays is killing the performance that I would expect to avoid doing the big loop.
#
There are a couple things you can do to speed up array indexing in Cython:
Turn of bounds checking and wraparound.
Use typed memoryviews.
Declare the array is contiguous.
So for your function:
#cython.boundscheck(False)
#cython.wraparound(False)
def func2(long N):
cdef np.float64_t[::1] sum_arr
cdef long i
sum_arr = np.zeros(3,dtype=np.float64)
for i in range(N):
sum_arr[0] += i
sum_arr[1] += 2.0*i
sum_arr[2] += 3.0*i
return sum_arr
For the original code Cython produced the following C code for the line sum_arr[0] += i:
__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
__pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
{__pyx_filename = __pyx_f[0]; __pyx_lineno = 13; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;
With the improvements above:
__pyx_t_8 = 0;
*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_sum_arr.data) + __pyx_t_8)) )) += __pyx_v_i;
why func2 is almost 2x slower than func1?
It's because indexing cause an indirection, so you double the number of elementary operations. Calculate the sum like in func1, then affect with
sum=array([sum1,sum2,sum3])
How to speed python code ?
Numpy is the first good idea , It raise nearly C speed with no effort.
Numba can fill the gap with no effort too, and is very simple.
Cython for critical cases.
Here some illustration of that:
# python way
def func1(N):
sum1 = 0.0
sum2 = 0.0
sum3 = 0.0
for i in range(N):
sum1 += i
sum2 += 2.0*i
sum3 += 3.0*i
return sum1,sum2,sum3
# numpy way
def func2(N):
aran=arange(float(N))
sum1=aran.sum()
sum2=(2.0*aran).sum()
sum3=(3.0*aran).sum()
return sum1,sum2,sum3
#numba way
import numba
func3 =numba.njit(func1)
"""
In [609]: %timeit func1(10**6)
1 loop, best of 3: 710 ms per loop
In [610]: %timeit func2(1e6)
100 loops, best of 3: 22.2 ms per loop
In [611]: %timeit func3(10e6)
100 loops, best of 3: 2.87 ms per loop
"""
Look at the html produced by cython -a ...pyx.
For func1, the sum1 += i line expands to :
+15: sum1 += i
__pyx_v_sum1 = (__pyx_v_sum1 + __pyx_v_i);
for func3, with a C array
+45: sum_arr[0] += i
__pyx_t_3 = 0;
(__pyx_v_sum_arr[__pyx_t_3]) = ((__pyx_v_sum_arr[__pyx_t_3]) + __pyx_v_i);
Slightly more complicated, but a straight forward c.
But for func2:
+29: sum_arr[0] += i
__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
__pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
__Pyx_RaiseBufferIndexError(__pyx_t_6);
__PYX_ERR(0, 29, __pyx_L1_error)
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;
Much more complicated with references to numpy functions (e.g. Pyx_BUfPtrStrided1d). Even initializing the array is complicated:
+26: sum_arr = np.zeros(3,dtype=np.float64)
__pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
__Pyx_GOTREF(__pyx_t_1);
....
I expect that moving the sum_arr creation to the calling Python, and passing it as an argument to func2 would save some time.
Have you read this guide for using memoryviews:
http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html
You'll get the best cython performance if you focus on writing the low level operations so they translate into simple c. In
for k in xrange(N):
corr3 = h(k,val)
A = f(P[i,j,k])
B = g(i,j,k)
Arr[B] += A*corr1*corr2*corr3
It's not the loops on i,j,k that will slow you down. It's evaluating h, f, and g each time, as well as the Arr[B] +=.... Those functions should be tightly coded cython, not general Python functions. Look at the compiled simplicity of the sum3d function in the memoryview guide.

haskell FFI passing array of C structs in and out

I am trying to perform a series of transforms on graphical files using Haskell and Repa/DevIL. The starting example used was provided by the Haskell wiki page https://wiki.haskell.org/Numeric_Haskell:_A_Repa_Tutorial. I am an imperative programmer of 30 years experience with some erlang for good measure, trying to learn Haskell outside a classroom environment.
The problem is manipulating the data after the file load was first transformed into a Repa array:
import Data.Array.Repa.IO.DevIL (runIL,readImage,writeImage,Image(RGB),IL)
import qualified Data.Array.Repa as R
import Data.Vector.Unboxed as DVU
import Control.Monad
main :: IO ()
main = do
[f] <- getArgs
(RGB a) <- runIL $ Data.Array.Repa.IO.DevIL.readImage f
let
c = (computeP (R.traverse a id rgbTransform)) :: IL (Array U DIM3 Float)
which is successfully cast to type "Array F DIM3 Float" as output from the rgbTransform. From that point on it has been a nightmare to use the data. Flicking the array storage type between F(oreign) and U(nboxed) changes all following call's usability, plus the Repa-added monad layer IL forces use of liftM for nearly every equation following the 1st transform:
let -- continued
sh = liftM R.extent c -- IL DIM3
v = liftM R.toUnboxed c -- IL (Vector Float)
lv = liftM DVU.length v -- IL Int
f = liftM indexed v -- vector of tuples: (Int,a) where Int is idx
k = (Z :. 2) :. 2 :. 0 :: DIM3
These are the routines I can call without error. The IO monad's print command produces no output if placed in or after this 'let' list, due to the IL monad layer.
The game plan for the curious:
read the graphic file (done, via Repa)
resize image (not done, no resize in Repa, must be hand-coded)
transform and convert image from Word8 to Float (done)
get a Stablepointer to the transformed Float data (not done)
transform in-place the Float data as an array of C structs
of {Float a,b,c;}, by an external C routine via FFI (not completely
done). This is done hopefully without marshalling a new graphic
array by passing a pointer to the data
perform more passes over the transformed data to extract more info (partly done).
I am looking for help with issues 4 and 5.
4 -> The type system has been difficult to deal with while attempting to get C-usable memory pointers. Going thru the mountains of haskell library calls has not helped.
5 -> The external C routine is of type:
foreign import ccall unsafe "transform.h xform"
c_xform :: Ptr (CFloat,CFloat,CFloat) ->
CInt ->
IO ()
The Ptr is expected to point to an unboxed flat C array of rgb_t structs:
typedef struct
{
float r;
float g;
float b;
} rgb_t;
Available web-based FFI descriptions of how to deal with array pointers in FFI are non-existent if not downright obscure. The fairly straightforward idea of unfreezing and passing in a C array of floating-point RGB structs, modifying them in-place and then freezing the result is what I had in mind. The external transform is pure in the sense that the same input will produce predictable output, does not use threads, does not use global vars nor depend upon obscure libraries.
Foreign.Marshal.Array seems to provide a way to convert haskell data to C data and other way around.
I tested interfacing C code and haskell using the following files (Haskell + FFI for the first time for me)
hsc2hs rgb_ffi.hsc
ghc main.hs rgb_ffi.hs rgb.c
rgb.h
#ifndef RGB_H
#define RGB_H
#include <stdlib.h>
typedef struct {
float r;
float g;
float b;
} rgb_t;
void rgb_test(rgb_t * rgbs, ssize_t n);
#endif
rgb.h
#include <stdlib.h>
#include <stdio.h>
#include "rgb.h"
void rgb_test(rgb_t * rgbs, ssize_t n)
{
int i;
for(i=0; i<n; i++) {
printf("%.3f %.3f %.3f\n", rgbs[i].r, rgbs[i].g, rgbs[i].b);
rgbs[i].r *= 2.0;
rgbs[i].g *= 2.0;
rgbs[i].b *= 2.0;
}
}
rgb_ffi.hsc
{-# LANGUAGE ForeignFunctionInterface #-}
{-# LANGUAGE CPP #-}
module RGB where
import Foreign
import Foreign.C
import Control.Monad (ap)
#include "rgb.h"
data RGB = RGB {
r :: CFloat, g :: CFloat, b :: CFloat
} deriving Show
instance Storable RGB where
sizeOf _ = #{size rgb_t}
alignment _ = alignment (undefined :: CInt)
poke p rgb_t = do
#{poke rgb_t, r} p $ r rgb_t
#{poke rgb_t, g} p $ g rgb_t
#{poke rgb_t, b} p $ b rgb_t
peek p = return RGB
`ap` (#{peek rgb_t, r} p)
`ap` (#{peek rgb_t, g} p)
`ap` (#{peek rgb_t, b} p)
foreign import ccall "rgb.h rgb_test" crgbTest :: Ptr RGB -> CSize -> IO ();
rgbTest :: [RGB] -> IO [RGB]
rgbTest rgbs = withArray rgbs $ \ptr ->
do
crgbTest ptr (fromIntegral (length rgbs))
peekArray (length rgbs) ptr
rgbAlloc :: [RGB] -> IO (Ptr RGB)
rgbAlloc rgbs = newArray rgbs
rgbPeek :: Ptr RGB -> Int -> IO [RGB]
rgbPeek rgbs l = peekArray l rgbs
rgbTest2 :: Ptr RGB -> Int -> IO ()
rgbTest2 ptr l =
do
crgbTest ptr (fromIntegral l)
return ()
main.hs
module Main (main) where
import RGB
main =
do
let a = [RGB {r = 1.0, g = 1.0, b = 1.0},
RGB {r = 2.0, g = 2.0, b = 2.0},
RGB {r = 3.0, g = 3.0, b = 3.0}]
let l = length a
print a
-- b <- rgbTest a
-- print b
c <- rgbAlloc a
rgbTest2 c l
rgbTest2 c l
d <- rgbPeek c l
print d
return ()

Performance of changing values of STUArray in bulk

I am implementing a topological sort in Haskell with the requirement to be as efficient as possible. I have profiled my current solution and found out the the following method is taking 60% of total time (and 0 amount of additional space):
import Control.Monad.ST
import Control.Monad
import Data.Array.ST
import Data.Array.Unboxed
import Data.Word
import Data.Array.Base
zeroElementsAfterDecrement' :: (MArray a e m, Num e, Eq e) => a Int e -> [Int] -> m [Int]
zeroElementsAfterDecrement' arr is = foldr k (return []) is
where k i a = do xs <- a
decremented <- liftM (subtract 1) (unsafeRead arr i)
unsafeWrite arr i decremented
if decremented == 0 then return (i:xs) else return xs
largenum :: Int
largenum = 10000000
test = runST $ do arr <- newArray (1, largenum) 100 :: ST s (STUArray s Int Word32)
zeroElementsAfterDecrement' arr [1..largenum]
main = (putStrLn . show) test
The function takes an array (I use unboxed mutable arrays) and a list of indexes, decrements elements by these indexes and returns indexes of elements that became zero during this operation. Right now this is more than 10 times slower than the optimized C++ code but still pretty good compared to Python (or maybe I don't know Python way to optimize this). I understand there is an overhead from executing a monadic code, but maybe there are still ways to optimize I am not aware of?
Edit:
GHC: -O -fllvm: 0.54s
GHC (with unsafeWrite/unsafeRead and Word32): 0.34s
g++: 0.24s
g++ -O2: 0.05s
python3: 2.66s
Also when I change foldr to foldl' it starts allocating some memory and is 4 times slower as a result, why is that?
Here is a C++ version I compared it to:
#include <iostream>
#include <vector>
using namespace std;
#define LARGENUM 10000000
int main()
{
vector <int> arr;
for (int i = 0; i < LARGENUM; i++) {
arr.push_back(100);
}
for (int i = 0; i < arr.size(); i++) {
arr[i]--;
if (arr[i] == 0)
cout << i << endl;
}
return 0;
}
And a Python version:
arr = [100] * 10000000
for x in range (0, 10000000 - 1):
arr[x] = arr[x] - 1
if arr[x] == 0:
print x

Comparing speed of Haskell and C for the computation of primes

I initially wrote this (brute force and inefficient) method of calculating primes with the intent of making sure that there was no difference in speed between using "if-then-else" versus guards in Haskell (and there is no difference!). But then I decided to write a C program to compare and I got the following (Haskell slower by just over 25%) :
(Note I got the ideas of using rem instead of mod and also the O3 option in the compiler invocation from the following post : On improving Haskell's performance compared to C in fibonacci micro-benchmark)
Haskell : Forum.hs
divisibleRec :: Int -> Int -> Bool
divisibleRec i j
| j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)
divisible::Int -> Bool
divisible i = divisibleRec i (i-1)
r = [ x | x <- [2..200000], divisible x == False]
main :: IO()
main = print(length(r))
C : main.cpp
#include <stdio.h>
bool divisibleRec(int i, int j){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ return divisibleRec(i,j-1); }
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<200000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results I got were as follows :
Compilation times
time (ghc -O3 -o runProg Forum.hs)
real 0m0.355s
user 0m0.252s
sys 0m0.040s
time (gcc -O3 -o runProg main.cpp)
real 0m0.070s
user 0m0.036s
sys 0m0.008s
and the following running times :
Running times on Ubuntu 32 bit
Haskell
17984
real 0m54.498s
user 0m51.363s
sys 0m0.140s
C++
number of primes = 17984
real 0m41.739s
user 0m39.642s
sys 0m0.080s
I was quite impressed with the running times of Haskell. However my question is this : can I do anything to speed up the haskell program without :
Changing the underlying algorithm (it is clear that massive speedups can be gained by changing the algorithm; but I just want to understand what I can do on the language/compiler side to improve performance)
Invoking the llvm compiler (because I dont have this installed)
[EDIT : Memory usage]
After a comment by Alan I noticed that the C program uses a constant amount of memory where as the Haskell program slowly grows in memory size. At first I thought this had something to do with recursion, but gspr explains below why this is happening and provides a solution. Will Ness provides an alternative solution which (like gspr's solution) also ensures that the memory remains static.
[EDIT : Summary of bigger runs]
max number tested : 200,000:
(54.498s/41.739s) = Haskell 30.5% slower
max number tested : 400,000:
3m31.372s/2m45.076s = 211.37s/165s = Haskell 28.1% slower
max number tested : 800,000:
14m3.266s/11m6.024s = 843.27s/666.02s = Haskell 26.6% slower
[EDIT : Code for Alan]
This was the code that I had written earlier which does not have recursion and which I had tested on 200,000 :
#include <stdio.h>
bool divisibleRec(int i, int j){
while(j>0){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ j -= 1;}
}
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<8000000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results for the C code with and without recursion are as follows (for 800,000) :
With recursion : 11m6.024s
Without recursion : 11m5.328s
Note that the executable seems to take up 60kb (as seen in System monitor) irrespective of the maximum number, and therefore I suspect that the compiler is detecting this recursion.
This isn't really answering your question, but rather what you asked in a comment regarding growing memory usage when the number 200000 grows.
When that number grows, so does the list r. Your code needs all of r at the very end, to compute its length. The C code, on the other hand, just increments a counter. You'll have to do something similar in Haskell too if you want constant memory usage. The code will still be very Haskelly, and in general it's a sensible proposition: you don't really need the list of numbers for which divisible is False, you just need to know how many there are.
You can try with
main :: IO ()
main = print $ foldl' (\s x -> if divisible x then s else s+1) 0 [2..200000]
(foldl' is a stricter foldl from Data.List that avoids thunks being built up).
Well bang patters give you a very small win (as does llvm, but you seem to have expected that):
{-# LANUGAGE BangPatterns #-}
divisibleRec !i !j | j == 1 = False
And on my x86-64 I get a very big win by switching to smaller representations, such as Word32:
divisibleRec :: Word32 -> Word32 -> Bool
...
divisible :: Word32 -> Bool
My timings:
$ time ./so -- Int
2262
real 0m2.332s
$ time ./so -- Word32
2262
real 0m1.424s
This is a closer match to your C program, which is only using int. It still doesn't match performance wise, I suspect we'd have to look at core to figure out why.
EDIT: and the memory use, as was already noted I see, is about the named list r. I just inlined r, made it output a 1 for each non-divisble value and took the sum:
main = print $ sum $ [ 1 | x <- [2..800000], not (divisible x) ]
Another way to write down your algorithm is
main = print $ length [()|x<-[2..200000], and [rem x d>0|d<-[x-1,x-2..2]]]
Unfortunately, it runs slower. Using all ((>0).rem x) [x-1,x-2..2] as a test, it runs slower still. But maybe you'd test it on your setup nevertheless.
Replacing your code with explicit loop with bang patterns made no difference whatsoever:
{-# OPTIONS_GHC -XBangPatterns #-}
r4::Int->Int
r4 n = go 0 2 where
go !c i | i>n = c
| True = go (if not(divisible i) then (c+1) else c) (i+1)
divisibleRec::Int->Int->Bool
divisibleRec i !j | j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)
When I started programming in Haskell I was also impressed about its speed. You may be interested in reading point 5 "The speed of Haskell" of this article.

Resources