I am implementing a topological sort in Haskell with the requirement to be as efficient as possible. I have profiled my current solution and found out the the following method is taking 60% of total time (and 0 amount of additional space):
import Control.Monad.ST
import Control.Monad
import Data.Array.ST
import Data.Array.Unboxed
import Data.Word
import Data.Array.Base
zeroElementsAfterDecrement' :: (MArray a e m, Num e, Eq e) => a Int e -> [Int] -> m [Int]
zeroElementsAfterDecrement' arr is = foldr k (return []) is
where k i a = do xs <- a
decremented <- liftM (subtract 1) (unsafeRead arr i)
unsafeWrite arr i decremented
if decremented == 0 then return (i:xs) else return xs
largenum :: Int
largenum = 10000000
test = runST $ do arr <- newArray (1, largenum) 100 :: ST s (STUArray s Int Word32)
zeroElementsAfterDecrement' arr [1..largenum]
main = (putStrLn . show) test
The function takes an array (I use unboxed mutable arrays) and a list of indexes, decrements elements by these indexes and returns indexes of elements that became zero during this operation. Right now this is more than 10 times slower than the optimized C++ code but still pretty good compared to Python (or maybe I don't know Python way to optimize this). I understand there is an overhead from executing a monadic code, but maybe there are still ways to optimize I am not aware of?
Edit:
GHC: -O -fllvm: 0.54s
GHC (with unsafeWrite/unsafeRead and Word32): 0.34s
g++: 0.24s
g++ -O2: 0.05s
python3: 2.66s
Also when I change foldr to foldl' it starts allocating some memory and is 4 times slower as a result, why is that?
Here is a C++ version I compared it to:
#include <iostream>
#include <vector>
using namespace std;
#define LARGENUM 10000000
int main()
{
vector <int> arr;
for (int i = 0; i < LARGENUM; i++) {
arr.push_back(100);
}
for (int i = 0; i < arr.size(); i++) {
arr[i]--;
if (arr[i] == 0)
cout << i << endl;
}
return 0;
}
And a Python version:
arr = [100] * 10000000
for x in range (0, 10000000 - 1):
arr[x] = arr[x] - 1
if arr[x] == 0:
print x
Related
This is the dual question of Performance considerations of Haskell FFI / C?: I would like to call a C function with as small an overhead as possible.
To set the scene, I have the following C function:
typedef struct
{
uint64_t RESET;
} INPUT;
typedef struct
{
uint64_t VGA_HSYNC;
uint64_t VGA_VSYNC;
uint64_t VGA_DE;
uint8_t VGA_RED;
uint8_t VGA_GREEN;
uint8_t VGA_BLUE;
} OUTPUT;
void Bounce(const INPUT* input, OUTPUT* output);
Let's run it from C and time it, with gcc -O3:
int main (int argc, char **argv)
{
INPUT input;
input.RESET = 0;
OUTPUT output;
int cycles = 0;
for (int j = 0; j < 60; ++j)
{
for (;; ++cycles)
{
Bounce(&input, &output);
if (output.VGA_HSYNC == 0 && output.VGA_VSYNC == 0) break;
}
for (;; ++cycles)
{
Bounce(&input, &output);
if (output.VGA_DE) break;
}
}
printf("%d cycles\n", cycles);
}
Running it for 25152001 cycles takes ~400 ms:
$ time ./Bounce
25152001 cycles
real 0m0.404s
user 0m0.403s
sys 0m0.001s
Now let's write some Haskell code to set up FFI (note that Bool's Storable instance really does use a full int):
data INPUT = INPUT
{ reset :: Bool
}
data OUTPUT = OUTPUT
{ vgaHSYNC, vgaVSYNC, vgaDE :: Bool
, vgaRED, vgaGREEN, vgaBLUE :: Word64
}
deriving (Show)
foreign import ccall unsafe "Bounce" topEntity :: Ptr INPUT -> Ptr OUTPUT -> IO ()
instance Storable INPUT where ...
instance Storable OUTPUT where ...
And let's do what I believe to be functionally equivalent to our C code from before:
main :: IO ()
main = alloca $ \inp -> alloca $ \outp -> do
poke inp $ INPUT{ reset = False }
let loop1 n = do
topEntity inp outp
out#OUTPUT{..} <- peek outp
let n' = n + 1
if not vgaHSYNC && not vgaVSYNC then loop2 n' else loop1 n'
loop2 n = do
topEntity inp outp
out <- peek outp
let n' = n + 1
if vgaDE out then return n' else loop2 n'
loop3 k n
| k < 60 = do
n <- loop1 n
loop3 (k + 1) n
| otherwise = return n
n <- loop3 (0 :: Int) (0 :: Int)
printf "%d cycles" n
I build it with GHC 8.6.5, using -O3, and I get.. more than 3 seconds!
$ time ./.stack-work/dist/x86_64-linux/Cabal-2.4.0.1/build/sim-ffi/sim-ffi
25152001 cycles
real 0m3.468s
user 0m3.146s
sys 0m0.280s
And it's not a constant overhead at startup, either: if I run for 10 times the cycles, I get roughly 3.5 seconds from C and 34 seconds from Haskell.
What can I do to reduce the Haskell -> C FFI overhead?
I managed to reduce the overhead so that the 25 M calls now finish in 1.2 seconds. The changes were:
Make loop1, loop2 and loop3 strict in the n argument (using BangPatterns)
Add an INLINE pragma to peek in OUTPUT's Storable instance
Point #1 is silly, of course, but that's what I get for not profiling earlier. That change alone gets me to 1.5 seconds....
Point #2, however, makes a ton of sense and is generally applicable. It also addresses the comment from #Thomas M. DuBuisson:
Do you ever need the Haskell structure in haskell? If you can just keep it as a pointer to memory and have a few test functions such as vgaVSYNC :: Ptr OUTPUT -> IO Bool then that will save a log of copying, allocation, GC work on every call.
In the eventual full program, I do need to look at all the fields of OUTPUT. However, with peek inlined, GHC is happy to do the case-of-case transformation, so I can see in Core that now there is no OUTPUT value allocated; the output of peek is consumed directly.
Haskell one is implemented using optimized Data.IntSet with complexity O(lg n). However, there is a 15x (previously 30x) speed difference for n = 2000000 despite Haskell code is already optimized for even number cases. I would like to know whether/why my implementation in Haskell is imperfect.
Original Haskell
primesUpTo :: Int -> [Int]
primesUpTo n = 2 : put S.empty [3,5..n]
where put :: S.IntSet -> [Int] -> [Int]
put _ [] = []
put comps (x:xs) =
if S.member x comps
then put comps xs
else x : put (S.union comps multiples) xs
where multiples = S.fromList [x*2, x*3 .. n]
Update
fromDistinctAscList gives a 4x speed increase. 2-3-5-7-Wheel speeds up by another 50%.
primesUpTo :: Int -> [Int]
primesUpTo n = 2 : 3 : 5 : 7 : put S.empty (takeWhile (<=n) (spin wheel 11))
where put :: S.IntSet -> [Int] -> [Int]
put _ [] = []
put comps (x:xs) =
if S.member x comps
then put comps xs
else x : put (S.union comps multiples) xs
where multiples = S.fromDistinctAscList [x*x, x*(x+2) .. n]
spin (x:xs) n = n : spin xs (n + x)
wheel = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2:4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wheel
Benchmarking
All time are measured by *nix time command, real space
Haskell original : 2e6: N/A; 2e7: >30s
Haskell optimized: 2e6: 0.396s; 2e7: 6.273s
C++ Set (ordered): 2e6: 4.694s; 2e7: >30s
C++ Bool Array : 2e6: 0.039s; 2e7: 0.421s
Haskell optimized is slower than C++ Bool by 10~15x, and faster than C++ Set by 10x.
Source code
C Compiler options: g++ 5.3.1, g++ -std=c++11
Haskell options: ghc 7.8.4, ghc
C code (Bool array) http://pastebin.com/W0s7cSWi
prime[0] = prime[1] = false;
for (int i=2; i<=limit; i++) { //edited
if (!prime[i]) continue;
for (int j=2*i; j<=n; j+=i)
prime[j] = false;
}
C code (Set) http://pastebin.com/sNpghrU4
nonprime.insert(1);
for (int i=2; i<=limit; i++) { //edited
if (nonprime.count(i) > 0) continue;
for (int j=2*i; j<=n; j+=i)
nonprime.insert(j);
}
Haskell code http://pastebin.com/HuMqwvRW
Code as written above.
I would like to know whether/why my implementation in Haskell is imperfect.
Instead of fromList you better use fromDistinctAscList which performs linearly. You may also add only odd multiples starting with x*x not x*2, because all the smaller odd multiples have already been added. Style-wise, a right fold may fit better than recursion.
Doing so, I get more than 3 times performance improvement for n equal to 2,000,000:
import Data.IntSet (member, union, empty, fromDistinctAscList)
sieve :: Int -> [Int]
sieve n = 2: foldr go (const []) [3,5..n] empty
where
go i run obs
| member i obs = run obs
| otherwise = i: run (union obs inc)
where inc = fromDistinctAscList [i*i, i*(i + 2)..n]
Nevertheless, an array has both O(1) access and cache friendly memory allocation. Using mutable arrays, I see more than 15 times performance improvement over your Haskell code (again n equal to 2,000,000):
{-# LANGUAGE FlexibleContexts #-}
import Data.Array.ST (STUArray)
import Control.Monad (forM_, foldM)
import Control.Monad.ST (ST, runST)
import Data.Array.Base (newArray, unsafeWrite, unsafeRead)
sieve :: Int -> [Int]
sieve n = reverse $ runST $ do
arr <- newArray (0, n) False :: ST s (STUArray s Int Bool)
foldM (go arr) [2] [3,5..n]
where
go arr acc i = do
b <- unsafeRead arr i
if b then return acc else do
forM_ [i*i, i*(i + 2).. n] $ \k -> unsafeWrite arr k True
return $ i: acc
I work in R using C libraries. I need to pass to a C function an array with numbers between 1 and 10 but that could also be "NA". Then in C, depending on the value I need to set the output.
Here's a simplified code
heredyn.load("ranking.so")
fun <- function(ranking) {
nrak <- length(ranking)
out <- .C("ranking", as.integer(nrak), as.character(ranking), rr = as.integer(vector("integer",nrak)))
out$rr
}
ranking <- sample(c(NA,seq(1,10)),10,replace=TRUE)
rr <- fun(ranking)
The C function could simply be such as
#include <R.h>
void ranking(int *nrak, char *ranking, int *rr) {
int i ;
for (i=0;i<*nrak;i++) {
if (ranking[i] == 'NA')
rr[i] = 1 ;
else
rr[i] = (int) strtol(&ranking[i],(char **)NULL,10) ;
}
}
Due to the "NA" value I set ranking as character but maybe there's another way to do that, using integer and without replacing "NA" to 0 before calling the function?
(The code like this, gives me always an array of zeros...)
Test for whether the value is an NA using R_NaInt, like
#include <R.h>
void ranking_c(int *nrak, int *ranking, int *rr) {
for (int i=0; i < *nrak; i++)
rr[i] = R_NaInt == ranking[i] ? -1 : ranking[i];
}
Invoke from R by explicitly allowing NAs
> x = c(1:2, NA_integer_)
> .C("ranking_c", length(x), as.integer(x), integer(length(x)), NAOK=TRUE)[[3]]
[1] 1 2 -1
Alternatively, use R's .Call() interface. Each R object is represented as an S-expression. There are C-level functions to manipulate S-expressions, e.g., length Rf_length(), data access INTEGER(), and allocation Rf_allocVector() of different types of S-expressions such as INTSXP for integer vectors.
R memory management uses a garbage collector that can run on any call that allocates memory. It is therefore best practice to PROTECT() any R allocation while in scope.
Your function will accept 0 or more S-expressions as input, and return a single S-expression; it might be implemented as
#include <Rinternals.h>
#include <R_ext/Arith.h>
SEXP ranking_call(SEXP ranking)
{
/* allocate space for result, PROTECTing from garbage collection */
SEXP result = PROTECT(Rf_allocVector(INTSXP, Rf_length(ranking)));
/* assign result */
for (int i = 0; i < Rf_length(ranking); ++i)
INTEGER(result)[i] =
R_NaInt == INTEGER(ranking)[i] ? -1 : INTEGER(ranking)[i];
UNPROTECT(1); /* no more need to protect */
return result;
}
And invoked from R with .Call("ranking_call", as.integer(ranking)).
Using .Call is more efficient than .C in terms of speed and memory allocation (.C may copy atomic vectors on the way in), but the primary reason to use it is for the flexibility it offers in terms of working directly with R's data structures. This is especially important when the return values are more complicated than atomic vectors.
You are attempting to address a couple of delicate and non-trivial points, least of all how to compile code with R, and to test for non-finite values.
You asked for help with C. I would like to suggest C++ -- which you do not need to use in a complicated way. Consider this short file with contains a function to process a vector along the lines you suggest (I just test for NA and then assign 42 as a marker for simplicit) or else square the value:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector foo(NumericVector x) {
unsigned int n = x.size();
for (unsigned int i=0; i<n; i++)
if (NumericVector::is_na(x[i]))
x[i] = 42.0;
else
x[i] = pow(x[i], 2);
return x;
}
/*** R
foo( c(1, 3, NA, NaN, 6) )
*/
If I save this on my box as /tmp/foo.cpp, in order compile, link, load and even run the embedded R use example, I only need one line to call sourceCpp():
R> Rcpp::sourceCpp("/tmp/foo.cpp")
R> foo( c(1, 3, NA, NaN, 6))
[1] 1 9 42 42 36
R>
We can do the same with integers:
// [[Rcpp::export]]
IntegerVector bar(IntegerVector x) {
unsigned int n = x.size();
for (unsigned int i=0; i<n; i++)
if (IntegerVector::is_na(x[i]))
x[i] = 42;
else
x[i] = pow(x[i], 2);
return x;
}
I am trying to perform a series of transforms on graphical files using Haskell and Repa/DevIL. The starting example used was provided by the Haskell wiki page https://wiki.haskell.org/Numeric_Haskell:_A_Repa_Tutorial. I am an imperative programmer of 30 years experience with some erlang for good measure, trying to learn Haskell outside a classroom environment.
The problem is manipulating the data after the file load was first transformed into a Repa array:
import Data.Array.Repa.IO.DevIL (runIL,readImage,writeImage,Image(RGB),IL)
import qualified Data.Array.Repa as R
import Data.Vector.Unboxed as DVU
import Control.Monad
main :: IO ()
main = do
[f] <- getArgs
(RGB a) <- runIL $ Data.Array.Repa.IO.DevIL.readImage f
let
c = (computeP (R.traverse a id rgbTransform)) :: IL (Array U DIM3 Float)
which is successfully cast to type "Array F DIM3 Float" as output from the rgbTransform. From that point on it has been a nightmare to use the data. Flicking the array storage type between F(oreign) and U(nboxed) changes all following call's usability, plus the Repa-added monad layer IL forces use of liftM for nearly every equation following the 1st transform:
let -- continued
sh = liftM R.extent c -- IL DIM3
v = liftM R.toUnboxed c -- IL (Vector Float)
lv = liftM DVU.length v -- IL Int
f = liftM indexed v -- vector of tuples: (Int,a) where Int is idx
k = (Z :. 2) :. 2 :. 0 :: DIM3
These are the routines I can call without error. The IO monad's print command produces no output if placed in or after this 'let' list, due to the IL monad layer.
The game plan for the curious:
read the graphic file (done, via Repa)
resize image (not done, no resize in Repa, must be hand-coded)
transform and convert image from Word8 to Float (done)
get a Stablepointer to the transformed Float data (not done)
transform in-place the Float data as an array of C structs
of {Float a,b,c;}, by an external C routine via FFI (not completely
done). This is done hopefully without marshalling a new graphic
array by passing a pointer to the data
perform more passes over the transformed data to extract more info (partly done).
I am looking for help with issues 4 and 5.
4 -> The type system has been difficult to deal with while attempting to get C-usable memory pointers. Going thru the mountains of haskell library calls has not helped.
5 -> The external C routine is of type:
foreign import ccall unsafe "transform.h xform"
c_xform :: Ptr (CFloat,CFloat,CFloat) ->
CInt ->
IO ()
The Ptr is expected to point to an unboxed flat C array of rgb_t structs:
typedef struct
{
float r;
float g;
float b;
} rgb_t;
Available web-based FFI descriptions of how to deal with array pointers in FFI are non-existent if not downright obscure. The fairly straightforward idea of unfreezing and passing in a C array of floating-point RGB structs, modifying them in-place and then freezing the result is what I had in mind. The external transform is pure in the sense that the same input will produce predictable output, does not use threads, does not use global vars nor depend upon obscure libraries.
Foreign.Marshal.Array seems to provide a way to convert haskell data to C data and other way around.
I tested interfacing C code and haskell using the following files (Haskell + FFI for the first time for me)
hsc2hs rgb_ffi.hsc
ghc main.hs rgb_ffi.hs rgb.c
rgb.h
#ifndef RGB_H
#define RGB_H
#include <stdlib.h>
typedef struct {
float r;
float g;
float b;
} rgb_t;
void rgb_test(rgb_t * rgbs, ssize_t n);
#endif
rgb.h
#include <stdlib.h>
#include <stdio.h>
#include "rgb.h"
void rgb_test(rgb_t * rgbs, ssize_t n)
{
int i;
for(i=0; i<n; i++) {
printf("%.3f %.3f %.3f\n", rgbs[i].r, rgbs[i].g, rgbs[i].b);
rgbs[i].r *= 2.0;
rgbs[i].g *= 2.0;
rgbs[i].b *= 2.0;
}
}
rgb_ffi.hsc
{-# LANGUAGE ForeignFunctionInterface #-}
{-# LANGUAGE CPP #-}
module RGB where
import Foreign
import Foreign.C
import Control.Monad (ap)
#include "rgb.h"
data RGB = RGB {
r :: CFloat, g :: CFloat, b :: CFloat
} deriving Show
instance Storable RGB where
sizeOf _ = #{size rgb_t}
alignment _ = alignment (undefined :: CInt)
poke p rgb_t = do
#{poke rgb_t, r} p $ r rgb_t
#{poke rgb_t, g} p $ g rgb_t
#{poke rgb_t, b} p $ b rgb_t
peek p = return RGB
`ap` (#{peek rgb_t, r} p)
`ap` (#{peek rgb_t, g} p)
`ap` (#{peek rgb_t, b} p)
foreign import ccall "rgb.h rgb_test" crgbTest :: Ptr RGB -> CSize -> IO ();
rgbTest :: [RGB] -> IO [RGB]
rgbTest rgbs = withArray rgbs $ \ptr ->
do
crgbTest ptr (fromIntegral (length rgbs))
peekArray (length rgbs) ptr
rgbAlloc :: [RGB] -> IO (Ptr RGB)
rgbAlloc rgbs = newArray rgbs
rgbPeek :: Ptr RGB -> Int -> IO [RGB]
rgbPeek rgbs l = peekArray l rgbs
rgbTest2 :: Ptr RGB -> Int -> IO ()
rgbTest2 ptr l =
do
crgbTest ptr (fromIntegral l)
return ()
main.hs
module Main (main) where
import RGB
main =
do
let a = [RGB {r = 1.0, g = 1.0, b = 1.0},
RGB {r = 2.0, g = 2.0, b = 2.0},
RGB {r = 3.0, g = 3.0, b = 3.0}]
let l = length a
print a
-- b <- rgbTest a
-- print b
c <- rgbAlloc a
rgbTest2 c l
rgbTest2 c l
d <- rgbPeek c l
print d
return ()
I initially wrote this (brute force and inefficient) method of calculating primes with the intent of making sure that there was no difference in speed between using "if-then-else" versus guards in Haskell (and there is no difference!). But then I decided to write a C program to compare and I got the following (Haskell slower by just over 25%) :
(Note I got the ideas of using rem instead of mod and also the O3 option in the compiler invocation from the following post : On improving Haskell's performance compared to C in fibonacci micro-benchmark)
Haskell : Forum.hs
divisibleRec :: Int -> Int -> Bool
divisibleRec i j
| j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)
divisible::Int -> Bool
divisible i = divisibleRec i (i-1)
r = [ x | x <- [2..200000], divisible x == False]
main :: IO()
main = print(length(r))
C : main.cpp
#include <stdio.h>
bool divisibleRec(int i, int j){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ return divisibleRec(i,j-1); }
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<200000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results I got were as follows :
Compilation times
time (ghc -O3 -o runProg Forum.hs)
real 0m0.355s
user 0m0.252s
sys 0m0.040s
time (gcc -O3 -o runProg main.cpp)
real 0m0.070s
user 0m0.036s
sys 0m0.008s
and the following running times :
Running times on Ubuntu 32 bit
Haskell
17984
real 0m54.498s
user 0m51.363s
sys 0m0.140s
C++
number of primes = 17984
real 0m41.739s
user 0m39.642s
sys 0m0.080s
I was quite impressed with the running times of Haskell. However my question is this : can I do anything to speed up the haskell program without :
Changing the underlying algorithm (it is clear that massive speedups can be gained by changing the algorithm; but I just want to understand what I can do on the language/compiler side to improve performance)
Invoking the llvm compiler (because I dont have this installed)
[EDIT : Memory usage]
After a comment by Alan I noticed that the C program uses a constant amount of memory where as the Haskell program slowly grows in memory size. At first I thought this had something to do with recursion, but gspr explains below why this is happening and provides a solution. Will Ness provides an alternative solution which (like gspr's solution) also ensures that the memory remains static.
[EDIT : Summary of bigger runs]
max number tested : 200,000:
(54.498s/41.739s) = Haskell 30.5% slower
max number tested : 400,000:
3m31.372s/2m45.076s = 211.37s/165s = Haskell 28.1% slower
max number tested : 800,000:
14m3.266s/11m6.024s = 843.27s/666.02s = Haskell 26.6% slower
[EDIT : Code for Alan]
This was the code that I had written earlier which does not have recursion and which I had tested on 200,000 :
#include <stdio.h>
bool divisibleRec(int i, int j){
while(j>0){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ j -= 1;}
}
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<8000000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results for the C code with and without recursion are as follows (for 800,000) :
With recursion : 11m6.024s
Without recursion : 11m5.328s
Note that the executable seems to take up 60kb (as seen in System monitor) irrespective of the maximum number, and therefore I suspect that the compiler is detecting this recursion.
This isn't really answering your question, but rather what you asked in a comment regarding growing memory usage when the number 200000 grows.
When that number grows, so does the list r. Your code needs all of r at the very end, to compute its length. The C code, on the other hand, just increments a counter. You'll have to do something similar in Haskell too if you want constant memory usage. The code will still be very Haskelly, and in general it's a sensible proposition: you don't really need the list of numbers for which divisible is False, you just need to know how many there are.
You can try with
main :: IO ()
main = print $ foldl' (\s x -> if divisible x then s else s+1) 0 [2..200000]
(foldl' is a stricter foldl from Data.List that avoids thunks being built up).
Well bang patters give you a very small win (as does llvm, but you seem to have expected that):
{-# LANUGAGE BangPatterns #-}
divisibleRec !i !j | j == 1 = False
And on my x86-64 I get a very big win by switching to smaller representations, such as Word32:
divisibleRec :: Word32 -> Word32 -> Bool
...
divisible :: Word32 -> Bool
My timings:
$ time ./so -- Int
2262
real 0m2.332s
$ time ./so -- Word32
2262
real 0m1.424s
This is a closer match to your C program, which is only using int. It still doesn't match performance wise, I suspect we'd have to look at core to figure out why.
EDIT: and the memory use, as was already noted I see, is about the named list r. I just inlined r, made it output a 1 for each non-divisble value and took the sum:
main = print $ sum $ [ 1 | x <- [2..800000], not (divisible x) ]
Another way to write down your algorithm is
main = print $ length [()|x<-[2..200000], and [rem x d>0|d<-[x-1,x-2..2]]]
Unfortunately, it runs slower. Using all ((>0).rem x) [x-1,x-2..2] as a test, it runs slower still. But maybe you'd test it on your setup nevertheless.
Replacing your code with explicit loop with bang patterns made no difference whatsoever:
{-# OPTIONS_GHC -XBangPatterns #-}
r4::Int->Int
r4 n = go 0 2 where
go !c i | i>n = c
| True = go (if not(divisible i) then (c+1) else c) (i+1)
divisibleRec::Int->Int->Bool
divisibleRec i !j | j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)
When I started programming in Haskell I was also impressed about its speed. You may be interested in reading point 5 "The speed of Haskell" of this article.