Comparing speed of Haskell and C for the computation of primes - c

I initially wrote this (brute force and inefficient) method of calculating primes with the intent of making sure that there was no difference in speed between using "if-then-else" versus guards in Haskell (and there is no difference!). But then I decided to write a C program to compare and I got the following (Haskell slower by just over 25%) :
(Note I got the ideas of using rem instead of mod and also the O3 option in the compiler invocation from the following post : On improving Haskell's performance compared to C in fibonacci micro-benchmark)
Haskell : Forum.hs
divisibleRec :: Int -> Int -> Bool
divisibleRec i j
| j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)
divisible::Int -> Bool
divisible i = divisibleRec i (i-1)
r = [ x | x <- [2..200000], divisible x == False]
main :: IO()
main = print(length(r))
C : main.cpp
#include <stdio.h>
bool divisibleRec(int i, int j){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ return divisibleRec(i,j-1); }
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<200000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results I got were as follows :
Compilation times
time (ghc -O3 -o runProg Forum.hs)
real 0m0.355s
user 0m0.252s
sys 0m0.040s
time (gcc -O3 -o runProg main.cpp)
real 0m0.070s
user 0m0.036s
sys 0m0.008s
and the following running times :
Running times on Ubuntu 32 bit
Haskell
17984
real 0m54.498s
user 0m51.363s
sys 0m0.140s
C++
number of primes = 17984
real 0m41.739s
user 0m39.642s
sys 0m0.080s
I was quite impressed with the running times of Haskell. However my question is this : can I do anything to speed up the haskell program without :
Changing the underlying algorithm (it is clear that massive speedups can be gained by changing the algorithm; but I just want to understand what I can do on the language/compiler side to improve performance)
Invoking the llvm compiler (because I dont have this installed)
[EDIT : Memory usage]
After a comment by Alan I noticed that the C program uses a constant amount of memory where as the Haskell program slowly grows in memory size. At first I thought this had something to do with recursion, but gspr explains below why this is happening and provides a solution. Will Ness provides an alternative solution which (like gspr's solution) also ensures that the memory remains static.
[EDIT : Summary of bigger runs]
max number tested : 200,000:
(54.498s/41.739s) = Haskell 30.5% slower
max number tested : 400,000:
3m31.372s/2m45.076s = 211.37s/165s = Haskell 28.1% slower
max number tested : 800,000:
14m3.266s/11m6.024s = 843.27s/666.02s = Haskell 26.6% slower
[EDIT : Code for Alan]
This was the code that I had written earlier which does not have recursion and which I had tested on 200,000 :
#include <stdio.h>
bool divisibleRec(int i, int j){
while(j>0){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ j -= 1;}
}
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<8000000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results for the C code with and without recursion are as follows (for 800,000) :
With recursion : 11m6.024s
Without recursion : 11m5.328s
Note that the executable seems to take up 60kb (as seen in System monitor) irrespective of the maximum number, and therefore I suspect that the compiler is detecting this recursion.

This isn't really answering your question, but rather what you asked in a comment regarding growing memory usage when the number 200000 grows.
When that number grows, so does the list r. Your code needs all of r at the very end, to compute its length. The C code, on the other hand, just increments a counter. You'll have to do something similar in Haskell too if you want constant memory usage. The code will still be very Haskelly, and in general it's a sensible proposition: you don't really need the list of numbers for which divisible is False, you just need to know how many there are.
You can try with
main :: IO ()
main = print $ foldl' (\s x -> if divisible x then s else s+1) 0 [2..200000]
(foldl' is a stricter foldl from Data.List that avoids thunks being built up).

Well bang patters give you a very small win (as does llvm, but you seem to have expected that):
{-# LANUGAGE BangPatterns #-}
divisibleRec !i !j | j == 1 = False
And on my x86-64 I get a very big win by switching to smaller representations, such as Word32:
divisibleRec :: Word32 -> Word32 -> Bool
...
divisible :: Word32 -> Bool
My timings:
$ time ./so -- Int
2262
real 0m2.332s
$ time ./so -- Word32
2262
real 0m1.424s
This is a closer match to your C program, which is only using int. It still doesn't match performance wise, I suspect we'd have to look at core to figure out why.
EDIT: and the memory use, as was already noted I see, is about the named list r. I just inlined r, made it output a 1 for each non-divisble value and took the sum:
main = print $ sum $ [ 1 | x <- [2..800000], not (divisible x) ]

Another way to write down your algorithm is
main = print $ length [()|x<-[2..200000], and [rem x d>0|d<-[x-1,x-2..2]]]
Unfortunately, it runs slower. Using all ((>0).rem x) [x-1,x-2..2] as a test, it runs slower still. But maybe you'd test it on your setup nevertheless.
Replacing your code with explicit loop with bang patterns made no difference whatsoever:
{-# OPTIONS_GHC -XBangPatterns #-}
r4::Int->Int
r4 n = go 0 2 where
go !c i | i>n = c
| True = go (if not(divisible i) then (c+1) else c) (i+1)
divisibleRec::Int->Int->Bool
divisibleRec i !j | j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)

When I started programming in Haskell I was also impressed about its speed. You may be interested in reading point 5 "The speed of Haskell" of this article.

Related

Haskell -> C FFI performance

This is the dual question of Performance considerations of Haskell FFI / C?: I would like to call a C function with as small an overhead as possible.
To set the scene, I have the following C function:
typedef struct
{
uint64_t RESET;
} INPUT;
typedef struct
{
uint64_t VGA_HSYNC;
uint64_t VGA_VSYNC;
uint64_t VGA_DE;
uint8_t VGA_RED;
uint8_t VGA_GREEN;
uint8_t VGA_BLUE;
} OUTPUT;
void Bounce(const INPUT* input, OUTPUT* output);
Let's run it from C and time it, with gcc -O3:
int main (int argc, char **argv)
{
INPUT input;
input.RESET = 0;
OUTPUT output;
int cycles = 0;
for (int j = 0; j < 60; ++j)
{
for (;; ++cycles)
{
Bounce(&input, &output);
if (output.VGA_HSYNC == 0 && output.VGA_VSYNC == 0) break;
}
for (;; ++cycles)
{
Bounce(&input, &output);
if (output.VGA_DE) break;
}
}
printf("%d cycles\n", cycles);
}
Running it for 25152001 cycles takes ~400 ms:
$ time ./Bounce
25152001 cycles
real 0m0.404s
user 0m0.403s
sys 0m0.001s
Now let's write some Haskell code to set up FFI (note that Bool's Storable instance really does use a full int):
data INPUT = INPUT
{ reset :: Bool
}
data OUTPUT = OUTPUT
{ vgaHSYNC, vgaVSYNC, vgaDE :: Bool
, vgaRED, vgaGREEN, vgaBLUE :: Word64
}
deriving (Show)
foreign import ccall unsafe "Bounce" topEntity :: Ptr INPUT -> Ptr OUTPUT -> IO ()
instance Storable INPUT where ...
instance Storable OUTPUT where ...
And let's do what I believe to be functionally equivalent to our C code from before:
main :: IO ()
main = alloca $ \inp -> alloca $ \outp -> do
poke inp $ INPUT{ reset = False }
let loop1 n = do
topEntity inp outp
out#OUTPUT{..} <- peek outp
let n' = n + 1
if not vgaHSYNC && not vgaVSYNC then loop2 n' else loop1 n'
loop2 n = do
topEntity inp outp
out <- peek outp
let n' = n + 1
if vgaDE out then return n' else loop2 n'
loop3 k n
| k < 60 = do
n <- loop1 n
loop3 (k + 1) n
| otherwise = return n
n <- loop3 (0 :: Int) (0 :: Int)
printf "%d cycles" n
I build it with GHC 8.6.5, using -O3, and I get.. more than 3 seconds!
$ time ./.stack-work/dist/x86_64-linux/Cabal-2.4.0.1/build/sim-ffi/sim-ffi
25152001 cycles
real 0m3.468s
user 0m3.146s
sys 0m0.280s
And it's not a constant overhead at startup, either: if I run for 10 times the cycles, I get roughly 3.5 seconds from C and 34 seconds from Haskell.
What can I do to reduce the Haskell -> C FFI overhead?
I managed to reduce the overhead so that the 25 M calls now finish in 1.2 seconds. The changes were:
Make loop1, loop2 and loop3 strict in the n argument (using BangPatterns)
Add an INLINE pragma to peek in OUTPUT's Storable instance
Point #1 is silly, of course, but that's what I get for not profiling earlier. That change alone gets me to 1.5 seconds....
Point #2, however, makes a ton of sense and is generally applicable. It also addresses the comment from #Thomas M. DuBuisson:
Do you ever need the Haskell structure in haskell? If you can just keep it as a pointer to memory and have a few test functions such as vgaVSYNC :: Ptr OUTPUT -> IO Bool then that will save a log of copying, allocation, GC work on every call.
In the eventual full program, I do need to look at all the fields of OUTPUT. However, with peek inlined, GHC is happy to do the case-of-case transformation, so I can see in Core that now there is no OUTPUT value allocated; the output of peek is consumed directly.

Why is iterating through an array backwards faster than forward in C

I'm studying for an exam and am trying to follow this problem:
I have the following C code to do some array initialisation:
int i, n = 61440;
double x[n];
for(i=0; i < n; i++) {
x[i] = 1;
}
But the following runs faster (0.5s difference in 1000 iterations):
int i, n = 61440;
double x[n];
for(i=n-1; i >= 0; i--) {
x[i] = 1;
}
I first thought that it was due to the loop accessing the n variable, thus having to do more reads (as suggested here for example: Why is iterating through an array backwards faster than forwards). But even if I change the n in the first loop to a hard coded value, or vice versa move the 0 in the bottom loop to a variable, the performance remains the same. I also tried to change the loops to only do half the work (go from 0 to < 30720, or from n-1 to >= 30720), to eliminate any special treatment of the 0 value, but the bottom loop is still faster
I assume it is because of some compiler optimisations? But everything I look up for the generated machine code suggests, that < and >= ought to be equal.
Thankful for any hints or advice! Thank you!
Edit: Makefile, for compiler details (this is part of a multi threading exercise, hence the OpenMP, though for this case it's all running on 1 core, without any OpenMP instructions in the code)
#CC = gcc
CC = /opt/rh/devtoolset-2/root/usr/bin/gcc
OMP_FLAG = -fopenmp
CFLAGS = -std=c99 -O2 -c ${OMP_FLAG}
LFLAGS = -lm
.SUFFIXES : .o .c
.c.o:
${CC} ${CFLAGS} -o $# $*.c
sblas:sblas.o
${CC} ${OMP_FLAG} -o $# $#.o ${LFLAGS}
Edit2: I redid the experiment with n * 100, getting the same results:
Forward: ~170s
Backward: ~120s
Similar to the previous values of 1.7s and 1.2s, just times 100
Edit3: Minimal Example - changes described above where all localized to the vector update method. This is the default forward version, which takes longer than the backwards version for(i = limit - 1; i >= 0; i--)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
void vector_update(double a[], double b[], double x[], int limit);
/* SBLAS code */
void *main() {
int n = 1024*60;
int nsteps = 1000;
int k;
double a[n], b[n], x[n];
double vec_update_start;
double vec_update_time = 0;
for(k = 0; k < nsteps; k++) {
// Loop over whole program to get reasonable execution time
// (simulates a time-steping code)
vec_update_start = omp_get_wtime();
vector_update(a, b, x, n);
vec_update_time = vec_update_time + (omp_get_wtime() - vec_update_start);
}
printf( "vector update time = %f seconds \n \n", vec_update_time);
}
void vector_update(double a[], double b[], double x[] ,int limit) {
int i;
for (i = 0; i < limit; i++ ) {
x[i] = 0.0;
a[i] = 3.142;
b[i] = 3.142;
}
}
Edit4: the CPU is AMD quad-core Opteron 8378. The machine uses 4 of those, but I'm using only one on the main processor (core ID 0 in the AMD architecture)
It's not the backward iteration but the comparison with zero which causes the loop in the second case run faster.
for(i=n-1; i >= 0; i--) {
Comparison with zero can be done with a single assembly instruction whereas comparison with any other number takes multiple instructions.
The main reason is that your compiler isn't very good at optimising. In theory there's no reason that a better compiler couldn't have converted both versions of your code into the exact same machine code instead of letting one be slower.
Everything beyond that depends on what the resulting machine code is and what it's running on. This can include differences in RAM and/or CPU speeds, differences in cache behaviour, differences in hardware prefetching (and number of prefetchers), differences in instruction costs and instruction pipelining, differences in speculation, etc. Note that (in theory) this doesn't exclude the possibility that (on most computers but not on your computer) the machine code your compiler generates for forward loop is faster than the machine code it generates for backward loop (your sample size isn't large enough to be statistically significant, unless you're working on embedded systems or game consoles where all computers that run the code are identical).

Calculate possible combinations in C

I've started studying C and I'm trying to practice it developing a small application. Please, could you give any tips about what to do here?
I want to buy shoes from three different brands (brandA=50; brandB=100; brandC=150). I need to spend 2000 dollars on it and buy exactly 20 shoes.
How could I write a program to display all possible combinations?
E.g. brandA (10 shoes), brandB (0 shoe), brandC(10 shoes);
brandA(1 shoe), brandB (3 shoes), brandC (11 shoes), etc.
Please, I don't want the full code now but tips about how to do it.
I really appreciate any help. Tks!
I've updated my post to include a code. Does this code make any sense?
int main(void) {
int brandA=50, brandB=100, brandC=150, ba, bb, bc;
for(ba=0;ba<=20;ba++) {
for(bb=0;bb<=20;bb++) {
for(bc=0;bc<=20;bc++) {
if(ba+bb+bc==20 && (ba*brandA)+(bb*brandB)+(bc*brandC)==2000 {
printf("You can buy %d brandA, %d brandB, %d brandC", ba,bb,bc);
}
}
}
}
return 0;
}
For first you have to have an algorithm...
Take all zero shoes brandA=0; brandB=0; brandC=0
Check total quantity= 0+0+0 = 0
If it is not 20 pcs - pass it
If it equal 20 pcs ( for example: brandA=5; brandB=5; brandC=10) - check total price.
If total price equal 2000 - show it, if not - pass it.
Increment brandA value till 20
repeat steps 2-6
Increment brandB value till 20
repeat steps 2-8
Increment brandC value till 20
repeat steps 2-10
Note: you can use 3 included 'for' cycles :)
If you are a beginner I suggest to start with backtracking and recursion.
Even if backtracking is a costly technique it's great for a beginner to see how recursion can provide a simple yet powerful solution to a problem.
Here are some resources for you to start: http://web.cse.ohio-state.edu/~gurari/course/cis680/cis680Ch19.html
And if you are serious about programming you should also read some books about algorithms and data structures since you will rely heavily on these basic fundamentals:
1. The Algorithm Design Manual
2. The Pragmatic Programmer: From Journeyman to Master
Now that you have it working, I feel no guilt in suggesting a potential refinement. Rather than a pure brute force triple loop, as others have mentioned, you can use the relationship between A, B & C to eliminate the third loop. Remember in C, there are generally many ways to approach any given problem and many ways to handle the output. As long as the logic and syntax are correct, then the only difference will be in the efficiency of the algorithms. Here is an example of eliminating the third loop:
#include <stdio.h>
int main (void) {
int ca = 50;
int cb = 100;
int cc = 150;
int budget = 2000;
int pairs = 20;
int a, b, c, cost;
for (a = 0; a < budget/ca; a++)
for (b = 0; b < budget/cb; b++)
{
c = pairs - (a + b);
if ((cost = a * ca + b * cb + c * cc) != budget)
continue;
printf ("\n a (%2d) * %3d = %4d"
"\n b (%2d) * %3d = %4d"
"\n c (%2d) * %3d = %4d\n",
a, ca, a * ca, b, cb, b * cb, c, cc, c * cc);
printf (" ===================\n (%d) %d\n", pairs, budget);
}
return 0;
}
Compiling
Since you are new to C, when you compile your code, make sure you always compile with warnings enabled. The compiler warnings are there for a reason, and there are very, very few circumstances where you can rely on code that doesn't compile without warnings. At minimum, you will want to compile with -Wall -Wextra enabled (gcc). You can also include -pedantic if you want to check your code against virtually all possible warnings. For example, to compile the code above:
$ gcc -Wall -Wextra -pedantic -o bin/shoes shoes.c
If you want to add optimizations to the fullest extent, you can add:
-Ofast (-03 with gcc < 4.6)
Output
$ ./bin/shoes
a ( 1) * 50 = 50
b (18) * 100 = 1800
c ( 1) * 150 = 150
===================
(20) 2000
a ( 2) * 50 = 100
b (16) * 100 = 1600
c ( 2) * 150 = 300
===================
(20) 2000
a ( 3) * 50 = 150
b (14) * 100 = 1400
c ( 3) * 150 = 450
===================
(20) 2000
<..snip..>
a ( 9) * 50 = 450
b ( 2) * 100 = 200
c ( 9) * 150 = 1350
===================
(20) 2000
a (10) * 50 = 500
b ( 0) * 100 = 0
c (10) * 150 = 1500
===================
(20) 2000
Good luck learning C. There is no other comparable language that gives you the precise low-level control that you have in C. (assembler excluded) But that precise control doesn't come for free. There is a learning curve involved and there is a bit more to cover in C before you will lose that "fish out of water" feeling and feel comfortable with the language. The benefit of learning C, with the low-level access it provides, is it will greatly improve your understanding of how programming works. That knowledge is applicable to all other programming languages (no matter how hard the other languages work to hide the details from you). C is time well spent.

Why does this code go into infinite loop

This function below checks to see if an integer is prime or not.
I'm running a for loop from 3 to 2147483647 (+ve limit of long int).
But this code hangs, can't find out why?
#include<time.h>
#include<stdio.h>
int isPrime1(long t)
{
long i;
if(t==1) return 0;
if(t%2==0) return 0;
for(i=3;i<t/2;i+=2)
{
if(t%i==0) return 0;
}
return 1;
}
int main()
{
long i=0;
time_t s,e;
s = time(NULL);
for(i=3; i<2147483647; i++)
{
isPrime1(i);
}
e = time(NULL);
printf("\n\t Time : %ld secs", e - s );
return 0;
}
It will eventually terminate, but will take a while, if you look at your loops when you inline your isPrime1 function, you have something like:
for(i=3; i<2147483647; i++)
for(j=3;j<i/2;j+=2)
which is roughly n*n/4 = O(n^2). Your loop trip count is way too high.
It depends upon the system and the compiler. On Linux, with GCC 4.7.2 and compiling with gcc -O2 vishaid.c -o vishaid the program returns immediately, and the compiler is optimizing all the call to isPrime1 by removing them (I checked the generated assembler code with gcc -O2 -S -fverbose-asm, then main does not even call isPrime1). And GCC is right: since isPrime1 has no side-effect and its result is not used, its call can be removed. Then the for loop has an empty body, so can also be optimized.
The lesson to learn is that when benchmarking optimized binaries, you better have some real side-effect in your code.
Also, arithmetic tells us that some i is prime if it has no divisors less than its square root. So better code:
int isPrime1(long t) {
long i;
double r = sqrt((double)t);
long m = (long)r;
if(t==1) return 0;
if(t%2==0) return 0;
for(i=3;i <= m;i +=2)
if(t%i==0) return 0;
return 1;
}
On my system (x86-64/Debian/Sid with i7 3770K Intel processor, the core running that program is at 3.5GHz) long-s are 64 bits. So I coded
int main ()
{
long i = 0;
long cnt = 0;
time_t s, e;
s = time (NULL);
for (i = 3; i < 2147483647; i++)
{
if (isPrime1 (i) && (++cnt % 4096) == 0) {
printf ("#%ld: %ld\n", cnt, i);
fflush (NULL);
}
}
e = time (NULL);
printf ("\n\t Time : %ld secs\n", e - s);
return 0;
}
and after about 4 minutes it was still printing a lot of lines, including
#6819840: 119566439
#6823936: 119642749
#6828032: 119719177
#6832128: 119795597
I'm guessing it would need several hours to complete. After 30 minutes it is still spitting (slowly)
#25698304: 486778811
#25702400: 486862511
#25706496: 486944147
#25710592: 487026971
Actually, the program needed 4 hours and 16 minutes to complete. Last outputs are
#105086976: 2147139749
#105091072: 2147227463
#105095168: 2147315671
#105099264: 2147402489
Time : 15387 secs
BTW, this program is still really inefficient: The primes program /usr/games/primes from bsdgames package is answering much quicker
% time /usr/games/primes 1 2147483647 | tail
2147483423
2147483477
2147483489
2147483497
2147483543
2147483549
2147483563
2147483579
2147483587
2147483629
/usr/games/primes 1 2147483647
10.96s user 0.26s system 99% cpu 11.257 total
and it has still printed 105097564 lines (most being skipped by tail)
If you are interested in prime number generation, read several math books (it is still a research subject if you are interested in efficiency; you still can get your PhD on that subject.). Start with the sieve of erasthothenes and primality test pages on Wikipedia.
Most importantly, compile first your program with debugging information and all warnings (i.e. gcc -Wall -g on Linux) and learn to use your debugger (i.e. gdb on Linux). You could then interrupt your debugged program (with Ctrl-C under gdb, then let it continue with the cont command to gdb) after about a minute and two, then observe that the i counter in main is increasing slowly. Perhaps also ask for profiling information (with -pg option to gcc then use gprof). And when coding complex arithmetic things it is well worth to read good math books about them (and primality test is a very complex subject, central to most cryptographic algorithms).
This is a very inefficient approach to test for primes, and that's why it seems to hang.
Search the web for more efficient algorithms, such as the Sieve of Eratosthenes
Here try this, see if it's really an infinite loop
int main()
{
long i=0;
time_t s,e;
s = time(NULL);
for(i=3; i<2147483647; i++)
{
isPrime1(i);
//calculate the time execution for each loop
e = time(NULL);
printf("\n\t Time for loop %d: %ld secs", i, e - s );
}
return 0;
}

fast way to check if an array of chars is zero [duplicate]

This question already has answers here:
Faster approach to checking for an all-zero buffer in C?
(19 answers)
Closed 4 years ago.
I have an array of bytes, in memory. What's the fastest way to see if all the bytes in the array are zero?
Nowadays, short of using SIMD extensions (such as SSE on x86 processors), you might as well iterate over the array and compare each value to 0.
In the distant past, performing a comparison and conditional branch for each element in the array (in addition to the loop branch itself) would have been deemed expensive and, depending on how often (or early) you could expect a non-zero element to appear in the array, you might have elected to completely do without conditionals inside the loop, using solely bitwise-or to detect any set bits and deferring the actual check until after the loop completes:
int sum = 0;
for (i = 0; i < ARRAY_SIZE; ++i) {
sum |= array[i];
}
if (sum != 0) {
printf("At least one array element is non-zero\n");
}
However, with today's pipelined super-scalar processor designs complete with branch prediction, all non-SSE approaches are virtualy indistinguishable within a loop. If anything, comparing each element to zero and breaking out of the loop early (as soon as the first non-zero element is encountered) could be, in the long run, more efficient than the sum |= array[i] approach (which always traverses the entire array) unless, that is, you expect your array to be almost always made up exclusively of zeroes (in which case making the sum |= array[i] approach truly branchless by using GCC's -funroll-loops could give you the better numbers -- see the numbers below for an Athlon processor, results may vary with processor model and manufacturer.)
#include <stdio.h>
int a[1024*1024];
/* Methods 1 & 2 are equivalent on x86 */
int main() {
int i, j, n;
# if defined METHOD3
int x;
# endif
for (i = 0; i < 100; ++i) {
# if defined METHOD3
x = 0;
# endif
for (j = 0, n = 0; j < sizeof(a)/sizeof(a[0]); ++j) {
# if defined METHOD1
if (a[j] != 0) { n = 1; }
# elif defined METHOD2
n |= (a[j] != 0);
# elif defined METHOD3
x |= a[j];
# endif
}
# if defined METHOD3
n = (x != 0);
# endif
printf("%d\n", n);
}
}
$ uname -mp
i686 athlon
$ gcc -g -O3 -DMETHOD1 test.c
$ time ./a.out
real 0m0.376s
user 0m0.373s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD2 test.c
$ time ./a.out
real 0m0.377s
user 0m0.372s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD3 test.c
$ time ./a.out
real 0m0.376s
user 0m0.373s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD1 -funroll-loops test.c
$ time ./a.out
real 0m0.351s
user 0m0.348s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD2 -funroll-loops test.c
$ time ./a.out
real 0m0.343s
user 0m0.340s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD3 -funroll-loops test.c
$ time ./a.out
real 0m0.209s
user 0m0.206s
sys 0m0.003s
Here's a short, quick solution, if you're okay with using inline assembly.
#include <stdio.h>
int main(void) {
int checkzero(char *string, int length);
char str1[] = "wow this is not zero!";
char str2[] = {0, 0, 0, 0, 0, 0, 0, 0};
printf("%d\n", checkzero(str1, sizeof(str1)));
printf("%d\n", checkzero(str2, sizeof(str2)));
}
int checkzero(char *string, int length) {
int is_zero;
__asm__ (
"cld\n"
"xorb %%al, %%al\n"
"repz scasb\n"
: "=c" (is_zero)
: "c" (length), "D" (string)
: "eax", "cc"
);
return !is_zero;
}
In case you're unfamiliar with assembly, I'll explain what we do here: we store the length of the string in a register, and ask the processor to scan the string for a zero (we specify this by setting the lower 8 bits of the accumulator, namely %%al, to zero), reducing the value of said register on each iteration, until a non-zero byte is encountered. Now, if the string was all zeroes, the register, too, will be zero, since it was decremented length number of times. However, if a non-zero value was encountered, the "loop" that checked for zeroes terminated prematurely, and hence the register will not be zero. We then obtain the value of that register, and return its boolean negation.
Profiling this yielded the following results:
$ time or.exe
real 0m37.274s
user 0m0.015s
sys 0m0.000s
$ time scasb.exe
real 0m15.951s
user 0m0.000s
sys 0m0.046s
(Both test cases ran 100000 times on arrays of size 100000. The or.exe code comes from Vlad's answer. Function calls were eliminated in both cases.)
If you want to do this in 32-bit C, probably just loop over the array as a 32-bit integer array and compare it to 0, then make sure the stuff at the end is also 0.
Split the checked memory half, and compare the first part to the second.
a. If any difference, it can't be all the same.
b. If no difference repeat for the first half.
Worst case 2*N. Memory efficient and memcmp based.
Not sure if it should be used in real life, but I liked the self-compare idea.
It works for odd length. Do you see why? :-)
bool memcheck(char* p, char chr, size_t size) {
// Check if first char differs from expected.
if (*p != chr)
return false;
int near_half, far_half;
while (size > 1) {
near_half = size/2;
far_half = size-near_half;
if (memcmp(p, p+far_half, near_half))
return false;
size = far_half;
}
return true;
}
If the array is of any decent size, your limiting factor on a modern CPU is going to be access to the memory.
Make sure to use cache prefetching for a decent distance ahead (i.e. 1-2K) with something like __dcbt or prefetchnta (or prefetch0 if you are going to use the buffer again soon).
You will also want to do something like SIMD or SWAR to or multiple bytes at a time. Even with 32-bit words, it will be 4X less operations than a per character version. I'd recommend unrolling the or's and making them feed into a "tree" of or's. You can see what I mean in my code example - this takes advantage of superscalar capability to do two integer ops (the or's) in parallel by making use of ops that do not have as many intermediate data dependencies. I use a tree size of 8 (4x4, then 2x2, then 1x1) but you can expand that to a larger number depending on how many free registers you have in your CPU architecture.
The following pseudo-code example for the inner loop (no prolog/epilog) uses 32-bit ints but you could do 64/128-bit with MMX/SSE or whatever is available to you. This will be fairly fast if you have prefetched the block into the cache. Also you will possibly need to do unaligned check before if your buffer is not 4-byte aligned and after if your buffer (after alignment) is not a multiple of 32-bytes in length.
const UINT32 *pmem = ***aligned-buffer-pointer***;
UINT32 a0,a1,a2,a3;
while(bytesremain >= 32)
{
// Compare an aligned "line" of 32-bytes
a0 = pmem[0] | pmem[1];
a1 = pmem[2] | pmem[3];
a2 = pmem[4] | pmem[5];
a3 = pmem[6] | pmem[7];
a0 |= a1; a2 |= a3;
pmem += 8;
a0 |= a2;
bytesremain -= 32;
if(a0 != 0) break;
}
if(a0!=0) then ***buffer-is-not-all-zeros***
I would actually suggest encapsulating the compare of a "line" of values into a single function and then unrolling that a couple times with the cache prefetching.
Measured two implementations on ARM64, one using a loop with early return on false, one that ORs all bytes:
int is_empty1(unsigned char * buf, int size)
{
int i;
for(i = 0; i < size; i++) {
if(buf[i] != 0) return 0;
}
return 1;
}
int is_empty2(unsigned char * buf, int size)
{
int sum = 0;
for(int i = 0; i < size; i++) {
sum |= buf[i];
}
return sum == 0;
}
Results:
All results, in microseconds:
is_empty1 is_empty2
MEDIAN 0.350 3.554
AVG 1.636 3.768
only false results:
is_empty1 is_empty2
MEDIAN 0.003 3.560
AVG 0.382 3.777
only true results:
is_empty1 is_empty2
MEDIAN 3.649 3,528
AVG 3.857 3.751
Summary: only for datasets where the probability of false results is very small, the second algorithm using ORing performs better, due to the omitted branch. Otherwise, returning early is clearly the outperforming strategy.
Rusty Russel's memeqzero is very fast. It reuses memcmp to do the heavy lifting:
https://github.com/rustyrussell/ccan/blob/master/ccan/mem/mem.c#L92.

Resources