Haskell -> C FFI performance - c

This is the dual question of Performance considerations of Haskell FFI / C?: I would like to call a C function with as small an overhead as possible.
To set the scene, I have the following C function:
typedef struct
{
uint64_t RESET;
} INPUT;
typedef struct
{
uint64_t VGA_HSYNC;
uint64_t VGA_VSYNC;
uint64_t VGA_DE;
uint8_t VGA_RED;
uint8_t VGA_GREEN;
uint8_t VGA_BLUE;
} OUTPUT;
void Bounce(const INPUT* input, OUTPUT* output);
Let's run it from C and time it, with gcc -O3:
int main (int argc, char **argv)
{
INPUT input;
input.RESET = 0;
OUTPUT output;
int cycles = 0;
for (int j = 0; j < 60; ++j)
{
for (;; ++cycles)
{
Bounce(&input, &output);
if (output.VGA_HSYNC == 0 && output.VGA_VSYNC == 0) break;
}
for (;; ++cycles)
{
Bounce(&input, &output);
if (output.VGA_DE) break;
}
}
printf("%d cycles\n", cycles);
}
Running it for 25152001 cycles takes ~400 ms:
$ time ./Bounce
25152001 cycles
real 0m0.404s
user 0m0.403s
sys 0m0.001s
Now let's write some Haskell code to set up FFI (note that Bool's Storable instance really does use a full int):
data INPUT = INPUT
{ reset :: Bool
}
data OUTPUT = OUTPUT
{ vgaHSYNC, vgaVSYNC, vgaDE :: Bool
, vgaRED, vgaGREEN, vgaBLUE :: Word64
}
deriving (Show)
foreign import ccall unsafe "Bounce" topEntity :: Ptr INPUT -> Ptr OUTPUT -> IO ()
instance Storable INPUT where ...
instance Storable OUTPUT where ...
And let's do what I believe to be functionally equivalent to our C code from before:
main :: IO ()
main = alloca $ \inp -> alloca $ \outp -> do
poke inp $ INPUT{ reset = False }
let loop1 n = do
topEntity inp outp
out#OUTPUT{..} <- peek outp
let n' = n + 1
if not vgaHSYNC && not vgaVSYNC then loop2 n' else loop1 n'
loop2 n = do
topEntity inp outp
out <- peek outp
let n' = n + 1
if vgaDE out then return n' else loop2 n'
loop3 k n
| k < 60 = do
n <- loop1 n
loop3 (k + 1) n
| otherwise = return n
n <- loop3 (0 :: Int) (0 :: Int)
printf "%d cycles" n
I build it with GHC 8.6.5, using -O3, and I get.. more than 3 seconds!
$ time ./.stack-work/dist/x86_64-linux/Cabal-2.4.0.1/build/sim-ffi/sim-ffi
25152001 cycles
real 0m3.468s
user 0m3.146s
sys 0m0.280s
And it's not a constant overhead at startup, either: if I run for 10 times the cycles, I get roughly 3.5 seconds from C and 34 seconds from Haskell.
What can I do to reduce the Haskell -> C FFI overhead?

I managed to reduce the overhead so that the 25 M calls now finish in 1.2 seconds. The changes were:
Make loop1, loop2 and loop3 strict in the n argument (using BangPatterns)
Add an INLINE pragma to peek in OUTPUT's Storable instance
Point #1 is silly, of course, but that's what I get for not profiling earlier. That change alone gets me to 1.5 seconds....
Point #2, however, makes a ton of sense and is generally applicable. It also addresses the comment from #Thomas M. DuBuisson:
Do you ever need the Haskell structure in haskell? If you can just keep it as a pointer to memory and have a few test functions such as vgaVSYNC :: Ptr OUTPUT -> IO Bool then that will save a log of copying, allocation, GC work on every call.
In the eventual full program, I do need to look at all the fields of OUTPUT. However, with peek inlined, GHC is happy to do the case-of-case transformation, so I can see in Core that now there is no OUTPUT value allocated; the output of peek is consumed directly.

Related

sorting speeds in c and Julia

I'm working on developing sorting algorithms and was surprised to find c's qsort taking 1.6x as long Julia's default sorting algorithm. I imagine I'm making some sort of benchmarking mistake. Here are my benchmarking programs and their results:
Julia:
# time (julia bench.jl)
using Printf
function main()
len = 100_000_000
x = rand(Int64, len)
t = #elapsed sort!(x)
#printf "%d elements:\nclaim\t%fs" len t
end
main()
c
// time (gcc -O3 bench.c && ./a.out)
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
int comp (const void * elem1, const void * elem2)
{
int f = *((int*)elem1);
int s = *((int*)elem2);
if (f > s) return 1;
if (f < s) return -1;
return 0;
}
long long utime()
{
struct timeval now_time;
gettimeofday(&now_time, NULL);
return now_time.tv_sec * 1000000LL + now_time.tv_usec;
}
int main(int argc, char* argv[])
{
long length = 100000000;
long long *x;
x = (long long *) malloc(length * sizeof(long long));
if (x == NULL)
{
printf("Malloc failed\n");
return 1;
}
for (long cnt = 0 ; cnt < length ; cnt++)
x[cnt] = rand();
long long start = utime();
qsort (x, length, sizeof(*x), comp);
long long end = utime();
//for (long cnt = 0 ; cnt < length ; cnt += length/10)
// printf("%lld\n", x[cnt]);
free(x);
printf ("%ld elements:\nclaim\t%fs", length, (end-start)/1000000.0);
return 0;
}
Results
bash-3.2$ time (julia bench.jl)
100000000 elements:
claim 12.405531s
real 0m16.560s
user 0m13.883s
sys 0m1.297s
bash-3.2$ time (gcc -O3 bench.c && ./a.out)
100000000 elements:
claim 20.592641s
real 0m24.604s
user 0m21.352s
sys 0m2.479s
Is it true that Julia's algorithm (median of 3 quicksort with an insertion sort base case for less than 20 elements) is substantially faster than c's qsort? Can I sort faster than qsort in c?
It's easy to sort faster than C's qsort. You could, for example, use C++'s std::sort. The C++ library is not faster because it uses a better algorithm; rather, it's because C++'s generics allow the compiler to avoid the overhead of calling the comparison function and a smaller overhead in qsort's swap, which needs to handle elements of arbitrary size.
In the following, the only difference between sortbench-c and sortbench-cc is the use of std::sort in the latter:
$ diff sortbench-c.c sortbench-cc.cc
1c1
< // time (gcc -O3 sortbench-c.c && ./a.out)
---
> // time (gcc -O3 sortbench-cc.cc && ./a.out)
2a3
> #include <algorithm>
7,14d7
< int comp (const void * elem1, const void * elem2)
< {
< int f = *((int*)elem1);
< int s = *((int*)elem2);
< if (f > s) return 1;
< if (f < s) return -1;
< return 0;
< }
38c31
< qsort (x, length, sizeof(*x), comp);
---
> std::sort(x, x+length);
The difference is dramatic:
$ time (gcc -O3 sortbench-c.c && ./a.out)
100000000 elements:
claim 16.673827s
real 0m17.774s
user 0m17.387s
sys 0m0.379s
$ time (gcc -O3 sortbench-cc.cc && ./a.out)
100000000 elements:
claim 9.948971s
real 0m11.133s
user 0m10.926s
sys 0m0.204s
There is no performance guarantee for qsort:
Despite the name, neither C nor POSIX standards require this function to be implemented using quicksort or make any complexity or stability guarantees.
To do a proper sorting benchmark between Julia and C, you will need another implementation.
The problem is that the rand functions are [probably] different.
Quicksort is data/order dependent. For example, mergesort will always execute in the same amount of time, regardless of what data it is sorting.
However, quicksort's time will vary depending upon the data.
To do a proper benchmark, do not use rand unless you write them yourself or guarantee that Julia's version and libc's version are exactly the same.
I'd write an initialization function for both langs. For example, the requisite for (i = 0; i < length; ++i) array[i] = length - i; or some such, so that the initial data is guaranteed to be the same.
You can use a random function if you have one program generate the array and save it to a file. The other program can then read in the [same] data.
Sometimes, I write a separate program that generates the input data, and saves it to a file. Then, I pass that file off to both programs. This decouples the test data generation from the programs under test.

passing in R array to C function with "NA"

I work in R using C libraries. I need to pass to a C function an array with numbers between 1 and 10 but that could also be "NA". Then in C, depending on the value I need to set the output.
Here's a simplified code
heredyn.load("ranking.so")
fun <- function(ranking) {
nrak <- length(ranking)
out <- .C("ranking", as.integer(nrak), as.character(ranking), rr = as.integer(vector("integer",nrak)))
out$rr
}
ranking <- sample(c(NA,seq(1,10)),10,replace=TRUE)
rr <- fun(ranking)
The C function could simply be such as
#include <R.h>
void ranking(int *nrak, char *ranking, int *rr) {
int i ;
for (i=0;i<*nrak;i++) {
if (ranking[i] == 'NA')
rr[i] = 1 ;
else
rr[i] = (int) strtol(&ranking[i],(char **)NULL,10) ;
}
}
Due to the "NA" value I set ranking as character but maybe there's another way to do that, using integer and without replacing "NA" to 0 before calling the function?
(The code like this, gives me always an array of zeros...)
Test for whether the value is an NA using R_NaInt, like
#include <R.h>
void ranking_c(int *nrak, int *ranking, int *rr) {
for (int i=0; i < *nrak; i++)
rr[i] = R_NaInt == ranking[i] ? -1 : ranking[i];
}
Invoke from R by explicitly allowing NAs
> x = c(1:2, NA_integer_)
> .C("ranking_c", length(x), as.integer(x), integer(length(x)), NAOK=TRUE)[[3]]
[1] 1 2 -1
Alternatively, use R's .Call() interface. Each R object is represented as an S-expression. There are C-level functions to manipulate S-expressions, e.g., length Rf_length(), data access INTEGER(), and allocation Rf_allocVector() of different types of S-expressions such as INTSXP for integer vectors.
R memory management uses a garbage collector that can run on any call that allocates memory. It is therefore best practice to PROTECT() any R allocation while in scope.
Your function will accept 0 or more S-expressions as input, and return a single S-expression; it might be implemented as
#include <Rinternals.h>
#include <R_ext/Arith.h>
SEXP ranking_call(SEXP ranking)
{
/* allocate space for result, PROTECTing from garbage collection */
SEXP result = PROTECT(Rf_allocVector(INTSXP, Rf_length(ranking)));
/* assign result */
for (int i = 0; i < Rf_length(ranking); ++i)
INTEGER(result)[i] =
R_NaInt == INTEGER(ranking)[i] ? -1 : INTEGER(ranking)[i];
UNPROTECT(1); /* no more need to protect */
return result;
}
And invoked from R with .Call("ranking_call", as.integer(ranking)).
Using .Call is more efficient than .C in terms of speed and memory allocation (.C may copy atomic vectors on the way in), but the primary reason to use it is for the flexibility it offers in terms of working directly with R's data structures. This is especially important when the return values are more complicated than atomic vectors.
You are attempting to address a couple of delicate and non-trivial points, least of all how to compile code with R, and to test for non-finite values.
You asked for help with C. I would like to suggest C++ -- which you do not need to use in a complicated way. Consider this short file with contains a function to process a vector along the lines you suggest (I just test for NA and then assign 42 as a marker for simplicit) or else square the value:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector foo(NumericVector x) {
unsigned int n = x.size();
for (unsigned int i=0; i<n; i++)
if (NumericVector::is_na(x[i]))
x[i] = 42.0;
else
x[i] = pow(x[i], 2);
return x;
}
/*** R
foo( c(1, 3, NA, NaN, 6) )
*/
If I save this on my box as /tmp/foo.cpp, in order compile, link, load and even run the embedded R use example, I only need one line to call sourceCpp():
R> Rcpp::sourceCpp("/tmp/foo.cpp")
R> foo( c(1, 3, NA, NaN, 6))
[1] 1 9 42 42 36
R>
We can do the same with integers:
// [[Rcpp::export]]
IntegerVector bar(IntegerVector x) {
unsigned int n = x.size();
for (unsigned int i=0; i<n; i++)
if (IntegerVector::is_na(x[i]))
x[i] = 42;
else
x[i] = pow(x[i], 2);
return x;
}

Performance of changing values of STUArray in bulk

I am implementing a topological sort in Haskell with the requirement to be as efficient as possible. I have profiled my current solution and found out the the following method is taking 60% of total time (and 0 amount of additional space):
import Control.Monad.ST
import Control.Monad
import Data.Array.ST
import Data.Array.Unboxed
import Data.Word
import Data.Array.Base
zeroElementsAfterDecrement' :: (MArray a e m, Num e, Eq e) => a Int e -> [Int] -> m [Int]
zeroElementsAfterDecrement' arr is = foldr k (return []) is
where k i a = do xs <- a
decremented <- liftM (subtract 1) (unsafeRead arr i)
unsafeWrite arr i decremented
if decremented == 0 then return (i:xs) else return xs
largenum :: Int
largenum = 10000000
test = runST $ do arr <- newArray (1, largenum) 100 :: ST s (STUArray s Int Word32)
zeroElementsAfterDecrement' arr [1..largenum]
main = (putStrLn . show) test
The function takes an array (I use unboxed mutable arrays) and a list of indexes, decrements elements by these indexes and returns indexes of elements that became zero during this operation. Right now this is more than 10 times slower than the optimized C++ code but still pretty good compared to Python (or maybe I don't know Python way to optimize this). I understand there is an overhead from executing a monadic code, but maybe there are still ways to optimize I am not aware of?
Edit:
GHC: -O -fllvm: 0.54s
GHC (with unsafeWrite/unsafeRead and Word32): 0.34s
g++: 0.24s
g++ -O2: 0.05s
python3: 2.66s
Also when I change foldr to foldl' it starts allocating some memory and is 4 times slower as a result, why is that?
Here is a C++ version I compared it to:
#include <iostream>
#include <vector>
using namespace std;
#define LARGENUM 10000000
int main()
{
vector <int> arr;
for (int i = 0; i < LARGENUM; i++) {
arr.push_back(100);
}
for (int i = 0; i < arr.size(); i++) {
arr[i]--;
if (arr[i] == 0)
cout << i << endl;
}
return 0;
}
And a Python version:
arr = [100] * 10000000
for x in range (0, 10000000 - 1):
arr[x] = arr[x] - 1
if arr[x] == 0:
print x

Comparing speed of Haskell and C for the computation of primes

I initially wrote this (brute force and inefficient) method of calculating primes with the intent of making sure that there was no difference in speed between using "if-then-else" versus guards in Haskell (and there is no difference!). But then I decided to write a C program to compare and I got the following (Haskell slower by just over 25%) :
(Note I got the ideas of using rem instead of mod and also the O3 option in the compiler invocation from the following post : On improving Haskell's performance compared to C in fibonacci micro-benchmark)
Haskell : Forum.hs
divisibleRec :: Int -> Int -> Bool
divisibleRec i j
| j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)
divisible::Int -> Bool
divisible i = divisibleRec i (i-1)
r = [ x | x <- [2..200000], divisible x == False]
main :: IO()
main = print(length(r))
C : main.cpp
#include <stdio.h>
bool divisibleRec(int i, int j){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ return divisibleRec(i,j-1); }
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<200000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results I got were as follows :
Compilation times
time (ghc -O3 -o runProg Forum.hs)
real 0m0.355s
user 0m0.252s
sys 0m0.040s
time (gcc -O3 -o runProg main.cpp)
real 0m0.070s
user 0m0.036s
sys 0m0.008s
and the following running times :
Running times on Ubuntu 32 bit
Haskell
17984
real 0m54.498s
user 0m51.363s
sys 0m0.140s
C++
number of primes = 17984
real 0m41.739s
user 0m39.642s
sys 0m0.080s
I was quite impressed with the running times of Haskell. However my question is this : can I do anything to speed up the haskell program without :
Changing the underlying algorithm (it is clear that massive speedups can be gained by changing the algorithm; but I just want to understand what I can do on the language/compiler side to improve performance)
Invoking the llvm compiler (because I dont have this installed)
[EDIT : Memory usage]
After a comment by Alan I noticed that the C program uses a constant amount of memory where as the Haskell program slowly grows in memory size. At first I thought this had something to do with recursion, but gspr explains below why this is happening and provides a solution. Will Ness provides an alternative solution which (like gspr's solution) also ensures that the memory remains static.
[EDIT : Summary of bigger runs]
max number tested : 200,000:
(54.498s/41.739s) = Haskell 30.5% slower
max number tested : 400,000:
3m31.372s/2m45.076s = 211.37s/165s = Haskell 28.1% slower
max number tested : 800,000:
14m3.266s/11m6.024s = 843.27s/666.02s = Haskell 26.6% slower
[EDIT : Code for Alan]
This was the code that I had written earlier which does not have recursion and which I had tested on 200,000 :
#include <stdio.h>
bool divisibleRec(int i, int j){
while(j>0){
if(j==1){ return false; }
else if(i%j==0){ return true; }
else{ j -= 1;}
}
}
bool divisible(int i){ return divisibleRec(i, i-1); }
int main(void){
int i, count =0;
for(i=2; i<8000000; ++i){
if(divisible(i)==false){
count = count+1;
}
}
printf("number of primes = %d\n",count);
return 0;
}
The results for the C code with and without recursion are as follows (for 800,000) :
With recursion : 11m6.024s
Without recursion : 11m5.328s
Note that the executable seems to take up 60kb (as seen in System monitor) irrespective of the maximum number, and therefore I suspect that the compiler is detecting this recursion.
This isn't really answering your question, but rather what you asked in a comment regarding growing memory usage when the number 200000 grows.
When that number grows, so does the list r. Your code needs all of r at the very end, to compute its length. The C code, on the other hand, just increments a counter. You'll have to do something similar in Haskell too if you want constant memory usage. The code will still be very Haskelly, and in general it's a sensible proposition: you don't really need the list of numbers for which divisible is False, you just need to know how many there are.
You can try with
main :: IO ()
main = print $ foldl' (\s x -> if divisible x then s else s+1) 0 [2..200000]
(foldl' is a stricter foldl from Data.List that avoids thunks being built up).
Well bang patters give you a very small win (as does llvm, but you seem to have expected that):
{-# LANUGAGE BangPatterns #-}
divisibleRec !i !j | j == 1 = False
And on my x86-64 I get a very big win by switching to smaller representations, such as Word32:
divisibleRec :: Word32 -> Word32 -> Bool
...
divisible :: Word32 -> Bool
My timings:
$ time ./so -- Int
2262
real 0m2.332s
$ time ./so -- Word32
2262
real 0m1.424s
This is a closer match to your C program, which is only using int. It still doesn't match performance wise, I suspect we'd have to look at core to figure out why.
EDIT: and the memory use, as was already noted I see, is about the named list r. I just inlined r, made it output a 1 for each non-divisble value and took the sum:
main = print $ sum $ [ 1 | x <- [2..800000], not (divisible x) ]
Another way to write down your algorithm is
main = print $ length [()|x<-[2..200000], and [rem x d>0|d<-[x-1,x-2..2]]]
Unfortunately, it runs slower. Using all ((>0).rem x) [x-1,x-2..2] as a test, it runs slower still. But maybe you'd test it on your setup nevertheless.
Replacing your code with explicit loop with bang patterns made no difference whatsoever:
{-# OPTIONS_GHC -XBangPatterns #-}
r4::Int->Int
r4 n = go 0 2 where
go !c i | i>n = c
| True = go (if not(divisible i) then (c+1) else c) (i+1)
divisibleRec::Int->Int->Bool
divisibleRec i !j | j == 1 = False
| i `rem` j == 0 = True
| otherwise = divisibleRec i (j-1)
When I started programming in Haskell I was also impressed about its speed. You may be interested in reading point 5 "The speed of Haskell" of this article.

fast way to check if an array of chars is zero [duplicate]

This question already has answers here:
Faster approach to checking for an all-zero buffer in C?
(19 answers)
Closed 4 years ago.
I have an array of bytes, in memory. What's the fastest way to see if all the bytes in the array are zero?
Nowadays, short of using SIMD extensions (such as SSE on x86 processors), you might as well iterate over the array and compare each value to 0.
In the distant past, performing a comparison and conditional branch for each element in the array (in addition to the loop branch itself) would have been deemed expensive and, depending on how often (or early) you could expect a non-zero element to appear in the array, you might have elected to completely do without conditionals inside the loop, using solely bitwise-or to detect any set bits and deferring the actual check until after the loop completes:
int sum = 0;
for (i = 0; i < ARRAY_SIZE; ++i) {
sum |= array[i];
}
if (sum != 0) {
printf("At least one array element is non-zero\n");
}
However, with today's pipelined super-scalar processor designs complete with branch prediction, all non-SSE approaches are virtualy indistinguishable within a loop. If anything, comparing each element to zero and breaking out of the loop early (as soon as the first non-zero element is encountered) could be, in the long run, more efficient than the sum |= array[i] approach (which always traverses the entire array) unless, that is, you expect your array to be almost always made up exclusively of zeroes (in which case making the sum |= array[i] approach truly branchless by using GCC's -funroll-loops could give you the better numbers -- see the numbers below for an Athlon processor, results may vary with processor model and manufacturer.)
#include <stdio.h>
int a[1024*1024];
/* Methods 1 & 2 are equivalent on x86 */
int main() {
int i, j, n;
# if defined METHOD3
int x;
# endif
for (i = 0; i < 100; ++i) {
# if defined METHOD3
x = 0;
# endif
for (j = 0, n = 0; j < sizeof(a)/sizeof(a[0]); ++j) {
# if defined METHOD1
if (a[j] != 0) { n = 1; }
# elif defined METHOD2
n |= (a[j] != 0);
# elif defined METHOD3
x |= a[j];
# endif
}
# if defined METHOD3
n = (x != 0);
# endif
printf("%d\n", n);
}
}
$ uname -mp
i686 athlon
$ gcc -g -O3 -DMETHOD1 test.c
$ time ./a.out
real 0m0.376s
user 0m0.373s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD2 test.c
$ time ./a.out
real 0m0.377s
user 0m0.372s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD3 test.c
$ time ./a.out
real 0m0.376s
user 0m0.373s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD1 -funroll-loops test.c
$ time ./a.out
real 0m0.351s
user 0m0.348s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD2 -funroll-loops test.c
$ time ./a.out
real 0m0.343s
user 0m0.340s
sys 0m0.003s
$ gcc -g -O3 -DMETHOD3 -funroll-loops test.c
$ time ./a.out
real 0m0.209s
user 0m0.206s
sys 0m0.003s
Here's a short, quick solution, if you're okay with using inline assembly.
#include <stdio.h>
int main(void) {
int checkzero(char *string, int length);
char str1[] = "wow this is not zero!";
char str2[] = {0, 0, 0, 0, 0, 0, 0, 0};
printf("%d\n", checkzero(str1, sizeof(str1)));
printf("%d\n", checkzero(str2, sizeof(str2)));
}
int checkzero(char *string, int length) {
int is_zero;
__asm__ (
"cld\n"
"xorb %%al, %%al\n"
"repz scasb\n"
: "=c" (is_zero)
: "c" (length), "D" (string)
: "eax", "cc"
);
return !is_zero;
}
In case you're unfamiliar with assembly, I'll explain what we do here: we store the length of the string in a register, and ask the processor to scan the string for a zero (we specify this by setting the lower 8 bits of the accumulator, namely %%al, to zero), reducing the value of said register on each iteration, until a non-zero byte is encountered. Now, if the string was all zeroes, the register, too, will be zero, since it was decremented length number of times. However, if a non-zero value was encountered, the "loop" that checked for zeroes terminated prematurely, and hence the register will not be zero. We then obtain the value of that register, and return its boolean negation.
Profiling this yielded the following results:
$ time or.exe
real 0m37.274s
user 0m0.015s
sys 0m0.000s
$ time scasb.exe
real 0m15.951s
user 0m0.000s
sys 0m0.046s
(Both test cases ran 100000 times on arrays of size 100000. The or.exe code comes from Vlad's answer. Function calls were eliminated in both cases.)
If you want to do this in 32-bit C, probably just loop over the array as a 32-bit integer array and compare it to 0, then make sure the stuff at the end is also 0.
Split the checked memory half, and compare the first part to the second.
a. If any difference, it can't be all the same.
b. If no difference repeat for the first half.
Worst case 2*N. Memory efficient and memcmp based.
Not sure if it should be used in real life, but I liked the self-compare idea.
It works for odd length. Do you see why? :-)
bool memcheck(char* p, char chr, size_t size) {
// Check if first char differs from expected.
if (*p != chr)
return false;
int near_half, far_half;
while (size > 1) {
near_half = size/2;
far_half = size-near_half;
if (memcmp(p, p+far_half, near_half))
return false;
size = far_half;
}
return true;
}
If the array is of any decent size, your limiting factor on a modern CPU is going to be access to the memory.
Make sure to use cache prefetching for a decent distance ahead (i.e. 1-2K) with something like __dcbt or prefetchnta (or prefetch0 if you are going to use the buffer again soon).
You will also want to do something like SIMD or SWAR to or multiple bytes at a time. Even with 32-bit words, it will be 4X less operations than a per character version. I'd recommend unrolling the or's and making them feed into a "tree" of or's. You can see what I mean in my code example - this takes advantage of superscalar capability to do two integer ops (the or's) in parallel by making use of ops that do not have as many intermediate data dependencies. I use a tree size of 8 (4x4, then 2x2, then 1x1) but you can expand that to a larger number depending on how many free registers you have in your CPU architecture.
The following pseudo-code example for the inner loop (no prolog/epilog) uses 32-bit ints but you could do 64/128-bit with MMX/SSE or whatever is available to you. This will be fairly fast if you have prefetched the block into the cache. Also you will possibly need to do unaligned check before if your buffer is not 4-byte aligned and after if your buffer (after alignment) is not a multiple of 32-bytes in length.
const UINT32 *pmem = ***aligned-buffer-pointer***;
UINT32 a0,a1,a2,a3;
while(bytesremain >= 32)
{
// Compare an aligned "line" of 32-bytes
a0 = pmem[0] | pmem[1];
a1 = pmem[2] | pmem[3];
a2 = pmem[4] | pmem[5];
a3 = pmem[6] | pmem[7];
a0 |= a1; a2 |= a3;
pmem += 8;
a0 |= a2;
bytesremain -= 32;
if(a0 != 0) break;
}
if(a0!=0) then ***buffer-is-not-all-zeros***
I would actually suggest encapsulating the compare of a "line" of values into a single function and then unrolling that a couple times with the cache prefetching.
Measured two implementations on ARM64, one using a loop with early return on false, one that ORs all bytes:
int is_empty1(unsigned char * buf, int size)
{
int i;
for(i = 0; i < size; i++) {
if(buf[i] != 0) return 0;
}
return 1;
}
int is_empty2(unsigned char * buf, int size)
{
int sum = 0;
for(int i = 0; i < size; i++) {
sum |= buf[i];
}
return sum == 0;
}
Results:
All results, in microseconds:
is_empty1 is_empty2
MEDIAN 0.350 3.554
AVG 1.636 3.768
only false results:
is_empty1 is_empty2
MEDIAN 0.003 3.560
AVG 0.382 3.777
only true results:
is_empty1 is_empty2
MEDIAN 3.649 3,528
AVG 3.857 3.751
Summary: only for datasets where the probability of false results is very small, the second algorithm using ORing performs better, due to the omitted branch. Otherwise, returning early is clearly the outperforming strategy.
Rusty Russel's memeqzero is very fast. It reuses memcmp to do the heavy lifting:
https://github.com/rustyrussell/ccan/blob/master/ccan/mem/mem.c#L92.

Resources