unicode collation NIF running slower than Pure Erlang implementation

unicode collation NIF running slower than Pure Erlang implementation - c

I'm trying to optimise existing unicode collation library(written in Erlang) by rewriting it as a NIF implementation. Prime reason is because collation is CPU intensive operation.
Link to implementation: https://github.com/abhi-bit/merger
Unicode collation of 1M rows via Pure Erlang based priority queue:
erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_couch_skew main 1000000 -s init stop
Queue size: 1000000
12321.649 ms
Unicode collation of 1M rows via NIF based binomial heap:
erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_merger main 1000000 -s init stop
Queue size: 1000000
15871.965 ms
This is unusual, I was expecting it to be probably ~10X faster.
I turned on eprof/fprof but they aren't of much use when it comes to NIF modules, below is what eprof said about the prominent functions
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- --- ---- [----------]
merger:new/0 1 0.00 0 [ 0.00]
merger:new/2 1 0.00 0 [ 0.00]
merger:size/1 100002 0.31 19928 [ 0.20]
merger:in/3 100000 3.29 210620 [ 2.11]
erlang:put/2 2000000 6.63 424292 [ 0.21]
merger:out/1 100000 14.35 918834 [ 9.19]
I'm sure, NIF implementation could be made faster because I've a pure C implementation of unicode collation based on binary Heap using dynamic array and that's much much faster.
$ make
gcc -I/usr/local/Cellar/icu4c/55.1/include -L/usr/local/Cellar/icu4c/55.1/lib min_heap.c collate_json.c kway_merge.c kway_merge_test.c -o output -licui18n -licuuc -licudata
./output
Merging 1 arrays each of size 1000000
mergeKArrays took 84.626ms
Specific questions I've here:
How much slowdown is expected because of Erlang <-> C communication in a NIF module? In this case, slowdown is probably 30x or more between pure C and NIF implementation
What tools could be useful to debug NIF related slowdown(like in this case)? I tried using perf top to see the function call, top ones(some hex addresses were showing) were coming from "beam.smp".
What are possible areas that I should look at optimising a NIF? For example: I've heard that one should keep data being transferred between Erlang to C and vice-versa minimal, are there more such areas to consider?

The overhead of calling a NIF is tiny. When the Erlang runtime loads a module that loads a NIF, it patches the module's beam code with an emulator instruction to call into the NIF. The instruction itself performs just a small amount of setup prior to calling the C function implementing the NIF. This is not the area that's causing your performance issues.
Profiling a NIF is much the same as profiling any other C/C++ code. Judging from your Makefile it appears you're developing this code on OS X. On that platform, assuming you have XCode installed, you can use the Instruments application with the CPU Samples instrument to see where your code is spending most of its time. On Linux, you can use the callgrind tool of valgrind together with an Erlang emulator built with valgrind support to measure your code.
What you'll find if you use these tools on your code is, for example, that perf_merger:main/1 spends most of its time in merger_nif_heap_get, which in turn spends a noticeable amount of time in CollateJSON. That function seems to call convertUTF8toUChar and createStringFromJSON quite a bit. Your NIF also seems to perform a lot of memory allocation. These are the areas you should focus on to speed up your code.

Related

How can I unit test performance optimisations in C?

I've been working on a portable C library that does image processing.
I've invested quite some time on a couple of low level functions so as to take advantage of GCC auto-vectorization (SSE and/or AVX depending on target processor) mode while still preserve a somewhat portable C code (extensions used: restrict and __builtin_assume_aligned).
Now is time to test the code on windows (MSVC compiler). But before that I'd like to setup some kind of unit testing so as not to shoot myself in the foot and loose all my carefully chosen instructions to preserve GCC auto-vectorization code as-is.
I could simply #ifdef/#endif the whole body function, but I am thinking of a more long term solution that would detect upon compiler update(s) of any regression.
I am fairly confident with unit testing (there are tons of good framework out there), but I am a lot less confident with unit-testing of such low level functionality. How does one integrate performance unit testing in CI service such as jenkins ?
PS: I'd like to avoid storing hard-coded timing results based on a particular processor, eg:
// start timer:
gettimeofday(&t1, NULL);
// call optimized function:
...
// stop timer:
gettimeofday(&t2, NULL);
// hard code some magic number:
if( t2.tv_sec - t1.tv_sec > 42 ) return EXIT_FAILURE;

Your problem basically boils down into two parts:
What's the best way to performance benchmark your carefully optimized code?
How to compare the results of the comparisons so you can detect if code changes and/or compiler updates have affected the performance of your code
The google benchmark framework might provide a reasonable approach to problem #1. It is C++, but that wouldn't stop you from calling your C functions up from it.
This library can produce summary reports in various formats, including JSON and good old CSV. You could arrange for these to be stored somewhere per run.
You could then write a simple perl/python/etc script to compare the results of the benchmarks and raise the alarm if they deviate by more than some threshold.
One thing you will have to be careful about is the potential for noise in your results caused by variables such as load on the system performing the test. You didn't say much about the environment you are running the tests in, but if it is (for example) a VM on a host containing other VMs then your test results may be skewed by whatever is going on in the other VMs.
CI frameworks such as Jenkins allow you to script up the actions to be taken when running tests, so it should be relatively easy to integrate this approach into such frameworks.

A way to measure the performance in a simple and repeatable way would be to run a benchmarking unit test through valgrind/callgrind. That will give you a number of metrics: CPU cycles, Instruction and Data read and write transactions (at different cache depths), bus-blocking transactions, etc. You would only have to check those values against a known-good starting value.
Valgrind is repeatable because it emulates the running of the code. It is of course (much) slower than directly running the code, but that makes it independent of system load, etc.
Where Valgrind is not available, as in Windows (though there are mentions of valgrind + wine + Windows programs on Linux), dynamoRIO is an option. It provides tools similar to Valgrind, like an instruction counter, and memory and cache usage analyzer. (Also available on Linux and seemingly half-ported to OS X as of this writing)

How do I benchmark or trace a specific function in the Linux Kernel?

How do I use ftrace() (or anything else) to trace a specific, user-defined function in the Linux kernel? I'm trying to create and run some microbenchmarks, so I'd like to have the time it takes certain functions to run. I've read through (at least as much as I can) the documentation, but a step in the right direction would be awesome.
I'm leaning towards ftrace(), but having issues getting it to work on Ubuntu 14.04.

Here are a couple of options you may have depending on the version of the kernel you are on:
Systemtap - this is the ideal way check the examples that come with the stap, you may have something ready with minimal modifications to do.
Oprofile - if you are using older versions of the kernel, stap gives better precision compared to oprofile.
debugfs with stack tracer option - good for stack overrun debugging. To do this you would need to turn on depth checking functions by mounting debugfs and then echo 1 > /proc/sys/kernel/stack_tracer_enabled.
strace - if you are looking at identifying the system calls being called by the user space program and some performance numbers. use strace -fc <program name>
Hope this helps!

Ftrace is a good option and has a good documentation.

use WARN_ON() It will print some trace of function called that.
For time tracing i think you should use time stamp showing in kernel log or use jiffies counter

Also systemtap will be useful in your situation. Systemtap is some kind of tool in which you can write code like in scripting languages. It is very powerful, but if you want to only know a time of execution particular function ftrace would be better, but if you need very advanced tool to analyze e.g, performance problems in the kernel space, it may be very helpful.
Pls read more: (what you want to do is here:- 5.2 Timing function execution times)
enter link description here

If the function's execution time is interesting because it makes subsidiary calls to slow/blocking functions, then statement-by-statement tracing could work for you, without too much distortion due to the "probe effect" overheads of the instrumentation itself.
probe kernel.statement("function_name#dir/file.c:*") { println(tid(), " ", gettimeofday_us(), " ", pn()) }
will give you a trace of each separate statement in the function_name. Deltas between adjacent statements are easily computed by hand or by a larger script. See also https://sourceware.org/systemtap/examples/#profiling/linetimes.stp

To get the precision that I needed (CPU cycles), I ended up using get_cycles() which is essentially a wrappeer for RDTSC (but portable). ftrace() may still be beneficial in the future, but all I'm doing now is taking the difference between start CPU cycles and end CPU cycles and using that as a benchmark.
Update: To avoid parallelization of instructions, I actually ended up wrapping RDTSCP instead. I couldn't use RDTSC + CPUID because that caused a lot of delays from hypercalls (I'm working in a VM).

Use systemtap and try this script:
https://github.com/openresty/stapxx#func-latency-distr

How to print data about the execution of my code ?

When programming in haskell we have the interpreter option :set +s. It prints some information about the code you ran. When on ghci, prints the time spent on running the code and the number of bytes used. when on hugs, prints the number of reductions made by the interpreter and the number of bytes used. How can I do the same thing in C ? I know how to print the time spent running my c code and how to print the number of clocks spent by the processor to run it. But what about the number of bytes and reductions ? I want to know a good way to compare two differents codes that do the same thing and compare which is the most efficient for me.
Thanks.

If you want to compare performance, just compare time and used memory. Allow both programs exploit the same number of processor cores, write equivalent programs in both language and run benchmarks. If you are using a Unix, time(1) is your friend.
Everything else is not relevant to performance. If a program performed 10x more functions calls than another one, but ran in half of the time, it is still the one having the better performance.
The benchmark game web site compares different language using time/space criteria. You may wish to follow the same spirit.
For more careful profiling of portions of the programs, rather than the whole program, you can either use a profiler (in C) or turn on the profiling options (in GHC Haskell). Criterion is also a popular Haskell library to benchmark Haskell programs. Profiling is typically useful to spot the "hot points" in the code: long-running loops, frequently called functions, etc. This is useful because it allows the programmer to know where optimization is needed. For instance, if a function cumulatively runs for 0.05s, obtaining a 10x speed increase on that is far less useful than a 5% optimization on a function cumulatively running for 20 minutes (0.045s vs 60s gain).

how to allocate more cpu and RAM to a c program in linux

I am running a simple C program which performs a lot calculations(CFD) hence takes a lot of time to run. However i still have a lot of unused CPU and RAM. So how will i allocate some of my processing power to one program.??

I'm guessing that CFD means Computational Fluid Dynamics (but CFD has also a lot of other meanings, so I might guess wrong).
You definitely should first profile your code. At the very least, compile it with gcc -Wall -pg -O and learn how to use gprof. You might also use strace to find out the system calls done by your code.
I'm not an expert of CFD (even if in the previous century I did work with CFD experts). But such code uses a lot of finite elements analysis and other vector computation.
If you are writing the code, you might perhaps consider using OpenMP (so by carefully adding OpenMP pragmas in your source code, you might speed it up), or even consider using GPGPUs by coding OpenCL kernels that run on the GPU.
You could also learn more about pthreads programming and change your code to use threads.
If you are using important numerical libraries like e.g. BLAS they have a lot of tuning, and even specialized variants (e.g. multi-core, OpenMP-ed, or even in OpenCL).
In all cases, parallelizing your code is a lot of work. You'll spend weeks or months on improving it, if it is possible.

Linux doesn't keep programs waiting and CPU free when they need to do calculations.
Either you have a multicore CPU and one single thread running (as suggested by #Pankrates) or you are blocking on some I/O.

You could nice the process with a negative increment, but you need to be superuser for that. See
man nice
This would increase the scheduling priority of the process. If it is competing with other processes for CPU time, it would get more CPU time and therefore "run faster".
As for increasing the amount of RAM used by the program: you'd need to rewrite or reconfigure the program to use more RAM. It is difficult to say more given the information available in the question.

To use multiple CPU's at once, you either need to run multiple copies of your program, or run multiple threads within the program. Neither is terribly hard to get started on.
However, it's much easier to do a parallel version of "I've got 10000 large numbers, I want to find out for each of them if they are primes or not" than it is to do "lots of A = A + B" type calculations in parallel - because you need the new A before you can make the next step. CFD calculations tend to do the latter [as far as I understand it], but with large arrays. You may be able to split large vector calculations into a set of smaller vector caclulations [say we have a matrix of 1000 x 1000, you could split that into 4 sets of 250 x 1000 matrixes, or 4 sets of 500 x 500 matrixes, and perform each of those in it's own thread].
If it's your own code, then you hopefully know what it does and how it works. If it's someone elses code, then you need to talk to whoever owns the code.
There is no magical way to "automatically make use of more CPU's". 30% CPU usage on a quad-core processor probably means that your system is basically using one core, and 5% or so is overhead for other things going on in the system - or maybe there is a second thread somewhere in your application that uses a little bit of CPU doing whatever it does. Or the application is multithreaded, but doesn't use the multiple cores to full extent because there is contention between the threads over some shared resource... It's impossible for us to say which of these three [or several other] alternatives.
Asking for more RAM isn't going to help unless you have something useful to put into that memory. If there is free memory, your application get as much memory as it needs.

Matlab mex file is slow compared to its straight C equivalent

I'm at a loss to explain (and avoid) the differences in speed between a Matlab mex program and the corresponding C program with no Matlab interface. I've been profiling a numerical analysis program:
int main(){
Well_optimized_code();
}
compiled with gcc 4.4 against the Matlab-Mex equivalent (directed to use gcc44, which is not the version currently supported by Matlab, but it's required for other reasons):
void mexFunction(int nlhs,mxArray* plhs[], int nrhs, const mxArray* prhs[]){
Well_optimized_code(); //literally the exact same code
}
I performed the timings as:
$ time ./C_version
vs.
>> tic; mex_version(); toc
The difference in timing is staggering. The version run from the command line takes 5.8 seconds on average. The version in Matlab runs in 21 seconds. For context, the mex file replaces an algorithm in the SimBiology toolbox that takes about 26 seconds to run.
As compared to Matlab's algorithm, both the C and mex versions scale linearly up to 27 threads using calls to openMP, but for the purposes of profiling these calls have been disabled and commented out.
The two versions have been compiled in the same way with the exception of the necessary flags to compile as a mex file: -fPIC --shared -lmex -DMATLAB_MEX_FILE being applied in the mex compilation/linking. I've removed all references to the left and right arguments of the mex file. That is to say it takes no inputs and gives no outputs, it is solely for profiling.
The Great and Glorious Google has informed me that the position independent code should not be the source of the slowdown and beyond that I'm at a loss.
Any help will be appreciated,
Andrew

After a month of emailing with my contacts at Mathworks, playing around with my own code, and profiling my code every which way, I have an answer; however, it may be the most dissatisfying answer I have ever had to a technical question:
The short version is "upgrade to Matlab version 2011a (officially released last week), this issue has now been resolved".
The longer version regards an issue of the overhead associated with the mex gateway in versions 2010b and earlier. The best explanation that I've been able to extract is that this overhead is not assessed once, rather we pay a little bit every time a function calls another function that is in a linked library.
While why this occurs baffles me, it is at least consistent with the SHARK profiling that I did. When I profile and compare the differences between the native app and the mex app there is a recurring pattern. The time spent in functions that are in the source code I wrote for the app does not change. The time spent in library functions increases a little when comparing between the native and mex implementations. Functions in another library used to build this library increase the difference a lot. The time difference continues to increase as we proceed ever deeper until we reach by BLAS implementation.
A couple of heavily used BLAS functions were the main culprits. A function that took ~1% of my computation time in the native app was clocking in at 30% in the mex function.
The implementation of the mex gateway appears to have changed between 2010b and 2011a. On my macbook the native app takes about 6 seconds and the mex version takes 6.5 seconds. This is overhead that I can deal with.
As for the underlying cause, I can only speculate. Matlab has it's roots in interpretive coding. Since mex functions are dynamic libraries, I'm guessing that each mex library was unaware of what it was linked against until runtime. Since Matlab suggests the user rarely use mex and then only for small computationally intensive chunks, I assume that large programs (such as an ODE solver) are rarely implemented. These programs, like mine, are the ones that suffer the most.
I've profiled a couple of Matlab functions that I know to be implemented in C then compiled using mex (especially sbiosimulate after calling sbioaccelerate on kinetic models, part of the SimBiology toolbox) and there appears to be some significant speed ups. So the 2011a update appears to be more broadly beneficial than the usual semi-yearly upgrade.
Best of luck to other coders with the similar issues. Thanks for all of the helpful advice that got me started in the right direction.
--Andrew

Recall that Matlab stores arrays as column major, and C/C++ as row major. Is it possible that your loop structure/algorithm is iterating in a row major fashion, resulting in poor memory access times in Matlab, but fast access times in C/C++ ?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight