Should I also measure clCreateContext() when profiling OpenCL code?

Should I also measure clCreateContext() when profiling OpenCL code? - c

Recently I'm programming OpenCL code which handles some images.
After completing the code, I need to benchmark OpenCL code and native C(or C++) code which does same job.
My question arouses from above. Specifically which steps should I contain to time measuring?
Majority of books and questions on StackOverflow only measures time of executing clEnqueueNDRangeKernel() with using clGetEventProfilingInfo() and clWaitForEvents().
My senior said I need to contain buffer copying jobs(C memory to cl_mem) since native C code doesn't have such steps.
Then should I contain program creating & kernel building step, argument setting step, *.cl source code file reading step, and (most curious stuff)clCreateContext() step?
According to [this paper], clCreateContext() consumes largest time compared with other steps like below.
IMAGE
Android OpenCL code example from SONY also only gets elapsed time of clEnqueueNDRangeKernel(). Check here -> developer.sonymobile.com/downloads/code-example-module/opencl-code-example/
If above is right, is it right that I should only measure the very native C code which does same job in OpenCL kernel code?
Or are there various perspective to profiling and comparing OpenCL & native C code?
PLUS: My program is going to handle continuous image (like video) so there'll be frequent memory copy between GPU and other memory. Then I should also get time for copying memory in both OpenCL code and native C code, right?

I mean, that obviously depends on what you need to measure.
Generally, if you care about the total run time of your program, measure the total runtime, including context creation.
In reality, you usually don't use openCL to do workloads that, over the whole life time of a program, take less time than the context creation. If that is the case, I'd be sure to check whether using openCL makes sense, at all. OpenCL is a single instruction, much much much data architecture. Hence, I think you might be constructing testbenches with simply too little work to be done to ever get statistically sufficient data.
For example, the timers you use to measure the time something takes to execute have some granularity, typically in the multiples of microseconds. If your workload takes shorter than let's say 500 µs, then what you're measuring is practically unusable as benchmark. This is a common problem for the performance comparison of many things!

Related

OpenCL: How to distribute a calculation on different devices without multithreading

Following my former post about comparing the time required to do a simple array addition job (C[i]=A[i]+B[i]) on different devices, I improved the code a little bit to repeat the process for different array length and give back the time required:
The X axis is the array length in logarithm with a base 2 and Y is the time in logarithm with base 10. As it can be seen somewhere between 2^13 and 2^14 the GPUs become faster than the CPU. I guess it is because the memory allocation becomes negligible in comparison to the calculation. (GPI1 is a typo I meant GPU1).
Now hoping my C-OpenCL code is correct I can have an estimation of the time required to do an array addition on different devices: f1(n) for CPU, f2(n) for the first GPU and f3(n) for the second GPU. If I have an array job with a length of n I should theoretically be able to divide it into 3 parts as n1+n2+n3=n and in a way to satisfy the f1(n1)=f2(n2)=f3(n3) and distribute it on three devices on my system to have the fastest possible calculation. I think I can do it using lets say OpenMP or any other multithreading method and use the cores of my CPU to host three different OpenCL tasks. That's not what I like to do because:
It is a wast of resources. Two of the cores are just hosting while could be used for calculation
It makes the code more complicated.
I'm not sure how to do it. I'm now using the Apple Clang compiler with -framework OpenCL to compile the code, but for OpenMP I have to use the GNU compiler. I don't know how to both OpenMP and OpenCL on one of these compilers.
Now I'm thinking if there is any way to do this distribution without multithreading? For example if one of the CPU cores assigns the tasks to the three devices consequentially and then catches the results in the same (or different) order and then concatenate them. It probably needs a little bit of experimenting to adjust for the timing of the task assignment of the subtasks, but I guess it should be possible.
I'm a total beginner to the OpenCL so I would appreciate if you guys could help me know if it is possible and how to do it. Maybe there are already some examples doing so, please let me know. Thanks in advance.
P.S. I have also posted this question here and here on Reddit.

The problem as its read implicitly tells you the solution should be concurrent (asynchronous) thus you require to add the results from three different devices at same time, otherwise what you will do is to run a process first on device A, then device B and then device C (better to run a single process on the fastest device), if you plan to efficiently learn to exploit OpenCL programming (either on mCPU/GPUs) you should be comfortable to do Asynchronous programming (indeed multi threaded).

How to print data about the execution of my code ?

When programming in haskell we have the interpreter option :set +s. It prints some information about the code you ran. When on ghci, prints the time spent on running the code and the number of bytes used. when on hugs, prints the number of reductions made by the interpreter and the number of bytes used. How can I do the same thing in C ? I know how to print the time spent running my c code and how to print the number of clocks spent by the processor to run it. But what about the number of bytes and reductions ? I want to know a good way to compare two differents codes that do the same thing and compare which is the most efficient for me.
Thanks.

If you want to compare performance, just compare time and used memory. Allow both programs exploit the same number of processor cores, write equivalent programs in both language and run benchmarks. If you are using a Unix, time(1) is your friend.
Everything else is not relevant to performance. If a program performed 10x more functions calls than another one, but ran in half of the time, it is still the one having the better performance.
The benchmark game web site compares different language using time/space criteria. You may wish to follow the same spirit.
For more careful profiling of portions of the programs, rather than the whole program, you can either use a profiler (in C) or turn on the profiling options (in GHC Haskell). Criterion is also a popular Haskell library to benchmark Haskell programs. Profiling is typically useful to spot the "hot points" in the code: long-running loops, frequently called functions, etc. This is useful because it allows the programmer to know where optimization is needed. For instance, if a function cumulatively runs for 0.05s, obtaining a 10x speed increase on that is far less useful than a 5% optimization on a function cumulatively running for 20 minutes (0.045s vs 60s gain).

how to allocate more cpu and RAM to a c program in linux

I am running a simple C program which performs a lot calculations(CFD) hence takes a lot of time to run. However i still have a lot of unused CPU and RAM. So how will i allocate some of my processing power to one program.??

I'm guessing that CFD means Computational Fluid Dynamics (but CFD has also a lot of other meanings, so I might guess wrong).
You definitely should first profile your code. At the very least, compile it with gcc -Wall -pg -O and learn how to use gprof. You might also use strace to find out the system calls done by your code.
I'm not an expert of CFD (even if in the previous century I did work with CFD experts). But such code uses a lot of finite elements analysis and other vector computation.
If you are writing the code, you might perhaps consider using OpenMP (so by carefully adding OpenMP pragmas in your source code, you might speed it up), or even consider using GPGPUs by coding OpenCL kernels that run on the GPU.
You could also learn more about pthreads programming and change your code to use threads.
If you are using important numerical libraries like e.g. BLAS they have a lot of tuning, and even specialized variants (e.g. multi-core, OpenMP-ed, or even in OpenCL).
In all cases, parallelizing your code is a lot of work. You'll spend weeks or months on improving it, if it is possible.

Linux doesn't keep programs waiting and CPU free when they need to do calculations.
Either you have a multicore CPU and one single thread running (as suggested by #Pankrates) or you are blocking on some I/O.

You could nice the process with a negative increment, but you need to be superuser for that. See
man nice
This would increase the scheduling priority of the process. If it is competing with other processes for CPU time, it would get more CPU time and therefore "run faster".
As for increasing the amount of RAM used by the program: you'd need to rewrite or reconfigure the program to use more RAM. It is difficult to say more given the information available in the question.

To use multiple CPU's at once, you either need to run multiple copies of your program, or run multiple threads within the program. Neither is terribly hard to get started on.
However, it's much easier to do a parallel version of "I've got 10000 large numbers, I want to find out for each of them if they are primes or not" than it is to do "lots of A = A + B" type calculations in parallel - because you need the new A before you can make the next step. CFD calculations tend to do the latter [as far as I understand it], but with large arrays. You may be able to split large vector calculations into a set of smaller vector caclulations [say we have a matrix of 1000 x 1000, you could split that into 4 sets of 250 x 1000 matrixes, or 4 sets of 500 x 500 matrixes, and perform each of those in it's own thread].
If it's your own code, then you hopefully know what it does and how it works. If it's someone elses code, then you need to talk to whoever owns the code.
There is no magical way to "automatically make use of more CPU's". 30% CPU usage on a quad-core processor probably means that your system is basically using one core, and 5% or so is overhead for other things going on in the system - or maybe there is a second thread somewhere in your application that uses a little bit of CPU doing whatever it does. Or the application is multithreaded, but doesn't use the multiple cores to full extent because there is contention between the threads over some shared resource... It's impossible for us to say which of these three [or several other] alternatives.
Asking for more RAM isn't going to help unless you have something useful to put into that memory. If there is free memory, your application get as much memory as it needs.

Is there a better way to benchmark a C program than timing?

I'm coding a little program that has to sort a large array (up to 4 million text strings). Seems like I'm doing quite well at it, since a combination of radixsort and mergesort already cut the original q(uick)sort execution time in less than half.
Execution time being the main point, since this is what I'm using to benchmark my piece of code.
My question is:
Is there a better (i. e. more reliable) way of benchmarking a program than just time the execution? It kinda works, but the same program (with the same background processes running) usually has slightly different execution times if run twice.
This kinda defeats the purpose of detecting small improvements. And several small improvements could add up to a big one...
Thanks in advance for any input!
Results:
I managed to get gprof to work under Windows (using gcc and MinGW). gcc behaves poorly (considering execution time) compared to my normal compiler (tcc), but it gave me quite some insight.

Try a profiling tool, that will also show you where the program is spending its time. gprof is the classic C profiling tool, at least on Unix.

Look at the time command. It tracks both the CPU time a process uses and the wall-clock time. You can also use something like gprof for profiling your code to find the parts of your program that are actually taking the most time. You could do a lower-tech version of profiling with timers in your code. Boost has a nice timer class, but it's easy to roll your own.

I don't think it's sufficient to just measure how long a piece of code takes to execute. Your environment is a constantly changing thing, so you have to take a statistical approach to measuring execution time.
Essentially you need to take N measurements, discard outliers, and calculate your average, median and standard deviation running time, with an uncertainty measurement.
Here's a good blog explaining why and how to do this (with code): http://blogs.perl.org/users/steffen_mueller/2010/09/your-benchmarks-suck.html

What do you use for timing execution time so far? There's C89 clock() in time.h for starters. On unixoid systems you might find getitimer() for ITIMER_VIRTUAL to measure process CPU time. See the respective manual pages for details.
You can also use a POSIX shell's times utility to benchmark the processor time used by a process and its children. The resolution is system dependent, like just anything about profiling. Try to wrap your C code in a loop, executing it as many times as necessary to reduce the "jitter" in the time the benchmarking reports.

Call your routine from a test harness, whereby it executes N + 1 times. Ignore the timing for the first iteration and then take the average of iterations 1..N. The reason for ignoring the first time is that is is often slightly inflated due to various effects, e.g. virtual memory, code being paged in, etc. The reason for averaging N iterations is that you get rid of artefacts caused by other processes, the scheduler, etc.
If you're running on Linux or similar You might also want to use taskset to pin your code to a specific CPU core (assuming it's single-threaded), ideally not core 0, since this tends to handle all interrupts.

Performance/profiling measurement in C

I'm doing some prototyping work in C, and I want to compare how long a program takes to complete with various small modifications.
I've been using clock; from K&R:
clock returns the processor time used by the program since the beginning of execution, or -1 if unavailable.
This seems sensible to me, and has been giving results which broadly match my expectations. But is there something better to use to see what modifications improve/worsen the efficiency of my code?
Update: I'm interested in both Windows and Linux here; something that works on both would be ideal.
Update 2: I'm less interested in profiling a complex problem than total run time/clock cycles used for a simple program from start to finish—I already know which parts of my program are slow. clock appears to fit this bill, but I don't know how vulnerable it is to, for example, other processes running in the background and chewing up processor time.

Forget time() functions, what you need is:
Valgrind!
And KCachegrind is the best gui for examining callgrind profiling stats. In the past I have ported applications to linux just so I could use these tools for profiling.

For a rough measurement of overall running time, there's time ./myprog.
But for performance measurement, you should be using a profiler. For GCC, there is gprof.
This is both assuming a Unix-ish environment. I'm sure there are similar tools for Windows, but I'm not familiar with them.
Edit: For clarification: I do advise against using any gettime() style functions in your code. Profilers have been developed over decades to do the job you are trying to do with five lines of code, and provide a much more powerful, versatile, valuable, and fool-proof way to find out where your code spends its cycles.

I've found that timing programs, and finding things to optimize, are two different problems, and for both of them I personally prefer low-tech.
For timing, the trick is to make it take long enough by wrapping a loop around it. For example, if you iterate an operation 1000 times and time it with a stopwatch, then seconds become milliseconds when you remove the loop.
For finding things to optimize, there are pieces of code (terminal instructions and function calls) that are responsible for various fractions of the time. During that time, they are exposed on the stack. So you can wrap a loop around the program to make it take long enough, and then take stackshots. The code to optimize will jump out at you.

In POSIX (e.g. on Linux), you can use gettimeofday() to get higher-precision timing values (microseconds).
In Win32, QueryPerformanceCounter() is popular.
Beware of CPU clock-changing effects, if your CPU decides to clock down during the test, results may be skewed.

If you can use POSIX functions, have a look at clock_gettime. I found an example from a quick google search on how to use it. To measure processor time taken by your program, you need to pass CLOCK_PROCESS_CPUTIME_ID as the first argument to clock_gettime, if your system supports it. Since clock_gettime uses struct timespec, you can probably get useful nanosecond resolution.
As others have said, for any serious profiling work, you will need to use a dedicated profiler.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight