nvidia cuda using all cores of the machine - c

I was running cuda program on a machine which has cpu with four cores, how is it possible to change cuda c program to use all four cores and all gpu's available?
I mean my program also does things on host side before computing on gpus'...
thanks!

CUDA is not intended to do this. The purpose of CUDA is to provide access to the GPU for parallel processing. It will not use your CPU cores.
From the What is CUDA? page:
CUDA is NVIDIA’s parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (graphics processing unit).
That should be handled via more traditional multi-threading techniques.

cuda code runs only on GPU.
so if you want parallelism on your CPU cores, you need to use threads such as Pthreads or OpenMP.

Convert your program to OpenCL :-)

Related

Timing Issue with 4 Threads / 4 Processes in Parallel Programming

I am having issues with OpenMP and MPI execution timings. When I select either 4 Threads (OMP) or 4 Proccesses (MPI) my execution time is slower than the serial code.
Both scripts have correct timings on other machines and both use the gettimeofday() function for timing. Below is a screen shot of both scripts being executed from 1-8 Threads/Procs:
RAM is not exceeding its limit and the disk is not busy during execution.The machine hosts an Intel i5 2500k (Stock not overclocked) and is running on Linux Mint 17 x64.
AS mentioned before, both programs produce the correct timings on other machines, so I think the issue has something to do with cpu affinity and the OS.
Has anyone encountered this issue before?
EDIT 1:
When using the argument 'bind-to-core' on the MPI execution, runtime is significantly increased, but still much slower than serial:
Problem was faulty hardware.
I Replaced motherboard with one of the same series/chipset (so did not require an install) now the timings are returning correct on both scripts.

How much less kernel overhead Tesla is, comparing to Geforce?

Tesla(Fermi or Kepler) with TCC mode comparing to Geforce (same generations) with WDDM?
The program I wrote have some very serious problems with kernel overheads due to it have to repeatively lanuch kernels, the overhead is so huge I have to merge many kernels togegther and trade memory space for less kernel launches, however it can only work so far thanks to the grand size of GPU memory storage.
I heard TCC mode can have less overheads, but can it bring the overhead performane to CPU'level?
Since I read some benchmarks, at least for Geforce 280 GTX the kernel-call overheads is thousands of times longer than function-call overheads of CPU, and for methods require a large amount of repeatively iterations it make a huge performance difference here.
The WDDM driver will batch kernel launches together to reduce overhead. So if you are able to merge kernels together to reduce launch overhead, so will be the WDDM driver as well (unless you use CUDA calls in between that prevent batching). Thus switching to TCC mode will not gain you much in this specific use case.
Are you sure the problem is launch overhead and not something else? How many separate kernels are you launching and how long does this take?
It could well be (particularly in the case of very small kernels where launch overhead would be noticable) that merging the kernels together allows the compiler to better optimize the kernels, e.g. to eliminate the writing out and reading back of intermediate results to global memory.
I was launching 16 kernels and the speed was X when I merged all kernels to be launched at once the speed was 10X also merging kernels has put overhead but the results were great.
This is many-core architecture if you cannot make use of that (launch the largest job size) then you are wasting the overhead you took to launch the kernel.
I hope this helps you.

Heterogeneous Computing Using CPU, GPU , and ARM CPU

In my opencl application I have a controlling application part, a graphics application part and some serial application part, as shown below:
All these applications are running in parallel.
So far I have written applications that run simultaneously on CPU and GPU. Is there a way I can use ARM together with CPU(Intel) and GPU (ATI) in parallel as shown in the picture above?

Same codebase for CPU and GPU

Does anybody have any experience in maintaining single codebase for both CPU and GPU?
I want to create an application which when possible would use GPU for some long lasting calculations, but if a compatible GPU is not present on a target machine it would just use regular CPU version. It would be really helpfull if I could just write a portion of code using conditional compilation directives which would compile both to a CPU version and GPU version. Of course there will be some parts which are different for CPU and GPU, but I would like to keep the essense of the algorithm in one place. Is it at all possible?
OpenCL is a C-based language. OpenCL platforms exist that run on GPUs (from NVidia and AMD) and CPUs (from Intel and AMD).
While it is possible to execute the same OpenCL code on both GPUs and CPUs, it really needs to be optimized for the target device. Different code would need to be written for different GPUs and CPUs to gain the best performance. However, a CPU OpenCL platform can function as a low-performance fallback for even GPU optimized code.
If you are happy writing conditional directives that execute depending on the target device (CPU or GPU) then that can help performance of OpenCL code on multiple devices.

Combine multi Cores into SIngle Core Processing , on linux , Possible?

I am thinking about an idea , where a lagacy application needing To run on full performance on Core i7 cpu. Is there any linux software / utility to combine all cores for that application, so it can process at some higher performance than using only 1 core?
the application is readpst and it only uses 1 Core for Processing outlook PST files.
Its ok if i can't use all cores , it will be fine if can use like 3 cores.
Possible? or am i drunk?
I will rewrite it to use multiple cores if my C knowledge on multi forking is good.
Intel Nehalem-based CPUs (i7, i5, i3) already do this to an extent.
By using their Turbo Boost mode, when a single core is being used it is automatically over-clocked until the power and temperature limits are reached.
The newer versions of the i7 (the 2K chips) do this even better.
Read this, and this.
"Possible? or am i drunk?"
You're drunk! If this was easy in the general case, Intel would have built it into the processors by now!
What you're looking for is called 'Single System Image' or SSI. There is scant information on the internet about people doing such a thing, as it tends to be reserved for super computing (and perhaps servers).
http://en.wikipedia.org/wiki/Single_system_image
No, the application needs to be multi-threaded to use more than one core. You're of course free to write a multi-threaded version of that application if you wish, but it may not be easy to make sure the different threads don't mess each other up.
If you want it to alleviate multiple cores then you could write a multi-threaded version of your program. But only in the case that it is actually parallelizable. You said you were reading from pst-files, take care not to run into IO bottlenecks.
A great library for working with threads, mutex, semaphores and so on is POSIX Threads.
There is'nt available such an application, but it is possible.
When a OS will run in a VM, then the hypervisor could make use of a few CPUs to identify which CPU code could run parallel, and are not required to run sequentially, and then they could be actually done with a few other CPUs at once,
In the next second when the Operating CPUs are idle (because they finished their work faster then the menager can provide them with new they can start calculating the next second of instructions.
The reason why we need to do this on the Hypervisor level, and not within the OS, is because of memory locking this wouldnt be possible.

Resources