OpenCL Solving two different-sized problems simultaneously on a GPU - c

For a problem I'm working on I need to solve two sub-problems: Sub1 on an NxM grid, and Sub2 on a Kx1 grid. The problem is, these sub-problems should communicate after every step in the solution process so I need to run them simultaneously.
The end result should look like this:
Sub1 is solved for time t
Sub2 is solved for time t
An interaction term between sub1 and sub2 for time t+1 is calculated
This is then repeated for t+1, using the newly calculated interaction term, and then for t+2, t+3, etc. All the data used is stored in global device memory so there doesn't need to be any copying to and from the device in between the steps.
My problem is, how do I tell OpenCL I want to work on two different sized problems at the same time?

Is it really needed to be "at the same time"?
This is a common missunderstanding of OpenCL and parallel systems. Being more and more parallel and having all running in parallel is not always a good choice. In fact, 99% of the cases do not need to be parallel (unless some time constrain exist), and forcing to be so, slows down the speed.
Depending on the sizes and amount of work of Sub1 and Sub2:
If it takes very little time or applies to very few amount of data:
Merge both in one process, and scale the work items as needed. Some of them will be idle, but the loss is small and it will be compensated with the local/private memory sharing across Sub1 and Sub2.
If they are BIG chucks of processing.
Split both process to 2-3 different kernels, different arguments, etc.
Communicate these two process using global variables
Launch the 2 kernels with different sizes (in order to fit exactly the amount of work)
When both have finished, launch another kernel on the result, to generate the new iteration data.
You can even launch everything at once in a queue and they will launch in order without CPU intervention. That is the easyer aproach.
I would say in your case, you should go for many kernels.

Related

At what point does adding more threads stop helping?

I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
I think the best way to answer is to give a first overview on how threads are managed by the system. Nowadays all processors are actually multi-core and multi-thread per core, but for sake of simplicity let's first imagine a single core processor with single thread. This is physically limited in performing only a single task at the time, but we are still capable of running multitask programs.
So how is this possible? Well it is simply illusion!
The CPU is still performing a single task at the time, but switches between one and the other giving the illusion of multitasking. This process of changing from one task to the other is named Context switching.
During a Context switch all the data related to the task that is running is saved and the data related to the next task is loaded. Depending on the architecture of the CPU data can be saved in registers, cache, RAM, etc. The more the technology advances, the more performing solutions have been discovered. When the task is resumed, the whole data is fetched and the task continues its operations.
This concept introduces many issues in managing tasks, like:
Race condition
Synchronization
Starvation
Deadlock
There are other points, but this is just a quick list since the question does not focus on this.
Getting back to your question:
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
Short answer: It depends!
As previously said, to switch between a task and another, a Context switch is required. To perform this some storing and fetching data operations are required, but these operations are just an overhead for you computation and don't give you directly any advantage. So having too many tasks requires a high amount of Context switching, thus meaning a lot of computational time wasted! So at the end your task might be running slower than with less tasks.
Also, since you tagged this question with pthreads, it is also necessary to check that the code is compiled to run on multiple HW cores. Having a multi core CPU does not guarantee that you multitask code will run on multiple HW cores!
In your particular case of application:
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
Is a good example of concurrent and data independent computing. This sort of tasks run great on GPU, since operations don't have data correlation and concurrent computing is performed in hardware (modern GPU have thousands of computing cores!)

About Dijkstra omp

Recently I've download a source code from internet of the OpenMP Dijkstra.
But I found that the parallel time will always larger than when it is run by one thread (whatever I use two, four or eight threads.)
Since I'm new to OpenMP I really want to figure out what happens.
The is due to the overheard of setting up the threads. The execution time of the work itself is theoretically the same, but the system has to set up the threads that manage the work (even if there's only one). For little work, or for only one thread, this overhead time makes your time-to-solution slower than the serial time-to-solution.
Alternatively, if you see the time increasing dramatically as you increase the thread-count, you could only be using 1 core on your computer and tricking it into thinking its 2,4,8, etc threads.
Finally, it's possible that the way you're implementing dijkstra's method is largely serial. But without looking at your code it would be too hard to say.

Set CPU usage or manipulate other system resource in C

I have specific application to make in C. Is there any possibility to programmatically set CPU usage for process? I want to set CPU usage to eg. 20% by specific (mine) process for few seconds and then back to regular usage. while(1) take 100% CPU so its not bes idea for me. Any other ideas to manipulate some system resources and functions that can provide it? I already did memory allocation manipulations but i need other ideas about manipulating system resources.
Thanks!
What I know is that you may be able to control your application's priority depending on the operating system.
Also, a function equivalent to Sleep() reduces CPU load as it causes your application to relinquish CPU cycles to other running programs.
Have you ever tried to answer a question that became more and more complicated once you dug into it?
What you do depends upon what are you trying to accomplish. Do you want to utilize "20% by specific (mine) process for few seconds and then back to regular usage"? Or do you want to utilize 20% of all the CPU usage of the entire processor? Over what interval do you want to use 20%? Averaged over 5 sec? 500 msec? 10 msec?
20% of your process is pretty easy as long as you don't need to do any real work and want 20% of the average over a reasonably long interval, say 1 sec.
for( i=0; i=INTERVAL_CNT; i++ ) //untested syntax error ridden code
{
for( j=0; j=INTERVAL_CNT*(PERCENT/100); j++ )
{
//some work the compiler won't optimize away
}
sleep( INTERVAL_CNT*(1-(PERCENT/100)) );
}
Adjusting this for doing real work is more difficult. Note the comment about the compiler doing optimization. Optimizing compilers are pretty smart and will identify and remove code that does nothing useful. For example, if you use myVar++, declare it local to a certain scope, and never use it, the compiler will remove it to make your app run faster.
If you want a more continuous load (read that as a load of 20% at any sampling point vs a square wave with a certain duty cycle), it's going to be complicated. You might be able to do this with some experimentation by launching CPU consuming multiple threads. Having multiple threads with offset duty cycles should give you a smoother load.
20% of the entire processor is even more complicated since you need to account for multiple factors such as other processes executing, process priority, and multiple CPUs in the processor. I'm not going to get into any detail, but you might be able to do this using simultaneously executing multiple heavy weight processes with offset duty cycles along with a master thread sampling the processor load and dynamically adjusting the heavy weight processes through a set of shared variables.
Let me know if you want me to confuse the matter even further.

Memory Sharing C - Performance

I'm playing around with process creation/ scheduling in Linux. As part of that, I have a number of concurrent threads computing a basic hash function from a shared in memory buffer. Each thread is created using clone, I'm trying and I'm playing around with the various flags, stack size, to measure process creation time, etc. (hence the use of clone)
My experiments are run on a 2 core i7 with hyperthreading enabled.
In this context, I find that, with all flags enabled (CLONE_VM, CLONE_SIGHAND, CLONE_FILES, CLONE_FS), the time it takes to compute n hash functions doubles when I run 4 processes (ak one per logical cpu) over when I run 2 processes. My understanding is that hyperthreading helps when a process is waiting on IO, so for a CPU bound process, it has almost no effect. Is this correct?
The second observation is that I observe pretty high variance (up to 2 seconds) when computing these hash functions (I compute a hash 1 000 000 times). No other process is running on he system (though there are some background threads). I'm struggling to understand why so much variance? Is it strictly due to how the scheduler happens to schedule the processes? I understand that without using sched_affinity, there is no guarantee that they will be located on different cpus, so can that just be explained by them being placed on the same CPU?
Are there any other ways to guarantee improved reliability without relying on sched_affinity?
The third observation is that, even when I run with just 2 threads (so when each should be scheduled on a diff CPU), I find that the performance goes down (not by much, but a little bit). I'm struggling to understand why that is the case? It's the same read-only buffer, and fits in the cache. Is there some contention in accessing the page table? Would it then be preferable to create two processes with distinct address spaces and explicitly share the segment, marking it as read only?
Different threads still run in the context of one process so they should run on the same CPU the process is run on (usually one process is run on one CPU but that is not guaranteed).
When you run two threads instead of processes you have an overhead of switching threads, the more calculations you do the more this switching needs to be done so it will be slower than the same calculations done in one thread.
Furthermore if you run the same calculations in different processes then there is an even bigger overhead of switching between processes but there is more chance you will run on different CPUs so in the long run this will probably be faster, not so much for short calculations.
Even if you don't think you have other processes running the OS has a lot to do all the time and switches to it's own processes that you aren't always aware of.
All of this emanates from the randomness of switching. Hope I helped a bit.

Questions about parallelism on GPU (CUDA)

I need to give some details about what I am doing before asking my question. I hope my English and my explanations are clear and concise enough.
I am currently working on a massive parallelization of an initially written C code. The reason I was interested in CUDA is the large sizes of the arrays I was dealing with : the code is a simulation of fluid mechanics and I needed to launch a "time loop" with five to six successive operations on arrays as big as 3.10^9 or 19.10^9 double variables. I went through various tutorials and documentation and I finally managed to write a not-so-bad CUDA code.
Without going through the details of the code, I used relatively small 2D-blocks. The number of threads is 18 or 57 (which is awkwardly done since my wraps are not fully occupied).
The kernels call a "big" 3D-grid, which describes my physical geometry (the maximal desired size is 1000 value per dimension, that means I want to deal with a 3D grid with a 1 billion blocks).
Okay so now, my five to six kernels which are doing correctly the job are making good use of the shared memory advantages, since global memory is read ounce and written ounce for each kernel (the size of my blocks was actually determined in accordance with the adequate needed amount of shared memory).
Some of my kernels are launched concurrently, asynchronously called, but most of them need to be successive. There are several memcpy from device to host, but the ratio of memcpys over kernels calls is significantly low. I am mostly executing operations on my arrays values.
Here is my question :
If I understood correctly, all of my blocks are doing the job on the arrays at the same time. So that means dealing with a 10-blocks grid, a 100-blocks grid or a billion will take the same amount of time? The answer is obviously no, since the compuation time is significantly more important when I am dealing with large grids. Why is that?
I am using a relatively modest NVIDIA device (NVS 5200M). I was trying to get used to CUDA before getting bigger/more efficient devices.
Since I went through all the optimization and CUDA programming advices/guides by myself, I may have completely misunderstood some points. I hope my question is not too naive...
Thanks!
If I understood correctly, all of my blocks are doing the job on the arrays at the same time.
No they don't run at the same time! How many thread blocks can run concurrently depends on several things, all effected on the compute capability of your device - NVS 5200M should be cc2.1.
A CUDA enabled gpu has an internal scheduler, that manages where and when which thread block and warps of the blocks will run. Where means on which streaming multiprocessor (SM) the block will be launched.
Every SM has a limited amount of resources - shared memory and registers for example. A good overview for these limitations gives the Programming Guide or the Occupancy Calculator.
The first limitation is, that for cc2.1 a SM can run up to 8 thread blocks at the same time. Depending on your usage of registers, shared memory... the number will possible decrease.
If I remind me right a SM of cc2.1 exists of 96 cuda cores and therefore your NVS 5200M should have one SM. Let's assume with your kernel setup N (N<=8) thread blocks fit into the SM at the same time. The internal scheduler will be launched the first N blocks and queue up all other thread blocks. If one thread block has finished his work, the next one from the queue will be launched. So if you will launch in total 1 until N blocks, the used time for the kernel will be very equal. If you run the kernel with N+1 blocks, than the used time will be increased.

Resources