Concurrency, 4 CUDA Applications competing to get GPU resources - c

What would happen if there are four concurrent CUDA Applications competing for resources in one single GPU
so they can offload the work to the graphic card?. The Cuda Programming Guide 3.1 mentions that there
are certain methods which are asynchronous:
Kernel launches
Device device memory copies
Host device memory copies of a memory block of 64 KB or less
Memory copies performed by functions that are suffixed with Async
Memory set function calls
As well it mentions that devices with compute capability 2.0 are able to execute multiple kernels concurrently as long as the kernels belong to the same context.
Does this type of concurrency just apply to streams within a single cuda applications but not possible when there are complete different applications requesting GPU resources??
Does that mean that the concurrent support is just available within 1 application (context???) and that the 4 applications will just run concurrent in the way that the methods might be overlaped by context switching in the CPU but the 4 applications need to wait until the GPU is freed by the other applications? (i.e Kernel launch from app4 waits until a kernel launch from app1 finishes..)
If that is the case, how these 4 applications might access GPU resources without suffering long waiting times?

As you said only one "context" can occupy each of the engines at any given time. This means that one of the copy engines can be serving a memcpy for application A, the other a memcpy for application B, and the compute engine can be executing a kernel for application C (for example).
An application can actually have multiple contexts, but no two applications can share the same context (although threads within an application can share a context).
Any application that schedules work to run on the GPU (i.e. a memcpy or a kernel launch) can schedule the work asynchronously so that the application is free to go ahead and do some other work on the CPU and it can schedule any number of tasks to run on the GPU.
Note that it is also possible to put the GPUs in exclusive mode whereby only one context can operate on the GPU at any time (i.e. all the resources are reserved for the context until the context is destroyed). The default is shared mode.

Related

Does the CreateThread API in windows automatically spread threads across multiple cores?

I have a question regarding the CreateThread API in windows (C/C++) that MSDN of the API doesn't explain.
If i use this API to create multiple threads that execute a function (all of them execute a common function), will Windows automatically spread these threads across different cores? specially if the function is CPU intensive?
Most operating systems internally treat processes and threads as a specialization of more general "tasks". By default tasks have no fixed association with particular cores. In general the OS will try to reschedule tasks on the same core in the short term, to aid with cache coherence, but will cycle them over all associated cored in the long term to distribute the thermal load.
It is however possible to limit the set of cores a given task may run on using the so called "CPU affinity mask", which can be set either at runtime using a system call, or as part of the executable binary at link time.

Why does one core execute more than its share of instructions?

I have a C program (graphics benchamrk) that runs on a MIPS processor simulator(I'm looking to graph some performance characteristics). The processor has 8 cores but it seems like core 0 is executing more than its fair share of instructions. The benchmark is multithreaded with the work exactly distributed between the threads. Why could it be that core 0 happens to run about between 1/4 and half the instructions even though it is multithreaded on a 8 core processor?
What are some possible reasons this could be happening?
Most application workloads involve some number of system calls, which could block (e.g. for I/O). It's likely that your threads spend some amount of time blocked, and the scheduler simply runs them on the first available core. In an extreme case, if you have N threads but each is able to do work only 1/N of the time, a single core is sufficient to service the entire workload.
You could use pthread_setaffinity_np to assign each thread to a specific core, then see what happens.
You did not mention which OS you are using.
However, most of the code in most OSs is still written for a single core CPU.
Therefore, the OS will not try to evenly distribute the processes over the array of cores.
When there are multiple cores available, most OSs start a process on the first core that is available (and a blocked process leaves the related core available.)
As an example, on my system (a 4 core amd-64) running ubuntu linux 14.04, the CPUs are usually less than 1 percent busy, So everything could run on a single core.
There must be lots of applications running like videos and background long running applications, with several windows open to show much real activity on other than the first core.

How to communicate between processes in realtime Linux?

There are a lot of examples how to write realtime code for RT-Linux by FSMLabs but this distro has been abandoned many years ago. Currently PREEMPT_RT patch for vanilla kernel is actively developed but there are only few code examples on official Wiki. First let me introduce my issue.
I'm writing a project containing 2 programs:
Virtual machine of byte code - it must work as realtime application - it has 64 KB of I/O memory and 64 KB for byte code.
Client program - it will read and write I/O memory, start/pause machine, load new programs, set parameters, etc. It doesn't have to be realtime.
How to communicate between these processes to keep process (1) realtime and avoid page faults or other behaviors that can interfere realtime app?
Approach 1. Use only threads
There are 2 threads:
virtual machine thread with highest priority
client thread with normal priority that communicates with user and machine
Both threads have access to all global variables by name. I can create additional buffer for incoming/outcoming data after each machine cycle. However, if client thread causes application crash, machine thread will terminate too. It's also more difficult to implement remote access.
Approach 2. Shared memory
In old FSMLabs it's recommended to use shared global memory between processes. Modern PREEMPT_RT's Wiki page recommends using mmap() for process data sharing but in the same article it discourages mmap() because of page faults.
Approach 3. Named pipes
It's more flexible way to communicate between processes. However, I'm new to programming in Linux. We want to share memory between machine and client but it should also provide a way to load new program (file path or program code), stop/start machine, etc. Old FSMLabs RT-Linux implemented its own FIFO queues (named pipes). Modern PREEMPT_RT doesn't. Can using names pipes break realtime behavior? How to do it properly? Should I read data with O_NONBLOCK flag or create another thread for reading/writing data from/to pipe?
Do you know other ways to communicate between processes where one process must be realtime? Maybe I need only threads. However, consider a scenario that more clients are connected to virtual machine process.
For exchanging data between processes executing on the same host operating system you can also use UNIX domain sockets.

Can I take advntage of multi core in a multi-threaded application that I develop

If I am writing a multi-threaded C application on linux (using pthreads), can I take advantage of multi-core processor.
I mean what should an application programmer do to take advantage of multi-core processor. Or is it that the OS alone does so with its various scheduling algorithms
You don't need to do anything. Create as many threads as you want and the OS will schedule them together with the threads from all the other processes over every available cores.
"Take advantage of multi-core" could be understood to mean "utilize multi-core."
Or it could mean "gaining a qualitative advantage from the utilization of multi-core."
Anyone can do the former. They often end up with software that runs slower than if it were single-threaded.
The latter is an entirely different proposition. It requires writing the software such that usage of and accessing computing resources shared by all cores (bus-locking, RAM and L3 cache) are economized upon and focusing on doing as much computing as possible primarily in the individual cores and their L1 caches. The L2 cache is usually shared by two cores so it falls somewhere in-between the two categories in that yes, it is a shared resource but it is shared by just two cores and it is much faster than the resources shared by all cores.
This is at the implementation level, writing and testing the code.
The decisions made at earlier stages - specifically the system's software architecture phase - are usually much more important to the system's long-term quality and performance.
Some posts: 1 2 3. There are many more.

pthread on-wakeup execution

How can I make my pthreads execute a function each time they are rescheduled by the kernel?
I need to identify on which physical CPU/socket (not logical core) my thread is being scheduled at and cannot afford to do this all the time.
Can the wakeup routine be hooked somehow to make the necessary updates to TLS only when the thread is actually being rescheduled?
As to why I need this: I have code which executes AMOs appx every 70ns per thread which is fine if the address is not cached on another socket, deploying the same code on two sockets gives a 15 times performance impact because of frequent cache invalidations. I intend to allocate memory especially for this which is only shared among threads running the same L3 cache. So I need to identify on which socket I am running and address the correct memory block. I could obviously call sched_getcpu and compare this to the physical CPU ID in /proc/cpuinfo, but this is a rather big overhead. I cannot afford to allocate thread-private memory for each thread though, too expensive.
From what I have read in Linux Kernel Development, Third Edition, there is no service nor interface, provided by the kernel, for what you want. Using pthread_setaffinity (as suggested above by #osgx, or, in more recent linux kernel implementations, pthread_setaffinity_np) or caching a TLS key per cpu socket in the beginning (as suggested above by #caf) are perhaps the best methods to use in that direction.

Resources