OpenGL Multithreading slower than using a single thread - c

I'm using Windows 7 and using VC++ 2010 and this is a 32 bit application
I am trying to get my renderer to work multithreaded but as it turns out I made it slower than without using multiple threads.
I want it to have the main thread adding rendering commands to a list, and a worker thread that does the rendering of these commands.
This all does happen, and it draws to the screen fine, but I get less fps when doing so...
I used the benchmark tool in Fraps to get this data:
Time is the time it was benchmarked for, in this case 30 seconds.
Min, max, avg are all FPS values.
With Multithreading:
Frames, Time (ms), Min, Max, Avg
28100, 30000, 861,1025, 936.667
Without multithreading:
Frames, Time (ms), Min, Max, Avg
21483, 30000, 565, 755, 716.100
Here is some pseudocode (with the relevant event function calls):
Main Thread:
Add render comands to queue
ResetEvent (renderCompletedEvent);
SetEvent (renderCommandsEvent);
WaitForSingleObject (renderCompletedEvent, INFINITE);
Render Thread:
WaitForSingleObject (renderCommandsEvent, INFINITE);
Process commands
SetEvent (renderCompletedEvent);
ResetEvent (renderCommandsEvent);

Why would you expect this to be faster?
Only one thread is ever doing anything, you create the commands in one thread and the signal the other and wait for it to finish which will take just as long as just doing it in the first thread, only with more overhead.
TO take advantage of multithreading you need to ensure that both threads are doing something at the same time.

I am no opengl expert, but in general it is important to realize that threads are actually not used to speed things up, they are to guarantee that some subsystem is responsive at the cost of overall speed. That is one might keep a gui thread and a networking thread to ensure that the gui and networking are responsive. That is actually done at a performance cost to the main thread. The CPU is going to give 1/3 of its time to the main thread, 1/3 of its time to the networking thread and 1/3 of its time to the gui thread, even if there are no gui events to handle and nothing going in or out of the network. Thus whatever the main thread is doing gets only 1/3 of the CPU time that it would in a non-multithreaded situation. The upside is that if a lot of data starts arriving over the network, there is always CPU time reserved to handle it (which can be bad if there isn't as the networking buffer can be filled and then additional data starts being dropped or overwritten). The possible exception is that if multiple threads are running on different cores. However, even then be careful, cores can share the same caches, so if two cores are invalidating each other caches, performance could drop dramatically, not improve. If the cores share some resource to move data to and from the GPU or has some other shared limiting resource, this again could possibly cause performance losses, not gains.
In short, threading on a single CPU system is always about responsiveness of a subsystem, not performance. There are possible perfomance gains when different threads run on multiple cores (which windows doesn't seem to usually do by default, but it can be forced). However there are potential issues with doing this when those cores share some resource that could potentially hurt, not help, performance, e.g. shared cache space or some shared GPU related resource in your context.

Related

Best way to synchronise threads and measure performance at sub-microsecond frequency

I'm working on a standard x86 six core SMP machine, 3.6GHz clock speed, plain C code.
I have a threaded producer/consumer scheme in which my "producer" thread is reading from file at roughly 1,000,000 lines/second, and handing the data it reads off to either two or four "consumer" threads which do a bit of work on it and then stick it into a database. While they are consuming it is busy reading the next line.
So both producer and consumers have to have some means of synchronisation which works at sub-microsecond frequency, for which I use a "busy spin wait" loop, because all the normal synchronisation mechanisms I can find are just too slow. In pseudo code terms:
Producer thread
While(something in file)
{
read a line
populate 1/2 of data double buffer
wait for consumers to idle
set some key data
set memory fence
swap buffers
}
And the consumer threads likewise
while(not told to die)
{
wait for key data change event
consume data
}
At both sides the "wait" loop is coded:
while(waiting)
{
_mm_pause(); /* Intel say this is a good hint to processor that this is a spin wait */
if(#iterations > 1000) yield_thread(); /* Sleep(0) on Windows, pthread_yield() on Linux */
}
This all works, and I get some quite nice speed-ups compared to the equivalent serial code, but my profiler (Intel's VTune Amplifier) shows that I am spending a horrendous amount of time in my busy wait loops, and the ratio of "spin" to "useful work done" is depressingly high. Given the way the profiler concentrates its feedback on the busiest sections this also means that the lines of code doing useful work tend not to be reported, since (relatively speaking) their %age of total cpu is down at the noise level ... or at least that is what the profiler is saying. They must be doing something otherwise I wouldn't see any speed up!
I can and do time things, but it is hard to distinguish between delays imposed by disk latency in the producer thread, and delays spent while the threads synchronise.
So is there a better way to measure what is actually going on? By which I mean just how much time are these threads really spending waiting for one another? Measuring time accurately is really hard at sub-microsecond resolution, the profiler doesn't seem to give me much help, and I am struggling to optimise the scheme.
Or maybe my spin wait scheme is rubbish, but I can't seem to find a better solution for sub-microsecond synchronisation.
Any hints would be really welcome :-)
Even better than fast locks is not locking at all. Try switching to a lock-free queue. Producers and consumers wouldn't need to wait at all.
Lock-free data structures are process, thread and interrupt safe (i.e. the same data structure instance can be safely used concurrently and simultaneously across cores, processes, threads and both inside and outside of interrupt handlers), never sleep (and so are safe for kernel use when sleeping is not permitted), operate without context switches, cannot fail (no need to handle error cases, as there are none), perform and scale literally orders of magnitude better than locking data structures, and liblfds itself (as of release 7.0.0) is implemented such that it performs no allocations (and so works with NUMA, stack, heap and shared memory) and compiles not just on a freestanding C89 implementation, but on a bare C89 implementation.
Thank you to all who commented above, the suggestion of making the quantum of work bigger was the key. I have now implemented a queue (1000 entry long rotating buffer) for my consumer threads, so the producer only has to wait if that queue is full, rather than waiting for its half of the double buffer in my previous scheme. So its synchronisation time is now sub-millisecond instead of sub-microsecond - well that's a surmise, but it's definitely 1000x longer than before!
If the producer hits "queue full" I can now yield its thread immediately, instead of spin waiting, safe in the knowledge that any time slice it loses will be used gainfully by the consumer threads. This does indeed show up as a small amount of sleep/spin time in the profiler. The consumer threads benefit too since they have a more even workload.
Net outcome is a 10% reduction in the overall time to read a file, and given that only part of the file is able to be processed in a threaded manner that suggests that the threaded part of the process is around 15% or more faster.

At what point does adding more threads stop helping?

I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
I think the best way to answer is to give a first overview on how threads are managed by the system. Nowadays all processors are actually multi-core and multi-thread per core, but for sake of simplicity let's first imagine a single core processor with single thread. This is physically limited in performing only a single task at the time, but we are still capable of running multitask programs.
So how is this possible? Well it is simply illusion!
The CPU is still performing a single task at the time, but switches between one and the other giving the illusion of multitasking. This process of changing from one task to the other is named Context switching.
During a Context switch all the data related to the task that is running is saved and the data related to the next task is loaded. Depending on the architecture of the CPU data can be saved in registers, cache, RAM, etc. The more the technology advances, the more performing solutions have been discovered. When the task is resumed, the whole data is fetched and the task continues its operations.
This concept introduces many issues in managing tasks, like:
Race condition
Synchronization
Starvation
Deadlock
There are other points, but this is just a quick list since the question does not focus on this.
Getting back to your question:
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
Short answer: It depends!
As previously said, to switch between a task and another, a Context switch is required. To perform this some storing and fetching data operations are required, but these operations are just an overhead for you computation and don't give you directly any advantage. So having too many tasks requires a high amount of Context switching, thus meaning a lot of computational time wasted! So at the end your task might be running slower than with less tasks.
Also, since you tagged this question with pthreads, it is also necessary to check that the code is compiled to run on multiple HW cores. Having a multi core CPU does not guarantee that you multitask code will run on multiple HW cores!
In your particular case of application:
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
Is a good example of concurrent and data independent computing. This sort of tasks run great on GPU, since operations don't have data correlation and concurrent computing is performed in hardware (modern GPU have thousands of computing cores!)

Memory Sharing C - Performance

I'm playing around with process creation/ scheduling in Linux. As part of that, I have a number of concurrent threads computing a basic hash function from a shared in memory buffer. Each thread is created using clone, I'm trying and I'm playing around with the various flags, stack size, to measure process creation time, etc. (hence the use of clone)
My experiments are run on a 2 core i7 with hyperthreading enabled.
In this context, I find that, with all flags enabled (CLONE_VM, CLONE_SIGHAND, CLONE_FILES, CLONE_FS), the time it takes to compute n hash functions doubles when I run 4 processes (ak one per logical cpu) over when I run 2 processes. My understanding is that hyperthreading helps when a process is waiting on IO, so for a CPU bound process, it has almost no effect. Is this correct?
The second observation is that I observe pretty high variance (up to 2 seconds) when computing these hash functions (I compute a hash 1 000 000 times). No other process is running on he system (though there are some background threads). I'm struggling to understand why so much variance? Is it strictly due to how the scheduler happens to schedule the processes? I understand that without using sched_affinity, there is no guarantee that they will be located on different cpus, so can that just be explained by them being placed on the same CPU?
Are there any other ways to guarantee improved reliability without relying on sched_affinity?
The third observation is that, even when I run with just 2 threads (so when each should be scheduled on a diff CPU), I find that the performance goes down (not by much, but a little bit). I'm struggling to understand why that is the case? It's the same read-only buffer, and fits in the cache. Is there some contention in accessing the page table? Would it then be preferable to create two processes with distinct address spaces and explicitly share the segment, marking it as read only?
Different threads still run in the context of one process so they should run on the same CPU the process is run on (usually one process is run on one CPU but that is not guaranteed).
When you run two threads instead of processes you have an overhead of switching threads, the more calculations you do the more this switching needs to be done so it will be slower than the same calculations done in one thread.
Furthermore if you run the same calculations in different processes then there is an even bigger overhead of switching between processes but there is more chance you will run on different CPUs so in the long run this will probably be faster, not so much for short calculations.
Even if you don't think you have other processes running the OS has a lot to do all the time and switches to it's own processes that you aren't always aware of.
All of this emanates from the randomness of switching. Hope I helped a bit.

Preventing application from taking resources from other applications

I have an application that does a lot of CPU and I/O heavy work. While this work is being done, it should not interfere with other applications.
For example, if another application is fully utilizing the disk my application is reading from, I want my application to throttle down it's disk access to very low speeds, so as not to interfere with the other applications. The same goes with CPU; if another application is encoding video, for example, I don't want to steal many cycles from it.
I've tried putting my threads in background mode, but I'm experiencing that these threads won't utilize unused resources. With no other applications running and almost no CPU or disk usage, an operation that takes 1 second on a normal priority thread takes up to 5 minutes on a background thread.
Does winapi provide anything to help me with this?
Below is a picture of my application's disk usage, while a background thread attempts to calculate the SHA1 hash of an 800 MB file. As you can see, it's barely utilizing my disk. On normal priority, it maintains 20 MB+ reads.
EDIT: To clarify, by 'background thread,' I mean a thread with it's priority set to background mode, not a C# background thread.
SetThreadPriority(GetCurrentThread(), THREAD_MODE_BACKGROUND_BEGIN);
Your code is fine – THREAD_MODE_BACKGROUND_BEGIN is how you signal to the system that this thread is a background thread and that its I/O is to be treated as low-priority. You can achieve the same effect process-wide with SetPriorityClass and PROCESS_MODE_BACKGROUND_BEGIN. You can even control things at the file handle granularity with SetFileInformationByHandle and FileIoPriorityHintInfo.
So you are already doing what you intend to do. But you are finding that your task is not getting given any resource. That can only mean that there is at least one other thread running, and a higher than background priority, that is using resources.
For cpu utilization throttling:
SetPriorityClass
SetThreadPriority
Just don't use THREAD_MODE_BACKGROUND_BEGIN, anything that's below normal (has negative priority boost) should be fine. Windows schedules threads with higher priority to run first. Choose THREAD_PRIORITY_IDLE if you wish even dynamic priority boosts to be almost always not enough to interfere with normal priority threads.
For information on IO priority, click here

Mutex vs busy wait for tcp io

I do not care about being a cpu hog as I have one thread assigned to each core and the system threads blocked off to their own set. My understanding is that mutex is of use when other tasks are to run, in this case that is not important so I am considering having a consumer thread loop on an address in memory waiting for its value to be non zero - as in the single producer thread that is looping recv()ing with TCP_NONBLOCK set just deposited information and it is now non zero.
Is my implantation a smart one given my circumstances or should I be using a mutex or custom interrupt even though no other tasks will run.
In addition to points by #ugoren and comments by others:
Even if you have a valid use-case for busy-waiting and burning a core, which are admittedly rare, you need to:
Protect the data shared between threads. This is where locks come into play - you need mutual exclusion when accessing any complex shared data structure. People tend to look into lock-free algorithms here, but these are way-way not obvious and error-prone and are still considered deep black magic. Don't even try these until you have a solid understanding of concurrency.
Notify threads about changed state. This is where you'd use conditional variables or monitors. There are other methods too, eventfd(2) on Linux, for example.
Here are some links for you to show that it's much harder then you seem to think:
Memory Ordering
Out-of-order execution
ABA problem
Cache coherence
Busy-wait can give you a lower latency and somewhat better performance in some cases.
Letting other threads use the CPU is the obvious reason not to do it, but there are others:
You consume more power. An idle CPU goes into a low power state, reducing consumption very significantly. Power consumption is a major issue in data centers, and any serious application must bit waste power.
If your code runs in a virtual machine (and everything is being virtualized these days), your machine competes for CPU with others. Consuming 100% CPU leaves less for the others, and may cause the hypervisor to give your machine less CPU when it's really needed.
You should always stick to mainstream methods, unless there's a good reason not to. In this case, the mainstream is to use select or poll (or epoll). This lets you do other stuff while waiting, if you want, and doesn't waste CPU time. Is the performance difference large enough to justify busy wait?

Resources