Preventing application from taking resources from other applications - c

I have an application that does a lot of CPU and I/O heavy work. While this work is being done, it should not interfere with other applications.
For example, if another application is fully utilizing the disk my application is reading from, I want my application to throttle down it's disk access to very low speeds, so as not to interfere with the other applications. The same goes with CPU; if another application is encoding video, for example, I don't want to steal many cycles from it.
I've tried putting my threads in background mode, but I'm experiencing that these threads won't utilize unused resources. With no other applications running and almost no CPU or disk usage, an operation that takes 1 second on a normal priority thread takes up to 5 minutes on a background thread.
Does winapi provide anything to help me with this?
Below is a picture of my application's disk usage, while a background thread attempts to calculate the SHA1 hash of an 800 MB file. As you can see, it's barely utilizing my disk. On normal priority, it maintains 20 MB+ reads.
EDIT: To clarify, by 'background thread,' I mean a thread with it's priority set to background mode, not a C# background thread.
SetThreadPriority(GetCurrentThread(), THREAD_MODE_BACKGROUND_BEGIN);

Your code is fine – THREAD_MODE_BACKGROUND_BEGIN is how you signal to the system that this thread is a background thread and that its I/O is to be treated as low-priority. You can achieve the same effect process-wide with SetPriorityClass and PROCESS_MODE_BACKGROUND_BEGIN. You can even control things at the file handle granularity with SetFileInformationByHandle and FileIoPriorityHintInfo.
So you are already doing what you intend to do. But you are finding that your task is not getting given any resource. That can only mean that there is at least one other thread running, and a higher than background priority, that is using resources.

For cpu utilization throttling:
SetPriorityClass
SetThreadPriority
Just don't use THREAD_MODE_BACKGROUND_BEGIN, anything that's below normal (has negative priority boost) should be fine. Windows schedules threads with higher priority to run first. Choose THREAD_PRIORITY_IDLE if you wish even dynamic priority boosts to be almost always not enough to interfere with normal priority threads.
For information on IO priority, click here

Related

Best way to synchronise threads and measure performance at sub-microsecond frequency

I'm working on a standard x86 six core SMP machine, 3.6GHz clock speed, plain C code.
I have a threaded producer/consumer scheme in which my "producer" thread is reading from file at roughly 1,000,000 lines/second, and handing the data it reads off to either two or four "consumer" threads which do a bit of work on it and then stick it into a database. While they are consuming it is busy reading the next line.
So both producer and consumers have to have some means of synchronisation which works at sub-microsecond frequency, for which I use a "busy spin wait" loop, because all the normal synchronisation mechanisms I can find are just too slow. In pseudo code terms:
Producer thread
While(something in file)
{
read a line
populate 1/2 of data double buffer
wait for consumers to idle
set some key data
set memory fence
swap buffers
}
And the consumer threads likewise
while(not told to die)
{
wait for key data change event
consume data
}
At both sides the "wait" loop is coded:
while(waiting)
{
_mm_pause(); /* Intel say this is a good hint to processor that this is a spin wait */
if(#iterations > 1000) yield_thread(); /* Sleep(0) on Windows, pthread_yield() on Linux */
}
This all works, and I get some quite nice speed-ups compared to the equivalent serial code, but my profiler (Intel's VTune Amplifier) shows that I am spending a horrendous amount of time in my busy wait loops, and the ratio of "spin" to "useful work done" is depressingly high. Given the way the profiler concentrates its feedback on the busiest sections this also means that the lines of code doing useful work tend not to be reported, since (relatively speaking) their %age of total cpu is down at the noise level ... or at least that is what the profiler is saying. They must be doing something otherwise I wouldn't see any speed up!
I can and do time things, but it is hard to distinguish between delays imposed by disk latency in the producer thread, and delays spent while the threads synchronise.
So is there a better way to measure what is actually going on? By which I mean just how much time are these threads really spending waiting for one another? Measuring time accurately is really hard at sub-microsecond resolution, the profiler doesn't seem to give me much help, and I am struggling to optimise the scheme.
Or maybe my spin wait scheme is rubbish, but I can't seem to find a better solution for sub-microsecond synchronisation.
Any hints would be really welcome :-)
Even better than fast locks is not locking at all. Try switching to a lock-free queue. Producers and consumers wouldn't need to wait at all.
Lock-free data structures are process, thread and interrupt safe (i.e. the same data structure instance can be safely used concurrently and simultaneously across cores, processes, threads and both inside and outside of interrupt handlers), never sleep (and so are safe for kernel use when sleeping is not permitted), operate without context switches, cannot fail (no need to handle error cases, as there are none), perform and scale literally orders of magnitude better than locking data structures, and liblfds itself (as of release 7.0.0) is implemented such that it performs no allocations (and so works with NUMA, stack, heap and shared memory) and compiles not just on a freestanding C89 implementation, but on a bare C89 implementation.
Thank you to all who commented above, the suggestion of making the quantum of work bigger was the key. I have now implemented a queue (1000 entry long rotating buffer) for my consumer threads, so the producer only has to wait if that queue is full, rather than waiting for its half of the double buffer in my previous scheme. So its synchronisation time is now sub-millisecond instead of sub-microsecond - well that's a surmise, but it's definitely 1000x longer than before!
If the producer hits "queue full" I can now yield its thread immediately, instead of spin waiting, safe in the knowledge that any time slice it loses will be used gainfully by the consumer threads. This does indeed show up as a small amount of sleep/spin time in the profiler. The consumer threads benefit too since they have a more even workload.
Net outcome is a 10% reduction in the overall time to read a file, and given that only part of the file is able to be processed in a threaded manner that suggests that the threaded part of the process is around 15% or more faster.

Set CPU usage or manipulate other system resource in C

I have specific application to make in C. Is there any possibility to programmatically set CPU usage for process? I want to set CPU usage to eg. 20% by specific (mine) process for few seconds and then back to regular usage. while(1) take 100% CPU so its not bes idea for me. Any other ideas to manipulate some system resources and functions that can provide it? I already did memory allocation manipulations but i need other ideas about manipulating system resources.
Thanks!
What I know is that you may be able to control your application's priority depending on the operating system.
Also, a function equivalent to Sleep() reduces CPU load as it causes your application to relinquish CPU cycles to other running programs.
Have you ever tried to answer a question that became more and more complicated once you dug into it?
What you do depends upon what are you trying to accomplish. Do you want to utilize "20% by specific (mine) process for few seconds and then back to regular usage"? Or do you want to utilize 20% of all the CPU usage of the entire processor? Over what interval do you want to use 20%? Averaged over 5 sec? 500 msec? 10 msec?
20% of your process is pretty easy as long as you don't need to do any real work and want 20% of the average over a reasonably long interval, say 1 sec.
for( i=0; i=INTERVAL_CNT; i++ ) //untested syntax error ridden code
{
for( j=0; j=INTERVAL_CNT*(PERCENT/100); j++ )
{
//some work the compiler won't optimize away
}
sleep( INTERVAL_CNT*(1-(PERCENT/100)) );
}
Adjusting this for doing real work is more difficult. Note the comment about the compiler doing optimization. Optimizing compilers are pretty smart and will identify and remove code that does nothing useful. For example, if you use myVar++, declare it local to a certain scope, and never use it, the compiler will remove it to make your app run faster.
If you want a more continuous load (read that as a load of 20% at any sampling point vs a square wave with a certain duty cycle), it's going to be complicated. You might be able to do this with some experimentation by launching CPU consuming multiple threads. Having multiple threads with offset duty cycles should give you a smoother load.
20% of the entire processor is even more complicated since you need to account for multiple factors such as other processes executing, process priority, and multiple CPUs in the processor. I'm not going to get into any detail, but you might be able to do this using simultaneously executing multiple heavy weight processes with offset duty cycles along with a master thread sampling the processor load and dynamically adjusting the heavy weight processes through a set of shared variables.
Let me know if you want me to confuse the matter even further.

OpenGL Multithreading slower than using a single thread

I'm using Windows 7 and using VC++ 2010 and this is a 32 bit application
I am trying to get my renderer to work multithreaded but as it turns out I made it slower than without using multiple threads.
I want it to have the main thread adding rendering commands to a list, and a worker thread that does the rendering of these commands.
This all does happen, and it draws to the screen fine, but I get less fps when doing so...
I used the benchmark tool in Fraps to get this data:
Time is the time it was benchmarked for, in this case 30 seconds.
Min, max, avg are all FPS values.
With Multithreading:
Frames, Time (ms), Min, Max, Avg
28100, 30000, 861,1025, 936.667
Without multithreading:
Frames, Time (ms), Min, Max, Avg
21483, 30000, 565, 755, 716.100
Here is some pseudocode (with the relevant event function calls):
Main Thread:
Add render comands to queue
ResetEvent (renderCompletedEvent);
SetEvent (renderCommandsEvent);
WaitForSingleObject (renderCompletedEvent, INFINITE);
Render Thread:
WaitForSingleObject (renderCommandsEvent, INFINITE);
Process commands
SetEvent (renderCompletedEvent);
ResetEvent (renderCommandsEvent);
Why would you expect this to be faster?
Only one thread is ever doing anything, you create the commands in one thread and the signal the other and wait for it to finish which will take just as long as just doing it in the first thread, only with more overhead.
TO take advantage of multithreading you need to ensure that both threads are doing something at the same time.
I am no opengl expert, but in general it is important to realize that threads are actually not used to speed things up, they are to guarantee that some subsystem is responsive at the cost of overall speed. That is one might keep a gui thread and a networking thread to ensure that the gui and networking are responsive. That is actually done at a performance cost to the main thread. The CPU is going to give 1/3 of its time to the main thread, 1/3 of its time to the networking thread and 1/3 of its time to the gui thread, even if there are no gui events to handle and nothing going in or out of the network. Thus whatever the main thread is doing gets only 1/3 of the CPU time that it would in a non-multithreaded situation. The upside is that if a lot of data starts arriving over the network, there is always CPU time reserved to handle it (which can be bad if there isn't as the networking buffer can be filled and then additional data starts being dropped or overwritten). The possible exception is that if multiple threads are running on different cores. However, even then be careful, cores can share the same caches, so if two cores are invalidating each other caches, performance could drop dramatically, not improve. If the cores share some resource to move data to and from the GPU or has some other shared limiting resource, this again could possibly cause performance losses, not gains.
In short, threading on a single CPU system is always about responsiveness of a subsystem, not performance. There are possible perfomance gains when different threads run on multiple cores (which windows doesn't seem to usually do by default, but it can be forced). However there are potential issues with doing this when those cores share some resource that could potentially hurt, not help, performance, e.g. shared cache space or some shared GPU related resource in your context.

Mutex vs busy wait for tcp io

I do not care about being a cpu hog as I have one thread assigned to each core and the system threads blocked off to their own set. My understanding is that mutex is of use when other tasks are to run, in this case that is not important so I am considering having a consumer thread loop on an address in memory waiting for its value to be non zero - as in the single producer thread that is looping recv()ing with TCP_NONBLOCK set just deposited information and it is now non zero.
Is my implantation a smart one given my circumstances or should I be using a mutex or custom interrupt even though no other tasks will run.
In addition to points by #ugoren and comments by others:
Even if you have a valid use-case for busy-waiting and burning a core, which are admittedly rare, you need to:
Protect the data shared between threads. This is where locks come into play - you need mutual exclusion when accessing any complex shared data structure. People tend to look into lock-free algorithms here, but these are way-way not obvious and error-prone and are still considered deep black magic. Don't even try these until you have a solid understanding of concurrency.
Notify threads about changed state. This is where you'd use conditional variables or monitors. There are other methods too, eventfd(2) on Linux, for example.
Here are some links for you to show that it's much harder then you seem to think:
Memory Ordering
Out-of-order execution
ABA problem
Cache coherence
Busy-wait can give you a lower latency and somewhat better performance in some cases.
Letting other threads use the CPU is the obvious reason not to do it, but there are others:
You consume more power. An idle CPU goes into a low power state, reducing consumption very significantly. Power consumption is a major issue in data centers, and any serious application must bit waste power.
If your code runs in a virtual machine (and everything is being virtualized these days), your machine competes for CPU with others. Consuming 100% CPU leaves less for the others, and may cause the hypervisor to give your machine less CPU when it's really needed.
You should always stick to mainstream methods, unless there's a good reason not to. In this case, the mainstream is to use select or poll (or epoll). This lets you do other stuff while waiting, if you want, and doesn't waste CPU time. Is the performance difference large enough to justify busy wait?

Overhead of Spin Loop in terms of cache coherence

Say a thread in one core is spinning on a variable which will be updated by a thread running on another core. My question is what is the overhead at cache level. Will the waiting thread cache the variable and therefore does not cause any traffic on the bus until the writing thread writes to that variable?
How can this overhead be reduced. Does x86 pause instruction help?
I believe all modern x86 CPUs use the MESI protocol. So the spinning "reader" thread will likely have a cached copy of the data in either "exclusive" or "shared" mode, generating no memory bus traffic while you spin.
It is only when the other core writes to the location that it will have to perform cross-core communication.
[update]
A "spinlock" like this is only a good idea if you will not be spinning for very long. If it may be a while before the variable gets updated, use a mutex + condition variable instead, which will put your thread to sleep so that it adds no overhead while it waits.
(Incidentally, I suspect a lot of people -- including me -- are wondering "what are you actually trying to do?")
If you spin lock for short intervals you are usually fine. However there is a timer interrupt on Linux (and I assume similar on other OSes) so if you spin lock for 10 ms or close to it you will see a cache disturbance.
I have heard its possible to modify the Linux kernel to prevent all interrupts on specific cores and this disturbance goes away, but I don't know what is involved in doing this.
In the case of two threads the overhead may be ignored, anyway it could be a good idea make a simple benchmark. For instance, if you implement spinlocks, how much time the thread spends into the spin.
This effect on the cache it's called cache line bouncing.
I tested this extensively in this post. The overhead in general is incurred by the bus-locking component of the spinlock, usually the instruction "xchg reg,mem" or some variant of it. Since that particular overhead cannot be avoided you have the options of economizing on the frequency with which you invoke the spinlock and performing the absolute minimum amount of work necessary - once the lock is in place - before releasing it.

Resources