Set CPU usage or manipulate other system resource in C - c

I have specific application to make in C. Is there any possibility to programmatically set CPU usage for process? I want to set CPU usage to eg. 20% by specific (mine) process for few seconds and then back to regular usage. while(1) take 100% CPU so its not bes idea for me. Any other ideas to manipulate some system resources and functions that can provide it? I already did memory allocation manipulations but i need other ideas about manipulating system resources.
Thanks!

What I know is that you may be able to control your application's priority depending on the operating system.
Also, a function equivalent to Sleep() reduces CPU load as it causes your application to relinquish CPU cycles to other running programs.

Have you ever tried to answer a question that became more and more complicated once you dug into it?
What you do depends upon what are you trying to accomplish. Do you want to utilize "20% by specific (mine) process for few seconds and then back to regular usage"? Or do you want to utilize 20% of all the CPU usage of the entire processor? Over what interval do you want to use 20%? Averaged over 5 sec? 500 msec? 10 msec?
20% of your process is pretty easy as long as you don't need to do any real work and want 20% of the average over a reasonably long interval, say 1 sec.
for( i=0; i=INTERVAL_CNT; i++ ) //untested syntax error ridden code
{
for( j=0; j=INTERVAL_CNT*(PERCENT/100); j++ )
{
//some work the compiler won't optimize away
}
sleep( INTERVAL_CNT*(1-(PERCENT/100)) );
}
Adjusting this for doing real work is more difficult. Note the comment about the compiler doing optimization. Optimizing compilers are pretty smart and will identify and remove code that does nothing useful. For example, if you use myVar++, declare it local to a certain scope, and never use it, the compiler will remove it to make your app run faster.
If you want a more continuous load (read that as a load of 20% at any sampling point vs a square wave with a certain duty cycle), it's going to be complicated. You might be able to do this with some experimentation by launching CPU consuming multiple threads. Having multiple threads with offset duty cycles should give you a smoother load.
20% of the entire processor is even more complicated since you need to account for multiple factors such as other processes executing, process priority, and multiple CPUs in the processor. I'm not going to get into any detail, but you might be able to do this using simultaneously executing multiple heavy weight processes with offset duty cycles along with a master thread sampling the processor load and dynamically adjusting the heavy weight processes through a set of shared variables.
Let me know if you want me to confuse the matter even further.

Related

At what point does adding more threads stop helping?

I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
I think the best way to answer is to give a first overview on how threads are managed by the system. Nowadays all processors are actually multi-core and multi-thread per core, but for sake of simplicity let's first imagine a single core processor with single thread. This is physically limited in performing only a single task at the time, but we are still capable of running multitask programs.
So how is this possible? Well it is simply illusion!
The CPU is still performing a single task at the time, but switches between one and the other giving the illusion of multitasking. This process of changing from one task to the other is named Context switching.
During a Context switch all the data related to the task that is running is saved and the data related to the next task is loaded. Depending on the architecture of the CPU data can be saved in registers, cache, RAM, etc. The more the technology advances, the more performing solutions have been discovered. When the task is resumed, the whole data is fetched and the task continues its operations.
This concept introduces many issues in managing tasks, like:
Race condition
Synchronization
Starvation
Deadlock
There are other points, but this is just a quick list since the question does not focus on this.
Getting back to your question:
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
Short answer: It depends!
As previously said, to switch between a task and another, a Context switch is required. To perform this some storing and fetching data operations are required, but these operations are just an overhead for you computation and don't give you directly any advantage. So having too many tasks requires a high amount of Context switching, thus meaning a lot of computational time wasted! So at the end your task might be running slower than with less tasks.
Also, since you tagged this question with pthreads, it is also necessary to check that the code is compiled to run on multiple HW cores. Having a multi core CPU does not guarantee that you multitask code will run on multiple HW cores!
In your particular case of application:
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
Is a good example of concurrent and data independent computing. This sort of tasks run great on GPU, since operations don't have data correlation and concurrent computing is performed in hardware (modern GPU have thousands of computing cores!)

Does a cache write take longer with more caches to invalidate?

can you please help me to find out if it takes longer for a cache write to finish when there are more cores/caches holding a copy of that line.
I also want to measure/quantify how much longer it actually takes.
I couldn't find anything useful on google and I have trouble measuring it myself plus interpret what I measure because of the many things that can happen on a modern processor.
(reordering, prefetching, buffering and god knows what)
Details:
My basic process of measuring it is roughly as follows:
write soemthing to the cacheline on processor 0
read it on processors 1 to n.
rdtsc
write it on process 0
rdtsc
I am not even sure which instructions to actually use for read/write on process 0 in order to make sure the write/invalidate is finished before the final time measurement.
At the moment I fiddle with an atomic exchange (__sync_fetch_and_add()), but it seems that the number of threads is itself important for the length of this operation (not the number of threads to invalidate) -- which is probably not what I want to measure?!.
I also tried a read, then write, then memory barrier (__sync_synchronize()). This looks more like what I expect to see,
but here I am also not sure if the write is finished when the final rdtsc takes place.
As you can guess my knowledge of CPU internals is somewhat limited.
Any help is very much appreciated!
ps:
* I use linux, gcc and pthreads for the measurements.
* I want know this for modeling a parallel algorithm of mine.
Edit:
In a week or so (going on vacation tomorrow) I'll do some more research and post my code and notes and link it here (In case someone is interested), because the time I can spend on this is limited.
I started writing a very long answer, describing exactly how this works, then realized, I probably don't know enough about the exact details. So I'll do a shorter answer....
So, when you write something on one processor, if it's not already in that processors cache, it will have to be fetched in, and after the processor has read the data, it will perform the actual write. In doing so, it will send a cache-invalidate message to ALL other processors in the system. These will then throw away any content. If another processor has "dirty" content, it will in itself write out the data, and ask for an invalidation - in which case the first processor will have to RELOAD the data before finishing its write (otherwise, some other element in the same cacheline may get destroyed).
Reading it back into the cache will be required on every other processor that is interested in that cache-line.
The __sync_fetch_and_add() wilol use a "lock" prefix [on x86, other processors may vary, but the general idea on processors that support "per instruction" locks is roughtly the same] - this will issue a "I want this cacheline EXCLUSIVELY, everyone else please give it up and invalidate it". Just like the first case, the processor may well have to re-read anything that another processor may have made dirty.
A memory barrier will not ensure that data is updated "safely" - it will just make sure that "whatever happened (to memory) before now is visible to all processors by the time this instructon finishes".
The best way to optimize the use of processors is to share as little as possible, and in particular, avoid "false sharing". In a benchmark many years ago, there was a structure like [simplifed] this:
struct stuff {
int x[2];
... other data ... total data a few cachelines.
} data;
void thread1()
{
for( ... big number ...)
data.x[0]++;
}
void thread2()
{
for( ... big number ...)
data.x[1]++;
}
int main()
{
start = timenow();
create(thread1);
create(thread2);
end = timenow() - start;
}
Since EVERY time thread1 wrote to the x[0], thread2's processor had to get rid of it's copy of x[1], and vice versa, the result is was that the SMP test [vs just running thread1] was running about 15 times slower. By altering the struct like this:
struct stuff {
int x;
... other data ...
} data[2];
and
void thread1()
{
for( ... big number ...)
data[0].x++;
}
we got 200% of the 1 thread variant [give or take a few percent]
Right, so the processor has queues of buffers where write operations are stored when the processor is writing to memory. A memory barrier (mfence, sfence or lfence) instruction is there to ensure that any outstanding read/write, write or read type operation has completely been finished before the processor proceeds to the next instruction. Normally, the processor would just continue on it's jolly way through any following instructions, and eventualy the memory operation becomes fulfilled some way or another. Since modern processors have a lot of parallel operations and buffers all over the place, it can take quite some time before something ACTUALLY trickles through to where it eventually will end up. So, when it's CRITICAL to make sure that something has ACTUALLY been done before proceeding (for example, if we have written a bunch of instructions to the video memory, and we now want to kick off the run of those instructions, we need to make sure that the 'instruction' writing has actually finished, and some other part of the processor isn't still working on finishing that. So use an sfence to make sure that the write has really happened - that may not be a very realistic example, but I think you get the idea.)
Cache writes have to get line-ownership before dirtying the cache line. Depending on the
cache coherence model implemented in the processor architecture, the time taken for this step varies. The most common coherence protocols that I know are:
Snooping Coherence Protocol: all caches monitor address lines for cached memory lines i.e. all memory requests have to be broadcast to all cpus i.e. non-scalable as cpus increase.
Directory-based Coherence Protocol: all cache lines shared among many cpus is kept in a directory; so, invalidating/gaining ownership is a point-to-point cpu request rather than a broadcast i.e. more scalable, but latency suffers because the directory is a single point of contention.
Most cpu architectures support something called PMU (perf monitoring unit). This unit exports
counters for many things like: cache hits, misses, cache write latency, read latency, tlb hits, etc. Please consult the cpu manual to see if this info is available.

Reliable to use cudaEvent_t to measure CPU time?

I have a simple kernel without using multiple events, and i want to create a CPU version of it which i've done and measure the difference between them. I don't know if events are strictly created for CUDA, but i guess my example is simple enough and does not contain anything to be ok to do that. Opinions?
If you are measuring time on the CPU, nothing is better than High performance counters E.g for java, you can measure time in the nano second range. Events are generally used for the GPU as the start and stop event are added to the GPU queue, not the CPU one.

When benchmarking, what causes a lag between CPU time and "elapsed real time"?

I'm using a built-in benchmarking module for some quick and dirty tests. It gives me:
CPU time
system CPU time (actually I never get any result for this with the code I'm running)
the sum of the user and system CPU times (always the same as the CPU time in my case)
the elapsed real time
I didn't even know I needed all that information.
I just want to compare two pieces of code and see which one takes longer. I know that one piece of code probably does more garbage collection than the other but I'm not sure how much of an impact it's going to have.
Any ideas which metric I should be looking at?
And, most importantly, could someone explain why the "elapsed real time" is always longer than the CPU time - what causes the lag between the two?
There are many things going on in your system other than running your Ruby code. Elapsed time is the total real time taken and should not be used for benchmarking. You want the system and user CPU times since those are the times that your process actually had the CPU.
An example, if your process:
used the CPU for one second running your code; then
used the CPU for one second running OS kernel code; then
was swapped out for seven seconds while another process ran; then
used the CPU for one more second running your code,
you would have seen:
ten seconds elapsed time,
two seconds user time,
one second system time,
three seconds total CPU time.
The three seconds is what you need to worry about, since the ten depends entirely upon the vagaries of the process scheduling.
Multitasking operating system, stalls while waiting for I/O, and other moments when you code is not actively working.
You don't want to totally discount clock-on-the-wall time. Time used to wait w/o another thread ready to utilize CPU cycles may make one piece of code less desirable than another. One set of code may take some more CPU time, but, employ multi-threading to dominate over the other code in the real world. Depends on requirements and specifics. My point is ... use all metrics available to you to make your decision.
Also, as a good practice, if you want to compare two pieces of code you should be running as few extraneous processes as possible.
It may also be the case that the CPU time when your code is executing is not counted.
The extreme example is a real-time system where the timer triggers some activity which is always shorter than a timer tick. Then the CPU time for that activity may never be counted (depending on how the OS does the accounting).

Are high resolution calls to get the system time wrong by the time the function returns?

Given a C process that runs at the highest priority that requests the current time, Is the time returned adjusted for the amount of time the code takes to return to the user process space? Is it out of date when you get it? As a measurement taking the execution time of known number of assembly instructions in a loop and asking for the time before and after it could give you an approximation of the error. I know this must be an issue in scientific applications? I don't plan to write software involving any super colliders any time in the near future. I have read a few articles on the subject but they do not indicate that any correction is made to make the time given to you be slightly ahead of the time the system read in. Should I lose sleep over other things?
Yes, they are almost definitely "wrong".
For Windows, the timing functions do not take into account the time it takes to transition back to user mode. Even if this were taken into account, it can't correct if the function returns, and your code hits a page fault/gets swapped out/etc., before capturing the return value.
In general, when timing things you should snap a start and an end time around a large number of iterations to weed out these sort of uncertainties.
No, you should not lose sleep over this. No amount of adjustment or other software trickery will yield perfect results on a system with a pipelined processor with multi-layered memory access running a multi-tasking operating system with memory management, devices, interrupt handlers... Not even if your process has the highest priority.
Plus, taking the difference of two such times will cancel out the constant overhead, anyway.
Edit: I mean yes, you should lose sleep over other things :).
Yes, the answer you get will be off by a certain (smallish) amount; I have never heard of a timer function compensating for the average return time, because such a thing is nearly impossible to predict well. Such things are usually implemented by simply reading a register in the hardware and returning the value, or a version of it scaled to the appropriate timescale.
That said, I wouldn't lose sleep over this. The accepted way of keeping this overhead from affecting your measurements in any significant way is not to use these timers for short events. Usually, you will time several hundred, thousand, or million executions of the same thing, and divide by the number of executions to estimate the average time. Such a thing is usually more useful than timing a single instance, as it takes into account average cache behavior, OS effects, and so forth.
Most of the real world problems involving high resolution timers are used for profiling, in which the time is read once during START, and once more during FINISH. So most of the times ~almost~ the same amount of delay in involved in both START and FINISH. And hence it works fine.
Now, for nuclear reactors, WINDOWS or for that many other operating system with generic functions may not be suitable. I guess they use REAL TIME operating systems which might give a better accurate time values than desktop operating systems.

Resources