Best practices to ensure low power consumption [closed] - c

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Assume I have two programs P1 and P2 which perform the same functionality, but P1 consumes lesser power than P2 when they run. What are some best practices in coding that help me write good (in terms of low power consumption) programs like P1? You can assume C or any other popular language.
I am asking from a battery saving point of view (say, for a smartphone).

To start off, let's consider what consumes power on a modern CPU (most to least):
running at higher frequencies
keeping more cores online
doing any kind of work
If a particular thread is taking a while to do something, the kernel might boost the CPU frequency to ensure smooth performance, thereby increasing power consumption. In fact, power consumption increases with CPU frequency - exponentially (!) so (PDF), so it's a really good idea to reduce how long it takes to get any particular thing done as much as possible.
If multiple tasks are active and doing enough work that they cannot easily and/or performantly share a single core, the kernel will bring additional cores online (well, technically they're just not sleeping anymore - they were never offline) if available, again, in order to ensure smooth performance. Now this scales roughly about linearly, especially, in mobile ARM processors, according to NVIDIA (PDF).
When the processor doesn't have any work to do, the kernel will put it to sleep if possible, which usually consumes ridiculously small amounts of power, thus vastly increasing how long the device can run on its battery.
So far, we have essentially established that we should do as little work as possible, should do whatever we need to do as fast as possible, and that we should minimize any overhead we have via threads. The neat thing about these attributes is that doing optimizing for them will also likely increase performance! So without further ado, let's actually start seeing what we can do:
Block / No Event Loops
When we use nonblocking calls, we usually end up doing a lot of polling. This means that we are just burning through CPU cycles like an insane madman until something happens. Event loops are the usual way that people go about doing this and are an excellent example of what not to do.
Instead, use blocking calls. Often, with things such as IO, it may take quite a while for a request to complete. In this time, the kernel can allow another thread or process to use the CPU (thus reducing the overall usage of the processor) or can sleep the processor.
In other words, turn something like this:
while (!event) {
event = getEvent (read);
}
into something like this:
read ();
Vectorize
Sometimes, you have a lot of data that you need to process. Vector operations allow you to process more data faster (usually - in rare occasions they can be much slower and just exist for compatibility). Therefore, vectorizing your code can often allow for it to complete its task faster, thus utilizing less processing resources.
Today, many compilers can auto-vectorize with the appropriate flags. For instance, on gcc, the flag -ftree-vectorize will enable auto-vectorization (if available) which can accelerate code massively by processing more data at a time when appropriate, often freeing up registers in the process (thus reducing register pressure), which also has the beneficial side effect of reducing loads and stores, which can in turn further increase performance.
On some platforms, vendors may provide libraries for processing certain kinds of data that may help with this. For instance, the Accelerate framework by Apple includes functions for dealing with vector and matrix data.
However, in certain cases, you may want to do the vectorization yourself, such as when the compiler does not see the opportunity to vectorize or does not fully utilize the opportunity, you may want to vectorize your code yourself. This is often done in assembly, but if you use gcc or clang, you can simply use a form of intrinsics to write portable vectorized code (albeit for all platforms with the specified vector size):
typedef float v4f __attribute__ (((vector_size (16)));
// calculates (r = a * b + c) four floats at a time
void vmuladd (const v4f *a, const v4f *b, const v4f *c, int n) {
int x;
for (x = 0; x < n; x++) {
r[x] = a[x] * b[x];
r[x] = r[x] + c[x];
}
}
This may not be useful on older platforms, but this could seriously improve performance on ARM64 and other modern 64-bit platforms (x86_64, etc.).
Parallelization
Remember how I said that keeps more cores online is bad because it consumes power? Well:
Parallelization via multiple threads doesn't necessarily mean using more cores. If you paid attention to the whole thing I said about using blocking functions, threads could allow you to get work done while other threads wait on IO. That being said, you should not use those extra threads as "IO worker" threads that simply wait on IO - you'll just end up polling all over again. Instead, divide up the individual, atomic tasks that you need to get done among the threads so that for the most part, they can work independently.
It's better to consume more cores than to have to boost clock frequency (linear vs exponential). If you have a task that needs to do a shit ton of processing, it might be useful to break up that processing among a few threads so that they can utilize the available cores. If you do this, take care to ensure that only minimal synchronization is required across the threads; we don't want to waste even more cycles just waiting for synchronization.
When possible, try to combine both of the approaches - parallelize tasks when you have a lot of things to do and parallelize computation when you have a lot of a single thing to do. If you do end up using threads, try to make them block when waiting for work (pthreads - POSIX threads available on both Android and iOS have POSIX semaphores that can help with this) and try to make them long running.
If you have a situation in which you will often need to create and destroy threads, it might be worthwhile to utilize a thread pool. How you accomplish this varies based on the task that you have at hand, but a set of queues is a common way to accomplish this. Ensure that your pools' threads block when there is no work if you use one (this can again be accomplished using the above mentioned POSIX semaphores).
Minimize Work
Try to do as little as you can get by doing. When possible, offload work to external servers up in the cloud, where power consumption isn't as critical of a concern (for most people - this changes once you are at scale).
In situations where you must poll, reducing the frequency of the polling by calling a sleep function can often help - turn something like this:
while (!event) {
event = getEvent ();
}
into something like this:
event = getEvent ();
while (!event) {
sleep (25); // in ms
event = getEvent ();
}
Also, batch processing can work well if you don't have real time requirements (although this may be a good case to push it to the cloud) or if you get lots of independent data rapidly - change something like this:
while (!exit) {
event = getEventBlocking ();
process (event);
}
into something more like this:
while (!exit) {
int x;
event_type *events[16];
for (x = 0; (x < 16) && availableEvents (); x++) {
events[x] = getEventBlocking ();
}
int y;
for (y = 0; y < x; y++) {
process (events[y]);
}
}
This can increase performance by increasing the speed via instruction and data cache locality. If possible, it'd be nice to take this a step further (when such functionality is available on your platform of choice):
while (!exit) {
int x;
event_types **events = getEventsAllBlocking (&x);
int y;
for (y = 0; y < x; y++) {
process (events[y]);
}
}
This will increase performance and waste fewer cycles on waiting and performing function calls. Furthermore, this speedup can become quite noticeable with large amounts of data.
Optimize
This one is pretty easy: crank up the optimization settings on your compiler. Check out the documentation for relevant optimizations that you can enable and benchmark to see if they increase performance and/or reduce power consumption.
On GCC and clang, you can enable recommended safe optimizations by using the flag -O2. Bear in mind that this can make debugging slightly harder, so only use it on production releases.
All in all:
do as little work as possible
don't waste time in event loops
optimize to get shit done in less time
vectorize to get more data processed faster
parallelize to use available resources more efficiently

Related

At what point does adding more threads stop helping?

I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
I think the best way to answer is to give a first overview on how threads are managed by the system. Nowadays all processors are actually multi-core and multi-thread per core, but for sake of simplicity let's first imagine a single core processor with single thread. This is physically limited in performing only a single task at the time, but we are still capable of running multitask programs.
So how is this possible? Well it is simply illusion!
The CPU is still performing a single task at the time, but switches between one and the other giving the illusion of multitasking. This process of changing from one task to the other is named Context switching.
During a Context switch all the data related to the task that is running is saved and the data related to the next task is loaded. Depending on the architecture of the CPU data can be saved in registers, cache, RAM, etc. The more the technology advances, the more performing solutions have been discovered. When the task is resumed, the whole data is fetched and the task continues its operations.
This concept introduces many issues in managing tasks, like:
Race condition
Synchronization
Starvation
Deadlock
There are other points, but this is just a quick list since the question does not focus on this.
Getting back to your question:
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
Short answer: It depends!
As previously said, to switch between a task and another, a Context switch is required. To perform this some storing and fetching data operations are required, but these operations are just an overhead for you computation and don't give you directly any advantage. So having too many tasks requires a high amount of Context switching, thus meaning a lot of computational time wasted! So at the end your task might be running slower than with less tasks.
Also, since you tagged this question with pthreads, it is also necessary to check that the code is compiled to run on multiple HW cores. Having a multi core CPU does not guarantee that you multitask code will run on multiple HW cores!
In your particular case of application:
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
Is a good example of concurrent and data independent computing. This sort of tasks run great on GPU, since operations don't have data correlation and concurrent computing is performed in hardware (modern GPU have thousands of computing cores!)

Set CPU usage or manipulate other system resource in C

I have specific application to make in C. Is there any possibility to programmatically set CPU usage for process? I want to set CPU usage to eg. 20% by specific (mine) process for few seconds and then back to regular usage. while(1) take 100% CPU so its not bes idea for me. Any other ideas to manipulate some system resources and functions that can provide it? I already did memory allocation manipulations but i need other ideas about manipulating system resources.
Thanks!
What I know is that you may be able to control your application's priority depending on the operating system.
Also, a function equivalent to Sleep() reduces CPU load as it causes your application to relinquish CPU cycles to other running programs.
Have you ever tried to answer a question that became more and more complicated once you dug into it?
What you do depends upon what are you trying to accomplish. Do you want to utilize "20% by specific (mine) process for few seconds and then back to regular usage"? Or do you want to utilize 20% of all the CPU usage of the entire processor? Over what interval do you want to use 20%? Averaged over 5 sec? 500 msec? 10 msec?
20% of your process is pretty easy as long as you don't need to do any real work and want 20% of the average over a reasonably long interval, say 1 sec.
for( i=0; i=INTERVAL_CNT; i++ ) //untested syntax error ridden code
{
for( j=0; j=INTERVAL_CNT*(PERCENT/100); j++ )
{
//some work the compiler won't optimize away
}
sleep( INTERVAL_CNT*(1-(PERCENT/100)) );
}
Adjusting this for doing real work is more difficult. Note the comment about the compiler doing optimization. Optimizing compilers are pretty smart and will identify and remove code that does nothing useful. For example, if you use myVar++, declare it local to a certain scope, and never use it, the compiler will remove it to make your app run faster.
If you want a more continuous load (read that as a load of 20% at any sampling point vs a square wave with a certain duty cycle), it's going to be complicated. You might be able to do this with some experimentation by launching CPU consuming multiple threads. Having multiple threads with offset duty cycles should give you a smoother load.
20% of the entire processor is even more complicated since you need to account for multiple factors such as other processes executing, process priority, and multiple CPUs in the processor. I'm not going to get into any detail, but you might be able to do this using simultaneously executing multiple heavy weight processes with offset duty cycles along with a master thread sampling the processor load and dynamically adjusting the heavy weight processes through a set of shared variables.
Let me know if you want me to confuse the matter even further.

Does a cache write take longer with more caches to invalidate?

can you please help me to find out if it takes longer for a cache write to finish when there are more cores/caches holding a copy of that line.
I also want to measure/quantify how much longer it actually takes.
I couldn't find anything useful on google and I have trouble measuring it myself plus interpret what I measure because of the many things that can happen on a modern processor.
(reordering, prefetching, buffering and god knows what)
Details:
My basic process of measuring it is roughly as follows:
write soemthing to the cacheline on processor 0
read it on processors 1 to n.
rdtsc
write it on process 0
rdtsc
I am not even sure which instructions to actually use for read/write on process 0 in order to make sure the write/invalidate is finished before the final time measurement.
At the moment I fiddle with an atomic exchange (__sync_fetch_and_add()), but it seems that the number of threads is itself important for the length of this operation (not the number of threads to invalidate) -- which is probably not what I want to measure?!.
I also tried a read, then write, then memory barrier (__sync_synchronize()). This looks more like what I expect to see,
but here I am also not sure if the write is finished when the final rdtsc takes place.
As you can guess my knowledge of CPU internals is somewhat limited.
Any help is very much appreciated!
ps:
* I use linux, gcc and pthreads for the measurements.
* I want know this for modeling a parallel algorithm of mine.
Edit:
In a week or so (going on vacation tomorrow) I'll do some more research and post my code and notes and link it here (In case someone is interested), because the time I can spend on this is limited.
I started writing a very long answer, describing exactly how this works, then realized, I probably don't know enough about the exact details. So I'll do a shorter answer....
So, when you write something on one processor, if it's not already in that processors cache, it will have to be fetched in, and after the processor has read the data, it will perform the actual write. In doing so, it will send a cache-invalidate message to ALL other processors in the system. These will then throw away any content. If another processor has "dirty" content, it will in itself write out the data, and ask for an invalidation - in which case the first processor will have to RELOAD the data before finishing its write (otherwise, some other element in the same cacheline may get destroyed).
Reading it back into the cache will be required on every other processor that is interested in that cache-line.
The __sync_fetch_and_add() wilol use a "lock" prefix [on x86, other processors may vary, but the general idea on processors that support "per instruction" locks is roughtly the same] - this will issue a "I want this cacheline EXCLUSIVELY, everyone else please give it up and invalidate it". Just like the first case, the processor may well have to re-read anything that another processor may have made dirty.
A memory barrier will not ensure that data is updated "safely" - it will just make sure that "whatever happened (to memory) before now is visible to all processors by the time this instructon finishes".
The best way to optimize the use of processors is to share as little as possible, and in particular, avoid "false sharing". In a benchmark many years ago, there was a structure like [simplifed] this:
struct stuff {
int x[2];
... other data ... total data a few cachelines.
} data;
void thread1()
{
for( ... big number ...)
data.x[0]++;
}
void thread2()
{
for( ... big number ...)
data.x[1]++;
}
int main()
{
start = timenow();
create(thread1);
create(thread2);
end = timenow() - start;
}
Since EVERY time thread1 wrote to the x[0], thread2's processor had to get rid of it's copy of x[1], and vice versa, the result is was that the SMP test [vs just running thread1] was running about 15 times slower. By altering the struct like this:
struct stuff {
int x;
... other data ...
} data[2];
and
void thread1()
{
for( ... big number ...)
data[0].x++;
}
we got 200% of the 1 thread variant [give or take a few percent]
Right, so the processor has queues of buffers where write operations are stored when the processor is writing to memory. A memory barrier (mfence, sfence or lfence) instruction is there to ensure that any outstanding read/write, write or read type operation has completely been finished before the processor proceeds to the next instruction. Normally, the processor would just continue on it's jolly way through any following instructions, and eventualy the memory operation becomes fulfilled some way or another. Since modern processors have a lot of parallel operations and buffers all over the place, it can take quite some time before something ACTUALLY trickles through to where it eventually will end up. So, when it's CRITICAL to make sure that something has ACTUALLY been done before proceeding (for example, if we have written a bunch of instructions to the video memory, and we now want to kick off the run of those instructions, we need to make sure that the 'instruction' writing has actually finished, and some other part of the processor isn't still working on finishing that. So use an sfence to make sure that the write has really happened - that may not be a very realistic example, but I think you get the idea.)
Cache writes have to get line-ownership before dirtying the cache line. Depending on the
cache coherence model implemented in the processor architecture, the time taken for this step varies. The most common coherence protocols that I know are:
Snooping Coherence Protocol: all caches monitor address lines for cached memory lines i.e. all memory requests have to be broadcast to all cpus i.e. non-scalable as cpus increase.
Directory-based Coherence Protocol: all cache lines shared among many cpus is kept in a directory; so, invalidating/gaining ownership is a point-to-point cpu request rather than a broadcast i.e. more scalable, but latency suffers because the directory is a single point of contention.
Most cpu architectures support something called PMU (perf monitoring unit). This unit exports
counters for many things like: cache hits, misses, cache write latency, read latency, tlb hits, etc. Please consult the cpu manual to see if this info is available.

Openmp not speeding up parallel loop

I have the following embarassingly parallel loop
//#pragma omp parallel for
for(i=0; i<tot; i++)
pointer[i] = val;
Why does uncommenting the #pragma line cause performance to drop? I'm getting a slight increase in program run time when I use openmp to parallelize this for loop. Since each access is independent, shouldn't it greatly increase the speed of the program?
Is it possible that if this for loop isn't run for large values of tot, the overhead is slowing things down?
Achieving performance with multiple threads in a Shared Memory environment usually depends on:
The task granularity;
Load balance between parallel tasks;
The number of parallel task/number of cores used;
The amount of synchronization among parallel tasks;
The type of bound of the algorithm;
The machine architecture.
I will give a brief overview of each of the aforementioned points.
You need to check if the granularity of the parallel tasks is enough to overcome the overhead of the parallelization (e.g., thread creation and synchronization). Maybe the number of iterations of your loop, and the computation pointer[i] = val; is not enough to justify the overhead of thread creation; Worth-noting, however, that too large of a task granularity can also lead to problems, for instance, load unbalancing.
You have to test the load balance (the amount of work per thread). Ideally, each thread should compute the same amount of work. In your code example this is not problematic;
Are you using hyper-threading?! Are you utilizing more threads than cores?! Because, if you are, threads will start "competing" for resources, and this can lead to a drop in performance;
Usually, one wants to reduce the amount of synchronization among threads. Consequently, sometimes one uses finer-grain synchronization mechanisms and even data redundancy (among other approaches) to achieve that. Your code does not have this issue.
Before attempting to parallelize your code you should analyze if it is memory-bound, CPU-bound, and so on. If it is memory-bound you may start by improving the cache usage, before you tackling the parallelization. For this task, it is highly recommended the use of a profiler.
To extract the most out of the underlining architecture, the multi-threaded approach needs to tackle the constraints of that architecture. For example, implementing an efficient multi-threaded approach to execute in a SMP architecture is different than implementing it to execute in a NUMA architecture. Since in the latter, one has to take into account the memory affinity.
EDIT: Suggestion from #Hristo lliev
Thread affinity: "Binding threads to cores improves performance in general and even more on NUMA systems since it improves data locality."
Btw, I recommend you to read this Intel Guide for Developing Multithreaded Applications.

How can I evaluate performances of a lockless queue?

I have implemented a lockless queue using the hazard pointer methodology explained in http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf using GCC CAS instructions for the implementation and pthread local storage for thread local structures.
I'm now trying to evaluate the performance of the code I have written, in particular I'm trying to do a comparison between this implementation and the one that uses locks (pthread mutexes) to protect the queue.
I'm asking this question here because I tried comparing it with the "locked" queue and I found that this has better performances with respect to the lockless implementation. The only test I tried is creating 4 thread on a 4-core x86_64 machine doing 10.000.000 random operations on the queue and it it significantly faster than the lockless version.
I want to know if you can suggest me an approach to follow, i.e. what kind of operation I have to test on the queue and what kind of tool I can use to see where my lockless code is wasting its time.
I also want to understand if it is possible that the performance are worse for the lockless queue just because 4 threads are not enough to see a major improvement...
Thanks
First point: lock-free programming doesn't necessarily improve speed. Lock-free programming (when done correctly) guarantees forward progress. When you use locks, it's possible for one thread to crash (e.g., go into an infinite loop) while holding a mutex. When/if that happens, no other thread waiting on that mutex can make any more progress. If that mutex is central to normal operation, you may easily have to restart the entire process before any more work can be done at all. With lock-free programming, no such circumstance can arise. Other threads can make forward progress, regardless of what happens in any one thread1.
That said, yes, one of the things you hope for is often better performance -- but to see it, you'll probably need more than four threads. Somewhere in the range of dozens to hundreds of threads would give your lock-free code a much better chance of showing improved performance over a lock-based queue. To really do a lot of good, however, you not only need more threads, but more cores as well -- at least based on what I've seen so far, with four cores and well-written code, there's unlikely to be enough contention over a lock for lock-free programming to show much (if any) performance benefit.
Bottom line: More threads (at least a couple dozen) will improve the chances of the lock-free queue showing a performance benefit, but with only four cores, it won't be terribly surprising if the lock-based queue still keeps up. If you add enough threads and cores, it becomes almost inevitable that the lock-free version will win. The exact number of threads and cores necessary is hard to predict, but you should be thinking in terms of dozens at a minimum.
1 At least with respect to something like a mutex. Something like a fork-bomb that just ate all the system resources might be able to deprive the other threads of enough resources to get anything done -- but some care with things like quotas can usually prevent that as well.
The question is really to what workloads you are optimizing for. If congestion is rare, lock structures on modern OS are probably not too bad. They mainly use CAS instructions under the hood as long as they are on the fast path. Since these are quite optimized out it will be difficult to beat them with your own code.
Our own implementation can only win substantially for the congested part. Just random operations on the queue (you are not too precise in your question) will probably not do this if the average queue length is much longer than the number of threads that hack on it in parallel. So you must ensure that the queue is short, perhaps by introducing a bias about the random operation that is chosen if the queue is too long or too short. Then I would also charge the system with at least twice as much threads than there are cores. This would ensure that wait times (for memory) don't play in favor of the lock version.
The best way in my opinion is to identify hotspots in your application with locks
by profiling the code.Introduce the lockless mechanism and measure the same again.
As mentioned already by other posters, there may not be a significant improvement
at lower scale (number of threads, application scale, number of cores) but you might
see throughput improvements as you scale up the system.This is because deadlock
situations have been eliminated and threads are always making forward progress.
Another way of looking at an advantage with lockless schemes are that to some
extent one decouples system state from application performance because there
is no kernel/scheduler involvement and much of the code is userland except
for CAS which is a hw instruction.
With locks that are heavily contended, threads block and are scheduled once
locks are obtained which basically means they are placed at the end of the run
queue (for a specific prio level).Inadvertently this links the application to system
state and response time for the app now depends on the run queue length.
Just my 2 cents.

Resources