Can we optimize code to reduce power consumption? - c

Are there any techniques to optimize code in order to ensure lesser power consumption.Architecture is ARM.language is C

From the ARM technical reference site:
The features of the ARM11 MPCore
processor that improve energy
efficiency include:
accurate branch and sub-routine return prediction, reducing the number
of incorrect instruction fetch and
decode operations
use of physically addressed caches, which reduces the number of cache
flushes and refills, saving energy in
the system
the use of MicroTLBs reduces the power consumed in translation and
protection lookups each cycle
the caches use sequential access information to reduce the number of
accesses to the tag RAMs and to
unwanted data RAMs.
In the ARM11 MPCore processor
extensive use is also made of gated
clocks and gates to disable inputs to
unused functional blocks. Only the
logic actively in use to perform a
calculation consumes any dynamic
power.
Based on this information, I'd say that the processor does a lot of work for you to save power. Any power wastage would come from poorly written code that does more processing than necessary, which you wouldn't want anyway. If you're looking to save power, the overall design of your application will have more effect. Network access, screen rendering, and other power-hungry operations will be of more concern for power consumption.

Optimizing code to use less power is, effectively, just optimizing code. Regardless of whether your motives are monetary, social, politital or the like, fewer CPU cycles = less energy used. What I'm trying to say is I think you can probably replace "power consumption" with "execution time", as they would, essentially, be directly proportional - and you therefore may have more success when not "scaring" people off with a power-related question. I may, however, stand corrected :)

Yes. Use a profiler and see what routines are using most of the CPU. On ARM you can use some JTAG connectors, if available (I used Lauterbach both for debugging and for profiling). The main problem is generally to put your processor, when in idle, in a low-consumption state (deep sleep). If you cannot reduce the CPU percentage used by much (for example from 80% to 50%) it won't make a big difference. Depending on what operating systems you are running the options may vary.

The July 2010 edition of the Communications of the ACM has an article on energy-efficient algorithms which might interest you. I haven't read it yet so cannot impart any of its wisdom.

Try to stay in on chip memory (cache) for idle loops, keep I/O to a minimum, keep bit flipping to a minimum on busses. NV memory like proms and flash consume more power to store zeros than ones (which is why they erase to ones, it is actually a zero but the transitor(s) invert the bit before you see it, zeros stored as ones, ones stored as zeros, this is also why they degrade to ones when they fail), I dont know about volatile memories, dram uses half as many transistors as sram, but has to be refreshed.
For all of this to matter though you need to start with a lower power system as the above may not be noticeable. dont use anything from intel for example.

If you are not running Windows XP+ or a newer version of Linux, you could run a background thread which does nothing but HLT.
This is how programs like CPUIdle reduce power consumption/heat.

If the processor is tuned to use less power when it needs less cycles, then simply making your code run more efficiently is the solution. Else, there's not much you can do unless the operating system exposes some sort of power management functionality.

Keep IO to a minimum.

On some ARM processors it's possible to reduce power consumption by putting the voltage regulator in standby mode.

Related

how instructions are fetched in cortex M processors

As far as I understand, Cortex M0/M3 processors have only one memory space that holds instructions and data and that the access is only through the memory bus interface. Thus, if I understand correctly, every clock cycle the processor must read a new instruction to enter the pipeline, but that means that the bus will always be busy reading instructions so how data can be read simultaneously (for load word/store word instructions for example)?
Additionally, what is the latency of reading an instruction from the memory? since if it is not a single cycle, then the processor must constantly halt itself until the next instruction is fetched, so how it is handled?
Thanks
Yes this is how it happens, the processor stalls a lot, this goes on with big processors as well as small, difficult at best to feed a pipelined processor (although some of these pipes are shallow on cortex-ms, but pipelined nevertheless).
Many of the parts I have used and I have touched most of the vendors, the flash is at half clock speed of the core, so even at zero wait states you can only get an instruction every other clock (on average naturally with overhead rolled in) if fetching a halfword at a time, if fetching a word at a time which many of the cores offer then that is ideally two instructions per two clocks or one per. thumb2 of course you take the hit. ST definitely has a prefetcher/cacher thing with a fancy marketing name that does a pretty good job. Others may offer that as well or just rely on what arm offers which varies.
The different cortex-ms have different mixtures of busses. I hate the von-Neumann/Harvard references as there is little practical use for an actual Harvard architecture, thus the "modified" adjective which means they can do anything and try to attract folks taught in school that Harvard means performance. The busses can have multiple transactions in flight and there are a different number of busses as is somewhat obvious when you go in and release clocks for a peripheral, apb1 clock control ahb2 clock control, etc. Peripherals, flash, etc. But we can run code from sram, so it's not Harvard. Forget Harvard and von-Neumann terms and just focus on the actual implementation.
The bus documentation is as readily available as the core documentation. If you buy the right fpga board you can request a free eval of a core which you can then get an up close and personal view as to how it really works.
End of the day there is some parallelism, but on many chips the flash is half speed so if you are not fetching two per or have some other solution you are barely making it and stalling often if you have other same bus accesses. Likewise on many of these chips the peripherals can't run as fast as the core, so that alone incurs a stall, but even if the peripheral runs on the same clock doesn't mean it turns around a csr or data access as fast as sram, so you incur a stall there too.
There is no reason to assume you will get one instruction per clock performance out of these parts any more than a full sized arm or x86 or other.
While there are some important details that are not documented and only seen when you get the core there is documentation on each core and bus to get a rough idea if how to tune your code to perform better or tune your expectations of how it will really perform. I know I have demonstrated this here and elsewhere, it is pretty easy even with an ST to see a performance difference between flash and sram and see that it takes more clocks than instructions to perform a benchmark.
Your question is too broad in a few ways. The cortex-m0 and m3 are quite different, one was the first one out and dripping with features, the other was tuned for size and has just less stuff in general not meant to necessarily compete in this way. Then how long is the latency, etc, that is strictly chip company and family within the chip company so that questions extremely broad all the cortex-m products out there, dozens of different answers to that question. ARM makes cores not chips, the chip vendors make chips and buy IP from various places and make some of their own, some small part of that chip might be some IP they buy from a processor vendor.
What you've described is known as the "von Neumann bottleneck", and in machines with a pure von Neumann architecture with shared data and program memory, accesses are usually interleaved. However you might like to check out the "modified Harvard architecture", because that's basically what is in use here. The backing store is shared like in a von Neumann machine, but the instruction and data fetch paths are separate like in a Harvard machine and crucially they have separate caches. So if an instruction or data fetch results in a cache hit, a memory fetch doesn't takes place and there is no bottleneck.
The second part of your question doesn't make a great deal of sense I'm afraid, because it is meaningless to talk about instruction fetch times in terms of instruction cycles. By definition, if an instruction fetch is delayed for some reason, the execution of that instruction (and subsequent instructions) must be delayed. Typically this is done by inserting NOPs into the pipeline until the next instruction is ready (known as "bubbling" the pipeline).
re: part 2: Instruction fetch can be pipelined to hide some / all of the fetch latency. Cortex-M3 has a prefetch unit with a 3-word FIFO.
https://developer.arm.com/documentation/ddi0337/e/introduction/prefetch-unit (This can hold up to six 16-bit Thumb instructions.)
This buffer can also supply instructions while data load/store is happening, in a config where those compete with each other (not Harvard split bus, and without a data or instruction cache).
This prefetch is of course speculative; discarded on branches. (It's simple and small enough not to be worth doing branch prediction to try to fetch from the right place before decode even knows the upcoming instruction stream contains a branch.)

Hyperthreading effects on gettimeofday and other time measurements

while I was benchmarking a CPU with hyperthreading with BLAS matrix operations in C, I observed a nearly exact doubling of the runtime of the functions when using hyperthreading. What I expected was some kind of speed improvement because of out of order executions or other optimizations.
I use gettimeofday to estimate the runtime. In order to evaluate the observation I want to know if you have thoughts on the stability of gettimeofday in hyperthreading environment (Debian Linux 32 Bit) or maybe on my expectations (they might be wrong)?
Update: I forgot to mention that I am running the benchmark application twice, setting the affinity to one hyperthreading core each. For example gemm is run twice in parallel.
I doubt whether your use of gettimeofday() explains the discrepancy, unless, possibly, you are measuring very small time intervals.
More to the point, I would not expect enabling hyperthreading to improve the performance of single-threaded BLAS computations. A single thread uses only one processor (at a time), so the additional logical processors presented by hyperthreading do not help.
A well-tuned BLAS makes good use of the CPU's data cache to reduce memory access time. That doesn't help much if the needed data are evicted from the cache, however, as is likely to happen when a different process is executed by the other logical processor of the same physical CPU. Even on a lightly-loaded system, there is probably enough work to do that the OS will have a process scheduled at all times on every available (logical) processor.

Is there a way to avoid cache misses _completely_?

I read the very basics on how the cache works here: How and when to align to cache line size? and here: What is "cache-friendly" code? , but none of these posts answered my question: is there a way to execute some code entirely within the cache, i.e., without using any access to RAM (beyond perhaps during the initial process of reading the file from the HDD)? As far as I understand the bottleneck in computation nowadays is mostly memory bandwidth, and "as long as you are within the CPU, you are just fine".
Is there a way to load a program into the cache, and keep it there until it terminates? So let's say I have a 1MB compiled C program, which does some scientific computation with a memory requirement of another 1MB, and runs for 5 days. Is there a way to flag this code, so that it does not get out from the cache during evaluation? I am thinking of giving this code higher priority, or alike during execution.
In other words, how much cache is used by an idling computer, which loads its OS (say Ubuntu), and then does nothing? Is there excessive cache use during idling? Should I expect my small program to be always in the cache if the OS does not do anything besides executing it? Let's say after 5 minutes the screensaver starts. Does this lead to massive cache misses (and hence, drastic reduction in performance), since now it competes with my program for the cache space? My experience says that running several non-demanding programs (like the screensaver, or a simple audio player, pdf reader, etc.) at the same time does not significantly decrease the performance of my scientific program, even though I would expect that it would go in-and-out from the cache all the time. The question is: why does not it get its speed affected? Would it make sense to use an absolute minimalistic OS (if so, then which one?) to improve (or rather: maintain) the speed of the computation?
Just for clarity, we can assume that the code is something very simple, say it is a bunch of nested for loops where the innermost part sums up all the increment variables modulo 97. The point is that it is small enough to be put and executed in the cache.
There are different types of CPU cache misses: compulsory, conflict, capacity, coherence.
Compulsory misses can't be avoided, as they happen on the first reference to a location in memory. So no, you definitely can't avoid cache misses completely.
Besides that, typical L1 cache sizes today are 32KB/64KB per core, and L2 cache sizes are 256KB per core. So 1MB of data would also create either capacity or conflict misses, depending on cache's associativity.
No, on most standard architectures, CPU cache is not addressable.*
And even if you could, what kind of performance improvement are you anticipating here? What percentage of your program's execution time do you believe is being spent loading from main memory into (L3) cache? You should profile your program to determine where it's actually spending its time, rather than dreaming up solutions to problems that don't exist!
* I think x86 CPUs might have a hardware configuration which allows them to operate without attached RAM, but that's basically irrelevant.
Short answer: NO. Cache is being maintained by the OS/CPU and it is a bad idea to allow programs to force itself to stay in cache. Lets say you got 2 programs running at the same time, and both are trying to force to stay in the cache, chaos would happen isn't it?
Newer Intel CPUs have added "Cache Allocation Technology" (CAT) under the general rubric of their Resource Director Technology. This allows software directives to reserve certain cache (and other) resources for particular computational units (application, container, VM, etc). So, if the process in question has enough cache space set aside for it under CAT, it should experience only its initial compulsory misses (to bring its code and data into cache) and self-induced conflict misses, avoiding capacity misses and conflict misses created by other processes.
I am not sure whether it will satisfy your questions.
is there a way to execute some code entirely within the cache, i.e., without using any access to RAM?
Is there a way to load a program into the cache, and keep it there until it terminates?
It is possible to use fully associative cache( for eg Tightly coupled memories), which has single cycle access times.(This is realistic only in very small embedded systems).it is a general practise to use TCM's in embedded systems for time critical code as it provides predictability.
In case of partially associative caches it is possible to lock up cache lines or ways (for eg using CP15 in ARM ), so that the eviction algorithm doesn't consider them as a victim for cache fill.
as a side note it is also useful sometimes to use Cache as Ram for Bringup of non booting boards when the caches are in debug mode.
(http://www.asset-intertech.com/Products/Processor-Controlled-Test/PCT-Software/Cache-as-RAM-for-board-bring-up-of-non-boothing-ci)

How to saturate memory bus

I want to test a program with various memory bus usage levels. For example, I would like to find out if my program works as expected when other processes use 50% of the memory bus.
How would I simulate this kind of disturbance?
My attempt was to run a process with multiple threads, each thread doing random reads from a big block of memory. This didn't appear to have a big impact on my program. My program has a lot of memory operations, so I would expect that a significant disturbance will be noticeable.
I want to saturate the bus but without using too many CPU cycles, so that any performance degradation will be caused only by bus contention.
Notes:
I'm using a Xeon E5645 processor, DDR3 memory
The mental model of "processes use 50% of the memory bus" is not a great one. A thread that has acquired a core and accesses memory that's not in the caches uses the memory bus.
Getting a thread to saturate the bus is simple, just use memcpy(). Copy several times the amount that fits in the last cache and warm it up by running it multiple times so there are no page faults to slow the code down.
My first instinct would be to set up a bunch of DMA operations to bounce data around without using the CPU too much. This all depends on what operating system you're running and what hardware. Is this an embedded system? I'd be glad to give more detail in the comments.
I'd use SSE2 movntps instructions to stream data, to avoid cache conflicts for the other thread in the same core. Maybe unroll that loop 16 times to minimize number of instructions per memory transfer. While DMA idea sounds good, the linked manual is old and for 32bit linux and your processor model makes me think you probably have 64bit os, which makes me wonder how much of it is correct still. And bug in your test code may screw your hard drive in worst case.

Measuring CPU clocks consumed by a process

I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?
The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.
No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.
Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.
I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)
Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).

Resources