force some data on L1 cache - c

Apologies about this simple question. Still struggling with some of the memory concepts here. Question is: Suppose I have a pre-computed array A that I want to access repeatedly. Is there a way to tell a C program to keep this array as close as possible to the CPU cache for fastest access? Thanks.

There is no way to force an array to L1/L2 cache on most architectures; it is not needed usually, if you access it frequently it is unlikely to be evicted from cache.
On some architectures there is a set of instructions that allows you to give the processor a hint that the memory location will soon be needed, so that it can start loading it to L1/L2 cache early - this is called prefetching, see _mm_prefetch instruction for example ( http://msdn.microsoft.com/en-us/library/84szxsww(v=vs.80).aspx ). Still this is unlikely to be needed if you're accessing a small array.
The general advice is - make your data structures cache-efficient first (put related data together, pack data, etc.), try prefetching later if the profiler tells you that you're still spending time on cache misses and you can't improve the data layout any further.

Related

Does labelling a block of memory volatile imply the cache is always bypassed? [duplicate]

Cache is controlled by cache hardware transparently to processor, so if we use volatile variables in C program, how is it guaranteed that my program reads data each time from the actual memory address specified but not cache.
My understanding is that,
Volatile keyword tells compiler that the variable references shouldn't be optimized and should be read as programmed in the code.
Cache is controlled by cache hardware transparently, hence when processor issues an address, it doesn't know whether the data is coming from cache or the memory.
So, if I have a requirement of having to read a memory address every time required, how can I make sure that its not referred from cache but from required address?
Some how, these two concepts are not fitting together well. Please clarify how its done.
(Imagining we have write-back policy in cache (if required for analyzing the problem))
Thank you,
Microkernel :)
Firmware developer here. This is a standard problem in embedded programming, and one that trips up many (even very experienced) developers.
My assumption is that you are attempting to access a hardware register, and that register value can change over time (be it interrupt status, timer, GPIO indications, etc.).
The volatile keyword is only part of the solution, and in many cases may not be necessary. This causes the variable to be re-read from memory each time it is used (as opposed to being optimized out by the compiler or stored in a processor register across multiple uses), but whether the "memory" being read is an actual hardware register versus a cached location is unknown to your code and unaffected by the volatile keyword. If your function only reads the register once then you can probably leave off volatile, but as a general rule I will suggest that most hardware registers should be defined as volatile.
The bigger issue is caching and cache coherency. The easiest approach here is to make sure your register is in uncached address space. That means every time you access the register you are guaranteed to read/write the actual hardware register and not cache memory. A more complex but potentially better performing approach is to use cached address space and have your code manually force cache updates for specific situations like this. For both approaches, how this is accomplished is architecture-dependent and beyond the scope of the question. It could involve MTRRs (for x86), MMU, page table modifications, etc.
Hope that helps. If I've missed something, let me know and I'll expand my answer.
From your question there is a misconception on your part.
Volatile keyword is not related to the cache as you describe.
When the keyword volatile is specified for a variable, it gives a hint to the compiler not to do certain optimizations as this variable can change from other parts of the program unexpectedly.
What is meant here, is that the compiler should not reuse the value already loaded in a register, but access the memory again as the value in register is not guaranteed to be the same as the value stored in memory.
The rest concerning the cache memory is not directly related to the programmer.
I mean the synchronization of any cache memory of CPU with the RAM is an entirely different subject.
My suggestion is to mark the page as non-cached by the virtual memory manager.
In Windows, this is done through setting PAGE_NOCACHE when calling VirtualProtect.
For a somewhat different purpose, the SSE 2 instructions have the _mm_stream_xyz instructions to prevent cache pollution, although I don't think they apply to your case here.
In either case, there is no portable way of doing what you want in C; you have to use OS functionality.
Wikipedia has a pretty good article about MTRR (Memory Type Range Registers) which apply to the x86 family of CPUs.
To summarize it, starting with the Pentium Pro Intel (and AMD copied) had these MTR registers which could set uncached, write-through, write-combining, write-protect or write-back attributes on ranges of memory.
Starting with the Pentium III but as far as I know, only really useful with the 64-bit processors, they honor the MTRRs but they can be overridden by the Page Attribute Tables which let the CPU set a memory type for each page of memory.
A major use of the MTRRs that I know of is graphics RAM. It is much more efficient to mark it as write-combining. This lets the cache store up the writes and it relaxes all of the memory write ordering rules to allow very high-speed burst writes to a graphics card.
But for your purposes you would want either a MTRR or a PAT setting of either uncached or write-through.
As you say cache is transparent to the programmer. The system guarantees that you always see the value that was last written to if you access an object through its address. The "only" thing that you may incur if an obsolete value is in your cache is a runtime penalty.
volatile makes sure that data is read everytime it is needed without bothering with any cache between CPU and memory. But if you need to read actual data from memory and not cached data, you have two options:
Make a board where said data is not cached. This may already be the case if you address some I/O device,
Use specific CPU instructions that bypass the cache. This is used when you need to scrub memory for activating possible SEU errors.
The details of second option depend on OS and/or CPU.
using the _Uncached keyword may help in embedded OS , like MQX
#define MEM_READ(addr) (*((volatile _Uncached unsigned int *)(addr)))
#define MEM_WRITE(addr,data) (*((volatile _Uncached unsigned int *)(addr)) = data)

Programmatically find the number of cache levels

i am a newbie in c programming . I have an assignment to find the number of data cache levels in the cpu and also the hit time of each levels.I am looking at C Program to determine Levels & Size of Cache but finding it difficult to interpret the results. How is the number of cache levels revealed?
any pointers will be helpful
Assuming you don't have a way to cheat (like some way of getting that information from the operating system or some CPU identification register):
The basic idea is that (by design), your L1 cache is faster than your L2 cache which is faster than your L3 cache... In any normal design, your L1 cache is also smaller than your L2 cache which is smaller than your L3 cache...
So you want to allocate a large-ish block of memory and then access (read and write) it sequentially[1] until you notice that the time taken to perform X accesses has risen sharply. Then keep going until you see the same thing again. You would need to allocate a memory block larger than the largest cache you are hoping to discover.
This requires access to some low-overhead access timestamp counter for the actual measurement (as pointed out in the referrered-to answer).
[1] or depending on whether you want to try to fool any clever prefetching that may skew the results, randomly within a sequentially progressing N-byte block.

Storing C/C++ variables in processor cache instead of system memory

On the Intel x86 platform running Linux, in C/C++, how can I tell the OS and the hardware to store a value (such as a uint32) in L1/L2 cache, and not in system memory? For example, let's say either for security or performance reasons, I don't want to store a 32-bit key (a 32-bit unsigned int) in DRAM, and instead I would like to store it only in the processor's cache. How can I do this? I'm using Fedora 16 (Linux 3.1 and gcc 4.6.2) on an Intel Xeon processor.
Many thanks in advance for your help!
I don't think you are able to force a variable to be stored in the processor's cache, but you can use the register keyword to suggest to the compiler that a given variable should be allocated into a CPU register, declaring it like:
register int i;
There are no cpu instructions on x86 (or indeed any platform that I'm aware of) that will allow you to force the CPU to keep something in the L1/L2 cache. Let alone exposing such an extremely low level detail to the higher level languages like C/C++.
Saying you need to do this for "performance" is meaningless without more context of what sort of performance you're looking at. Why is your program so tightly dependent on having access to data in cache alone. Saying you need this for security just seems like bad security design. In either case, you have to provide a lot more detail of what exactly you're trying to do here.
Short answer, you can't - that is not what those caches are for - they are fed from main memory, to speed up access, or allow for advanced techniques like branch prediction and pipelining.
There are ways to ensure the caches are used for certain data, but it will still reside in ram, and in a pre-emptive multitasking operating system, you cannot guarantee that your cache contents will not be blown away through a context switch between any two instructions, except by 'stopping the world', or low level atomic operations, but they are generally for very, very, very short sequences of instructions that simply cannot not be interrupted, like increment and fetch for spinlocks, not processing cryptographic algorithms in one go.
You can't use cache directly but you can use hardware registers for integers, and they are faster.
If you really want performance, a variable is better off in a CPU register.
If you cannot use a register, for example because you need to share the same value across different threads or cores (multicore is getting common now!), you need to store the variable into memory.
As already mentioned, you cannot force some memory into the cache using a call or keyword.
However, caches aren't entirely stupid: if you memory block is used often enough you shouldn't have a problem to keep it in the cache.
Keep in mind that if you happen to write to this memory place a lot from different cores, you're going to strain the cache coherency blocks in the processor because they need to make sure that all the caches and the actual memory below are kept in sync.
Put simply, this would reduce overall performance of the CPU.
Note that the opposite (do not cache) does exist as a property you can assign to parts of your heap memory.

Code optimization

If I have a big structure(having lot of member variables). This structure pointer is passed to many functions in my code. Some member variables of this structure are used very often, in almost all functions.
If I put those frequently used member variables at the beginning in the structure declaration, will it optmize the code for MCPS - Million cycles per second(time consumed by the code). If i put frequently accessed members at time, will they be accessed efficiently/lesser time than if they are put randomly in the structure of at bottom of structure declaration? If yes what is the logic?
If I have a structure member being accessed in some function as follows:
structurepointer1->member_variable
Will it help in optimizing it in MCPS aspect if I assign it to a local variable and then access the local variable, as shown below?
local_variable = structurepointer1->member_variable;
If yes, then how does it help?
1) The position of a field in a structure should have no effect on its access time except to the extent that, if your structure is very large and spans multiple pages, it may be a good idea to position members that are often used in quick succession close together in order to increase locality of reference and try to decrease cache misses.
2) Maybe / maybe not. In fact it may make things slower. If the variable is not volatile, your compiler may be smart enough to store the field in a register anyway. Even if not, your processor will cache its value, but this may not help if is uses are somewhat far apart, with lots of other memory access in between. If the value would have either been stored in a register or would have stayed in your processor's cache, then assigning it to a local will only be unnecessary extra work.
Standard Optimizations Disclaimer: Always profile before optimizing. Make sure that what you are trying to optimize is worth optimizing. Always profile your attempted optimizations and make sure they actually made things faster (and not slower).
First, the obligatory disclaimer: for all performance questions, you must profile the code to see where improvements can be made.
In general though, anything you can do to keep your data in the processor cache will help. Putting the most commonly accessed items close together will facilitate this.
I know this is not really answering your question, but before you delve into super-optimizing your code, go through this presentation http://dl.fefe.de/optimizer-isec.pdf. I saw it live and it was a good eye opening experience showing compilers are getting far more advanced in optimization than we tend to think and readable code is more important than small optimizations.
On 2, you most likely are better off not declaring a local variable. The compiler is usually smart enough to figure out when and how variable is used and utilize registers to keep it around.
Also, I would second Mark Ransom's suggestion, profile the code before making assumptions about bottlenecks.
I think your question is related with data alignment and data structure padding. In modern compilers this is handled automatically the most of the times, trying to avoid the alignment faults that could happen on memory. You can read about this here. Of course, you can change the alignment for your data, but I think you would need to specify some compiler options to disable auto-alignment and rearrange the fields on the structure to match the architecture you are aiming to.
I would say this is a very low level optimization.
The location of the field in the structure is irrelevant as that will be calculated by the compiler. A more promising optimization is to make sure that your most-used fields are byte-aligned with the word size of your processor.
If you are using the variable local to a function, this should have no impact. If you are passing it to other functions (separate from the larger structure) than that might help a bit.
As with all of the other answers, you need to run a profile baseline before optimizing, to make sure changes are effective. If you're worried about execution time, profile your algorithms and optimize them before you worry about the code a compiler creates, more bang for the buck.
Also, if you want to know what is going to happen, you should consider compiling your c code into assembly output. This will give you an idea of what the compiler is going to do and how you may go about further "fine tuning".
Structure access is most always indexed indirect access. The assembly code will effectively pull memory knowing the pointer to the structure as the base plus and index to get the right field. This is usually an expensive operation, but for modern CPU's its probably not that slow.
This depends on the locality of the data being accessed. First and foremost accessing the structure the first time will be the most expensive. Accessing the data afterwards, can be quick if the data is already in a processor register, however, this may not be the case depending on the processor used. Storing to a local variable should be less expensive since the memory access instructions for such an operation is less expensive. Again, I think now days processors are fast enough that this optimization is minimal.
I still think that there are probably better places to optimize your code. It is good though that there is someone out there that thinks about this still, in a world of code bloat ;) Embedded computing, you still need to worry about these things.
This depends on the size of your fields and caching details. Look at using valgrind for profiling this.
If you doing this dereferencing a lot it would cost time. A decent optimizing compiler will effectively do the storing the pointer into the local variable optimization as you described. It will do a better job than you will and it will do it in an architecture-specific way.
What you want to do in this situation, overall, is make sure that you test the correctness and the performance of each optimization you are trying. Otherwise you are poking around in the dark.
Remember that fine optimizations at the C line level will virtually never trump higher-order algorithm/design optimizations.
Yes, it can help. But as people have already stated, it depends and can even be counter productive.
The reason why I think it can help, has to do with pointer aliasing. If you access your variables via a pointer, and the compiler can not guarantee that the structure was not changed elsewhere (via your pointer or another) he will generate code to reload or save the variable even if he could have hold the value in a register. Here an example to show what I mean:
calc = structurepointer1->member_variable * x + c;
/* Do something in function which doesn't involve member_variable; */
function(structurepointer1);
calc2 = structurepointer1->member_variable * y;
The compiler will make a memory access for both references to member_variable, because it can not be sure that the called function has modified that field.
If you're sure the function doesn't change that value, doing this would save 1 memory access
int temp = structurepointer1->member_variable;
calc = temp * x + something;
function(structurepointer1);
calc2 = temp * y;
There's also another reason you can use a local variable for your member variables, it can make the code much more readable.

How can adding data to a segment in flash memory screw up a program's timing?

I have a real-time embedded app with the major cycle running at 10KHz. It runs on a TI TMS320C configured to boot from flash. I recently added an initialized array to a source file, and all of a sudden the timing is screwed up (in a way too complex to explain well - essentially a serial port write is no longer completing on time.)
The things about this that baffle me:
I'm not even accessing the new data, just declaring an initialized array.
It is size dependant - the problem only appears if the array is >40 words.
I know I'm not overflowing any data segments in the link map.
There is no data caching, so it's not due to disrupting cache consistency.
Any ideas on how simply increasing the size of the .cinit segment in flash can affect the timing of your code?
Additional Info:
I considered that maybe the code had moved, but it is well-separated from the data. I verified through the memory map that all the code segements have the same addresses before and after the bug. I also verified that none of the segments are full - The only addresses that change in the map are a handful in the .cinit section. That section contains data values used to initialize variables in ram (like my array). It shouldn't ever be accessed after main() is called.
My suspicions would point to a change in alignment between your data/code and the underlying media/memory. Adding to your data would change the locations of memory in your heap (depending on the memory model) and might put your code across a 'page' boundary on the flash device causing latency which was not there before.
Perhaps the new statically allocated array pushes existing data into slower memory regions, causing accesses to that data to be slower?
Does the problem recur if the array is the last thing in its chunk of address space? If not, looking in your map, try moving the array declaration so that, one by one, things placed after it are shuffled to be before it instead. This way you can pinpoint the relevant object and begin to work out why moving it causes the delay.
I would risk my self and claim that you don't have a performance problem here, rather some kind of memory corruption that symptoms as a performance problem. Adding an array to your executable changing the memory picture. So my guess will be that you have a memory corruption that mostly harmless (i.e overwriting not used part of the memory) and shifting your memory more than 40 bytes cause the memory corruption to make a bigger problem. Which one is a real questions
After more than a day staring at traces and generated assembly, I think I figured it out.
The root cause problem turned out to be an design issue that caused glitches only if the ISR that kicked off the serial port write collided with a higher priority one. The timing just happened to work out that it only took adding a few extra instructions to one loop to cause the two interrupts to collide.
So the question becomes: How does storing, but not accessing, additional data in flash memory cause additional instructions to be executed?
It appears that the answer is related to, but not quite the same as, the suggestions by Frosty and Frederico. The new array does move some existing variables, but not across page boundaries or to slower regions (on this board, access times should be the same for all regions) . But it does change the offsets of some frequently accessed structures, which causes the optimizer to issue slightly different instruction sequences for accessing them. One data alignment may cause a one cycle pipeline stall, where the other does not. And those few instructions shifted timing enough enough to expose the underlying problem.
Could the initialization be overwriting another adjacent piece of code?
Are there any structs or variables that use the array, that are now bigger and could cause a stackoverflow?
Could be a bank or page conflict in as well. Maybe you have two routines that are called quite often (interrupt handlers or so) that have been at the same page and are now split up in two pages.

Resources