Does labelling a block of memory volatile imply the cache is always bypassed? [duplicate]

Cache is controlled by cache hardware transparently to processor, so if we use volatile variables in C program, how is it guaranteed that my program reads data each time from the actual memory address specified but not cache.
My understanding is that,
Volatile keyword tells compiler that the variable references shouldn't be optimized and should be read as programmed in the code.
Cache is controlled by cache hardware transparently, hence when processor issues an address, it doesn't know whether the data is coming from cache or the memory.
So, if I have a requirement of having to read a memory address every time required, how can I make sure that its not referred from cache but from required address?
Some how, these two concepts are not fitting together well. Please clarify how its done.
(Imagining we have write-back policy in cache (if required for analyzing the problem))
Thank you,
Microkernel :)

Firmware developer here. This is a standard problem in embedded programming, and one that trips up many (even very experienced) developers.
My assumption is that you are attempting to access a hardware register, and that register value can change over time (be it interrupt status, timer, GPIO indications, etc.).
The volatile keyword is only part of the solution, and in many cases may not be necessary. This causes the variable to be re-read from memory each time it is used (as opposed to being optimized out by the compiler or stored in a processor register across multiple uses), but whether the "memory" being read is an actual hardware register versus a cached location is unknown to your code and unaffected by the volatile keyword. If your function only reads the register once then you can probably leave off volatile, but as a general rule I will suggest that most hardware registers should be defined as volatile.
The bigger issue is caching and cache coherency. The easiest approach here is to make sure your register is in uncached address space. That means every time you access the register you are guaranteed to read/write the actual hardware register and not cache memory. A more complex but potentially better performing approach is to use cached address space and have your code manually force cache updates for specific situations like this. For both approaches, how this is accomplished is architecture-dependent and beyond the scope of the question. It could involve MTRRs (for x86), MMU, page table modifications, etc.
Hope that helps. If I've missed something, let me know and I'll expand my answer.

From your question there is a misconception on your part.
Volatile keyword is not related to the cache as you describe.
When the keyword volatile is specified for a variable, it gives a hint to the compiler not to do certain optimizations as this variable can change from other parts of the program unexpectedly.
What is meant here, is that the compiler should not reuse the value already loaded in a register, but access the memory again as the value in register is not guaranteed to be the same as the value stored in memory.
The rest concerning the cache memory is not directly related to the programmer.
I mean the synchronization of any cache memory of CPU with the RAM is an entirely different subject.

My suggestion is to mark the page as non-cached by the virtual memory manager.
In Windows, this is done through setting PAGE_NOCACHE when calling VirtualProtect.
For a somewhat different purpose, the SSE 2 instructions have the _mm_stream_xyz instructions to prevent cache pollution, although I don't think they apply to your case here.
In either case, there is no portable way of doing what you want in C; you have to use OS functionality.

Wikipedia has a pretty good article about MTRR (Memory Type Range Registers) which apply to the x86 family of CPUs.
To summarize it, starting with the Pentium Pro Intel (and AMD copied) had these MTR registers which could set uncached, write-through, write-combining, write-protect or write-back attributes on ranges of memory.
Starting with the Pentium III but as far as I know, only really useful with the 64-bit processors, they honor the MTRRs but they can be overridden by the Page Attribute Tables which let the CPU set a memory type for each page of memory.
A major use of the MTRRs that I know of is graphics RAM. It is much more efficient to mark it as write-combining. This lets the cache store up the writes and it relaxes all of the memory write ordering rules to allow very high-speed burst writes to a graphics card.
But for your purposes you would want either a MTRR or a PAT setting of either uncached or write-through.

As you say cache is transparent to the programmer. The system guarantees that you always see the value that was last written to if you access an object through its address. The "only" thing that you may incur if an obsolete value is in your cache is a runtime penalty.

volatile makes sure that data is read everytime it is needed without bothering with any cache between CPU and memory. But if you need to read actual data from memory and not cached data, you have two options:
Make a board where said data is not cached. This may already be the case if you address some I/O device,
Use specific CPU instructions that bypass the cache. This is used when you need to scrub memory for activating possible SEU errors.
The details of second option depend on OS and/or CPU.

using the _Uncached keyword may help in embedded OS , like MQX
#define MEM_READ(addr) (*((volatile _Uncached unsigned int *)(addr)))
#define MEM_WRITE(addr,data) (*((volatile _Uncached unsigned int *)(addr)) = data)


Can we prevent shared mutable state if we only allow local variables to be mutable?

In a non-OOP programming language, like C, If we only allow local variables to be mutated in every possible way (change internal fields, re-assigning, ...) but disallow mutation of function arguments, will it help us prevent shared mutable state?
Note that in this case, function main can start 10 threads (functions) and each of those 10 threads will receive an immutable reference to the same variable (defined in main). But the main function can still change the value of that shared variable. So can this cause problem in a concurrent/parallel software?
I hope the question is clear, but let me know if it's not.
P.S. Can "software transactional memory (STM)" solve the potential problems? Like what Clojure offers?
Yes and no... this depends on the platform, the CPU, the size of the shared variable and the compiler.
On an NVIDIA forum, in relation to GPU operations, a similar question was very neatly answered:
When multiple threads are writing or reading to/from a naturally aligned location in global memory, and the datatype being read or written is the same by all threads, and the datatype corresponds to one of the supported types for single-instruction thread access ...
(Many GPU single-instruction can handle 16 Byte words (128bit) when it's known in advance, but most CPUs use single-instruction 32bits or 64bit limits)
I'm leaving aside the chance that threads might read from the CPU registers instead of the actual memory (ignoring updates to the data), these are mostly solvable using the volatile keyword in C.
However, conflicts and memory corruption can still happen.
Some memory storage operations are handled internally (by the CPU) or by your compiler (the machine code) using a number of storage calls.
In these cases, mostly on multi-core machines (but not only), there's the risk that the "reader" will receive information that was partially updated and has no meaning whatsoever (i.e., half of a pointer is valid and the other isn't).
Variables larger than 32 bits or 64 bits, will usually get updated a CPU "word" (not an OS word) at a time (32bits or 64 bits).
Byte sized variables are super safe, That's why they are are often used as flags... but they should probably be handled using the atomic_* store/write operations provided by the OS or the compiler.

Erasing sensitive information from memory

After reading this question I'm curious how one would do this in C. When receiving the information from another program, we probably have to assume that the memory is writable.
I have found this stating that a regular memset maybe optimized out and this comment stating that memsets are the wrong way to do it.
The example you have provided is not quite valid: the compiler can optimize out a variable setting operation when it can detect that there are no side effects and the value is no longer used.
So, if your code uses some shared buffer, accessible from multiple locations, the memset would work fine. Almost.
Different processors use different caching policies, so you might have to use memory barriers to ensure the data (zero's) have reached memory chip from the cache.
So, if you are not worried about hardware level details, making sure compiler can't optimize out operation is sufficient. For example, memsetting block before releasing it would be executed.
If you want to ensure the data is removed from all hardware items, you need to check how the data caching is implemented on your platform and use appropriate code to force cache flush, which can be non-trivial on multi-core machine.

Storing C/C++ variables in processor cache instead of system memory

On the Intel x86 platform running Linux, in C/C++, how can I tell the OS and the hardware to store a value (such as a uint32) in L1/L2 cache, and not in system memory? For example, let's say either for security or performance reasons, I don't want to store a 32-bit key (a 32-bit unsigned int) in DRAM, and instead I would like to store it only in the processor's cache. How can I do this? I'm using Fedora 16 (Linux 3.1 and gcc 4.6.2) on an Intel Xeon processor.
Many thanks in advance for your help!
I don't think you are able to force a variable to be stored in the processor's cache, but you can use the register keyword to suggest to the compiler that a given variable should be allocated into a CPU register, declaring it like:
register int i;
There are no cpu instructions on x86 (or indeed any platform that I'm aware of) that will allow you to force the CPU to keep something in the L1/L2 cache. Let alone exposing such an extremely low level detail to the higher level languages like C/C++.
Saying you need to do this for "performance" is meaningless without more context of what sort of performance you're looking at. Why is your program so tightly dependent on having access to data in cache alone. Saying you need this for security just seems like bad security design. In either case, you have to provide a lot more detail of what exactly you're trying to do here.
Short answer, you can't - that is not what those caches are for - they are fed from main memory, to speed up access, or allow for advanced techniques like branch prediction and pipelining.
There are ways to ensure the caches are used for certain data, but it will still reside in ram, and in a pre-emptive multitasking operating system, you cannot guarantee that your cache contents will not be blown away through a context switch between any two instructions, except by 'stopping the world', or low level atomic operations, but they are generally for very, very, very short sequences of instructions that simply cannot not be interrupted, like increment and fetch for spinlocks, not processing cryptographic algorithms in one go.
You can't use cache directly but you can use hardware registers for integers, and they are faster.
If you really want performance, a variable is better off in a CPU register.
If you cannot use a register, for example because you need to share the same value across different threads or cores (multicore is getting common now!), you need to store the variable into memory.
As already mentioned, you cannot force some memory into the cache using a call or keyword.
However, caches aren't entirely stupid: if you memory block is used often enough you shouldn't have a problem to keep it in the cache.
Keep in mind that if you happen to write to this memory place a lot from different cores, you're going to strain the cache coherency blocks in the processor because they need to make sure that all the caches and the actual memory below are kept in sync.
Put simply, this would reduce overall performance of the CPU.
Note that the opposite (do not cache) does exist as a property you can assign to parts of your heap memory.

Concurrent access to struct member

I'm using 32-bit microcontroller (STR91x). I'm concurrently accessing (from ISR and main loop) struct member of type enum. Access is limited to writing to that enum field in the ISR and checking in the main loop. Enum's underlying type is not larger than integer (32-bit).
I would like to make sure that I'm not missing anything and I can safely do it.
Provided that 32 bit reads and writes are atomic, which is almost certainly the case (you might want to make sure that your enum's word-aligned) then that which you've described will be just fine.
As paxdiablo & David Knell said, generally speaking this is fine. Even if your bus is < 32 bits, chances are the instruction's multiple bus cycles won't be interrupted, and you'll always read valid data.
What you stated, and what we all know, but it bears repeating, is that this is fine for a single-writer, N-reader situation. If you had more than one writer, all bets are off unless you have a construct to protect the data.
If you want to make sure, find the compiler switch that generates an assembly listing and examine the assembly for the write in the ISR and the read in the main loop. Even if you are not familiar with ARM assembly, I'm sure you could quickly and easily be able to discern whether or not the reads and writes are atomic.
ARM supports 32-bit aligned reads that are atomic as far as interrupts are concerned. However, make sure your compiler doesn't try to cache the value in a register! Either mark it as a volatile, or use an explicit memory barrier - on GCC this can be done like so:
int tmp = yourvariable;
Note, however, that current versions of GCC person a full memory barrier for __sync_synchronize, rather than just for the one variable, so volatile is probably better for your needs.
Further, note that your variable will be aligned automatically unless you are doing something Weird (ie, explicitly specifying the location of the struct in memory, or requesting a packed struct). Unaligned variables on ARM cannot be read atomically, so make sure it's aligned, or disable interrupts while reading.
Well, it depends entirely on your hardware but I'd be surprised if an ISR could be interrupted by the main thread.
So probably the only thing you have to watch out for is if the main thread could be interrupted halfway through a read (so it may get part of the old value and part of the new).
It should be a simple matter of consulting the specs to ensure that interrupts are only processed between instructions (this is likely since the alternative would be very complex) and that your 32-bit load is a single instruction.
An aligned 32 bit access will generally be atomic (unless it were a particularly ludicrous compiler!).
However the rock-solid solution (and one generally applicable to non-32 bit targets too) is to simply disable the interrupt temporarily while accessing the data outside of the interrupt. The most robust way to do this is through an access function to statically scoped data rather than making the data global where you then have no single point of access and therefore no way of enforcing an atomic access mechanism when needed.

Is there a way to force a variable to be stored in the cache in C?

I just had a phone interview where I was asked this question. I am aware of ways to store in register or heap or stack, but cache specifically?
Not in C as a language. In GCC as a compiler - look for __builtin_prefetch.
You might be interested in reading What every programmer should know about memory.
Just to clear some confusion - caches are physically separate memories in hardware, but not in software abstraction of the machine. A word in a cache is always associated with address in main memory. This is different from the CPU registers, which are named/addressed separately from the RAM.
In C, as in as defined by the C standard? No.
In C, as in some specific implementation on a specific platform? Maybe.
As cache is a CPU concept and is meaningless for C language (and C language has targets processors that have no cache, unlikely today, but quite common in old days) definitely No.
Trying to optimize such things by hand is also usually a quite bad idea.
What you can do is keep the job easy for the compiler keeping loops very short and doing only one thing (good for instruction cache), iterate on memory blocks in the right order (prefer accesses to consecutive cells in memory to sparse accesses), avoid reusing the same variables for different uses (it introduces read-after-write dependencies), etc. If you are attentive to such details the program is more likely to be efficiently optimized by compiler and memory accesses be cached.
But it will still depend on actual hardware and even compiler may not guarantee it.
It depends on the platform, so if you were speaking to a company targetting current generation consoles, you would need to know the PowerPC data cache intrinsics/instructions. On various platforms, you would also need to know the false sharing rules. Also, you can't cache from memory marked explicitly as uncached.
Without more context about the actual job or company or question, this would probably be best answered by talking about what not to do to keep memory references in the data cache.
If you are trying to force something to be stored in the CPU cache, I would recommend that you avoid trying to do so unless you have an overwhelmingly good reason. Manually manipulating the CPU cache can have all sorts of unintended consequences, not the least among them being coherency in multi-core or multi-CPU applications. This is something that is done by the CPU at run-time and is generally transparent to the programmer and the compiler for a good reason.
The specific answer will depend on your compiler and platform. If you are targeting a MIPS architecture, there is a CACHE instruction (assembly) which allows you to do CPU cache manipulations.
