How to use pld instruction in ARM

How to use pld instruction in ARM - c

I use it like this.
__pld(pin[0], pin[1], pin[2], pin[3], pin[4]);
But I get this error.
undefined reference to `__pld'
What am I missing? Do I need to include a header file or something? I am using ARM Cortex A8, does it even support the pld instruction?

As shown in this answer, you can use inline assembler as per Clark. __builtin_prefetch is also a good suggestion. An important fact to know is how the pld instruction acts on the ARM; for some processors it does nothing. For others, it brings the data into the cache. This is only going to be effective for a read operation (or read/modify/write). The other thing to note, is that if it does work on your processor, it fetches an entire cache line. So the example of fetching the pin array, doesn't need to specify all members.
You will get more performance by ensuring that pld data is cache aligned. Another issue, from seeing the previous code, you will only gain performance with variables you read. In some cases, you are just writing to the pin array. There is no value in prefetching these items. The ARM has a write buffer, so writes are batched together and will burst to an SDRAM chip automatically.
Grouping all read data together on a cache line will show the most performance improvement; the whole line can be pre-fectched with a single pld. Also, when you un-roll a loop, the compiler will be able to see these reads and will schedule them earlier if possible so that they are filled in the cache; at least for some ARM cpus.
Also, you may consider,
__attribute__((optimize("prefetch-loop-arrays")))
in the spirit of the accepted answer to the other question; probably the compiler will have already enabled this at -O3 if it is effective on the CPU you have specified.
Various compiler options can be specified with --param NAME=VALUE that allow you to give hints to the compiler on the memory sub-system. This could be a very potent combination, if you get the parameters correct.
prefetch-latency
simultaneous-prefetches
l1-cache-line-size
l1-cache-size
l2-cache-size
min-insn-to-prefetch-ratio
prefetch-min-insn-to-mem-ratio
Make sure you specify a -mcpu to the compiler that supports the pld. If all is right, the compiler should do this automatically for you. However, sometime you may need to do it manually.
For reference, here is gcc-4.7.3's ARM prefetch loop arrays code activation.
/* Enable sw prefetching at -O3 for CPUS that have prefetch, and we have deemed
it beneficial (signified by setting num_prefetch_slots to 1 or more.) */
if (flag_prefetch_loop_arrays < 0
&& HAVE_prefetch
&& optimize >= 3
&& current_tune->num_prefetch_slots > 0)
flag_prefetch_loop_arrays = 1;

Try http://www.ethernut.de/en/documents/arm-inline-asm.html
In GCC it might look like this:
Example from: http://communities.mentor.com/community/cs/archives/arm-gnu/msg01553.html
and a usage of pld:
__asm__ __volatile__(
"pld\t[%0]"
:
: "r" (first) );

You may want to look at gcc's __builtin_prefetch. I reproduced it here for your convenience:
This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions will be generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.
The value of addr is the address of the memory to prefetch. There are two optional arguments, rw and locality. The value of rw is a compile-time constant one or zero; one means that the prefetch is preparing for a write to the memory address and zero, the default, means that the prefetch is preparing for a read. The value locality must be a compile-time constant integer between zero and three. A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three.
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}
Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid. For example, a prefetch of p->next will not fault if p->next is not a valid address, but evaluation will fault if p is not a valid address.
If the target does not support data prefetch, the address expression is evaluated if it includes side effects but no other code is generated and GCC does not issue a warning.

undefined reference to `__pld'
To answer the question about the undefined reference, __pld is an ARM compiler intrinsic. See __pld intrinsic in the ARM manual.
Perhaps GCC does not recognize the ARM instrinsic.

Related

How do I determine when the compiler is caching values and not reading them from the proper registers or memory locations?

I am running into a problem where I am polling flags in an ARM processor for SPI transmission completion and it hangs in the while loop. I am not entirely sure what is going on and I suspect that the compiler is not actually fetching the value stored in the register.
//LPSPI1 is part of the ARM CMSIS ABI
void WriteCmdDataLCD(uint8_t *data_buffer, uint32_t data_length)
{
volatile uint32_t inc = 0;
...
while(inc < data_length)
{
LPSPI1->TDR = *(data_buffer+inc);
while(((LPSPI1->SR & LPSPI_SR_TDF_MASK) >> LPSPI_SR_TDF_SHIFT) == 0);
LPSPI1->SR = LPSPI_SR_TDF_MASK;
inc++;
}
...
The LPSPI1 TX FIFO is 16 words deep, if I write 16 words or less to this FIFO there is no problem. However, when I write more than 16 then it gets stuck in that while loop. Presumably because TDF (Transmission Done Flag) never evaluates to 1 (for complete). The debugger is too invasive because if I break program execution and restart it, it gets beyond that point.
ICACHE and DCACHE are disabled. I am at a loss for next steps.
If it matters, the ARM processor is the I.MXRT1060.
Edit:
So this is a phantom problem and is hardware behavior related. I need to rethink how to use this peripheral.

How do I determine when the compiler is caching values
The compiler is not caching anything (maybe something internal when the compiler is run) The hardware when you run the compiled program might cache something.
ICACHE and DCACHE are disabled. I am at a loss for next steps.
Then nothing will be cached and your problem simply does not exist. But even if you enable them Cortex-M7 guarantees that the core accesses to the memory is coherent. In your case, you poll for the data and core reads (and writes) hardware registers and memory.
If you use DMA with the cache ON - you will have to invalidate the cache before accessing the data. But it is not the case here.
There is a lot of confusion about the volatile in this question and answers.
To clarify:
volatile uint32_t inc = 0; - inc does not have to be volatile.
Hardware registers are declared as volatile in the NXP provided headers. Nothing to worry about. Those headers are used by thousands of programmers around the world.
data buffers do not have to be volatile.
Your check can be simplified to:
while(!(LPSPI1->SR & LPSPI_SR_TDF_MASK));
The data register write is incorrect. It shuld be:
*(volatile uint8_t *)&LPSPI1->TDR = *(data_buffer+inc);
otherwise, the compiler will generate 32 bit write instruction and 4 values will be stored in the FIFO instead of one (your data + 3 zero bytes). It is a very common beginners trap. Here you can see the difference in the generated code: https://godbolt.org/z/x9Knd5
The order of operation is incorrect (you should first check if you can write, then write) it shuld be:
while(!(LPSPI1->SR & LPSPI_SR_TDF_MASK));
*(volatile uint8_t *)&LPSPI1->TDR = *(data_buffer+inc);
The question of why it stops in the while loop is almost impossible to answer without the full code and access to your hardware. I can only guess. Probably the peripheral is not correctly configured and/or some error conditions are detected. You never check the error flags.
Presumably because TDF (Transmission Done Flag)
It is Transmit Data Flag and it does not indicate the end of the transmission.
There is another flag called TCF used to indicate the end of the transfer. I strongly advise to read carefully and with understanding the relevant chapter of the Referance Manual, then start to write the software

How do I determine when the compiler is caching values and not reading
them from the proper registers or memory locations?
The compiler might want to "cache" all values which are not declared with volatile.
You might create a test:
unsigned int maybe_cached = REGISTER;
volatile unsigned int never_cached = REGISTER;
/* some code */
if(maybe_cached != never_cached){
/* something had been cached */
}
PS you should use volatile at atleast for all asynchronous operations.

Is assigning a pointer in C program considered atomic on x86-64

https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html#Atomic-Types says - In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
My question is whether pointer assignment can be considered atomic on x86_64 architecture for a C program compiled with gcc m64 flag. OS is 64bit Linux and CPU is Intel(R) Xeon(R) CPU D-1548. One thread will be setting a pointer and another thread accessing the pointer. There is only one writer thread and one reader thread. Reader should either be getting the previous value of the pointer or the latest value and no garbage value in between.
If it is not considered atomic, please let me know how can I use the gcc atomic builtins or maybe memory barrier like __sync_synchronize to achieve the same without using locks. Interested only in C solution and not C++. Thanks!

Bear in mind that atomicity alone is not enough for communicating between threads. Nothing prevents the compiler and CPU from reordering previous/subsequent load and store instructions with that "atomic" store. In old days people used volatile to prevent that reordering but that was never intended for use with threads and doesn't provide means to specify less or more restrictive memory order (see "Relationship with volatile" in there).
You should use C11 atomics because they guarantee both atomicity and memory order.

For almost all architectures, pointer load and store are atomic. A once notable exception was 8086/80286 where pointers could be seg:offset; there was an l[des]s instruction which could make an atomic load; but no corresponding atomic store.
The integrity of the pointer is only a small concern; your bigger issue revolves around synchronization: the pointer was at value Y, you set it to X; how will you know when nobody is using the (old) Y value?
A somewhat related problem is that you may have stored things at X, which the other thread expects to find. Without synchronization, other might see the new pointer value, however what it points to might not be up to date yet.

A plain global char *ptr should not be considered atomic. It might work sometimes, especially with optimization disabled, but you can get the compiler to make safe and efficient optimized asm by using modern language features to tell it you want atomicity.
Use C11 stdatomic.h or GNU C __atomic builtins. And see Why is integer assignment on a naturally aligned variable atomic on x86? - yes the underlying asm operations are atomic "for free", but you need to control the compiler's code-gen to get sane behaviour for multithreading.
See also LWN: Who's afraid of a big bad optimizing compiler? - weird effects of using plain vars include several really bad well-known things, but also more obscure stuff like invented loads, reading a variable more than once if the compiler decides to optimize away a local tmp and load the shared var twice, instead of loading it into a register. Using asm("" ::: "memory") compiler barriers may not be sufficient to defeat that depending on where you put them.
So use proper atomic stores and loads that tell the compiler what you want: You should generally use atomic loads to read them, too.
#include <stdatomic.h> // C11 way
_Atomic char *c11_shared_var; // all access to this is atomic, functions needed only if you want weaker ordering
void foo(){
atomic_store_explicit(&c11_shared_var, newval, memory_order_relaxed);
}
char *plain_shared_var; // GNU C
// This is a plain C var. Only specific accesses to it are atomic; be careful!
void foo() {
__atomic_store_n(&plain_shared_var, newval, __ATOMIC_RELAXED);
}
Using __atomic_store_n on a plain var is the functionality that C++20 atomic_ref exposes. If multiple threads access a variable for the entire time that it needs to exist, you might as well just use C11 stdatomic because every access needs to be atomic (not optimized into a register or whatever). When you want to let the compiler load once and reuse that value, do char *tmp = c11_shared_var; (or atomic_load_explicit if you only want acquire instead of seq_cst; cheaper on a few non-x86 ISAs).
Besides lack of tearing (atomicity of asm load or store), the other key parts of _Atomic foo * are:
The compiler will assume that other threads may have changed memory contents (like volatile effectively implies), otherwise the assumption of no data-race UB will let the compiler hoist loads out of loops. Without this, dead-store elimination might only do one store at the end of a loop, not updating the value multiple times.
The read side of the problem is usually what bites people in practice, see Multithreading program stuck in optimized mode but runs normally in -O0 - e.g. while(!flag){} becomes if(!flag) infinite_loop; with optimization enabled.
Ordering wrt. other code. e.g. you can use memory_order_release to make sure that other threads that see the pointer update also see all changes to the pointed-to data. (On x86 that's as simple as compile-time ordering, no extra barriers needed for acquire/release, only for seq_cst. Avoid seq_cst if you can; mfence or locked operations are slow.)
Guarantee that the store will compile to a single asm instruction. You'd be depending on this. It does happen in practice with sane compilers, although it's conceivable that a compiler might decide to use rep movsb to copy a few contiguous pointers, and that some machine somewhere might have a microcoded implementation that does some stores narrower than 8 bytes.
(This failure mode is highly unlikely; the Linux kernel relies on volatile load/store compiling to a single instruction with GCC / clang for its hand-rolled intrinsics. But if you just used asm("" ::: "memory") to make sure a store happened on a non-volatile variable, there's a chance.)
Also, something like ptr++ will compile to an atomic RMW operation like lock add qword [mem], 4, rather than separate load and store like volatile would. (See Can num++ be atomic for 'int num'? for more about atomic RMWs). Avoid that if you don't need it, it's slower. e.g. atomic_store_explicit(&ptr, ptr + 1, mo_release); - seq_cst loads are cheap on x86-64 but seq_cst stores aren't.
Also note that memory barriers can't create atomicity (lack of tearing), they can only create ordering wrt other ops.
In practice x86-64 ABIs do have alignof(void*) = 8 so all pointer objects should be naturally aligned (except in a __attribute__((packed)) struct which violates the ABI, so you can use __atomic_store_n on them. It should compile to what you want (plain store, no overhead), and meet the asm requirements to be atomic.
See also When to use volatile with multi threading? - you can roll your own atomics with volatile and asm memory barriers, but don't. The Linux kernel does that, but it's a lot of effort for basically no gain, especially for a user-space program.
Side note: an often repeated misconception is that volatile or _Atomic are needed to avoid reading stale values from cache. This is not the case.
All machines that run C11 threads across multiple cores have coherent caches, not needing explicit flush instructions in the reader or writer. Just ordinary load or store instructions, like x86 mov. The key is to not let the compiler keep values of shared variable in CPU registers (which are thread-private). It normally can do this optimization because of the assumption of no data-race Undefined Behaviour. Registers are very much not the same thing as L1d CPU cache; managing what's in registers vs. memory is done by the compiler, while hardware keeps cache in sync. See When to use volatile with multi threading? for more details about why coherent caches is sufficient to make volatile work like memory_order_relaxed.
See Multithreading program stuck in optimized mode but runs normally in -O0 for an example.

"Atomic" is treated as this quantum state where something can be both atomic and not atomic at the same time because "it's possible" that "some machines" "somewhere" "might not" write "a certain value" atomically. Maybe.
That is not the case. Atomicity has a very specific meaning, and it solves a very specific problem: threads being pre-empted by the OS to schedule another thread in its place on that core. And you cannot stop a thread from executing mid-assembly instruction.
What that means is that any single assembly instruction is "atomic" by definition. And since you have registry moving instructions, any register-sized copy is atomic by definition. That means a 32-bit integer on a 32-bit CPU, and a 64-bit integer on a 64-bit CPU are all atomic -- and of course that includes pointers (ignore all the people who will tell you "some architectures" have pointers of "different size" than registers, that hasn't been the case since 386).
You should however be careful not to hit variable caching problems (ie one thread writing a pointer, and another trying to read it but getting an old value from the cache), use volatile as needed to prevent this.

Does labelling a block of memory volatile imply the cache is always bypassed? [duplicate]

Cache is controlled by cache hardware transparently to processor, so if we use volatile variables in C program, how is it guaranteed that my program reads data each time from the actual memory address specified but not cache.
My understanding is that,
Volatile keyword tells compiler that the variable references shouldn't be optimized and should be read as programmed in the code.
Cache is controlled by cache hardware transparently, hence when processor issues an address, it doesn't know whether the data is coming from cache or the memory.
So, if I have a requirement of having to read a memory address every time required, how can I make sure that its not referred from cache but from required address?
Some how, these two concepts are not fitting together well. Please clarify how its done.
(Imagining we have write-back policy in cache (if required for analyzing the problem))
Thank you,
Microkernel :)

Firmware developer here. This is a standard problem in embedded programming, and one that trips up many (even very experienced) developers.
My assumption is that you are attempting to access a hardware register, and that register value can change over time (be it interrupt status, timer, GPIO indications, etc.).
The volatile keyword is only part of the solution, and in many cases may not be necessary. This causes the variable to be re-read from memory each time it is used (as opposed to being optimized out by the compiler or stored in a processor register across multiple uses), but whether the "memory" being read is an actual hardware register versus a cached location is unknown to your code and unaffected by the volatile keyword. If your function only reads the register once then you can probably leave off volatile, but as a general rule I will suggest that most hardware registers should be defined as volatile.
The bigger issue is caching and cache coherency. The easiest approach here is to make sure your register is in uncached address space. That means every time you access the register you are guaranteed to read/write the actual hardware register and not cache memory. A more complex but potentially better performing approach is to use cached address space and have your code manually force cache updates for specific situations like this. For both approaches, how this is accomplished is architecture-dependent and beyond the scope of the question. It could involve MTRRs (for x86), MMU, page table modifications, etc.
Hope that helps. If I've missed something, let me know and I'll expand my answer.

From your question there is a misconception on your part.
Volatile keyword is not related to the cache as you describe.
When the keyword volatile is specified for a variable, it gives a hint to the compiler not to do certain optimizations as this variable can change from other parts of the program unexpectedly.
What is meant here, is that the compiler should not reuse the value already loaded in a register, but access the memory again as the value in register is not guaranteed to be the same as the value stored in memory.
The rest concerning the cache memory is not directly related to the programmer.
I mean the synchronization of any cache memory of CPU with the RAM is an entirely different subject.

My suggestion is to mark the page as non-cached by the virtual memory manager.
In Windows, this is done through setting PAGE_NOCACHE when calling VirtualProtect.
For a somewhat different purpose, the SSE 2 instructions have the _mm_stream_xyz instructions to prevent cache pollution, although I don't think they apply to your case here.
In either case, there is no portable way of doing what you want in C; you have to use OS functionality.

Wikipedia has a pretty good article about MTRR (Memory Type Range Registers) which apply to the x86 family of CPUs.
To summarize it, starting with the Pentium Pro Intel (and AMD copied) had these MTR registers which could set uncached, write-through, write-combining, write-protect or write-back attributes on ranges of memory.
Starting with the Pentium III but as far as I know, only really useful with the 64-bit processors, they honor the MTRRs but they can be overridden by the Page Attribute Tables which let the CPU set a memory type for each page of memory.
A major use of the MTRRs that I know of is graphics RAM. It is much more efficient to mark it as write-combining. This lets the cache store up the writes and it relaxes all of the memory write ordering rules to allow very high-speed burst writes to a graphics card.
But for your purposes you would want either a MTRR or a PAT setting of either uncached or write-through.

As you say cache is transparent to the programmer. The system guarantees that you always see the value that was last written to if you access an object through its address. The "only" thing that you may incur if an obsolete value is in your cache is a runtime penalty.

volatile makes sure that data is read everytime it is needed without bothering with any cache between CPU and memory. But if you need to read actual data from memory and not cached data, you have two options:
Make a board where said data is not cached. This may already be the case if you address some I/O device,
Use specific CPU instructions that bypass the cache. This is used when you need to scrub memory for activating possible SEU errors.
The details of second option depend on OS and/or CPU.

using the _Uncached keyword may help in embedded OS , like MQX
#define MEM_READ(addr) (*((volatile _Uncached unsigned int *)(addr)))
#define MEM_WRITE(addr,data) (*((volatile _Uncached unsigned int *)(addr)) = data)

How to prevent LDM/STM instructions expansion in ARM Compiler 5 armcc inline assembler?

I'm trying to generate AXI bus burst accesses using STM/LDM instructions in inline assembly in .c file compiled with ARM Compiler 5 armcc.
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
But ARM Compiler armcc User Guide, paragraph 7.18 is saying:
"All LDM and STM instructions are expanded into a sequence of LDR and STR instructions with equivalent effect. However, the compiler might subsequently recombine the separate instructions into an LDM or STM during optimization."
And that is what really happens in practice, LDM/STM are expanded into a set of LDR/STR in some cases and order of these instuctions is arbitrary.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
To resolve this it's possible to use embedded assembler instead of inline assembler, but this leads to extra function calls-returns what affects performance.
So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
Target CPU: Cortex M0+ (ARMv6-M).
Edit:
Slave devices are all on-chip devices, most of them are non-memory devices. For every register of non-memory slave that supports burst access region of address space is reserved (for example [0x10000..0x10100]), I'm not completely sure why, maybe CPU or bus doesn't support fixed (non-incremental) addresses. HW ignores offsets within this region. Full request can be 16 bytes for example and first word of the full request is first word written (even if offset is non-zero).

So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
A little bit about compiler optimizations. Register allocation is one of it's toughest jobs. The heart of any compiler's code generation is probably around when it allocates physical CPU registers. Most compilers are using Single static assignment or SSA to rename your 'C' variables into a bunch of pseudo variable (or time order variables).
In order for your STMIA and LDMIA to work you need the loads and stores to be consistent. Ie, if it is stmia [rx], {r3,r7} and a restore like ldmia [rx], {r4,r8} with the 'r3' mapping to the new 'r4' and the stored 'r7' mapping to the restored 'r8'. This is not simple for any compiler to implement generically as 'C' variables will be assigned according to need. Different versions of the same variable maybe in different registers. To make the stm/ldm work those variable must be assigned so that register increments in the right order. Ie, for the ldmia above if the compiler want the stored r7 in r0 (maybe a return value?), there is no way for it to create a good ldm instruction without generating additional code.
You may have gotten gcc to generate this, but it was probably luck. If you proceed with only gcc, you will probably find it doesn't work as well.
See: ldm/stm and gcc for issues with GCC stm/ldm.
Taking your example,
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
The value of inline is that the whole function body may be put right in the code. The caller might have the w0 and w1 in registers R8 and R4. If the function is not inline, then the compile must place them in R1 and R2 but may have generated extra moves. It is difficult for any compiler to fulfil the requirements of the ldm/stm generically.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
If the hardware is a particular non-memory slave peripheral on the bus, then you can wrap the functionality to write to this slave in an external wrapper and force the register allocation (see AAPCS) so that ldm/stm will work. This will result in a performance hit which could be mitigated by some custom assembler in the driver for the device.
However, it sounds like the device might be memory? In this case, you have a problem. Normally, memory devices like this will use a cache only? If your CPU has an MPU (memory protection unit) and can enable both data and code cache, then you might resolve this issue. Cache lines will always be burst accesses. Care only needs to be taken in the code to setup the MPU and the data cache. OPs Cortex-M0+ has no cache and the devices are non-memory so this will not be possible (nor needed).
If your device is memory and you have no data cache then your issue is probably unresolvable (without massive effort) and you need different hardware. Or you can wrap it like the peripheral device and take a performance hit; loosing the benefits of the random access of the memory device.

g_atomic_int_get guarantees visibility into another thread?

This is related to this question.
rmmh's claim on that question was that on certain architectures, no special magic is needed to implement atomic get and set. Specifically, in this case that g_atomic_int_get(&foo) from glib gets expanded simply to (*(&foo)). I understand that this means that foo will not be in an internally consistent state. However am I also guaranteed that foo won't be cached by a given CPU or core?
Specifically, if one thread is setting foo, and another reading it (using the glib g_atomic_* functions), can I assume that the reader will see the updates to the variable made by the writer. Or is it possible for the writer to simply update the value in a register? For reference my target platform is gcc (4.1.2) on a multi-core multi-CPU X86_64 machine.

What most architecture ensures (included) is atomicity and coherence of reads and writes of suitably sized and aligned read/write (so every processors see a subsequence of the same master sequence of values for a given memory adress (*)), and int is most probably suitably size and compilers generally ensure that they are also correctly aligned.
But compilers rarely ensures that they aren't optimizing out some reads or writes if they aren't marked in a way or another. I've tried to compile:
int f(int* ptr)
{
int i, r=0;
*ptr = 5;
for (i = 0; i < 100; ++i) {
r += i*i;
}
*ptr = r;
return *ptr;
}
with gcc 4.1.2 gcc optimized out without problem the first write to *ptr, something you probably don't want for an atomic write.
(*) Coherence is not to be confused with consistency: the relationship between reads and writes at different address is often relaxed with respect to the intuitive, but costly to implement, sequential consistency. That's why memory barriers are needed.

Volatile will only ensure that the compiler doesn't use a register to hold the variable. Volatile will not prevent the compiler from re-ordering code; although, it might act as a hint to not reorder.
Depending on the architecture, certain instructions are atomic. writing to an integer and reading from an integer are often atomic. If gcc uses atomic instructions for reading/writing to/from an integer memory location, there will be no "intermediate garbage" read by one thread if another thread is in the middle of a write.
But, you might run into problems because of compiler reordering and instruction reordering.
With optimizations enabled, gcc reorders code. Gcc usually doesn't reorder code when global variables or function calls are involved since gcc can't guarantee the same outcome. Volatile might act as a hint to gcc wrt reordering, but I don't know. If you do run into reordering problems, this will act as a general purpose compiler barrier for gcc:
__asm__ __volatile__ ("" ::: "memory");
Even if the compiler doesn't reorder code, the CPU constantly reorders instructions during execution. Here is a very good article on the subject. A "memory barrier" is used to prevent the cpu from reordering instructions over a barrier. Here is one possible way to make a memory barrier using gcc:
__sync_synchronize();
You can also execute asm instructions to do different kinds of barriers.
That said, we read and write global integers without using atomic operations or mutexes from multiple threads and have no problems. This is most likely because A) we run on Intel and Intel does not reorder instructions aggressively and B) there is enough code executing before we do something "bad" with an early read of a flag. Also in our favor is the fact that a lot of system calls have barriers and the gcc atomic operations are barriers. We use a lot of atomic operations.
Here is a good discussion in stack overflow of a similar question.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight