Is function call a memory barrier?

Is function call a memory barrier? - c

Consider this C code:
extern volatile int hardware_reg;
void f(const void *src, size_t len)
{
void *dst = <something>;
hardware_reg = 1;
memcpy(dst, src, len);
hardware_reg = 0;
}
The memcpy() call must occur between the two assignments. In general, since the compiler probably doesn't know what will the called function do, it can't reorder the call to the function to be before or after the assignments. However, in this case the compiler knows what the function will do (and could even insert an inline built-in substitute), and it can deduce that memcpy() could never access hardware_reg. Here it appears to me that the compiler would see no trouble in moving the memcpy() call, if it wanted to do so.
So, the question: is a function call alone enough to issue a memory barrier that would prevent reordering, or is, otherwise, an explicit memory barrier needed in this case before and after the call to memcpy()?
Please correct me if I am misunderstanding things.

The compiler cannot reorder the memcpy() operation before the hardware_reg = 1 or after the hardware_reg = 0 - that's what volatile will ensure - at least as far as the instruction stream the compiler emits. A function call is not necessarily a 'memory barrier', but it is a sequence point.
The C99 standard says this about volatile (5.1.2.3/5 "Program execution"):
At sequence points, volatile objects are stable in the sense that previous accesses are
complete and subsequent accesses have not yet occurred.
So at the sequence point represented by the memcpy(), the volatile access of writing 1 has to occurred, and the volatile access of writing 0 cannot have occurred.
However, there are 2 things I'd like to point out:
Depending on what <something> is, if nothing else is done with the the destination buffer, the compiler might be able to completely remove the memcpy() operation. This is the reason Microsoft came up with the SecureZeroMemory() function. SecureZeroMemory() operates on volatile qualified pointers to prevent optimizing writes away.
volatile doesn't necessarily imply a memory barrier (which is a hardware thing, not just a code ordering thing), so if you're running on a multi-proc machine or certain types of hardware you may need to explicitly invoke a memory barrier (perhaps wmb() on Linux).
Starting with MSVC 8 (VS 2005), Microsoft documents that the volatile keyword implies the appropriate memory barrier, so a separate specific memory barrier call may not be necessary:
http://msdn.microsoft.com/en-us/library/12a04hfd.aspx
Also, when optimizing, the compiler
must maintain ordering among
references to volatile objects as well
as references to other global objects.
In particular,
A write to a volatile object (volatile write) has Release
semantics; a reference to a global or
static object that occurs before a
write to a volatile object in the
instruction sequence will occur before
that volatile write in the compiled
binary.
A read of a volatile object (volatile read) has Acquire semantics;
a reference to a global or static
object that occurs after a read of
volatile memory in the instruction
sequence will occur after that
volatile read in the compiled binary.

As far as I can see your reasoning leading to
the compiler would see no trouble in moving the memcpy call
is correct. Your question is not answered by the language definition, and can only be addressed with reference to specific compilers.
Sorry to not have any more-useful information.

My assumption would be that the compiler never re-orders volatile assignments since it has to assume they must be executed at exactly the position where they occur in the code.

It's probalby going to get optimized, either because the compiler inlines the mecpy call and eliminates the first assignment, or because it gets compiled to RISC code or machine code and gets optimized there.

Here is a slightly modified example, compiled with gcc 7.2.1 on x86-64:
#include <string.h>
static int temp;
extern volatile int hardware_reg;
int foo (int x)
{
hardware_reg = 0;
memcpy(&temp, &x, sizeof(int));
hardware_reg = 1;
return temp;
}
gcc knows that the memcpy() is the same as an assignment, and knows that temp is not accessed anywhere else, so temp and the memcpy() disappear completely from the generated code:
foo:
movl $0, hardware_reg(%rip)
movl %edi, %eax
movl $1, hardware_reg(%rip)
ret

Related

Testing lockless buffer copy in C using memory barriers

I have a few questions regarding memory barriers.
Say I have the following C code (it will be run both from C++ and C code, so atomics are not possible) that writes an array into another one. Multiple threads may call thread_func(), and I want to make sure that my_str is returned only after it was initialized fully. In this case, it is a given that the last byte of the buffer can't be 0. As such, checking for the last byte as not 0, should suffice.
Due to reordering by compiler/CPU, this can be a problem as the last byte might get written before previous bytes, causing my_str to be returned with a partially copied buffer. So to get around this, I want to use a memory barrier. A mutex will work of course, but would be too heavy for my uses.
Keep in mind that all threads will call thread_func() with the same input, so even if multiple threads call init() a couple of times, it's OK as long as in the end, thread_func() returns a valid my_str, and that all subsequent calls after initialization return my_str directly.
Please tell me if all the following different code approaches work, or if there could be issues in some scenarios as aside from getting the solution to the problem, I'd like to get some more information regarding memory barriers.
__sync_bool_compare_and_swap on last byte. If I understand correctly, any memory store/load would not be reordered, not just the one for the particular variable that is sent to the command. Is that correct? if so, I would expect this to work as all previous writes of the previous bytes should be made before the barrier moves on.
#define STR_LEN 100
static uint8_t my_str[STR_LEN] = {0};
static void init(uint8_t input_buf[STR_LEN])
{
for (int i = 0; i < STR_LEN - 1; ++i) {
my_str[i] = input_buf[i];
}
__sync_bool_compare_and_swap(my_str, 0, input_buf[STR_LEN - 1]);
}
const char * thread_func(char input_buf[STR_LEN])
{
if (my_str[STR_LEN - 1] == 0) {
init(input_buf);
}
return my_str;
}
__sync_bool_compare_and_swap on each write. I would expect this to work as well, but to be slower than the first one.
static void init(char input_buf[STR_LEN])
{
for (int i = 0; i < STR_LEN; ++i) {
__sync_bool_compare_and_swap(my_str + i, 0, input_buf[i]);
}
}
__sync_synchronize before each byte copy. I would expect this to work as well, but is this slower or faster than (2)? __sync_bool_compare_and_swap is supposed to be a full barrier as well, so which would be preferable?
static void init(char input_buf[STR_LEN])
{
for (int i = 0; i < STR_LEN; ++i) {
__sync_synchronize();
my_str[i] = input_buf[i];
}
}
__sync_synchronize by condition. As I understand it, __sync_synchronize is both a HW and SW memory barrier. As such, since the compiler can't tell the value of use_sync it shouldn't reorder. And the HW reordering will be done only if use_sync is true. is that correct?
static void init(char input_buf[STR_LEN], bool use_sync)
{
for (int i = 0; i < STR_LEN; ++i) {
if (use_sync) {
__sync_synchronize();
}
my_str[i] = input_buf[i];
}
}

GNU C legacy __sync builtins are not recommended for new code, as the manual says.
Use the __atomic builtins which can take a memory-order parameter like C11 stdatomic. But they're still builtins and still work on plain types not declared _Atomic, so using them is like C++20 std::atomic_ref. In C++20, use std::atomic_ref<unsigned char>(my_str[STR_LEN - 1]), but C doesn't provide an equivalent so you'd have to use compiler builtins to hand-roll it.
Just do the last store separately with a release store in the writer, not an RMW, and definitely not a full memory barrier (__sync_synchronize()) between every byte!!! That's way slower than necessary, and defeats any optimization to use memcpy. Also, you need the store of the final byte to be at least RELEASE, not a plain store, so readers can synchronize with it. See also Who's afraid of a big bad optimizing compiler? re: how exactly compilers can break your code if you try to hand-roll lockless code with just barriers, not atomic loads or stores. (It's written for Linux kernel code, where a macro would use *(volatile char*) to hand-roll something close to __atomic_store_n with __ATOMIC_RELAXED`)
So something like
__atomic_store_n(&my_str[STR_LEN - 1], input_buf[STR_LEN - 1], __ATOMIC_RELEASE);
The if (my_str[STR_LEN - 1] == 0) load in thread_func is of course data-race UB when there are concurrent writers.
For safety it needs to be an acquire load, like __atomic_load_n(&my_str[STR_LEN - 1], __ATOMIC_ACQUIRE) == 0, since you need a thread that loads a non-0 value to also see all other stores by another thread that ran init(). (Which did a release-store to that location, creating acquire/release synchronization and guaranteeing a happens-before relationship between these threads.)
See https://preshing.com/20120913/acquire-and-release-semantics/
Writing the same value non-atomically is also UB in ISO C and ISO C++. See Race Condition with writing same value in C++? and others.
But in practice it should be fine except with clang -fsanitize=thread. In theory a DeathStation9000 could implement non-atomic stores by storing value+1 and then subtracting 1, so temporarily there's be a different value in memory. But AFAIK there aren't real compilers that do that. I'd have a look at the generated asm on any new compiler / ISA combination you're trying, just to make sure.
It would be hard to test; the init stuff can only race once per program invocation. But there's no fully safe way to do it that doesn't totally suck for performance, AFAIK. Perhaps doing the init with a cast to _Atomic unsigned char* or typedef _Atomic unsigned long __attribute__((may_alias)) aliasing_atomic_ulong; as a building block for a manual copy loop?
Bonus question: if(use_sync) __sync_synchronize() inside the loop.
Since the compiler can't tell the value of use_sync it shouldn't reorder.
Optimization is possible to asm that works something like if(use_sync) { slow barrier loop } else { no-barrier loop }. This is called "loop unswitching": making two loops and branching once to decide which to run, instead of every iteration. GCC has been able to do that optimization (in some cases) since 3.4. So that defeats your attempt to take advantage of how the compiler would compile to trick it into doing more ordering than the source actually requires.
And the HW reordering will be done only if use_sync is true.
Yes, that part is correct.
Also, inlining and constant-propagation of use_sync could easily defeat this, unless use_sync was a volatile global or something. At that point you might as well just make a separate _Atomic unsigned char array_init_done flag / guard variable.
And you can use it for mutual exclusion by having threads try to set it to 1 with int old = guard.exchange(1), with the winner of the race being the one to run init while they spin-wait (or C++20 .wait(1)) for the guard variable to become 2 or -1 or something, which the winner of the race will set after finishing init.
Have a look at the asm GCC makes for non-constant-initialized static local vars; they check a guard variable with an acquire load, only doing locking to have one thread do the run_once init stuff and the others wait for that result. IIRC there's a Q&A about doing that yourself with atomics.

May an implementation optimize an atomic access to a non-atomic access, if it happens through a restrict pointer?

Consider the following function.
void incr(_Atomic int *restrict ptr) {
*ptr += 1;
}
I'll consider x86, but my question is about the language, not the semantics of any particular implementation of atomics. GCC and Clang both emit the following:
incr:
lock add DWORD PTR [rdi], 1
ret
Is it possible for a conforming implementation to emit simply
incr:
add DWORD PTR [rdi], 1
ret
(The same thing you get if you remove _Atomic.)
Without restrict, this would be a miscompilation because add is not atomic, so (for instance) calling incr at the same time in two threads would race. However, since the pointer is restrict-qualified, I think it's not possible for a race to occur between incr and any other access to *ptr (without causing undefined behavior anyway).
I would feel confident making this optimization manually, but no compiler I know of does so automatically. Is it correct? Or have I misunderstood restrict?

The optimization cannot be made by the compiler, because the behavior of restrict is defined only with respect to accesses to the same object via other pointers within incr.
From this reference, with my own emphasis:
During each execution of a block in which a restricted pointer P is declared (typically each execution of a function body in which P is a function parameter), if some object that is accessible through P (directly or indirectly) is modified, by any means, then all accesses to that object (both reads and writes) in that block must occur through P (directly or indirectly), otherwise the behavior is undefined...
The restriction on the pointer only applies to accesses made from incr, meaning that a conformant compiler should not be able to conclude anything about accesses made by other threads, or other functions that have their own pointers pointing to the same integer (which is perfectly fine a caller of incr might only have non-restricted pointers to it).
The open standard, section 6.7.3.1, also defines the formal behavior of restrict with respect to the block being executed, making no claim about the lifetime of any pointees. However, I think the standard's definition is too mathematically unwieldy to be worth presenting in full here.

Or have I misunderstood restrict?
Yeah - it makes no guarantees of what goes on outside the function where you used it. So it cannot be used for synchronization or thread-safety purposes.
The TL;DR about restrict is that *restrict ptr guarantees that no other pointer access to the object that ptr points at will be done from the block where ptr was declared. If there are other pointers or object references present ("lvalue accesses"), the compiler may assume that they aren't modifying that pointed-at restricted object.
This is pretty much just a contract between the programmer and the compiler: "Dear compiler, I promise I won't do stupid things behind your back with the object that this pointer points at". But the compiler can't really check if the programmer broke this contract, because in order to do so it may have to go check across several different translation units at once.
Example code:
#include <stdio.h>
int a=0;
int b=1;
void copy_stuff (int* restrict pa, int* restrict pb)
{
*pa = *pb; // assign the value 1 to a
a=2; // undefined behavior, the programmer broke the restrict contract
printf("%d\n", *pa);
}
int main()
{
copy_stuff(&a, &b);
printf("%d\n", a);
}
This prints 1 then 2 with optimization on, on all mainstream x86 compilers. Because in the printf inside the function they are free to assume that *pa has not been modified since *pa = *pb;. Drop restrict and they will print 2 then 2, because the compiler will have to add an extra read instruction in the machine code.
So for restrict to make sense, it needs to be compared with another reference to an object. That's not the case in your example, so restrict fills no apparent purpose.
As for _Atomic, it will always have to guarantee that the machine instructions in themselves are atomic, regardless of what else goes on in the program. It's not some semi-high level "nobody else seems to be using this variable anyway". But for the keyword to make sense for multi-threading and multi-core both, it would also have to act as a low level memory barrier to prevent pipelining, instruction re-ordering and data caching - which if I remember correctly is exactly what x86 lock does.

Provoke stack underflow in C

I would like to provoke a stack underflow in a C function to test security measures in my system. I could do this using inline assembler. But C would be more portable. However I can not think of a way to provoke a stack underflow using C since stack memory is safely handled by the language in that regard.
So, is there a way to provoke a stack underflow using C (without using inline assembler)?
As stated in the comments: Stack underflow means having the stack pointer to point to an address below the beginning of the stack ("below" for architectures where the stack grows from low to high).

There's a good reason why it's hard to provoke a stack underflow in C.The reason is that standards compliant C does not have a stack.
Have a read of the C11 standard, you'll find out that it talks about scopes but it does not talk about stacks. The reason for this is that the standard tries, as far as possible, to avoid forcing any design decisions on implementations. You may be able to find a way to cause stack underflow in pure C for a particular implementation but it will rely on undefined behaviour or implementation specific extensions and won't be portable.

You can't do this in C, simply because C leaves stack handling to the implementation (compiler). Similarly, you cannot write a bug in C where you push something on the stack but forget to pop it, or vice versa.
Therefore, it is impossible to produce a "stack underflow" in pure C. You cannot pop from the stack in C, nor can you set the stack pointer from C. The concept of a stack is something on an even lower level than the C language. In order to directly access and control the stack pointer, you must write assembler.
What you can do in C is to purposely write out of bounds of the stack. Suppose we know that the stack starts at 0x1000 and grows upwards. Then we can do this:
volatile uint8_t* const STACK_BEGIN = (volatile uint8_t*)0x1000;
for(volatile uint8_t* p = STACK_BEGIN; p<STACK_BEGIN+n; p++)
{
*p = garbage; // write outside the stack area, at whatever memory comes next
}
Why you would need to test this in a pure C program that doesn't use assembler, I have no idea.
In case someone incorrectly got the idea that the above code invokes undefined behavior, this is what the C standard actually says, normative text C11 6.5.3.2/4 (emphasis mine):
The unary * operator denotes indirection. If the operand points to a function, the result is
a function designator; if it points to an object, the result is an lvalue designating the
object. If the operand has type ‘‘pointer to type’’, the result has type ‘‘type’’. If an
invalid value has been assigned to the pointer, the behavior of the unary * operator is
undefined 102)
The question is then what's the definition of an "invalid value", as this is no formal term defined by the standard. Foot note 102 (informative, not normative) provides some examples:
Among the invalid values for dereferencing a pointer by the unary * operator are a null pointer, an
address inappropriately aligned for the type of object pointed to, and the address of an object after the
end of its lifetime.
In the above example we are clearly not dealing with a null pointer, nor with an object that has passed the end of its lifetime. The code may indeed cause a misaligned access - whether this is an issue or not is determined by the implementation, not by the C standard.
And the final case of "invalid value" would be an address that is not supported by the specific system. This is obviously not something that the C standard mentions, because memory layouts of specific systems are not coverted by the C standard.

It is not possible to provoke stack underflow in C. In order to provoke underflow the generated code should have more pop instructions than push instructions, and this would mean the compiler/interpreter is not sound.
In the 1980s there were implementations of C that ran C by interpretation, not by compilation. Really some of them used dynamic vectors instead of the stack provided by the architecture.
stack memory is safely handled by by the language
Stack memory is not handled by the language, but by the implementation. It is possible to run C code and not to use stack at all.
Neither the ISO 9899 nor K&R specifies anything about the existence of a stack in the language.
It is possible to make tricks and smash the stack, but it will not work on any implementation, only on some implementations. The return address is kept on the stack and you have write-permissions to modify it, but this is neither underflow nor portable.

Regarding already existing answers: I don't think that talking about undefined behaviour in the context of exploitation mitigation techniques is appropriate.
Clearly, if an implementation provides a mitigation against stack underflows, a stack is provided. In practice, void foo(void) { char crap[100]; ... } will end up having the array on the stack.
A note prompted by comments to this answer: undefined behaviour is a thing and in principle any code exercising it can end up being compiled to absolutely anything, including something not resembling the original code in the slightest. However, the subject of exploit mitigation techniques is closely tied to the target environment and what happens in practice. In practice, the code below should "work" just fine. When dealing with this kind of stuff you always have to verify generated assembly to be sure.
Which brings me to what in practice will give an underflow (volatile added to prevent the compiler from optimising it away):
static void
underflow(void)
{
volatile char crap[8];
int i;
for (i = 0; i != -256; i--)
crap[i] = 'A';
}
int
main(void)
{
underflow();
}
Valgrind nicely reports the problem.

By definition, a stack underflow is a type of undefined behaviour, and thus any code which triggers such a condition must be UB. Therefore, you can't reliably cause a stack underflow.
That said, the following abuse of variable-length arrays (VLAs) will cause a controllable stack underflow in many environments (tested with x86, x86-64, ARM and AArch64 with Clang and GCC), actually setting the stack pointer to point above its initial value:
#include <stdint.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char **argv) {
uintptr_t size = -((argc+1) * 0x10000);
char oops[size];
strcpy(oops, argv[0]);
printf("oops: %s\n", oops);
}
This allocates a VLA with a "negative" (very very large) size, which will wrap the stack pointer around and result in the stack pointer moving upwards. argc and argv are used to prevent optimizations from taking out the array. Assuming that the stack grows down (default on the listed architectures), this will be a stack underflow.
strcpy will either trigger a write to an underflowed address when the call is made, or when the string is written if strcpy is inlined. The final printf should not be reachable.
Of course, this all assumes a compiler which doesn't just make the VLA some kind of temporary heap allocation - which a compiler is completely free to do. You should check the generated assembly to verify that the above code does what you actually expect it to do. For example, on ARM (gcc -O):
8428: e92d4800 push {fp, lr}
842c: e28db004 add fp, sp, #4, 0
8430: e1e00000 mvn r0, r0 ; -argc
8434: e1a0300d mov r3, sp
8438: e0433800 sub r3, r3, r0, lsl #16 ; r3 = sp - (-argc) * 0x10000
843c: e1a0d003 mov sp, r3 ; sp = r3
8440: e1a0000d mov r0, sp
8444: e5911004 ldr r1, [r1]
8448: ebffffc6 bl 8368 <strcpy#plt> ; strcpy(sp, argv[0])

This assumption:
C would be more portable
is not true. C doesn't tell anything about a stack and how it is used by the implementation. On your typical x86 platform, the following (horribly invalid) code would access the stack outside of the valid stack frame (until it is stopped by the OS), but it would not actually "pop" from it:
#include <stdarg.h>
#include <stdio.h>
int underflow(int dummy, ...)
{
va_list ap;
va_start(ap, dummy);
int sum = 0;
for(;;)
{
int x = va_arg(ap, int);
fprintf(stderr, "%d\n", x);
sum += x;
}
return sum;
}
int main(void)
{
return underflow(42);
}
So, depending on what exactly you mean with "stack underflow", this code does what you want on some platform. But as from a C point of view, this just exposes undefined behavior, I wouldn't suggest to use it. It's not "portable" at all.

Is it possible to do it reliably in standard compliant C? No
Is it possible to do it on at least one practical C compiler without resorting to inline assembler? Yes
void * foo(char * a) {
return __builtin_return_address(0);
}
void * bar(void) {
char a[100000];
return foo(a);
}
typedef void (*baz)(void);
int main() {
void * a = bar();
((baz)a)();
}
Build that on gcc with "-O2 -fomit-frame-pointer -fno-inline"
https://godbolt.org/g/GSErDA
Basically the flow in this program goes as follows
main calls bar.
bar allocates a bunch of space on the stack (thanks to the big array),
bar calls foo.
foo takes a copy of the return address (using a gcc extension). This address points into the middle of bar, between the "allocation" and the "cleanup".
foo returns the address to bar.
bar cleans up it's stack allocation.
bar returns the return address captured by foo to main.
main calls the return address, jumping into the middle of bar.
the stack cleanup code from bar runs, but bar doesn't currently have a stack frame (because we jumped into the middle of it). So the stack cleanup code underflows the stack.
We need -fno-inline to stop the optimiser inlining stuff and breaking our carefully laid-down strcture. We also need the compiler to free the space on the stack by calculation rather than by use of a frame pointer, -fomit-frame-pointer is the default on most gcc builds nowadays but it doesn't hurt to specify it explicitly.
I belive this tehcnique should work for gcc on pretty much any CPU architecture.

There is a way to underflow the stack, but it is very complicated. The only way that I can think of is define a pointer to the bottom element then decrement its address value. I.e. *(ptr)--. My parentheses may be off, but you want to decrement the value of the pointer, then dereference the pointer.
Generally the OS is just going to see the error and crash. I am not sure what you are testing. I hope this helps. C allows you to do bad things, but it tries to look after the programmer. Most ways to get around this protection is through manipulation of pointers.

Do you mean stack overflow? Putting more things into the stack than the stack can accomodate? If so, recursion is the easiest way to accomplish that.
void foo();
{foo();};
If you mean attempting to remove things from an empty stack, then please post your question to the stackunderflow web site, and let me know where you've found that! :-)

So there are older library functions in C which are not protected. strcpy is a good example of this. It copies one string to another until it reaches a null terminator. One funny thing to do is pass a program that uses this a string with the null terminator removed. It will run amuck until it reaches a null terminator somewhere. Or have a string copy to itself. So back to what I was saying before is C supports pointers to just about anything. You can make a pointer to an element in the stack at the last element. Then you can use the pointer iterator built into C to decrement the value of the address, change the address value to a location preceding the last element in the stack. Then pass that element to the pop. Now if you are doing this to the Operating system process stack that would get very dependent on the compiler and operating system implementation. In most cases a function pointer to the main and a decrement should work to underflow the stack. I have not tried this in C. I have only done this in Assembly Language, great care has to be taken in working like this. Most operating systems have gotten good at stopping this since it was for a long time an attack vector.

Volatile keyword in C [duplicate]

This question already has answers here:
Why is volatile needed in C?
(18 answers)
Closed 9 years ago.
I am writing program for ARM with Linux environment. its not a low level program, say app level
Can you clarify me what is the difference between,
int iData;
vs
volatile int iData;
Does it have hardware specific impact ?

Basically, volatile tells the compiler "the value here might be changed by something external to this program".
It's useful when you're (for instance) dealing with hardware registers, that often change "on their own", or when passing data to/from interrupts.
The point is that it tells the compiler that each access of the variable in the C code must generate a "real" access to the relevant address, it can't be buffered or held in a register since then you wouldn't "see" changes done by external parties.
For regular application-level code, volatile should never be needed unless (of course) you're interacting with something a lot lower-level.

The volatile keyword specifies that variable can be modified at any moment not by a program.
If we are talking about embedded, then it can be e.g. hardware state register. The value that it contains may be modified by the hardware at any unpredictable moment.
That is why, from the compiler point of view that means that compiler is forbidden to apply optimizations on this variable, as any kind of assumption is wrong and can cause unpredictable result on the program execution.

By making a variable volatile, every time you access the variable, you force the CPU to fetch it from memory rather than from a cache. This is helpful in multithreaded programs where many threads could reuse the value of a variable in a cache. To prevent such reuse ( in multithreaded program) volatile keyword is used. This ensures that any read or write to an volatile variable is stable (not cached)

Generally speaking, the volatile keyword is intended to prevent the compiler from applying any optimizations on the code that assume values of variables cannot change "on their own."
(from Wikipedia)
Now, what does this mean?
If you have a variable that could have its contents changed at any time, usually due to another thread acting on it while you are possibly referencing this variable in a main thread, then you may wish to mark it as volatile. This is because typically a variable's contents can be "depended on" with certainty for the scope and nature of the context in which the variable is used. But if code outside your scope or thread is also affecting the variable, your program needs to know to expect this and query the variable's true contents whenever necessary, more than the normal.
This is a simplification of what is going on, of course, but I doubt you will need to use volatile in most programming tasks.

In the following example, global_data is not explicitly modified. so when the optimization is done, compiler thinks that, it is not going to modified anyway. so it assigns global_data with 0. And uses 0, whereever global_data is used.
But actually global_data updated through some other process/method(say through ptrace ). by using volatile you can force it to read always from memory. so you can get updated result.
#include <stdio.h>
volatile int global_data = 0;
int main()
{
FILE *fp = NULL;
int data = 0;
printf("\n Address of global_data:%x \n", &global_data);
while(1)
{
if(global_data == 0)
{
continue;
}
else if(global_data == 2)
{
;
break;
}
}
return 0;
}

volatile keyword can be used,
when the object is a memory mapped io port.
An 8 bit memory mapped io port at physical address 0x15 can be declared as
char const ptr = (char *) 0x15;
Suppose that we want to change the value at that port at periodic intervals.
*ptr = 0 ;
while(*ptr){
*ptr = 4;//Setting a value
*ptr = 0; // Clearing after setting
}
It may get optimized as
*ptr = 0 ;
while(0){
}
Volatile supress the compiler optimization and compiler assumes that tha value can
be changed at any time even if no explicit code modify it.
Volatile char *const ptr = (volatile char * )0x15;
Used when the object is a modified by ISR.
Sometimes ISR may change tha values used in the mainline codes
static int num;
void interrupt(void){
++num;
}
int main(){
int val;
val = num;
while(val != num)
val = num;
return val;
}
Here the compiler do some optimizations to the while statement.ie the compiler
produce the code it such a way that the value of num will always read form the cpu
registers instead of reading from the memory.The while statement will always be
false.But in actual scenario the valu of num may get changed in the ISR and it will
reflect in the memory.So if the variable is declared as volatile the compiler will know
that the value should always read from the memory

volatile means that variables value could be change any time by any external source. in GCC if we dont use volatile than it optimize the code which is sometimes gives unwanted behavior.
For example if we try to get real time from an external real time clock and if we don't use volatile there then what compiler do is it will always display the value which is stored in cpu register so it will not work the way we want. if we use volatile keyword there then every time it will read from the real time clock so it will serve our purpose....
But as u said you are not dealing with any low level hardware programming then i don't think you need to use volatile anywhere
thanks

How to safely convert/copy volatile variable?

volatile char* sevensegment_char_value;
void ss_load_char(volatile char *digits) {
...
int l=strlen(digits);
...
}
ss_load_char(sevensegment_char_value);
In the above example I've got warning from avr-gcc compiler
Warning 6 passing argument 1 of 'strlen' discards 'volatile' qualifier from pointer target type [enabled by default]
So I have to somehow copy the value from volatile to non-volatile var? What is the safe workaround?

There is no such thing like a "built in" Workaround in C. Volatile tells the compiler, that the contents of a variable (or in your case the memory the variable is pointing at) can change without the compiler noticing it and forces the compiler to read the data direct from the data bus rather than using a possibly existing copy in the registers.
Therefore the volatile keyword is used to avoid odd behaviour induced through compiler optimizations. (I can explain this further if you like)
In your case, you have a character buffer declared as volatile. If your program changes the contents of this buffer in a different context like an ISR for example, you have to implement sort of a synchronisation mechanism (like disabling the particular interrupt or so) to avoid inconsistency of data. After aquiring the "lock" (disabling the interrupt) you can copy the data byte by byte to a local (non-volatile) buffer and work on this buffer for the rest of the routine.
If the buffer will not change "outside" of the context of your read accesses I suggest to omit the volatile keyword as there is no use for it.
To judge the correct solution, a little bit more information about your exact use case would be needed.

Standard library routines aren't designed to work on volatile objects. The simplest solution is to read the volatile memory into normal memory before operating on it:
void ss_load_char(volatile char *digits) {
char buf[BUFSIZE];
int i = 0;
for (i = 0; i < BUFSIZE; ++i) {
buf[i] = digits[i];
}
int l=strlen(buf);
...
}
Here BUFSIZE is the size of the area of volatile memory.
Depending on how the volatile memory is configured, there may be routines you are supposed to call to copy out the contents, rather than just using a loop. Note that memcpy won't work as it is not designed to work with volatile memory.

The compiler warning only means that strlen() will not treat your pointer as volatile, i.e. it will maybe cache the pointer in a register when computing the length of your string. I guess, that's ok with you.
In general, volatile means that the compiler will not cache the variable. Look at this example:
extern int flag;
while (flag) { /* loop*/ }
This would loop forever if flag != 0, since the compiler assumes that flag is not changed "from the outside", like a different thread. If you want to wait on the input of some other thread, you must write this:
extern volatile int flag;
while (flag) { /* loop*/ }
Now, the compiler will really look at flag each time the loop loops. This may be must more what we intended in this example.
In answer to your question: if you know what you're doing, just cast the volatile away with int l=strlen((char*)digits).