C - overcoming the use of atomic operations - c

I was wondering. If I have an int variable which I want to be synced across all my threads - wouldn't I be able to reserve one bit to know whether the value is being updated or not?
To avoid the write operation being executed in chunks, which would mean threads could potentially be accessing mid-written value, which is not correct, or even worse, overwrite it, causing it to be totally wrong, I want the threads to first be informed that the variable is being written to. I could simply use an atomic operation to write the new value so that the other threads don't interfere, but this idea does not seem that dumb and I would like to use the basic tools first.
What if I just make one operation, which is small enough to keep it in one chunk, an operation like changing a single bit (which will still result in the whole byte(s) changing, but it's not the whole value changing, right), and let the bit indicate the variable is being written to or not? Would that even work, or would the whole int be written to?
I mean, even if the whole int was to change, this still would have a chance of working - if the bit indicating if the value is changing was written first.
Any thoughts on this?
EDIT: I feel like I did not specify what I am actually planning to do, and why I thought of this in the first place.
I am trying to implement a timeout function, similarly to setTimeout in JavaScript. It is pretty straightforward for a timeout that you don't want to ever cancel - you create a new thread, tell it to sleep for given amount of time, then give it a function to execute, eventually with some data. Piece of cake. Finished writing it in maybe half an hour, while being totally new to C.
The hard part comes when you want to set a timeout which might be canceled in the future. So you do exactly the same as a timeout without canceling, but when the thread wakes up and the CPU's scheduler puts it on, the thread must check if a value in the memory it was given when it started does not say 'you should stop executing'. The value could potentially be modified by other thread, but it would only be done once, at least in the best case scenario. I will worry about different solutions when it comes down to trying to modify the value from multiple threads at the same time. The base assumption right now is that only the main thread, or one of other threads, can modify the value, and it will happen only once. Control of it happening only once can be by setting up other variable, which might change multiple times, but always to the same value (that is, initial value is 0 and it means not-yet-canceled, but then when it must be canceled, the value changes to 1, so there is no worrying about the value being fragmented into multiple write operations and only chunk of it being updated at the time of reading it by different thread).
Given this assumption, I think the text I initially wrote at the beginning of this post should be more clear. In a nutshell, no need to worry about the value being written multiple times, only once, but by any thread, and the value must be available to be read by any other thread, or it must be indicated that it cannot be read.
Now as I am thinking of it, since the value itself will only ever be 0 or 1, the trick with knowing when it's already been canceled should work too, shouldn't it? Since the 0 or 1 will always be in one operation, so there is no need to worry about it being fragmented and read incorrectly. Please correct me if I'm wrong.
On the other hand, what if the value is being written from the end, not the beginning? If it's not possible then no need to worry and the post will be resolved, but I would like to know of every danger that might come with overcoming atomic operations like this, in this specific context. In case it is being written from the end, and a thread wants to access the variable to know if it should continue executing, it will notice that it indeed should, while the expected behaviour would be to stop executing. This should have completely minimal chance of being possible, but still is, which means it is dangerous, and I want it to be 100% predictable.
Another edit to explain what steps I imagine the program to make.
Main thread spawns a new thread, aka 'cancelable timeout'. It passes a function to execute along with data, time to sleep, and memory address, pointing to a value. After the thread wakes up after given time, it must check the value to see if it should execute the function it has been given. 0 means it should continue, 1 means it should stop and exit. The value (thread's 'state', canceled or not canceled) can be manipulated by either the main thread, or any other thread, 'timeout', which's job is to cancel the first thread.
Sample code:
struct Timeout {
void (*function)(void* data);
void* data;
int milliseconds;
int** base;
int cancelID;
};
DWORD WINAPI CTimeout(const struct Timeout* data) {
Sleep(data->milliseconds);
if(*(*(data->base) + sizeof(int) * data->cancelID) == 0) {
data->function(data->data);
}
free(data);
return 0;
}
Where CTimeout is a function provided to the newly-spawned thread. Please note that I have written some of this code on go and haven't tested it. Ignore any potential errors.
Timeout.base is pointer to a pointer to an array of ints, since many timeouts can exists at the same time. Timeout.cancelID is the ID of current thread on the list of timeouts. The same ID has a value if treated as index in the base array. If the value is 0, the thread should execute its function, else, clean up the data it has been given and nicely return. The reason behind base being pointer to a pointer, is because at any time, the array of states of timeouts can be resized. In case place of the array changes, there is no option to pass its initial place. It might potentially cause a segmentation fault (if not, correct me please), for accessing memory which does not belong to us anymore.
Base can be accessed from the main thread or other threads if necessary, and the state of our thread can be changed to cancel its execution.
If any thread wants to change the state (the state as state of the timeout we spawned at the beginning and want to cancel), it should change the value in the 'base' array. I think this is pretty straightforward so far.
There would be a huge problem if the values for continuing and stopping would be something bigger than just 1 byte. Operation to write to the memory could actually take multiple operations, and thus, accessing the memory too early would cause unexpected results to occur, which is not what I am fond of. Though, as I earlier mentioned out, what if the value is very small, 0 or 1? Wouldn't it matter at all at what time the value is accessed at? We are interested only in 1 byte, or even 2 or 4 bytes or the whole number, even 8 bytes wouldn't make any difference in this case, would they? In the end, there is no worry about receiving an invalid value, since we don't care about 32bit value, but just 1 bit, no matter how many bytes we would be reading.
Maybe it isn't exactly understandable what I mean. Write/read operations do not consist of reading single bits, but byte(s). That is, if our value is not bigger than 255, or 65535, or 4 million million, whatever the amount of bytes we are writing/reading is, we shouldn't worry about reading it in middle of it being written. What we care about is only one chunk of what is being written, the last or the first byte(s). The rest is completely useless to us, so no need to worry about it all being synced at the time we access the value. The real problem starts when the value is being written to, but the first byte written to is at the end, which is useless to us. If we read the value at that moment, we will receive what we shouldn't - no cancel state instead of cancel. If the first byte, given little endian, was to be read first, we would receive valid value even if reading in the middle of write.
Perhaps I am mangling and mistaking everything. I am not a pro, you know. Perhaps I have been reading trashy articles, whatever. If I am wrong about anything at all, please correct me.

Except for some specialised embedded environments with dedicated hardware, there is no such thing as "one operation, which is small enough to keep it in one chunk, an operation like changing a single bit". You need to keep in mind that you do not want to simply overwrite the special bit with "1" (or "0"). Because even if you could do that, it might just coincide with some other thread doing the same. What you need in fact to do is to check whrther it is already 1 and ONLY if it is NOT write a 1 yourself and KNOW that you did not overwrite an existing 1 (or that writing your 1 failed because of a 1 already being there).
This is called the critical section. And this problem can only be solved by the OS, which happens to know or be able to prevent about other parallel threads. This is the reason for the existence of the OS-supported synchronisation methods.
There is no easy way around this.

Related

A thread only reads and a thread only modifies. Does this variable also need a mutex with linux c? [duplicate]

There are 2 threads,one only reads the signal,the other only sets the signal.
Is it necessary to create a mutex for signal and the reason?
UPDATE
All I care is whether it'll crash if two threads read/set the same time
You will probably want to use atomic variables for this, though a mutex would work as well.
The problem is that there is no guarantee that data will stay in sync between threads, but using atomic variables ensures that as soon as one thread updates that variable, other threads immediately read its updated value.
A problem could occur if one thread updates the variable in cache, and a second thread reads the variable from memory. That second thread would read an out-of-date value for the variable, if the cache had not yet been flushed to memory. Atomic variables ensure that the value of the variable is consistent across threads.
If you are not concerned with timely variable updates, you may be able to get away with a single volatile variable.
It depends. If writes are atomic then you don't need a mutual exclusion lock. If writes are not atomic, then you do need a lock.
There is also the issue of compilers caching variables in the CPU cache which may cause the copy in main memory to not get updating on every write. Some languages have ways of telling the compiler to not cache a variable in the CPU like that (volatile keyword in Java), or to tell the compiler to sync any cached values with main memory (synchronized keyword in Java). But, mutex's in general don't solve this problem.
If all you need is synchronization between threads (one thread must complete something before the other can begin something else) then mutual exclusion should not be necessary.
Mutual exclusion is only necessary when threads are sharing some resource where the resource could be corrupted if they both run through the critical section at roughly the same time. Think of two people sharing a bank account and are at two different ATM's at the same time.
Depending on your language/threading library you may use the same mechanism for synchronization as you do for mutual exclusion- either a semaphore or a monitor. So, if you are using Pthreads someone here could post an example of synchronization and another for mutual exclusion. If its java, there would be another example. Perhaps you can tell us what language/library you're using.
If, as you've said in your edit, you only want to assure against a crash, then you don't need to do much of anything (at least as a rule). If you get a collision between threads, about the worst that will happen is that the data will be corrupted -- e.g., the reader might get a value that's been partially updated, and doesn't correspond directly to any value the writing thread ever wrote. The classic example would be a multi-byte number that you added something to, and there was a carry, (for example) the old value was 0x3f ffff, which was being incremented. It's possible the reading thread could see 0x3f 0000, where the lower 16 bits have been incremented, but the carry to the upper 16 bits hasn't happened (yet).
On a modern machine, an increment on that small of a data item will normally be atomic, but there will be some size (and alignment) where it's not -- typically if part of the variable is in one cache line, and part in another, it'll no longer be atomic. The exact size and alignment for that varies somewhat, but the basic idea remains the same -- it's mostly just a matter of the number having enough digits for it to happen.
Of course, if you're not careful, something like that could cause your code to deadlock or something on that order -- it's impossible to guess what might happen without knowing anything about how you plan to use the data.

pointer shared between two threads without mutex [duplicate]

Is there a problem with multiple threads using the same integer memory location between pthreads in a C program without any synchronization utilities?
To simplify the issue,
Only one thread will write to the integer
Multiple threads will read the integer
This pseudo-C illustrates what I am thinking
void thread_main(int *a) {
//wait for something to finish
//dereference 'a', make decision based on its value
}
int value = 0;
for (int i=0; i<10; i++)
pthread_create(NULL,NULL,thread_main,&value);
}
// do something
value = 1;
I assume it is safe, since an integer occupies one processor word, and reading/writing to a word should be the most atomic of operations, right?
Your pseudo-code is NOT safe.
Although accessing a word-sized integer is indeed atomic, meaning that you'll never see an intermediate value, but either "before write" or "after write", this isn't enough for your outlined algorithm.
You are relying on the relative order of the write to a and making some other change that wakes the thread. This is not an atomic operation and is not guaranteed on modern processors.
You need some sort of memory fence to prevent write reordering. Otherwise it's not guaranteed that other threads EVER see the new value.
Unlike java where you explicitly start a thread, posix threads start executing immediatelly.
So there is no guarantee that the value you set to 1 in main function (assuming that is what you refer in your pseudocode) will be executed before or after the threads try to access it.
So while it is safe to read the integer concurrently, you need to do some synchronization if you need to write to the value in order to be used by the threads.
Otherwise there is no guarantee what is the value they will read (in order to act depending on the value as you note).
You should not be making assumptions on multithreading e.g.that there is some processing in each thread befor accessing the value etc.
There are no guarantees
I wouldn't count on it. The compiler may emit code that assumes it knows what the value of 'value' is at any given time in a CPU register without re-loading it from memory.
EDIT:
Ben is correct (and I'm an idiot for saying he wasn't) that there is the possibility that the cpu will re-order the instructions and execute them down multiple pipelines at the same time. This means that the value=1 could possibly get set before the pipeline performing "the work" finished. In my defense (not a full idiot?) I have never seen this happen in real life and we do have an extensive thread library and we do run exhaustive long term tests and this pattern is used throughout. I would have seen it if it were happening, but none of our tests ever crash or produce the wrong answer. But... Ben is correct, the possibility exists. It is probably happening all the time in our code, but the re-ordering is not setting flags early enough that the consumers of the data protected by the flags can use the data before its finished. I will be changing our code to include barriers, because there is no guarantee that this will continue to work in the wild. I believe the correct solution is similar to this:
Threads that read the value:
...
if (value)
{
__sync_synchronize(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
__sync_synchronize(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
That being said, I found this to be a simple explanation of barriers.
COMPILER BARRIER
Memory barriers affect the CPU. Compiler barriers affect the compiler. Volatile will not keep the compiler from re-ordering code. Here for more info.
I believe you can use this code to keep gcc from rearranging the code during compile time:
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
So maybe this is what should really be done?
#define GENERAL_BARRIER() do { COMPILER_BARRIER(); __sync_synchronize(); } while(0)
Threads that read the value:
...
if (value)
{
GENERAL_BARRIER(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
GENERAL_BARRIER(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
Using GENERAL_BARRIER() keeps gcc from re-ordering the code and also keeps the cpu from re-ordering the code. Now, I wonder if gcc wont re-order code over its memory barrier builtin, __sync_synchronize(), which would make the use of COMPILER_BARRIER redundant.
X86
As Ben points out, different architectures have different rules regarding how they rearrange code in the execution pipelines. Intel seems to be fairly conservative. So the barriers might not be required nearly as much on Intel. Not a good reason to avoid the barriers though, since that could change.
ORIGINAL POST:
We do this all the time. its perfectly safe (not for all situations, but a lot). Our application runs on 1000's of servers in a huge farm with 16 instances per server and we don't have race conditions. You are correct to wonder why people use mutexes to protect already atomic operations. In many situations the lock is a waste of time. Reading and writing to 32 bit integers on most architectures is atomic. Don't try that with 32 bit bit-fields though!
Processor write re-ordering is not going to affect one thread reading a global value set by another thread. In fact, the result using locks is the same as the result not without locks. If you win the race and check the value before its changed ... well that's the same as winning the race to lock the value so no-one else can change it while you read it. Functionally the same.
The volatile keyword tells the compiler not to store a value in a register, but to keep referring to the original memory location. this should have no effect unless you are optimizing code. We have found that the compiler is pretty smart about this and have not run into a situation yet where volatile changed anything. The compiler seems to be pretty good at coming up with candidates for register optimization. I suspect that the const keyword might encourage register optimization on a variable.
The compiler might re-order code in a function if it knows the end result will not be different. I have not seen the compiler do this with global variables, because the compiler has no idea how changing the order of a global variable will affect code outside of the immediate function.
If a function is acting up, you can control the optimization level at the function level using __attrribute__.
Now, that said, if you use that flag as a gateway to allow only one thread of a group to perform some work, that wont work. Example: Thread A and Thread B both could read the flag. Thread A gets scheduled out. Thread B sets the flag to 1 and starts working. Thread A wakes up and sets the flag to 1 and starts working. Ooops! To avoid locks and still do something like that you need to look into atomic operations, specifically gcc atomic builtins like __sync_bool_compare_and_swap(value, old, new). This allows you to set value = new if value is currently old. In the previous example, if value = 1, only one thread (A or B) could execute __sync_bool_compare_and_swap(&value, 1, 2) and change value from 1 to 2. The losing thread would fail. __sync_bool_compare_and_swap returns the success of the operation.
Deep down, there is a "lock" when you use the atomic builtins, but it is a hardware instruction and very fast when compared to using mutexes.
That said, use mutexes when you have to change a lot of values at the same time. atomic operations (as of todayu) only work when all the data that has to change atomicly can fit into a contiguous 8,16,32,64 or 128 bits.
Assume the first thing you're doing in thread func in sleeping for a second. So value after that will be definetly 1.
In any instant you should at least declare the shared variable volatile. However you should in all cases prefer some other form of thread IPC or synchronisation; in this case it looks like a condition variable is what you actually need.
Hm, I guess it is secure, but why don't you just declare a function that returns the value to the other threads, as they will only read it?
Because the simple idea of passing pointers to separate threads is already a security fail, in my humble opinion. What I'm telling you is: why to give a (modifiable, public accessible) integer address when you only need the value?

Can an integer be shared between threads safely?

Is there a problem with multiple threads using the same integer memory location between pthreads in a C program without any synchronization utilities?
To simplify the issue,
Only one thread will write to the integer
Multiple threads will read the integer
This pseudo-C illustrates what I am thinking
void thread_main(int *a) {
//wait for something to finish
//dereference 'a', make decision based on its value
}
int value = 0;
for (int i=0; i<10; i++)
pthread_create(NULL,NULL,thread_main,&value);
}
// do something
value = 1;
I assume it is safe, since an integer occupies one processor word, and reading/writing to a word should be the most atomic of operations, right?
Your pseudo-code is NOT safe.
Although accessing a word-sized integer is indeed atomic, meaning that you'll never see an intermediate value, but either "before write" or "after write", this isn't enough for your outlined algorithm.
You are relying on the relative order of the write to a and making some other change that wakes the thread. This is not an atomic operation and is not guaranteed on modern processors.
You need some sort of memory fence to prevent write reordering. Otherwise it's not guaranteed that other threads EVER see the new value.
Unlike java where you explicitly start a thread, posix threads start executing immediatelly.
So there is no guarantee that the value you set to 1 in main function (assuming that is what you refer in your pseudocode) will be executed before or after the threads try to access it.
So while it is safe to read the integer concurrently, you need to do some synchronization if you need to write to the value in order to be used by the threads.
Otherwise there is no guarantee what is the value they will read (in order to act depending on the value as you note).
You should not be making assumptions on multithreading e.g.that there is some processing in each thread befor accessing the value etc.
There are no guarantees
I wouldn't count on it. The compiler may emit code that assumes it knows what the value of 'value' is at any given time in a CPU register without re-loading it from memory.
EDIT:
Ben is correct (and I'm an idiot for saying he wasn't) that there is the possibility that the cpu will re-order the instructions and execute them down multiple pipelines at the same time. This means that the value=1 could possibly get set before the pipeline performing "the work" finished. In my defense (not a full idiot?) I have never seen this happen in real life and we do have an extensive thread library and we do run exhaustive long term tests and this pattern is used throughout. I would have seen it if it were happening, but none of our tests ever crash or produce the wrong answer. But... Ben is correct, the possibility exists. It is probably happening all the time in our code, but the re-ordering is not setting flags early enough that the consumers of the data protected by the flags can use the data before its finished. I will be changing our code to include barriers, because there is no guarantee that this will continue to work in the wild. I believe the correct solution is similar to this:
Threads that read the value:
...
if (value)
{
__sync_synchronize(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
__sync_synchronize(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
That being said, I found this to be a simple explanation of barriers.
COMPILER BARRIER
Memory barriers affect the CPU. Compiler barriers affect the compiler. Volatile will not keep the compiler from re-ordering code. Here for more info.
I believe you can use this code to keep gcc from rearranging the code during compile time:
#define COMPILER_BARRIER() __asm__ __volatile__ ("" ::: "memory")
So maybe this is what should really be done?
#define GENERAL_BARRIER() do { COMPILER_BARRIER(); __sync_synchronize(); } while(0)
Threads that read the value:
...
if (value)
{
GENERAL_BARRIER(); // don't pipeline any of the work until after checking value
DoSomething();
}
...
The thread that sets the value:
...
DoStuff()
GENERAL_BARRIER(); // Don't pipeline "setting value" until after finishing stuff
value = 1; // Stuff Done
...
Using GENERAL_BARRIER() keeps gcc from re-ordering the code and also keeps the cpu from re-ordering the code. Now, I wonder if gcc wont re-order code over its memory barrier builtin, __sync_synchronize(), which would make the use of COMPILER_BARRIER redundant.
X86
As Ben points out, different architectures have different rules regarding how they rearrange code in the execution pipelines. Intel seems to be fairly conservative. So the barriers might not be required nearly as much on Intel. Not a good reason to avoid the barriers though, since that could change.
ORIGINAL POST:
We do this all the time. its perfectly safe (not for all situations, but a lot). Our application runs on 1000's of servers in a huge farm with 16 instances per server and we don't have race conditions. You are correct to wonder why people use mutexes to protect already atomic operations. In many situations the lock is a waste of time. Reading and writing to 32 bit integers on most architectures is atomic. Don't try that with 32 bit bit-fields though!
Processor write re-ordering is not going to affect one thread reading a global value set by another thread. In fact, the result using locks is the same as the result not without locks. If you win the race and check the value before its changed ... well that's the same as winning the race to lock the value so no-one else can change it while you read it. Functionally the same.
The volatile keyword tells the compiler not to store a value in a register, but to keep referring to the original memory location. this should have no effect unless you are optimizing code. We have found that the compiler is pretty smart about this and have not run into a situation yet where volatile changed anything. The compiler seems to be pretty good at coming up with candidates for register optimization. I suspect that the const keyword might encourage register optimization on a variable.
The compiler might re-order code in a function if it knows the end result will not be different. I have not seen the compiler do this with global variables, because the compiler has no idea how changing the order of a global variable will affect code outside of the immediate function.
If a function is acting up, you can control the optimization level at the function level using __attrribute__.
Now, that said, if you use that flag as a gateway to allow only one thread of a group to perform some work, that wont work. Example: Thread A and Thread B both could read the flag. Thread A gets scheduled out. Thread B sets the flag to 1 and starts working. Thread A wakes up and sets the flag to 1 and starts working. Ooops! To avoid locks and still do something like that you need to look into atomic operations, specifically gcc atomic builtins like __sync_bool_compare_and_swap(value, old, new). This allows you to set value = new if value is currently old. In the previous example, if value = 1, only one thread (A or B) could execute __sync_bool_compare_and_swap(&value, 1, 2) and change value from 1 to 2. The losing thread would fail. __sync_bool_compare_and_swap returns the success of the operation.
Deep down, there is a "lock" when you use the atomic builtins, but it is a hardware instruction and very fast when compared to using mutexes.
That said, use mutexes when you have to change a lot of values at the same time. atomic operations (as of todayu) only work when all the data that has to change atomicly can fit into a contiguous 8,16,32,64 or 128 bits.
Assume the first thing you're doing in thread func in sleeping for a second. So value after that will be definetly 1.
In any instant you should at least declare the shared variable volatile. However you should in all cases prefer some other form of thread IPC or synchronisation; in this case it looks like a condition variable is what you actually need.
Hm, I guess it is secure, but why don't you just declare a function that returns the value to the other threads, as they will only read it?
Because the simple idea of passing pointers to separate threads is already a security fail, in my humble opinion. What I'm telling you is: why to give a (modifiable, public accessible) integer address when you only need the value?

Interruptible in-place sorting algorithm

I need to write a sorting program in C and it would be nice if the file could be sorted in place to save disk space. The data is valuable, so I need to ensure that if the process is interrupted (ctrl-c) the file is not corrupted. I can guarantee the power cord on the machine will not be yanked.
Extra details: file is ~40GB, records are 128-bit, machine is 64-bit, OS is POSIX
Any hints on accomplishing this, or notes in general?
Thanks!
To clarify: I expect the user will want to ctrl-c the process. In this case, I want to exit gracefully and ensure that the data is safe. So this question is about handling interrupts and choosing a sort algorithm that can wrap up quickly if requested.
Following up (2 years later): Just for posterity, I have installed the SIGINT handler and it worked great. This does not protect me against power failure, but that is a risk I can handle. Code at https://code.google.com/p/pawnsbfs/source/browse/trunk/hsort.c and https://code.google.com/p/pawnsbfs/source/browse/trunk/qsort.c
Jerry's right, if it's just Ctrl-C you're worried about, you can ignore SIGINT for periods at a time. If you want to be proof against process death in general, you need some sort of minimal journalling. In order to swap two elements:
1) Add a record to a control structure at the end of the file or in a separate file, indicating which two elements of the file you are going to swap, A and B.
2) Copy A to the scratch space, record that you've done so, flush.
3) Copy B over A, then record in the scratch space that you have done so, flush
4) Copy from the scratch space over B.
5) Remove the record.
This is O(1) extra space for all practical purposes, so still counts as in-place under most definitions. In theory recording an index is O(log n) if n can be arbitrarily large: in reality it's a very small log n, and reasonable hardware / running time bounds it above at 64.
In all cases when I say "flush", I mean commit the changes "far enough". Sometimes your basic flush operation only flushes buffers within the process, but it doesn't actually sync the physical medium, because it doesn't flush buffers all the way through the OS/device driver/hardware levels. That's sufficient when all you're worried about is process death, but if you're worried about abrupt media dismounts then you'd have to flush past the driver. If you were worried about power failure, you'd have to sync the hardware, but you're not. With a UPS or if you think power cuts are so rare you don't mind losing data, that's fine.
On startup, check the scratch space for any "swap-in-progress" records. If you find one, work out how far you got and complete the swap from there to get the data back into a sound state. Then start your sort over again.
Obviously there's a performance issue here, since you're doing twice as much writing of records as before, and flushes/syncs may be astonishingly expensive. In practice your in-place sort might have some compound moving-stuff operations, involving many swaps, but which you can optimise to avoid every element hitting the scratch space. You just have to make sure that before you overwrite any data, you have a copy of it safe somewhere and a record of where that copy should go in order to get your file back to a state where it contains exactly one copy of each element.
Jerry's also right that true in-place sorting is too difficult and slow for most practical purposes. If you can spare some linear fraction of the original file size as scratch space, you'll have a much better time of it with a merge sort.
Based on your clarification, you wouldn't need any flush operations even with an in-place sort. You need scratch space in memory that works the same way, and that your SIGINT handler can access in order to get the data safe before exiting, rather than restoring on startup after an abnormal exit, and you need to access that memory in a signal-safe way (which technically means using a sig_atomic_t to flag which changes have been made). Even so, you're probably better off with a mergesort than a true in-place sort.
Install a handler for SIGINT that just sets a "process should exit soon" flag.
In your sort, check the flag after every swap of two records (or after every N swaps). If the flag is set, bail out.
The part for protecting against ctrl-c is pretty easy: signal(SIGINT, SIG_IGN);.
As far as the sorting itself goes, a merge sort generally works well for external sorting. The basic idea is to read as many records into memory as you can, sort them, then write them back out to disk. By far the easiest way to handle this is to write each run to a separate file on disk. Then you merge those back together -- read the first record from each run into memory, and write the smallest of those out to the original file; read another record from the run that supplied that record, and repeat until done. The last phase is the only time you're modifying the original file, so it's the only time you really need to assure against interruptions and such.
Another possibility is to use a selection sort. The bad point is that the sorting itself is quite slow. The good point is that it's pretty easy to write it to survive almost anything, without using much extra space. The general idea is pretty simple: find the smallest record in the file, and swap that into the first spot. Then find the smallest record of what's left, and swap that into the second spot, and so on until done. The good point of this is that journaling is trivial: before you do a swap, you record the values of the two records you're going to swap. Since the sort runs from the first record to the last, the only other thing you need to track is how many records are already sorted at any given time.
Use heap sort, and prevent interruptions (e.g. block signals) during each swap operation.
Backup whatever you plan to change. The put a flag that marks a successful sort. If everything is OK then keep the result, otherwise restore backup.
Assuming a 64-bit OS (you said it is a 64bit machine but could still be running 32bit OS), you could use mmap to map the file to an array then use qsort on the array.
Add a handler for SIGINT to call msync and munmap to allow the app to respond to Ctrl-C without losing data.

What can cause non-deterministic output in a program?

I have a bug in a multi-processes program. The program receives input and instantly produces output, no network involved, and it doesn't have any time references.
What makes the cause of this bug hard to track down is that it only happens sometimes.
If I constantly run it, it produces both correct and incorrect output, with no discernible order or pattern.
What can cause such non-deterministic behavior? Are there tools out there that can help? There is a possibility that there are uninitialized variables in play. How do I find those?
EDIT: Problem solved, thanks for anyone who suggested
Race Condition.
I didn't thought of it mainly because I was sure that my design prevents this. The problem was that I've used 'wait' instead of 'waitpid', thus sometimes, when some process was lucky enough to finish before the one I was expecting, the correct order of things went wild.
You say it's a "multi-processes" program - could you be more specific? It may very well be a race condition in how you're handling the multiple processes.
If you could tell us more about how the processes interact, we might be able to come up with some possibilities. Note that although Artem's suggestion of using a debugger is fine in and of itself, you need to be aware that introducing a debugger may very well change the situation completely - particularly when it comes to race conditions. Personally I'm a fan of logging a lot, but even that can change the timing subtly.
The scheduler!
Basically, when you have multiple processes, they can run in any bizarre order they want. If those processes are sharing a resource that they are both reading and writing from (whether it be a file or memory or an IO device of some sort), ops are going to get interleaved in all sorts of weird orders. As a simple example, suppose you have two threads (they're threads so they share memory) and they're both trying to increment a global variable, x.
y = x + 1;
x = y
Now run those processes, but interleave the code in this way
Assume x = 1
P1:
y = x + 1
So now in P1, for variable y which is local and on the stack, y = 2. Then the scheduler comes in and starts P2
P2:
y = x + 1
x = y
x was still 1 coming into this, so 1 has been added to it and now x = 2
Then P1 finishes
P1:
x = y
and x is still 2! We incremented x twice but only got that once. And because we don't know how this is going to happen, it's referred to as non-deterministic behavior.
The good news is, you've stumbled upon one of the hardest problems in Systems programming as well as the primary battle cry of many of the functional language folks.
You're most likely looking at a race condition, i.e. an unpredictable and therefore hard to reproduce and debug interaction between improperly synchronized threads or processes.
The non-determinism in this case stems from process/thread and memory access scheduling. This is unpredictable because it is influenced by a large number of external factors, including network traffic and user input which constantly cause interrupts and lead to different actual sequences of execution in the program's threads each time it's run.
It could be a lot of things, memory leaks, critical sections access, unclosed resources, unclosed connection and etc. There is only one tool which can help you - DEBUGGER, or try examine your algorithm and find bug, or if you succeeded to point the problematic part, you can paste here a snippet and we will try to help you.
Start with the basics... make sure that all your variables have a default value and that all dynamic memory is zeroed out before you use it (i.e. use calloc rather than malloc). There should be a compiler option to flag this (unless you're using some obscure compiler).
If this is c++ (I know it's supposed to be a 'c' forum), there are times were object creation and initialization lags behind variable assignment that can bite you. For example if you have a scope that is used concurrently by multiple threads (as in a singleton or a global var) this can cause issues:
if (!foo)
Foo tmp = new Foo();
If you have multiple threads access the above, the first thread finds foo = null and starts the object creation and assignment and then yields. Another thread comes in and finds foo != null so skips the section and starts to use foo.
We'd need to see specifics about your code to be able to give a more accurate answer, but to be concise, when you have a program that coordinates between multiple processes or multiple threads, the variable of when the threads execute can add indeterminacy to your application. Essentially, the scheduling that the OS does can cause processes and threads to execute out-of-order. Depending on your environment and code, the scheduling that the OS does can cause wildly different results. You can search on google for more information about out-of-order execution with multithreading for more information; it's a large topic.
By "multi-process" do you mean multi-threaded? If we had two threads that do this routine
i = 1;
while(true)
{
printf(i++);
if(i > 4) i = 1;
}
Normally we'd expect the output to be something like
112233441122334411223344
But actually we'd be seeing something like
11232344112233441231423
This is because each thread would get to use the CPU at different rates. (There's a whole lot of complicated behind the scheduling schedule, and I'm too weak to tell you the technical stuffs behind it.) Suffice to say, the scheduling from the average person's point of view is pretty random.
This is an example of race condition mentioned in other comments.

Resources