Differentiating between queue full and queue empty - c

I am using a an array with a write index and a read index to implement a straightforward FIFO Queue. I do the usual MOD ArraySize when incrementing the write and read index.
Is there a way to differentiate between queue full and queue empty condition (wrIndex == rdIndex) without using any additional queuecount and also without wasting any array entry i.e . Queue is full if (WrIndex + 1 ) MOD ArraySize == ReadIndex

I'd go with 'wasting' an array entry to detect the queue full condition, especially if you're dealing with different threads/tasks being producers and consumers. Having another flag keep track of that situation increases the locking necessary to keep things consistent and increases the likelihood of some sort of bug that introduces a race condition. This is even more true in the case where you can't use a critical section (as you mention in a comment) to ensure that things are in-sync.
You'll need at least a bit somewhere to keep track of that condition, and that probably means at least a byte. Assuming that your queue contains ints you're only saving 3 bytes of RAM and you're going to chew up several more bytes of program image (which might not be as precious, so that might not matter). If you keep a flag bit inside a byte used to store other flag bits, then you have to additionally deal with setting/testing/clearing that flag bit in a thread safe manner to ensure that the other bits don't get corrupted.
If you're queuing bytes, then you probably save nothing - you can consider the sentinel element to be the flag that you'd have to put somewhere else. But now you have to have no extra code to deal with the flag.
Consider carefully if you really need that extra queue item, and keep in mind that if you're queuing bytes, then the extra queue item probably isn't really extra space

Instead of a read and write index, you could use a read index and a queue count. From the queue count, you can easily tell if the queue is empty of full. And the write index can be computed as (read index + queue count) mod array_size.

What's wrong with a queue count? It sounds like you're going for maximum efficiency and minimal logic, and while I would do the same, I think I'd still use a queue count variable. Otherwise, one other potential solution would be to use a linked list. Low memory usage, and removing first element would be easy, just make sure that you have pointers to the head and tail of the list.

Basically you only need a single additional bit somewhere to signal that the queue is currently empty. You can probably stash that away somewhere, e.g., in the most significant bit of one of your indices (and than AND-ing the lower bits creatively in places where you need to work only on the actual index into your array).
But honestly, I'd go with a queue count first and only cut that if I really need that space, instead of putting up with bit fiddling.

Related

C - overcoming the use of atomic operations

I was wondering. If I have an int variable which I want to be synced across all my threads - wouldn't I be able to reserve one bit to know whether the value is being updated or not?
To avoid the write operation being executed in chunks, which would mean threads could potentially be accessing mid-written value, which is not correct, or even worse, overwrite it, causing it to be totally wrong, I want the threads to first be informed that the variable is being written to. I could simply use an atomic operation to write the new value so that the other threads don't interfere, but this idea does not seem that dumb and I would like to use the basic tools first.
What if I just make one operation, which is small enough to keep it in one chunk, an operation like changing a single bit (which will still result in the whole byte(s) changing, but it's not the whole value changing, right), and let the bit indicate the variable is being written to or not? Would that even work, or would the whole int be written to?
I mean, even if the whole int was to change, this still would have a chance of working - if the bit indicating if the value is changing was written first.
Any thoughts on this?
EDIT: I feel like I did not specify what I am actually planning to do, and why I thought of this in the first place.
I am trying to implement a timeout function, similarly to setTimeout in JavaScript. It is pretty straightforward for a timeout that you don't want to ever cancel - you create a new thread, tell it to sleep for given amount of time, then give it a function to execute, eventually with some data. Piece of cake. Finished writing it in maybe half an hour, while being totally new to C.
The hard part comes when you want to set a timeout which might be canceled in the future. So you do exactly the same as a timeout without canceling, but when the thread wakes up and the CPU's scheduler puts it on, the thread must check if a value in the memory it was given when it started does not say 'you should stop executing'. The value could potentially be modified by other thread, but it would only be done once, at least in the best case scenario. I will worry about different solutions when it comes down to trying to modify the value from multiple threads at the same time. The base assumption right now is that only the main thread, or one of other threads, can modify the value, and it will happen only once. Control of it happening only once can be by setting up other variable, which might change multiple times, but always to the same value (that is, initial value is 0 and it means not-yet-canceled, but then when it must be canceled, the value changes to 1, so there is no worrying about the value being fragmented into multiple write operations and only chunk of it being updated at the time of reading it by different thread).
Given this assumption, I think the text I initially wrote at the beginning of this post should be more clear. In a nutshell, no need to worry about the value being written multiple times, only once, but by any thread, and the value must be available to be read by any other thread, or it must be indicated that it cannot be read.
Now as I am thinking of it, since the value itself will only ever be 0 or 1, the trick with knowing when it's already been canceled should work too, shouldn't it? Since the 0 or 1 will always be in one operation, so there is no need to worry about it being fragmented and read incorrectly. Please correct me if I'm wrong.
On the other hand, what if the value is being written from the end, not the beginning? If it's not possible then no need to worry and the post will be resolved, but I would like to know of every danger that might come with overcoming atomic operations like this, in this specific context. In case it is being written from the end, and a thread wants to access the variable to know if it should continue executing, it will notice that it indeed should, while the expected behaviour would be to stop executing. This should have completely minimal chance of being possible, but still is, which means it is dangerous, and I want it to be 100% predictable.
Another edit to explain what steps I imagine the program to make.
Main thread spawns a new thread, aka 'cancelable timeout'. It passes a function to execute along with data, time to sleep, and memory address, pointing to a value. After the thread wakes up after given time, it must check the value to see if it should execute the function it has been given. 0 means it should continue, 1 means it should stop and exit. The value (thread's 'state', canceled or not canceled) can be manipulated by either the main thread, or any other thread, 'timeout', which's job is to cancel the first thread.
Sample code:
struct Timeout {
void (*function)(void* data);
void* data;
int milliseconds;
int** base;
int cancelID;
};
DWORD WINAPI CTimeout(const struct Timeout* data) {
Sleep(data->milliseconds);
if(*(*(data->base) + sizeof(int) * data->cancelID) == 0) {
data->function(data->data);
}
free(data);
return 0;
}
Where CTimeout is a function provided to the newly-spawned thread. Please note that I have written some of this code on go and haven't tested it. Ignore any potential errors.
Timeout.base is pointer to a pointer to an array of ints, since many timeouts can exists at the same time. Timeout.cancelID is the ID of current thread on the list of timeouts. The same ID has a value if treated as index in the base array. If the value is 0, the thread should execute its function, else, clean up the data it has been given and nicely return. The reason behind base being pointer to a pointer, is because at any time, the array of states of timeouts can be resized. In case place of the array changes, there is no option to pass its initial place. It might potentially cause a segmentation fault (if not, correct me please), for accessing memory which does not belong to us anymore.
Base can be accessed from the main thread or other threads if necessary, and the state of our thread can be changed to cancel its execution.
If any thread wants to change the state (the state as state of the timeout we spawned at the beginning and want to cancel), it should change the value in the 'base' array. I think this is pretty straightforward so far.
There would be a huge problem if the values for continuing and stopping would be something bigger than just 1 byte. Operation to write to the memory could actually take multiple operations, and thus, accessing the memory too early would cause unexpected results to occur, which is not what I am fond of. Though, as I earlier mentioned out, what if the value is very small, 0 or 1? Wouldn't it matter at all at what time the value is accessed at? We are interested only in 1 byte, or even 2 or 4 bytes or the whole number, even 8 bytes wouldn't make any difference in this case, would they? In the end, there is no worry about receiving an invalid value, since we don't care about 32bit value, but just 1 bit, no matter how many bytes we would be reading.
Maybe it isn't exactly understandable what I mean. Write/read operations do not consist of reading single bits, but byte(s). That is, if our value is not bigger than 255, or 65535, or 4 million million, whatever the amount of bytes we are writing/reading is, we shouldn't worry about reading it in middle of it being written. What we care about is only one chunk of what is being written, the last or the first byte(s). The rest is completely useless to us, so no need to worry about it all being synced at the time we access the value. The real problem starts when the value is being written to, but the first byte written to is at the end, which is useless to us. If we read the value at that moment, we will receive what we shouldn't - no cancel state instead of cancel. If the first byte, given little endian, was to be read first, we would receive valid value even if reading in the middle of write.
Perhaps I am mangling and mistaking everything. I am not a pro, you know. Perhaps I have been reading trashy articles, whatever. If I am wrong about anything at all, please correct me.
Except for some specialised embedded environments with dedicated hardware, there is no such thing as "one operation, which is small enough to keep it in one chunk, an operation like changing a single bit". You need to keep in mind that you do not want to simply overwrite the special bit with "1" (or "0"). Because even if you could do that, it might just coincide with some other thread doing the same. What you need in fact to do is to check whrther it is already 1 and ONLY if it is NOT write a 1 yourself and KNOW that you did not overwrite an existing 1 (or that writing your 1 failed because of a 1 already being there).
This is called the critical section. And this problem can only be solved by the OS, which happens to know or be able to prevent about other parallel threads. This is the reason for the existence of the OS-supported synchronisation methods.
There is no easy way around this.

Lockfree buffer updates with variable-length messages in C

I have 2 buffer of size N. I want to write to the buffer from different threads without using locks.
I maintain a buffer index (0 and 1) and an offset where the new write operation to buffer starts. If I can get the current offset and set the offset at offset + len_of_the_msg in an atomic manner, it will guarantee that the different threads will not overwrite each other. I also have to take care of buffer overflow. Once a buffer is full, switch buffer and set offset to 0.
Task to do in order:
set a = offset
increment offset by msg_len
if offset > N: switch buffer, set a to 0, set offset to msg_len
I am implementing this in C. Compiler is gcc.
How to do this operations in an atomic manner without using locks? Is it possible to do so?
EDIT:
I don't have to use 2 buffers. What I want to do is "Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached"
re: your edit:
I don't have to use 2 buffers. What I want to do is: Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached
A lock-free circular buffer could maybe work, with the reader collecting all data up to the last written entry. Extending an existing MPSC or MPMC queue based on using an array as a circular buffer is probably possible; see below for hints.
Verifying that all entries have been fully written is still a problem, though, as are variable-width entries. Doing that in-band with a length + sequence number would mean you couldn't just send the byte-range to the server, and the reader would have to walk through the "linked list" (of length "pointers") to check the sequence numbers, which is slow when they inevitably cache miss. (And can possibly false-positive if stale binary data from a previous time through the buffer happens to look like the right sequence number, because variable-length messages might not line up the same way.)
Perhaps a secondary array of fixed-width start/end-position pairs could be used to track "done" status by sequence number. (Writers store a sequence number with a release-store after writing the message data. Readers seeing the right sequence number know that data was written this time through the circular buffer, not last time. Sequence numbers provide ABA protection vs. a "done" flag that the reader would have to unset as it reads. The reader can indicate its read position with an atomic integer.)
I'm just brainstorming ATM, I might get back to this and write up more details or code, but I probably won't. If anyone else wants to build on this idea and write up an answer, feel free.
It might still be more efficient to do some kind of non-lock-free synchronization that makes sure all writers have passed a certain point. Or if each writer stores the position it has claimed, the reader can scan that array (if there are only a few writer threads) and find the lowest not-fully-written position.
I'm picturing that a writer should wake the reader (or even perform the task itself) after detecting that its increment has pushed the used space of the queue up past some threshold. Make the threshold a little higher than you normally want to actually send with, to account for partially-written entries from previous writers not actually letting you read this far.
If you are set on switching buffers:
I think you probably need some kind of locking when switching buffers. (Or at least stronger synchronization to make sure all claimed space in a buffer has actually been written.)
But within one buffer, I think lockless is possible. Whether that helps a lot or a little depends on how you're using it. Bouncing cache lines around is always expensive, whether that's just the index, or whether that's also a lock plus some write-index. And also false sharing at the boundaries between two messages, if they aren't all 64-byte aligned (to cache line boundaries.)
The biggest problem is that the buffer-number can change while you're atomically updating the offset.
It might be possible with a separate offset for each buffer, and some extra synchronization when you change buffers.
Or you can pack the buffer-number and offset into a single 64-bit struct that you can attempt to CAS with atomic_compare_exchange_weak. That can let a writer thread claim that amount of space in a known buffer. You do want CAS, not fetch_add because you can't build an upper limit into fetch_add; it would race with any separate check.
So you read the current offset, check there's enough room, then try to CAS with offset+msg_len. On success, you've claimed that region of that buffer. On fail, some other thread got it first. This is basically the same as what a multi-producer queue does with a circular buffer, but we're generalizing to reserving a byte-range instead of just a single entry with CAS(&write_idx, old, old+1).
(Maybe possible to use fetch_add and abort if the final offset+len you got goes past the end of the buffer. If you can avoid doing any fetch_sub to undo it, that could be good, but it would be worse if you had multiple threads trying to undo their mistakes with more modifications. That would still leave the possible problem of a large message stopping other small messages from packing into the end of a buffer, given some orderings. CAS avoids that because only actually-usable offsets get swapped in.)
But then you also need a mechanism to know when that writer has finished storing to that claimed region of the buffer. So again, maybe extra synchronization around a buffer-change is needed for that reason, to make sure all pending writes have actually happened before we let readers touch it.
A MPMC queue using a circular buffer (e.g. Lock-free Progress Guarantees) avoids this by only having one buffer, and giving writers a place to mark each write as done with a release-store, after they claimed a slot and stored into it. Having fixed-size slots makes this much easier; variable-length messages would make that non-trivial or maybe not viable at all.
The "claim a byte-range" mechanism I'm proposing is very much what lock-free array-based queues, to, though. A writer tries to CAS a write-index, then uses that claimed space.
Obviously all of this would be done with C11 #include <stdatomic.h> for _Atomic size_t offsets[2], or with GNU C builtin __atomic_...
I believe this is not solvable in a lock-free manner, unless you're only ruling out OS-level locking primitives and can live with brief spin locks in application code (which would be a bad idea).
For discussion, let's assume your buffers are organized this way:
#define MAXBUF 100
struct mybuffer {
char data[MAXBUF];
int index;
};
struct mybuffer Buffers[2];
int currentBuffer = 0; // switches between 0 and 1
Though parts can be done with atomic-level primitives, in this case the entire operation has to be done atomically so is really one big critical section. I cannot imagine any compiler with a unicorn primitive for this.
Looking at the GCC __atomic_add_fetch() primitive, this adds a given value (the message size) to a variable (the current buffer index), returning the new value; this way you could test for overflow.
Looking at some rough code that is not correct;
// THIS IS ALL WRONG!
int oldIndex = Buffers[current]->index;
if (__atomic_add_fetch(&Buffers[current]->index, mysize, _ATOMIC_XX) > MAXBUF)
{
// overflow, must switch buffers
// do same thing with new buffer
// recompute oldIndex
}
// copy your message into Buffers[current] at oldIndex
This is wrong in every way, because at almost every point some other thread could sneak in and change things out from under you, causing havoc.
What if your code grabs the oldIndex that happens to be from buffer 0, but then some other thread sneaks in and changes the current buffer before your if test even gets to run?
The __atomic_add_fetch() would then be allocating data in the new buffer but you'd copy your data to the old one.
This is the NASCAR of race conditions, I do not see how you can accomplish this without treating the whole thing as a critical section, making other processes wait their turn.
void addDataTobuffer(const char *msg, size_t n)
{
assert(n <= MAXBUF); // avoid danger
// ENTER CRITICAL SECTION
struct mybuffer *buf = Buffers[currentBuffer];
// is there room in this buffer for the entire message?
// if not, switch to the other buffer.
//
// QUESTION: do messages have to fit entirely into a buffer
// (as this code assumes), or can they be split across buffers?
if ((buf->index + n) > MAXBUF)
{
// QUESTION: there is unused data at the end of this buffer,
// do we have to fill it with NUL bytes or something?
currentBuffer = (currentBuffer + 1) % 2; // switch buffers
buf = Buffers[currentBuffer];
}
int myindex = buf->index;
buf->index += n;
// copy your data into the buffer at myindex;
// LEAVE CRITICAL SECTION
}
We don't know anything about the consumer of this data, so we can't tell how it gets notified of new messages, or if you can move the data-copy outside the critical section.
But everything inside the critical section MUST be done atomically, and since you're using threads anyway, you may as well use the primitives that come with thread support. Mutexes probably.
One benefit of doing it this way, in addition to avoiding race conditions, is that the code inside the critical section doesn't have to use any of the atomic primitives and can just be ordinary (but careful) C code.
An additional note: it's possible to roll your own critical section code with some interlocked exchange shenanigans, but this is a terrible idea because it's easy to get wrong, makes the code harder to understand, and avoids tried-and-true thread primitives designed for exactly this purpose.

CUDA threads appending variable amounts of data to common array

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.
All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?
Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.
I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.
However I'm not looking forward to implementing that. Help? Thanks.
Using an atomic increment to repeatedly reserve a pair from a global
array sounds horribly slow.
You could increment bins of a global array instead of one entry at a time. In other words, you could have a large array, each thread could start with 10 possible output entries. If the thread over flows it requests for the next available bin from the global array. If you're worried about slow speed with the 1 atomic number, you could use 10 atomic numbers to 10 portions of the array and distribute the accesses. If one gets full, find another one.
I'm also considering processing the data twice: the 1st time just to
determine the number of output records for each input record. Then
allocate just enough space and finally process all the data again.
This is another valid method. The bottleneck is calculating the offset of each thread into the global array once you have the total number of results for each thread. I haven't figured a reasonable parallel way to do that.
The last option I can think of, would be to allocate a large array, distribute it based on blocks, used a shared atomic int (would help with slow global atomics). If you run out of space, mark that the block didn't finish, and mark where it left off. On your next iteration complete the work that hasn't been finished.
Downside of course of the distributed portions of global memory is like talonmies said... you need a gather or compaction to make the results dense.
Good luck!

Queues implementation benchmark

I'm starting development of a series of image processing algorithms, some of them with intensive use of queues. Do you guys know a good benchmark for those data structures?
To narrow the scope, I'm using C mostly, but I can use C++, stl and any library.
I've got a few hits on data structure libraries, such as GLib and C-Generic-Library, and of course the containers of STL. Also, if any of you developed/know a faster queue than those, please advise :)
Also, the queue will have lots of enqueues and dequeues operations, so it better have a smart way to manage memory.
For a single threaded application you can often get around having to use any type of queue at all simply by processing the next item as it comes in, but there are many applications where this isn't the case (queuing up data for output, for instance).
Without the need to lock the queue (no other threads to worry about) a simple circular buffer is going to be hard to beat for performance. If for some reason the queue needs to grow after creation this is a little bit more difficult, but you shouldn't have a hard time finding a circular buffer queue implementation (or building your own). If either inserting or extracting are done in a signal handler (or interrupt service routine) then you may actually need to protect the read and/or write position indexes, but if you know your target well you may be able to determine that this is not the case (when in doubt protect, though). Protection would be by either temporarily blocking the signals or interrupts that could put things in your queue. (You would really need to block this if you were to need to resize the queue)
If whatever you are putting in the queue has to be dynamically allocated anyway then you might want to just tack on a pointer and turn the thing into a list node. A singly linked list where the list master holds a pointer to the head and the last node is sufficient. Extract from the head and insert at the tail. Here protecting the inserts and extractions from race conditions is pretty much independent and you only need to worry about things when the lenght of the list is very low. If you truly do have a single threaded application then you don't have to worry about it at all.
I don't have any actual benchmarks and can't make any suggestions about any library implementations, but both methods are O(1) for both insert and extract. The first is more cache (and memory pager) friendly unless your queue size is much larger than it needs to be. The second method is less cache friendly since each member of the queue can be in a different area of RAM.
Hope this helps you evaluate or create your own queue.

Interruptible in-place sorting algorithm

I need to write a sorting program in C and it would be nice if the file could be sorted in place to save disk space. The data is valuable, so I need to ensure that if the process is interrupted (ctrl-c) the file is not corrupted. I can guarantee the power cord on the machine will not be yanked.
Extra details: file is ~40GB, records are 128-bit, machine is 64-bit, OS is POSIX
Any hints on accomplishing this, or notes in general?
Thanks!
To clarify: I expect the user will want to ctrl-c the process. In this case, I want to exit gracefully and ensure that the data is safe. So this question is about handling interrupts and choosing a sort algorithm that can wrap up quickly if requested.
Following up (2 years later): Just for posterity, I have installed the SIGINT handler and it worked great. This does not protect me against power failure, but that is a risk I can handle. Code at https://code.google.com/p/pawnsbfs/source/browse/trunk/hsort.c and https://code.google.com/p/pawnsbfs/source/browse/trunk/qsort.c
Jerry's right, if it's just Ctrl-C you're worried about, you can ignore SIGINT for periods at a time. If you want to be proof against process death in general, you need some sort of minimal journalling. In order to swap two elements:
1) Add a record to a control structure at the end of the file or in a separate file, indicating which two elements of the file you are going to swap, A and B.
2) Copy A to the scratch space, record that you've done so, flush.
3) Copy B over A, then record in the scratch space that you have done so, flush
4) Copy from the scratch space over B.
5) Remove the record.
This is O(1) extra space for all practical purposes, so still counts as in-place under most definitions. In theory recording an index is O(log n) if n can be arbitrarily large: in reality it's a very small log n, and reasonable hardware / running time bounds it above at 64.
In all cases when I say "flush", I mean commit the changes "far enough". Sometimes your basic flush operation only flushes buffers within the process, but it doesn't actually sync the physical medium, because it doesn't flush buffers all the way through the OS/device driver/hardware levels. That's sufficient when all you're worried about is process death, but if you're worried about abrupt media dismounts then you'd have to flush past the driver. If you were worried about power failure, you'd have to sync the hardware, but you're not. With a UPS or if you think power cuts are so rare you don't mind losing data, that's fine.
On startup, check the scratch space for any "swap-in-progress" records. If you find one, work out how far you got and complete the swap from there to get the data back into a sound state. Then start your sort over again.
Obviously there's a performance issue here, since you're doing twice as much writing of records as before, and flushes/syncs may be astonishingly expensive. In practice your in-place sort might have some compound moving-stuff operations, involving many swaps, but which you can optimise to avoid every element hitting the scratch space. You just have to make sure that before you overwrite any data, you have a copy of it safe somewhere and a record of where that copy should go in order to get your file back to a state where it contains exactly one copy of each element.
Jerry's also right that true in-place sorting is too difficult and slow for most practical purposes. If you can spare some linear fraction of the original file size as scratch space, you'll have a much better time of it with a merge sort.
Based on your clarification, you wouldn't need any flush operations even with an in-place sort. You need scratch space in memory that works the same way, and that your SIGINT handler can access in order to get the data safe before exiting, rather than restoring on startup after an abnormal exit, and you need to access that memory in a signal-safe way (which technically means using a sig_atomic_t to flag which changes have been made). Even so, you're probably better off with a mergesort than a true in-place sort.
Install a handler for SIGINT that just sets a "process should exit soon" flag.
In your sort, check the flag after every swap of two records (or after every N swaps). If the flag is set, bail out.
The part for protecting against ctrl-c is pretty easy: signal(SIGINT, SIG_IGN);.
As far as the sorting itself goes, a merge sort generally works well for external sorting. The basic idea is to read as many records into memory as you can, sort them, then write them back out to disk. By far the easiest way to handle this is to write each run to a separate file on disk. Then you merge those back together -- read the first record from each run into memory, and write the smallest of those out to the original file; read another record from the run that supplied that record, and repeat until done. The last phase is the only time you're modifying the original file, so it's the only time you really need to assure against interruptions and such.
Another possibility is to use a selection sort. The bad point is that the sorting itself is quite slow. The good point is that it's pretty easy to write it to survive almost anything, without using much extra space. The general idea is pretty simple: find the smallest record in the file, and swap that into the first spot. Then find the smallest record of what's left, and swap that into the second spot, and so on until done. The good point of this is that journaling is trivial: before you do a swap, you record the values of the two records you're going to swap. Since the sort runs from the first record to the last, the only other thing you need to track is how many records are already sorted at any given time.
Use heap sort, and prevent interruptions (e.g. block signals) during each swap operation.
Backup whatever you plan to change. The put a flag that marks a successful sort. If everything is OK then keep the result, otherwise restore backup.
Assuming a 64-bit OS (you said it is a 64bit machine but could still be running 32bit OS), you could use mmap to map the file to an array then use qsort on the array.
Add a handler for SIGINT to call msync and munmap to allow the app to respond to Ctrl-C without losing data.

Resources