Context switch in the middle of a system call - c

Assume we have 2 threads in the process.
now we run the following code:
read(fd, buf, 10);
where fd is some file descriptor which is shared among the threads (say static), and buf is an array which is not shared among the threads (local variable).
now, assume that the file is 1KB and the first 10 chars in the file are "AAAAAAAAAA" and all the rest are 'B's. ("BBBBBB.....").
now If we have only one processor, what the output of the bufs are if ill print them in each thread?
I know the answer is that one of the arrays will always have only A's and the other only B's, but I don't fully understand why, because I think that there could be a context switch in the middle of this system-call (read) and then both of the buf's will have A's in them.
Is it even possible for a context switch to occur in the middle of a system-call? and if so what do you think will buf's could have at the end of the execution?

Modern disks cannot perform reads and writes at 10-byte granularity and instead perform reads and writes in units of sectors, traditionally 512 bytes for hard disk drives (HDDs).
Copying 10 characters to thread buffer happen very fast, before context switch though not guaranteed.
A simple program to get a feeling would be to have 2 threads printing to same console, one prints + and the other -. Check how many + before first -.
Anyway, on to the original question, change the size of the array to 1024 and have 1024 A's to start with and you will most probably see the difference.

Related

C bulk write multiple values at once

I am trying to write a bunch of values to a FIFO pipe - which works fine, but the issue that I have is that the program on the other end of the FIFO pipe, ends up reading the values before all are written (I assume kernel scheduling isn't working in my favor). Below is what my code is somewhat like (this works well - half of the time):
write(out_fd, (void *)struct_1, sizeof(struct part_1));
write(out_fd, (void *)struct_2, sizeof(struct part_2));
write(out_fd, (void *)struct_3, sizeof(struct part_3));
However - what I assume is that - kernel scheduling essentially interrupts somewhere in-between the sequential writes, whilst on the other end of the FIFO pipe, another program reads all values as they come in, and when it does not match the expected full (multiple) values being written - at once, my programs fails to operate correctly, as the values do not match the format what is expected.
Does anyone have any ideas into how all write's can be written in bulk, such as preventing scheduling from switching between applications before all write's can be done. My initial attempt was to malloc enough space for all of the values, then to memcpy the values towards a respective offset - matching each interval position - and then writing that; however, this ended up causes heap memory corruption (after freeing all), which only became present further down the line.
Any suggestions? Thank you.
A FIFO is just a stream of bytes; it provides no framing of any kind. In particular, there is no guarantee that read() will read the same chunks that write() wrote. Fix your receiver to handle this.

Lockfree buffer updates with variable-length messages in C

I have 2 buffer of size N. I want to write to the buffer from different threads without using locks.
I maintain a buffer index (0 and 1) and an offset where the new write operation to buffer starts. If I can get the current offset and set the offset at offset + len_of_the_msg in an atomic manner, it will guarantee that the different threads will not overwrite each other. I also have to take care of buffer overflow. Once a buffer is full, switch buffer and set offset to 0.
Task to do in order:
set a = offset
increment offset by msg_len
if offset > N: switch buffer, set a to 0, set offset to msg_len
I am implementing this in C. Compiler is gcc.
How to do this operations in an atomic manner without using locks? Is it possible to do so?
EDIT:
I don't have to use 2 buffers. What I want to do is "Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached"
re: your edit:
I don't have to use 2 buffers. What I want to do is: Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached
A lock-free circular buffer could maybe work, with the reader collecting all data up to the last written entry. Extending an existing MPSC or MPMC queue based on using an array as a circular buffer is probably possible; see below for hints.
Verifying that all entries have been fully written is still a problem, though, as are variable-width entries. Doing that in-band with a length + sequence number would mean you couldn't just send the byte-range to the server, and the reader would have to walk through the "linked list" (of length "pointers") to check the sequence numbers, which is slow when they inevitably cache miss. (And can possibly false-positive if stale binary data from a previous time through the buffer happens to look like the right sequence number, because variable-length messages might not line up the same way.)
Perhaps a secondary array of fixed-width start/end-position pairs could be used to track "done" status by sequence number. (Writers store a sequence number with a release-store after writing the message data. Readers seeing the right sequence number know that data was written this time through the circular buffer, not last time. Sequence numbers provide ABA protection vs. a "done" flag that the reader would have to unset as it reads. The reader can indicate its read position with an atomic integer.)
I'm just brainstorming ATM, I might get back to this and write up more details or code, but I probably won't. If anyone else wants to build on this idea and write up an answer, feel free.
It might still be more efficient to do some kind of non-lock-free synchronization that makes sure all writers have passed a certain point. Or if each writer stores the position it has claimed, the reader can scan that array (if there are only a few writer threads) and find the lowest not-fully-written position.
I'm picturing that a writer should wake the reader (or even perform the task itself) after detecting that its increment has pushed the used space of the queue up past some threshold. Make the threshold a little higher than you normally want to actually send with, to account for partially-written entries from previous writers not actually letting you read this far.
If you are set on switching buffers:
I think you probably need some kind of locking when switching buffers. (Or at least stronger synchronization to make sure all claimed space in a buffer has actually been written.)
But within one buffer, I think lockless is possible. Whether that helps a lot or a little depends on how you're using it. Bouncing cache lines around is always expensive, whether that's just the index, or whether that's also a lock plus some write-index. And also false sharing at the boundaries between two messages, if they aren't all 64-byte aligned (to cache line boundaries.)
The biggest problem is that the buffer-number can change while you're atomically updating the offset.
It might be possible with a separate offset for each buffer, and some extra synchronization when you change buffers.
Or you can pack the buffer-number and offset into a single 64-bit struct that you can attempt to CAS with atomic_compare_exchange_weak. That can let a writer thread claim that amount of space in a known buffer. You do want CAS, not fetch_add because you can't build an upper limit into fetch_add; it would race with any separate check.
So you read the current offset, check there's enough room, then try to CAS with offset+msg_len. On success, you've claimed that region of that buffer. On fail, some other thread got it first. This is basically the same as what a multi-producer queue does with a circular buffer, but we're generalizing to reserving a byte-range instead of just a single entry with CAS(&write_idx, old, old+1).
(Maybe possible to use fetch_add and abort if the final offset+len you got goes past the end of the buffer. If you can avoid doing any fetch_sub to undo it, that could be good, but it would be worse if you had multiple threads trying to undo their mistakes with more modifications. That would still leave the possible problem of a large message stopping other small messages from packing into the end of a buffer, given some orderings. CAS avoids that because only actually-usable offsets get swapped in.)
But then you also need a mechanism to know when that writer has finished storing to that claimed region of the buffer. So again, maybe extra synchronization around a buffer-change is needed for that reason, to make sure all pending writes have actually happened before we let readers touch it.
A MPMC queue using a circular buffer (e.g. Lock-free Progress Guarantees) avoids this by only having one buffer, and giving writers a place to mark each write as done with a release-store, after they claimed a slot and stored into it. Having fixed-size slots makes this much easier; variable-length messages would make that non-trivial or maybe not viable at all.
The "claim a byte-range" mechanism I'm proposing is very much what lock-free array-based queues, to, though. A writer tries to CAS a write-index, then uses that claimed space.
Obviously all of this would be done with C11 #include <stdatomic.h> for _Atomic size_t offsets[2], or with GNU C builtin __atomic_...
I believe this is not solvable in a lock-free manner, unless you're only ruling out OS-level locking primitives and can live with brief spin locks in application code (which would be a bad idea).
For discussion, let's assume your buffers are organized this way:
#define MAXBUF 100
struct mybuffer {
char data[MAXBUF];
int index;
};
struct mybuffer Buffers[2];
int currentBuffer = 0; // switches between 0 and 1
Though parts can be done with atomic-level primitives, in this case the entire operation has to be done atomically so is really one big critical section. I cannot imagine any compiler with a unicorn primitive for this.
Looking at the GCC __atomic_add_fetch() primitive, this adds a given value (the message size) to a variable (the current buffer index), returning the new value; this way you could test for overflow.
Looking at some rough code that is not correct;
// THIS IS ALL WRONG!
int oldIndex = Buffers[current]->index;
if (__atomic_add_fetch(&Buffers[current]->index, mysize, _ATOMIC_XX) > MAXBUF)
{
// overflow, must switch buffers
// do same thing with new buffer
// recompute oldIndex
}
// copy your message into Buffers[current] at oldIndex
This is wrong in every way, because at almost every point some other thread could sneak in and change things out from under you, causing havoc.
What if your code grabs the oldIndex that happens to be from buffer 0, but then some other thread sneaks in and changes the current buffer before your if test even gets to run?
The __atomic_add_fetch() would then be allocating data in the new buffer but you'd copy your data to the old one.
This is the NASCAR of race conditions, I do not see how you can accomplish this without treating the whole thing as a critical section, making other processes wait their turn.
void addDataTobuffer(const char *msg, size_t n)
{
assert(n <= MAXBUF); // avoid danger
// ENTER CRITICAL SECTION
struct mybuffer *buf = Buffers[currentBuffer];
// is there room in this buffer for the entire message?
// if not, switch to the other buffer.
//
// QUESTION: do messages have to fit entirely into a buffer
// (as this code assumes), or can they be split across buffers?
if ((buf->index + n) > MAXBUF)
{
// QUESTION: there is unused data at the end of this buffer,
// do we have to fill it with NUL bytes or something?
currentBuffer = (currentBuffer + 1) % 2; // switch buffers
buf = Buffers[currentBuffer];
}
int myindex = buf->index;
buf->index += n;
// copy your data into the buffer at myindex;
// LEAVE CRITICAL SECTION
}
We don't know anything about the consumer of this data, so we can't tell how it gets notified of new messages, or if you can move the data-copy outside the critical section.
But everything inside the critical section MUST be done atomically, and since you're using threads anyway, you may as well use the primitives that come with thread support. Mutexes probably.
One benefit of doing it this way, in addition to avoiding race conditions, is that the code inside the critical section doesn't have to use any of the atomic primitives and can just be ordinary (but careful) C code.
An additional note: it's possible to roll your own critical section code with some interlocked exchange shenanigans, but this is a terrible idea because it's easy to get wrong, makes the code harder to understand, and avoids tried-and-true thread primitives designed for exactly this purpose.

Why fread does have thread safe requirements which slows down its call

I am writing a function to read binary files that are organized as a succession of (key, value) pairs where keys are small ASCII strings and value are int or double stored in binary format.
If implemented naively, this function makes a lot of call to fread to read very small amount of data (usually no more than 10 bytes). Even though fread internally uses a buffer to read the file, I have implemented my own buffer and I have observed speed up by a factor of 10 on both Linux and Windows. The buffer size used by fread is large enough and the function call cannot be responsible for such a slowdown. So I went and dug into the GNU implementation of fread and discovered some lock on the file, and many other things such as verifying that the file is open with read access and so on. No wonder why fread is so slow.
But what is the rationale behind fread being thread-safe where it seems that multiple thread can call fread on the same file which is mind boggling to me. These requirements make it slow as hell. What are the advantages?
Imagine you have a file where each 5 bytes can be processed in parallel (let's say, pixel by pixel in an image):
123456789A
One thread needs to pick 5 bytes "12345", the next one the next 5 bytes "6789A".
If it was not thread-safe different threads could pick-up wrong chunks. For example: "12367" and "4589A" or even worst (unexpected behaviour, repeated bytes or worst).
As suggested by nemequ:
Note that if you're on glibc you can use the _unlocked variants (*e.g., fread_unlocked). On Windows you can define _CRT_DISABLE_PERFCRIT_LOCKS
Stream I/O is already as slow as molasses. Programmers think that a read from main memory (1000x longer than a CPU cycle) is ages. A read from the physical disk or a network may as well be eternity.
I don't know if that's the #1 reason why the library implementers were ok with adding the lock overhead, but I guarantee it played a significant part.
Yes, it slows it down, but as you discovered, you can manually buffer the read and use your own handling to increase the speed when performance really matters. (That's the key--when you absolutely must read the data as fast as possible. Don't bother manually buffering in the general case.)
That's a rationalization. I'm sure you could think of more!

C: Each child process reads alternate lines

I'm training a typical map-reduce architecture (in O.S. classes) and I'm free to decide how the master process will tell its N child processes to parse a log. So, I'm kind of stuck in these two possibilities:
count the number of rows and give X rows for each map OR
each map reads the line of its ID and the next line to read= current_one+number_of_existent_maps
E.g.: with 3 maps, each one is going to read these lines:
Map1: 1, 4, 7, 10, 13
Map2: 2, 5, 8, 11, 14
Map3: 3, 6, 9, 12, 15
I have to do this in order to out-perform a single process that parses the entire log file, so the way I split the job between child processes has to be consistent with this objective.
Which one do you think is best? How can I do the scanf or fgets to adapt to 1) or 2)?
I would be happy with some example code for 2), because the fork/pipes are not my problem :P
RE-EDIT:
I'm not encouraged to use select here, only between map procs and the reduce process that will be monitoring the reads. I have restrictions now and :
I want each process to read total_lines/N lines each. But it seems like I have to make map procs open the file and then read the respective lines. So here are my doubts:
1- Is it bad or even possible to make every procs open the file simultaneously or almost simultaneously? How will that help in speeding up?
2- If it isn't possible to do that, I will have a parent opening the file (instead of each child doing that)that sends a struct with min and max limit and then the map procs will read whatever the lines they are responsible for, process them and give the reduce process a result (this doesn't matter for the problem now).
How can I divide correctly the number of lines by N maps and putting them to read at the same time? I think fseek() may be a good weapon, but I don't know HOW I can use it. Help, please!
If I understood correctly, you want to have all processes reading lines from a single file. I don't recommend this, it's kinda messy, and you'll have to a) read the same parts of the file several times or b) use locking/mutex or some other mechanism to avoid that. It'll get complicated and hard to debug.
I'd have a master process read the file, and assign lines to a subprocess pool. You can use shared memory to speed this up, and reduce the need for data-copying IPC; or use threads.
As for examples, I answered a question about forking and IPC and gave a code snippet on an example function that forks and returns a pair of pipes for parent-child communication. Let me look that up (...) here it is =P Can popen() make bidirectional pipes like pipe() + fork()?
edit: I kept thinking about this =P. Here's an idea:
Have a master process spawn subprocesses with something similar to what I showed in the link above.
Each process starts by sending a byte up to the master to signal it's ready, and blocking on read().
Have the master process read a line from the file to a shared memory buffer, and block on select() on its children pipes.
When select() returns, read one of the bytes that signal readiness and send to that subprocess the offset of the line in the shared memory space.
The master process repeats (reads a line, blocks on select, reads a byte to consume the readiness event, etc.)
The children process the line in whatever way you need, then send a byte to the master to signal readiness once again.
(You can avoid the shared memory buffer if you want, and send the lines down the pipes, though it'll involve constant data-copying. If the processing of each line is computationally expensive, it won't really make a difference; but if the lines require little processing, it may slow you down).
I hope this helps!
edit 2 based on Newba's comments:
Okay, so no shared memory. Use the above model, only instead of sending down the pipe the offset of the line read in the shared memory space, send the whole line. This may sound to you like you're wasting time when you could just read it from the file, but trust me, you're not. Pipes are orders of magnitude faster than reads from regular files in a hard disk, and if you wanted subprocesses to read directly from the file, you'll run into the problem I pointed at the start of the answer.
So, master process:
Spawn subprocesses using something like the function I wrote (link above) that creates pipes for bidirectional communication.
Read a line from the file into a buffer (private, local, no shared memory whatsoever).
You now have data ready to be processed. Call select() to block on all the pipes that communicate you with your subprocesses.
Choose any of the pipes that have data available, read one byte from it, and then send the line you have waiting to be processed in the buffer down the corresponding pipe (remember, we had 2 per child process, on to go up, one to go down).
Repeat from step 2, i.e. read another line.
Child processes:
When they start, they have a reading pipe and a writing pipe at their disposal. Send a byte down your writing pipe to signal the master process you are ready and waiting for data to process (this is the single byte we read in step 4 above).
Block on read(), waiting for the master process (that knows you are ready because of step 1) to send you data to process. Keep reading until you reach a newline (you said you were reading lines, right?). Note I'm following your model, sending a single line to each process at a time, you could send multiple lines if you wanted.
Process the data.
Return to step 1, i.e. send another byte to signal you are ready for more data.
There you go, simple protocol to assign tasks to as many subprocesses as you want. It may be interesting to run a test with 1 child, n children (where n is the number of cores in your computer) and more than n children, and compare performances.
Whew, that was a long answer. I really hope I helped xD
Since each of the processes is going to have to read the file in its entirety (unless the log lines are all of the same length, which is unusual), there really isn't a benefit to your proposal 2.
If you are going to split up the work into 3, then I would do:
Measure (stat()) the size of the log file - call it N bytes.
Allocate the range of bytes 0..(N/3) to first child.
Allocate the range of bytes (N/3)+1..2(N/3) to the second child.
Allocate the range of bytes 2(N/3)+1..end to the third child.
Define that the second and third children must synchronize by reading forward to the first line break after their start position.
Define that each child is responsible for reading to the first line break on or after the end of their range.
Note that the third child (last child) may have to do more work if the log file is growing.
Then the processes are reading independent segments of the file.
(Of course, with them all sharing the file, then the system buffer pool saves rereading the disk, but the data is still copied to each of the three processes, only to have each process throw away 2/3 of what was copied as someone else's job.)
Another, more radical option:
mmap() the log file into memory.
Assign the children to different segments of the file along the lines described previously.
If you're on a 64-bit machine, this works pretty well. If your log files are not too massive (say 1 GB or less), you can do it on a 32-bit machine too. As the file size grows above 1 GB or so, you may start running into memory mapping and allocation issues, though you might well get away with it until you reach a size somewhat less than 4 GB (on a 32-bit machine). The other issue here is with growing log files. AFAIK, mmap() doesn't map extra memory as extra data is written to the file.
Use a master and slave queue pattern.
The master sets up the slaves which sit waiting on a queue for work items.
The master then reads the file line by line.
Each line then represents a work item that you put on the queue
with a function pointer of how do the work.
One of the waiting slaves then takes the item of the queue
A slave processes a work item.
When a slave has finished it rejoins the work queue.

types of buffers

Recently an interviewer asked me about the types of buffers. What types of buffers are there ? Actually this question came up when I said I will be writing all the system calls to a log file to monitor the system. He said it will be slow to write each and every call to a file. How to prevent it. I said I will use a buffer and he asked me what type of buffer ? Can some one explain me types of buffers please.
In C under UNIX (and probably other operating systems as well), there are usually two buffers, at least in your given scenario.
The first exists in the C runtime libraries where information to be written is buffered before being delivered to the OS.
The second is in the OS itself, where information is buffered until it can be physically written to the underlying media.
As an example, we wrote a logging library many moons ago that forced information to be written to the disk so that it would be there if either the program crashed or the OS crashed.
This was achieved with the sequence:
fflush (fh); fsync (fileno (fh));
The first of these actually ensured that the information was handed from the C runtime buffers to the operating system, the second that it was written to disk. Keep in mind that this is an expensive operation and should only be done if you absolutely need the information written immediately (we only did it at the SUPER_ENORMOUS_IMPORTANT logging level).
To be honest, I'm not entirely certain why your interviewer thought it would be slow unless you're writing a lot of information. The two levels of buffering already there should perform quite adequately. If it was a problem, then you could just introduce another layer yourself which wrote the messages to an in-memory buffer and then delivered that to a single fprint-type call when it was about to overflow.
But, unless you do it without any function calls, I can't see it being much faster than what the fprint-type buffering already gives you.
Following clarification in comments that this question is actually about buffering inside a kernel:
Basically, you want this to be as fast, efficient and workable (not prone to failure or resource shortages) as possible.
Probably the best bet would be a buffer, either statically allocated or dynamically allocated once at boot time (you want to avoid the possibility that dynamic re-allocation will fail).
Others have suggested a ring (or circular) buffer but I wouldn't go that way (technically) for the following reason: the use of a classical circular buffer means that to write out the data when it has wrapped around will take two independent writes. For example, if your buffer has:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|s|t|r|i|n|g| |t|o| |w|r|i|t|e|.| | | | | | |T|h|i|s| |i|s| |t|h|e| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
^ ^
| |
Buffer next --+ +-- Buffer start
then you'll have to write "This is the " followed by "string to write.".
Instead, maintain the next pointer and, if the bytes in the buffer plus the bytes to be added are less than the buffer size, just add them to the buffer with no physical write to the underlying media.
Only if you are going to overflow the buffer do you start doing tricky stuff.
You can take one of two approaches:
Either flush the buffer as it stands, set the next pointer back to the start for processing the new message; or
Add part of the message to fill up the buffer, then flush it and set the next pointer back to the start for processing the rest of the message.
I would probably opt for the second given that you're going to have to take into account messages that are too big for the buffer anyway.
What I'm talking about is something like this:
initBuffer:
create buffer of size 10240 bytes.
set bufferEnd to end of buffer + 1
set bufferPointer to start of buffer
return
addToBuffer (size, message):
while size != 0:
xfersz = minimum (size, bufferEnd - bufferPointer)
copy xfersz bytes from message to bufferPointer
message = message + xfersz
bufferPointer = bufferPointer + xfersz
size = size - xfersz
if bufferPointer == bufferEnd:
write buffer to underlying media
set bufferPointer to start of buffer
endif
endwhile
That basically handles messages of any size efficiently by reducing the number of physical writes. There will be optimisations of course - it's possible that the message may have been copied into kernel space so it makes little sense to copy it to the buffer if you're going to write it anyway. You may as well write the information from the kernel copy directly to the underlying media and only transfer the last bit to the buffer (since you have to save it).
In addition, you'd probably want to flush an incomplete buffer to the underlying media if nothing had been written for a time. That would reduce the likelihood of missing information on the off chance that the kernel itself crashes.
Aside: Technically, I guess this is sort of a circular buffer but it has special case handling to minimise the number of writes, and no need for a tail pointer because of that optimisation.
There are also ring buffers which have bounded space requirements and are probably best known in the Unix dmesg facility.
What comes to mind for me is time-based buffers and size-based. So you could either just write whatever is in the buffer to file once every x seconds/minutes/hours or whatever. Alternatively, you could wait until there are x log entries or x bytes worth of log data and write them all at once. This is one of the ways that log4net and log4J do it.
Overall, there are "First-In-First-Out" (FIFO) buffers, also known as queues; and there are "Latest*-In-First-Out" (LIFO) buffers, also known as stacks.
To implement FIFO, there are circular buffers, which are usually employed where a fixed-size byte array has been allocated. For example, a keyboard or serial I/O device driver might use this method. This is the usual type of buffer used when it is not possible to dynamically allocate memory (e.g., in a driver which is required for the operation of the Virtual Memory (VM) subsystem of the OS).
Where dynamic memory is available, FIFO can be implemented in many ways, particularly with linked-list derived data structures.
Also, binomial heaps implementing priority queues may be used for the FIFO buffer implementation.
A particular case of neither FIFO nor LIFO buffer is the TCP segment reassembly buffers. These may hold segments received out-of-order ("from the future") which are held pending the receipt of intermediate segments not-yet-arrived.
* My acronym is better, but most would call LIFO "Last In, First Out", not Latest.
Correct me if I'm wrong, but wouldn't using a mmap'd file for the log avoid both the overhead of small write syscalls and the possibility of data loss if the application (but not the OS) crashed? It seems like an ideal balance between performance and reliability to me.

Resources