Linux C Standard I/O - why double copying - c

Assuming I understand the flow correctly, one would like to read few byes off an opened FILE stream, lets says, using fread:
the read syscall will copy the data from the kernel to the user space buffer
user space buffer (either allocated by glibc or provided by setvbuf...) will be copied to the buffer provided to fread
why is the 2nd step needed? why can I get a pointer to the user space buffer which I will decide if I want to store (copy) or not?
Thanks,

The purpose of the 2nd buffer is to amortize the system call overhead. If you read/write only a few bytes at a time, this second user space buffer will improve performance enormously. OTOH if you read/write a large chunk, the 2nd buffer can be bypassed, so you don't pay the price for double copying.

The second step is what it is all about. Kernel must take care of such operations. The api you use will be feeded afterwards with the result. This is usual kernelspace/user space behaviour. Read about it. You perhaps might not know it NOW but kernel space/ user space differentiation are basics of os infrastructure.

Related

User space buffer and Kernel buffer

I am recently studying file system in linux. I learned that when we call fopen(), the library call will call malloc() to allocate space for the FILE structure, and inside this FILE structure there will be a buffer for I/0. But later on I found that the write system call actually writes the data to the kernel buffer, So what's the difference between these two buffers?
You have to understand two things: fwrite() is a standard library's routine operating on FILE structure, but write() is a system call. I bet fwrite() uses write() internally. Nothing keeps fwrite() from providing user space IO-buffering until it is ready to pass your data on to the write() syscall.
write() syscall in it's turn goes straight to the kernel and says: "Hey kernel, I've got this user space buffer here. Will you write this one to the storage for me?". And here it's up to the kernel what to do next: it will either go directly to storage to write the data, or, most likely, copy the data to kernel buffer, until it decides it's time to modify storage data.
Turning back to your question. Any kind of buffering is done to accumulate data in order to postpone turning to more expensive operations: standard library may consider invoking syscall on every len-byte expensive, kernel considers going to hard disk on every syscall expensive and so on.
You might want to read this to see how far buffering goes https://fgiesen.wordpress.com/2015/10/25/reading-and-writing-are-less-symmetric-than-you-probably-think/
The FILE structure holds the meta data about the opened file (mode, stream position, etc). It is part of the C Standard I/O interface.
The buffer allocated as part of FILE takes only a limited amount of data (e.g. when the stream is buffered). It is deallocated upon fclose(). You may even provide your own user space stdio buffer with setvbuf().
The kernel buffer receives the file contents written by write(), whenever the stream is flushed or the associated file descriptor is closed.
The FILE structure holds the information about the opened file.this defines the FILE struct members. But in kernel level a file has been accessed by inode, buffer cache.
Data is getting read/write to disk from user space through buffer cache using method copy_to_user and copy_from_user.
There is a very big difference between the two buffers, one is kernel buffer and the other is the user buffer. So, what basically happens when you do a I/O is that the buffer from the user space is copied into a buffer in the kernel space. The function copy_from_user() does this task.
Now the question which arises is why do we need two buffers, when the kernel has access into the user space? The reason is that the kernel does not want to read the user buffer directly because both the kernel and the user space has different address spaces and so a valid address in user space might not be a valid address in kernel.
In kernel if a non-valid address is accessed then the system will panic immediately, so the function copy_from_user does the task of mapping the user space address and the kernel space address and checks if the address is accessible or not. If not then it simply returns EFAULT (bad address).

C read part of file into cache

I have to do a program (for Linux) where there's an extremely large index file and I have to search and interpret the data from the file. Now the catch is, I'm only allowed to have x-bytes of the file cached at any time (determined by argument) so I have to remove certain data from the cache if it's not what I'm looking for.
If my understanding is correct, fopen (r) doesn't put anything in the cache, only when I call getc or fread(specifying size) does it get cached.
So my question is, lets say I use fread and read 100 bytes but after checking it, only 20 of the 100 bytes contains the data I need; how would I remove the useless 80 bytes from cache (or overwrite it) in order to read more from the file.
EDIT By caching I mean data stored in memory, which makes the problem easier
fread's first argument is a pointer to a block of memory. So the way to go about this is to set that pointer to the stuff you want to over write. For example lets say you want to keep bytes 20-40 and overwrite everything else. You could either a) invoke fread on start with a length of 20 then invoke it again on buffer[40] with a size of 60. or b) You could start by defragmenting (ie copy the bytes you want to keep to the start) then invoke fread with a pointer to the next section.
Why do you want to micromanage the cache? Secondly, what makes you think you can? No argument specified on the command line of your program can control what the cache manager does internally - it may decide to read an entire file into RAM, it may decide to read none of it, or it may decide to throw a party. Any control you have over it would use low-level APIs/syscalls and would not very granular.
I think you might be confused about the requirements, or maybe the person who gave them to you. You seem to be referring to the cache managed by the operating system, which there is no need for an application to ever have to worry about. The operating system will make sure it doesn't grow too large automatically.
The other meaning of "cache" is the one you create yourself, the char* buffer or whatever you create to temporarily hold the data in memory while you process it. This one should be fairly easy to manage yourself simply by not allocating too much memory for that buffer.
To discard the read buffer of a file opened with fopen(), you can use fflush(). Also note that you can control the buffer size with setvbuf().
You should consider using open/read (instead of fopen/fread) if you must have exact control over buffering, though.

since 'fread' is buffred, is it necessary to fread data into memory and then use it?

I am using fopen/fread/fwrite/fseek on linux with gcc. is it necessary to allocate a memory buffer and use fread to read data sequentially into buffer before using the data?
When you use fread or the other file I/O functions in the C standard library, memory is buffered in several places.
Your application allocates a buffer which gets passed to fread. fread copies data into your buffer, and then you can do what you want with it. You are responsible for allocation/deallocation of this buffer.
The C library will usually create a buffer for every FILE* you have open. Data is read into this buffers in large chunks. This allows fread to satisfy many small requests without having to make a large number of system calls, which are expensive. This is what people mean when they say fread is buffered.
The kernel will also buffer files that are being read in the disk cache. This reduces the time needed for the read system call, since if data is already in memory, your program won't have to wait while the kernel fetches it from the disk. The kernel will hold on to recently read files, and it may read ahead for files which are being accessed sequentially.
The C library buffer is allocated automatically when you open a file and freed when you close the file. You don't have to manage it at all.
The kernel disk cache is stored in physical memory that isn't being used for anything else. Again, you don't have to manage this. The memory will be freed as soon as it's needed for something else.
You must pass a buffer (a buffer created by your code, malloced or local) to fread to pass the read data back to you. I don't know what do you exactly mean by saying "fread is buffered". Most 'C' library calls operate in this fashion. They will not return their internal storage (buffer or otherwise) to you and if they do, they will provide you a corresponding free/release functions.
Refer http://pubs.opengroup.org/onlinepubs/000095399/functions/fread.html It has a very basic example also.
With fread, yes, you have to allocate memory in your process and the system call will copy the data into your buffer.
In some specialised cases, you can handle data without copying it into userspace. See the sendfile system call, which copies data from one file descriptor to another directly. This can be used to transfer data from a file to a network socket without excessive copying.

read() system call does a copy of data instead of passing the reference

The read() system call causes the kernel to copy the data instead of passing the buffer by reference. I was asked the reason for this in an interview. The best I could come up with were:
To avoid concurrent writes on the same buffer across multiple processes.
If the user-level process tries to access a buffer mapped to kernel virtual memory area it will result in a segfault.
As it turns out the interviewer was not entirely satisfied with either of these answers. I would greatly appreciate if anybody could elaborate on the above.
A zero copy implementation would mean the user level process would have to be given access to the buffers used internally by the kernel/driver for reading. The user would have to make an explicit call to the kernel to free the buffer after they were done with it.
Depending on the type of device being read from, the buffers could be more than just an area of memory. (For example, some devices could require the buffers to be in a specific area of memory. Or they could only support writing to a fixed area of memory be given to them at startup.) In this case, failure of the user program to "free" those buffers (so that the device could write more data to them) could cause the device and/or its driver to stop functioning properly, something a user program should never be able to do.
The buffer is specified by the caller, so the only way to get the data there is to copy them. And the API is defined the way it is for historical reasons.
Note, that your two points above are no problem for the alternative, mmap, which does pass the buffer by reference (and writing to it than writes to the file, so you than can't process the data in place, while many users of read do just that).
I might have been prepared to dispute the interviewer's assertion. The buffer in a read() call is supplied by the user process and therefore comes from the user address space. It's also not guaranteed to be aligned in any particular way with respect to page frames. That makes it tricky to do what is necessary to perform IO directly into the buffer ie. map the buffer into the device driver's address space or wire it for DMA. However, in limited circumstances, this may be possible.
I seem to remember the BSD subsystem used by Mac OS X used to copy data between address spaces had an optimisation in this respect, although I may be completely mistaken.

types of buffers

Recently an interviewer asked me about the types of buffers. What types of buffers are there ? Actually this question came up when I said I will be writing all the system calls to a log file to monitor the system. He said it will be slow to write each and every call to a file. How to prevent it. I said I will use a buffer and he asked me what type of buffer ? Can some one explain me types of buffers please.
In C under UNIX (and probably other operating systems as well), there are usually two buffers, at least in your given scenario.
The first exists in the C runtime libraries where information to be written is buffered before being delivered to the OS.
The second is in the OS itself, where information is buffered until it can be physically written to the underlying media.
As an example, we wrote a logging library many moons ago that forced information to be written to the disk so that it would be there if either the program crashed or the OS crashed.
This was achieved with the sequence:
fflush (fh); fsync (fileno (fh));
The first of these actually ensured that the information was handed from the C runtime buffers to the operating system, the second that it was written to disk. Keep in mind that this is an expensive operation and should only be done if you absolutely need the information written immediately (we only did it at the SUPER_ENORMOUS_IMPORTANT logging level).
To be honest, I'm not entirely certain why your interviewer thought it would be slow unless you're writing a lot of information. The two levels of buffering already there should perform quite adequately. If it was a problem, then you could just introduce another layer yourself which wrote the messages to an in-memory buffer and then delivered that to a single fprint-type call when it was about to overflow.
But, unless you do it without any function calls, I can't see it being much faster than what the fprint-type buffering already gives you.
Following clarification in comments that this question is actually about buffering inside a kernel:
Basically, you want this to be as fast, efficient and workable (not prone to failure or resource shortages) as possible.
Probably the best bet would be a buffer, either statically allocated or dynamically allocated once at boot time (you want to avoid the possibility that dynamic re-allocation will fail).
Others have suggested a ring (or circular) buffer but I wouldn't go that way (technically) for the following reason: the use of a classical circular buffer means that to write out the data when it has wrapped around will take two independent writes. For example, if your buffer has:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|s|t|r|i|n|g| |t|o| |w|r|i|t|e|.| | | | | | |T|h|i|s| |i|s| |t|h|e| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
^ ^
| |
Buffer next --+ +-- Buffer start
then you'll have to write "This is the " followed by "string to write.".
Instead, maintain the next pointer and, if the bytes in the buffer plus the bytes to be added are less than the buffer size, just add them to the buffer with no physical write to the underlying media.
Only if you are going to overflow the buffer do you start doing tricky stuff.
You can take one of two approaches:
Either flush the buffer as it stands, set the next pointer back to the start for processing the new message; or
Add part of the message to fill up the buffer, then flush it and set the next pointer back to the start for processing the rest of the message.
I would probably opt for the second given that you're going to have to take into account messages that are too big for the buffer anyway.
What I'm talking about is something like this:
initBuffer:
create buffer of size 10240 bytes.
set bufferEnd to end of buffer + 1
set bufferPointer to start of buffer
return
addToBuffer (size, message):
while size != 0:
xfersz = minimum (size, bufferEnd - bufferPointer)
copy xfersz bytes from message to bufferPointer
message = message + xfersz
bufferPointer = bufferPointer + xfersz
size = size - xfersz
if bufferPointer == bufferEnd:
write buffer to underlying media
set bufferPointer to start of buffer
endif
endwhile
That basically handles messages of any size efficiently by reducing the number of physical writes. There will be optimisations of course - it's possible that the message may have been copied into kernel space so it makes little sense to copy it to the buffer if you're going to write it anyway. You may as well write the information from the kernel copy directly to the underlying media and only transfer the last bit to the buffer (since you have to save it).
In addition, you'd probably want to flush an incomplete buffer to the underlying media if nothing had been written for a time. That would reduce the likelihood of missing information on the off chance that the kernel itself crashes.
Aside: Technically, I guess this is sort of a circular buffer but it has special case handling to minimise the number of writes, and no need for a tail pointer because of that optimisation.
There are also ring buffers which have bounded space requirements and are probably best known in the Unix dmesg facility.
What comes to mind for me is time-based buffers and size-based. So you could either just write whatever is in the buffer to file once every x seconds/minutes/hours or whatever. Alternatively, you could wait until there are x log entries or x bytes worth of log data and write them all at once. This is one of the ways that log4net and log4J do it.
Overall, there are "First-In-First-Out" (FIFO) buffers, also known as queues; and there are "Latest*-In-First-Out" (LIFO) buffers, also known as stacks.
To implement FIFO, there are circular buffers, which are usually employed where a fixed-size byte array has been allocated. For example, a keyboard or serial I/O device driver might use this method. This is the usual type of buffer used when it is not possible to dynamically allocate memory (e.g., in a driver which is required for the operation of the Virtual Memory (VM) subsystem of the OS).
Where dynamic memory is available, FIFO can be implemented in many ways, particularly with linked-list derived data structures.
Also, binomial heaps implementing priority queues may be used for the FIFO buffer implementation.
A particular case of neither FIFO nor LIFO buffer is the TCP segment reassembly buffers. These may hold segments received out-of-order ("from the future") which are held pending the receipt of intermediate segments not-yet-arrived.
* My acronym is better, but most would call LIFO "Last In, First Out", not Latest.
Correct me if I'm wrong, but wouldn't using a mmap'd file for the log avoid both the overhead of small write syscalls and the possibility of data loss if the application (but not the OS) crashed? It seems like an ideal balance between performance and reliability to me.

Resources