Overhead of times() system call - relative to file operations - c

What is the relative overhead of calling times() versus file operations like reading a line fread().
I realize this likely differs from OS to OS and depends on how long the line is, where the file is located, if it's really a pipe that's blocked (it's not), etc.
Most likely the file is not local but is on a mounted NFS drive somewhere on the local network. The common case is a line that is 20 characters long. If it helps, assume Linux kernel 2.6.9. The code will not be run on Windows.
I'm just looking for a rough guide. Is it on the same order of magnitude? Faster? Slower?
Ultimate goal: I'm looking at implementing a progress callback routine, but don't want to call too frequently (because the callback is likely very expensive). The majority of work is reading a text file (line by line) and doing something with the line. Unfortunately, some of the lines are very long, so simply calling every N lines isn't effective in the all-too-often seen pathological cases.
I'm avoiding writing a benchmark because I'm afraid of writing it wrong and am hoping the wisdom of the crowd is greater than my half-baked tests.

fread() is a C library function, not a system call. fread(), fwrite(), fgets() and friends are all buffered I/O by default (see setbuf) which means that the library allocates a buffer which decreases the frequency with which read() and write() system calls need to be made.
This means that if you're reading sequentially from the file, the library will only issue a system call every, say, 100 reads (subject to the buffer size and how much data you read at a time).
When the read() and write() system calls are made, however, they will definitely be slower than calling times(), simply due to the volume of data that needs to be exchanged between your program and the kernel. If the data is cached in the OS's buffers (e.g. it was written by another process on the same machine moments ago) then it will still be pretty fast. If the data is not cached, then you will have to wait for I/O (be it to the disk or over the network), which is very slow in comparison.
If the data is coming fresh over NFS, then I'd be pretty confident that calling times() will be faster than fread() on average.

On Linux, you could write a little program that does lots of calls to times() and fread() and measure the syscall times with strace -c
e.g
for (i = 0; i < RUNS; i++) {
times(&t_buf);
fread(buf,1,BUF,fh);
}
This is when BUF 4096 (fread will actually call read() every time)
# strace -c ./time_times
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
59.77 0.001988 0 100000 read
40.23 0.001338 0 99999 times
and this is when BUF 16
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.00 0.001387 0 99999 times
1.00 0.000014 0 392 read

times() simply reads kernel maintained process-specific data. The data is maintained by the kernel to supply information for the wait() system call when the process exits. So, the data is always maintained, regardless of whether times() ever gets called. The extra overhead of calling times() is really low
fread(), fwrite(), etc call underlying system calls - read() & write(), which invoke drivers. The drivers then place data in a kernel buffer. This is far more costly in terms of resources than invoking times().
Is this what you are asking?

Related

fwrite consuming all MemFree, fflush not working?

We have a data capture system that is connected to a very fast 10TB raid 0 jbod.
We receive 4 MiB data buffers at approximately 1.25 GB/s which are written to a sequential file which was opened with fopen, 10 GiB is fallocate'd, and written to with fwrite. every 10 GiB we fflush then fallocate gets another 10 GiB. Lastly it's closed after capture is complete with fclose.
The problem is that while the capture is underway, we can see /proc/meminfo MemFree drop, and Cached shoot up - i.e. the fflush seems to do nothing. This proceeds until we have about 200 MiB MemFree in the system, and now the data rate becomes extremely spikey, which causes our capture to fail.
We were hoping that the spikes would fall around the 10 GiB when we call fflush, but it just doesn't seem to do anything. The file isn't flushed until we call fclose.
Any reason for this behavior? using setvbuf(hFile, NULL, _IONBF, 0) doesn't seem to have any effect either.
When you see your free memory drop, that's your OS writing to its buffer cache (essentially, all available memory). In addition, stdio's fwrite() is buffering on its own. Because of this, there's some resource contention going on. When your OS hits the upper limits of available memory, this resource contention causes slower writes and high memory utilization. The bottleneck causes you to miss data captures.
Since you are managing your own buffer, it would be possible to use write() with O_DIRECT to avoid all this buffering.

Handling mmap and fread for big files between processes

I have two processes:
Process A is mapping large file (~170 GB - content constantly changes) into memory for writing with the flags MAP_NONBLOCK and MAP_SHARED:
MyDataType *myDataType; = (MyDataType*)mmap(NULL, sizeof(MyDataType), PROT_WRITE, MAP_NONBLOCK | MAP_SHARED , fileDescriptor, 0);
and every second I call msync:
msync((void *)myDataType, sizeof(MyDataType), MS_ASYNC);
This section works fine.
The problem occurs when process B is trying to read from the same file that process A is mapped to, process A does not respond for ~20 seconds.
Process B is trying to read from the file something like 1000 times, using fread() and fseek(), small blocks (~4 bytes every time).
Most of the content the process is reading are close to each other.
What is the cause for this problem? Is it related to pages allocation? How can I solve it?
BTW, same problem occur when I use mmap() in process B instead of simple fread().
msync() is likely the problem. It forces the system to write to disk, blocking the kernel in a write frenzy.
In general on Linux (it's the same on Solaris BTW), it is a bad idea to use msync() too often. There is no need to call msync() for the synchronization of data between the memory map and the read()/write() I/O operations, this is a misconception that comes from obsolete HOWTOs. In reality, mmap() makes only the file system cache "visible" for a process. This means that the memory blocks the process changes are still under kernel control. Even if your process crashed, the changes would land on the disk eventually. Other processes would also still be serviced by the same buffer.
Here another answer on the subject mmap, msync and linux process termination
The interesting part is the link to a discussion on realworldtech where Linus Torvalds himself explains how buffer cache and memory mapping work.
PS: fseek()/fread() pair is also probably better replaced by pread(). 1 system call is always better than 2. Also fseek()/fread() read always 4K and copies in a buffer, so if you have several small reads without fseek(), it will read from its local buffer and maybe miss updates in process A.
This sounds that you are suffering from IO-Starvation, which has nothing to do with the method (mmap or fread) you choose. You will have to improve your (pre-)caching-strategy and/or try another IO-scheduler (cfq being the default, maybe deadline delivers better overall-results for you)
You can change the scheduler by writing to /sys:
echo deadline > /sys/block/<device>/queue/scheduler
Maybe you should try profiling or even using strace to figure out for sure where the process is spending its time. 20 s seems like an awfully long time to be explained by io in msync().
When you say A doesn't respond, what exactly do you mean?

When does actual write() takes place in C?

What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.
The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.
As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).
write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.
OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.
Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.
Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.

Concurrent writes to a file using multiple threads

I have a userlevel program which opens a file using the flags O_WRONLY|O_SYNC. The program creates 256 threads which attempt to write 256 or more bytes of data each to the file. I want to have a total of 1280000 requests, making it a total of about 300 MB of data. The program ends once 1280000 requests have been completed.
I use pthread_spin_trylock() to increment a variable which keeps track of the number of requests that have been completed. To ensure that each thread writes to a unique offset, I use pwrite() and calculate the offset as a function of the number of requests that have been written already. Hence, I don't use any mutex when actually writing to the file (does this approach ensure data integrity?)
When I check the average time for which the pwrite() call was blocked and the corresponding numbers (i.e., the average Q2C times -- which is the measure of the times for the complete life cycle of BIOs) as found using blktrace, I find that there is a significant difference. In fact, the average completion time for a given BIO is much greater than the average latency of a pwrite() call. What is the reason behind this discrepancy? Shouldn't these numbers be similar since O_SYNC ensures that the data is actually written to the physical medium before returning?
pwrite() is suppose to be atomic, so you should be safe there ...
In regards to the difference in latency between your syscall and the actual BIO, according to this information on the man-pages at kernel.org for open(2):
POSIX provides for three different variants of synchronized I/O,
corresponding
to the flags O_SYNC, O_DSYNC, and O_RSYNC. Currently (2.6.31),
Linux only
implements O_SYNC, but glibc maps O_DSYNC and O_RSYNC to the
same numerical
value as O_SYNC. Most Linux file systems don't actually
implement the POSIX
O_SYNC semantics, which require all metadata updates of a write
to be on disk
on returning to userspace, but only the O_DSYNC semantics,
which require only
actual file data and metadata necessary to retrieve it to be on
disk by the
time the system call returns.
So this basically implies that with the O_SYNC flag the entirety of the data you're attempting to write does not need to be flushed to disk before a syscall returns, but rather just enough information to be capable of retrieving it from disk ... depending on what you're writing, that could be quite a bit less than the entire buffer of data you were intending to write to disk, and therefore the actual writing of all the data will take place at a later time, after the syscall has been completed and the process has moved on to something else.

Why the fwrite libc function is faster than the syscall write function?

After providing the same program which reads a random generated input file and echoes the same string it read to an output. The only difference is that on one side I'm providing the read and write methods from linux syscalls, and on the other side I'm using fread/fwrite.
Timing my application with an input of 10Mb in size and echoing it to /dev/null, and making sure the file is not cached, I've found that libc's fwrite is faster by a LARGE scale when using very small buffers (1 byte in case).
Here is my output from time, using fwrite:
real 0m0.948s
user 0m0.780s
sys 0m0.012s
And using the syscall write:
real 0m8.607s
user 0m0.972s
sys 0m7.624s
The only possibility that I can think of is that internally libc is already buffering my input... Unfortunately I couldn't find that much information around the web, so maybe the gurus here could help me out.
Timing my application with an input of
10Mb in size and echoing it to
/dev/null, and making sure the file in
not cached, I've found that libc's
frwite is faster by a LARGE scale when
using very small buffers (1 byte in
case).
fwrite works on streams, which are buffered. Therefore many small buffers will be faster because it won't run a costly system call until the buffer fills up (or you flush it or close the stream). On the other hand, small buffers being sent to write will run a costly system call for each buffer - that's where you're losing the speed. With a 1024 byte stream buffer, and writing 1 byte buffers, you're looking at 1024 write calls for each kilobyte, rather than 1024 fwrite calls turning into one write - see the difference?
For big buffers the difference will be small, because there will be less buffering, and therefore a more consistent number of system calls between fwrite and write.
In other words, fwrite(3) is just a library routine that collects up output into chunks, and then calls write(2). Now, write(2), is a system call which traps into the kernel. That's where the I/O actually happens. There is some overhead for simply calling into the kernel, and then there is the time it takes to actually write something. If you use large buffers, you will find that write(2) is faster because it eventually has to be called anyway, and if you are writing one or more times per fwrite then the fwrite buffering overhead is just that: more overhead.
If you want to read more about it, you can have a look at this document, which explains standard I/O streams.
write(2) is the fundamental kernel operation.
fwrite(3) is a library function that adds buffering on top of write(2).
For small (e.g., line-at-a-time) byte counts, fwrite(3) is faster, because of the overhead for just doing a kernel call.
For large (block I/O) byte counts, write(2) is faster, because it doesn't bother with buffering and you have to call the kernel in both cases.
If you look at the source to cp(1), you won't see any buffering.
Finally, there is one last consideration: ISO C vs Posix. The buffered library functions like fwrite are specified in ISO C whereas kernel calls like write are Posix. While many systems claim Posix-compatibility, especially when trying to qualify for government contracts, in practice it's specific to Unix-like systems. So, the buffered ops are more portable. As a result, a Linux cp will certainly use write but a C program that has to work cross-platform may have to use fwrite.
You can also disable buffering with setbuf() function. When the buffering is disabled, fwrite() will be as slow as write() if not slower.
More information on this subject can be found there: http://www.gnu.org/s/libc/manual/html_node/Controlling-Buffering.html

Resources