Win32 IO Performance Problem

Win32 IO Performance Problem - c

Recently I ran into a "fun" problem with the Microsoft implementation of the CRTL. tmpfile places temp files in the root directory and completely ignores the temp file directory. This has issues with users who do not have privileges to the root directory (say, on our cluster). Moreover, using _tempnam would require the application to remember to delete the temporary files, which it is unable to do without a considerable amount of rework.
Therefore I bit the bullet and wrote Win32 versions of all of the IO routines (create_temp, read, write, seek, flush) which call the appropriate method. One thing I've noticed is the now abysmal performance of the library.
Results from the test suite:
CRTL: 4:30.05 elapsed
Win32: 11:18.06 elapsed
Stats measured in my routines:
Writes: 3129934 ( 44,642,745,008 bytes)
Reads: 935903 ( 8,183,423,744 bytes)
Seeks: 2205757 (2,043,782,657,968 bytes traveled)
Flushes: 92442
Example of a CRTL v. Win32 method:
int io_write(FILE_POINTER fp, size_t words, const void *buffer)
{
#if !defined(USE_WIN32_IO)
{
size_t words_written = 0;
/* read the data */
words_written = fwrite(buffer, sizeof(uint32_t), words, fp);
if (words_written != words)
{
return errno;
}
}
#else /* !defined(USE_WIN32_IO) */
{
DWORD bytesWritten;
if (!WriteFile(fp, buffer, words * sizeof(uint32_t), &bytesWritten, NULL)
|| (bytesWritten != words * sizeof(uint32_t)))
{
return GetLastError();
}
}
#endif /* USE_WIN32_IO */
return E_SUCCESS;
}
As you can see, they are effectively identical, yet the performance (in release mode) is wildly divergent. Time spent in WriteFile and SetFilePointer dwarf the time spent in fwrite and fseeko, which seems counterintuitive.
Ideas?
UPDATE: perfmon notes that fflush is about 10x cheaper than FlushFileBuffers and fwrite is ~1.1x slower than WriteFile. The net result is a huge performance loss with FlushFileBuffers used in the same manner as fflush. There is no change from FILE_ATTRIBUTE_NORMAL to FILE_FLAG_RANDOM_ACCESS either.

I think it's probably due to this issue, described on MSDN's page for FlushFileBuffers:
Due to disk caching interactions
within the system, the
FlushFileBuffers function can be
inefficient when used after every
write to a disk drive device when many
writes are being performed separately.
If an application is performing
multiple writes to disk and also needs
to ensure critical data is written to
persistent media, the application
should use unbuffered I/O instead of
frequently calling FlushFileBuffers.
To open a file for unbuffered I/O,
call the CreateFile function with the
FILE_FLAG_NO_BUFFERING and
FILE_FLAG_WRITE_THROUGH flags. This
prevents the file contents from being
cached and flushes the metadata to
disk with each write. For more
information, see CreateFile.
In general, FlushFileBuffers is an "expensive" operation, since it flushes everything in the write-back cache:
FlushFileBuffers(): This function will flush everything in the write-back cache, as it
does not know what part of the cache belongs to your file. This can take a lot of time,
depending on the cache size and the speed of the media. How necessary is it? There is
a thread which goes through and writes out dirty pages, so it is likely not very
necessary.
I presume that fflush does not flush the entire write-back cache. In that case, it's much more efficient, but that efficiency comes at the risk of potential data loss. The CRT's source code for fflush confirms this, since _commit calls FlushFileBuffers:
/* lowio commit to ensure data is written to disk */
if (str->_flag & _IOCOMMIT) {
return (_commit(_fileno(str)) ? EOF : 0);
}
From the implementation of _commit:
if ( !FlushFileBuffers((HANDLE)_get_osfhandle(filedes)) ) {
retval = GetLastError();
}

Traditionally, the C runtime library functions buffer the data and only trigger the write operation (hence the need for functions like fflush). I don't think that WriteFile buffers the write operation so every time you call WriteFile, an I/O operation gets triggered whereas with fwrite, the I/O gets triggered when the buffer has reached a certain size.
As you can see from your measurements, the buffered I/O tends to be more efficient...

I might be crazy, but wouldn't it be easier to just write a replacement for tmpfile that uses fopen(temporaryname, "wbTD+"), where you generate your own temporaryname?
At least then you don't have to worry about reimplementing <file.h>.

I'm still a little unclear on what the question is. You start out by talking about managing the lifetime of a temporary file and then jump to wrapping an entire file i/o interface. Are you asking about how to manage a temporary file without the performance penalty of wrapping all the file I/O? Or are you interested in how the CRT functions can be faster than the WinAPI functions they are built on top of?
Several of the comparisons being made between the C run-time functions and the WinAPi functions are of the apples and oranges variety.
The C run-time functions buffer the I/O in library memory. There is another layer of buffering (and caching) in the OS.
fflush flushes the data from the library buffers to the OS. It may go directly to disk, or it may go to OS buffers for later writing. FlushFileBuffers gets data from the OS buffers onto the disk, which generally takes longer than moving data from the library buffers to the OS buffers.
Unaligned writes are expensive. The OS buffers make unaligned writes possible, but they don't really speed up the process. The library buffers may accept several writes before pushing data to the OS, effectively reducing the number of unaligned writes to the disk.
It's also possible (though this is just a guess) that the library routines are taking advantage of overlapped (asynchronous) I/O to the disk, where your straight-to-WinAPI implementation is all synchronous.

Related

Understanding low level file routines

I am going through Mark Burgess's "The GNU C Programming Tutorial". I have come across the following information:
Even though low-level fle routines do not use buﬀering, and once you call write, your data can be read from the file immediately, it may take up to a minute before your data is physically written to disk. (Page:142)
Firstly, is "it may take up to a minute(some time) before your data is written to disk" true?
Secondly, when low level file routines are not using buffering why will the delay take place?

There are two places where I/O buffering can occur (at least — it could be more than just two).
One is in the application; the standard I/O functions using FILE * use buffered I/O unless you use setvbuf() to prevent it.
The other is in the kernel. Disk I/O normally goes into the kernel buffer pool, and eventually gets written by the kernel to disk. There are ways around that (O_DIRECT on Linux; raw devices on classic Unix; etc). The key point is that the write() system call normally writes to he kernel buffer pool. The kernel takes responsibility for ensuring that the data is written to disk safely and correctly (journalling, …).
The kernel doesn't write everything to disk immediately because (a) you may add more changes to the data, (b) other people may need to read or write the data, (c) the disk drive may be busy writing something else at the other end of its 1 TiB of storage and it will take time to get the write head in position to take your data, and it would be better for the overall performance of the system if it scheduled other work before writing your changed buffer to disk. It will get written to disk. It is just not defined when, and it could be fractions of a second or multiple seconds or longer, though most often it will not take minutes for the data to be written to disk.
These days, there could also be buffering in the RAID controllers, and maybe in the individual disks inside the RAID setup, and maybe there's network buffering too if it is a remotely-mounted file system. Those add extra levels of buffering.
The read() and write() and related low-level I/O functions do not have any client-side (application) buffering — unlike the standard C I/O functions.

A file is said to be buffered, when its contents are not outputted or inputted directly. Instead, the file's bytes are written to a temporary buffer in memory.
For example, if you are reading from a file, you are reading from the buffer. Once you have read all the characters in the buffer, it is replenished with new bytes from the file. The reason for this indirectness, is that a memory read is much faster than a hard disk read.
The calls read and write are low-level, and do not perform buffering. The stdio.h calls like getc and putc, do use buffering. These higher-level APIs only call the low level ones, when the buffer must be replenished.

Writing to the hard drive is much slower than writing to RAM. When you write to a drive it writes to memory, but doesn't always write to the disk immediately. The data might not be written to disk until that part of memory needs to be overwritten to make room for something else. This is called a Write-Back cache.

When does actual write() takes place in C?

What really happens when write() system call is executed?
Lets say I have a program which writes certain data into a file using write() function call. Now C library has its own internal buffer and OS too has its own buffer.
What interaction takes place between these buffers ?
Is it like when C library buffer gets filled completely, it writes to OS buffer and when OS buffer gets filled completely, then the actual write is done on the file?
I am looking for some detailed answers, useful links would also help. Consider this question for a UNIX system.

The write() system call (in fact all system calls) are nothing more that a contract between the application program and the OS.
for "normal" files, the write() only puts the data on a buffer, and marks that buffer as "dirty"
at some time in the future, these dirty buffers will be collected and actually written to disk. This can be forced by fsync()
this is done by the .write() "method" in the mounted-filesystem-table
and this will invoke the hardware's .write() method. (which could involve another level of buffering, such as DMA)
modern hard disks have there own buffers, which may or may not have actually been written to the physical disk, even if the OS->controller told them to.
Now, some (abnormal) files don't have a write() method to support them. Imagine open()ing "/dev/null", and write()ing a buffer to it. The system could choose not to buffer it, since it will never be written anyway.
Also note that the behaviour of write() does depend on the nature of the file; for network sockets the write(fd,buff,size) can return before size bytes have been sent(write will return the number of characters sent). But it is impossible to find out where they are once they have been sent. They could still be in a network buffer (eg waiting for Nagle ...), or a buffer inside the network interface, or a buffer in a router or switch somewhere on the wire.

As far as I know...
The write() function is a lower level thing where the library doesn't buffer data (unlike fwrite() where the library does/may buffer data).
Despite that, the only guarantee is that the OS transfers the data to disk drive before the next fsync() completes. However, hard disk drives usually have their own internal buffers that are (sometimes) beyond the OS's control, so even if a subsequent fsync() has completed it's possible for a power failure or something to occur before the data is actually written from the disk drive's internal buffer to the disk's physical media.
Essentially, if you really must make sure that your data is actually written to the disk's physical media; then you need to redesign your code to avoid this requirement, or accept a (small) risk of failure, or ensure the hardware is capable of it (e.g. get a UPS).

write() writes data to operating system, making it visible for all processes (if it is something which can be read by other processes). How operating system buffers it, or when it gets written permanently to disk, that is very library, OS, system configuration and file system specific. However, sync() can be used to force buffers to be flushed.
What is quaranteed, is that POSIX requires that, on a POSIX-compliant file system, a read() which can be proved to occur after a write() has returned must return the written data.

OS dependant, see man 2 sync and (on Linux) the discussion in man 8 sync.

Years ago operating systems were supposed to implement an 'elevator algorithm' to schedule writes to disk. The idea would be to minimize the disk writing head movement, which would allow a good throughput for several processes accessing the disk at the same time.

Since you're asking for UNIX, you must keep in mind that a file might actually be on an FTP server, which you have mounted, as an example. For example files /dev and /proc are not files on the HDD, as well.
Also, on Linux data is not written to the hard drive directly, instead there is a polling process, that flushes all pending writes every so often.
But again, those are implementation details, that really don't affect anything from the point of view of your program.

Why is sequentially reading a large file row by row with mmap and madvise sequential slower than fgets?

Overview
I have a program bounded significantly by IO and am trying to speed it up.
Using mmap seemed to be a good idea, but it actually degrades the performance relative to just using a series of fgets calls.
Some demo code
I've squeezed down demos to just the essentials, testing against an 800mb file with about 3.5million lines:
With fgets:
char buf[4096];
FILE * fp = fopen(argv[1], "r");
while(fgets(buf, 4096, fp) != 0) {
// do stuff
}
fclose(fp);
return 0;
Runtime for 800mb file:
[juhani#xtest tests]$ time ./readfile /r/40/13479/14960
real 0m25.614s
user 0m0.192s
sys 0m0.124s
The mmap version:
struct stat finfo;
int fh, len;
char * mem;
char * row, *end;
if(stat(argv[1], &finfo) == -1) return 0;
if((fh = open(argv[1], O_RDONLY)) == -1) return 0;
mem = (char*)mmap(NULL, finfo.st_size, PROT_READ, MAP_SHARED, fh, 0);
if(mem == (char*)-1) return 0;
madvise(mem, finfo.st_size, POSIX_MADV_SEQUENTIAL);
row = mem;
while((end = strchr(row, '\n')) != 0) {
// do stuff
row = end + 1;
}
munmap(mem, finfo.st_size);
close(fh);
Runtime varies quite a bit, but never faster than fgets:
[juhani#xtest tests]$ time ./readfile_map /r/40/13479/14960
real 0m28.891s
user 0m0.252s
sys 0m0.732s
[juhani#xtest tests]$ time ./readfile_map /r/40/13479/14960
real 0m42.605s
user 0m0.144s
sys 0m0.472s
Other notes
Watching the process run in top, the memmapped version generated a few thousand page faults along the way.
CPU and memory usage are both very low for the fgets version.
Questions
Why is this the case? Is it just because the buffered file access implemented by fopen/fgets is better than the aggressive prefetching that mmap with madvise POSIX_MADV_SEQUENTIAL?
Is there an alternative method of possibly making this faster(Other than on-the-fly compression/decompression to shift IO load to the processor)? Looking at the runtime of 'wc -l' on the same file, I'm guessing this might not be the case.

POSIX_MADV_SEQUENTIAL is only a hint to the system and may be completely ignored by a particular POSIX implementation.
The difference between your two solutions is that mmap requires the file to be mapped into the virtual address space entierly, whereas fgets has the IO entirely done in kernel space and just copies the pages into a buffer that doesn't change.
This also has more potential for overlap, since the IO is done by some kernel thread.
You could perhaps increase the perceived performance of the mmap implementation by having one (or more) independent threads reading the first byte of each page. This (or these) thread then would have all the page faults and the time your application thread would come at a particular page it would already be loaded.

Reading the man pages of mmap reveals that the page faults could be prevented by adding MAP_POPULATE to mmap's flags:
MAP_POPULATE (since Linux 2.5.46): Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This way a page faulting pre-load thread (as suggested by Jens) will become obsolete.
Edit:
First of all the benchmarks you make should be done with the page cache flushed to get meaningful results:
echo 3 | sudo tee /proc/sys/vm/drop_caches
Additionally: The MADV_WILLNEED advice with madvise will pre-fault the required pages in (same as the POSIX_FADV_WILLNEED with fadvise). Currently unfortunately these calls block until the requested pages are faulted in, even if the documentation tells differently. But there are kernel patches underway which queue the pre-fault requests into a kernel work-queue to make these calls asynchronous as one would expect - making a separate read-ahead user space thread obsolete.

What you're doing - reading through the entire mmap space - is supposed to trigger a series of page faults. with mmap, the OS only lazily loads pages of the mmap'd data into memory (loads them when you access them). So this approach is an optimization. Although you interface with mmap as if the entire thing is in RAM, it is not all in RAM - it is just a chunk set aside in virtual memory.
In contrast, when you do a read of a file into a buffer the OS pulls the entire structure into RAM (into your buffer). This can apply alot of memory pressure, crowding out other pages, forcing them to be written back to disk. It can lead to thrashing if you're low on memory.
A common optimization technique when using mmap is to page-walk the data into memory: loop through the mmap space, incrementing your pointer by the page size, accessing a single byte per page and triggering the OS to pull all the mmap's pages into memory; triggering all these page faults. This is an optimization technique to "prime the RAM", pulling the mmap in and readying it for future use. With this approach, the OS won't need to do as much lazy loading. You can do this on a separate thread to lead the pages in prior to your main threads access - just make sure you don't run out of RAM or get too far ahead of the main thread, you'll actually begin to degrade performance.
What is the difference between page walking w/ mmap and read() into a large buffer? That's kind of complicated.
Older versions of UNIX, and some current versions, don't always use demand-paging (where the memory is divided up into chunks and swapped in / out as needed). Instead, in some cases, the OS uses traditional swapping - it treats data structures in memory as monolithic, and the entire structure is swapped in / out as needed. This may be more efficient when dealing with large files, where demand-paging requires copying pages into the universal buffer cache, and may lead to frequent swapping or even thrashing. Swapping may avoid use of the universal buffer cache - reducing memory consumption, avoiding an extra copy operation and avoiding frequent writes. Downside is you can't benefit from demand-paging.
With mmap, you're guaranteed demand-paging; with read() you are not.
Also bear in mind that page-walking in a full mmap memory space is always about 60% slower than a flat out read (not counting if you use MADV_SEQUENTIAL or other optimizations).
One note when using mmap w/ MADV_SEQUENTIAL - when you use this, you must be absolutely sure your data IS stored sequentially, otherwise this will actually slow down the paging in of the file by about 10x. Usually your data is not mapped to a continuous section of the disk, it's written to blocks that are spread around the disk. So I suggest you be careful and look closely into this.
Remember, too much data in RAM will pollute the RAM, making page faults alot more common elsewhere. One common misconception about performance is that CPU optimization is more important than memory footprint. Not true - the time it takes to travel to disk exceeds the time of CPU operations by something like 8 orders of magnitude, even with todays SSDs. Therefor, when program execution speed is a concern, memory footprint and utilization is far more important.
A nice thing about read() is the data can be stored on the stack (assuming the stack is large enough), which will further speed up processing.
Using read() with a streaming approach is a good alternative to mmap, if it fits your use case. This is kind of what you're doing with fgets/fputs (fgets/fputs is internally implemented with read). Here what you do is, in a loop, read into a buffer, process the data, & then read in the next section / overwrite the old data. Streaming like this can keep your memory consumption very low, and can be the most efficient way of doing I/O. The only downside is that you never have the entire file in memory at once, and it doesn't persist in memory. So it's a one-off approach. If you can use it - great, do it. If not... use mmap.
So whether read or mmap is faster... it depends on many factors. Testing is probably what you need to do. Generally speaking, mmap is nice if you plan on using the data for an extended period, where you will benefit from demand-paging; or if you just can't handle that amount of data in memory at once. Read() is better if you are using a streaming approach - the data doesn't have to persist, or the data can fit in memory so memory pressure isn't a concern. Also if the data won't be in memory for very long, read() may be preferable.
Now, with your current implementation - which is a sort of streaming approach - you are using fgets() and stopping on \n. Large, bulk reads are more efficient than calling read() repeatedly a million times (which is what fgets does). You don't have to use a giant buffer - you don't want excess memory pressure (which can pollute your cache & other things), & the system also has some internal buffering it uses. But you do want to be reading into a buffer of... lets say 64k in size. You definitely dont want to be calling read line by line.
You could multithread the parsing of that buffer. Just make sure the threads access data in different cache blocks - so find the size of the cache block, get your threads working on different portions of the buffer distanced by at least the cache block size.
Some more specific suggestions for your particular problem:
You might try reformatting the data into some binary format. For example, try changing the file encoding to a custom format instead of UTF-8 or whatever it is. That could reduce its size. 3.5 million lines is quite alot of characters to loop through... it's probably ~150 million character comparisons that you are doing.
If you can sort the file by line length prior to the program running... you can write an algorithm to much more quickly parse the lines - just increment a pointer and test the character you arrive at, making sure it's '\n'. Then do whatever processing you need to do.
You'll need to find a way to maintain the sorted file by inserting new data into appropriate places with this approach.
You can take this a step further - after sorting your file, maintain a list of how many lines of a given length are in the file. Use that to guide your parsing of lines - jump right to the end of each line w/out having to do character comparisons.
If you can't sort the file, just create a list of all the offsets from the start of each line to its terminating newline. 3.5 million offsets.
Write algorithms to update that list on insertion/deletion of lines from the file
When you get into file processing algorithms such as this... it begins to resemble the implementation of a noSQL database. An alternative might just be to insert all this data into a noSQL database. Depends on what you need to do: believe it or not, sometimes just raw custom file manipulation & maintenance described above is faster than any database implementation, even noSQL databases.
A few more things:
When you use this streaming approach with read() you must take care to handle the edge cases - where you reach the end of one buffer, and start a new buffer - appropriately. That's called buffer-stitching.
Lastly, on most modern systems when you use read() the data still gets stored in the universal buffer cache and then copied into your process. That's an extra copy operation. You can disable the buffer cache to speed up the IO in certain cases where you're handling big files. Beware, this will disable paging. But if the data is only in memory for a brief time, this doesn't matter.
The buffer cache is important - find a way to reenable it after the IO was finished. Maybe disable it just for the particular process, do your IO in a separate process, or something... I'm not sure about the details, but this is something that can be done.
I don't think that's actually your problem, though, tbh I think the character comparisons - once you fix that it should just be fine.
That's the best I've got, maybe the experts will have other ideas.
Carry onward!

Probing for filesystem block size

I'm going to first admit that this is for a class project, since it will be pretty obvious. We are supposed to do reads to probe for the block size of the filesystem. My problem is that the time taken to do this appears to be linearly increasing, with no steps like I would expect.
I am timing the read like this:
double startTime = getticks();
read = fread(x, 1, toRead, fp);
double endTime = getticks();
where getticks uses rdtsc instructions. I am afraid there is caching/prefetching that is causing the reads to not take time during the fread. I tried creating a random file between each execution my program, but that is not alleviating my problem.
What is the best way to accurately measure the time taken for a read from disk? I am pretty sure my block size is 4096, but how can I get data to support that?

The usual way of determining filesystem block size is to ask the filesystem what its blocksize is.
#include <sys/statvfs.h>
#include <stdio.h>
int main() {
struct statvfs fs_stat;
statvfs(".", &fs_stat);
printf("%lu\n", fs_stat.f_bsize);
}
But if you really want, open(…,…|O_DIRECT) or posix_fadvise(…,…,…,POSIX_FADV_DONTNEED) will try to let you bypass the kernel's buffer cache (not guaranteed).

You may want to use the system calls (open(), read(), write(), ...)
directly to reduce the impact of the buffering done by the FILE* stuff.
Also, you may want to use synchronous I/O somehow.
One ways is opening the file with the O_SYNC flag set
(or O_DIRECT as per ephemient's reply).
Quoting the Linux open(2) manual page:
O_SYNC The file is opened for synchronous I/O. Any write(2)s on the
resulting file descriptor will block the calling process until
the data has been physically written to the underlying hardware.
But see NOTES below.
Another options would be mounting the filesystem with -o sync (see mount(8)) or setting the S attribute on the file using the chattr(1) command.

Why the fwrite libc function is faster than the syscall write function?

After providing the same program which reads a random generated input file and echoes the same string it read to an output. The only difference is that on one side I'm providing the read and write methods from linux syscalls, and on the other side I'm using fread/fwrite.
Timing my application with an input of 10Mb in size and echoing it to /dev/null, and making sure the file is not cached, I've found that libc's fwrite is faster by a LARGE scale when using very small buffers (1 byte in case).
Here is my output from time, using fwrite:
real 0m0.948s
user 0m0.780s
sys 0m0.012s
And using the syscall write:
real 0m8.607s
user 0m0.972s
sys 0m7.624s
The only possibility that I can think of is that internally libc is already buffering my input... Unfortunately I couldn't find that much information around the web, so maybe the gurus here could help me out.

Timing my application with an input of
10Mb in size and echoing it to
/dev/null, and making sure the file in
not cached, I've found that libc's
frwite is faster by a LARGE scale when
using very small buffers (1 byte in
case).
fwrite works on streams, which are buffered. Therefore many small buffers will be faster because it won't run a costly system call until the buffer fills up (or you flush it or close the stream). On the other hand, small buffers being sent to write will run a costly system call for each buffer - that's where you're losing the speed. With a 1024 byte stream buffer, and writing 1 byte buffers, you're looking at 1024 write calls for each kilobyte, rather than 1024 fwrite calls turning into one write - see the difference?
For big buffers the difference will be small, because there will be less buffering, and therefore a more consistent number of system calls between fwrite and write.
In other words, fwrite(3) is just a library routine that collects up output into chunks, and then calls write(2). Now, write(2), is a system call which traps into the kernel. That's where the I/O actually happens. There is some overhead for simply calling into the kernel, and then there is the time it takes to actually write something. If you use large buffers, you will find that write(2) is faster because it eventually has to be called anyway, and if you are writing one or more times per fwrite then the fwrite buffering overhead is just that: more overhead.
If you want to read more about it, you can have a look at this document, which explains standard I/O streams.

write(2) is the fundamental kernel operation.
fwrite(3) is a library function that adds buffering on top of write(2).
For small (e.g., line-at-a-time) byte counts, fwrite(3) is faster, because of the overhead for just doing a kernel call.
For large (block I/O) byte counts, write(2) is faster, because it doesn't bother with buffering and you have to call the kernel in both cases.
If you look at the source to cp(1), you won't see any buffering.
Finally, there is one last consideration: ISO C vs Posix. The buffered library functions like fwrite are specified in ISO C whereas kernel calls like write are Posix. While many systems claim Posix-compatibility, especially when trying to qualify for government contracts, in practice it's specific to Unix-like systems. So, the buffered ops are more portable. As a result, a Linux cp will certainly use write but a C program that has to work cross-platform may have to use fwrite.

You can also disable buffering with setbuf() function. When the buffering is disabled, fwrite() will be as slow as write() if not slower.
More information on this subject can be found there: http://www.gnu.org/s/libc/manual/html_node/Controlling-Buffering.html