strace has different result of Linux write system call on different buffer - c

I try to write a data buffer into a file using write() system call on Linux, here is the user space code I wrote.
memset (dataBuffer, 'F', FILESIZE);
fp = open(fileName, O_WRONLY | O_CREAT, 0644);
write (fp, dataBuffer, FILESIZE);
I tried two types of dataBuffer, one is from malloc(), another one is from mmap().
And I use strace to watch what the kernel will do on these two kind of buffer. Most of them are the same, but I saw when doing write(), they look different.
buffer from malloc()
[pid 258] open("/mnt/mtd/mmc/block/DATA10", O_WRONLY|O_CREAT, 0644) = 3
[pid 258] write(3, "FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF"..., 691200) = 691200
buffer from mmap()
[pid 262] open("/mnt/mtd/mmc/block/DATA10", O_WRONLY|O_CREAT, 0644) = 4
[pid 262] write(4, 0x76557000, 691200) = 691200
Like you can see above, the parameters of write() are different, one is "FFFFF..." like I memset before, but another is like a memory address.
Also the first parameter is different, one is 3, another is 4.
And on my system, buffer from malloc() is faster than mmap().
How come they are different? Who made this different?
Thanks.
Update: how do I measure malloc() is faster?
I trace deep inside the write() in kernel, find out the last step of write() is iov_iter_copy_from_user_atomic(), I think this is the actual memory copy operation.
Then I using gettimeofday() to measure how long the iov_iter_copy_from_user_atomic() cost in malloc/mmap buffer.

I think it's checking whether there's a trailing 0 byte at the end of the buffer. If there is, it assumes the data is a string and displays it with quotes. If not, it just shows the address.
I can't think of a reason why the buffer from malloc() would be faster than mmap(). Calling memset() should get the memory of both into RAM, so the write() shouldn't have to wait for anything to be loaded into memory.

Related

read(fd, buf, N>0) == 0, but fd not at EOF?

The following little C program (let's call it pointless):
/* pointless.c */
#include <stdio.h>
#include <unistd.h>
void main(){
write(STDOUT_FILENO, "", 0); /* pointless write() of 0 bytes */
sleep(1);
write(STDOUT_FILENO, "still there!\n", 13);
}
will print "still there!" after a small delay, as expected. However,
rlwrap ./pointless prints nothing under AIX and exits immediatly.
Apparently, rlwrap reads 0 bytes after the first write() and
(incorrectly) decides that pointless has called it quits.
When running pointless without rlwrap, and with rlwrap on all
other systems I could lay my hand on (Linux, OSX, FreeBSD), the "still
there!" gets printed, as expected.
The relevant rlwrap (pseudo-)code is this:
/* master is the file descriptor of the master end of a pty, while the slave is 'pointless's stdout */
/* master was opened with O_NDELAY */
while(pselect(nfds, &readfds, .....)) {
if (FD_ISSET(master, &readfds)) { /* master is "ready" for reading */
nread = read(master, buf, BUFFSIZE - 1); /* so try to read a buffer's worth */
if (nread == 0) /* 0 bytes read... */
cleanup_and_exit(); /* ... usually means EOF, doens't it? */
Apparently, on all systems, except AIX, writeing 0 bytes on the
slave end of a pty is a no-op, while on AIX it wakes up the
select() on the master end. Writing 0 bytes seems pointless, but one
of my test programs writes random-length chunks of text, which may
actually happen to have length 0.
On linux, man 2 read states "on success, the number of bytes read is
returned (zero indicates end of file)" (italics are mine) This
question has come up
before
without mention of this scenario.
This begs the question: how can I portably determine whether the
slave end has been closed? (In this case I can probably just wait for
a SIGCHLD and then close shop, but that might open another can of
worms I'd rather avoid)
Edit: POSIX states:
Writing a zero-length buffer (nbyte is 0) to a STREAMS device sends 0 bytes with 0 returned. However, writing a zero-length buffer to a STREAMS-based pipe or FIFO sends no message and 0 is returned. The process may issue I_SWROPT ioctl() to enable zero-length messages to be sent across the pipe or FIFO.
On AIX, pty is indeed a STREAMS device, moreover, not a pipe or FIFO. ioctl(STDOUT_FILENO, I_SWROPT, 0) seems to make it possible to make the pty conform to the rest of the Unix world. The sad thing is that this has to be called from the slave side, and so is outside rlwraps sphere of infuence (even though we could call the ioctl() between fork() and exec() - that would not guarantee that the executed command won't change it back)
Per POSIX:
When attempting to read from an empty pipe or FIFO:
If no process has the pipe open for writing, read() shall return 0 to indicate end-of-file."
So the "read of zero bytes means EOF" is POSIX-compliant.
On the write() side (bolding mine):
Before any action described below is taken, and if nbyte is zero and the file is a regular file, the write() function may detect and return errors as described below. In the absence of errors, or if error detection is not performed, the write() function shall return zero and have no other results. If nbyte is zero and the file is not a regular file, the results are unspecified.
Unfortunately, that means you can't portably depend on a write() of zero bytes to have no effect because AIX is compliant with the POSIX standard for write() here.
You probably have to rely on SIGCHLD.
From the linux man page
If count is zero and fd refers to a regular file, then write()
may return a failure status if one of the errors below is
detected. If no errors are detected, or error detection is not
performed, 0 will be returned without causing any other effect.
If count is zero and fd refers to a file other than a regular
file, the results are not specified.
So, since it is unspecified, it can do whatever it likes in your case.

Thread Safety of Reading a File

So my end goal is to allow multiple threads to read the same file from start to finish. For example, if the file was 200 bytes:
Thread A 0-> 200 bytes
Thread B 0-> 200 bytes
Thread C 0-> 200 bytes
etc.
Basically have each thread read the entire file. The software is only reading that file, no writing.
so I open the file:
fd = open(filename, O_RDWR|O_SYNC, 0);
and then in each thread simply loop the file. Because I only create one File Descriptor, are also create a create a clone of the file descriptor in each thread using dup
Here is a minimual example of a thread function:
void ThreadFunction(){
int file_desc= dup(fd);
uint32_t nReadBuffer[1000];
int numBytes = -1;
while (numBytes != 0) {
numBytes = read(file_desc, &nReadBuffer, sizeof(nReadBuffer));
//processing on the bytes goes here
}
}
However, I'm not sure this is correctly looping through the entire file and each thread is instead somehow daisy chaining through the file.
Is this approach correct? I inherited this software for a project I am working on, the file descriptor gets used in an mmap call, so I am not entirely sure of O_RDWR or O_SYNC matter
As other folks have mentioned, it isn't possible to use a duplicated file descriptor here. However, there is a thread-safe alternative, which is to use pread. pread reads a file at an offset and doesn't change the implicit offset in the file description.
This does mean that you have to manually manage the offset in each thread, but that shouldn't be too much of a problem with your proposed function.

Disk write does not work with malloc in C

I did write to disk using C code.
First I tried with malloc and found that write did not work (write returned -1):
fd = open('/dev/sdb', O_DIRECT | O_SYNC | O_RDWR);
void *buff = malloc(512);
lseek(fd, 0, SEEK_SET);
write(fd, buff, 512);
Then I changed the second line with this and it worked:
void *buff;
posix_memalign(&buff,512,512);
However, when I changed the lseek offset to 1: lseek(fd, 1, SEEK_SET);, write did not work again.
First, why didn't malloc work?
Then, I know that in my case, posix_memalign guarantees that start address of memory alignment must be multiple of 512. But, should not memory alignment and write be a separate process? So why I could not write to any offset that I want?
From the Linux man page for open(2):
The O_DIRECT flag may impose alignment restrictions on the length
and address of user-space buffers and the file offset of I/Os.
And:
Under Linux 2.4, transfer sizes, and the alignment of the user
buffer and the file offset must all be multiples of the logical block
size of the filesystem. Under Linux 2.6, alignment to 512-byte
boundaries suffices.
The meaning of O_DIRECT is to "try to minimize cache effects of the I/O to and from this file", and if I understand it correctly it means that the kernel should copy directly from the user-space buffer, thus perhaps requiring stricter alignment of the data.
Maybe the documentation doesn't say, but it's quite possible that write and reads to/from a block device is required to be aligned and for entire blocks to succeed (this would explain why you get failure in your first and last cases but not in the second). If you use linux the documentation of open(2) basically says this:
The O_DIRECT flag may impose alignment restrictions on the length and
address of user-space buffers and the file offset of I/Os. In Linux
alignment restrictions vary by file system and kernel version and
might be absent entirely. However there is cur‐
rently no file system-independent interface for an application to discover these restrictions for a given file or file system. Some
file systems provide their own interfaces for doing so, for example
the XFS_IOC_DIOINFO operation in xfsctl(3).
Your code shows lack of error handling. Every line in the code contains functions that may fail and open, lseek and write also reports the cause of error in errno. So with some kind of error handling it would be:
fd = open('/dev/sdb', O_DIRECT | O_SYNC | O_RDWR);
if( fd == -1 ) {
perror("open failed");
return;
}
void *buff = malloc(512);
if( !buff ) {
printf("malloc failed");
return;
}
if( lseek(fd, 0, SEEK_SET) == (off_t)-1 ) {
perror("lseek failed");
free(buff);
return;
}
if( write(fd, buff, 512) == -1 ) {
perror("write failed");
free(buff);
return;
}
in that case you would at least get a more detailed explaination on what goes wrong. In this case I suspect that you get EIO (Input/output error) from the write call.
Note that the above maybe isn't complete errorhandling as perror and printf themselves can fail (and you might want to do something about that possibility).

Copying files using memory map

I want to implement an effective file copying technique in C for my process which runs on BSD OS. As of now the functionality is implemented using read-write technique. I am trying to make it optimized by using memory map file copying technique.
Basically I will fork a process which mmaps both src and dst file and do memcpy() of the specified bytes from src to dst. The process exits after the memcpy() returns. Is msync() required here, because when I actually called msync with MS_SYNC flag, the function took lot of time to return. Same behavior is seen with MS_ASYNC flag as well?
i) So to summarize is it safe to avoid msync()?
ii) Is there any other better way of copying files in BSD. Because bsd seems to be does not support sendfile() or splice()? Any other equivalents?
iii) Is there any simple method for implementing our own zero-copy like technique for this requirement?
My code
/* mmcopy.c
Copy the contents of one file to another file, using memory mappings.
Usage mmcopy source-file dest-file
*/
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
{
char *src, *dst;
int fdSrc, fdDst;
struct stat sb;
if (argc != 3)
usageErr("%s source-file dest-file\n", argv[0]);
fdSrc = open(argv[1], O_RDONLY);
if (fdSrc == -1)
errExit("open");
/* Use fstat() to obtain size of file: we use this to specify the
size of the two mappings */
if (fstat(fdSrc, &sb) == -1)
errExit("fstat");
/* Handle zero-length file specially, since specifying a size of
zero to mmap() will fail with the error EINVAL */
if (sb.st_size == 0)
exit(EXIT_SUCCESS);
src = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fdSrc, 0);
if (src == MAP_FAILED)
errExit("mmap");
fdDst = open(argv[2], O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if (fdDst == -1)
errExit("open");
if (ftruncate(fdDst, sb.st_size) == -1)
errExit("ftruncate");
dst = mmap(NULL, sb.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fdDst, 0);
if (dst == MAP_FAILED)
errExit("mmap");
memcpy(dst, src, sb.st_size); /* Copy bytes between mappings */
if (msync(dst, sb.st_size, MS_SYNC) == -1)
errExit("msync");
enter code here
exit(EXIT_SUCCESS);
}
Short answer: msync() is not required.
When you do not specify msync(), the operating system flushes the memory-mapped pages in the background after the process has been terminated. This is reliable on any POSIX-compliant operating system.
To answer the secondary questions:
Typically the method of copying a file on any POSIX-compliant operating system (such as BSD) is to use open() / read() / write() and a buffer of some size (16kb, 32kb, or 64kb, for example). Read data into buffer from src, write data from buffer into dest. Repeat until read(src_fd) returns 0 bytes (EOF).
However, depending on your goals, using mmap() to copy a file in this fashion is probably a perfectly viable solution, so long as the files being coped are relatively small (relative to the expected memory constraints of your target hardware and your application). The mmap copy operation will require roughly 2x the total physical memory of the file. So if you're trying to copy a file that's a 8MB, your application will use 16MB to perform the copy. If you expect to be working with even larger files then that duplication could become very costly.
So does using mmap() have other advantages? Actually, no.
The OS will often be much slower about flushing mmap pages than writing data directly to a file using write(). This is because the OS will intentionally prioritize other things ahead of page flushes so to keep the system 'responsive' for foreground tasks/apps.
During the time the mmap pages are being flushed to disk (in the background), the chance of sudden loss of power to the system will cause loss of data. Of course this can happen when using write() as well but if write() finishes faster then there's less chance for unexpected interruption.
the long delay you observe when calling msync() is roughly the time it takes the OS to flush your copied file to disk. When you don't call msync() it happens in the background instead (and also takes even longer for that reason).

Why does fopen/fgets use both mmap and read system calls to access the data?

I have a small example program which simply fopens a file and uses fgets to read it. Using strace, I notice that the first call to fgets runs a mmap system call, and then read system calls are used to actually read the contents of the file. on fclose, the file is munmaped. If I instead open read the file with open/read directly, this obviously does not occur. I'm curious as to what is the purpose of this mmap is, and what it is accomplishing.
On my Linux 2.6.31 based system, when under heavy virtual memory demand these mmaps will sometimes hang for several seconds, and appear to me to be unnecessary.
The example code:
#include <stdlib.h>
#include <stdio.h>
int main ()
{
FILE *f;
if ( NULL == ( f=fopen( "foo.txt","r" )))
{
printf ("Fail to open\n");
}
char buf[256];
fgets(buf,256,f);
fclose(f);
}
And here is the relevant strace output when the above code is run:
open("foo.txt", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb8039000
read(3, "foo\nbar\n\n"..., 4096) = 9
close(3) = 0
munmap(0xb8039000, 4096) = 0
It's not the file that is mmap'ed - in this case mmap is used anonymously (not on a file), probably to allocate memory for the buffer that the consequent reads will use.
malloc in fact results in such a call to mmap. Similarly, the munmap corresponds to a call to free.
The mmap is not mapping the file; instead it's allocating memory for the stdio FILE buffering. Normally malloc would not use mmap to service such a small allocation, but it seems glibc's stdio implementation is using mmap directly to get the buffer. This is probably to ensure it's page-aligned (though posix_memalign could achieve the same thing) and/or to make sure closing the file returns the buffer memory to the kernel. I question the usefulness of page-aligning the buffer. Presumably it's for performance, but I can't see any way it would help unless the file offset you're reading from is also page-aligned, and even then it seems like a dubious micro-optimization.
from what i have read memory mapping functions are useful while handling large files. now the definition of large is something i have no idea about. but yes for the large files they are significantly faster as compared to the 'buffered' i/o calls.
in the example that you have posted i think the file is opened by the open() function and mmap is used for allocating memory or something else.
from the syntax of mmap function this can be seen clearly:
void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);
the second last parameter takes the file descriptor which should be non-negative.
while in the stack trace it is -1
Source code of fopen in glibc shows that mmap can be actually used.
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofopen.c;h=965d21cd978f3acb25ca23152993d9cac9f120e3;hb=HEAD#l36

Resources