Does aio_write always write the whole buffer? - c

I know that the POSIX write function can return successfully even though it didn't write the whole buffer (if interrupted by a signal). You have to check for short writes and resume them.
But does aio_write have the same issue? I don't think it does, but it's not mentioned in the documentation, and I can't find anything that states that it doesn't happen.

Short answer
Excluding any case of error: Practical yes, theoratical not necessarily.
Long answer
From my experience the caller does not need to call aio_write() more then once to write the whole buffer using aoi_write().
This however is not a guarantee that the whole buffer passed in really will be written. A final call to aio_error() gives the result of the whole asyncronous I/O operation, which could indicate an error.
Anyhow the documentation does not explicitly excludes the case that the final call to aio_return() returns a value less then the amount of bytes to write out specified in the original call to aio_write(), what indeed needs to be interpreted as that not the whole buffer would have been sent, in which case it would be necessary to call aio_write() again passing in what whould have been indicated as having been left over to write by the previous call.

The list of error codes on this page doesn't include EINTR, which is the value in errno that means "please call again to do some more work". So, no you shouldn't need to call aio_write again for the same piece of data to be written.
This doesn't mean that you can rely on every write being completed. You still could, for example, get an partial write because the disk is full or some such. But you don't need to check for EINTR and "try again".

Related

Buffering expectations using `printf`

Say there exists a C program that executes in some Linux process. Upon start, the C program calls setvbuf to disable buffering on stdout. The program then alternates between two "logical" calls ("logical" in this sense to avoid consideration of the compiler possibly reordering instructions) - the first to printf() and the second incrementing a variable.
int main (int argc, char **argv)
{
setvbuf(stdout, NULL, _IONBF, 0);
unsigned int a = 0;
for (;;) {
printf("hello world!");
a++;
}
}
At some point assume the program receives a signal, e.g. via kill, that causes the program to terminate. Will the contents of stdout always be complete after the signal is received, in the sense that they include the result of all previous invocations to printf(), or is this dependent on other levels of buffering/other behavior not controllable via setvbuf (e.g. kernel buffering)?
The broader context of this question is, if using a synchronous logging mechanism in a C application (e.g. all threads log with printf()), can the log be trusted to be "complete" for all calls that have returned from printf() upon receiving some application-terminating signal?
Edit: I've edited the code snippet and question to remove undefined behavior for clarity.
Any sane interpretation of the expression "unbuffered stream" means that the data has left the stream object when printf returns. In the case of file-descriptor backed streams, that means the data has entered kernel-space, and the kernel should continue sending the data to its final destination (assuming no kernel panic, power loss etc).
But a problem with segfaults is that they may not happen when you think they do. Take for instance the following code:
int *p = NULL;
printf("hello world\n");
*p = 1;
A dumb non-optimizing compiler may create code that segfaults at *p=1;. But that is not the only possibility according to the c-standard. A compiler may for instance, if it can prove that printf doesn't depend on the contents of *p, reorganize the code like this:
int *p = NULL;
*p = 1;
printf("hello world\n");
In that case printf would never be called.
Another possibility is that, since p==NULL, *p=1 is invalid, the compiler may scrap that expression all together.
EDIT: The poster has changed the question from "Segfaulting" to being killed. In that case, it should all depend on if the kernel closes open file descriptors on exit the same way as close does, or not.
Given a construct like:
fprintf(file1, "whatever"); fflush(file1);
file2 = fopen(someExistingFile, "w");
there are some circumstances where it may be essential that fopen doesn't overwrite the existing file unless or until the write to file1 can be guaranteed successful, but there are others where waiting until success of the fflush can be assured before starting the fopen would needlessly degrade performance. In order to allow designers of C implementations to weigh such considerations however they see fit, and also avoid requiring that implementations provide semantic guarantees beyond those offered by the underlying OS (e.g. if an OS reports that the fflush() is complete before data is written to disk, and offers no way of finding out when all pending writes are complete, there would be no way the Standard could usefully require that an implementation which targets that OS must not allow fflush to return at any time when the write could still fail).
So, it appears that there's a basic misunderstanding in your question, and I think it's important to go through the basics of what printf is -> if your stdout buffer size is 0, then the question of "will all data be sent out of the buffer" is always yes, since there isn't a hardware buffer to save data, in theory. That is, somewhere in your computer hardware there's a something like a UART chip, that has a small buffer for transferring data. Most programs I've seen do not use this hardware buffer, so It's not surprising that your program does this.
However, the printf function has an upper layer buffer (in my application ~150 characters), and I'm assuming that this is the buffer you're asking about, note that this is not the same thing as the stdout buffer, its just an allocated piece of memory that stores messages before they're sent to wherever you want them to go. Think about it - if there were no printf-specific buffer you would only be able to send 1 character per function call
Now it really depends on the implementation of printf on your system, if it's nonblocking or blocking. If it's nonblocking, that could mean that data is being transferred by an interrupt or a DMA, probably a combination of both. In which case it depends on if your system stops these transfer mechanisms in the middle of a transfer, or allows them to complete. It's impossible for me to say based on the information you've given
However, in my experience, printf is usually a blocking function; that is it locks up the rest of your code while it's transferring things out of the buffer and moves to the next command only once it's completed, in which case if you have stopped the code from running (again, I'm not certain on the specifics of "kill" in your system) then you have also stopped the transfer.
Your system most likely has blocking PRINTF calls, and considering you say a "kill" signal it sounds like you're not even really sure what you mean by that. I think it's safe to assume that whatever signal you're talking about is not internally stopping your printf function from completing, so your full message will probably be sent before exiting, even if it arrives mid-printf. If your printf is being called it most likely is completing and sending the full message, unless this "kill" signal does something odd. That's the best answer I can give you from a "C" standpoint - if you would like a more absolute answer you would have to give us information that lets us see the implementation of "printf" on your operating system, and/or give us more specifics on how this "kill signal" you mentioned works

How to implement a system call that could check if itself has been successfully executed without going to the kernel log?

I have just stepped into the kernel world and would like to add some system calls. My goal is to add a system call that lets me check if it executed (without looking at the kernel log). However, I have been thinking for a long time, but have not yet figured out how to implement it. Could anyone please give me some advice? Or some pseudocodes? Thanks in advance.
My thinking is that we could implement a new system call, in which it writes something into a buffer. Then, another system call reads the content of the buffer to check if the previous system call has written to the buffer. (Somehow like pthread_create and pthread_join) Hence, my implementation consists of 2 system calls in total.
Here is a sketch of my thinking written in pseudocode:
syscall_2(...){
if (syscall_1 executes)
return 0;
if (syscall_1 NOT executes)
return -1;
}
syscall_1(){
do something;
create a buffer;
write something into buffer;
return syscall_2(buffer); // checks what is in buffer
}
My suggestion is that you have the system call itself accept a pointer to a userspace buffer that it overwrites with a specific piece of information.
You will have to learn how to access userspace memory, and more importantly how to verify that you were given a pointer to memory the process has mapped, and has write access to.
Then, once the system call completes, your program that called it can not only check the return code of the system call, you can also examine the memory to see if the system call wrote the correct thing to it.
Normally, system calls inform the caller if they are executed (how it went) so I guess you are interested in knowing which system calls have been executed, and how many times.
From this perspective, I think the best is to implement a device that can be queried (by means of some ioctl call) and let you know statistics about the individual system calls you can be interested on.
For example.... you can implement the number of system calls of type n you have used in some time.... by checking a counter at start time of interval and at end, and then check how many calls (if you implement a counter) you did in between by just subtracting the counter values at both times. You can also do the same to e.g. calculate the average time, by accumulating the time a system call takes to execute at the end of it. If you do this for example in picosecs, you can be sure this will be a good idea that you can publish. In this schema you can also account for the amount of I/O that each system call does, by counting the amount of bytes transferred to/from usermode.... You could implement this as ioctls to some device and then you don't need to add a system call for it.

How does list I/O writev internally work?

The writev function takes an array of struct iovec as input argument
writev(int fd, const struct iovec *iov, int iovcnt);
The input is a list of memory buffers that need to be written to a file (say). What I want to know is:
Does writev internally do this:
for (each element in iov)
write(element)
such that every element of iov is written to file in a separate I/O call? Or does writev write everything to file in a single I/O call?
Per the standards, the for loop you mentioned is not a valid implementation of writev, for several reasons:
The loop could fail to finish writing one iov before proceeding to the next, in the event of a short write - but this could be worked around by making the loop more elaborate.
The loop could have incorrect behavior with respect to atomicity for pipes: if the total write length is smaller than PIPE_BUF, the pipe write is required to be atomic, but the loop would break the atomicity requirement. This issue cannot be worked around except by moving all the iov entries into a single buffer before writing when the total length is at most PIPE_BUF.
The loop might have cases where it could result in blocking, where the single writev call would be required to perform a partial write without blocking. As far as I know, this issue would be impossible to work around in the general case.
Possibly other reasons I haven't thought of.
I'm not sure about point #3, but it definitely exists in the opposite direction, when reading. Calling read in a loop could block if a terminal has some data (shorter than the total iov length) available followed by an EOF indicator; calling readv should return immediately with a partial read in this case. However, due to a bug in Linux, readv on terminals is actually implemented as a read loop in kernelspace, and it does exhibit this blocking bug. I had to work around this bug in implementing musl's stdio:
http://git.etalabs.net/cgi-bin/gitweb.cgi?p=musl;a=commit;h=2cff36a84f268c09f4c9dc5a1340652c8e298dc0
To answer the last part of your question:
Or does writev write everything to file in a single I/O call?
In all cases, a conformant writev implementation will be a single syscall. Getting down to how it's implemented on Linux: for ordinary files and for most devices, the underlying file driver has methods that implement iov-style io directly, without any sort of internal loop. But the terminal driver on Linux is highly outdated and lacks the modern io methods, causing the kernel to fallback to a write/read loop for writev/readv when operating on a terminal.
The direct way to know how code works is read the source code.
see http://www.oschina.net/code/explore/glibc-2.9/sysdeps/posix/writev.c
It simplely alloca() or malloc() a buffer, copy all vectors into it, and call write() once.
That how it works. Nothing mysterious.
Or does writev write everything to file in a single I/O call?
I'm afarid not everything, though sys_writev try its best to write everything in a single call. it's depends on vfs's implement, if the vfs doesn't give an implement of writev, then kenerl will call vfs' write() in a loop. it's better to check the return value of writev/readv to see how many bytes wrotten as you do in write().
you can find the code of writev in kernel, fs/read_write.c:do_readv_writev.

Is there a way to check if a (file) handle is valid?

Is there any way to check if a handle, in my case returned by CreateFile, is valid?
The problem I face is that a valid file handle returned by CreateFile (it is not INVALID_HANDLE_VALUE) later causes WriteFile to fail, and GetLastError claims that it is because of an invalid handle.
Since it seems that you are not setting the handle value to INVALID_HANDLE_VALUE after closing it, what I would do is set a read watchpoint on the HANDLE variable, which will cause the debugger to break at each line that accesses the value of the HANDLE. You will be able to see the order in which the variable is accessed, including when the variable is read in order to pass it to CloseHandle.
See: Adding a watchpoint (breaking when a variable changes)
Your problem is caused most probably by either of two things:
You may close the file handle, nevertheless you still try to use it
File handle is overwritten due to a memory corruption
Generally it's a good practice to assign INVALID_HANDLE_VALUE to every handle as long as it's not supposed to contain any valid handle value.
In simple words - when your variable is declared - immediately initialize it to this value. And also write this value into your variable immediately after you close the file handle.
This will give you an indication of (1) - attempt to use the file handle which is already closed (or hasn't been opened yet)
The other answers are all important for your particular problem.
However, if you are given a HANDLE and simply want to find out whether it is indeed an open file handle (as opposed to, e.g., a handle to a mutex or a GDI object etc.), there is the Windows API function GetFileInformationByHandle for that.
Depending on the permissions your handle grants you for the file, you can also try to read some data from it using ReadFile or perform a null write operation using WriteFile with nNumberOfBytesToWrite set to 0.
Open-File are kept as a Data Structure in kernel, I don't think that has a official way to detect a file-handle is valid, just use it and check Error code as INVALID_HANDLE. Are you sure no others thread closed that file-handle?
Checking the validity of the handle is a band-aid, at best.
You should debug the process - set a breakpoint at the point the handle is set up (file open) and when you hit that code and after the handle is set up, set a second conditional breakpoint to trigger when the handle value changes.
This should enable you to work out the underlying cause rather than just check the handle is valid on each access, which is unreliable, costly and not necessary given correct logic.
Just to add to what everyone else is saying, make sure that you check the return value when you call CreateFile. IIRC, it will return INVALID_HANDLE_VALUE on failure, at which point you should call GetLastError to find out why.

"short read" from filesystem, when can it happen?

It is obvious that in general the read(2) system call can return less bytes than what was asked to be read. However, quite a few programs assume that when working with a local files, read(2) never returns less than what was asked (unless the file is shorter, of course).
So, my question is: on Linux, in which cases can read(2) return less than what was requested if reading from an open file and EOF is not encountered and the amount being read is a few kilobytes at maximum?
Some guesses:
Can received signals interrupt a read like that, but not make it fail?
Can different filesystems affect this behavior? Is there anything special about jffs2?
POSIX.1-2008 states:
The value returned may be less than
nbyte if the number of bytes left in
the file is less than nbyte, if the
read() request was interrupted by a
signal, or if the file is a pipe or
FIFO or special file and has fewer
than nbyte bytes immediately available
for reading.
Disk-based filesystems generally use uninterruptible reads, which means that the
read operation generally cannot be interrupted by a signal. Network-based
filesystems sometimes use interruptible reads, which can return partial data or no data.
(In the case of NFS this is configurable using the intr mount option.)
They sometimes also implement timeouts.
Keep in mind that even /some/arbitrary/file/path may refer to a FIFO or
special file, so what you thought was a regular file may not be. It is therefore
good practice to handle partial reads even though they may be unlikely.
I have to ask: "why do you care about the reason"? If read can return a number of bytes less than the requested amount (which, as you point out, it certainly can) why would you not want to deal with that situation?
A received signal only makes read() fail if it hasn't yet read a single byte. Otherwise, it will return partial data.
And I guess alternate filesystems may indeed return short reads in other situations. For example, it makes some sense (to me) to have a network-based filesystem behave just like a network socket wrt short reads (= having them often).
If it's really a file you are reading, then you can get short read as the last read before end of file.
Howver, it's generally best to behave as if ANY read could be a short read. If what you are reading is a pipe or an input device (stdin) rather than a file, you can get a short read whenever your buffer is larger than what is currently in the input buffer.
I am not sure but this situation could arise when the OS is running out of pages in the page cache. You could suggest that flush thread will be invoked in that case, but it depends on the heuristic used in the I/O scheduler. This situation could cause a read to return fewer bytes.
What I have always read being called a "short read" is not related to the file access read(2) but to the physical read of a disk sector. It happens when, while reading the data part of the sector, less valid magnetic signals are found than to make the 512 (or 4096 or whatever) bytes of a sector. That makes an invalid sector and a read fault. Regarding "when", or rather why it happens is most probably because the power feeding the drive fell down while that sector was written.
Could it be that a read(2) ends with a physical error code called "short read"?

Resources