Handling large size of Read operation - c

I am interposing a read operation with my own implementation of read that prints some log and calls the libc read. I am wondering what should be the right way to handle read with a huge nbyte parameter. Since nbyte is size_t, what is the right way to handle out of range read request? From the read manpage:
If the value of nbyte is greater than {SSIZE_MAX}, the result is implementation-defined
What does this mean and if I have to handle a large read request, what should I do?

Don't change the behavior of the read() call - just wrap the OS-provided call and allow it to do what it does.
ssize_t read( int fd, void *buf, size_t bytes )
{
ssize_t result;
.
.
.
result = read_read( fd, buf, bytes );
.
.
.
return( result );
}
What could you possibly do if you're implementing a 64-bit library a caller passes you a size_t value that's greater than SSIZE_MAX? You can't split that up into anything reasonable anyway.
And if you're implementing a 32-bit library, how would you pass the proper result back if you did split up the read?

You could break up the one large request into several smaller ones.
Besides, SSIZE_MAX is positively huge. Are you really sure you need to read >2GB of data, in one go?

You could simply use strace(1) to get some logs of your read syscalls.
In practice the read count is the size of some buffer (in memory), so it is very unusual to have it being bigger than a dozen of megabytes. It is often some kilobytes.
So I believe you should not care about SSIZE_MAX limit in real life

The last parameter of read is the buffer size. It's not the number of bytes to read.
So:
if the buffer size you received is lesser than SSIZE_MAX, call the syscall 'read' with buffer size.
If the buffer size you received is greater than SSIZE_MAX, 'read' SSIZE_MAX
If the read syscall return -1, return -1 too
If the read syscall return 0 or less than SSIZE_MAX --> return the sum of bytes read.
If the read call return exactly SSIZE_MAX, decrement the buffer size received of SSIZE_MAX
and loop (goto "So")
Do not forget to adjust the buffer pointer and to count the total number of bytes read.

Being implementation defined means that there is no correct answer, and callers should never do this (because they can’t be certain how it will be handled). Given that you are interposing the syscall, I suggest you just assert(2) that the value is in range. If you end up failing that assert somewhere, fix the calling code to be compliant.

Related

max size_t value on send() in C

I'm writing a tcp server in C but I'm facing problems on send. I read local file and send data back to the client, when the file is small I have no problems, but when it becomes bigger I have this strange situation:
server tcp:
// create socket, bind, listen accept
// read file
fseek(fptr, 0, SEEK_SET);
// malloc for the sending buffer
ssize_t read = fread(sbuf, 1, file_size, fptr);
while(to_send>0) {
sent = send(socket, sbuf, buf_size, 0);
sbuf += sent;
to_send -= sent;
}
On huge files sent becomes equals to the max value of size_t, I think that I have a buffer overflow. How can I prevent this? What is the best practice to read from a file and send it back?
The problem is that you send buf_size bytes every time, even if there aren't that many left.
For example, pretend buf_size is 8 and you are sending 10 bytes (So initially, to_send is also 10). The first send sends 8 bytes, so you need to send 2 more. The second time, you also send 8 bytes (Which probably reads out of bounds). Then, to_send will be will be -6, which is the same as SIZE_MAX - 5.
Simple fix is to send to_send if it is smaller:
sent = send(socket, sbuf, to_send < buf_size ? to_send : buf_size, 0);
Also, send returns -1 if it is unsuccessful. This is the same as SIZE_MAX when it is assigned to a size_t. You would need some error handling to fix this.
On huge files sent becomes equals to the max value of size_t, I think
that I have a buffer overflow.
Since sent gets its value as the return value of send(), and send() returns ssize_t, which is a signed type unlikely to be wider than size_t, it is virtually certain that what is actually happening is that send() is indicating an error by returning -1. In that case, it will also be setting errno to a value indicative of the error. It cannot return the maximum value of size_t on any system I've ever had my hands on.
How can I prevent this?
In the first place, before you worry about preventing it, you should be sure to detect it by
declaring sent as a ssize_t to match the return type of send(), not a size_t, and
checking the value returned into sent for such error conditions.
Second, if you are really dealing with files longer than can be represented by a ssize_t (much less a size_t), then it is a poor idea to load the whole thing into memory before sending any of it. Instead, load it in (much) smaller blocks, and send the data one such block at a time. Not only will this tend to have lower perceived latency, but it will also avoid any risk associated with approaching the limits of the data types involved.
Additionally, when you do so, be careful to do it right. You have done well to wrap your send() call in a loop to account for short writes, but as #Artyer describes in his answer, you don't quite get that right because you do not reduce the number of bytes you try to send on the second and subsequent calls.
What is the
best practice to read from a file and send it back?
As above.

May printf (or fprintf or dprintf) return ("successfully") less (but nonnegative) than the number of "all bytes"?

The manual says that
Upon successful return, these functions [printf, dprintf etc.] return the number of characters printed.
The manual does not mention whethet may this number less (but yet nonnegative) than the length of the "final" (substitutions and formattings done) string. Nor mentions that how to check whether (or achieve that) the string was completely written.
The dprintf function operates on file descriptor. Similarily to the write function, for which the manual does mention that
On success, the number of bytes written is returned (zero indicates nothing was written). It is not an error if this number is smaller than the number of bytes requested;
So if I want to write a string completely then I have to enclose the n = write() in a while-loop. Should I have to do the same in case of dprintf or printf?
My understanding of the documentation is that dprintf would either fail or output all the output. But I agree that it is some gray area (and I might not understand well); I'm guessing that a partial output is some kind of failure (so returns a negative size).
Here is the implementation of musl-libc:
In stdio/dprintf.c the dprintf function just calls vdprintf
But in stdio/vdprintf.c you just have:
static size_t wrap_write(FILE *f, const unsigned char *buf, size_t len)
{
return __stdio_write(f, buf, len);
}
int vdprintf(int fd, const char *restrict fmt, va_list ap)
{
FILE f = {
.fd = fd, .lbf = EOF, .write = wrap_write,
.buf = (void *)fmt, .buf_size = 0,
.lock = -1
};
return vfprintf(&f, fmt, ap);
}
So dprintf is returning a size like vfprintf (and fprintf....) does.
However, if you really are concerned, you'll better use snprintf or asprintf to output into some memory buffer, and explicitly use write(2) on that buffer.
Look into stdio/__stdio_write.c the implementation of __stdio_write (it uses writev(2) with a vector of two data chunks in a loop).
In other words, I would often not really care; but if you really need to be sure that every byte has been written as you expect it (for example if the file descriptor is some HTTP socket), I would suggest to buffer explicitly (e.g. by calling snprintf and/or asprintf) yourself, then use your explicit write(2).
PS. You might check yourself the source code of your particular C standard library providing dprintf; for GNU glibc see notably libio/iovdprintf.c
With stdio, returning the number of partially written bytes doesn't make much sense because stdio functions work with a (more or less) global buffer whose state is unknown to you and gets dragged in from previous calls.
If stdio functions allowed you to work with that, the error return values would need to be more complex as they would not only need to communicate how many characters were or were not outputted, but also whether the failure was before your last input somewhere in the buffer, or in the middle of your last input and if so, how much of the last input got buffered.
The d-functions could theoretically give you the number of partially written characters easy, but POSIX specifies that they should mirror the stdio functions and so they only give you a further unspecified negative value on error.
If you need more control, you can use the lower level functions.
Concerning printf(), it is quite clear.
The printf function returns the number of characters transmitted, or a negative value if an output or encoding error occurred. C11dr §7.21.6.3 3
A negative value is returned if an error occurred. In that case 0 or more characters may have printed. The count is unknowable via the standard library.
If the value return is not negative, that is the number sent to stdout.
Since stdout is often buffered, that may not be the number received at the output device on the conclusion of printf(). Follow printf() with a fflush(stdout)
int r1 = printf(....);
int r2 = fflush(stdout);
if (r1 < 0 || r2 != 0) Handle_Failure();
For finest control, "print" to a buffer and use putchar() or various non-standard functions.
My bet is that no. (After looking into the - obfuscated - source of printf.) So any nonnegative return value means that printf was fully succesful (reached the end of the format string, everything was passed to kernel buffers).
But some (authentic) people should confirm it.

C - does read() add a '\0'?

Does it have to? I've always been fuzzy on this sort of stuff, but if I have something like:
char buf[256];
read(fd, buf, 256);
write(fd2, buf, 256);
Is there potential for error here, other than the cases where those functions return -1?
If it were to only read 40 characters, would it put a \0 after it? (And would write recognize that \0 and stop?
Also, if it were to read 256 characters, is there a \0 after those 256?
does read() add a '\0'?
No, it doesn't. It just reads.
From read()'s documentation:
The read() function shall attempt to read nbyte bytes from the file associated with the open file descriptor, fildes, into the buffer pointed to by buf.
Is there potential for error here, other than the cases where those functions return -1?
read() might return 0 indicating end-of-file.
If reading (also from a socket descriptor) read() not necessarily reads as much bytes as it was told to do. So in this context do not just test the outcome of read against -1, but also compare it against the number of bytes the function was told to read.
A general note:
Functions do what is documented (at least for proper implementations of the C language). Both your assumptions (autonomously set a 0-termination, detect the latter) are not documented.
No.
Consider reading binary data (eg. a photo from a file): adding extra bytes would corrupt the data.
From the man page:
Synopsis
#include <unistd.h>
ssize_t read(int fd, void *buf, size_t count);
That is void *, not char *, because read() reads bytes, not characters. It reads zero bytes as well as any other value, and since blocks of bytes (as opposed to strings) aren't terminated, read() doesn't.
Does it have to?
Not unless the data that is successfully read from the file contains a '\0'...
Is there potential for error here, other than the cases where those functions return -1?
Yes. read returns the actual number of bytes read (or a negative value to indicate failure). If you choose to write more than that number of bytes into your other file, then you are writing potential garbage.

C using fread to read an unknown amount of data

I have a text file called test.txt
Inside it will be a single number, it may be any of the following:
1
2391
32131231
3123121412
I.e. it could be any size of number, from 1 digit up to x digits.
The file will only have 1 thing in it - this number.
I want a bit of code using fread() which will read that number of bytes from the file and put it into an appropriately sized variable.
This is to run on an embedded device; I am concerned about memory usage.
How to solve this problem?
You can simply use:
char buffer[4096];
size_t nbytes = fread(buffer, sizeof(char), sizeof(buffer), fp);
if (nbytes == 0)
...EOF or other error...
else
...process nbytes of data...
Or, in other words, provide yourself with a data space big enough for any valid data and then record how much data was actually read into the string. Note that the string will not be null terminated unless either buffer contained all zeroes before the fread() or the file contained a zero byte. You cannot rely on a local variable being zeroed before use.
It is not clear how you want to create the 'appropriately sized variable'. You might end up using dynamic memory allocation (malloc()) to provide the correct amount of space, and then return that allocated pointer from the function. Remember to check for a null return (out of memory) before using it.
If you want to avoid over-reading, fread is not the right function. You probably want fscanf with a conversion specifier along the lines of %100[0123456789]...
One way to achieve this is to use fseek to move your file stream location to the end of the file:
fseek(file, SEEK_END, SEEK_SET);
and then using ftell to get the position of the cursor in the file — this returns the position in bytes so you can then use this value to allocate a suitably large buffer and then read the file into that buffer.
I have seen warnings saying this may not always be 100% accurate but I've used it in several instances without a problem — I think the issues could be dependant on specific implementations of the functions on certain platforms.
Depending on how clever you need to be with the number conversion... If you do not need to be especially clever and fast, you can read it a character at a time with getc(). So,
- start with a variable initialized to 0.
- Read a character, multiply variable by 10 and add new digit.
- Then repeat until done.
Get a bigger sized variable as needed along the way or start with your largest sized variable and then copy it into the smallest size that fits after you finish.

Why does fwrite have both size and count parameters when just bytes to write would suffice? [duplicate]

We had a discussion here at work regarding why fread() and fwrite() take a size per member and count and return the number of members read/written rather than just taking a buffer and size. The only use for it we could come up with is if you want to read/write an array of structures which aren't evenly divisible by the platform alignment and hence have been padded but that can't be so common as to warrant this choice in design.
From fread(3):
The function fread() reads nmemb elements of data, each size bytes long,
from the stream pointed to by stream, storing them at the location given
by ptr.
The function fwrite() writes nmemb elements of data, each size bytes
long, to the stream pointed to by stream, obtaining them from the location
given by ptr.
fread() and fwrite() return the number of items successfully read or written
(i.e., not the number of characters). If an error occurs, or the
end-of-file is reached, the return value is a short item count (or zero).
The difference in fread(buf, 1000, 1, stream) and fread(buf, 1, 1000, stream) is, that in the first case you get only one chunk of 1000 bytes or nothing, if the file is smaller and in the second case you get everything in the file less than and up to 1000 bytes.
It's based on how fread is implemented.
The Single UNIX Specification says
For each object, size calls shall be
made to the fgetc() function and the
results stored, in the order read, in
an array of unsigned char exactly
overlaying the object.
fgetc also has this note:
Since fgetc() operates on bytes,
reading a character consisting of
multiple bytes (or "a multi-byte
character") may require multiple calls
to fgetc().
Of course, this predates fancy variable-byte character encodings like UTF-8.
The SUS notes that this is actually taken from the ISO C documents.
This is pure speculations, however back in the days(Some are still around) many filesystems were not simple byte streams on a hard drive.
Many file systems were record based, thus to satisfy such filesystems in an efficient manner, you'll have to specify the number of items ("records"), allowing fwrite/fread to operate on the storage as records, not just byte streams.
Here, let me fix those functions:
size_t fread_buf( void* ptr, size_t size, FILE* stream)
{
return fread( ptr, 1, size, stream);
}
size_t fwrite_buf( void const* ptr, size_t size, FILE* stream)
{
return fwrite( ptr, 1, size, stream);
}
As for a rationale for the parameters to fread()/fwrite(), I've lost my copy of K&R long ago so I can only guess. I think that a likely answer is that Kernighan and Ritchie may have simply thought that performing binary I/O would be most naturally done on arrays of objects. Also, they may have thought that block I/O would be faster/easier to implement or whatever on some architectures.
Even though the C standard specifies that fread() and fwrite() be implemented in terms of fgetc() and fputc(), remember that the standard came into existence long after C was defined by K&R and that things specified in the standard might not have been in the original designers ideas. It's even possible that things said in K&R's "The C Programming Language" might not be the same as when the language was first being designed.
Finally, here's what P.J. Plauger has to say about fread() in "The Standard C Library":
If the size (second) argument is greater than one, you cannot determine
whether the function also read up to size - 1 additional characters beyond what it reports.
As a rule, you are better off calling the function as fread(buf, 1, size * n, stream); instead of
fread(buf, size, n, stream);
Bascially, he's saying that fread()'s interface is broken. For fwrite() he notes that, "Write errors are generally rare, so this is not a major shortcoming" - a statement I wouldn't agree with.
Likely it goes back to the way that file I/O was implemented. (back in the day) It might have been faster to write / read to files in blocks then to write everything at once.
Having separate arguments for size and count could be advantageous on an implementation that can avoid reading any partial records. If one were to use single-byte reads from something like a pipe, even if one was using fixed-format data, one would have to allow for the possibility of a record getting split over two reads. If could instead requests e.g. a non-blocking read of up to 40 records of 10 bytes each when there are 293 bytes available, and have the system return 290 bytes (29 whole records) while leaving 3 bytes ready for the next read, that would be much more convenient.
I don't know to what extent implementations of fread can handle such semantics, but they could certainly be handy on implementations that could promise to support them.
I think it is because C lacks function overloading. If there was some, size would be redundant. But in C you can't determine a size of an array element, you have to specify one.
Consider this:
int intArray[10];
fwrite(intArray, sizeof(int), 10, fd);
If fwrite accepted number of bytes, you could write the following:
int intArray[10];
fwrite(intArray, sizeof(int)*10, fd);
But it is just inefficient. You will have sizeof(int) times more system calls.
Another point that should be taked into consideration is that you usually don't want a part of an array element be written to a file. You want the whole integer or nothing. fwrite returns a number of elements succesfully written. So if you discover that only 2 low bytes of an element is written what would you do?
On some systems (due to alignment) you can't access one byte of an integer without creating a copy and shifting.

Resources