C - does read() add a '\0'? - c

Does it have to? I've always been fuzzy on this sort of stuff, but if I have something like:
char buf[256];
read(fd, buf, 256);
write(fd2, buf, 256);
Is there potential for error here, other than the cases where those functions return -1?
If it were to only read 40 characters, would it put a \0 after it? (And would write recognize that \0 and stop?
Also, if it were to read 256 characters, is there a \0 after those 256?

does read() add a '\0'?
No, it doesn't. It just reads.
From read()'s documentation:
The read() function shall attempt to read nbyte bytes from the file associated with the open file descriptor, fildes, into the buffer pointed to by buf.
Is there potential for error here, other than the cases where those functions return -1?
read() might return 0 indicating end-of-file.
If reading (also from a socket descriptor) read() not necessarily reads as much bytes as it was told to do. So in this context do not just test the outcome of read against -1, but also compare it against the number of bytes the function was told to read.
A general note:
Functions do what is documented (at least for proper implementations of the C language). Both your assumptions (autonomously set a 0-termination, detect the latter) are not documented.

No.
Consider reading binary data (eg. a photo from a file): adding extra bytes would corrupt the data.

From the man page:
Synopsis
#include <unistd.h>
ssize_t read(int fd, void *buf, size_t count);
That is void *, not char *, because read() reads bytes, not characters. It reads zero bytes as well as any other value, and since blocks of bytes (as opposed to strings) aren't terminated, read() doesn't.

Does it have to?
Not unless the data that is successfully read from the file contains a '\0'...
Is there potential for error here, other than the cases where those functions return -1?
Yes. read returns the actual number of bytes read (or a negative value to indicate failure). If you choose to write more than that number of bytes into your other file, then you are writing potential garbage.

Related

Is there any performance and usage-difference between read and fread?

The goal of this program is to read a line from a file and load it into a buffer of a fixed size buf_len. If the line ends or the buffer is not big enough to store the line the program exits.
int read_line(int file, char * buffer, size_t buf_len) {
size_t total = 0;
for (total = 0; total < buf_len; ++total) {
ssize_t bytes_read = read(file, &buffer[total], 1) // why shouldn't one use fread() here?
if (bytes_read == 0) return total != 0;
if buffer[total] == '\0' return 1;
}
exit(-1) // line to long
Is there any difference if one uses fread() instead of read() in this context?
The main difference between read() and fread() is that read() is a system call that reads a specified number of bytes from a file descriptor, whereas fread() is a standard library function that reads a specified number of elements from a file stream. fread() is typically used for reading binary data, while read() is used for reading text and binary data.
In the context of the code you provided, it would make more sense to use fread() if you were reading binary data from a file stream, since it will read a specified number of elements from the file stream, rather than just a single byte as in read(). However, if you are reading text data, it may be more appropriate to use fgets(), as it will handle newlines and null-terminated strings more easily.
read() is a lower-level system call that may be more efficient for
reading large amounts of data, while fread() is a higher-level
function that may be more convenient for reading smaller amounts of
binary data.
Here are some official documentation links for the read() and fread()
read() system call: enter link description here
fread() standard library function: enter link description here
The read() system call is part of the POSIX standard and is available on most Unix-like systems, while fread() is part of the C standard library and is available on most C implementations.
These links provide detailed information on the usage, behavior, and return values of these functions. I hope you find them helpful!

Using fread() to read a text based file - best practices

Consider this code to read a text based file. This sort of fread() usage was briefly touched upon in the excellent book C Programming: A Modern Approach by K.N. King.
There are other methods of reading text based files, but here I am concerned with fread() only.
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
// Declare file stream pointer.
FILE *fp = fopen("Note.txt", "r");
// fopen() call successful.
if(fp != NULL)
{
// Navigate through to end of the file.
fseek(fp, 0, SEEK_END);
// Calculate the total bytes navigated.
long filesize = ftell(fp);
// Navigate to the beginning of the file so
// it can be read.
rewind(fp);
// Declare array of char with appropriate size.
char content[filesize + 1];
// Set last char of array to contain NULL char.
content[filesize] = '\0';
// Read the file content.
fread(content, filesize, 1, fp);
// Close file stream pointer.
fclose(fp);
// Print file content.
printf("%s\n", content);
}
// fopen() call unsuccessful.
else
{
printf("File could not be read.\n");
}
return 0;
}
There are some problems I have with this method. My opinion is that this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
To circumvent this issue, we may use a buffer size and keep on reading into a char array of that size. If filesize is less than buffer size, then we simply perform fread() once as described in the above code. Otherwise, We divide the total file size by the buffer size and get a result, whose int portion we will use as the total number of times to iterate a loop where we will invoke fread() each time, appending the read buffer array into a larger string. Now, for the final fread(), which we will perform after the loop, we will have to read exactly (filesize % buffersize) bytes of data into an array of that size and finally append this array into the larger string (Which we would have malloc-ed with filesize + 1 beforehand). I find that if we perform fread() for the last chunk of data using buffersize as its second parameter, then extra garbage data of size (buffersize - chunksize) will be read in and the data might become corrupted. Are my assumptions here correct? Please explain if/ how I have overlooked something.
Also, there is the issue that non-ASCII characters might not have size of 1 byte. In that case I would assume the proper amount is being read, but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?
this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
fread() does not care about strings (null character terminated arrays). It reads data as if it was in multiples of unsigned char*1 with no special concern to the data content if the stream opened in binary mode and perhaps some data processing (e.g. end-of-line, byte-order-mark) in text mode.
Are my assumptions here correct?
Failed assumptions:
Assuming ftell() return value equals the sum of fread() bytes.
The assumption can be false in text mode (as OP opened the file) and fseek() to the end is technical undefined behavior in binary mode.
Assuming not checking the return value of fread() is OK. Use the return value of fread() to know if an error occurred, end-of-file and how many multiples of bytes were read.
Assuming error checking is not required. , ftell(), fread(), fseek() instead of rewind() all deserve error checks. In particular, ftell() readily fails on streams that have no certain end.
Assuming no null characters are read. A text file is not certainly made into one string by reading all and appending a null character. Robust code detects and/or copes with embedded null characters.
Multi-byte: assuming input meets the encoding requirements. Example: robust code detects (and rejects) invalid UTF8 sequences - perhaps after reading the entire file.
Extreme: Assuming a file length <= LONG_MAX, the max value returned from ftell(). Files may be larger.
but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?
fread() does not function on multi-byte boundaries, only multiples of unsigned char. A given fread() may end with a portion of a multi-byte and the next fread() will continue from mid-multi-byte.
Instead of of 2 pass approach consider 1 single pass
// Pseudo code
total_read = 0
Allocate buffer, say 4096
forever
if buffer full
double buffer_size (`realloc()`)
u = unused portion of buffer
fread u bytes into unused portion of buffer
total_read += number_just_read
if (number_just_read < u)
quit loop
Resize buffer total_read (+ 1 if appending a '\0')
Alternatively consider the need to read the entire file in before processing the data. I do not know the higher level goal, but often processing data as it arrives makes for less resource impact and faster throughput.
Advanced
Text files may be simple ASCII only, 8-bit code page defined, one of various UTF encodings (byte-order-mark, etc. The last line may or may not end with a '\n'. Robust text processing beyond simple ASCII is non-trivial.
ASCII and UTF-8 are the most common. IMO, handle 1 or both of those and error out on anything that does not meet their requirements.
*1 fread() reads in multiple of bytes as per the 3rd argument, which is 1 in OP's case.
// v --- multiple of 1 byte
fread(content, filesize, 1, fp);

May printf (or fprintf or dprintf) return ("successfully") less (but nonnegative) than the number of "all bytes"?

The manual says that
Upon successful return, these functions [printf, dprintf etc.] return the number of characters printed.
The manual does not mention whethet may this number less (but yet nonnegative) than the length of the "final" (substitutions and formattings done) string. Nor mentions that how to check whether (or achieve that) the string was completely written.
The dprintf function operates on file descriptor. Similarily to the write function, for which the manual does mention that
On success, the number of bytes written is returned (zero indicates nothing was written). It is not an error if this number is smaller than the number of bytes requested;
So if I want to write a string completely then I have to enclose the n = write() in a while-loop. Should I have to do the same in case of dprintf or printf?
My understanding of the documentation is that dprintf would either fail or output all the output. But I agree that it is some gray area (and I might not understand well); I'm guessing that a partial output is some kind of failure (so returns a negative size).
Here is the implementation of musl-libc:
In stdio/dprintf.c the dprintf function just calls vdprintf
But in stdio/vdprintf.c you just have:
static size_t wrap_write(FILE *f, const unsigned char *buf, size_t len)
{
return __stdio_write(f, buf, len);
}
int vdprintf(int fd, const char *restrict fmt, va_list ap)
{
FILE f = {
.fd = fd, .lbf = EOF, .write = wrap_write,
.buf = (void *)fmt, .buf_size = 0,
.lock = -1
};
return vfprintf(&f, fmt, ap);
}
So dprintf is returning a size like vfprintf (and fprintf....) does.
However, if you really are concerned, you'll better use snprintf or asprintf to output into some memory buffer, and explicitly use write(2) on that buffer.
Look into stdio/__stdio_write.c the implementation of __stdio_write (it uses writev(2) with a vector of two data chunks in a loop).
In other words, I would often not really care; but if you really need to be sure that every byte has been written as you expect it (for example if the file descriptor is some HTTP socket), I would suggest to buffer explicitly (e.g. by calling snprintf and/or asprintf) yourself, then use your explicit write(2).
PS. You might check yourself the source code of your particular C standard library providing dprintf; for GNU glibc see notably libio/iovdprintf.c
With stdio, returning the number of partially written bytes doesn't make much sense because stdio functions work with a (more or less) global buffer whose state is unknown to you and gets dragged in from previous calls.
If stdio functions allowed you to work with that, the error return values would need to be more complex as they would not only need to communicate how many characters were or were not outputted, but also whether the failure was before your last input somewhere in the buffer, or in the middle of your last input and if so, how much of the last input got buffered.
The d-functions could theoretically give you the number of partially written characters easy, but POSIX specifies that they should mirror the stdio functions and so they only give you a further unspecified negative value on error.
If you need more control, you can use the lower level functions.
Concerning printf(), it is quite clear.
The printf function returns the number of characters transmitted, or a negative value if an output or encoding error occurred. C11dr §7.21.6.3 3
A negative value is returned if an error occurred. In that case 0 or more characters may have printed. The count is unknowable via the standard library.
If the value return is not negative, that is the number sent to stdout.
Since stdout is often buffered, that may not be the number received at the output device on the conclusion of printf(). Follow printf() with a fflush(stdout)
int r1 = printf(....);
int r2 = fflush(stdout);
if (r1 < 0 || r2 != 0) Handle_Failure();
For finest control, "print" to a buffer and use putchar() or various non-standard functions.
My bet is that no. (After looking into the - obfuscated - source of printf.) So any nonnegative return value means that printf was fully succesful (reached the end of the format string, everything was passed to kernel buffers).
But some (authentic) people should confirm it.

Handling large size of Read operation

I am interposing a read operation with my own implementation of read that prints some log and calls the libc read. I am wondering what should be the right way to handle read with a huge nbyte parameter. Since nbyte is size_t, what is the right way to handle out of range read request? From the read manpage:
If the value of nbyte is greater than {SSIZE_MAX}, the result is implementation-defined
What does this mean and if I have to handle a large read request, what should I do?
Don't change the behavior of the read() call - just wrap the OS-provided call and allow it to do what it does.
ssize_t read( int fd, void *buf, size_t bytes )
{
ssize_t result;
.
.
.
result = read_read( fd, buf, bytes );
.
.
.
return( result );
}
What could you possibly do if you're implementing a 64-bit library a caller passes you a size_t value that's greater than SSIZE_MAX? You can't split that up into anything reasonable anyway.
And if you're implementing a 32-bit library, how would you pass the proper result back if you did split up the read?
You could break up the one large request into several smaller ones.
Besides, SSIZE_MAX is positively huge. Are you really sure you need to read >2GB of data, in one go?
You could simply use strace(1) to get some logs of your read syscalls.
In practice the read count is the size of some buffer (in memory), so it is very unusual to have it being bigger than a dozen of megabytes. It is often some kilobytes.
So I believe you should not care about SSIZE_MAX limit in real life
The last parameter of read is the buffer size. It's not the number of bytes to read.
So:
if the buffer size you received is lesser than SSIZE_MAX, call the syscall 'read' with buffer size.
If the buffer size you received is greater than SSIZE_MAX, 'read' SSIZE_MAX
If the read syscall return -1, return -1 too
If the read syscall return 0 or less than SSIZE_MAX --> return the sum of bytes read.
If the read call return exactly SSIZE_MAX, decrement the buffer size received of SSIZE_MAX
and loop (goto "So")
Do not forget to adjust the buffer pointer and to count the total number of bytes read.
Being implementation defined means that there is no correct answer, and callers should never do this (because they can’t be certain how it will be handled). Given that you are interposing the syscall, I suggest you just assert(2) that the value is in range. If you end up failing that assert somewhere, fix the calling code to be compliant.

What does write() write if null terminator is already reached?

For write(fd[1], string, size) - what would happen if string is shorter than size?
I looked up the man page but it doesn't clearly specify that situation. I know that for read, it would simply stop there and read whatever string is, but it's certainly not the case for write. So what is write doing? The return value is still size so is it appending null terminator? Why doesn't it just stop like read.
When you call write(), the system assumes you are writing generic data to some file - it doesn't care that you have a string. A null-terminated string is seen as a bunch of non-zero bytes followed by a zero byte - the system will keep writing out until it's written size bytes.
Thus, specifying size which is longer than your string could be dangerous. It's likely that the system is reading data beyond the end of the string out your file, probably filled with garbage data.
write will write size bytes of data starting at string. If you define string to be an array shorter than size it will have undefined behaviour. But in you previous question the char *line = "apple"; contains 6 characters (i.e. a, p, p, l, e and the null character).
So it is best to write the with the value of size set to the correct value
write(int fildes, const void *buf, size_t nbyte) does not write null terminated strings. It writes the content of a buffer. If there are any null characters in the buffer they will be written as well.
read(int fildes, void *buf, size_t nbyte) also pays no attention to null characters. It reads a number of bytes into the given buffer, up to a maximum of nbyte. It does not add any null terminating bytes.
These are low level routines, designed for reading and writing arbitrary data.
The write call outputs a buffer of the given size. It does not attempt to interpret the data in the buffer. That is, you give it a pointer to a memory location and a number of bytes to write (the length) then, as long as those memory locations exist in a legal portion of your program's data, it will copy those bytes to the output file descriptor.
Unlike the string manipulation routines write, and read for that matter, ignore null bytes, that is bytes with the value zero. read does pay attention to the EOF character and, on certain devices, will only read that amount of data available at the time, perhaps returning less data than requested, but they operate on raw bytes without interpreting them as "strings".
If you attempt to write more data than the buffer contains, it may or may not work depending on the position of the memory. At best the behavior is undefined. At worst you'll get a segment fault and your program will crash.

Resources