read system in Linux vs Windows - c

Is there any difference between using read() in Linux than in Windows?
Is it possible that in Windows, it will usually read less than I request, and in Linux it usually reads as much as I request?

read isn't a standard c function. Historically it is a posix syscall, and as such, windows (assuming windows means MSVC) isn't required to implement it at all. Still, they tried. And we can compare the two implementations:
linux:
http://man7.org/linux/man-pages/man2/read.2.html
On success, the number of bytes read is returned (zero indicates end
of file), and the file position is advanced by this number. It is
not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to end-of-
file, or because we are reading from a pipe, or from a terminal), or
because read() was interrupted by a signal. See also NOTES.
windows:
https://msdn.microsoft.com/en-us/library/ms235412.aspx
https://msdn.microsoft.com/en-us/library/wyssk1bs.aspx
_read returns the number of bytes read, which might be less than count if there are fewer than count bytes left in the file or if the file was opened in text mode, in which case each carriage return–line feed (CR-LF) pair is replaced with a single linefeed character. Only the single linefeed character is counted in the return value. The replacement does not affect the file pointer.
So you should expect both implementations to return less than the requested number of bytes. Futhermore, there is a clear difference when reading files in text mode.

Related

Understanding read syscall

I'm reading man read manual page and discovered that it was possible to read less then the desired number of bytes passed in as a parameter:
It is not an error if this number is smaller than the number of bytes
requested; this may happen for example because fewer bytes are
actually available right now (maybe because we were close to
end-of-file, or because we are reading from a pipe, or from a
termi‐nal), or because read() was interrupted by a signal.
I have the following situation:
Some process moved a file into a directory I'm listening to IN_MOVED_TO inotify events.
I receive a IN_MOVED_TO event, open a file and start reading it till the EOF is reached
No other processes modify the moved at 1. file (After it is moved it is left unchanged all the time)
Is it guaranteed that if read returns the number of bytes read less then I requested then the next call to read will return 0? I mean the situation like 'reading 1 000 000 000 by a single bytes for a gigabyte file' is forbidden by the documentation
Is it guaranteed that if read returns the number of bytes read less then I requested then the next call to read will return 0?
No, not in practice. It should be true if the file system is entirely POSIX compliant, but many of them are not (in corner cases). In particular NFS (see nfs(5)) and FUSE or proc (see proc(5)) are not exactly POSIX compliant.
So in practice I strongly recommend handling the "read returns a smaller number of bytes than wanted case", even if you are right to believe that it should not happen. Handling that "impossible" case should be easy for you.
Notice also that inotify(7) facilities don't work with bizarre filesystems like NFS, proc, FUSE, ... Think also of corner cases like, inside an Ext4 file system, a symlink to an NFS file,; or bind mounts, etc...

Difference between stream and direct I/O in C?

In C, I believe (correct me if I'm wrong) there are two different types of input/output functions, direct and stream, which result in binary and ASCII files respectively.
What is the difference between stream (ASCII) and direct (Binary) I/O in terms of retrieving (read/write) and printing data?
No, yes, sort of, maybe…
In C, … there are two different types of input/output functions, direct and stream, which result in binary and ASCII files respectively.
In Standard C, there are only file streams, FILE *. In POSIX C, there are what might be termed 'direct' file access functions, mainly using file descriptors instead of file streams. AFAIK, Windows also provides alternative I/O functions, mainly using handles instead of file streams. So "No" — Standard C has one type of I/O function; but POSIX (and Windows) provide alternatives.
In Standard C, you can create a binary files and text files using:
FILE *bfp = fopen("binary-file.bin", "wb");
FILE *tfp = fopen("regular-file.txt", "w");
On Windows (and maybe other systems for Windows compatibility), you can be explicit about opening a text file:
FILE *tcp = fopen("regular-file.txt", "wt");
So the standard distinguishes between text and binary files, but file streams can be used to access either type of file. Further, on Unix systems, there is no difference between a text file and a binary file; they will be treated the same. On Windows, a text file will have its CRLF (carriage return, line feed) line endings mapped to newline on input, and newlines mapped to CRLF line endings on output. That translation does not occur with binary files.
Note that there is also a concept 'direct I/O' on Linux, activated using the O_DIRECT flag, which is probably not what you're thinking of. It is a refinement of file descriptor I/O.
What is the difference between stream (ASCII) and direct (Binary) I/O in terms of retrieving (read/write) and printing data?
There are multiple issues.
First, the dichotomy between text files and binary files is separate from the dichotomy between stream I/O and direct I/O.
With stream I/O, the mapping of line endings from native (e.g. CRLF) to newline when processing text files compared with no such mapping when processing binary files.
With text I/O, it is assumed that there will be no null bytes, '\0' in the data. Such bytes in the middle of a line mess up text processing code that expects to read up to a null. With binary I/O, all 256 byte values are expected; code that breaks because of a null byte is broken.
Complicating this is the distinction between different code sets for encoding text files. If you have a single-byte code set, such as ISO 8859-15, then null bytes don't generally appear. If you have a multi-byte code set such as UTF-8, again, null bytes don't generally appear. However, if you have a wide character code set such as UTF-16 (whether big-endian or little-endian), then you will often get zero bytes in the body of the file — it is not intended to be read or written as a byte stream but rather as a stream of 16-bit units.
The major difference between stream I/O and direct I/O is that the stream library buffers data for both input and output, unless you override it with setvbuf(). That is, if you repeatedly read a single character in the user code (getchar() for example), the stream library first reads a chunk of data from the file and then doles out one character at a time from the chunk, only going back to the file for more data when the previous chunk has been delivered completely. By contrast, direct I/O reading a single byte at a time will make a system call for each byte. Granted, the kernel will buffer the I/O (it does that for the stream I/O too — so there are multiple layers of buffering here, which is part of what O_DIRECT I/O attempts to avoid whenever possible), but the overhead of a system call per byte is rather substantial.
Generally, you have more fine-grained control over access with file descriptors; there are operations you can do with file descriptors that are simply not feasible with streams because the stream interface functions simply don't cover the possibility. For example, setting FD_CLOEXEC or O_CLOEXEC on a file descriptor means that the file descriptor will be closed automatically by the system when the program executes another one — the stream library simply doesn't cover the concept, let alone provide control over it. The cost of gaining the fine-grained control is that you have to write more code — or, at least, different code that does what is handled for you by the stream library functions.
Streams are a portable way of reading and writing data. They provide a flexible and efficient means of I/O. A Stream is a file or a physical device (like monitor) which is manipulated with a pointer to the stream.
This is BUFFERED that is to say a fixed chunk is read from or written to a file via some temporary storage area (the buffer). But data written to a buffer does not appear in a file (or device) until the buffer is flushed or written out. (\n does this).
In Direct or low-level I/O-
This form of I/O is UNBUFFERED -- each read/write request results in accessing disk (or device) directly to fetch/put a specific number of bytes.
There are no formatting facilities -- we are dealing with bytes of information.
This means we are now using binary (and not text) files.

Finding out the number of chars read/write reads

I'm fairly new to c so bear with me.
How do I go about finding out the number of chars read/write reads?
Can I be more specific and designate the # of chars read/write reads in an argument? If so, how?
From man(2) read:
If successful, the number of bytes actually read is returned
From man(2) write:
Upon successful completion the number of bytes which were written is returned
Now concerning:
Can I be more specific and designate the # of chars read/write reads in an argument? If so, how?
AFAIK no, but there might be some device/kernel specific ways using for example ioctl(2)
C and C++ has different IO libraries. I guess you are coding in C.
fprintf(3) returns (when successful) the number of printed characters.
scanf(3) returns the number of successfully read items, but also accept the %n specifier:
n Nothing is expected; instead, the number of characters
consumed thus far from the input is stored through the next
pointer, which must be a pointer to int.
You could also do IO line by line... (getline, snprintf, sscanf, fputs ....)
for Linux and Posix
If you call directly the read(2) or write(2) functions (i.e. syscalls) they return the number of input or output bytes on success.
And you could use the lseek(2) syscall, or the ftell(3) <stdio.h> function, to query the current file offset (which has no meaning so would fail on non-seekable files like pipes, sockets, FIFOs, ...).
See also FIONREAD

In C, what's the size of stdout buffer?

Today I learned that stdout is line buffered when it's set to terminal and buffered in different cases. So, in normal situation, if I use printf() without the terminating '\n' it will be printed on the screen only when the buffer will be full. How to get a size of this buffer, how big is this?
The actual size is defined by the individual implementation; the standard doesn't mandate a minimum size (based on what I've been able to find, anyway). Don't have a clue on how you'd determine the size of the buffer.
Edit
Chapter and verse:
7.19.3 Files
...
3 When a stream is unbuffered, characters are intended to appear from the source or at the
destination as soon as possible. Otherwise characters may be accumulated and
transmitted to or from the host environment as a block. When a stream is fully buffered,
characters are intended to be transmitted to or from the host environment as a block when
a buffer is filled. When a stream is line buffered, characters are intended to be
transmitted to or from the host environment as a block when a new-line character is
encountered. Furthermore, characters are intended to be transmitted as a block to the host
environment when a buffer is filled, when input is requested on an unbuffered stream, or
when input is requested on a line buffered stream that requires the transmission of
characters from the host environment. Support for these characteristics is
implementation-defined, and may be affected via the setbuf and setvbuf functions.
Emphasis added.
"Implementation-defined" is not a euphemism for "I don't know", it's simply a statement that the language standard explicitly leaves it up to the implementation to define the behavior.
And having said that, there is a non-programmatic way to find out; consult the documentation for your compiler. "Implementation-defined" also means that the implementation must document the behavior:
3.4.1
1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made
2 EXAMPLE An example of implementation-defined behavior is the propagation of the high-order bit
when a signed integer is shifted right.
The Linux when a pipe is created for default pipe size 64K is used.
In /proc/sys/fs/pipe-max-size the maximum pipe size exists.
For the default 1048576 is typical.
For glibc's default file buffer; 65536 bytes seems reasonable.
However, ascertained by grep from the glibc source tree:
libio/libio.h:#define _IO_BUFSIZ _G_BUFSIZ
sysdeps/generic/_G_config.h:#define _G_BUFSIZ 8192
sysdeps/unix/sysv/linux/_G_config.h:#define _G_BUFSIZ 8192
By that the original question might or might not be answered.
For a minute's effort the best guess is 8 kilobytes.
For mere line buffering 8K is adequate.
However, for more than line buffered output
as compared with 64K; 8K is not efficient.
Because for the default pipe size 64K is used and
if a larger pipe size is not expected and
if a larger pipe size is not explicitly set
then for a stdio buffer 64K is recommended.
If performance is required
then meager 8K buffers do not suffice.
By fcntl(pipefd,F_SETPIPE_SZ,1048576)
a pipe's size can be increased.
By setvbuf (stdout,buffer,_IOFBF,1048576)
a stdio provided file buffer can be replaced.
If a pipe is not used
then pipe size is irrelevant.
However, if between two processes data is piped
then by increasing pipe size a performance boon could become.
Otherwise
by the smallest buffer or
by the smallest pipe
a bottleneck is created.
If reading also
then by a larger buffer
by stdio fewer read function invocations might be required.
By the word "might" an important consideration is suggested.
As by provided
by a single write function invocation
by a single read function invocation
as much data can be read.
By a read function invocation
a return with fewer bytes than requested can be expected.
By an additional read function invocation
additional bytes may be gained.
For writing a data line; by stdio overkill is provided.
However, by stdio line buffered output is possible.
In some scenarios line buffered output is essential.
If writing to a proc virtual file system provided file or
if writing to a sys virtual file system provided file
then in a single write buffer
the line feed byte should be included.
If a second write is used
then an unexpected outcome could become.
If read write and stdio are mixed
then caveats exist.
Before
a write function invocation
a fflush function invocation is required.
Because stderr is not buffered;
for stderr the fflush function invocation is not required.
By read fewer than expected bytes might be provided.
By stdio the previous bytes might already be buffered.
Not mixing unistd and stdio I/O is good advise, but often ignored.
Mixing buffered input is unreasonable.
Mixing unbuffered input is possible.
Mixing buffered output is plausible.
By stdio buffered IO convenience is provided.
Without stdio buffered IO is possible.
However, for the code additional bytes are required.
When a sufficient sized buffer is leveraged;
compared with stdio provided output functions;
the write function invocation is not necessarily slower.
However, when a pipe is not involved
then by function mmap superior IO can be provided.
On a pipe by mmap an error is not returned.
However, in the address space the data is not provided.
On a pipe by lseek an error is provided.
Lastly by man 3 setvbuf a good example is provided.
If on the stack the buffer is allocated
then before a return a fclose function invocation
must not be omitted.
The actual question was
"In C, what's the size of stdout buffer?"
By 8192 that much might be answered.
By those who encounter this inquiry
curiosity concerning buffer input/output efficiency might exist.
By some inquiries the goal is implicitly approached.
By a preference for terse replies
the pipe size significance and
the buffer size significance and
mmap is not explicated.
This reply explicates.
here are some pretty interesting answers on a similar question.
on a linux system you can view buffer sizes from different functions, including ulimit.
Also the header files limits.h and pipe.h should contain that kind of info.
You could set it to unbuffered, or just flush it.
This seems to have some decent info when the C runtime typically flushes it for you and some examples. Take a look at this.

character reading in C

I am struggling to know the difference between these functions. Which one of them can be used if i want to read one character at a time.
fread()
read()
getc()
Depending on how you want to do it you can use any of those functions.
The easier to use would probably be fgetc().
fread() : read a block of data from a stream (documentation)
read() : posix implementation of fread() (documentation)
getc() : get a character from a stream (documentation). Please consider using fgetc() (doc)instead since it's kind of saffer.
fread() is a standard C function for reading blocks of binary data from a file.
read() is a POSIX function for doing the same.
getc() is a standard C function (a macro, actually) for reading a single character from a file - i.e., it's what you are looking for.
In addition to the other answers, also note that read is unbuffered method to read from a file. fread provides an internal buffer and reading is buffered. The buffer size is determined by you. Also each time you call read a system call occurs which reads the amount of bytes you told it to. Where as with fread it will read a chunk in the internal buffer and return you only the bytes you need. For each call on fread it will first check if it can provide you with more data from the buffer, if not it makes a system call (read) and gets a chunk more data and returns you only the portion you wanted.
Also read directly handles the file descriptor number, where fread needs the file to be opened as a FILE pointer.
The answer depends on what you mean by "one character at a time".
If you want to ensure that only one character is consumed from the underlying file descriptor (which may refer to a non-seekable object like a pipe, socket, or terminal device) then the only solution is to use read with a length of 1. If you use strace (or similar) to monitor a shell script using the shell command read, you'll see that it repeatedly calls read with a length of 1. Otherwise it would risk reading too many bytes (past the newline it's looking for) and having subsequent processes fail to see the data on the "next line".
On the other hand, if the only program that should be performing further reads is your program itself, fread or getc will work just fine. Note that getc should be a lot faster than fread if you're just reading a single byte.

Resources