carriage return by fgets - c

I am running the following code:
#include<stdio.h>
#include<string.h>
#include<io.h>
int main(){
FILE *fp;
if((fp=fopen("test.txt","r"))==NULL){
printf("File can't be read\n");
exit(1);
}
char str[50];
fgets(str,50,fp);
printf("%s",str);
return 0;
}
text.txt contains: I am a boy\r\n
Since I am on Windows, it takes \r\n as a new line character and so if I read this from a file it should store "I am a boy\n\0" in str, but I get "I am a boy\r\n". I am using mingw compiler.

The behavior depends on the c library implementation and which mode you pass to fopen. See this quote from the MSDN documentation on fopen (fopen on MSDN):
b - Open in binary (untranslated) mode; translations involving carriage-return and linefeed characters are suppressed.
Means, if you use the Microsoft c library, and open your file omitting the 'b', the carriage return characters will be removed from the stream.
Since you're using mingw, your compiler probably links against the GNU c library which follows the POSIX standard. This is what the GNU documentation says about fopen (fopen on gnu.org):
The character ‘b’ in opentype has a standard meaning; it requests a binary stream rather than a text stream. But this makes no difference in POSIX systems (including GNU systems).
Concluding: you're omitting the 'b' mode char, which opens your stream in text mode. You're on Windows but use a GNU c library which makes no difference between text and binary mode. This is why fgets reads both carriage return and new line.

Since I am on Windows, it takes \r\n as a new line character...
This assumption is wrong. The C standard treats carriage return and new line as two different things, as evidenced in C99 §5.2.1/3 (Character sets):
[...] In the basic execution character set, there shall be
control characters representing alert, backspace, carriage return, and new-line. [...]
The fgets function description is as follows, in C99 §7.19.7.2/2:
The fgets function reads at most one less than the number of characters specified by n
from the stream pointed to by stream into the array pointed to by s. No additional
characters are read after a new-line character (which is retained) or after end-of-file. A
null character is written immediately after the last character read into the array.
Therefore, when encountering the string I am a boy\r\n, a conforming implementation should read up to the \n character. There is no possibly sane reason why the implementation should discard \r based on the platform.

The c standard says this about text streams in (among other things):
Characters may have to be added, altered, or deleted on input and output to
conform to differing conventions for representing text in the host
environment. Thus, there need not be a one-to-one correspondence between the
characters in a stream and those in the external representation. Data read in
from a text stream will necessarily compare equal to the data that were
earlier written out to that stream only if: the data consist only of printing
characters and the control characters horizontal tab and new-line; no new-line
character is immediately preceded by space characters; and the last character
is a new-line character.
In other words, if a file is opened in text mode, an implementation is free to add, remove and modify control characters if it wants/needs to when going to and from disk. Which is apparently what the microsoft implementation does with the carriage return, but the gnu implementation doesn't.

Related

Why is the text file created with this code having a charset == binary?

In the below code I'm creating two files one in text format and other in binary format. The icons of the files show the same. But the characteristics of both the files are exactly same including the size ,charset (==binary) and stream(octet). Why isn't there a text file? Because if i create a text file explicitly the charset is ASCII.
Compiler version - gcc (Ubuntu 8.3.0-6ubuntu1) 8.3.0.
Operating system - Tried on both Ubuntu 18.10 and 19.04.
No messages displayed by compiler.
Command used to examine the files file --mime.
Output by the command for file Text1.txt :
Text1.txt: application/octet-stream; charset=binary
Output by the command for file Text1.txt : Binary: application/octet-stream; charset=binary
Output by command od -xa FILENAME is same for both files and is :
0000000 0021
!
0000001
#include<stdio.h>
void main(){
FILE *fp;
FILE *fp2;
int a = 10111110;
fp2 = fopen("Text1.txt","w");
fputc('!',fp2);
fp = fopen("Binary","wb");
fputc('!',fp);
}
Expected output is One File with charset as ASCII and One with Binary, Actual output is both of them with charset as Binary
The file command diagnoses the files as binary and not ASCII because you are writing non-ASCII characters to the files due to incorrect use of fputc.
fputc("!",fp2); is incorrect. The first argument to fputc should be an int with a character value. "!" is a string literal, which is an array, which is automatically converted to a pointer to its first character.
GCC warns you about this, saying “warning: passing argument 1 of 'fputc' makes integer from pointer without a cast [-Wint-conversion]”. You apparently ignored the warning. Do not do that. When the compiler warns you about something, pay attention, diagnose the problem, and fix it.
The result is that the pointer is converted to an int, and this int is passed to fputc. That may result in some non-ASCII character being written to the file, which in turn causes the file command to diagnose the file as binary.
To fix this, change the string "!" to a single character '!', so that you pass a single character to fputc, with fputc('!',fp2);.
Additionally, main should not be declared with void main(). Declare it with int main(void) or int main(int argc, char *argv[]) or another implementation-defined manner.
On Unix systems, the resulting files with the corrected code will be identical. Core Unix does not distinguish between text and binary files, except that some applications may use metadata (such as “extended attributes”) to characterize files in various ways. The files resulting from the incorrect code may or may not be identical, because identical string literals in different places may or may not have the same address, so the resulting pointer may or may not have the same value.
C provides a distinction in principle between binary and text streams. Data traversing a text stream may be subject to implementation-dependent conversions:
Characters may have to be added, altered, or deleted on input and
output to conform to differing conventions for representing text in
the host environment. Thus, there need not be a one- to-one
correspondence between the characters in a stream and those in the
external representation. Data read in from a text stream will
necessarily compare equal to the data that were earlier written out to
that stream only if: the data consist only of printing characters and
the control characters horizontal tab and new-line; no new-line
character is immediately preceded by space characters; and the last
character is a new-line character. Whether space characters that are
written out immediately before a new-line character appear when read
in is implementation-defined.
(C2011, 7.21.2/2)
In practice, however, the only conversion you will see for byte-oriented streams on any system you're likely to meet is line terminator conversions on systems (primarily Windows) that use carriage return / newline pairs for line terminators in text files. C text mode streams will convert between that external representation and C's newline-only internal representation.
On Linux and modern BSD-based macOS, however, there isn't even that -- these operating systems make no distinction in practice between text and binary files, and it is not at all surprising that your two mechanisms for producing a file yield identical files.
It is an entirely separate question how an external program that attempts to guess at file types might interpret any given file, especially a very short one. Your chances are better for a file to be detected as text if it contains genuine text in the form of words and sentences.

why when we write \n in the file it converts into \r\n combination?

I read this concept from book that when we attemp to write \n to the file using fputs(), fputs() converts the \n to \r\n combination and then if we read the same line back using fgets () the reverse conversion happens means \r\n back convert to \n.
I don't get that what is the purpose behind this?
It is because Windows (and MS-DOS) text files are supposed to have lines ending in \r\n, and portable C programs are supposed to simply use \n because C was originally defined on Unix.
And it's not just fputs and fgets that do it - any I/O function on a text file, even getc and fread, will do the same conversion.
Succinctly, DOS is the reason for this.
Different systems have different conventions for line endings. Unix reckons one character, '\n', is sufficient to mark the end of a line. DOS decided that it needed two characters, '\r' and '\n', though other systems also used that convention. The versions of Mac OS 1-9 (prior to Mac OS X) used just '\r' instead. Other systems could use a count and the line data instead of a line ending, or could simulate punched cards with blanks up to a fixed length (72 or 80). Unix also doesn't distinguish between binary and text files; DOS does. (DOS also uses Control-Z to mark EOF in a text file. Unix doesn't have an EOF marker; it knows exactly how big the file is and uses that length to determine when it has reached EOF.)
C originate on Unix, but to make it easier to migrate code between the systems, the standard I/O package defined that when it was working on text files, the input side would convert a native line ending to the single '\n' character for uniform input, and the output side would convert a '\n' to the native line ending.
However, the mention of text files also meant that there needed to be binary files, where these mappings do not occur.
You might note that most of the internet protocols (HTTP, for example) mandate CRLF (carriage return, line feed, or '\r', '\n') for the end of line markers.
(Actually, blaming DOS, as in MS-DOS or PC-DOS, is a little unfair. There were other systems that used the CRLF line end convention before DOS existed, and they may have been more influential on the Internet. However, almost all those ancestral systems are substantially defunct, and Windows is the environment that you'll run into these days where the distinction between binary and text files matters, and where you'll encounter CRLF line endings.)
Note that the C standard has this to say about text files:
ISO/IEC 9899:2011 §7.21.2 Files
¶2 A text stream is an ordered sequence of characters composed into lines, each line
consisting of zero or more characters plus a terminating new-line character. Whether the
last line requires a terminating new-line character is implementation-defined. Characters
may have to be added, altered, or deleted on input and output to conform to differing
conventions for representing text in the host environment. Thus, there need not be a one-to-one correspondence between the characters in a stream and those in the external
representation. Data read in from a text stream will necessarily compare equal to the data
that were earlier written out to that stream only if: the data consist only of printing
characters and the control characters horizontal tab and new-line; no new-line character is
immediately preceded by space characters; and the last character is a new-line character.
Whether space characters that are written out immediately before a new-line character
appear when read in is implementation-defined.
That's a lot of things that might or might not happen. Note, in particular, that trailing blanks written to a file might, or might not, appear in the input — according to the standard. That allows the systems that support punched card images or fixed length records to comply with the standard.
Note, too (as pointed out by Giacomo Degli Eposti), that this all means that if you open a file in binary mode that was originally written as a text file, you may very well get a significantly different list of bytes back from the I/O system. You'll see two characters per newline; you might see a Control-Z followed by other characters (possibly null bytes) up to a 'block' boundary that might be a multiple of 256 bytes, etc.

Does wide character input/output in C always read from / write to the correct (system default) encoding?

I'm primarily interested in the Unix-like systems (e.g., portable POSIX) as it seems like Windows does strange things for wide characters.
Do the read and write wide character functions (like getwchar() and putwchar()) always "do the right thing", for example read from utf-8 and write to utf-8 when that is the set locale, or do I have to manually call wcrtomb() and print the string using e.g. fputs()? On my system (openSUSE 12.3) where $LANG is set to en_GB.UTF-8 they do seem to do the right thing (inspecting the output I see what looks like UTF-8 even though strings were stored using wchar_t and written using the wide character functions).
However I am unsure if this is guaranteed. For example cprogramming.com states that:
[wide characters] should not be used for output, since spurious zero
bytes and other low-ASCII characters with common meanings (such as '/'
and '\n') will likely be sprinkled throughout the data.
Which seems to indicate that outputting wide characters (presumably using the wide character output functions) can wreak havoc.
Since the C standard does not seem to mention coding at all I really have no idea who/when/how coding is applied when using wchar_t. So my question is basically if reading, writing and using wide characters exclusively is a proper thing to do when my application has no need to know about the encoding used. I only need string lengths and console widths (wcswidth()), so to me using wchar_t everywhere when dealing with text seems ideal.
The relevant text governing the behavior of the wide character stdio functions and their relationship to locale is from POSIX XSH 2.5.2 Stream Orientation and Encoding Rules:
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_05_02
Basically, the wide character stdio functions always write in the encoding that's in effect (per the LC_CTYPE locale category) at the time the FILE stream becomes wide-oriented; this means the first time a wide stdio function is called on it, or fwide is used to set the orientation to wide. So as long as a proper LC_CTYPE locale is in effect matching the desired "system" encoding (e.g. UTF-8) when you start working with the stream, everything should be fine.
However, one important consideration you should not overlook is that you must not mix byte and wide oriented operations on the same FILE stream. Failure to observe this rule is not a reportable error; it simply results in undefined behavior. As a good deal of library code assumes stderr is byte oriented (and some even makes the same assumption about stdout), I would strongly discourage ever using wide-oriented functions on the standard streams. If you do, you need to be very careful about which library functions you use.
Really, I can't think of any reason at all to use wide-oriented functions. fprintf is perfectly capable of sending wide-character strings to byte-oriented FILE streams using the %ls specifier.
So long as the locale is set correctly, there shouldn't be any issues processing UTF-8 files on a system using UTF-8, using the wide character functions. They'll be able to interpret things correctly, i.e. they'll treat a character as 1-4 bytes as necessary (in both input and output). You can test it out by something like this:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main()
{
setlocale(LC_CTYPE, "en_GB.UTF-8");
// setlocale(LC_CTYPE, ""); // to use environment variable instead
wchar_t *txt = L"£Δᗩ";
wprintf(L"The string %ls has %d characters\n", txt, wcslen(txt));
}
$ gcc -o loc loc.c && ./loc
The string £Δᗩ has 3 characters
If you use the standard functions (in particular character functions) on multibyte strings carelessly, things will start to break, e.g. the equivalent:
char *txt = "£Δᗩ";
printf("The string %s has %zu characters\n", txt, strlen(txt));
$ gcc -o nloc nloc.c && ./nloc
The string £Δᗩ has 7 characters
The string still prints correctly here because it's essentially just a stream of bytes, and as the system is expecting UTF-8 sequences, they're translated perfectly. Of course strlen is reporting the number of bytes in the string, 7 (plus the \0), with no understanding that a character and a byte aren't equivalent.
In this respect, because of the compatibility between ASCII and UTF-8, you can often get away with treating UTF-8 files as simply multibyte C strings, as long as you're careful.
There's a degree of flexibility as well. It's possible to convert a standard C string (as a multibyte string) to a wide character string easily:
char *stdtxt = "ASCII and UTF-8 €£¢";
wchar_t buf[100];
mbstowcs(buf, stdtxt, 20);
wprintf(L"%ls has %zu wide characters\n", buf, wcslen(buf));
Output:
ASCII and UTF-8 €£¢ has 19 wide characters
Once you've used a wide character function on a stream, it's set to wide orientation. If you later want to use standard byte i/o functions, you'll need to re-open the stream first. This is probably why the recommendation is not to use it on stdout. However, if you only use wide character functions on stdin and stdout (including any code that you link to), you will not have any problems.
Don't use fputs with anything else than ASCII.
If you want to write down lets say UTF8, then use a function who return the real size used by the utf8 string and use fwrite to write the good number of bytes, without worrying of vicious '\0' inside the string.

Is \n multi-character in C?

I read that \n consists of CR & LF. Each has their own ASCII codes.
So is the \n in C represented by a single character or is it multi-character?
Edit: Kindly specify your answer, rather than simply saying "yes, it is" or "no, it isn't"
In a C program, it's a single character, '\n'representing end of line. However, some operating systems (most notably Microsoft Windows) use two characters to represent end of line in text files, and this is likely where the confusion comes from.
It's the responsibility of the C I/O functions to do the conversions between the C representation of '\n' and whatever the OS uses.
In C programs, simply use '\n'. It is guaranteed to be correct. When looking at text files with some sort of editor, you might see two characters. When a text file is transferred from Windows to some Unix-based system, you might get "^M" showing up at the end of each line, which is annoying, but has nothing to do with C.
Generally: '\n' is a single character, which represents a newline. '\r' is a single character, which represents a carriage-return. They are their own independent ASCII characters.
Issues arise because in the actual file representation, UNIX-based systems tend to use '\n' alone to represent what you think of when you hit "enter" or "return" on the keyboard, whereas Windows uses a '\r' followed directly by a '\n'.
In a file:
"This is my UNIX file\nwhich spans two lines"
"This is my Windows file\r\nwhich spans two lines"
Of course, like all binary data, these characters are all about interpretation, and that interpretation depends on the application using the data. Stick to '\n' when you are making C-strings, unless you want a literal carriage-return, because as people have pointed out in the comments, the OS representation doesn't concern you. IO libraries, including C's, are supposed to handle this themselves and abstract it away from you.
For your curiosity, in decimal, '\n' in ASCII is 10, '\r' is 13, but note that this is the ASCII standard, not a C standard.
It depends:
'\n' is a single character (ASCII LF)
"\n" is a '\n' character followed by a 0 terminator
some I/O operations transform a '\n' into '\r\n' on some systems (CR-LF).
When you print the \n to a file, using the windows C stdio libraries, the library interprets that as a logical new-line, not the literal character 0x0A. The output to the file will be the windows version of a new-line: 0x0D0A (\r\n).
Writing
Sample code:
#include <stdio.h>
int main() {
FILE *f = fopen("foo.txt","w");
fprintf(f,"foo\nbar");
return 0;
}
A quick cl /EHsc foo.c later and you get
0x666F6F 0x0D0A 0x626172 (separated for convenience)
in foo.txt under a hex editor.
It's important to note that this translation DOES NOT occur if you are writing to a file in 'binary mode'.
Reading
If you are reading the file back in using the same tools, also on windows, the "windows EOL" will be interpreted properly if you try to match up against \n.
When reading it back
#include <stdio.h>
int main() {
FILE *f = fopen("foo.txt", "r");
char c;
while (EOF != fscanf(f, "%c", &c))
printf("%x-", c);
}
You get
66-6f-6f-a-62-61-72-
Therefore, the only time this should be relevant to you is if you are
Moving files back and forth between mac/unix and windows. Unix needs no real explanation here, since \n directly translates to 0x0A on those platforms. (pre-OSX \n was 0x0D on mac iirc)
Putting text in binary files, only do this carefully please
Trying to figure out why your binary data is being messed up when you opened the file "w", instead of "wb"
Estimating something important based on the size of the file, on windows you'll have an extra byte per newline.
\n is a new-line -- it's a logical representation of whatever separates one line from another in a text file.
A given platform will have some physical representation of that logical separation between lines. On Unix and most similar systems, the new-line is represented by a line-feed (LF) character (and since Unix was/is so closely associated with C, on Unix the LF is often just called a new-line). On MacOS, it's typically represented by a carriage-return (CR). On a fair number of other systems, most prominently Windows, it's represented by a carriage return/line feed pair -- normally in that order, though once in a while you see something use LF followed by CR (as I recall, Clarion used to do that).
In theory, a new-line doesn't need to correspond to any characters in the stream at all though. For example, a system could have text files that were stored as a length followed by the appropriate number of characters. In such a case, the run-time library would need to carry out a slightly more extensive translation between internal and external representations of text files than is now common, but such is life.
According to the C99 Standard (section 5.2.2),
\n "moves the active position [where the next character from fputc would appear] to the initial position on the next line".
Also
[\n] shall produce a unique implementation-defined value
which can be stored in a single char object. The external representations in a text file
need not be identical to the internal representations and are outside the scope of [the C99 Standard]
Most C implementations choose to define \n as ASCII line feed (0x0A) for historical reasons. However, on many computer operating systems, the sequence for moving the active position to the beginning of the next line requires two characters usually 0x0D, 0x0A. So, when writing to a text file, the C implementation must convert the internal sequence of 0x0A to the external one of 0x0D, 0x0A. How this is done is outside of the scope of the C standard, but usually, the file IO library will perform the conversion on any file opened in text mode.
Your question is about text files.
A text file is a sequence of lines.
A line is a sequence of characters ending in (and including) a line break.
A line breaks is represented differently by different Operating Systems.
On Unix/Linux/Mac they are usually represented by a single LINEFEED
On Windows they are usually represented by the pair CARRIAGE RETURN + LINEFEED
On old Macs they were usually represented by a single CARRIAGE RETURN
On other systems (AS/400 ??) there may even not be a specific character that represents a line break ...
Anyway, the library code in C is responsible to translating the system's line break to '\n' when reading text files and do the reverse operation when writing text files.
So, no matter what the representation is on any given system, when you read a text file in C, lines will be ended by a '\n'.
Note: The '\n' is not necessarily 0x0a in all systems.
Yes it is.
\n is a newline. Hex code is 0x0A.
\r is a carriage return. Hex code is 0x0D
It is a single character. It represents Newline (but is not the only representation - Wikipedia).
EDIT: The question was changed while I was typing the answer.

In C how to write whichever end of line character is appropriate to the OS?

Unix has \n, Mac was \r but is now \n and DOS/Win32 is \r\n. When creating a text file with C, how to ensure whichever end of line character(s) is appropriate to the OS gets used?
fprintf(your_file, "\n");
This will be converted to an appropriate EOL by the stdio library on your operating system provided that you opened the file in text mode. In binary mode no conversion takes place.
From Wikipedia:
When writing a file in text mode, '\n'
is transparently translated to the
native newline sequence used by the
system, which may be longer than one
character. (Note that a C
implementation is allowed not to store
newline characters in files. For
example, the lines of a text file
could be stored as rows of a SQL table
or as fixed-length records.) When
reading in text mode, the native
newline sequence is translated back to
'\n'. In binary mode, the second mode
of I/O supported by the C library, no
translation is performed, and the
internal representation of any escape
sequence is output directly.
When you open a file in text mode (pass "w" to fopen instead of "wb") any newline characters written to the file will automatically be converted to the appropriate newline sequence for the system. Newline sequences will be translated back to newline characters when you read the file.
This is why it's important to distinguish between text and binary mode; if you're writing in binary mode, C will not tamper with the bytes you write to a file.

Resources