Does CMD do CRLF translation? - c

Since Windows uses CRLF as native line endings, one might expect that code like this
#include <windows.h>
#include <string.h>
int main() {
HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
char* msg = "Line1\nLine2\nLine3\nLine4\nLine5\n";
DWORD written = 0;
WriteFile(stdout, msg, strlen(msg), &written, NULL);
return 0;
}
would produce the following output:
line1
Line2
Line3
Line4
Line5
But it produces:
Line1
Line2
Line3
Line4
Line5
Are the LF translated within cmd to also move the cursor to the first line? Because they do not appear to be translated to CRLF when output is redirected to a file (I originally tried WriteConsoleA only to observe that it does not support redirection).
Is it OK to only use LF and not CRLF in cross-platform programs then? Is this behaviour given for all versions of windows or do some of them produce the "staircase" pattern described above?

Your expectations are exactly on target. And you are correct - that isn't what you see.
And #IInspectable is correct that you can't open devices via the Windows CreateFile API in text mode.
And yet, you see what you see.
If you check out this, you will find that Windows contains some console mode settings that aren't apparent when you are reading the API function descriptions. In particular, there is an ENABLE_PROCESSED_OUTPUT mode that is on by default. When this is on it enables special processing to handle '\n' line endings and other things.
Note that this is for console I/O only. For actual file I/O Windows API calls do no translations. On the other hand, the C library function fopen allows a file to be opened it text or binary mode. Binary mode does no translation, but text mode enables special handling for \n on Windows systems.
I understand that this is also handled similarly for consoles on POSIX systems where "\n" and not "\r\n" is the official standard line ending.

The console standard output buffer has ENABLE_PROCESSED_OUTPUT|ENABLE_WRAP_AT_EOL_OUTPUT by default.
\r alone just returns to the start of the line allowing you to overwrite yourself. \n is treated as \r\n for unknown reasons (Compatibility? Source portability?). NT4 does it, XP does it, 8 does it and 10 does it.
If you turn off ENABLE_PROCESSED_OUTPUT with SetConsoleMode then neither \r nor \n are special and WriteFile will print them as symbols to the console instead.

Related

Escape sequence for a true LineFeed (LF)

In C we have a couple common escape sequences:
\r for a Carriage Return (CR) - which would be the equivalent of doing '\015'
\n is often described as LineFeed, but I understand that '\n' will get translated in a string as required to CRLF (dependant on the OS) - which would be the equivalent of doing "\015\012". In particular if I'm dong a printf or an fprintf.
Is there an escape code for a true line feed character that won't get translated or am I stuck using '\012' when I don't want it translated?
There is no translation in the C compiler. A string of [and these are all equivalent]:
// (1) these are all equivalent to a string of newline of length 1:
"\n"
"\x0a"
"\012"
// (2) these are all equivalent to a string of carriage return of length 1:
"\r"
"\x0d"
"\015"
// (3) these are all equivalent to a string of CRLF of length 2:
"\r\n"
"\x0d\0x0a"
"\015\012"
When outputting to a terminal under a POSIX system, the TTY driver will convert case (1) into CRLF in cooked mode. This can be changed via some TTY ioctl calls. IIRC, similar for windows(?). But, [again] IIRC, windows has some windows specific call that must be done because the translation is done at a very low layer.
When writing to a file under a POSIX system, no translation is done.
However, when writing to a file under Windows, case (1) is translated by the OS to CRLF for normal opens [because the default is "text" mode]:
open(file,O_WRONLY);
fopen(file,"w");
To suppress the translation under windows for case (1), open the file in "binary" mode:
open(file,O_WRONLY | O_BINARY);
fopen(file,"wb");
Binary mode is also applicable for opening in read mode. And, for POSIX, it is [effectively] a no-op and is ignored. With/without the binary option, under POSIX, is opening in binary mode because POSIX has no "text mode" for files.
So, for portability between POSIX/windows, this is the mode to use to suppress translation.
#Barmar is right: \n and \012 are the exact same bits. The difference between a plain LF and a CRLF on Windows machines is how you open whatever device you are writing to. If you are doing printf to a terminal under cygwin, you could stty to change to raw mode, for example. Otherwise, it will depend on the specifics of the C library you are using.
Edit For Win32 using msvcrt, using fopen(..., "b"), "translations involving carriage-return and linefeed characters are suppressed" (from MSDN). By contrast, in text mode, "linefeed characters are translated to carriage return–linefeed combinations on output" (same source).
So to answer the original question, there is no single escape sequence that will always be \012 on output, on every platform, with every output routine.
The history:
Old mainframe computers had terminals connected over often slow connections. The terminals were typewriters. After the user had typed a line, they pressed return (as on old typewriters). This was the signal for the mainframe to process the line. Once the mainframe had received and processed the line, it sent a line feed. The paper of the typewriter now went up one line, informing the user the system was ready to receive another line.
Unix, being based on timesharing, copied this behavior.
(But I am still not sure whether the LF is stored under Unix, or the CR - from the above, it should be the CR and the system adds the LF.)
Windows, not being timesharing, just put the CR and LF into the file.

C: strtok and newlines in Windows vs Linux

I'm working on a C school assignment that is intended to be done on Windows, however, I'm programming it on OS X. While the other students working on Windows don't have problems reading a file, I do.
The code provided by the tutors splits the contents of a file on \n using this code:
/* Read ADFGX information */
adfgx = read_from_file("adfgx.txt");
/* Define the alphabet */
alphabet = strtok(adfgx, "\n");
/* Define the code symbols */
symbols = strtok(NULL, "\n");
However, the file adfgx.txt (which is provided for the assignment) has Windows style newlines (\r\n): I checked it with a hex editor. So, compiling this with the Microsoft C compiler from Visual Studio and running it on Windows splits the file correctly on newlines (\r\n). Which I think is weird, because I can not find any documentation on this behavior. The other part: when I compile it on OS X using gcc, and I run it: the \r is still included in the tokenized string, because it obviously splits on \n. If I change the delimiters to the strtok call to "\r\n", it works for me.
Is this normal that this behaves differently on Windows and Unix? How should I handle this in real life situations (assuming I'm trying to write portable code for Windows and Unix in C that should handle file input that uses \r\n)?
If you open the file with fopen("adfgx.txt", "r") on Windows, the file gets opened in "text mode" and the \r char gets implicitly stripped from subsequent fread calls. If you had opened the file on Windows with fopen("adfgx.txt", "rb"), the file gets opened in "binary mode", and the \r char remains. To learn about the "rb" mode, and other mode strings, you can read about the different mode parameters that fopen on Windows takes here. And as you might imagine, fwrite on Windows will automatically insert a \r into the stream in front of the \n char (as long as the file was not opened in binary mode).
Unix and MacOS treat \r as any ordinary character. Hence, strok(NULL, "\n") won't strip off the '\r' char, because you are not splitting on that.
The easy cross-platform fix would be to invoke strtok as follows on all platforms:
/* Define the alphabet */
alphabet = strtok(adfgx, "\r\n");
And I think passing "\r\n" as the delimiter string will clear up most of your issues of reading text files on Windows and vice-versa. I don't think strtok will return an empty string in either case, but you might need to check for an empty string on each strtok call (and invoke it again to read the next line).

Auto detect OS in C and handle with their specific line breaks

Is there a way to detect the OS where the C code is compiled to handle with it's specific line break characters in text files? For example I compile my code on a Windows machine, it should use \r\n as line break in text files, on Linux it should just use \n.
I need this for a program which should read text files binary and match substrings of the file with other strings. This should work on windows and Linux.
Thanks for your help!
You don't need to know the native storage format. When reading a file, you cannot know if it was created on a Window, Linux, or other system -- it could be created on another system than the one you are working on. When writing, your program will use the native libraries for your OS and output whatever it deems appropriate for \n.
Reading a text file line-ending agnostically comes down to this:
use a binary mode rather than "text mode" (you seem to already do this).
read text until you encounter either an \r or \n.
if you get an \r, skip all next \n;
if you get an \n, skip all next \r.
This will work for line endings of \n (Linux and other Unix-like OSes such as Mac OS X), Windows-like \r\n and older Mac OS files ending with \r only. That covers about 99.99% of all "normal" text files you are likely to encounter. There used to be a very rare one that used \r\n\n (or possibly \n\r\r) but even that will be handled correctly.
The best way would be to check for a predefined macro and #ifdef on it.
You can print all the predefined MACROs using the command
gcc -dM -E - < /dev/null
and grep for "LINUX" or "WIN32"
I'd expect to find _ LINUX _ defined on Linux machines and _ WIN32 _ defined on windows machine.

Reading files with DOS line endings using fgets() on linux

I have a file with DOS line endings that I receive at run-time, so I cannot convert the line endings to UNIX-style offline. Also, my app runs on both Windows and Linux. My app does an fgets() on the file and tries to read in line-by-line.
Would the number of bytes read per line on Linux also account for 2 trailing characters (\r \n) or would it contain only (\n) and the \r would be discarded by the underlying system?
EDIT:
Ok, so the line endings are preserved while reading a file on Linux, but I have run into another issue. On Windows, opening the file in "r" or "rb" is behaving differently. Does windows treat these two modes distinctly, unlike Linux?
fgets() keeps line endings.
http://msdn.microsoft.com/en-us/library/c37dh6kf(v=vs.80).aspx
fgets() itself doesn't have any special options for converting line endings, but on Windows, you can choose to either open a file in "binary" mode, or in "text" mode. In text mode Windows converts the CR/LF sequence (C string: "\r\n") into just a newline (C string: "\n"). It's a feature so that you can write the same code for Windows and Linux and it will work (you don't need "\r\n" on Windows and just "\n" on Linux).
http://msdn.microsoft.com/en-US/library/yeby3zcb(v=vs.80)
Note that the Windows call to fopen() takes the same arguments as the call to fopen() in Linux. The "binary" mode needs a non-standard character ('b') in the file mode, but the "text" mode is the default. So I suggest you just use the same code lines for Windows and Linux; the Windows version of fopen() is designed for that.
The Linux version of the C library doesn't have any tricky features. If the text file has CR/LF line endings, then that is what you get when you read it. Linux fopen() will accept a 'b' in the options, but ignores it!
http://linux.die.net/man/3/fopen
http://linux.die.net/man/3/fgets
On Unix, the lines would be read to the newline \n and would include the carriage return \r. You would need to trim both off the end.
Although the other answers gave satisfying information regarind the question what kind of line ending would be returned for a DOS file read under UNIX, I'd like to mentioned an alternative way to chop off such line endings.
The significant difference is, that the following approach is multi-byte-character save, as it does not involve any characters directly:
if (pszLine && (2 <= strlen(pszLine)))
{
size_t size = strcspn(pszLine, "\r\n");
pszLine[size] = 0;
}
You'll get what's actually in the file, including the \r characters. In unix there aren't text files and binary files, there are just files, and stdio doesn't do conversions. After reading a line into a buffer with fgets, you can do:
char *p = strrchr(buffer, '\r');
if(p && p[1]=='\n' && p[2]=='\0') {
p[0] = '\n';
p[1] = '\0';
}
That will change a terminating \r\n\0 into \n\0. Or you could just do p[0]='\0' if you don't want to keep the \n.
Note the use of strrchr, not strchr. There's nothing that prevents multiple \rs from being present in the middle of a line, and you probably don't want to truncate the line at the first one.
Answer to the EDIT section of the question: yes, the "b" in "rb" is a no-op in unix.

What exactly is \r in C language?

#include <stdio.h>
int main()
{
int countch=0;
int countwd=1;
printf("Enter your sentence in lowercase: ");
char ch='a';
while(ch!='\r')
{
ch=getche();
if(ch==' ')
countwd++;
else
countch++;
}
printf("\n Words =%d ",countwd);
printf("Characters = %d",countch-1);
getch();
}
This is the program where I came across \r. What exactly is its role here? I am beginner in C and I appreciate a clear explanation on this.
'\r' is the carriage return character. The main times it would be useful are:
When reading text in binary mode, or which may come from a foreign OS, you'll find (and probably want to discard) it due to CR/LF line-endings from Windows-format text files.
When writing to an interactive terminal on stdout or stderr, '\r' can be used to move the cursor back to the beginning of the line, to overwrite it with new contents. This makes a nice primitive progress indicator.
The example code in your post is definitely a wrong way to use '\r'. It assumes a carriage return will precede the newline character at the end of a line entered, which is non-portable and only true on Windows. Instead the code should look for '\n' (newline), and discard any carriage return it finds before the newline. Or, it could use text mode and have the C library handle the translation (but text mode is ugly and probably should not be used).
It's Carriage Return. Source: http://msdn.microsoft.com/en-us/library/6aw8xdf2(v=vs.80).aspx
The following repeats the loop until the user has pressed the Return key.
while(ch!='\r')
{
ch=getche();
}
Once upon a time, people had terminals like typewriters (with only upper-case letters, but that's another story). Search for 'Teletype', and how do you think tty got used for 'terminal device'?
Those devices had two separate motions. The carriage return moved the print head back to the start of the line without scrolling the paper; the line feed character moved the paper up a line without moving the print head back to the beginning of the line. So, on those devices, you needed two control characters to get the print head back to the start of the next line: a carriage return and a line feed. Because this was mechanical, it took time, so you had to pause for long enough before sending more characters to the terminal after sending the CR and LF characters. One use for CR without LF was to do 'bold' by overstriking the characters on the line. You'd write the line out once, then use CR to start over and print twice over the characters that needed to be bold. You could also, of course, type X's over stuff that you wanted partially hidden, or create very dense ASCII art pictures with judicious overstriking.
On Unix, all the logic for this stuff was hidden in a terminal driver. You could use the stty command and the underlying functions (in those days, ioctl() calls; they were sanitized into the termios interface by POSIX.1 in 1988) to tweak all sorts of ways that the terminal behaved.
Eventually, you got 'glass terminals' where the speeds were greater and and there were new idiosyncrasies to deal with - Hazeltine glitches and so on and so forth. These got enshrined in the termcap and later terminfo libraries, and then further encapsulated behind the curses library.
However, some other (non-Unix) systems did not hide things as well, and you had to deal with CRLF in your text files - and no, this is not just Windows and DOS that were in the 'CRLF' camp.
Anyway, on some systems, the C library has to deal with text files that contain CRLF line endings and presents those to you as if there were only a newline at the end of the line. However, if you choose to treat the text file as a binary file, you will see the CR characters as well as the LF.
Systems like the old Mac OS (version 9 or earlier) used just CR (aka \r) for the line ending. Systems like DOS and Windows (and, I believe, many of the DEC systems such as VMS and RSTS) used CRLF for the line ending. Many of the Internet standards (such as mail) mandate CRLF line endings. And Unix has always used just LF (aka NL or newline, hence \n) for its line endings. And most people, most of the time, manage to ignore CR.
Your code is rather funky in looking for \r. On a system compliant with the C standard, you won't see the CR unless the file is opened in binary mode; the CRLF or CR will be mapped to NL by the C runtime library.
There are a few characters which can indicate a new line. The usual ones are these two:
'\n' or '0x0A' (10 in decimal) -> This character is called "Line Feed" (LF).
'\r' or '0x0D' (13 in decimal) -> This one is called "Carriage return" (CR).
Different Operating Systems handle newlines in a different way. Here is a short list of the most common ones:
DOS and Windows
They expect a newline to be the combination of two characters, namely '\r\n' (or 13 followed by 10).
Unix (and hence Linux as well)
Unix uses a single '\n' to indicate a new line.
Mac
Macs use a single '\r'.
That is not always true; it only works in Windows.
For interacting with terminal in putty, Linux shell,... it will be used for returning the cursor to the beginning of line.
following picture shows the usage of that:
Without '\r':
Data comes without '\r' to the putty terminal, it has just '\n'.
it means that data will be printed just in next line.
With '\r':
Data comes with '\r', i.e. string ends with '\r\n'. So the cursor in putty terminal not only will go to the next line but also at the beginning of line
It depends upon which platform you're on as to how it will be translated and whether it will be there at all: Wikipedia entry on newline
\r is an escape sequence character or void character. It is used to bring the cursor to the beginning of the line (it maybe of same or new line) to overwrite with new content (content written ahead of \r like: \rhello);
int main ()
{
printf("Hello \rworld");
return 0;
}
The output of the program will be world not Hello world
because \r has put the cursor at the beginning of the line and Hello has been overwritten with world.

Resources