Auto detect OS in C and handle with their specific line breaks - c

Is there a way to detect the OS where the C code is compiled to handle with it's specific line break characters in text files? For example I compile my code on a Windows machine, it should use \r\n as line break in text files, on Linux it should just use \n.
I need this for a program which should read text files binary and match substrings of the file with other strings. This should work on windows and Linux.
Thanks for your help!

You don't need to know the native storage format. When reading a file, you cannot know if it was created on a Window, Linux, or other system -- it could be created on another system than the one you are working on. When writing, your program will use the native libraries for your OS and output whatever it deems appropriate for \n.
Reading a text file line-ending agnostically comes down to this:
use a binary mode rather than "text mode" (you seem to already do this).
read text until you encounter either an \r or \n.
if you get an \r, skip all next \n;
if you get an \n, skip all next \r.
This will work for line endings of \n (Linux and other Unix-like OSes such as Mac OS X), Windows-like \r\n and older Mac OS files ending with \r only. That covers about 99.99% of all "normal" text files you are likely to encounter. There used to be a very rare one that used \r\n\n (or possibly \n\r\r) but even that will be handled correctly.

The best way would be to check for a predefined macro and #ifdef on it.
You can print all the predefined MACROs using the command
gcc -dM -E - < /dev/null
and grep for "LINUX" or "WIN32"
I'd expect to find _ LINUX _ defined on Linux machines and _ WIN32 _ defined on windows machine.

Related

Does CMD do CRLF translation?

Since Windows uses CRLF as native line endings, one might expect that code like this
#include <windows.h>
#include <string.h>
int main() {
HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
char* msg = "Line1\nLine2\nLine3\nLine4\nLine5\n";
DWORD written = 0;
WriteFile(stdout, msg, strlen(msg), &written, NULL);
return 0;
}
would produce the following output:
line1
Line2
Line3
Line4
Line5
But it produces:
Line1
Line2
Line3
Line4
Line5
Are the LF translated within cmd to also move the cursor to the first line? Because they do not appear to be translated to CRLF when output is redirected to a file (I originally tried WriteConsoleA only to observe that it does not support redirection).
Is it OK to only use LF and not CRLF in cross-platform programs then? Is this behaviour given for all versions of windows or do some of them produce the "staircase" pattern described above?
Your expectations are exactly on target. And you are correct - that isn't what you see.
And #IInspectable is correct that you can't open devices via the Windows CreateFile API in text mode.
And yet, you see what you see.
If you check out this, you will find that Windows contains some console mode settings that aren't apparent when you are reading the API function descriptions. In particular, there is an ENABLE_PROCESSED_OUTPUT mode that is on by default. When this is on it enables special processing to handle '\n' line endings and other things.
Note that this is for console I/O only. For actual file I/O Windows API calls do no translations. On the other hand, the C library function fopen allows a file to be opened it text or binary mode. Binary mode does no translation, but text mode enables special handling for \n on Windows systems.
I understand that this is also handled similarly for consoles on POSIX systems where "\n" and not "\r\n" is the official standard line ending.
The console standard output buffer has ENABLE_PROCESSED_OUTPUT|ENABLE_WRAP_AT_EOL_OUTPUT by default.
\r alone just returns to the start of the line allowing you to overwrite yourself. \n is treated as \r\n for unknown reasons (Compatibility? Source portability?). NT4 does it, XP does it, 8 does it and 10 does it.
If you turn off ENABLE_PROCESSED_OUTPUT with SetConsoleMode then neither \r nor \n are special and WriteFile will print them as symbols to the console instead.

How to solve the \r (carriage return) problem that prevents Windows text files from being cross-platform with Linux?

Intro
There is a "feature" in Windows that adds a carriage return character before every newline. This is an old thing, but I am having trouble finding specific solutions to the problem.
Problem
If you create a file in Windows with fopen("file.txt", "w"), it will create a text file where every \n in your code is converted to \r\n. This creates a problem in situations where you try to read the file on Linux, because there is now an unaccounted-for character in the mix, and most line reads depend on reading to \n.
Research
I created text ("w") and binary ("wb") files on Windows and Linux, and tried to read and compare them with the same files made on the other OS.
Binary files do not get an added carriage return, and seem to work fine.
Text files, on the other hand, are a mixed bag:
On Windows, comparing the Windows text file (\r\n) to the Linux text file (\n) will result in them being read equally (this is apparently a feature of some C Windows implementations - \r\n will get automatically read as \n)
On Linux, comparing the files will result in them not being equal and the \r will not be handled, creating a reading problem.
Conclusion
I would like to know of ways how to handle this so my text files can be truly cross-platform. I have seen a lot of posts about this, but most have mixed and opposing claims and almost none have specific solutions.
So please, can we solve this problem once and for all after 40 years? Thank you.

Escape sequence for a true LineFeed (LF)

In C we have a couple common escape sequences:
\r for a Carriage Return (CR) - which would be the equivalent of doing '\015'
\n is often described as LineFeed, but I understand that '\n' will get translated in a string as required to CRLF (dependant on the OS) - which would be the equivalent of doing "\015\012". In particular if I'm dong a printf or an fprintf.
Is there an escape code for a true line feed character that won't get translated or am I stuck using '\012' when I don't want it translated?
There is no translation in the C compiler. A string of [and these are all equivalent]:
// (1) these are all equivalent to a string of newline of length 1:
"\n"
"\x0a"
"\012"
// (2) these are all equivalent to a string of carriage return of length 1:
"\r"
"\x0d"
"\015"
// (3) these are all equivalent to a string of CRLF of length 2:
"\r\n"
"\x0d\0x0a"
"\015\012"
When outputting to a terminal under a POSIX system, the TTY driver will convert case (1) into CRLF in cooked mode. This can be changed via some TTY ioctl calls. IIRC, similar for windows(?). But, [again] IIRC, windows has some windows specific call that must be done because the translation is done at a very low layer.
When writing to a file under a POSIX system, no translation is done.
However, when writing to a file under Windows, case (1) is translated by the OS to CRLF for normal opens [because the default is "text" mode]:
open(file,O_WRONLY);
fopen(file,"w");
To suppress the translation under windows for case (1), open the file in "binary" mode:
open(file,O_WRONLY | O_BINARY);
fopen(file,"wb");
Binary mode is also applicable for opening in read mode. And, for POSIX, it is [effectively] a no-op and is ignored. With/without the binary option, under POSIX, is opening in binary mode because POSIX has no "text mode" for files.
So, for portability between POSIX/windows, this is the mode to use to suppress translation.
#Barmar is right: \n and \012 are the exact same bits. The difference between a plain LF and a CRLF on Windows machines is how you open whatever device you are writing to. If you are doing printf to a terminal under cygwin, you could stty to change to raw mode, for example. Otherwise, it will depend on the specifics of the C library you are using.
Edit For Win32 using msvcrt, using fopen(..., "b"), "translations involving carriage-return and linefeed characters are suppressed" (from MSDN). By contrast, in text mode, "linefeed characters are translated to carriage return–linefeed combinations on output" (same source).
So to answer the original question, there is no single escape sequence that will always be \012 on output, on every platform, with every output routine.
The history:
Old mainframe computers had terminals connected over often slow connections. The terminals were typewriters. After the user had typed a line, they pressed return (as on old typewriters). This was the signal for the mainframe to process the line. Once the mainframe had received and processed the line, it sent a line feed. The paper of the typewriter now went up one line, informing the user the system was ready to receive another line.
Unix, being based on timesharing, copied this behavior.
(But I am still not sure whether the LF is stored under Unix, or the CR - from the above, it should be the CR and the system adds the LF.)
Windows, not being timesharing, just put the CR and LF into the file.

How is erasing output in terminal implemented in C?

Some applications running in terminal can erase their outputs. e.g.
when it tells you to wait, it will show a sequence of dots alternating between different lengths.
How is erasing output in terminal implemented in C? Is it done by reverse line feed?
Can a program only erase the previous characters in the current line, not the characters in the previous line in stdout?
Thanks.
It depends on the terminal.
The COMSPEC shell on Windows (often called the DOS prompt or command.com) exposes an API in C to control the cursor. I haven't done any Windows programming so I can't tell you much about it.
Most other terminals (especially on unixen) emulate protocols that resemble the VT100 serial terminal (the VT100 terminal was a physical device, a monitor and keyboard, that you attached to a modem or serial port to communicate with a server).
On VT100 terminals, carriage return and line feed are separate commands, both one byte. The carriage return command sets the cursor to the beginning of the line. The line feed command moves the cursor down a line (but doesn't bring the cursor to the beginning of the line by itself). Most shells on unixen automatically insert a carriage return after a line feed but almost none inserts a line feed after a carriage return.
With this knowledge, the simplest implementation is to simply output a carriage return and reprint the entire line:
printf("\rprogress: %d percent ", x);
Note the extra spaces at the end of the line. Printing "\r" doesn't erase the line so reprinting over the old line may end up leaving some of the old string on screen. The extra spaces is used to try and erase the remainder of the old line.
If you googled "VT100 escape secquence", you'll find commands that will allow you to do things like erase the current line, change color of text, goto a specific row/column on screen etc. The most popular use of VT100 sequences is to output coloured text. But you can also do other things with them.
The next simplest implementation is to cleanly delete the line and reprint it:
printf("\033[2K\rprogress: %d percent", x);
The \033[2K is the escape sequence to delete the current line (ESC[2K).
From here you can get more creative if you want. You can use the cursor save/restore command with the erase until end of line command to only erase the part you want to update (instead of the entire line). You can use the goto commands to put the cursor in a specific location on screen to update text there etc.
It should be noted that the more advanced stuff such as VT102 sequences or some of the full ANSI escape sequences are generally not portable accross terminals (by terminals I don't mean the shell, I mean the terminals: rxvt, xterm, linux terminal, hyperterminal(on windows) etc).
If you want portability (and/or sane API) you should use the curses or ncurses libraries.
If you wanted to know how it's done, then that's how it's done. It's just printing specially formatted strings to screen (except for the COMSPEC shell). Kind of like HTML but old-school.

What are reserved filenames for various platforms?

I'm not asking about general syntactic rules for file names. I mean gotchas that jump out of nowhere and bite you. For example, trying to name a file "COM<n>" on Windows?
From: http://www.grouplogic.com/knowledge/index.cfm/fuseaction/view_Info/docID/111.
The following characters are invalid as file or folder names on Windows using NTFS: / ? < > \ : * | " and any character you can type with the Ctrl key.
In addition to the above illegal characters the caret ^ is also not permitted under Windows Operating Systems using the FAT file system.
Under Windows using the FAT file system file and folder names may be up to 255 characters long.
Under Windows using the NTFS file system file and folder names may be up to 256 characters long.
Under Window the length of a full path under both systems is 260 characters.
In addition to these characters, the following conventions are also illegal:
Placing a space at the end of the name
Placing a period at the end of the name
The following file names are also reserved under Windows:
aux,
com1,
com2,
...
com9,
lpt1,
lpt2,
...
lpt9,
con,
nul,
prn
Full description of legal and illegal filenames on Windows: http://msdn.microsoft.com/en-us/library/aa365247.aspx
A tricky Unix gotcha when you don't know:
Files which start with - or -- are legal but a pain in the butt to work with, as many command line tools think you are providing options to them.
Many of those tools have a special marker "--" to signal the end of the options:
gzip -9vf -- -mydashedfilename
As others have said, device names like COM1 are not possible as filenames under Windows because they are reserved devices.
However, there is an escape method to create and access files with these reserved names, for example, this command will redirect the output of the ver command into a file called COM1:
ver > "\\?\C:\Users\username\COM1"
Now you will have a file called COM1 that 99% of programs won't be able to open, and will probably freeze if you try to access.
Here's the Microsoft article that explains how this "file namespace" works. Basically it tells Windows not to do any string processing on the text and to pass it straight through to the filesystem. This trick can also be used to work with paths longer than 260 characters.
The boost::filesystem Portability Guide has a lot of good info.
Well, for MSDOS/Windows, NUL, PRN, LPT<n> and CON. They even cause problems if used with an extension: "NUL.TXT"
Unless you're touching special directories, the only illegal names on Linux are '.' and '..'. Any other name is possible, although accessing some of them from the shell requires using escape sequences.
EDIT: As Vinko Vrsalovic said, files starting with '-' and '--' are a pain from the shell, since those character sequences are interpreted by the application, not the shell.

Resources