How to properly recognize different line endings in C? - c

I guess the title speaks for itself.
I am coding a C program on Windows 7, using g++ and Notepad++, which compares content of files.
Content of the file:
simple
file with lines
File has line endings in windows style CRLF.
When I count the length of file using this code:
fseek(file, 0, SEEK_END);
size = ftell(file);
fseek(file, 0, SEEK_SET);
I get 23.
When I change line endings to Unix format LF (using Notepad++) I get 22 length.
This creates kind of a problem, when comparing two files. That's why I ask, if there is a way to determine if given file has LF or CR or CRLF.
I know that I can distinguish between CR and LF, LF has ascii code 10 and CR has ascii code 13. Or LF is '\n' and CR is '\r'.
But when reading file char after char I always get LF (ascii 10), even if there is CRLF.
I hope I made it clear. Thanks.

That is the difference between reading files in text and binary mode.
In text mode (fopen with the relevant parameters fopen( file, "r") then getc etc) all line ends are read as one character. If you read in binary mode e.g. fopen(file, "rb") then you will get the actual bytes and you will see CRLF and CR as different. fseek will use the actual number of bytes and so sees the difference in line endings.
And the only way to tell is to read the files in the two different ways and see if there are CRLF pairs or the size differs, or in practice just see if there is a LF as I fdon't think any current major OS uses that as a line enfing.

In addition to Mark's answer, if you need to do this for a filehandle that has already been opened (such as stdin or stdout), you can use _setmode():
#include <fcntl.h>
#include <io.h>
...
_setmode(fileno(stdin), _O_BINARY);
This works provided no input or output has already occurred to that filehandle. Incidentally, _setmode() only exists on Windows and DOS; on Unix-like operating systems (including versions of Mac OS since OS X), files are effectively always opened in binary mode, and fopen(file, "...b") there is accepted but has no effect. On these platforms, a line ending is encoded by the single character \n.

Related

why is fgetc or fgets ignoring

Today I almost became crazy because the size of the bytes I read didn't match the size of the xml file I was trying to read.
Then, when I checked the content of the file I was reading, I said it must be a nasty non printable char (\r) and I checked that with a simple program : the \r were not present.
My question is why fgetc/fgets are ignoring \r and picking only \n and If I want \r to be read how can I proceed ?
Because they are designed to do so. On the Windows OS the end of line is a combination of two characters '\n' (new line) and '\r' (carriage return).
When you open a file with the "r" mode these are all converted to '\n' so if you are in a Windows OS there will be one character missing from each line.
If you open with the "rb" mode, it will no longer convert the two characters to '\n' and you will be able to read it. This is the primary difference between the "b" and non "b" modes.
Note that this freature allows the file to be open by different platforms without caring about this at all, you simply open it in text mode "r" or "w" for output and don't worry about the way the underlying system represents end of lines.

Reading files with DOS line endings using fgets() on linux

I have a file with DOS line endings that I receive at run-time, so I cannot convert the line endings to UNIX-style offline. Also, my app runs on both Windows and Linux. My app does an fgets() on the file and tries to read in line-by-line.
Would the number of bytes read per line on Linux also account for 2 trailing characters (\r \n) or would it contain only (\n) and the \r would be discarded by the underlying system?
EDIT:
Ok, so the line endings are preserved while reading a file on Linux, but I have run into another issue. On Windows, opening the file in "r" or "rb" is behaving differently. Does windows treat these two modes distinctly, unlike Linux?
fgets() keeps line endings.
http://msdn.microsoft.com/en-us/library/c37dh6kf(v=vs.80).aspx
fgets() itself doesn't have any special options for converting line endings, but on Windows, you can choose to either open a file in "binary" mode, or in "text" mode. In text mode Windows converts the CR/LF sequence (C string: "\r\n") into just a newline (C string: "\n"). It's a feature so that you can write the same code for Windows and Linux and it will work (you don't need "\r\n" on Windows and just "\n" on Linux).
http://msdn.microsoft.com/en-US/library/yeby3zcb(v=vs.80)
Note that the Windows call to fopen() takes the same arguments as the call to fopen() in Linux. The "binary" mode needs a non-standard character ('b') in the file mode, but the "text" mode is the default. So I suggest you just use the same code lines for Windows and Linux; the Windows version of fopen() is designed for that.
The Linux version of the C library doesn't have any tricky features. If the text file has CR/LF line endings, then that is what you get when you read it. Linux fopen() will accept a 'b' in the options, but ignores it!
http://linux.die.net/man/3/fopen
http://linux.die.net/man/3/fgets
On Unix, the lines would be read to the newline \n and would include the carriage return \r. You would need to trim both off the end.
Although the other answers gave satisfying information regarind the question what kind of line ending would be returned for a DOS file read under UNIX, I'd like to mentioned an alternative way to chop off such line endings.
The significant difference is, that the following approach is multi-byte-character save, as it does not involve any characters directly:
if (pszLine && (2 <= strlen(pszLine)))
{
size_t size = strcspn(pszLine, "\r\n");
pszLine[size] = 0;
}
You'll get what's actually in the file, including the \r characters. In unix there aren't text files and binary files, there are just files, and stdio doesn't do conversions. After reading a line into a buffer with fgets, you can do:
char *p = strrchr(buffer, '\r');
if(p && p[1]=='\n' && p[2]=='\0') {
p[0] = '\n';
p[1] = '\0';
}
That will change a terminating \r\n\0 into \n\0. Or you could just do p[0]='\0' if you don't want to keep the \n.
Note the use of strrchr, not strchr. There's nothing that prevents multiple \rs from being present in the middle of a line, and you probably don't want to truncate the line at the first one.
Answer to the EDIT section of the question: yes, the "b" in "rb" is a no-op in unix.

Writing line to a file using C

I'm currently doing this:
FILE *fOut;
fOut = fopen("fileOut.txt", "w");
char line[255];
...
strcat(line, "\n");
fputs(line, fOut);
but find that when I open the file in a text editor I get
line 1
line 2
If I remove the strcat(line, "\n"); then I get.
line 1line2
How do I get fOut to be
line 1
line 2
The puts() function appends a newline to the string it is given to write to stdout; the fputs() function does not do that.
Since you've not shown us all the code, we can only hypothesize about what you've done. But:
strcpy(line, "line1");
fputs(line, fOut);
putc('\n', fOut);
strcpy(line, "line2\n");
fputs(line, fOut);
would produce the result you require, in two slightly different ways that could each be used twice to achieve consistency (and your code should be consistent — leave 'elegant variation' for your literature writing, not for your programming).
In a comment, you say:
I'm actually looping through a file encrypting each line and then writing that line to a new file.
Oh boy! Are you base-64 encoding the encrypted data? If not, then:
You must include b in the fopen() mode (as in fOut = fopen("fileout.bin", "wb");) because encrypted data is binary data, not text data. This (the b) is safe for both Unix and Windows, but is critical on Windows and immaterial on Unix.
You must not use fputs() to write the data; there will be zero bytes ('\0') amongst the encrypted values and fputs() will stop at the first of those that it encounters. You probably need to use fwrite() instead, telling it exactly how many bytes to write each time.
You must not insert newlines anywhere; the encrypted data might contain newlines, but those must be preserved, and no extraneous one can be added.
When you read this file back in, you must open it as a binary file "rb" and read it using fread().
If you are base-64 encoding your encrypted data, then you can go back to treating the output as text; that's the point of base-64 encoding.
When files are opened with w (or wt) Windows replaces the \n with \r\n.
To avoid this, open the file with wb (instead of w).
...
fOut = fopen("fileOut.txt", "wb");
...
Unlike many other OSs, Windows makes a distinction between binary and text mode, and -- confusingly -- the Windows C runtime handles both modes differently.
You can try using \r instead of \n. What platform are you running this on, Windows?

How to fix this file related problem

i m reading from file line by line but when i read some garbage character like space /r is being added i m nt getting why it is being added although there is no such character in file from where i m reading ..i have used fread and fgets both from both i m getting the same problem please reply if u have solution for this problem
The file was probably edited/created on Windows. Windows uses \r\n as a line delimiter. When you read the file, you must strip the \r manually. Since most editors treat \r\n as a single character (line end), you can't "see" it but it's still in the file. Use a hex editor if you want to see it or a tool like od.
Open the file in text mode.
/* ... */
fopen(filename, "r"); /* notice no 'b' in mode */
/* ... */
Supposing you're on Windows ... on reading operations, the library is responsible for translating the literal "\r\n" present on disk to "\n"; and on writing operation, the library translates "\n" to "\r\n".

file size c is different than the size data string's size

I have a file I'm writing to and then changing the size of it to the size of text written to it something like:
FILE * file...
I get all the data from the file and change the file's size to the data's size but it differs. The string's size is smaller then the filelength and it cuts it and loses data.
What might be the problem?
while(fgets(cLine, sizeof(cLine), file) )
str.append((string)cLine);
fputs(str.c_str(),file);
_chsize( fileno(file), (int)str.size() );
When I checked it always fileLength(fileno(file)) is larger than str.size()!
Perhaps it's CRLF? Beware of:
fopen(filename, "r") vs fopen(filename, "rb"),
and likewise
fopen(filename, "w") vs fopen(filename, "wb").
The reason is because "r" or "w" will translate CRLF, while "rb" or "wb" will treat the data as binary. On most platforms this is ignored. For instance, the fopen man page on OS X:
The mode string can also include the
letter "b" either as a third
character or as a character between
the characters in any of the
two-character strings described above.
This is strictly for compatibility
with ISO/IEC 9899:1990 ("ISO C90")
and has no effect; the "b" is
ignored.
The fopen page on MSDN says something different:
b
Open in binary (untranslated) mode;
translations involving carriage-return
and linefeed characters are
suppressed.
If t or b is not given in mode, the
default translation mode is defined by
the global variable _fmode. If t or b
is prefixed to the argument, the
function fails and returns NULL.
For more information about using text
and binary modes in Unicode and
multibyte stream-I/O, see Text and
Binary Mode File I/O and Unicode
Stream I/O in Text and Binary Modes.
Depending on what you are doing in your code for cr/lf and what OS you are running, there could be some translating happening in the background when you read/write the file if you open it in text mode.
Jonathan has hit the nail on the head.
Ensure that you are reading the file in binary format or if you are certain that the file only contains text (and that is all that you want) then be prepared for file characters to be in unicode or some other format.
You'll also find that extra control characters will be automatically added not least the EOF character.
My question though is why do you read the data from the file, only to write it back in again?

Resources