Today I almost became crazy because the size of the bytes I read didn't match the size of the xml file I was trying to read.
Then, when I checked the content of the file I was reading, I said it must be a nasty non printable char (\r) and I checked that with a simple program : the \r were not present.
My question is why fgetc/fgets are ignoring \r and picking only \n and If I want \r to be read how can I proceed ?
Because they are designed to do so. On the Windows OS the end of line is a combination of two characters '\n' (new line) and '\r' (carriage return).
When you open a file with the "r" mode these are all converted to '\n' so if you are in a Windows OS there will be one character missing from each line.
If you open with the "rb" mode, it will no longer convert the two characters to '\n' and you will be able to read it. This is the primary difference between the "b" and non "b" modes.
Note that this freature allows the file to be open by different platforms without caring about this at all, you simply open it in text mode "r" or "w" for output and don't worry about the way the underlying system represents end of lines.
Related
I have created a .mid file by writing bytes to a file and save it as .midi. I can run it and it works, but there are some special cases where it does not.
If I write a byte containing \n (ASCII 10) then it will instead write 2 bytes \r\n, which makes the .mid not runnable. (This is normal for Windows machine to do, but not desirable in my case.) An example of writing \n could be when picking the key which is being represented by \n.
Is there a workaround to write \n and not \r\n or another way to make sure that byte written is ASCII 10 on a Windows machine?
Thanks!
On linux/unix, it doesn't matter whether you specify "wb" or "w" to create a file.
But creating a text file using fopen in windows means that all \n are converted to \r\n, so if you're using this to create binary files, the binary files will be "corrupt" if there are some bytes with value "10" (linefeed)
Simple solution: always use fopen("file.bin","wb") when creating a binary file, on all platforms so your code is portable.
I have a file with DOS line endings that I receive at run-time, so I cannot convert the line endings to UNIX-style offline. Also, my app runs on both Windows and Linux. My app does an fgets() on the file and tries to read in line-by-line.
Would the number of bytes read per line on Linux also account for 2 trailing characters (\r \n) or would it contain only (\n) and the \r would be discarded by the underlying system?
EDIT:
Ok, so the line endings are preserved while reading a file on Linux, but I have run into another issue. On Windows, opening the file in "r" or "rb" is behaving differently. Does windows treat these two modes distinctly, unlike Linux?
fgets() keeps line endings.
http://msdn.microsoft.com/en-us/library/c37dh6kf(v=vs.80).aspx
fgets() itself doesn't have any special options for converting line endings, but on Windows, you can choose to either open a file in "binary" mode, or in "text" mode. In text mode Windows converts the CR/LF sequence (C string: "\r\n") into just a newline (C string: "\n"). It's a feature so that you can write the same code for Windows and Linux and it will work (you don't need "\r\n" on Windows and just "\n" on Linux).
http://msdn.microsoft.com/en-US/library/yeby3zcb(v=vs.80)
Note that the Windows call to fopen() takes the same arguments as the call to fopen() in Linux. The "binary" mode needs a non-standard character ('b') in the file mode, but the "text" mode is the default. So I suggest you just use the same code lines for Windows and Linux; the Windows version of fopen() is designed for that.
The Linux version of the C library doesn't have any tricky features. If the text file has CR/LF line endings, then that is what you get when you read it. Linux fopen() will accept a 'b' in the options, but ignores it!
http://linux.die.net/man/3/fopen
http://linux.die.net/man/3/fgets
On Unix, the lines would be read to the newline \n and would include the carriage return \r. You would need to trim both off the end.
Although the other answers gave satisfying information regarind the question what kind of line ending would be returned for a DOS file read under UNIX, I'd like to mentioned an alternative way to chop off such line endings.
The significant difference is, that the following approach is multi-byte-character save, as it does not involve any characters directly:
if (pszLine && (2 <= strlen(pszLine)))
{
size_t size = strcspn(pszLine, "\r\n");
pszLine[size] = 0;
}
You'll get what's actually in the file, including the \r characters. In unix there aren't text files and binary files, there are just files, and stdio doesn't do conversions. After reading a line into a buffer with fgets, you can do:
char *p = strrchr(buffer, '\r');
if(p && p[1]=='\n' && p[2]=='\0') {
p[0] = '\n';
p[1] = '\0';
}
That will change a terminating \r\n\0 into \n\0. Or you could just do p[0]='\0' if you don't want to keep the \n.
Note the use of strrchr, not strchr. There's nothing that prevents multiple \rs from being present in the middle of a line, and you probably don't want to truncate the line at the first one.
Answer to the EDIT section of the question: yes, the "b" in "rb" is a no-op in unix.
#include <stdio.h>
int main()
{
int countch=0;
int countwd=1;
printf("Enter your sentence in lowercase: ");
char ch='a';
while(ch!='\r')
{
ch=getche();
if(ch==' ')
countwd++;
else
countch++;
}
printf("\n Words =%d ",countwd);
printf("Characters = %d",countch-1);
getch();
}
This is the program where I came across \r. What exactly is its role here? I am beginner in C and I appreciate a clear explanation on this.
'\r' is the carriage return character. The main times it would be useful are:
When reading text in binary mode, or which may come from a foreign OS, you'll find (and probably want to discard) it due to CR/LF line-endings from Windows-format text files.
When writing to an interactive terminal on stdout or stderr, '\r' can be used to move the cursor back to the beginning of the line, to overwrite it with new contents. This makes a nice primitive progress indicator.
The example code in your post is definitely a wrong way to use '\r'. It assumes a carriage return will precede the newline character at the end of a line entered, which is non-portable and only true on Windows. Instead the code should look for '\n' (newline), and discard any carriage return it finds before the newline. Or, it could use text mode and have the C library handle the translation (but text mode is ugly and probably should not be used).
It's Carriage Return. Source: http://msdn.microsoft.com/en-us/library/6aw8xdf2(v=vs.80).aspx
The following repeats the loop until the user has pressed the Return key.
while(ch!='\r')
{
ch=getche();
}
Once upon a time, people had terminals like typewriters (with only upper-case letters, but that's another story). Search for 'Teletype', and how do you think tty got used for 'terminal device'?
Those devices had two separate motions. The carriage return moved the print head back to the start of the line without scrolling the paper; the line feed character moved the paper up a line without moving the print head back to the beginning of the line. So, on those devices, you needed two control characters to get the print head back to the start of the next line: a carriage return and a line feed. Because this was mechanical, it took time, so you had to pause for long enough before sending more characters to the terminal after sending the CR and LF characters. One use for CR without LF was to do 'bold' by overstriking the characters on the line. You'd write the line out once, then use CR to start over and print twice over the characters that needed to be bold. You could also, of course, type X's over stuff that you wanted partially hidden, or create very dense ASCII art pictures with judicious overstriking.
On Unix, all the logic for this stuff was hidden in a terminal driver. You could use the stty command and the underlying functions (in those days, ioctl() calls; they were sanitized into the termios interface by POSIX.1 in 1988) to tweak all sorts of ways that the terminal behaved.
Eventually, you got 'glass terminals' where the speeds were greater and and there were new idiosyncrasies to deal with - Hazeltine glitches and so on and so forth. These got enshrined in the termcap and later terminfo libraries, and then further encapsulated behind the curses library.
However, some other (non-Unix) systems did not hide things as well, and you had to deal with CRLF in your text files - and no, this is not just Windows and DOS that were in the 'CRLF' camp.
Anyway, on some systems, the C library has to deal with text files that contain CRLF line endings and presents those to you as if there were only a newline at the end of the line. However, if you choose to treat the text file as a binary file, you will see the CR characters as well as the LF.
Systems like the old Mac OS (version 9 or earlier) used just CR (aka \r) for the line ending. Systems like DOS and Windows (and, I believe, many of the DEC systems such as VMS and RSTS) used CRLF for the line ending. Many of the Internet standards (such as mail) mandate CRLF line endings. And Unix has always used just LF (aka NL or newline, hence \n) for its line endings. And most people, most of the time, manage to ignore CR.
Your code is rather funky in looking for \r. On a system compliant with the C standard, you won't see the CR unless the file is opened in binary mode; the CRLF or CR will be mapped to NL by the C runtime library.
There are a few characters which can indicate a new line. The usual ones are these two:
'\n' or '0x0A' (10 in decimal) -> This character is called "Line Feed" (LF).
'\r' or '0x0D' (13 in decimal) -> This one is called "Carriage return" (CR).
Different Operating Systems handle newlines in a different way. Here is a short list of the most common ones:
DOS and Windows
They expect a newline to be the combination of two characters, namely '\r\n' (or 13 followed by 10).
Unix (and hence Linux as well)
Unix uses a single '\n' to indicate a new line.
Mac
Macs use a single '\r'.
That is not always true; it only works in Windows.
For interacting with terminal in putty, Linux shell,... it will be used for returning the cursor to the beginning of line.
following picture shows the usage of that:
Without '\r':
Data comes without '\r' to the putty terminal, it has just '\n'.
it means that data will be printed just in next line.
With '\r':
Data comes with '\r', i.e. string ends with '\r\n'. So the cursor in putty terminal not only will go to the next line but also at the beginning of line
It depends upon which platform you're on as to how it will be translated and whether it will be there at all: Wikipedia entry on newline
\r is an escape sequence character or void character. It is used to bring the cursor to the beginning of the line (it maybe of same or new line) to overwrite with new content (content written ahead of \r like: \rhello);
int main ()
{
printf("Hello \rworld");
return 0;
}
The output of the program will be world not Hello world
because \r has put the cursor at the beginning of the line and Hello has been overwritten with world.
I have written some data to a file manually i.e. not by my application.
My code is reading the data char by char and storing them in different arrays but my program gets stuck when I insert the condition EOF.
After some investigation I found out that in my file before EOF there are three to four \n characters. I have not inserted them. I don't understand why they are in my file.
Want to remove those pesky extra characters? First, see how many of them there are at the end of your file:
od -c <filename> | tail
Then, remove however many characters you don't like. If it's 3:
truncate -s -3 <filename>
But overall, if it were me, I'd change my program to discard undesired newline characters, unless they're truly invalid according to the input file format specification.
It is very easy to add additional newlines to the end of a file in every text editor. You have to push the cursor around to see them. Open your file in your editor and see what happens when you navigate to the end, you'll see the extra newlines.
There is no such thing as an EOF character in general. Windows treats control-Z as EOF in some cases. Perhaps you are talking about the return value from some API that indicates that it has reached the end of file?
i m reading from file line by line but when i read some garbage character like space /r is being added i m nt getting why it is being added although there is no such character in file from where i m reading ..i have used fread and fgets both from both i m getting the same problem please reply if u have solution for this problem
The file was probably edited/created on Windows. Windows uses \r\n as a line delimiter. When you read the file, you must strip the \r manually. Since most editors treat \r\n as a single character (line end), you can't "see" it but it's still in the file. Use a hex editor if you want to see it or a tool like od.
Open the file in text mode.
/* ... */
fopen(filename, "r"); /* notice no 'b' in mode */
/* ... */
Supposing you're on Windows ... on reading operations, the library is responsible for translating the literal "\r\n" present on disk to "\n"; and on writing operation, the library translates "\n" to "\r\n".