I have created a .mid file by writing bytes to a file and save it as .midi. I can run it and it works, but there are some special cases where it does not.
If I write a byte containing \n (ASCII 10) then it will instead write 2 bytes \r\n, which makes the .mid not runnable. (This is normal for Windows machine to do, but not desirable in my case.) An example of writing \n could be when picking the key which is being represented by \n.
Is there a workaround to write \n and not \r\n or another way to make sure that byte written is ASCII 10 on a Windows machine?
Thanks!
On linux/unix, it doesn't matter whether you specify "wb" or "w" to create a file.
But creating a text file using fopen in windows means that all \n are converted to \r\n, so if you're using this to create binary files, the binary files will be "corrupt" if there are some bytes with value "10" (linefeed)
Simple solution: always use fopen("file.bin","wb") when creating a binary file, on all platforms so your code is portable.
Related
Intro
There is a "feature" in Windows that adds a carriage return character before every newline. This is an old thing, but I am having trouble finding specific solutions to the problem.
Problem
If you create a file in Windows with fopen("file.txt", "w"), it will create a text file where every \n in your code is converted to \r\n. This creates a problem in situations where you try to read the file on Linux, because there is now an unaccounted-for character in the mix, and most line reads depend on reading to \n.
Research
I created text ("w") and binary ("wb") files on Windows and Linux, and tried to read and compare them with the same files made on the other OS.
Binary files do not get an added carriage return, and seem to work fine.
Text files, on the other hand, are a mixed bag:
On Windows, comparing the Windows text file (\r\n) to the Linux text file (\n) will result in them being read equally (this is apparently a feature of some C Windows implementations - \r\n will get automatically read as \n)
On Linux, comparing the files will result in them not being equal and the \r will not be handled, creating a reading problem.
Conclusion
I would like to know of ways how to handle this so my text files can be truly cross-platform. I have seen a lot of posts about this, but most have mixed and opposing claims and almost none have specific solutions.
So please, can we solve this problem once and for all after 40 years? Thank you.
I need some help.
I'm writing a program that opens 2 source files in UTF-8 encoding without BOM. The first contains English text and some other information, including ID. The second contains only string ID and translation. The program changes every string from the first file by replacing English chars to Russian translation from the second one and writes these strings to output file. Everything seems to be ok, but there is BOM appears in destination file. And i want to create file without BOM, like source.
I open files with fopen function in text mode with ccs=UTF-8
read string with fgetws function to wchar_t buffer
and write with fputws function to output file
Don't use text mode, don't use the MS ccs= extension to fopen, and don't use fputws. Instead use fopen in binary mode and write the correct UTF-8 yourself.
Today I almost became crazy because the size of the bytes I read didn't match the size of the xml file I was trying to read.
Then, when I checked the content of the file I was reading, I said it must be a nasty non printable char (\r) and I checked that with a simple program : the \r were not present.
My question is why fgetc/fgets are ignoring \r and picking only \n and If I want \r to be read how can I proceed ?
Because they are designed to do so. On the Windows OS the end of line is a combination of two characters '\n' (new line) and '\r' (carriage return).
When you open a file with the "r" mode these are all converted to '\n' so if you are in a Windows OS there will be one character missing from each line.
If you open with the "rb" mode, it will no longer convert the two characters to '\n' and you will be able to read it. This is the primary difference between the "b" and non "b" modes.
Note that this freature allows the file to be open by different platforms without caring about this at all, you simply open it in text mode "r" or "w" for output and don't worry about the way the underlying system represents end of lines.
Was accessing a file in a code with both C# and C++. When file is opened in notepad it looks like this (one integer at left and the rest of numbers are double):
But the same file when opened with WordPad looks like this (one integer next to each double):
Why do they look different?
It has to do with the way that newlines are encoded in your file. Windows recognizes a newline as consisting of two characters (\r\n) whereas some other operating systems, namely Unix-based ones, use only \n or \r. WordPad is smart enough to recognize both newline types, but Notepad is not.
Because notepad and Wordpad use different ways to read out files, apperantly this file is written in a way that both read it differently...
Because Notepad and WorkPad understand \r\n differently
Notepad and WordPad treat "new line" differently - one accepts just \n, another requires \r\n to recognize "new line" (and some would be ok with \n\r).
Similar goes for many other editors. I.e. if your try to open the file in Visual Studio it is likely to ask something like "Do you want to convert Unix new lines to Windows new lines".
If you are writing file with C# use WriteLine rather than manually adding \n or at least use Envirnment.NewLine to write "new line" to stream.
Similarly in C++ you can write "\r\n" instead of just "\n" if you must open file in Notepad or other editor that requires such sequence (most editors/viewers would be ok with either).
I have a file with DOS line endings that I receive at run-time, so I cannot convert the line endings to UNIX-style offline. Also, my app runs on both Windows and Linux. My app does an fgets() on the file and tries to read in line-by-line.
Would the number of bytes read per line on Linux also account for 2 trailing characters (\r \n) or would it contain only (\n) and the \r would be discarded by the underlying system?
EDIT:
Ok, so the line endings are preserved while reading a file on Linux, but I have run into another issue. On Windows, opening the file in "r" or "rb" is behaving differently. Does windows treat these two modes distinctly, unlike Linux?
fgets() keeps line endings.
http://msdn.microsoft.com/en-us/library/c37dh6kf(v=vs.80).aspx
fgets() itself doesn't have any special options for converting line endings, but on Windows, you can choose to either open a file in "binary" mode, or in "text" mode. In text mode Windows converts the CR/LF sequence (C string: "\r\n") into just a newline (C string: "\n"). It's a feature so that you can write the same code for Windows and Linux and it will work (you don't need "\r\n" on Windows and just "\n" on Linux).
http://msdn.microsoft.com/en-US/library/yeby3zcb(v=vs.80)
Note that the Windows call to fopen() takes the same arguments as the call to fopen() in Linux. The "binary" mode needs a non-standard character ('b') in the file mode, but the "text" mode is the default. So I suggest you just use the same code lines for Windows and Linux; the Windows version of fopen() is designed for that.
The Linux version of the C library doesn't have any tricky features. If the text file has CR/LF line endings, then that is what you get when you read it. Linux fopen() will accept a 'b' in the options, but ignores it!
http://linux.die.net/man/3/fopen
http://linux.die.net/man/3/fgets
On Unix, the lines would be read to the newline \n and would include the carriage return \r. You would need to trim both off the end.
Although the other answers gave satisfying information regarind the question what kind of line ending would be returned for a DOS file read under UNIX, I'd like to mentioned an alternative way to chop off such line endings.
The significant difference is, that the following approach is multi-byte-character save, as it does not involve any characters directly:
if (pszLine && (2 <= strlen(pszLine)))
{
size_t size = strcspn(pszLine, "\r\n");
pszLine[size] = 0;
}
You'll get what's actually in the file, including the \r characters. In unix there aren't text files and binary files, there are just files, and stdio doesn't do conversions. After reading a line into a buffer with fgets, you can do:
char *p = strrchr(buffer, '\r');
if(p && p[1]=='\n' && p[2]=='\0') {
p[0] = '\n';
p[1] = '\0';
}
That will change a terminating \r\n\0 into \n\0. Or you could just do p[0]='\0' if you don't want to keep the \n.
Note the use of strrchr, not strchr. There's nothing that prevents multiple \rs from being present in the middle of a line, and you probably don't want to truncate the line at the first one.
Answer to the EDIT section of the question: yes, the "b" in "rb" is a no-op in unix.