I'm trying to create wide chars file using MinGW C on Windows, however wide chars seem to be omitted. My code:
const wchar_t* str = L"příšerně žluťoučký kůň úpěl ďábelské ódy";
FILE* fd = fopen("file.txt","w");
// FILE* fd = _wfopen(L"demo.txgs",L"w"); // attempt to open wide file doesn't help
fwide(fd,1); // attempt to force wide mode, doesn't help
fwprintf(fd,L"%ls",str);
// fputws(p,fd); // stops output after writing "p" (1B file size)
fclose(fd);
File contents
píern luouký k úpl ábelské ódy
The file size is 30B, so the wide chars are really missing. How to convince the compiler to write them?
As #chqrlie suggests in the comments: the result of
fwrite(str, 1, sizeof(L"příšerně žluťoučký kůň úpěl ďábelské ódy"), fd);
is 82 (I guess 2*30 + 2*10 (ommited chars) + 2 (wide trailing zero)).
It also might be useful to quote from here
The external representation of wide characters in files are multibyte
characters: These are obtained as if wcrtomb was called to convert
each wide character (using the stream's internal mbstate_t object).
Which explains why the ISO-8859-1 chars are single byte in the file, but I don't know how to use this information to solve my problem. Doing the opposite task (reading multibyte UTF-8 into wide chars) I failed to use mbtowc and ended up using winAPI's MultiByteToWideChar.
I am not a Windows user, but you might try this:
const wchar_t *str = L"příšerně žluťoučký kůň úpěl ďábelské ódy";
FILE *fd = fopen("file.txt", "w,ccs=UTF-8");
fwprintf(fd, L"%ls", str);
fclose(fd);
I got this idea from this question: How do I write a UTF-8 encoded string to a file in windows, in C++
I figured this out. The internal use of wcrtomb (mentioned in details of my question) needs setlocale call, but that call fails with UTF-8 on Windows. So I used winAPI here:
char output[100]; // not wchar_t, write byte-by-byte
int len = WideCharToMultiByte(CP_UTF8,0,str,-1,NULL,0,NULL,NULL);
if(len>100) len = 100;
WideCharToMultiByte(CP_UTF8,0,str,-1,output,len,NULL,NULL);
fputs(output,fd);
And voila! The file is 56B long with expected UTF-8 contents:
příšerně žluťoučký kůň úpěl ďábelské ódy
I hope this will save some nerves to Windows coders.
Related
long int getFileSize(const wchar_t* path)
{
long int res = -1;
char file_name[MAX_PATH];
wcstombs(file_name, path, MAX_PATH);
FILE* fp = fopen(file_name, "r");
if (fp != NULL) {
fseek(fp, 0L, SEEK_END);
res = ftell(fp);
fclose(fp);
}
return res;
}
This is how I get the file size, but if there is a Turkish character in the file path, like (ğİı), the result is -1.
As a result of my research, I learned that if I use the "setlocale", my problem will be fixed. But this causes other errors in my project.
FILE* fp = _wfopen(path, L"w,ccs=UTF-8"); << I tried this but not works
The char apis on Windows all use the "current code page" for string encoding. On USA computers, this is usually code page 1252 encoding. These encodings usually do not contain most unicode characters, and so it is impossible to convert most unicode strings to the current encoding. For instance, Windows 1252 doesn't contain an encoding for Turkish ğİı. When you pass those characters to wcstombs, it stops when it encounters any character it can't convert, and returns -1 to signal failure, but your code isn't checking if the conversion succeeded. To guarantee success, all char apis must be limited to only using characters that can be encoded by the current code page. Since each computer might use a different code page, you're effectively limited to only using characters that can be encoded in every code page, which is basically ASCII.
Since you clearly want to interact with apis using Turkish characters, your options are either to change the current codepage to one that contains those characters (Code page 857 (Turkish) or Code page 65001 (UTF8)) and make sure every single char string in your app is encoded with that code page, or you can use wchar apis, which always use the UTF16 encoding on Windows. C doesn't offer any wchar apis to my knowledge, so you'll have to use C++ apis, or other native windows Apis.
You probably want to use FindFirstFileExW
and GetFileInformationByHandleEx.
Thanks everyone, I solved like this
long int getFileSize(const wchar_t* path)
{
long int res = -1;
wchar_t file_name[MAX_PATH];
wcscpy(file_name, path);
FILE *fp;
fp = _wfopen(file_name, L"r");
if (fp != NULL) {
fseek(fp, 0L, SEEK_END);
res = ftell(fp);
fclose(fp);
}
return res;
}
What translation occurs when writing to a file that was opened in text mode that does not occur in binary mode? Specifically in MS Visual C.
unsigned char buffer[256];
for (int i = 0; i < 256; i++) buffer[i]=i;
int size = 1;
int count = 256;
Binary mode:
FILE *fp_binary = fopen(filename, "wb");
fwrite(buffer, size, count, fp_binary);
Versus text mode:
FILE *fp_text = fopen(filename, "wt");
fwrite(buffer, size, count, fp_text);
I believe that most platforms will ignore the "t" option or the "text-mode" option when dealing with streams. On windows, however, this is not the case. If you take a look at the description of the fopen() function at: MSDN, you will see that specifying the "t" option will have the following effect:
line feeds ('\n') will be translated to '\r\n" sequences on output
carriage return/line feed sequences will be translated to line feeds on input.
If the file is opened in append mode, the end of the file will be examined for a ctrl-z character (character 26) and that character removed, if possible. It will also interpret the presence of that character as being the end of file. This is an unfortunate holdover from the days of CPM (something about the sins of the parents being visited upon their children up to the 3rd or 4th generation). Contrary to previously stated opinion, the ctrl-z character will not be appended.
In text mode, a newline "\n" may be converted to a carriage return + newline "\r\n"
Usually you'll want to open in binary mode. Trying to read any binary data in text mode won't work, it will be corrupted. You can read text ok in binary mode though - it just won't do automatic translations of "\n" to "\r\n".
See fopen
Additionally, when you fopen a file with "rt" the input is terminated on a Crtl-Z character.
Another difference is when using fseek
If the stream is open in binary mode, the new position is exactly offset bytes measured from the beginning of the file if origin is SEEK_SET, from the current file position if origin is SEEK_CUR, and from the end of the file if origin is SEEK_END. Some binary streams may not support the SEEK_END.
If the stream is open in text mode, the only supported values for offset are zero (which works with any origin) and a value returned by an earlier call to std::ftell on a stream associated with the same file (which only works with origin of SEEK_SET.
Even though this question was already answered and clearly explained, I think it would be interesting to show the main issue (translation between \n and \r\n) with a simple code example. Note that I'm not addressing the issue of the Crtl-Z character at the end of the file.
#include <stdio.h>
#include <string.h>
int main() {
FILE *f;
char string[] = "A\nB";
int len;
len = strlen(string);
printf("As you'd expect string has %d characters... ", len); /* prints 3*/
f = fopen("test.txt", "w"); /* Text mode */
fwrite(string, 1, len, f); /* On windows "A\r\nB" is writen */
printf ("but %ld bytes were writen to file", ftell(f)); /* prints 4 on Windows, 3 on Linux*/
fclose(f);
return 0;
}
If you execute the program on Windows, you will see the following message printed:
As you'd expect string has 3 characters... but 4 bytes were writen to file
Of course you can also open the file with a text editor like Notepad++ and see yourself the characters:
The inverse conversion is performed on Windows when reading the file in text mode.
We had an interesting problem with opening files in text mode where the files had a mixture of line ending characters:
1\n\r
2\n\r
3\n
4\n\r
5\n\r
Our requirement is that we can store our current position in the file (we used fgetpos), close the file and then later to reopen the file and seek to that position (we used fsetpos).
However, where a file has mixtures of line endings then this process failed to seek to the actual same position. In our case (our tool parses C++), we were re-reading parts of the file we'd already seen.
Go with binary - then you can control exactly what is read and written from the file.
In 'w' mode, the file is opened in write mode and the basic coding is 'utf-8'
in 'wb' mode, the file is opened in write -binary mode and it is resposible for writing other special characters and the encoding may be 'utf-16le' or others
I have some code that works perfectly fine on Linux BUT on Windows it only works as expected if is compiled using Cygwin, which emulates a Linux env. on Windows but is bad for portability (you must have Cygwin installed for compiled binary to work.) The program does the following:
Opens a document in read mode and ccs=UTF-8 and reads it char by char.
Writes the braille Unicode pattern (U+2800..U+28FF) corresponding to that letter, num. or punct. mark to a 'dest' document (opened in write mode and ccs=UTF-8)
Significant code:
const char *brai[26] = {
"⠁","⠃","⠉","⠙","⠑","⠋","⠛","⠓","⠊","⠚",
"⠅","⠇","⠍","⠝","⠕","⠏","⠟","⠗","⠎","⠞",
"⠥","⠧","⠭","⠽","⠵","⠺"
}
int main(void) {
setlocale(LC_ALL, "es_MX.UTF-8");
FILE *source = fopen(origen, "r, ccs=UTF-8");
FILE *dest = fopen(destino, "w, ccs=UTF-8");
unsigned int letra;
while ((letra = fgetc(source)) != EOF) {
// This next line is the problem, I guess:
fwprintf(dest, L"%s", "⠷"); // Prints directly the braille sign as a char[]
// OR prints it from an array that contains the exact same sign.
fwprintf(dest, L"%s", brai[7]);
}
}
Code works as expected on Linux every time, but not for Windows. I tried everything and nothing seems to get the output right. On the 'dest' document I get random chars like:
甥╩極肠─猀甥iꃢ¨.
The only way to print braille patterns to the doc so far on Windows was:
fwprintf(dest, L"⠷");
Which is not very useful (would need to make an 'else if' for every case instead).
If you wish to see the full code, it's on Github:
https://github.com/oliver-almaraz/Texto_a_Braille
What I tried so far:
Changing files open options to UTF-16LE and UNICODE.
Changing fwprintf() arguments in every way I could imagine.
Changing the array properties to unsigned int for the arrays containing the braille patterns.
Different compilers.
Here's a tested (with MSVC and mingw on Windows), semi-working example.
#include <stdio.h>
#include <ctype.h>
const char *brai[26] = {
"⠁","⠃","⠉","⠙","⠑","⠋","⠛","⠓","⠊","⠚",
"⠅","⠇","⠍","⠝","⠕","⠏","⠟","⠗","⠎","⠞",
"⠥","⠧","⠭","⠽","⠵","⠺"
};
int main(void) {
char* origen = "a.txt";
char* destino = "b.txt";
FILE *source = fopen(origen, "r");
FILE *dest = fopen(destino, "w");
int letra;
while ((letra = fgetc(source)) != EOF) {
if (isupper(letra))
fprintf(dest, "%s", brai[letra - 'A']);
else if (islower(letra))
fprintf(dest, "%s", brai[letra - 'a']);
else
fprintf (dest, "%c", letra);
}
}
Note these things.
No locale or wide character or anything like that in sight. None of this is needed.
This code only translates English letters. No punctuation or numbers (I don't know nearly enough about Braille to add that, but this should be straightforward).
Since the code only translates English letters and leaves everything else as is, it is OK to feed it a UTF-8 encoded file. It will just leave unrecognised characters untranslated. If you ever need to translate accented letters, you will need to learn a whole lot more about Unicode. Here is a good place to start.
Error handling omitted for brevity.
The code must use the correct charset. For MSVC, either UTF-8 with BOM or UTF16, alternatively use UTF-8 without BOM and /utf-8 compiler switch if your MSVC version recognises it. For mingw, just use UTF-8.
This method will not work for standard console output on Windows. It is not a big problem since Windows console by default won't output Braille characters anyway. It will however work for msys console and many others.
Option 1: Use wchar_t and fwprintf. Make sure to save the source as UTF-8 w/ BOM encoding or use UTF-8 encoding and the /utf-8 switch to force assuming UTF-8 encoding on the Microsoft compiler; otherwise, MSVS assumes an ANSI encoding for the source file and you get mojibake.
#include <stdio.h>
const wchar_t brai[] = L"⠁⠃⠉⠙⠑⠋⠛⠓⠊⠚⠅⠇⠍⠝⠕⠏⠟⠗⠎⠞⠥⠧⠭⠽⠵⠺";
int main(void) {
FILE *dest = fopen("out.txt", "w, ccs=UTF-8");
fwprintf(dest, L"%s", brai);
}
out.txt (encoded as UTF-8 w/ BOM):
⠁⠃⠉⠙⠑⠋⠛⠓⠊⠚⠅⠇⠍⠝⠕⠏⠟⠗⠎⠞⠥⠧⠭⠽⠵⠺
Option 2: Use char and fprintf, save the source as UTF-8 or UTF-8 w/ BOM, and use the /utf-8 Microsoft compile switch. The char string will be in the source encoding, so it must be UTF-8 to get UTF-8 in the output file.
#include <stdio.h>
const char brai[] = "⠁⠃⠉⠙⠑⠋⠛⠓⠊⠚⠅⠇⠍⠝⠕⠏⠟⠗⠎⠞⠥⠧⠭⠽⠵⠺";
int main(void) {
FILE *dest = fopen("out.csv","w");
fprintf(dest, "%s", brai);
}
The latest compiler can also use the u8"" syntax. The advantage here is you can use a different source encoding and the char string will still be UTF-8 as long as you use the appropriate compiler switch to indicate the source encoding.
const char brai[] = u8"⠁⠃⠉⠙⠑⠋⠛⠓⠊⠚⠅⠇⠍⠝⠕⠏⠟⠗⠎⠞⠥⠧⠭⠽⠵⠺";
For reference, these are the Microsoft compiler options:
/source-charset:<iana-name>|.nnnn set source character set
/execution-charset:<iana-name>|.nnnn set execution character set
/utf-8 set source and execution character set to UTF-8
What translation occurs when writing to a file that was opened in text mode that does not occur in binary mode? Specifically in MS Visual C.
unsigned char buffer[256];
for (int i = 0; i < 256; i++) buffer[i]=i;
int size = 1;
int count = 256;
Binary mode:
FILE *fp_binary = fopen(filename, "wb");
fwrite(buffer, size, count, fp_binary);
Versus text mode:
FILE *fp_text = fopen(filename, "wt");
fwrite(buffer, size, count, fp_text);
I believe that most platforms will ignore the "t" option or the "text-mode" option when dealing with streams. On windows, however, this is not the case. If you take a look at the description of the fopen() function at: MSDN, you will see that specifying the "t" option will have the following effect:
line feeds ('\n') will be translated to '\r\n" sequences on output
carriage return/line feed sequences will be translated to line feeds on input.
If the file is opened in append mode, the end of the file will be examined for a ctrl-z character (character 26) and that character removed, if possible. It will also interpret the presence of that character as being the end of file. This is an unfortunate holdover from the days of CPM (something about the sins of the parents being visited upon their children up to the 3rd or 4th generation). Contrary to previously stated opinion, the ctrl-z character will not be appended.
In text mode, a newline "\n" may be converted to a carriage return + newline "\r\n"
Usually you'll want to open in binary mode. Trying to read any binary data in text mode won't work, it will be corrupted. You can read text ok in binary mode though - it just won't do automatic translations of "\n" to "\r\n".
See fopen
Additionally, when you fopen a file with "rt" the input is terminated on a Crtl-Z character.
Another difference is when using fseek
If the stream is open in binary mode, the new position is exactly offset bytes measured from the beginning of the file if origin is SEEK_SET, from the current file position if origin is SEEK_CUR, and from the end of the file if origin is SEEK_END. Some binary streams may not support the SEEK_END.
If the stream is open in text mode, the only supported values for offset are zero (which works with any origin) and a value returned by an earlier call to std::ftell on a stream associated with the same file (which only works with origin of SEEK_SET.
Even though this question was already answered and clearly explained, I think it would be interesting to show the main issue (translation between \n and \r\n) with a simple code example. Note that I'm not addressing the issue of the Crtl-Z character at the end of the file.
#include <stdio.h>
#include <string.h>
int main() {
FILE *f;
char string[] = "A\nB";
int len;
len = strlen(string);
printf("As you'd expect string has %d characters... ", len); /* prints 3*/
f = fopen("test.txt", "w"); /* Text mode */
fwrite(string, 1, len, f); /* On windows "A\r\nB" is writen */
printf ("but %ld bytes were writen to file", ftell(f)); /* prints 4 on Windows, 3 on Linux*/
fclose(f);
return 0;
}
If you execute the program on Windows, you will see the following message printed:
As you'd expect string has 3 characters... but 4 bytes were writen to file
Of course you can also open the file with a text editor like Notepad++ and see yourself the characters:
The inverse conversion is performed on Windows when reading the file in text mode.
We had an interesting problem with opening files in text mode where the files had a mixture of line ending characters:
1\n\r
2\n\r
3\n
4\n\r
5\n\r
Our requirement is that we can store our current position in the file (we used fgetpos), close the file and then later to reopen the file and seek to that position (we used fsetpos).
However, where a file has mixtures of line endings then this process failed to seek to the actual same position. In our case (our tool parses C++), we were re-reading parts of the file we'd already seen.
Go with binary - then you can control exactly what is read and written from the file.
In 'w' mode, the file is opened in write mode and the basic coding is 'utf-8'
in 'wb' mode, the file is opened in write -binary mode and it is resposible for writing other special characters and the encoding may be 'utf-16le' or others
Is it possible to read a text file hat has non-english text?
Example of text in file:
E 37
SVAR:
Fettembolisyndrom. (1 poäng)
Example of what is present in buffer which stores "fread" output using "puts" :
E 37 SVAR:
Fettembolisyndrom.
(1 poäng)
Under Linux my program was working fine but in Windows I am seeing this problem with non-english letters. Any advise how this can be fixed?
Program:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
int debug = 0;
int main(int argc, char* argv[])
{
if (argc < 2)
{
puts("ERROR! Please enter a filename\n");
exit(1);
}
else if (argc > 2)
{
debug = atoi(argv[2]);
puts("Debugging mode ENABLED!\n");
}
FILE *fp = fopen(argv[1], "rb");
fseek(fp, 0, SEEK_END);
long fileSz = ftell(fp);
fseek(fp, 0, SEEK_SET);
char* buffer;
buffer = (char*) malloc (sizeof(char)*fileSz);
size_t readSz = fread(buffer, 1, fileSz, fp);
rewind(fp);
if (readSz == fileSz)
{
char tmpBuff[100];
fgets(tmpBuff, 100, fp);
if (!ferror(fp))
{
printf("100 characters from text file: %s\n", tmpBuff);
}
else
{
printf("Error encounter");
}
}
if (strstr("FRÅGA",buffer) == NULL)
{
printf("String not found!");
}
return 0;
}
Sample output
Text file
Summary: If you read text from a file encoded in UTF-8 and display it on the console you must either set the console to UTF-8 or transcode the text from UTF-8 to the encoding used by the console (in English-speaking countries, usually MS-DOS code page 437 or 850).
Longer explanation
Bytes are not characters and characters are not bytes. The char data type in C holds a byte, not a character. In particular, the character Å (Unicode <U+00C5>) mentioned in the comments can be represented in many ways, called encodings:
In UTF-8 it is two bytes, '\xC3' '\x85';
In UTF-16 it is two bytes, either '\xC5' '\x00' (little-endian UTF-16), or '\x00' '\xC5' (big-endian UTF-16);
In Latin-1 and Windows-1252, it is one byte, '\xC5';
In MS-DOS code page 437 and code page 850, it is one byte, '\x8F'.
It is the responsibility of the programmer to translate between the internal encoding used by the program (usually but not always Unicode), the encoding used in input or output files, and the encoding expected by the display device.
Note: Sometimes, if the program does not do much with the characters it reads and outputs, one can get by just by making sure that the input files, the output files, and the display device all use the same encoding. In Linux, this encoding is almost always UTF-8. Unfortunately, on Windows the existence of multiple encodings is a fact of life. System calls expect either UTF-16 or Windows-1252. By default, the console displays Code Page 437 or 850. Text files are quite often in UTF-8. Windows is old and complicated.