I want to find the reason why with the new feature setlocale(LC_ALL, ".utf8") the standard function fgetwc() can't read '\u2013' (EN DASH) from a utf8 text file and instead returns WEOF. Maybe find a workaround.
I disabled "Only my code" and enabled symbol downloading for C:\WINDOWS\SysWOW64\ucrtbased.dll that contains fgetwc
However, when I try to step into that function it cannot find fgetwc.cpp.
These two locations don't contain that file and I can't find any other place:
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\crt\src\
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\crt\src\
This is my test program:
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#include <stdlib.h>
int main()
{
wint_t wc; // = L'\u2013';
FILE* file;
printf("%s\n", setlocale(LC_ALL, ".utf8"));
file = fopen("test.txt", "r");
wc = fgetwc(file);
// ffff '?' 0 0
fprintf(stdout, "%04x '%lc' %d %d\n", wc, wc, ferror(file), feof(file));
return 0;
}
It prints ffff instead of 2013. ferror() and feof() return false.
test.txt:
–
It's encoded as E2 80 93
For reading the UTF-8 file, optionally drop the setlocale call, and replace the fopen line with:
file = fopen("test.txt", "r, ccs=utf-8");
The fopen documentation states:
ccs=encoding -- Specifies the encoded character set to use (one of UTF-8, UTF-16LE, or UNICODE) for this file. Leave unspecified if you want ANSI encoding.
This appears to imply that the ccs=UTF-8 encoding must be specified explicitly in order to read a file as UTF-8 text.
Though, on the other hand, "ANSI" used to mean either the active codepage, or the system default locale. With the recent support in Windows 10 1903 and later for UTF-8 as an active codepage, it would be expected that "ANSI encoding" be the same as "UTF-8 encoding" when the current locale is UTF-8. However, that does not seem to be the case with the current implementation of the UCRT.
For writing the wide char, #include <io.h> and <fcntl.h>, and replace the fprintf line with:
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"%04x '%wc' %d %d\n", wc, wc, ferror(file), feof(file));
The printf documentation states:
wprintf is a wide-character version of printf; format is a wide-character string. wprintf and printf behave identically if the stream is opened in ANSI mode. printf does not currently support output into a UNICODE stream.
Related
Certainly, my problem is not new...., so I apologize if my error is simply too stupid.
I just wanted to become familiar with putwchar and simply wrote the following little piece of code:
#include <stdio.h>
#include <wchar.h>
#include <locale.h>
int main(void)
{
char *locale = setlocale(LC_ALL, "");
printf ("Locale: %s\n", locale);
//setlocale(LC_CTYPE, "de_DE.utf8");
wchar_t hello[]=L"Registered Trademark: ®®\nEuro sign: €€\nBritisch Pound: ££\nYen: ¥¥\nGerman Umlauts: äöüßÄÖÜ\n";
int index = 0;
while (hello[index]!=L'\0'){
//printf("put liefert: %d\n", putwchar(hello[index++]));
putwchar(hello[index++]);
};
}
Now. the output is simply:
Locale: de_DE.UTF-8
Registered Trademark: ��
Euro sign: ��
Britisch Pound: ��
Yen: ��
German Umlauts: �������
\[1\]+ Fertig gedit versuch.c
None of the non-ASCII chars appeared on the screen.
As you see in the comment (and I well noticed that I must not mix putwchar and print in the same program, hence the line is in comment, putwchar returned the proper Unicode codepoint for the character I wanted to print. Thus, the call is supposed to work. (At least to my understanding.)
The c source is coded in utf-8
$ file versuch.c
versuch.c: C source, UTF-8 Unicode text
my system is Ubuntu Linux 20.04.05
compiler: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
I would greatly appreciate any advice on this one.
As stated above: I simply expected the trademark sign, yen, € and the umlauts äöüßÄÖÜ to appear.
You shouldn't mix normal and wide output on the same stream.
I get the expected output if I change this early print:
printf ("Locale: %s\n", locale);
into a wide print:
wprintf(L"Locale: %s\n", locale);
Then the subsequent putwchar() calls write the expected characters.
You cannot mix narrow and wide I/O in the same stream (7.21.2). If you want putwchar, you cannot use printf. Start with wprintf instead (with the wide format string):
wprintf (L"Locale: %s\n", locale);
You can simply print those wide characters as shown below:
wprintf(L"Registered Trade Mark: %ls\n", L"®®");
wprintf(L"Euro Sign: %ls\n", L"€€");
wprintf(L"British Pound: %ls\n", L"££");
wprintf(L"Yen: %ls\n", L"¥¥");
wprintf(L"German Umlauts: %ls\n", L"äöüßÄÖÜ");
Please refer:
https://stackoverflow.com/a/37587933/2805824
https://stackoverflow.com/a/7696033/2805824
I have to display all the files and sub-directories in a directory in Linux (more specifically, Ubuntu 19.10) using popen() in C. The relevant code is given below. The problem when I debug this code is, that, the "list" variable contains only up to the first "\n" escape character, which is ".:\n". How can I detour so that popen() outputs all the string including escape sequence characters?
#include <stdio.h>
int main()
{
FILE *read_file;
char list[1000];
read_file = popen("ls -R","r");
fgets(list, 1000, read_file);
pclose(read_file);
printf("%s", list);
return(0);
}
In a C program for Linux, with ncursesw and form, I need to read the string stored in a field, with support for UTF-8 characters. When ASCII only is used, it is pretty simple, because the string is stored as an array of char:
char *dest;
...
dest = field_buffer(field[0], 0);
If I try to type a UTF-8 and non-ASCII character in the field with this code the character does not appear and it is not handled. In this answer for UTF-8 it is suggested to use ncursesw. But with the following code (written following this guide)
#define _XOPEN_SOURCE_EXTENDED
#include <ncursesw/form.h>
#include <locale.h>
int main()
{
...
setlocale(LC_ALL, "");
...
initscr();
wchar_t *dest;
...
dest = field_buffer(field[0], 0);
}
the compiler produces an error:
warning: assignment from incompatible pointer type [enabled by default]
dest = field_buffer(field[0], 0);
^
How to obtain from the field an array of wchar_t?
ncursesw uses get_wch instead of getch, so which function does it use instead of field_buffer()? I couldn't find it by googling.
The program is compiled in a system with the following locale:
$ locale
LANG=it_IT.UTF-8
LANGUAGE=
LC_CTYPE="it_IT.UTF-8"
LC_NUMERIC="it_IT.UTF-8"
LC_TIME="it_IT.UTF-8"
LC_COLLATE="it_IT.UTF-8"
LC_MONETARY="it_IT.UTF-8"
LC_MESSAGES="it_IT.UTF-8"
LC_PAPER="it_IT.UTF-8"
LC_NAME="it_IT.UTF-8"
LC_ADDRESS="it_IT.UTF-8"
LC_TELEPHONE="it_IT.UTF-8"
LC_MEASUREMENT="it_IT.UTF-8"
LC_IDENTIFICATION="it_IT.UTF-8"
LC_ALL=
It supports and uses UTF-8 as a default. With a locale like this, when the ncursesw environment is used, the C program should be able to save UTF-8 characters into a char array.
In order to correctly set up ncursesw it is very important to follow all the steps of the mentioned guide. In particular, the program should have the header
#define _XOPEN_SOURCE_EXTENDED
#include <ncursesw/form.h>
#include <stdio.h>
#include <locale.h>
The program should be compiled as
gcc -o executable_file source_file.c -lncursesw -lformw
and the program should contain
setlocale(LC_ALL, "");
before initscr();. With all these conditions satisfied, the string can be saved into a normal char array, as if ncurses and ASCII were used instead of ncursesw and UTF-8. As specified by John Bollinger in the comments, the function field_buffer can only return a char * and so it is unuseful to use any other data type such as wchar_t.
I'm writting a program in C and I want to have Greek characters in the menu when I run it in cmd.exe . Someone said that in order to include Greek characters you have to use a printf that goes something like this:
printf(charset:IS0-1089:uffe);
but they weren't sure.
Does anyone know how to do that?
Assuming Windows, you can:
set your console font to a Unicode TrueType font:
emit the data using an "ANSI" mechanism
This code prints γειά σου:
#include "windows.h"
int main() {
SetConsoleOutputCP(1253); //"ANSI" Greek
printf("\xE3\xE5\xE9\xDC \xF3\xEF\xF5");
return 0;
}
The hex codes represent γειά σου when encoded as windows-1253. If you use an editor that saves data as windows-1253, you can use literals instead. An alternative would be to use either OEM 737 (that really is a DOS encoding) or use Unicode.
I used SetConsoleOutputCP to set the console code page, but you could type the command chcp 1253 prior to running the program instead.
you can print a unicode char characters by using printf like this :
printf("\u0220\n");
this will print Ƞ
I think this might only work if your console supports Greek. Probably what you want to do is to map characters to the Greek, but using ASCII. For C# but same idea in C.
913 to 936 = upper case Greek letters
945 to 968 = lower case Greek letters
Read more at Suite101: Working with the Greek Alphabet and C#: How to Display ASCII Codes Correctly when Creating a C# Application | Suite101.com at this link.
One way to do this is to print a wide string. Unfortunately, Windows needs a bit of non-standard setup to make this work. This code does that setup inside #if blocks.
#include <locale.h>
#include <stdlib.h>
#include <stdio.h>
#include <wchar.h>
/* This has been reported not to autodetect correctly on tdm-gcc. */
#ifndef MS_STDLIB_BUGS // Allow overriding the autodetection.
# if ( _WIN32 || _WIN64 )
# define MS_STDLIB_BUGS 1
# else
# define MS_STDLIB_BUGS 0
# endif
#endif
#if MS_STDLIB_BUGS
# include <io.h>
# include <fcntl.h>
#endif
void init_locale(void)
// Does magic so that wprintf() can work.
{
// Constant for fwide().
static const int wide_oriented = 1;
#if MS_STDLIB_BUGS
// Windows needs a little non-standard magic.
static const char locale_name[] = ".1200";
_setmode( _fileno(stdout), _O_WTEXT );
#else
// The correct locale name may vary by OS, e.g., "en_US.utf8".
static const char locale_name[] = "";
#endif
setlocale( LC_ALL, locale_name );
fwide( stdout, wide_oriented );
}
int main(void)
{
init_locale();
wprintf(L"μουσάων Ἑλικωνιάδων ἀρχώμεθ᾽\n");
return EXIT_SUCCESS;
}
This has to be saved as UTF-8 with a BOM in order for older versions of Visual Studio to read it properly. Your console also has to be set to a monospaced Unicode font, such as Lucida Console, to display it properly. To mix wide strings in with ASCII strings, the standard defines the %ls and %lc format specifiers to printf(), although I’ve found these don’t work everywhere.
An alternative is to set the console to UTF-8 mode (On Windows, do this with chcp 65001.) and then print the UTF-8 string with printf(u8"μουσάων Ἑλικωνιάδων ἀρχώμεθ᾽\n");. UTF-8 is a second-class citizen on Windows, but that usually works. Try to run that without setting the code page first, though, and you will get garbage.
FILE *out=fopen64("text.txt","w+");
unsigned int write;
char *outbuf=new char[write];
//fill outbuf
printf("%i\n",ftello64(out));
fwrite(outbuf,sizeof(char),write,out);
printf("%i\n",write);
printf("%i\n",ftello64(out));
output:
0
25755
25868
what is going on?
write is set to 25755, and I tell fwrite to write that many bytes to a file, which is at the beginning, and then im at a position besides 25755?
If you are on a DOSish system (say, Windows) and the file is not opened in binary mode, line-endings will be converted automatically and each "line" will add one byte.
So, specify "wb" as the mode rather than just "w" as #caf points out. It will have no effect on Unix like platforms and will do the right thing on others.
For example:
#include <stdio.h>
#define LF 0x0a
int main(void) {
char x[] = { LF, LF };
FILE *out = fopen("test", "w");
printf("%d", ftell(out));
fwrite(x, 1, sizeof(x), out);
printf("%d", ftell(out));
fclose(out);
return 0;
}
With VC++:
C:\Temp> cl y.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.21022.08 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
y.c
Microsoft (R) Incremental Linker Version 9.00.21022.08
Copyright (C) Microsoft Corporation. All rights reserved.
/out:y.exe
C:\Temp> y.exe
04
With Cygwin gcc:
/cygdrive/c/Temp $ gcc y.c -o y.exe
/cygdrive/c/Temp $ ./y.exe
02
It may depend on the mode in which you opened the file. If you open it as a text file, then \n may be written as \r\n in DOS/Windows systems. However, ftello64() probably only gives the binary file pointer, which would count in the extra \r characters written. Try clearing the outbuf[] of any \n data or try opening the out file as binary ("wb" instead of "w").
The variable write is uninitialized and so the size of the array and the amount written will be essentially random.
Interesting. Works fine on Windows VC++, albeit ftello64 replaced with ftell.