How to output foreign characters in console? - c

How can I print foreign characters on the screen using C?
Here's my code, which doesn't work:
#include <stdio.h>
#include <locale.h>
int main(){
setlocale(LC_ALL,"Turkish");
printf("İ ş ğ ü ö ı");
system("pause");
return 0;
}

On my Windows there is no such characters in the 'Terminal' font. I think you can't print them.
But I suggest you to check this font yourself. Maybe you have a different version of it.

If you're using a narrow charset then you need to make sure that the terminal/console is using the same charset and the source code file is encoded in the correct encoding, otherwise of course the system will misinterpret the character codes
To set the charset in the console run chcp. For example to use code page Windows-1254 run chcp 1254. You can use SetConsoleOutputCP to set the code page programmatically, like SetConsoleOutputCP(1254)
However you should avoid the legacy ANSI code pages and use Unicode instead. The current preferred way on Windows is to output Unicode characters as wide char with wprintf. You may need to set the mode to wide first with
int result = _setmode(_fileno(stdout), _O_U16TEXT);
then
wprintf(L"İ ş ğ ü ö ı");
See also wprintf manual in Windows, Linux or Mac. However on POSIX systems UTF-8 is preferred
On older Windows UTF-8 support on console is not very good, but it's increasingly getting better, and Windows 10 even supports UTF-8 as a locale so you can just call SetConsoleOutputCP(CP_UTF8); or SetConsoleOutputCP(65001); (or run chcp 65001 in the console) and it'll work immediately, provided that you saved the source code as UTF-8. Remember to also set the font to the one that supports those characters like Lucida Console or Consolas. The default raster font contains very a limited number of characters and appears with a lot of aliasing. It also doesn't work well on modern hidpi displays
There are already lots of questions about outputting Unicode on this site like Output unicode strings in Windows console app or UTF-8 character in .NET Console Application. Please have a look and try to see which one fits you.
Edit
When you use
wchar_t c=L'ğ';
fputwc(c,ptr);
you're printing to a file and not the console. In that case just the stream of bytes is saved into the file. When you open the file again, it's the job of the editor to treat the bytes in the correct charset and print it correctly. For example the character "ğ" is stored as c4 9f in UTF-8 and when open the file as UTF-8, the editor knows that it represents the char "ğ" to display
Unfortunately there's no character encoding information embedded in a text file so the editor must choose one. Remember There Ain't No Such Thing As Plain Text (must read). A simple editor may just choose to open the file as ANSI in the current Windows codepage and the characters won't be displayed correctly if the original encoding is not that one and you'll just see garbage
Some more advanced editors like Notepad++ or MS Word will try to guess the encoding of the file. But as with any guessing, it can be wrong and the result is again a file with garbage
The simplest solution is to add a BOM to the beginning of the file so the editor can recognize the encoding easily. If your files doesn't contain a BOM you need to tell the editor to read the file in the correct encoding if the encoding is wrong (for wchar_t on Windows like that it's UTF-16LE). For example in Notepad++ it's this menu
Unfortunately the OP didn't edit the question to show what was tried, there's nothing more I can explain

Your code works: http://ideone.com/K9hrv5
setlocale(LC_ALL,"Turkish");
printf("İ ş ğ ü ö ı");
The only issue is that you have to set your terminal locale as well before executing your c program's output binary.
Setting your terminal locale works:

setlocale only affects the runtime locale. It doesn't make your compiler support extra source file characters.
You may need to specify the non-ASCII characters in your source file by using character constants (e.g. \xF1 for the character with code 241).

Related

How to output special characters in cmd window?

Once i write a c program and try to output special characters (like ä ö ü ß) with printf() on the cmd window on windows 10 it only shows sth like ▒▒▒▒▒▒▒▒▒▒▒▒
But if i just type them in the cmd window without a c programm being executed it displays these characters properly.
When i change the console type to standard output in netbeans the output is correct as well.
I tried to change the codepage of cmd but it didnt fix the problem.
I use the gcc c compiler.
The reason is the usage of different code pages for character encoding.
In GUI text editor on writing program code stored in a file on which each character is encoded with just a single byte the code page Windows-1252 is used in Western European and North American countries.
In console window opened on running a console application an OEM code page is used which is in Western European countries OEM 850 and in North American countries OEM 437.
So you need for ÄÖÜäöüß different byte values written in code to get those characters displayed as expected in the console window at least on execution in Western European and North American countries.
Character Windows-1252 OEM 850
Ä \xC4 \x8E
Ö \xD6 \x99
Ü \xDC \x9A
ä \xE4 \x84
ö \xF6 \x94
ü \xF1 \x8C
ß \xDF \xE1
The code page used by default in a console window can be seen by opening a command prompt window and run either chcp (change code page) or mode which both display the active code page.
The default code page for GUI applications and console applications on a computer for a user account depends on the Windows region and language settings for this user account.
Some web pages you should read to better understand character encoding:
Character encoding (English Wikipedia article)
On the Goodness of Unicode by Tim Bray
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolksy
What's the best default new file format? (UltraEdit forum topic)
Programmers should not write non ASCII characters into strings output by a compiled executable because it depends on which code page is used by the compiler on creating the binary representation (bytes) of the characters in executable. It is better to use the hexadecimal notation when active code page on execution of the application is known or defined by the application before the string is output.
It is also possible to store strings in the executable in Unicode, determine the encoding of the output handle before output any string and convert each Unicode string to the encoding of the output handle before the string is written to the output handle.
And of course it depends on used output font how the bytes in the strings in the executable are finally really displayed on screen.

Complications in reading extra characters

While using the default console font Raster Fonts 8x12 I am unable to read extra characters using
ReadConsoleOutputCharacter(). These characters will be printed out as ?.
If I change the console font to "Consolas" or "Lucida Console", these extra characters, read by
ReadConsoleOutputCharacter() are printed out without a problem.
Is there anything I can do about that?
Anyway, I fixed it changing the locale after a conversion for a console I/O mapping:
SetConsoleOutputCP(GetACP());
SetConsoleCP(GetACP());
setlocale(LC_ALL, "");
#David Heffernan
I suggest you to read this
According to the docs https://msdn.microsoft.com/en-us/library/windows/desktop/ms684969%28v=vs.85%29.aspx
This function uses either Unicode characters or 8-bit characters from
the console's current code page. The console's code page defaults
initially to the system's OEM code page. To change the console's code
page, use the SetConsoleCP or SetConsoleOutputCP functions, or use the
chcp or mode con cp select= commands.
I believe you're getting back a unicode string that needs to be encoded to a charset before trying to be displayed.

Does gnome-terminal support DOS code pages?

In my C program I've had to swap my unicode box-drawing characters into escaped characters for DOS code page 437 to get it to work in the Windows command prompt. Is it possible to change the code page of gnome-terminal to display these characters correctly when natively compiling the program for linux?
Thanks.
From https://nethackwiki.com/wiki/IBMgraphics
The current gnome-terminal does not
have a setting for code page 437, but
it does support other code pages that
are equivalent for NetHack's purposes,
such as 862 (Hebrew).
To set code page 862 on
gnome-terminal:
Select Terminal->Set Character Encoding->Add or Remove.
In the pane on the left, select the line with description Hebrew and
encoding IBM862.
Click the right-pointing arrow between the two panes.
Click Close.
The above steps only need to be done
once for the lifetime of the Gnome
installation. Once done, it is
sufficient to:
Select Terminal, Set Character Encoding, and then Hebrew (IBM862).
It should be noted that the current
default gnome-terminal font in Ubuntu
Jaunty fully supports DECgraphics as
long as eight_bit_tty is set to false.
If you need these characters, you should use their correct Unicode codepoint values and output them as UTF-8. Or, if you prefer, you can output them as wide characters and let the standard library's locale system take care of converting them to UTF-8 or another "native" encoding the user has selected (which might even be CP437, although I've never seen a system setup that poorly...).

How to bake UTF-8 files with CakePHP's console?

For some reason, every file that I bake with CakePHP's console is regarded as ISO-8859-1 encoded by my IDE Dreamweaver. This works fine up to the point where I end up typing a special character, which will be wrongly displayed by the browser, since its encoding (by the editor) differs from the overall rendering.
How can I force the console to produce UTF-8 files, with a BOM if necessary?
I've already tried converting the template files that are used to bake the standard scaffolding pages, but with no luck.
I have the same problem - baked files are NOT UTF-8 but ASCII. (use notepad++ editor which allows easily convert, save files in another format).
Once bake generates files I have to convert them to UTF-8 one by one, to be able to work with Polish local characters.
I tried changing template files to UTF-8 but somehow this does not help. This may have something to do with the fact that the default file does not contain any non ascii character, therefore even if saved as UTF they stay ASCII.
The simplest way I found to overcome this is to modify template file eg.
cake\console\templates\default\classes\model.ctp
to include utf-8 character somewhere, e.g.:
//'message' => 'Your custom message here ł',
(notice last non ASCII character at the end of line.
then converting and saving as UTF-8 makes sure template file is utf-8.
now, model files are generated as UTF-8.
The baked files are UTF-8, or rather, they only contain basic ASCII characters which are identical to the basic UTF-8 range, so can be regarded as either. It's Dreamweaver's problem, not a problem with bake. Check the Dreamweaver settings (or code in a decent editor ;-P).
You do not want to include a BOM, it'll screw you over later.
Use the Bake_UTF8 plugin =]
http://www.github.com/pedroelsner/bake_utf8
I hope this be helpfull.
Pedro Elsner
Another way to achieve this is to open PHP files that are producing UTF-8 content (without BOM) and then saving them in format UTF-8 with BOM using Notepad++ (Encoding->Encode in UTF-8)
In my case I had Excel CSV file:
/patients/exportFirstReport/atskaite1-25-10-2013.csv
Then I had to convert encoding of PHP files down the stack:
\index.php
\app\Controller\PatientsController.php
\app\View\Patients\csv\export_first_report.ctp
\app\View\Layouts\csv\default.ctp
After conversion of encoding of these files it produces readable UTF8 excel files

Linux & C-Programming: How can I write utf-8 encoded text to a file?

I am interested in writing utf-8 encoded strings to a file.
I did this with low level functions open() and write().
In the first place I set the locale to a utf-8 aware character set with
setlocale("LC_ALL", "de_DE.utf8").
But the resulting file does not contain utf-8 characters, only iso8859 encoded umlauts. What am I doing wrong?
Addendum: I don't know if my strings are really utf-8 encoded in the first place. I just keep them in the source file in this form: char *msg = "Rote Grütze";
See screenshot for content of the textfile:
alt text http://img19.imageshack.us/img19/9791/picture1jh9.png
Changing the locale won't change the actual data written to the file using write(). You have to actually produce UTF-8 characters to write them to a file. For that purpose you can use libraries as ICU.
Edit after your edit of the question: UTF-8 characters are only different from ISO-8859 in the "special" symbols (ümlauts, áccénts, etc.). So, for all the text that doesn't have any of this symbols, both are equivalent. However, if you include in your program strings with those symbols, you have to make sure your text editor treats the data as UTF-8. Sometimes you just have to tell it to.
To sum up, the text you produce will be in UTF-8 if the strings within the source code are in UTF-8.
Another edit: Just to be sure, you can convert your source code to UTF-8 using iconv:
iconv -f latin1 -t utf8 file.c
This will convert all your latin-1 strings to utf8, and when you print them they will be definitely in UTF-8. If iconv encounters a strange character, or you see the output strings with strange characters, then your strings were in UTF-8 already.
Regards,
Yes, you can do it with glibc. They call it multibyte instead of UTF-8, because it can handle more than one encoding type. Check out this part of the manual.
Look for functions that start with the prefix mb, and also function with wc prefix, for converting from multibyte to wide char. You'll have to set the locale first with setlocale() to UTF-8 so it chooses this implementation of multibyte support.
If you are coming from an Unicode file I believe the function you looking for is wcstombs().
Can you open up the file in a hex editor and verify, with a simple input example, that the written bytes are not the values of Unicode characters that you passed to write(). Sometimes, there is no way for a text editor to determine character set and your text editor may have assumed an ISO8859-1 character set.
Once you have done this, could you edit your original post to add the pertinent information?

Resources