%c and other symbols (e.g. " ° ") not displayed after compiling - c

I just started to learn C and then want to proceed to learn C++. I am currently using a textbook and just write the examples in order to get a bit more familiar with the programming language and procedure.
Since the example that is given in the book didn't work, I tried to find other similar codes. The problem is that after compiling the code, the program itself does not show and of the symbols represented by %c. I get symbols for the numbers 33-126 but everything else is either nothing at all or just a white block...
Also, on some previous example I wanted to write °C for temperature and it couldn't display the symbol °
The example I found on the web that does not display the %c symbols is
#include <stdio.h>
#include <ctype.h>
int main()
{
int i;
i=0;
do
{
printf("%i %c \n",i,i);
i++;
}
while(i<=255);
}
Is anyone familiar with this? Why can I not get an output for %c or e.g. ° as well???

ASCII is a 7-bit character set, which means it consists of only codepoints in the range [0, 127]. For 8-bit code pages there are still 128 available codepoints with values from 128 to 255 (i.e. the high bit is set). These are sometimes called extended ASCII (although they're not related to ASCII at all) and the characters that they map to depend on the character set. An 8-bit charset is sometimes also called ANSI although it's actually a misnomer
US English Windows uses Windows-1252 code page by default, with the character ° at codepoint 0xB0. Other OSes/languages may use different character sets which have different codepoint for ° or possibly no ° symbol at all.
You have many solutions to this:
If your PC uses an 8-bit charset
Lookup the value of ° in the charset your computer is using and print it normally. For example if you're using CP437 then printf("\xF8") will work because ° is at the code point 0xF8. printf("°") also works if you save the source file in the same code page (CP437)
Or just change charset to Windows-1252/ISO 8859-1 and print '°' or '\xB0'. This can be done programmatically (using SetConsoleOutputCP on Windows and similar APIs on other OSes) or manually (by some console settings, or by running chcp 1252 in Windows cmd). The source code file still needs to be saved in the same code page
Print Unicode. This is the recommended way to do
Linux/Unix and most other modern OSes use UTF-8, so just output the correct UTF-8 string and you don't need to care about anything. However because ° is a multibyte sequence in UTF-8, you must print it as a string. That means you need to use %s instead of %c. A single char can't represent ° in UTF-8. Newer Windows 10 also supports UTF-8 as a locale so you can print the UTF-8 string directly
On older Windows you need to print the string out as UTF-16. It's a little bit tricky but not impossible
If you use "\u00B0" and it prints out successfully then it means your terminal is already using UTF-8. \u is the escape sequence for arbitrary Unicode code points
See also
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Anything outside the range 33-126 isn't a visible ASCII character. 0-32 is stuff like backspace (8), "device control 2" (18), and space (32). 127 is DEL, and anything past that isn't even ASCII; who knows how your terminal will handle that.

Related

How to take I/O in Greek, in C program on Windows console

For a school project I decided to make an app. I am writing it in C, and running it on the Windows Console. I live in Greece and the program needs read and write text in Greek too. So, I have tried just plainly
printf("Καλησπέρα");
But it prints some random characters. How can I output Greek letters? And, similarly, how can I take input in Greek?
Welcome to Stack Overflow, and thank you for asking such an interesting question! I wish what you are trying to do was simple. But your programming language (C), and your execution environment (the Windows console) were both designed a long time ago, without Greek in mind. As a result, it is not easy to use them for your simple school project.
When your C program outputs bytes to stdout via printf, the Windows Console interprets those bytes as characters. It has a default interpretation, or encoding, which does not include Greek. In order for your Greek letters to appear, you need to tell Windows Console to use the correct encoding. You do this using the _setmode call, using the _O_U16TEXT parameter. This is described in the Windows _setmode documentation, as Semih Artan pointed out in the comments.
The _O_U16TEXT mode means your program must print text out in UTF-16 form. Each character is 16 bits long. That means you must represent your text as wide characters, using C syntax like L"\x039a". The L before the double quotes marks the string as having "wide characters", where each character has 16 bits instead of 8 bits. The \x in the string indicates that the next four characters are hex digits, representing the 16 bits of a wide character.
Your C program is itself a text file. The C compiler must interpret the bytes of this text file in terms of characters. When used in a simple way, the compiler will expect only ASCII-compatible byte values in the file. That includes Latin letters and digits, and simple punctuation. It does not include Greek letters. Thus you must write your Greek text by representing its bytes with ASCII substitutes.
There Greek characters Καλησπέρα are, I believe, represented in C wide character syntax as L"\x039a\x03b1\x03bb\x03b7\x03c3\x03c0\x03ad\x03c1\x03b1".
Finally, Windows Console must have access to a Greek font in order for it to display the Greek characters. I expect this is not a problem for you, because you are probably already running your computer in Greek. In any case Windows worldwide includes fonts with Greek coverage.
Plugging this Greek text into the sample program in Microsoft's _setmode documentation gives this. (Note: I have not tested this program myself.)
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x039a\x03b1\x03bb\x03b7\x03c3\x03c0\x03ad\x03c1\x03b1\n");
return 0;
}
Input is another matter. I won't attempt to go through it here. You probably have to set the mode of stdin to _O_U16TEXT. Then characters will appear as UTF-16. You may need to convert them before they are useful to your program.
Overall, to write a simple app for a school project, which reads and writes Greek, I suggest that you consider using a tool like Visual Studio to write a GUI program. These tools have more modern design, and give you access to text with Greek letters more easily.

Clarification on Winapi Paths and Filename (W functions and A functions)

I have try to check an importance and reason to use W winapi vs A, (W meaning wide char, A meaning ascii right?)
I have made a simple example, i receive a temp path for current user like this:
CHAR pszUserTempPathA[MAX_PATH] = { 0 };
WCHAR pwszUserTempPathW[MAX_PATH] = { 0 };
GetTempPathA(MAX_PATH - 1, pszUserTempPathA);
GetTempPathW(MAX_PATH - 1, pwszUserTempPathW);
printf("pathA=%s\r\npathW=%ws\r\n",pszUserTempPathA,pwszUserTempPathW);
My current user has a russian name, so its written in cyrillic, printf outputs like this:
pathA=C:\users\Пыщь\Local\Temp
pathW=C:\users\Пыщь\Local\Temp
So both paths are allright, i thought i will receive some error, or a mess of symbols with a GetTempPathA since the current user is a unicode, but i figured out, that cyrillic characters are actually included in extended ascii character set. So i have a question, if i were to use my software, and it will extract data in a temp folder of current user, who is chinese ( assuming he have chinese symbols in user name ), will i get a mess or an error using the GetTempPathA version? Should i always use a W prefixed functions, for a production software, that is working with winapi directly?
First, the -A suffix stands for ANSI, not ASCII. ASCII is a 7-bit character set. ANSI, as Microsoft uses the term, is for an encoding using 8-bit code units (chars) and code pages.
Some people use the terms "extended ASCII" or "high ASCII," but that's not actually a standard and, in some cases, isn't quite the same as ANSI. Extended ASCII is the ASCII character set plus (at most) 128 additional characters. For many ANSI code pages this is identical to extended ASCII, but some code pages accommodate variable length characters (which Microsoft calls multi-byte). Some people consider "extended ASCII" to just mean ISO-Latin-1 (which is nearly identical to Windows-1252).
Anyway, with an ANSI function, your string can include any characters from your current code page. If you need characters that aren't part of your current code page, you're out-of-luck. You'll have to use the wide -W versions.
In modern versions of Windows, you can generally think of the -A functions as wrappers around the -W functions that use MultiByteToWideChar and/or WideCharToMultiByte to convert any strings passing through the API. But the latter conversion can be lossy, since wide character strings might include characters that your multibyte strings cannot represent.
Portable, cross-platform code often stores all text in UTF-8, which uses 8-bit code units (chars) but can represent any Unicode code point, and anytime text needs to go through a Windows API, you'd explicitly convert to/from wide chars and then call the -W version of the API.
UTF-8 is nearly similar to what Microsoft calls a multibyte ANSI code page, except that Windows does not completely support a UTF-8 code page. There is CP_UTF8, but it works only with certain APIs (like WideCharToMultiByte and MultiByteToWideChar). You cannot set your code page to CP_UTF8 and expect the general -A APIs to do the right thing.
As you try to test things, be aware that it's difficult (and sometimes impossible) to get the CMD console window to display characters outside the current code page. If you want to display multi-script strings, you probably should write a GUI application and/or use the debugger to inspect the actual content of the strings.
Of course, you need the wide version. ASCII version can't even technically handle more than 256 distinct characters. Cyrillic is included in the extended ASCII set (if that's your localization) while Chinese isn't and can't due to much larger set of characters needed to represent it. Moreover, you can get mess with Cyrillic as well - it will only work properly if the executing machine has matching localization. So on a machine with non-cyrillic localization the text will be displayed according to whatever is defined by the localization settings.

What does printf("\033c" ) mean?

I was looking for a way to "reset" my Unix terminal window after closing my program, and stumbled upon printf("\033c" ); which works perfectly, but I just can't understand it. I went to man console_codes and since I'm somewhat inexperienced with Unix c programming, it wasn't very helpful.
Could someone explain printf("\033c" );?
In C numbers starting with a leading zero are octal numbers. Numbers in base 8.
What it does is print the character represented by octal number 33 followed by a 'c'.
In ASCII encoding the octal number 33 is the ESC (escape) character, which is a common prefix for terminal control sequences.
With that knowledge searching for terminal control sequences we can find e.g. this VT100 control sequence reference (VT100 was an old "dumb" terminal, and is emulated by most modern terminal programs). Using the VT100 reference we find <ESC>c in the terminal setup section, where it's documented as
Reset Device <ESC>c
Reset all terminal settings to default.
The ESC character could also be printed using "\x1b" (still assuming ASCII encoding). There is no way to use decimal numbers in constant string literals, only octal and hexadecimal.
However (as noted by the comment by chux) the sequence "\x1bc" will not do the same as "\033c". That's because 0x1bc is a valid hexadecimal number, and the compiler is greedy when it parses such sequences. It will print the character represented by the value 0x1bc instead, and I have no idea what it might be (depends on locale and terminal settings I suppose, might be printed as a Unicode character).
That's an escape sequence used to reset a DEC VT100 (or compatible) terminal. Some terminals (such as Linux console) accept VT100-style escape sequences, even when they are not actually VT100s.
The \033 is the ASCII escape character, which begins these sequences. Most are followed by another special character (this is a rare exception). XTerm Control Sequences lists that, along with others that are not followed by a special character.
In ECMA-48 it is possible to use a different character for the usual case, e.g., [ for the *control sequence initiator.
Resetting a real VT100 (in contrast to a terminal emulator) does more than clear the screen, as noted in Debian Bug report logs - #60377
"reset" broken for dumb terminals, but users of terminal emulators tend to assume it is a short way to clear the screen. The standard way would be something like this:
printf("\033[H\033[J");
The ncurses FAQ Why does reset log me out? addresses that issue.
Incidentally, users of terminal emulators also get other issues with the terminal confused. The ncurses FAQ How do I get color with VT100? addresses one of those.
It clears the screen in Linux type operating systems (ubuntu, fedora etc...).
You can check here on asciitable.com, under octal 33 (decimal 27) you have ESC character.

validate the entry of an ASCII character

I have a homework problem. I have to validate the entry of uppercase characters, but am having a problem with the A to Z.
I just put a while (c<65 || c>90) and it works fine. But, in my country we, use Ñ too, so that is my problem. I tried to use the ascii code 165 to validate the entry but it didn't work.
The char range is from -128 to 127, so for the extended ASCII table I need an unsigned char right?
i tried this:
int main (){
unsinged char n;
//scanf("%c",&n);
printf("%c",n);
return 0;
}
Print the 165 if it scans a 'Ñ'.
The next one:
unsigned char n;
n='Ñ';
printf("%d",n);
pPrints 209.
So I try to validate with 165 and 209 and neither works.
Why does this happen? What can I do to validate the entry of this character?
its works when i use unsigned char and validate with 165. But when i used cmd to try it by reading a txt file, didn't work...
print the 165 if i scan a 'Ñ'.
This means that in your system the character 'Ñ' has code equal to 165, as in the usual extension ISO 8859-1 extension of ASCII.
printf("%d",'Ñ');
print 209.
This reveals a different encoding for the characters you enter manually in your IDE.
Mark Tolonen has suggested that it corresponds to OEM cp437.
(I originally associated to UTF-8 by I'm a little confused now...)
IN C you have to take in account the existence of two collating sequence for characters, that could be different:
The source character set.
The execution character set.
The source character set is referred to the encoding used by your editing environment, that is, the place where you normally type your .c files. Your system and/or editor and/or IDE is working with a specific encoding-schema. In this case, it seems that the encoding is UTF-8.
Thus, if you write 'Ñ' in your editor, the character Ñ has the encoding of your editor, and has not the encoding of the target system. In this case you have Ñ encoded as 209, which gives you 'Ñ' == 209 as true.
The execution character set is referred to the encoding using in the operative system and/or the console you are using to run your executable (that is, compiled) programs. It seems clear that the encoding is Latin 1 (ISO-8859-1).
In particular, when you type Ñ in the console of your system, it's encoded as 165, which gives you the value 165 when you print the value.
Since this dichotomy always can happen (or not), you must be warried about that, and make some adjustments, to avoid potential problems.
its works when i use unsigned char and validate with 165. But when i used cmd to try it by reading a txt file, didn't work...
This means that your .txt file has been written with a text editor (perhaps your own IDE, I guess), that is using an encoding different to Latin 1 (ISO-8859-1).
Let me guess: You are writting your C code and your text files with the same IDE, but you are executing programs from the Windows CMD.
There are two possible solutions here.
The complicated solution is that you investigate about encoding schemas, locale issues, and wide characters. There is not quick solutions here, because it needs to be careful about several delicate stuff.
The easy solution is to make adjustments in all the tools you are using.
Go to the options of your IDE and try to obtain the information of the encoding schema used to save text files (I guessed you have UTF-8, but you can have there other possibilities, like LATIN 1 (or ISO-8859-1), UTF-16 and a large etc.):
Execute in your CMD the command CHCP to obtain the codepage number that your system is using. This codepage is a number whose meaning is explained my Microsoft, here:
a. OEM codepages
b. Windows codepages
c. ISO codepages
d. LIST OF ALL WINDOWS CODEPAGES
I guess you have codepage 850 or well 28591 (corresponding to Latin 1).
Change one of these configurations to fit with the other one.
a. In the configuration of your IDE, in the "editor options" part, you could change the encoding to something like Latin 1, or ISO-8859-1.
b. Or well, better change the codepage in your CMD, by means of the CHCP command, to fit OEM 437 encoding:
CHCP 437
Probably the solution involving the change of codepage in CMD not always work as one expected.
It's safer the solution (a.): to modify the configuration of your editor.
However, it would be prefirable to keep the UTF-8 in your editor (if this is your editor's choice), because nowadays every modern software is turning to UTF encodings (Unicode).
New info: The UTF-8 encoding sometimes uses more than 1 byte to represent 1 character. The following table shows the UTF-8 encoding for the first 256 entry points:
UTF-8 for U+0000 to U+00FF
Note: After a little discussion in the comments, I realized that I had some wrong believes about UTF-8 encoding. At least, this illustrate my point: encoding is not a trivial matter.
So, I have to repeat here my advice to the OP: go by the simplest path and try to achieve to an agreement with your teacher about how to handle encoding for special characters.
165 is not an ASCII code. ASCII goes from 0 to 127. 165 is a code in some other character set. In any case, char must be used for scanf and you can convert the value to unsigned char after that. Alternatively, use getchar() which returns a value in the range of unsigned char already.
You should use the standard function isalpha from ctype.h:
int n = getchar();
if ( isalpha(n) )
{
// do something...
}
You probably also will have to set a locale in which this character is a letter, e.g. setlocale( LC_CTYPE, "es_ES");

char vs wchar_t

I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.

Resources