ASCII characters in C - c

I'm trying to save a character from the cyrillic alphabet in a char.
When I take a string from the console it saves it in the char array successfully but just initializing it doesn't seem to work. I get "programName.exe has stopped working" when trying to run it.
#include <stdio.h>
#include <conio.h>
#include <string.h>
#include <Windows.h>
#include <stdlib.h>
void test(){
char test = 'Я';
printf("%s",test);
}
void main(){
SetConsoleOutputCP(1251);
SetConsoleCP(1251);
test();
}
fgets ( books[booksCount].bookTitle, 80, stdin ); // this seems to be working ok with ascii.
I tried using wchar_t but I get the same results.

If you're using Russian Windows which uses Windows-1251 codepage by default, you can print the character encoded as a single byte using the old printf but you need to make sure that the source code uses the same cp1251 charset. Don't save as Unicode.
But the preferred way should be using wprintf with wide char string
void test() {
wchar_t test_char = L'Я';
wchar_t *test_string = L"АБВГ"; // or LPCWSTR test_string
wprintf(L"%c\n%s", test_char, test_string);
}
This time you need to save the file as Unicode (UTF-8 or UTF-16)
UTF-8 may be better, but it's trickier on Windows. Moreover if you use UTF-8 you cannot use a char to store Я because it needs more than 1 byte. You must use a char* instead
Note that main must return int, not void, and the above fgets must be called from inside some function

This could be solved, doing
void test()
{
char test = 'Я';
putchar(test);
}
But there is a catch: Since 'Я' is not an ASCII character, you might need to set appropriate locale before.
Moreover, only ASCII characters 32 - 126 are guaranteed to be printable, and the same symbol, on all systems.

Related

Expected encoding of wcwidth() argument

I'm trying to find out what the expected encoding of wcwidth() argument is.
The man page says absolutely nothing about this, and I wasted hours trying to
find out what it is. Here's an example, in C:
#include <stdio.h>
#include <wchar.h>
void main()
{
wchar_t c = L'h';
printf("%d\n", wcwidth(c));
}
I want to know how should I encode this character literal so that this program
prints 2 instead of -1.
Here's a Rust example:
extern "C" {
fn wcwidth(c: libc::wchar_t) -> libc::c_int;
}
fn main() {
let c = 'h';
println!("{}", unsafe { wcwidth(c as libc::wchar_t) });
}
Similarly I want to convert this character constant to wchar_t (i32) so that
this program prints 2.
Thanks.
UPDATE: Sorry for my wording, I made this sound specific to C's long char literals. I want to encode character literals in any language as a 32-bit int so that when I pass it to wcwidth I get a right answer. So my question is not specific to C or C's long char literals.
UPDATE 2: I'd also be happy with another function like wcwidth that is better specified (and maybe even platform independent). E.g. one that takes UTF-8 encoded character and returns number of cols needed to render it in a monospace terminal.
You need to add support for _XOPEN_SOURCE and also you need to set your locales.
Try this:
#define _XOPEN_SOURCE 700
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
int main(void)
{
setlocale(LC_CTYPE, "");
wchar_t c = L'h';
printf("%d\n", wcwidth(c));
return 0;
}

Store accented letters in C variable

My main language is portuguese so we have some accented words (with á é í ó ú... etc characters) i'm trying to read and store those characters into a variable but it just doesn't work. If i just set it on the code it works, but if i ask the user for input it doesn't. Example code:
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main(int argc, char *argv[]) {
setlocale(LC_ALL, "Portuguese");
char test, test2; //The same still happens using unsigned char
test = 'í';
printf("Character: %c\n", test);
scanf(" %c", &test2); //The same still happens using fgets in case of a string
printf("Character: %c\n", test2);
system("pause");
return 0;
}
When compiled and executed the code shows:
Character: í
(wait for input, example:) í
character: ¡
if input is 'á' it prints ' '(space), 'é' prints ', ó prints '¢' and ú prints '£'.
I'm new into programming and stackoverflow, so sorry for any mistake i made, every help is appreciated, thank you.
oh, also I'm using Dev-c++ to compile if this make any difference.
You need to recognize that a char in C is a numeric type of size 1 byte. It actually is not exactly intended to keep the representation of a single language character item. (Sometimes called code point).
You do have two options to deal with this situation:
Use a character encoding that is single byte. (E.g. the proper
version of the iso-8859 family, iso-8859-1 in your case). This
will ensure that all characters will fit into a single byte.
deal with your input with proper mechanisms for multibyte
characters. You might look for char16_t or char32_t types and
maybe turn to using wchar_t and related library routines

How to find the built-in function to deal with char16_t in C?

Please tell what is the char16_t version for the String Manipulation Functions
such as:
http://www.tutorialspoint.com/ansi_c/c_function_references.htm
I found many references site, but no one mentioned that.
Especially for printing function, this is that most important, because it help me to verify whether the Manipulation function is work.
#include <stdio.h>
#include <uchar.h>
char16_t *u=u"α";
int main(int argc, char *argv[])
{
printf("%x\n",u[0]); // output 3b1, it is UTF16
wprintf("%s\n",u); //no ouput
_cwprintf("%s\n",u); //incorrect output
return 0;
}
To print/read/open write etc.., you need to convert to 32-bit chars using the mbsrtowcs function.
For ALL intents and purposes, char16_t is a multi-byte representation, therefore, one need use mbr functions to work with this integral type.
A few answers used the L"prefix" which is completely incorrect. 16-bit strings require the u"prefix".
The following code gets you everything you need to work with 8, 16, and 32-bit string representations.
#include <string.h>
#include <wchar.h>
#include <uchar.h>
You can Google the procedures found in <wchar.h> if you don't have manual pages (UNIX).
Gnome.org's GLib has some great code for you to drop-in if overhead isn't an issue.
char16_t and char32_t are ISO C11 (iso9899:2011) extensions.
wprintf and its wchar colleagues need to have th format string in wchar too:
wprintf( L"%s\n", u);
For wchar L is used as a prefix to the string literals.
Edit:
Here's a code snippet (tested on Windows):
#include <stdio.h>
#include <io.h>
#include <fcntl.h>
#include <wchar.h>
void main()
{
wchar_t* a = L"α";
fflush(stdout); //must be done before _setmode
_setmode(_fileno(stdout), _O_U16TEXT); // set console mode to unicode
wprintf(L"alpha is:\n\t%s\n", a); // works for me :)
}
The console doesn't work in unicode and prints a "?" for non ascii chars. In Linux you need to remove the underscore prefix before setmode and fileno.
Note: for windows GUI prints, there already proper support, so you can use wsprintf to format unicode strings.

iterating through a char array with non standard chars

Edit:
I can only use stdio.h and stdlib.h
I would like to iterate through a char array filled with chars.
However chars like ä,ö take up twice the space and use two elements.
This is where my problem lies, I don't know how to access those special chars.
In my example the char "ä" would use hmm[0] and hmm[1].
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char* hmm = "äö";
printf("%c\n", hmm[0]); //i want to print "ä"
printf("%i\n", strlen(hmm));
return 0;
}
Thanks, i tried to run my attached code in Eclipse, there it works. I assume because it uses 64 bits and the "ä" has enough space to fit. strlen confirms that each "ä" is only counted as one element.
So i guess i could somehow tell it to allocate more space for each char (so "ä" can fit)?
#include <stdio.h>
#include <stdlib.h>
int main()
{
char* hmm = "äüö";
printf("%c\n", hmm[0]);
printf("%c\n", hmm[1]);
printf("%c\n", hmm[2]);
return 0;
}
A char always used one byte.
In your case you think that "ä" is one char: Wrong.
Open your .c source code with an hexadecimal viewer and you will see that ä is using 2 char because the file is encoded in UTF8
Now the question is do you want to use wide character ?
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
int main()
{
const wchar_t hmm[] = L"äö";
setlocale(LC_ALL, "");
wprintf(L"%ls\n", hmm);
wprintf(L"%lc\n", hmm[0]);
wprintf(L"%i\n", wcslen(hmm));
return 0;
}
Your data is in a multi-byte encoding. Therefore, you need to use multibyte character handling techniques to divvy up the string. For example:
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main(void)
{
char* hmm = "äö";
int off = 0;
int len;
int max = strlen(hmm);
setlocale(LC_ALL, "");
printf("<<%s>>\n", hmm);
printf("%zi\n", strlen(hmm));
while (hmm[off] != '\0' && (len = mblen(&hmm[off], max - off)) > 0)
{
printf("<<%.*s>>\n", len, &hmm[off]);
off += len;
}
return 0;
}
On my Mac, it produced:
<<äö>>
4
<<ä>>
<<ö>>
The call to setlocale() was crucial; without that, the program runs in the "C" locale instead of my en_US.UTF-8 locale, and mblen() mishandled things:
<<äö>>
4
<<?>>
<<?>>
<<?>>
<<?>>
The questions marks appear because the bytes being printed are invalid single bytes as far as the UTF-8 terminal is concerned.
You can also use wide characters and wide-character printing, as shown in benjarobin's answer..
Sorry to drag this on. Though I think its important to highlight some issues. As I understand it OS-X has the ability to have the default OS code page to be UTF-8 so the answer is mostly in regards to Windows that under the hood uses UTF-16, and its default ACP code page is dependent on the specified OS region.
Firstly you can open Character Map, and find that
äö
Both reside in the code page 1252 (western), so this is not a MBCS issue. The only way it could be a MBCS issue is if you saved the file using MBCS (Shift-JIS,Big5,Korean,GBK) encoding.
The answer, of using
setlocale( LC_ALL, "" )
Does not give insight into the reason why, äö was rendered in the command prompt window incorrectly.
Command Prompt does use its own code pages, namely OEM code pages. Here is a reference to the following (OEM) code pages available with their character map's.
Going into command prompt and typing the following command (Chcp) Will reveal the current OEM code page that the command prompt is using.
Following Microsoft documentation by using setlocal(LC_ALL,"") it details the following behavior.
setlocale( LC_ALL, "" );
Sets the locale to the default, which is the user-default ANSI code page obtained from the operating system.
You can do this manually, by using chcp and passing your required code page, then run your application and it should output the text perfectly fine.
If it was a multie byte character set problem then there would be a whole list of other issues:
Under MBCS, characters are encoded in either one or two bytes. In two-byte characters, the first, or "lead-byte," signals that both it and the following byte are to be interpreted as one character. The first byte comes from a range of codes reserved for use as lead bytes. Which ranges of bytes can be lead bytes depends on the code page in use. For example, Japanese code page 932 uses the range 0x81 through 0x9F as lead bytes, but Korean code page 949 uses a different range.
Looking at the situation, and that the length was 4 instead of 2. I would say that the file format has been saved in UTF-8 (It could in fact been saved in UTF-16, though you would of run into problems sooner than later with the compiler). You're using characters that are not within the ASCII range of 0 to 127, UTF-8 is encoding the Unicode code point to two bytes. Your compiler is opening the file and assuming its your default OS code page or ANSI C. When parsing your string, it's interpreting the string as a ANSI C Strings 1 byte = 1 character.
To sove the issue, under windows convert the UTF-8 string to UTF-16 and print it with wprintf. Currently there is no native UTF-8 support for the Ascii/MBCS stdio functions.
For Mac OS-X, that has the default OS code page of UTF-8 then I would recommend following Jonathan Leffler solution to the problem because it is more elegant. Though if you port it to Windows later, you will find you will need to covert the string from UTF-8 to UTF-16 using the example bellow.
In either solution you will still need to change the command prompt code page to your operating system code page to print the characters above ASCII correctly.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <Windows.h>
#include <locale>
// File saved as UTF-8, with characters outside the ASCII range
int main()
{
// Set the OEM code page to be the default OS code page
setlocale(LC_ALL, "");
// äö reside outside of the ASCII range and in the Unicode code point Western Latin 1
// Thus, requires a lead byte per unicode code point when saving as UTF-8
char* hmm = "äö";
printf("UTF-8 file string using Windows 1252 code page read as:%s\n",hmm);
printf("Length:%d\n", strlen(hmm));
// Convert the UTF-8 String to a wide character
int nLen = MultiByteToWideChar(CP_UTF8, 0,hmm, -1, NULL, NULL);
LPWSTR lpszW = new WCHAR[nLen];
MultiByteToWideChar(CP_UTF8, 0, hmm, -1, lpszW, nLen);
// Print it
wprintf(L"wprintf wide character of UTF-8 string: %s\n", lpszW);
// Free the memory
delete[] lpszW;
int c = getchar();
return 0;
}
UTF-8 file string using Windows 1252 code page read as:äö
Length:4
wprintf wide character of UTF-8 string: äö
i would check your command prompt font/code page to make sure that it can display your os single byte encoding. note command prompt has its own code page that differs to your text editor.

How to read the value stored in a Key from the Registry using C

Hi I'm trying to read the value stored in a registry key using C code. I've tried the following code. It's not generating any compilation errors. But I get only the first letter of the string as the output. Here is my code sample
#include "stdafx.h"
#include <windows.h>
#include <malloc.h>
#include <stdio.h>
#define TOTALBYTES 8192
#define BYTEINCREMENT 4096
#define BUFFER 8192
int _tmain(int argc, _TCHAR* argv[])
{
char value[255];
DWORD BufferSize = BUFFER;
RegGetValue(HKEY_LOCAL_MACHINE, TEXT("SOFTWARE\\Test\\subkey"), TEXT("blockedurlslist"), RRF_RT_ANY, NULL, (PVOID)&value, &BufferSize);
printf("%s",value);
system("pause");
}
Please help me if anyone has got an idea
For a start you are not checking whether or not the call to RegGetValue succeeded. Always check the return values of Win32 API calls.
But what is happening is pretty clear. Whenever a function returns a string with only the first character, the most likely cause is that the function is returning UTF-16 data which you interpret as ANSI. Your string will contain an English character which is encoded in UTF-16 with a 0 as the second byte. And when interpreted as ANSI that 0 is treated as the string terminator.
You need to declare your buffer as containing a wide character payload. Since you are using TCHAR then you would do it like this:
TCHAR value[255];
And the buffer size is
DWORD BufferSize = sizeof(value);
You will need to change your printf to be able to print out a wide string.
If I were you I would not be using TCHAR. I would recommend that you decide to use Unicode throughout your code. It makes it much simpler for you to understand and I doubt that you need to support Windows 98 these days.
What is the call returning? If it's not returning ERROR_SUCCESS, it failed. You can inspect the value of BufferSize after the call to find out how much data the API claims to have returned.
As pointed out in a comment, you should say:
char value[BUFFER];
DWORD BufferSize = sizeof value;
Since now you're lying to the API about your buffer size, which invites buffer overwrites.
Perhaps it's returning data in 16-bit characters, and you're printing with an 8-bit call, thus interpreting the interspersed 0-bytes as string terminators.

Resources