Store multi-byte into char array

Store multi-byte into char array - c

Following code works:
char *text = "中文";
printf("%s", text);
Then I'm trying to print this text via it's unicode code point which is 0x4e2d for "中" and 0x6587 for "文":
And sure, nothing prints out.
I'm trying to understand what's happening here when I store multi-byte string into char* and how to print multi-byte string with it's unicode code point, and further more, what does it mean by "Format specifier '%ls' requires 'wchar_t *' argument instead of 'wchar_t *'"?
Thanks for any help.
Edit:
I'm on Mac osx (high sierra 10.13.6), with clion
$ gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix

wchar_t *arr = malloc(2 * sizeof(wchar_t));
arr[0] = 0x4e2d;
arr[1] = 0x6587;
First, the above string is not null-terminated. The printf function knows the beginning of the array, but it has no idea where the array ends, or what size it has. You have to add a zero at the end to make null-terminated C string.
To print this null-terminated wide string, use "printf("%ls", arr);" for Unix based machines (including Mac), use "wprintf("%s", arr);" in Windows (that's a completely different thing, it actually treats the string as UTF16)
Make sure to add setlocale(LC_ALL, "C.UTF-8"); or setlocale(LC_ALL, ""); for Unix based machines.
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "C.UTF-8");
//print single character:
printf("%lc\n", 0x00004e2d);
printf("%lc\n", 0x00006587);
printf("%lc\n", 0x0001F310);
wchar_t *arr = malloc((2 + 1)* sizeof(wchar_t));
arr[0] = 0x00004e2d;
arr[1] = 0x00006587;
arr[2] = 0;
printf("%ls\n", arr);
return 0;
}
Aside,
In UTF32, code points always need 4 bytes (example 0x00004e2d) This can be represented with a 4 byte data type char32_t (or wchar_t in POSIX).
In UTF8, code points need 1, 2, 3, or 4 bytes. UTF8 encoding for ASCII characters needs one byte. While 中 needs 3 bytes (or 3 char values). You can confirm this by running this code:
printf("A:%d 中:%d 🙂:%d\n", strlen("A"), strlen("中"), strlen("🙂"));
Se we can't use a single char in UTF8. We can use strings instead:
const char* x = u8"中";
We can use normal string functions in C, like strcpy etc. But some standard C functions don't work. For example strchr just doesn't work for finding 中. This is usually not a problem because characters such as "print format specifiers" are all ASCII and are one byte.

Related

Is there a way to print Runes as individual characters?

Program's Purpose: Rune Cipher
Note - I am linking to my Own GitHub page below
(it is only for purpose-purpose (no joke intended; it is only for the purpose of showing the purpose of it - what I needed help with (and got help, thanks once again to all of you!)
Final Edit:
I have now (thanks to the Extremely Useful answers provided by the Extremely Amazing People) Completed the project I've been working on; and - for future readers I am also providing the full code.
Again, This wouldn't have been possible without all the help I got from the guys below, thanks to them - once again!
Original code on GitHub
Code
(Shortened down a bit)
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#define UNICODE_BLOCK_START 0x16A0
#define UUICODE_BLOCK_END 0x16F1
int main(){
setlocale(LC_ALL, "");
wchar_t SUBALPHA[]=L"ᛠᚣᚫᛞᛟᛝᛚᛗᛖᛒᛏᛋᛉᛈᛇᛂᛁᚾᚻᚹᚷᚳᚱᚩᚦᚢ";
wchar_t DATA[]=L"hello";
int lenofData=0;
int i=0;
while(DATA[i]!='\0'){
lenofData++; i++;
}
for(int i=0; i<lenofData; i++) {
printf("DATA[%d]=%lc",i,DATA[i]);
DATA[i]=SUBALPHA[i];
printf(" is now Replaced by %lc\n",DATA[i]);
} printf("%ls",DATA);
return 0;
}
Output:
DATA[0]=h is now Replaced by ᛠ
...
DATA[4]=o is now Replaced by ᛟ
ᛠᚣᚫᛞᛟ
Question continues below
(Note that it's solved, see Accepted answer!)
In Python3 it is easy to print runes:
for i in range(5794,5855):
print(chr(i))
outputs
ᚢ
ᚣ
(..)
ᛝ
ᛞ
How to do that in C ?
using variables (char, char arrays[], int, ...)
Is there a way to e.g print ᛘᛙᛚᛛᛜᛝᛞ as individual characters?
When I try it, it just prints out both warnings about multi-character character constant 'ᛟ'.
I have tried having them as an array of char, a "string" (e.g char s1 = "ᛟᛒᛓ";)
And then print out the first (ᛟ) char of s1: printf("%c", s1[0]); Now, this might seem very wrong to others.
One Example of how I thought of going with this:
Print a rune as "a individual character":
To print e.g 'A'
printf("%c", 65); // 'A'
How do I do that, (if possible) but with a Rune ?
I have as well as tried printing it's digit value to char, which results in question marks, and - other, "undefined" results.
As I do not really remember exactly all the things I've tried so far, I will try my best to formulate this post.
If someone spots a a very easy (maybe, to him/her - even plain-obvious) solution(or trick/workaround) -
I would be super happy if you could point it out! Thanks!
This has bugged me for quite some time.
It works in python though - and it works (as far as I know) in c if you just "print" it (not trough any variable) but, e.g: printf("ᛟ"); this works, but as I said I want to do the same thing but, trough variables. (like, char runes[]="ᛋᛟ";) and then: printf("%c", runes[0]); // to get 'ᛋ' as the output
(Or similar, it does not need to be %c, as well as it does not need to be a char array/char variable) I am just trying to understand how to - do the above, (hopefully not too unreadable)
I am on Linux, and using GCC.
External Links
Python3 Cyphers - At GitHub
Runes - At Unix&Linux SE
Junicode - At Sourceforge.io

To hold a character outside of the 8-bit range, you need a wchar_t (which isn't necessarily Unicode). Although wchar_t is a fundamental C type, you need to #include <wchar.h> to use it, and to use the wide character versions of string and I/O functions (such as putwc shown below).
You also need to ensure that you have activated a locale which supports wide characters, which should be the same locale as is being used by your terminal emulator (if you are writing to a terminal). Normally, that will be the default locale, selected with the string "".
Here's a simple equivalent to your Python code:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main(void) {
setlocale(LC_ALL, "");
/* As indicated in a comment, I should have checked the
* return value from `putwc`; if it returns EOF and errno
* is set to EILSEQ, then the current locale can't handle
* runic characters.
*/
for (wchar_t wc = 5794; wc < 5855; ++wc)
putwc(wc, stdout);
putwc(L'\n', stdout);
return 0;
}
(Live on ideone.)

Stored on the stack as a string of (wide) characters
If you want to add your runes (wchar_t) to a string then you can proceed the following way:
using wcsncpy: (overkill for char, thanks chqrlie for noticing)
#define UNICODE_BLOCK_START 0x16A0 // see wikipedia link for the start
#define UUICODE_BLOCK_END 0x16F0 // true ending of Runic wide chars
int main(void) {
setlocale(LC_ALL, "");
wchar_t buffer[UUICODE_BLOCK_END - UNICODE_BLOCK_START + sizeof(wchar_t) * 2];
int i = 0;
for (wchar_t wc = UNICODE_BLOCK_START; wc <= UUICODE_BLOCK_END; wc++)
buffer[i++] = wc;
buffer[i] = L'\0';
printf("%ls\n", buffer);
return 0;
}
About Wide Chars (and Unicode)
To understand a bit better what is a wide char, you have to think of it as a set of bits set that exceed the original range used for character which was 2^8 = 256 or, with left shifting, 1 << 8).
It is enough when you just need to print what is on your keyboard, but when you need to print Asian characters or other unicode characters, it was not enough anymore and that is the reason why the Unicode standard was created. You can find more about the very different and exotic characters that exist, along with their range (named unicode blocks), on wikipedia, in your case runic.
Range U+16A0..U+16FF - Runic (86 characters), Common (3 characters)
NB: Your Runic wide chars end at 0x16F1 which is slightly before 0x16FF (0x16F1 to 0x16FF are not defined)
You can use the following function to print your wide char as bits:
void print_binary(unsigned int number)
{
char buffer[36]; // 32 bits, 3 spaces and one \0
unsigned int mask = 0b1000000000000000000000000000;
int i = 0;
while (i++ < 32) {
buffer[i] = '0' + !!(number & (mask >> i));
if (i && !(i % 8))
buffer[i] = ' ';
}
buffer[32] = '\0';
printf("%s\n", buffer);
}
That you call in your loop with:
print_binary((unsigned int)wc);
It will give you a better understand on how your wide char is represented at the machine level:
ᛞ
0000000 0000001 1101101 1100000
NB: You will need to pay attention to detail: Do not forget the final L'\0' and you need to use %ls to get the output with printf.

How to print integers and memory addresses to the console using write()?

I am in a situation where I cannot use printf() to print to the console in C.
It's a university assignment and we're reimplementing malloc, calloc, free and realloc. I am using ubuntu and when I call printf(), it seg faults as printf() uses malloc in its implementation (according to my lecturer), so I can't use it and have to use write() instead.
So if I have a simple program, how can I print an integer and a pointer address to the console using write().
I have tried:
#include <unistd.h>
int main(void) {
int a = 12345;
int* ptr = &a;
// none of the following seem to work
write(2, &a, sizeof(a));
write(2, "\n", 1);
write(2, ptr, sizeof(ptr));
write(2, "\n", 1);
write(2, &ptr, sizeof(ptr));
write(2, "\n", 1);
return 0;
}
The output was
90
90
xZ???
Thanks, Juan

The int value 12345 is equal to 0x0000003039 (assuming 4-byte int).
On a little-endian system (like a standard x86 or x86-64 PC) it's stored in the sequence 0x39 0x30 0x00 0x00 0x00 0x00 0x00 0x00.
In ASCII encoding 0x39 is the character '9' and 0x30 is '0'.
So printing the value 12345 will print the two characters 90, and then some unprintable zeros.
You need to convert the values to text to be able to print them. Perhaps like this:
char value[128];
snprintf(value, sizeof value, "%d", a);
write(STDOUT_FILENO, value, strlen(value));
If you're not allowed to use even snprintf (or sprintf) then you need to come up with some other way to convert the number to text.

Decide on the format you want to output the number. As a hex number? Decimal number? In octal? In base64?
Write a conversion function between uintptr_t to a string. uintptr_t is the type that may be used to convert a pointer to a number.
Convert numbers to string.
Write the string.
The following program:
#include <unistd.h>
#include <limits.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
char *uintptr_to_string(char *dest, size_t n, void *v0) {
uintptr_t v = (uintptr_t)v0;
for (int i = 0; i < sizeof(v) * 2; ++i) {
if (n-- == 0) return NULL;
const short idx = (v >> (sizeof(v) * CHAR_BIT - 4));
const char hex[] = "0123456789abcdef";
*dest++ = hex[idx];
v <<= 4;
}
return dest;
}
int main() {
char string[200];
int a;
char *end = uintptr_to_string(string, sizeof(string), &a);
printf("%018p\n", (void*)&a);
write(STDOUT_FILENO, "0x", 2);
write(STDOUT_FILENO, string, end - string);
write(STDOUT_FILENO, "\n", 1);
}
may output on godbolt:
0x00007ffffd3d0a9c
0x00007ffffd3d0a9c

I am in a situation where I cannot use printf() to print to the console in C. It's a university assignment and we're reimplementing malloc, calloc, free and realloc. I am using ubuntu
You are lucky. Ubuntu is mostly open source software,
So the code of printf, malloc etc... is inside GNU glibc (or musl-libc) whose source code (above syscalls(2) provided by the Linux kernel) you can read and study.
Read of course Advanced Linux Programming and Modern C (and later the C standard n1570).
Use gdb(1) and strace(1) to understand the behavior of your (or others) program.
If you compile with a recent GCC, be sure to use gcc -Wall -Wextra -g to get warnings and DWARF debug information for the GDB debugger. Once your code is correct, add -O2 to get compiler optimizations. If you want some formal proof of your code, consider using Frama-C.
Mention in your assignment the source code that you did actually read. Your teacher would appreciate.
Of course, you need to convert your int a; into a string buffer such as char buf[80];; avoid buffer overflows and other kinds of undefined behavior. You might not be allowed to use snprintf, but you surely can mimic its code and behavior. Converting an int to some ASCII string is an easy exercise (assuming you understand what a decimal number is).
The intuition is that for a small int x between 0 and 9 included, its decimal digit in ASCII is (char)('0'+x) or "0123456789"[x]. To convert a bigger number, use the % modulus and / integral division operators cleverly in a simple loop.
For IBM Z main frames using EBCDIC, it should be similar.
The full code is left as an exercise to the reader.

You need to multiply sizeof() by the size of a byte

printing the char value of each wide character's bytes

when running the following:
char acute_accent[7] = "éclair";
int i;
for (i=0; i<7; ++i)
{
printf("acute_accent[%d]: %c\n", i, acute_accent[i]);
}
I get:
acute_accent[0]:
acute_accent[1]: �
acute_accent[2]: c
acute_accent[3]: l
acute_accent[4]: a
acute_accent[5]: i
acute_accent[6]: r
which makes me think that the multibyte character é is 2-byte wide.
However, when running this (after ignoring the compiler warning me from multi-character character constant):
printf("size: %lu",sizeof('é'));
I get size: 4.
What's the reason for the different sizes?
EDIT: This question differs from this one because it is more about multibyte characters encoding, the different UTFs and their sizes, than the mere understanding of a size of a char.

The reason you're seeing a discrepancy is because in your first example, the character é was encoded by the compiler as the two-byte UTF-8 codepoint 0xC3 0xA9.
See here:
http://www.fileformat.info/info/unicode/char/e9/index.htm
And as described by dbush, the character 'é' was encoded as a UTF-32 codepoint and stored in an integer; therefore it was represented as four bytes.
Part of your confusion stems from using an implementation defined feature by storing Unicode in an undefined manner.
To prevent undefined behavior you should always clearly identify the encoding type for string literals.
For example:
char acute_accent[7] = u8"éclair"
This is very bad form because unless you count it out yourself, you can't know the exact length of the string unless. And indeed, my compiler (g++) is yelling at me because, while the string is 7 bytes, it's 8 bytes total with the null character at the end. So you have actually overrun the buffer.
It's much safer to use this instead:
const char* acute_accent = u8"éclair"
Notice how your string is actually 8-bytes:
#include <stdio.h>
#include <string.h> // strlen
int main() {
const char* a = u8"éclair";
printf("String length : %lu\n", strlen(a));
// Add +1 for the null byte
printf("String size : %lu\n", strlen(a) + 1);
return 0;
}
The output is:
String length : 7
String size : 8
Also note that the size of a char is different between C and C++!!
#include <stdio.h>
int main() {
printf("%lu\n", sizeof('a'));
printf("%lu\n", sizeof('é'));
return 0;
}
In C the output is:
4
4
While in C++ the output is:
1
4

From the C99 standard, section 6.4.4.4:
2 An integer character constant is a sequence of one or more multibyte
characters enclosed in single-quotes, as in 'x'.
...
10 An integer character constant has type int.
sizeof(int) on your machine is probably 4, which is why you're getting that result.
So 'é', 'c', 'l' are all integer character constants, so all are of type int whose size is 4. The fact that some are multibyte and some are not doesn't matter in this regard.

C String Literal "too big for character"

With MSVC 2010 i try to compile this in C or C++ mode (needs to be compilable in both) and
it does not work. Why? I thought and found in the documentation that '\x' takes the next two characters as hex characters and not more (4 characters when using \X").
I also learned that there is no portable way to use character codes outside ASCII in C source code anyway, so how can i specify some german ISO-8859-1 characters?
int main() {
char* x = "\xBCd"; // Why is this not char(188) + 'd'
}
// returns test.c(2) : error C2022: '3021' : too big for character
// and a warning with GCC

Unfortunately you've stumbled upon the fact that \x will read every last character that appears to be hex1,2, instead you'll need to break this up:
const char *x = "\xBC" "d"; /* const added to satisfy literal assignment probs */
Consider the output from this program:
/* wide.c */
#include <stdio.h>
int main(int argc, char **argv)
{
const char *x = "\x000000000000021";
return printf("%s\n", x);
}
Compiled and executed:
C:\temp>cl /nologo wide.c
wide.c
C:\temp>wide
!
Tested on Microsoft's C++ compiler shipped with VS 2k12, 2k10, 2k8, and 2k5
Tested on gcc 4.3.4.

Can't assign wide char into wide char field.

I have been given this school project. I have to alphabetically sort list of items by Czech rules. Before I dig deeper, I have decided to test it on a 16 by 16 matrix so I did this:
typedef struct {
wint_t **field;
}LIST;
...
setlocale(LC_CTYPE,NULL);
....
list->field=(wint_t **)malloc(16*sizeof(wint_t *));
for(int i=0;i<16;i++)
list->field[i]=(wint_t *)malloc(16*sizeof(wint_t));
In another function I am trying to assign a char. Like this:
sorted->field[15][15] = L'C';
wprintf(L"%c\n",sorted->field[15][15]);
Everything is fine. Char is printed. But when I try to change it to
sorted->field[15][15] = L'Č';
It says: Extraneous characters in wide character constant ignored. (Xcode) And the printing part is skipped. The main.c file is in UTF-8. If I try to print this:
printf("ěščřžýááíé\n");
It prints it out as written. I am not sure if I should allocate mem using wint_t or wchar_t or if I am doing it right. I tested it with both but none of them works.

clang seems to support entering arbitrary byte sequences into to wide strings with the \x notation:
wchar_t c = L'\x2126';
This compiles without notice.
Edit: Adapting what I find on wikipedia about wide characters, the following works for me:
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main(void)
{
setlocale(LC_ALL,"");
wchar_t myChar1 = L'\x2126';
wchar_t myChar2 = 0x2126; // hexadecimal encoding of char Ω using UTF-16
wprintf(L"This is char: %lc \n",myChar1);
wprintf(L"This is char: %lc \n",myChar2);
}
and prints nice Ω characters in my terminal. Make sure that your teminal is able to interpret utf-8 characters.