Character encoding and types - C/C++ - c

Maybe I have the wrong idea - but it was my understanding that wide types (i.e wchar_t etc) were for UTF-16 Unicode types. If this is correct, then I can't understand the flood of responses to similar issues, all involving some form of wchar_t, or other "wide" conversion with UTF-8.
I'm doing a CLI/C++ project with MSVC, with a Unicode build, that uses an implementation of Luac, to compile lua code to bytecode. Now everything works nicely in that regard, but the trouble is that no special handling is done for UTF-8 files - except for "discarding" the BOM. So all the data within is treated as ANSI. Obviously, when it comes to special characters, it becomes a problem to display them correctly.
So, I need a way to convert between the two - preferably at the source (fopen); but as I've rerouted the output, I could also do so there as well. Unfortunately the only promising solution I've found- using FILE* fh=fopen(fn,"r,css=UTF-8); just ends up kicking back an exception for invalid file mode. Which is puzzling, considering it's an Visual C++ project.
Unless of course, I need to change my include order/add an additional include?
/lauxlib.c
#include <ctype.h>
#include <errno.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "lua.h"
#include "lauxlib.h"
/lauxlib.h
#include <stddef.h>
#include <stdio.h>
#include "lua.h"
Edit:
After taking a look at the file in a hex editor, I'm beginning to understand. UTF-8 isn't just 1-byte, it's just 1-byte when it can be. The initial problem still remains, but at least I understand it a bit more.
Edit2/Update:
First off, I'm not sure if this part should be an answer, or if I should close the question - so please, feel free to educate me on that.
The application was originally written to be a console application - so when it needed to output, it just used either putchar or printf. However, this wasn't going to help for a WinForms application. So I basically just rerouted it, by making managed-friendly equivalents.
Luac is essentially a parser/compiler for Lua scripts. It has the option to output information based on the result of that parsing. Listing things like functions, opcodes, constant, and local variables. When it prints out the constants for each function, it prints the actual values of said constants. And that's where the encoding problem comes in.
If the constant value is a string type, the function written to handle printing strings, does the following:
casts a its parameter - a pointer to a union type to a const char*
loops through the const char* via index, assigning the value of the char to an int
checks for any escape characters in the text via a switch/case (tab, newline,etc) and escapes them
if that falls through, the default case is checking if it's a printable char, using isprint
If it is, it uses putchar
If not, it uses printf. Casting it to an unsigned char, and using \\%03u for the format.
Now obviously, if the intention was to display it in a form control, and the format is UTF-8, printing out the unsigned value of the individual chars isn't going to help. So I finally decided to just keep Googling for some usage clarification on MultiByteToWideChar, and that worked - except for high value characters (i.e. Asian language characters). Since I found that said Windows function makes some mistakes, I eventually found another that did so "manually". Unfortunately, it still didn't handle those characters correctly.
So I took another look at the actual const char* that was being looped over, and discovered the reason it wasn't being converted - was because something else changed those chars to a value of 63 - the question mark. And this is about the time where tracking that particular "something else" is far beyond my abilities, and asking for help has a real good chance of ending up far too specific for this site's guidelines.
Because the parameter that this function takes, is a pointer to a union typedef, that contains a typedef for string alignment, and a struct - which contains absolutely zero char arrays/pointers. But yet, it casts to one. Which is how that parameter gets turned into a const char* in the function. Since specifically changing certain char values to 63, doesn't seem very beneficial, I'm thinking that it's either the result of a c function, or maybe an ill advised (at least in this case) cast. Maybe if someone knows of a situation where that would be the result, and lets me know, I could probably find the offending code. But otherwise, it's way too specific for me to expect someone to be able to help in this case.

Use the Win32 API funxtion MultibyteToWideChar to convert the stuff you read to "wide" which is UTF-16. I think the stream classes and/or FILE streams have a conversion mode, which would be exactly what you need.
wchat_t is a 16-bit UTF-16 code point in Windows. Other platforms often make wchar_t 32 bits and have different conventions.

Related

Effect of Wide Characters/ Strings on a C Program

Below is an excerpt from an old edition of the book Programming Windows by Charles Petzold
There are, of course, certain disadvantages to using Unicode. First and foremost is that every string in your program will occupy twice as much space. In addition, you'll observe that the functions in the wide-character run-time library
are larger than the usual functions.
Why would every string in my program occupy twice the bytes, should not only the character arrays we've declared as storing wchar_t type do so?
Is there perhaps some condition that if a program is to be able to work with Long values, then the entire program mode it'll operate on is altered?
Usually if we declare a long int, we never fuss over or mention the fact that all ints will be occupying double the memory now. Are strings somehow a special case?
Why would every string in my program occupy twice the bytes, should not only the character arrays we've declared as storing wchar_t type do so?
As I understand it, it is meant, that if you have a program that uses char *, and now you rewrite that program to use wchar_t *, then it will use (more than) twice the bytes.
If a string could potentially contain a character outside of the ascii range, you'll have to declare it as a wide string. So most strings in the program will be bigger. Personally, I wouldn't worry about it; if you need Unicode, you need Unicode, and a few more bytes aren't going to kill you.
That seems to be what you're saying, and I agree. But the question is skating the fine line between opinionated and objective.
Unicode have some types : utf8, utf16 utf32. https://en.wikipedia.org/wiki/Unicode.
You can check advantage , disadvantage of them to know what situation you should use .
reference: UTF-8, UTF-16, and UTF-32

C - How to concatenate any combination char or int in array to binary string of type int or char

Brand new to the board. I really try to solve my programming conundrums without bogging down a public forum, and stackoverflow has helped with many a problem, but I'm at a loss. I hope you can help.
I am attempting to concatenate/combine, single digit characters in an array into a variable that takes the form of a binary string of format 10101010. There is no arithmetic involved and all characters are single-digit 1s and 0s. I have scoured the Web for a solution and I have tried everything from strcat(), otoi(), sprintf(), etc. I have even tried using a String array[] to construct my variable and then convert type using strtoul(). The code below reflects my latest attempt, which compiles successfully, but appears to produce blank characters. Understanding that I may be way off track with this solution, I am open to any solution that works.
The code:
#include <avr/io.h>
#include <avr/interrupt.h>
#include <stdio.h>
#include <stdlib.h>
char digit[48] = "";
char buf[9];
snprintf(buf, sizeof buf, "%s%c", digit[16], digit[17], digit[18], digit[19], digit[20],_
digit[21], digit[22], digit[23]);
As you can see, the idea is to be able to take any digits from char digit[48] and combine them into the desired format. Truthfully, I'm not interested in adding the characters into a new array, but I got to the point of trying anything. My preference is to have the result accessible via a single variable to then use for bitwise operations and direct input into integrated components. Than you for any assistance you can provide.
Take a look at Converting an int into a base 2 cstring/string
and note (if you're using Linux) that I believe the deprecated function name there is _itoa.

Understanding and writing wchar_t in C

I'm currently rewriting (a part of) the printf() function for a school project.
Overall, we were required to reproduce the behaviour of the function with several flags, conversions, length modifiers ...
The only thing I have left to do and that gets me stuck are the flags %C / %S (or %lc / %ls).
So far, I've gathered that wchar_t is a type that can store characters on more than one byte, in order to accept more characters or symbols and therefore be compatible with pretty much every language, regardless of their alphabet and special characters.
However, I wasn't able to find any concrete information on what a wchar looks like for the machine, it's actual length (which apparently vary based on several factors including the compiler, the OS ...) or how to actually write them.
Thank you in advance
Note that we are limited in the functions we are allowed to use. The only allowed functions are write(), malloc(), free(), and exit().
We must be able to code any other required function ourselves.
To sum this up, what I'm asking here is some informations on how to interpret and write "manually" any wchar_t character, with as little code as possible so that I can try to understand the whole process and code it myself.
A wchar_t is similar to a char in the sense that it is a number, but when displaying a char or wchar_t we don't want to see the number, but the drawn character corresponding to the number. The mapping from the number to the characters aren't defined by neither char nor wchar_t, they depend on the system. So there is no difference in the end usage between char and wchar_t except for their sizes.
Given the above, the most trivial implementation of printf("%ls") is one where you know what are the system encodings for use with char and wchar_t. For example, in my system, char has 8 bits, has encoding UTF-8, while wchar_t is 32 bits and has encoding UTF-32. So the printf implementation just converts from UTF-32 to UTF-8 and outputs the result.
A more general implementation must support different and configurable encodings and may need to inspect what's the current encoding. In this case functions like wcsnrtombs() or iconv() must be used.

Diacritic characters in C char arrays or strings

Background
Im working on some embedded project and Im trying to handle non-standard characters and font.
I have raw bitmap font in 600+ element array. Every 5 elements of this array contain one character. I have character 32 (space) in first 5 elements, 33 character (!) in 6-10 elements etc.
I have to handle national diacritic characters ("ę" for example). I located them after 122 character. Now im trying to remap characters, to get proper character printed when I type print("Test ę"); in C source.
Problem
So I want to type like this in source:
print("Test diactric ę");
// warning: (228) illegal character (0xC4)
When I try this (I tried to see what code C will put for "ę"):
int a = 'ę';
// error: (226) char const too long
How to workaround this?
Im using XC8 compiler (gcc based?).
I found in compiler manual, that it uses 7-bit character encoding, but maybe there is some way? My source file is encoded in UTF-8.
EDIT
Looks like wchar.h suggested by Emilien could work for me, but unfortunately there is no wchar.h for my compiler.
Maybe some preprocessor trick? I really want to avoid hardcore text preparation like this:
print("abcde");
print_diactric(123); // 123 code used for ę
print("fgh");
// to get "abcdeęf" "word"
You need to think about the difference between the source encoding (what it sounds like, the character encoding used by your C source files on the system where the compiler runs) and the target encoding, which is the encoding that the compiler assumes for the system where the code will be running.
If your compiler's target encoding is "7-bit", then there's no standard way to express a character like ę, it's simply not part of the target charset. You're going to have to work around that, perhaps by implementing the encoding by yourself from some other format.
As unwind explained, you'll need for than 7 bits in order to encode these characters, maybe you can use the wide character type?
#include <wchar.h>
#include <stdio.h>
int main(){
printf("%s\n", "漢語");
printf("%s\n", "ę");
}
output:
~$ gcc wcharexample.c -o wcharexample && ./wcharexample
漢語
ę

char vs wchar_t

I'm trying to print out a wchar_t* string.
Code goes below:
#include <stdio.h>
#include <string.h>
#include <wchar.h>
char *ascii_ = "中日友好"; //line-1
wchar_t *wchar_ = L"中日友好"; //line-2
int main()
{
printf("ascii_: %s\n", ascii_); //line-3
wprintf(L"wchar_: %s\n", wchar_); //line-4
return 0;
}
//Output
ascii_: 中日友好
Question:
Apparently I should not assign CJK characters to char* pointer in line-1, but I just did it, and the output of line-3 is correct, So why? How could printf() in line-3 give me the non-ascii characters? Does it know the encoding somehow?
I assume the code in line-2 and line-4 are correct, but why I didn't get any output of line-4?
First of all, it's usually not a good idea to use non-ascii characters in source code. What's probably happening is that the chinese characters are being encoded as UTF-8 which works with ascii.
Now, as for why the wprintf() isn't working. This has to do with stream orientation. Each stream can only be set to either normal or wide. Once set, it cannot be changed. It is set the first time it is used. (which is ascii due to the printf). After that the wprintf will not work due the incorrect orientation.
In other words, once you use printf() you need to keep on using printf(). Similarly, if you start with wprintf(), you need to keep using wprintf().
You cannot intermix printf() and wprintf(). (except on Windows)
EDIT:
To answer the question about why the wprintf line doesn't work even by itself. It's probably because the code is being compiled so that the UTF-8 format of 中日友好 is stored into wchar_. However, wchar_t needs 4-byte unicode encoding. (2-bytes in Windows)
So there's two options that I can think of:
Don't bother with wchar_t, and just stick with multi-byte chars. This is the easy way, but may break if the user's system is not set to the Chinese locale.
Use wchar_t, but you will need to encode the Chinese characters using unicode escape sequences. This will obviously make it unreadable in the source code, but it will work on any machine that can print Chinese character fonts regardless of the locale.
Line 1 is not ascii, it's whatever multibyte encoding is used by your compiler at compile-time. On modern systems that's probably UTF-8. printf does not know the encoding. It's just sending bytes to stdout, and as long as the encodings match, everything is fine.
One problem you should be aware of is that lines 3 and 4 together invoke undefined behavior. You cannot mix character-based and wide-character io on the same FILE (stdout). After the first operation, the FILE has an "orientation" (either byte or wide), and after that any attempt to perform operations of the opposite orientation results in UB.
You are omitting one step and therefore think the wrong way.
You have a C file on disk, containing bytes. You have a "ASCII" string and a wide string.
The ASCII string takes the bytes exactly like they are in line 1 and outputs them.
This works as long as the encoding of the user's side is the same as the one on the programmer's side.
The wide string first decodes the given bytes into unicode codepoints and stored in the program- maybe this goes wrong on your side. On output they are encoded again according to the encoding on the user's side. This ensures that these characters are emitted as they are intended to, not as they are entered.
Either your compiler assumes the wrong encoding, or your output terminal is set up the wrong way.

Resources