wchar_t to octets - in C? - c

I'm trying to store a wchar_t string as octets, but I'm positive I'm doing it wrong - anybody mind to validate my attempt? What's going to happen when one char will consume 4 bytes?
unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);
for (i=0;i< wcslen(wchar1);i++) {
printf("(%d)", (wchar1[i]) & 255);
printf("(%d)", (wchar1[i] >> 8) & 255);
}

Unicode text is always encoded. Popular encodings are UTF-8, UTF-16 and UTF-32. Only the latter has a fixed size for a glyph. UTF-16 uses surrogates for codepoints in the upper planes, such a glyph uses 2 wchar_t. UTF-8 is byte oriented, it uses between 1 and 4 bytes to encode a codepoint.
UTF-8 is an excellent choice if you need to transcode the text to a byte oriented stream. A very common choice for text files and HTML encoding on the Internet. If you use Windows then you can use WideCharToMultiByte() with CodePage = CP_UTF8. A good alternative is the ICU library.
Be careful to avoid byte encodings that translate text to a code page, such as wcstombs(). They are lossy encodings, glyphs that don't have a corresponding character code in the code page are replaced by ?.

You can use the wcstombs() (widechar string to multibyte string) function provided in stdlib.h
The prototype is as follows:
#include <stdlib.h>
size_t wcstombs(char *dest, const wchar_t *src, size_t n);
It will correctly convert your wchar_t string provided by src into a char (a.k.a. octets) string and write it to dest with at most n bytes.
char wide_string[] = "Hellöw, Wörld! :)";
char mb_string[512]; /* Might want to calculate a better, more realistic size! */
int i, length;
memset(mb_string, 0, 512);
length = wcstombs(mb_string, wide_string, 511);
/* mb_string will be zero terminated if it wasn't cancelled by reaching the limit
* before being finished with converting. If the limit WAS reached, the string
* will not be zero terminated and you must do it yourself - not happening here */
for (i = 0; i < length; i++)
printf("Octet #%d: '%02x'\n", i, mb_string[i]);

If you're trying to see the content of the memory buffer holding the string, you can do this:
size_t len = wcslen(str) * sizeof(wchar_t);
const char *ptr = (const char*)(str);
for (i=0; i<len; i++) {
printf("(%u)", ptr[i]);
}

I don't know why printf and wprintf do not work together. Following code works.
unsigned int i;
const wchar_t *wchar1 = L"abc";
wprintf(L"%ls\r\n", wchar1);
for(i=0; i<wcslen(wchar1); i++)
{
wprintf(L"(%d)", (wchar1[i]) & 255);
wprintf(L"(%d)", (wchar1[i] >> 8) & 255);
}

Related

In C, how to print UTF-8 char if given its bytes in char variables?

If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?
Similarly for the 3 and 4 byte UTF-8 characters?
I've been trying all kinds of approaches with mbstowcs() but I just can't get it to work.
I managed to write a working example.
When c1 is '\xce' and c2 is '\xb8', the result is θ.
It turns out that I have to call setlocale before using mbstowcs.
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>
int main()
{
char* localeInfo = setlocale(LC_ALL, "en_US.utf8");
printf("Locale information set to %s\n", localeInfo);
const char c1 = '\xce';
const char c2 = '\xb8';
int byteCount = 2;
char* mbS = (char*) malloc(byteCount + 1);
mbS[0] = c1;
mbS[1] = c2;
mbS[byteCount] = 0; //null terminator
printf("Directly using printf: %s\n", mbS);
int requiredSize = mbstowcs(NULL, mbS, 0);
printf("Output size including null terminator is %d\n\n", requiredSize +1);
wchar_t *wideOutput = (wchar_t *)malloc( (requiredSize +1) * sizeof( wchar_t ));
int len = mbstowcs(wideOutput , mbS, requiredSize +1 );
if(len == -1){
printf("Failed conversion!");
}else{
printf("Converted %d character(s). Result: %ls\n", len, wideOutput );
}
return 0;
}
Output:
Locale information set to en_US.utf8
Directly using printf: θ
Output size including null terminator is 2
Converted 1 character(s). Result: θ
For 3 or 4 byte utf8 characters, one can use a similar approach.
If I have c1, c2 as char variables (such that c1c2 would be the byte sequences for the UTF-8 character), how do I create and print the UTF-8 character?
They are already an UTF-8 character. You would just print them.
putchar(c1);
putchar(c2);
It's up to your terminal or whatever device you are using to display the output to properly understand and render the UTF-8 encoding. This is unrelated to encoding used by your program and unrelated to wide characters.
Similarly for the 3 and 4 byte UTF-8 characters?
You would output them.
If your terminal or the device you are sending the bytes to does not understand UTF-8 encoding, then you have to convert the bytes to something the device understands. Typically, you would use an external library for that, like iconv. Alternatively, you could setlocale("C.utf-8") then convert your bytes to wchar_t, then setlocale("C.your_target_encoding") and then convert the bytes to that encoding or output the bytes with %ls. All %ls does (on common systems) is it converts the string back to multibyte and then outputs it. Wide stream outputting to terminal does the same, first converts, then outputs.

Adding leading 'zeros' to a number without library in C

I wanted to store 'two zeros' in a value e.g. answer(which is a BYTE) and send it over to RS-485. How can I do that without any library?
Secondly, I tried by adding char '0' to zero(0) but instead it converted it to equivalent of zero which is '48' and send it over RS-485.
enter image description here
Thanks,
sprintf(buffer, sizeof buffer, "%03d", the_byte_number);
will do what you want, you can then send the buffer contents over the RS-485 using something like:
write(rs_485_file_descriptor, buffer, strlen(buffer);
if you refuse to use even the C standard library, you'll have to do the conversion yourself:
char *zero_pad(uint8_t numb, char *buffer)
{
static char tab[] = "01234567898";
/* you need to have at least three characters for the three digits in
* buffer, no check is made here for efficiency reasons.
* we advance the buffer 3 positions first, and then go backwards,
* filling with characters, from the least significant digit to the most.
* in case you want the string to be null terminated, you must start
* with the '\0' char.
*/
#if YOU_WANT_IT_NULL_TERMINATED
buffer += 4;
*--buffer = '\0';
#else
buffer += 3;
#endif
for (int i = 0; i < 3; i++) {
int dig = numb % 10;
*--buffer = tab[dig];
number /= 10;
}
return buffer;
} /* zero_pad */
What you describe as 'two zeros' are two nibbles in the hexadecimal representation of the BYTE 0. Of course if you add the ASCII character '0' to zero, you get 4810. If you just want to send the BYTE answer, no conversion is needed.
wr = write(fd, &answer, sizeof answer);

Convert char to wchar_t using standard library?

I have a function that expects a wchar_t array as a parameter.I don't know of a standard library function to make a conversion from char to wchar_t so I wrote a quick dirty function, but I want a reliable solution free from bugs and undefined behaviors. Does the standard library have a function that makes this conversion ?
My code:
wchar_t *ctow(const char *buf, wchar_t *output)
{
const char ANSI_arr[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`~!##$%^&*()-_=+[]{}\\|;:'\",<.>/? \t\n\r\f";
const wchar_t WIDE_arr[] = L"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`~!##$%^&*()-_=+[]{}\\|;:'\",<.>/? \t\n\r\f";
size_t n = 0, len = strlen(ANSI_arr);
while (*buf) {
for (size_t x = 0; x < len; x++) {
if (*buf == ANSI_arr[x]) {
output[n++] = WIDE_arr[x];
break;
}
}
buf++;
}
output[n] = L'\0';
return output;
}
Well, conversion functions are declared in stdlib.h (*). But you must know that for any character in latin1 aka ISO-8859-1 charset the conversion to a wide character is a mere assignation, because character of unicode code below 256 are the latin1 characters.
So if your initial charset is ISO-8859-1, the convertion is simply:
wchar_t *ctow(const char *buf, wchar_t *output) {
wchar_t cr = output;
while (*buf) {
*output++ = *buf++;
}
*output = 0;
return cr;
}
provided caller passed a pointer to an array of size big enough to store all the converted characters.
If you are using any other charset, you will have to use a well known library like icu, or build one by hand, which is simple for single byte charsets (ISO-8859-x serie), more trikier for multibyte ones like UTF8.
But without knowing the charsets you want to be able to process, I cannot say more...
BTW, plain ascii is a subset of ISO-8859-1 charset.
(*) From cplusplus.com
int mbtowc (wchar_t* pwc, const char* pmb, size_t max);
Convert multibyte sequence to wide character
The multibyte character pointed by pmb is converted to a value of type wchar_t and stored at the location pointed by pwc. The function returns the length in bytes of the multibyte character.
mbtowc has its own internal shift state, which is altered as necessary only by calls to this function. A call to the function with a null pointer as pmb resets the state (and returns whether multibyte characters are state-dependent).
The behavior of this function depends on the LC_CTYPE category of the selected C locale.
It does in the header wchar.h. It is called btowc:
The btowc function returns WEOF if c has the value EOF or if (unsigned char)c
does not constitute a valid single-byte character in the initial shift state. Otherwise, it
returns the wide character representation of that character.
That isn't a conversion from wchar_t to char. It's a function for destroying data outside of ISO-646. No method in the C library will make that conversion for you. You can look at the ICU4C library. If you are only on Windows, you can look at the relevant functions in the Win32 API (WideCharToMultiByte, etc).

Converting Unicode codepoints to UTF-8 in C using iconv

I want to convert a 32-bit value, which represents a Unicode codepoint, into a sequence of chars which is the utf-8 encoded string containing only the character corresponding to the codepoint.
For example, I want to turn the value 955 into the utf-8 encoded string "λ".
I tried to do this using iconv, but I could not get the desired result. Here is the code that I wrote:
#include <stdio.h>
#include <iconv.h>
#include <stdint.h>
int main(void)
{
uint32_t codepoint = U'λ';
char *input = (char *) &codepoint;
size_t in_size = 2; // lower-case lambda is a 16-bit character (0x3BB = 955)
char output_buffer[10];
char *output = output_buffer;
size_t out_size = 10;
iconv_t cd = iconv_open("UTF-8", "UTF-32");
iconv(cd, &input, &in_size, &output, &out_size);
puts(output_buffer);
return 0;
}
When I run it, only a newline is printed (puts automatically prints a newline,-- the first byte of outout_buffer is '\0').
What is wrong with my understanding or my implementation?
As said by minitech, you must use size = 4 for UTF32 in an uint32_t, and you must preset the buffer to null to have the terminating null after conversion.
This code works on Ubuntu :
#include <stdio.h>
#include <iconv.h>
#include <stdint.h>
#include <memory.h>
int main(void)
{
uint32_t codepoint = 955;
char *input = (char *) &codepoint;
size_t in_size = 4; // lower-case lambda is a 16-bit character (0x3BB = 955)
char output_buffer[10];
memset(output_buffer, 0, sizeof(output_buffer));
char *output = output_buffer;
size_t out_size = 10;
iconv_t cd = iconv_open("UTF-8", "UTF-32");
iconv(cd, &input, &in_size, &output, &out_size);
puts(output_buffer);
return 0;
}
Two problems:
Since you’re using UTF-32, you need to specify 4 bytes. The “lower-case lambda is a 16-bit character (0x3BB = 955)” comment isn’t true for a 4-byte fixed-width encoding; it’s 0x000003bb. Set size_t in_size = 4;.
iconv doesn’t add null terminators for you; it adjusts the pointers it’s given. You’ll want to add your own before calling puts.
*output = '\0';
puts(output_buffer);

Make a long string of encrypted substrings in c

I'm trying to create a long string that is produced out of encrypted substrings. For the encryption I'm using AES128 and libmcrypt. The code is working, but I get a shorter output then I should and a beeping sound. I guess it's because I'm using strlen, but I have no idea, how I can avoid that. I will be very grateful for some suggestions. Here is my code:
char *Encrypt( char *key, char *message){
static char *Res;
MCRYPT mfd;
char *IV;
int i, blocks, key_size = 16, block_size = 16;
blocks = (int) (strlen(message) / block_size) + 1;
Res = calloc(1, (blocks * block_size));
mfd = mcrypt_module_open(MCRYPT_RIJNDAEL_128, NULL, "ecb", NULL);
mcrypt_generic_init(mfd, key, key_size, IV);
strncpy(Res, message, strlen(message));
mcrypt_generic(mfd, Res, block_size);
//printf("the encrypted %s\n", Res);
mcrypt_generic_deinit(mfd);
mcrypt_module_close(mfd);
return (Res);
}
char *mkline ( int cols) {
int j;
char seed[] = "thesecretmessage", key1[]="dontusethisinput", key2[]="abadinputforthis";
char *encrypted, *encrypted2, *in = malloc(cols * 16);
encrypted = Encrypt(key1, seed);
sprintf(in, "%s", encrypted);
encrypted2= Encrypt(key2, encrypted);
printf("encrypted2 before for-loop %s\n", encrypted2);
printf("encrypted2 before for loop len %d\n", strlen(encrypted2));
for (j=1; j<cols; j++) {
strcat(in, encrypted2);
memmove(encrypted2, Encrypt(key2, encrypted2),strlen(seed));
printf("encrypted2 %s on position %d\n" , encrypted2,j);
printf("encrypted2 len %d\n", strlen(encrypted2));
}
free(encrypted);
free(encrypted2);
return in;
}
int main(int argc, char *argv[]) {
char *line = mkline(15);
printf("line %s\n", line);
printf("line lenght %d\n", strlen(line));
return 0;
}
You get the beep sound because you are printing control character.
Also strlen return the size until the first '\0' character (because strings are zero terminated). That's why you get length less than you expect since the encrypted message may contain zeroes.
You can do something like this to return the result length:
char *Encrypt(const char *key, const char *message, int *result_len)
{
*result_len = blocks * block_size;
}
Also
memmove(encrypted2, Encrypt(key2, encrypted2),strlen(seed));
This line should produce a memory leak since every time you call Encrypt you call calloc (allocate new memory) which you need to free after you are done.
You probably should use memcpy, memmove is primarly used if there is a chance destination and source may overlap.
The encrypted string you are trying to print contains a stream of bytes where the value of the individual byte ranges from 0 to 255. Because you are using a cryptographically secure algorithm, the distribution of values is very close to even.
Since you are trying to print the encrypted string through a console, the console interprets some of the bytes as control characters (see Bell character) that are unprintable but have other effects instead, such as playing beeps.
Furthermore, strlen isn't doing what you think it should be doing because the encrypted string is not null-terminated, but instead contains zeroes amongst other bytes and they have no special meaning unlike in NULL terminated strings. You need to store the length of the string elsewhere.
Simple, you are treating binary output (any byte value) directly as printable text. Any character wit a code point below 32 (hex 20) isn't. E.g. the ASCII value for BELL (look it up) could be meaningful to you. Print the resulting bytes in hexadecimals and you should be ok.
I should like to add that in general it is good practice to clear any memory that held the plaintext/unencrypted message after you encrypt it if you can. This is not good coding practice, but good cryptology practice.
This can be done by:
memset(buffer, 0, length_of_buffer);
Don't worry, that won't be optimized out by your compiler. It's actually not smart enough to tell if you'll be using that area again or not.

Resources