C String Literal "too big for character" - c

With MSVC 2010 i try to compile this in C or C++ mode (needs to be compilable in both) and
it does not work. Why? I thought and found in the documentation that '\x' takes the next two characters as hex characters and not more (4 characters when using \X").
I also learned that there is no portable way to use character codes outside ASCII in C source code anyway, so how can i specify some german ISO-8859-1 characters?
int main() {
char* x = "\xBCd"; // Why is this not char(188) + 'd'
}
// returns test.c(2) : error C2022: '3021' : too big for character
// and a warning with GCC

Unfortunately you've stumbled upon the fact that \x will read every last character that appears to be hex1,2, instead you'll need to break this up:
const char *x = "\xBC" "d"; /* const added to satisfy literal assignment probs */
Consider the output from this program:
/* wide.c */
#include <stdio.h>
int main(int argc, char **argv)
{
const char *x = "\x000000000000021";
return printf("%s\n", x);
}
Compiled and executed:
C:\temp>cl /nologo wide.c
wide.c
C:\temp>wide
!
Tested on Microsoft's C++ compiler shipped with VS 2k12, 2k10, 2k8, and 2k5
Tested on gcc 4.3.4.

Related

Is there a way to print Runes as individual characters?

Program's Purpose: Rune Cipher
Note - I am linking to my Own GitHub page below
(it is only for purpose-purpose (no joke intended; it is only for the purpose of showing the purpose of it - what I needed help with (and got help, thanks once again to all of you!)
Final Edit:
I have now (thanks to the Extremely Useful answers provided by the Extremely Amazing People) Completed the project I've been working on; and - for future readers I am also providing the full code.
Again, This wouldn't have been possible without all the help I got from the guys below, thanks to them - once again!
Original code on GitHub
Code
(Shortened down a bit)
#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#define UNICODE_BLOCK_START 0x16A0
#define UUICODE_BLOCK_END 0x16F1
int main(){
setlocale(LC_ALL, "");
wchar_t SUBALPHA[]=L"ᛠᚣᚫᛞᛟᛝᛚᛗᛖᛒᛏᛋᛉᛈᛇᛂᛁᚾᚻᚹᚷᚳᚱᚩᚦᚢ";
wchar_t DATA[]=L"hello";
int lenofData=0;
int i=0;
while(DATA[i]!='\0'){
lenofData++; i++;
}
for(int i=0; i<lenofData; i++) {
printf("DATA[%d]=%lc",i,DATA[i]);
DATA[i]=SUBALPHA[i];
printf(" is now Replaced by %lc\n",DATA[i]);
} printf("%ls",DATA);
return 0;
}
Output:
DATA[0]=h is now Replaced by ᛠ
...
DATA[4]=o is now Replaced by ᛟ
ᛠᚣᚫᛞᛟ
Question continues below
(Note that it's solved, see Accepted answer!)
In Python3 it is easy to print runes:
for i in range(5794,5855):
print(chr(i))
outputs
ᚢ
ᚣ
(..)
ᛝ
ᛞ
How to do that in C ?
using variables (char, char arrays[], int, ...)
Is there a way to e.g print ᛘᛙᛚᛛᛜᛝᛞ as individual characters?
When I try it, it just prints out both warnings about multi-character character constant 'ᛟ'.
I have tried having them as an array of char, a "string" (e.g char s1 = "ᛟᛒᛓ";)
And then print out the first (ᛟ) char of s1: printf("%c", s1[0]); Now, this might seem very wrong to others.
One Example of how I thought of going with this:
Print a rune as "a individual character":
To print e.g 'A'
printf("%c", 65); // 'A'
How do I do that, (if possible) but with a Rune ?
I have as well as tried printing it's digit value to char, which results in question marks, and - other, "undefined" results.
As I do not really remember exactly all the things I've tried so far, I will try my best to formulate this post.
If someone spots a a very easy (maybe, to him/her - even plain-obvious) solution(or trick/workaround) -
I would be super happy if you could point it out! Thanks!
This has bugged me for quite some time.
It works in python though - and it works (as far as I know) in c if you just "print" it (not trough any variable) but, e.g: printf("ᛟ"); this works, but as I said I want to do the same thing but, trough variables. (like, char runes[]="ᛋᛟ";) and then: printf("%c", runes[0]); // to get 'ᛋ' as the output
(Or similar, it does not need to be %c, as well as it does not need to be a char array/char variable) I am just trying to understand how to - do the above, (hopefully not too unreadable)
I am on Linux, and using GCC.
External Links
Python3 Cyphers - At GitHub
Runes - At Unix&Linux SE
Junicode - At Sourceforge.io
To hold a character outside of the 8-bit range, you need a wchar_t (which isn't necessarily Unicode). Although wchar_t is a fundamental C type, you need to #include <wchar.h> to use it, and to use the wide character versions of string and I/O functions (such as putwc shown below).
You also need to ensure that you have activated a locale which supports wide characters, which should be the same locale as is being used by your terminal emulator (if you are writing to a terminal). Normally, that will be the default locale, selected with the string "".
Here's a simple equivalent to your Python code:
#include <locale.h>
#include <stdio.h>
#include <wchar.h>
int main(void) {
setlocale(LC_ALL, "");
/* As indicated in a comment, I should have checked the
* return value from `putwc`; if it returns EOF and errno
* is set to EILSEQ, then the current locale can't handle
* runic characters.
*/
for (wchar_t wc = 5794; wc < 5855; ++wc)
putwc(wc, stdout);
putwc(L'\n', stdout);
return 0;
}
(Live on ideone.)
Stored on the stack as a string of (wide) characters
If you want to add your runes (wchar_t) to a string then you can proceed the following way:
using wcsncpy: (overkill for char, thanks chqrlie for noticing)
#define UNICODE_BLOCK_START 0x16A0 // see wikipedia link for the start
#define UUICODE_BLOCK_END 0x16F0 // true ending of Runic wide chars
int main(void) {
setlocale(LC_ALL, "");
wchar_t buffer[UUICODE_BLOCK_END - UNICODE_BLOCK_START + sizeof(wchar_t) * 2];
int i = 0;
for (wchar_t wc = UNICODE_BLOCK_START; wc <= UUICODE_BLOCK_END; wc++)
buffer[i++] = wc;
buffer[i] = L'\0';
printf("%ls\n", buffer);
return 0;
}
About Wide Chars (and Unicode)
To understand a bit better what is a wide char, you have to think of it as a set of bits set that exceed the original range used for character which was 2^8 = 256 or, with left shifting, 1 << 8).
It is enough when you just need to print what is on your keyboard, but when you need to print Asian characters or other unicode characters, it was not enough anymore and that is the reason why the Unicode standard was created. You can find more about the very different and exotic characters that exist, along with their range (named unicode blocks), on wikipedia, in your case runic.
Range U+16A0..U+16FF - Runic (86 characters), Common (3 characters)
NB: Your Runic wide chars end at 0x16F1 which is slightly before 0x16FF (0x16F1 to 0x16FF are not defined)
You can use the following function to print your wide char as bits:
void print_binary(unsigned int number)
{
char buffer[36]; // 32 bits, 3 spaces and one \0
unsigned int mask = 0b1000000000000000000000000000;
int i = 0;
while (i++ < 32) {
buffer[i] = '0' + !!(number & (mask >> i));
if (i && !(i % 8))
buffer[i] = ' ';
}
buffer[32] = '\0';
printf("%s\n", buffer);
}
That you call in your loop with:
print_binary((unsigned int)wc);
It will give you a better understand on how your wide char is represented at the machine level:
ᛞ
0000000 0000001 1101101 1100000
NB: You will need to pay attention to detail: Do not forget the final L'\0' and you need to use %ls to get the output with printf.

Store multi-byte into char array

Following code works:
char *text = "中文";
printf("%s", text);
Then I'm trying to print this text via it's unicode code point which is 0x4e2d for "中" and 0x6587 for "文":
And sure, nothing prints out.
I'm trying to understand what's happening here when I store multi-byte string into char* and how to print multi-byte string with it's unicode code point, and further more, what does it mean by "Format specifier '%ls' requires 'wchar_t *' argument instead of 'wchar_t *'"?
Thanks for any help.
Edit:
I'm on Mac osx (high sierra 10.13.6), with clion
$ gcc --version
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 9.1.0 (clang-902.0.39.2)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
wchar_t *arr = malloc(2 * sizeof(wchar_t));
arr[0] = 0x4e2d;
arr[1] = 0x6587;
First, the above string is not null-terminated. The printf function knows the beginning of the array, but it has no idea where the array ends, or what size it has. You have to add a zero at the end to make null-terminated C string.
To print this null-terminated wide string, use "printf("%ls", arr);" for Unix based machines (including Mac), use "wprintf("%s", arr);" in Windows (that's a completely different thing, it actually treats the string as UTF16)
Make sure to add setlocale(LC_ALL, "C.UTF-8"); or setlocale(LC_ALL, ""); for Unix based machines.
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
int main()
{
setlocale(LC_ALL, "C.UTF-8");
//print single character:
printf("%lc\n", 0x00004e2d);
printf("%lc\n", 0x00006587);
printf("%lc\n", 0x0001F310);
wchar_t *arr = malloc((2 + 1)* sizeof(wchar_t));
arr[0] = 0x00004e2d;
arr[1] = 0x00006587;
arr[2] = 0;
printf("%ls\n", arr);
return 0;
}
Aside,
In UTF32, code points always need 4 bytes (example 0x00004e2d) This can be represented with a 4 byte data type char32_t (or wchar_t in POSIX).
In UTF8, code points need 1, 2, 3, or 4 bytes. UTF8 encoding for ASCII characters needs one byte. While 中 needs 3 bytes (or 3 char values). You can confirm this by running this code:
printf("A:%d 中:%d 🙂:%d\n", strlen("A"), strlen("中"), strlen("🙂"));
Se we can't use a single char in UTF8. We can use strings instead:
const char* x = u8"中";
We can use normal string functions in C, like strcpy etc. But some standard C functions don't work. For example strchr just doesn't work for finding 中. This is usually not a problem because characters such as "print format specifiers" are all ASCII and are one byte.

Can't assign wide char into wide char field.

I have been given this school project. I have to alphabetically sort list of items by Czech rules. Before I dig deeper, I have decided to test it on a 16 by 16 matrix so I did this:
typedef struct {
wint_t **field;
}LIST;
...
setlocale(LC_CTYPE,NULL);
....
list->field=(wint_t **)malloc(16*sizeof(wint_t *));
for(int i=0;i<16;i++)
list->field[i]=(wint_t *)malloc(16*sizeof(wint_t));
In another function I am trying to assign a char. Like this:
sorted->field[15][15] = L'C';
wprintf(L"%c\n",sorted->field[15][15]);
Everything is fine. Char is printed. But when I try to change it to
sorted->field[15][15] = L'Č';
It says: Extraneous characters in wide character constant ignored. (Xcode) And the printing part is skipped. The main.c file is in UTF-8. If I try to print this:
printf("ěščřžýááíé\n");
It prints it out as written. I am not sure if I should allocate mem using wint_t or wchar_t or if I am doing it right. I tested it with both but none of them works.
clang seems to support entering arbitrary byte sequences into to wide strings with the \x notation:
wchar_t c = L'\x2126';
This compiles without notice.
Edit: Adapting what I find on wikipedia about wide characters, the following works for me:
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main(void)
{
setlocale(LC_ALL,"");
wchar_t myChar1 = L'\x2126';
wchar_t myChar2 = 0x2126; // hexadecimal encoding of char Ω using UTF-16
wprintf(L"This is char: %lc \n",myChar1);
wprintf(L"This is char: %lc \n",myChar2);
}
and prints nice Ω characters in my terminal. Make sure that your teminal is able to interpret utf-8 characters.

What to replace in this C Puzzle?

Now this is a silly puzzle I got from some exam paper,sadly I am unable to figure it out from last 15 minutes.
#include <stdio.h>
int main(void){
/* <something> */
putchar(*(wer[1]+1));
return 0;
}
What should we replace in place of something in order to get the output e.Now we know putchar takes a int as argument but this code assumes to give a pointer.Does this question is even valid ?
const char *wer[2] = { "failed", "test" };
Since a[i] is the same as *(a + i) by definition, you can transform the putchar() argument into wer[1][1]. So, something like char *wer[2] would be a satisfactory definition, and any values such that wer[1][1] == 'e' will work.
char * wer[] = { "foobar", "he"};
First, in the code you have mentioned the argument to putchar() is
*(wer[1]+1)
which is not a pointer. It seems that wer[1] is some pointer and that address pointed by wer[1] + 1 is dereferenced. So if wer is an array of pointers to int, then putchar argument should be int which is fine.
Now the code in place of something can be
You have not mentioned clearly what does e mean, is e a char or e is 2.71... (Natural logarithm base) In either case it should be easy to get that output with this code.
-AD
An easy answer is:
char **wer;
putchar('e');
return 0;
For example, the complete code would like like:
#include <stdio.h>
int main(int argc, char **argv)
{
/* Something starts here */
char **wer;
putchar('e');
return 0;
/* Something ends here */
putchar(*(wer[1] + 1));
return 0;
}
The output is:
susam#swift:~$ gcc really-silly-puzzle.c && ./a.out && echo
e
susam#swift:~$
A more interesting question would have been: What is the shortest code that can replace /* */ to get the output 'e'?
From the very pedantic point of view, there's no correct answer to the question. The question is invalid. C language itself makes no guarantees about the output of a program if the output does not end in a newline character. See 7.19.2/2
A text stream is an ordered sequence
of characters composed into lines,
each line consisting of zero or more
characters plus a terminating new-line
character. Whether the last line
requires a terminating new-line
character is implementation-defined.
This program output to the standard output, which is a text stream. The output of this program is implementation-dependent, regardless of what you put in place of /* <something> */, meaning that the question might make sense for some specific platform, but it makes no sense as an abstract C language question.
I highly doubt though that your examiners are expecting this kind of pedantry from you :)))

Is it possible to use a Unicode "argv"?

I'm writing a little wrapper for an application that uses files as arguments.
The wrapper needs to be in Unicode, so I'm using wchar_t for the characters and strings I have. Now I find myself in a problem, I need to have the arguments of the program in a array of wchar_t's and in a wchar_t string.
Is it possible? I'm defining the main function as
int main(int argc, char *argv[])
Should I use wchar_t's for argv?
Thank you very much, I seem not to find useful info on how to use Unicode properly in C.
In general, no. It will depend on the O/S, but the C standard says that the arguments to 'main()' must be 'main(int argc, char **argv)' or equivalent, so unless char and wchar_t are the same basic type, you can't do it.
Having said that, you could get UTF-8 argument strings into the program, convert them to UTF-16 or UTF-32, and then get on with life.
On a Mac (10.5.8, Leopard), I got:
Osiris JL: echo "ï€" | odx
0x0000: C3 AF E2 82 AC 0A ......
0x0006:
Osiris JL:
That's all UTF-8 encoded. (odx is a hex dump program).
See also: Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment
Portable code doesn't support it. Windows (for example) supports using wmain instead of main, in which case argv is passed as wide characters.
On Windows, you can use GetCommandLineW() and CommandLineToArgvW() to produce an argv-style wchar_t[] array, even if the app is not compiled for Unicode.
On Windows anyway, you can have a wmain() for UNICODE builds. Not portable though. I dunno if GCC or Unix/Linux platforms provide anything similar.
Assuming that your Linux environment uses UTF-8 encoding then the following code will prepare your program for easy Unicode treatment in C++:
int main(int argc, char * argv[]) {
std::setlocale(LC_CTYPE, "");
// ...
}
Next, wchar_t type is 32-bit in Linux, which means it can hold individual Unicode code points and you can safely use wstring type for classical string processing in C++ (character by character). With setlocale call above, inserting into wcout will automatically translate your output into UTF-8 and extracting from wcin will automatically translate UTF-8 input into UTF-32 (1 character = 1 code point). The only problem that remains is that argv[i] strings are still UTF-8 encoded.
You can use the following function to decode UTF-8 into UTF-32. If the input string is corrupted it will return properly converted characters until the place where the UTF-8 rules were broken. You could improve it if you need more error reporting. But for argv data one can safely assume that it is correct UTF-8:
#define ARR_LEN(x) (sizeof(x)/sizeof(x[0]))
wstring Convert(const char * s) {
typedef unsigned char byte;
struct Level {
byte Head, Data, Null;
Level(byte h, byte d) {
Head = h; // the head shifted to the right
Data = d; // number of data bits
Null = h << d; // encoded byte with zero data bits
}
bool encoded(byte b) { return b>>Data == Head; }
}; // struct Level
Level lev[] = {
Level(2, 6),
Level(6, 5),
Level(14, 4),
Level(30, 3),
Level(62, 2),
Level(126, 1)
};
wchar_t wc = 0;
const char * p = s;
wstring result;
while (*p != 0) {
byte b = *p++;
if (b>>7 == 0) { // deal with ASCII
wc = b;
result.push_back(wc);
continue;
} // ASCII
bool found = false;
for (int i = 1; i < ARR_LEN(lev); ++i) {
if (lev[i].encoded(b)) {
wc = b ^ lev[i].Null; // remove the head
wc <<= lev[0].Data * i;
for (int j = i; j > 0; --j) { // trailing bytes
if (*p == 0) return result; // unexpected
b = *p++;
if (!lev[0].encoded(b)) // encoding corrupted
return result;
wchar_t tmp = b ^ lev[0].Null;
wc |= tmp << lev[0].Data*(j-1);
} // trailing bytes
result.push_back(wc);
found = true;
break;
} // lev[i]
} // for lev
if (!found) return result; // encoding incorrect
} // while
return result;
} // wstring Convert
On Windows, you can use tchar.h and _tmain, which will be turned into wmain if the _UNICODE symbol is defined at compile time, or main otherwise. TCHAR *argv[] will similarly be expanded to WCHAR * argv[] if unicode is defined, and char * argv[] if not.
If you want to have your main method work cross platform, you can define your own macros to the same effect.
TCHAR.h contains a number of convenience macros for conversion between wchar and char.

Resources