sizeof character and strlen string mismatch - c

As per my code, I assume each greek character is stored in 2bytes.
sizeof returns the size of each character as 4 (i.e the sizeof int)
How does strlen return 16 ? [Making me think each character occupies 2 bytes] (Shouldn't it be 4*8 = 32 ? Since it counts the number of bytes.)
Also, how does printf("%c",bigString[i]); print each character properly? Shouldn't it read 1 byte (a char) and then display because of %c, why is the greek character not split in this case.
strcpy(bigString,"ειδικούς");//greek
sLen = strlen(bigString);
printf("Size is %d\n ",sizeof('ε')); //printing for each character similarly
printf("%s is of length %d\n",bigString,sLen);
int k1 = 0 ,k2 = sLen - 2;
for(i=0;i<sLen;i++)
printf("%c",bigString[i]);
Output:
Size is 4
ειδικούς is of length 16
ειδικούς

Character literals in C have type int, so sizeof('ε') is the same as sizeof(int). You're playing with fire in this statement, a bit. 'ε' will be a multicharacter literal, which isn't standard, and might come back to bite you. Be careful with using extensions like this one. Clang, for example, won't accept this program with that literal in it. GCC gives a warning, but will still compile it.
strlen returns 16, since that's the number of bytes in your string before the null-terminator. Your greek characters are all 16 bits long in UTF-8, so your string looks something like:
c0c0 c1c1 c2c2 c3c3 c4c4 c5c5 c6c6 c7c7 0
in memory, where c0c0, for example, is the two bytes of the first character. There is a single null-termination byte in your string.
The printf appears to work because your terminal is UTF-8 aware. You are printing each byte separately, but the terminal is interpreting the first two prints as a single character, and so on. If you change that printf call to:
printf("%d: %02x\n", i, (unsigned char)bigString[i]);
You'll see the byte-by-byte behaviour you're expecting.

Related

Printf outputs characters beyond the specified length of the array

I tried this chunk of code:
char string_one[8], string_two[8];
printf("&string_one == %p\n", &string_one);
printf("&string_two == %p\n", &string_two);
strcpy(string_one, "Hello!");
strcpy(string_two, "Long string");
printf("string_one == %s\n", string_one);
printf("string_two == %s\n", string_two);
And got this output:
&string_one == 0x7fff3f871524
&string_two == 0x7fff3f87151c
string_one == ing
string_two == Long string
Since the second string length value is greater than the specified size of the respective array, the characters which subscript values are greater than the specified array size are stored in the next bytes, which belong to the first array as the addresses show. Obviously the first string is overwritten.
There is no way the second array can hold the whole string, it is too big. Nevertheless, the output prints the whole string.
I speculated for a while and came to a conclusion that the printf() function keeps outputting characters from the next bytes until it comes across a string terminator '\0'. I did not find any confirmation for my pondering, so the question is are these speculations correct?
From the C Standard (5.2.1 Character sets)
2 In a character constant or string literal, members of the execution
character set shall be represented by corresponding members of the
source character set or by escape sequences consisting of the
backslash \ followed by one or more characters. A byte with all bits
set to 0, called the null character, shall exist in the basic
execution character set; it is used to terminate a character string.
And (7.21.6.1 The fprintf function)
8 The conversion specifiers and their meanings are:
s If no l length modifier is present, the argument shall be a pointer
to the initial element of an array of character type.273) Characters
from the array are written up to (but not including) the terminating
null character.
My compiler(GCC) said:
warning: ‘__builtin_memcpy’ writing 12 bytes into a region of size 8 overflows the destination [-Wstringop-overflow=]
strcpy(string_two, "Long string");
And just to show how optimizations will take everything that you think you know and turn it on its head, here's what happens if you compile this on a 64-bit PowerPC Power-9 (aka not x86) with gcc -O3 -flto
$ ./char-array-overlap
&string_one == 0x7fffc502bef0
&string_two == 0x7fffc502bef8
string_one == Hello!
string_two == Long string
Because if you look at the machine code it never executes strcpy at all.

character array overflowing by sprintf

I am using char array[6];
I am converting a float variable to string using sprintf as follows..
sprintf(array,"%f\0",floatvar);
and i am writing char array on LCD.
Problem is my array size is only 6 bytes, but it is printing "00000.00000" 11 byte of data. Array size is restricted to 6 bytes. But How the array overflowing in this case?
The sprintf function expects that you provide a big enough buffer to hold all of its output. Otherwise your code causes undefined behaviour.
Your code would not produce 00000.00000 either; if the value is between 0 and 1 then the output will start with 0. . Perhaps you used a different format string in your real code.
With %f it is not possible to constrain the output solely via format string modifiers. To be safe, you can use snprintf:
snprintf(array, 6, "%f", floatvar);
If your system does not have snprintf available then I would suggest downloading a freeware implementation of vsnprintf.
As a last resort you could use sprintf with a lot of checking:
if ( floatvar < 0.f || floatvar >= 1.f )
exit.....;
sprintf(array, 6, "%.3f", floatvar);
The .3 means that at most 3 characters will show after the decimal point; and since we did a range check that means the start will be 0. , for a total of 5 output characters plus null terminator.
To be on the safe side I'd suggest temporarily outputting to a large buffer, using strlen to check what was written, and then copying to your 6-byte buffer if it did write correctly.
NB. "%f\0" is strange; string literals are strings and so they end in '\0' already. "%f\0" ends in two null terminators.

Why does this program give the following output?

When I ran this program it gave an output of
1, 4, 4
Why does sizeof('A') gives 4 bytes? Is 'A' treated as integer? If so, then why?
#include<stdio.h>
int main()
{
char ch = 'A';
printf("%d, %d, %d", sizeof(ch), sizeof('A'), sizeof(3.14f));
return 0;
}
Moreover, when I replace
printf("%d, %d, %d", sizeof(ch), sizeof('A'), sizeof(3.14f));
with,
printf("%d, %d, %d", sizeof(ch), sizeof("A"), sizeof(3.14f));
It gives the output
1, 2, 4
which is even more confounding.
P.S.: I used compileonline.com to test this code.
In C, the type of 'A' is int, which explains why sizeof('A') is 4 (since evidently your platform has 32-bit int). For more information, see Size of character ('a') in C/C++
When compiled as C++, the first program prints 1 1 4.
"A" is a string literal consisting of the letter A followed by the NUL character. Since it's two characters long, sizeof("A") is 2.
1.sizeof operator provide the size of input argument.
2.Size of a vaiable is machine(complier) dependent.In you case it is 32 bit.
3.sizeof(ch)=1 because you declare as char.
4.sizeof('A')=4 because compiler treats the literal constant as an integer.
5.sizeof("A")=2 because its a string of 2 bye.In the case string,if u write a single character also compiler insert null character at the end.so its size is 2 bytes.
4.sizeof(3.13f)=4 because its size of float is 4 bytes
I generally suggest to use sizeof on types or on variables. Using sizeof on literal constants seems confusing (except perhaps on literal strings, to compute 1 + their string length at compile time).
The literal 'A' is in C an int whose size is 4 on your machine.
The literal string "A" is exactly like
const char literal_A_string[] = {'A', (char)0};
whose size is obviously 2 bytes (because each literal string has a terminal null byte appended).

Find non-ascii characters from a UTF-8 string

I need to find the non-ASCII characters from a UTF-8 string.
my understanding:
UTF-8 is a superset of character encoding in which 0-127 are ascii characters.
So if in a UTF-8 string , a characters value is Not between 0-127, then it is not a ascii character , right? Please correct me if i'm wrong here.
On the above understanding i have written following code in C :
Note:
I'm using the Ubuntu gcc compiler to run C code
utf-string is x√ab c
long i;
char arr[] = "x√ab c";
printf("length : %lu \n", sizeof(arr));
for(i=0; i<sizeof(arr); i++){
char ch = arr[i];
if (isascii(ch))
printf("Ascii character %c\n", ch);
else
printf("Not ascii character %c\n", ch);
}
Which prints the output like:
length : 9
Ascii character x
Not ascii character
Not ascii character �
Not ascii character �
Ascii character a
Ascii character b
Ascii character
Ascii character c
Ascii character
To naked eye length of x√ab c seems to be 6, but in code it is coming as 9 ?
Correct answer for the x√ab c is 1 ...i.e it has only 1 non-ascii character , but in above output it is coming as 3 (times Not ascii character).
How can i find the non-ascii character from UTF-8 string, correctly.
Please guide on the subject.
What C calls a char is actually a byte. A UTF-8 character can be made up of several bytes.
In fact only the ASCII characters are represented by a single byte in UTF-8 (which is why all valid ASCII-encoded text is also effectively UTF-8 encoded).
So to count the number of UTF-8 characters you have to do a partial decoding: count the number of UTF-8 start codepoints.
See the Wikipedia article on UTF-8 to find out how they are encoded.
Basically there are 3 categories:
single-byte codes 0b0xxxxxxx
start bytes: 0b110xxxxx, 0b1110xxxx, 0b11110xxx
continuation bytes: 0b10xxxxxx
To count the number of unicode codepoint simply count all characters that are not continuation bytes.
However unicode codepoints don't always have a 1-to-1 correspondence to "characters" (depending on your exact definition of character).
The UTF-8 characters when taken in a character array occupies it in such a way that the first byte occupied by each UTF-8 character would contain the information regarding the number of bytes taken to represent the character. The number of consecutive 1's from the MSB of the first byte would represent the total bytes taken by the non-ascii character. In case of '√' the binary form would be: 11100010,10001000,10011010. Counting the number of 1's the in the first byte gives the number of bytes occupied as 3. Something like the code below would work for this:
int get_count(char non_ascii_char){
/*
The function returns the number of bytes occupied by the UTF-8 character
It takes the non ASCII character as the input and returns the length
to the calling function.
*/
int bit_counter=7,count=0;
/*
bit_counter - is the counter initialized to traverse through each bit of the
non ascii character
count - stores the number of bytes occupied by the character
*/
for(;bit_counter>=0;bit_counter--){
if((non_ascii_char>>bit_counter)&1){
count++;// increments on the number of consecutive 1s in the byte
}
else{
break;// breaks on encountering the first 0
}
}
return count;// returns the count to the calling function
}

memcpy() to copy integer value to char buffer

I am trying to copy the memory value of int into the char buffer. The code looks like below,
#define CPYINT(a, b) memcpy(a, &b, 4)
............
char str1[4];
int i = 1;
CPYINT(str1, i);
printf("%s",s);
...........
When I print str1 it’s blank. Please clarify.
You are copying the byte representation of an integer into a char array. You then ask printf to interpret this array as a null terminating string : str1[0] being zero, you are essentially printing an empty string (I'm skipping the endianness talk here).
What did you expect ? Obviously, if you wanted to print a textual representation of the integer i, you should use printf("%d", i).
try
printf("%02X %02X %02X %02X\n", str1[0], str1[1], str1[2], str1[3]);
instead.
The binary representation of the integer 1, probably contains leading NULs, and so your current printf statement terminates earlier than you want.
What is your intention here? Right now you are putting arbitrary byte values into the char array, but then interpreting them as a string, as it happens the first byte is probably a zero (null) and hence your print nothing, but in all probability many of the characters will be unprintable, so printf is the wrong tool to use to check if the copy worked.
So, either: loop through the array and print the numeric value of each byte, %0xd might be useful for that or if your intention is actually to create a string representation of the int then you'll need a larger buffer, and space for a null terminator.
Maybe you need convert intger to char* in that way tou can use itoa function
link text

Resources