Unicode stored in C char - c

I'm learning the C language on Linux now and I've came across a little weird situation.
As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits). It should mean, that it can hold only ASCII characters.
In my program I use char input[], which is filled by getchar function like this pseudocode:
char input[20];
int z, i;
for(i = 0; i < 20; i++)
{
z = getchar();
input[i] = z;
}
The weird thing is that it works not only for ASCII characters, but for any character I imagine, such as #&#{čřžŧ¶'`[łĐŧđж←^€~[←^ø{&}čž on the input.
My question is - how is it possible? It seems to be one of many beautiful exceptions in C, but I would really appreciate explanation. Is it a matter of OS, compiler, hidden language's additional super-feature?
Thanks.

There is no magic here - The C language gives you acess to the raw bytes, as they are stored in the computer memory.
If your terminal is using utf-8 (which is likely), non-ASCII chars take more than one byte in memory. When you display then again, is our terminal code which converts these sequences into a single displayed character.
Just change your code to print the strlen of the strings, and you will see what I mean.
To properly handle utf-8 non-ASCII chars in C you have to use some library to handle them for you, like glib, qt, or many others.

ASCII is a 7 bit character set. In C normally represented by an 8 bit char. If highest bit in an 8 bit byte is set, it is not an ASCII character.
Also notice that you are not guaranteed ASCII as base, tho many ignore other scenarios. If you want to check if a "primitive" byte is a alpha character you can in other words not, when taking heed to all systems, say:
is_alpha = (c > 0x40 && c < 0x5b) || (c > 0x60 && c < 0x7b);
Instead you'll have to use ctype.h and say:
isalpha(c);
Only exception, AFAIK, is for numbers, on most tables at least, they have contiguous values.
Thus this works;
char ninec = '9';
char eightc = '8';
int nine = ninec - '0';
int eight = eightc - '0';
printf("%d\n", nine);
printf("%d\n", eight);
But this is not guaranteed to be 'a':
alhpa_a = 0x61;
Systems not based on ASCII, i.e. using EBCDIC; C on such a platform still runs fine but here they (mostly) use 8 bits instead of 7 and i.e. A can be coded as decimal 193 and not 65 as it is in ASCII.
For ASCII however; bytes having decimal 128 - 255, (8 bits in use), is extended, and not part of the ASCII set. I.e. ISO-8859 uses this range.
What is often done; is also to combine two or more bytes to one character. So if you print two bytes after each other that is defined as say, utf8 0xc3 0x98 == Ø, then you'll get this character.
This again depends on which environment you are in. On many systems/environments printing ASCII values give same result across character sets, systems etc. But printing bytes > 127 or double byted characters gives a different result depending on local configuration.
I.e.:
Mr. A running the program gets
Jasŋ€
While Mr. B gets
Jasπß
This is perhaps especially relevant to the ISO-8859 series and Windows-1252 of single byte representation of extended characters, etc.
ASCII_printable_characters , notice they are 7 not 8 bits.
ISO_8859-1 and ISO_8859-15, widely used sets, with ASCII as core.
Windows-1252, legacy of Windows.
UTF-8#Codepage_layout, In UTF-8 you have ASCII, then you have special sequences of byes.
Each sequence starts with a byte > 127 (which is last ASCII byte),
followed by a given number of bytes which all starts with the bits 10.
In other words, you will never find an ASCII byte in a multi byte UTF-8 representation.
That is; the first byte in UTF-8, if not ASCII, tells how many bytes this character has. You could also say ASCII characters say no more bytes follow - because highest bit is 0.
I.e if file interpreted as UTF-8:
fgetc(c);
if c < 128, 0x80, then ASCII
if c == 194, 0xC2, then one more byte follow, interpret to symbol
if c == 226, 0xE2, then two more byte follows, interpret to symbol
...
As an example. If we look at one of the characters you mention. If in an UTF-8 terminal:
$ echo -n "č" | xxd
Should yield:
0000000: c48d ..
In other words "č" is represented by the two bytes 0xc4 and 0x8d. Add -b to the xxd command and we get the binary representation of the bytes. We dissect them as follows:
___ byte 1 ___ ___ byte 2 ___
| | | |
0xc4 : 1100 0100 0x8d : 1000 1101
| |
| +-- all "follow" bytes starts with 10, rest: 00 1101
|
+ 11 -> 2 bits set = two byte symbol, the "bits set" sequence
end with 0. (here 3 bits are used 110) : rest 0 0100
Rest bits combined: xxx0 0100 xx00 1101 => 00100001101
\____/ \_____/
| |
| +--- From last byte
+------------ From first byte
This give us: 00100001101 2 = 26910 = 0x10D => Uncode codepoint U+010D == "č".
This number can also be used in HTML as č == č
Common for this and lots of other code systems is that an 8 bit byte is the base.
Often it is also a question about context. As an example take GSM SMS, with ETSI GSM 03.38/03.40 (3GPP TS 23.038, 3GPP 23038). There we also find an 7bit character table, 7-bits GSM default alphabet, but instead of storing them as 8 bits they are stored as 7 bits1. This way you can pack more characters into a given number of bytes. Ie standard SMS 160 characters becomes 1280 bits or 160 bytes as ASCII and 1120 or 140 bytes as SMS.
1 Not without exception, (it is more to the story).
I.e. a simple example of bytes saved as septets (7bit) C8329BFD06 in SMS UDP format to ASCII:
_________
7 bit UDP represented | +--- Alphas has same bits as ASCII
as 8 bit hex '0.......'
C8329BFDBEBEE56C32 1100100 d * Prev last 6 bits + pp 1
| | | | | | | | +- 00 110010 -> 1101100 l * Prev last 7 bits
| | | | | | | +--- 0 1101100 -> 1110010 r * Prev 7 + 0 bits
| | | | | | +----- 1110010 1 -> 1101111 o * Last 1 + prev 6
| | | | | +------- 101111 10 -> 1010111 W * Last 2 + prev 5
| | | | +--------- 10111 110 -> 1101111 o * Last 3 + prev 4
| | | +----------- 1111 1101 -> 1101100 l * Last 4 + prev 3
| | +------------- 100 11011 -> 1101100 l * Last 5 + prev 2
| +--------------- 00 110010 -> 1100101 e * Last 6 + prev 1
+----------------- 1 1001000 -> 1001000 H * Last 7 bits
'------'
|
+----- GSM Table as binary
And 9 bytes "unpacked" becomes 10 characters.

ASCII is 7 bits, not 8 bits. a char [] holds bytes, which can be in any encoding - iso8859-1, utf-8, whatever you want. C doesn't care.

This is the magic of UTF-8, that you don't even had to worry about how it works. The only problem is that the C data-type is named char (for character), while what it actually means is byte. there is no 1:1 correspondence between characters and the bytes that encode them.
What happens in your code is that, from the program's point of view, you input a sequence of bytes, it stores the bytes in memory and if you print the text it prints bytes. This code doesn't care how these bytes encode the characters, it's only the terminal which needs to worry about encoding them on input and correctly interpreting them on output.

There are of course many libraries that does the job, but to quickly decode any UTF8 unicode, this little function is handy:
typedef unsigned char utf8_t;
#define isunicode(c) (((c)&0xc0)==0xc0)
int utf8_decode(const char *str,int *i) {
const utf8_t *s = (const utf8_t *)str; // Use unsigned chars
int u = *s,l = 1;
if(isunicode(u)) {
int a = (u&0x20)? ((u&0x10)? ((u&0x08)? ((u&0x04)? 6 : 5) : 4) : 3) : 2;
if(a<6 || !(u&0x02)) {
int b,p = 0;
u = ((u<<(a+1))&0xff)>>(a+1);
for(b=1; b<a; ++b)
u = (u<<6)|(s[l++]&0x3f);
}
}
if(i) *i += l;
return u;
}
Considering your code; you can iterate the string and read the unicode values:
int l;
for(i=0; i<20 && input[i]!='\0'; ) {
if(!isunicode(input[i])) i++;
else {
l = 0;
z = utf8_decode(&input[i],&l);
printf("Unicode value at %d is U+%04X and it\'s %d bytes.\n",i,z,l);
i += l;
}
}

There is a datatype wint_t (#include <wchar.h>) for non-ASCII characters. You can use the method getwchar() to read them.

Related

explanation on bitwise operators

I found this piece of code online and it works as part of my project, but I'm not sure why. I don't want to just use it without understanding what it does.
type = (packet_data[12] << 8) | packet_data[13];
if I use it I get the proper type (0x0800 for IPv4) and can use it for comparison on printing out whether it's IPv4 or IPv6. If I don't use it and try something like:
if(packet_data[12] == 08 && packet_data[13] == 00)
print out IPv4
it doesn't work (compiling errors).
Also if I just print out the values like
printf"%02X", packet_data[12];
printf"%02X", packet_data[13];
it prints out the proper value in the form 0800, but I need to print out that it's an IPv4 type. Which is why I need to comparison in the first place. Thanks for any piece of advice or explanation on what this does would be much appreciated. Thanks
if(packet_data[12] == 08 && packet_data[13] == 00)
the right literal operands are seen as octal base literals by the compiler.
Fortunately for you, 8 cannot represent an octal number and you're getting a compilation error.
You mean hexadecimal literals:
if (packet_data[12] == 0x8 && packet_data[13] == 0x0)
this line:
(packet_data[12] << 8) | packet_data[13]
recreates the big endian value (network convention) of the data located at offsets 12 & 13. Both are equivalent in your case, although the latter is more convenient to compare values as a whole.
packet_data[12] << 8 takes the first Ethertype octet and shifts it 8 bits to the left to the upper 8 bits of a 16-bit word.
| packet_data[13] takes the second Ethertype octet and bitwise-ORs it to the previous 16-bit word.
You can then compare it to 0x0800 for IPv4 or 0x86DD for IPv6; see a more complete list on https://en.wikipedia.org/wiki/EtherType#Examples
As has already been pointed out 08 doesn't work since numerals starting with 0 represent octal numbers, and 8 doesn't exist in octal.
type = (packet_data[12] << 8) | packet_data[13];
The << is bitwise shift left. It takes the binary representation of the variable and shifts its 1's to the left, 8 bits in this case.
'0x0800' looks like 100000000000 in binary. So in order for 0x0800 to be the type, it has to end up looking like that after | packet_data[13]. This last part is bitwise OR. It will write a 1 if either the left side or the right side have a 1 in that place and a 0 otherwise.
So after shifting the value in packet_data[12], the only way for it to be type 0x0800 (100000000000) is if packet_data[13] looks like 0x0800 or 0x0000:
type = (0x800) <==> ( 100000000000 | 100000000000 )
type = (0x800) <==> ( 100000000000 | 000000000000 )
Also, to get the 0x out from printf() you need a to add the %# format specifier. But to get 0x0800 you need to specify a .04 which means 4 characters including leading zeros. However this won't output the 0x if the type is 0. For that you'd need to hardcode the literal 0x into printf().
printf("%#02x\n", data);
printf("%#.04x\n", data);
printf("0x%.04x\n", data=0);
Output
0x800
0x0800
0x0000

Pointers in C with typecasting

#include<stdio.h>
int main()
{
int a;
char *x;
x = (char *) &a;
a = 512;
x[0] = 1;
x[1] = 2;
printf("%d\n",a);
return 0;
}
I'm not able to grasp the fact that how the output is 513 or even Machine dependent ? I can sense that typecasting is playing a major role but what is happening behind the scenes, can someone help me visualise this problem ?
The int a is stored in memory as 4 bytes. The number 512 is represented on your machine as:
0 2 0 0
When you assign to x[0] and x[1], it changes this to:
1 2 0 0
which is the number 513.
This is machine-dependent, because the order of bytes in a multi-byte number is not specified by the C language.
For simplifying assume the following:
size of int is 4 (in bytes)
size of any pointer type is 8
size of char is 1 byte
in line 3 x is referencing a as a char, this means that x thinks that he is pointing to a char (he has no idea that a was actually a int.
line 4 is meant to confuse you. Don't.
line 5 - since x thinks he is pointing to a char x[0] = 1 changes just the first byte of a (because he thinks that he is a char)
line 6 - once again, x changed just the second byte of a.
note that the values put in lines 5 and 6 overide the value in line 4.
the value of a is now 0...0000 0010 0000 0001 (513).
Now when we print a as an int, all 4 bytes would be considered as expected.
Let me try to break this down for you in addition to the previous answers:
#include<stdio.h>
int main()
{
int a; //declares an integer called a
char *x; //declares a pointer to a character called x
x = (char *) &a; //points x to the first byte of a
a = 512; //writes 512 to the int variable
x[0] = 1; //writes 1 to the first byte
x[1] = 2; //writes 2 to the second byte
printf("%d\n",a); //prints the integer
return 0;
}
Note that I wrote first byte and second byte. Depending on the byte order of your platform and the size of an integer you might not get the same results.
Lets look at the memory for 32bit or 4 Bytes sized integers:
Little endian systems
first byte | second byte | third byte | forth byte
0x00 0x02 0x00 0x00
Now assigning 1 to the first byte and 2 to the second one leaves us with this:
first byte | second byte | third byte | forth byte
0x01 0x02 0x00 0x00
Notice that the first byte gets changed to 0x01 while the second was already 0x02.
This new number in memory is equivalent to 513 on little endian systems.
Big endian systems
Lets look at what would happen if you were trying this on a big endian platform:
first byte | second byte | third byte | forth byte
0x00 0x00 0x02 0x00
This time assigning 1 to the first byte and 2 to the second one leaves us with this:
first byte | second byte | third byte | forth byte
0x01 0x02 0x02 0x00
Which is equivalent to 16,908,800 as an integer.
I'm not able to grasp the fact that how the output is 513 or even Machine dependent
The output is implementation-defined. It depends on the order of bytes in CPU's interpretation of integers, commonly known as endianness.
I can sense that typecasting is playing a major role
The code reinterprets the value of a, which is an int, as an array of bytes. It uses two initial bytes, which is guaranteed to work, because an int is at least two bytes in size.
Can someone help me visualise this problem?
An int consists of multiple bytes. They can be addressed as one unit that represents an integer, but they can also be addressed as a collection of bytes. The value of an int depends on the number of bytes that you set, and on the order of these bytes in CPU's interpretation of integers.
It looks like your system stores the least significant byte at a lowest address, so the result of storing 1 and 2 at offsets zero and one produces this layout:
Byte 0 Byte 1 Byte 2 Byte 3
------ ------ ------ ------
1 2 0 0
Integer value can be computed as follows:
1 + 2*256 + 0*65536 + 0*16777216
By taking x, which is a char *, and pointing it to the address of a, which is an int, you can use x to modify the individual bytes that represent a.
The output you're seeing suggests that an int is stored in little-endian format, meaning the least significant byte comes first. This can change however if you run this code on a different system (ex. a Sun SPARC machine which is big-enidan).
You first set a to 512. In hex, that's 0x200. So the memory for a, assuming a 32 bit int in little endian format, is laid out as follows:
-----------------------------
| 0x00 | 0x02 | 0x00 | 0x00 |
-----------------------------
Next you set x[0] to 1, which updates the first byte in the representation of a (in this case leaving it unchanged):
-----------------------------
| 0x01 | 0x02 | 0x00 | 0x00 |
-----------------------------
Then you set x[1] to 2, which updates the second byte in the representation of a:
-----------------------------
| 0x01 | 0x02 | 0x00 | 0x00 |
-----------------------------
Now a has a value of 0x201, which in decimal is 513.

Compress a struct into a binary file? [C]

This is part of my homework that I'm having difficults to solve.
I have a simple structure:
typedef struct Client {
char* lname;
unsigned int id;
unsigned int car_id;
} Client;
And the exercise is:
Create a text file named as the company name and then branch number with txt extention.
the file contain all clients' details.
The file you created in exercise 1 will be compressed. as a result, a binary file be created with .cmpr extention.
I don't really have an idea how to implement 2.
I remember at the lectures that the professor said we have to use "all" the variable, with binary operators (<< , >> , | , &, ~), but I don't know how to used it.
I'm using Ubuntu, under GCC and Eclipse. I'm using C.
I'd be glad to get helped. thanks!
Let's say the file from step 1 looks like:
user1798362
2324
462345
where the three fields were simply printed on three lines. Note that the above is the text/readable (i.e. ASCII) representation of that file.
Looking at the contents of this file in hex(adecimal) representation we get (with the ASCII character printed below each byte value):
75 73 65 72 31 37 39 38 33 36 32 0a 32 33 32 34 0a 34 36 32 33 34 35 0a
u s e r 1 7 9 8 3 6 2 nl 2 3 2 4 nl 4 6 2 3 4 5 nl
here nl is of course the newline character. You can count that there are 24 bytes.
In step 2 you have to invent another format that saves as many bits as possible. The simplest way to do this is to compress each of the three fields individually.
Similar to where the text format uses a nl to mark the end of a field, you also need a way to define where a binary field begins and ends. A common way is to put a length in front of the binary field data. As a first step we could replace the nl's with a length and get:
58 75 73 65 72 31 37 39 38 33 36 32 20 32 33 32 34 30 34 36 32 33 34 35
-- u s e r 1 7 9 8 3 6 2 -- 2 3 2 4 -- 4 6 2 3 4 5
For now we simply take a whole byte for the length in bits. Note that 58 is the hex representation of 77 (i.e. 11 characters * 8 bits), the bit length of lname',20hex equals 4 * 8 = 32, and30is 6 * 8 = 48. This does not compress anything, as it's still 24 bytes in total. But we already got a binary format because58,20and30` got a special meaning.
The next step would be to compress each field. This is where it gets tricky. The lname field consists of ASCII character. In ASCII only 7 of the 8 bits are needed/used; here's a nice table For example the letter u in binary is 01110101. We can safely chop off the leftmost bit, which is always 0. This yields 1110101. The same can be done for all the characters. So you'll end up with 11 7-bit values -> 77 bits.
These 77 bits now must be fit in 8-bit bytes. Here are the first 4 bytes user in binary representation, before chopping the leftmost bit off:
01110101 01110011 01100101 01110010
Chopping off a bit in C is done by shifting the byte (i.e. unsigned char) to the left with:
unsigned char byte = lname[0];
byte = byte << 1;
When you do this for all characters you get:
1110101- 1110011- 1100101- 1110010-
Here I use - to indicate the bits in these bytes that are now available to be filled; they became available by shifting all bits one place to the left. You now use one or more bit from the right side of the next byte to fill up these - gaps. When doing this for these four bytes you'll get:
11101011 11001111 00101111 0010----
So now there's a gap of 4 bits that should be filled with the bit from the character 1, etc.
Filling up these gaps is done by using the binary operators in C which you mention. We already use the shift left <<. To combine 1110101- and 1110011- for example we do:
unsigned char* name; // name MUST be unsigned to avoid problems with binary operators.
<allocated memory for name and read it from text file>
unsigned char bytes[10]; // 10 is just a random size that gives us enough space.
name[0] = name[0] << 1; // We shift to the left in-place here, so `name` is overwritten.
name[1] = name[1] << 1; // idem.
bytes[0] = name[0] | (name[1] >> 7);
bytes[1] = name[1] << 1;
With name[1] >> 7 we have 1110011- >> 7 which gives: 00000001; the right most bit. With the bitwise OR operator | we then 'add' this bit to 1110101-, resulting in 111010111.
You have to do things like this in a loop to get all the bits in the correct bytes.
The new length of this name field is 11 * 7 = 77, so we've lost a massive 11 bits :-) Note that with a byte length, we assume that the lname field will never be more than 255 / 7 = 36 characters long.
As with the bytes above, you can then coalesce the second length against the final bits of the lname field.
To compress the numbers you first read 'em in with (fscanf(file, %d, ...)) in an unsigned int. There will be many 0s at the left side in this 4-byte unsigned int. The first field for example is (shown in chunks of 4 bit only for readability):
0000 0000 0000 0000 0000 1001 0001 0100
which has 20 unused bits at the left.
You need to get rid of these. Do 32 minus the number of zero's at the left, and you get the bit-length of this number. Add this length to the bytes array by coalescing its bits against those of previous field. Then only add the significant bits of the number to the bytes. This would be:
1001 0001 0100
In C, when working with the bits of an 'int' (but also 'short', 'long', ... any variable/number larger than 1 byte), you must take byte-order or endianness into account.
When you do the above step twice for both numbers, you're done. You then have a bytes array you can write to a file. Of course you must have kept where you were writing in bytes in the steps above; so you know the number of bytes. Note that in most cases there will be a few bits in the last byte that are not filled with data. But that doesn't hurt and it simply unavoidable waste of the fact that files are stored in chunks of 8 bits = 1 byte minimally.
When reading the binary file, you'll get a reverse process. You'll read in a unsigned char bytes array. You then know that the first byte (i.e. bytes[0]) contains the bit-length of the name field. You then fill in the bytes of the 'lname' byte-by-byte by shifting and masking. etc....
Good luck!

converting little endian hex to big endian decimal in C

I am trying to understand and implement a simple file system based on FAT12. I am currently looking at the following snippet of code and its driving me crazy:
int getTotalSize(char * mmap)
{
int *tmp1 = malloc(sizeof(int));
int *tmp2 = malloc(sizeof(int));
int retVal;
* tmp1 = mmap[19];
* tmp2 = mmap[20];
printf("%d and %d read\n",*tmp1,*tmp2);
retVal = *tmp1+((*tmp2)<<8);
free(tmp1);
free(tmp2);
return retVal;
};
From what I've read so far, the FAT12 format stores the integers in little endian format.
and the code above is getting the size of the file system which is stored in the 19th and 20th byte of boot sector.
however I don't understand why retVal = *tmp1+((*tmp2)<<8); works. is the bitwise <<8 converting the second byte to decimal? or to big endian format?
why is it only doing it to the second byte and not the first one?
the bytes in question are [in little endian format] :
40 0B
and i tried converting them manually by switching the order first to
0B 40
and then converting from hex to decimal, and I get the right output, I just don't understand how adding the first byte to the bitwise shift of second byte does the same thing?
Thanks
The use of malloc() here is seriously facepalm-inducing. Utterly unnecessary, and a serious "code smell" (makes me doubt the overall quality of the code). Also, mmap clearly should be unsigned char (or, even better, uint8_t).
That said, the code you're asking about is pretty straight-forward.
Given two byte-sized values a and b, there are two ways of combining them into a 16-bit value (which is what the code is doing): you can either consider a to be the least-significant byte, or b.
Using boxes, the 16-bit value can look either like this:
+---+---+
| a | b |
+---+---+
or like this, if you instead consider b to be the most significant byte:
+---+---+
| b | a |
+---+---+
The way to combine the lsb and the msb into 16-bit value is simply:
result = (msb * 256) + lsb;
UPDATE: The 256 comes from the fact that that's the "worth" of each successively more significant byte in a multibyte number. Compare it to the role of 10 in a decimal number (to combine two single-digit decimal numbers c and d you would use result = 10 * c + d).
Consider msb = 0x01 and lsb = 0x00, then the above would be:
result = 0x1 * 256 + 0 = 256 = 0x0100
You can see that the msb byte ended up in the upper part of the 16-bit value, just as expected.
Your code is using << 8 to do bitwise shifting to the left, which is the same as multiplying by 28, i.e. 256.
Note that result above is a value, i.e. not a byte buffer in memory, so its endianness doesn't matter.
I see no problem combining individual digits or bytes into larger integers.
Let's do decimal with 2 digits: 1 (least significant) and 2 (most significant):
1 + 2 * 10 = 21 (10 is the system base)
Let's now do base-256 with 2 digits: 0x40 (least significant) and 0x0B (most significant):
0x40 + 0x0B * 0x100 = 0x0B40 (0x100=256 is the system base)
The problem, however, is likely lying somewhere else, in how 12-bit integers are stored in FAT12.
A 12-bit integer occupies 1.5 8-bit bytes. And in 3 bytes you have 2 12-bit integers.
Suppose, you have 0x12, 0x34, 0x56 as those 3 bytes.
In order to extract the first integer you only need take the first byte (0x12) and the 4 least significant bits of the second (0x04) and combine them like this:
0x12 + ((0x34 & 0x0F) << 8) == 0x412
In order to extract the second integer you need to take the 4 most significant bits of the second byte (0x03) and the third byte (0x56) and combine them like this:
(0x56 << 4) + (0x34 >> 4) == 0x563
If you read the official Microsoft's document on FAT (look up fatgen103 online), you'll find all the FAT relevant formulas/pseudo code.
The << operator is the left shift operator. It takes the value to the left of the operator, and shift it by the number used on the right side of the operator.
So in your case, it shifts the value of *tmp2 eight bits to the left, and combines it with the value of *tmp1 to generate a 16 bit value from two eight bit values.
For example, lets say you have the integer 1. This is, in 16-bit binary, 0000000000000001. If you shift it left by eight bits, you end up with the binary value 0000000100000000, i.e. 256 in decimal.
The presentation (i.e. binary, decimal or hexadecimal) has nothing to do with it. All integers are stored the same way on the computer.

Printing the hex representation of an array of values

I was reading a program in C (an implementation of a server/client communication) and I saw this:
for (i = 0; i < len; i++)
sprintf(nickmsg+i*2, "%02X", buf[i] & 0xFF);
What does this line do? I don't understand this especially: nickmsg+i*2.
nickmsg is a char table and i is an integer. If it was just nickmsg, ok I'll understand but there what's the aim of this line ?
Thanks.
Start at the address pointed to by nickmsg and then go an additional i * 2 * CHAR_BIT / 8 bytes in memory. From there, write the hex representation of buf[i] & 0xFF, which will occupy 2 * CHAR_BIT / 8 bytes. Repeat for each i.
Assuming buf looks like
buf[0] = 20
buf[1] = 12
Then the memory pointed to by nickmsg will look like:
nickmsg
|
|
|
+ + + + +
0 2 4 6 8
140C\
Where the \ is my nomenclature for the null-terminator that sprintf writes at the end.
It's converting the values in the buf array to their hexadecimal representation and storing them in the nickmsg array.
As it steps through each value in buf, it extracts the rightmost 8 bits by performing a bitwise AND with 0xFF, which is binary 1111 1111.
Then it uses the format string "%02X" to print each value as 2 hex digits.
It stores each pair of hex digits in the nickmsg array, then advances past them by using the index i*2.
nickmsg+i*2 is treating the nickmsg variable as a pointer to a C string table, then stepping through it 2 entries every loop.

Resources