Printing the hex representation of an array of values - c

I was reading a program in C (an implementation of a server/client communication) and I saw this:
for (i = 0; i < len; i++)
sprintf(nickmsg+i*2, "%02X", buf[i] & 0xFF);
What does this line do? I don't understand this especially: nickmsg+i*2.
nickmsg is a char table and i is an integer. If it was just nickmsg, ok I'll understand but there what's the aim of this line ?
Thanks.

Start at the address pointed to by nickmsg and then go an additional i * 2 * CHAR_BIT / 8 bytes in memory. From there, write the hex representation of buf[i] & 0xFF, which will occupy 2 * CHAR_BIT / 8 bytes. Repeat for each i.
Assuming buf looks like
buf[0] = 20
buf[1] = 12
Then the memory pointed to by nickmsg will look like:
nickmsg
|
|
|
+ + + + +
0 2 4 6 8
140C\
Where the \ is my nomenclature for the null-terminator that sprintf writes at the end.

It's converting the values in the buf array to their hexadecimal representation and storing them in the nickmsg array.
As it steps through each value in buf, it extracts the rightmost 8 bits by performing a bitwise AND with 0xFF, which is binary 1111 1111.
Then it uses the format string "%02X" to print each value as 2 hex digits.
It stores each pair of hex digits in the nickmsg array, then advances past them by using the index i*2.

nickmsg+i*2 is treating the nickmsg variable as a pointer to a C string table, then stepping through it 2 entries every loop.

Related

Reading 2 bytes from a file and converting to an int gives the wrong output

Basically I have a text file that contains a number. I changed the number to 0 to start and then I read 2 bytes from the file (because an int is 2 bytes) and I converted it to an int. I then print the results, however it's printing out weird results.
So when I have 0 it prints out 2608 for some reason.
I'm going off a document that says I need to read through a file where the offset of bytes 0 to 1 represents a number. So this is why I'm reading bytes instead of characters...
I imagine the issue is due to reading bytes instead of reading by characters, so if this is the case can you please explain why it would make a difference?
Here is my code:
void readFile(FILE *file) {
char buf[2];
int numRecords;
fread(buf, 1, 2, file);
numRecords = buf[0] | buf[1] << 8;
printf("numRecords = %d\n", numRecords);
}
I'm not really sure what the buf[0] | buf[1] << 8 does, but I got it from another question... So I suppose that could be the issue as well.
The number 0 in your text file will actually be represented as a 1-byte hex number 0x30. 0x30 is loaded to buf[0]. (In the ASCII table, 0 is represented by 0x30)
You have garbage data in buf[1], in this case the value is 0x0a. (0x0a is \n in the ASCII table)
Combining those two by buf[0] | buf[1] << 8 results in 0x0a30 which is 2608 in decimal. Note that << is the bit-wise left shift operator.
(Also, the size of int type is 4-byte in many systems. You should check that out.)
You can directly read into integer
fread(&numRecords, sizeof(numRecords), 1, file);
You need to check sizeof(int) on your system, if its four bytes you need to declare numRecords as short int rather than int

BCD to Ascii and Ascii to BCD

I'm trying to convert a BCD to ascii and vice versa and saw a solution similar to this while browsing, but don't fully understand it. Could someone explain it?
void BCD_Ascii(unsigned char src, char *dest) {
outputs = "0123456789"
*dest++ = outputs[src>>4];
*dest++ = outputs[src&0xf];
*dest = '\0';
}
Regarding your first question: Explaining the method:
You are right about src>>4, this shifts the character value 4 bits to the right which means it is returning the value of the higher hexdecimal digit.
e.g. if src is '\30' then src>>3 will evaluate to 3 or '\3'.
src&0xf is getting the lower hexdecimal digit by ANDing the src value with 0xF with is the binary value 1111 (not 11111111). e.g. if src is '\46' then src&0xf will evaluate to 6.
There are two important notes here while trying to understand the method:
First: The method cannot handle input when src has either the two digits above 9. i.e. if src was equal to '\3F' for instance, the method will overrun buffer.
Second: beware that this method adds the two digit characters at a certain location in a string and then terminates the string. The caller logic should be responsible for where the location is, incrementing the pointer, and making sure the output buffer allows three characters (at least) after the input pointer location.
Regarding your second question:
A reverse method could be as following:
unsigned char Ascii_BCD(const char* src) {
return (unsigned char)((src[0] - 0x30) * 0x10 + src[1] - 0x30);
}
[Edit: adding explanation to the reverse method]
The two ASCII digits at location 0 and 1 are subtracted of by 0x30 or '0' to convert from ascii to binary. E.g. the digit '4' is represented by the ascii code 0x34 so subtracting 0x30 will evaluate to 4.
Then the first digit which is the higher is multiplied by 0x10 to shift the value by 4 bits to the left.
The two values are added to compose the BCD value.
The opposite function can be:
BYTE ASC_BCD( char * asc )
{
return (BYTE)( ( asc[0] & 15 ) << 4 ) | ( asc[1] & 15 );
}
Char codes '0'..'9' can be converted to hex with & 15 or & 0x0F. Then make shift and | to combine.
The function converts a character in binary-coded decimal into a string.
First the upper 4 bits of src are obtained:
src>>4
The function then assumes the values those bits represent are in the range 0-9. Then that value is used to get an index in the string literal outputs:
outputs[src>>4];
The value is written into address which is pointed to by dest. This pointer is then incremented.
*dest++ = outputs[src>>4];
Then the lower 4 bits of src are used:
src&0xf
Again assuming the values of those bits, are representing a value in range 0-9. And the rest is the same as before:
*dest++ = outputs[src&0xf];
Finally a 0 is written into dest, to terminate it.

Unsigned Char pointing to unsigned integer

I don't understand why the following code prints out 7 2 3 0 I expected it to print out 1 9 7 1. Can anyone explain why it is printing 7230?:
unsigned int e = 197127;
unsigned char *f = (char *) &e;
printf("%ld\n", sizeof(e));
printf("%d ", *f);
f++;
printf("%d ", *f);
f++;
printf("%d ", *f);
f++;
printf("%d\n", *f);
Computers work with binary, not decimal, so 197127 is stored as a binary number and not a series of single digits separately in decimal
19712710 = 0003020716 = 0011 0000 0010 0000 01112
Suppose your system uses little endian, 0x00030207 would be stored in memory as 0x07 0x02 0x03 0x00 which is printed out as (7 2 3 0) as expected when you print out each byte
Because with your method you print out the internal representation of the unsigned and not its decimal representation.
Integers or any other data are represented as bytes internally. unsigned char is just another term for "byte" in this context. If you would have represented your integer as decimal inside a string
char E[] = "197127";
and then done an anologous walk throught the bytes, you would have seen the representation of the characters as numbers.
Binary representation of "197127" is "00110000001000000111".
The bytes looks like "00000111" (is 7 decimal), "00000010" (is 2), "0011" (is 3). the rest is 0.
Why did you expect 1 9 7 1? The hex representation of 197127 is 0x00030207, so on a little-endian architecture, the first byte will be 0x07, the second 0x02, the third 0x03, and the fourth 0x00, which is exactly what you're getting.
The value of e as 197127 is not a string representation. It is stored as a 16/32 bit integer (depending on platform). So, in memory, e is allocated, say 4 bytes on the stack, and would be represented as 0x30207 (hex) at that memory location. In binary, it would look like 110000001000000111. Note that the "endian" would actually backwards. See this link account endianess. So, when you point f to &e, you are referencing the 1st byte of the numeric value, If you want to represent a number as a string, you should have
char *e = "197127"
This has to do with the way the integer is stored, more specifically byte ordering. Your system happens to have little-endian byte ordering, i.e. the first byte of a multi byte integer is least significant, while the last byte is most significant.
You can try this:
printf("%d\n", 7 + (2 << 8) + (3 << 16) + (0 << 24));
This will print 197127.
Read more about byte order endianness here.
The byte layout for the unsigned integer 197127 is [0x07, 0x02, 0x03, 0x00], and your code prints the four bytes.
If you want the decimal digits, then you need to break the number down into digits:
int digits[100];
int c = 0;
while(e > 0) { digits[c++] = e % 10; e /= 10; }
while(c > 0) { printf("%u\n", digits[--c]); }
You know the type of int often take place four bytes. That means 197127 is presented as 00000000 00000011 00000010 00000111 in memory. From the result, your memory's address are Little-Endian. Which means, the low-byte 0000111 is allocated at low address, then 00000010 and 00000011, finally 00000000. So when you output f first as int, through type cast you obtain a 7. By f++, f points to 00000010, the output is 2. The rest could be deduced by analogy.
The underlying representation of the number e is in binary and if we convert the value to hex we can see that the value would be(assuming 32 bit unsigned int):
0x00030207
so when you iterate over the contents you are reading byte by byte through the *unsigned char **. Each byte contains two 4 bit hex digits and the byte order endiannes of the number is little endian since the least significant byte(0x07) is first and so in memory the contents are like so:
0x07020300
^ ^ ^ ^- Fourth byte
| | |-Third byte
| |-Second byte
|-First byte
Note that sizeof returns size_t and the correct format specifier is %zu, otherwise you have undefined behavior.
You also need to fix this line:
unsigned char *f = (char *) &e;
to:
unsigned char *f = (unsigned char *) &e;
^^^^^^^^
Because e is an integer value (probably 4 bytes) and not a string (1 byte per character).
To have the result you expect, you should change the declaration and assignment of e for :
unsigned char *e = "197127";
unsigned char *f = e;
Or, convert the integer value to a string (using sprintf()) and have f point to that instead :
char s[1000];
sprintf(s,"%d",e);
unsigned char *f = s;
Or, use mathematical operation to get single digit from your integer and print those out.
Or, ...

converting little endian hex to big endian decimal in C

I am trying to understand and implement a simple file system based on FAT12. I am currently looking at the following snippet of code and its driving me crazy:
int getTotalSize(char * mmap)
{
int *tmp1 = malloc(sizeof(int));
int *tmp2 = malloc(sizeof(int));
int retVal;
* tmp1 = mmap[19];
* tmp2 = mmap[20];
printf("%d and %d read\n",*tmp1,*tmp2);
retVal = *tmp1+((*tmp2)<<8);
free(tmp1);
free(tmp2);
return retVal;
};
From what I've read so far, the FAT12 format stores the integers in little endian format.
and the code above is getting the size of the file system which is stored in the 19th and 20th byte of boot sector.
however I don't understand why retVal = *tmp1+((*tmp2)<<8); works. is the bitwise <<8 converting the second byte to decimal? or to big endian format?
why is it only doing it to the second byte and not the first one?
the bytes in question are [in little endian format] :
40 0B
and i tried converting them manually by switching the order first to
0B 40
and then converting from hex to decimal, and I get the right output, I just don't understand how adding the first byte to the bitwise shift of second byte does the same thing?
Thanks
The use of malloc() here is seriously facepalm-inducing. Utterly unnecessary, and a serious "code smell" (makes me doubt the overall quality of the code). Also, mmap clearly should be unsigned char (or, even better, uint8_t).
That said, the code you're asking about is pretty straight-forward.
Given two byte-sized values a and b, there are two ways of combining them into a 16-bit value (which is what the code is doing): you can either consider a to be the least-significant byte, or b.
Using boxes, the 16-bit value can look either like this:
+---+---+
| a | b |
+---+---+
or like this, if you instead consider b to be the most significant byte:
+---+---+
| b | a |
+---+---+
The way to combine the lsb and the msb into 16-bit value is simply:
result = (msb * 256) + lsb;
UPDATE: The 256 comes from the fact that that's the "worth" of each successively more significant byte in a multibyte number. Compare it to the role of 10 in a decimal number (to combine two single-digit decimal numbers c and d you would use result = 10 * c + d).
Consider msb = 0x01 and lsb = 0x00, then the above would be:
result = 0x1 * 256 + 0 = 256 = 0x0100
You can see that the msb byte ended up in the upper part of the 16-bit value, just as expected.
Your code is using << 8 to do bitwise shifting to the left, which is the same as multiplying by 28, i.e. 256.
Note that result above is a value, i.e. not a byte buffer in memory, so its endianness doesn't matter.
I see no problem combining individual digits or bytes into larger integers.
Let's do decimal with 2 digits: 1 (least significant) and 2 (most significant):
1 + 2 * 10 = 21 (10 is the system base)
Let's now do base-256 with 2 digits: 0x40 (least significant) and 0x0B (most significant):
0x40 + 0x0B * 0x100 = 0x0B40 (0x100=256 is the system base)
The problem, however, is likely lying somewhere else, in how 12-bit integers are stored in FAT12.
A 12-bit integer occupies 1.5 8-bit bytes. And in 3 bytes you have 2 12-bit integers.
Suppose, you have 0x12, 0x34, 0x56 as those 3 bytes.
In order to extract the first integer you only need take the first byte (0x12) and the 4 least significant bits of the second (0x04) and combine them like this:
0x12 + ((0x34 & 0x0F) << 8) == 0x412
In order to extract the second integer you need to take the 4 most significant bits of the second byte (0x03) and the third byte (0x56) and combine them like this:
(0x56 << 4) + (0x34 >> 4) == 0x563
If you read the official Microsoft's document on FAT (look up fatgen103 online), you'll find all the FAT relevant formulas/pseudo code.
The << operator is the left shift operator. It takes the value to the left of the operator, and shift it by the number used on the right side of the operator.
So in your case, it shifts the value of *tmp2 eight bits to the left, and combines it with the value of *tmp1 to generate a 16 bit value from two eight bit values.
For example, lets say you have the integer 1. This is, in 16-bit binary, 0000000000000001. If you shift it left by eight bits, you end up with the binary value 0000000100000000, i.e. 256 in decimal.
The presentation (i.e. binary, decimal or hexadecimal) has nothing to do with it. All integers are stored the same way on the computer.

Unicode stored in C char

I'm learning the C language on Linux now and I've came across a little weird situation.
As far as I know, the standard C's char data type is ASCII, 1 byte (8 bits). It should mean, that it can hold only ASCII characters.
In my program I use char input[], which is filled by getchar function like this pseudocode:
char input[20];
int z, i;
for(i = 0; i < 20; i++)
{
z = getchar();
input[i] = z;
}
The weird thing is that it works not only for ASCII characters, but for any character I imagine, such as #&#{čřžŧ¶'`[łĐŧđж←^€~[←^ø{&}čž on the input.
My question is - how is it possible? It seems to be one of many beautiful exceptions in C, but I would really appreciate explanation. Is it a matter of OS, compiler, hidden language's additional super-feature?
Thanks.
There is no magic here - The C language gives you acess to the raw bytes, as they are stored in the computer memory.
If your terminal is using utf-8 (which is likely), non-ASCII chars take more than one byte in memory. When you display then again, is our terminal code which converts these sequences into a single displayed character.
Just change your code to print the strlen of the strings, and you will see what I mean.
To properly handle utf-8 non-ASCII chars in C you have to use some library to handle them for you, like glib, qt, or many others.
ASCII is a 7 bit character set. In C normally represented by an 8 bit char. If highest bit in an 8 bit byte is set, it is not an ASCII character.
Also notice that you are not guaranteed ASCII as base, tho many ignore other scenarios. If you want to check if a "primitive" byte is a alpha character you can in other words not, when taking heed to all systems, say:
is_alpha = (c > 0x40 && c < 0x5b) || (c > 0x60 && c < 0x7b);
Instead you'll have to use ctype.h and say:
isalpha(c);
Only exception, AFAIK, is for numbers, on most tables at least, they have contiguous values.
Thus this works;
char ninec = '9';
char eightc = '8';
int nine = ninec - '0';
int eight = eightc - '0';
printf("%d\n", nine);
printf("%d\n", eight);
But this is not guaranteed to be 'a':
alhpa_a = 0x61;
Systems not based on ASCII, i.e. using EBCDIC; C on such a platform still runs fine but here they (mostly) use 8 bits instead of 7 and i.e. A can be coded as decimal 193 and not 65 as it is in ASCII.
For ASCII however; bytes having decimal 128 - 255, (8 bits in use), is extended, and not part of the ASCII set. I.e. ISO-8859 uses this range.
What is often done; is also to combine two or more bytes to one character. So if you print two bytes after each other that is defined as say, utf8 0xc3 0x98 == Ø, then you'll get this character.
This again depends on which environment you are in. On many systems/environments printing ASCII values give same result across character sets, systems etc. But printing bytes > 127 or double byted characters gives a different result depending on local configuration.
I.e.:
Mr. A running the program gets
Jasŋ€
While Mr. B gets
Jasπß
This is perhaps especially relevant to the ISO-8859 series and Windows-1252 of single byte representation of extended characters, etc.
ASCII_printable_characters , notice they are 7 not 8 bits.
ISO_8859-1 and ISO_8859-15, widely used sets, with ASCII as core.
Windows-1252, legacy of Windows.
UTF-8#Codepage_layout, In UTF-8 you have ASCII, then you have special sequences of byes.
Each sequence starts with a byte > 127 (which is last ASCII byte),
followed by a given number of bytes which all starts with the bits 10.
In other words, you will never find an ASCII byte in a multi byte UTF-8 representation.
That is; the first byte in UTF-8, if not ASCII, tells how many bytes this character has. You could also say ASCII characters say no more bytes follow - because highest bit is 0.
I.e if file interpreted as UTF-8:
fgetc(c);
if c < 128, 0x80, then ASCII
if c == 194, 0xC2, then one more byte follow, interpret to symbol
if c == 226, 0xE2, then two more byte follows, interpret to symbol
...
As an example. If we look at one of the characters you mention. If in an UTF-8 terminal:
$ echo -n "č" | xxd
Should yield:
0000000: c48d ..
In other words "č" is represented by the two bytes 0xc4 and 0x8d. Add -b to the xxd command and we get the binary representation of the bytes. We dissect them as follows:
___ byte 1 ___ ___ byte 2 ___
| | | |
0xc4 : 1100 0100 0x8d : 1000 1101
| |
| +-- all "follow" bytes starts with 10, rest: 00 1101
|
+ 11 -> 2 bits set = two byte symbol, the "bits set" sequence
end with 0. (here 3 bits are used 110) : rest 0 0100
Rest bits combined: xxx0 0100 xx00 1101 => 00100001101
\____/ \_____/
| |
| +--- From last byte
+------------ From first byte
This give us: 00100001101 2 = 26910 = 0x10D => Uncode codepoint U+010D == "č".
This number can also be used in HTML as č == č
Common for this and lots of other code systems is that an 8 bit byte is the base.
Often it is also a question about context. As an example take GSM SMS, with ETSI GSM 03.38/03.40 (3GPP TS 23.038, 3GPP 23038). There we also find an 7bit character table, 7-bits GSM default alphabet, but instead of storing them as 8 bits they are stored as 7 bits1. This way you can pack more characters into a given number of bytes. Ie standard SMS 160 characters becomes 1280 bits or 160 bytes as ASCII and 1120 or 140 bytes as SMS.
1 Not without exception, (it is more to the story).
I.e. a simple example of bytes saved as septets (7bit) C8329BFD06 in SMS UDP format to ASCII:
_________
7 bit UDP represented | +--- Alphas has same bits as ASCII
as 8 bit hex '0.......'
C8329BFDBEBEE56C32 1100100 d * Prev last 6 bits + pp 1
| | | | | | | | +- 00 110010 -> 1101100 l * Prev last 7 bits
| | | | | | | +--- 0 1101100 -> 1110010 r * Prev 7 + 0 bits
| | | | | | +----- 1110010 1 -> 1101111 o * Last 1 + prev 6
| | | | | +------- 101111 10 -> 1010111 W * Last 2 + prev 5
| | | | +--------- 10111 110 -> 1101111 o * Last 3 + prev 4
| | | +----------- 1111 1101 -> 1101100 l * Last 4 + prev 3
| | +------------- 100 11011 -> 1101100 l * Last 5 + prev 2
| +--------------- 00 110010 -> 1100101 e * Last 6 + prev 1
+----------------- 1 1001000 -> 1001000 H * Last 7 bits
'------'
|
+----- GSM Table as binary
And 9 bytes "unpacked" becomes 10 characters.
ASCII is 7 bits, not 8 bits. a char [] holds bytes, which can be in any encoding - iso8859-1, utf-8, whatever you want. C doesn't care.
This is the magic of UTF-8, that you don't even had to worry about how it works. The only problem is that the C data-type is named char (for character), while what it actually means is byte. there is no 1:1 correspondence between characters and the bytes that encode them.
What happens in your code is that, from the program's point of view, you input a sequence of bytes, it stores the bytes in memory and if you print the text it prints bytes. This code doesn't care how these bytes encode the characters, it's only the terminal which needs to worry about encoding them on input and correctly interpreting them on output.
There are of course many libraries that does the job, but to quickly decode any UTF8 unicode, this little function is handy:
typedef unsigned char utf8_t;
#define isunicode(c) (((c)&0xc0)==0xc0)
int utf8_decode(const char *str,int *i) {
const utf8_t *s = (const utf8_t *)str; // Use unsigned chars
int u = *s,l = 1;
if(isunicode(u)) {
int a = (u&0x20)? ((u&0x10)? ((u&0x08)? ((u&0x04)? 6 : 5) : 4) : 3) : 2;
if(a<6 || !(u&0x02)) {
int b,p = 0;
u = ((u<<(a+1))&0xff)>>(a+1);
for(b=1; b<a; ++b)
u = (u<<6)|(s[l++]&0x3f);
}
}
if(i) *i += l;
return u;
}
Considering your code; you can iterate the string and read the unicode values:
int l;
for(i=0; i<20 && input[i]!='\0'; ) {
if(!isunicode(input[i])) i++;
else {
l = 0;
z = utf8_decode(&input[i],&l);
printf("Unicode value at %d is U+%04X and it\'s %d bytes.\n",i,z,l);
i += l;
}
}
There is a datatype wint_t (#include <wchar.h>) for non-ASCII characters. You can use the method getwchar() to read them.

Resources