I'm having a little trouble understanding Perl's unpack in some code I'm reading, specifically with the S* template.
$data = "FF";
print "$data - ", unpack("S*", $data), "\n";
# > FF - 17990
What is the equivalent of this in C?
Why?
Thanks very much for your help
Your code in C would look (roughly) like this:
const char *data = "FA";
unsigned short s;
memcpy( &s, data, strlen(data) );
printf("%s = %d\n", data, s);
This only handles your case with two characters, while unpack('S*',...) will return a list of shorts corresponding to its input.
Why? One of the primary motivations for pack and unpack was to make it easier interchange binary data with C structures.
perlpacktut is a good place to start.
unpack 'S' casts two bytes into a uint16_t.
#include <stdint.h>
const char *data = "\x46\x41";
uint16_t n;
memcpy(&n, data, sizeof(n)); // n = 0x4146 or 0x4641
Don't forget to check the number of bytes in data before doing this!
Notice that it can give two different results based on the system.
On a little-endian system (e.g. x86, x64), unpack 'S' is also equivalent to
uint16_t n = (data[1] << 8) | data[0]; // 0x4146
On a big-endian system, unpack 'S' is also equivalent to
uint16_t n = (data[0] << 8) | data[1]; // 0x4641
By the way, you might be tempted to do the following, but it's not portable due to memory alignment issues:
uint16_t n = *((const uint16_t *)data);
I’m answering my own question, so I might have some things wrong, but I'll leave this here for anyone coming in the future.
First, let's change my example to
$data = "FA";
print "$data - ", unpack("S*", $data), "\n";
# > FA - 16710
since having “FF” wasn’t particularly helpful.
The question is: how did we get from “FA” to 16710?
First, the character ‘F’ is converted to its ASCII value—70. In binary, this is 0100 0110 (note that I padded a leading zero so it’s clear that it’s a whole byte).
Then, we need the ASCII value of ‘A’—65. In binary, 0100 0001.
So we have F corresponding to 0100 0110 and A corresponding to 0100 0001.
Then we just glue these two binary values together, except we put the A first:
0100 0001 0100 0110
And converting 0100 0001 0100 0110 to decimal gives 16,710.
Note: I think the order in which the bytes are glued together might be different on different computers, so while the principle here should apply everywhere, the numbers might be different.
Related
Here's an example :
I have a list of uint8_t element (It's not real code ^^) :
List[0] = 0010 1001
List[1] = 0100 0111
And have one unsigned short element (Twice the size of uint8_t).
I want my short element to be this after : 0010 1001 0100 0111
Can I move it like this :
ShortElement = (unsigned short) UnsignedInt8List[0];
Or do I have to use binary tweaks to move it?
Thank you :)
No, you can't simply cast the value. Since unsigned short can always represent the value in uint8_t, the value will remained unchanged. If you do ShortElement = (unsigned short) UnsignedInt8List[0];, ShortElement will be assigned the value 0000 0000 0010 1001.
You should use bitwise operations to combine the values:
ShortElement = ((unsigned short) UnsignedInt8List[0] << 8) |
((unsigned short) UnsignedInt8List[1] << 0);
This is a little more verbose than it needs to be, but I've included everything for clarity. It could be equivalently written as:
ShortElement = ((unsigned short) UnsignedInt8List[0] << 8) | UnsignedInt8List[1];
One note is to be careful with your types. You know the size of uint8_t, but you don't portably know that unsigned short is exactly 16 bits long, you only know that it must be at least 16 bits long to be able to hold the range [0, 65535]. This isn't a problem in anything you've shown because of that minimum of 16 bits, it's even often recommended to use non-fixed width types where safe and convenient, but it's something to keep in mind.
One this that might be tempting but that you should not do is to use pointers to get the value:
// DO NOT DO THIS, IT IS THE WRONG WAY
ShortElement = *(unsigned short *) &UnsignedInt8List[0];
Don't do that, you'll get the wrong result on many systems, and it might outright cause a crash on some. It's undefined behaviour.
I am trying to convert an array of characters into an array of uint32_t in order to use that in a CRC calculation. I was curious if this is the correct way to do this or if it is dangerous? I have a habit of doing dangerous conversions and I am trying to learn better ways to convert things that are less dangerous :). I know that each char in the array is 8 bits. Should I sum 4 of the characters up and toss it into an index of the unsigned int array or is it ok just to place each character in its separate array? Would summing four 8 bit characters up change their values into the array? I have read something about shifting characters, however, I am not sure exactly how to shift the four characters into one index of the unsigned int array.
text[i] is my array of characters.
uint32_t inputText[512];
for( i = 0; i < 504; i++)
{
inputText[i] = (uint32_t)text[i];
}
The cast seems fine; although, I'm not sure why you say i < 504 when your array of uint32_ts is 512. (If you do want to only convert 504 values and you want a 512-length array, you might want to use array[512] = {0} to ensure the memory is zeroed out instead of the last 8 values being set to whatever was previously in the memory.) Nonetheless, it is perfectly safe to say: SomeArrayOfLargerType[i] = (largerType_t)SomeArrayOfSmallerType[i], but bear in mind that how it is now, your binary will end up looking something like:
0100 0001 -> 0000 0000 0000 0000 0000 0000 0100 0001
So, those 24 leading 0s might be an undesired result.
As for summing up four characters, that will almost definitely not work out how you want; unless you literally want the sum like 0000 0001 (one) + 0000 0010 (two) = 0000 0100 (three). If you would instead want the previous example to produce 00000001 000000010, then yes, you would need to apply shifts.
UPDATE - Some information about shifting via example:
The following would be an example of shifting:
uint32_t valueArray[FINAL_LENGTH] = {0};
int i;
for(i=0; i < TEXT_LENGTH; i++){ // text_length is the initial message/text length (512 bytes or something)
int mode = i % 4; // 4-to-1 value storage ratio (4 uint8s being stored as 1 uint32)
int writeLocation = (int)(i/4); // values will be truncated, so something like 3/4 = 0 (which is desired)
switch(mode){
case(0):
// add to bottom 8-bits of index
valueArray[writeLocation] = text[i];
break;
case(1):
valueArray[writeLocation] |= (text[i] << 8); // shift to left by 8 bits to insert to second byte
break;
case(2):
valueArray[writeLocation] |= (text[i] << 16); // shift to left by 16 bits to insert to third byte
break;
case(3):
valueArray[writeLocation] |= (text[i] << 24); // shift to left by 24 bits to insert to fourth byte
break;
default:
printf("Some error occurred here... If source has been modified, please check to make sure the number of case handlers == the possible values for mode.\n");
}
}
You can see an example of that running here: https://ideone.com/OcDMoM (Note, there is some runtime error when executing that on IDEOne. I haven't looked intensely for that issue, though, as the output still seems to be accurate and the code is just meant to serve as an example.)
Essentially, because each byte is 8-bits, and you want to store the bytes in 4-byte chunks (32-bits each), you need four different cases for how far you shift. In the first case, the first 8-bits are filled in by a byte from the message. In the second case, the second 8-bits are filled in by the following byte in the message (which is left shifted by 8-bits because that is the offset for the binary position). And that continues for the remaining 2 bytes, and then it repeats starting at the next index of the initial message array.
When combining the bytes, |= is used because that will take what is already in uint32 and it will perform a bitwise OR on it, so the final values will combine into one single value.
So, to break down a simple example like what I had in my initial post, let's say I have 0000 0001 (one) and 0000 0010 (two), with an initial 16-bit integer to hold them 0000 0000 0000 0000. The first byte is assigned to the 16-bit integer making it 0000 0000 0000 0001. Then the second byte is left shifted by 8 making it 0000 0010 0000 0000. Finally, the two are via a bitwise OR, so the 16-bit integer becomes: 0000 0010 0000 0001.
In the case of a 32-bit integer to hold 4 bytes, that process will repeat 2 more times with 8 additional shifts, and then it will proceed to the next uint32 to repeat the process.
Hopefully that all makes sense. If not, I can try to clarify further.
I have an unsigned char array whose size is 6. The content of the byte array is an integer (4096*number of seconds since Unix Time). I know that the byte array is big-endian.
Is there a library function in C that I can use to convert this byte array into int_64 or do I have to do it manually?
Thanks!
PS: just in case you need more information, yes, I am trying to parse an Unix timestamp. Here is the format specification of the timestamp that I dealing with.
A C99 implementation may offer uint64_t (it doesn't have to provide it if there is no native fixed-width integer that is exactly 64 bits), in which case, you could use:
#include <stdint.h>
unsigned char data[6] = { /* bytes from somewhere */ };
uint64_t result = ((uint64_t)data[0] << 40) |
((uint64_t)data[1] << 32) |
((uint64_t)data[2] << 24) |
((uint64_t)data[3] << 16) |
((uint64_t)data[4] << 8) |
((uint64_t)data[5] << 0);
If your C99 implementation doesn't provide uint64_t you can still use unsigned long long or (I think) uint_least64_t. This will work regardless of the native endianness of the host.
Have your tried this:
unsigned char a [] = {0xaa,0xbb,0xcc,0xdd,0xee,0xff};
unsigned long long b = 0;
memcpy(&b,a,sizeof(a)*sizeof(char));
cout << hex << b << endl;
Or you can do it by hand which will avoid some architecture specific issues.
I would recommend using normal integer operation (sums and shifts) rather than trying to emulate memory block ordering which is no better than the solution above in term of compatibility.
I think the best way to do it is using a union.
union time_u{
uint8_t data[6];
uint64_t timestamp;
}
Then you can use that memory space as a byte array or uint64_t, by referencing
union time_u var_name;
var_name.data[i]
var_name.timestamp
Here is a method to convert it to 64 bits:
uint64_t
convert_48_to_64(uint8_t *val_ptr){
uint64_t ret = 0;
uint8_t *ret_ptr = (uint8_t *)&ret;
for (int i = 0; i < 6; i++) {
ret_ptr[5-i] = val_ptr[i];
}
return ret;
}
convert_48_to_64((uint8_t)&temp); //temp is in 48 bit
eg: num_in_48_bit = 77340723707904; this number in 48 bit binary will be : 0100 0110 0101 0111 0100 1010 0101 1101 0000 0000 0000 0000 After conversion in 64 bit binary will be : 0000 0000 0000 0000 0000 0000 0000 0000 0101 1101 0100 1010 0101 0111 0100 0110 let's say val_ptr stores the base address of num_in_48_bit. Since pointer typecast to uint8_t, incrementing val_ptr will give you next byte. Looping over and copy the value byte by byte. Note, I am taking care of network to byte order as well.
You can use pack option
#pragma pack(1)
or
__attribute__((packed))
depending on the compiler
typedef struct __attribute__((packed))
{
uint64_t u48: 48;
} uint48_t;
uint48_t data;
memcpy(six_byte_array, &data, 6);
uint64_t result = data.u48;
See
_int64 bit field
How can I create a 48-bit uint for bit mask
If a 32-bit integer overflows, can we use a 40-bit structure instead of a 64-bit long one?
Which C datatype can represent a 40-bit binary number?
int i = 259; /* 03010000 in Little Endian ; 00000103 in Big Endian */
char c = (char)i; /* returns 03 in both Little and Big Endian?? */
In my computer it assigns 03 to char c and I have Little Endian, but I don't know if the char casting reads the least significant byte or reads the byte pointed by the i variable.
Endianness doesn't actually change anything here. It doesn't try to store one of the bytes (MSB, LSB etc).
If char is unsigned it will wrap around. Assuming 8-bit char 259 % 256 = 3
If char is signed the result is implementation defined. Thank you pmg: 6.3.1.3/3 in the C99 Standard
Since you're casting from a larger integer type to a smaller one, it takes the least significant part regardless of endianness. If you were casting pointers instead, though, it would take the byte at the address, which would depend on endianness.
So c = (char)i assigns the least-significant byte to c, but c = *((char *)(&i)) would assign the first byte at the address of i to c, which would be the same thing on little-endian systems only.
If you want to test for little/big endian, you can use a union:
int isBigEndian (void)
{
union foo {
size_t i;
char cp[sizeof(size_t)];
} u;
u.i = 1;
return *u.cp != 1;
}
It works because in little endian, it would look like 01 00 ... 00, but in big endian, it would be 00 ... 00 01 (the ... is made up of zeros). So if the first byte is 0, the test returns true. Otherwise it returns false. Beware, however, that there also exist mixed endian machines that store data differently (some can switch endianness; others just store the data differently). The PDP-11 stored a 32-bit int as two 16-bit words, except the order of the words was reversed (e.g. 0x01234567 was 4567 0123).
When casting from int(4 bytes) to char(1 byte), it will preserve the last 1 byte.
Eg:
int x = 0x3F1; // 0x3F1 = 0000 0011 1111 0001
char y = (char)x; // 1111 0001 --> -15 in decimal (with Two's complement)
char z = (unsigned char)x; // 1111 0001 --> 241 in decimal
I discovered something odd that I can't explain. If someone here can see what or why this is happening I'd like to know. What I'm doing is taking an unsigned short containing 12 bits aligned high like this:
1111 1111 1111 0000
I then want to shif the bits so that each byte in the short hold 7bits with the MSB as a pad. The result on what's presented above should look like this:
0111 1111 0111 1100
What I have done is this:
unsigned short buf = 0xfff;
//align high
buf <<= 4;
buf >>= 1;
*((char*)&buf) >>= 1;
This gives me something like looks like it's correct but the result of the last shift leaves the bit set like this:
0111 1111 1111 1100
Very odd. If I use an unsigned char as a temporary storage and shift that then it works, like this:
unsigned short buf = 0xfff;
buf <<= 4;
buf >>= 1;
tmp = *((char*)&buf);
*((char*)&buf) = tmp >> 1;
The result of this is:
0111 1111 0111 1100
Any ideas what is going on here?
Yes, it looks like char is signed on your platform. If you did *((unsigned char*)&buf) >>= 1, it would work.
Lets break this down. I'll assume that your compiler thinks of short as a 16-bits of memory.
unsigned short buf = 0xfff;
//align high
buf <<= 4;
is equivalent to:
unsigned short buf = 0xfff0;
... and
buf >>= 1;
should result in buf having the value 0x7ff8 (i.e. th bits shifted to the right by one bit). Now for your fancy line:
*((char*)&buf) >>= 1;
lots going on here... first the left hand side needs to be resolved. What you're saying is take buf and treat it as a pointer to 8-bits of memory (as opposed to it's natural 16-bits). Which of the two bytes that buf originally referred to relies on knowning what your memory endian-ness is (if it's big-endian buf points to 0x7f, if it's little-endian buf points to 0xf8). I'll assume you're on an Intel box, which means its little endian, and now buff points to 0xf8. Your statement then says assign to that byte, the value at that byte shifted (and sign extended since char is signed) to the right by one, or 0xfc. The other byte will remain unchanged. If you want no sign extension, chast buf to a (unsigned char *).