I read in some hex numbers and I then would like to convert them to base 2^64. Unfortunately as this number cannot be stored in an int, it seems there is no function in GMP that can help me solve this problem.
Is there another way to do this that I am missing completely?
(The program is in C)
10 in base 2^1 is 1010 which in binary is 1 0 1 0
10 in base 2^2 is 22 which in binary is 10 10
10 in base 2^3 is 12 which in binary is 001 010
10 in base 2^4 is A which in binary is 1010
The pattern I'm trying to show you (and that others have noted) is that they all have the same binary representation. In other words, if you convert your number to base 256 (chars) and write it to a file or memory, you can read it in base 2^16 (reading 2 bytes at a time), or base 2^32 (4 bytes at a time), or in fact 2^anything. It will be the same binary representation (assuming you get your endians correct). So be careful with big vs little endian and read as int64_t.
To be clear, this only applies to bases which are 2^n. 10 in base 5 is 20 which in binary is 010 000; obviously different. But if you use trinary, the same principle applies to 3^n, and in pentary (?) it would apply to 5^n.
Update: How you could use this:
With some function
void convert( char *myBase16String, uint8_t *outputBase256 );
which we suppose takes a string encoded in base 16 and produces an array of unsigned chars, where each char is a unit in base 256, we do this:
uint8_t base2_8[8];
convert( "0123456789ABCDEF", base2_8 );
uint64_t base2_64[2];
base2_64[0] = (base2_8[0] << 24) | (base2_8[1] << 16) | (base2_8[2] << 8) | base2_8[3];
base2_64[1] = (base2_8[4] << 24) | (base2_8[5] << 16) | (base2_8[6] << 8) | base2_8[7];
// etc. You can do this in a loop, but make sure you know how long it is.
Suppose your input wasn't a nice multiple-of-4 bytes:
uint8_t base2_8[6];
convert( "0123456789AB", base2_8 );
uint64_t base2_64[2];
base2_64[0] = (base2_8[0] << 8) | base2_8[1];
base2_64[1] = (base2_8[2] << 24) | (base2_8[3] << 16) | (base2_8[4] << 8) | base2_8[5];
Slightly more complex, but still pretty easy to automate.
GMP comes with an extension of stdio.h that works on large numbers, see the manual on Formatted Input Functions.
There are the usual flavors that work on either standard input (gmp_scanf), files (gmp_fscanf) or strings you have already read into memory (gmp_sscanf).
Related
I need to create a large integer: (1431655765), which is 01010101, for 4 bytes overall. However, there is the restriction that I'm only allowed to declare a number between 0 to 255. I've thought about declaring 01010101 and then pushing it using << to the left, and then trying to add more of the 01s along the way. However, this would only get the first byte to be the way I want it, and not the remaining 3 bytes, and I'm not sure how to change the values of the 0's in the other bytes. I also thought about using two's complement, somehow doing some kind of negative, using ~x+1 or something similar. I wasn't sure how to get there from just a one byte integer though. I'm pretty stuck and some help would be appreciated! For context, this is for bitwise operations, I can use, ! ~ & ^ | + << >>
If you have numbers a, b, c, d of 8 bits each, and they should form 32-bit number n = abcd let's say, you can do:
n = a;
n = (n << 8) | b;
n = (n << 8) | c;
n = (n << 8) | d;
This question is more generic without a particular language. I am more interested in solving this generally across languages. Every answer I find references a built-in method of something like getInt32 to extract an integer from a byte array.
I have a byte array which contains the big-endian representation of a signed integer.
1 -> [0, 0, 0, 1]
-1 -> [255, 255, 255, 255]
-65535 -> [255, 255, 0, 1]
Getting the values for the positive cases are easy:
arr[3] | arr[2] << 8 | arr[1] << 16 | arr[0] << 24
What I would like to figure out is the more general case. I have been reading about 2s complement, which lead me to the python function from Wikipedia:
def twos_complement(input_value, num_bits):
'''Calculates a two's complement integer from the given input value's bits'''
mask = 2**(num_bits - 1) - 1
return -(input_value & mask) + (input_value & ~mask)
which in turn lead me to produce this function:
# Note that the mask from the wiki function has an additional - 1
mask = 2**(32 - 1)
def arr_to_int(arr):
uint_val = arr[3] | arr[2] << 8 | arr[1] << 16 | arr[0] << 24
if (determine_if_negative(uint_val)):
return -(uint_val & mask) + (uint_val & ~mask)
else:
return uint_val
In order for my function to work I need to fill in determine_if_negative (I should mask the signed bit and check if it is 1). But is there a standard formula to handle this? One thing I found is that in some languages, like Go, the bitshift might overflow the int value.
This is pretty hard to search because I get a thousand results explaining the difference between big-endian and little-endian or results explaining twos complement, and many more giving examples of using the standard library but I haven't seen a complete formula for bitwise functions.
Is there a canonical example in C or similar language of converting a char array using only array access and bitwise functions (ie, no memcpy or pointer casting or tricky stuff)
The bitwise method only works properly for unsigned values so you will need to build the unsigned integer and then convert to signed. The code could be:
int32_t val( uint8_t *s )
{
uint32_t x = ((uint32_t)s[0] << 24) + ((uint32_t)s[1] << 16) + ((uint32_t)s[2] << 8) + s[3];
return x;
}
Note, this assumes you are on a 2's complement system which also defines unsigned->signed conversion as no change in repesentation. If you want to support other systems too , it would be more complicated.
The casts are necessary so that the shift is performed over the right width.
Even c might be too high level for this. After all, the exact representation of int is machine dependent. On top of that, not all integer types on all systems are 2s complement.
When you mention a byte array and converting it to integer you must specify what format that byte array implies.
If you assume 2s complement and little endian (like intel/amd). Then the last byte contains the sign.
For simplicity's sake lets start with a 4 digit 2s complement integer,then byte byte, then 2 byte integers and then 4.
BIN SIGNED_DEC UNSIGNED_DEC
000 0 0
001 1 1
010 2 2
100 -4(oops) 4
101 -3 5
110 -1 6
111 -1 7
---
123
let each bit be b3,b2,b1, where b1 is the most significant bit(and sign)
then the formula would be:
b3*2^2+b2*2^1-b1*4
for a byte we have 4 bits and the formula would look like this:
b4*2^3 + b3*2^2+b2*2^1-b1*2^3
for 2 bytes it is the same but we have to multiple the most significant byte by 256 and the negative value would be 256^2 or 2^16.
/**
* returns calculated value of 2s complement bit string.
* expects string of bits 0or1. if a chanracter is not 1 it is considered 0.
*
*/
public static long twosComplementFromBitArray(String input) {
if(input.length()<2) throw new RuntimeException("intput too short ");
int sign=input.charAt(0)=='1'?1:0;
long unsignedComplementSum=1;
long unsignedSum=0;
for(int i=1;i<input.length();++i) {
char c=input.charAt(i);
int val=(c=='1')?1:0;
unsignedSum=unsignedSum*2+val;
unsignedComplementSum*=2;
}
return unsignedSum-sign*unsignedComplementSum;
}
public static void main(String[] args) {
System.out.println(twosComplementFromBitArray("000"));
System.out.println(twosComplementFromBitArray("001"));
System.out.println(twosComplementFromBitArray("010"));
System.out.println(twosComplementFromBitArray("011"));
System.out.println(twosComplementFromBitArray("100"));
System.out.println(twosComplementFromBitArray("101"));
System.out.println(twosComplementFromBitArray("110"));
System.out.println(twosComplementFromBitArray("111"));
}
outputs:
0
1
2
3
-4
-3
-2
-1
It has come to my attention that there is no builtin structure for a single bit in C. There is (unsigned) char and int, which are 8 bits (one byte), and long which is 64+ bits, and so on (uint64_t, bool...)
I came across this while coding up a huffman tree, and the encodings for certain characters were not necessarily exactly 8 bits long (like 00101), so there was no efficient way to store the encodings. I had to find makeshift solutions such as strings or boolean arrays, but this takes far more memory.
But anyways, my question is more general: is there a good way to store an array of bits, or some sort of user-defined struct? I scoured the web for one but the smallest structure seems to be 8 bits (one byte). I tried things such as int a : 1 but it didn't work. I read about bit fields but they do not simply achieve exactly what I want to do. I know questions have already been asked about this in C++ and if there is a struct for a single bit, but mostly I want to know specifically what would be the most memory-efficient way to store an encoding such as 00101 in C.
If you're mainly interested in accessing a single bit at a time, you can take an array of unsigned char and treat it as a bit array. For example:
unsigned char array[125];
Assuming 8 bits per byte, this can be treated as an array of 1000 bits. The first 16 logically look like this:
---------------------------------------------------------------------------------
byte | 0 | 1 |
---------------------------------------------------------------------------------
bit | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---------------------------------------------------------------------------------
Let's say you want to work with bit b. You can then do the following:
Read bit b:
value = (array[b/8] & (1 << (b%8)) != 0;
Set bit b:
array[b/8] |= (1 << (b%8));
Clear bit b:
array[b/8] &= ~(1 << (b%8));
Dividing the bit number by 8 gets you the relevant byte. Similarly, mod'ing the bit number by 8 gives you the relevant bit inside of that byte. You then left shift the value 1 by the bit number to give you the necessary bit mask.
While there is integer division and modulus at work here, the dividend is a power of 2 so any decent compiler should replace them with bit shifting/masking.
It has come to my attention that there is no builtin structure for a single bit in C.
That is true, and it makes sense because substantially no machines have bit-addressible memory.
But anyways, my question is more general: is there a good way to store
an array of bits, or some sort of user-defined struct?
One generally uses an unsigned char or another unsigned integer type, or an array of such. Along with that you need some masking and shifting to set or read the values of individual bits.
I scoured the
web for one but the smallest structure seems to be 8 bits (one byte).
Technically, the smallest addressible storage unit ([[un]signed] char) could be larger than 8 bits, though you're unlikely ever to see that.
I tried things such as int a : 1 but it didn't work. I read about bit
fields but they do not simply achieve exactly what I want to do.
Bit fields can appear only as structure members. A structure object containing such a bitfield will still have a size that is a multiple of the size of a char, so that doesn't map very well onto a bit array or any part of one.
I
know questions have already been asked about this in C++ and if there
is a struct for a single bit, but mostly I want to know specifically
what would be the most memory-efficient way to store an encoding such
as 00101 in C.
If you need a bit pattern and a separate bit count -- such as if some of the bits available in the bit-storage object are not actually significant -- then you need a separate datum for the significant-bit count. If you want a data structure for a small but variable number of bits, then you might go with something along these lines:
struct bit_array_small {
unsigned char bits;
unsigned char num_bits;
};
Of course, you can make that larger by choosing a different data type for the bits member and, maybe, the num_bits member. I'm sure you can see how you might extend the concept to handling arbitrary-length bit arrays if you should happen to need that.
If you really want the most memory efficiency, you can encode the Huffman tree itself as a stream of bits. See, for example:
https://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html
Then just encode those bits as an array of bytes, with a possible waste of 7 bits.
But that would be a horrible idea. For the structure in memory to be useful, it must be easy to access. You can still do that very efficiently. Let's say you want to encode up to 12-bit codes. Use a 16-bit integer and bitfields:
struct huffcode {
uint16_t length: 4,
value: 12;
}
C will store this as a single 16-bit value, and allow you to access the length and value fields separately. The complete Huffman node would also contain the input code value, and tree pointers (which, if you want further compactness, can be integer indices into an array).
You can make you own bit array in no time.
#define ba_set(ptr, bit) { (ptr)[(bit) >> 3] |= (char)(1 << ((bit) & 7)); }
#define ba_clear(ptr, bit) { (ptr)[(bit) >> 3] &= (char)(~(1 << ((bit) & 7))); }
#define ba_get(ptr, bit) ( ((ptr)[(bit) >> 3] & (char)(1 << ((bit) & 7)) ? 1 : 0 )
#define ba_setbit(ptr, bit, value) { if (value) { ba_set((ptr), (bit)) } else { ba_clear((ptr), (bit)); } }
#define BITARRAY_BITS (120)
int main()
{
char mybits[(BITARRAY_BITS + 7) / 8];
memset(mybits, 0, sizeof(mybits));
ba_setbit(mybits, 33, 1);
if (!ba_get(33))
return 1;
return 0;
};
I'm trying to read binary data from a file. At the bytes 10-13 is a litte-endian binary-encoded number and I'm trying to parse it using only the information that the offset is 10 and the "size" is 4.
I've figured out I will have to do some binary shifting operations, but I'm not sure which byte goes where and how "far" and where it should be shifted.
If you know for certain the data is little endian, you can do something like:
int32 value = data[10] | (data[11] << 8) | (data[12] << 16) | (data[13] << 24);
This gives you a portable solution in case your code will run on both endian machines.
I am trying to understand and implement a simple file system based on FAT12. I am currently looking at the following snippet of code and its driving me crazy:
int getTotalSize(char * mmap)
{
int *tmp1 = malloc(sizeof(int));
int *tmp2 = malloc(sizeof(int));
int retVal;
* tmp1 = mmap[19];
* tmp2 = mmap[20];
printf("%d and %d read\n",*tmp1,*tmp2);
retVal = *tmp1+((*tmp2)<<8);
free(tmp1);
free(tmp2);
return retVal;
};
From what I've read so far, the FAT12 format stores the integers in little endian format.
and the code above is getting the size of the file system which is stored in the 19th and 20th byte of boot sector.
however I don't understand why retVal = *tmp1+((*tmp2)<<8); works. is the bitwise <<8 converting the second byte to decimal? or to big endian format?
why is it only doing it to the second byte and not the first one?
the bytes in question are [in little endian format] :
40 0B
and i tried converting them manually by switching the order first to
0B 40
and then converting from hex to decimal, and I get the right output, I just don't understand how adding the first byte to the bitwise shift of second byte does the same thing?
Thanks
The use of malloc() here is seriously facepalm-inducing. Utterly unnecessary, and a serious "code smell" (makes me doubt the overall quality of the code). Also, mmap clearly should be unsigned char (or, even better, uint8_t).
That said, the code you're asking about is pretty straight-forward.
Given two byte-sized values a and b, there are two ways of combining them into a 16-bit value (which is what the code is doing): you can either consider a to be the least-significant byte, or b.
Using boxes, the 16-bit value can look either like this:
+---+---+
| a | b |
+---+---+
or like this, if you instead consider b to be the most significant byte:
+---+---+
| b | a |
+---+---+
The way to combine the lsb and the msb into 16-bit value is simply:
result = (msb * 256) + lsb;
UPDATE: The 256 comes from the fact that that's the "worth" of each successively more significant byte in a multibyte number. Compare it to the role of 10 in a decimal number (to combine two single-digit decimal numbers c and d you would use result = 10 * c + d).
Consider msb = 0x01 and lsb = 0x00, then the above would be:
result = 0x1 * 256 + 0 = 256 = 0x0100
You can see that the msb byte ended up in the upper part of the 16-bit value, just as expected.
Your code is using << 8 to do bitwise shifting to the left, which is the same as multiplying by 28, i.e. 256.
Note that result above is a value, i.e. not a byte buffer in memory, so its endianness doesn't matter.
I see no problem combining individual digits or bytes into larger integers.
Let's do decimal with 2 digits: 1 (least significant) and 2 (most significant):
1 + 2 * 10 = 21 (10 is the system base)
Let's now do base-256 with 2 digits: 0x40 (least significant) and 0x0B (most significant):
0x40 + 0x0B * 0x100 = 0x0B40 (0x100=256 is the system base)
The problem, however, is likely lying somewhere else, in how 12-bit integers are stored in FAT12.
A 12-bit integer occupies 1.5 8-bit bytes. And in 3 bytes you have 2 12-bit integers.
Suppose, you have 0x12, 0x34, 0x56 as those 3 bytes.
In order to extract the first integer you only need take the first byte (0x12) and the 4 least significant bits of the second (0x04) and combine them like this:
0x12 + ((0x34 & 0x0F) << 8) == 0x412
In order to extract the second integer you need to take the 4 most significant bits of the second byte (0x03) and the third byte (0x56) and combine them like this:
(0x56 << 4) + (0x34 >> 4) == 0x563
If you read the official Microsoft's document on FAT (look up fatgen103 online), you'll find all the FAT relevant formulas/pseudo code.
The << operator is the left shift operator. It takes the value to the left of the operator, and shift it by the number used on the right side of the operator.
So in your case, it shifts the value of *tmp2 eight bits to the left, and combines it with the value of *tmp1 to generate a 16 bit value from two eight bit values.
For example, lets say you have the integer 1. This is, in 16-bit binary, 0000000000000001. If you shift it left by eight bits, you end up with the binary value 0000000100000000, i.e. 256 in decimal.
The presentation (i.e. binary, decimal or hexadecimal) has nothing to do with it. All integers are stored the same way on the computer.