Storing individual bits in memory - c

So I want to store random bits of length 1 to 8 (a BYTE) in memory. I know that computer aren't efficient enough to store individual bits in memory and that we must store at least a BYTE data on most modern machines. I have been doing some research on this but haven't come across any useful material. I need to find a way to store these bits so that, for example, when reading the bits back from the memory, 0 must NOT be evaluated as 00, 0000 or 00000000. To further explain, for example, 010 must NOT be read back or evaluated for that matter, as 00000010. Numbers should be unique based on the value as well as their cardinality.
Some more examples;
1 ≠ 00000001
10 ≠ 00000010
0010 ≠ 00000010
10001 ≠ 00010001
And so on...
Also one thing i want to point out again is that the bit size is always between 1 and 8 (inclusive) and is NOT a fixed number. I'm using C for this problem.

So you want to store bits in memory and read them back without knowing how long they are. This is not possible. (It's not possible with bytes either)
Imagine if you could do this. Then we could compress a file by, for example, saying that "0" compresses to "0" and "1" compresses to "00". After this "compression" (which would actually make the file bigger) we have a file with only 0's in it. Then, we compress the file with only 0's in it by writing down how many 0's there are. Amazing! Any 2GB file compresses to only 4 bytes. But we know it's impossible to compress every 2GB file into 4 bytes. So something is wrong with this idea.
You can read several bits from memory but you need to know how many you are reading. You can also do it if you don't know how many bits you are reading, but the combinations don't "overlap". So if "01" is a valid combination, then you can't have "010" because that would overlap "01". But you could have "001". This is called a prefix code and it is used in Huffman coding, a type of compression.
Of course, you could also save the length before each number. So you could save "0" as "0010" where the "001" means how many bits long the number is. With 3-digit lengths, you could only have up to 7-bit numbers. Or 8-bit numbers if you subtract 1 from the length, in which case you can't have zero-bit numbers. (so "0" becomes "0000", "101" becomes "010101", etc)

You can control bits using bit shift operators or bit fields
Make sure you understand the endianess concept, that is machine dependent. and keep in mind that bit fields needs a struct, and a struct uses a minimum of 4 bytes.
And bit fields can be very tricky.
Good luck!

If you just need to make sure a given binary number is evaluated properly, then you have two choices I can think of. You could store all of the amount of bits of each numbers alongside with the given number, which wouldn't be so efficient.
But you could also store all the binary numbers as being 8-bit, then when processing each individual number, pass through all of its digits to find its length. That way you just store the lenght of a single number at a time.
Here is some quick code, hopefully it's clear:
Uint8 rightNumber = 2; //Which is 10 in binary, or 00000010
int rightLength = 2; //Since it is 2 bits long
Uint8 bn = mySuperbBinaryValueIWantToTest;
int i;
for(i = 7; i > 0; i--)
{
if((bn & (1 << i)) != 0)break;
}
int length = i + 1;
if(bn == rightNumber && length == rightLength) printf("Correct number");
else printf("Incorrect number");
Keep in mind you can also use the same technique to calculate the amount of bits inside the right value instead of precomputing it. If it's to arbitrary values you are comparing, the same can also work.
Hope this helped, if not, feel free to criticize/re-explain your problem

Related

Declaring the array size in C

Its quite embarrassing but I really want to know... So I needed to make a conversion program that converts decimal(base 10) to binary and hex. I used arrays to store values and everything worked out fine, but i declared the array as int arr[1000]; because i thought 1000 was just an ok number, not too big, not to small...someone in class said " why would you declare an array of 1000? Integers are 32 bits". I was too embarrased to ask what that meant so i didnt say anything. But does this mean that i can just declare the array as int arr[32]; instead? Im using C btw
No, the int type has tipically a 32 bit size, but when you declare
int arr[1000];
you are reserving space for 1000 integers, i.e. 32'000 bits, while with
int arr[32];
you can store up to 32 integers.
You are practically asking yourself a question like this: if an apple weighs 32 grams, I want to my bag to
contain 1000 apples or 32 apples?
Don't be embarrassed. Fear is your enemy and in the end you will be perceived based on contexts that you have no hope of significantly influencing. Anyway, to answer your question, your approach is incorrect. You should declare the array with a size completely determined by the number of positions used.
Concretely, if you access the array at 87 distinct positions (from 0 to 86) then you need a size of 87.
0 to 4,294,967,295 is the maximum possible range of numbers you can store in 32 bits.If your number is outside this range you cannot store your number in 32 bits.Since each bit will occupy one index location of your array if you number falls in that range array size of 32 will do fine.for example consider number 9 it will be stored in array as a[]={1,0,0,1}.
In order to know the know range of numbers, your formula is 0 to (2^n -1) where n is the number of bits in binary. means in the array size of 4 or 4 bits you can just store number from range 0 to 15.
In C , integer datatype can store typically up to 2,147,483,647 and 4,294,967,295 if you are using unsigned integer. Since the maximum value, an integer data type can store in C is within the range of maximum possible number which can be expressed using 32 bits. It is safe to say that array size of 32 is the best size for defining an array.Sice you will never require more than 32 bits to express any number using an int.
I will use
int a = 42;
char bin[sizeof a * CHAR_BIT + 1];
char hex[sizeof a * CHAR_BIT / 4 + 1]
I think this include all possibility.
Consider that also the 'int' type is ambiguous. Generally it depends on the machine you're working on and at minimum its ranges are: -32767,+32767:
https://en.wikipedia.org/wiki/C_data_types
Can I suggest to use the stdint types?
int32_t/uint32_t
What you did is okay. If that is precisely what you want to do. C is a language that lets you do whatever you want. Whenever you want. The reason you were berated on the declaration is because of 'hogging' memory. The thought being, how DARE YOU take up space that is possibly never used... it is inefficient.
And it is. But who cares if you just want to run a program that has a simple purpose? A 1000 16 or 32 bit block of memory is weeeeeensy teeeeny tiny compared to computers from the way back times when it was necessary to watch over how much RAM you were taking up. So - go ahead.
But what they should have said next is how to avoid that. More on that at the end - but first a thing about built in data types in C.
An int can be 16 or 32 bits depending on how you declare it. And your compiler's settings...
A LONG int is 32.
consider:
short int x = 10; // declares an integer that is 16 bits
signed int x = 10; // 32 bit integer with negative and positive range
unsigned int x = 10 // same 32 bit integer - but only 0 to positive values
To specifically code a 32 bit integer you declare it 'long'
long int = 10; // 32 bit
unsigned long int = 10; // 32 bit 0 to positive values
Typical nomenclature is to call a 16 bit value a WORD and a 32 bit value a DWORD - (double word). But why would you want to type in:
long int x = 10;
instead of:
int x = 10;
?? For a few reasons. Some compilers may handle the int as a 16 bit WORD if keeping up with older standards. But the only real reason is to maintain a convention of strongly typecasted code. Make it read directly what you intend it to do. This also helps in readability. You will KNOW when you see it = what size it is for sure, and be reminded whilst coding. Many many code mishaps happen for lack of attention to code practices and naming things well. Save yourself hours of headache later on by learning good habits now. Create YOUR OWN style of coding. Take a look at other styles just to get an idea on what the industry may expect. But in the end you will find you own way in it.
On to the array issue ---> So, I expect you know that the array takes up memory right when the program runs. Right then, wham - the RAM for that array is set aside just for your program. It is locked out from use by any other resource, service, etc the operating system is handling.
But wouldn't it be neat if you could just use the memory you needed when you wanted, and then let it go when done? Inside the program - as it runs. So when your program first started, the array (so to speak) would be zero. And when you needed a 'slot' in the array, you could just add one.... use it, and then let it go - or add another - or ten more... etc.
That is called dynamic memory allocation. And it requires the use of a data type that you may not have encountered yet. Look up "Pointers in C" to get an intro.
If you are coding in regular C there are a few functions that assist in performing dynamic allocation of memory:
malloc and free ~ in the alloc.h library routines
in C++ they are implemented differently. Look for:
new and delete
A common construct for handling dynamic 'arrays' is called a "linked-list." Look that up too...
Don't let someone get your flustered with code concepts. Next time just say your program is designed to handle exactly what you have intended. That usually stops the discussion.
Atomkey

Write 9 bits binary data in C

I am trying to write to a file binary data that does not fit in 8 bits. From what I understand you can write binary data of any length if you can group it in a predefined length of 8, 16, 32,64.
Is there a way to write just 9 bits to a file? Or two values of 9 bits?
I have one value in the range -+32768 and 3 values in the range +-256. What would be the way to save most space?
Thank you
No, I don't think there's any way using C's file I/O API:s to express storing less than 1 char of data, which will typically be 8 bits.
If you're on a 9-bit system, where CHAR_BIT really is 9, then it will be trivial.
If what you're really asking is "how can I store a number that has a limited range using the precise number of bits needed", inside a possibly larger file, then that's of course very possible.
This is often called bitstreaming and is a good way to optimize the space used for some information. Encoding/decoding bitstream formats requires you to keep track of how many bits you have "consumed" of the current input/output byte in the actual file. It's a bit complicated but not very hard.
Basically, you'll need:
A byte stream s, i.e. something you can put bytes into, such as a FILE *.
A bit index i, i.e. an unsigned value that keeps track of how many bits you've emitted.
A current byte x, into which bits can be put, each time incrementing i. When i reaches CHAR_BIT, write it to s and reset i to zero.
You cannot store values in the range –256 to +256 in nine bits either. That is 513 values, and nine bits can only distinguish 512 values.
If your actual ranges are –32768 to +32767 and –256 to +255, then you can use bit-fields to pack them into a single structure:
struct MyStruct
{
int a : 16;
int b : 9;
int c : 9;
int d : 9;
};
Objects such as this will still be rounded up to a whole number of bytes, so the above will have six bytes on typical systems, since it uses 43 bits total, and the next whole number of eight-bit bytes has 48 bits.
You can either accept this padding of 43 bits to 48 or use more complicated code to concatenate bits further before writing to a file. This requires additional code to assemble bits into sequences of bytes. It is rarely worth the effort, since storage space is currently cheap.
You can apply the principle of base64 (just enlarging your base, not making it smaller).
Every value will be written to two bytes and and combined with the last/next byte by shift and or operations.
I hope this very abstract description helps you.

Most efficient way to store an unsigned 16-bit Integer to a file

I'm making a dictionary compressor in C with dictionary max size 64000. Because of this, I'm storing my entries as 16-bit integers.
What I'm currently doing:
To encode 'a', I get its ASCII value, 97, and then convert this number into a string representation of the 16-bit integer of 97. So I end up encoding '0000000001100001' for 'a', which obviously isn't saving much space in the short run.
I'm aware that more efficient versions of this algorithm would start with smaller integer sizes (less bits of storage until we need more), but I'm wondering if there's a better way to either
Convert my integer '97' into an ASCII string of fixed length that can store 16 bits of data (97 would be x digits, 46347 would also be x digits)
writing to a file that can ONLY store 1s and 0s. Because as it is, it seems like I'm writing 16 ascii characters to a text file, each of which is 8 bits...so that's not really helping the cause much, is it?
Please let me know if I can be more clear in any way. I'm pretty new to this site. Thank you!
EDIT: How I store my dictionary is entirely up to me as far as I know. I just know that I need to be able to easily read the encoded file back and get the integers from it.
Also, I can only include stdio.h, stdlib.h, string.h, and header files I wrote for the program.
Please, do ignore these people who are suggesting that you "write directly to the file". There are a number of issues with that, which ultimately fall into the category of "integer representation". There appear to be some compelling reasons to write integers straight to external storage using fwrite or what-not, there are some solid facts in play here.
The bottleneck is the external storage controller. Either that, or the network, if you're writing a network application. Thus, writing two bytes as a single fwrite, or as two distinct fputcs, should be roughly the same speed, providing your memory profile is adequate for your platform. You can adjust the amount of buffer that your FILE *s use to a degree using setvbuf (note: must be a power of two), so we can always fine-tune per platform based on what our profilers tell us, though this information should probably float gracefully upstream to the standard library through gentle suggestions to be useful for other projects, too.
Underlying integer representations are inconsistent between todays computers. Suppose you write unsigned ints directly to a file using system X which uses 32-bit ints and big endian representation, you'll end up with issues reading that file on system Y which uses 16-bit ints and little endian representation, or system Z which uses 64-bit ints with mixed endian representation and 32 padding bits. Nowadays we have this mix of computers from 15 years ago that people torture themselves with to ARM big.Little SoCs, smartphones and smart TVs, gaming consoles and PCs, all of which have their own quirks which fall outside of the realm of standard C, especially with regards to integer representation, padding and so on.
C was developed with abstractions in mind that allow you to express your algorithm portably, so that you don't have to write different code for each OS! Here's an example of reading and converting four hex digits to an unsigned int value, portably:
unsigned int value;
int value_is_valid = fscanf(fd, "%04x", &value) == 1;
assert(value_is_valid); // #include <assert.h>
/* NOTE: Actual error correction should occur in place of that
* assertioon
*/
I should point out the reason why I choose %04X and not %08X or something more contemporary... if we go by questions asked even today, unfortunately there are students for example using textbooks and compilers that are over 20 years old... Their int is 16-bit and technically, their compilers are compliant in that aspect (though they really ought to push gcc and llvm throughout academia). With portability in mind, here's how I'd write that value:
value &= 0xFFFF;
fprintf(fd, "%04x", value);
// side-note: We often don't check the return value of `fprintf`, but it can also become \
very important, particularly when dealing with streams and large files...
Supposing your unsigned int values occupy two bytes, here's how I'd read those two bytes, portably, using big endian representation:
int hi = fgetc(fd);
int lo = fgetc(fd);
unsigned int value = 0;
assert(hi >= 0 && lo >= 0); // again, proper error detection & handling logic should be here
value += hi & 0xFF; value <<= 8;
value += lo & 0xFF;
... and here's how I'd write those two bytes, in their big endian order:
fputc((value >> 8) & 0xFF, fd);
fputc(value & 0xFF, fd);
// and you might also want to check this return value (perhaps in a finely tuned end product)
Perhaps you're more interested in little endian. The neat thing is, the code really isn't that different. Here's input:
int lo = fgetc(fd);
int hi = fgetc(fd);
unsigned int value = 0;
assert(hi >= 0 && lo >= 0);
value += hi & 0xFF; value <<= 8;
value += lo & 0xFF;
... and here's output:
fputc(value & 0xFF, fd);
fputc((value >> 8) & 0xFF, fd);
For anything larger than two bytes (i.e. a long unsigned or long signed), you might want to fwrite((char unsigned[]){ value >> 24, value >> 16, value >> 8, value }, 1, 4, fd); or something for example, to reduce boilerplate. With that in mind, it doesn't seem abusive to form a preprocessor macro:
#define write(fd, ...) fwrite((char unsigned){ __VA_ARGS__ }, 1, sizeof ((char unsigned) { __VA_ARGS__ }), fd)
I suppose one might look at this like choosing the better of two evils: preprocessor abuse or the magic number 4 in the code above, because now we can write(fd, value >> 24, value >> 16, value >> 8, value); without the 4 being hard-coded... but a word for the uninitiated: side-effects might cause headaches, so don't go causing modifications, writes or global state changes of any kind in arguments of write.
Well, that's my update to this post for the day... Socially delayed geek person signing out for now.
What you are contemplating is to utilize ASCII characters in saving your numbers, this is completely unnecessary and most inefficient.
The most space efficient way to do this (without utilizing complex algorithms) would be to just dump the bytes of the numbers into the file (the number of bits would have to depend on the largest number you intend to save. Or have multiple files for 8bit, 16bit etc.
Then when you read the file you know that your numbers are located per x # of bits so you just read them out one by one or in a big chunk(s) and then just make the chunk(s) into an array of a type that matches x # of bits.

loop over 2^n states of n bits in C with n > 32

I'd like to have a loop in C over all possible 2^n states of n bits. For example if n=4 I'd like to loop over 0000, 0001, 0010, 0011, ..., 1110, 1111. The bits can be represented in any way, for example an integer array of length n with values 0 or 1, or a character array of length n with values "0" or "1", etc, it doesn't really matter.
For smallish n what I do is calculate x=2^n using integer arithmetic (both n and x are integers), then
for(i=0;i<x;i++) {
bits = convert_integer_to_bits( i );
work_on_bits( bits );
}
Here 'bits' is in the given representation of bits, what was useful so far is an integer array of length n with values 0 or 1 (but can be anything else).
If n>32 this approach obviously doesn't work even with longs.
How would I work with n>32?
Specifically, do I really need to evaluate 2^n, or is there a tricky way of writing the loop which does not refer to the actual value of 2^n but nevertheless iterates 2^n times?
For n > 32 use unsigned long long. This will work for n up to 64. Still for values even close to 50 you will have to wait long time until the cycle finishes.
It's not clear why you say that if n>32, it obviously won't work. Is your concern the width of bits, or is your concern the run time?
If you're concerned about number width, investigate a big math library such as http://gmplib.org/.
If you're concerned about run time... you won't live long enough for your loop to complete if the width is large enough, so get a different hobby ;) Seriously... figure out the rough run time of one iteration through your loop and multiply that by 4 billion, divide by 20 years, and you'll have an estimate of the number of generations of your ancestors that will need to wait for the answer.

How to read 5 bytes to a meanful uint64_t in C?

I need to alloc an array of uint64_t[1e9] to count something, and I know the items are between (0,2^39).
So I want to calloc 5*1e9 bytes for the array.
Then I found that, if I want to make the uint64_t meanful, it is difficult to by pass the byte order.
There should be 2 ways.
One is to check the endianness first, so that we can memcpy the 5 bytes to either first or last of the whole 8 bytes.
The other is to use 5 bitshift and then bit-or them together.
I think the former should be faster.
So, under GCC or libc or GNU system, is there any header file to indicate whether the current system is Little Endian or Big Endian ? I know x86_64 is Little Endian, but I don't like to write a unportable code.
Of course any other idears are welcomed.
Add:
I need use the array to count many strings use D-left hashing. I plan to use 21bit for key and 18bit for counting.
When you say "faster"... how often is this code executed? 5 times <<8 plus an | probably costs less than 100ns. So if that code is executed 10'000 times, that adds up to 1 (one) second.
If the code is executed less times and you need more than 1 second to implement an endian-clean solution, you're wasting everyone's time.
That said, the solution to figure out the endianess is simple:
int a = 1;
char * ptr = (char*)&a;
bool littleEndian = *ptr == 1;
Now all you need it a big endian machine and a couple of test cases to make sure your memcpy solution works. Note that you need to need to call memcpy five times in one of the two cases to reorder the bytes.
Or you could simply shift and or five times...
EDIT I guess I misunderstood your question a bit. You're saying that you want to use the lowest 5 bytes (=40 bits) of the uint64_t as a counter, yes?
So the operation will be executed many, many times. Again, memcpy is utterly useless. Let's take the number 0x12345678 (32bit). In memory, that looks like so:
0x12 0x34 0x56 0x78 big endian
0x78 0x56 0x34 0x12 little endian
As you can see, the bytes are swapped. So to convert between the two, you must either use bit-shifting or byte swapping. memcpy doesn't work.
But that doesn't actually matter since the CPU will do the decoding for you. All you have to do is to shift the bits in the right place.
key = item & 0x1FFFFF
count = (item >>> 21)
to read and
item = count << 21 | key
to write. Now you just need to build the key from the five bytes and you're done:
key = (((hash[0] << 8) | (hash[1]<<8)) | ....
EDIT 2
It seems you have an array of 40bit ints and you want to read/write that array.
I have two solutions: Using memcpy should work as long as the data isn't copied between CPUs of different endianess (read: when you save/load the data to/from disk). But the function call might be too slow for such a huge array.
The other solution is to use two arrays:
int lower[];
unit8_t upper[]
that is: Save the bits 33-40 in a second array. To read/write the values, one shift+or is necessary.
If you treat numbers as numbers, and not an array of bytes, your code will be endianess-agnostic. Hence, I would go for the shift and or solution.
Having said that, I really didn't catch what you are trying to do? Do you really need one billion entries, each five bytes long? If the data you are sampling is sparse, you might get away with allocating far less memory.
Well, I just find the kernel headers come with <asm/byteorder.h>.
inline memcpy to a while(i<x+3){++*i=++*j} may still slower since cache operation is slower than registers.
another way for memcpy is:
union dat {
uint64_t a;
char b[8];
} d;

Resources