Perl: Efficiently store/get 2D array of constrained integers in file - arrays

This is an attempt to improve my
Perl: seek to and read bits, not bytes by explaining more
thoroughly what I was trying to do.
I have x, a 9136 x 42 array of integers that I want to store
super-efficiently in a file. The integers have the following
constraints:
All of the 9136 integers in x[0..9135][0] are between
-137438953472 and 137438953471, and can therefore be stored
using 38 bits.
All of the 9136 integers in x[0..9135][1] are between -16777216 and
16777215, and can therefore be stored using 25 bits.
And so on... (the integer bit constraints are known in
advance; Perl doesn't have to compute them)
Question: Using Perl, how do I efficiently store this array in a file?
Notes:
If an integer can be stored in 25 bits, it can also be stored
in 4 bytes (32 bits), if you're willing to waste 7 bits. In my
situation, however, every bit counts.
I want to use file seek() to find data quickly, not read
sequentially through the file.
The array will normally be accessed as x[i]. In other words,
I'll want the 42 integers corresponding to a given x[i], so
these 42 integers should be stored close to each other
(ideally, they should be stored adjacent to each other in the
file)
My initial approach was to just lay down a bitstream, and
then find a way to read it back and change it back into an
integer. My original question focused on that, but perhaps
there's a better solution to the bigger problem that I'm not
seeing.
Far too much detail on what I'm doing:
https://github.com/barrycarter/bcapps/blob/master/ASTRO/bc-read-cheb.m
https://github.com/barrycarter/bcapps/blob/master/ASTRO/bc-read-cheb.pl
https://github.com/barrycarter/bcapps/blob/master/ASTRO/playground.pl

I'm not sure I should be encouraging you, but it loks like Data::BitStream will do what you ask.
The program below writes a 38-bit value and a 25-bit value to a file, and then opens and retrieves the values intact.
#!/usr/bin/perl
use strict;
use warnings;
use Data::BitStream;
{
my $bs_out = Data::BitStream->new(
mode => 'w',
file => 'bits.dat',
);
printf "Maximum %d bits per word\n", $bs_out->maxbits;
$bs_out->write(38, 137438953471);
$bs_out->write(25, 16777215);
printf "Total %d bits written\n\n", $bs_out->len;
}
{
my $bs_in = Data::BitStream->new(
mode => 'ro',
file => 'bits.dat',
);
printf "Total %d bits read\n\n", $bs_in->len;
print "Data:\n";
print $bs_in->read(38), "\n";
print $bs_in->read(25), "\n";
}
output
Maximum 64 bits per word
Total 63 bits written
File size 11 bytes
Total 63 bits read
Data:
137438953471
16777215
38 and 25 is 63 bits of data written, which the module confirms. But there is clearly some additional housekeeping data involved as the total size of the resulting file is eleven bytes, and not just the eight that would be the minimum necessary. Note that, when reopened, the data remembers that it is 63 bits long. However, it is shorter than the sixteen bytes that a file would have to be to contain two simple 64-bit integers.
What you do with this information is up to you, but remember that data packed in this way will be extremely difficult to debug with a hex editor. You may be shooting yourself in the foot if you adopt something like this.

Related

Storing individual bits in memory

So I want to store random bits of length 1 to 8 (a BYTE) in memory. I know that computer aren't efficient enough to store individual bits in memory and that we must store at least a BYTE data on most modern machines. I have been doing some research on this but haven't come across any useful material. I need to find a way to store these bits so that, for example, when reading the bits back from the memory, 0 must NOT be evaluated as 00, 0000 or 00000000. To further explain, for example, 010 must NOT be read back or evaluated for that matter, as 00000010. Numbers should be unique based on the value as well as their cardinality.
Some more examples;
1 ≠ 00000001
10 ≠ 00000010
0010 ≠ 00000010
10001 ≠ 00010001
And so on...
Also one thing i want to point out again is that the bit size is always between 1 and 8 (inclusive) and is NOT a fixed number. I'm using C for this problem.
So you want to store bits in memory and read them back without knowing how long they are. This is not possible. (It's not possible with bytes either)
Imagine if you could do this. Then we could compress a file by, for example, saying that "0" compresses to "0" and "1" compresses to "00". After this "compression" (which would actually make the file bigger) we have a file with only 0's in it. Then, we compress the file with only 0's in it by writing down how many 0's there are. Amazing! Any 2GB file compresses to only 4 bytes. But we know it's impossible to compress every 2GB file into 4 bytes. So something is wrong with this idea.
You can read several bits from memory but you need to know how many you are reading. You can also do it if you don't know how many bits you are reading, but the combinations don't "overlap". So if "01" is a valid combination, then you can't have "010" because that would overlap "01". But you could have "001". This is called a prefix code and it is used in Huffman coding, a type of compression.
Of course, you could also save the length before each number. So you could save "0" as "0010" where the "001" means how many bits long the number is. With 3-digit lengths, you could only have up to 7-bit numbers. Or 8-bit numbers if you subtract 1 from the length, in which case you can't have zero-bit numbers. (so "0" becomes "0000", "101" becomes "010101", etc)
You can control bits using bit shift operators or bit fields
Make sure you understand the endianess concept, that is machine dependent. and keep in mind that bit fields needs a struct, and a struct uses a minimum of 4 bytes.
And bit fields can be very tricky.
Good luck!
If you just need to make sure a given binary number is evaluated properly, then you have two choices I can think of. You could store all of the amount of bits of each numbers alongside with the given number, which wouldn't be so efficient.
But you could also store all the binary numbers as being 8-bit, then when processing each individual number, pass through all of its digits to find its length. That way you just store the lenght of a single number at a time.
Here is some quick code, hopefully it's clear:
Uint8 rightNumber = 2; //Which is 10 in binary, or 00000010
int rightLength = 2; //Since it is 2 bits long
Uint8 bn = mySuperbBinaryValueIWantToTest;
int i;
for(i = 7; i > 0; i--)
{
if((bn & (1 << i)) != 0)break;
}
int length = i + 1;
if(bn == rightNumber && length == rightLength) printf("Correct number");
else printf("Incorrect number");
Keep in mind you can also use the same technique to calculate the amount of bits inside the right value instead of precomputing it. If it's to arbitrary values you are comparing, the same can also work.
Hope this helped, if not, feel free to criticize/re-explain your problem

System Verilog - Reading a line from testbench and splitting the data

I'm a beginner in SystemVerilog Programming. I have a file called "input.in" and it has around 32 bits of data. The value is present in only one line of the file.
The data once sent from the testbench must be split into an array or 4 variables, each containing only 8 bits of the input. Please. Somebody help me :(
I think, you want to split the 32 bits of data into 4 bytes of data.
Please try the following:
{>>{a,b,c,d}} = var_32_bit ; //a,b,c,d are 8 bits variable.
// var_32_bit is an array of 32 bits size or a 32-bit variable. {bit a[] or bit [31:0]}
Is this the one you need ?

Declaring the array size in C

Its quite embarrassing but I really want to know... So I needed to make a conversion program that converts decimal(base 10) to binary and hex. I used arrays to store values and everything worked out fine, but i declared the array as int arr[1000]; because i thought 1000 was just an ok number, not too big, not to small...someone in class said " why would you declare an array of 1000? Integers are 32 bits". I was too embarrased to ask what that meant so i didnt say anything. But does this mean that i can just declare the array as int arr[32]; instead? Im using C btw
No, the int type has tipically a 32 bit size, but when you declare
int arr[1000];
you are reserving space for 1000 integers, i.e. 32'000 bits, while with
int arr[32];
you can store up to 32 integers.
You are practically asking yourself a question like this: if an apple weighs 32 grams, I want to my bag to
contain 1000 apples or 32 apples?
Don't be embarrassed. Fear is your enemy and in the end you will be perceived based on contexts that you have no hope of significantly influencing. Anyway, to answer your question, your approach is incorrect. You should declare the array with a size completely determined by the number of positions used.
Concretely, if you access the array at 87 distinct positions (from 0 to 86) then you need a size of 87.
0 to 4,294,967,295 is the maximum possible range of numbers you can store in 32 bits.If your number is outside this range you cannot store your number in 32 bits.Since each bit will occupy one index location of your array if you number falls in that range array size of 32 will do fine.for example consider number 9 it will be stored in array as a[]={1,0,0,1}.
In order to know the know range of numbers, your formula is 0 to (2^n -1) where n is the number of bits in binary. means in the array size of 4 or 4 bits you can just store number from range 0 to 15.
In C , integer datatype can store typically up to 2,147,483,647 and 4,294,967,295 if you are using unsigned integer. Since the maximum value, an integer data type can store in C is within the range of maximum possible number which can be expressed using 32 bits. It is safe to say that array size of 32 is the best size for defining an array.Sice you will never require more than 32 bits to express any number using an int.
I will use
int a = 42;
char bin[sizeof a * CHAR_BIT + 1];
char hex[sizeof a * CHAR_BIT / 4 + 1]
I think this include all possibility.
Consider that also the 'int' type is ambiguous. Generally it depends on the machine you're working on and at minimum its ranges are: -32767,+32767:
https://en.wikipedia.org/wiki/C_data_types
Can I suggest to use the stdint types?
int32_t/uint32_t
What you did is okay. If that is precisely what you want to do. C is a language that lets you do whatever you want. Whenever you want. The reason you were berated on the declaration is because of 'hogging' memory. The thought being, how DARE YOU take up space that is possibly never used... it is inefficient.
And it is. But who cares if you just want to run a program that has a simple purpose? A 1000 16 or 32 bit block of memory is weeeeeensy teeeeny tiny compared to computers from the way back times when it was necessary to watch over how much RAM you were taking up. So - go ahead.
But what they should have said next is how to avoid that. More on that at the end - but first a thing about built in data types in C.
An int can be 16 or 32 bits depending on how you declare it. And your compiler's settings...
A LONG int is 32.
consider:
short int x = 10; // declares an integer that is 16 bits
signed int x = 10; // 32 bit integer with negative and positive range
unsigned int x = 10 // same 32 bit integer - but only 0 to positive values
To specifically code a 32 bit integer you declare it 'long'
long int = 10; // 32 bit
unsigned long int = 10; // 32 bit 0 to positive values
Typical nomenclature is to call a 16 bit value a WORD and a 32 bit value a DWORD - (double word). But why would you want to type in:
long int x = 10;
instead of:
int x = 10;
?? For a few reasons. Some compilers may handle the int as a 16 bit WORD if keeping up with older standards. But the only real reason is to maintain a convention of strongly typecasted code. Make it read directly what you intend it to do. This also helps in readability. You will KNOW when you see it = what size it is for sure, and be reminded whilst coding. Many many code mishaps happen for lack of attention to code practices and naming things well. Save yourself hours of headache later on by learning good habits now. Create YOUR OWN style of coding. Take a look at other styles just to get an idea on what the industry may expect. But in the end you will find you own way in it.
On to the array issue ---> So, I expect you know that the array takes up memory right when the program runs. Right then, wham - the RAM for that array is set aside just for your program. It is locked out from use by any other resource, service, etc the operating system is handling.
But wouldn't it be neat if you could just use the memory you needed when you wanted, and then let it go when done? Inside the program - as it runs. So when your program first started, the array (so to speak) would be zero. And when you needed a 'slot' in the array, you could just add one.... use it, and then let it go - or add another - or ten more... etc.
That is called dynamic memory allocation. And it requires the use of a data type that you may not have encountered yet. Look up "Pointers in C" to get an intro.
If you are coding in regular C there are a few functions that assist in performing dynamic allocation of memory:
malloc and free ~ in the alloc.h library routines
in C++ they are implemented differently. Look for:
new and delete
A common construct for handling dynamic 'arrays' is called a "linked-list." Look that up too...
Don't let someone get your flustered with code concepts. Next time just say your program is designed to handle exactly what you have intended. That usually stops the discussion.
Atomkey

Write 9 bits binary data in C

I am trying to write to a file binary data that does not fit in 8 bits. From what I understand you can write binary data of any length if you can group it in a predefined length of 8, 16, 32,64.
Is there a way to write just 9 bits to a file? Or two values of 9 bits?
I have one value in the range -+32768 and 3 values in the range +-256. What would be the way to save most space?
Thank you
No, I don't think there's any way using C's file I/O API:s to express storing less than 1 char of data, which will typically be 8 bits.
If you're on a 9-bit system, where CHAR_BIT really is 9, then it will be trivial.
If what you're really asking is "how can I store a number that has a limited range using the precise number of bits needed", inside a possibly larger file, then that's of course very possible.
This is often called bitstreaming and is a good way to optimize the space used for some information. Encoding/decoding bitstream formats requires you to keep track of how many bits you have "consumed" of the current input/output byte in the actual file. It's a bit complicated but not very hard.
Basically, you'll need:
A byte stream s, i.e. something you can put bytes into, such as a FILE *.
A bit index i, i.e. an unsigned value that keeps track of how many bits you've emitted.
A current byte x, into which bits can be put, each time incrementing i. When i reaches CHAR_BIT, write it to s and reset i to zero.
You cannot store values in the range –256 to +256 in nine bits either. That is 513 values, and nine bits can only distinguish 512 values.
If your actual ranges are –32768 to +32767 and –256 to +255, then you can use bit-fields to pack them into a single structure:
struct MyStruct
{
int a : 16;
int b : 9;
int c : 9;
int d : 9;
};
Objects such as this will still be rounded up to a whole number of bytes, so the above will have six bytes on typical systems, since it uses 43 bits total, and the next whole number of eight-bit bytes has 48 bits.
You can either accept this padding of 43 bits to 48 or use more complicated code to concatenate bits further before writing to a file. This requires additional code to assemble bits into sequences of bytes. It is rarely worth the effort, since storage space is currently cheap.
You can apply the principle of base64 (just enlarging your base, not making it smaller).
Every value will be written to two bytes and and combined with the last/next byte by shift and or operations.
I hope this very abstract description helps you.

Processing time between strings and integers in C

As we know the integer storage length is 4 bytes and character storage is 1 byte.Here comes my problem,I have a huge data and I need to write them into a file.
For eg. my data is like
Integers - 123456789 (of length 9) (Total 9! factorial records)
Character - abcdefghi (of length 9) (Total 9! factorial records)
Which one will take less processing time? Any thoughts...
It's insignificant compared to the file's access time.
If your integers are stored in individual 32-bit ints and you save them in binary, you have 4 bytes per integer and no conversion overhead.
If your character strings are stored in arrays of 9 chars and you save them as-is, you have 9 bytes per string and no conversion overhead.
In this case strings will take more I/O time than integers.
If you convert your integers into readable 9-char strings and save them the same way as the other strings, the I/O time will be the same, but there will be extra processing time for the integers required for their conversion into text.
Using integer will save you some space and may be some time as well but it will be so small that you wont notice. As an integer takes 4 bytes and 9 characters will take 9 bytes. So for each value, you will be using 5 extra bytes. As the length of your data set is 9! = 362880, so you will be wasting 362880*5 bytes which is 1.73 MB. This is very small chunk and writing it to disk will not be noticeable. So choose integer or character based on what suits you more and not based on which will be faster as you wont notice a difference for data set of this size.
It seems that the processing time is the same, since you have the same bytes!
A file is stored in hard-drive, not in RAM.

Resources