packing two single precision values into one - arrays

I'm in labview working with very constrained ram. I have two arrays that require single precision since I need decimal points. However, single precision takes too much space for what I have, the decimal values I work with are within 0.00-1000.00.
Is there an intuitive way to pack these two arrays together so I can save some space? Or is there a different approach I can take?

If you need to represent 0.00 - 1000.00, you've got 100000 values. That cannot be represented in less than 17 (whole) bits. That means that to fit two numbers in, you'll need 34 bits. 34 bits is obviously more than you can fit in a 32 bit space. I suggest you try to limit your space of values. You could dedicate 11 bits to the integer value (0 - 1023) and 5 bits to the decimal value (0 to 0.96875 in chunks of 1/32 or 0.03125). Then you'll be able to fit two decimal values into one 32 bit space.
Just remember the extra bit manipulation you have to do for this is likely to have a small performance impact on your application.

First of all it would be good general advice to double check you've correctly understood how LabVIEW stores data in memory and whether any of your VI's are using more memory than they need to.
If you still need to squeeze this data into the minimum space, you could do something like:
Instead of a 1D array of n values, use a 2D array of ceiling(n/16) x 17 U16's. Each U16 is going to hold one bit from each of 16 of your data values.
To read value m from the array, get the 17 U16's from row m/16 of the array and get bit (m MOD 16) from each U16, then combine them to create the value you need.
To write to the array, get the relevant 17 U16's, replace the relevant bit of each with the bits representing the new value, and replace the changed U16's in the array.
I guess this won't be fast but maybe you can optimise it for the particular operations you need to do on this data.
Alternatively, could you perhaps use some sort of data compression? I imagine that would work best if you can organise the data into 'pages' containing some set number of values. For example, you could take a 1D array of SGL, flatten it to a string, then apply the compression to the string, and store the compressed string in a string array. I believe OpenG includes zip tools, for example.

Related

Look up table with fixed sized bit arrays as keys

I have a series of fixed size arrays of binary values (individuals from a genetic algorithm) that I would like to associate with a floating point value (fitness value). Such look up table would have a fairly large size constrained by available memory. Due to the nature of the keys is there a hash function that would guarantee no collisions? I tried a few things but they result in collisions. What other data structure could I use to build this look up system?
To answer your questions:
There is no hash function that guarantees no collisions unless you make a hash function that encodes completely the bit array, meaning that given the hash you can reconstruct the bit array. This type of function would be a compression function. If your arrays have a lot of redundant information (for example most of the values are zeros), compressing them could be useful to reduce the total size of the lookup table.
A question on compressing bit array in C is answered here: Compressing a sparse bit array
Since you have most of the bits set to zero, the easiest solution would be to just write a function that converts your bit array in an integer array that keeps track of the positions of the bits that are set to '1'. Then write a function that does the opposite if you need the bit array again. You can save in the hashmap only the encoded array.
Another option to reduce the total size of the lookup table is to erase the old values. Since you are using a genetic algorithm, the population should change over time and old values should become useless, you could periodically remove the older values from the lookup table.

C - Difference between bitset vector and bloom filter

So I understand that bitset vectors can essentially store true/false sets for you in each bit, however I'm confused as to the difference between that and a bloom filter, I understand bloom filters make use of hashing functions and can return false positives, however what's the actual difference in the type of data they can store/functions they can do ?
Bitset vectors are simply a large field of an arbitrary number of bits that can be set individually, using their index.
A bloom filter is a kind of set (not containing the data itself) allowing to decide quickly if an element is contained in the set or not. It is build on top of some kind of bitset vector, setting several of the bits of the latter to 1 on inserting elements or reading them to check if an element is contained (without giving you direct access to its underlying bitset vector).
A bloom filter can be implemented using a bitset, but a bitset can not be implemented using a bloom filter.
(I know this is an old post but posting anyway for future readers.)
Think of Bitset as as lossless compression of a boolean array
Say you have to store 32 boolean values in a collection.
Your immediate intuition is to use an array or a list of booleans
Say every item in that array will need a byte of space, so the array adds up to 32 bytes
But you see, every item is a byte so it is 8 bits.
Although just 1 bit is enough to say true or false, you are using 8 bits for that item.
Since there are 32 bytes, we end up using 32 * 8 = 256 bits.
Then the question arises, "Why don't we use a single 32-bit number to store 1 or 0 in every bit corresponding to those 32 items?"
That is nothing but a bitset - you will use just 32 bits as opposed to 256 bits to store the same information. Here, you can say if an item is present or not by checking its corresponding bit position. This kind of storage helps everywhere and particularly in memory intensive code in embedded systems, high-end games with lot of graphics/Math operations per second and specific applications in machine learning too.
Think of Bloom filter as a lossy compression of a boolean array or a lossy compression of a bitset - depending on which underlying datastructure you want to implement with.
Here, it is lossy because of the fact that you can never know whether a given item is present BUT you can definitely say if it is absent!
It uses a hash function to set certain bits of the bitset to 1 for a given input. For another input, some other bits will be set. The catch is, there can be common bits which are set for two different inputs. Because of this, you cannot say if an item is present because common bits might be set based on multiple items. However even if a single bit is not set for a given input, you can definitely say it is absent. That is why i call this lossy.
Hope this is what you are looking for.

In which format variant array store values internally

I have amount column which has format of number.
I declare 2 dimensional array of type variant, first dimension, I store currency(ex. : GBP, USD) and in other dimension I store amount(eg.: 1234.22 or-1567.69)
myArray(1,0)=GBP
myArray(1,1)= -1234.12
myArray(2,0)=GBP
myArray(2,1)= 1234.12
I am summing myArray(1,1) and myArray(2,1), while summing it is considering format as General/Text instead of Number(which is my column format) and sum is non-zero whereas ideally sum should be 0.
Please suggest, how do I handle this scenario?
To understand that, you will need to understand exactly what a VARIANT is in VBA and exactly what an ARRAY is.
Arrays:
Starting with arrays, a VBA array is actually not really an array of memory locations but a data structure called SAFEARRAY which includes details as shown in the listing below (source):
typedef struct tagSAFEARRAY {
USHORT cDims;
USHORT fFeatures;
ULONG cbElements;
ULONG cLocks;
PVOID pvData;
SAFEARRAYBOUND rgsabound[1];
} SAFEARRAY, *LPSAFEARRAY;
So, you can see that this structure has a pointer to where the data actually is, and what the number of elements are, the number of dimensions and so on and so forth. Because of that VBA is able to ensure that by using its arrays, you will not accidentally mess up some not-to-be-disturbed memory location.
Variants:
With that out of the way, you need to understand what exactly a VARIANT is. VARIANT is also not really a primitive data type but a data structure which makes it able to handle multiple data types easily.
Details of the structure can be found by a simple search but the details are simple:
Total data structure size: 16 bytes
2 bytes: Information about the data type
6 bytes: Reserved bytes (set to 0)
8 bytes: Contain the actual data
Hence when you do a VarType the first two bytes are obtained and that is how the interpreter knows what data type is being used. See here for more details.
So you can understand now what a SAFEARRAY of VARIANT data is.
Finally, the problem in the question::
That has nothing to do with the Variant and everything to do with floating point math. Floating point numbers are not stored exactly as you think they are.
E.g. 2.323 will not be stored as 2.323 but rather as something like 2.322999999999999999999
This rounding error will eventually cause trouble (leading to the entire study of stable and unstable methods, etc.) unless you are very careful about the way you handle this quantization of sorts.
Some algorithms will be such that the errors cancel out and in some they add-up.
So, if you are looking for exact calculations, you need to use a different fixed point data type which might be more suited to your problem domain (e.g. Currency might help in some precision financial calculations)
The Solution:
The Currency data type is a 64-bit data type and interally it's like a very long integer scaled by 10,000. So up to 4 decimal places and 15 digits before the decimal can be accurately represented.

Use a hashtable or array as bitmap when checking for characters

Say I want to have some kind of a bitmap to know the number of times a particular char appears in a string.
So, for example, if I read the string "abracadabra", I would have a data structure that would look something like this:
a -> 5
b -> 2
r -> 2
c -> 1
d -> 1
I have read a book (Programming INterviews Exposed) that says the following:
Hashtables have a higher lookup overhead than arrays.
An array would need an element for every possible character.
A hashtable would need to store just the characters that actually appear in the string.
Therefore:
Arrays are a better choice for long strings with a limited set of possible characters and hash tables are more efficient for shorter strings or when there are many possible character values.
I don't understand why:
-> Hashtables have a higher lookup overhead than arrays? Why is that?
An array is an extremely simple data structure. In memory, it is a simple contiguous block. Say each item in the array is four bytes, and the array has room for 100 elements. Then the array is simply 400 contiguous bytes in memory, and your variable assigned to the array is a pointer to the first element. Say this is at location 10000 in memory.
When you access element #3 of the array, like this:
myarray[3] = 17;
...what happens is very simple: 3 multiplied by the element size (4 bytes) is added to the base pointer. In this example it's 10000 + 3 * 4 = 10012. Then you simply write to the 4 bytes located at address 10012. Trivially simple math.
A hashtable is not an elementary data structure. It could be implemented in various ways, but a simple one might be an array of 256 lists. Then when you access the hashtable, first you have to calculate the hash of your key, then look up the right list in the array, and finally walk along the list to find the right element. This is a much more complicated process.
A simple array is always going to be faster than a hashtable. What the text you cite is getting at is that if the data is very sparse... you might need a very large array to do this simple calculation. In that case you could use a lot less memory space to hold the hash table.
Consider if your characters were Unicode -- two bytes each. That's 65536 possible characters. And say you're only talking about strings with 256 or fewer characters. To count those characters with an array, you would need to make an array with 64K elements, one byte each... taking 64K of memory. The hashtable on the other hand, implemented like I mentioned above, might take only 4*64 bytes for the array of list pointers, and then 5-8 bytes per list element. So if you were processing a 256-character string with say 64 unique Unicode characters used, it would take up a total of at most 768 bytes. Under these conditions, the hashtable would be using much less memory. But it's always going to be slower.
Finally, in the simple case you show, you're probably just talking about the Latin alphabet, so if you force lowercase, you could have an array with just 26 elements, and make them as large as you want so you could count as many characters as you'll need. Even if it's 4 billion, you would need just 26 * 4 = 104 character array. So that's definitely the way to go here.
Hashtables have a higher lookup overhead than arrays? Why is that?
When accessing an array for a charcter counting it is a direct access:
counter[c]++;
While a hastable is a (complex) data structure, where first a hash function must be calculated, then a second function to reduce the hascode to hash table position.
If the table position already is used, additional action has to be done.
I personally think, that as long as your characters are in Asci Range (0-255) the array approach is always faster, and more suited. If it comes to uni code character (which in java is the default in Strings, then the hashtable is more appropriate.)
Hashtables have a higher lookup overhead than arrays? Why is that?
Because they have to search for the key calculate the hash from the key.
In contrast, arrays have O(1) lookup time. For accessing a value in an array, typically calculating the offset and returning the element at that offset is enough, this works in constant time.

SQL Server data types: Store 8-digit unsigned number as INT or as CHAR(8)?

i think the title says everything.
Is it better(faster,space-saving according memory and disk) to store 8-digit unsigned numbers as Int or as char(8) type?
Would i get into trouble when the number will change to 9 digits in future when i use a fixed char-length?
Background-Info: i want to store TACs
Thanks
Given that TACs can have leading zeroes, that they're effectively an opaque identifier, and are never calculated with, use a char column.
Don't start optimizing for space before you're sure you've modelled your data types correctly.
Edit
But to avoid getting junk in there, make sure you apply a CHECK constraint also. E.g if it's meant to be 8 digits, add
CONSTRAINT CK_Only8Digits CHECK (not TAC like '%[^0-9]%' and LEN(RTRIM(TAC)) = 8)
If it is a number, store it as a number.
Integers are stored using 4 bytes, giving them the range:
-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)
So, suitable for your needs.
char[8] will be stored as 8 bytes, so double the storage, and of course suffers from the need to expand in the future (converting almost 10M records from 8 to 9 chars will take time and will probably require taking the database offline for the duration).
So, from storage, speed, memory and disk usage (all related to the number of bytes used for the datatype), readability, semantics and future proofing, int wins hands down on all.
Update
Now that you have clarified that you are not storing numbers, I would say that you will have to use char in order to preserve the leading zeroes.
As for the issue with future expansion - since char is a fixed length field, changing from char[8] to char[9] would not lose information. However, I am not sure if the additional character will be added on the right or left (though this is possibly undetermined). You will have to test and once the field has been expanded you will need to ensure that the original data has been preserved.
A better way may be to create a new char[9] field, migrate all the char[8] data to it (to keep things reliable and consistent), then remove the char[8] field and rename the new field to the original name. Of course this would ruin all statistics on the table (but so would expanding the field directly).
An int will use less memory space and give faster indexing than a char.
If you need to take these numbers apart -- search for everything where digits 3-4 are "02" or some such -- char would be simpler and probably faster.
I gather you're not doing arithmetic on them. You'd not adding two TACs together or finding the average TAC for a set of records or anything like that. If you were, that would be a slam-dunk argument for using int.
If they have leading zeros, its probably easier to use char so you don't have to always pad the number with zeros to the correct length.
If none of the above applies, it doesn't matter much. I'd probably use char. I can't think of a compelling reason to go either way.
Stick to INT for this one, DEFFINITELY INT (OR BIGINT)
Have a look at int, bigint, smallint, and tinyint (Transact-SQL)
Integer (whole number) data from -2^31
(-2,147,483,648) through 2^31 - 1
(2,147,483,647). Storage size is 4
bytes,
bigint (whole number) data from -2^63
(-9,223,372,036,854,775,808) through
2^63-1 (9,223,372,036,854,775,807).
Storage size is 8 bytes.
compared to
char and varchar
Fixed-length non-Unicode character
data with length of n bytes. n must be
a value from 1 through 8,000. Storage
size is n bytes.
Also, once you query against this, you will have degraded performance if you use ints compared to your char column, as SQL Server will have to do as cast for you...

Resources