Compact data structure for sorted array - arrays

I have a table with sorted numbers like:
1 320102
2 5200100
3 92010023
4 112010202
5 332020201
6 332020411
:
5000000000 3833240522044511
5000000001 3833240522089999
5000000002 4000000000213312
Given the record number I need the value in O(log n) time. The record number is 64-bit long and there are no missing record numbers. The values are 64-bit long, they are sorted and value(n) < value(n+1).
The obvious solution is simply doing an array and use the records number as index. This will cost 64-bit per value.
But I would like a more space efficient way of doing that. Since we know the values are always increasing that should be doable, but I do not remember a data structure that lets me do that.
A solution would be to use deflate on the array, but that will not give me O(log n) for accessing an element - thus unacceptable.
Do you know of a data structure that will give me:
O(log n) for access
space requirement < 64-bit/value
= Edit =
Since we know all numbers in advance we could find the difference between each number. By taking the 99th percentile of these differences we will get a relatively modest number. Taking the log2 will give us the number of bits needed to represent modest number - let us call that modest-bits.
Then create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
Then a delta table for all records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
modest-bits difference to record k*1024 will always be 0, so we can use that for signaling. If it is non-zero, then the following 64-bit will be a pointer to a simple array for the next 1024 records as 64-bit values.
As the modest value is chosen as the 99th percentile number, that will at most happen 1% of the time, thus wasting at most 1% * n * modest-bits + 1% * n * 64-bit * 1024.
space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * modest-bits + 1% * n * 64-bit * 1024)
lookup: O(1 + 1024)
(99% and 1024 may have to be adjusted)
= Edit2 =
Based on the idea above, but wasting less space. Create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
And for all value that cannot be represented by modest-bits create big-value table as a tree:
64-bit position, 64-bit value
64-bit position, 64-bit value
64-bit position, 64-bit value
Then a delta table for all records, that is reset for every 1024 records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
but also reset for every value that is in the big-value table.
space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * 2 * 64-bit).
Lookup requires searching big-value table, then looking up the 1024'th value and finally summing up the modest-bits values.
lookup: O(log(big-value table) + 1 + 1024) = O(log n)
Can you improve this? Or do better in a different way?

OP proposes splitting numbers into blocks (only once). But this process may be continued. Split every block once more. And again... Finally we might get a binary trie.
Root node contains value of the number with least index. Its right descendant stores difference between the middle number in the table and the number with least index: d = A[N/2] - A[0] - N/2. This is continued for other right descendants (red nodes on diagram). Leaf nodes contain deltas from preceding numbers: d = A[i+1] - A[i] - 1.
So most of the values, stored in trie, are delta values. Each of them occupies less than 64 bits. And for compactness they may be stored as variable-bit-length numbers in a bit stream. To get length of each number and to navigate in this structure in O(log N) time, bit stream should also contain lengths of (some) numbers and (some) subtrees:
Each node contains length (in bits) of its left sub-tree (if it has one).
Each right descendant (red nodes on diagram), except leaf nodes, contains length (in bits) of its value. Leaf node's length may be calculated from other lengths on the path from root to this node.
Each right descendant (red nodes on diagram) contains difference of correspondent value and the value of nearest "red" node up the path.
All nodes are packed in bit stream, starting from root node, in-order: left descendant always follows its ancestor; right descendant follows sub-tree, rooted by left descendant.
To access element given its index, use index's binary representation to follow path in the trie. While traversing this path, add together all values of "red" nodes. Stop when no more non-zero bits are left in the index.
There are several options to store N/2 value lengths:
Allocate as many bits for each length as needed to represent all values from the largest length to somewhere below mean length (excluding some very short outliers).
Also exclude some long outliers (keep them in a separate map).
Since lengths may be not evenly distributed, it's reasonable to use Huffman encoding for value lengths.
Either fixed length or Huffman encodings should be different for each trie depth.
N/4 subtree lengths are, in fact, value lengths, because N/4 smallest subtrees contain a single value.
Other N/4 subtree lengths may be stored in words of fixed (predefined) length, so that for large subtrees we know only approximate (rounded up) lengths.
For 230 full-range 64-bit numbers we have to pack approximately 34-bit values, for 3/4 nodes, approx. 4-bit value lengths, and for every fourth node, 10-bit subtree lengths. Which saves 34% space.
Example values:
0 320102
1 5200100
2 92010023
3 112010202
4 332020201
5 332020411
6 3833240522044511
7 3833240522089999
8 4000000000213312
Trie for these values:
root d=320102 vl=19 tl=84+8+105+4+5=206
+-l tl=75+4+5=84
| +-l tl=23
| | +-l
| | | +-r d=4879997 (vl=23)
| | +-r d=91689919 vl=27
| | +-r d=20000178 (vl=25)
| +-r d=331700095 vl=29 tl=8
| +-l
| | +-r d=209 (vl=8)
| +-r d=3833240190024308 vl=52
| +-r d=45487 (vl=16)
+-r d=3999999999893202 vl=52
Value length encoding:
bits start end
Root 0 19 19
depth 1 0 52 52
depth 2 0 29 29
depth 3 5 27 52
depth 4 4 8 23
Sub-tree lengths need 8 bits each.
Here is encoded stream (binary values still shown in decimal for readability):
bits value comment
19 320102 root value
8 206 left subtree length of the root
8 84 left subtree length
4 15 smallest left subtree length (with base value 8)
23 4879997 value for index 1
5 0 value length for index 2 (with base value 27)
27 91689919 value for index 2
25 20000178 value for index 3
29 331700095 value for index 4
4 0 smallest left subtree length (with base value 8)
8 209 value for index 5
5 25 value length for index 6 (with base value 27)
52 3833240190024308 value for index 6
16 45487 value for index 7
52 3999999999893202 value for index 8
Altogether 285 bits or 5 64-bit words. We also need to store bits/start values from value length encoding table (350 bits). To store 635 bits we need 10 64-bit words, which means such a small number table cannot be compressed. For larger number tables, size of value length encoding table is negligible.
To search a value for index 7, read root value (320102), skip 206 bits, add value for index 4 (331700095), skip 8 bits, add value for index 6 (3833240190024308), add value for index 7 (45487), and add index (7). The result is 3 833 240 522 089 999, as expected.

I would do it in blocks, as you outline in your question. Pick a block size k, where you can accept having to decode on average k/2 values before getting to the one you're after. For the n total values, you will have n/k blocks. A table with n/k entries would point into the data stream to find the starting point of each block. Finding where to go in that table would be O(log(n/k)) for a binary search, or if the table is small enough and if it matters, you could make it about O(1) with an auxiliary hash table.
Each block would start with a starting 64-bit value. All values after that would be stored as deltas from the preceding value. My suggestion is to store those deltas as a Huffman code that says how many bits are in the next value, followed by that many bits. The Huffman code would be optimized for each block, and a description of that code would be stored at the beginning of the block.
You could simplify that by just preceding each value with six bits having the number of bits following, in the range of 1..64, effectively a flat Huffman code. Depending on the histogram of the bit lengths, an optimized Huffman code could knock off a good number of bits compared to the flat code.
Once you have this set up, you can experiment with k and see how small you can make it and still have limited impact on the compression.

I do not know of a data structure that does that.
The obvious solution to gain space and not loose too much speed would be to create your own structure with different array size based on the different int size you store.
Pseudo-code
class memoryAwareArray {
array16 = Int16[] //2 bytes
array32 = Int32[] //4 bytes
array64 = Int64[] //8 bytes
max16Index = 0;
max32Index = 0;
addObjectAtIndex(index, value) {
if (value < 65535) {
array16[max16Index] = value;
max16Index++;
return;
}
if (value < 2147483647) {
array32[max32Index] = value;
max32Index++;
return;
}
array64[max64Index] = value;
max64Index++;
}
getObject(index) {
if (index < max16Index) return(array16[index]);
if (index < max32Index) return(array32[index-max16Index]);
return(array64[index-max16Index-max32Index]);
}
}
Something along those lines shouldn't alter to much the speed and you'd save around 7 gigas if you filled up the entire structure. You won't save as much since you have gaps beetween your values of course.

Related

Change the minimum number of entries in an array so that the sum of any k consecutive items is even

We are given an array of integers. We have to change the minimum number of those integers however we'd like so that, for some fixed parameter k, the sum of any k consecutive items in the array is even.
Example:
N = 8; K = 3;
A = {1,2,3,4,5,6,7,8}
We can change 3 elements (4th,5th,6th)
so the array can be {1,2,3,5,6,7,7,8}
then
1+2+3=6 is even
2+3+5=10 is even
3+5+6=14 is even
5+6+7=18 is even
6+7+7=20 is even
7+7+8=22 is even
There's a very nice O(n)-time solution to this problem that, at a high level, works like this:
Recognize that determining which items to flip boils down to determining a pattern that repeats across the array of which items to flip.
Use dynamic programming to determine what that pattern is.
Here's how to arrive at this solution.
First, some observations. Since all we care about here is whether the sums are even or odd, we actually don't care about the numbers' exact values. We just care about whether they're even or odd. So let's begin by replacing each number with either 0 (if the number is even) or 1 (if it's odd). Now, our task is to make each window of k elements have an even number of 1s.
Second, the pattern of 0s and 1s that results after you've transformed the array has a surprising shape: it's simply a repeated copy of the first k elements of the array. For example, suppose k = 5 and we decide that the array should start off as 1 0 1 1 1. What must the sixth array element be? Well, in moving from the first window to the second, we dropped a 1 off the front of the window, changing the parity to odd. We therefore have to have the next array element be a 1, which means that the sixth array element must be a 1, equal to the first array element. The seventh array element then has to be a 0, since in moving from the second window to the third we drop off a zero. This process means that whatever we decide on for the first k elements turns out to determine the entire final sequence of values.
This means that we can reframe the problem in the following way: break the original input array of n items into n/k blocks of size k. We're now asked to pick a sequence of 0s and 1s such that
this sequence differs in as few places as possible from the n/k blocks of k items each, and
the sequence has an even number of 1s.
For example, given the input sequence
0 1 1 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
and k = 3, we would form the blocks
0 1 1, 0 1 1, 1 0 0, 1 0 1, 1 1 0, 1 1 1
and then try to find a pattern of length three with an even number of 1s in it such that replacing each block with that pattern requires the fewest number of edits.
Let's see how to take that problem on. Let's work one bit at a time. For example, we can ask: what's the cost of making the first bit a 0? What's the cost of making the first bit a 1? The cost of making the first bit a 0 is equal to the number of blocks that have a 1 at the front, and the cost of making the first bit a 1 is equal to the number of blocks that have a 0 at the front. We can work out the cost of setting each bit, individually, to either to zero or to one. That gives us a matrix like this one:
Bit #0 Bit #1 Bit #2 Bit #3 ... Bit #k-1
---------------------+--------+--------+--------+--------+--------+----------
Cost of setting to 0 | | | | | | |
Cost of setting to 1 | | | | | | |
We now need to choose a value for each column with the goal of minimizing the total cost picked, subject to the constraint that we pick an even number of bits to be equal to 1. And this is a nice dynamic programming exercise. We consider subproblems of the form
What is the lowest cost you can make out of the first m columns from the table, provided your choice has parity p of items chosen from the bottom row?
We can store this in an (k + 1) × 2 table T[m][p], where, for example, T[3][even] is the lowest cost you can achieve using the first three columns with an even number of items set to 1, and T[6][odd] is the lowest cost you can achieve using the first six columns with an odd number of items set to 1. This gives the following recurrence:
T[0][even] = 0 (using zero columns costs nothing)
T[0][odd] = ∞ (you cannot have an odd number of bits set to 1 if you use no colums)
T[m+1][p] = min(T[m][p] + cost of setting this bit to 0, T[m][!p] + cost of setting this bit to 1) (either use a zero and keep the same parity, or use a 1 and flip the parity).
This can be evaluated in time O(k), and the resulting minimum cost is given by T[n][even]. You can use a standard DP table walk to reconstruct the optimal solution from this point.
Overall, here's the final algorithm:
create a table costs[k+1][2], all initially zero.
/* Populate the costs table. costs[m][0] is the cost of setting bit m
* to 0; costs[m][1] is the cost of setting bit m to 1. We work this
* out by breaking the input into blocks of size k, then seeing, for
* each item within each block, what its parity is. The cost of setting
* that bit to the other parity then increases by one.
*/
for i = 0 to n - 1:
parity = array[i] % 2
costs[i % k][!parity]++ // Cost of changing this entry
/* Do the DP algorithm to find the minimum cost. */
create array T[k + 1][2]
T[0][0] = 0
T[0][1] = infinity
for m from 1 to k:
for p from 0 to 1:
T[m][p] = min(T[m - 1][p] + costs[m - 1][0],
T[m - 1][!p] + costs[m - 1][1])
return T[m][0]
Overall, we do O(n) work with our initial pass to work out the costs of setting each bit, independently, to 0. We then do O(k) work with the DP step at the end. The overall work is therefore O(n + k), and assuming k ≤ n (otherwise the problem is trivial) the cost is O(n).

Why is there a difference in precision range widths for decimal?

As is evident by the MSDN description of decimal certain precision ranges have the same amount of storage bytes assigned to them.
What I don't understand is that there are differences in the sizes of the range. How the range from 1 to 9 of 5 storage bytes has a width of 9, while the range from 10 to 19 of 9 storage bytes has a width of 10. Then the next range of 13 storage bytes has a width of 9 again, while the next has a width of 10 again.
Since the storage bytes increase by 4 every time, I would have expected all of the ranges to be the same width. Or maybe the first one to be smaller to reserve space for the sign or something but from then on equal in width. But it goes from 9 to 10 to 9 to 10 again.
What's going on here? And if it would exist, would 21 storage bytes have a precision range of 39-47 i.e. is the pattern 9-10-9-10-9-10...?
would 21 storage bytes have a precision range of 39-47
No. 2 ^ 160 = 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976 - which has 49 decimal digits. So this hypothetical scenario would cater for a precision range of 39-48 (as a 20 byte integer would not be big enough to hold any 49 digit numbers larger than that)
The first byte is reserved for the sign.
01 is used for positive numbers; 00 for negative.
The remainder stores the value as an integer. i.e. 1.234 would be stored as the integer 1234 (or some multiple of 10 of this dependant on the declared scale)
The length of the integer is either 4, 8, 12 or 16bytes depending on the declared precision. Some 10 digit integers can be stored in 4 bytes however to get the whole range in would overflow this so it needs to go to the next step up.
And so on.
2^32 = 4,294,967,295 (10 digits)
2^64 = 18,446,744,073,709,551,616 (20 digits)
2^96 = 79,228,162,514,264,337,593,543,950,336 (29 digits)
2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 (39 digits)
You need to use DBCC PAGE to see this, casting the column as binary does not give you the storage representation. Or use a utility like SQL Server internals viewer.
CREATE TABLE T(
A DECIMAL( 9,0),
B DECIMAL(19,0),
C DECIMAL(28,0) ,
D DECIMAL(38,0)
);
INSERT INTO T VALUES
(999999999, 9999999999999999999, 9999999999999999999999999999, 99999999999999999999999999999999999999),
(-999999999, -9999999999999999999, -9999999999999999999999999999, -99999999999999999999999999999999999999);
Shows the first row stored as
And the second as
Note that the values after the sign bit are byte reversed. 0x3B9AC9FF = 999999999

Integer compression method

How can I compress a row of integers into something shorter ?
Like:
Input: '1 2 4 5 3 5 2 3 1 2 3 4' -> Algorithm -> Output: 'X Y Z'
and can get it back the other way around? ('X Y Z' -> '1 2 4 5 3 5 2 3 1 2 3 4')
Note:Input will only contain numbers between 1-5 and the total string of number will be 10-16
Is there any way I can compress it to 3-5 numbers?
Here is one way. First, subtract one from each of your little numbers. For your example input that results in
0 1 3 4 2 4 1 2 0 1 2 3
Now treat that as the base-5 representation of an integer. (You can choose either most significant digit first or last.) Calculate the number in binary that means the same thing. Now you have a single integer that "compressed" your string of little numbers. Since you have shown no code of your own, I'll just stop here. You should be able to implement this easily.
Since you will have at most 16 little numbers, the maximum resulting value from that algorithm will be 5^16 which is 152,587,890,625. This fits into 38 bits. If you need to store smaller numbers than that, convert your resulting value into another, larger number base, such as 2^16 or 2^32. The former would result in 3 numbers, the latter in 2.
#SergGr points out in a comment that this method does not show the number of integers encoded. If that is not stored separately, that can be a problem, since the method does not distinguish between leading zeros and coded zeros. There are several ways to handle that, if you need the number of integers included in the compression. You could require the most significant digit to be 1 (first or last depends on where the most significant number is.) This increases the number of bits by one, so you now may need 39 bits.
Here is a toy example of variable length encoding. Assume we want to encode two strings: 1 2 3 and 1 2 3 0 0. How the results will be different? Let's consider two base-5 numbers 321 and 00321. They represent the same value but still let's convert them into base-2 preserving the padding.
1 + 2*5 + 3*5^2 = 86 dec = 1010110 bin
1 + 2*5 + 3*5^2 + 0*5^3 + 0*5^4 = 000001010110 bin
Those additional 0 in the second line mean that the biggest 5-digit base-5 number 44444 has a base-2 representation of 110000110100 so the binary representation of the number is padded to the same size.
Note that there is no need to pad the first line because the biggest 3-digit base-5 number 444 has a base-2 representation of 1111100 i.e. of the same length. For an initial string 3 2 1 some padding will be required in this case as well, so padding might be required even if the top digits are not 0.
Now lets add the most significant 1 to the binary representations and that will be our encoded values
1 2 3 => 11010110 binary = 214 dec
1 2 3 0 0 => 1000001010110 binary = 4182 dec
There are many ways to decode those values back. One of the simplest (but not the most efficient) is to first calculate the number of base-5 digits by calculating floor(log5(encoded)) and then remove the top bit and fill the digits one by one using mod 5 and divide by 5 operations.
Obviously such encoding of variable-length always adds exactly 1 bit of overhead.
Its call : polidatacompressor.js but license will be cost you, you have to ask author about prices LOL
https://github.com/polidatacompressor/polidatacompressor
Ncomp(65535) will output: 255, 255 and when you store this in database as bytes you got 2 char
another way is to use "Hexadecimal aka base16" in javascript (1231).toString(16) give you '4cf' in 60% situation it compress char by -1
Or use base10 to base64 https://github.com/base62/base62.js/
4131 --> 14D
413131 --> 1Jtp

How to find subarray between min and max

I have a Sorted array .Lets assume
{4,7,9,12,23,34,56,78} Given min and max I want to find elements in array between min and max in efficient way.
Cases:min=23 and max is 78 op:{23,34,56,78}
min =10 max is 65 op:{12,23,34,56}
min 0 and max is 100 op:{4,7,9,12,23,34,56,78}
Min 30 max= 300:{34,56,78}
Min =100 max=300 :{} //empty
I want to find efficient way to do this?I am not asking code any algorithm which i can use here like DP exponential search?
Since it's sorted, you can easily find the lowest element greater than or equal to the minimum desired, by using a binary search over the entire array.
A binary search basically reduces the serch space by half with each iteration. Given your first example of 10, you start as follows with the midpoint on the 12:
0 1 2 3 4 5 6 7 <- index
4 7 9 12 23 34 56 78
^^
Since the element you're looking at is higher than 10 and the next lowest is lesser, you've found it.
Then, you can use a similar binary search but only over that section from the element you just found to the end. This time you're looking for the highest element less than or equal to the maximum desired.
On the same example as previously mentioned, you start with:
3 4 5 6 7 <- index
12 23 34 56 78
^^
Since that's less than 65 and the following one is also, you need to increase the pointer to the halfway point of 34..78:
3 4 5 6 7 <- index
12 23 34 56 78
^^
And there you have it, because that number is less and the following number is more (than 65)
Then you have the start at stop indexes (3 and 6) for extracting the values.
0 1 2 3 4 5 6 7 <- index
4 7 9 ((12 23 34 56)) 78
-----------
The time complexity of the algorithm is O(log N). Though keep in mind that this really only becomes important when dealing with larger data sets. If your data sets do consist of only about eight elements, you may as well use a linear search since (1) it'll be easier to write; and (2) the time differential will be irrelevant.
I tend not to worry about time complexity unless the operations are really expensive, the data set size gets into the thousands, or I'm having to do it thousands of times a second.
Since it is sorted, this should do:
List<Integer> subarray = new ArrayList<Integer>();
for (int n : numbers) {
if (n >= MIN && n <= MAX) subarray.add(n);
}
It's O(n) as you only look at every number once.

B+ tree index size given size of data

If I create a B+-tree index for the key table(a,b,c), in a database with 2KB pages and using 64 bit pointers, where a,b and c are all of size 4 bytes and the total size of each record is 88 bytes.
What is the range of possible values for the depth of the index if the table has 36,279 rows?
For minimum capacity:
2 * ceiling[n/2]^(d-2) * ceiling[(n-1)/2] = 36279
solved for d gives you 3.5, so depth is 4.
For max capacity:
n^(d-1) * (n-1) = 36279
solved for d gives you 2.3 so depth is 3.
Therefore the answer is 3-4.
Oh, and n is 102.

Resources