B+ tree index size given size of data - database

If I create a B+-tree index for the key table(a,b,c), in a database with 2KB pages and using 64 bit pointers, where a,b and c are all of size 4 bytes and the total size of each record is 88 bytes.
What is the range of possible values for the depth of the index if the table has 36,279 rows?

For minimum capacity:
2 * ceiling[n/2]^(d-2) * ceiling[(n-1)/2] = 36279
solved for d gives you 3.5, so depth is 4.
For max capacity:
n^(d-1) * (n-1) = 36279
solved for d gives you 2.3 so depth is 3.
Therefore the answer is 3-4.
Oh, and n is 102.

Related

binomial coefficient for very high numbers in c

So the task I have to solve is to calculate the binomial coefficient for 100>=n>k>=1 and then say how many solutions for n and k are over an under barrier of 123456789.
I have no problem in my formula of calculating the binomial coefficient but for high numbers n & k -> 100 the datatypes of c get to small to calculated this.
Do you have any suggestions how I can bypass this problem with overflowing the datatypes.
I thought about dividing by the under barrier straight away so the numbers don't get too big in the first place and I have to just check if the result is >=1 but i couldn't make it work.
Say your task is to determine how many binomial coefficients C(n, k) for 1 ≤ k < n ≤ 8 exceed a limit of m = 18. You can do this by using the recurrence C(n, k) = C(n − 1, k) + C(n − 1, k − 1) that can visualized in Pascal's triangle.
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 (20) 15 6 1
1 7 (21 35 35 21) 7 1
1 8 (28 56 70 56 28) 8 1
Start at the top and work your way down. Up to n = 5, everything is below the limit of 18. On the next line, the 20 exceeds the limit. From now on, more and more coefficients are beyond 18.
The triangle is symmetric and strictly increasing in the first half of each row. You only need to find the first element that exceeds the limit on each line in order to know how many items to count.
You don't have to store the whole triangle. It is enough to keey the last and current line. Alternatively, you can use the algorithm detailed [in this article][ot] to work your way from left to right on each row. Since you just want to count the coefficients that exceed a limit and don't care about their values, the regular integer types should be sufficient.
First, you'll need a type that can handle the result. The larget number you need to handle is C(100,50) = 100,891,344,545,564,193,334,812,497,256. This number requires 97 bits of precision, so your normal data types won't do the trick. A quad precision IEEE float would do the trick if your environment provides it. Otherwise, you'll need some form of high/arbitrary precision library.
Then, to keep the numbers within this size, you'll want cancel common terms in the numerator and the denominator. And you'll want to calculate the result using ( a / c ) * ( b / d ) * ... instead of ( a * b * ... ) / ( c * d * ... ).

Determine the largest element at particular index of all array for the first Q indexes

Given several arrays that contain natural numbers less than M. The size of each array may vary, but the sum of all size of arrays must be N. Determine the largest element at particular index of all array for the first Q indexes start from 0. If the index is larger than the size of array, then the index is modulo by its size.
Constraint:
N <= 50000
M <= 32
Q <= 1000000
Example :
N = 6
M = 4
array1 = [0,1,3]
array2 = [2,0]
array3 = [1]
Q = 6
Results = [2,1,3,1,2,3]
Explanation : At index 3, the largest element from array1[3%3], array2[3%2], and array3[3%1] is 1.
I know the O(√N * Q) solution by simply iterate all array from 0 to Q indexes, but I need faster solution. Maybe by benefiting the constraint M in such a way might provide faster solution, but still can't figure it out.

How to find subarray between min and max

I have a Sorted array .Lets assume
{4,7,9,12,23,34,56,78} Given min and max I want to find elements in array between min and max in efficient way.
Cases:min=23 and max is 78 op:{23,34,56,78}
min =10 max is 65 op:{12,23,34,56}
min 0 and max is 100 op:{4,7,9,12,23,34,56,78}
Min 30 max= 300:{34,56,78}
Min =100 max=300 :{} //empty
I want to find efficient way to do this?I am not asking code any algorithm which i can use here like DP exponential search?
Since it's sorted, you can easily find the lowest element greater than or equal to the minimum desired, by using a binary search over the entire array.
A binary search basically reduces the serch space by half with each iteration. Given your first example of 10, you start as follows with the midpoint on the 12:
0 1 2 3 4 5 6 7 <- index
4 7 9 12 23 34 56 78
^^
Since the element you're looking at is higher than 10 and the next lowest is lesser, you've found it.
Then, you can use a similar binary search but only over that section from the element you just found to the end. This time you're looking for the highest element less than or equal to the maximum desired.
On the same example as previously mentioned, you start with:
3 4 5 6 7 <- index
12 23 34 56 78
^^
Since that's less than 65 and the following one is also, you need to increase the pointer to the halfway point of 34..78:
3 4 5 6 7 <- index
12 23 34 56 78
^^
And there you have it, because that number is less and the following number is more (than 65)
Then you have the start at stop indexes (3 and 6) for extracting the values.
0 1 2 3 4 5 6 7 <- index
4 7 9 ((12 23 34 56)) 78
-----------
The time complexity of the algorithm is O(log N). Though keep in mind that this really only becomes important when dealing with larger data sets. If your data sets do consist of only about eight elements, you may as well use a linear search since (1) it'll be easier to write; and (2) the time differential will be irrelevant.
I tend not to worry about time complexity unless the operations are really expensive, the data set size gets into the thousands, or I'm having to do it thousands of times a second.
Since it is sorted, this should do:
List<Integer> subarray = new ArrayList<Integer>();
for (int n : numbers) {
if (n >= MIN && n <= MAX) subarray.add(n);
}
It's O(n) as you only look at every number once.

Compact data structure for sorted array

I have a table with sorted numbers like:
1 320102
2 5200100
3 92010023
4 112010202
5 332020201
6 332020411
:
5000000000 3833240522044511
5000000001 3833240522089999
5000000002 4000000000213312
Given the record number I need the value in O(log n) time. The record number is 64-bit long and there are no missing record numbers. The values are 64-bit long, they are sorted and value(n) < value(n+1).
The obvious solution is simply doing an array and use the records number as index. This will cost 64-bit per value.
But I would like a more space efficient way of doing that. Since we know the values are always increasing that should be doable, but I do not remember a data structure that lets me do that.
A solution would be to use deflate on the array, but that will not give me O(log n) for accessing an element - thus unacceptable.
Do you know of a data structure that will give me:
O(log n) for access
space requirement < 64-bit/value
= Edit =
Since we know all numbers in advance we could find the difference between each number. By taking the 99th percentile of these differences we will get a relatively modest number. Taking the log2 will give us the number of bits needed to represent modest number - let us call that modest-bits.
Then create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
Then a delta table for all records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
modest-bits difference to record k*1024 will always be 0, so we can use that for signaling. If it is non-zero, then the following 64-bit will be a pointer to a simple array for the next 1024 records as 64-bit values.
As the modest value is chosen as the 99th percentile number, that will at most happen 1% of the time, thus wasting at most 1% * n * modest-bits + 1% * n * 64-bit * 1024.
space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * modest-bits + 1% * n * 64-bit * 1024)
lookup: O(1 + 1024)
(99% and 1024 may have to be adjusted)
= Edit2 =
Based on the idea above, but wasting less space. Create this:
64-bit value of record 0
64-bit value of record 1024
64-bit value of record 2048
64-bit value of record 3072
64-bit value of record 4096
And for all value that cannot be represented by modest-bits create big-value table as a tree:
64-bit position, 64-bit value
64-bit position, 64-bit value
64-bit position, 64-bit value
Then a delta table for all records, that is reset for every 1024 records:
modest-bits difference to record 0
modest-bits difference to previous record
1022 * modest-bits difference to previous record
modest-bits difference to record 1024
but also reset for every value that is in the big-value table.
space: O(modest-bits * n + 64-bit * n / 1024 + 1% * n * 2 * 64-bit).
Lookup requires searching big-value table, then looking up the 1024'th value and finally summing up the modest-bits values.
lookup: O(log(big-value table) + 1 + 1024) = O(log n)
Can you improve this? Or do better in a different way?
OP proposes splitting numbers into blocks (only once). But this process may be continued. Split every block once more. And again... Finally we might get a binary trie.
Root node contains value of the number with least index. Its right descendant stores difference between the middle number in the table and the number with least index: d = A[N/2] - A[0] - N/2. This is continued for other right descendants (red nodes on diagram). Leaf nodes contain deltas from preceding numbers: d = A[i+1] - A[i] - 1.
So most of the values, stored in trie, are delta values. Each of them occupies less than 64 bits. And for compactness they may be stored as variable-bit-length numbers in a bit stream. To get length of each number and to navigate in this structure in O(log N) time, bit stream should also contain lengths of (some) numbers and (some) subtrees:
Each node contains length (in bits) of its left sub-tree (if it has one).
Each right descendant (red nodes on diagram), except leaf nodes, contains length (in bits) of its value. Leaf node's length may be calculated from other lengths on the path from root to this node.
Each right descendant (red nodes on diagram) contains difference of correspondent value and the value of nearest "red" node up the path.
All nodes are packed in bit stream, starting from root node, in-order: left descendant always follows its ancestor; right descendant follows sub-tree, rooted by left descendant.
To access element given its index, use index's binary representation to follow path in the trie. While traversing this path, add together all values of "red" nodes. Stop when no more non-zero bits are left in the index.
There are several options to store N/2 value lengths:
Allocate as many bits for each length as needed to represent all values from the largest length to somewhere below mean length (excluding some very short outliers).
Also exclude some long outliers (keep them in a separate map).
Since lengths may be not evenly distributed, it's reasonable to use Huffman encoding for value lengths.
Either fixed length or Huffman encodings should be different for each trie depth.
N/4 subtree lengths are, in fact, value lengths, because N/4 smallest subtrees contain a single value.
Other N/4 subtree lengths may be stored in words of fixed (predefined) length, so that for large subtrees we know only approximate (rounded up) lengths.
For 230 full-range 64-bit numbers we have to pack approximately 34-bit values, for 3/4 nodes, approx. 4-bit value lengths, and for every fourth node, 10-bit subtree lengths. Which saves 34% space.
Example values:
0 320102
1 5200100
2 92010023
3 112010202
4 332020201
5 332020411
6 3833240522044511
7 3833240522089999
8 4000000000213312
Trie for these values:
root d=320102 vl=19 tl=84+8+105+4+5=206
+-l tl=75+4+5=84
| +-l tl=23
| | +-l
| | | +-r d=4879997 (vl=23)
| | +-r d=91689919 vl=27
| | +-r d=20000178 (vl=25)
| +-r d=331700095 vl=29 tl=8
| +-l
| | +-r d=209 (vl=8)
| +-r d=3833240190024308 vl=52
| +-r d=45487 (vl=16)
+-r d=3999999999893202 vl=52
Value length encoding:
bits start end
Root 0 19 19
depth 1 0 52 52
depth 2 0 29 29
depth 3 5 27 52
depth 4 4 8 23
Sub-tree lengths need 8 bits each.
Here is encoded stream (binary values still shown in decimal for readability):
bits value comment
19 320102 root value
8 206 left subtree length of the root
8 84 left subtree length
4 15 smallest left subtree length (with base value 8)
23 4879997 value for index 1
5 0 value length for index 2 (with base value 27)
27 91689919 value for index 2
25 20000178 value for index 3
29 331700095 value for index 4
4 0 smallest left subtree length (with base value 8)
8 209 value for index 5
5 25 value length for index 6 (with base value 27)
52 3833240190024308 value for index 6
16 45487 value for index 7
52 3999999999893202 value for index 8
Altogether 285 bits or 5 64-bit words. We also need to store bits/start values from value length encoding table (350 bits). To store 635 bits we need 10 64-bit words, which means such a small number table cannot be compressed. For larger number tables, size of value length encoding table is negligible.
To search a value for index 7, read root value (320102), skip 206 bits, add value for index 4 (331700095), skip 8 bits, add value for index 6 (3833240190024308), add value for index 7 (45487), and add index (7). The result is 3 833 240 522 089 999, as expected.
I would do it in blocks, as you outline in your question. Pick a block size k, where you can accept having to decode on average k/2 values before getting to the one you're after. For the n total values, you will have n/k blocks. A table with n/k entries would point into the data stream to find the starting point of each block. Finding where to go in that table would be O(log(n/k)) for a binary search, or if the table is small enough and if it matters, you could make it about O(1) with an auxiliary hash table.
Each block would start with a starting 64-bit value. All values after that would be stored as deltas from the preceding value. My suggestion is to store those deltas as a Huffman code that says how many bits are in the next value, followed by that many bits. The Huffman code would be optimized for each block, and a description of that code would be stored at the beginning of the block.
You could simplify that by just preceding each value with six bits having the number of bits following, in the range of 1..64, effectively a flat Huffman code. Depending on the histogram of the bit lengths, an optimized Huffman code could knock off a good number of bits compared to the flat code.
Once you have this set up, you can experiment with k and see how small you can make it and still have limited impact on the compression.
I do not know of a data structure that does that.
The obvious solution to gain space and not loose too much speed would be to create your own structure with different array size based on the different int size you store.
Pseudo-code
class memoryAwareArray {
array16 = Int16[] //2 bytes
array32 = Int32[] //4 bytes
array64 = Int64[] //8 bytes
max16Index = 0;
max32Index = 0;
addObjectAtIndex(index, value) {
if (value < 65535) {
array16[max16Index] = value;
max16Index++;
return;
}
if (value < 2147483647) {
array32[max32Index] = value;
max32Index++;
return;
}
array64[max64Index] = value;
max64Index++;
}
getObject(index) {
if (index < max16Index) return(array16[index]);
if (index < max32Index) return(array32[index-max16Index]);
return(array64[index-max16Index-max32Index]);
}
}
Something along those lines shouldn't alter to much the speed and you'd save around 7 gigas if you filled up the entire structure. You won't save as much since you have gaps beetween your values of course.

Finding FORTRAN array location, 4-dimensional array

Hey guys, I have a question.
If given a four dimensional array in FORTRAN, and told to find a location of a certain part of it (with a starting location of 200 and 4 bytes per integer). Is there a formula to find the location if is stored in row-major and column-major order.
Basiically given array A(x:X, y:Y, z:Z, q:q) and told to find the location at A(a,b,c,d) what is the formula for finding the location
This comes up all the time when using C libraries with Fortran -- eg, calling MPI routines trying to send particular subsets of Fortran arrays.
Fortran is row-major, or more usefully, the first index moves fastest. That is, the item after A(1,2,3,4) in linear order in memory is A(2,2,3,4). So in your example above, an increase in a by one is a jump of 1 index in the array; a jump in b by one corresponds to a jump of (X-x+1); a jump in c by one corresponds to a jump of (X-x+1)x(Y-y+1), and a jump in d by one is a jump of (X-x+1)x(Y-y+1)x(Z-z+1). In C-based languages, it would be just the opposite; a jump of 1 in the d index would move you 1 index in memory; a jump in c would be a jump of (Q-q+1), etc.
If you have m indicies, and ni is the (zero-based) index in the ith index from the left, and that index has a range of Ni, then the (zero-based) index from the starting position is something like this:
where the product is 1 if the upper index is less than the lower index. To find the number of bytes from the start of the array, you'd multiply that by the size of the object, eg 4 bytes for 32-bit integers.
Been over 25 years since I did any FORTRAN.
I believe FORTRAN, unlike many other languages, lays arrays out in
column major order. That means the leftmost index is the
one that changes most frequently when processing a multi
dimensional array in linear order. Once
the maximum dimension of the leftmost index is reached, set it back to 1, assuming 1 based
indexing, and increment the next level index by 1 and start the process over again.
To calculate the index configuration for any given address offset
you need to know the value of each of the 4 array dimensions. Without this
you can't do it.
Example:
Suppose your array has dimensions 2 by 3 by 4 by 5. This implies a
total of 2 * 3 * 4 * 5 = 120 cells in the matrix. You want the index corresponding
to the 200th byte.
This would be the (200 / 4) - 1 = 49th cell (this assumes 4 bytes per cell and offset zero
is the first cell).
First observe how specific indices translate into offsets...
What cell number does the element X(1,1,1,1) occur at? Simple answer: 1
What cell number does element X(1, 2, 1, 1) occur at? Since we cycled through
the leftmost dimension it must be that dimension plus 1. In other words,
2 + 1 = 3. How about X(1, 1, 2, 1)? We cycled trough the first two dimensions
which is 2 * 3 = 6 plus 1 to give us 7. Finally X(1, 1, 1, 2) must be:
2 * 3 * 4 = 24 plus 1 gives the 25th cell.
Notice that the next righmost index does not increment until the cell number
exceeds the product of the indices to its left. Using this observation you can
calculate the indices for any given cell number by working from the rightmost
index to the left most as follows:
Right most index increments every (2 * 3 * 4 = 24) cells. 24 goes into 49 (the cell number
we want to find the indexing for) twice
leaving 1 left over. Add 1 (for 1 based indexing) that gives us a rightmost
index value of 2 + 1 = 3. Next index (moving left) changes every (2 * 3 = 12) cells. One goes into 12
zero times, this gives us index 0 + 1 = 1. Next index changes every 2 cells. One goes into 2 zero
times giving an incex value of 1. For the last (leftmost index) just add 1 to whatever is
left over, 1 + 1 = 2. This gives us the following reference X(2, 1, 1, 2).
Double check by working it back to an offset:
((2 - 1) + ((1 - 1) * 2) + ((1 - 1) * 2 * 3) + ((3 - 1) * 2 * 3 * 4) = 49.
Just change the numbers and use the same process for any number of dimensions
and/or offsets.
Fortran has column-major order for arrays. This is described at http://en.wikipedia.org/wiki/Row-major_order#Column-major_order. Further down in that article there is the equation for the memory offset of a higher dimensional array.

Resources