Whats is wrong with this extendible hashing solution? - database

I'm trying to grasp the concept of extendible hashing, but I'm getting confused about the distribution of values to the buckets.
For example:
Say I want to insert 6 values from scratch: 17, 32, 14, 50, 35, 21
What would be wrong with this as a solution:
Global depth = 2
Bucket size = 2
00[] --> [][]
01[] --> [][]
10[] --> [][]
11[] --> [][]
Does this mean only one value for each hash value will be pointed to the bucket, so then you increment the global depth? Or would this work?
I understand the beginning of the process, I am just confused at this point.

There is nothing wrong in the solution that you've provided just that the global depth need not be increased. The solution is perfectly compatible with the given global depth.
Assuming that we choose the directory and the corresponding bucket using the 2 left most bits. Then, the solution would look like the following
Also the numbers in the binary format would look like the following
17 - 010001
32 - 100000
14 - 001110
50 - 110010
35 - 100011
21 - 010101
directory ------------- buckets
00-----------------------> 14 |
01-----------------------> 17 | 21
10-----------------------> 32 | 35
11-----------------------> 50 |
Hope this helps.

You shouldn't increment global depth.
The whole idea of the hash is to select such function that it would put items in a buckets more or less equally.
That depends on a hash function.
You can use something as complex as md5 as a hash and than you will get 1 element in 1 bucket but you're not really guaranteed that there will be only 1.
So a general implementation should use binary search on buckets and some other search inside bucket. You can't and you shouldn't change hash function on the fly.

Related

COBOL programmers - How to use arrays

I am programming in COBOL and trying to put this client file in an array. I'm having trouble understanding this problem. I know that the array would probably be based on the bookingtype because there are 4 different options. Any help would be appreciated.
This is how I have the array defined so far:
01 Booking-Table.
05 BookingType OCCURS 4 TIMES PIC 9.
Here is the client file.
I guess the solution is about storing the costs in an array. To calculate the average the array would need to have cost + number with the booking type being the index used.
The "tricky" part may be the maximum of amount per type (9999.99) * maximum customers with this type (all and as the client number implies the 3 given positions are numeric: 1000 [including the zero, all could have the same type]).
Something like
REPLACE ==MaxBookingType== BY ==4==.
01 Totals-Table.
05 Type-Total OCCURS MaxBookingType TIMES.
10 type-amount pic 9(8)V99 COMP.
10 type-customers pic 9(4) COMP.
Now loop through the file from start to end, do check that BookingType >= 1 AND <= MaxBookingType (I'm always skeptic that "data never changes and is always correct) and then
ADD 1 TO type-customers(BookingType)
ADD trip-cost TO type-amount (BookingType)
and after end of file calculate the average for all 4 entries using a PERFORM VARYING.
The main benefit of using an "array" here is that you can update the program to have 20 booking types just by changing the value for MaxBookingType - and as you've added a check which tells you what "bad" number is seen in there you can adjust it quite fast.
I'm not sure if/how your compiler does allow self-defined numeric constants, if there's a way: use this instead of forcing the compiler to check for all occurrences of the text "MaxBookingType".
I believe the diagram is trying to say you need an enumeration. In COBOL, you'd implement this with
01 client-file-record.
*> ...
03 booking-type PIC 9.
88 cruise VALUE 1.
88 air-independent VALUE 2.
88 air-tour VALUE 3.
88 other VALUE 4.
*> ...
An array-approach is only necessary if the booking types (and/or their behaviour) varied at runtime.

How LogLog algorithm with single hash function works

I have found tens of explanation of the basic idea of LogLog algorithms, but they all lack details about how does hash function result splitting works? I mean using single hash function is not precise while using many function is too expensive. How do they overcome the problem with single hash function?
This answer is the best explanation I've found, but still have no sense for me:
They used one hash but divided it into two parts. One is called a
bucket (total number of buckets is 2^x) and another - is basically the
same as our hash. It was hard for me to get what was going on, so I
will give an example. Assume you have two elements and your hash
function which gives values form 0 to 2^10 produced 2 values: 344 and
387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
Could you explain example above pls? You should have 16 buckets because you have header of length 4, right? So how you can have 16 buckets with only two hashes? Do we estimate only buckets, right? So the first bucket is of size 1, and the second of size 4, right? How to merge the results?
Hash function splitting: our goal is to use many hyperloglog structures (as an example, let's say 16 hyperloglog structures, each of them using a 64-bit hash function) instead of one, in order to reduce the estimation error. An intuitive approach might be to process each of the inputs in each of these hyperloglog structures. However, in that case we would need to make sure that the hyperloglogs are independent of each other, meaning we would need a set of 16 hash functions which are independent of each other - that's hard to find!.
So we use an alternative approach. Instead of using a family of 64-bit hash functions, we will use 16 separate hyperloglog structures, each using just a 60-bit hash function. How do we do that? Easy, we take our 64-bit hash function and just ignore the first 4 bits, producing a 60-bit hash function. What do we do with the first 4 bits? We use them to choose one of 16 "buckets" (Each "bucket" is just a hyperloglog structure. Note that 2^4 bits=16 buckets). Now each of the inputs is assigned to exactly one of the 16 buckets, where a 60-bit hash function is used to calculate the hyperloglog value. So we have 16 hyperloglog structures, each using a 60-bit hash function. Assuming that we chose a decent hash function (meaning that the first 4 bits are uniformly distributed, and that they aren't correlated with the remaining 60 bits), we now have 16 independent hyperloglog structures. We take an harmonic average of their 16 estimates to get a much less error-prone estimate of the cardinality.
Hope that clears it up!
The original HyperLogLog paper mentioned by OronNavon is quite theoretical. If you are looking for an explanation of the cardinality estimator without the need of complex analysis, you could have a look on the paper I am currently working on: http://oertl.github.io/hyperloglog-sketch-estimation-paper. It also presents a generalization of the original estimator that does not require any special handling for small or large cardinalities.

Algorithm so that i can index 2^n combinations in a way so i can backtrack from any index value of 1 to 2^n without using an array

I am trying to do something but it is outside my field. To explain lets set n=3 to simplify things where n is the total number of the parameters in this example: A, B, C. These parameters can have a state of ON and OFF (aka 0 or 1).
The total number of combinations of these parameters is 2^n = 8 in this case which can be visualized as:
ABC
1: 000
2: 111
3: 100
4: 010
5: 001
6: 110
7: 011
8: 101
Of course the above list can be sorted in (2^n)! = 40320 ways.
I want an algorithm so that i can calculate the state of any of my parameters (0 or 1) given a number from 1 to 2^n. For example if i have the number of 3 using the table above i know state of A is 1 and B and C is 0. Of course you can have a table/array to look it up given a specific sorting, but even for relatively small values of n you need to have a huge table.
I'm not familiar with this and the methods you can do indexing that's why i need help.
Kind regards
Just realised you can actually look at it another way. What you want is a function encrypting N bits to another set of N bits. In practice this is the same as format preserving encryption. The question is, do you care whether:
all 2^n cases are covered, or just a large enough number close to 2^n (you have to choose the right encryption/hash method)
you want to do this one way or both ways (that is, do you ever want to ask - I have this number corresponding to that number, which permutation am I using)
If the answer is no to both, you can just find an FPE algorithm that doesn't require you to generate the whole table (some do).
I have seen another problem of finding all subsets of a given set using bitmask. You can use the same concept in your case. This link contains a good tutorial.

Given a sorted array find number of all unique pairs which sum is greater than X

Is there any solution which can be achieved with complexity better than
O(n2)? Linear complexity would be best but O(nlogn) is also great.
I've tried to use binary search for each element which would be O(nlogn) but Im always missing something.
For example, given the sorted array: 2 8 8 9 10 20 21
and the number X=18
[2,20][2,21][8,20][8,21][8,20][8,21][9,10][9,20][9,21][10,20][10,21][20,21]
The function should return 12.
For your problem, for any given set of numbers, there may be some which, included with any other number, are guaranteed to be a solution because in themselves they are greater than the target amount for the summing.
Identify those. Any combination of those is a solution. With your example-data, 21 and 20 satisfy this, so all combinations of 21, which is six, and all remaining combinations of 20, which is five, satisfy your requirement (11 results so far).
Treat your sorted array now as a subset, which excludes the above set (if non-empty).
From your new subset, find all numbers which can be included with another number in that subset to satisfy your requirement. Starting from the highest (for "complexity" it may not matter, but for ease of coding, knowing early that you have none often helps) number in your subset, identify any, remembering that there need to be at least two such numbers for any more results to be added to those identified.
Take you data in successive adjacent pairs. Once you reach of sum which does not meet your requirement, you know you have found the bottom limit for your table.
With this sequence:
1 2 12 13 14 15 16 17 18 19
19 already meets the criteria. Subset is of nine numbers. Take 18 and 17. Meets the criteria. Take 17 and 16. Meets. 16 and 15. Meets... 13 and 12. Meets. 12 and 2. Does not meet. So 12 is the lowest value in the range, and the number of items in the range are seven.
This process yields 10 and 9 with your original data. One more combination, giving your 12 combinations that are sought.
Determining the number of combinations should be abstracted out, as it is simple to calculate and, depending on actual situation, perhaps faster to pre-calculate.
A good rule-of-thumb is that if you, yourself, can get the answer by looking at the data, it is pretty easy for a computer. Doing it the same way is a good starting point, though not always the end.
Find the subset of the highest elements which would cause the sum to be greater than the desired value when paired with any other member of the array.
Taking the subset which excludes those, use the highest element and work backwards to find the lowest element which gives you success. Have that as your third subset.
Your combinations are all those of the entire group for members of the first subset, plus solely those combinations of the third subset.
You are the student, you'll have to work out how complex that is.

Fast 3D Lut lookups

I'm trying to write a fast 3D lut lookup function and noticed that most luts are either 33x33x33 or 17x17x17.
Why 33 or 17? Wouldn't the math be quicker with 32 or 16 instead? So you could do some shifts instead of divides? Or maybe I'm not understanding why.
Anyone?
This paper will provide a synopsis: https://www.hpl.hp.com/techreports/98/HPL-98-95.pdf
Basically what you need is to divide the color space into a certain number of pieces and do linear interpolation between those pieces. It's a method of doing the lookup table such that you can find the color positions without much error but with a more sparced lookup than you would otherwise have.
And here's the reason: if you cut a line 2 times, you have 3 pieces.
The reason you have 17 or 33 rather than 16 or 32 is that you need the piece you are in, not the nearest position. If you divide you're going to bitshift a 2^8 value, you'll have 16 values that you could have. But, since you need to linear interpolation the position within that piece, you need 17 values.
In short, the reason you have 17 and not 16 is that with 17 you can evenly divide the value by 16 which is faster, and then check the value that occurs after your floored integer division, and then make an educated guess where you should be between those values. And that takes N+1 values in the lookup table.

Resources