Split Entire Hash Range Into n Equal Ranges - md5

I am looking to take a hash range (md5 or sha1) and split it into n equal ranges.
For example, if m (num nodes) = 5, the entire hash range would be split by 5 so that there would be a uniform distribution of key ranges. I would like n=1 (node 1) to be from the beginning of the hash range to 1/5, 2 from 1/5 to 2/5, etc all the way to the end.
Basically, I need to have key ranges mapped to each n such that when I hash a value, it knows which n is going to take care of that range.
I am new to hashing and a little bit unsure of where I could start on solving this for a project. Any help you could give would be great.

If you are looking to place a hash value into a number of "buckets" evenly, then some simple math will do the trick. Watch out for rounding edge cases... You would be better to use a power of 2 for the BUCKETS value.
This is python code, by the way, which supports large integers...
BUCKETS = 5
BITS = 160
BUCKETSIZE = 2**BITS / BUCKETS
int('ad01c5b3de58a02a42367e33f5bdb182d5e7e164', 16) / BUCKETSIZE == 3
int('553ae7da92f5505a92bbb8c9d47be76ab9f65bc2', 16) / BUCKETSIZE == 1
int('001c7c8c5ff152f1cc8ed30421e02a898cfcfb23', 16) / BUCKETSIZE == 0

If you can stand a little very hard to remove bias (any power of two is impossible to divide evenly in 5, so there has to be some bias), then modulo (% in C and many other languages with C-like syntax) is the way to divide the full range into 5 almost identically-sized partitions.
Any message m with md5(m)%5==0 is in the first partition, etc.

Related

Resizing a hash table: Using prime number vs Power of 2 (for array size)

I've looked into a lot of questions on this site about whether one should use prime number or power of 2 for mod. Thankfully, I understood the idea
However, my situation is different. I couldn't find any question that dealt with the right array size when resizing hash tables.
I am implementing a hash table that may need to grow and shrink as the number of stored keys varies. I have a hashcode function that uniformly hashes keys to positive 32-bit integers. The table itself will use a smaller array of approximate size M. So which of the following is best choice for the hash function that takes the hashcode to produce a value between 0 and P-1 where P is close to M?
a) Modulo P where P is a prime closest to M
b) Modulo P where P is the power of 2 closest to M
c) Either
d) Neither
I've been trying to figure this out for hours, but with no luck.

Find high & low peak points in cell array MATLAB

I want to find "significant" changes in a cell array in MATLAB for when I have a movement.
E.g. I have YT which represents movements in a yaw presentation for a face interaction. YT can change based on an interaction from anywhere upwards of 80x1 to 400x1. The first few lines might be
YT = {-7 -8 -8 -8 -8 -9 -9 -9 -6 ...}
I would like to record the following
Over the entire cell array;
1) Count the number of high and low peaks
I can do this with findpeak but not for low peaks?*
2) Measure the difference between each peak -
For this example, peaks -9 and -6 so difference of +3 between those. So report 1 peak change of +3. At the moment I am only interested in changes of +/- 3, but this might change, so I will need a threshold?
and then over X number of cells (repeating for the cell array)
3) count number of changes - for this example, 3 changes
3) count number of significant changes - for this example, 1 changes of -/+3
4) describe the change - 1 change of -1, 1 change of -1, 1 change of +3
Any help would be appreciated, bit of a MATLAB noob.
Thanks!
1) Finding negative peaks is the same as finding positive ones - all you need to do is multiply the sequence by -1 and then findpeaks again
2) If you simply want the differences, then you could subtract the vectors of the positive and negative peaks (possibly offset by one if you want differences in both directions). Something like pospeaks-negpeaks would do one side. You'd need to identify whether the positive or negative peak was first (use the loc return from findpeaks to determine this), and then do pospeaks(1:end-1)-negpeaks(2:end) or vice versa as appropriate.
[edit]As pointed out in your comment, the above assumes that pospeaks and negpeaks are the same length. I shouldn't have been so lazy! The code might be better written as:
if (length(pospeaks)>length(negpeaks))
% Starts and ends with a positive peak
neg_diffs=pospeaks(1:end-1)-negpeaks;
pos_diffs=negpeaks-pospeaks(2:end);
elseif (length(pospeaks)<length(negpeaks))
% Starts and ends with a negative peak
pos_diffs=negpeaks(1:end-1)-pospeaks;
neg_diffs=pospeaks-negpeaks(1:end-1);
elseif posloc<negloc
% Starts with a positive peak, and ends with a negative one
neg_diffs=pospeaks-negpeaks;
pos_diffs=pospeaks(2:end)-negpeaks(1:end-1);
else
% Starts with a negative peak, and ends with a positive one
pos_diffs=negpeaks-pospeaks;
neg_diffs=negpeaks(2:end)-pospeaks(1:end-1);
end
I'm sure that could be coded more effectively, but I can't think just now how to write it more compactly. posloc and negloc are the location returns from findpeaks.[/edit]
For (3) to (5) it is easier to record the differences between samples: changes=[YT{2:end}]-[YT{1:end-1}];
3) To count changes, count the number of non-zeros in the difference between adjacent elements: sum(changes~=0)
4) You don't define what you mean by "significant changes", but the test is almost identical to 3) sum(abs(changes)>=3)
5) It is simply changes(changes~=0)
I would suggest diff is the command which can provide the basis of a solution to all your problems (prior converting the cell to an array with cell2mat). It outputs the difference between adjacent values along an array:
1) You'd have to define what a 'peak' is but at a guess:
YT = cell2mat(YT); % convert cell to array
change = diff(YT); % get diffs
highp = sum(change >= 3); % high peak threshold
lowp = sum(change <= -3); % low peak threshold
2) diff(cell2mat(YT)) provides this.
3)
YT = cell2mat(YT); % convert cell to array
change = diff(YT); % get diffs
count = sum(change~=0);
4) Seems to be answered in the other points?

Find all possible distances from two arrays

Given two sorted array A and B length N. Each elements may contain natural number less than M. Determine all possible distances for all combinations elements A and B. In this case, if A[i] - B[j] < 0, then the distance is M + (A[i] - B[j]).
Example :
A = {0,2,3}
B = {1,2}
M = 5
Distances = {0,1,2,3,4}
Note: I know O(N^2) solution, but I need faster solution than O(N^2) and O(N x M).
Edit: Array A, B, and Distances contain distinct elements.
You can get a O(MlogM) complexity solution in the following way.
Prepare an array Ax of length M with Ax[i] = 1 if i belongs to A (and 0 otherwise)
Prepare an array Bx of length M with Bx[M-1-i] = 1 if i belongs to B (and 0 otherwise)
Use the Fast Fourier Transform to convolve these 2 sequences together
Inspect the output array, non-zero values correspond to possible distances
Note that the FFT is normally done with floating point numbers, so in step 4 you probably want to test if the output is greater than 0.5 to avoid potential rounding noise issues.
I possible done with optimized N*N.
If convert A to 0 and 1 array where 1 on positions which present in A (in range [0..M].
After convert this array into bitmasks, size of A array will be decreased into 64 times.
This will allow insert results by blocks of size 64.
Complexity still will be N*N but working time will be greatly decreased. As limitation mentioned by author 50000 for A and B sizes and M.
Expected operations count will be N*N/64 ~= 4*10^7. It will passed in 1 sec.
You can use bitvectors to accomplish this. Bitvector operations on large bitvectors is linear in the size of the bitvector, but is fast, easy to implement, and may work well given your 50k size limit.
Initialize two bitvectors of length M. Call these vectA and vectAnswer. Set the bits of vectA that correspond to the elements in A. Leave vectAnswer with all zeroes.
Define a method to rotate a bitvector by k elements (rotate down). I'll call this rotate(vect,k).
Then, for every element b of B, vectAnswer = vectAnswer | rotate(vectA,b).

How do I check to see if two (or more) elements of an array/vector are the same?

For one of my homework problems, we had to write a function that creates an array containing n random numbers between 1 and 365. (Done). Then, check if any of these n birthdays are identical. Is there a shorter way to do this than doing several loops or several logical expressions?
Thank you!
CODE SO FAR, NOT DONE YET!!
function = [prob] bdayprob(N,n)
N = input('Please enter the number of experiments performed: N = ');
n = input('Please enter the sample size: n = ');
count = 0;
for(i=1:n)
x(i) = randi(365);
if(x(i)== x)
count = count + 1
end
return
If I'm interpreting your question properly, you want to check to see if generating n integers or days results in n unique numbers. Given your current knowledge in MATLAB, it's as simple as doing:
n = 30; %// Define sample size
N = 10; %// Define number of trials
%// Define logical array where each location tells you whether
%// birthdays were repeated for a trial
check = false(1, N);
%// For each trial...
for idx = 1 : N
%// Generate sample size random numbers
days = randi(365, n, 1);
%// Check to see if the total number of unique birthdays
%// are equal to the sample size
check(idx) = numel(unique(days)) == n;
end
Woah! Let's go through the code slowly shall we? We first define the sample size and the number of trials. We then specify a logical array where each location tells you whether or not there were repeated birthdays generated for that trial. Now, we start with a loop where for each trial, we generate random numbers from 1 to 365 that is of n or sample size long. We then use unique and figure out all unique integers that were generated from this random generation. If all of the birthdays are unique, then the total number of unique birthdays generated should equal the sample size. If we don't, then we have repeats. For example, if we generated a sample of [1 1 1 2 2], the output of unique would be [1 2], and the total number of unique elements is 2. Since this doesn't equal 5 or the sample size, then we know that the birthdays generated weren't unique. However, if we had [1 3 4 6 7], unique would give the same output, and since the output length is the same as the sample size, we know that all of the days are unique.
So, we check to see if this number is equal to the sample size for each iteration. If it is, then we output true. If not, we output false. When I run this code on my end, this is what I get for check. I set the sample size to 30 and the number of trials to be 10.
check =
0 0 1 1 0 0 0 0 1 0
Take note that if you increase the sample size, there is a higher probability that you will get duplicates, because randi can be considered as sampling with replacement. Therefore, the larger the sample size, the higher the chance of getting duplicate values. I made the sample size small on purpose so that we can see that it's possible to get unique days. However, if you set it to something like 100, or 200, you will most likely get check to be all false as there will most likely be duplicates per trial.
Here are some more approaches that avoid loops. Let
n = 20; %// define sample size
x = randi(365,n,1); %// generate n values between 1 and 365
Any of the following code snippets returns true (or 1) if there are two identical values in x, and false (or 0) otherwise:
Sort and then check if any two consecutive elements are the same:
result = any(diff(sort(x))==0);
Do all pairwise comparisons manually; remove self-pairs and duplicate pairs; and check if any of the remaining comparisons is true:
result = nnz(tril(bsxfun(#eq, x, x.'),-1))>0;
Compute the distance between distinct values, considering each pair just once, and then check if any distance is 0:
result = any(pdist(x(:))==0);
Find the number of occurrences of the most common value (mode):
[~, occurs] = mode(x);
result = occurs>1;
I don't know if I'm supposed to solve the problem for you, but perhaps a few hints may lead you in the right direction (besides I'm not a matlab expert so it will be in general terms):
Maybe not, but you have to ask yourself what they expect of you. The solution you propose requires you to loop through the array in two nested loops which will mean n*(n-1)/2 times through the loop (ie quadratic time complexity).
There are a number of ways you can improve the time complexity of the problem. The most straightforward would be to have a 365 element table where you can keep track if a particular number has been seen yet - which would require only a single loop (ie linear time complexity), but perhaps that's not what they're looking for either. But maybe that solution is a little bit ad-hoc? What we're basically looking for is a fast lookup if a particular number has been seen before - there exists more memory efficient structures that allows look up in O(1) time and O(log n) time (if you know these you have an arsenal of tools to use).
Then of course you could use the pidgeonhole principle to provide the answer much faster in some special cases (remember that you only asked to determine whether two or more numbers are equal or not).

Efficient histogram implementation using a hash function

Is there a more efficient approach to computing a histogram than a binary search for a non-linear bin distribution?
I'm actually only interested in the bit of the algorithm that matches the key (value) to the bin (the transfer function?) , i.e. for a bunch of floating point values I just want to know the appropriate bin index for each value.
I know that for a linear bin distribution you can get O(1) by dividing the value by the bin width, and that for non linear bins a binary search gets you O(logN). My current implementation uses a binary search on unequal bin widths.
In the spirit of improving efficiency I was curious as to whether you could use a hash function to map a value to its appropriate bin and achieve O(1) time complexity when you have bins of unequal widths?
In some simple cases you can get O(1).
Suppose, your values are 8-bit, from 0 to 255.
If you split them into 8 bins of sizes 2, 2, 4, 8, 16, 32, 64, 128, then the bin value ranges will be: 0-1, 2-3, 4-7, 8-15, 16-31, 32-63, 64-127, 128-255.
In binary these ranges look like:
0000000x (bin 0)
0000001x
000001xx
00001xxx
0001xxxx
001xxxxx
01xxxxxx
1xxxxxxx (bin 7)
So, if you can quickly (in O(1)) count how many most significant zero bits there are in the value, you can get the bin number from it.
In this particular case you may precalculate a look-up table of 256 elements, containing the bin number and finding the appropriate bin for a value is just one table look-up.
Actually, with 8-bit values you can use bins of arbitrary sizes since the look-up table is small.
If you were to go with bins of sizes of powers of 2, you could reuse this look-up table for 16-bit values as well. And you'd need two look-ups. You can extend it to even longer values.
Ordinary hash functions are intended to scatter different values quite randomly across some range. A single-bit difference in arguments may lead to dozens of bits different in results. For that reason, ordinary hash functions are not suitable for the situation described in the question.
An alternative is to build an array P with entries that index into the table B of bin limits. Given some value x, we find the bin j it belongs to (or sometimes a nearby bin) via j = P[⌊x·r⌋] where r is a ratio that depends on the size of P and the maximum value in B. The effectiveness of this approach depends on the values in B and the size of P.
The behavior of functions like P[⌊x·r⌋] can be seen via the python code shown below. (The method is about the same in any programming language. However, tips for Python-to-C are given below.) Suppose the code is stored in file histobins.py and loaded into the ipython interpreter with the command import histobins as hb. Then a command like hb.betterparts(27, 99, 9, 80,155) produces output like
At 80 parts, steps = 20 = 7+13
At 81 parts, steps = 16 = 7+9
At 86 parts, steps = 14 = 6+8
At 97 parts, steps = 13 = 12+1
At 108 parts, steps = 12 = 3+9
At 109 parts, steps = 12 = 8+4
At 118 parts, steps = 12 = 6+6
At 119 parts, steps = 10 = 7+3
At 122 parts, steps = 10 = 3+7
At 141 parts, steps = 10 = 5+5
At 142 parts, steps = 10 = 4+6
At 143 parts, steps = 9 = 7+2
These parameters to betterparts set nbins=27, topsize=99, seed=9, plo=80, phi=155 which creates a test set of 27 bins for values from 0 to 99, with random seed 9, and size of P from 80 to 155-1. The number of “steps” is the number of times the two while loops in testparts() operated during a test with 10*nbins values from 0 to topsize. Eg, “At 143 parts, steps = 9 = 7+2” means that when the size of P is 143, out of 270 trials, 261 times P[⌊x·r⌋] produced the correct index at once; 7 times the index had to be decreased, and twice it had to be increased.
The general idea of the method is to trade off space for time. Another tradeoff is preparation time versus operation time. If you are going to be doing billions of lookups, it is worthwhile to do a few thousand trials to find a good value of |P|, the size of P. If you are going to be doing only a few millions of lookups, it might be better to just pick some large value of |P| and run with it, or perhaps just run betterparts over a narrow range. Instead of doing 75 tests as above, if we start with larger |P| fewer tests may give a good enough result. For example, 10 tests via “hb.betterparts(27, 99, 9, 190,200)” produces
At 190 parts, steps = 11 = 5+6
At 191 parts, steps = 5 = 3+2
At 196 parts, steps = 5 = 4+1
As long as P fits into some level of cache (along with other relevant data) making |P| larger will speed up access. So, making |P| as large as practical is a good idea. As |P| gets larger, the difference in performance between one value of |P| and the next gets smaller and smaller. The limiting factors on speed then include time to multiply and time to set up while loops. One approach for faster multiplies may be to choose a power of 2 as a multiplier; compute |P| to match; then use shifts or adds to exponents instead of multiplies. One approach to spending less time setting up while loops is to move the statement if bins[bin] <= x < bins[bin+1]: (or its C equivalent, see below) to before the while statements and do the while's only if the if statement fails.
Python code is shown below. Note, in translating from Python to C,
• # begins a comment
• def begins a function
• a statement like ntest, right, wrong, x = 10*nbins, 0, 0, 0 assigns values to respective identifiers
• a statement like return (ntest, right, wrong, stepdown, stepup) returns a tuple of 5 values that the caller can assign to a tuple or to respective identifiers
• the scope of a def, while, or if ends with a line not indented farther than the def, while, or if
• bins = [0] initializes a list (an extendible indexable array) with value 0 as its initial entry
• bins.append(t) appends value t at the end of list bins
• for i,j in enumerate(p): runs a loop over the elements of iterable p (in this case, p is a list), making the index i and corresponding entry j == p[i] available inside the loop
• range(nparts) stands for a list of the values 0, 1, ... nparts-1
• range(plo, phi) stands for a list of the values plo, plo+1, ... phi-1
• if bins[bin] <= x < bins[bin+1] means if ((bins[bin] <= x) && (x < bins[bin+1]))
• int(round(x*float(nparts)/topsize))) actually rounds x·r, instead of computing ⌊x·r⌋ as advertised above
def makebins(nbins, topsize):
bins, t = [0], 0
for i in range(nbins):
t += random.random()
bins.append(t)
for i in range(nbins+1):
bins[i] *= topsize/t
bins.append(topsize+1)
return bins
#________________________________________________________________
def showbins(bins):
print ''.join('{:6.2f} '.format(x) for x in bins)
def showparts(nbins, bins, topsize, nparts, p):
ratio = float(topsize)/nparts
for i,j in enumerate(p):
print '{:3d}. {:3d} {:6.2f} {:7.2f} '.format(i, j, bins[j], i*ratio)
print 'nbins: {} topsize: {} nparts: {} ratio: {}'.format(nbins, topsize, nparts, ratio)
print 'p = ', p
print 'bins = ',
showbins(bins)
#________________________________________________________________
def testparts(nbins, topsize, nparts, seed):
# Make bins and make lookup table p
import random
if seed > 0: random.seed(seed)
bins = makebins(nbins,topsize)
ratio, j, p = float(topsize)/nparts, 0, range(nparts)
for i in range(nparts):
while j<nbins and i*ratio >= bins[j+1]:
j += 1
p[i] = j
p.append(j)
#showparts(nbins, bins, topsize, nparts, p)
# Count # of hits and steps with avg. of 10 items per bin
ntest, right, wrong, x = 10*nbins, 0, 0, 0
delta, stepdown, stepup = topsize/float(ntest), 0, 0
for i in range(ntest):
bin = p[min(nparts, max(0, int(round(x*float(nparts)/topsize))))]
while bin < nbins and x >= bins[bin+1]:
bin += 1; stepup += 1
while bin > 0 and x < bins[bin]:
bin -= 1; stepdown += 1
if bins[bin] <= x < bins[bin+1]: # Test if bin is correct
right += 1
else:
wrong += 1
print 'Wrong bin {} {:7.3f} at x={:7.3f} Too {}'.format(bin, bins[bin], x, 'high' if bins[bin] > x else 'low')
x += delta
return (ntest, right, wrong, stepdown, stepup)
#________________________________________________________________
def betterparts(nbins, topsize, seed, plo, phi):
beststep = 1e9
for parts in range(plo, phi):
ntest, right, wrong, stepdown, stepup = testparts(nbins, topsize, parts, seed)
if wrong: print 'Error with ', parts, ' parts'
steps = stepdown + stepup
if steps <= beststep:
beststep = steps
print 'At {:3d} parts, steps = {:d} = {:d}+{:d}'.format(parts, steps, stepdown, stepup)
#________________________________________________________________
Interpolation search is your friend. It's kind of an optimistic, predictive binary search where it guesses where the bin should be based on a linear assumption about the distribution of inputs, rather than just splitting the search space in half at each step. It will be O(1) if the linear assumption is true, but still works (though more slowly) when the assumption is not. To the degree that its predictions are accurate, the search is fast.
Depends on the implementation of the hashing and the type of data you're working with. For smaller data sets a more simple algorithm like binary search might outperform constant lookup if the lookup-overhead of hashing is larger on average.
The usual implementation of hashing, consists of an array of linked lists and a hashing function that maps a string to an index in the array of linked lists. There's a thing called the load factor, which is the number of elements in the hash map / length of the linked-list array. Thus for load factors < 1 you'll achieve constant lookup in the best case because no linked-list will contain more than one element (best case).
There's only one way to find out which is better - implement a hash map and see for yourself. You should be able to get something near constant lookup :)

Resources