Order insensitive hash function for an array

Order insensitive hash function for an array - arrays

I'm looking for a hash-function which will produce the same result for unordered sequences containing same elements.
For example:
Array_1: [a, b, c]
Array_2: [b, a, c]
Array_3: [c, b, a]
The hash-function should return the same result for each of these arrays.
How to achieve this?
The most popular answer is to sort elements by some rule, then concatenate, then take hash.
Is there any other method?

if a,b,c are numbers, you could sum up and then build a hash on the sum.
You may multiply, too.
But take care about zeros!
XOR-ing numbers is also an approach.
for very small numbers you may consider to set the bit indexed by the number. This means building a long (64bit) as input for the hash allows only element numbers in range 0-63.
The more elements you have the more collisions you will get.
In the end you map n elements with m bits (resulting to 2^(m*n) range) to a hash value with k bits.
Usually m and k is a constant but n varies.
Please aware any access as by a hash requires a test whether to get the correct element. In general a hash is NOT unique.
otherwise sort the element and then do the hash as proposed
Regarding the comment from CodesInChaos:
in order to be able to omit a test, the numbers of bits of the hash should be much greater than the sum of elements bits. Say at least 64 bits more. In general this situation is not given.
One common case of secure hash/unique id is a guid. This means effectively 128 bits.
A random sequence of text char reaches this number of bits within 20-25 characters.
Longer texts are very likely to produce collisions. It depends on the use case whether this is still acceptable.

XOR | Sum | Sum of squares | ...
where | denotes concat.
or
XOR of hash of elements

Related

How to select value from different ranges with equal probability

Provided different ranges, select each value with equal probability.
Like say var 'a' can have values among { [10,20],40,[70,100]...} (given) . Each selected value by provided constraints should have same probability. How to get a random value in C?

Giving each Range equal probabilistic chance:
Let N be the number of ranges you've defined in your problem-set. Ranges { R0, R1, R2 ... RN-1 }, Indexes start at 0.
Generate a random number, RandValue mod N to pick a range. In C, modulo operator is %, gives you integral remainder.
Is picked range just a number? (like 40 in your example)
3.1 Yes, then your random value is that number
3.2 No, it's a range. Find a random value within selected range.
Giving each value in all ranges equal probabilistic chance:
Let N be the number of values across all ranges.
Map each value to an index, Values { V0, V1, V2 ... VN-1 }, Indexes start at 0.
Use hash-tables for quick lookups. Also, you can handle overlapping ranges.
Generate a random number, RandValue mod N to pick a value-index.
Look up in hash-table for mapped value against the index.
Also, note that hash-table could become huge if the ranges are too large. In that case you may have to merge overlapping/consecutive (if any) ranges and maintain sorted(by value-index) list(array of structs) of ranges and assign index-ranges. Use binary search to find the range against random-index. Range offsets (start/end values & indexes) should give the final value for a given random-index.
PS: This is for trivial implementations of randomness in C projects. I believe all randomness is deterministic.
Edit: I agree, there is modulo-bias & to reject values beyond (RAND_MAX - RAND_MAX % N).

Simple solution:
do
r=rand();
until (is_in_range(r));
It's not at all efficient, and especially it's not bounded in running time. But it should work.
And sometimes simple and stupid solutions are good enough.
(Once you start doing things like r=rand()%limit;, then you start introducing skewed probabilities. Imagine doing r=rand()%((RAND_MAX/2)+1);. It's twice as likely to return anything below RAND_MAX/2 as RAND_MAX/2.
See this answer for more detail. )
To improve performance, one could do something like what #Jakob Stark hinted at:
for(limit=1;limit<top_of_range;limit<<=1)
; // Find the smallest power-of-two larger than the top_of_range
do
r=rand()%limit;
while(!(is_in_range(r));
It's still not guaranteed to run in finite time, though...

Find way to separate array so each subarrays sum is less or equal to a number

I have a mathematical/algorithmic problem here.
Given an array of numbers, find a way to separate it to 5 subarrays, so that sum of each subarrays is less than or equal to a given number. All numbers from the initial array, must go to one of the subarrays, and be part of one sum.
So the input to the algorithm would be:
d - representing the number that each subarrays sum has to be less or equal
A - representing the array of numbers that will be separated to different subarrays, and will be part of one sum
Algorithm complexity must be polynomial.
Thank you.

If by "subarray" you mean "subset" as opposed to "contiguous slice", it is impossible to find a polynomial time algorithm for this problem (unless P = NP). The Partition Problem is to partition a list of numbers into to sets such that the sum of both sets are equal. It is known to be NP-complete. The partition problem can be reduced to your problem as follows:
Suppose that x1, ..., x_n are positive numbers that you want to partition into 2 sets such that their sums are equal. Let d be this common sum (which would be the sum of the xi divided by 2). extend x_i to an array, A, of size n+3 by adding three copies of d. Clearly the only way to partition A into 5 subarrays so that the sum of each is less than or equal to d is if the sum of each actually equals d. This would in turn require 3 of the subarrays to have length 1, each consisting of the number d. The remaining 2 subarrays would be exactly a partition of the original n numbers.
On the other hand, if there are additional constraints on what the numbers are and/or the subarrays need to be, there might be a polynomial solution. But, if so, you should clearly spell out what there constraints are.

Set up of the problem:
d : the upper bound for the subarray
A : the initial array
Assuming A is not sorted.
(Heuristic)
Algorithm:
1.Sort A in ascending order using standard sorting algorithm->O(nlogn)
2.Check if the largest element of A is greater than d ->(constant)
if yes, no solution
if no, continue
3.Sum up all the element in A, denote S. Check if S/5 > d ->O(n)
if yes, no solution
if no, continue
4.Using greedy approach, create a new subarray Asi, add next biggest element aj in the sorted A to Asi so that the sum of Asi does not exceed d. Remove aj from sorted A ->O(n)
repeat step4 until either of the condition satisfied:
I.At creating subarray Asi, there are only 5-i element left
In this case, split the remaining element to individual subarray, done
II. i = 5. There are 5 subarray created.
The algorithm described above is bounded by O(nlogn) therefore in polynomial time.

A unique key for a two dimensional array of letters

I have a two dimensional array of letters. Any letter can vary according to a certain alphabet.
I want to make a unique key for this array according to the letters and its position.
For example, if the array is 3 * 3 and the alphabet is {0, a, b, c, *}, the array can be in the form like:
0 b c
b * a
a a 0
I have tried Key = sum(code(letter)*(r*3+c)) for all r and c, where r and c are the row and the column, but it still gives me the same key for different array forms.
What do I miss?
P.S. code(letter) is a mapping function to convert the letter into a value.

You need to take into account the size of alphabet. If code and indices are all zero based it would be:
key = Sum(code(letter)*pow(L, r*C+c))
where L is the number of letters and C is the number of columns. However watch out for numeric overflow. For larger alphabets or matrices you need to use one of the following:
Lessen the requirement of keys being unique and use a hash (hash combiner).
Larger number type for the key or even unlimited arithmetic type such as in GMP lib.
Compression such as arithmetic coding if the distribution of letters is not even. However you still run into the risk of not being able to fit / compress specific matrix into the key.

Find all possible distances from two arrays

Given two sorted array A and B length N. Each elements may contain natural number less than M. Determine all possible distances for all combinations elements A and B. In this case, if A[i] - B[j] < 0, then the distance is M + (A[i] - B[j]).
Example :
A = {0,2,3}
B = {1,2}
M = 5
Distances = {0,1,2,3,4}
Note: I know O(N^2) solution, but I need faster solution than O(N^2) and O(N x M).
Edit: Array A, B, and Distances contain distinct elements.

You can get a O(MlogM) complexity solution in the following way.
Prepare an array Ax of length M with Ax[i] = 1 if i belongs to A (and 0 otherwise)
Prepare an array Bx of length M with Bx[M-1-i] = 1 if i belongs to B (and 0 otherwise)
Use the Fast Fourier Transform to convolve these 2 sequences together
Inspect the output array, non-zero values correspond to possible distances
Note that the FFT is normally done with floating point numbers, so in step 4 you probably want to test if the output is greater than 0.5 to avoid potential rounding noise issues.

I possible done with optimized N*N.
If convert A to 0 and 1 array where 1 on positions which present in A (in range [0..M].
After convert this array into bitmasks, size of A array will be decreased into 64 times.
This will allow insert results by blocks of size 64.
Complexity still will be N*N but working time will be greatly decreased. As limitation mentioned by author 50000 for A and B sizes and M.
Expected operations count will be N*N/64 ~= 4*10^7. It will passed in 1 sec.

You can use bitvectors to accomplish this. Bitvector operations on large bitvectors is linear in the size of the bitvector, but is fast, easy to implement, and may work well given your 50k size limit.
Initialize two bitvectors of length M. Call these vectA and vectAnswer. Set the bits of vectA that correspond to the elements in A. Leave vectAnswer with all zeroes.
Define a method to rotate a bitvector by k elements (rotate down). I'll call this rotate(vect,k).
Then, for every element b of B, vectAnswer = vectAnswer | rotate(vectA,b).

efficient methods to do summation

Is there any efficient techniques to do the following summation ?
Given a finite set A containing n integers A={X1,X2,…,Xn}, where Xi is an integer. Now there are n subsets of A, denoted by A1, A2, ... , An. We want to calculate the summation for each subset. Are there some efficient techniques ?
(Note that n is typically larger than the average size of all the subsets of A.)
For example, if A={1,2,3,4,5,6,7,9}, A1={1,3,4,5} , A2={2,3,4} , A3= ... . A naive way of computing the summation for A1 and A2 needs 5 Flops for additions:
Sum(A1)=1+3+4+5=13
Sum(A2)=2+3+4=9
...
Now, if computing 3+4 first, and then recording its result 7, we only need 3 Flops for addtions:
Sum(A1)=1+7+5=13
Sum(A2)=2+7=9
...
What about the generalized case ? Is there any efficient methods to speed up the calculation? Thanks!

For some choices of subsets there are ways to speed up the computation, if you don't mind doing some (potentially expensive) precomputation, but not for all. For instance, suppose your subsets are {1,2}, {2,3}, {3,4}, {4,5}, ..., {n-1,n}, {n,1}; then the naive approach uses one arithmetic operation per subset, and you obviously can't do better than that. On the other hand, if your subsets are {1}, {1,2}, {1,2,3}, {1,2,3,4}, ..., {1,2,...,n} then you can get by with n-1 arithmetic ops, whereas the naive approach is much worse.
Here's one way to do the precomputation. It will not always find optimal results. For each pair of subsets, define the transition cost to be min(size of symmetric difference, size of Y - 1). (The symmetric difference of X and Y is the set of things that are in X or Y but not both.) So the transition cost is the number of arithmetic operations you need to do to compute the sum of Y's elements, given the sum of X's. Add the empty set to your list of subsets, and compute a minimum-cost directed spanning tree using Edmonds' algorithm (http://en.wikipedia.org/wiki/Edmonds%27_algorithm) or one of the faster but more complicated variations on that theme. Now make sure that when your spanning tree has an edge X -> Y you compute X before Y. (This is a "topological sort" and can be done efficiently.)
This will give distinctly suboptimal results when, e.g., you have {1,2}, {3,4}, {1,2,3,4}, {5,6}, {7,8}, {5,6,7,8}. After deciding your order of operations using the procedure above you could then do an optimization pass where you find cheaper ways to evaluate each set's sum given the sums already computed, and this will probably give fairly decent results in practice.
I suspect, but have made no attempt to prove, that finding an optimal procedure for a given set of subsets is NP-hard or worse. (It is certainly computable; the set of possible computations you might do is finite. But, on the face of it, it may be awfully expensive; potentially you might be keeping track of about 2^n partial sums, be adding any one of them to any other at each step, and have up to about n^2 steps, for a super-naive cost of (2^2n)^(n^2) = 2^(2n^3) operations to try every possibility.)

Assuming that 'addition' isn't simply an ADD operation but instead some very intensive function involving two integer operands, then an obvious approach would be to cache the results.
You could achieve that via a suitable data structure, for example a key-value dictionary containing keys formed by the two operands and the answers as the value.
But as you specified C in the question, then the simplest approach would be an n by n array of integers, where the solution to x + y is stored at array[x][y].
You can then repeatedly iterate over the subsets, and for each pair of operands you check the appropriate position in the array. If no value is present then it must be calculated and placed in the array. The value then replaces the two operands in the subset and you iterate.
If the operation is commutative then the operands should be sorted prior to looking up the array (i.e. so that the first index is always the smallest of the two operands) as this will maximise "cache" hits.

A common optimization technique is to pre-compute intermediate results. In your case, you might pre-compute all sums with 2 summands from A and store them in a lookup table. This will result in |A|*|A+1|/2 table entries, where |A| is the cardinality of A.
In order to compute the element sum of Ai, you:
look up the sum of the first two elements of Ai and save them in tmp
while there is an element x left in Ai:
look up the sum of tmp and x
In order to compute the element sum of A1 = {1,3,4,5} from your example, you do the following:
lookup(1,3) = 4
lookup(4,4) = 8
lookup(8,5) = 13
Note that computing the sum of any given Ai doesn't require summation, since all the work has already been conducted while pre-computing the lookup table.
If you store the lookup table in a hash table, then lookup() is in O(1).
Possible optimizations to this approach:
construct the lookup table while computing the summation results; hence, you only compute those summations that you actually need. Your lookup table is now a cache.
if your addition operation is commutative, you can save half of your cache size by storing only those summations where the smaller summand comes first. Then modify lookup() such that lookup(a,b) = lookup(b,a) if a > b.

If assuming summation is time consuming action you can find LCS of every pair of subsets (by assuming they are sorted as mentioned in comments, or if they are not sorted sort them), after that calculate sum of LCS of maximum length (over all LCS in pairs), then replace it's value in related arrays with related numbers, update their LCS and continue this way till there is no LCS with more than one number. Sure this is not optimum, but it's better than naive algorithm (smaller number of summation). However you can do backtracking to find best solution.
e.g For your sample input:
A1={1,3,4,5} , A2={2,3,4}
LCS (A_1,A_2) = {3,4} ==>7 ==>replace it:
A1={1,5,7}, A2={2,7} ==> LCS = {7}, maximum LCS length is `1`, so calculate sums.
Still you can improve it by calculation sum of two random numbers, then again taking LCS, ...

NO. There is no efficient techique.
Because it is NP complete problem. and there are no efficient solutions for such problem
why is it NP-complete?
We could use algorithm for this problem to solve set cover problem, just by putting extra set in set, conatining all elements.
Example:
We have sets of elements
A1={1,2}, A2={2,3}, A3 = {3,4}
We want to solve set cover problem.
we add to this set, set of numbers containing all elements
A4 = {1,2,3,4}
We use algorhitm that John Smith is aking for and we check solution A4 is represented whit.
We solved NP-Complete problem.