Pairwise comparisons of a large amount of sorted arrays

Pairwise comparisons of a large amount of sorted arrays - arrays

Suppose I have n sorted integer arrays (a_1, ..., a_n, there may be duplicated elements in a single array), and T is a threshold value between 0 and 1. I would like to find all pairs of arrays the similarity of which is larger than T. The similarity of array a_j w.r.t. array a_i is defined as follows:
sim(i, j) = intersection(i, j) / length(i)
where intersection(i, j) returns the number of elements shared in a_i and a_j, and length(i) returns the length of array a_i.
I can enumerate all pairs of arrays and compute the similarity value, but this takes too much time for a large n (say n=10^5). Is there any data structure, pruning strategy, or other techniques that can reduce the time cost of this procedure? I'm using Java so the technique should be easily applicable in Java.

There are (n^2 - n)/2 pairs of arrays. If n=10^5, then you have to compute the similarity of 5 billion pairs of arrays. That's going to take some time.
One potential optimization is to shortcut your evaluation of two arrays if it becomes clear that you won't reach T. For example, if T is 0.5, you've examined more than half of the array and haven't found any intersections, then it's clear that that pair of arrays won't meet the threshold. I don't expect this optimization to gain you much.
It might be possible to make some inferences based on prior results. That is, if sim(1,2) = X and sim(1,3) < T, there's probably a value of X (likely would have to be very high) at which you can say definitively that sim(2,3) < T.

Related

Maximal subset sum smaller than a given value

You are given an array of integers and a number k. The question is to find a subset such that the sum is maximal and smaller than given number k.
I feel like there is a dynamic programming approach to solve this but I am not sure how to solve this problem efficiently.

The simple dynamic programs for this class of problems all use the same basic trick: for each successive prefix of the array, compute the set of its subset sums, depending on the fact that, when the input elements are bounded, this set has many fewer than 2^n elements (for the problem in question, less than 10,000,000 instead of 2^1000). This particular problem, in Python:
def maxsubsetsumlt(array, k):
sums = {0}
for elem in array:
sums.update({sum_ + elem for sum_ in sums if sum_ + elem < k})
return max(sums)

This could be done using the Subset Sum algorithm with a small adaptation. Instead of returning the boolean value from the last (bottom right) cell you search for the first cell with a value of true which indicates that there is a combination of elements that sum up to this particular k_i < k. This k_i is your maximal sum smaller than k. This algorithm has a worst case time and space complexity of O(nk).

Matlab: Looping a function on the elements of a 3-array vs Passing a 3-array to the same function

In Matlab I have an array v of length m, a matrix of order n and a function F that takes as an input a single matrix and outputs a number. Starting from v I would like to apply the function to the whole array of matrices whose i-th element consists of a matrix M_i whose entries are obtained by multiplicating all the entries of M by v_i. The output would be itself an array of length n.
As far as I can see there are two ways of achieving this:
Looping on all i=1:n, computing F on all the M_is and store all the corresponding values in an array
Defining a 3-array structure that contains all the matrices M_i and correspondingly extending the function F as to act on 3-arrays instead of matrices. However this entails overloading some matrix operators and functions (transpose, exponential, logarithm, square root, inverse etc...) as to formally handle a 3-array.
I have done the simpler option 1. It takes a long time to execute. Number 2 promises to be faster- However, I am not sure if this is the case, and I am not familiar with overloading operators on Matlab. In particular: how to extend a matrix operator to a 3-array in such a way that it performs the related function on all of its entries.

A for loop is probably no slower than vectorising this, especially for larger problems where memory starts to limit speed. Nevertheless, here are two ways of doing it:
M=rand(3,3,5) % I'm using a 3x3x5 matrix
v=1:5
F=#sum % This is the function
M2=bsxfun(#times,M,permute(v.',[2 3 1])) % Multiply the M(:,:,i) matrix by v(i)
R=arrayfun(#(t) F(M2(:,:,t)),(1:size(M,3)).','UniformOutput',false) % applies the function F to the resulting matrices
cell2mat(R) % Convert from cell array to matrix, since my F function returns row vectors
R2=zeros(size(M,3),size(M,1)); % Initialise R2
for t=1:size(M,3)
R2(t,:)=F(M(:,:,t)*v(t)); % Apply F to M(:,:,i)*v(i)
end
R2
You should do some testing to see which will be more efficient for your actual problem. The vectorised version should be faster for small problems, but use more memory, whereas the for loop will be slower for small problems but use less memory, and so could be faster on larger problems.

find if two arrays contain the same set of integers without extra space and faster than NlogN

I came across this post, which reports the following interview question:
Given two arrays of numbers, find if each of the two arrays have the
same set of integers ? Suggest an algo which can run faster than NlogN
without extra space?
The best that I can think of is the following:
(a) sort each array, and then (b) have two pointers moving along the two arrays and check if you find different values ... but step (a) has already NlogN complexity :(
(a) scan shortest array and put values into a map, and then (b) scan second array and check if you find a value that is not in the map ... here we have linear complexity, but we I use extra space
... so, I can't think of a solution for this question.
Ideas?
Thank you for all the answers. I feel many of them are right, but I decided to choose ruslik's one, because it gives an interesting option that I did not think about.

You can try a probabilistic approach by choosing a commutative function for accumulation (eg, addition or XOR) and a parametrized hash function.
unsigned addition(unsigned a, unsigned b);
unsigned hash(int n, int h_type);
unsigned hash_set(int* a, int num, int h_type){
unsigned rez = 0;
for (int i = 0; i < num; i++)
rez = addition(rez, hash(a[i], h_type));
return rez;
};
In this way the number of tries before you decide that the probability of false positive will be below a certain treshold will not depend on the number of elements, so it will be linear.
EDIT: In general case the probability of sets being the same is very small, so this O(n) check with several hash functions can be used for prefiltering: to decide as fast as possible if they are surely different or if there is a probability of them being equivalent, and if a slow deterministic method should be used. The final average complexity will be O(n), but worst case scenario will have the complexity of the determenistic method.

You said "without extra space" in the question but I assume that you actually mean "with O(1) extra space".
Suppose that all the integers in the arrays are less than k. Then you can use in-place radix sort to sort each array in time O(n log k) with O(log k) extra space (for the stack, as pointed out by yi_H in comments), and compare the sorted arrays in time O(n log k). If k does not vary with n, then you're done.

I'll assume that the integers in question are of fixed size (eg. 32 bit).
Then, radix-quicksorting both arrays in place (aka "binary quicksort") is constant space and O(n).
In case of unbounded integers, I believe (but cannot proof, even if it is probably doable) that you cannot break the O(n k) barrier, where k is the number of digits of the greatest integer in either array.
Whether this is better than O(n log n) depends on how k is assumed to scale with n, and therefore depends on what the interviewer expects of you.

A special, not harder case is when one array holds 1,2,..,n. This was discussed many times:
How to tell if an array is a permutation in O(n)?
Algorithm to determine if array contains n...n+m?
mathoverflow
and despite many tries no deterministic solutions using O(1) space and O(n) time were shown. Either you can cheat the requirements in some way (reuse input space, assume integers are bounded) or use probabilistic test.
Probably this is an open problem.

Here is a co-rp algorithm:
In linear time, iterate over the first array (A), building the polynomial
Pa = A[0] - x)(A[1] -x)...(A[n-1] - x). Do the same for array B, naming this polynomial Pb.
We now want to answer the question "is Pa = Pb?" We can check this probabilistically as follows. Select a number r uniformly at random from the range [0...4n] and compute d = Pa(r) - Pb(r) in linear time. If d = 0, return true; otherwise return false.
Why is this valid? First of all, observe that if the two arrays contain the same elements, then Pa = Pb, so Pa(r) = Pb(r) for all r. With this in mind, we can easily see that this algorithm will never erroneously reject two identical arrays.
Now we must consider the case where the arrays are not identical. By the Schwart-Zippel Lemma, P(Pa(r) - Pb(r) = 0 | Pa != Pb) < (n/4n). So the probability that we accept the two arrays as equivalent when they are not is < (1/4).

The usual assumption for these kinds of problems is Theta(log n)-bit words, because that's the minimum needed to index the input.
sshannin's polynomial-evaluation answer works fine over finite fields, which sidesteps the difficulties with limited-precision registers. All we need are a prime of the appropriate (easy to find under the same assumptions that support a lot of public-key crypto) or an irreducible polynomial in (Z/2)[x] of the appropriate degree (difficulty here is multiplying polynomials quickly, but I think the algorithm would be o(n log n)).
If we can modify the input with the restriction that it must maintain the same set, then it's not too hard to find space for radix sort. Select the (n/log n)th element from each array and partition both arrays. Sort the size-(n/log n) pieces and compare them. Now use radix sort on the size-(n - n/log n) pieces. From the previously processed elements, we can obtain n/log n bits, where bit i is on if a[2*i] > a[2*i + 1] and off if a[2*i] < a[2*i + 1]. This is sufficient to support a radix sort with n/(log n)^2 buckets.

In the algebraic decision tree model, there are known Omega(NlogN) lower bounds for computing set intersection (irrespective of the space limits).
For instance, see here: http://compgeom.cs.uiuc.edu/~jeffe/teaching/497/06-algebraic-tree.pdf
So unless you do clever bit manipulations/hashing type approaches, you cannot do better than NlogN.
For instance, if you used only comparisons, you cannot do better than NlogN.

You can break the O(n*log(n)) barrier if you have some restrictions on the range of numbers. But it's not possible to do this if you cannot use any extra memory (you need really silly restrictions to be able to do that).
I would also like to note that even O(nlog(n)) with sorting is not trivial if you have O(1) space limit as merge sort uses O(n) space and quicksort (which is not even strict o(nlog(n)) needs O(log(n)) space for the stack. You have to use heapsort or smoothsort.
Some companies like to ask questions which cannot be solved and I think it is a good practice, as a programmer you have to know both what's possible and how to code it and also know what are the limits so you don't waste your time on something that's not doable.
Check this question for a couple of good techniques to use:
Algorithm to tell if two arrays have identical members

For each integer i check that the number of occurrences of i in the two arrays are either both zero or both nonzero, by iterating over the arrays.
Since the number of integers is constant the total runtime is O(n).
No, I wouldn't do this in practice.

Was just thinking if there was a way you could hash the cumulative of both arrays and compare them, assuming the hashing function doesn't produce collisions from two differing patterns.

why not i find the sum , product , xor of all the elements one array and compare them with the corresponding value of the elements of the other array ??
the xor of elements of both arrays may give zero if the it is like
2,2,3,3
1,1,2,2
but what if you compare the xor of the elements of two array to be equal ???
consider this
10,3
12,5
here xor of both arrays will be same !!! (10^3)=(12^5)=9
but their sum and product are different . I think two different set of elements cannot have same sum ,product and xor !
This can be analysed by simple bitvalue examination.
Is there anything wrong in this approach ??

I'm not sure that correctly understood the problem, but if you are interested in integers that are in both array:
If N >>>>> 2^SizeOf(int) (count of bit for integer (16, 32, 64)) there is one solution:
a = Array(N); //length(a) = N;
b = Array(M); //length(b) = M;
//x86-64. Integer consist of 64 bits.
for i := 0 to 2^64 / 64 - 1 do //very big, but CONST
for k := 0 to M - 1 do
if a[i] = b[l] then doSomething; //detected
for i := 2^64 / 64 to N - 1 do
if not isSetBit(a[i div 64], i mod 64) then
setBit(a[i div 64], i mod 64);
for i := 0 to M - 1 do
if isSetBit(a[b[i] div 64], b[i] mod 64) then doSomething; //detected
O(N), with out aditional structures

All I know is that comparison based sorting cannot possibly be faster than O(NlogN), so we can eliminate most of the "common" comparison based sorts. I was thinking of doing a bucket sort. Perhaps if this qn was asked in an interview, the best response would first be to clarify what sort of data those integers represent. For e.g., if they represent a persons age, then we know that the range of values of int is limited, and can use bucket sort at O(n). However, this will not be in place....

If the arrays have the same size, and there are guaranteed to be no duplicates, sum each of the arrays. If the sum of the values is different, then they contain different integers.
Edit: You can then sum the log of the entries in the arrays. If that is also the same, then you have the same entries in the array.

Algorithm to pick values from array that sum closest to a target value?

I have an array of nearly sorted values 28 elements long. I need to find the set of values that sums to a target value provided to the algorithm (or if exact sum cannot be found, the closest sum Below the target value).
I currently have a simple algorithm that does the job but it doesn't always find the best match. It works under ideal circumstances with a specific set of values, but I need a more robust and accurate solution that can handle a wider variety of data sets.
The algorithm must be written in C, not C++, and is meant for an embedded system so keep that in mind.
Here's my current algorithm for reference. It iterates starting at the highest value available. If the current value is less than the target sum, it adds the value to the output and subtracts it from the target sum. This repeats until the sum has been reached or it runs out of values. It asumes a nearly ascending sorted list.
//valuesOut will hold a bitmask of the values to be used (LSB representing array index 0, next bit index 1, etc)
void pickValues(long setTo, long* valuesOut)
{
signed char i = 27;//last index in array
long mask = 0x00000001;
(*valuesOut) = 0x00000000;
mask = mask<< i;//shift to ith bit
while(i>=0 && setTo > 0)//while more values needed and available
{
if(VALUES_ARRAY[i] <= setTo)
{
(*valuesOut)|= mask;//set ith bit
setTo = setTo - VALUES_ARRAY[i]._dword; //remove from remaining }
//decrement and iterate
mask = mask >> 1;
i--;
}
}
A few more paramters:
The array of values is likely to be Nearly Sorted ascending, but that cannot be enforced so assume that there is not sorting. In fact, there may also be duplicate values.
It is quite possible that the array will hold a set of values that cannot create every sum within its range. If the exact sum cannot be found, the algorithm should return values that create the next Lowest Sum.

This problem is known as the subset sum problem, which is a special case of the Knapsack problem. Wikipedia is a good starting point for some algorithms.

As others have noted, this is same as the optimization version of subset sum problem, which is NP-Complete.
Since you mentioned that you are short in memory and can probably work with approximate solutions (based on your current solution), there are polynomial time approximation algorithms for solving the optimization version of subset sum.
For instance, given an e > 0, there is a polynomial time algorithm which uses O((n*logt)/e) space, (t is the target sum, n is the size of the array) which gives you a subset such that the sum z is no less than 1/(1+e) times the optimal.
i.e If the largest subset sum was y, then the algorithm finds a subset sum z such that
z <= y <= (1+e)z
and uses space O((n*logt)/e).
Such an algorithm can be found here: http://www.cs.dartmouth.edu/~ac/Teach/CS105-Winter05/Notes/nanda-scribe-3.pdf
Hope this helps.

If values are reasonably small, it's a simple dynamic programming (DP). Time complexity would be O(n * target) and memory requirements O(target). If that satisfies you, there're lots of DP tutorials on the web. For example, here the first discussed problem (with couns) is very much like yours (except they allow to use each number more than once):
http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=dynProg
update
Yep, as other person noted, it's a simple case of knapsack problem.

Algorithm to find a number and its square in an array

I have an array of integers, and I need an O(n) algorithm to find if the array contains a number and its square; one pair is sufficient.
I tried to do it myself, but I have only managed to find a solution in O(n2).
I thought about using counting sort, but the memory usage is too big.

create a new array twice the length of the input array. O(2N)
copy all of the numbers in O(N)
copy the squares of the numbers in O(N)
radix sort (we can since they are all ints) O(N)
iterate over once to see if there are two numbers the same one after the other O(N)
profit! O(1)

There are basically two ways to do this.
Sort the array and then perform a binary search for the square of each number. Overall complexity would be O(nlogn), but it would need sorting which would destroy the original ordering (which might be important for your case).
Insert all items of the array into a hashtable (or any fast set data structure). Then iterate over the elements of the array again, checking to see if its square exists in the hashtable. Using a hashtable gives an overall complexity of O(n), but you will need O(n) extra space. You could also use a tree-based set (e.g. std::set in C++ or TreeSet in Java), which would give you a complexity of O(nlogn).

If we're allowed to take that the input can be sorted in O(N) by radix sort, I'd improve a bit on Chris's solution:
radix sort the input.
For the first element of the result, linear search forward until we find either its square (in which case stop with true), or else the end (in which case stop with false) or else a value larger than the square (in which case continue searching for the square of the second and subsequent elements of the sorted array).
Each of the two "pointers" is moving strictly forward, so the overall complexity is O(N), assuming that the radix sort is O(N) and that squaring and comparison are O(1). Presumably whoever set the question intended these assumptions to be made.
In response to a comment by the questioner on another answer: if the integers in the input are not bounded, then I don't think it can be done. Just calculating the square of an integer requires greater than linear time (at least: no linear algorithm for multiplication is known), so consider an input of size n bits, consisting of two integers of size n / 3 bits and 2 * n / 3 bits. Testing whether one is the square of the other cannot be done in O(n). I think. I could be wrong.

While I can't add to the suggestions above, you can reduce the average run time by first finding the min and max values in your data set (both O(n)) and confining your search to that range. For instance if the maximum value is 620, I know that no integer 25 or over has a square in the list.

You may be able to do it with a couple of hashsets helping you out.
While iterating,
If the value is in the squares hashset, you've got a pair (value is the square of a previously found value)
If the square is in the values hashset, you've got a pair (square of this value was already passed)
else store the value in one and the square in the other.

Personally I think that Anon's answer (the little algorithm with 'squares') is more useful than it appears to be: remove the 'remove all less than e from squares' line from it, and the algorithm can handle an unsorted input array.
If we assume the typical Homework machine with Sufficient Space, the 'squares' datastructure could be modelled as an array of boolean flags, yielding true O(1) lookup time.

Without sorting, works with duplicates:
Iterate the array to find the smallest and largest integers. O(n)
Create an array of bits the the size of the difference. O(1) time, O(k) space
(Now each possible integer between the smallest and largest values has a corresponding bit in the array)
Iterate the old-array, setting the bit corresponding to each integer found to 1. O(n)
Iterate the old-array again, checking if the integer's square has its corresponding bit set. O(n)
(Though I didn't sort, this algorithm can be very easily modified to create a sorting algorithm which sorts in O(n+k) time and O(k) space)

If we're using C/C++ 32 bit unsigned ints the maximum value that can be stored is: 4294967295 =(2<<32)-1. The largest number whose square we can store is (1<<16)-1=65535. Now if create an array of bits and store in the array whether we've seen the number and/or its square (2 bits per "slot") we can get the total storage down to 65535/4 = 16384 bytes.
IMO This is not excessive memory consumption so we should be able to do this without radix sorting. An O(N) algorithm could look like this:
uint32_t index(uint32_t i ) { return i/4; }
unsigned char bit1( uint32_t i ) { return 1<<( (i%4)*2 ); }
unsigned char bit2( uint32_t i ) { return 1<<( (i%4)*2 +1 ); }
bool hasValueAndSquare( std::vector<uint32_t> & v )
{
const uint32_t max_square=65535;
unsigned char found[(max_square+1)/4]={0};
for(unsigned int i=0; i<v.size(); ++i)
{
if (v[i]<=max_square)
{
found[ index(v[i]) ] |= bit1(v[i]);
if ((found[ index(v[i])] & bit2(v[i])) == bit2(v[i])) return true;
}
uint32_t w = (uint32_t)round(sqrt(v[i]));
if( w*w == v[i] )
{
found[ index(w) ] |= bit2(w);
if ((found[index(w)] & bit1(w)) == bit1(w)) return true;
}
}
return false;
}
This is not tested, not very optimized, and a proper integer square-root would be better.
however the compiler should inline all the bit-accessing functions - so they'll be OK.
Note that if we're using 64 bit ints the memory consumption becomes much larger, instead of an array of 16Kb we need an array of 1Gb - possible less practical.

Optimization notes
Both the hashset and radix sort algorithms can be optimized by noting three facts:
Odd and even values can be handled separately
Calculating an integer square root is a very fast operation (typically consists of 3-5 divides and a few adds)
Cache locality is important for both of these algorithms
The optimized algorithms below will typically perform 5x faster and use less than half the RAM of the unoptimized case. In some cases where the data size is similar to the L2/L3 cache size they may perform 100x faster or more.
Optimized algorithm based on radix sort
Data structure is five lists of integers: IN, Aodd, Bodd, Aeven, Beven
The A and B lists use half the integer size of IN. (eg if IN = 64bits, A & B = 32bits)
Scan list IN to find the largest odd and even numbers MAXodd and MAXeven
Let LIMITodd = floor(sqrt(MAXodd))
Let LIMITeven = floor(sqrt(MAXeven))
For each number in list IN: a. Compute the square root if positive. If exact, add the square root to list Aodd/Aeven. b. If the number is >=0 and <= LIMITodd/LIMITeven, add it to list Bodd/Beven
Radix sort list Aodd and Bodd using just log2(LIMITodd) bits
Linear scan Aodd and Bodd for a match
Radix sort list Aeven and Beven using just log2(LIMITeven) bits
Linear scan Aeven and Beven for a match
If either linear scan finds a match, return that match immediately.
The reason this is much faster than the straightforward radix sort algorithm is that:
The arrays being sorted typicaly have less than 1/4 the numbers of values and need only half the number of bits per integer, so a total of less than 1/8 the RAM in use in a given sort which is good for the cache.
The radix sort is done on much fewer bits leading to fewer passes, so even if it does exceed your L1 or L2 cache you read RAM fewer times, and you read much less RAM
The linear scan is typically much faster because the A list contains only exact square roots and the B list only contains small values
Optimized algorithm based on hashset
Data structure is list of integers IN, plus two hashsets A and B
The A and B sets use half the integer size of IN
Scan list IN to find the largest odd and even numbers MAXodd and MAXeven
Let LIMITodd = floor(sqrt(MAXodd))
Let LIMITeven = floor(sqrt(MAXeven))
For each odd number in list IN: a. Compute the square root if positive. If exact, check if square root exists in B & return if true otherwise add it to A. b. If the number is >=0 and <= LIMITodd/LIMITeven, check if it exists in A & return if true otherwise add it to B.
Clear A and B and repeat step 4 for even numbers
The reason this is faster than the straightforward hashset algorithm is that:
The hashset is typically 1/8 the amount of RAM leading to much better cache performance
Only exact squares and small numbers have hashset entries, so much less time is spent hashing and adding/removing values
There is an additional small optimization available here: A and B can be a single hashset, which stores bit flag with each entry to say whether the integer is in A or B (it can't be in both because then the algorithm would have terminated).

If I correctly understand the problem, you have to check if a specified number is in the array. And not finding all the numbers in the array that have their square in the array too.
Simply maintain two boolean (one to check if the number has been found, another for the square), iterate over the elements in the array and test each element. Return the AND of the two boolean.
In pseudo code :
bool ArrayContainsNumberAndSquare(int number, int[] array):
boolean numberFound, squareFound;
int square = number * number;
foreach int i in array
(
numberFound = numberFound || i == number;
squareFound = squareFound || i == square;
)
return numberFound && squareFound;

1) With the hashmap you get O(n).
2) If you use std::set on 2 sets: the evens, and the odds, you can get
2*O((n/2)log(n/2))=O(nlog(n/2))
assuming there is roughly as many evens than odds

If the array is not sorted, you won't be able to do O(n).
If it is sorted, you can make use of that property to do it in one pass, like so:
foreach e in array
if squares contains e
return true
remove all less than e from squares
add e * e to squares
return false
Where squares is, say, a HashSet.
If it's not sorted, you can sort it in O(n log n) and then use this method to check for squares, which will still be faster than the naive solution on a large enough data set.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Pairwise comparisons of a large amount of sorted arrays - arrays

Related

Maximal subset sum smaller than a given value

Matlab: Looping a function on the elements of a 3-array vs Passing a 3-array to the same function

find if two arrays contain the same set of integers without extra space and faster than NlogN

Algorithm to pick values from array that sum closest to a target value?

Algorithm to find a number and its square in an array

Categories

Resources