Generate unique integer key for set of lists - arrays

I have a number of integer vectors all the same length. They could contain any signed int16.
I need to create a single unique number for each vector, but vectors which have the same content must be given the same number.
E.g the following vectors:
[1, 2, 3, 4]
[1, 2, 3, 4]
[6, 2, 4, 1]
might be assigned the numbers 2, 2 and 4.
Also order counts. So the vectors
[1, 2, 3, 4]
[2, 1, 4, 3]
should get different values.
Is there any reliable way to calculate a single number for a such a set of vectors?
To sum up the value must:
Be the same for vectors which are exactly the same (order counts!)
Be guaranteed to be unique for different values
The value must be calculated for one vector at a time...i.e you are given a vector, you get the value, then you get the next vector and so on.
The whole purpose of this is that Im interested in an alternative way of indexing distinct vectors to e.g adding them all to an oredered set or similar.

To guarantee perfect uniqueness, you will have to compose a large number out of every single number. Given that you specified to allow signed int16 values you would get the following 64bit hash key:
[n1, n2, n3, n4] => n1 + n2*2^16 + n3*2^32 + n4*2^48

Related

Searching 2D numpy array of ids for partial id

I have the following 2 arrays:
arr = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[7, 5, 6, 3],
[2, 4, 8, 9]]
ids = np.array([6, 5])
Each row in the array arr describes a 4-digit id, there are no redundant ids - neither in their values nor their combination. So if [1, 2, 3, 4] exists, no other combination of these 4 digits can exist. This will be important in a sec.
The array ids contains a 2-digit id, however the order is random. Now I need to go through each row of arr and look if this 2-digit partial id part of any 4-digit id. In this example ids fits to the 2nd and 3rd row from the top of arr.
My current solution with np.isin only works if the array ids has also a 4-digit row.
arr[np.isin(arr, ids).all(1)]
Changing all(1) to any(1) doesn´t do the trick either, because this way it would be enough if just one digit of ids is in one row of arr, however I need both values.
Does anyone have a compact solution?
Just need the boolean index to only accept values that are 2. When doing non-boolean operations like sum with boolean arrays, True and False values are interpreted as 1 and 0
arr[np.isin(arr, ids).sum(1) == 2]

Generate 2D array with adjacent elements not being x+1

I need to develop an algorithm which would accept two numbers m and n - dimensions of 2D array - as input and generate 2D array filled with numbers [1..m*n] with the following condition:
All (4) elements adjacent to a given element cannot be equal to currentElement + 1
Adjacent elements are located to the two/three/four sides (depending on position) of a given element
0 1 0
1 2 1
0 1 0
(E.g four 1s are adjacent to 2)
Example:
Input: m = 3, n = 3 (does not essentially have to be square matrix)
(Sample) output:
[
[7, 2, 5],
[1, 6, 9],
[3, 8, 4]
]
Note that there apparently may exist more than one possible output. In that case, numbers in the array have to be generated randomly (though still meeting the conditions), not following any preset sequence (e.g not [ [1, 3, 5], [4, 6, 2], [7, 9, 8] ] because it clearly uses a non-randomly generated sequence of numbers, odds first, then evens, etc)
Basically, for the same input, on two different occasions, two different arrays should be generated.
P.S: that was a coding interview question and I wonder how I could solve it, so, any help is highly appreciated.

Average between arrays of different length

I'm trying to develop a sort of very simple machine learning example to recognize similarity between arrays.
For this reason I'm trying to calculate the average between 2 arrays with different length.
For example if I have:
array_1 = [0, 4, 5];
array_2 = [4, 2, 7];
The average is:
average_array = [2, 3, 6];
But how can I manage to calculate the average if I have the following situation:
array_1 = [0, 4, 5, 10, 7];
array_2 = [4, 2, 7];
As you can see the arrays have a different length.
Is there an algorithm that I can apply to solve this problems?
Does anyone have an idea or some suggestion?
Of course I can consider the missing values of the second array as 0, and evaluate the average as, for example:
average_array = [2, 3, 6, 5, 3.5];
or consider the values as "null" and have:
average_array = [2, 3, 6, 10, 7];
But are this two approach good?
Or there is something smarter?
Thanks for your help!!
To answer your question, we really need more information on what you are trying to achieve.
I'm trying to develop a sort of very simple machine learning example
to recognize similarity between arrays. For this reason I'm trying to
calculate the average between 2 arrays with different length.
Depending on your usecase, similarity might be defined completely differently.
For instance:
if the array encodes sound-information you might want to measure similarity as "does this sound clip occur in this one" or "are the main frequencies (which would correspond to chords) the same"
if the array encodes image information (properly DFT-ed and zig-zag-encoded) you might not care about the low frequencies (end of the array) and only measure the difference between the first few values of the array
if the array encodes some kind of composition of elements (e.g. this essay contains keyword "matrix" 40 times, and keyword "SVM" 27 times) the difference in values might be very important.
General advice:
Think about what you're measuring
Decide what's important
But in general, have a look at smoothing algorithms. For instance Kneyser-Ney or Good-Turing smoothing. They explictly deal with comparing a vector of probabilities that may differ in length (in other words, have explicit zero entries)
https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation
If after taking the the average of the arrays, you intend to take the mod of the difference of the array and the average array, then you are probably in the right direction if you will measure the dissimilarity by the magnitude of the difference.
But for arrays of different lengths I propose that you also take the index of extra elements in consideration.
For
array_1 = [0, 4, 5, 10, 7];
array_2 = [4, 2, 7];
average should be average_array = [2, 3, 6, 6.5, 5.5];
6.5 = (10 + 3(index) + 0(element) ) / 2
and
5.5 = (7 + 4(index) + 0(element))/2
Reason for taking index into consideration is that the length factor is also dealth with this approach. However this is just my 2 cents. May be there are better algorithms out there.
You should also take a look at this post

Answering Queries on List of Subarrays

Given an array A of size N, we construct a list containing all possible subarrays of A in descending order.
Two subarrays B and C are compare by padding zeroes until both are of size N. Then, we compare the two subarrays element by element and return as soon as a point of difference is observed.
We are given multiple queries where given x we have to find the maximum element in the xth subarray sorted according to the order given above.
For example, if the array A is [3, 1, 2, 4]; then the sorted subarrays will be:
[4]
[3, 1, 2, 4]
[3, 1, 2]
[3, 1]
[3]
[2, 4]
[2]
[1, 2, 4]
[1, 2]
[1]
A query where x = 3 corresponds to finding the maximum element in the subarray [3, 1, 2]; so here the answer would be 3.
Since the number of queries are large (of the order of 10^5) and the number of elements in the array can also be large (of the order of 10^5), we would need to do some preprocessing to answer each query in O(1) or O(log N) or O(sqrt N) time. I can't seem to figure out how to do this. I have solved it for when the array contains unique elements, however how could we do this for when the array contains repetitions? Is there any data structure which could help in storing the required information?
Build suffix array in back order for this array (consider it like string)
For every entry store it's length and cumulative count (sum of lengths from the beginning of suffix array)
For query find needed index by binary search for cumulative counts, and get needed prefix of found suffix
For your examples suffixes with cumul.counts are
4 (0)
3124 (1)
34 (5)
124 (7)
query 3 finds entry 3124 (1<=3<5), and gets 3-1=2-nd (by length) prefix = 312

Efficient way of finding sequential numbers across multiple arrays?

I'm not looking for any code or having anything being done for me. I need some help to get started in the right direction but do not know how to go about it. If someone could provide some resources on how to go about solving these problems I would very much appreciate it. I've sat with my notebook and am having trouble designing an algorithm that can do what I'm trying to do.
I can probably do:
foreach element in array1
foreach element in array2
check if array1[i] == array2[j]+x
I believe this would work for both forward and backward sequences, and for the multiples just check array1[i] % array2[j] == 0. I have a list which contains int arrays and am getting list[index] (for array1) and list[index+1] for array2, but this solution can get complex and lengthy fast, especially with large arrays and a large list of those arrays. Thus, I'm searching for a better solution.
I'm trying to come up with an algorithm for finding sequential numbers in different arrays.
For example:
[1, 5, 7] and [9, 2, 11] would find that 1 and 2 are sequential.
This should also work for multiple sequences in multiple arrays. So if there is a third array of [24, 3, 15], it will also include 3 in that sequence, and continue on to the next array until there isn't a number that matches the last sequential element + 1.
It also should be able to find more than one sequence between arrays.
For example:
[1, 5, 7] and [6, 3, 8] would find that 5 and 6 are sequential and also 7 and 8 are sequential.
I'm also interested in finding reverse sequences.
For example:
[1, 5, 7] and [9, 4, 11]would return 5 and 4 are reverse sequential.
Example with all:
[1, 5, 8, 11] and [2, 6, 7, 10] would return 1 and 2 are sequential, 5 and 6 are sequential, 8 and 7 are reverse sequential, 11 and 10 are reverse sequential.
It can also overlap:
[1, 5, 7, 9] and [2, 6, 11, 13] would return 1 and 2 sequential, 5 and 6 sequential and also 7 and 6 reverse sequential.
I also want to expand this to check numbers with a difference of x (above examples check with a difference of 1).
In addition to all of that (although this might be a different question), I also want to check for multiples,
Example:
[5, 7, 9] and [10, 27, 8] would return 5 and 10 as multiples, 9 and 27 as multiples.
and numbers with the same ones place.
Example:
[3, 5, 7] and [13, 23, 25] would return 3 and 13 and 23 have the same ones digit.
Use a dictionary (set or hashmap)
dictionary1 = {}
Go through each item in the first array and add it to the dictionary.
[1, 5, 7]
Now dictionary1 = {1:true, 5:true, 7:true}
dictionary2 = {}
Now go through each item in [6, 3, 8] and lookup if it's part of a sequence.
6 is part of a sequence because dictionary1[6+1] == true
so dictionary2[6] = true
We get dictionary2 = {6:true, 8:true}
Now set dictionary1 = dictionary2 and dictionary2 = {}, and go to the third array.. and so on.
We only keep track of sequences.
Since each lookup is O(1), and we do 2 lookups per number, (e.g. 6-1 and 6+1), the total is n*O(1) which is O(N) (N is the number of numbers across all the arrays).
The brute force approach outlined in your pseudocode will be O(c^n) (exponential), where c is the average number of elements per array and n is the number of total arrays.
If the input space is sparse (meaning there will be more missing numbers on average than presenting numbers), then one way to speed up this process is to first create a single sorted set of all the unique numbers from all your different arrays. This "master" set will then allow you to early exit (i.e. break statements in your loops) on any sequences which are not viable.
For example, if we have input arrays [1, 5, 7] and [6, 3, 8] and [9, 11, 2], the master ordered set would be {1, 2, 3, 5, 6, 7, 8, 9, 11}. If we are looking for n+1 type sequences, we could skip ever continuing checking any sequence that contains a 3 or 9 or 11 (because the n+1 value in not present at the next index in the sorted set. While the speedups are not drastic in this particular example, if you have hundreds of input arrays and very large range of values for n (sparsity), then the speedups should be exponential because you will be able to early exit on many permutations. If the input space is not sparse (such as in this example where we didn't have many holes), the speedups will be less than exponential.
A further improvement would be to store a "master" set of key-value pairs, where the key is the n value as shown in the example above, and the value portion of the pair is a list of the indices of any arrays that contain that value. The master set of the previous example would then be: {[1, 0], [2, 2], [3, 1], [5, 0], [6, 1], [7, 0], [8, 1], [9, 2], [11, 2]}. With this architecture, scan time could potentially be as low as O(c*n), because you could just traverse this single sorted master set looking for valid sequences instead of looping over all the sub-arrays. By also requiring the array indexes to increment, you can clearly see that the 1->2 sequence can be skipped because the arrays are not in the correct order, and the same with the 2->3 sequence, etc. Note this toy example is somewhat oversimplified because in practice you would need a list of indices for the value portions of the key-value pairs. This would be necessary if the same value of n ever appeared in multiple arrays (duplicate values).

Resources