How to get the indices of subarray in large array - arrays

I have two arrays as follows:
a = [1,2,3,4,5,6,7,8,9,10]
b = [3,5,8,10,11]
I want to find the index of subarray in main array if a number is present. The expected output is:
res = [2,4,7,9]
I have done as follows:
[3,5,8,10,11].each do |_element|
res_array = []
if [1,2,3,4,5,6,7,8,9,10].find_index(_element).present?
res_array << (header_array.find_index(_element)
end
res_array
end
But I think there is a better approach to do this.

If performance matters (i.e. if your arrays are huge), you can build a hash of all number-index pairs in a, using each_with_index and to_h:
a.each_with_index.to_h
#=> {1=>0, 2=>1, 3=>2, 4=>3, 5=>4, 6=>5, 7=>6, 8=>7, 9=>8, 10=>9}
A hash allows fetching the values (i.e. indices) for the numbers in b much faster (as opposed to traversing an array each time), e.g. via values_at:
a.each_with_index.to_h.values_at(*b)
#=> [2, 4, 7, 9, nil]
Use compact to eliminate nil values:
a.each_with_index.to_h.values_at(*b).compact
#=> [2, 4, 7, 9]
or alternatively slice and values:
a.each_with_index.to_h.slice(*b).values
#=> [2, 4, 7, 9]

b.map { |e| a.index(e) }.compact
#⇒ [2, 4, 7, 9]
or, more concise:
b.map(&a.method(:index)).compact

Here is another simpler solution,
indxs = a.each_with_index.to_h
(a&b).map{|e| indxs[e]}

All the answers so far traverse all of a once (#Stefan's) or traverse all or part of a b.size times. My answer traverses part or all of a once. It is relatively efficient when a is large, b is small relative to a and all elements in b appear in a.
My solution is particularly efficient when a is ordered in such a way that the elements of b typically appear towards the beginning of a. For example, a might be a list of last names sorted by decreasing frequency of occurrence in the general population (e.g., ['smith', 'jones',...]) and b is a list of names to look up in a.
a and b may contain duplicates1 and not all elements of b are guaranteed to be in a. I assume b is not empty.
Code
require 'set'
def lookup_index(a, b)
b_set = b.to_set
b_hash = {}
a.each_with_index do |n,i|
next unless b_set.include?(n)
b_hash[n] = i
b_set.delete(n)
break if b_set.empty?
end
b_hash.values_at(*b)
end
I converted b to a set to make lookups comparable in speed to hash lookups (which should not be surprising considering that sets are implemented with an underlying hash). Hash lookups are very fast, of course.
Examples
a = [1,2,3,4,5,6,7,8,9,10,8]
b = [3,5,8,10,11,5]
Note that in this example both a and b contain duplicates and 11 in b is not present in a.
lookup_index(a, b)
#=> [2, 4, 7, 9, nil, 4]
Observe the array returned contains the index 4 twice, once for each 5 in b. Also, the array contains nil at index 4 to show that it is b[4] #=> 11 that does not appear in a. Without the nil placeholder there would be no means to map the elements of b to indices in a. If, however, the nil placeholder is not desired, one may replace b_hash.values_at(*b) with b_hash.values_at(*b).compact, or, if duplicates are unwanted, with b_hash.values_at(*b).compact.uniq.
As a second example suppose we are given the following.
a = [*1..10_000]
b = 10.times.map { rand(100) }.shuffle
#=> [30, 62, 36, 24, 41, 27, 83, 61, 15, 55]
lookup_index(a, b)
#=> [29, 61, 35, 23, 40, 26, 82, 60, 14, 54]
Here the solution was found after the first 83 elements of a were enumerated.
1 My solution would be no more efficient if duplicates were not permitted in a and/or b.

Related

Find missing random amount of numbers in array with duplicates

I should have a complete array of numeric identifiers like this one:
a = [3, 4, 5, 6, 7, 8, 9, 10]
But instead, I have a a messed up array in random order, with duplicates and missing numbers like this one:
b = [4, 9, 7, 7, 3, 3]
Is there a more optimal way to find out which numbers are missing apart from substract the array without duplicates?
a - b.uniq
(a - b).empty?
works, but--depending on the data--it may not be the fastest way of determining if a contains an element not in b. For example, the probability were high that every element of a was not in b, it might be faster, on average, to check if a[0] is in b, then (if it is not) if a[1] is in b and so on, stopping if and when the element is in b. But again, that depends on the data, in particular the likelihood that (a - b).empty? is true. If that likelihood is great, Array#-, which is written in C, would be relatively fast and probably the best choice.
On the other hand, if its all but certain that a will contain many elements that is not in b it may be faster to do something like the following:
require 'set'
b_set = b.to_set
#=> #<Set: {4, 9, 7, 3}>
a.all? { |n| b_set.include?(n) }
In any event, you might first perform a cheap test:
b.size < a.size
If that is true there certainly will be at least one element of a that is not in b (assuming that a contains no duplicates).
Ruby 2.6 introduced Array#difference which seems perfect here:
a = [3, 4, 5, 6, 7, 8, 9, 10]
b = [4, 9, 7, 7, 3, 3]
a.difference(b)
# => [5, 6, 8, 10]
Seems handy for this, with the added benefit of being very readable.

Sort the array with reference to another array

I have two different arrays. Let's say:
a = [1, 2, 13, 4, 10, 11, 43]
b = [44, 23, 1, 4, 10, 2, 55, 13]
Now I have to sort the array b by referring to the array a. I tried the following solution:
lookup = {}
a.each_with_index do |item, index|
lookup[item] = index
end
b.sort_by do |item|
lookup.fetch(item)
end
But I'm getting the KeyError: key not found: 44 error. Can anyone help me find a solution?
Expected output is [1, 2, 13, 4, 10, 23, 44, 55].
Comparing arrays checks the first value, if it's equal goes to the second value and so on. Hence this will compare by the order of occurrence in a and then by the actual value for the ones not in a:
b.sort_by { |e| [a.index(e) || a.size, e] }
To keep O(nlogn), you could:
ai = a.each_with_index.to_h
b.sort_by { |e| [ai[e] || a.size, e] }

How can I create a function that combines list/array rows/columns/elements in arbitrary sized array/list?

Afternoon. I'm currently trying to create a function(s) that, when given an array or list and a specified selection of columns/rows/elements, the specified columns/rows/etc are removed and concatenated into a array/list-much in this fashion (but for arbitrary sized objects that may or may not be pretty big)
a = [1 2 3 b=['a','b','c'
4 5 6 'd','e','f'
7 8 9] 'g','h','i']
Now, let's say I want the 1st, and third columns. Then this would look like
a'=[1 3 b'=['a', 'c'
4 6 'd', 'f'
7 9] 'g', 'i]
I'm familiar with slicing indices and extracting them using numpy-so I guess where I'm really hung up is creating some object (a list or array of arrays/lists?) that contains columns/whatever (in the above i choose the first and third columns, as you can see) and then iterating over that object to create a concatenated/combined list of what I've specified(i.e.-If I'm given an array with 127 variables and I want to exact an arbitrary amount of arbitrary columns at a given time)
Thanks for taking a look. Let me know how to update the op if anything is unclear.
How is this different from advanced indexing
In [324]: A = np.arange(12).reshape(2,6)
In [325]: A
Out[325]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]])
In [326]: A[:,[1,2,4]]
Out[326]:
array([[ 1, 2, 4],
[ 7, 8, 10]])
To select both rows and columns you have to pay attention to index broadcasting:
In [327]: A = np.arange(24).reshape(4,6)
In [328]: A[[[1],[3]], [1,2,4]] # column index and row index
Out[328]:
array([[ 7, 8, 10],
[19, 20, 22]])
In [329]: A[np.ix_([1,3], [1,2,4])] # easier with ix_()
Out[329]:
array([[ 7, 8, 10],
[19, 20, 22]])
https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#purely-integer-array-indexing
The index arrays/lists can be assigned to variables - the input the the A indexing can be a tuple.
In [330]: idx = [[1,3],[1,2,4]]
In [331]: idx1 = np.ix_(*idx)
In [332]: idx1
Out[332]:
(array([[1],
[3]]), array([[1, 2, 4]]))
In [333]: A[idx1]
Out[333]:
array([[ 7, 8, 10],
[19, 20, 22]])
And to expand a set of slices and indices into single array, np.r_ is handy (though not magical):
In [335]: np.r_[slice(0,5),7,6, 3:6]
Out[335]: array([0, 1, 2, 3, 4, 7, 6, 3, 4, 5])
There are other indexing tools, utilities in indexing_tricks, functions like np.delete and np.take.
Try np.source(np.delete) to see how that handles general purpose deletion.
You could use a double list comprehension
>>> def select(arr, rows, cols):
... return [[el for j, el in enumerate(row) if j in cols] for i, row in enumerate(arr) if i in rows]
...
>>> select([[1,2,3,4],[5,6,7,8],[9,10,11,12]],(0,2),(1,3))
[[2, 4], [10, 12]]
>>>
please note that, independent of the order of indices in rows and cols,
select doesn't reorder the rows and columns of the input, note also that
using the same index repeatedly in either rows or cols does not give you duplicated rows or columns. Eventually note that select works only for lists of lists.
That said I advise you in favor of numpy that's hugely more flexible and
extremely more efficient.

Efficient way of finding sequential numbers across multiple arrays?

I'm not looking for any code or having anything being done for me. I need some help to get started in the right direction but do not know how to go about it. If someone could provide some resources on how to go about solving these problems I would very much appreciate it. I've sat with my notebook and am having trouble designing an algorithm that can do what I'm trying to do.
I can probably do:
foreach element in array1
foreach element in array2
check if array1[i] == array2[j]+x
I believe this would work for both forward and backward sequences, and for the multiples just check array1[i] % array2[j] == 0. I have a list which contains int arrays and am getting list[index] (for array1) and list[index+1] for array2, but this solution can get complex and lengthy fast, especially with large arrays and a large list of those arrays. Thus, I'm searching for a better solution.
I'm trying to come up with an algorithm for finding sequential numbers in different arrays.
For example:
[1, 5, 7] and [9, 2, 11] would find that 1 and 2 are sequential.
This should also work for multiple sequences in multiple arrays. So if there is a third array of [24, 3, 15], it will also include 3 in that sequence, and continue on to the next array until there isn't a number that matches the last sequential element + 1.
It also should be able to find more than one sequence between arrays.
For example:
[1, 5, 7] and [6, 3, 8] would find that 5 and 6 are sequential and also 7 and 8 are sequential.
I'm also interested in finding reverse sequences.
For example:
[1, 5, 7] and [9, 4, 11]would return 5 and 4 are reverse sequential.
Example with all:
[1, 5, 8, 11] and [2, 6, 7, 10] would return 1 and 2 are sequential, 5 and 6 are sequential, 8 and 7 are reverse sequential, 11 and 10 are reverse sequential.
It can also overlap:
[1, 5, 7, 9] and [2, 6, 11, 13] would return 1 and 2 sequential, 5 and 6 sequential and also 7 and 6 reverse sequential.
I also want to expand this to check numbers with a difference of x (above examples check with a difference of 1).
In addition to all of that (although this might be a different question), I also want to check for multiples,
Example:
[5, 7, 9] and [10, 27, 8] would return 5 and 10 as multiples, 9 and 27 as multiples.
and numbers with the same ones place.
Example:
[3, 5, 7] and [13, 23, 25] would return 3 and 13 and 23 have the same ones digit.
Use a dictionary (set or hashmap)
dictionary1 = {}
Go through each item in the first array and add it to the dictionary.
[1, 5, 7]
Now dictionary1 = {1:true, 5:true, 7:true}
dictionary2 = {}
Now go through each item in [6, 3, 8] and lookup if it's part of a sequence.
6 is part of a sequence because dictionary1[6+1] == true
so dictionary2[6] = true
We get dictionary2 = {6:true, 8:true}
Now set dictionary1 = dictionary2 and dictionary2 = {}, and go to the third array.. and so on.
We only keep track of sequences.
Since each lookup is O(1), and we do 2 lookups per number, (e.g. 6-1 and 6+1), the total is n*O(1) which is O(N) (N is the number of numbers across all the arrays).
The brute force approach outlined in your pseudocode will be O(c^n) (exponential), where c is the average number of elements per array and n is the number of total arrays.
If the input space is sparse (meaning there will be more missing numbers on average than presenting numbers), then one way to speed up this process is to first create a single sorted set of all the unique numbers from all your different arrays. This "master" set will then allow you to early exit (i.e. break statements in your loops) on any sequences which are not viable.
For example, if we have input arrays [1, 5, 7] and [6, 3, 8] and [9, 11, 2], the master ordered set would be {1, 2, 3, 5, 6, 7, 8, 9, 11}. If we are looking for n+1 type sequences, we could skip ever continuing checking any sequence that contains a 3 or 9 or 11 (because the n+1 value in not present at the next index in the sorted set. While the speedups are not drastic in this particular example, if you have hundreds of input arrays and very large range of values for n (sparsity), then the speedups should be exponential because you will be able to early exit on many permutations. If the input space is not sparse (such as in this example where we didn't have many holes), the speedups will be less than exponential.
A further improvement would be to store a "master" set of key-value pairs, where the key is the n value as shown in the example above, and the value portion of the pair is a list of the indices of any arrays that contain that value. The master set of the previous example would then be: {[1, 0], [2, 2], [3, 1], [5, 0], [6, 1], [7, 0], [8, 1], [9, 2], [11, 2]}. With this architecture, scan time could potentially be as low as O(c*n), because you could just traverse this single sorted master set looking for valid sequences instead of looping over all the sub-arrays. By also requiring the array indexes to increment, you can clearly see that the 1->2 sequence can be skipped because the arrays are not in the correct order, and the same with the 2->3 sequence, etc. Note this toy example is somewhat oversimplified because in practice you would need a list of indices for the value portions of the key-value pairs. This would be necessary if the same value of n ever appeared in multiple arrays (duplicate values).

How to extract lines in an array, which contain a certain value? (numpy, scipy)

I have an numpy 2D array and I want it to return coloumn c where (r, c-1) (row r, coloumn c) equals a certain value (int n).
I don't want to iterate over the rows writing something like
for r in len(rows):
if array[r, c-1] == 1:
store array[r,c]
, because there are 4000 of them and this 2D array is just one of 20 i have to look trough.
I found "filter" but don't know how to use it (Found no doc).
Is there an function, that provides such a search?
I hope I understood your question correctly. Let's say you have an array a
a = array(range(7)*3).reshape(7, 3)
print a
array([[0, 1, 2],
[3, 4, 5],
[6, 0, 1],
[2, 3, 4],
[5, 6, 0],
[1, 2, 3],
[4, 5, 6]])
and you want to extract all lines where the first entry is 2. This can be done like this:
print a[a[:,0] == 2]
array([[2, 3, 4]])
a[:,0] denotes the first column of the array, == 2 returns a Boolean array marking the entries that match, and then we use advanced indexing to extract the respective rows.
Of course, NumPy needs to iterate over all entries, but this will be much faster than doing it in Python.
Numpy arrays are not indexed. If you need to perform this specific operation more effeciently than linear in the array size, then you need to use something other than numpy.

Resources