Build array from other array and table of values (Python) - arrays

I have a table of values stored into a list of lists like:
A = [ [a[1],b[1],c[1]],
[a[2],b[2],c[2]],
...
[a[m],b[m],c[m]]]
with
a[i] < b[1]
b[i] < a[i+1]
0 < c[i] < 1
and a numpy array such as:
X = [x[1], x[2], ..., x[n]]
I need to create an array
Y = [y[1], y[2], ..., y[n]]
where each value of Y will correspond to
for i in [1,2, ..., n]:
for k in [1,2, ..., m]:
if a[k] < x[i] < b[k]:
y[i] = c[k]
else:
y[i] = 1
Please note that X and Y have the same length, but A is totally different. Y can take any value in the third column of A (c[k] for k= 1,2,... m), as long as a[k] < x[i] < b[k] is met (for k= 1,2,... m and for i= 1,2,... n).
In the actual cases I am working on, n = 6789 and m = 6172.
I could do the verification using nested "for" cycles, but it is really slow. What is the fastest way to accomplish this? what if X and Y where 2D numpy arrays?
SAMPLE DATA:
a = [10, 20, 30, 40, 50, 60, 70, 80, 90]
b = [11, 21, 31, 41, 51, 61, 71, 81, 91]
c = [ 0.917, 0.572, 0.993 , 0.131, 0.44, 0.252 , 0.005, 0.375, 0.341]
A = A = [[d,e,f] for d,e,f in zip(a,b,c)]
X = [1, 4, 10.2, 20.5, 25, 32, 41.3, 50.5, 73]
EXPECTED RESULTS:
Y = [1, 1, 0.993, 0.132, 1, 1, 1, 0.375, 1 ]

Approach #1: Using brute-force comparison with broadcasting -
import numpy as np
# Convert to numpy arrays
A_arr = np.array(A)
X_arr = np.array(X)
# Mask that represents "if a[k] < x[i] < b[k]:" for all i,k
mask = (A_arr[:,None,0]<X_arr) & (X_arr<A_arr[:,None,1])
# Get indices where the mask has 1s, i.e. the conditionals were satisfied
_,C = np.where(mask)
# Setup output numpy array and set values in it from third column of A
# that has conditionals satisfied for specific indices
Y = np.ones_like(X_arr)
Y[C] = A_arr[C,2]
Approach #2: Based on binning with np.searchsorted -
import numpy as np
# Convert A to 2D numpy array
A_arr = np.asarray(A)
# Setup intervals for binning later on
intv = A_arr[:,:2].ravel()
# Perform binning & get interval & grouped indices for each X
intv_idx = np.searchsorted(intv, X, side='right')
grp_intv_idx = np.floor(intv_idx/2).astype(int)
# Get mask of valid indices, i.e. X elements are within grouped intervals
mask = np.fmod(intv_idx,2)==1
# Setup output array
Y = np.ones(len(X))
# Extract col-3 elements with grouped indices and valid ones from mask
Y[mask] = A_arr[:,2][grp_intv_idx[mask]]
# Remove (set to 1's) elements that fall exactly on bin boundaries
Y[np.in1d(X,intv)] = 1
Please note that if you need the output as a list, you can convert the numpy array to a list with a call like this - Y.tolist().
Sample run -
In [480]: A
Out[480]:
[[139.0, 355.0, 0.5047342078960846],
[419.0, 476.0, 0.3593886192040009],
[580.0, 733.0, 0.3137694021600973]]
In [481]: X
Out[481]: [555, 689, 387, 617, 151, 149, 452]
In [482]: Y
Out[482]:
array([ 1. , 0.3137694 , 1. , 0.3137694 , 0.50473421,
0.50473421, 0.35938862])

With 1-d arrays, it's not too bad:
a,b,c = np.array(A).T
mask = (a<x) & (x<b)
y = np.ones_like(x)
y[mask] = c[mask]
If x and y are higher-dimensional, then your A matrix will also need to be bigger. The basic concept works the same, though.

Related

How can I use numpy to more efficiently modify the recorded sizes of nested sub-arrays, where the modification is condition-dependent?

I have a working declustering algorithm that I would like to speed up using numpy. Given an array a, the consecutive differences diffa are obtained. Each of these consecutive differences are then checked to see whether each is greater or lesser than some threshold value t_c, which produces an array of 0's and 1's False and True. Taking into account that diffa is one index smaller than a, the counting schema is slightly modified. First, the size of each cluster of 0's and 1's is calculated as array cl_size. If the array contains 0, then the size of the cluster is its original size plus one; if the array contains 1, then the size of the cluster is its original size minus one. Below is an example that I would like to adapt for a much larger dataset.
import numpy as np
thresh = 21
a = np.array([1, 2, 5, 10, 20, 40, 70, 71, 72, 74, 100, 130, 160, 171, 200, 201])
diffa = np.diff(a)
print(diffa)
>> [ 1 3 5 10 20 30 1 1 2 26 30 30 11 29 1]
def get_cluster_statistics(array, t_c, func_kw='all'):
""" This function separates clusters of datapoints such that the number
of clusters and the number of events in each cluster can be known. """
# GET CONSECUTIVE DIFFERENCES
ts_dif = np.diff(array)
# GET BOOLEAN ARRAY MASK OF 0's AND 1's FOR TIMES ABOVE THRESHOLD T_C
bool_mask = np.array(ts_dif > t_c) * 1
# COPY BOOLEAN ARRAY MASK (DO NOT MODIFY ORIGINAL ARRAY)
bm_arr = bool_mask[:]
# SPLIT CLUSTERS INTO SUB-ARRAYS
res = np.split(bm_arr, np.where(abs(np.diff(bm_arr)) != 0)[0] + 1)
print(res)
>>[array([0, 0, 0, 0, 0]), array([1]), array([0, 0, 0]), array([1, 1, 1]), array([0]), array([1]), array([0])]
# GET SIZE OF EACH SUB-ARRAY CLUSTER
cl_size = np.array([res[idx].size for idx in range(len(res))])
print(cl_size)
>>[5 1 3 3 1 1 1]
# CHOOSE BETWEEN CHECKING ANY OR ALL VALUES OF SUB-ARRAYS (check timeit)
func = dict(zip(['all', 'any'], [np.all, np.any]))[func_kw]
# INITIALIZE EMPTY OUTPUT LIST
ans = []
# CHECK EACH SPLIT SUB-ARRAY IN RES
for idx in range(len(res)):
# print("res[%d] = %s" %(idx, res[idx]))
if func(res[idx] == 1):
term = [1 for jdx in range(cl_size[idx]-1)]
# cl_size[idx] = cl_size[idx]-1
ans.append(term)
elif func(res[idx] == 0):
# cl_size[idx] = cl_size[idx]+1
term = [cl_size[idx]+1]
ans.append(term)
print(ans)
>> [[6], [], [4], [1, 1], [2], [], [2]]
out = np.sum(ans)
print(out)
>> [6, 4, 1, 1, 2, 2]
get_cluster_statistics(a, thresh, 'any')
After this, I apply Counter via importable module collections to count the frequency of clusters of various sizes.
I am not sure how but I think there is a numpy solution that is more efficient, specifically in the section of code under # CHECK EACH SPLIT SUB-ARRAY IN RES. Any help would be appreciated.

Maximize the minimum element

We have an array of N positive elements. We can perform M operations on this array. In each operation we have to select a subarray(contiguous) of length W and increase each element by 1. Each element of the array can be increased at most K times.
We have to perform these operations such that the minimum element in the array is maximized.
1 <= N, W <= 10^5
1 <= M, K <= 10^5
Time limit: 1 sec
I can think of an O(n^2) solution but it is exceeding time limit. Can somebody provide an O(nlogn) or better solution for this?
P.S.- This is an interview question
It was asked in a Google interview and I solved it by using sliding window, heap and increment in a range logic. I will solve the problem in 3 parts:
Finding out the minimum of every subarray of size W. This can be done in O(n) by using sliding window with priority queue. The maximum of every window must be inserted into a min-heap with 3 variable: [array_value, left_index, right_index]
Now, make auxiliary array initialised to 0 with of size N. Perform pop operation on heap M number of times and in each pop operation perform 3 task:
value, left_index, right_index = heap.pop() # theoretical function to pop minimum
Increment the value by 1,
increment by 1 in auxiliary array at left_index and decrement by 1 in
auxiliary array at right_index+1
Again insert this pair into heap. [with incremented value and same indexes]
After performing M operations traverse the given array with auxiliary array and add the cumulative sum till index 'i' to element at index 'i' in array.
Return minimum of array.
Time Complexity
O(N) <- for minimum element in every window + building heap.
O(M*logN) <- Extracting and inserting into heap.
O(N) <- For traversing to add cumulative sum.
So, overall is O(N + M*logN + N) which is O(M*logN)
Space Complexity
O(N) <- Extra array + heap.
Few things can be easily optimised above like inserting values in heap, only left_index can be inserted and as right_index = left_index + k.
My Code
from heapq import heappop, heappush
from collections import deque
def find_maximised_minimum(arr, n, m, k):
"""
arr -> Array, n-> Size of array
m -> increment operation that can be performed
k -> window size
"""
heap = []
q = deque()
# sliding window + heap building
for i in range(k):
while q and arr[q[-1]] > arr[i]:
q.pop()
q.append(i)
for i in range(k, n):
heappush(heap, [arr[q[0]], i - k, i - 1])
while q and q[0] <= i - k:
q.popleft()
while q and arr[q[-1]] > arr[i]:
q.pop()
q.append(i)
heappush(heap, [arr[q[0]], n - k, n - 1])
# auxiliary array
temp = [0 for i in range(n)]
# performing M increment operations
while m:
top = heappop(heap)
temp[top[1]] += 1
try:
temp[top[2] + 1] -= 1
except:
# when the index is last, so just ignore
pass
top[0] += 1
heappush(heap, top)
m -= 1
# finding cumulative sum
sumi = 0
for i in range(n):
sumi += temp[i]
arr[i] += sumi
print(min(arr))
if __name__ == '__main__':
# find([1, 2, 3, 4, 5, 6], 6, 5, 2)
# find([73, 77, 60, 100, 94, 24, 31], 7, 9, 1)
# find([24, 41, 100, 70, 97, 89, 38, 68, 41, 93], 10, 6, 5)
# find([88, 36, 72, 72, 37, 76, 83, 18, 76, 54], 10, 4, 3)
find_maximised_minimum([98, 97, 23, 13, 27, 100, 75, 42], 8, 5, 1)
What if we kept a copy of the array sorted ascending, pointing each element to its original index? Think about the order of priority when incrementing the elements. Also, does the final order of operations matter?
Once the lowest element reaches the next lowest element, what must then be incremented? And if we apply k operations to any one element does it matter in which w those increments were applied?

Swap elements in array if the next is bigger then current

I want change order in arr if the next element is bigger than current.
Hot to modify the code, so it will be work?
arr = [5, 22, 29, 39, 19, 51, 78, 96, 84]
i = 0
while (i < arr.size-1)
if arr[i].to_i < arr[i+1].to_i
arr[i]
elsif arr[i].to_i > arr[i + 1].to_i
arr[i+1], arr[i] = arr[i], arr[i+1]
end
puts arr[i]
i += 1
end
Returns: [5, 22, 29, 39, 19, 51, 78, 96, 84]
Instead: [5, 19, 22, 29, 39, 51, 78, 84, 96]
You can use any of sorting algorithms depending on the size of array (n),
For Bubble Sort, Time Complexity is O(n^2)
For Merge Sort, Time Complexity is O(nlogn)
For Counting Sort, Time Complexity is O(n) but number in array must be 0.upto 10^6
Bubble Sort: It runs pairwise in one iteration and put the maximum element in last, In second iteration, put the second maximum element in second last position and so on till array is sorted.
Iterate (n-1) times [to find (n-1) max numbers]
Iterate (n-idx-1) times to swap pair of numbers if (first number is
greater than next number)
If swapping stopped in inner loop means that array becomes sorted,
so break the outer loop
Ruby Code:
def bubble_sort(arr)
n = arr.size
(n-1).times do |idx|
swapped = false
(n-idx-1).times do |i|
if arr[i] > arr[i+1]
arr[i], arr[i+1] = arr[i+1], arr[i]
swapped = true
end
end
break unless swapped
end
arr
end
p bubble_sort([5, 22, 29, 39, 19, 51, 78, 96, 84])
Merge Sort: It runs on divide and conquer strategy, i.e if you know two halves of array is sorted then you can sort whole array by using two pointer strategy in O(n).
For Instance,
#first half : [4,5,7,9]
#second half : [1,2,10,15]
1. Take two pointer l and r assigned to starting index of both halves i.e 0
2. Iterate over l and r upto their lengths to consume both arrays
if element at first_half[l] < second_half[r]
Put first_half[l] in result_array
Increment l pointer
else
Put second_half[r] in result_array
Increment r pointer
This merge operation will take O(n) to sort whole array.
Now, if we divide whole array into two halves recursively, we will get binary tree of height log(n) and each level will take O(n) to sort the subproblems (halves), resulting in O(nlogn) Time Complexity.
Base case would be : single element array is always sorted
Ruby Code:
def merge(left_sorted, right_sorted)
res = []
left_size, right_size = left_sorted.size, right_sorted.size
l = r = 0
loop do
break if r == right_size and l == left_size # break if both halves processed
if r == right_size or (l < left_size and left_sorted[l] < right_sorted[r])
res << left_sorted[l]; l += 1
else
res << right_sorted[r]; r += 1
end
end
res
end
def merge_sort(arr)
size = arr.size
return arr if size <= 1 # base case
mid = arr.size/2 - 1
left_half, right_half = arr[0..mid], arr[mid+1..-1]
left_sorted = merge_sort(left_half)
right_sorted = merge_sort(right_half)
return merge(left_sorted, right_sorted)
end
p merge_sort([5, 22, 29, 39, 19, 51, 78, 96, 84])
Counting Sort: It works in O(n) by counting numbers appearance in array if numbers in array lies in range(0..10^6)
Keep count of each number of array in count_array.
Iterate from min_element to max_element of array, and put element in
sorted_array if appeared i.e its count > 0
Ruby Code:
def counting_sort(arr)
min, max = arr.min, arr.max
count_arr = [0] * (max - min + 1) # initialize count_array with all 0s
arr.each do |num|
count_arr[num - min] += 1
end
res = []
size = count_arr.size
size.times do |i|
count_arr[i].times do
res << i + min
end
end
res
end
p counting_sort([5, 22, 29, 39, 19, 51, 78, 96, 84])
Notice that as you sort you are rearranging the array. Don't modify it, use it as a reference and place the sorted items in a new array.
If you want to study algotirthms use C or C++.
def bubble_sort(array)
sorted = array.dup
i = 0
l = sorted.length
while i < (l - 1)
j = 0
while j < l - i - 1
if sorted[j] > sorted[j + 1]
tmp = sorted[j]
sorted[j] = sorted[j + 1]
sorted[j + 1] = tmp
end
j += 1
end
i += 1
end
sorted
end
puts bubble_sort([5, 22, 29, 39, 19, 51, 78, 96, 84])

Array sorting using presorted ranking

I'm building a decision tree algorithm. The sorting is very expensive in this algorithm because for every split I need to sort each column. So at the beginning - even before tree construction I'm presorting variables - I'm creating a matrix so for each column in the matrix I save its ranking. Then when I want to sort the variable in some split I don't actually sort it but use the presorted ranking array. The problem is that I don't know how to do it in a space efficient manner.
A naive solution of this is below. This is only for 1 variabe (v) and 1 split (split_ind).
import numpy as np
v = np.array([60,70,50,10,20,0,90,80,30,40])
sortperm = v.argsort() #1 sortperm = array([5, 3, 4, 8, 9, 2, 0, 1, 7, 6])
rankperm = sortperm.argsort() #2 rankperm = array([6, 7, 5, 1, 2, 0, 9, 8, 3, 4])
split_ind = np.array([3,6,4,8,9]) # this is my split (random)
# split v and sortperm
v_split = v[split_ind] # v_split = array([10, 90, 20, 30, 40])
rankperm_split = rankperm[split_ind] # rankperm_split = array([1, 9, 2, 3, 4])
vsorted_dummy = np.ones(10)*-1 #3 allocate "empty" array[N]
vsorted_dummy[rankperm_split] = v_split
vsorted = vsorted_dummy[vsorted_dummy!=-1] # vsorted = array([ 10., 20., 30., 40., 90.])
Basically I have 2 questions:
Is double sorting necessary to create ranking array? (#1 and #2)
In the line #3 I'm allocating array[N]. This is very inefficent in terms of space because even if split size n << N I have to allocate whole array. The problem here is how to calculate rankperm_split. In the example original rankperm_split = [1,9,2,3,4] while it should be really [1,5,2,3,4]. This problem can be reformulated so that I want to create a "dense" integer array that has maximum gap of 1 and it keeps the ranking of the array intact.
UPDATE
I think that second point is the key here. This problem can be redefined as
A[N] - array of size N
B[N] - array of size N
I want to transform array A to array B so that:
Ranking of the elements stays the same (for each pair i,j if A[i] < A[j] then B[i] < B[j]
Array B has only elements from 1 to N where each element is unique.
A few examples of this transformation:
[3,4,5] => [1,2,3]
[30,40,50] => [1,2,3]
[30,50,40] => [1,3,2]
[3,4,50] => [1,2,3]
A naive implementation (with sorting) can be defined like this (in Python)
def remap(a):
a_ = sorted(a)
b = [a_.index(e)+1 for e in a]
return b

matlab: efficient search a value within the array

I have an array of already ordered values (e.g. vec=[20, 54, 87, 233]). Array contains ~300 elements. I have a value, which I need to search in this array. The successful search is not only the exact value, but also +/- 5 digits within the range. For example, in this case values like 17 or 55 should be also considered as found. What is the most efficient way to do this? I used the loop like below, but I guess it does not take in the account that my array is already ordered. In addition, in case of non-empty I get to check manually how distant was the value because find does not return position. This is not a big problem since my "finds" are only 15%.
bRes = find(vec >= Value-5 & vec <= Value+5);
if ~isempty(bRes)
distGap = GetGapDetails(Value, vec);
return;
end
Thanks!
Vadim
The best way to search for a value in a list that is already sorted is a binary search, which takes only O(log(n)) time. This is better than comparing the value with every item in the list, which costs O(n). As far as I know, Matlab does not have a function to do exactly this. As already mentioned by Natan, you can (a)buse the built-in function histc for this, which is written in C and presumably does a binary search.
function good = is_within_range(value, vector, threshold)
% check that vector is sorted, comment this out for speed
assert(all(diff(vector) > 0))
assert(threshold > 0)
% pad vector with +- inf for histc
vector = [-inf, vector, inf];
% find index of value in vector, so that vector(ind) <= value < vector(ind+1)
% abuse histc, ignore bincounts
[~, ind] = histc(value, vector);
% check if we are within +- threshold from a value in vector,
% either below or above
good = (value <= vector(ind) + threshold) | value >= (vector(ind+1) - threshold);
Some quick tests:
>> is_within_range(0, [10, 30, 80], 5)
ans = 0
>> is_within_range(4, [10, 30, 80], 5)
ans = 0
>> is_within_range(5, [10, 30, 80], 5)
ans = 1
>> is_within_range(10, [10, 30, 80], 5)
ans = 1
>> is_within_range(15, [10, 30, 80], 5)
ans = 1
>> is_within_range(16, [10, 30, 80], 5)
ans = 0
>> is_within_range(31, [10, 30, 80], 5)
ans = 1
>> is_within_range(36, [10, 30, 80], 5)
ans = 0
And as a bonus, this function is vectorized, so you can test more than one value at the same time:
>> is_within_range([0, 4, 5, 10, 15, 16, 31, 36], [10, 30, 80], 5)
ans =
0 0 1 1 1 0 1 0
This will be somewhat more efficient:
bRes = vec >= Value-5 & vec <= Value+5;
if any(bRes) ...
You are right that MATLAB will likely not take advantage of the fact that 'vec' is already sorted. You could write a binary search to zero in on the range of interest (that is, work in O(log(N)) time rather than O(N) time), but with only 300 elements in the array, I suspect your current implementation will hold up well.
let's say your array is stored in var 'A' and your value is 'v':
A(A>v+5 || A<v-5)=[];

Resources