tensorflow creating mask of varied lengths - arrays

I have a tensor of lengths in tensorflow, let's say it looks like this:
[4, 3, 5, 2]
I wish to create a mask of 1s and 0s whose number of 1s correspond to the entries to this tensor, padded by 0s to a total length of 8. I.e. I want to create this tensor:
[[1,1,1,1,0,0,0,0],
[1,1,1,0,0,0,0,0],
[1,1,1,1,1,0,0,0],
[1,1,0,0,0,0,0,0]
]
How might I do this?

This can now be achieved by tf.sequence_mask. More details here.

This can be achieved using a variety of TensorFlow transformations:
# Make a 4 x 8 matrix where each row contains the length repeated 8 times.
lengths = [4, 3, 5, 2]
lengths_transposed = tf.expand_dims(lengths, 1)
# Make a 4 x 8 matrix where each row contains [0, 1, ..., 7]
range = tf.range(0, 8, 1)
range_row = tf.expand_dims(range, 0)
# Use the logical operations to create a mask
mask = tf.less(range_row, lengths_transposed)
# Use the select operation to select between 1 or 0 for each value.
result = tf.select(mask, tf.ones([4, 8]), tf.zeros([4, 8]))

I've got a bit shorter version, than previous answer. Not sure if it is more efficient or not
def mask(self, seq_length, max_seq_length):
return tf.map_fn(
lambda x: tf.pad(tf.ones([x], dtype=tf.int32), [[0, max_seq_length - x]]),
seq_length)

Related

Determine indices of N number of non-zero minimum values in array

I have an array of x size and need to determine the indices of n of the smallest values. I found this link (I have need the N minimum (index) values in a numpy array) discussing how to get multiple minimum values but it doesn't work as well when my array has zeros in it.
For example:
x = [10, 12, 11, 9, 0, 1, 15, 4, 10]
n = 3
I need to find the indices of the 3 lowest non-zero values so the result would be
non_zero_min_ind = [5, 7, 3]
They don't need to be be in any order. I am trying to do this in python 3. Any help would be greatly appreciated.
Using numpy:
import numpy as np
y = np.argsort(x)
y[np.array(x)[y]!=0][:n]
array([5, 7, 3])

How would you split a numpy array where the elements give the partition size?

For an numpy 1d array such as:
In [1]: A = np.array([2,5,1,3,9,0,7,4,1,2,0,11])
In [2]: A
Out[2]: array([2,5,1,3,9,0,7,4,1,2,0,11])
I need to split the array by using the values as a sub-array length.
For the example array:
The first index has a value of 2, so I need the first split to occur at index 0 + 2, so it would result in ([2,5,1]).
Skip to index 3 (since indices 0-2 were gobbled up in step 1).
The value at index 3 = 3, so the second split would occur at index 3 + 3, and result in ([3,9,0,7]).
Skip to index 7
The value at index 7 = 4, so the third and final split would occur at index 7 + 4, and result in ([4,1,2,0,11])
I'm using this simple array as an example, because I think it will help in my actual use case, which is reading data from binary files (either as bytes or unsigned shorts). I'm guessing that numpy will be the fastest way to do it, but I could also use struct/bytearray/lists or whatever would be best.
I hope this makes sense. I had a hard time trying to figure out how best to word the question.
Here is an approach using standard python lists and a while loop:
def custom_partition(arr):
partitions = []
i = 0
while i < len(arr):
pariton_size = arr[i]
next_i = i + pariton_size + 1
partitions.append(arr[i:next_i])
i = next_i
return partitions
a = [2, 5, 1, 3, 9, 0, 7, 4, 1, 2, 0, 11]
b = custom_partition(a)
print(b)
Output:
[[2, 5, 1], [3, 9, 0, 7], [4, 1, 2, 0, 11]]

How can I use numpy to more efficiently modify the recorded sizes of nested sub-arrays, where the modification is condition-dependent?

I have a working declustering algorithm that I would like to speed up using numpy. Given an array a, the consecutive differences diffa are obtained. Each of these consecutive differences are then checked to see whether each is greater or lesser than some threshold value t_c, which produces an array of 0's and 1's False and True. Taking into account that diffa is one index smaller than a, the counting schema is slightly modified. First, the size of each cluster of 0's and 1's is calculated as array cl_size. If the array contains 0, then the size of the cluster is its original size plus one; if the array contains 1, then the size of the cluster is its original size minus one. Below is an example that I would like to adapt for a much larger dataset.
import numpy as np
thresh = 21
a = np.array([1, 2, 5, 10, 20, 40, 70, 71, 72, 74, 100, 130, 160, 171, 200, 201])
diffa = np.diff(a)
print(diffa)
>> [ 1 3 5 10 20 30 1 1 2 26 30 30 11 29 1]
def get_cluster_statistics(array, t_c, func_kw='all'):
""" This function separates clusters of datapoints such that the number
of clusters and the number of events in each cluster can be known. """
# GET CONSECUTIVE DIFFERENCES
ts_dif = np.diff(array)
# GET BOOLEAN ARRAY MASK OF 0's AND 1's FOR TIMES ABOVE THRESHOLD T_C
bool_mask = np.array(ts_dif > t_c) * 1
# COPY BOOLEAN ARRAY MASK (DO NOT MODIFY ORIGINAL ARRAY)
bm_arr = bool_mask[:]
# SPLIT CLUSTERS INTO SUB-ARRAYS
res = np.split(bm_arr, np.where(abs(np.diff(bm_arr)) != 0)[0] + 1)
print(res)
>>[array([0, 0, 0, 0, 0]), array([1]), array([0, 0, 0]), array([1, 1, 1]), array([0]), array([1]), array([0])]
# GET SIZE OF EACH SUB-ARRAY CLUSTER
cl_size = np.array([res[idx].size for idx in range(len(res))])
print(cl_size)
>>[5 1 3 3 1 1 1]
# CHOOSE BETWEEN CHECKING ANY OR ALL VALUES OF SUB-ARRAYS (check timeit)
func = dict(zip(['all', 'any'], [np.all, np.any]))[func_kw]
# INITIALIZE EMPTY OUTPUT LIST
ans = []
# CHECK EACH SPLIT SUB-ARRAY IN RES
for idx in range(len(res)):
# print("res[%d] = %s" %(idx, res[idx]))
if func(res[idx] == 1):
term = [1 for jdx in range(cl_size[idx]-1)]
# cl_size[idx] = cl_size[idx]-1
ans.append(term)
elif func(res[idx] == 0):
# cl_size[idx] = cl_size[idx]+1
term = [cl_size[idx]+1]
ans.append(term)
print(ans)
>> [[6], [], [4], [1, 1], [2], [], [2]]
out = np.sum(ans)
print(out)
>> [6, 4, 1, 1, 2, 2]
get_cluster_statistics(a, thresh, 'any')
After this, I apply Counter via importable module collections to count the frequency of clusters of various sizes.
I am not sure how but I think there is a numpy solution that is more efficient, specifically in the section of code under # CHECK EACH SPLIT SUB-ARRAY IN RES. Any help would be appreciated.

How can I pad values into array 1 with values from array 2 at indices from array 3?

I am trying to pad values to a numpy array. The array is initially filled with ones, and my goal is to overwrite the values of ones at specified indices with values from another array.
import numpy as np
# get initial array of ones
mask = np.ones(10)
# get values to overwrite ones at indices
values = [10, 30, 50.5]
# get indices for which values will replace ones
idx_pad = [1, 6, 7]
print(mask)
>> [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
What I want to get is:
>> [ 1 10 1 1 1 1 30 50.5 1 1 ]
I think there's a way to do this using an OrderedDict, though I'm still trying to figure it out. I'm also hopeful that there is a fast approach via numpy. I hope to apply this example to my actual dataset, for which len(idx_pad) = 10322 and len(mask) = 69268. Any help would be appreciated.
This is the solution via #Divakar.
import numpy as np
# get initial array of ones
mask = np.ones(10)
# get values to overwrite ones at indices
values = [10, 30, 50.5]
# get indices for which values will replace ones
idx_pad = [1, 6, 7]
print(mask)
>> [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
# replace values at indices in idx_pad
mask[idx_pad] = values
print(mask)
>> [ 1. 10. 1. 1. 1. 1. 30. 50.5 1. 1. ]

Partition an array of numbers into sets by proximity

Let's say we have an array like
[37, 20, 16, 8, 5, 5, 3, 0]
What algorithm can I use so that I can specify the number of partitions and have the array broken into them.
For 2 partitions, it should be
[37] and [20, 16, 8, 5, 5, 3, 0]
For 3, it should be
[37],[20, 16] and [8, 5, 5, 3, 0]
I am able to break them down by proximity by simply subtracting the element with right and left numbers but that doesn't ensure the correct number of partitions.
Any ideas?
My code is in ruby but any language/algo/pseudo-code will suffice.
Here's the ruby code by Vikram's algorithm
def partition(arr,clusters)
# Return same array if clusters are less than zero or more than array size
return arr if (clusters >= arr.size) || (clusters < 0)
edges = {}
# Get weights of edges
arr.each_with_index do |a,i|
break if i == (arr.length-1)
edges[i] = a - arr[i+1]
end
# Sort edge weights in ascending order
sorted_edges = edges.sort_by{|k,v| v}.collect{|k| k.first}
# Maintain counter for joins happening.
prev_edge = arr.size+1
joins = 0
sorted_edges.each do |edge|
# If join is on right of previous, subtract the number of previous joins that happened on left
if (edge > prev_edge)
edge -= joins
end
joins += 1
# Join the elements on the sides of edge.
arr[edge] = arr[edge,2].flatten
arr.delete_at(edge+1)
prev_edge = edge
# Get out when right clusters are done
break if arr.size == clusters
end
end
(assuming the array is sorted in descending order)
37, 20, 16, 8, 5, 5, 3, 0
Calculate the differences between adjacent numbers:
17, 4, 8, 3, 0, 2, 3
Then sort them in descending order:
17, 8, 4, 3, 3, 2, 0
Then take the first few numbers. For example, for 4 partitions, take 3 numbers:
17, 8, 4
Now look at the original array and find the elements with these given differences (you should attach the index in the original array to each element in the difference array to make this most easy).
17 - difference between 37 and 20
8 - difference between 16 and 8
4 - difference between 20 and 16
Now print the stuff:
37 | 20 | 16 | 8, 5, 5, 3, 0
I think your problem can be solved using k-clustering using kruskal's algorithm . Kruskal algorithm is used to find the clusters such that there is maximum spacing between them.
Algorithm : -
Construct path graph from your data set like following : -
[37, 20, 16, 8, 5, 5, 3, 0]
path graph: - 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7
then weight for each edge will be difference between their values
edge(0,1) = abs(37-20) = 17
edge(1,2) = abs(20-16) = 4
edge(2,3) = abs(16-8) = 8
edge(3,4) = abs(8-5) = 3
edge(4,5) = abs(5-5) = 0
edge(5,6) = abs(5-3) = 2
edge(6,7) = abs(3-0) = 3
Use kruskal on this graph till there are only k clusters remaining : -
Sort the edges first according to weights in ascending order:-
(4,5),(5,6),(6,7),(3,4),(1,2),(2,3),(0,1)
Use krushkal on it find exactly k = 3 clusters : -
iteration 1 : join (4,5) clusters = 7 clusters: [37,20,16,8,(5,5),3,0]
iteration 2 : join (5,6) clusters = 6 clusters: [37,20,16,8,(5,5,3),0]
iteration 3 : join (6,7) clusters = 5 clusters: [37,20,16,8,(5,5,3,0)]
iteration 4 : join (3,4) clusters = 4 clusters: [37,20,16,(8,5,5,3,0)]
iteration 5 : join (1,2) clusters = 3 clusters: [37,(20,16),(8,5,5,3,0)]
stop as clusters = 3
reconstrusted solution : [(37), (20, 16), (8, 5, 5, 3, 0)] is what
u desired
While #anatolyg's solution may be fine, you should also look at k-means clustering. It's usually done in higher dimensions, but ought to work fine in 1d.
You pick k; your examples are k=2 and k=3. The algorithm seeks to put the inputs into k sets that minimize the sum of distances squared from the set's elements to the centroid (mean position) of the set. This adds a bit of rigor to your rather fuzzy definition of the right result.
While getting an optimal result is NP hard, there is a simple greedy solution.
It's an iteration. Take a guess to get started. Either pick k elements at random to be the initial means or put all the elements randomly into k sets and compute their means. Some care is needed here because each of the k sets must have at least one element.
Additionally, because your integer sets can have repeats, you'll have to ensure the initial k means are distinct. This is easy enough. Just pick from a set that has been "unqualified."
Now iterate. For each element find its closest mean. If it's already in the set corresponding to that mean, leave it there. Else move it. After all elements have been considered, recompute the means. Repeat until no elements need to move.
The Wikipedia page on this is pretty good.

Resources