Let's say we have an array like
[37, 20, 16, 8, 5, 5, 3, 0]
What algorithm can I use so that I can specify the number of partitions and have the array broken into them.
For 2 partitions, it should be
[37] and [20, 16, 8, 5, 5, 3, 0]
For 3, it should be
[37],[20, 16] and [8, 5, 5, 3, 0]
I am able to break them down by proximity by simply subtracting the element with right and left numbers but that doesn't ensure the correct number of partitions.
Any ideas?
My code is in ruby but any language/algo/pseudo-code will suffice.
Here's the ruby code by Vikram's algorithm
def partition(arr,clusters)
# Return same array if clusters are less than zero or more than array size
return arr if (clusters >= arr.size) || (clusters < 0)
edges = {}
# Get weights of edges
arr.each_with_index do |a,i|
break if i == (arr.length-1)
edges[i] = a - arr[i+1]
end
# Sort edge weights in ascending order
sorted_edges = edges.sort_by{|k,v| v}.collect{|k| k.first}
# Maintain counter for joins happening.
prev_edge = arr.size+1
joins = 0
sorted_edges.each do |edge|
# If join is on right of previous, subtract the number of previous joins that happened on left
if (edge > prev_edge)
edge -= joins
end
joins += 1
# Join the elements on the sides of edge.
arr[edge] = arr[edge,2].flatten
arr.delete_at(edge+1)
prev_edge = edge
# Get out when right clusters are done
break if arr.size == clusters
end
end
(assuming the array is sorted in descending order)
37, 20, 16, 8, 5, 5, 3, 0
Calculate the differences between adjacent numbers:
17, 4, 8, 3, 0, 2, 3
Then sort them in descending order:
17, 8, 4, 3, 3, 2, 0
Then take the first few numbers. For example, for 4 partitions, take 3 numbers:
17, 8, 4
Now look at the original array and find the elements with these given differences (you should attach the index in the original array to each element in the difference array to make this most easy).
17 - difference between 37 and 20
8 - difference between 16 and 8
4 - difference between 20 and 16
Now print the stuff:
37 | 20 | 16 | 8, 5, 5, 3, 0
I think your problem can be solved using k-clustering using kruskal's algorithm . Kruskal algorithm is used to find the clusters such that there is maximum spacing between them.
Algorithm : -
Construct path graph from your data set like following : -
[37, 20, 16, 8, 5, 5, 3, 0]
path graph: - 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7
then weight for each edge will be difference between their values
edge(0,1) = abs(37-20) = 17
edge(1,2) = abs(20-16) = 4
edge(2,3) = abs(16-8) = 8
edge(3,4) = abs(8-5) = 3
edge(4,5) = abs(5-5) = 0
edge(5,6) = abs(5-3) = 2
edge(6,7) = abs(3-0) = 3
Use kruskal on this graph till there are only k clusters remaining : -
Sort the edges first according to weights in ascending order:-
(4,5),(5,6),(6,7),(3,4),(1,2),(2,3),(0,1)
Use krushkal on it find exactly k = 3 clusters : -
iteration 1 : join (4,5) clusters = 7 clusters: [37,20,16,8,(5,5),3,0]
iteration 2 : join (5,6) clusters = 6 clusters: [37,20,16,8,(5,5,3),0]
iteration 3 : join (6,7) clusters = 5 clusters: [37,20,16,8,(5,5,3,0)]
iteration 4 : join (3,4) clusters = 4 clusters: [37,20,16,(8,5,5,3,0)]
iteration 5 : join (1,2) clusters = 3 clusters: [37,(20,16),(8,5,5,3,0)]
stop as clusters = 3
reconstrusted solution : [(37), (20, 16), (8, 5, 5, 3, 0)] is what
u desired
While #anatolyg's solution may be fine, you should also look at k-means clustering. It's usually done in higher dimensions, but ought to work fine in 1d.
You pick k; your examples are k=2 and k=3. The algorithm seeks to put the inputs into k sets that minimize the sum of distances squared from the set's elements to the centroid (mean position) of the set. This adds a bit of rigor to your rather fuzzy definition of the right result.
While getting an optimal result is NP hard, there is a simple greedy solution.
It's an iteration. Take a guess to get started. Either pick k elements at random to be the initial means or put all the elements randomly into k sets and compute their means. Some care is needed here because each of the k sets must have at least one element.
Additionally, because your integer sets can have repeats, you'll have to ensure the initial k means are distinct. This is easy enough. Just pick from a set that has been "unqualified."
Now iterate. For each element find its closest mean. If it's already in the set corresponding to that mean, leave it there. Else move it. After all elements have been considered, recompute the means. Repeat until no elements need to move.
The Wikipedia page on this is pretty good.
Related
I have a database of 10,000 vector of integers ranging from 1 to 1,000. The length of each vector can be up to 1,000. For example, it can look like this:
vec1: 1 2 56 78
vec2: 23 34 35 36 37 38
vec3: 1 2 3 4 5 7
vec4: 2 3 4 6 100
...
vec10000: 13 234
Now, I want to store this database in a way that is fast in response to a particular type of request. Each request will come in the form of an integer vector, up to 10,000 long:
query: 1 2 3 4 5 7 56 78 100
The response should be the indices of the vectors that are subsets of this query string. For example, in the above list, only vec1 and vec3 are subsets of the query, so the response in this case should be
response: 1 3
This database is not going to change so you can preprocess it in any possible way. You may specify that queries come in any protocol as well, as long as the information is the same. For example, it can come as a sorted list or a boolean table.
What is the best strategy to encode the database and the query to achieve the highest response rate possible?
Since you are using python, this method seems easy. (For any other language also, it is implementable but will include modular arithmetic etc.)
So, for each number from 1-1000, assign a prime number to it. So,
1 => 2
2 => 3
3 => 5
4 => 7
...
...
25 => 97
...
...
1000 => 7919
For every set, use its value to be the hash function generated by product of all values in the set.
eg. If your vector, vec-x = {1,2,5,25}, vec-x = 2 * 3 * 11 * 97
Similarly, your query vector can be calculated as above. Let its value be Q.
If Q % vec-i == 0, it is a subset, else not.
What about just preprocessing your vector list into an indicator matrix and using matrix multiplication, something like:
import numpy as np
# generate 10000 random vectors with length in [0-1000]
# and elements in [0-1000]
vectors = [np.random.randint(1000, size=n)
for n in np.random.randint(1000, size=10000)]
# generate indicator matrix
database = np.zeros((10000, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
def query(ints):
tmp = np.zeros(1000, dtype='int8')
tmp[ints] = 1
return np.where(database.dot(tmp) == lengths)[0]
The dot product of a database row and the transformed query will be equal to the number of elements of the row that are in the query. If this number is equal to total number of elements in the row, then we've found a subset. Note that this uses 0-based indexing.
Here's this revised for your example data
vectors = [[1, 2, 56, 78],
[23, 34, 35, 36, 37, 38],
[1, 2, 3, 4, 5, 7],
[2, 3, 4, 6, 100],
[13, 234]]
database = np.zeros((5, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
print query([1, 2, 3, 4, 5, 7, 56, 78, 100])
# [0, 2] 0-based indexing
I have a tensor of lengths in tensorflow, let's say it looks like this:
[4, 3, 5, 2]
I wish to create a mask of 1s and 0s whose number of 1s correspond to the entries to this tensor, padded by 0s to a total length of 8. I.e. I want to create this tensor:
[[1,1,1,1,0,0,0,0],
[1,1,1,0,0,0,0,0],
[1,1,1,1,1,0,0,0],
[1,1,0,0,0,0,0,0]
]
How might I do this?
This can now be achieved by tf.sequence_mask. More details here.
This can be achieved using a variety of TensorFlow transformations:
# Make a 4 x 8 matrix where each row contains the length repeated 8 times.
lengths = [4, 3, 5, 2]
lengths_transposed = tf.expand_dims(lengths, 1)
# Make a 4 x 8 matrix where each row contains [0, 1, ..., 7]
range = tf.range(0, 8, 1)
range_row = tf.expand_dims(range, 0)
# Use the logical operations to create a mask
mask = tf.less(range_row, lengths_transposed)
# Use the select operation to select between 1 or 0 for each value.
result = tf.select(mask, tf.ones([4, 8]), tf.zeros([4, 8]))
I've got a bit shorter version, than previous answer. Not sure if it is more efficient or not
def mask(self, seq_length, max_seq_length):
return tf.map_fn(
lambda x: tf.pad(tf.ones([x], dtype=tf.int32), [[0, max_seq_length - x]]),
seq_length)
I have an array in which I want to replace values at a known set of indices with the value immediately preceding it. As an example, my array might be
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 0];
and the indices of values to be replaced by previous values might be
y = [2, 3, 8];
I want this replacement to occur from left to right, or else start to finish. That is, the value at index 2 should be replaced by the value at index 1, before the value at index 3 is replaced by the value at index 2. The result using the arrays above should be
[1, 1, 1, 4, 5, 6, 7, 7, 9, 0]
However, if I use the obvious method to achieve this in Matlab, my result is
>> x(y) = x(y-1)
x =
1 1 2 4 5 6 7 7 9 0
Hopefully you can see that this operation was performed right to left and the value at index 3 was replaced by the value at index 2, then 2 was replaced by 1.
My question is this: Is there some way of achieving my desired result in a simple way, without brute force looping over the arrays or doing something time consuming like reversing the arrays around?
Well, practically this is a loop but the order is number of consecutive index elements
while ~isequal(x(y),x(y-1))
x(y)=x(y-1)
end
Using nancumsum you can achieve a fully vectorized version. Nevertheless, for most cases the solution karakfa provided is probably one to prefer. Only for extreme cases with long sequences in y this code is faster.
c1=[0,diff(y)==1];
c1(c1==0)=nan;
shift=nancumsum(c1,2,4);
y(~isnan(shift))=y(~isnan(shift))-shift(~isnan(shift));
x(y)=x(y-1)
Given an array, the output array consecutive elements where total sum is 0.
Eg:
For input [2, 3, -3, 4, -4, 5, 6, -6, -5, 10],
Output is [3, -3, 4, -4, 5, 6, -6, -5]
I just can't find an optimal solution.
Clarification 1: For any element in the output subarray, there should a subset in the subarray which adds with the element to zero.
Eg: For -5, either one of subsets {[-2, -3], [-1, -4], [-5], ....} should be present in output subarray.
Clarification 2: Output subarray should be all consecutive elements.
Here is a python solution that runs in O(n³):
def conSumZero(input):
take = [False] * len(input)
for i in range(len(input)):
for j in range(i+1, len(input)):
if sum(input[i:j]) == 0:
for k in range(i, j):
take[k] = True;
return numpy.where(take, input)
EDIT: Now more efficient! (Not sure if it's quite O(n²); will update once I finish calculating the complexity.)
def conSumZero(input):
take = [False] * len(input)
cs = numpy.cumsum(input)
cs.insert(0,0)
for i in range(len(input)):
for j in range(i+1, len(input)):
if cs[j] - cs[i] == 0:
for k in range(i, j):
take[k] = True;
return numpy.where(take, input)
The difference here is that I precompute the partial sums of the sequence, and use them to calculate subsequence sums - since sum(a[i:j]) = sum(a[0:j]) - sum(a[0:i]) - rather than iterating each time.
Why not just hash the incremental sum totals and update their indexes as you traverse the array, the winner being the one with largest index range. O(n) time complexity (assuming average hash table complexity).
[2, 3, -3, 4, -4, 5, 6, -6, -5, 10]
sum 0 2 5 2 6 2 7 13 7 2 12
The winner is 2, indexed 1 to 8!
To also guarantee an exact counterpart contiguous-subarray for each number in the output array, I don't yet see a way around checking/hashing all the sum subsequences in the candidate subarrays, which would raise the time complexity to O(n^2).
Based on the example, I assumed that you wanted to find only the ones where 2 values together added up to 0, if you want to include ones that add up to 0 if you add more of them together (like 5 + -2 + -3), then you would need to clarify your parameters a bit more.
The implementation is different based on language, but here is a javascript example that shows the algorithm, which you can implement in any language:
var inputArray = [2, 3, -3, 4, -4, 5, 6, -6, -5, 10];
var ouputArray = [];
for (var i=0;i<inputArray.length;i++){
var num1 = inputArray[i];
for (var x=0;x<inputArray.length;x++){
var num2 = inputArray[x];
var sumVal = num1+num2;
if (sumVal == 0){
outputArray.push(num1);
outputArray.push(num2);
}
}
}
Is this the problem you are trying to solve?
Given a sequence , find maximizing such that
If so, here is the algorithm for solving it:
let $U$ be a set of contiguous integers
for each contiguous $S\in\Bbb Z^+_{\le n}$
for each $\T in \wp\left([i,j)\right)$
if $\sum_{n\in T}a_n = 0$
if $\left|S\right| < \left|U\left$
$S \to u$
return $U$
(Will update with full latex once I get the chance.)
I'm dealing with long daily time series in Matlab, running over periods of 30-100+ years. I've been meaning to start looking at it by seasons, roughly approximating that by taking 91-day segments of each year over the time period (with some tbd method of correcting for odd number of days in the year)
Basically, what I want is an array indexing method that allows me to make a new array that takes 91 elements every 365 elements, starting at element 1. I've been looking for some normal array methods (some (:) or other), but I haven't been able to find one. I guess an alternative would be to kind of iterate over 365-day segments 91 times, but that seems needlessly complicated.
Is there a simpler way that I've missed?
Thanks in advance for the help!
So if I understand correctly, you want to extract elements 1-91, 366-457, 731-822, and so on? I'm not sure that there is a way to do this with basic matrix indexing, but you can do the following:
days = 1:365; %Create array ranging from 1 - 365
difference = length(data) - 365; %how much bigger is time series data?
padded = padarray(days, [0, difference], 'circular'); %extend to fit time series
extracted = data(padded <= 91); %get every element in the range 1-91
Basically what I am doing is creating an array that is the same size as your time series data that repeats 1-365 over and over. I then perform logical indexing on data, such that the padded array is less than or equal to 91.
As a more approachable example, consider:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
days = 1:5;
difference = length(x) - 5;
padded = padarray(days, [0, difference], 'circular');
extracted = x(padded <= 2);
padded then is equal to [1, 2, 3, 4, 5, 1, 2, 3, 4, 5] and extracted is going to be [1, 2, 6, 7]