I have a strange feeling this is a very easy problem to solve but I'm not finding a good way of doing this without using brute force or dynamic programming. Here it goes:
Given N arrays of ordered and monotonic values, find the set of positions for each array i1, i2 ... in that minimises pair-wise difference of values at those indexes between all arrays. In other words, find the positions for all arrays whose values are closest to each other. Multiple solutions may exist and arrays may or may not be equally sized.
If A denotes the list of all arrays, the pair-wise difference is given by the sum of absolute differences between all values at the given indexes between all different arrays, as so:
An example, 3 arrays a, b and c:
a = [20 29 30 32 33]
b = [28 29 30 32 33]
c = [10 12 28 31 32 33]
The best alignment for this array would be a[3] b[3] c[4] or a[4] b[4] c[5], because (32,32,32) and (33,33,33) are all equal values and have, therefore minimum pairwise difference between each other. (Assuming array index starts at 0)
This is a common problem in bioinformatics thats usually solved with Dynamic Programming, but due to the fact this is an ordered sequence, I think there's somehow a way of exploiting this notion of order. I first thought about doing this pairwise, but this does not guarantee the global optimum because the best local answer might not be the best global answer.
This is meant to be language agnostic, but I don't really mind an answer for a specific language, as long as there is no loss of generality. I know Dynamic Programming is an option here, but I have a feeling there's an easier way to do this?
The tricky thing is parsing the arrays so that at some point you're guaranteed to be considering the set of indices that realize the pairwise min. Using a min heap on the values doesn't work. Counterexample with 4 arrays: [0,5], [1,2], [2], [2]. We start with a d(0,1,2,2) = 7, optimal is d(0,2,2,2) = 6, but the min heap moves us from 7 to d(5,1,2,2) = 12, then d(5,2,2,2) = 9.
I believe (but haven't proved) that if we alway increment the index that improves pairwise distance the most (or degrades it the least), we're guaranteed to visit every local min and the global min.
Assuming n total elements across k arrays:
Simple approach: we repeatedly get the pairwise distance deltas (delta wrt. incrementing each index), increment the best one, and any time doing so switch us from improvement to degradation (i.e. a local minimum) we calculate the pairwise distance. All this is O(k^2) per increment for a total running time of O((n-k) * (k^2)).
With O(k^2) storage, we could keep an array where (i,j) stores the pairwise distance delta achieve by increment the index of array i wrt. array j. We also store the column sums. Then on incrementing an index we can update the appropriate row & column & column sums in O(k). This gives us a running time of O((n-k)*k)
To just complete Dave's answer, here is the pseudocode of the delta algorithm:
initialise index_table to 0's where each row i denotes the index for the ith array
initialise delta_table with the corresponding cost of incrementing index of ith array and keeping the other indexes at their current values
cur_cost <- cost of current index table
best_cost <- cur_cost
best_solutions <- list with the current index table
while (can_at_least_one_index_increase)
i <- index whose delta is lowest
increment i-th entry of the index_table
if cost(index_table) < cur_cost
cur_cost = cost(index_table)
best_solutions = {} U {index_table}
if cost(index_table) = cur_cost
best_solutions = best_solutions U {index_table}
update delta_table
Important Note: During an iteration, some index_table entries might have already reached the maximum value for that array. Whenever updating the delta_table, it is necessary to never pick those values, otherwise this will result in a Array Out of Bounds,Segmentation Fault or undefined behaviour. A neat trick is to simply check which indexes are already at max and set a sufficiently large value, so they are never picked. If no index can increase anymore, the loop will end.
Here's an implementation in Python:
def align_ordered_sequences(arrays: list):
def get_cost(index_table):
n = len(arrays)
if n == 1:
return 0
sum = 0
for i in range(0, n-1):
for j in range(i+1, n):
v1 = arrays[i][index_table[i]]
v2 = arrays[j][index_table[j]]
sum += math.sqrt((v1 - v2) ** 2)
return sum
def compute_delta_table(index_table):
# Initialise the delta table: we switch each index element to 1, call
# the cost method and then revert the change, this avoids having to
# create copies, which decreases performance unnecessarily
delta_table = []
for i in range(n):
if index_table[i] + 1 >= len(arrays[i]):
# Implementation detail: if the index is outside the bounds of
# array i, choose a "large enough" number
delta_table.append(999999999999999)
else:
index_table[i] = index_table[i] + 1
delta_table.append(get_cost(index_table))
index_table[i] = index_table[i] - 1
return delta_table
def can_at_least_one_index_increase(index_table):
answer = False
for i in range(len(arrays)):
if index_table[i] < len(arrays[i]) - 1:
answer = True
return answer
n = len(arrays)
index_table = [0] * n
delta_table = compute_delta_table(index_table)
best_solutions = [index_table.copy()]
cur_cost = get_cost(index_table)
best_cost = cur_cost
while can_at_least_one_index_increase(index_table):
i = delta_table.index(min(delta_table))
index_table[i] = index_table[i] + 1
new_cost = get_cost(index_table)
# A new best solution was found
if new_cost < cur_cost:
cur_cost = new_cost
best_solutions = [index_table.copy()]
# A new solution with the same cost was found
elif new_cost == cur_cost:
best_solutions.append(index_table.copy())
# Update the delta table
delta_table = compute_delta_table(index_table)
return best_solutions
And here are some examples:
>>> print(align_ordered_sequences([[0,5], [1,2], [2], [2]]))
[[0, 1, 0, 0]]
>> print(align_ordered_sequences([[3, 5, 8, 29, 40, 50], [1, 4, 14, 17, 29, 50]]))
[[3, 4], [5, 5]]
Note 2: this outputs indexes not the actual values of each array.
The Goal
(Forgive me for length of this, it's mostly background and detail.)
I'm contributing to a TOML encoder/decoder for MATLAB and I'm working with numerical arrays right now. I want to input (and then be able to write out) the numerical array in the same format. This format is the nested square-bracket format that is used by numpy.array. For example, to make multi-dimensional arrays in numpy:
The following is in python, just to be clear. It is a useful example though my work is in MATLAB.
2D arrays
>> x = np.array([1,2])
>> x
array([1, 2])
>> x = np.array([[1],[2]])
>> x
array([[1],
[2]])
3D array
>> x = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
>> x
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
4D array
>> x = np.array([[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]])
>> x
array([[[[ 1, 2],
[ 3, 4]],
[[ 5, 6],
[ 7, 8]]],
[[[ 9, 10],
[11, 12]],
[[13, 14],
[15, 16]]]])
The input is a logical construction of the dimensions by nested brackets. Turns out this works pretty well with the TOML array structure. I can already successfully parse and decode any size/any dimension numeric array with this format from TOML to MATLAB numerical array data type.
Now, I want to encode that MATLAB numerical array back into this char/string structure to write back out to TOML (or whatever string).
So I have the following 4D array in MATLAB (same 4D array as with numpy):
>> x = permute(reshape([1:16],2,2,2,2),[2,1,3,4])
x(:,:,1,1) =
1 2
3 4
x(:,:,2,1) =
5 6
7 8
x(:,:,1,2) =
9 10
11 12
x(:,:,2,2) =
13 14
15 16
And I want to turn that into a string that has the same format as the 4D numpy input (with some function named bracketarray or something):
>> str = bracketarray(x)
str =
'[[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]]'
I can then write out the string to a file.
EDIT: I should add, that the function numpy.array2string() basically does exactly what I want, though it adds some other whitespace characters. But I can't use that as part of the solution, though it is basically the functionality I'm looking for.
The Problem
Here's my problem. I have successfully solved this problem for up to 3 dimensions using the following function, but I cannot for the life of me figure out how to extend it to N-dimensions. I feel like it's an issue of the right kind of counting for each dimension, making sure to not skip any and to nest the brackets correctly.
Current bracketarray.m that works up to 3D
function out = bracketarray(in, internal)
in_size = size(in);
in_dims = ndims(in);
% if array has only 2 dimensions, create the string
if in_dims == 2
storage = cell(in_size(1), 1);
for jj = 1:in_size(1)
storage{jj} = strcat('[', strjoin(split(num2str(in(jj, :)))', ','), ']');
end
if exist('internal', 'var') || in_size(1) > 1 || (in_size(1) == 1 && in_dims >= 3)
out = {strcat('[', strjoin(storage, ','), ']')};
else
out = storage;
end
return
% if array has more than 2 dimensions, recursively send planes of 2 dimensions for encoding
else
out = cell(in_size(end), 1);
for ii = 1:in_size(end) %<--- this doesn't track dimensions or counts of them
out(ii) = bracketarray(in(:,:,ii), 'internal'); %<--- this is limited to 3 dimensions atm. and out(indexing) need help
end
end
% bracket the final bit together
if in_size(1) > 1 || (in_size(1) == 1 && in_dims >= 3)
out = {strcat('[', strjoin(out, ','), ']')};
end
end
Help me Obi-wan Kenobis, y'all are my only hope!
EDIT 2: Added test suite below and modified current code a bit.
Test Suite
Here is a test suite to use to see if the output is what it should be. Basically just copy and paste it into the MATLAB command window. For my current posted code, they all return true except the ones more than 3D. My current code outputs as a cell. If your solution output differently (like a string), then you'll have to remove the curly brackets from the test suite.
isequal(bracketarray(ones(1,1)), {'[1]'})
isequal(bracketarray(ones(2,1)), {'[[1],[1]]'})
isequal(bracketarray(ones(1,2)), {'[1,1]'})
isequal(bracketarray(ones(2,2)), {'[[1,1],[1,1]]'})
isequal(bracketarray(ones(3,2)), {'[[1,1],[1,1],[1,1]]'})
isequal(bracketarray(ones(2,3)), {'[[1,1,1],[1,1,1]]'})
isequal(bracketarray(ones(1,1,2)), {'[[[1]],[[1]]]'})
isequal(bracketarray(ones(2,1,2)), {'[[[1],[1]],[[1],[1]]]'})
isequal(bracketarray(ones(1,2,2)), {'[[[1,1]],[[1,1]]]'})
isequal(bracketarray(ones(2,2,2)), {'[[[1,1],[1,1]],[[1,1],[1,1]]]'})
isequal(bracketarray(ones(1,1,1,2)), {'[[[[1]]],[[[1]]]]'})
isequal(bracketarray(ones(2,1,1,2)), {'[[[[1],[1]]],[[[1],[1]]]]'})
isequal(bracketarray(ones(1,2,1,2)), {'[[[[1,1]]],[[[1,1]]]]'})
isequal(bracketarray(ones(1,1,2,2)), {'[[[[1]],[[1]]],[[[1]],[[1]]]]'})
isequal(bracketarray(ones(2,1,2,2)), {'[[[[1],[1]],[[1],[1]]],[[[1],[1]],[[1],[1]]]]'})
isequal(bracketarray(ones(1,2,2,2)), {'[[[[1,1]],[[1,1]]],[[[1,1]],[[1,1]]]]'})
isequal(bracketarray(ones(2,2,2,2)), {'[[[[1,1],[1,1]],[[1,1],[1,1]]],[[[1,1],[1,1]],[[1,1],[1,1]]]]'})
isequal(bracketarray(permute(reshape([1:16],2,2,2,2),[2,1,3,4])), {'[[[[1,2],[3,4]],[[5,6],[7,8]]],[[[9,10],[11,12]],[[13,14],[15,16]]]]'})
isequal(bracketarray(ones(1,1,1,1,2)), {'[[[[[1]]]],[[[[1]]]]]'})
I think it would be easier to just loop and use join. Your test cases pass.
function out = bracketarray_matlabbit(in)
out = permute(in, [2 1 3:ndims(in)]);
out = string(out);
dimsToCat = ndims(out);
if iscolumn(out)
dimsToCat = dimsToCat-1;
end
for i = 1:dimsToCat
out = "[" + join(out, ",", i) + "]";
end
end
This also seems to be faster than the route you were pursing:
>> x = permute(reshape([1:16],2,2,2,2),[2,1,3,4]);
>> tic; for i = 1:1e4; bracketarray_matlabbit(x); end; toc
Elapsed time is 0.187955 seconds.
>> tic; for i = 1:1e4; bracketarray_cris_luengo(x); end; toc
Elapsed time is 5.859952 seconds.
The recursive function is almost complete. What is missing is a way to index the last dimension. There are several ways to do this, the neatest, I find, is as follows:
n = ndims(x);
index = cell(n-1, 1);
index(:) = {':'};
y = x(index{:}, ii);
It's a little tricky at first, but this is what happens: index is a set of n-1 strings ':'. index{:} is a comma-separated list of these strings. When we index x(index{:},ii) we actually do x(:,:,:,ii) (if n is 4).
The completed recursive function is:
function out = bracketarray(in)
n = ndims(in);
if n == 2
% Fill in your n==2 code here
else
% if array has more than 2 dimensions, recursively send planes of 2 dimensions for encoding
index = cell(n-1, 1);
index(:) = {':'};
storage = cell(size(in, n), 1);
for ii = 1:size(in, n)
storage(ii) = bracketarray(in(index{:}, ii)); % last dimension automatically removed
end
end
out = { strcat('[', strjoin(storage, ','), ']') };
Note that I have preallocated the storage cell array, to prevent it from being resized in every loop iteration. You should do the same in your 2D case code. Preallocating is important in MATLAB for performance reasons, and the MATLAB Editor should warm you about this too.
How can corresponding array indices be found within two differently shaped arrays of arrays that are the same size?
For example, an array x of size 36 is split into 11 arrays. Another array y of size 36 is split into 4 arrays. Then some modifications happen on the 4 arrays making up y.
N = 6 #some size param
x = np.zeros(N*N,dtype=np.int) #make empty array
s1 = np.array_split(x,11) #split array into arbitrary parts
y = np.random.randint(5, size=(N, N)) #make another same size array (and modify it)
s2 = np.array_split(y,4) #split array into different number of parts
Then iterating through the 4 arrays of y, I need to find the start index in the first array (array_num) of s1, to the end index of the last array of s1 that the values in s2 correspond to.
for sub_s2 in s2:
array_num = ?
s_idx = ?
e_idx = ?
s2_idx = ?
e2_idx = ?
#put the array into the correct ordered indexes of the other array
s1[array_num][s_idx,e_idx] = sub_s2[s2_idx,e2_idx]
res = np.concatenate(s1)
I made this image to try and illustrate the issue. In this case, 'data' means the size of x and y to start. Then s1 and s2 are broken into different chunks, and the problem is finding the index within each chunk that the arrays in s2 correspond to.
Here is how to find the correct indices:
# create example use same data for both splits for easy validation
a = np.arange(36)
s1 = np.array_split(a, 11)
s2 = np.array_split(a, 4)
# recover absolute offsets of bit boundaries
l1 = np.cumsum([0, *map(len,s1)])
l2 = np.cumsum([0, *map(len,s2)])
# find bits in s1 into which the first ...
start_n = l1[1:].searchsorted(l2[:-1], 'right')
# ... and last elements of bits of s2 fall
end_n = l1[1:].searchsorted(l2[1:]-1, 'right')
# find the corresponding indices into bits of s1
start_idx = l2[:-1] - l1[start_n]
end_idx = l2[1:]-1 - l1[end_n]
# check
[s[0] for s in s2]
# [0, 9, 18, 27]
[s1[n][i] for n, i in zip(start_n, start_idx)]
# [0, 9, 18, 27]
[s[-1] for s in s2]
# [8, 17, 26, 35]
[s1[n][i] for n, i in zip(end_n, end_idx)]
# [8, 17, 26, 35]
If I have the following array:
x = double([1, 1, 1, 10, 1, 1, 50, 1, 1, 1 ])
I want to do the following:
Group the array into groups of 5 which will each be evaluated separately.
Identify the MAX value each of the groups of the array
Remove that MAX value and put it into another array.
Finally, I want to print the updated array x without the MAX values, and the new array containing the MAX values.
How can I do this? I am new to IDL and have had no formal training in coding.
I understand that I can write the code to group and find the max values this way:
FOR i = 1, (n_elements(x)-4) do begin
print, "MAX of array", MAX( MAX(x[i-1:1+3])
ENDFOR
However, how do I implement all of what I specified above? I know I have to create an empty array that will append the values found by the for loop, but I don't know how to do that.
Thanks
I changed your x to have unique elements to make sure I wasn't fooling myself. It this, the number of elements of x must be divisible by group_size:
x = double([1, 2, 3, 10, 4, 5, 50, 6, 7, 8])
group_size = 5
maxes = max(reform(x, group_size, n_elements(x) / group_size), ind, dimension=1)
all = bytarr(n_elements(x))
all[ind] = 1
x_without_maxes = x[where(all eq 0)]
print, maxes
print, x_without_maxes
Lists are good for this, because they allow you to pop out values at specific indices, rather than rewriting the whole array again. You might try something like the following. I've used a while loop here, rather than a for loop, because it makes it a little easier in this case.
x = List(1, 1, 1, 10, 1, 1, 50, 1, 1, 1)
maxValues = List()
pos = 4
while (pos le x.length) do begin
maxValues.add, max(x[pos-4:pos].toArray(), iMax)
x.Remove, iMax+pos-4
pos += 5-1
endwhile
print, "Max Values : ", maxValues.toArray()
print, "Remaining Values : ", x.toArray()
This allows you to do what you want I think. At the end, you have a List object (which can easily be converted to an array) with the max values for each group of 5, and another containing the remaining values.
Also, please tag this as idl-programming-language rather than idl. They are two different tags.