IGNORE EVERYTHING AND SKIP TO EDIT AT THE BOTTOM FOR CONDENSED EXPLANATION
FULL QUESTION:
How can I find the index of the max values of a reshape-able array of observed values in a rolling window of variable size given that the array of observed values corresponds to an array of observation times, and given that both arrays must be padded at identical indices?
SHORT PREFACE:
(I am having trouble fixing a code that is long and has a lot of moving parts. As such, I have only provided information I feel necessary to address my question and have left out other details that would make this post even longer than it is, though I can post a workable version of the code if requested.)
SETUP:
I have a text file containing times of observations and observed values at those times. I read the contents of the text file into the appropriate lists with the goal of performing a 'maximum self-similarity test', which entails finding the maximum value in a rolling window over an entire list of values; however, the first rolling window is 2 elements wide (check indices 0 and 1, then 2 and 3, ..., len(data)-1 and len(data)), then 4 elements wide, then 8 elements wide, etc. Assuming an array of 100 elements (I actually have around 11,000 data points), the last rolling window that is 8 elements wide will be disregarded because it is incomplete. To do this (with some help from SO), I first defined a function to reshape an array such that it can be called by a parent function.
def shapeshifter(ncol, my_array=data):
my_array = np.array(my_array)
desired_size_factor = np.prod([n for n in ncol if n != -1])
if -1 in ncol: # implicit array size
desired_size = my_array.size // desired_size_factor * desired_size_factor
else:
desired_size = desired_size_factor
return my_array.flat[:desired_size].reshape(ncol)
The parent function that calls this will loop over each row to find the maximums.
def looper(ncol, my_array=data):
my_array = shapeshifter((-1, ncol))
rows = [my_array[index] for index in range(len(my_array))]
res = []
for index in range(len(rows)):
res.append( max(rows[index]) )
return res
And looper is called by a grandparent function that will change the size of the window for which the maximum values are obtained.
def metalooper(window_size, my_array=data):
outer = [looper(win) for win in window_size]
return outer
The next line calls the grandparent function, which in turns calls the sub-functions. In the line below, window_size is a predefined list of window sizes (ex: [2,4,8,16,...]).
ans = metalooper(window_size)
PURPOSE (can remove if unnecessary):
The function metalooper should return a list of sublists, for which each sublist contains the maximum elements of the rolling window. I then "normalize" (for lack of a better word) each value in the sublists by taking the logarithmic value of each maximum, only then to sum the elements of each sublist (such that the number of sums equals the number of sublists). Each sum is then divided by its respective weight, which gives the y-values that will be plotted against the window sizes. This plot should be piecewise (linear or power-law).
PROBLEM:
My array of data points only contains the observed values and not the times (all of which I have converted into hours) that correspond to the observations. The times are not consecutive, so there may be an observation at 4 hrs, another at 7 hrs, another at 7.3 hrs, etc. My first mistake is not padding zeroes for non-consecutive times (ex: observation_1 at 4 hrs, observation_2 at 6 hours ==> observed_value = 0 at 5 hrs) as I should have moved the rolling window over the hours of observation (ex: window size of 2 means between [0,2) hours, [2,4) hours, etc) instead of the observed values at those times. However, my problem is compounded by the fact that there are also duplicate hours that fit within a window (ex: if multiple observations are made at 1 and 1.1 hours within a window of [0,2); regardless, I should find the maximum observed value in each rolling window, which entails knowing which observed values correspond to which times of observation without disregarding padded zeroes. However, how can I efficiently pad zeroes at identical indices in both lists? I am aware that I can floor the hours of observation to check which window an observed value should fall into. However, I am unsure how to proceed after this point as well - if I can pad both lists and if could find the index of the maximum observed value for each window, I can then us that index to get the desired observed value and the corresponding time of observation; I do not know how to do this or where to begin with this as my approach with for-looping lists is extremely slow. I would appreciate any help or advice on how to fix this. (Apologies for the length of this post, not sure how to condense beyond this). I would prefer to adapt my existing approach, but am open to alternatives if my method is too ridiculous.
EDIT:
To see how these functions work, let's use an example list data.
>> data = np.array(np.linspace(1,20,20))
# data corresponds to the observed values and not the observation times,
# below is a proof of concept using the values in data
>> print(shapeshifter((-1,2))) # 2 columns, -1 is always there
[[ 1. 2.]
[ 3. 4.]
[ 5. 6.]
[ 7. 8.]
[ 9. 10.]
[ 11. 12.]
[ 13. 14.]
[ 15. 16.]
[ 17. 18.]
[ 19. 20.]]
>> print(looper(2)) # get maximum in window_size (AKA length of each row) of 2 for each row of reshaped array via shapeshifter
[2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 18.0, 20.0]
def window_params(dataset=data): # concerned with window_size
numdata = len(dataset) ## N = 11764
lim = np.floor(np.log2(numdata)) ## last term of j = 13
time_sc_index = np.linspace(1, lim, num=lim) ## j = [1,2,3,...,floor(log_2(N))=13]
window_size = [2**j for j in time_sc_index] ## scale = [2,4,8,...,8192]
block_size = np.floor([numdata/sc for sc in window_size]) ## b_j (sc ~ scale)
return numdata, time_sc_index, window_size, block_size
numdata, time_sc_index, window_size, block_size = window_params()
>> print(window_size)
[2.0, 4.0, 8.0, 16.0]
>> print(metalooper(window_size)) # call looper for each window_size
[[2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 18.0, 20.0], [4.0, 8.0, 12.0, 16.0, 20.0], [8.0, 16.0], [16.0]]
My issue is that these observations each correspond to different times. The list of times can be something like
times = [0, 4, 6, 6, 9, ...] # times are floored, duplicate times correspond to multiple observations at floored times
I need have a list of consecutive times [0, 1, 2, 3, ...], each of which correspond to an observed value from the list data (as each data point is observed at a specific time). My goal is to find the maximum observed value in each window of times. Using the times above, the observed value at time=0 is data[0] and the observed value at time=1 is 0 since there is no observation at that time. Similarly, I would use the maximum observed value at duplicate times; in other words, I have 2 observations at time=6, so I would want the maximum observed value at that time. While my windows roll over only observed values, I actually need the windows to roll over all hours (including time=1 per this example) to find the maximum observed values at those times. In such a case, rolling a window over a time range that contains duplicate times should only count one of the duplicate times - specifically the one that corresponds to the maximum observed value at that time. My thinking is to pad zeroes into both lists (times and data) such that the index of times corresponds to the index of data. I'm trying to find an efficient way to proceed, though I'm having trouble figuring out a way to proceed at all.
I have a 2D array (typical size about 400x100) as shown (it looks like a trapezium because elements in the lower right are nan). For each element in the array, I want to perform a numerical integral along the column for several elements (of the order of ~10 elements). In physics language think of the colour as the magnitude of the force, and I want to find the work done by calculating th integral of Fdz. I can use a double for-loop and use trap to do the job, but are there other more efficient ways (probably mkaing use of arrays and vectorization) to do it in Matlab or python? My ultimate goal is to find the point where the evaluated integral is the largest. So from the image in which yellow represents large value, we expect the integral to be the largest somewhere on the right side above the dotted line.
Also, it is relatively easy if the number of points I want to integrate is an integer, but what if I want to integrate, say, 7.5 points? I am thinking of using fit to interpolate the points, but I'm not sure if that's over-complicating the task.
You can use cumsum to speedup trap. Calculating the cummulative sum (1-dimensional integral images proposed by #Benjamin)
>>> import numpy as np
>>> csdata = np.cumsum(data, axis=1)
Integrate with a discrete length
>>> npoints = 6
>>> result = np.zeros_like(data)
>>> result[:-npoints, :] = csdata[npoints:, :] - csdata[:-npoints, :]
The result is a vectorization of cumdata[i+npoints, j] - cumdata[i, j] for every i, j in the image. It will fill with zeros last npoints rows. You can reflect the boundary with np.pad if you want to prevent this.
For non-discrete intervals, you can work with interpolations:
>>> from scipy.interpolate import interp2d
>>> C = 0.5 # to interpolate every npoints+C pixels
>>> y, x = np.mgrid[:data.shape[0], :data.shape[1]]
>>> ynew, xnew = np.mgrid[C:data.shape[0]+C, :data.shape[1]]
>>> f = interp2d(x, y, csdata)
>>> csnew = f(xnew, ynew)
The above shifts a regular grid C pixels in y direction, and interpolates the cummulative data csdata at those points (in practice, it vectorices interpolation for every pixel).
Then the integral of npoints+C length can be obtained as
>>> npoints = 6
>>> result = np.zeros_like(data)
>>> result[:-npoints, :] = csnew[npoints:, :] - csdata[:-npoints, :]
Note that the upper bound is now csnew (a shift of 6 actually gets the 6.5 element), making it integrate every 6.5 points in practice.
You can then find the maximum point as
>>> idx = np.argmax(result.ravel()) # ravel to get the 1D maximum point
>>> maxy, maxx = np.unravel_index(idx, data.shape) # get 2D coordinates of idx
I am using Matlab for one of my projects. I am actually stuck at a point since some time now. Tried searching on google, but, not much success.
I have an array of 0s and 1s. Something like:
A = [0,0,0,1,1,1,1,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0];
I want to extract an array of indicies: [x_1, x_2, x_3, x_4, x_5, ..]
Such that x_1 is the index of start of first range of zeros. x_2 is the index of end of first range of zeros.
x_3 is the index of start of second range of zeros. x_4 is the index of end of second range of zeros.
For the above example:
x_1 = 1, x_2 = 3
x_3 = 9, x_4 = 10
and so on.
Of course, I can do it by writing a simple loop. I am wondering if there is a more elegant (vectorized) way to solve this problem. I was thinking about something like prefix some, but, no luck as of now.
Thanks,
Anil.
The diff function is great for this sort of stuff and pretty quick.
temp = diff(A);
Starts = find([A(1) == 0, temp==-1]);
Ends = find([temp == 1,A(end)==0])
Edit: Fixed the error in the Ends calculation caught by gnovice.
Zeros not preceded by other zeros: A==0 & [true A(1:(end-1))~=0]
Zeros not followed by other zeros: A==0 & [A(2:end)~=0 true]
Use each of these plus find to get starts and ends of runs of zeros. Then, if you really want them in a single vector as you described, interleave them.
If you want to get your results in a single vector like you described above (i.e. x = [x_1 x_2 x_3 x_4 x_5 ...]), then you can perform a second-order difference using the function DIFF and find the points greater than 0:
x = find(diff([1 A 1],2) > 0);
EDIT:
The above will work for the case when there are at least 2 zeroes in every string of zeroes. If you will have single zeroes appearing in A, the above can be modified to handle them like so:
diffA = diff([1 A 1],2);
[~,x] = find([diffA > 0; diffA == 2]);
In this case, a single zero value will create repeated indices in x (i.e. if A starts with a single zero, then x(1) and x(2) will both be 1).
I have for example 5 arrays with some inserted elements (numbers):
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
I need to find most common elements in those arrays and every element should go all the way till the end (see example below). In this example that would be the bold combination (or the same one but with "30" on the end, it's the "same") because it contains the smallest number of different elements (only two, 4 and 2/30).
This combination (see below) isn't good because if I have for ex. "4" it must "go" till it ends (next array mustn't contain "4" at all). So combination must go all the way till the end.
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
EDIT2: OR
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
OR anything else is NOT good.
Is there some algorithm to speed this thing up (if I have thousands of arrays with hundreds of elements in each one)?
To make it clear - solution must contain lowest number of different elements and the groups (of the same numbers) must be grouped from first - larger ones to the last - smallest ones. So in upper example 4,4,4,2 is better then 4,2,2,2 because in first example group of 4's is larger than group of 2's.
EDIT: To be more specific. Solution must contain the smallest number of different elements and those elements must be grouped from first to last. So if I have three arrrays like
1,2,3
1,4,5
4,5,6
Solution is 1,1,4 or 1,1,5 or 1,1,6 NOT 2,5,5 because 1's have larger group (two of them) than 2's (only one).
Thanks.
EDIT3: I can't be more specific :(
EDIT4: #spintheblack 1,1,1,2,4 is the correct solution because number used first time (let's say at position 1) can't be used later (except it's in the SAME group of 1's). I would say that grouping has the "priority"? Also, I didn't mention it (sorry about that) but the numbers in arrays are NOT sorted in any way, I typed it that way in this post because it was easier for me to follow.
Here is the approach you want to take, if arrays is an array that contains each individual array.
Starting at i = 0
current = arrays[i]
Loop i from i+1 to len(arrays)-1
new = current & arrays[i] (set intersection, finds common elements)
If there are any elements in new, do step 6, otherwise skip to 7
current = new, return to step 3 (continue loop)
print or yield an element from current, current = arrays[i], return to step 3 (continue loop)
Here is a Python implementation:
def mce(arrays):
count = 1
current = set(arrays[0])
for i in range(1, len(arrays)):
new = current & set(arrays[i])
if new:
count += 1
current = new
else:
print " ".join([str(current.pop())] * count),
count = 1
current = set(arrays[i])
print " ".join([str(current.pop())] * count)
>>> mce([[1, 4, 8, 10], [1, 2, 3, 4, 11, 15], [2, 4, 20, 21], [2, 30]])
4 4 4 2
If all are number lists, and are all sorted, then,
Convert to array of bitmaps.
Keep 'AND'ing the bitmaps till you hit zero. The position of the 1 in the previous value indicates the first element.
Restart step 2 from the next element
This has now turned into a graphing problem with a twist.
The problem is a directed acyclic graph of connections between stops, and the goal is to minimize the number of lines switches when riding on a train/tram.
ie. this list of sets:
1,4,8,10 <-- stop A
1,2,3,4,11,15 <-- stop B
2,4,20,21 <-- stop C
2,30 <-- stop D, destination
He needs to pick lines that are available at his exit stop, and his arrival stop, so for instance, he can't pick 10 from stop A, because 10 does not go to stop B.
So, this is the set of available lines and the stops they stop on:
A B C D
line 1 -----X-----X-----------------
line 2 -----------X-----X-----X-----
line 3 -----------X-----------------
line 4 -----X-----X-----X-----------
line 8 -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----
If we consider that a line under consideration must go between at least 2 consecutive stops, let me highlight the possible choices of lines with equal signs:
A B C D
line 1 -----X=====X-----------------
line 2 -----------X=====X=====X-----
line 3 -----------X-----------------
line 4 -----X=====X=====X-----------
line 8 -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----
He then needs to pick a way that transports him from A to D, with the minimal number of line switches.
Since he explained that he wants the longest rides first, the following sequence seems the best solution:
take line 4 from stop A to stop C, then switch to line 2 from C to D
Code example:
stops = [
[1, 4, 8, 10],
[1,2,3,4,11,15],
[2,4,20,21],
[2,30],
]
def calculate_possible_exit_lines(stops):
"""
only return lines that are available at both exit
and arrival stops, discard the rest.
"""
result = []
for index in range(0, len(stops) - 1):
lines = []
for value in stops[index]:
if value in stops[index + 1]:
lines.append(value)
result.append(lines)
return result
def all_combinations(lines):
"""
produce all combinations which travel from one end
of the journey to the other, across available lines.
"""
if not lines:
yield []
else:
for line in lines[0]:
for rest_combination in all_combinations(lines[1:]):
yield [line] + rest_combination
def reduce(combination):
"""
reduce a combination by returning the number of
times each value appear consecutively, ie.
[1,1,4,4,3] would return [2,2,1] since
the 1's appear twice, the 4's appear twice, and
the 3 only appear once.
"""
result = []
while combination:
count = 1
value = combination[0]
combination = combination[1:]
while combination and combination[0] == value:
combination = combination[1:]
count += 1
result.append(count)
return tuple(result)
def calculate_best_choice(lines):
"""
find the best choice by reducing each available
combination down to the number of stops you can
sit on a single line before having to switch,
and then picking the one that has the most stops
first, and then so on.
"""
available = []
for combination in all_combinations(lines):
count_stops = reduce(combination)
available.append((count_stops, combination))
available = [k for k in reversed(sorted(available))]
return available[0][1]
possible_lines = calculate_possible_exit_lines(stops)
print("possible lines: %s" % (str(possible_lines), ))
best_choice = calculate_best_choice(possible_lines)
print("best choice: %s" % (str(best_choice), ))
This code prints:
possible lines: [[1, 4], [2, 4], [2]]
best choice: [4, 4, 2]
Since, as I said, I list lines between stops, and the above solution can either count as lines you have to exit from each stop or lines you have to arrive on into the next stop.
So the route is:
Hop onto line 4 at stop A and ride on that to stop B, then to stop C
Hop onto line 2 at stop C and ride on that to stop D
There are probably edge-cases here that the above code doesn't work for.
However, I'm not bothering more with this question. The OP has demonstrated a complete incapability in communicating his question in a clear and concise manner, and I fear that any corrections to the above text and/or code to accommodate the latest comments will only provoke more comments, which leads to yet another version of the question, and so on ad infinitum. The OP has gone to extraordinary lengths to avoid answering direct questions or to explain the problem.
I am assuming that "distinct elements" do not have to actually be distinct, they can repeat in the final solution. That is if presented with [1], [2], [1] that the obvious answer [1, 2, 1] is allowed. But we'd count this as having 3 distinct elements.
If so, then here is a Python solution:
def find_best_run (first_array, *argv):
# initialize data structures.
this_array_best_run = {}
for x in first_array:
this_array_best_run[x] = (1, (1,), (x,))
for this_array in argv:
# find the best runs ending at each value in this_array
last_array_best_run = this_array_best_run
this_array_best_run = {}
for x in this_array:
for (y, pattern) in last_array_best_run.iteritems():
(distinct_count, lengths, elements) = pattern
if x == y:
lengths = tuple(lengths[:-1] + (lengths[-1] + 1,))
else :
distinct_count += 1
lengths = tuple(lengths + (1,))
elements = tuple(elements + (x,))
if x not in this_array_best_run:
this_array_best_run[x] = (distinct_count, lengths, elements)
else:
(prev_count, prev_lengths, prev_elements) = this_array_best_run[x]
if distinct_count < prev_count or prev_lengths < lengths:
this_array_best_run[x] = (distinct_count, lengths, elements)
# find the best overall run
best_count = len(argv) + 10 # Needs to be bigger than any possible answer.
for (distinct_count, lengths, elements) in this_array_best_run.itervalues():
if distinct_count < best_count:
best_count = distinct_count
best_lengths = lengths
best_elements = elements
elif distinct_count == best_count and best_lengths < lengths:
best_count = distinct_count
best_lengths = lengths
best_elements = elements
# convert it into a more normal representation.
answer = []
for (length, element) in zip(best_lengths, elements):
answer.extend([element] * length)
return answer
# example
print find_best_run(
[1,4,8,10],
[1,2,3,4,11,15],
[2,4,20,21],
[2,30]) # prints [4, 4, 4, 30]
Here is an explanation. The ...this_run dictionaries have keys which are elements in the current array, and they have values which are tuples (distinct_count, lengths, elements). We are trying to minimize distinct_count, then maximize lengths (lengths is a tuple, so this will prefer the element with the largest value in the first spot) and are tracking elements for the end. At each step I construct all possible runs which are a combination of a run up to the previous array with this element next in sequence, and find which ones are best to the current. When I get to the end I pick the best possible overall run, then turn it into a conventional representation and return it.
If you have N arrays of length M, this should take O(N*M*M) time to run.
I'm going to take a crack here based on the comments, please feel free to comment further to clarify.
We have N arrays and we are trying to find the 'most common' value over all arrays when one value is picked from each array. There are several constraints 1) We want the smallest number of distinct values 2) The most common is the maximal grouping of similar letters (changing from above for clarity). Thus, 4 t's and 1 p beats 3 x's 2 y's
I don't think either problem can be solved greedily - here's a counterexample [[1,4],[1,2],[1,2],[2],[3,4]] - a greedy algorithm would pick [1,1,1,2,4] (3 distinct numbers) [4,2,2,2,4] (two distinct numbers)
This looks like a bipartite matching problem, but I'm still coming up with the formulation..
EDIT : ignore; This is a different problem, but if anyone can figure it out, I'd be really interested
EDIT 2 : For anyone that's interested, the problem that I misinterpreted can be formulated as an instance of the Hitting Set problem, see http://en.wikipedia.org/wiki/Vertex_cover#Hitting_set_and_set_cover. Basically the left hand side of the bipartite graph would be the arrays and the right hand side would be the numbers, edges would be drawn between arrays that contain each number. Unfortunately, this is NP complete, but the greedy solutions described above are essentially the best approximation.