Plotting moving average with numpy and csv - arrays

I need help plotting a moving average on top of the data I am already able to plot (see below)
I am trying to make m (my moving average) equal to the length of y (my data) and then within my 'for' loop, I seem to have the right math for my moving average.
What works: plotting x and y
What doesn't work: plotting m on top of x & y and gives me this error
RuntimeWarning: invalid value encountered in double_scalars
My theory: I am setting m to np.arrays = y.shape and then creating my for loop to make m equal to the math set within the loop thus replacing all the 0's to the moving average
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import csv
import math
def graph():
date, value = np.loadtxt("CL1.csv", delimiter=',', unpack=True,
converters = {0: mdates.strpdate2num('%d/%m/%Y')})
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1, axisbg = 'white')
plt.plot_date(x=date, y=value, fmt = '-')
y = value
m = np.zeros(y.shape)
for i in range(10, y.shape[0]):
m[i-10] = y[i-10:1].mean()
plt.plot_date(x=date, y=value, fmt = '-', color='g')
plt.plot_date(x=date, y=m, fmt = '-', color='b')
plt.title('NG1 Chart')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
graph ()

I think that lmjohns3 answer is correct, but you have a couple of problems with your moving average function. First of all, there is the indexing problem the lmjohns3 pointed out. Take the following data for example:
In [1]: import numpy as np
In [2]: a = np.arange(10)
In [3]: a
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Your function gives the following moving average values:
In [4]: for i in range(3, a.shape[0]):
...: print a[i-3:i].mean(),
1.0 2.0 3.0 4.0 5.0 6.0 7.0
The size of this array (7) is too small by one number. The last value in the moving average should be (7+8+9)/3=8. To fix that you could change your function as follows:
In [5]: for i in range(3, a.shape[0] + 1):
...: print a[i-3:i].sum()/3,
1 2 3 4 5 6 7 8
The second problem is that in order to plot two sets of data, the total number of data points needs to be the same. Your function returns a new set of data that is smaller than the original data set. (You maybe didn't notice because you preassigned a zeros array of the same size. Your for loop will always produce an array with a bunch of zeros at the end.)
The convolution function gives you the correct data, but it has two extra values (some at each end) because of the same argument, which ensures that the new data array has the same size as the original.
In [6]: np.convolve(a, [1./3]*3, 'same')
Out[6]:
array([ 0.33333333, 1. , 2. , 3. , 4. ,
5. , 6. , 7. , 8. , 5.66666667])
As an alternate method, you could vectorize your code by using Numpy's cumsum function.
In [7]: (cs[3-1:] - np.append(0,cs[:-3]))/3.
Out[7]: array([ 1., 2., 3., 4., 5., 6., 7., 8.])
(This last one is a modification of the answer in a previous post.)
The trick might be that you should drop the first values of your date array. For example use the following plotting call, where n is the number of points in your average:
plt.plot_date(x=date[n-1:], y=m, fmt = '-', color='b')

The problem here lives in your computation of the moving average -- you just have a couple of off-by-one problems in the indexing !
y = value
m = np.zeros(y.shape)
for i in range(10, y.shape[0]):
m[i-10] = y[i-10:1].mean()
Here you've got everything right except for the :1]. This tells the interpreter to take a slice starting at whatever i-10 happens to be, and ending just before 1. But if i-10 is larger than 1, this results in the empty list ! To fix it, just replace 1 with i.
Additionally, your range needs to be extended by one at the end. Replace y.shape[0] with y.shape[0]+1.
Alternative
I just thought I'd mention that you can compute the moving average more automatically by using np.convolve (docs) :
m = np.convolve(y, [1. / 10] * 10, 'same')
In this case, m will have the same length as y, but the moving average values might look strange at the beginning and end. This is because 'same' effectively causes y to be padded with zeros at both ends so that there are enough y values to use when computing the convolution.
If you'd prefer to get only moving average values that are computed using values from y (and not from additional zero-padding), you can replace 'same' with 'valid'. In this case, as Ryan points out, m will be shorter than y (more precisely, len(m) == len(y) - len(filter) + 1), which you can address in your plot by removing the first or last elements of your date array.

Okay, either I'm going nuts or it actually worked - I compared my chart vs. another chart and it seemed to have worked.
Does this make sense?
m = np.zeros(y.shape)
for i in range(10, y.shape[0]):
m[i-10] = y[i-10:i].mean()
plt.plot_date(x=date, y=m, fmt = '-', color='r')

Related

Having trouble randomly generating numbers in multidimensional arrays

I'm trying to generate coordinates in a mulidimensional array.
the range for each digit in the coords is -1 to 1. <=> seems like the way to go comparing two random numbers. I'm having trouble because randomizing it takes forever, coords duplicate and sometimes don't fill all the way through. I've tried uniq! which only causes the initialization to run forever while it tries to come up with the different iterations.
the coords look something like this. (-1, 0, 1, 0, 0)
5 digits give position. I could write them out but I'd like to generate the coords each time the program is initiated. The coords would then be assigned to a hash tied to a key. 1 - 242.
I could really use some advice.
edited to add code. It does start to iterate but it doesn't fill out properly. Short of just writing out an array with all possible combos and randomizing before merging it with the key. I can't figure out how.
room_range = (1..241)
room_num = [*room_range]
p room_num
$rand_loc_cords = []
def Randy(x)
srand(x)
y = (rand(100) + 1) * 1500
z = (rand(200) + 1) * 1000
return z <=> y
end
def rand_loc
until $rand_loc_cords.length == 243 do
x = Time.new.to_i
$rand_loc_cords.push([Randy(x), Randy(x), Randy(x), Randy(x), Randy(x)])
$rand_loc_cords.uniq!
p $rand_loc_cords
end
#p $rand_loc_cords
end
rand_loc
You are trying to get all possible permutations of -1, 0 and 1 with a length of 5 by sheer luck, which can take forever. There are 243 of them (3**5) indeed:
coords = [-1,0,1].repeated_permutation(5).to_a
Shuffle the array if the order should be randomized.

How to find the max values of reshape-able array1 in a rolling window of variable size given array1 and array2 are equal length before padding?

IGNORE EVERYTHING AND SKIP TO EDIT AT THE BOTTOM FOR CONDENSED EXPLANATION
FULL QUESTION:
How can I find the index of the max values of a reshape-able array of observed values in a rolling window of variable size given that the array of observed values corresponds to an array of observation times, and given that both arrays must be padded at identical indices?
SHORT PREFACE:
(I am having trouble fixing a code that is long and has a lot of moving parts. As such, I have only provided information I feel necessary to address my question and have left out other details that would make this post even longer than it is, though I can post a workable version of the code if requested.)
SETUP:
I have a text file containing times of observations and observed values at those times. I read the contents of the text file into the appropriate lists with the goal of performing a 'maximum self-similarity test', which entails finding the maximum value in a rolling window over an entire list of values; however, the first rolling window is 2 elements wide (check indices 0 and 1, then 2 and 3, ..., len(data)-1 and len(data)), then 4 elements wide, then 8 elements wide, etc. Assuming an array of 100 elements (I actually have around 11,000 data points), the last rolling window that is 8 elements wide will be disregarded because it is incomplete. To do this (with some help from SO), I first defined a function to reshape an array such that it can be called by a parent function.
def shapeshifter(ncol, my_array=data):
my_array = np.array(my_array)
desired_size_factor = np.prod([n for n in ncol if n != -1])
if -1 in ncol: # implicit array size
desired_size = my_array.size // desired_size_factor * desired_size_factor
else:
desired_size = desired_size_factor
return my_array.flat[:desired_size].reshape(ncol)
The parent function that calls this will loop over each row to find the maximums.
def looper(ncol, my_array=data):
my_array = shapeshifter((-1, ncol))
rows = [my_array[index] for index in range(len(my_array))]
res = []
for index in range(len(rows)):
res.append( max(rows[index]) )
return res
And looper is called by a grandparent function that will change the size of the window for which the maximum values are obtained.
def metalooper(window_size, my_array=data):
outer = [looper(win) for win in window_size]
return outer
The next line calls the grandparent function, which in turns calls the sub-functions. In the line below, window_size is a predefined list of window sizes (ex: [2,4,8,16,...]).
ans = metalooper(window_size)
PURPOSE (can remove if unnecessary):
The function metalooper should return a list of sublists, for which each sublist contains the maximum elements of the rolling window. I then "normalize" (for lack of a better word) each value in the sublists by taking the logarithmic value of each maximum, only then to sum the elements of each sublist (such that the number of sums equals the number of sublists). Each sum is then divided by its respective weight, which gives the y-values that will be plotted against the window sizes. This plot should be piecewise (linear or power-law).
PROBLEM:
My array of data points only contains the observed values and not the times (all of which I have converted into hours) that correspond to the observations. The times are not consecutive, so there may be an observation at 4 hrs, another at 7 hrs, another at 7.3 hrs, etc. My first mistake is not padding zeroes for non-consecutive times (ex: observation_1 at 4 hrs, observation_2 at 6 hours ==> observed_value = 0 at 5 hrs) as I should have moved the rolling window over the hours of observation (ex: window size of 2 means between [0,2) hours, [2,4) hours, etc) instead of the observed values at those times. However, my problem is compounded by the fact that there are also duplicate hours that fit within a window (ex: if multiple observations are made at 1 and 1.1 hours within a window of [0,2); regardless, I should find the maximum observed value in each rolling window, which entails knowing which observed values correspond to which times of observation without disregarding padded zeroes. However, how can I efficiently pad zeroes at identical indices in both lists? I am aware that I can floor the hours of observation to check which window an observed value should fall into. However, I am unsure how to proceed after this point as well - if I can pad both lists and if could find the index of the maximum observed value for each window, I can then us that index to get the desired observed value and the corresponding time of observation; I do not know how to do this or where to begin with this as my approach with for-looping lists is extremely slow. I would appreciate any help or advice on how to fix this. (Apologies for the length of this post, not sure how to condense beyond this). I would prefer to adapt my existing approach, but am open to alternatives if my method is too ridiculous.
EDIT:
To see how these functions work, let's use an example list data.
>> data = np.array(np.linspace(1,20,20))
# data corresponds to the observed values and not the observation times,
# below is a proof of concept using the values in data
>> print(shapeshifter((-1,2))) # 2 columns, -1 is always there
[[ 1. 2.]
[ 3. 4.]
[ 5. 6.]
[ 7. 8.]
[ 9. 10.]
[ 11. 12.]
[ 13. 14.]
[ 15. 16.]
[ 17. 18.]
[ 19. 20.]]
>> print(looper(2)) # get maximum in window_size (AKA length of each row) of 2 for each row of reshaped array via shapeshifter
[2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 18.0, 20.0]
def window_params(dataset=data): # concerned with window_size
numdata = len(dataset) ## N = 11764
lim = np.floor(np.log2(numdata)) ## last term of j = 13
time_sc_index = np.linspace(1, lim, num=lim) ## j = [1,2,3,...,floor(log_2(N))=13]
window_size = [2**j for j in time_sc_index] ## scale = [2,4,8,...,8192]
block_size = np.floor([numdata/sc for sc in window_size]) ## b_j (sc ~ scale)
return numdata, time_sc_index, window_size, block_size
numdata, time_sc_index, window_size, block_size = window_params()
>> print(window_size)
[2.0, 4.0, 8.0, 16.0]
>> print(metalooper(window_size)) # call looper for each window_size
[[2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 18.0, 20.0], [4.0, 8.0, 12.0, 16.0, 20.0], [8.0, 16.0], [16.0]]
My issue is that these observations each correspond to different times. The list of times can be something like
times = [0, 4, 6, 6, 9, ...] # times are floored, duplicate times correspond to multiple observations at floored times
I need have a list of consecutive times [0, 1, 2, 3, ...], each of which correspond to an observed value from the list data (as each data point is observed at a specific time). My goal is to find the maximum observed value in each window of times. Using the times above, the observed value at time=0 is data[0] and the observed value at time=1 is 0 since there is no observation at that time. Similarly, I would use the maximum observed value at duplicate times; in other words, I have 2 observations at time=6, so I would want the maximum observed value at that time. While my windows roll over only observed values, I actually need the windows to roll over all hours (including time=1 per this example) to find the maximum observed values at those times. In such a case, rolling a window over a time range that contains duplicate times should only count one of the duplicate times - specifically the one that corresponds to the maximum observed value at that time. My thinking is to pad zeroes into both lists (times and data) such that the index of times corresponds to the index of data. I'm trying to find an efficient way to proceed, though I'm having trouble figuring out a way to proceed at all.

efficient way of performing integral on an image

I have a 2D array (typical size about 400x100) as shown (it looks like a trapezium because elements in the lower right are nan). For each element in the array, I want to perform a numerical integral along the column for several elements (of the order of ~10 elements). In physics language think of the colour as the magnitude of the force, and I want to find the work done by calculating th integral of Fdz. I can use a double for-loop and use trap to do the job, but are there other more efficient ways (probably mkaing use of arrays and vectorization) to do it in Matlab or python? My ultimate goal is to find the point where the evaluated integral is the largest. So from the image in which yellow represents large value, we expect the integral to be the largest somewhere on the right side above the dotted line.
Also, it is relatively easy if the number of points I want to integrate is an integer, but what if I want to integrate, say, 7.5 points? I am thinking of using fit to interpolate the points, but I'm not sure if that's over-complicating the task.
You can use cumsum to speedup trap. Calculating the cummulative sum (1-dimensional integral images proposed by #Benjamin)
>>> import numpy as np
>>> csdata = np.cumsum(data, axis=1)
Integrate with a discrete length
>>> npoints = 6
>>> result = np.zeros_like(data)
>>> result[:-npoints, :] = csdata[npoints:, :] - csdata[:-npoints, :]
The result is a vectorization of cumdata[i+npoints, j] - cumdata[i, j] for every i, j in the image. It will fill with zeros last npoints rows. You can reflect the boundary with np.pad if you want to prevent this.
For non-discrete intervals, you can work with interpolations:
>>> from scipy.interpolate import interp2d
>>> C = 0.5 # to interpolate every npoints+C pixels
>>> y, x = np.mgrid[:data.shape[0], :data.shape[1]]
>>> ynew, xnew = np.mgrid[C:data.shape[0]+C, :data.shape[1]]
>>> f = interp2d(x, y, csdata)
>>> csnew = f(xnew, ynew)
The above shifts a regular grid C pixels in y direction, and interpolates the cummulative data csdata at those points (in practice, it vectorices interpolation for every pixel).
Then the integral of npoints+C length can be obtained as
>>> npoints = 6
>>> result = np.zeros_like(data)
>>> result[:-npoints, :] = csnew[npoints:, :] - csdata[:-npoints, :]
Note that the upper bound is now csnew (a shift of 6 actually gets the 6.5 element), making it integrate every 6.5 points in practice.
You can then find the maximum point as
>>> idx = np.argmax(result.ravel()) # ravel to get the 1D maximum point
>>> maxy, maxx = np.unravel_index(idx, data.shape) # get 2D coordinates of idx

Indicies of zero ranges in a zero-one matrix

I am using Matlab for one of my projects. I am actually stuck at a point since some time now. Tried searching on google, but, not much success.
I have an array of 0s and 1s. Something like:
A = [0,0,0,1,1,1,1,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,0,0];
I want to extract an array of indicies: [x_1, x_2, x_3, x_4, x_5, ..]
Such that x_1 is the index of start of first range of zeros. x_2 is the index of end of first range of zeros.
x_3 is the index of start of second range of zeros. x_4 is the index of end of second range of zeros.
For the above example:
x_1 = 1, x_2 = 3
x_3 = 9, x_4 = 10
and so on.
Of course, I can do it by writing a simple loop. I am wondering if there is a more elegant (vectorized) way to solve this problem. I was thinking about something like prefix some, but, no luck as of now.
Thanks,
Anil.
The diff function is great for this sort of stuff and pretty quick.
temp = diff(A);
Starts = find([A(1) == 0, temp==-1]);
Ends = find([temp == 1,A(end)==0])
Edit: Fixed the error in the Ends calculation caught by gnovice.
Zeros not preceded by other zeros: A==0 & [true A(1:(end-1))~=0]
Zeros not followed by other zeros: A==0 & [A(2:end)~=0 true]
Use each of these plus find to get starts and ends of runs of zeros. Then, if you really want them in a single vector as you described, interleave them.
If you want to get your results in a single vector like you described above (i.e. x = [x_1 x_2 x_3 x_4 x_5 ...]), then you can perform a second-order difference using the function DIFF and find the points greater than 0:
x = find(diff([1 A 1],2) > 0);
EDIT:
The above will work for the case when there are at least 2 zeroes in every string of zeroes. If you will have single zeroes appearing in A, the above can be modified to handle them like so:
diffA = diff([1 A 1],2);
[~,x] = find([diffA > 0; diffA == 2]);
In this case, a single zero value will create repeated indices in x (i.e. if A starts with a single zero, then x(1) and x(2) will both be 1).

Algorithm to find "most common elements" in different arrays

I have for example 5 arrays with some inserted elements (numbers):
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
I need to find most common elements in those arrays and every element should go all the way till the end (see example below). In this example that would be the bold combination (or the same one but with "30" on the end, it's the "same") because it contains the smallest number of different elements (only two, 4 and 2/30).
This combination (see below) isn't good because if I have for ex. "4" it must "go" till it ends (next array mustn't contain "4" at all). So combination must go all the way till the end.
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
EDIT2: OR
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
OR anything else is NOT good.
Is there some algorithm to speed this thing up (if I have thousands of arrays with hundreds of elements in each one)?
To make it clear - solution must contain lowest number of different elements and the groups (of the same numbers) must be grouped from first - larger ones to the last - smallest ones. So in upper example 4,4,4,2 is better then 4,2,2,2 because in first example group of 4's is larger than group of 2's.
EDIT: To be more specific. Solution must contain the smallest number of different elements and those elements must be grouped from first to last. So if I have three arrrays like
1,2,3
1,4,5
4,5,6
Solution is 1,1,4 or 1,1,5 or 1,1,6 NOT 2,5,5 because 1's have larger group (two of them) than 2's (only one).
Thanks.
EDIT3: I can't be more specific :(
EDIT4: #spintheblack 1,1,1,2,4 is the correct solution because number used first time (let's say at position 1) can't be used later (except it's in the SAME group of 1's). I would say that grouping has the "priority"? Also, I didn't mention it (sorry about that) but the numbers in arrays are NOT sorted in any way, I typed it that way in this post because it was easier for me to follow.
Here is the approach you want to take, if arrays is an array that contains each individual array.
Starting at i = 0
current = arrays[i]
Loop i from i+1 to len(arrays)-1
new = current & arrays[i] (set intersection, finds common elements)
If there are any elements in new, do step 6, otherwise skip to 7
current = new, return to step 3 (continue loop)
print or yield an element from current, current = arrays[i], return to step 3 (continue loop)
Here is a Python implementation:
def mce(arrays):
count = 1
current = set(arrays[0])
for i in range(1, len(arrays)):
new = current & set(arrays[i])
if new:
count += 1
current = new
else:
print " ".join([str(current.pop())] * count),
count = 1
current = set(arrays[i])
print " ".join([str(current.pop())] * count)
>>> mce([[1, 4, 8, 10], [1, 2, 3, 4, 11, 15], [2, 4, 20, 21], [2, 30]])
4 4 4 2
If all are number lists, and are all sorted, then,
Convert to array of bitmaps.
Keep 'AND'ing the bitmaps till you hit zero. The position of the 1 in the previous value indicates the first element.
Restart step 2 from the next element
This has now turned into a graphing problem with a twist.
The problem is a directed acyclic graph of connections between stops, and the goal is to minimize the number of lines switches when riding on a train/tram.
ie. this list of sets:
1,4,8,10 <-- stop A
1,2,3,4,11,15 <-- stop B
2,4,20,21 <-- stop C
2,30 <-- stop D, destination
He needs to pick lines that are available at his exit stop, and his arrival stop, so for instance, he can't pick 10 from stop A, because 10 does not go to stop B.
So, this is the set of available lines and the stops they stop on:
A B C D
line 1 -----X-----X-----------------
line 2 -----------X-----X-----X-----
line 3 -----------X-----------------
line 4 -----X-----X-----X-----------
line 8 -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----
If we consider that a line under consideration must go between at least 2 consecutive stops, let me highlight the possible choices of lines with equal signs:
A B C D
line 1 -----X=====X-----------------
line 2 -----------X=====X=====X-----
line 3 -----------X-----------------
line 4 -----X=====X=====X-----------
line 8 -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----
He then needs to pick a way that transports him from A to D, with the minimal number of line switches.
Since he explained that he wants the longest rides first, the following sequence seems the best solution:
take line 4 from stop A to stop C, then switch to line 2 from C to D
Code example:
stops = [
[1, 4, 8, 10],
[1,2,3,4,11,15],
[2,4,20,21],
[2,30],
]
def calculate_possible_exit_lines(stops):
"""
only return lines that are available at both exit
and arrival stops, discard the rest.
"""
result = []
for index in range(0, len(stops) - 1):
lines = []
for value in stops[index]:
if value in stops[index + 1]:
lines.append(value)
result.append(lines)
return result
def all_combinations(lines):
"""
produce all combinations which travel from one end
of the journey to the other, across available lines.
"""
if not lines:
yield []
else:
for line in lines[0]:
for rest_combination in all_combinations(lines[1:]):
yield [line] + rest_combination
def reduce(combination):
"""
reduce a combination by returning the number of
times each value appear consecutively, ie.
[1,1,4,4,3] would return [2,2,1] since
the 1's appear twice, the 4's appear twice, and
the 3 only appear once.
"""
result = []
while combination:
count = 1
value = combination[0]
combination = combination[1:]
while combination and combination[0] == value:
combination = combination[1:]
count += 1
result.append(count)
return tuple(result)
def calculate_best_choice(lines):
"""
find the best choice by reducing each available
combination down to the number of stops you can
sit on a single line before having to switch,
and then picking the one that has the most stops
first, and then so on.
"""
available = []
for combination in all_combinations(lines):
count_stops = reduce(combination)
available.append((count_stops, combination))
available = [k for k in reversed(sorted(available))]
return available[0][1]
possible_lines = calculate_possible_exit_lines(stops)
print("possible lines: %s" % (str(possible_lines), ))
best_choice = calculate_best_choice(possible_lines)
print("best choice: %s" % (str(best_choice), ))
This code prints:
possible lines: [[1, 4], [2, 4], [2]]
best choice: [4, 4, 2]
Since, as I said, I list lines between stops, and the above solution can either count as lines you have to exit from each stop or lines you have to arrive on into the next stop.
So the route is:
Hop onto line 4 at stop A and ride on that to stop B, then to stop C
Hop onto line 2 at stop C and ride on that to stop D
There are probably edge-cases here that the above code doesn't work for.
However, I'm not bothering more with this question. The OP has demonstrated a complete incapability in communicating his question in a clear and concise manner, and I fear that any corrections to the above text and/or code to accommodate the latest comments will only provoke more comments, which leads to yet another version of the question, and so on ad infinitum. The OP has gone to extraordinary lengths to avoid answering direct questions or to explain the problem.
I am assuming that "distinct elements" do not have to actually be distinct, they can repeat in the final solution. That is if presented with [1], [2], [1] that the obvious answer [1, 2, 1] is allowed. But we'd count this as having 3 distinct elements.
If so, then here is a Python solution:
def find_best_run (first_array, *argv):
# initialize data structures.
this_array_best_run = {}
for x in first_array:
this_array_best_run[x] = (1, (1,), (x,))
for this_array in argv:
# find the best runs ending at each value in this_array
last_array_best_run = this_array_best_run
this_array_best_run = {}
for x in this_array:
for (y, pattern) in last_array_best_run.iteritems():
(distinct_count, lengths, elements) = pattern
if x == y:
lengths = tuple(lengths[:-1] + (lengths[-1] + 1,))
else :
distinct_count += 1
lengths = tuple(lengths + (1,))
elements = tuple(elements + (x,))
if x not in this_array_best_run:
this_array_best_run[x] = (distinct_count, lengths, elements)
else:
(prev_count, prev_lengths, prev_elements) = this_array_best_run[x]
if distinct_count < prev_count or prev_lengths < lengths:
this_array_best_run[x] = (distinct_count, lengths, elements)
# find the best overall run
best_count = len(argv) + 10 # Needs to be bigger than any possible answer.
for (distinct_count, lengths, elements) in this_array_best_run.itervalues():
if distinct_count < best_count:
best_count = distinct_count
best_lengths = lengths
best_elements = elements
elif distinct_count == best_count and best_lengths < lengths:
best_count = distinct_count
best_lengths = lengths
best_elements = elements
# convert it into a more normal representation.
answer = []
for (length, element) in zip(best_lengths, elements):
answer.extend([element] * length)
return answer
# example
print find_best_run(
[1,4,8,10],
[1,2,3,4,11,15],
[2,4,20,21],
[2,30]) # prints [4, 4, 4, 30]
Here is an explanation. The ...this_run dictionaries have keys which are elements in the current array, and they have values which are tuples (distinct_count, lengths, elements). We are trying to minimize distinct_count, then maximize lengths (lengths is a tuple, so this will prefer the element with the largest value in the first spot) and are tracking elements for the end. At each step I construct all possible runs which are a combination of a run up to the previous array with this element next in sequence, and find which ones are best to the current. When I get to the end I pick the best possible overall run, then turn it into a conventional representation and return it.
If you have N arrays of length M, this should take O(N*M*M) time to run.
I'm going to take a crack here based on the comments, please feel free to comment further to clarify.
We have N arrays and we are trying to find the 'most common' value over all arrays when one value is picked from each array. There are several constraints 1) We want the smallest number of distinct values 2) The most common is the maximal grouping of similar letters (changing from above for clarity). Thus, 4 t's and 1 p beats 3 x's 2 y's
I don't think either problem can be solved greedily - here's a counterexample [[1,4],[1,2],[1,2],[2],[3,4]] - a greedy algorithm would pick [1,1,1,2,4] (3 distinct numbers) [4,2,2,2,4] (two distinct numbers)
This looks like a bipartite matching problem, but I'm still coming up with the formulation..
EDIT : ignore; This is a different problem, but if anyone can figure it out, I'd be really interested
EDIT 2 : For anyone that's interested, the problem that I misinterpreted can be formulated as an instance of the Hitting Set problem, see http://en.wikipedia.org/wiki/Vertex_cover#Hitting_set_and_set_cover. Basically the left hand side of the bipartite graph would be the arrays and the right hand side would be the numbers, edges would be drawn between arrays that contain each number. Unfortunately, this is NP complete, but the greedy solutions described above are essentially the best approximation.

Resources