Given an array of values, how can I update a range with a sequence within that array, efficiently?
Updates are performed multiple times. After all updates are performed, we can query any index of the array for its final value.
If we update a value of v at index i, every element at index j is increased with a value of max { v - | i - j | , 0 }
For example.
array = {1,1,1,1,1,1}
Now I do an update at index 4 with a value of 3 the resulting array will look like this:
array = {1,1,2,3,4,3}
I want to perform both operations efficiently.
You can't update a range of elements "efficiently". Questions like these are always about figuring out how to avoid updating a range of elements altogether.
To figure out this one, consider two operations:
INTEGRATE(A) takes an array and replaces every element A[i] with sum(A[0]...A[i]).
DIFF(A) takes an array and replaces every element with its difference from the previous element (the first element is left unaltered).
These operations have some important properties:
They are inverses: INTEGRATE(DIFF(A)) = DIFF(INTEGRATE(A)) = A for all arrays A; and
They are linear: If A = B+C, then INTEGATE(A) = INTEGRATE(B) + INTEGRATE(C), and similarly for DIFF.
Your final array is the sum of the original array, plus a whole bunch of those "triangle" arrays. Let's say it's A + T1 + T2 + T3... etc.
Each one of those triangles has a whole bunch of non-zero elements, but watch what happens when you apply DIFF twice:
[0,0,1,2,3,2,1,0,0] -> [0,0,1,1,1,-1,-1,-1,0] -> [0,0,1,0,0,-2,0,0,1]
The result has only 3 non-zero elements. That gives us a way to calculate your final array quickly.
Let D(X) = DIFF(DIFF(X)) and let I(X) = INTEGRATE(INTEGRATE(X)). Then instead of calculating A + T1 + T2 + T3..., you calculate I( D(A) + D(T1) + D(T2) + D(T3)... )
Since all those D(Tx) have at most 3 non-zero elements, it's quick and easy to add them into the result.
I'm deliberately explaining how to solve it, without giving you full code. This also handles the complex case of interleaved updates and lookups, but therefore is more complex than what Matter Timmermans came up with.
You obviously can't use an array as your representation. It makes lookups fast, but an update with value k will be an O(k) operation.
Our second try, is to just have a list of the updates. Now updates are O(1), but after m updates a lookup is O(m).
What we need is to have a way to store updates such that both adding an update and doing a lookup are fast.
The first step is to change an update from "update at a value" to "update a range by a linear rule". That is currently you say:
update at 4 by 3
Instead we'd say:
from 2 to 3:
update by x - 2
from 4 to 5:
update by 7 - x
This isn't yet a win. But it becomes one when you rewrite the ranges in terms of a standard set of intervals. First the original array
from 0 to 5 1 + 0x
Now the array after update:
from 0 to 5, 1 + 0x +
from 2 to 3, -1 + x
from 4 to 5, 7 - x
This can be represented compactly in 2 arrays:
m = [0, 0, 1, 0, -1, 0]
b = [1, 0, -1, 0, 7, 0]
And as complicated as it feels, now both updates and lookups wind up with O(log(n)) work.
For example for a lookup:
def rising_binary (n):
power = 1
m = 0
yield m
while m < n:
if n & power:
m += power
yield m
power *= 2
...
answer = 0
for bin in rising_binary(k):
answer += m[bin] * k + b[bin]
My current code has two major bottlenecks, one I can improve for sure, but this one has me stuck. It eats up roughly 50% of my run time, and only gets worse.
What should it do?
It should take an array (a walk) from Walks and break it into two new arrays, A and B. The rules look a bit odd, but I'm sure they're straightforward enough.
Each walk should have even-N non-negative integers, and a pair is simply a list of 2 lists of integers, each list also being length N.
L is N/2.
#example pair: [[1,2,5,6,-4,-1],[8,12,-3,7,4,9]]
#example walks:[[1,0,2,5,3,1]] just 1 walk in this example. Could be k many.
#L = 3
newpairs=[]
for walk in walks:
Anew = [0 for j in range(2*L)]
Bnew = [0 for j in range(2*L)]
for r in range(L):
Anew[r] = int((pair[0][r]+walk[r])/2)
Anew[r+L] = int((pair[0][r]-walk[r])/2)
Bnew[r] = int((pair[1][r]+walk[r+L])/2)
Bnew[r+L] = int((pair[1][r]-walk[r+L])/2)
newpair = [Anew,Bnew]
newpairs.append(newpair)
#output:[[[1, 1, 3, 0, 1, 1], [6, 7, -1, 1, 4, -2]]]
I realize this may be a shot in the dark, but I'm happy to answer any questions to further clarify aspects of the code. My project cannot go much further without optimizing this piece. Its blowing up run times by over 50% and will only get worse as I push bigger sets through.
Your algorithm seems simple enough and doesn't have any glaring performance mistakes. You probably won't be reducing the run time by an order of magnitude or anything like it. There are some smaller optimizations you can do, though.
1) Use list multiplication notation for initializing your Anew and Bnew lists. Replace this:
Anew = [0 for j in range(2*L)]
Bnew = [0 for j in range(2*L)]
with this:
Anew = [0]*2*L
Bnew = [0]*2*L
Benchmarking:
>>> timeit.timeit('[0 for x in range(300)]')
7.822149500000023
>>> timeit.timeit('[0]*300')
0.8999562000000196
2) Use floor division. Replace
Anew[r] = int((pair[0][r]+walk[r])/2)
and similar lines, with this:
Anew[r] = (pair[0][r]+walk[r])//2
Benchmark:
>>> timeit.timeit('[int((x+y)/2) for x in range(-5,5) for y in range(-5,5)]')
23.69675469999993
>>> timeit.timeit('[(x+y)//2 for x in range(-5,5) for y in range(-5,5)]')
11.680407500000001
Beyond that, you might want to look into using numpy as it's almost always faster than the standard library for working with lists/arrays.
I need help plotting a moving average on top of the data I am already able to plot (see below)
I am trying to make m (my moving average) equal to the length of y (my data) and then within my 'for' loop, I seem to have the right math for my moving average.
What works: plotting x and y
What doesn't work: plotting m on top of x & y and gives me this error
RuntimeWarning: invalid value encountered in double_scalars
My theory: I am setting m to np.arrays = y.shape and then creating my for loop to make m equal to the math set within the loop thus replacing all the 0's to the moving average
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import csv
import math
def graph():
date, value = np.loadtxt("CL1.csv", delimiter=',', unpack=True,
converters = {0: mdates.strpdate2num('%d/%m/%Y')})
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1, axisbg = 'white')
plt.plot_date(x=date, y=value, fmt = '-')
y = value
m = np.zeros(y.shape)
for i in range(10, y.shape[0]):
m[i-10] = y[i-10:1].mean()
plt.plot_date(x=date, y=value, fmt = '-', color='g')
plt.plot_date(x=date, y=m, fmt = '-', color='b')
plt.title('NG1 Chart')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
graph ()
I think that lmjohns3 answer is correct, but you have a couple of problems with your moving average function. First of all, there is the indexing problem the lmjohns3 pointed out. Take the following data for example:
In [1]: import numpy as np
In [2]: a = np.arange(10)
In [3]: a
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Your function gives the following moving average values:
In [4]: for i in range(3, a.shape[0]):
...: print a[i-3:i].mean(),
1.0 2.0 3.0 4.0 5.0 6.0 7.0
The size of this array (7) is too small by one number. The last value in the moving average should be (7+8+9)/3=8. To fix that you could change your function as follows:
In [5]: for i in range(3, a.shape[0] + 1):
...: print a[i-3:i].sum()/3,
1 2 3 4 5 6 7 8
The second problem is that in order to plot two sets of data, the total number of data points needs to be the same. Your function returns a new set of data that is smaller than the original data set. (You maybe didn't notice because you preassigned a zeros array of the same size. Your for loop will always produce an array with a bunch of zeros at the end.)
The convolution function gives you the correct data, but it has two extra values (some at each end) because of the same argument, which ensures that the new data array has the same size as the original.
In [6]: np.convolve(a, [1./3]*3, 'same')
Out[6]:
array([ 0.33333333, 1. , 2. , 3. , 4. ,
5. , 6. , 7. , 8. , 5.66666667])
As an alternate method, you could vectorize your code by using Numpy's cumsum function.
In [7]: (cs[3-1:] - np.append(0,cs[:-3]))/3.
Out[7]: array([ 1., 2., 3., 4., 5., 6., 7., 8.])
(This last one is a modification of the answer in a previous post.)
The trick might be that you should drop the first values of your date array. For example use the following plotting call, where n is the number of points in your average:
plt.plot_date(x=date[n-1:], y=m, fmt = '-', color='b')
The problem here lives in your computation of the moving average -- you just have a couple of off-by-one problems in the indexing !
y = value
m = np.zeros(y.shape)
for i in range(10, y.shape[0]):
m[i-10] = y[i-10:1].mean()
Here you've got everything right except for the :1]. This tells the interpreter to take a slice starting at whatever i-10 happens to be, and ending just before 1. But if i-10 is larger than 1, this results in the empty list ! To fix it, just replace 1 with i.
Additionally, your range needs to be extended by one at the end. Replace y.shape[0] with y.shape[0]+1.
Alternative
I just thought I'd mention that you can compute the moving average more automatically by using np.convolve (docs) :
m = np.convolve(y, [1. / 10] * 10, 'same')
In this case, m will have the same length as y, but the moving average values might look strange at the beginning and end. This is because 'same' effectively causes y to be padded with zeros at both ends so that there are enough y values to use when computing the convolution.
If you'd prefer to get only moving average values that are computed using values from y (and not from additional zero-padding), you can replace 'same' with 'valid'. In this case, as Ryan points out, m will be shorter than y (more precisely, len(m) == len(y) - len(filter) + 1), which you can address in your plot by removing the first or last elements of your date array.
Okay, either I'm going nuts or it actually worked - I compared my chart vs. another chart and it seemed to have worked.
Does this make sense?
m = np.zeros(y.shape)
for i in range(10, y.shape[0]):
m[i-10] = y[i-10:i].mean()
plt.plot_date(x=date, y=m, fmt = '-', color='r')
I have for example 5 arrays with some inserted elements (numbers):
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
I need to find most common elements in those arrays and every element should go all the way till the end (see example below). In this example that would be the bold combination (or the same one but with "30" on the end, it's the "same") because it contains the smallest number of different elements (only two, 4 and 2/30).
This combination (see below) isn't good because if I have for ex. "4" it must "go" till it ends (next array mustn't contain "4" at all). So combination must go all the way till the end.
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
EDIT2: OR
1,4,8,10
1,2,3,4,11,15
2,4,20,21
2,30
OR anything else is NOT good.
Is there some algorithm to speed this thing up (if I have thousands of arrays with hundreds of elements in each one)?
To make it clear - solution must contain lowest number of different elements and the groups (of the same numbers) must be grouped from first - larger ones to the last - smallest ones. So in upper example 4,4,4,2 is better then 4,2,2,2 because in first example group of 4's is larger than group of 2's.
EDIT: To be more specific. Solution must contain the smallest number of different elements and those elements must be grouped from first to last. So if I have three arrrays like
1,2,3
1,4,5
4,5,6
Solution is 1,1,4 or 1,1,5 or 1,1,6 NOT 2,5,5 because 1's have larger group (two of them) than 2's (only one).
Thanks.
EDIT3: I can't be more specific :(
EDIT4: #spintheblack 1,1,1,2,4 is the correct solution because number used first time (let's say at position 1) can't be used later (except it's in the SAME group of 1's). I would say that grouping has the "priority"? Also, I didn't mention it (sorry about that) but the numbers in arrays are NOT sorted in any way, I typed it that way in this post because it was easier for me to follow.
Here is the approach you want to take, if arrays is an array that contains each individual array.
Starting at i = 0
current = arrays[i]
Loop i from i+1 to len(arrays)-1
new = current & arrays[i] (set intersection, finds common elements)
If there are any elements in new, do step 6, otherwise skip to 7
current = new, return to step 3 (continue loop)
print or yield an element from current, current = arrays[i], return to step 3 (continue loop)
Here is a Python implementation:
def mce(arrays):
count = 1
current = set(arrays[0])
for i in range(1, len(arrays)):
new = current & set(arrays[i])
if new:
count += 1
current = new
else:
print " ".join([str(current.pop())] * count),
count = 1
current = set(arrays[i])
print " ".join([str(current.pop())] * count)
>>> mce([[1, 4, 8, 10], [1, 2, 3, 4, 11, 15], [2, 4, 20, 21], [2, 30]])
4 4 4 2
If all are number lists, and are all sorted, then,
Convert to array of bitmaps.
Keep 'AND'ing the bitmaps till you hit zero. The position of the 1 in the previous value indicates the first element.
Restart step 2 from the next element
This has now turned into a graphing problem with a twist.
The problem is a directed acyclic graph of connections between stops, and the goal is to minimize the number of lines switches when riding on a train/tram.
ie. this list of sets:
1,4,8,10 <-- stop A
1,2,3,4,11,15 <-- stop B
2,4,20,21 <-- stop C
2,30 <-- stop D, destination
He needs to pick lines that are available at his exit stop, and his arrival stop, so for instance, he can't pick 10 from stop A, because 10 does not go to stop B.
So, this is the set of available lines and the stops they stop on:
A B C D
line 1 -----X-----X-----------------
line 2 -----------X-----X-----X-----
line 3 -----------X-----------------
line 4 -----X-----X-----X-----------
line 8 -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----
If we consider that a line under consideration must go between at least 2 consecutive stops, let me highlight the possible choices of lines with equal signs:
A B C D
line 1 -----X=====X-----------------
line 2 -----------X=====X=====X-----
line 3 -----------X-----------------
line 4 -----X=====X=====X-----------
line 8 -----X-----------------------
line 10 -----X-----------------------
line 11 -----------X-----------------
line 15 -----------X-----------------
line 20 -----------------X-----------
line 21 -----------------X-----------
line 30 -----------------------X-----
He then needs to pick a way that transports him from A to D, with the minimal number of line switches.
Since he explained that he wants the longest rides first, the following sequence seems the best solution:
take line 4 from stop A to stop C, then switch to line 2 from C to D
Code example:
stops = [
[1, 4, 8, 10],
[1,2,3,4,11,15],
[2,4,20,21],
[2,30],
]
def calculate_possible_exit_lines(stops):
"""
only return lines that are available at both exit
and arrival stops, discard the rest.
"""
result = []
for index in range(0, len(stops) - 1):
lines = []
for value in stops[index]:
if value in stops[index + 1]:
lines.append(value)
result.append(lines)
return result
def all_combinations(lines):
"""
produce all combinations which travel from one end
of the journey to the other, across available lines.
"""
if not lines:
yield []
else:
for line in lines[0]:
for rest_combination in all_combinations(lines[1:]):
yield [line] + rest_combination
def reduce(combination):
"""
reduce a combination by returning the number of
times each value appear consecutively, ie.
[1,1,4,4,3] would return [2,2,1] since
the 1's appear twice, the 4's appear twice, and
the 3 only appear once.
"""
result = []
while combination:
count = 1
value = combination[0]
combination = combination[1:]
while combination and combination[0] == value:
combination = combination[1:]
count += 1
result.append(count)
return tuple(result)
def calculate_best_choice(lines):
"""
find the best choice by reducing each available
combination down to the number of stops you can
sit on a single line before having to switch,
and then picking the one that has the most stops
first, and then so on.
"""
available = []
for combination in all_combinations(lines):
count_stops = reduce(combination)
available.append((count_stops, combination))
available = [k for k in reversed(sorted(available))]
return available[0][1]
possible_lines = calculate_possible_exit_lines(stops)
print("possible lines: %s" % (str(possible_lines), ))
best_choice = calculate_best_choice(possible_lines)
print("best choice: %s" % (str(best_choice), ))
This code prints:
possible lines: [[1, 4], [2, 4], [2]]
best choice: [4, 4, 2]
Since, as I said, I list lines between stops, and the above solution can either count as lines you have to exit from each stop or lines you have to arrive on into the next stop.
So the route is:
Hop onto line 4 at stop A and ride on that to stop B, then to stop C
Hop onto line 2 at stop C and ride on that to stop D
There are probably edge-cases here that the above code doesn't work for.
However, I'm not bothering more with this question. The OP has demonstrated a complete incapability in communicating his question in a clear and concise manner, and I fear that any corrections to the above text and/or code to accommodate the latest comments will only provoke more comments, which leads to yet another version of the question, and so on ad infinitum. The OP has gone to extraordinary lengths to avoid answering direct questions or to explain the problem.
I am assuming that "distinct elements" do not have to actually be distinct, they can repeat in the final solution. That is if presented with [1], [2], [1] that the obvious answer [1, 2, 1] is allowed. But we'd count this as having 3 distinct elements.
If so, then here is a Python solution:
def find_best_run (first_array, *argv):
# initialize data structures.
this_array_best_run = {}
for x in first_array:
this_array_best_run[x] = (1, (1,), (x,))
for this_array in argv:
# find the best runs ending at each value in this_array
last_array_best_run = this_array_best_run
this_array_best_run = {}
for x in this_array:
for (y, pattern) in last_array_best_run.iteritems():
(distinct_count, lengths, elements) = pattern
if x == y:
lengths = tuple(lengths[:-1] + (lengths[-1] + 1,))
else :
distinct_count += 1
lengths = tuple(lengths + (1,))
elements = tuple(elements + (x,))
if x not in this_array_best_run:
this_array_best_run[x] = (distinct_count, lengths, elements)
else:
(prev_count, prev_lengths, prev_elements) = this_array_best_run[x]
if distinct_count < prev_count or prev_lengths < lengths:
this_array_best_run[x] = (distinct_count, lengths, elements)
# find the best overall run
best_count = len(argv) + 10 # Needs to be bigger than any possible answer.
for (distinct_count, lengths, elements) in this_array_best_run.itervalues():
if distinct_count < best_count:
best_count = distinct_count
best_lengths = lengths
best_elements = elements
elif distinct_count == best_count and best_lengths < lengths:
best_count = distinct_count
best_lengths = lengths
best_elements = elements
# convert it into a more normal representation.
answer = []
for (length, element) in zip(best_lengths, elements):
answer.extend([element] * length)
return answer
# example
print find_best_run(
[1,4,8,10],
[1,2,3,4,11,15],
[2,4,20,21],
[2,30]) # prints [4, 4, 4, 30]
Here is an explanation. The ...this_run dictionaries have keys which are elements in the current array, and they have values which are tuples (distinct_count, lengths, elements). We are trying to minimize distinct_count, then maximize lengths (lengths is a tuple, so this will prefer the element with the largest value in the first spot) and are tracking elements for the end. At each step I construct all possible runs which are a combination of a run up to the previous array with this element next in sequence, and find which ones are best to the current. When I get to the end I pick the best possible overall run, then turn it into a conventional representation and return it.
If you have N arrays of length M, this should take O(N*M*M) time to run.
I'm going to take a crack here based on the comments, please feel free to comment further to clarify.
We have N arrays and we are trying to find the 'most common' value over all arrays when one value is picked from each array. There are several constraints 1) We want the smallest number of distinct values 2) The most common is the maximal grouping of similar letters (changing from above for clarity). Thus, 4 t's and 1 p beats 3 x's 2 y's
I don't think either problem can be solved greedily - here's a counterexample [[1,4],[1,2],[1,2],[2],[3,4]] - a greedy algorithm would pick [1,1,1,2,4] (3 distinct numbers) [4,2,2,2,4] (two distinct numbers)
This looks like a bipartite matching problem, but I'm still coming up with the formulation..
EDIT : ignore; This is a different problem, but if anyone can figure it out, I'd be really interested
EDIT 2 : For anyone that's interested, the problem that I misinterpreted can be formulated as an instance of the Hitting Set problem, see http://en.wikipedia.org/wiki/Vertex_cover#Hitting_set_and_set_cover. Basically the left hand side of the bipartite graph would be the arrays and the right hand side would be the numbers, edges would be drawn between arrays that contain each number. Unfortunately, this is NP complete, but the greedy solutions described above are essentially the best approximation.