So I have an array 'a0' of size let's say 105, and now I have to make some changes in this array. The ith change could be calculated using a function f(ai-1) to give ai in O(1) time, Where aj denotes array 'a' after jth change has been made to it. Meaning that ai could be calculated if we know ai-1 in constant time. I know that I have to make 105 changes beforehand.
Now the problem asks me to answer large number of queries such as ai[p]-aj[q], where ax[y], represents yth element of the array after xth change has been made to the array a0.
Now if I had space of the order of 1010, I could easily solve this problem in O(1) by storing all the 105 arrays beforehand but I don't (generally) have that kind of space. And I could also answer these queries by each time generating ai and aj from scratch and answering the queries but I can't afford that kind of time complexity either, so I was wondering if I could monitor this problem using some data-structure.
EDIT: Example:
We define an array B= {1,3,1,4,2,6}, and we define aj as the array storing the frequency of ith number after jth element has been added to B. That is, a0={0,0,0,0,0,0} now a1={1,0,0,0,0,0}, a2={1,0,1,0,0,0}, a3={2,0,1,0,0,0} a4={2,0,1,1,0,0} a5={2,1,1,1,0,0} and a6={2,1,1,1,0,1}.
f(aj) just adds a an element to B and updates the value of aj-1.
Assume the number of changed elements per iteration is much smaller than the total number of elements. Store an array of lists, where the list elements are (i, new_value). For example if the full view is like this:
a0 = [3, 5, 1, 9]
a1 = [3, 5, 1, 8]
a2 = [1, 5, 1, 0]
We will store this:
c0 = [(0, 3), (2, 1)]
c1 = [(0, 5)]
c2 = [(0, 1)]
c3 = [(0, 9), (1, 8), (2, 0)]
Then for the query a2[0] - a1[3], we need only consult c0 and c3 (the two columns in the query). We can use binary search to locate the necessary indexes 2 and 1 (the keys for the binary search being the first elements of the tuples).
The query time is then O(log N) for the two binary searches, where N is the maximum number of changes to a single value in the array. The space is O(L + M), where L is the length of the original array and M is the total number of changes made.
If there is some a maximum number of states N, then checkpoints are a good way to go. For instance, if N=100,000, you might have:
c0 = [3, 5, 7, 1, ...]
c100 = [1, 4, 9, 8, ...]
c200 = [9, 7, 1, 2, ...]
...
c10000 = [1, 1, 4, 6, ...]
Now you have 1000 checkpoints. You can find the nearest checkpoint to an arbitrary state x in O(1) time and reconstruct x in at most 99 operations.
Riffing off of my comment on your question and John Zwinck's answer, if your mutating function f(*) is expensive and its effects are limited to only a few elements, then you could store the incremental changes. Doing so won't decrease the time complexity of the algorithm, but may reduce the run-time.
If you had unlimited space, you would just store all of the checkpoints. Since you do not, you'll have to balance the number of checkpoints against the incrementals appropriately. That will require some experimentation, probably centered around determining how expensive f(*) is and the extent of its effects.
Another option is to look at query behavior. If users tend to query the same or nearby locations repeatedly, you may be able to leverage an LRU (least-recently used) cache.
Related
You have an array of integers. you have to find the number of subarrays which mean (sum of those elements divided by the count of those elements) rounds to zero.
I have solved this with O(n^2) time but it is not efficient enough. Is there a way to do it?
example:
[-1, 1, 5, 4]
subarrays which mean rounds to zero are:
[-1, 1] = 0 , [-1, 1, 5, -4] = 1/4 which rounds to zero
Denote new array composed of pairs (prefix sum, cnt) where first element is the prefix summation and second element is number of elements, for example,
int[] arr = [-1, 1, 5 ,4]:
int[] narr = [(0, 0), (-1, 1), (0, 2), (5, 3), (9, 4)]
the question is converted to count pair (i, j) in narr where i < j and Math.abs(narr[j][0] - narr[i][0]) < narr[j][1] - narr[i][1] = j - i which is further boiled down to:
narr[j][0] - j < narr[i][0] - i < narr[i][0] + i < narr[j][0] + j
so the question is further converted to the following question:
for some intervals: [[1, 2], [-1, 0], ...] (initially is empty), given an interval [x, y], count how many intervals are totally within the range of [x, y], then we add this interval, and repeat this procedure for total N times. (how to manage the data structure of intervals become the key problem)
If we just brute force iterate every intervals and do the validation, the query time complexity is O(N) and insertion time complexity is O(1), total O(N^2)
If we use square decomposition, the query time complexity is O(sqrt(N)) and insertion time complexity is O(1) , total O(Nsqrt(N))
If we use treap (using first or second as priority, use another as key), the average total time complexity we can achieve is O(NlgN)
If you don't know the technique of square decomposition or treap , I suggest you reading couple of articles first.
Update:
After carefully 30 mins thinking, I find treap cannot achieve O(NlgN) average time complexity.
Instead we can use 2d segment tree to achieve O(NlgNlgN):
Please read this article instead:
2d segment tree
This was the interview question I had from a tech company. I got it wrong, which I think doomed my chances, but I honestly I still cannot figure out the answer... here's the question. Assume that all elements of the sequence are unique.
We have two finite sequences: X={Xi}, Y={Yi} where Yi is a sub-sequence of Xi.
Let's write them as separate arrays: [X1, X2, ..., Xn], [Y1, Y2, ..., Yk] where n is the length of X, k is the length of Y, and obviously, since Y is a sub-sequence of X, we have n>=k.
For instance
X=[1, 10, 5, 7, 11, -4, 9, 5]
y=[10, 7, -4, 9]
Then for each element in Y, we want to find the number of elements in X which 1) appear after that element and 2) greater than that element.
Using the example above
X=[1, 10, 5, 7, 11, -4, 9, 5]
y=[10, 7, -4, 9]
ans=[1, 2, 2, 0]
explanation:
the first element of ans is 1 because only 11 appears after 10 and greater than 10 in X,
so there's only 1 element
second element of ans is 2 since 11, 9 both appear after 7 in X, so there are 2 elements
that appear after 7 and greater than 7.
the third element of ans is also 2 since 9, 5 appear after -4 and are both greater than
-4 in X.
the fourth element is 0 since no element in X appears after and greater than 9.
The interviewer wanted me to solve it in O(N) time complexity where N is the length of X. I did not find how.
Anybody has an idea?
If have an algorithm that can solve this problem, then by setting Y = X, you can make it provide enough information to sort X without any further comparisons among elements in X. Therefore, you can't do this in linear time under the usual assumptions, i.e., arbitrary integers in X that you can do operations on in constant time, but no constant bound on their size.
You can do it in O(N log N) time pretty easily by walking backwards through X and maintaining an order statistic tree of the elements seen so far. See https://en.wikipedia.org/wiki/Order_statistic_tree
I think it's impossible same as it's impossible for sorting and here is the reason
For solving this we should save state for previous calculation in limited number variable, for example store addition, subtraction or multiply.
So if there is a big number in A thats not in B its very clear it's not usefull at all, and we already know the only possible solution is to save previous state in limited variable, So we can't have numbers that related only to item in A.
So far we know to solve this is we should figure out the saving state algorithm, for saving state we can only store some number that represent for all previous numbers for current element in Y all of these calculation its not helping because we dont know the next item in Y (for example the current number is -1000 and next number is 3000 and other number in X is 1,2,3). so because of that we cant have any stored number that related to current element in Y. also we cant have any number that's not related to Y(as its usefull at all)
I'm trying to develop a sort of very simple machine learning example to recognize similarity between arrays.
For this reason I'm trying to calculate the average between 2 arrays with different length.
For example if I have:
array_1 = [0, 4, 5];
array_2 = [4, 2, 7];
The average is:
average_array = [2, 3, 6];
But how can I manage to calculate the average if I have the following situation:
array_1 = [0, 4, 5, 10, 7];
array_2 = [4, 2, 7];
As you can see the arrays have a different length.
Is there an algorithm that I can apply to solve this problems?
Does anyone have an idea or some suggestion?
Of course I can consider the missing values of the second array as 0, and evaluate the average as, for example:
average_array = [2, 3, 6, 5, 3.5];
or consider the values as "null" and have:
average_array = [2, 3, 6, 10, 7];
But are this two approach good?
Or there is something smarter?
Thanks for your help!!
To answer your question, we really need more information on what you are trying to achieve.
I'm trying to develop a sort of very simple machine learning example
to recognize similarity between arrays. For this reason I'm trying to
calculate the average between 2 arrays with different length.
Depending on your usecase, similarity might be defined completely differently.
For instance:
if the array encodes sound-information you might want to measure similarity as "does this sound clip occur in this one" or "are the main frequencies (which would correspond to chords) the same"
if the array encodes image information (properly DFT-ed and zig-zag-encoded) you might not care about the low frequencies (end of the array) and only measure the difference between the first few values of the array
if the array encodes some kind of composition of elements (e.g. this essay contains keyword "matrix" 40 times, and keyword "SVM" 27 times) the difference in values might be very important.
General advice:
Think about what you're measuring
Decide what's important
But in general, have a look at smoothing algorithms. For instance Kneyser-Ney or Good-Turing smoothing. They explictly deal with comparing a vector of probabilities that may differ in length (in other words, have explicit zero entries)
https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation
If after taking the the average of the arrays, you intend to take the mod of the difference of the array and the average array, then you are probably in the right direction if you will measure the dissimilarity by the magnitude of the difference.
But for arrays of different lengths I propose that you also take the index of extra elements in consideration.
For
array_1 = [0, 4, 5, 10, 7];
array_2 = [4, 2, 7];
average should be average_array = [2, 3, 6, 6.5, 5.5];
6.5 = (10 + 3(index) + 0(element) ) / 2
and
5.5 = (7 + 4(index) + 0(element))/2
Reason for taking index into consideration is that the length factor is also dealth with this approach. However this is just my 2 cents. May be there are better algorithms out there.
You should also take a look at this post
Given an array A of size N, we construct a list containing all possible subarrays of A in descending order.
Two subarrays B and C are compare by padding zeroes until both are of size N. Then, we compare the two subarrays element by element and return as soon as a point of difference is observed.
We are given multiple queries where given x we have to find the maximum element in the xth subarray sorted according to the order given above.
For example, if the array A is [3, 1, 2, 4]; then the sorted subarrays will be:
[4]
[3, 1, 2, 4]
[3, 1, 2]
[3, 1]
[3]
[2, 4]
[2]
[1, 2, 4]
[1, 2]
[1]
A query where x = 3 corresponds to finding the maximum element in the subarray [3, 1, 2]; so here the answer would be 3.
Since the number of queries are large (of the order of 10^5) and the number of elements in the array can also be large (of the order of 10^5), we would need to do some preprocessing to answer each query in O(1) or O(log N) or O(sqrt N) time. I can't seem to figure out how to do this. I have solved it for when the array contains unique elements, however how could we do this for when the array contains repetitions? Is there any data structure which could help in storing the required information?
Build suffix array in back order for this array (consider it like string)
For every entry store it's length and cumulative count (sum of lengths from the beginning of suffix array)
For query find needed index by binary search for cumulative counts, and get needed prefix of found suffix
For your examples suffixes with cumul.counts are
4 (0)
3124 (1)
34 (5)
124 (7)
query 3 finds entry 3124 (1<=3<5), and gets 3-1=2-nd (by length) prefix = 312
I'm dealing with long daily time series in Matlab, running over periods of 30-100+ years. I've been meaning to start looking at it by seasons, roughly approximating that by taking 91-day segments of each year over the time period (with some tbd method of correcting for odd number of days in the year)
Basically, what I want is an array indexing method that allows me to make a new array that takes 91 elements every 365 elements, starting at element 1. I've been looking for some normal array methods (some (:) or other), but I haven't been able to find one. I guess an alternative would be to kind of iterate over 365-day segments 91 times, but that seems needlessly complicated.
Is there a simpler way that I've missed?
Thanks in advance for the help!
So if I understand correctly, you want to extract elements 1-91, 366-457, 731-822, and so on? I'm not sure that there is a way to do this with basic matrix indexing, but you can do the following:
days = 1:365; %Create array ranging from 1 - 365
difference = length(data) - 365; %how much bigger is time series data?
padded = padarray(days, [0, difference], 'circular'); %extend to fit time series
extracted = data(padded <= 91); %get every element in the range 1-91
Basically what I am doing is creating an array that is the same size as your time series data that repeats 1-365 over and over. I then perform logical indexing on data, such that the padded array is less than or equal to 91.
As a more approachable example, consider:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
days = 1:5;
difference = length(x) - 5;
padded = padarray(days, [0, difference], 'circular');
extracted = x(padded <= 2);
padded then is equal to [1, 2, 3, 4, 5, 1, 2, 3, 4, 5] and extracted is going to be [1, 2, 6, 7]