A data structure “oracle” able to answer queries in O(1) - arrays

Let V be a vector of n elements, where each cell can contain one of k possible colors, that is
V[i] ∈ {c1. . . ,ck}
Design an algorithm that, given V, construct a "oracle" (a data structure) able to answer in O(1) to query of the following type:
Given an index i and a color c, which is the index of the cell closer to i that contains the color c?
The oracle construction algorithm must have complexity in O(kn), query algorithm in O(1).
EDIT
O(kn) concerns a time complexity, so there are no limits about the additional memory.
My reasoning
Given i and c, the query should return an index j with
V[j] = c
which minimizes | i - j |. If there's no cell that contains the color c, it must returns -1. So I guess that the two functions prototypes should be as following:
ORACLE(array V, int k)
QUERY(array O, int i, int c)
the array O is created by the oracle function in order to "save" the preprocessed values that will be subsequently extrapolated in O(1) by the function query. I'm stuck in this passage, because I can't understand how place values in order to get the right result. Any hints?

As you stated, your oracle should probably be an NxK array with the answer for every index and every color stored as an integer index that gives the index closes to the query index that has the query color. Initialize your oracle array to all -1. Then go through your array V first forward and then backards. When you go through forward, simply keep track of the last index in V where you have seen color k for each color k (with -1 if you haven't seen the color yet), and then as you proceed through V in forward order, if you are at index i then the answer for the oracle for color j is the last index where you saw color j. Then go through the array V backwards, and keep track of the last time you saw each color. When you are at position j in the array V, check to see what the index for the closest cell of each color was when you went forward, and if the index for the last cell when you saw a color when you are going backwards is closer, then over-write the oracle cell with the closer index. After you go through the array both forwards and backwards, you will have the oracle fully constructed and ready to query in O(1) time.

Related

Julia: How to efficiently sort subarrays of 2 large arrays in parallel?

I have large 1D arrays a and b, and an array of pointers I that separates them into subarrays. My a and b barely fit into RAM and are of different dtypes (one contains UInt32s, the other Rational{Int64}s), so I don’t want to join them into a 2D array, to avoid changing dtypes.
For each i in I[2:end], I wish to sort the subarray a[I[i-1],I[i]-1] and apply the same permutation to the corresponding subarray b[I[i-1],I[i]-1]. My attempt at this is:
function sort!(a,b)
p=sortperm(a);
a[:], b[:] = a[p], b[p]
end
Threads.#threads for i in I[2:end]
sort!( a[I[i-1], I[i]-1], b[I[i-1], I[i]-1] )
end
However, already on a small example, I see that sort! does not alter the view of a subarray:
a, b = rand(1:10,10), rand(-1000:1000,10) .//1
sort!(a,b); println(a,"\n",b) # works like it should
a, b = rand(1:10,10), rand(-1000:1000,10) .//1
sort!(a[1:5],b[1:5]); println(a,"\n",b) # does nothing!!!
Any help on how to create such function sort! (as efficient as possible) are welcome.
Background: I am dealing with data coming from sparse arrays:
using SparseArrays
n=10^6; x=sprand(n,n,1000/n); #random matrix with 1000 entries per column on average
x = SparseMatrixCSC(n,n,x.colptr,x.rowval,rand(-99:99,nnz(x)).//1); #chnging entries to rationals
U = randperm(n) #permutation of rows of matrix x
a, b, I = U[x.rowval], x.nzval, x.colptr;
Thus these a,b,I serve as good examples to my posted problem. What I am trying to do is sort the row indices (and corresponding matrix values) of entries in each column.
Note: I already asked this question on Julia discourse here, but received no replies nor comments. If I can improve on the quality of the question, don't hesitate to tell me.
The problem is that a[1:5] is not a view, it's just a copy. instead make the view like
function sort!(a,b)
p=sortperm(a);
a[:], b[:] = a[p], b[p]
end
Threads.#threads for i in I[2:end]
sort!(view(a, I[i-1]:I[i]-1), view(b, I[i-1]:I[i]-1))
end
is what you are looking for
ps.
the #view a[2:3], #view(a[2:3]) or the #views macro can help making thins more readable.
First of all, you shouldn't redefine Base.sort! like this. Now, sort! will shadow Base.sort! and you'll get errors if you call sort!(a).
Also, a[I[i-1], I[i]-1] and b[I[i-1], I[i]-1] are not slices, they are just single elements, so nothing should happen if you sort them either with views or not. And sorting arrays in a moving-window way like this is not correct.
What you want to do here, since your vectors are huge, is call p = partialsortperm(a[i:end], i:i+block_size-1) repeatedly in a loop, choosing a block_size that fits into memory, and modify both a and b according to p, then continue to the remaining part of a and find next p and repeat until nothing remains in a to be sorted. I'll leave the implementation as an exercise for you, but you can come back if you get stuck on something.

Changing the values of array by the distance of the indexes (c)

I'm having hard time with this one:
I need to write a function in C that recieving a binary array and his size, and the function should calculate and replace the current values with the distance (by indexes) of each 1 to the closest 0.
for example: if the function recieve that array {1,1,0,1,1,1,0,1} then the new values of the array should be {2,1,0,1,2,1,0,1}. It is known that the input has atleast 1 zero.
So the first step I tought about was to locate pair of zeros (or just 1 if there is only 1) and set them as 2 indexes (z1, z2). Then I set another index i
that check everytime which zero is the closest to him (absolute value) and then the diffrence between i and z1 or z2 would be the new value.
I have the plan but things are not going exactly as I planned. Basicly I deleted the code (it wasn't good anyway) so I would appreciate any help. thanks!
This problem is based on two things
Keep an array left[i] which has the distance of nearest 0 from index i from left to right.
Keep an array right[i] which has the distance of nearest 0 from index i from right to left.
Both can be calculate in single loop iteration. O(n).
Then for each position get the minimum value of left[i] and right[i]. That will be the answer for 1 staying in position i.
Overall the time complexity is O(n).

Split array into smaller unequal-sized arrays dependend on array-column values

I'm quite new to MatLab and this problem really drives me insane:
I have a huge array of 2 column and about 31,000 rows. One of the two columns depicts a spatial coordinate on a grid the other one a dependent parameter. What I want to do is the following:
I. I need to split the array into smaller parts defined by the spatial column; let's say the spatial coordinate are ranging from 0 to 500 - I now want arrays that give me the two column values for spatial coordinate 0-10, then 10-20 and so on. This would result in 50 arrays of unequal size that cover a spatial range from 0 to 500.
II. Secondly, I would need to calculate the average values of the resulting columns of every single array so that I obtain per array one 2-dimensional point.
III. Thirdly, I could plot these points and I would be super happy.
Sadly, I'm super confused since I miserably fail at step I. - Maybe there is even an easier way than to split the giant array in so many small arrays - who knows..
I would be really really happy for any suggestion.
Thank you,
Arne
First of all, since you wish a data structure of array of different size you will need to place them in a cell array so you could try something like this:
res = arrayfun(#(x)arr(arr(:,1)==x,:), unique(arr(:,1)), 'UniformOutput', 0);
The previous code return a cell array with the array splitted according its first column with #(x)arr(arr(:,1)==x,:) you are doing a function on x and arrayfun(function, ..., 'UniformOutput', 0) applies function to each element in the following arguments (taken a single value of each argument to evaluate the function) but you must notice that arr must be numeric so if not you should map your values to numeric values or use another way to select this values.
In the same way you could do
uo = 'UniformOutput';
res = arrayfun(#(x){arr(arr(:,1)==x,:), mean(arr(arr(:,1)==x,2))), unique(arr(:,1)), uo, 0);
You will probably want to flat the returning value, check the function cat, you could do:
res = cat(1,res{:})
Plot your data depends on their format, so I can't help if i don't know how the data are, but you could try to plot inside a loop over your 'res' variable or something similar.
Step I indeed comes with some difficulties. Once these are solved, I guess steps II and III can easily be solved. Let me make some suggestions for step I:
You first define the maximum value (maxValue = 500;) and the step size (stepSize = 10;). Now it is possible to iterate through all steps and create your new vectors.
for k=1:maxValue/stepSize
...
end
As every resulting array will have different dimensions, I suggest you save the vectors in a cell array:
Y = cell(maxValue/stepSize,1);
Use the find function to find the rows of the entries for each matrix. At each step k, the range of values of interest will be (k-1)*stepSize to k*stepSize.
row = find( (k-1)*stepSize <= X(:,1) & X(:,1) < k*stepSize );
You can now create the matrix for a stepk by
Y{k,1} = X(row,:);
Putting everything together you should be able to create the cell array Y containing your matrices and continue with the other tasks. You could also save the average of each value range in a second column of the cell array Y:
Y{k,2} = mean( Y{k,1}(:,2) );
I hope this helps you with your task. Note that these are only suggestions and there may be different (maybe more appropriate) ways to handle this.

R fill matrix or array with conditional lagged calculation in for loop

I've dug through the list archive, and either I don't know the right words to ask this question or this hasn't come up before--
I have a simulation function where I track a list of points over time, and want to introduce an extra lagged calculation based on an assignment. I've created a very simple bit of code to understand how R fills in a matrix:
t<-21 #time step
N<-10 #points to track
#creating a matrix where it's easy for me to see how the calculation is done
NEE<-rep(NA, (t+1)*N);dim(NEE)<-c(N,(t+1))
for(i in 1:t){
NEE[,1]<-1
NEE[,i+1]<-NEE[,i]+5
}
#the thing to calculate
gt<-rep(0, (t+1)*N);dim(gt)<-c(N,(t+1))
#assigned states
veg<-c(rep(0,5), rep(1,5))
veg.com<-rep(veg, t);dim(veg.com)<-c(N,t)
for (i in 1:t){
gt[,i+1]<-ifelse(veg.com[,i]==0, NEE[,i]/5, NEE[,i-3]/5)
}
#to have a view of what happens
veg1<-gt[1,]*5 #assignment for veg.com==0
veg2<-gt[10,]*5 #assignment for veg.com==1
what<-cbind(NEE[1,], veg1,veg2)
what
Of course it works, except how it fills in the first bit (shown here as the first 4 values in veg2 of what) before the lag is in effect when veg.com==1. I'm sure there're work arounds, but I first simply want to understand what R is doing in those initial few loops?
The first two times through that second for-loop you will be using negative indexing with the expression
NEE[ , i-3]
That will return a 10 column matrix with removal of the 2nd column. The next iteration will return another 10 column matrix with removal of the first column. Negative indices remove portions of a matrix or dataframe in R

How to locate in a huge list of numbers, two numbers where xi=xj?

I have the following question, and it screams at me for a solution with hashing:
Problem :
Given a huge list of numbers, x1........xn where xi <= T, we'd like to know
whether or not exists two indices i,j, where x_i == x_j.
Find an algorithm in O(n) run time, and also with expectancy of O(n), for the problem.
My solution at the moment : We use hashing, where we'll have a mapping function h(x) using chaining.
First - we build a new array, let's call it A, where each cell is a linked list - this would be the destination array.
Now - we run on all the n numbers and map each element in x1........xn, to its rightful place, using the hash function. This would take O(n) run time.
After that we'll run on A, and look for collisions. If we'll find a cell where length(A[k]) > 1
then we return the xi and xj that were mapped to the value that's stored in A[k] - total run time here would be O(n) for the worst case , if the mapped value of two numbers (if they indeed exist) in the last cell of A.
The same approach can be ~twice faster (on average), still O(n) on average - but with better constants.
No need to map all the elements into the hash and then go over it - a faster solution could be:
for each element e:
if e is in the table:
return e
else:
insert e into the table
Also note that if T < n, there must be a dupe within the first T+1 elements, from pigeonhole principle.
Also for small T, you can use a simple array of size T, no hash is needed (hash(x) = x). Initializing T can be done in O(1) to contain zeros as initial values.

Resources