Reduction or atomic operator on unknown global array indices - c

I have the following algorithm:
__global__ void Update(int N, double* x, double* y, int* z, double* out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
{
x[i] += y[i];
if (y[i] >= 0.)
out[z[i]] += x[i];
else
out[z[i]] -= x[i];
}
}
Important to note that out is smaller than x. Say x, y and z are always the same size, say 1000, and out is always smaller, say 100. z is the indices in out that each of x and y correspond to.
This is all find except the updates to out. There may be clashes across threads as z does not contain only unique values and has duplicates. Therefore I currently have this implemented with atomic versions of atomicAdd and subtract using compare and swap. This is obviously expensive and means my kernel takes 5-10x longer to run.
I would like to reduce this however the only way I can think of doing this is for each thread to have its own version of out (which can be large, 10000+, X 10000+ threads). This would mean I set up 10000 double[10000] (perhaps in shared?) call my kernel, and then sum across these arrays, perhaps in another kernel. Surely there must be a more elegant way to do this?
It might be worth noting that x, y, z and out reside in global memory. As my kernel (I have others like this) is very simple I have not decided to copy across bits to shared (nvvp on the kernel shows equal computation and memory so I am thinking not much performance to be gained when adding overhead of moving data from global to shared and back again, any thoughts?).

Method 1:
Build a set of "transactions". Since you only have one update per thread, you can easily build a fixed size "transaction" record, one entry per thread. Suppose I have 8 threads (for simplicity of presentation) and some arbitrary number of entries in my out table. Let's suppose my 8 threads wanted to do 8 transactions like this:
thread ID (i): 0 1 2 3 5 6 7
z[i]: 2 3 4 4 3 2 3
x[i]: 1.5 0.5 1.0 0.5 0.1 -0.2 -0.1
"transaction": 2,1.5 3,0.5 4,1.0 4,0.5 3,0.1 2,-0.2 3,-0.1
Now do a sort_by_key on the transactions, to arrange them in order of z[i]:
sorted: 2,1.5 2,-0.2 3,0.5 3,-0.1 3,0.1 4,1.0 4,0.5
Now do a reduce_by_key operation on the transactions:
keys: 2 3 4
values: 1.3 0.5 1.5
Now update out[i] according to the keys:
out[2] += 1.3
out[3] += 0.5
out[4] += 1.5
thrust and/or cub might be pre-built options for the sort and reduce operations.
Method 2:
As you say, you have arrays x, y, z, and out in global memory. If you are going to use z which is a "mapping" repeatedly, you might want to rearrange (group) or sort your arrays in order of z:
index (i): 0 1 2 3 4 5 6 7
z[i]: 2 8 4 8 3 1 4 4
x[i]: 0.2 0.4 0.3 0.1 -0.1 -0.4 0.0 1.0
group by z[i]:
index (i): 0 1 2 3 4 5 6 7
z[i]: 1 2 3 4 4 4 8 8
x[i]:-0.4 0.2 -0.1 0.3 0.0 1.0 0.4 0.1
This, or some variant of it, would allow you to eliminate having to repeatedly do the sorting operation in method 1 (again, if you were using the same "mapping" vector repeatedly).

Related

binomial coefficient for very high numbers in c

So the task I have to solve is to calculate the binomial coefficient for 100>=n>k>=1 and then say how many solutions for n and k are over an under barrier of 123456789.
I have no problem in my formula of calculating the binomial coefficient but for high numbers n & k -> 100 the datatypes of c get to small to calculated this.
Do you have any suggestions how I can bypass this problem with overflowing the datatypes.
I thought about dividing by the under barrier straight away so the numbers don't get too big in the first place and I have to just check if the result is >=1 but i couldn't make it work.
Say your task is to determine how many binomial coefficients C(n, k) for 1 ≤ k < n ≤ 8 exceed a limit of m = 18. You can do this by using the recurrence C(n, k) = C(n − 1, k) + C(n − 1, k − 1) that can visualized in Pascal's triangle.
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 (20) 15 6 1
1 7 (21 35 35 21) 7 1
1 8 (28 56 70 56 28) 8 1
Start at the top and work your way down. Up to n = 5, everything is below the limit of 18. On the next line, the 20 exceeds the limit. From now on, more and more coefficients are beyond 18.
The triangle is symmetric and strictly increasing in the first half of each row. You only need to find the first element that exceeds the limit on each line in order to know how many items to count.
You don't have to store the whole triangle. It is enough to keey the last and current line. Alternatively, you can use the algorithm detailed [in this article][ot] to work your way from left to right on each row. Since you just want to count the coefficients that exceed a limit and don't care about their values, the regular integer types should be sufficient.
First, you'll need a type that can handle the result. The larget number you need to handle is C(100,50) = 100,891,344,545,564,193,334,812,497,256. This number requires 97 bits of precision, so your normal data types won't do the trick. A quad precision IEEE float would do the trick if your environment provides it. Otherwise, you'll need some form of high/arbitrary precision library.
Then, to keep the numbers within this size, you'll want cancel common terms in the numerator and the denominator. And you'll want to calculate the result using ( a / c ) * ( b / d ) * ... instead of ( a * b * ... ) / ( c * d * ... ).

Check subset sum for special array equation

I was trying to solve the following problem.
We are given N and A[0]
N <= 5000
A[0] <= 10^6 and even
if i is odd then
A[i] >= 3 * A[i-1]
if i is even
A[i]= 2 * A[i-1] + 3 * A[i-2]
element at odd index must be odd and at even it must be even.
We need to minimize the sum of the array.
and We are given a Q numbers
Q <= 1000
X<= 10^18
We need to determine is it possible to get subset-sum = X from our array.
What I have tried,
Creating a minimum sum array is easy. Just follow the equations and constraints.
The approach that I know for subset-sum is dynamic programming which has time complexity sum*sizeof(Array) but since sum can be as large as 10^18 that approach won't work.
Is there any equation relation that I am missing?
We can make it with a bit of math:
sorry for latex I am not sure it is possible on stack?
let X_n be the sequence (same as being defined by your A)
I assume X_0 is positive.
Thus sequence is strictly increasing and minimization occurs when X_{2n+1} = 3X_{2n}
We can compute the general term of X_{2n} and X_{2n+1}
v_0 =
X0
X1
v_1 =
X1
X2
the relation between v_0 and v_1 is
M_a =
0 1
3 2
the relation between v_1 and v_2 is
M_b =
0 1
0 3
hence the relation between v_2 and v_0 is
M = M_bM_a =
3 2
9 6
we deduce
v_{2n} =
X_{2n}
X_{2n+1}
v_{2n} = M^n v_0
Follow the classical diagonalization... and we (unless mistaken) get
X_{2n} = 9^n/3 X_0 + 2*9^{n-1}X_1
X_{2n+1} = 9^n X_0 + 2*9^{n-1}/3X_1
recall that X_1 = 3X_0 thus
X_{2n} = 9^n X_0
X_{2n+1} = 3.9^n X_0
Now if we represent the sum we want to check in base 9 we get
9^{n+1} 9^n
___ ________ ___ ___
X^{2n+2} X^2n
In the X^{2n} places we can only put a 1 or a 0 (that means we take the 2n-th elem from the A)
we may also put a 3 in the place of the X^{2n} place which means we selected the 2n+1th elem from the array
so we just have to decompose number in base 9, and check whether all its digits or either 0,1 or 3 (and also if its leading digit is not out of bound of our array....)

Number of points in a rectangle

I have N points denoted by (xi,yi).
1<=i<=N
I have Q queries of the following form :
Given a rectangle (aligned with x,y axes) defined by the points x1, y1, x2, y2 where (x1, y1) is the lower left corner and (x2, y2) is the upper right corner, find the number of points inside the rectangle. Points on rectangle are considered outside.
Constraints :
1 ≤ N ≤ 100 000
1 ≤ Q ≤ 100 000
0 ≤ xi, yi ≤ 100 000
0 ≤ x1 < x2 ≤ 100 000
0 ≤ y1 < y2 ≤ 100 000
I have thought of following approaches :
Build a 2D segment tree on N*N matrix. A query will be solved in log N time.
But the size of the segment tree built would be >=10^7. Hence memory insufficient.
Keep two arrays(say X and Y), with both array containing all the N points.
X is sorted with respect to x coordinates and Y is sorted with respect to y coordinate. Now given x1,y1,x2,y2 : I can find all points >=x1 && <=x2 from X array in log N time. Similarly, I can find all points >=y1 && <=y2 from Y in log N time. But how to find number of points in given rectangle, I cannot workout out further!
Complexity should be O(NlogN) or O(QlogN)
This problem is called Orthogonal Range Searching:
Given a set of n points in Rd, preprocess them such that reporting or
counting the k points inside a d-dimensional axis-parallel box will be
most efficient.
Your queries are range counting queries (not range reporting queries).
A two-dimensional range tree can be used to answer a range counting query in O(log n) time using O(n log n) storage (see for example Ch.36 of Handbook of Discrete and Computational Geometry 2Ed, 2004)
If your x's and y's are on a grid, and the grid is narrow, see Orthogonal range searching in linear and almost-linear space [Nekrich, 2009] where an O((logn / log logn)2) time data structure is presented.
Create a map<x, y>, populate it with all coordinates, then sort with comparator (sort by key then by value):
return a.first != b.first? a.first < b.first : a.second < b.second;
So that you can have y value sorted as well with a particular x key.
Now you need to traverse within >=x1 && <=x2 incrementally, and apply binary search to find y_min and y_max by passing iterator_start_for_xi and iterator_end_for_xi as start and end index.
Count at xi = iterator_for_xi_y_max - iterator_for_xi_y_min.
You need to find summation of count at xi where x1<=xi<=x2
For example, a map upon sorting would look like this:
2 6
3 4
3 5
3 10
3 12
3 25
5 1
5 5
5 15
6 6
6 20
8 0
Lets say the x1 = 3, x2 = 7, y1 = 3, y2 = 12
2 6
3 4 < y_min for x = 3
3 5 <
3 10 <
3 12 < y_max for x = 3
3 25
5 1
5 5 < y_min and y_max for x = 5
5 15
6 6 < y_min and y_max for x = 6
6 20
8 0

Matlab: create matrix whose rows are identical vector. Use repmat() or multiply by ones()

I want to create a matrix from a vector by concatenating the vector onto itself n times. So if my vector is mx1, then my matrix will be mxn and each column of the matrix will be equal to the vector.
Which of the following is the best/correct way, or maybe there is a better way I do not know?
matrix = repmat(vector, 1, n);
matrix = vector * ones(1, n);
Thanks
Here is some benchmarking using timeit with different vector sizes and repetition factors. The results to be shown are for Matlab R2015b on Windows.
First define a function for each of the considered approaches:
%// repmat approach
function matrix = f_repmat(vector, n)
matrix = repmat(vector, 1, n);
%// multiply approach
function matrix = f_multiply(vector, n)
matrix = vector * ones(1, n);
%// indexing approach
function matrix = f_indexing(vector,n)
matrix = vector(:,ones(1,n));
Then generate vectors of different size, and use different repetition factors:
M = round(logspace(2,4,15)); %// vector sizes
N = round(logspace(2,3,15)); %// repetition factors
time_repmat = NaN(numel(M), numel(N)); %// preallocate results
time_multiply = NaN(numel(M), numel(N));
time_indexing = NaN(numel(M), numel(N));
for ind_m = 1:numel(M);
for ind_n = 1:numel(N);
vector = (1:M(ind_m)).';
n = N(ind_n);
time_repmat(ind_m, ind_n) = timeit(#() f_repmat(vector, n)); %// measure time
time_multiply(ind_m, ind_n) = timeit(#() f_multiply(vector, n));
time_indexing(ind_m, ind_n) = timeit(#() f_indexing(vector, n));
end
end
The results are plotted in the following two figures, using repmat as reference:
figure
imagesc(time_multiply./time_repmat)
set(gca, 'xtick',1:2:numel(N), 'xticklabels',N(1:2:end))
set(gca, 'ytick',1:2:numel(M), 'yticklabels',M(1:2:end))
title('Time of multiply / time of repmat')
axis image
colorbar
figure
imagesc(time_indexing./time_repmat)
set(gca, 'xtick',1:2:numel(N), 'xticklabels',N(1:2:end))
set(gca, 'ytick',1:2:numel(M), 'yticklabels',M(1:2:end))
title('Time of indexing / time of repmat')
axis image
colorbar
Perhaps a better comparison is to indicate, for each tested vector size and repetition factor, which of the three approaches is the fastest:
figure
times = cat(3, time_repmat, time_multiply, time_indexing);
[~, fastest] = min(times, [], 3);
imagesc(fastest)
set(gca, 'xtick',1:2:numel(N), 'xticklabels',N(1:2:end))
set(gca, 'ytick',1:2:numel(M), 'yticklabels',M(1:2:end))
title('1: repmat is fastest; 2: multiply is; 3: indexing is')
axis image
colorbar
Some conclusions can be drawn from the figures:
The multiply-based approach is always slower than repmat
The indexing-based approach is similar to repmat. It tends to be faster for large values of vector size or repetition factor, and slower for small values.
Either method is correct if they provide you with the desired output.
However, depending on how you declare your vector you may get incorrect results with repmat that will be spotted if you use ones. For instance take this example
>> v = 1:10;
>> m = v * ones(1, n)
Error using *
Inner matrix dimensions must agree.
>> m = repmat(v, 1, n)
m =
Columns 1 through 22
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2
Columns 23 through 44
3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4
Columns 45 through 50
5 6 7 8 9 10
ones provides an error to let you know you aren't doing the right thing but repmat doesn't. Whilst this example works correctly with both repmat and ones
>> v = (1:10).';
>> m = v * ones(1, n)
m =
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
>> m = repmat(v, 1, n)
m =
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
9 9 9 9 9
10 10 10 10 10
You can also do this -
vector(:,ones(1,n))
But, if I have to choose, repmat would be the go-to approach for me, as it is made exactly for this purpose. Also, depending on how you are going to use this replicated array, you can just avoid creating it altogether with bsxfun that does on-the-fly replication on its input arrays and some operation to be applied on the inputs. Here's a comparison on that - Comparing BSXFUN and REPMAT that shows bsxfun to be better than repmat in most cases.
Benchmarking
For the sake of performance, let's test out these. Here's a benchmarking code to do so -
%// Inputs
vector = rand(1000,1);
n = 1000;
%// Warm up tic/toc.
for iter = 1:50000
tic(); elapsed = toc();
end
disp(' ------- With REPMAT -------')
tic,
for iter = 1:200
A = repmat(vector, 1, n);
end
toc, clear A
disp(' ------- With vector(:,ones(1,n)) -------')
tic,
for iter = 1:200
A = vector(:,ones(1,n));
end
toc, clear A
disp(' ------- With vector * ones(1, n) -------')
tic,
for iter = 1:200
A = vector * ones(1, n);
end
toc
Runtime results -
------- With REPMAT -------
Elapsed time is 1.241546 seconds.
------- With vector(:,ones(1,n)) -------
Elapsed time is 1.212566 seconds.
------- With vector * ones(1, n) -------
Elapsed time is 3.023552 seconds.
Both are correct, but repmat is a more general solution for multi-dimensional matrix copying and is thus bound to be slower than an other solution. The specific 'homemade' solution of multiplying two vectors is possibly faster. It is probably even faster to do selecting instead of multiplying, i.e. vector(:,ones(n,1)) instead of vector*ones(1,n).
EDIT:
Type open repmat in your Command Window. As you can see, it is not a built-in function. You can see that it also makes use of ones (selecting) to copy matrices. However, since it is a more general solution (for scalars and multi-dimensional matrices and copies in multiple directions), you will find unnecessary if statements and other unnecessary code, effectively slowing things down.
EDIT:
Multiplying vectors with ones becomes slower for very large vectors. The unequivocal winner is using ones with selection, i.e. vector(:,ones(n,1)) (which should always be faster than repmat since it uses the same strategy).

Correct way to get weighted average of concrete array-values along continous interval

I've been looking for a while onto websearch, however, possibly or probably I am missing the right terminology.
I have arbitrary sized arrays of scalars ...
array = [n_0, n_1, n_2, ..., n_m]
I also have a function f->x->y, with 0<=x<=1, and y an interpolated value from array. Examples:
array = [1,2,9]
f(0) = 1
f(0.5) = 2
f(1) = 9
f(0.75) = 5.5
My problem is that I want to compute the average value for some interval r = [a..b], where a E [0..1] and b E [0..1], i.e. I want to generalize my interpolation function f->x->y to compute the average along r.
My mind boggles me slightly w.r.t. finding the right weighting. Imagine I want to compute f([0.2,0.8]):
array --> 1 | 2 | 9
[0..1] --> 0.00 0.25 0.50 0.75 1.00
[0.2,0.8] --> ^___________________^
The latter being the range of values I want to compute the average of.
Would it be mathematically correct to compute the average like this?: *
1 * (1-0.8) <- 0.2 'translated' to [0..0.25]
+ 2 * 1
avg = + 9 * 0.2 <- 0.8 'translated' to [0.75..1]
----------
1.4 <-- the sum of weights
This looks correct.
In your example, your interval's length is 0.6. In that interval, your number 2 is taking up (0.75-0.25)/0.6 = 0.5/0.6 = 10/12 of space. Your number 1 takes up (0.25-0.2)/0.6 = 0.05 = 1/12 of space, likewise your number 9.
This sums up to 10/12 + 1/12 + 1/12 = 1.
For better intuition, think about it like this: The problem is to determine how much space each array-element covers along an interval. The rest is just filling the machinery described in http://en.wikipedia.org/wiki/Weighted_average#Mathematical_definition .

Resources