Complex Dot Product in CMSIS DSP LIBRARY - c

Recently I've been checking out the CMSIS DSP complex math functions library and I've seen something I cannot fully comprehend, thus my first post on SO.
What I'm unable to grasp is how the he11 can the complex dot product function yeild a proper result? The function may be found here: Complex Dot Product
As far as I'm concerned the part
for(n=0; n<numSamples; n++) {
realResult += pSrcA[(2*n)+0]*pSrcB[(2*n)+0] - pSrcA[(2*n)+1]*pSrcB[(2*n)+1];
imagResult += pSrcA[(2*n)+0]*pSrcB[(2*n)+1] + pSrcA[(2*n)+1]*pSrcB[(2*n)+0];
}
is A-okay, but how's that:
/* CReal = A[0]* B[0] + A[2]* B[2] + A[4]* B[4] + .....+ A[numSamples-2]* B[numSamples-2] */
real_sum += (*pSrcA++) * (*pSrcB++);
/* CImag = A[1]* B[1] + A[3]* B[3] + A[5]* B[5] + .....+ A[numSamples-1]* B[numSamples-1] */
imag_sum += (*pSrcA++) * (*pSrcB++);
supposed to work, since it misses the product of real*imag parts of the samples?
It might - and most probably is - a really dumb question, but somehow I simply cannot see it working.

That looks simply wrong, and the implementation doesn't match the description.
Suppose we have z = x + i*y and w = u + i*v with x, y, u, v real. Then
z*w = (x + i*y)*(u + i*v) = (x*u - y*v) + i*(x*v + y*u)
and
z*conjugate(w) = (x + i*y)*(u - i*v) = (x*u + y*v) + i*(y*u - x*v)
So with the loop
while(blkCnt > 0u)
{
/* CReal = A[0]* B[0] + A[2]* B[2] + A[4]* B[4] + .....+ A[numSamples-2]* B[numSamples-2] */
real_sum += (*pSrcA++) * (*pSrcB++);
/* CImag = A[1]* B[1] + A[3]* B[3] + A[5]* B[5] + .....+ A[numSamples-1]* B[numSamples-1] */
imag_sum += (*pSrcA++) * (*pSrcB++);
/* Decrement the loop counter */
blkCnt--;
}
you will get real_sum + imag_sum = Real part of hermitian inner product finally.
Neither real_sum nor imag_sum is in any simple way related to the real/imaginary part of the inner product nor the bilinear product.

Related

Matrix Subset Operations in Array

I have a matrix multiplication problem. We have an image matrix which can be have variable size. It is required to calculate C = A*B for every possible nxn. C will be added to output image as seen in figure. The center point of A Matrix is located in the lower triangle. Also, B is placed diagonally symmetric to A. A can be overlap, so, B can be overlap too. Figures can be seen in below for more detailed understand:
Blue X points represent all possible mid points of A. Algorithm should just do multiply A and diagonally mirrored version of A or called B. I done it with lots of for loop. I need to reduce number of for that I used. Could you help me please?
What kind of algorithm can be used for this problem? I have some confusing points.
Could you please help me with your genius algorithm talents? Or could you direct me to an expert?
Original Questions is below:
Thanks.
Update:
#define SIZE_ARRAY 20
#define SIZE_WINDOW 5
#define WINDOW_OFFSET 2
#define INDEX_OFFSET 1
#define START_OFFSET_COLUMN 2
#define START_OFFSET_ROW 3
#define END_OFFSET_COLUMN 3
#define END_OFFSET_ROW 2
#define GET_LOWER_DIAGONAL_INDEX_MIN_ROW (START_OFFSET_ROW);
#define GET_LOWER_DIAGONAL_INDEX_MAX_ROW (SIZE_ARRAY - INDEX_OFFSET - END_OFFSET_ROW)
#define GET_LOWER_DIAGONAL_INDEX_MIN_COL (START_OFFSET_COLUMN);
#define GET_LOWER_DIAGONAL_INDEX_MAX_COL (SIZE_ARRAY - INDEX_OFFSET - END_OFFSET_COLUMN)
uint32_t lowerDiagonalIndexMinRow = GET_LOWER_DIAGONAL_INDEX_MIN_ROW;
uint32_t lowerDiagonalIndexMaxRow = GET_LOWER_DIAGONAL_INDEX_MAX_ROW;
uint32_t lowerDiagonalIndexMinCol = GET_LOWER_DIAGONAL_INDEX_MIN_COL;
uint32_t lowerDiagonalIndexMaxCol = GET_LOWER_DIAGONAL_INDEX_MAX_COL;
void parallelMultiplication_Stable_Master()
{
startTimeStamp = omp_get_wtime();
#pragma omp parallel for num_threads(8) private(outerIterRow, outerIterCol,rA,cA,rB,cB) shared(inputImage, outputImage)
for(outerIterRow = lowerDiagonalIndexMinRow; outerIterRow < lowerDiagonalIndexMaxRow; outerIterRow++)
{
for(outerIterCol = lowerDiagonalIndexMinCol; outerIterCol < lowerDiagonalIndexMaxCol; outerIterCol++)
{
if(outerIterCol + 1 < outerIterRow)
{
rA = outerIterRow - WINDOW_OFFSET;
cA = outerIterCol - WINDOW_OFFSET;
rB = outerIterCol - WINDOW_OFFSET;
cB = outerIterRow - WINDOW_OFFSET;
for(i= outerIterRow - WINDOW_OFFSET; i <= outerIterRow + WINDOW_OFFSET; i++)
{
for(j= outerIterCol - WINDOW_OFFSET; j <= outerIterCol + WINDOW_OFFSET; j++)
{
for(k=0; k < SIZE_WINDOW; k++)
{
#pragma omp critical
outputImage[i][j] += inputImage[rA][cA+k] * inputImage[rB+k][cB];
}
cB++;
rA++;
}
rB++;
cA++;
printf("Thread Number - %d",omp_get_thread_num());
}
}
}
}
stopTimeStamp = omp_get_wtime();
printArray(outputImage,"Output Image");
printConsoleNotification(100, startTimeStamp, stopTimeStamp);
}
I am getting segmentation fault error if I set up thread count more than "1". What is the trick ?
I'm not providing a solution, but some thoughts that may help the OP exploring a possible approach.
You can evaluate each element of the resulting C matrix directly, from the values of the original matrix in a way similar to a convolution operation.
Consider the following image (sorry if it's confusing):
Instead of computing each matrix product for every A submatrix, you can evaluate the value of each Ci, j from the values in the shaded areas.
Note that Ci, j depends only on a small subset of row i and that the elements of the upper right triangular submatrix (where the B submatrices are picked) could be copied and maybe transposed in a more chache-friendly accomodation.
Alternatively, it may be worth exploring an approach where for every possible Bi, j, all the corresponding elements of C are evaluated.
Edit
Note that you can actually save a lot of calculations (and maybe cache misses) by grouping the terms, see e.g. the first two elements of row i in A:
More formally
Ci,j = Ai,j-4 · (Bj-4,i + Bj-4,i+1 + Bj-4,i+2 + Bj-4,i+3 + Bj-4,i+4)
Ci,j += Ai,j-3 · (Bj-3,i-1 + Bj-3,i+4 + 2·(Bj-3,i + Bj-3,i+1 + Bj-3,i+2 + Bj-3,i+3))
Ci,j += Ai,j-2 · (Bj-2,i-2 + Bj-2,i+4 + 2·(Bj-2,i-1 + Bj-2,i+3) + 3·(Bj-2,i + Bj-2,i+1 + Bj-2,i+2))
Ci,j += Ai,j-1 · (Bj-1,i-3 + Bj-1,i+4 + 2·(Bj-1,i-2 + Bj-1,i+3) + 3·(Bj-1,i-1 + Bj-1,i+2) + 4·(Bj-1,i + Bj-1,i+1))
Ci,j += Ai,j · (Bj,i-4 + Bj,i+4 + 2·(Bj,i-3 + Bj,i+3) + 3·(Bj,i-2 + Bj,i+2) + 4·(Bj,i-1 + Bj,i+1) + 5·Bj,i)
Ci,j += Ai,j+1 · (Bj+1,i-4 + Bj+1,i+3 + 2·(Bj+1,i-3 + Bj+1,i+2) + 3·(Bj+1,i-2 + Bj+1,i+1) + 4·(Bj+1,i-1 + Bj+1,i))
Ci,j += Ai,j+2 · (Bj+2,i-4 + Bj+2,i+2 + 2·(Bj+2,i-3 + Bj+2,i+1) + 3·(Bj+2,i-2 + Bj+2,i-1 + Bj+2,i))
Ci,j += Ai,j+3 · (Bj+3,i-4 + Bj+3,i+1 + 2·(Bj+3,i-3 + Bj+3,i-2 + Bj+3,i-1 + Bj+3,i))
Ci,j += Ai,j+4 · (Bj+4,i-4 + Bj+4,i-3 + Bj+4,i-2 + Bj+4,i-1 + Bj+4,i)
If I correctly estimated, this requires something like 60 additions and 25 (possibly fused) multiplications, compared to 125 operations like Ci,j += Ai,k · Bk,i spread all over the places.
I think that cache-locality may have a bigger impact on performance than the mere reduction of operations.
We could also precompute all the values
Si,j = Bj,i + Bj,i+1 + Bj,i+2 + Bj,i+3 + Bj,i+4
Then the previous formulas become
Ci,j = Ai,j-4 · Sj-4,i
Ci,j += Ai,j-3 · (Sj-3,i-1 + Sj-3,i)
Ci,j += Ai,j-2 · (Sj-2,i-2 + Sj-2,i-1 + Sj-2,i)
Ci,j += Ai,j-1 · (Sj-1,i-3 + Sj-1,i-2 + Sj-1,i-1 + Sj-1,i)
Ci,j += Ai,j · (Sj,i-4 + Sj,i-3 + Sj,i-2 + Sj,i-1 + Sj,i)
Ci,j += Ai,j+1 · (Sj+1,i-4 + Sj+1,i-3 + Sj+1,i-2 + Sj+1,i-1)
Ci,j += Ai,j+2 · (Sj+2,i-4 + Sj+2,i-3 + Sj+2,i-2)
Ci,j += Ai,j+3 · (Sj+3,i-4 + Sj+3,i-3)
Ci,j += Ai,j+4 · Sj+4,i-4
Here is my take. I wrote this before OP showed any code, so I'm not following any of their code patterns.
I start with a suitable image struct, just for my own sanity.
struct Image
{
float* values;
int rows, cols;
};
struct Image image_allocate(int rows, int cols)
{
struct Image rtrn;
rtrn.rows = rows;
rtrn.cols = cols;
rtrn.values = malloc(sizeof(float) * rows * cols);
return rtrn;
}
void image_fill(struct Image* img)
{
ptrdiff_t row, col;
for(row = 0; row < img->rows; ++row)
for(col = 0; col < img->cols; ++col)
img->values[row * img->cols + col] = rand() * (1.f / RAND_MAX);
}
void image_print(const struct Image* img)
{
ptrdiff_t row, col;
for(row = 0; row < img->rows; ++row) {
for(col = 0; col < img->cols; ++col)
printf("%.3f ", img->values[row * img->cols + col]);
putchar('\n');
}
putchar('\n');
}
A 5x5 matrix multiplication is too small to reasonably dispatch to BLAS. So I write a simple version myself that can be loop-unrolled and / or inlined. This routine could use a couple of micro-optimizations but let's keep it simple for now.
/** out += left * right for 5x5 sub-matrices */
static void mat_mul_5x5(
float* restrict out, const float* left, const float* right, int cols)
{
ptrdiff_t row, col, inner;
float sum;
for(row = 0; row < 5; ++row) {
for(col = 0; col < 5; ++col) {
sum = out[row * cols + col];
for(inner = 0; inner < 5; ++inner)
sum += left[row * cols + inner] * right[inner * cols + col];
out[row * cols + col] = sum;
}
}
}
Now for the single-threaded implementation of the main algorithm. Again, nothing fancy. We just iterate over the lower triangular matrix, excluding the diagonal. I keep track of the top-left corner instead of the center point. Makes index computation a bit simpler.
void compute_ltr(struct Image* restrict out, const struct Image* in)
{
ptrdiff_t top, left, end;
/* if image is not quadratic, find quadratic subset */
end = out->rows < out->cols ? out->rows : out->cols;
assert(in->rows == out->rows && in->cols == out->cols);
memset(out->values, 0, sizeof(float) * out->rows * out->cols);
for(top = 1; top <= end - 5; ++top)
for(left = 0; left < top; ++left)
mat_mul_5x5(out->values + top * out->cols + left,
in->values + top * in->cols + left,
in->values + left * in->cols + top,
in->cols);
}
The parallelization is a bit tricky because we have to make sure the threads don't overlap in their output matrices. A critical section, atomics or similar stuff would cost too much performance.
A simpler solution is a strided approach: If we always keep the threads 5 rows apart, they cannot interfere. So we simply compute every fifth row, synchronize all threads, then compute the next set of rows, five apart, and so on.
void compute_ltr_parallel(struct Image* restrict out, const struct Image* in)
{
/* if image is not quadratic, find quadratic subset */
const ptrdiff_t end = out->rows < out->cols ? out->rows : out->cols;
assert(in->rows == out->rows && in->cols == out->cols);
memset(out->values, 0, sizeof(float) * out->rows * out->cols);
/*
* Keep the parallel section open for multiple loops to reduce
* overhead
*/
# pragma omp parallel
{
ptrdiff_t top, left, offset;
for(offset = 0; offset < 5; ++offset) {
/* Use dynamic scheduling because the work per row varies */
# pragma omp for schedule(dynamic)
for(top = 1 + offset; top <= end - 5; top += 5)
for(left = 0; left < top; ++left)
mat_mul_5x5(out->values + top * out->cols + left,
in->values + top * in->cols + left,
in->values + left * in->cols + top,
in->cols);
}
}
}
My benchmark with 1000 iterations of a 1000x1000 image show 7 seconds for the serial version and 1.2 seconds for the parallelized version on my 8 core / 16 thread CPU.
EDIT for completeness: Here are the includes and the main for benchmarking.
#include <assert.h>
#include <stddef.h>
/* using ptrdiff_t */
#include <stdlib.h>
/* using malloc */
#include <stdio.h>
/* using printf */
#include <string.h>
/* using memset */
/* Insert code from above here */
int main()
{
int rows = 1000, cols = 1000, rep = 1000;
struct Image in, out;
in = image_allocate(rows, cols);
out = image_allocate(rows, cols);
image_fill(&in);
# if 1
do
compute_ltr_parallel(&out, &in);
while(--rep);
# else
do
compute_ltr(&out, &in);
while(--rep);
# endif
}
Compile with gcc -O3 -fopenmp.
Regarding the comment, and also your way of using OpenMP: Don't overcomplicate things with unnecessary directives. OpenMP can figure out how many threads are available itself. And private variables can easily be declared within the parallel section (usually).
If you want a specific number of threads, just call with the appropriate environment variable, e.g. on Linux call OMP_NUM_THREADS=8 ./executable

Loop Splitting makes code slower

So I'm optimizing a loop (as homework) that adds 10,000 elements 600,000 times. The time without optimizations is 23.34s~ and my goal is to reach less than 7 seconds for a B and less than 5 for an A.
So I started my optimizations by first unrolling the loop like this.
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7];
This reduces the runtime to about 6.4~ seconds (I can hit about 6 if I unroll further).
So I figured I would try adding sub-sums and making a final sum at the end to save time on read-write dependencies and I came up with code that looks like this.
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum0 += array[j] + array[j+1];
sum1 += array[j+2] + array[j+3];
sum2 += array[j+4] + array[j+5];
sum3 += array[j+6] + array[j+7];
However this increases the runtime to about 6.8 seconds
I tried a similar technique using pointers and the best I could do was about 15 seconds.
I only know that the machine I'm running this on (as it is a service purchased by the school) is a 32 bit, remote, Intel based, Linux virtual server that I believe is running Red Hat.
I've tried every technique I can think of to speed up the code, but they all seem to have the opposite effect. Could someone elaborate on what I'm doing wrong? Or another technique I could use to lower the runtime? The best the teacher could do was about 4.8 seconds.
As an additional condition I cannot have more than 50 lines of code in the finished project, so doing something complex is likely not possible.
Here is a full copy of both sources
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
// double sum0 = 0;
// double sum1 = 0;
// double sum2 = 0;
// double sum3 = 0;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - ACTUAL NAME\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
// sum = sum0 + sum1 + sum2 + sum3;
// ... and this one.
return 0;
}
Broken up code
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
double sum0 = 0;
double sum1 = 0;
double sum2 = 0;
double sum3 = 0;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - ACTUAL NAME\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum0 += array[j] + array[j+1];
sum1 += array[j+2] + array[j+3];
sum2 += array[j+4] + array[j+5];
sum3 += array[j+6] + array[j+7];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
sum = sum0 + sum1 + sum2 + sum3;
// ... and this one.
return 0;
}
ANSWER
The 'time' application we use to judge the grade is a little bit off. The best I could do was 4.9~ by unrolling the loop 50 times and grouping it like I did below using TomKarzes's basic format.
int j;
for (j = 0; j < ARRAY_SIZE; j += 50) {
sum +=(((((((array[j] + array[j+1]) + (array[j+2] + array[j+3])) +
((array[j+4] + array[j+5]) + (array[j+6] + array[j+7]))) +
(((array[j+8] + array[j+9]) + (array[j+10] + array[j+11])) +
((array[j+12] + array[j+13]) + (array[j+14] + array[j+15])))) +
((((array[j+16] + array[j+17]) + (array[j+18] + array[j+19]))))) +
(((((array[j+20] + array[j+21]) + (array[j+22] + array[j+23])) +
((array[j+24] + array[j+25]) + (array[j+26] + array[j+27]))) +
(((array[j+28] + array[j+29]) + (array[j+30] + array[j+31])) +
((array[j+32] + array[j+33]) + (array[j+34] + array[j+35])))) +
((((array[j+36] + array[j+37]) + (array[j+38] + array[j+39])))))) +
((((array[j+40] + array[j+41]) + (array[j+42] + array[j+43])) +
((array[j+44] + array[j+45]) + (array[j+46] + array[j+47]))) +
(array[j+48] + array[j+49])));
}
I experimented with the grouping a bit. On my machine, with my gcc, I found that the following worked best:
for (j = 0; j < ARRAY_SIZE; j += 16) {
sum = sum +
(array[j ] + array[j+ 1]) +
(array[j+ 2] + array[j+ 3]) +
(array[j+ 4] + array[j+ 5]) +
(array[j+ 6] + array[j+ 7]) +
(array[j+ 8] + array[j+ 9]) +
(array[j+10] + array[j+11]) +
(array[j+12] + array[j+13]) +
(array[j+14] + array[j+15]);
}
In other words, it's unrolled 16 times, it groups the sums into pairs, and then it adds the pairs linearly. I also removed the += operator, which affects when sum is first used in the additions.
I found that the measured times varied significantly from one run to the next, even without changing anything, so I suggest timing each version several times before making any conclusions about whether the time has improved or gotten worse.
I'd be interested to know what numbers you get on your machine with this version of the inner loop.
Update: Here's my current fastest version (on my machine, with my compiler):
int j1, j2;
j1 = 0;
do {
j2 = j1 + 20;
sum = sum +
(array[j1 ] + array[j1+ 1]) +
(array[j1+ 2] + array[j1+ 3]) +
(array[j1+ 4] + array[j1+ 5]) +
(array[j1+ 6] + array[j1+ 7]) +
(array[j1+ 8] + array[j1+ 9]) +
(array[j1+10] + array[j1+11]) +
(array[j1+12] + array[j1+13]) +
(array[j1+14] + array[j1+15]) +
(array[j1+16] + array[j1+17]) +
(array[j1+18] + array[j1+19]);
j1 = j2 + 20;
sum = sum +
(array[j2 ] + array[j2+ 1]) +
(array[j2+ 2] + array[j2+ 3]) +
(array[j2+ 4] + array[j2+ 5]) +
(array[j2+ 6] + array[j2+ 7]) +
(array[j2+ 8] + array[j2+ 9]) +
(array[j2+10] + array[j2+11]) +
(array[j2+12] + array[j2+13]) +
(array[j2+14] + array[j2+15]) +
(array[j2+16] + array[j2+17]) +
(array[j2+18] + array[j2+19]);
}
while (j1 < ARRAY_SIZE);
This uses a total unroll amount of 40, split into two groups of 20, with alternating induction variables that are pre-incremenented to break dependencies, and a post-tested loop. Again, you can experiment with the parentheses groupings to fine-tune it for your compiler and platform.
I tried your code with the following approaches:
No optimization, for loop with integer indexes by 1, simple sum +=. This took 16.4 seconds on my 64 bit 2011 MacBook Pro.
gcc -O2, same code, got down to 5.46 seconds.
gcc -O3, same code, got down to 5.45 seconds.
I tried using your code with 8-way addition into the sum variable. This took it down to 2.03 seconds.
I doubled that to 16-way additon into the sum variable, this took it down to 1.91 seconds.
I doubled that to 32-way addition into the sum variable. The time WENT UP to 2.08 seconds.
I switched to a pointer approach, as suggested by #kcraigie. With -O3, the time was 6.01 seconds. (Very surprising to me!)
register double * p;
for (p = array; p < array + ARRAY_SIZE; ++p) {
sum += *p;
}
I changed the for loop to a while loop, with sum += *p++ and got the time down to 5.64 seconds.
I changed the while loop to count down instead of up, the time went up to 5.88 seconds.
I changed back to a for loop with incrementing-by-8 integer index, added 8 register double sum[0-7] variables, and added _array[j+N] to sumN for N in [0,7]. With _array declared to be a register double *const initialized to array, on the chance that it matters. This got the time down to 1.86 seconds.
I changed to a macro that expanded to 10,000 copies of +_array[n], with N a constant. Then I did sum = tnKX(addsum) and the compiler crashed with a segmentation fault. So a pure-inlining approach isn't going to work.
I switched to a macro that expanded to 10,000 copies of sum += _array[n] with N a constant. That ran in 6.63 seconds!! Apparently the overhead of loading all that code reduces the effectiveness of the inlining.
I tried declaring a static double _array[ARRAY_SIZE]; and then using __builtin_memcpy to copy it before the first loop. With the 8-way parallel addition, this resulted in a time of 2.96 seconds. I don't think static array is the way to go. (Sad - I was hoping the constant address would be a winner.)
From all this, it seems like 16-way inlining or 8-way parallel variables should be the way to go. You'll have to try this on your own platform to make sure - I don't know what the wider architecture will do to the numbers.
Edit:
Following a suggestion from #pvg, I added this code:
int ntimes = 0;
// ... and this one.
...
// You can change anything between this comment ...
if (ntimes++ == 0) {
Which reduced the run time to < 0.01 seconds. ;-) It's a winner, if you don't get hit with the F-stick.

Finding sum of x^k from k= 0 to n in O(logn) time

So my question is how to do this in C more specifically. I'm aware that O(logn) usually means that we'll be using recursion by somehow splitting one of the parameters.
What I'm trying to achieve is the sum of k = 0 to n of xn.
for example exponent_sum(x, n) would be the parameters in this case.
Then,
exponent_sum(4, 4) would be 40 + 41 + 42 + 43 + 44 = 341.
I'm not sure where to start. Some hints would be really appreciated.
One way to look at the sum is as a number in base x consisting of all 1s.
For e.g, 44 + 43 + 42 + 41 + 40 is 11111 in base 4.
In any base, a string of 1s is going to be equal to 1 followed by a string of the same number of 0s, minus 1, divided by the base minus 1.
For e.g:
in base 10: (1000 - 1) / 9 = 999 / 9 = 111
in base 16: (0x10000 - 1) / 0xF = 0xFFFF / 0xF = 0x1111
in base 8: (0100 - 1) / 7 = 077 / 7 = 011
etc
So put these together and we can generalize that
exponent_sum(x, n) = (x (n + 1) - 1) / (x - 1)
For example, exponent_sum(4, 4) = (45 - 1) / 3 = 1023 / 3 = 341
So the big O complexity for it will be the same as for computing xn
Let me add another proof for the sake of completeness:
s = 1 + x1 + x2 + ... + xn
Then
xs = x(1 + x1 + x2 + ... + xn) = x1 + x2 + ... + xn + xn+1 = s - 1 + xn+1
Solving for s
(x - 1)s = xn+1 - 1,
s = (xn+1 - 1)/(x - 1)
Another way to see the solution is like this: suppose the sum is S written as
S = 1 + x + x^2 + ... + x^k
Then if we multiply both sides of it by x we get
S*x = x * (1 + x + x^2 + ... + x^k)
= x + x^2 + ... + x^k + x^(k+1)
then add 1 to both sides
S*x + 1 = 1 + x + x^2 + ... + x^k + x^(k+1)
= (1 + x + x^2 + ... + x^k) + x^(k+1)
= S + x^(k+1)
which means
S*x - S = x^(k+1) - 1
S*(x - 1) = x^(k+1) - 1
so
S = (x^(k+1) - 1) / (x - 1)
Use the theory Of geometric progression. where
sum = (first-term(pow(common-ratio,number-of-terms)-1))/(common-ratio-1);
here first-term is obviously 1;
Common-ratio= number itself;
number-of-terms=number+1;
But common-ratio should be greater than 1;
For
Common-ratio=1;
Sum=number*number-of-terms.
You can evaluate the sum directly, without using the geometric progression formula. This has the advantage that no division is required (which is necessary if, for example, you want to adapt the code to return the result modulo some large number).
Letting S(k) to be the sum x^0 + ... + x^{k-1}, it satisfies these recurrence relations:
S(1) = 1
S(2n) = S(n) * (1 + x^n)
S(2n+1) = S(n) * (1 + x^n) + x^{2n}
Using these, the only difficulty is arranging to keep a running value of xp to use as x^n. Otherwise the algorithm is very similar to a bottom-up implementation of exponentiation by squaring.
#include <inttypes.h>
#include <stdio.h>
#include <stdint.h>
int64_t exponent_sum(int64_t x, int64_t k) {
int64_t r = 0, xp = 1;
for (int i = 63; i >= 0; i--) {
r *= 1 + xp;
xp *= xp;
if (((k + 1) >> i) & 1) {
r += xp;
xp *= x;
}
}
return r;
}
int main(int argc, char *argv[]) {
for (int k = 0; k < 10; k++) {
printf("4^0 + 4^1 + ... + 4^%d = %" PRId64 "\n", k, exponent_sum(4, k));
}
return 0;
}

Cheap way of calculating cubic bezier length

An analytical solution for cubic bezier length
seems not to exist, but it does not mean that
coding a cheap solution does not exist. By cheap I mean something like in the range of 50-100 ns (or less).
Does someone know anything like that? Maybe in two categories:
1) less error like 1% but more slow code.
2) more error like 20% but faster?
I scanned through google a bit but it doesn't
find anything which looks like a nice solution. Only something like divide on N line segments
and sum the N sqrt - too slow for more precision,
and probably too inaccurate for 2 or 3 segments.
Is there anything better?
Another option is to estimate the arc length as the average between the chord and the control net. In practice:
Bezier bezier = Bezier (p0, p1, p2, p3);
chord = (p3-p0).Length;
cont_net = (p0 - p1).Length + (p2 - p1).Length + (p3 - p2).Length;
app_arc_length = (cont_net + chord) / 2;
You can then recursively split your spline segment into two segments and calculate the arc length up to convergence. I tested myself and it actually converges pretty fast. I got the idea from this forum.
Simplest algorithm: flatten the curve and tally euclidean distance. As long as you want an approximate arc length, this solution is fast and cheap. Given your curve's coordinate LUT—you're talking about speed, so I'm assuming you use those, and don't constantly recompute the coordinates—it's a simple for loop with a tally. In generic code, with a dist function that computes the euclidean distance between two points:
var arclength = 0,
last=LUT.length-1,
i;
for (i=0; i<last; i++) {
arclength += dist(LUT[i], LUT[i+1]);
}
Done. arclength is now the approximate arc length based on the maximum number of segments you can form in the curve based on your LUT. Need things faster with a larger potential error? Control the segment count.
var arclength = 0,
segCount = ...,
last=LUT.length-2,
step = last/segCount,
s, i;
for (s=0; s<=segCount; s++) {
i = (s*step/last)|0;
arclength += dist(LUT[i], LUT[i+1]);
}
This is pretty much the simplest possible algorithm that still generates values that come even close to the true arc length. For anything better, you're going to have to use more expensive numerical approaches (like the Legendre-Gauss quadrature technique).
If you want to know why, hit up the arc length section of "A Primer on Bézier Curves".
in my case a fast and valid approach is this. (Rewritten in c# for Unity3d)
public static float BezierSingleLength(Vector3[] points){
var p0 = points[0] - points[1];
var p1 = points[2] - points[1];
var p2 = new Vector3();
var p3 = points[3]-points[2];
var l0 = p0.magnitude;
var l1 = p1.magnitude;
var l3 = p3.magnitude;
if(l0 > 0) p0 /= l0;
if(l1 > 0) p1 /= l1;
if(l3 > 0) p3 /= l3;
p2 = -p1;
var a = Mathf.Abs(Vector3.Dot(p0,p1)) + Mathf.Abs(Vector3.Dot(p2,p3));
if(a > 1.98f || l0 + l1 + l3 < (4 - a)*8) return l0+l1+l3;
var bl = new Vector3[4];
var br = new Vector3[4];
bl[0] = points[0];
bl[1] = (points[0]+points[1]) * 0.5f;
var mid = (points[1]+points[2]) * 0.5f;
bl[2] = (bl[1]+mid) * 0.5f;
br[3] = points[3];
br[2] = (points[2]+points[3]) * 0.5f;
br[1] = (br[2]+mid) * 0.5f;
br[0] = (br[1]+bl[2]) * 0.5f;
bl[3] = br[0];
return BezierSingleLength(bl) + BezierSingleLength(br);
}
I worked out the closed form expression of length for a 3 point Bezier (below). I've not attempted to work out a closed form for 4+ points. This would most likely be difficult or complicated to represent and handle. However, a numerical approximation technique such as a Runge-Kutta integration algorithm (see my Q&A here for details) would work quite well by integrating using the arc length formula.
Here is some Java code for the arc length of a 3 point Bezier, with points a,b, and c.
v.x = 2*(b.x - a.x);
v.y = 2*(b.y - a.y);
w.x = c.x - 2*b.x + a.x;
w.y = c.y - 2*b.y + a.y;
uu = 4*(w.x*w.x + w.y*w.y);
if(uu < 0.00001)
{
return (float) Math.sqrt((c.x - a.x)*(c.x - a.x) + (c.y - a.y)*(c.y - a.y));
}
vv = 4*(v.x*w.x + v.y*w.y);
ww = v.x*v.x + v.y*v.y;
t1 = (float) (2*Math.sqrt(uu*(uu + vv + ww)));
t2 = 2*uu+vv;
t3 = vv*vv - 4*uu*ww;
t4 = (float) (2*Math.sqrt(uu*ww));
return (float) ((t1*t2 - t3*Math.log(t2+t1) -(vv*t4 - t3*Math.log(vv+t4))) / (8*Math.pow(uu, 1.5)));
public float FastArcLength()
{
float arcLength = 0.0f;
ArcLengthUtil(cp0.position, cp1.position, cp2.position, cp3.position, 5, ref arcLength);
return arcLength;
}
private void ArcLengthUtil(Vector3 A, Vector3 B, Vector3 C, Vector3 D, uint subdiv, ref float L)
{
if (subdiv > 0)
{
Vector3 a = A + (B - A) * 0.5f;
Vector3 b = B + (C - B) * 0.5f;
Vector3 c = C + (D - C) * 0.5f;
Vector3 d = a + (b - a) * 0.5f;
Vector3 e = b + (c - b) * 0.5f;
Vector3 f = d + (e - d) * 0.5f;
// left branch
ArcLengthUtil(A, a, d, f, subdiv - 1, ref L);
// right branch
ArcLengthUtil(f, e, c, D, subdiv - 1, ref L);
}
else
{
float controlNetLength = (B-A).magnitude + (C - B).magnitude + (D - C).magnitude;
float chordLength = (D - A).magnitude;
L += (chordLength + controlNetLength) / 2.0f;
}
}
first of first you should Understand the algorithm use in Bezier,
When i was coding a program by c# Which was full of graphic material I used beziers and many time I had to find a point cordinate in bezier , whic it seem imposisble in the first look. so the thing i do was to write Cubic bezier function in my costume math class which was in my project. so I will share the code with you first.
//--------------- My Costum Power Method ------------------\\
public static float FloatPowerX(float number, int power)
{
float temp = number;
for (int i = 0; i < power - 1; i++)
{
temp *= number;
}
return temp;
}
//--------------- Bezier Drawer Code Bellow ------------------\\
public static void CubicBezierDrawer(Graphics graphics, Pen pen, float[] startPointPixel, float[] firstControlPointPixel
, float[] secondControlPointPixel, float[] endPointPixel)
{
float[] px = new float[1111], py = new float[1111];
float[] x = new float[4] { startPointPixel[0], firstControlPointPixel[0], secondControlPointPixel[0], endPointPixel[0] };
float[] y = new float[4] { startPointPixel[1], firstControlPointPixel[1], secondControlPointPixel[1], endPointPixel[1] };
int i = 0;
for (float t = 0; t <= 1F; t += 0.001F)
{
px[i] = FloatPowerX((1F - t), 3) * x[0] + 3 * t * FloatPowerX((1F - t), 2) * x[1] + 3 * FloatPowerX(t, 2) * (1F - t) * x[2] + FloatPowerX(t, 3) * x[3];
py[i] = FloatPowerX((1F - t), 3) * y[0] + 3 * t * FloatPowerX((1F - t), 2) * y[1] + 3 * FloatPowerX(t, 2) * (1F - t) * y[2] + FloatPowerX(t, 3) * y[3];
graphics.DrawLine(pen, px[i - 1], py[i - 1], px[i], py[i]);
i++;
}
}
as you see above, this is the way a bezier Function work and it draw the same Bezier as Microsoft Bezier Function do( I've test it). you can make it even more accurate by incrementing array size and counter size or draw elipse instead of line& ... . All of them depend on you need and level of accuracy you need and ... .
Returning to main goal ,the Question is how to calc the lenght???
well The answer is we Have tons of point and each of them has an x coorinat and y coordinate which remember us a triangle shape & especially A RightTriabgle Shape. so if we have point p1 & p2 , we can calculate the distance of them as a RightTriangle Chord. as we remeber from our math class in school, in ABC Triangle of type RightTriangle, chord Lenght is -> Sqrt(Angle's FrontCostalLenght ^ 2 + Angle's SideCostalLeghth ^ 2);
and there is this relation betwen all points we calc the lenght betwen current point and the last point before current point(exmp p[i - 1] & p[i]) and store sum of them all in a variable. lets show it in code bellow
//--------------- My Costum Power Method ------------------\\
public static float FloatPower2(float number)
{
return number * number;
}
//--------------- My Bezier Lenght Calculator Method ------------------\\
public static float CubicBezierLenghtCalculator(float[] startPointPixel
, float[] firstControlPointPixel, float[] secondControlPointPixel, float[] endPointPixel)
{
float[] tmp = new float[2];
float lenght = 0;
float[] px = new float[1111], py = new float[1111];
float[] x = new float[4] { startPointPixel[0], firstControlPointPixel[0]
, secondControlPointPixel[0], endPointPixel[0] };
float[] y = new float[4] { startPointPixel[1], firstControlPointPixel[1]
, secondControlPointPixel[1], endPointPixel[1] };
int i = 0;
for (float t = 0; t <= 1.0; t += 0.001F)
{
px[i] = FloatPowerX((1.0F - t), 3) * x[0] + 3 * t * FloatPowerX((1.0F - t), 2) * x[1] + 3F * FloatPowerX(t, 2) * (1.0F - t) * x[2] + FloatPowerX(t, 3) * x[3];
py[i] = FloatPowerX((1.0F - t), 3) * y[0] + 3 * t * FloatPowerX((1.0F - t), 2) * y[1] + 3F * FloatPowerX(t, 2) * (1.0F - t) * y[2] + FloatPowerX(t, 3) * y[3];
if (i > 0)
{
tmp[0] = Math.Abs(px[i - 1] - px[i]);// calculating costal lenght
tmp[1] = Math.Abs(py[i - 1] - py[i]);// calculating costal lenght
lenght += (float)Math.Sqrt(FloatPower2(tmp[0]) + FloatPower2(tmp[1]));// calculating the lenght of current RightTriangle Chord & add it each time to variable
}
i++;
}
return lenght;
}
if you wish to have faster calculation just need to reduce px & py array lenght and loob count.
We also can decrease memory need by reducing px and py to array lenght to 1 or make a simple double variable but becuase of Conditional situation Happend which Increase Our Big O I didn't do that.
Hope it helped you so much. if have another question just ask.
With Best regards, Heydar - Islamic Republic of Iran.

Smallest solution to system of linear equations

I need to find the smallest number of steps it takes to get between two points in a grid. If you are positioned at the center, and you can only move in 8 directions to the integer points surrounding you, then what's the least number of steps to take to get to a destination point?
I have a solution for this, but it is massively ugly and I'm honestly a bit ashamed of it:
/**
* #details Each point in the graph is a linear combination of these vectors:
*
* [x] = a[0] + b[1] + c[1] + d[ 1]
* [y] [1] [1] [0] [-1]
*
* EQ1: c + b + d == x
* EQ2: a + b - d == y
*
* Any path can be simplified to involve at most two of these variables, so we
* can solve the linear equations above with the knowledge that at least two of
* a, b, c, and d are 0. The sum of the absolute value of the coefficients is
* the number of steps taken in the grid.
*/
unsigned min_distance(point_t start, point_t goal)
{
int a, b, c, d;
int swap, steps;
int x, y;
x = goal.x - start.x;
y = goal.y - start.y;
/* Possible simple shortcuts */
if (x == 0 || y == 0) {
steps = abs(x) + abs(y);
} else if (abs(x) == abs(y)) {
steps = abs(y);
} else {
b = x, a = y - b;
steps = abs(a) + abs(b);
c = x, a = y;
swap = abs(a) + abs(c);
if (steps > swap)
steps = swap;
d = x, a = y + d;
swap = abs(a) + abs(d);
if (steps > swap)
steps = swap;
b = y, c = x - b;
swap = abs(b) + abs(c);
if (steps > swap)
steps = swap;
b = (x + y) / 2, d = b - y;
swap = abs(b) + abs(d);
if ((x + y) % 2 == 0 && steps > swap)
steps = swap;
d = -y, c = x - d;
swap = abs(c) + abs(d);
if (steps > swap)
steps = swap;
}
return steps;
}
The comment at the top explains the actual algorithm: represent each valid step as a column vector in a matrix, then find the smallest solution to the resulting system of linear equations.
In this case I saw the best answer would use at most two variables, and solved the equation six times by setting different combinations of the variables to 0. That's too specific! I want to be able to change the rules about which steps are valid and still be able to find the min distance.
EDIT: I realize this is a very poor simple example of what I'm trying to do, because of how easy it is to simplify the problem in this case. The goal is to calculate the number of steps given arbitrary stepping rules. If it the allowed steps instead looked like this then I'd start with a different matrix ([0 2 3 4 x; 1 2 0 -4 y]) and find the least solution to a different system of equations (2b + 3c + 4d = x, a + 2b - 4d = y). I'm actually trying to write a procedure that can work with any set of vectors to find the minimum number of steps.
...Any advice or criticism?

Resources