Is it possible to implement FIR filtering action without padding the input and coefficients?
i.e. Let's say if the input and filter coefficients are of size 4, then the output will be 7 samples. So, while implementing, we generally add 3 more zeros to both input and filter coefficients making them equal to output size.
But, if the input and filter coefficients are of size 1024, then the output will be of 2047 samples. So, now, we need to add 1023 zeros to both input and filter coefficients. This is inefficient, right?
So, I just want to know is there any other way to implement FIR Filtering without padding?
The below code gives the idea I was talking about.
int x[7],h[7],y[7];
int i,j;
for(i=0;i<7;i++)
{
if(i<4)
{
x[i] = i+1;
h[i] = i+1;
}
if(i>=4)
{
x[i] = 0;
h[i] = 0;
}
}
for(i=0;i<7;i++)
{
y[i] = 0;
for(j=0;j<=i;j++)
{
y[i] = y[i] + h[j] * x [i-j];
}
}
To see what your code is doing, change the calculations to printfs, like this:
for(int i = 0; i < 7; i++)
{
printf("y[%d] = 0\n", i);
for(int j = 0; j <= i; j++)
{
printf("y[%d] += h[%d] * x[%d]\n", i, j, i-j);
}
printf("\n");
}
The output from that code (comments added) is:
y[0] = 0
y[0] += h[0] * x[0]
y[1] = 0
y[1] += h[0] * x[1]
y[1] += h[1] * x[0]
y[2] = 0
y[2] += h[0] * x[2]
y[2] += h[1] * x[1]
y[2] += h[2] * x[0]
y[3] = 0
y[3] += h[0] * x[3]
y[3] += h[1] * x[2]
y[3] += h[2] * x[1]
y[3] += h[3] * x[0]
y[4] = 0
y[4] += h[0] * x[4] // zero x
y[4] += h[1] * x[3]
y[4] += h[2] * x[2]
y[4] += h[3] * x[1]
y[4] += h[4] * x[0] // zero h
y[5] = 0
y[5] += h[0] * x[5] // zero x
y[5] += h[1] * x[4] // zero x
y[5] += h[2] * x[3]
y[5] += h[3] * x[2]
y[5] += h[4] * x[1] // zero h
y[5] += h[5] * x[0] // zero h
y[6] = 0
y[6] += h[0] * x[6] // zero x
y[6] += h[1] * x[5] // zero x
y[6] += h[2] * x[4] // zero x
y[6] += h[3] * x[3]
y[6] += h[4] * x[2] // zero h
y[6] += h[5] * x[1] // zero h
y[6] += h[6] * x[0] // zero h
The commented calculations are just a waste of time, since either the h value or the x value will be zero. To avoid the wasted calculations, the code needs to adjust the starting and ending values of j.
When i<=3 the starting value for j is 0, otherwise the starting value is i-3.
When i<=3 the ending value for j is i, otherwise the ending value is 3.
Therefore, the loops should look like this:
for(int i = 0; i < 7; i++)
{
printf("y[%d] = 0\n", i);
int start = (i <= 3) ? 0 : i-3;
int end = (i <= 3) ? i : 3;
for(int j = start; j <= end; j++)
{
printf("y[%d] += h[%d] * x[%d]\n", i, j, i-j);
}
printf("\n");
}
The output is:
y[0] = 0
y[0] += h[0] * x[0]
y[1] = 0
y[1] += h[0] * x[1]
y[1] += h[1] * x[0]
y[2] = 0
y[2] += h[0] * x[2]
y[2] += h[1] * x[1]
y[2] += h[2] * x[0]
y[3] = 0
y[3] += h[0] * x[3]
y[3] += h[1] * x[2]
y[3] += h[2] * x[1]
y[3] += h[3] * x[0]
y[4] = 0
y[4] += h[1] * x[3]
y[4] += h[2] * x[2]
y[4] += h[3] * x[1]
y[5] = 0
y[5] += h[2] * x[3]
y[5] += h[3] * x[2]
y[6] = 0
y[6] += h[3] * x[3]
This avoids the wasted calculations, and eliminates the need to pad the h and x arrays.
You can modify your function to only compute a value when the indexes are inside the valid ranges:
int x[4], h[4], y[7];
for (i = 0; i < 4; i++) {
if (i < 4) {
x[i] = i + 1;
h[i] = i + 1;
}
}
for (j = 0; j < 4; j++) {
if (i - j > 0 && i - j < 4) {
y[i] = y[i] + h[j] * x[i - j];
}
}
This will simply discard all the values that are outside of range of the data and filter coefficients.
Related
This is a very strange problem. I cannot see any differences between code1 and code2 . However, there should be a difference because they produce different results : (notice f0 and f0A (acts as a buffer))
code1 :
for (k = 0; k < 6; k++) {
r1 = i0 + 6 * k;
f0 = 0.0F;
for (r2 = 0; r2 < 6; r2++) {
f0 += (float)b_a[i0 + 6 * r2] * p_est[r2 + 6 * k];
}
a[r1] = f0;
}
code2:
float f0A[6] = {0};
for (k = 0; k < 6; k++) {
r1 = i0 + 6 * k;
for (r2 = 0; r2 < 6; r2++) {
f0A[r2] += (float)b_a[i0 + 6 * r2] * p_est[r2 + 6 * k];
}
}
for (r2 = 0; r2 < 6; r2++) {
r1 = i0 + 6 * r2;
a[r1] = f0A[r2];
}
In the first loop, you are setting a[r1] to a summation stored in f0. It is being added to each loop.
In the second loop, you aren't doing a summation, your loop is using += but it is storing each one in a different f0A index. Thus a[r1] is not given the correct value
There is the difference
I'm trying to implement a Prefix Sum Algorithm in C using OpenMP, and I'm stuck.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int main(int argc, char* argv[])
{
int p = 5;
int X[5] = { 1, 5, 4, 2, 3 };
int* Y = (int*)malloc(p * sizeof(int));
for (int i = 0; i < p; i++)
printf("%d ", X[i]);
printf("\n");
Y[0] = X[0];
int i;
#pragma omp parallel for num_threads(4)
for (i = 1; i < p; i++)
Y[i] = X[i - 1] + X[i];
int k = 2;
while (k < p)
{
int i;
#pragma omp parallel for
for (i = k; i < p; i++)
Y[i] = Y[i - k] + Y[i];
k += k;
}
for (int i = 0; i < p; i++)
printf("%d ", Y[i]);
printf("\n");
system("pause");
return 0;
}
What this code should do?
Input numbers are in X,
output numbers are (prefixes) in Y
and the number count is p.
X = 1, 5, 4, 2, 3
Stage I.
Y[0] = X[0];
Y[0] = 1
Stage II.
int i;
#pragma omp parallel for num_threads(4)
for (i = 1; i < p; i++)
Y[i] = X[i - 1] + X[i];
Example:
Y[1] = X[0] + X[1] = 6
Y[2] = X[1] + X[2] = 9
Y[2] = X[2] + X[3] = 6
Y[4] = X[3] + X[4] = 5
Stage III. (where I am stuck)
int k = 2;
while (k < p)
{
int i;
#pragma omp parallel for
for (i = k; i < p; i++)
Y[i] = Y[i - k] + Y[i];
k += k;
}
Example:
k = 2
Y[2] = Y[0] + Y[2] = 1 + 9 = 10
Y[3] = Y[1] + Y[3] = 6 + 6 = 12
Y[4] = Y[2] + Y[4] = 10 + 5 = 15
Above the 10 + 5 = 15 should be 9 + 5 = 14, but the Y[2] was overwritten by another thread. I want to use that Y[2] what was before the for-loop started.
Example:
k = 4
Y[4] = Y[0] + Y[4] = 1 + 15 = 16
Result: 1, 6, 10, 12, 16. Expected good result: 1, 6, 10, 12, 15.
Above the 10 + 5 = 15 should be 9 + 5 = 14, but the Y[2] was overwritten by another thread. I want to use that Y[2] what was before the for-loop started.
With OpenMP, you always have to consider whether your code is correct for the serial case, with a single thread, because
It might in fact run that way, and
If it's incorrect serially, then it's virtually certain to be incorrect as a parallel program, too.
Your code is not correct serially. It appears you could fix that by running the problem loop backward, from i = p - 1 to k, but in fact that's not sufficient for parallel operation.
Your best bet appears to be to accumulate your partial results into a different array than holds the results of the previous cycle. For example, you might flip between X and Y as data source and result, with a little pointer wrangling to grease the iterative wheels. Or you might do it a little more easily by using a 2D array instead of separate X and Y.
UPDATE for Stage III.
int num_threads = 8;
int k = 2;
while (k < p)
{
#pragma omp parallel for ordered num_threads(k < num_threads ? 1 : num_threads)
for (i = p - 1; i >= k; i--)
{
Y[i] = Y[i - k] + Y[i];
}
k += k;
}
The code above solved my problem. It's now working with parallel, except the first few round.
I try to solve linear optimization problem of 4 variables and 600000 constraints.
I need to generate a large input. So I need A[600000][4] for constraint's coefficents and b[600000] for the right part. Here is a code to generate 600000 constraints.
int i, j;
int numberOfInequalities = 600000;
double c[4];
double result[4];;
double A[numberOfInequalities][4], b[numberOfInequalities];
printf("\nPreparing test: 4 variables, 600000 inequalities\n");
A[0][0] = 1.0; A[0][1] = 2.0; A[0][2] = 1.0; A[0][3] = 0.0; b[0] = 10000.0;
A[1][0] = 0.0; A[1][1] = 1.0; A[1][2] = 2.0; A[1][3] = 1.0; b[1] = 10000.0;
A[2][0] = 1.0; A[2][1] = 0.0; A[2][2] = 1.0; A[2][3] = 3.0; b[2] = 10000.0;
A[3][0] = 4.0; A[3][1] = 0.0; A[3][2] = 1.0; A[3][3] = 1.0; b[3] = 10000.0;
c[0]=1.0; c[1]=1.0; c[2]=1.0; c[3]=1.0;
for( i=4; i< 100000; i++ )
{
A[i][0] = (12123*i)%104729;
A[i][1] = (47*i)%104729;
A[i][2] = (2011*i)%104729;
A[i][3] = (7919*i)%104729;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1 + (i%137);
}
A[100000][0] = 0.0; A[100000][1] = 6.0; A[100000][2] = 1.0;
A[100000][3] = 1.0; b[100000] = 19.0;
for( i=100001; i< 200000; i++ )
{
A[i][0] = (2323*i)%101111;
A[i][1] = (74*i)%101111;
A[i][2] = (2017*i)%101111;
A[i][3] = (7915*i)%101111;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%89);
}
A[200000][0] = 5.0; A[200000][1] = 2.0; A[200000][2] = 0.0;
A[200000][3] = 1.0; b[200000] = 13.0;
for( i=200001; i< 300000; i++ )
{
A[i][0] = (23123*i)%100003;
A[i][1] = (47*i)%100003;
A[i][2] = (2011*i)%100003;
A[i][3] = (7919*i)%100003;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%57);
}
A[300000][0] = 1.0; A[300000][1] = 2.0; A[300000][2] = 1.0;
A[300000][3] = 3.0; b[300000] = 20.0;
A[300001][0] = 1.0; A[300001][1] = 0.0; A[300001][2] = 5.0;
A[300001][3] = 4.0; b[300001] = 32.0;
A[300002][0] = 7.0; A[300002][1] = 1.0; A[300002][2] = 1.0;
A[300002][3] = 7.0; b[300002] = 40.0;
for( i=300003; i< 400000; i++ )
{
A[i][0] = (13*i)%103087;
A[i][1] = (99*i)%103087;
A[i][2] = (2012*i)%103087;
A[i][3] = (666*i)%103087;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1;
}
for( i=400000; i< 500000; i++ )
{
A[i][0] = 1;
A[i][1] = (17*i)%999983;
A[i][2] = (1967*i)%444443;
A[i][3] = 2;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + (1000000.0/(double)i);
}
for( i=500000; i< 600000; i++ )
{
A[i][0] = (3*i)%111121;
A[i][1] = (2*i)%999199;
A[i][2] = (2*i)%444443;
A[i][3] = i;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1.3;
}
The problem is: it can't create such a large array, it just terminates at the run-time, BUT it works fine if I create no more than 200000 constraints.
I've tried to increase stack size to unlimited value, but it didn't help.
I've tried to use pointers like **A, but I get incorrect result in output.
P.S.
I use Ubuntu.
Any ideas?
If numberOfInequalities is a runtime constant, you could make it a #define and define A and b as global variables or static local variables:
#define numberOfInequalities 600000
static double A[numberOfInequalities][4], b[numberOfInequalities];
This will move these arrays from the 'stack' to the 'bss' segment.
A better solution is to allocate these arrays with malloc:
double (*A)[4] = malloc(numberOfInequalities * 4 * sizeof(double));
double *b = malloc(numberOfInequalities * sizeof(double));
This will cause these arrays to be allocated from the 'heap' memory.
Don't forget to free them before returning to the caller.
See http://www.geeksforgeeks.org/memory-layout-of-c-program/ for a brief explanation how memory is arranged in a typical C program
I`ve tried to implement dot product of this two arrays using AVX https://stackoverflow.com/a/10459028. But my code is very slow.
A and xb are arrays of doubles, n is even number. Can you help me?
const int mask = 0x31;
int sum =0;
for (int i = 0; i < n; i++)
{
int ind = i;
if (i + 8 > n) // padding
{
sum += A[ind] * xb[i].x;
i++;
ind = n * j + i;
sum += A[ind] * xb[i].x;
continue;
}
__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
__m256d x = _mm256_loadu_pd(&A[ind]);
__m256d y = _mm256_load_pd(ar);
i+=4; ind = n * j + i;
__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
__m256d z = _mm256_loadu_pd(&A[ind]);
__m256d w = _mm256_load_pd(arr);
__m256d xy = _mm256_mul_pd(x, y);
__m256d zw = _mm256_mul_pd(z, w);
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
i += 3;
}
There are two big inefficiencies in your loop that are immediately apparent:
(1) these two chunks of scalar code:
__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
...
__m256d y = _mm256_load_pd(ar);
and
__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
...
__m256d w = _mm256_load_pd(arr);
should be implemented using SIMD loads and shuffles (or at the very least use _mm256_set_pd and give the compiler a chance to do a half-reasonable job of generating code for a gathered load).
(2) the horizontal summation at the end of the loop:
for (int i = 0; i < n; i++)
{
...
__m256d xy = _mm256_mul_pd(x, y);
__m256d zw = _mm256_mul_pd(z, w);
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
i += 3;
}
should be moved out of the loop:
__m256d xy = _mm256_setzero_pd();
__m256d zw = _mm256_setzero_pd();
...
for (int i = 0; i < n; i++)
{
...
xy = _mm256_add_pd(xy, _mm256_mul_pd(x, y));
zw = _mm256_add_pd(zw, _mm256_mul_pd(z, w));
i += 3;
}
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
My professor send out test code to run on our program. However, the test code itself has a segmentation fault error on compiling. The error happens on the first printf. However if that line is commented out it just occurs on the next line. It sounds like the code works fine for him, so I'm trying to figure out why it's failing for me. I know he's using C while I'm using C++, but even when I try to compile the test code with gcc instead of g++ it still fails. Anyone know why I might be having problems? Thanks! The code is below.
#include <stdio.h>
main()
{ double A[400000][4], b[400000], c[4] ;
double result[4];
int i, j; double s, t;
printf("Preparing test: 4 variables, 400000 inequalities\n");
A[0][0] = 1.0; A[0][1] = 2.0; A[0][2] = 1.0; A[0][3] = 0.0; b[0] = 10000.0;
A[1][0] = 0.0; A[1][1] = 1.0; A[1][2] = 2.0; A[1][3] = 1.0; b[0] = 10000.0;
A[2][0] = 1.0; A[2][1] = 0.0; A[2][2] = 1.0; A[2][3] = 3.0; b[0] = 10000.0;
A[3][0] = 4.0; A[3][1] = 0.0; A[3][2] = 1.0; A[3][3] = 1.0; b[0] = 10000.0;
c[0]=1.0; c[1]=1.0; c[2]=1.0; c[3]=1.0;
for( i=4; i< 100000; i++ )
{ A[i][0] = (12123*i)%104729;
A[i][1] = (47*i)%104729;
A[i][2] = (2011*i)%104729;
A[i][3] = (7919*i)%104729;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1 + (i%137);
}
A[100000][0] = 0.0; A[100000][1] = 6.0; A[100000][2] = 1.0;
A[100000][3] = 1.0; b[100000] = 19.0;
for( i=100001; i< 200000; i++ )
{ A[i][0] = (2323*i)%101111;
A[i][1] = (74*i)%101111;
A[i][2] = (2017*i)%101111;
A[i][3] = (7915*i)%101111;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%89);
}
A[200000][0] = 5.0; A[200000][1] = 2.0; A[200000][2] = 0.0;
A[200000][3] = 1.0; b[200000] = 11.0;
for( i=200001; i< 300000; i++ )
{ A[i][0] = (23123*i)%100003;
A[i][1] = (47*i)%100003;
A[i][2] = (2011*i)%100003;
A[i][3] = (7919*i)%100003;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%57);
}
A[300000][0] = 1.0; A[300000][1] = 2.0; A[300000][2] = 1.0;
A[300000][3] = 3.0; b[300000] = 20.0;
A[300001][0] = 1.0; A[300001][1] = 0.0; A[300001][2] = 5.0;
A[300001][3] = 4.0; b[300001] = 32.0;
A[300002][0] = 7.0; A[300002][1] = 1.0; A[300002][2] = 1.0;
A[300002][3] = 7.0; b[300002] = 40.0;
for( i=300003; i< 400000; i++ )
{ A[i][0] = (13*i)%103087;
A[i][1] = (99*i)%103087;
A[i][2] = (2012*i)%103087;
A[i][3] = (666*i)%103087;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1;
}
printf("Running test: 400000 inequalities, 4 variables\n");
//j = rand_lp(40, &(A[0][0]), &(b[0]), &(c[0]), &(result[0]));
printf("Test: extremal point (%f, %f, %f, %f) after %d recomputation steps\n",
result[0], result[1], result[2], result[3], j);
printf("Answer should be (1,2,3,4)\n End Test\n");
}
Try to change:
double A[400000][4], b[400000], c[4] ;
to
static double A[400000][4], b[400000], c[4] ;
Your declaration of the A array has automatic storage duration which probably means on your system it is stored on the stack. Your total stack for your process is likely to be lower than that and you encountered a stack overflow.
On Linux, you can run the ulimit command:
$ ulimit -s
8192
$
to see the stack size in kB allocated for a process. For example, 8192 kB on my machine.
You have overflowed the limits of the stack. Your prof declares 15MB of data in main's stack frame. That's just too big.
Since the lifetime of an ojbect declared at the top of main is essentially the entire program, just declare the objects as static. That way they'll be in the (relatively limitless) data segment, and have nearly the same lifetime.
Try changing this line:
double A[400000][4], b[400000], c[4] ;
to this:
static double A[400000][4], b[400000], c[4] ;