I try to solve linear optimization problem of 4 variables and 600000 constraints.
I need to generate a large input. So I need A[600000][4] for constraint's coefficents and b[600000] for the right part. Here is a code to generate 600000 constraints.
int i, j;
int numberOfInequalities = 600000;
double c[4];
double result[4];;
double A[numberOfInequalities][4], b[numberOfInequalities];
printf("\nPreparing test: 4 variables, 600000 inequalities\n");
A[0][0] = 1.0; A[0][1] = 2.0; A[0][2] = 1.0; A[0][3] = 0.0; b[0] = 10000.0;
A[1][0] = 0.0; A[1][1] = 1.0; A[1][2] = 2.0; A[1][3] = 1.0; b[1] = 10000.0;
A[2][0] = 1.0; A[2][1] = 0.0; A[2][2] = 1.0; A[2][3] = 3.0; b[2] = 10000.0;
A[3][0] = 4.0; A[3][1] = 0.0; A[3][2] = 1.0; A[3][3] = 1.0; b[3] = 10000.0;
c[0]=1.0; c[1]=1.0; c[2]=1.0; c[3]=1.0;
for( i=4; i< 100000; i++ )
{
A[i][0] = (12123*i)%104729;
A[i][1] = (47*i)%104729;
A[i][2] = (2011*i)%104729;
A[i][3] = (7919*i)%104729;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1 + (i%137);
}
A[100000][0] = 0.0; A[100000][1] = 6.0; A[100000][2] = 1.0;
A[100000][3] = 1.0; b[100000] = 19.0;
for( i=100001; i< 200000; i++ )
{
A[i][0] = (2323*i)%101111;
A[i][1] = (74*i)%101111;
A[i][2] = (2017*i)%101111;
A[i][3] = (7915*i)%101111;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%89);
}
A[200000][0] = 5.0; A[200000][1] = 2.0; A[200000][2] = 0.0;
A[200000][3] = 1.0; b[200000] = 13.0;
for( i=200001; i< 300000; i++ )
{
A[i][0] = (23123*i)%100003;
A[i][1] = (47*i)%100003;
A[i][2] = (2011*i)%100003;
A[i][3] = (7919*i)%100003;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%57);
}
A[300000][0] = 1.0; A[300000][1] = 2.0; A[300000][2] = 1.0;
A[300000][3] = 3.0; b[300000] = 20.0;
A[300001][0] = 1.0; A[300001][1] = 0.0; A[300001][2] = 5.0;
A[300001][3] = 4.0; b[300001] = 32.0;
A[300002][0] = 7.0; A[300002][1] = 1.0; A[300002][2] = 1.0;
A[300002][3] = 7.0; b[300002] = 40.0;
for( i=300003; i< 400000; i++ )
{
A[i][0] = (13*i)%103087;
A[i][1] = (99*i)%103087;
A[i][2] = (2012*i)%103087;
A[i][3] = (666*i)%103087;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1;
}
for( i=400000; i< 500000; i++ )
{
A[i][0] = 1;
A[i][1] = (17*i)%999983;
A[i][2] = (1967*i)%444443;
A[i][3] = 2;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + (1000000.0/(double)i);
}
for( i=500000; i< 600000; i++ )
{
A[i][0] = (3*i)%111121;
A[i][1] = (2*i)%999199;
A[i][2] = (2*i)%444443;
A[i][3] = i;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1.3;
}
The problem is: it can't create such a large array, it just terminates at the run-time, BUT it works fine if I create no more than 200000 constraints.
I've tried to increase stack size to unlimited value, but it didn't help.
I've tried to use pointers like **A, but I get incorrect result in output.
P.S.
I use Ubuntu.
Any ideas?
If numberOfInequalities is a runtime constant, you could make it a #define and define A and b as global variables or static local variables:
#define numberOfInequalities 600000
static double A[numberOfInequalities][4], b[numberOfInequalities];
This will move these arrays from the 'stack' to the 'bss' segment.
A better solution is to allocate these arrays with malloc:
double (*A)[4] = malloc(numberOfInequalities * 4 * sizeof(double));
double *b = malloc(numberOfInequalities * sizeof(double));
This will cause these arrays to be allocated from the 'heap' memory.
Don't forget to free them before returning to the caller.
See http://www.geeksforgeeks.org/memory-layout-of-c-program/ for a brief explanation how memory is arranged in a typical C program
Related
Is it possible to implement FIR filtering action without padding the input and coefficients?
i.e. Let's say if the input and filter coefficients are of size 4, then the output will be 7 samples. So, while implementing, we generally add 3 more zeros to both input and filter coefficients making them equal to output size.
But, if the input and filter coefficients are of size 1024, then the output will be of 2047 samples. So, now, we need to add 1023 zeros to both input and filter coefficients. This is inefficient, right?
So, I just want to know is there any other way to implement FIR Filtering without padding?
The below code gives the idea I was talking about.
int x[7],h[7],y[7];
int i,j;
for(i=0;i<7;i++)
{
if(i<4)
{
x[i] = i+1;
h[i] = i+1;
}
if(i>=4)
{
x[i] = 0;
h[i] = 0;
}
}
for(i=0;i<7;i++)
{
y[i] = 0;
for(j=0;j<=i;j++)
{
y[i] = y[i] + h[j] * x [i-j];
}
}
To see what your code is doing, change the calculations to printfs, like this:
for(int i = 0; i < 7; i++)
{
printf("y[%d] = 0\n", i);
for(int j = 0; j <= i; j++)
{
printf("y[%d] += h[%d] * x[%d]\n", i, j, i-j);
}
printf("\n");
}
The output from that code (comments added) is:
y[0] = 0
y[0] += h[0] * x[0]
y[1] = 0
y[1] += h[0] * x[1]
y[1] += h[1] * x[0]
y[2] = 0
y[2] += h[0] * x[2]
y[2] += h[1] * x[1]
y[2] += h[2] * x[0]
y[3] = 0
y[3] += h[0] * x[3]
y[3] += h[1] * x[2]
y[3] += h[2] * x[1]
y[3] += h[3] * x[0]
y[4] = 0
y[4] += h[0] * x[4] // zero x
y[4] += h[1] * x[3]
y[4] += h[2] * x[2]
y[4] += h[3] * x[1]
y[4] += h[4] * x[0] // zero h
y[5] = 0
y[5] += h[0] * x[5] // zero x
y[5] += h[1] * x[4] // zero x
y[5] += h[2] * x[3]
y[5] += h[3] * x[2]
y[5] += h[4] * x[1] // zero h
y[5] += h[5] * x[0] // zero h
y[6] = 0
y[6] += h[0] * x[6] // zero x
y[6] += h[1] * x[5] // zero x
y[6] += h[2] * x[4] // zero x
y[6] += h[3] * x[3]
y[6] += h[4] * x[2] // zero h
y[6] += h[5] * x[1] // zero h
y[6] += h[6] * x[0] // zero h
The commented calculations are just a waste of time, since either the h value or the x value will be zero. To avoid the wasted calculations, the code needs to adjust the starting and ending values of j.
When i<=3 the starting value for j is 0, otherwise the starting value is i-3.
When i<=3 the ending value for j is i, otherwise the ending value is 3.
Therefore, the loops should look like this:
for(int i = 0; i < 7; i++)
{
printf("y[%d] = 0\n", i);
int start = (i <= 3) ? 0 : i-3;
int end = (i <= 3) ? i : 3;
for(int j = start; j <= end; j++)
{
printf("y[%d] += h[%d] * x[%d]\n", i, j, i-j);
}
printf("\n");
}
The output is:
y[0] = 0
y[0] += h[0] * x[0]
y[1] = 0
y[1] += h[0] * x[1]
y[1] += h[1] * x[0]
y[2] = 0
y[2] += h[0] * x[2]
y[2] += h[1] * x[1]
y[2] += h[2] * x[0]
y[3] = 0
y[3] += h[0] * x[3]
y[3] += h[1] * x[2]
y[3] += h[2] * x[1]
y[3] += h[3] * x[0]
y[4] = 0
y[4] += h[1] * x[3]
y[4] += h[2] * x[2]
y[4] += h[3] * x[1]
y[5] = 0
y[5] += h[2] * x[3]
y[5] += h[3] * x[2]
y[6] = 0
y[6] += h[3] * x[3]
This avoids the wasted calculations, and eliminates the need to pad the h and x arrays.
You can modify your function to only compute a value when the indexes are inside the valid ranges:
int x[4], h[4], y[7];
for (i = 0; i < 4; i++) {
if (i < 4) {
x[i] = i + 1;
h[i] = i + 1;
}
}
for (j = 0; j < 4; j++) {
if (i - j > 0 && i - j < 4) {
y[i] = y[i] + h[j] * x[i - j];
}
}
This will simply discard all the values that are outside of range of the data and filter coefficients.
I have a grid where each element holds 9 values. Every time-step grabs some values from neighbouring elements, does some trivial calculations, and then writes back new values to the same addresses.
On my machine this program runs in about 3 minutes. However, an alternative program that simply reads values from neighbouring elements, and then writes back the values (i.e. the below program just without intermediate calculations) runs in only 50 seconds.
Assuming the intermediate calculations don't take very long to compute (I may be wrong about that), how can I speed up the former program to achieve a similar performance to the latter? The issue is most likely to do with caching, but all changes I've attempted have either had no affect on the performance, or made the performance worse.
What I've attempted so far
Swapping the grid array from a structure of arrays (grid[9][256*256]) to a array of structures (grid[256*256][9]) seemed to have negligible impact on performance.
Hoisting the a[9] array to be outside the loop also didn't seem to affect the performance.
I've run the code through a profiler which tells me that cpu performance is poor whenever I access elements from the grid.
Simplified program:
#include <stdio.h>
#include <stdlib.h>
int main() {
double **grid = (double**)malloc(9*sizeof(double*));
for(int i = 0; i < 9; i++)
grid[i] = (double*)malloc(256*256*sizeof(double));
// double **grid = (double**)malloc(256*256*sizeof(double*));
// for(int i = 0; i < 256*256; i++)
// grid[i] = (double*)malloc(9*sizeof(double));
double res = 0.0;
for (int tt = 0; tt < 80000; tt++) {
for (int ii = 0; ii < 256; ii++) {
for (int jj = 0; jj < 256; jj++) {
int up = (ii + 1) % 256;
int rt = (jj + 1) % 256;
int dn = (ii == 0) ? 255 : (ii - 1);
int lf = (jj == 0) ? 255 : (jj - 1);
double sum = grid[0][ii*256 + jj] + grid[1][ii*256 + lf]
+ grid[2][dn*256 + jj] + grid[3][ii*256 + rt]
+ grid[4][up*256 + jj] + grid[5][dn*256 + lf]
+ grid[6][dn*256 + rt] + grid[7][up*256 + rt]
+ grid[8][up*256 + lf];
double odd = ( grid[1][ii*256 + jj] + grid[3][up*256 + lf]
+ grid[5][dn*256 + rt] + grid[7][up*256 + rt]
) / sum;
double even = ( grid[0][ii*256 + jj] + grid[2][up*256 + lf]
+ grid[4][dn*256 + rt] + grid[6][dn*256 + lf]
+ grid[8][ii*256 + lf]
) / sum;
double hypot = odd*odd + even*even;
double a[9];
a[1] = ( odd ) * hypot;
a[2] = ( even ) * hypot;
a[3] = ( - odd ) * hypot;
a[4] = ( - even ) * hypot;
a[5] = ( odd + even ) * hypot;
a[6] = ( - odd + even ) * hypot;
a[7] = ( - odd - even ) * hypot;
a[8] = ( odd - even ) * hypot;
sum = 0.0;
sum += ( grid[0][ii*256 + jj] = hypot * grid[0][ii*256 + jj] );
sum += ( grid[1][ii*256 + lf] = a[3] * grid[3][ii*256 + rt] );
sum += ( grid[2][dn*256 + jj] = a[4] * grid[4][up*256 + jj] );
sum += ( grid[3][ii*256 + rt] = a[1] * grid[1][ii*256 + lf] );
sum += ( grid[4][up*256 + jj] = a[2] * grid[2][dn*256 + jj] );
sum += ( grid[5][dn*256 + lf] = a[7] * grid[7][up*256 + rt] );
sum += ( grid[6][dn*256 + rt] = a[8] * grid[8][up*256 + lf] );
sum += ( grid[7][up*256 + rt] = a[5] * grid[5][dn*256 + lf] );
sum += ( grid[8][up*256 + lf] = a[6] * grid[6][dn*256 + rt] );
res += sum;
}
}
}
printf("%f", res);
return 0;
}
Vim command to swap the index ordering of all 2D arrays in the program:
:%s/\[\([^\]]\+\)\]\[\([^\]]\+\)]/\[\2\]\[\1\]/g
I have a video data which is YUV420p(I420). And I'm trying to convert to RGBA with C language, but I don't know which part of my code is wrong, I can't get the right RGBA data.
const void *data = [frameData bytes];
NSInteger numOfPixel = width * height;
NSInteger positionOfV = numOfPixel;
NSInteger positionOfU = numOfPixel * 5 / 4;
uint8_t *rgba = malloc(numOfPixel * 4);
for (uint32_t i = 0; i < height; i ++) {
NSInteger startY = i * width;
NSInteger step = (i/2)*(width/2);
NSInteger startU = positionOfU + step;
NSInteger startV = positionOfV + step;
for (uint32_t j = 0; j < width; j ++) {
NSInteger indexY = startY + j;
NSInteger indexU = startU + j/2;
NSInteger indexV = startV + j/2;
NSInteger index = indexY * 4;
uint8_t Y = *(uint8_t *)(&data[indexY]);
uint8_t U = *(uint8_t *)(&data[indexU]);
uint8_t V = *(uint8_t *)(&data[indexV]);
int r = (int)((Y&0xff) + 1.4075 * ((V&0xff) - 128));
int g = (int)((Y&0xff) - 0.3455 * ((U&0xff)-128) - 0.7169*((V&0xff)-128));
int b = (int)((Y&0xff) + 1.779 * ((U&0xff)-128));
int a = 0xff;
if (r<0) r=0; if (r>255) r=255;
if (g<0) g=0; if (g>255) g=255;
if (b<0) b=0; if (b>255) b=255;
rgba[index] = (uint8_t)r;
rgba[index + 1] = (uint8_t)g;
rgba[index + 2] = (uint8_t)b;
rgba[index + 3] = (uint8_t)a;
}
}
I`ve tried to implement dot product of this two arrays using AVX https://stackoverflow.com/a/10459028. But my code is very slow.
A and xb are arrays of doubles, n is even number. Can you help me?
const int mask = 0x31;
int sum =0;
for (int i = 0; i < n; i++)
{
int ind = i;
if (i + 8 > n) // padding
{
sum += A[ind] * xb[i].x;
i++;
ind = n * j + i;
sum += A[ind] * xb[i].x;
continue;
}
__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
__m256d x = _mm256_loadu_pd(&A[ind]);
__m256d y = _mm256_load_pd(ar);
i+=4; ind = n * j + i;
__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
__m256d z = _mm256_loadu_pd(&A[ind]);
__m256d w = _mm256_load_pd(arr);
__m256d xy = _mm256_mul_pd(x, y);
__m256d zw = _mm256_mul_pd(z, w);
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
i += 3;
}
There are two big inefficiencies in your loop that are immediately apparent:
(1) these two chunks of scalar code:
__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
...
__m256d y = _mm256_load_pd(ar);
and
__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
...
__m256d w = _mm256_load_pd(arr);
should be implemented using SIMD loads and shuffles (or at the very least use _mm256_set_pd and give the compiler a chance to do a half-reasonable job of generating code for a gathered load).
(2) the horizontal summation at the end of the loop:
for (int i = 0; i < n; i++)
{
...
__m256d xy = _mm256_mul_pd(x, y);
__m256d zw = _mm256_mul_pd(z, w);
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
i += 3;
}
should be moved out of the loop:
__m256d xy = _mm256_setzero_pd();
__m256d zw = _mm256_setzero_pd();
...
for (int i = 0; i < n; i++)
{
...
xy = _mm256_add_pd(xy, _mm256_mul_pd(x, y));
zw = _mm256_add_pd(zw, _mm256_mul_pd(z, w));
i += 3;
}
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
My professor send out test code to run on our program. However, the test code itself has a segmentation fault error on compiling. The error happens on the first printf. However if that line is commented out it just occurs on the next line. It sounds like the code works fine for him, so I'm trying to figure out why it's failing for me. I know he's using C while I'm using C++, but even when I try to compile the test code with gcc instead of g++ it still fails. Anyone know why I might be having problems? Thanks! The code is below.
#include <stdio.h>
main()
{ double A[400000][4], b[400000], c[4] ;
double result[4];
int i, j; double s, t;
printf("Preparing test: 4 variables, 400000 inequalities\n");
A[0][0] = 1.0; A[0][1] = 2.0; A[0][2] = 1.0; A[0][3] = 0.0; b[0] = 10000.0;
A[1][0] = 0.0; A[1][1] = 1.0; A[1][2] = 2.0; A[1][3] = 1.0; b[0] = 10000.0;
A[2][0] = 1.0; A[2][1] = 0.0; A[2][2] = 1.0; A[2][3] = 3.0; b[0] = 10000.0;
A[3][0] = 4.0; A[3][1] = 0.0; A[3][2] = 1.0; A[3][3] = 1.0; b[0] = 10000.0;
c[0]=1.0; c[1]=1.0; c[2]=1.0; c[3]=1.0;
for( i=4; i< 100000; i++ )
{ A[i][0] = (12123*i)%104729;
A[i][1] = (47*i)%104729;
A[i][2] = (2011*i)%104729;
A[i][3] = (7919*i)%104729;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1 + (i%137);
}
A[100000][0] = 0.0; A[100000][1] = 6.0; A[100000][2] = 1.0;
A[100000][3] = 1.0; b[100000] = 19.0;
for( i=100001; i< 200000; i++ )
{ A[i][0] = (2323*i)%101111;
A[i][1] = (74*i)%101111;
A[i][2] = (2017*i)%101111;
A[i][3] = (7915*i)%101111;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%89);
}
A[200000][0] = 5.0; A[200000][1] = 2.0; A[200000][2] = 0.0;
A[200000][3] = 1.0; b[200000] = 11.0;
for( i=200001; i< 300000; i++ )
{ A[i][0] = (23123*i)%100003;
A[i][1] = (47*i)%100003;
A[i][2] = (2011*i)%100003;
A[i][3] = (7919*i)%100003;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%57);
}
A[300000][0] = 1.0; A[300000][1] = 2.0; A[300000][2] = 1.0;
A[300000][3] = 3.0; b[300000] = 20.0;
A[300001][0] = 1.0; A[300001][1] = 0.0; A[300001][2] = 5.0;
A[300001][3] = 4.0; b[300001] = 32.0;
A[300002][0] = 7.0; A[300002][1] = 1.0; A[300002][2] = 1.0;
A[300002][3] = 7.0; b[300002] = 40.0;
for( i=300003; i< 400000; i++ )
{ A[i][0] = (13*i)%103087;
A[i][1] = (99*i)%103087;
A[i][2] = (2012*i)%103087;
A[i][3] = (666*i)%103087;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1;
}
printf("Running test: 400000 inequalities, 4 variables\n");
//j = rand_lp(40, &(A[0][0]), &(b[0]), &(c[0]), &(result[0]));
printf("Test: extremal point (%f, %f, %f, %f) after %d recomputation steps\n",
result[0], result[1], result[2], result[3], j);
printf("Answer should be (1,2,3,4)\n End Test\n");
}
Try to change:
double A[400000][4], b[400000], c[4] ;
to
static double A[400000][4], b[400000], c[4] ;
Your declaration of the A array has automatic storage duration which probably means on your system it is stored on the stack. Your total stack for your process is likely to be lower than that and you encountered a stack overflow.
On Linux, you can run the ulimit command:
$ ulimit -s
8192
$
to see the stack size in kB allocated for a process. For example, 8192 kB on my machine.
You have overflowed the limits of the stack. Your prof declares 15MB of data in main's stack frame. That's just too big.
Since the lifetime of an ojbect declared at the top of main is essentially the entire program, just declare the objects as static. That way they'll be in the (relatively limitless) data segment, and have nearly the same lifetime.
Try changing this line:
double A[400000][4], b[400000], c[4] ;
to this:
static double A[400000][4], b[400000], c[4] ;