Related
I have tried to find divisors to potential factorial primes (number of the form n!+-1) and because I recently bought Skylake-X workstation I thought that I could get some speed up using AVX512 instructions.
Algorithm is simple and main step is to take modulo repeatedly respect to same divisor. Main thing is to loop over large range of n values. Here is naïve approach written in c (P is table of primes):
uint64_t factorial_naive(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
{
uint64_t n, i, residue;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2;
for (n=3; n <= nmax; n++){
residue *= n;
residue %= P[i];
// Lets check if we found factor
if (nmin <= n){
if( residue == 1){
report_factor(n, -1, P[i]);
}
if(residue == P[i]- 1){
report_factor(n, 1, P[i]);
}
}
}
}
return EXIT_SUCCESS;
}
Here the idea is to check a large range of n, e.g. 1,000,000 -> 10,000,000 against the same set of divisors. So we will take modulo respect to same divisor several million times. using DIV is very slow so there are several possible approaches depending on the range of the calculations. Here in my case n is most likely less than 10^7 and potential divisor p is less than 10,000 G (< 10^13), So numbers are less than 64-bits and also less than 53-bits!, but the product of the maximum residue (p-1) times n is larger than 64-bits. So I thought that simplest version of Montgomery method doesn’t work because we are taking modulo from number that is larger than 64-bit.
I found some old code for power pc where FMA was used to get an accurate product up to 106 bits (I guess) when using doubles. So I converted this approach to AVX 512 assembler (Intel Intrinsics). Here is a simple version of the FMA method, this is based on work of Dekker (1971), Dekker product and FMA version of TwoProduct of that are useful words when trying to find/googling rationale behind this. Also this approach has been discussed in this forum (e.g. here).
int64_t factorial_FMA(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
{
uint64_t n, i;
double prime_double, prime_double_reciprocal, quotient, residue;
double nr, n_double, prime_times_quotient_high, prime_times_quotient_low;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2.0;
prime_double = (double)P[i];
prime_double_reciprocal = 1.0 / prime_double;
n_double = 3.0;
for (n=3; n <= nmax; n++){
nr = n_double * residue;
quotient = fma(nr, prime_double_reciprocal, rounding_constant);
quotient -= rounding_constant;
prime_times_quotient_high= prime_double * quotient;
prime_times_quotient_low = fma(prime_double, quotient, -prime_times_quotient_high);
residue = fma(residue, n, -prime_times_quotient_high) - prime_times_quotient_low;
if (residue < 0.0) residue += prime_double;
n_double += 1.0;
// Lets check if we found factor
if (nmin <= n){
if( residue == 1.0){
report_factor(n, -1, P[i]);
}
if(residue == prime_double - 1.0){
report_factor(n, 1, P[i]);
}
}
}
}
return EXIT_SUCCESS;
}
Here I have used magic constant
static const double rounding_constant = 6755399441055744.0;
that is 2^51 + 2^52 magic number for doubles.
I converted this to AVX512 (32 potential divisors per loop) and analyzed result using IACA. It told that Throughput Bottleneck: Backend and Backend allocation was stalled due to unavailable allocation resources.
I am not very experienced with assembler so my question is that is there anything I can do to speed this up and solve this backend bottleneck?
AVX512 code is here and can be found also from github
uint64_t factorial_AVX512_unrolled_four(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
{
// we are trying to find a factor for a factorial numbers : n! +-1
//nmin is minimum n we want to report and nmax is maximum. P is table of primes
// we process 32 primes in one loop.
// naive version of the algorithm is int he function factorial_naive
// and simple version of the FMA based approach in the function factorial_simpleFMA
const double one_table[8] __attribute__ ((aligned(64))) ={1.0, 1.0, 1.0,1.0,1.0,1.0,1.0,1.0};
uint64_t n;
__m512d zero, rounding_const, one, n_double;
__m512i prime1, prime2, prime3, prime4;
__m512d residue1, residue2, residue3, residue4;
__m512d prime_double_reciprocal1, prime_double_reciprocal2, prime_double_reciprocal3, prime_double_reciprocal4;
__m512d quotient1, quotient2, quotient3, quotient4;
__m512d prime_times_quotient_high1, prime_times_quotient_high2, prime_times_quotient_high3, prime_times_quotient_high4;
__m512d prime_times_quotient_low1, prime_times_quotient_low2, prime_times_quotient_low3, prime_times_quotient_low4;
__m512d nr1, nr2, nr3, nr4;
__m512d prime_double1, prime_double2, prime_double3, prime_double4;
__m512d prime_minus_one1, prime_minus_one2, prime_minus_one3, prime_minus_one4;
__mmask8 negative_reminder_mask1, negative_reminder_mask2, negative_reminder_mask3, negative_reminder_mask4;
__mmask8 found_factor_mask11, found_factor_mask12, found_factor_mask13, found_factor_mask14;
__mmask8 found_factor_mask21, found_factor_mask22, found_factor_mask23, found_factor_mask24;
// load data and initialize cariables for loop
rounding_const = _mm512_set1_pd(rounding_constant);
one = _mm512_load_pd(one_table);
zero = _mm512_setzero_pd ();
// load primes used to sieve
prime1 = _mm512_load_epi64((__m512i *) &P[0]);
prime2 = _mm512_load_epi64((__m512i *) &P[8]);
prime3 = _mm512_load_epi64((__m512i *) &P[16]);
prime4 = _mm512_load_epi64((__m512i *) &P[24]);
// convert primes to double
prime_double1 = _mm512_cvtepi64_pd (prime1); // vcvtqq2pd
prime_double2 = _mm512_cvtepi64_pd (prime2); // vcvtqq2pd
prime_double3 = _mm512_cvtepi64_pd (prime3); // vcvtqq2pd
prime_double4 = _mm512_cvtepi64_pd (prime4); // vcvtqq2pd
// calculates 1.0/ prime
prime_double_reciprocal1 = _mm512_div_pd(one, prime_double1);
prime_double_reciprocal2 = _mm512_div_pd(one, prime_double2);
prime_double_reciprocal3 = _mm512_div_pd(one, prime_double3);
prime_double_reciprocal4 = _mm512_div_pd(one, prime_double4);
// for comparison if we have found factors for n!+1
prime_minus_one1 = _mm512_sub_pd(prime_double1, one);
prime_minus_one2 = _mm512_sub_pd(prime_double2, one);
prime_minus_one3 = _mm512_sub_pd(prime_double3, one);
prime_minus_one4 = _mm512_sub_pd(prime_double4, one);
// residue init
residue1 = _mm512_set1_pd(2.0);
residue2 = _mm512_set1_pd(2.0);
residue3 = _mm512_set1_pd(2.0);
residue4 = _mm512_set1_pd(2.0);
// double counter init
n_double = _mm512_set1_pd(3.0);
// main loop starts here. typical value for nmax can be 5,000,000 -> 10,000,000
for (n=3; n<=nmax; n++) // main loop
{
// timings for instructions:
// _mm512_load_epi64 = vmovdqa64 : L 1, T 0.5
// _mm512_load_pd = vmovapd : L 1, T 0.5
// _mm512_set1_pd
// _mm512_div_pd = vdivpd : L 23, T 16
// _mm512_cvtepi64_pd = vcvtqq2pd : L 4, T 0,5
// _mm512_mul_pd = vmulpd : L 4, T 0.5
// _mm512_fmadd_pd = vfmadd132pd, vfmadd213pd, vfmadd231pd : L 4, T 0.5
// _mm512_fmsub_pd = vfmsub132pd, vfmsub213pd, vfmsub231pd : L 4, T 0.5
// _mm512_sub_pd = vsubpd : L 4, T 0.5
// _mm512_cmplt_pd_mask = vcmppd : L ?, Y 1
// _mm512_mask_add_pd = vaddpd : L 4, T 0.5
// _mm512_cmpeq_pd_mask = vcmppd L ?, Y 1
// _mm512_kor = korw L 1, T 1
// nr = residue * n
nr1 = _mm512_mul_pd (residue1, n_double);
nr2 = _mm512_mul_pd (residue2, n_double);
nr3 = _mm512_mul_pd (residue3, n_double);
nr4 = _mm512_mul_pd (residue4, n_double);
// quotient = nr * 1.0/ prime_double + rounding_constant
quotient1 = _mm512_fmadd_pd(nr1, prime_double_reciprocal1, rounding_const);
quotient2 = _mm512_fmadd_pd(nr2, prime_double_reciprocal2, rounding_const);
quotient3 = _mm512_fmadd_pd(nr3, prime_double_reciprocal3, rounding_const);
quotient4 = _mm512_fmadd_pd(nr4, prime_double_reciprocal4, rounding_const);
// quotient -= rounding_constant, now quotient is rounded to integer
// countient should be at maximum nmax (10,000,000)
quotient1 = _mm512_sub_pd(quotient1, rounding_const);
quotient2 = _mm512_sub_pd(quotient2, rounding_const);
quotient3 = _mm512_sub_pd(quotient3, rounding_const);
quotient4 = _mm512_sub_pd(quotient4, rounding_const);
// now we calculate high and low for prime * quotient using decker product (FMA).
// quotient is calculated using approximation but this is accurate for given quotient
prime_times_quotient_high1 = _mm512_mul_pd(quotient1, prime_double1);
prime_times_quotient_high2 = _mm512_mul_pd(quotient2, prime_double2);
prime_times_quotient_high3 = _mm512_mul_pd(quotient3, prime_double3);
prime_times_quotient_high4 = _mm512_mul_pd(quotient4, prime_double4);
prime_times_quotient_low1 = _mm512_fmsub_pd(quotient1, prime_double1, prime_times_quotient_high1);
prime_times_quotient_low2 = _mm512_fmsub_pd(quotient2, prime_double2, prime_times_quotient_high2);
prime_times_quotient_low3 = _mm512_fmsub_pd(quotient3, prime_double3, prime_times_quotient_high3);
prime_times_quotient_low4 = _mm512_fmsub_pd(quotient4, prime_double4, prime_times_quotient_high4);
// now we calculate new reminder using decker product and using original values
// we subtract above calculated prime * quotient (quotient is aproximation)
residue1 = _mm512_fmsub_pd(residue1, n_double, prime_times_quotient_high1);
residue2 = _mm512_fmsub_pd(residue2, n_double, prime_times_quotient_high2);
residue3 = _mm512_fmsub_pd(residue3, n_double, prime_times_quotient_high3);
residue4 = _mm512_fmsub_pd(residue4, n_double, prime_times_quotient_high4);
residue1 = _mm512_sub_pd(residue1, prime_times_quotient_low1);
residue2 = _mm512_sub_pd(residue2, prime_times_quotient_low2);
residue3 = _mm512_sub_pd(residue3, prime_times_quotient_low3);
residue4 = _mm512_sub_pd(residue4, prime_times_quotient_low4);
// lets check if reminder < 0
negative_reminder_mask1 = _mm512_cmplt_pd_mask(residue1,zero);
negative_reminder_mask2 = _mm512_cmplt_pd_mask(residue2,zero);
negative_reminder_mask3 = _mm512_cmplt_pd_mask(residue3,zero);
negative_reminder_mask4 = _mm512_cmplt_pd_mask(residue4,zero);
// we and prime back to reminder using mask if it was < 0
residue1 = _mm512_mask_add_pd(residue1, negative_reminder_mask1, residue1, prime_double1);
residue2 = _mm512_mask_add_pd(residue2, negative_reminder_mask2, residue2, prime_double2);
residue3 = _mm512_mask_add_pd(residue3, negative_reminder_mask3, residue3, prime_double3);
residue4 = _mm512_mask_add_pd(residue4, negative_reminder_mask4, residue4, prime_double4);
n_double = _mm512_add_pd(n_double,one);
// if we are below nmin then we continue next iteration
if (n < nmin) continue;
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
{ // we find factor very rarely
double *residual_list1 = (double *) &residue1;
double *residual_list2 = (double *) &residue2;
double *residual_list3 = (double *) &residue3;
double *residual_list4 = (double *) &residue4;
double *prime_list1 = (double *) &prime_double1;
double *prime_list2 = (double *) &prime_double2;
double *prime_list3 = (double *) &prime_double3;
double *prime_list4 = (double *) &prime_double4;
for (int i=0; i <8; i++){
if( residual_list1[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list1[i]);
}
if( residual_list2[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list2[i]);
}
if( residual_list3[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list3[i]);
}
if( residual_list4[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list4[i]);
}
if(residual_list1[i] == (prime_list1[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list1[i]);
}
if(residual_list2[i] == (prime_list2[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list2[i]);
}
if(residual_list3[i] == (prime_list3[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list3[i]);
}
if(residual_list4[i] == (prime_list4[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list4[i]);
}
}
}
}
return EXIT_SUCCESS;
}
As a few commenters have suggested: a "backend" bottleneck is what you'd expect for this code. That suggests you're keeping things pretty well fed, which is what you want.
Looking at the report, there should be an opportunity in this section:
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
From the IACA analysis:
| 1 | 1.0 | | | | | | | | kmovw r11d, k0
| 1 | 1.0 | | | | | | | | kmovw eax, k1
| 1 | 1.0 | | | | | | | | kmovw ecx, k2
| 1 | 1.0 | | | | | | | | kmovw esi, k3
| 1 | 1.0 | | | | | | | | kmovw edi, k4
| 1 | 1.0 | | | | | | | | kmovw r8d, k5
| 1 | 1.0 | | | | | | | | kmovw r9d, k6
| 1 | 1.0 | | | | | | | | kmovw r10d, k7
| 1 | | 1.0 | | | | | | | or r11d, eax
| 1 | | | | | | | 1.0 | | or r11d, ecx
| 1 | | 1.0 | | | | | | | or r11d, esi
| 1 | | | | | | | 1.0 | | or r11d, edi
| 1 | | 1.0 | | | | | | | or r11d, r8d
| 1 | | | | | | | 1.0 | | or r11d, r9d
| 1* | | | | | | | | | or r11d, r10d
The processor is moving the resulting comparison masks (k0-k7) over to regular registers for the "or" operation. You should be able to eliminate those moves, AND, do the "or" rollup in 6ops vs 8.
NOTE: the found_factor_mask types are defined as __mmask8, where they should be __mask16 (16x double floats in a 512bit fector). That might let the compiler get at some optimizations. If not, drop to assembly as a commenter noted.
And related: what fraction of iteractions fire this or-mask clause? As another commenter observed, you should be able to unroll this with an accumlating "or" operation. Check the accumulated "or" value at the end of each unrolled iteration (or after N iterations), and if it's "true", go back and re-do the values to figure out which n value triggered it.
(And, you can binary search within the "roll" to find the matching n value -- that might get some gain).
Next, you should be able to get rid of this mid-loop check:
// if we are below nmin then we continue next iteration, we
if (n < nmin) continue;
Which shows up here:
| 1* | | | | | | | | | cmp r14, 0x3e8
| 0*F | | | | | | | | | jb 0x229
It may not be a huge gain since the predictor will (probably) get this one (mostly) right, but you should get some gains by having two distinct loops for two "phases":
n=3 to n=nmin-1
n=nmin and beyond
Even if you gain a cycle, that's 3%. And since that's generally related to the big 'or' operation, above, there may be more cleverness in there to be found.
I am trying to write a simple ray tracer. The final image should like this: I have read stuff about it and below is what I am doing:
create an empty image (to fill each pixel, via ray tracing)
for each pixel [for each row, each column]
create the equation of the ray emanating from our pixel
trace() ray:
if ray intersects SPHERE
compute local shading (including shadow determination)
return color;
Now, the scene data is like: It sets a gray sphere of radius 1 at (0,0,-3). It sets a white light source at the origin.
2
amb: 0.3 0.3 0.3
sphere
pos: 0.0 0.0 -3.0
rad: 1
dif: 0.3 0.3 0.3
spe: 0.5 0.5 0.5
shi: 1
light
pos: 0 0 0
col: 1 1 1
Mine looks very weird :
//check ray intersection with the sphere
boolean intersectsWithSphere(struct point rayPosition, struct point rayDirection, Sphere sp,float* t){
//float a = (rayDirection.x * rayDirection.x) + (rayDirection.y * rayDirection.y) +(rayDirection.z * rayDirection.z);
// value for a is 1 since rayDirection vector is normalized
double radius = sp.radius;
double xc = sp.position[0];
double yc =sp.position[1];
double zc =sp.position[2];
double xo = rayPosition.x;
double yo = rayPosition.y;
double zo = rayPosition.z;
double xd = rayDirection.x;
double yd = rayDirection.y;
double zd = rayDirection.z;
double b = 2 * ((xd*(xo-xc))+(yd*(yo-yc))+(zd*(zo-zc)));
double c = (xo-xc)*(xo-xc) + (yo-yc)*(yo-yc) + (zo-zc)*(zo-zc) - (radius * radius);
float D = b*b + (-4.0f)*c;
//ray does not intersect the sphere
if(D < 0 ){
return false;
}
D = sqrt(D);
float t0 = (-b - D)/2 ;
float t1 = (-b + D)/2;
//printf("D=%f",D);
//printf(" t0=%f",t0);
//printf(" t1=%f\n",t1);
if((t0 > 0) && (t1 > 0)){
*t = min(t0,t1);
return true;
}
else {
*t = 0;
return false;
}
}
Below is the trace() function:
unsigned char* trace(struct point rayPosition, struct point rayDirection, Sphere * totalspheres) {
struct point tempRayPosition = rayPosition;
struct point tempRayDirection = rayDirection;
float f=0;
float tnear = INFINITY;
boolean sphereIntersectionFound = false;
int sphereIndex = -1;
for(int i=0; i < num_spheres ; i++){
float t = INFINITY;
if(intersectsWithSphere(tempRayPosition,tempRayDirection,totalspheres[i],&t)){
if(t < tnear){
tnear = t;
sphereIntersectionFound = true;
sphereIndex = i;
}
}
}
if(sphereIndex < 0){
//printf("No interesection found\n");
mycolor[0] = 1;
mycolor[1] = 1;
mycolor[2] = 1;
return mycolor;
}
else {
Sphere sp = totalspheres[sphereIndex];
//intersection point
hitPoint[0].x = tempRayPosition.x + tempRayDirection.x * tnear;
hitPoint[0].y = tempRayPosition.y + tempRayDirection.y * tnear;
hitPoint[0].z = tempRayPosition.z + tempRayDirection.z * tnear;
//normal at the intersection point
normalAtHitPoint[0].x = (hitPoint[0].x - totalspheres[sphereIndex].position[0])/ totalspheres[sphereIndex].radius;
normalAtHitPoint[0].y = (hitPoint[0].y - totalspheres[sphereIndex].position[1])/ totalspheres[sphereIndex].radius;
normalAtHitPoint[0].z = (hitPoint[0].z - totalspheres[sphereIndex].position[2])/ totalspheres[sphereIndex].radius;
normalizedNormalAtHitPoint[0] = normalize(normalAtHitPoint[0]);
for(int j=0; j < num_lights ; j++) {
for(int k=0; k < num_spheres ; k++){
shadowRay[0].x = lights[j].position[0] - hitPoint[0].x;
shadowRay[0].y = lights[j].position[1] - hitPoint[0].y;
shadowRay[0].z = lights[j].position[2] - hitPoint[0].z;
normalizedShadowRay[0] = normalize(shadowRay[0]);
//R = 2 * ( N dot L) * N - L
reflectionRay[0].x = - 2 * dot(normalizedShadowRay[0],normalizedNormalAtHitPoint[0]) * normalizedNormalAtHitPoint[0].x +normalizedShadowRay[0].x;
reflectionRay[0].y = - 2 * dot(normalizedShadowRay[0],normalizedNormalAtHitPoint[0]) * normalizedNormalAtHitPoint[0].y +normalizedShadowRay[0].y;
reflectionRay[0].z = - 2 * dot(normalizedShadowRay[0],normalizedNormalAtHitPoint[0]) * normalizedNormalAtHitPoint[0].z +normalizedShadowRay[0].z;
normalizeReflectionRay[0] = normalize(reflectionRay[0]);
struct point temp;
temp.x = hitPoint[0].x + (shadowRay[0].x * 0.0001 );
temp.y = hitPoint[0].y + (shadowRay[0].y * 0.0001);
temp.z = hitPoint[0].z + (shadowRay[0].z * 0.0001);
struct point ntemp = normalize(temp);
float f=0;
struct point tempHitPoint;
tempHitPoint.x = hitPoint[0].x + 0.001;
tempHitPoint.y = hitPoint[0].y + 0.001;
tempHitPoint.z = hitPoint[0].z + 0.001;
if(intersectsWithSphere(hitPoint[0],ntemp,totalspheres[k],&f)){
// if(intersectsWithSphere(tempHitPoint,ntemp,totalspheres[k],&f)){
printf("In shadow\n");
float r = lights[j].color[0];
float g = lights[j].color[1];
float b = lights[j].color[2];
mycolor[0] = ambient_light[0] + r;
mycolor[1] = ambient_light[1] + g;
mycolor[2] = ambient_light[2] + b;
return mycolor;
} else {
// point is not is shadow , use Phong shading to determine the color of the point.
//I = lightColor * (kd * (L dot N) + ks * (R dot V) ^ sh)
//(for each color channel separately; note that if L dot N < 0, you should clamp L dot N to zero; same for R dot V)
float x = dot(normalizedShadowRay[0],normalizedNormalAtHitPoint[0]);
if(x < 0)
x = 0;
V[0].x = - rayDirection.x;
V[0].x = - rayDirection.y;
V[0].x = - rayDirection.z;
normalizedV[0] = normalize(V[0]);
float y = dot(normalizeReflectionRay[0],normalizedV[0]);
if(y < 0)
y = 0;
float ar = totalspheres[sphereIndex].color_diffuse[0] * x;
float br = totalspheres[sphereIndex].color_specular[0] * pow(y,totalspheres[sphereIndex].shininess);
float r = lights[j].color[0] * (ar+br);
//----------------------------------------------------------------------------------
float bg = totalspheres[sphereIndex].color_specular[1] * pow(y,totalspheres[sphereIndex].shininess);
float ag = totalspheres[sphereIndex].color_diffuse[1] * x;
float g = lights[j].color[1] * (ag+bg);
//----------------------------------------------------------------------------------
float bb = totalspheres[sphereIndex].color_specular[2] * pow(y,totalspheres[sphereIndex].shininess);
float ab = totalspheres[sphereIndex].color_diffuse[2] * x;
float b = lights[j].color[2] * (ab+bb);
mycolor[0] = r + ambient_light[0];
mycolor[1] = g + ambient_light[1];
mycolor[2] = b+ ambient_light[2];
return mycolor;
}
}
}
}
}
The code calling trace() looks like :
void draw_scene()
{
//Aspect Ratio
double a = WIDTH / HEIGHT;
double angel = tan(M_PI * 0.5 * fov/ 180);
ray[0].x = 0.0;
ray[0].y = 0.0;
ray[0].z = 0.0;
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
unsigned int x,y;
float sx, sy;
for(x=0;x < WIDTH;x++)
{
glPointSize(2.0);
glBegin(GL_POINTS);
for(y=0;y < HEIGHT;y++)
{
sx = (((x + 0.5) / WIDTH) * 2.0 ) - 1;
sy = (((y + 0.5) / HEIGHT) * 2.0 ) - 1;;
sx = sx * angel * a;
sy = sy * angel;
//set ray direction
ray[1].x = sx;
ray[1].y = sy;
ray[1].z = -1;
normalizedRayDirection[0] = normalize(ray[1]);
unsigned char* color = trace(ray[0],normalizedRayDirection[0],spheres);
unsigned char x1 = color[0] * 255;
unsigned char y1 = color[1] * 255;
unsigned char z1 = color[2] * 255;
plot_pixel(x,y,x1 %256,y1%256,z1%256);
}
glEnd();
glFlush();
}
}
There could be many, many problems with the code/understanding.
I haven't taken the time to understand all your code, and I'm definitely not a graphics expert, but I believe the problem you have is called "surface acne". In this case it's probably happening because your shadow rays are intersecting with the object itself. What I did in my code to fix this is add epsilon * hitPoint.normal to the shadow ray origin. This effectively moves the ray away from your object a bit, so they don't intersect.
The value I'm using for epsilon is the square root of 1.19209290 * 10^-7, as that is the square root of a constant called EPSILON that is defined in the particular language I'm using.
What possible reason do you have for doing this (in the non-shadow branch of trace (...)):
V[0].x = - rayDirection.x;
V[0].x = - rayDirection.y;
V[0].x = - rayDirection.z;
You might as well comment out the first two computations since you write the results of each to the same component. I think you probably meant to do this instead:
V[0].x = - rayDirection.x;
V[0].y = - rayDirection.y;
V[0].z = - rayDirection.z;
That said, you should also avoid using GL_POINT primitives to cover a 2x2 pixel quad. Point primitives are not guaranteed to be square, and OpenGL implementations are not required to support any size other than 1.0. In practice, most support 1.0 - ~64.0 but glDrawPixels (...) is a much better way of writing 2x2 pixels, since it skips primitive assembly and the above mentioned limitations. You are using immediate mode in this example anyway, so glRasterPos (...) and glDrawPixels (...) are still a valid approach.
It seems you are implementing the formula here, but you deviate at the end from the direction the article takes.
First the article warns that D & b can be very close in value, so that -b + D gets you a very limited number. They suggest an alternative.
Also, you are testing that both t0 & t1 > 0. This doesn't have to be true for you to hit the sphere, you could be inside of it (though you obviously should not be in your test scene).
Finally, I would add a test at the beginning to confirm that the direction vector is normalized. I've messed that up more than once in my renderers.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm looking to implement an FFT algorithm on microcontrollers so I want to simulate the codes before actually using it
I got 2 examples which I converted to matlab codes but the result just isn't what I'm expected
Here are the codes:
function [ H ] = fft_2( g )
%FFT2 Summary of this function goes here
% Detailed explanation goes here
NUMDATA = length(g);
NUMPOINTS = NUMDATA/2;
N = NUMPOINTS;
% for(k=0; k<N; k++)
% {
% IA[k].imag = -(short)(16383.0*(-cos(2*pi/(double)(2*N)*(double)k)));
% IA[k].real = (short)(16383.0*(1.0 - sin(2*pi/(double)(2*N)*(double)k)));
% IB[k].imag = -(short)(16383.0*(cos(2*pi/(double)(2*N)*(double)k)));
% IB[k].real = (short)(16383.0*(1.0 + sin(2*pi/(double)(2*N)*(double)k)));
% }
for k=0:(N-1)
IA(k+1,2) = -floor(16383.0*(-cos(2*pi/(2*N)*k)));
IA(k+1,1) = floor(16383.0*(1.0 - sin(2*pi/(2*N)*k)));
IB(k+1,2) = -floor(16383.0*(cos(2*pi/(2*N)*k)));
IB(k+1,1) = floor(16383.0*(1.0 + sin(2*pi/(2*N)*k)));
end
% Note, IA(k) is the complex conjugate of A(k) and IB(k) is the complex conjugate of
% B(k).
% *********************************************************************************/
% #include <math.h>
% #include ”params1.h”
% #include ”params.h”
% extern short g[];
% void dft(int, COMPLEX *);
% void split(int, COMPLEX *, COMPLEX *, COMPLEX *, COMPLEX *);
% main()
% {
% int n, k;
% COMPLEX x[NUMPOINTS+1]; /* array of complex DFT data */
% COMPLEX A[NUMPOINTS]; /* array of complex A coefficients */
% COMPLEX B[NUMPOINTS]; /* array of complex B coefficients */
% COMPLEX IA[NUMPOINTS]; /* array of complex A* coefficients */
% COMPLEX IB[NUMPOINTS]; /* array of complex B* coefficients */
% COMPLEX G[2*NUMPOINTS]; /* array of complex DFT result */
% for(k=0; k<NUMPOINTS; k++)
for k=0:(NUMPOINTS-1)
% {
% A[k].imag = (short)(16383.0*(-cos(2*pi/(double)(2*NUMPOINTS)*(double)k)));
% A[k].real = (short)(16383.0*(1.0 - sin(2*pi/(double)(2*NUMPOINTS)*(double)k)));
% B[k].imag = (short)(16383.0*(cos(2*pi/(double)(2*NUMPOINTS)*(double)k)));
% B[k].real = (short)(16383.0*(1.0 + sin(2*pi/(double)(2*NUMPOINTS)*(double)k)));
% IA[k].imag = -A[k].imag;
% IA[k].real = A[k].real;
% IB[k].imag = -B[k].imag;
% IB[k].real = B[k].real;
% }
A(k+1, 2) = floor(16383.0*(-cos(2*pi/(2*NUMPOINTS)*k)));
A(k+1, 1) = floor(16383.0*(1.0 - sin(2*pi/(2*NUMPOINTS)*k)));
B(k+1, 2) = floor(16383.0*(cos(2*pi/(2*NUMPOINTS)*k)));
B(k+1, 1) = floor(16383.0*(1.0 + sin(2*pi/(2*NUMPOINTS)*k)));
IA(k+1, 2) = -A(k+1, 2);
IA(k+1, 1) = A(k+1, 1);
IB(k+1, 2) = -B(k+1, 2);
IB(k+1, 1) = B(k+1, 1);
end
% /* Forward DFT */
% /* From the 2N point real sequence, g(n), for the N-point complex sequence, x(n) */
% for (n=0; n<NUMPOINTS; n++)
% {
for n=0:(NUMPOINTS-1)
% x[n].imag = g[2*n + 1]; /* x2(n) = g(2n + 1) */
% x[n].real = g[2*n]; /* x1(n) = g(2n) */
% }
x(n+1,2)=g(2*n + 1+1);
x(n+1,1)=g(2*n +1);
end
% /* Compute the DFT of x(n) to get X(k) -> X(k) = DFT{x(n)} */
% dft(NUMPOINTS, x);
% void dft(int N, COMPLEX *X)
% {
% int n, k;
% double arg;
% int Xr[1024];
% int Xi[1024];
% short Wr, Wi;
% for(k=0; k<N; k++)
% {
N=NUMPOINTS;
for k=0:(N-1)
% Xr[k] = 0;
% Xi[k] = 0;
Xr(k+1)=0;
Xi(k+1)=0;
% for(n=0; n<N; n++)
% {
for n=0:(N-1)
% arg =(2*PI*k*n)/N;
% Wr = (short)((double)32767.0 * cos(arg));
% Wi = (short)((double)32767.0 * sin(arg));
% Xr[k] = Xr[k] + X[n].real * Wr + X[n].imag * Wi;
% Xi[k] = Xi[k] + X[n].imag * Wr – X[n].real * Wi;
arg = (2*pi*k*n)/N;
Wr = floor(32767*cos(arg));
Wi = floor(32767*sin(arg));
Xr(k+1) = Xr(k+1)+x(n+1,1)*Wr+x(n+1,2)*Wi;
Xi(k+1) = Xr(k+1)+x(n+1,2)*Wr-x(n+1,1)*Wi;
% }
% }
end
end
% for (k=0;k<N;k++)
% {
for k=0:(N-1)
% X[k].real = (short)(Xr[k]>>15);
% X[k].imag = (short)(Xi[k]>>15);
x(k+1,1)=floor(Xr(k+1)/pow2(15));
x(k+1,2)=floor(Xi(k+1)/pow2(15));
% }
% }
end
% /* Because of the periodicity property of the DFT, we know that X(N+k)=X(k). */
% x[NUMPOINTS].real = x[0].real;
% x[NUMPOINTS].imag = x[0].imag;
x(NUMPOINTS+1,1)=x(1,1);
x(NUMPOINTS+1,2)=x(1,2);
% /* The split function performs the additional computations required to get
% G(k) from X(k). */
% split(NUMPOINTS, x, A, B, G);
% void split(int N, COMPLEX *X, COMPLEX *A, COMPLEX *B, COMPLEX *G)
% {
% int k;
% int Tr, Ti;
% for (k=0; k<N; k++)
% {
for k=0:(NUMPOINTS-1)
% Tr = (int)X[k].real * (int)A[k].real – (int)X[k].imag * (int)A[k].imag +
% (int)X[N–k].real * (int)B[k].real + (int)X[N–k].imag * (int)B[k].imag;
Tr = x(k+1,1)*A(k+1,1)-x(k+1,2)*A(k+1,2)+x(NUMPOINTS-k+1,1)*B(k+1,1)+x(NUMPOINTS-k+1,2)*B(k+1,2);
% G[k].real = (short)(Tr>>15);
G(k+1,1)=floor(Tr/pow2(15));
% Ti = (int)X[k].imag * (int)A[k].real + (int)X[k].real * (int)A[k].imag +
% (int)X[N–k].real * (int)B[k].imag – (int)X[N–k].imag * (int)B[k].real;
Ti = x(k+1,2)*A(k+1,1)+x(k+1,1)*A(k+1,2)+x(NUMPOINTS-k+1,1)*B(k+1,2)-x(NUMPOINTS-k+1,2)*B(k+1,1);
% G[k].imag = (short)(Ti>>15);
G(k+1,2)=floor(Ti/pow2(15));
% }
end
% }
% /* Use complex conjugate symmetry properties to get the rest of G(k) */
% G[NUMPOINTS].real = x[0].real - x[0].imag;
% G[NUMPOINTS].imag = 0;
% for (k=1; k<NUMPOINTS; k++)
% {
% G[2*NUMPOINTS-k].real = G[k].real;
% G[2*NUMPOINTS-k].imag = -G[k].imag;
% }
G(NUMPOINTS+1,1) = x(1,1) - x(1,2);
G(NUMPOINTS+1,2) = 0;
for k=1:(NUMPOINTS-1)
G(2*NUMPOINTS-k+1,1) = G(k+1,1);
G(2*NUMPOINTS-k+1,2) = -G(k+1,2);
end
for k=1:(NUMDATA)
H(k)=sqrt(G(k,1)*G(k,1)+G(k,2)*G(k,2));
end
end
Another one:
function [ fr, fi ] = fix_fft( fr, fi )
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
N_WAVE = 1024; % full length of Sinewave[]
LOG2_N_WAVE = 10; % log2(N_WAVE)
m = nextpow2(length(fr));
% void fix_fft(short fr[], short fi[], short m)
% {
% long int mr = 0, nn, i, j, l, k, istep, n, shift;
mr=0;
% short qr, qi, tr, ti, wr, wi;
%
% n = 1 << m;
n = pow2(m);
% nn = n - 1;
nn = n-1;
%
% /* max FFT size = N_WAVE */
% //if (n > N_WAVE) return -1;
%
% /* decimation in time - re-order data */
% for (m=1; m<=nn; ++m)
for m=1:nn
% {
% l = n;
l=n;
% do
% {
% l >>= 1;
% } while (mr+l > nn);
not_done = true;
while(mr+l>nn || not_done)
l=floor(l/2);
not_done=false;
end
%
% mr = (mr & (l-1)) + l;
mr = (mr & (l-1)) + l;
% if (mr <= m) continue;
if (mr <= m)
continue
end
%
% tr = fr[m];
% fr[m] = fr[mr];
% fr[mr] = tr;
% ti = fi[m];
% fi[m] = fi[mr];
% fi[mr] = ti;
tr = fr(m+1);
fr(m+1) = fr(mr+1);
fr(mr+1) = tr;
ti = fi(m+1);
fi(m+1) = fi(mr+1);
fi(mr+1) = ti;
% }
end
%
% l = 1;
% k = LOG2_N_WAVE-1;
l=1;
k = LOG2_N_WAVE-1;
%
% while (l < n)
% {
while (l < n)
% /*
% fixed scaling, for proper normalization --
% there will be log2(n) passes, so this results
% in an overall factor of 1/n, distributed to
% maximize arithmetic accuracy.
%
% It may not be obvious, but the shift will be
% performed on each data point exactly once,
% during this pass.
% */
%
% // Variables for multiplication code
% long int c;
% short b;
%
% istep = l << 1;
istep = l*2;
% for (m=0; m<l; ++m)
% {
for m=0:(l-1)
% j = m << k;
% /* 0 <= j < N_WAVE/2 */
% wr = Sinewave[j+N_WAVE/4];
% wi = -Sinewave[j];
j = m*(pow2( k));
wr = sin((j+N_WAVE/4)*2*pi/N_WAVE)*32768;
wi = -sin(j*2*pi/1024)*32768;
%
% wr >>= 1;
% wi >>= 1;
wr = floor(wr/2);
wi = floor(wi/2);
%
% for (i=m; i<n; i+=istep)
% {
i=m;
while(i<n)
% j = i + l;
j = i+l;
%
% // Here I unrolled the multiplications to prevent overhead
% // for procedural calls (we don't need to be clever about
% // the actual multiplications since the pic has an onboard
% // 8x8 multiplier in the ALU):
%
% // tr = FIX_MPY(wr,fr[j]) - FIX_MPY(wi,fi[j]);
% c = ((long int)wr * (long int)fr[j]);
% c = c >> 14;
% b = c & 0x01;
% tr = (c >> 1) + b;
c = wr * fr(j+1);
c = floor(c / pow2(14));
b = c & 1;
tr = floor(c /2) + b;
%
% c = ((long int)wi * (long int)fi[j]);
% c = c >> 14;
% b = c & 0x01;
% tr = tr - ((c >> 1) + b);
c = wi * fi(j+1);
c = floor(c / pow2(14));
b = c & 1;
tr = tr - (floor((c/2)) + b);
%
% // ti = FIX_MPY(wr,fi[j]) + FIX_MPY(wi,fr[j]);
% c = ((long int)wr * (long int)fi[j]);
% c = c >> 14;
% b = c & 0x01;
% ti = (c >> 1) + b;
c = wr*fi(j+1);
c = floor(c / pow2(14));
b = c & 1;
ti = floor((c /2)) + b;
%
% c = ((long int)wi * (long int)fr[j]);
% c = c >> 14;
% b = c & 0x01;
% ti = ti + ((c >> 1) + b);
c = wi * fr(j+1);
c = floor(c / pow2(14));
b = c & 1;
ti = ti + (floor((c/2)) + b);
%
% qr = fr[i];
% qi = fi[i];
% qr >>= 1;
% qi >>= 1;
%
% fr[j] = qr - tr;
% fi[j] = qi - ti;
% fr[i] = qr + tr;
% fi[i] = qi + ti;
qr = fr(i+1);
qi = fi(i+1);
qr = floor(qr/2);
qi = floor(qi/2);
fr(j+1) = qr - tr;
fi(j+1) = qi - ti;
fr(i+1) = qr + tr;
fi(i+1) = qi + ti;
% }
i = i+istep;
end
% }
end
%
% --k;
% l = istep;
% }
k=k-1;
l=istep;
end
% }
end
Those in comments are the original, those aren't are the translated code
Then I simulated with this
function [ r ] = mfft( f )
%MFFT Summary of this function goes here
% Detailed explanation goes here
Fs = 2048;
T = 1/Fs;
L = 2048;
t = (0:L-1)*T;
NFFT = 2^nextpow2(L);
l = length(f);
y = 0;
for k=1:l
y = y + sin(2*pi*f(k)*t);
end
%sound(y, Fs);
Y = fft(y,NFFT)/L;
YY = fft_2(y)/L;
[Y1 Y2] = fix_fft(y, zeros(1, L));
YYY = Y1+j()*Y2;
f = Fs/2*linspace(0,1,NFFT/2+1);
plot(f, 2*abs(Y(1:NFFT/2+1)), ':+b');
hold on
plot(f, 2*abs(YY(1:NFFT/2+1)), '-or');
plot(f, abs(YYY(1:NFFT/2+1)), '--*g');
hold off
r=0;
end
Basically create a simple sine wave with a specific frequency (say 400Hz)
The expected output of the FFT should be a spike at 400 only, which the builtin FFT function agrees but the other codes didn't
Here's the output graph
The blue line is from builtin function and is what expected
The red line is the above code which, well, looks pretty good except there is a spike elsewhere with higher amplitude
The green line is absolute mess
I tried checking the program several times but to no avail
Did I ported them wrong or somehow I can fix them?
There are several ways to approach this problem, and a programmer should try all of them. First, an FFT is an optimization of a Fourier Transform, so as a first step code a Fourier Transform. That is, don't do an FFT, just do a FT directly.
These days an FT is not as slow as you might think. Unless the project needs to transform something like 10,000 data points in less than a few milliseconds. Also, an FT, compared to an FFT, is simple and easy to debug.
Doing this for a problem provides a baseline, that is, the correct answer. This is important because when you work on the FFT how do you know if the problem is in the code for the FFT or that the data is correct and just giving you an unexpected, but correct, answer.
Next, use a pre-written package to do an FFT. Scan the web, I know there are packages written in C that do FFTs.
Third, if you just have to write your own FFT then do so. But only if tasks (1) or (2) don't meet your requirements. It will be difficult to out-do any pre-written FFT packages.
You will learn much along this path.
I have an odd problem. Following (re: copying) from here, I've been trying to implement the Cooley–Tukey FFT algorithm for arrays with a power-of-2 size, but the answers returned from this implementation are the conjugate of the true answers.
int fft_pow2(int dir,int m,float complex *a)
{
long nn,i,i1,j,k,i2,l,l1,l2;
float c1,c2,tx,ty,t1,t2,u1,u2,z;
float complex t;
/* Calculate the number of points */
nn = 1;
for (i=0;i<m;i++)
nn *= 2;
/* Do the bit reversal */
i2 = nn >> 1;
j = 0;
for (i=0;i<nn-1;i++) {
if (i < j) {
t = a[i];
a[i] = a[j];
a[j] = t;
}
k = i2;
while (k <= j) {
j -= k;
k >>= 1;
}
j += k;
}
/* Compute the FFT */
c1 = -1.0;
c2 = 0.0;
l2 = 1;
for (l=0;l<m;l++) {
l1 = l2;
l2 <<= 1;
u1 = 1.0;
u2 = 0.0;
for (j=0;j<l1;j++) {
for (i=j;i<nn;i+=l2) {
i1 = i + l1;
t = u1 * crealf(a[i1]) - u2 * cimagf(a[i1])
+ I * (u1 * cimagf(a[i1]) + u2 * crealf(a[i1]));
a[i1] = a[i] - t;
a[i] += t;
}
z = u1 * c1 - u2 * c2;
u2 = u1 * c2 + u2 * c1;
u1 = z;
}
c2 = sqrt((1.0 - c1) / 2.0);
if (dir == 1)
c2 = -c2;
c1 = sqrt((1.0 + c1) / 2.0);
}
/* Scaling for forward transform */
if (dir == 1) {
for (i=0;i<nn;i++) {
a[i] /= (float)nn;
}
}
return 1;
}
int main(int argc, char **argv) {
float complex arr[4] = { 1.0, 2.0, 3.0, 4.0 };
fft_pow2(0, log2(n), arr);
for (int i = 0; i < n; i++) {
printf("%f %f\n", crealf(arr[i]), cimagf(arr[i]));
}
}
The results:
10.000000 0.000000
-2.000000 -2.000000
-2.000000 0.000000
-2.000000 2.000000
whereas the true answer is the conjugate.
Any ideas?
The FFT is often defined with Hk = sum(e–2•π•i•j•k/N•hj, 0 < j ≤ N). Note the minus sign in the exponent. The FFT can be defined with a plus sign instead of the minus sign. In large part, the definitions are equivalent, because +i and –i are completely symmetric.
The code you show is written for the definition with the negative sign, and it is also written so that the first parameter, dir, is 1 for a forward transform and something else for a reverse transform. We can determine the intended direction because of the comment about scaling for the forward transform: It scales if dir is 1.
So, where your code in main calls fft_pow2 with 0 for dir, it is requesting a reverse transform. Your code has performed a reverse transform using the FFT definition with a negative sign. The reverse of the transform with a negative sign is a transform with a positive sign. For [1, 2, 3, 4], the result is:
10•1 + 11•2 + 12•3 + 13•4 = 1 + 2 + 3 + 4 = 10.
i0•1 + i1•2 + i2•3 + i3•4 = 1 + 2i – 3 – 4i = –2 – 2i.
(–1)0•1 + (–1)1•2 + (–1)2•3 + (–1)3•4 = 1 – 2 + 3 – 4 = –2.
(–i)0•1 + (–i)1•2 + (–i)2•3 + (–i)3•4 = 1 – 2i – 3 + 4i = –2 + 2i.
And that is the result you obtained.
I've been working on a program for my Algorithm Analysis class where I have to solve the Knapsack problem with Brute Force, greedy, dynamic, and branch and bound strategies. Everything works perfectly when I run it in Visual Studio 2012, but if I compile with gcc and run it on the command line, I get a different result:
Visual Studio:
+-------------------------------------------------------------------------------+
| Number of | Processing time in seconds / Maximum benefit value |
| +---------------+---------------+---------------+---------------+
| items | Brute force | Greedy | D.P. | B. & B. |
+---------------+---------------+---------------+---------------+---------------+
| 10 + 0 / 1290 + 0 / 1328 + 0 / 1290 + 0 / 1290 |
+---------------+---------------+---------------+---------------+---------------+
| 20 + 0 / 3286 + 0 / 3295 + 0 / 3200 + 0 / 3286 |
+---------------+---------------+---------------+---------------+---------------+
cmd:
+-------------------------------------------------------------------------------+
| Number of | Processing time in seconds / Maximum benefit value |
| +---------------+---------------+---------------+---------------+
| items | Brute force | Greedy | D.P. | B. & B. |
+---------------+---------------+---------------+---------------+---------------+
| 10 + 0 / 1290 + 0 / 1328 + 0 / 1599229779+ 0 / 1290 |
+---------------+---------------+---------------+---------------+---------------+
| 20 + 0 / 3286 + 0 / 3295 + 0 / 3200 + 0 / 3286 |
+---------------+---------------+---------------+---------------+---------------+
The same number always shows up, "1599229779." Notice that the output is only messed up the first time the Dynamic algorithm is run.
Here is my code:
typedef struct{
short value; //This is the value of the item
short weight; //This is the weight of the item
float ratio; //This is the ratio of value/weight
} itemType;
typedef struct{
time_t startingTime;
time_t endingTime;
int maxValue;
} result;
result solveWithDynamic(itemType items[], int itemsLength, int maxCapacity){
result answer;
int rowSize = 2;
int colSize = maxCapacity + 1;
int i, j; //used in loops
int otherColumn, thisColumn;
answer.startingTime = time(NULL);
int **table = (int**)malloc((sizeof *table) * rowSize);//[2][(MAX_ITEMS*WEIGHT_MULTIPLIER)];
for(i = 0; i < rowSize; i ++)
table[i] = (int*)malloc((sizeof *table[i]) * colSize);
table[0][0] = 0;
table[1][0] = 0;
for(i = 1; i < maxCapacity; i++) table[1][i] = 0;
for(i = 0; i < itemsLength; i++){
thisColumn = i%2;
otherColumn = (i+1)%2; //this is always the other column
for(j = 1; j < maxCapacity + 1; j++){
if(items[i].weight <= j){
if(items[i].value + table[otherColumn][j-items[i].weight] > table[otherColumn][j])
table[thisColumn][j] = items[i].value + table[otherColumn][j-items[i].weight];
else
table[thisColumn][j] = table[otherColumn][j];
} else {
table[thisColumn][j] = table[thisColumn][j-1];
}//end if/else
}//end for
}//end for
answer.maxValue = table[thisColumn][maxCapacity];
answer.endingTime = time(NULL);
for(i = 0; i < rowSize; i ++)
free(table[i]);
free(table);
return answer;
}//end solveWithDynamic
Just a bit of explanation. I was having trouble with the memory consumption of this algorithm because I have to run it for a set of 10,000 items. I realized that I didn't need to store the whole table, because I only ever looked at the previous column. I actually figured out that you only need to store the current row and x+1 additional values, where x is the weight of the current itemType. It brought the memory required from (itemsLength+1) * (maxCapacity+1) elements to 2*(maxCapacity+1) and possibly (maxCapacity+1) + (x+1) (although I don't need to optimize it that much).
Also, I used printf("%d", answer.maxValue); in this function, and it still came out as "1599229779." Can anyone help me figure out what is going on? Thanks.
Can't be sure that that is what causes it, but
for(i = 1; i < maxCapacity; i++) table[1][i] = 0;
you leave table[1][maxCapacity] uninitialised, but then potentially use it:
for(j = 1; j < maxCapacity + 1; j++){
if(items[i].weight <= j){
if(items[i].value + table[otherColumn][j-items[i].weight] > table[otherColumn][j])
table[thisColumn][j] = items[i].value + table[otherColumn][j-items[i].weight];
else
table[thisColumn][j] = table[otherColumn][j];
} else {
table[thisColumn][j] = table[thisColumn][j-1];
}//end if/else
}//end for
If that is always zero with Visual Studio, but nonzero with gcc, that could explain the difference.