Fast AVX512 modulo when same divisor - c
I have tried to find divisors to potential factorial primes (number of the form n!+-1) and because I recently bought Skylake-X workstation I thought that I could get some speed up using AVX512 instructions.
Algorithm is simple and main step is to take modulo repeatedly respect to same divisor. Main thing is to loop over large range of n values. Here is naïve approach written in c (P is table of primes):
uint64_t factorial_naive(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
uint64_t n, i, residue;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2;
for (n=3; n <= nmax; n++){
residue *= n;
residue %= P[i];
// Lets check if we found factor
if (nmin <= n){
if( residue == 1){
report_factor(n, -1, P[i]);
if(residue == P[i]- 1){
report_factor(n, 1, P[i]);
Here the idea is to check a large range of n, e.g. 1,000,000 -> 10,000,000 against the same set of divisors. So we will take modulo respect to same divisor several million times. using DIV is very slow so there are several possible approaches depending on the range of the calculations. Here in my case n is most likely less than 10^7 and potential divisor p is less than 10,000 G (< 10^13), So numbers are less than 64-bits and also less than 53-bits!, but the product of the maximum residue (p-1) times n is larger than 64-bits. So I thought that simplest version of Montgomery method doesn’t work because we are taking modulo from number that is larger than 64-bit.
I found some old code for power pc where FMA was used to get an accurate product up to 106 bits (I guess) when using doubles. So I converted this approach to AVX 512 assembler (Intel Intrinsics). Here is a simple version of the FMA method, this is based on work of Dekker (1971), Dekker product and FMA version of TwoProduct of that are useful words when trying to find/googling rationale behind this. Also this approach has been discussed in this forum (e.g. here).
int64_t factorial_FMA(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
uint64_t n, i;
double prime_double, prime_double_reciprocal, quotient, residue;
double nr, n_double, prime_times_quotient_high, prime_times_quotient_low;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2.0;
prime_double = (double)P[i];
prime_double_reciprocal = 1.0 / prime_double;
n_double = 3.0;
for (n=3; n <= nmax; n++){
nr = n_double * residue;
quotient = fma(nr, prime_double_reciprocal, rounding_constant);
quotient -= rounding_constant;
prime_times_quotient_high= prime_double * quotient;
prime_times_quotient_low = fma(prime_double, quotient, -prime_times_quotient_high);
residue = fma(residue, n, -prime_times_quotient_high) - prime_times_quotient_low;
if (residue < 0.0) residue += prime_double;
n_double += 1.0;
// Lets check if we found factor
if (nmin <= n){
if( residue == 1.0){
report_factor(n, -1, P[i]);
if(residue == prime_double - 1.0){
report_factor(n, 1, P[i]);
Here I have used magic constant
static const double rounding_constant = 6755399441055744.0;
that is 2^51 + 2^52 magic number for doubles.
I converted this to AVX512 (32 potential divisors per loop) and analyzed result using IACA. It told that Throughput Bottleneck: Backend and Backend allocation was stalled due to unavailable allocation resources.
I am not very experienced with assembler so my question is that is there anything I can do to speed this up and solve this backend bottleneck?
AVX512 code is here and can be found also from github
uint64_t factorial_AVX512_unrolled_four(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
// we are trying to find a factor for a factorial numbers : n! +-1
//nmin is minimum n we want to report and nmax is maximum. P is table of primes
// we process 32 primes in one loop.
// naive version of the algorithm is int he function factorial_naive
// and simple version of the FMA based approach in the function factorial_simpleFMA
const double one_table[8] __attribute__ ((aligned(64))) ={1.0, 1.0, 1.0,1.0,1.0,1.0,1.0,1.0};
uint64_t n;
__m512d zero, rounding_const, one, n_double;
__m512i prime1, prime2, prime3, prime4;
__m512d residue1, residue2, residue3, residue4;
__m512d prime_double_reciprocal1, prime_double_reciprocal2, prime_double_reciprocal3, prime_double_reciprocal4;
__m512d quotient1, quotient2, quotient3, quotient4;
__m512d prime_times_quotient_high1, prime_times_quotient_high2, prime_times_quotient_high3, prime_times_quotient_high4;
__m512d prime_times_quotient_low1, prime_times_quotient_low2, prime_times_quotient_low3, prime_times_quotient_low4;
__m512d nr1, nr2, nr3, nr4;
__m512d prime_double1, prime_double2, prime_double3, prime_double4;
__m512d prime_minus_one1, prime_minus_one2, prime_minus_one3, prime_minus_one4;
__mmask8 negative_reminder_mask1, negative_reminder_mask2, negative_reminder_mask3, negative_reminder_mask4;
__mmask8 found_factor_mask11, found_factor_mask12, found_factor_mask13, found_factor_mask14;
__mmask8 found_factor_mask21, found_factor_mask22, found_factor_mask23, found_factor_mask24;
// load data and initialize cariables for loop
rounding_const = _mm512_set1_pd(rounding_constant);
one = _mm512_load_pd(one_table);
zero = _mm512_setzero_pd ();
// load primes used to sieve
prime1 = _mm512_load_epi64((__m512i *) &P[0]);
prime2 = _mm512_load_epi64((__m512i *) &P[8]);
prime3 = _mm512_load_epi64((__m512i *) &P[16]);
prime4 = _mm512_load_epi64((__m512i *) &P[24]);
// convert primes to double
prime_double1 = _mm512_cvtepi64_pd (prime1); // vcvtqq2pd
prime_double2 = _mm512_cvtepi64_pd (prime2); // vcvtqq2pd
prime_double3 = _mm512_cvtepi64_pd (prime3); // vcvtqq2pd
prime_double4 = _mm512_cvtepi64_pd (prime4); // vcvtqq2pd
// calculates 1.0/ prime
prime_double_reciprocal1 = _mm512_div_pd(one, prime_double1);
prime_double_reciprocal2 = _mm512_div_pd(one, prime_double2);
prime_double_reciprocal3 = _mm512_div_pd(one, prime_double3);
prime_double_reciprocal4 = _mm512_div_pd(one, prime_double4);
// for comparison if we have found factors for n!+1
prime_minus_one1 = _mm512_sub_pd(prime_double1, one);
prime_minus_one2 = _mm512_sub_pd(prime_double2, one);
prime_minus_one3 = _mm512_sub_pd(prime_double3, one);
prime_minus_one4 = _mm512_sub_pd(prime_double4, one);
// residue init
residue1 = _mm512_set1_pd(2.0);
residue2 = _mm512_set1_pd(2.0);
residue3 = _mm512_set1_pd(2.0);
residue4 = _mm512_set1_pd(2.0);
// double counter init
n_double = _mm512_set1_pd(3.0);
// main loop starts here. typical value for nmax can be 5,000,000 -> 10,000,000
for (n=3; n<=nmax; n++) // main loop
// timings for instructions:
// _mm512_load_epi64 = vmovdqa64 : L 1, T 0.5
// _mm512_load_pd = vmovapd : L 1, T 0.5
// _mm512_set1_pd
// _mm512_div_pd = vdivpd : L 23, T 16
// _mm512_cvtepi64_pd = vcvtqq2pd : L 4, T 0,5
// _mm512_mul_pd = vmulpd : L 4, T 0.5
// _mm512_fmadd_pd = vfmadd132pd, vfmadd213pd, vfmadd231pd : L 4, T 0.5
// _mm512_fmsub_pd = vfmsub132pd, vfmsub213pd, vfmsub231pd : L 4, T 0.5
// _mm512_sub_pd = vsubpd : L 4, T 0.5
// _mm512_cmplt_pd_mask = vcmppd : L ?, Y 1
// _mm512_mask_add_pd = vaddpd : L 4, T 0.5
// _mm512_cmpeq_pd_mask = vcmppd L ?, Y 1
// _mm512_kor = korw L 1, T 1
// nr = residue * n
nr1 = _mm512_mul_pd (residue1, n_double);
nr2 = _mm512_mul_pd (residue2, n_double);
nr3 = _mm512_mul_pd (residue3, n_double);
nr4 = _mm512_mul_pd (residue4, n_double);
// quotient = nr * 1.0/ prime_double + rounding_constant
quotient1 = _mm512_fmadd_pd(nr1, prime_double_reciprocal1, rounding_const);
quotient2 = _mm512_fmadd_pd(nr2, prime_double_reciprocal2, rounding_const);
quotient3 = _mm512_fmadd_pd(nr3, prime_double_reciprocal3, rounding_const);
quotient4 = _mm512_fmadd_pd(nr4, prime_double_reciprocal4, rounding_const);
// quotient -= rounding_constant, now quotient is rounded to integer
// countient should be at maximum nmax (10,000,000)
quotient1 = _mm512_sub_pd(quotient1, rounding_const);
quotient2 = _mm512_sub_pd(quotient2, rounding_const);
quotient3 = _mm512_sub_pd(quotient3, rounding_const);
quotient4 = _mm512_sub_pd(quotient4, rounding_const);
// now we calculate high and low for prime * quotient using decker product (FMA).
// quotient is calculated using approximation but this is accurate for given quotient
prime_times_quotient_high1 = _mm512_mul_pd(quotient1, prime_double1);
prime_times_quotient_high2 = _mm512_mul_pd(quotient2, prime_double2);
prime_times_quotient_high3 = _mm512_mul_pd(quotient3, prime_double3);
prime_times_quotient_high4 = _mm512_mul_pd(quotient4, prime_double4);
prime_times_quotient_low1 = _mm512_fmsub_pd(quotient1, prime_double1, prime_times_quotient_high1);
prime_times_quotient_low2 = _mm512_fmsub_pd(quotient2, prime_double2, prime_times_quotient_high2);
prime_times_quotient_low3 = _mm512_fmsub_pd(quotient3, prime_double3, prime_times_quotient_high3);
prime_times_quotient_low4 = _mm512_fmsub_pd(quotient4, prime_double4, prime_times_quotient_high4);
// now we calculate new reminder using decker product and using original values
// we subtract above calculated prime * quotient (quotient is aproximation)
residue1 = _mm512_fmsub_pd(residue1, n_double, prime_times_quotient_high1);
residue2 = _mm512_fmsub_pd(residue2, n_double, prime_times_quotient_high2);
residue3 = _mm512_fmsub_pd(residue3, n_double, prime_times_quotient_high3);
residue4 = _mm512_fmsub_pd(residue4, n_double, prime_times_quotient_high4);
residue1 = _mm512_sub_pd(residue1, prime_times_quotient_low1);
residue2 = _mm512_sub_pd(residue2, prime_times_quotient_low2);
residue3 = _mm512_sub_pd(residue3, prime_times_quotient_low3);
residue4 = _mm512_sub_pd(residue4, prime_times_quotient_low4);
// lets check if reminder < 0
negative_reminder_mask1 = _mm512_cmplt_pd_mask(residue1,zero);
negative_reminder_mask2 = _mm512_cmplt_pd_mask(residue2,zero);
negative_reminder_mask3 = _mm512_cmplt_pd_mask(residue3,zero);
negative_reminder_mask4 = _mm512_cmplt_pd_mask(residue4,zero);
// we and prime back to reminder using mask if it was < 0
residue1 = _mm512_mask_add_pd(residue1, negative_reminder_mask1, residue1, prime_double1);
residue2 = _mm512_mask_add_pd(residue2, negative_reminder_mask2, residue2, prime_double2);
residue3 = _mm512_mask_add_pd(residue3, negative_reminder_mask3, residue3, prime_double3);
residue4 = _mm512_mask_add_pd(residue4, negative_reminder_mask4, residue4, prime_double4);
n_double = _mm512_add_pd(n_double,one);
// if we are below nmin then we continue next iteration
if (n < nmin) continue;
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
{ // we find factor very rarely
double *residual_list1 = (double *) &residue1;
double *residual_list2 = (double *) &residue2;
double *residual_list3 = (double *) &residue3;
double *residual_list4 = (double *) &residue4;
double *prime_list1 = (double *) &prime_double1;
double *prime_list2 = (double *) &prime_double2;
double *prime_list3 = (double *) &prime_double3;
double *prime_list4 = (double *) &prime_double4;
for (int i=0; i <8; i++){
if( residual_list1[i] == 1.0)
report_factor((uint64_t) n, -1, (uint64_t) prime_list1[i]);
if( residual_list2[i] == 1.0)
report_factor((uint64_t) n, -1, (uint64_t) prime_list2[i]);
if( residual_list3[i] == 1.0)
report_factor((uint64_t) n, -1, (uint64_t) prime_list3[i]);
if( residual_list4[i] == 1.0)
report_factor((uint64_t) n, -1, (uint64_t) prime_list4[i]);
if(residual_list1[i] == (prime_list1[i] - 1.0))
report_factor((uint64_t) n, 1, (uint64_t) prime_list1[i]);
if(residual_list2[i] == (prime_list2[i] - 1.0))
report_factor((uint64_t) n, 1, (uint64_t) prime_list2[i]);
if(residual_list3[i] == (prime_list3[i] - 1.0))
report_factor((uint64_t) n, 1, (uint64_t) prime_list3[i]);
if(residual_list4[i] == (prime_list4[i] - 1.0))
report_factor((uint64_t) n, 1, (uint64_t) prime_list4[i]);
As a few commenters have suggested: a "backend" bottleneck is what you'd expect for this code. That suggests you're keeping things pretty well fed, which is what you want.
Looking at the report, there should be an opportunity in this section:
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
From the IACA analysis:
| 1 | 1.0 | | | | | | | | kmovw r11d, k0
| 1 | 1.0 | | | | | | | | kmovw eax, k1
| 1 | 1.0 | | | | | | | | kmovw ecx, k2
| 1 | 1.0 | | | | | | | | kmovw esi, k3
| 1 | 1.0 | | | | | | | | kmovw edi, k4
| 1 | 1.0 | | | | | | | | kmovw r8d, k5
| 1 | 1.0 | | | | | | | | kmovw r9d, k6
| 1 | 1.0 | | | | | | | | kmovw r10d, k7
| 1 | | 1.0 | | | | | | | or r11d, eax
| 1 | | | | | | | 1.0 | | or r11d, ecx
| 1 | | 1.0 | | | | | | | or r11d, esi
| 1 | | | | | | | 1.0 | | or r11d, edi
| 1 | | 1.0 | | | | | | | or r11d, r8d
| 1 | | | | | | | 1.0 | | or r11d, r9d
| 1* | | | | | | | | | or r11d, r10d
The processor is moving the resulting comparison masks (k0-k7) over to regular registers for the "or" operation. You should be able to eliminate those moves, AND, do the "or" rollup in 6ops vs 8.
NOTE: the found_factor_mask types are defined as __mmask8, where they should be __mask16 (16x double floats in a 512bit fector). That might let the compiler get at some optimizations. If not, drop to assembly as a commenter noted.
And related: what fraction of iteractions fire this or-mask clause? As another commenter observed, you should be able to unroll this with an accumlating "or" operation. Check the accumulated "or" value at the end of each unrolled iteration (or after N iterations), and if it's "true", go back and re-do the values to figure out which n value triggered it.
(And, you can binary search within the "roll" to find the matching n value -- that might get some gain).
Next, you should be able to get rid of this mid-loop check:
// if we are below nmin then we continue next iteration, we
if (n < nmin) continue;
Which shows up here:
| 1* | | | | | | | | | cmp r14, 0x3e8
| 0*F | | | | | | | | | jb 0x229
It may not be a huge gain since the predictor will (probably) get this one (mostly) right, but you should get some gains by having two distinct loops for two "phases":
n=3 to n=nmin-1
n=nmin and beyond
Even if you gain a cycle, that's 3%. And since that's generally related to the big 'or' operation, above, there may be more cleverness in there to be found.
Exception: variable does not exist in rstan:failed to create the sampler
I`ve tried finding a similar error to mine on the site but I cannot find anything. I am not sure what am I doing wrong because I keep getting the below error message. I would greatly appreciate it if somebody could explain why do I keep getting an error.TIA error message: Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) : Exception: variable does not exist; processing stage=data initialization; variable name=j; base type=int (in 'model32487fbf6c2c_production' at line 2) failed to create the sampler; sampling not done Rcode library(rstan) dat1 <- list(a1=0, b1= 1, a2 =0.1, b2 =0.5, j <-6, n <-c(3,7,4,8,5,9), x <-c(10,33,3,39,5,50)) fit1 <- stan(file = "production.stan", data = dat1, chains = 3, iter = 1000,) print(fit1) Stan code data { int < lower = 0 > j; int x[j]; int n[j]; real a1; real b1; real a2; real b2; } parameters { real < lower = 0 > lambda[j]; real < lower = 0 > sigma; real mu; } transformed parameters { real lambdanew1 = lambda[1]. / n[1]; real lambdanew2 = lambda[2]. / n[2]; real lambdanew3 = lambda[3]. / n[3]; real lambdanew4 = lambda[4]. / n[4]; real lambdanew5 = lambda[5]. / n[5]; real lambdanew6 = lambda[6]. / n[6]; } model { target += poisson_lpmf(x[1] | lambda[1]); target += poisson_lpmf(x[2] | lambda[2]); target += poisson_lpmf(x[3] | lambda[3]); target += poisson_lpmf(x[4] | lambda[4]); target += poisson_lpmf(x[5] | lambda[5]); target += poisson_lpmf(x[6] | lambda[6]); target += lognormal_lpdf(lambdanew1 | mu, sigma); target += lognormal_lpdf(lambdanew2 | mu, sigma); target += lognormal_lpdf(lambdanew3 | mu, sigma); target += lognormal_lpdf(lambdanew4 | mu, sigma); target += lognormal_lpdf(lambdanew5 | mu, sigma); target += lognormal_lpdf(lambdanew6 | mu, sigma); target += uniform_lpdf(mu | a1, b1); target += uniform_lpdf(sigma | a2, b2); }
Marking Stack Variables for Garbage Collection
I'm trying to learn how to implement a simple mark-and-sweep garbage collection algorithm. I'm learning by looking at the tgc library. I'm figuring out how to iterate through the stack to mark reachable heap allocated variables. This is done in the following lines in tgc.c: static void tgc_mark_stack(tgc_t *gc) { void *stk, *bot, *top, *p; bot = gc->bottom; top = &stk; if (bot == top) { return; } if (bot < top) { for (p = top; p >= bot; p = ((char*)p) - sizeof(void*)) { tgc_mark_ptr(gc, *((void**)p)); } } if (bot > top) { for (p = top; p <= bot; p = ((char*)p) + sizeof(void*)) { tgc_mark_ptr(gc, *((void**)p)); } } } How does p = ((char*)p) - sizeof(void*) and p = ((char*)p) + sizeof(void*) not cause pointer p to overshoot or undershoot pointing at the correct address for the stack variable? As an example, my intuition is that we can possibly have something like this which wouldn't yield the correct heap address of char *: void *bottom -> ----------- High address/bottom of stack | | | char a | 1 byte | | ----------- | | (char*)p + sizeof(void*) -> | char *str | 8 bytes | | ----------- | | | int x | 4 bytes | | void *p = void *top -> ----------- | | | void *stk | 8 bytes | | ----------- Low address/top of stock But I've tested this out, and the loop seems to work regardless of the stack layout. I figured that we'd do something like p = ((char*)p) - sizeof(char) which also seems to work but is a bit slower. Why does the method used in the above code work?
Unexpected Value Output When Not Using Visual Studio
I've been working on a program for my Algorithm Analysis class where I have to solve the Knapsack problem with Brute Force, greedy, dynamic, and branch and bound strategies. Everything works perfectly when I run it in Visual Studio 2012, but if I compile with gcc and run it on the command line, I get a different result: Visual Studio: +-------------------------------------------------------------------------------+ | Number of | Processing time in seconds / Maximum benefit value | | +---------------+---------------+---------------+---------------+ | items | Brute force | Greedy | D.P. | B. & B. | +---------------+---------------+---------------+---------------+---------------+ | 10 + 0 / 1290 + 0 / 1328 + 0 / 1290 + 0 / 1290 | +---------------+---------------+---------------+---------------+---------------+ | 20 + 0 / 3286 + 0 / 3295 + 0 / 3200 + 0 / 3286 | +---------------+---------------+---------------+---------------+---------------+ cmd: +-------------------------------------------------------------------------------+ | Number of | Processing time in seconds / Maximum benefit value | | +---------------+---------------+---------------+---------------+ | items | Brute force | Greedy | D.P. | B. & B. | +---------------+---------------+---------------+---------------+---------------+ | 10 + 0 / 1290 + 0 / 1328 + 0 / 1599229779+ 0 / 1290 | +---------------+---------------+---------------+---------------+---------------+ | 20 + 0 / 3286 + 0 / 3295 + 0 / 3200 + 0 / 3286 | +---------------+---------------+---------------+---------------+---------------+ The same number always shows up, "1599229779." Notice that the output is only messed up the first time the Dynamic algorithm is run. Here is my code: typedef struct{ short value; //This is the value of the item short weight; //This is the weight of the item float ratio; //This is the ratio of value/weight } itemType; typedef struct{ time_t startingTime; time_t endingTime; int maxValue; } result; result solveWithDynamic(itemType items[], int itemsLength, int maxCapacity){ result answer; int rowSize = 2; int colSize = maxCapacity + 1; int i, j; //used in loops int otherColumn, thisColumn; answer.startingTime = time(NULL); int **table = (int**)malloc((sizeof *table) * rowSize);//[2][(MAX_ITEMS*WEIGHT_MULTIPLIER)]; for(i = 0; i < rowSize; i ++) table[i] = (int*)malloc((sizeof *table[i]) * colSize); table[0][0] = 0; table[1][0] = 0; for(i = 1; i < maxCapacity; i++) table[1][i] = 0; for(i = 0; i < itemsLength; i++){ thisColumn = i%2; otherColumn = (i+1)%2; //this is always the other column for(j = 1; j < maxCapacity + 1; j++){ if(items[i].weight <= j){ if(items[i].value + table[otherColumn][j-items[i].weight] > table[otherColumn][j]) table[thisColumn][j] = items[i].value + table[otherColumn][j-items[i].weight]; else table[thisColumn][j] = table[otherColumn][j]; } else { table[thisColumn][j] = table[thisColumn][j-1]; }//end if/else }//end for }//end for answer.maxValue = table[thisColumn][maxCapacity]; answer.endingTime = time(NULL); for(i = 0; i < rowSize; i ++) free(table[i]); free(table); return answer; }//end solveWithDynamic Just a bit of explanation. I was having trouble with the memory consumption of this algorithm because I have to run it for a set of 10,000 items. I realized that I didn't need to store the whole table, because I only ever looked at the previous column. I actually figured out that you only need to store the current row and x+1 additional values, where x is the weight of the current itemType. It brought the memory required from (itemsLength+1) * (maxCapacity+1) elements to 2*(maxCapacity+1) and possibly (maxCapacity+1) + (x+1) (although I don't need to optimize it that much). Also, I used printf("%d", answer.maxValue); in this function, and it still came out as "1599229779." Can anyone help me figure out what is going on? Thanks.
Can't be sure that that is what causes it, but for(i = 1; i < maxCapacity; i++) table[1][i] = 0; you leave table[1][maxCapacity] uninitialised, but then potentially use it: for(j = 1; j < maxCapacity + 1; j++){ if(items[i].weight <= j){ if(items[i].value + table[otherColumn][j-items[i].weight] > table[otherColumn][j]) table[thisColumn][j] = items[i].value + table[otherColumn][j-items[i].weight]; else table[thisColumn][j] = table[otherColumn][j]; } else { table[thisColumn][j] = table[thisColumn][j-1]; }//end if/else }//end for If that is always zero with Visual Studio, but nonzero with gcc, that could explain the difference.
Clock_gettime nanoseconds calculation
ref: linux clock_gettime I found a formula which works well to get the processing time, but there's something I don't understand. See the result below. The first 2 rows is just to show the forumla in their respective columns. I'm only showing 3 results from a quick run. The interesting part is in the last row, why is 5551 - 999896062 nanoseconds = 18446744072709661105? Why is 18446744072709661105+1/1E9 = 0.000109? I think there's some data conversion going on that affects the results? xx: | t1.tv_sec | | t1.tv_nsec | | t2.tv_sec | | t2.tv_nsec xx: t2-t1(sec) t2-t1(nsec) (t2-t1(sec))+(t2-t1(nsec))/1E9 52291: | 30437 | | 999649886 | | 30437 | | 999759331 52291: 0 109445 0.000109 52292: | 30437 | | 999772970 | | 30437 | | 999882416 52292: 0 109446 0.000109 52293: | 30437 | | 999896062 | | 30438 | | 5551 52293: 1 18446744072709661105 0.000109 source code: int main() { struct timespec t1, t2; int i = 0; while(1) { clock_gettime(CLOCK_MONOTONIC, &t1); for(int j=0;j<25000;j++) { }; clock_gettime(CLOCK_MONOTONIC, &t2); printf("%d: \t | %llu | \t | %lu | \t\t | %llu | \t | %lu \n", i, (unsigned long long) t1.tv_sec, t1.tv_nsec, (unsigned long long) t2.tv_sec, t2.tv_nsec); printf("%d: \t %llu \t %lu \t\t %lf\n", i, (unsigned long long) t2.tv_sec - t1.tv_sec, t2.tv_nsec - t1.tv_nsec, (t2.tv_sec - t1.tv_sec)+(t2.tv_nsec - t1.tv_nsec)/1E9); if ((t2.tv_sec - t1.tv_sec) == 1) break; i++; } return 0; }
Because 5551 - 999896062 is some negative value, stored in a temp variable of type long, but interpreted by printf as "unsigned long" due to the %lu conversion specifier. Note that the tv_nsec field in struct timespec is of type long, not unsigned long. Similarly, on Linux and other Unix systems time_t is a typedef for a signed integer type. So get rid of all the unsigned stuff in your code. Btw, a way to to substract two timespec instances is timespec diff(timespec start, timespec end) { timespec temp; if ((end.tv_nsec - start.tv_nsec) < 0) { temp.tv_sec = end.tv_sec - start.tv_sec - 1; temp.tv_nsec = 1000000000 + end.tv_nsec - start.tv_nsec; } else { temp.tv_sec = end.tv_sec - start.tv_sec; temp.tv_nsec = end.tv_nsec - start.tv_nsec; } return temp; }
Alternatives for creating graphs in C
I recently created a program that calculates flow rate through a pipe and generates, line by line, a scatter graph of the output. My knowledge of C is rudimentary (started with python) and I get the feeling that I may have made the code overly complicated. As such, I am asking if anyone has any alternatives to the code below. Critiques of code structure etc. are also welcome! #include <stdio.h> #include <stdlib.h> #include <math.h> #include <string.h> #define PI 3.1415926 double flow_rate(double diameter, double k, double slope){ double area, w_perimeter, hyd_rad, fr; area = (PI*pow(diameter,2.0))/8.0; w_perimeter = (PI*diameter)/2.0; hyd_rad = area/w_perimeter; fr = (1.0/k)*area*pow(hyd_rad,(2.0/3.0))*pow(slope,(1.0/2.0)); return fr; } int main(int argc, char **argv) { double avg_k=0.0312, min_slope=0.0008; float s3_diameter; int i=0, num=0, flow_array[6] ,rows, align=29; char graph[] = " "; char graph_temp[]= " "; printf("\nFlow Rate (x 10^-3) m^3/s\n"); for (s3_diameter=0.50;s3_diameter>0.24;s3_diameter-=0.05){ flow_array[i] = (1000*(flow_rate(s3_diameter, avg_k, min_slope))+0.5); i += 1; } for (rows=30;rows>0;rows--){ strcpy(graph_temp,graph); for (num=0;num<6;num++){ if (rows==flow_array[num] && rows%5==0){ graph_temp[align] = '*'; printf("%d%s\n",rows,graph_temp); align -= 5; break; } else if (rows==flow_array[num]){ graph_temp[align] = '*'; printf("|%s\n",graph_temp); align -= 5; break; } else { if (rows%5==0 && num==5){ printf("%d%s\n",rows,graph_temp); } else if (rows%5!=0 && num==5){ printf("|%s\n",graph_temp); } } } } printf("|----2----3----3----4----4----5----\n"); printf(" 5 0 5 0 5 0\n"); printf(" Diameter (x 10^-2) m\n"); return 0; } Output as below. Flow Rate (x 10^-3) m^3/s 30 | | | | 25 | | | * | 20 | | | * | 15 | | | * | 10 | * | | | * 5 | * | | | |----2----3----3----4----4----5---- 5 0 5 0 5 0 Diameter (x 10^-2) m
GNUPlot is by far the simplest way to draw graph in C. It can draw from simple plotting to complex 3d graph, and even provides an ASCII Art output (if ASCII output is really required) You can find more information on how to use GNUPlot in a C program here: