i am up to build a Monte Carlo based simulation in C since it is super fast and I was wondering why my code is producing a even distribution.
So first to pick you up, bevor I started coding I imagened the picture of a few balls falling down and on every dot they can move left or right with random distribution. The picture is showen here: https://www.youtube.com/watch?v=PM7z_03o_kk.
But what I got is kind of strange. When I set the scatterpoints to 10 (which is in the code example setted to 100):
while(j < 100) // Number of abitrary change of direction
I got a picture like a Gauss-Distribution but only even bin´s are contributing to it. When its large enough like shown in the code, every bin got about the same amount of particles. This is the 2D case which will be expanded to the 3D case once it is working as expected.
There are still some Variables in it which are not really neccessary just to avoid any possible mistake I can imagine. I was working with gdb to find the error. When I just run distr() with gdb I figured out that it is producing only even numbers if the example above is setted to 10. When I go to 11 I found out, that bin[0] starts to contribute with a very small amout compared to the rest. I also ran rando() enough times to see if it's really pseudo-random and I found out that it should work.
I was still not able to figure out the mistake, so I really hope that here are enough people smarter than me oO.
Full code:
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#define SEED time(NULL)
int rando(void) // Monte Carlo method for choosing weather left or right
double temp=0;
temp = (double) rand()/RAND_MAX; // pseudo random number between 0-1
if (temp >= 0.5)
return 1; // Particle to the right
if (temp < 0.5)
return 0; // Particle ot the left
int distr(void) // Binning particle
int i=10; // Center of bin
int j=0;
int k=0;
while(j < 100) // Number of abitrary change of direction
if(k == 1)
if ( i < 13) // Choose upper bound of bins
if(k == 0)
if (i > 7) // Choose lower bound of bins
return i;
int main(void)
srand ( SEED );
int* bin;
int binning;
int k=0;
int l=0;
int iter;
fprintf(stdout, "\nIterations: ");
fscanf(stdin, "%d", &iter);
bin = malloc(21*sizeof(int));
while (k < 20)
bin[k] = 0;
k = 0;
while(k < iter) // Count of particle ot distribute
binning = distr(); // Choosing the bin
bin[binning]+=1; // Counting binned particle per bin
while(l < 20)
fprintf(stdout, "\n %d", bin[l]);
I can't wait to read you and thanks in advance,
Among several other issues, there are a couple of logical errors in the implementation of the distr function and its caller.
In the linked video, the "obstacles" are shaped in lines forming a triangle:
. Falling balls.
. *
. * . *
. * * . * Obstacles.
. * * . * *
. * * . * * *
. .
. | | o | | | Allowed result.
. |___|___|___|___|
o Rejected.
Note that the obstacles are spread in staggered lines, every bin can be "fed" by two obstacle, and that some balls will end up outside the allowed bins.
The posted code seems to implement another model:
-1 . +1
.[*]. If the choice is between +1 and -1...
. .
. .
. .
.[*]. [X] [*]. One out of two obstacles can't be reached.
. . .
. . .
. . .
[*]. [x] .[*] [X] [*].
. . .
. .
. .
. .
| | X | o | X | | X | . | The same happens to the bins.
| | | . | | | | o |
| | | o | | | | |
There are no rejections, only unreachable bins, the odd ones or the even ones depending on the number of lines of obstacles.
If there are enough lines of obstacles, more than the number of bins, the "balls" can't escape outside and they fall into the corresponding obstacle in the successive line (effectively a 0 movement). Now they start spreading towards the center, filling all the bins, in what is definetely not a normal distribution.
I'd suggest to consider a different random walk, one that advance or not by one:
[1] .\.
[0] . | .
[0] . | .
[1] . \ .
[0] . | .
[1] . \ .
[1] . \ .
[0] . | .
| | | | | | | | | |
0 1 2 3 4 5 6 7 8
^ Result
Which could be implemented as easily as this unoptimized function
int random_walk(int steps)
int count = 0;
while ( steps-- )
count += rand() > RAND_MAX / 2;
return count;
Note, though, that it uses only one bit of the value returned by rand and that the final result is the total number of non-zero bits. I leave all the possible optimization to the reader.
Apparently with the same even number of +1 or -1's you will always get an even number for the final bin. so if you start as you did with an odd bin starting point the final bin will be odd. Start with an even bin # as I did you will get an even final bin. So what I did was randomly start with j = 1 or 0 so you get about half odd number of movements and half even. I reduced the number of iterations to 50 and increased the number of bins so that most of the results (99%) are captured. You now get a nice normal distribution.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
#define SEED time(NULL)
int rando(void) // Monte Carlo method for choosing weather left or right
return rand() & 1;
int distr(void) // Binning particle
int i=20; // Center of bin
int j=rand() & 1; // makes the number if movements either odd or even
int k=0;
while(j < 50) // Number of 50 or 49 changes of direction
if(k == 1)
printf(" i = %d ", i);
if(k == 0)
printf(" i = %d ", i);
return i;
int main(void)
srand ( SEED );
int* bin;
int binning;
int k=0;
int iter;
printf("\nIterations: ");
scanf("%d", &iter);
bin = malloc(21*sizeof(int));
while (k < 40) // zero's out bin[0] to bin[39]?
bin[k] = 0;
k = 0;
while(k < iter) // Count of particle of distribute
binning = distr(); // Choosing the bin
printf("binning = %d ", binning);
bin[binning]+=1; // Counting binned particle per bin
int l = 0;
while(l < 40)
printf("\n %d", bin[l]);
/* total the mumber of iterations of distr() */
int total = 0;
l = 0;
while(l < 40)
total += bin[l];
printf("\n total number is %d\n\n", total);
return 0;
I am trying to print each char of an array of matrices for a brick breaker game (the full message would be YOU LOSE). I am new to C and I don't feel too confident about using pointers; I feel that that may be the source of my problem. To try to solve the problem, I've read plenty of online guides on how to deal with strings in C; but the fact that I'm dealing with an array of arrays of arrays of chars makes this task quite a bit harder. If you know how to print matrices of strings (in yet another array) in C, or you have a better solution, please let me know!
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#define LETTER_WIDTH 13
"___ __\n",
"\\ \\__ / /\n",
"\\ \\ / /\n",
"| | |\n",
"| | |\n",
" _______ \n",
" / __ \\\n",
"| | | |\n",
"| |__| |\n",
" \\_______/\n"};
void printLetter(char letter[LETTER_HEIGHT][LETTER_WIDTH]) {
for (int i = 0; i < LETTER_HEIGHT; i++) {
for (int j = 0; j < LETTER_WIDTH; j++) {
printf("%c", letter[i][j]);
void printSentence() {
for (int i = 0; i < 2; i++) {
strcpy(*letter, **SENTENCE[i]);
int main() {
return 0;
Firstly this should be better
char* Y[LETTER_HEIGHT] = {
"___ __\n",
"\\ \\__ / /\n",
"\\ \\ / /\n",
"| | |\n",
"| | |\n",
char* O[LETTER_HEIGHT] = {
" _______ \n",
" / __ \\\n",
"| | | |\n",
"| |__| |\n",
" \\_______/\n"};
Now these are arrays of size 6 (you must add one line because O now have height of 5) containing pointers to arrays of chars. Next
char** SENTENCE[2] = {Y, O};
You did some really weird things with this line before, this defines SENTENCE as 2 element array of pointers to array of pointers to char arrays (which are Y and O).
void printLetter(char** letter) {
for (int i = 0; i < LETTER_HEIGHT; i++) {
printf("%s", letter[i]);
This function takes pointer to array of pointers to char arrays. Then goes 6 times and print each array as string. Next
void printSentence() {
for (int i = 0; i < 2; i++) {
Here you can use simple for loop to pass to printLetter each pointer to array of pointers to char arrays (which are these letters) from SENTENCE.
or you have a better solution, please let me know!
Yes, there is a much simpler and, I would argue, better solution, it's to place the SENTENCE in a single 2D array and print it in one go, even if you are to use ncurses, this makes your job easier.
Note that with ncurses you can reposition the cursor so you can print each letter separately in one line, you wouldn't need to join them together like you try to do in SENTENCE.
#define LETTER_WIDTH 100
"__ __ ______ \n",
"\\ \\ / / / __ \\\n",
" \\ \\/ / | | | |\n",
" | | | | | |\n",
" | | | |__| |\n",
" |__| \\______/\n"};
void printSentence()
for (int i = 0; i < 6; i++)
printf("%s", SENTENCE[i]);
__ __ ______
\ \ / / / __ \
\ \/ / | | | |
| | | | | |
| | | |__| |
|__| \______/
This question already has answers here:
printing a square with diagonals
(4 answers)
Closed 3 years ago.
Guys i'm pretty stuck here. I'm trying to learn c and create some very basic code which asks the user to insert a number. Then, this number enters the following formula : 2x+1, then I want it to print a hollow square pattern with a different symbol for rows and columns, and add a + in the corners, diagonals, and a "X" in the middle.
I'm stuck in the very very beginning of the code. I don't know where should I even start. I mean I can't even learn how to make different symbols for the rows and columns.
I'm trying to learn and study it for 3 hours already, watched 20 different YouTube videos and read 20 different coding guides.
It's so frustrating..
I'm attaching a picture of my code & my output, and the desired output on the right.
the code itself:
int size;
printf("Please enter a number that will define the size of the square: \n");
scanf("%d", &size);
size = 2 * size + 1;
for (int i = 1; i <= size-2; i++) {
for (int j = 1; j <= size-2; j++) {
if (j == 1 || j == size - 1) {
else {
printf(" ");
if (i==1 || i==size-2){
else {
printf(" ");
#include <stdio.h>
int main(void) {
int size;
printf("Please enter a number that will define the size of the square: \n");
scanf("%d", &size);
size = 2 * size + 1;
const char *spaces=" ";
const char *dashes="-----------------------------------------";
printf("+%.*s+\n", size, dashes);
for(int i=1; i<size/2+1; ++i)
printf("|%.*s\\%.*s/%.*s|\n", i-1, spaces, size-2*i, spaces,i-1, spaces);
printf("|%.*sX%.*s|\n", size/2, spaces, size/2, spaces);
for(int i=size/2+1; i<size; ++i)
printf("|%.*s/%.*s\\%.*s|\n", size-i-1, spaces, 2*(i-size/2)-1, spaces, size-i-1, spaces);
printf("+%.*s+\n", size, dashes);
return 0;
Example Run:
Please enter a number that will define the size of the square: 8
Success #stdin #stdout 0s 4568KB
|\ /|
| \ / |
| \ / |
| \ / |
| \ / |
| \ / |
| \ / |
| \ / |
| X |
| / \ |
| / \ |
| / \ |
| / \ |
| / \ |
| / \ |
| / \ |
|/ \|
I have tried to find divisors to potential factorial primes (number of the form n!+-1) and because I recently bought Skylake-X workstation I thought that I could get some speed up using AVX512 instructions.
Algorithm is simple and main step is to take modulo repeatedly respect to same divisor. Main thing is to loop over large range of n values. Here is naïve approach written in c (P is table of primes):
uint64_t factorial_naive(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
uint64_t n, i, residue;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2;
for (n=3; n <= nmax; n++){
residue *= n;
residue %= P[i];
// Lets check if we found factor
if (nmin <= n){
if( residue == 1){
report_factor(n, -1, P[i]);
if(residue == P[i]- 1){
report_factor(n, 1, P[i]);
Here the idea is to check a large range of n, e.g. 1,000,000 -> 10,000,000 against the same set of divisors. So we will take modulo respect to same divisor several million times. using DIV is very slow so there are several possible approaches depending on the range of the calculations. Here in my case n is most likely less than 10^7 and potential divisor p is less than 10,000 G (< 10^13), So numbers are less than 64-bits and also less than 53-bits!, but the product of the maximum residue (p-1) times n is larger than 64-bits. So I thought that simplest version of Montgomery method doesn’t work because we are taking modulo from number that is larger than 64-bit.
I found some old code for power pc where FMA was used to get an accurate product up to 106 bits (I guess) when using doubles. So I converted this approach to AVX 512 assembler (Intel Intrinsics). Here is a simple version of the FMA method, this is based on work of Dekker (1971), Dekker product and FMA version of TwoProduct of that are useful words when trying to find/googling rationale behind this. Also this approach has been discussed in this forum (e.g. here).
int64_t factorial_FMA(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
uint64_t n, i;
double prime_double, prime_double_reciprocal, quotient, residue;
double nr, n_double, prime_times_quotient_high, prime_times_quotient_low;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2.0;
prime_double = (double)P[i];
prime_double_reciprocal = 1.0 / prime_double;
n_double = 3.0;
for (n=3; n <= nmax; n++){
nr = n_double * residue;
quotient = fma(nr, prime_double_reciprocal, rounding_constant);
quotient -= rounding_constant;
prime_times_quotient_high= prime_double * quotient;
prime_times_quotient_low = fma(prime_double, quotient, -prime_times_quotient_high);
residue = fma(residue, n, -prime_times_quotient_high) - prime_times_quotient_low;
if (residue < 0.0) residue += prime_double;
n_double += 1.0;
// Lets check if we found factor
if (nmin <= n){
if( residue == 1.0){
report_factor(n, -1, P[i]);
if(residue == prime_double - 1.0){
report_factor(n, 1, P[i]);
Here I have used magic constant
static const double rounding_constant = 6755399441055744.0;
that is 2^51 + 2^52 magic number for doubles.
I converted this to AVX512 (32 potential divisors per loop) and analyzed result using IACA. It told that Throughput Bottleneck: Backend and Backend allocation was stalled due to unavailable allocation resources.
I am not very experienced with assembler so my question is that is there anything I can do to speed this up and solve this backend bottleneck?
AVX512 code is here and can be found also from github
uint64_t factorial_AVX512_unrolled_four(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
// we are trying to find a factor for a factorial numbers : n! +-1
//nmin is minimum n we want to report and nmax is maximum. P is table of primes
// we process 32 primes in one loop.
// naive version of the algorithm is int he function factorial_naive
// and simple version of the FMA based approach in the function factorial_simpleFMA
const double one_table[8] __attribute__ ((aligned(64))) ={1.0, 1.0, 1.0,1.0,1.0,1.0,1.0,1.0};
uint64_t n;
__m512d zero, rounding_const, one, n_double;
__m512i prime1, prime2, prime3, prime4;
__m512d residue1, residue2, residue3, residue4;
__m512d prime_double_reciprocal1, prime_double_reciprocal2, prime_double_reciprocal3, prime_double_reciprocal4;
__m512d quotient1, quotient2, quotient3, quotient4;
__m512d prime_times_quotient_high1, prime_times_quotient_high2, prime_times_quotient_high3, prime_times_quotient_high4;
__m512d prime_times_quotient_low1, prime_times_quotient_low2, prime_times_quotient_low3, prime_times_quotient_low4;
__m512d nr1, nr2, nr3, nr4;
__m512d prime_double1, prime_double2, prime_double3, prime_double4;
__m512d prime_minus_one1, prime_minus_one2, prime_minus_one3, prime_minus_one4;
__mmask8 negative_reminder_mask1, negative_reminder_mask2, negative_reminder_mask3, negative_reminder_mask4;
__mmask8 found_factor_mask11, found_factor_mask12, found_factor_mask13, found_factor_mask14;
__mmask8 found_factor_mask21, found_factor_mask22, found_factor_mask23, found_factor_mask24;
// load data and initialize cariables for loop
rounding_const = _mm512_set1_pd(rounding_constant);
one = _mm512_load_pd(one_table);
zero = _mm512_setzero_pd ();
// load primes used to sieve
prime1 = _mm512_load_epi64((__m512i *) &P[0]);
prime2 = _mm512_load_epi64((__m512i *) &P[8]);
prime3 = _mm512_load_epi64((__m512i *) &P[16]);
prime4 = _mm512_load_epi64((__m512i *) &P[24]);
// convert primes to double
prime_double1 = _mm512_cvtepi64_pd (prime1); // vcvtqq2pd
prime_double2 = _mm512_cvtepi64_pd (prime2); // vcvtqq2pd
prime_double3 = _mm512_cvtepi64_pd (prime3); // vcvtqq2pd
prime_double4 = _mm512_cvtepi64_pd (prime4); // vcvtqq2pd
// calculates 1.0/ prime
prime_double_reciprocal1 = _mm512_div_pd(one, prime_double1);
prime_double_reciprocal2 = _mm512_div_pd(one, prime_double2);
prime_double_reciprocal3 = _mm512_div_pd(one, prime_double3);
prime_double_reciprocal4 = _mm512_div_pd(one, prime_double4);
// for comparison if we have found factors for n!+1
prime_minus_one1 = _mm512_sub_pd(prime_double1, one);
prime_minus_one2 = _mm512_sub_pd(prime_double2, one);
prime_minus_one3 = _mm512_sub_pd(prime_double3, one);
prime_minus_one4 = _mm512_sub_pd(prime_double4, one);
// residue init
residue1 = _mm512_set1_pd(2.0);
residue2 = _mm512_set1_pd(2.0);
residue3 = _mm512_set1_pd(2.0);
residue4 = _mm512_set1_pd(2.0);
// double counter init
n_double = _mm512_set1_pd(3.0);
// main loop starts here. typical value for nmax can be 5,000,000 -> 10,000,000
for (n=3; n<=nmax; n++) // main loop
// timings for instructions:
// _mm512_load_epi64 = vmovdqa64 : L 1, T 0.5
// _mm512_load_pd = vmovapd : L 1, T 0.5
// _mm512_set1_pd
// _mm512_div_pd = vdivpd : L 23, T 16
// _mm512_cvtepi64_pd = vcvtqq2pd : L 4, T 0,5
// _mm512_mul_pd = vmulpd : L 4, T 0.5
// _mm512_fmadd_pd = vfmadd132pd, vfmadd213pd, vfmadd231pd : L 4, T 0.5
// _mm512_fmsub_pd = vfmsub132pd, vfmsub213pd, vfmsub231pd : L 4, T 0.5
// _mm512_sub_pd = vsubpd : L 4, T 0.5
// _mm512_cmplt_pd_mask = vcmppd : L ?, Y 1
// _mm512_mask_add_pd = vaddpd : L 4, T 0.5
// _mm512_cmpeq_pd_mask = vcmppd L ?, Y 1
// _mm512_kor = korw L 1, T 1
// nr = residue * n
nr1 = _mm512_mul_pd (residue1, n_double);
nr2 = _mm512_mul_pd (residue2, n_double);
nr3 = _mm512_mul_pd (residue3, n_double);
nr4 = _mm512_mul_pd (residue4, n_double);
// quotient = nr * 1.0/ prime_double + rounding_constant
quotient1 = _mm512_fmadd_pd(nr1, prime_double_reciprocal1, rounding_const);
quotient2 = _mm512_fmadd_pd(nr2, prime_double_reciprocal2, rounding_const);
quotient3 = _mm512_fmadd_pd(nr3, prime_double_reciprocal3, rounding_const);
quotient4 = _mm512_fmadd_pd(nr4, prime_double_reciprocal4, rounding_const);
// quotient -= rounding_constant, now quotient is rounded to integer
// countient should be at maximum nmax (10,000,000)
quotient1 = _mm512_sub_pd(quotient1, rounding_const);
quotient2 = _mm512_sub_pd(quotient2, rounding_const);
quotient3 = _mm512_sub_pd(quotient3, rounding_const);
quotient4 = _mm512_sub_pd(quotient4, rounding_const);
// now we calculate high and low for prime * quotient using decker product (FMA).
// quotient is calculated using approximation but this is accurate for given quotient
prime_times_quotient_high1 = _mm512_mul_pd(quotient1, prime_double1);
prime_times_quotient_high2 = _mm512_mul_pd(quotient2, prime_double2);
prime_times_quotient_high3 = _mm512_mul_pd(quotient3, prime_double3);
prime_times_quotient_high4 = _mm512_mul_pd(quotient4, prime_double4);
prime_times_quotient_low1 = _mm512_fmsub_pd(quotient1, prime_double1, prime_times_quotient_high1);
prime_times_quotient_low2 = _mm512_fmsub_pd(quotient2, prime_double2, prime_times_quotient_high2);
prime_times_quotient_low3 = _mm512_fmsub_pd(quotient3, prime_double3, prime_times_quotient_high3);
prime_times_quotient_low4 = _mm512_fmsub_pd(quotient4, prime_double4, prime_times_quotient_high4);
// now we calculate new reminder using decker product and using original values
// we subtract above calculated prime * quotient (quotient is aproximation)
residue1 = _mm512_fmsub_pd(residue1, n_double, prime_times_quotient_high1);
residue2 = _mm512_fmsub_pd(residue2, n_double, prime_times_quotient_high2);
residue3 = _mm512_fmsub_pd(residue3, n_double, prime_times_quotient_high3);
residue4 = _mm512_fmsub_pd(residue4, n_double, prime_times_quotient_high4);
residue1 = _mm512_sub_pd(residue1, prime_times_quotient_low1);
residue2 = _mm512_sub_pd(residue2, prime_times_quotient_low2);
residue3 = _mm512_sub_pd(residue3, prime_times_quotient_low3);
residue4 = _mm512_sub_pd(residue4, prime_times_quotient_low4);
// lets check if reminder < 0
negative_reminder_mask1 = _mm512_cmplt_pd_mask(residue1,zero);
negative_reminder_mask2 = _mm512_cmplt_pd_mask(residue2,zero);
negative_reminder_mask3 = _mm512_cmplt_pd_mask(residue3,zero);
negative_reminder_mask4 = _mm512_cmplt_pd_mask(residue4,zero);
// we and prime back to reminder using mask if it was < 0
residue1 = _mm512_mask_add_pd(residue1, negative_reminder_mask1, residue1, prime_double1);
residue2 = _mm512_mask_add_pd(residue2, negative_reminder_mask2, residue2, prime_double2);
residue3 = _mm512_mask_add_pd(residue3, negative_reminder_mask3, residue3, prime_double3);
residue4 = _mm512_mask_add_pd(residue4, negative_reminder_mask4, residue4, prime_double4);
n_double = _mm512_add_pd(n_double,one);
// if we are below nmin then we continue next iteration
if (n < nmin) continue;
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
{ // we find factor very rarely
double *residual_list1 = (double *) &residue1;
double *residual_list2 = (double *) &residue2;
double *residual_list3 = (double *) &residue3;
double *residual_list4 = (double *) &residue4;
double *prime_list1 = (double *) &prime_double1;
double *prime_list2 = (double *) &prime_double2;
double *prime_list3 = (double *) &prime_double3;
double *prime_list4 = (double *) &prime_double4;
for (int i=0; i <8; i++){
if( residual_list1[i] == 1.0)
report_factor((uint64_t) n, -1, (uint64_t) prime_list1[i]);
if( residual_list2[i] == 1.0)
report_factor((uint64_t) n, -1, (uint64_t) prime_list2[i]);
if( residual_list3[i] == 1.0)
report_factor((uint64_t) n, -1, (uint64_t) prime_list3[i]);
if( residual_list4[i] == 1.0)
report_factor((uint64_t) n, -1, (uint64_t) prime_list4[i]);
if(residual_list1[i] == (prime_list1[i] - 1.0))
report_factor((uint64_t) n, 1, (uint64_t) prime_list1[i]);
if(residual_list2[i] == (prime_list2[i] - 1.0))
report_factor((uint64_t) n, 1, (uint64_t) prime_list2[i]);
if(residual_list3[i] == (prime_list3[i] - 1.0))
report_factor((uint64_t) n, 1, (uint64_t) prime_list3[i]);
if(residual_list4[i] == (prime_list4[i] - 1.0))
report_factor((uint64_t) n, 1, (uint64_t) prime_list4[i]);
As a few commenters have suggested: a "backend" bottleneck is what you'd expect for this code. That suggests you're keeping things pretty well fed, which is what you want.
Looking at the report, there should be an opportunity in this section:
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
From the IACA analysis:
| 1 | 1.0 | | | | | | | | kmovw r11d, k0
| 1 | 1.0 | | | | | | | | kmovw eax, k1
| 1 | 1.0 | | | | | | | | kmovw ecx, k2
| 1 | 1.0 | | | | | | | | kmovw esi, k3
| 1 | 1.0 | | | | | | | | kmovw edi, k4
| 1 | 1.0 | | | | | | | | kmovw r8d, k5
| 1 | 1.0 | | | | | | | | kmovw r9d, k6
| 1 | 1.0 | | | | | | | | kmovw r10d, k7
| 1 | | 1.0 | | | | | | | or r11d, eax
| 1 | | | | | | | 1.0 | | or r11d, ecx
| 1 | | 1.0 | | | | | | | or r11d, esi
| 1 | | | | | | | 1.0 | | or r11d, edi
| 1 | | 1.0 | | | | | | | or r11d, r8d
| 1 | | | | | | | 1.0 | | or r11d, r9d
| 1* | | | | | | | | | or r11d, r10d
The processor is moving the resulting comparison masks (k0-k7) over to regular registers for the "or" operation. You should be able to eliminate those moves, AND, do the "or" rollup in 6ops vs 8.
NOTE: the found_factor_mask types are defined as __mmask8, where they should be __mask16 (16x double floats in a 512bit fector). That might let the compiler get at some optimizations. If not, drop to assembly as a commenter noted.
And related: what fraction of iteractions fire this or-mask clause? As another commenter observed, you should be able to unroll this with an accumlating "or" operation. Check the accumulated "or" value at the end of each unrolled iteration (or after N iterations), and if it's "true", go back and re-do the values to figure out which n value triggered it.
(And, you can binary search within the "roll" to find the matching n value -- that might get some gain).
Next, you should be able to get rid of this mid-loop check:
// if we are below nmin then we continue next iteration, we
if (n < nmin) continue;
Which shows up here:
| 1* | | | | | | | | | cmp r14, 0x3e8
| 0*F | | | | | | | | | jb 0x229
It may not be a huge gain since the predictor will (probably) get this one (mostly) right, but you should get some gains by having two distinct loops for two "phases":
n=3 to n=nmin-1
n=nmin and beyond
Even if you gain a cycle, that's 3%. And since that's generally related to the big 'or' operation, above, there may be more cleverness in there to be found.
We are being taught programming using C in this semester and in our first assignment we were asked to print the list of values of sin(x), cos(x) and tan(x) using manual and library implementations. So, I wrote the following code:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define START 0
#define STOP 360
#define STEP 10
#define PI 3.14159265358979323846 /*For conversion: degrees <-> radians */
double rad(double x); /* Converts angle in degrees to radians */
double next_term(double angle, int term_index);
/* Function prototypes for manual implementation of sin(x), cos(x) and tan(x) */
double sin_series(double x);
double cos_series(double x);
double tan_series(double x);
int main() {
/* Creating and printing the table title:
** ==================================================================== */
char table_title[] = " n | ";
strcat(table_title, "sin series | sin library | ");
strcat(table_title, "cos series | cos library | ");
strcat(table_title, "tan series | tan library | ");
printf("%s \n", table_title);
/* ==================================================================== */
/* Creating and printing the line between the title and the table:
** ==================================================================== */
char* second_line = (char*)malloc((strlen(table_title) + 1) * sizeof(char));
for (int i = 0; i < strlen(table_title) - 1; i++) {
second_line[i] = '=';
second_line[strlen(table_title)] = '\0';
printf("%s \n", second_line);
/* ==================================================================== */
/* Creating each line of the table and printing it:
** ==================================================================== */
for (int angle = 0; angle < 360; angle += 10) {
printf("%4i | ", angle);
printf("%10.5f | %11.5f | ",sin_series(angle), sin(rad(angle)) );
printf("%10.5f | %11.5f | ",cos_series(angle), cos(rad(angle)) );
printf("%10.5f | %11.5f | ",tan_series(angle), tan(rad(angle)) );
return 0;
double rad(double x) {
return ((PI * x) / 180);
double next_series_term(double angle, int term_index) {
double result = 1.0;
for (int i = 0; i < term_index; i++) {
result *= angle;
result /= (i + 1);
return result;
unsigned long long factorial(int x) {
unsigned long long result = 1;
for (int i = 0; i < x; i++) {
result *= (i + 1);
return result;
double sin_series(double x) {
double result = 0;
if (x == 0 || x == 180 || x == 360) {
result = 0;
else {
for (int i = 0; i < 100; i++) {
/* Calculating the next term to add to result to increase precision.*/
double next_term = next_series_term(rad(x), 2*i + 1);
next_term *= pow(-1,i);
result += next_term;
return result;
double cos_series(double x) {
double result = 0;
if (x == 90 || x == 270) {
result = 0;
else {
for (int i = 0; i < 100; i++) {
/* Calculating the next term to add to result to increase precision.*/
double next_term = next_series_term(rad(x), 2*i);
next_term *= pow(-1,i);
result += next_term;
return result;
double tan_series(double x) {
return sin_series(x)/cos_series(x); //non-portable! searching for
// better solution
But this code results in Segmentation fault after returning 0 as I found after using gdb and has left me completely baffled. Being a novice in C and programming, this has completely baffled me. Please help.
This declaration
char table_title[] = " n | ";
That declares table_title to be an array of eight characters. When you append other strings to the end of the array, you will write out of bounds and have undefined behavior.
Either specify a size big enough to hold all the data you need, or initialize it properly with the complete string.
table_title you assign a string so the sizof(table_title) will not be able to hold the whole string which you are passing so accessing array out of bound is undefined behavior and might cause crash.
Why did you use table_title? There is no need. Just use printf.
printf ("sin series | sin library | cos series | cos library | tan series | tan library | ");
You declared table_title statically. free() is used to deallocate memory that you allocated dynamically. So do it. There wont be segmentation fault.
Ok then you can use printf as
printf (
"sin series | sin library | " \
"cos series | cos library | " \
"tan series | tan library | ");
There is no wraps. Try this. :)