I am trying to print each char of an array of matrices for a brick breaker game (the full message would be YOU LOSE). I am new to C and I don't feel too confident about using pointers; I feel that that may be the source of my problem. To try to solve the problem, I've read plenty of online guides on how to deal with strings in C; but the fact that I'm dealing with an array of arrays of arrays of chars makes this task quite a bit harder. If you know how to print matrices of strings (in yet another array) in C, or you have a better solution, please let me know!
#include <stdio.h>
#include <unistd.h>
#include <string.h>
#define LETTER_WIDTH 13
#define LETTER_HEIGHT 6
char Y[LETTER_HEIGHT][LETTER_WIDTH] = {
"___ __\n",
"\\ \\__ / /\n",
"\\ \\ / /\n",
"| | |\n",
"| | |\n",
"|__|__|\n"};
char O[LETTER_HEIGHT][LETTER_WIDTH] = {
" _______ \n",
" / __ \\\n",
"| | | |\n",
"| |__| |\n",
" \\_______/\n"};
char *SENTENCE[2][LETTER_HEIGHT][LETTER_WIDTH] = {*Y, *O};
void printLetter(char letter[LETTER_HEIGHT][LETTER_WIDTH]) {
for (int i = 0; i < LETTER_HEIGHT; i++) {
for (int j = 0; j < LETTER_WIDTH; j++) {
printf("%c", letter[i][j]);
}
}
}
void printSentence() {
for (int i = 0; i < 2; i++) {
char letter[LETTER_HEIGHT][LETTER_WIDTH];
strcpy(*letter, **SENTENCE[i]);
printLetter(letter);
sleep(1);
}
}
int main() {
printSentence();
return 0;
}
Firstly this should be better
char* Y[LETTER_HEIGHT] = {
"___ __\n",
"\\ \\__ / /\n",
"\\ \\ / /\n",
"| | |\n",
"| | |\n",
"|__|__|\n"};
char* O[LETTER_HEIGHT] = {
" _______ \n",
" / __ \\\n",
"| | | |\n",
"| |__| |\n",
" \\_______/\n"};
Now these are arrays of size 6 (you must add one line because O now have height of 5) containing pointers to arrays of chars. Next
char** SENTENCE[2] = {Y, O};
You did some really weird things with this line before, this defines SENTENCE as 2 element array of pointers to array of pointers to char arrays (which are Y and O).
Next
void printLetter(char** letter) {
for (int i = 0; i < LETTER_HEIGHT; i++) {
printf("%s", letter[i]);
}
}
This function takes pointer to array of pointers to char arrays. Then goes 6 times and print each array as string. Next
void printSentence() {
for (int i = 0; i < 2; i++) {
printLetter(SENTENCE[i]);
sleep(1);
}
}
Here you can use simple for loop to pass to printLetter each pointer to array of pointers to char arrays (which are these letters) from SENTENCE.
or you have a better solution, please let me know!
Yes, there is a much simpler and, I would argue, better solution, it's to place the SENTENCE in a single 2D array and print it in one go, even if you are to use ncurses, this makes your job easier.
Note that with ncurses you can reposition the cursor so you can print each letter separately in one line, you wouldn't need to join them together like you try to do in SENTENCE.
#define LETTER_WIDTH 100
#define LETTER_HEIGHT 6
char SENTENCE[LETTER_HEIGHT][LETTER_WIDTH] = {
"__ __ ______ \n",
"\\ \\ / / / __ \\\n",
" \\ \\/ / | | | |\n",
" | | | | | |\n",
" | | | |__| |\n",
" |__| \\______/\n"};
void printSentence()
{
for (int i = 0; i < 6; i++)
{
printf("%s", SENTENCE[i]);
}
}
Output:
__ __ ______
\ \ / / / __ \
\ \/ / | | | |
| | | | | |
| | | |__| |
|__| \______/
This question already has answers here:
printing a square with diagonals
(4 answers)
Closed 3 years ago.
Guys i'm pretty stuck here. I'm trying to learn c and create some very basic code which asks the user to insert a number. Then, this number enters the following formula : 2x+1, then I want it to print a hollow square pattern with a different symbol for rows and columns, and add a + in the corners, diagonals, and a "X" in the middle.
I'm stuck in the very very beginning of the code. I don't know where should I even start. I mean I can't even learn how to make different symbols for the rows and columns.
I'm trying to learn and study it for 3 hours already, watched 20 different YouTube videos and read 20 different coding guides.
It's so frustrating..
Thanks.
I'm attaching a picture of my code & my output, and the desired output on the right.
the code itself:
int size;
printf("Please enter a number that will define the size of the square: \n");
scanf("%d", &size);
size = 2 * size + 1;
for (int i = 1; i <= size-2; i++) {
for (int j = 1; j <= size-2; j++) {
if (j == 1 || j == size - 1) {
printf("|");
}
else {
printf(" ");
}
if (i==1 || i==size-2){
printf("-");
}
else {
printf(" ");
}
}
printf("\n");
}
#include <stdio.h>
int main(void) {
int size;
printf("Please enter a number that will define the size of the square: \n");
scanf("%d", &size);
size = 2 * size + 1;
const char *spaces=" ";
const char *dashes="-----------------------------------------";
printf("+%.*s+\n", size, dashes);
for(int i=1; i<size/2+1; ++i)
{
printf("|%.*s\\%.*s/%.*s|\n", i-1, spaces, size-2*i, spaces,i-1, spaces);
}
printf("|%.*sX%.*s|\n", size/2, spaces, size/2, spaces);
for(int i=size/2+1; i<size; ++i)
{
printf("|%.*s/%.*s\\%.*s|\n", size-i-1, spaces, 2*(i-size/2)-1, spaces, size-i-1, spaces);
}
printf("+%.*s+\n", size, dashes);
return 0;
}
Example Run:
Please enter a number that will define the size of the square: 8
Success #stdin #stdout 0s 4568KB
+-----------------+
|\ /|
| \ / |
| \ / |
| \ / |
| \ / |
| \ / |
| \ / |
| \ / |
| X |
| / \ |
| / \ |
| / \ |
| / \ |
| / \ |
| / \ |
| / \ |
|/ \|
+-----------------+
I am writing a small program for amortization using pointers
#include <stdio.h>
#include <string.h>
double power(double a, double b);
int main(void)
{
int loanAmount, number_of_payments, i = 0;
double interestRate, monthlyInterestRate, monthlyPayment;
printf("Enter amount of loan : $ ");
scanf(" %i",&loanAmount);
printf("Enter Interest rate per year : ");
scanf(" %lf",&interestRate);
printf("Enter number of payments : ");
scanf(" %i",&number_of_payments);
monthlyInterestRate = ((interestRate / 100) / 12); //AKA 'r' or rate.
monthlyPayment = (monthlyInterestRate) * (loanAmount/(1 - 1/(power((1 +
monthlyInterestRate), number_of_payments))));
double interest[7] = {0}; //Arbitrarily set to 7 - assuming less payments.
double principal[7] = {0};
double balance[7] = {0};
balance[0] = loanAmount;
double *ipoint,*ppoint,*bpoint,*bpointprev;
ipoint = &interest[0];
ppoint = &principal[0];
bpoint = &balance[0];
bpointprev = bpoint;
printf("Monthly payment should be $ %lf\n",monthlyPayment);
printf("# \t Payment \t Principal \t Interest \t Balance \n");
for (i = 1; i <= number_of_payments; i++) {
ipoint += i;
bpoint += i;
ppoint += i;
*ipoint = *bpointprev * monthlyInterestRate;
*ppoint = monthlyPayment - *ipoint;
*bpoint = *bpointprev - *ppoint;
printf("%i \t %.2f \t %.2f \t\t %.2f \t\t %.2f\n",i,monthlyPayment,*ppoint,*ipoint,*bpoint);
bpointprev += i; //Iterates after logic for next calculation.
}
return 0;
}
double power(double a, double b)
{
double i, sum = 1;
for (i = 0; i < b; i++) {
sum = sum * a;
}
return sum;
}
and came across an issue where the IDE I am writing in runs the program fine:
on Cloud9 IDE:
but on a Unix terminal, the increment variable in my for loop jumps after what seems to be an arbitrary count:
on Unix Terminal:
I am fairly certain it has something to do all of the pointer references I have whizzing around, but I don't know why it would affect the int variable i in the for-loop or why the IDE would handle the error so cleanly. Please educate!
You have a large number of problems with your code that are just waiting to cause problems. As you have discovered, you failed to protect your array bounds by incrementing your pointers with ptr += i causing you to invoke Undefined Behavior by accessing and writing to memory outside of the storage for your arrays.
Take for example interest and ipoint:
double interest[7] = {0};
ipoint = &interest[0];
So your indexes within interest array are as follows and ipoint is initialized to point to the first element in interest:
+---+---+---+---+---+---+---+
interest | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
^
|
ipoint
Within your loop you are advancing ipoint += i. On the first iteration of your loop, you advance ipoint by one:
+---+---+---+---+---+---+---+
interest | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
^
|
ipoint
Your second iteration, you advance by two:
+---+---+---+---+---+---+---+
interest | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
^
|
ipoint
Your third iteration, you advance by three:
+---+---+---+---+---+---+---+
interest | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
+---+---+---+---+---+---+---+
^
|
ipoint
When i = 4 you advance ipoint beyond the end of your array, and you invoke Undefined Behavior when you assign a value to ipoint and attempt to store the value in memory you do no own:
+---+---+---+---+---+---+---+---+---+---+---+
interest | 0 | 1 | 2 | 3 | 4 | 5 | 6 | out of bounds |
+---+---+---+---+---+---+---+---+---+---+---+
^
|
ipoint
Note: when Undefined Behavior is invoked, your code can appear to work normally, or it can SEGFAULT (or anything in between), the operation of your code is simply Undefined and cannot be relied upon from that point forward.
What you need to do is advance each of your pointers by 1, not 'i'. That will insure you do not write beyond your array bounds. You can fix the problem by simply changing 'i' to 1 for each, e.g.
ipoint += 1;
bpoint += 1;
ppoint += 1;
There are a number of other places you risk invoking Undefined Behavior. You fail to check the return of scanf. If you input (by accident, or a cat steps of the keyboard) anything other than a numeric value, a matching failure will occur, no characters will be read from stdin and all further prompts will be skipped and you will then proceed to process with indeterminate value which will invoke Undefined Behavior.
Further, if you enter more than 7 for the number_of_payments or a value less than zero you invoke Undefined Behavior. (the results are also rather uninteresting for number_of_payments = 0) When taking input, not only to you have to validate the conversion succeeds, but you must validate the resulting value is within a usable range -- to avoid Undefined Behavior, e.g.
printf ("Enter number of payments : ");
if (scanf (" %i", &number_of_payments) != 1) {
fprintf (stderr, "error: invalide no. of payments.\n");
return 1;
}
/* validate no. pmts in range */
if (number_of_payments < 1 || number_of_payments > MAXPMTS) {
fprintf (stderr, "error: no. of payments exceed MAXPMTS (%d)\n",
MAXPMTS);
return 1;
}
Lastly, while you are free to initialize ipoint = &interest[0];, it is not necessary. Accessing an array, the array is converted to a pointer to its first element, so ipoint = interest; is all that is required. There are other issues addressed in the comments below, but putting it altogether, you could do something like the following to insure behavior is defined throughout your code:
#include <stdio.h>
#include <string.h>
#define MAXPMTS 32 /* if you need a constant, define one (or more) */
double power (double a, double b);
int main (void)
{
int loanAmount = 0, /* initialize all variables */
number_of_payments = 0,
i = 0;
double interestRate = 0.0,
monthlyInterestRate = 0.0,
monthlyPayment = 0.0;
/* numeric conversions consume leading whitespace (as does %s)
* the ' ' in the conversion doesn't hurt, but isn't required.
*/
printf ("Enter amount of loan : $ ");
if (scanf (" %i", &loanAmount) != 1) { /* validate conversion */
fprintf (stderr, "error: invalid loan amount.\n");
return 1;
}
printf ("Enter Interest rate per year : ");
if (scanf (" %lf", &interestRate) != 1) {
fprintf (stderr, "error: invalid interest rate.\n");
return 1;
}
printf ("Enter number of payments : ");
if (scanf (" %i", &number_of_payments) != 1) {
fprintf (stderr, "error: invalide no. of payments.\n");
return 1;
}
/* validate no. pmts in range */
if (number_of_payments < 1 || number_of_payments > MAXPMTS) {
fprintf (stderr, "error: no. of payments exceed MAXPMTS (%d)\n",
MAXPMTS);
return 1;
}
monthlyInterestRate = ((interestRate / 100) / 12); //AKA 'r' or rate.
monthlyPayment = (monthlyInterestRate) * (loanAmount/(1 - 1/(power((1 +
monthlyInterestRate), number_of_payments))));
double interest[MAXPMTS] = {0};
double principal[MAXPMTS] = {0};
double balance[MAXPMTS] = {0};
balance[0] = loanAmount;
double *ipoint = NULL,
*ppoint = NULL,
*bpoint = NULL,
*bpointprev = NULL;
ipoint = interest;
ppoint = principal;
bpoint = balance;
bpointprev = bpoint;
printf ("Monthly payment should be $ %lf\n", monthlyPayment);
printf ("# \t Payment \t Principal \t Interest \t Balance \n");
/* standard loop is from 0 to i < number_of_payments */
for (i = 0; i < number_of_payments; i++) {
ipoint += 1;
bpoint += 1;
ppoint += 1;
*ipoint = *bpointprev * monthlyInterestRate;
*ppoint = monthlyPayment - *ipoint;
*bpoint = *bpointprev - *ppoint;
/* adjust 'i + 1' for payment no. output */
printf ("%i \t %.2f \t %.2f \t %.2f \t\t %.2f\n",
i + 1, monthlyPayment, *ppoint, *ipoint, *bpoint);
bpointprev += 1; //Iterates after logic for next calculation.
}
return 0;
}
double power(double a, double b)
{
double i, sum = 1;
for (i = 0; i < b; i++) {
sum = sum * a;
}
return sum;
}
(note: whether you loop from i=1 to i <= number_of_payments or from i=0 to i < number_of_payments is largely up to you, but it is standard for the loop variable to track the valid array indexes to protect the bounds of the array. As above, the output for the payment number is simply adjust by i + 1 to produce the desired 1,2,3...)
(also note: in practice you want to avoid using floating-point math for currency. People get really upset when you lose money due to rounding errors. Integer math eliminates that problem)
Look things over and let me know if you have further questions.
This is due to your += i; statements in for-loop , Change it to '++'. When you use += i; , Too much increment is happening for your pointers (not incremented by 1 but by i).
Modified for-loop :-
for (i = 1; i <= number_of_payments; i++)
{
ipoint ++; // not +=i
bpoint ++; // not +=i
ppoint ++; // not +=i
*ipoint = *bpointprev * monthlyInterestRate;
*ppoint = monthlyPayment - *ipoint;
*bpoint = *bpointprev - *ppoint;
printf("%i \t %.2f \t %.2f \t\t %.2f \t\t %.2f\n", i, monthlyPayment, *ppoint, *ipoint, *bpoint);
bpointprev ++; //Iterates after logic for next calculation. not +=i
}
because +=i consumes lot of memory (Too much increment) causing stack smashing error.
Output :-
Enter amount of loan : $ 2000
Enter Interest rate per year : 7.5
Enter number of payments : 6
Monthly payment should be $ 340.662858
# Payment Principal Interest Balance
1 340.66 328.16 12.50 1671.84
2 340.66 330.21 10.45 1341.62
3 340.66 332.28 8.39 1009.35
4 340.66 334.35 6.31 674.99
5 340.66 336.44 4.22 338.55
6 340.66 338.55 2.12 0.00
I have tried to find divisors to potential factorial primes (number of the form n!+-1) and because I recently bought Skylake-X workstation I thought that I could get some speed up using AVX512 instructions.
Algorithm is simple and main step is to take modulo repeatedly respect to same divisor. Main thing is to loop over large range of n values. Here is naïve approach written in c (P is table of primes):
uint64_t factorial_naive(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
{
uint64_t n, i, residue;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2;
for (n=3; n <= nmax; n++){
residue *= n;
residue %= P[i];
// Lets check if we found factor
if (nmin <= n){
if( residue == 1){
report_factor(n, -1, P[i]);
}
if(residue == P[i]- 1){
report_factor(n, 1, P[i]);
}
}
}
}
return EXIT_SUCCESS;
}
Here the idea is to check a large range of n, e.g. 1,000,000 -> 10,000,000 against the same set of divisors. So we will take modulo respect to same divisor several million times. using DIV is very slow so there are several possible approaches depending on the range of the calculations. Here in my case n is most likely less than 10^7 and potential divisor p is less than 10,000 G (< 10^13), So numbers are less than 64-bits and also less than 53-bits!, but the product of the maximum residue (p-1) times n is larger than 64-bits. So I thought that simplest version of Montgomery method doesn’t work because we are taking modulo from number that is larger than 64-bit.
I found some old code for power pc where FMA was used to get an accurate product up to 106 bits (I guess) when using doubles. So I converted this approach to AVX 512 assembler (Intel Intrinsics). Here is a simple version of the FMA method, this is based on work of Dekker (1971), Dekker product and FMA version of TwoProduct of that are useful words when trying to find/googling rationale behind this. Also this approach has been discussed in this forum (e.g. here).
int64_t factorial_FMA(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
{
uint64_t n, i;
double prime_double, prime_double_reciprocal, quotient, residue;
double nr, n_double, prime_times_quotient_high, prime_times_quotient_low;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2.0;
prime_double = (double)P[i];
prime_double_reciprocal = 1.0 / prime_double;
n_double = 3.0;
for (n=3; n <= nmax; n++){
nr = n_double * residue;
quotient = fma(nr, prime_double_reciprocal, rounding_constant);
quotient -= rounding_constant;
prime_times_quotient_high= prime_double * quotient;
prime_times_quotient_low = fma(prime_double, quotient, -prime_times_quotient_high);
residue = fma(residue, n, -prime_times_quotient_high) - prime_times_quotient_low;
if (residue < 0.0) residue += prime_double;
n_double += 1.0;
// Lets check if we found factor
if (nmin <= n){
if( residue == 1.0){
report_factor(n, -1, P[i]);
}
if(residue == prime_double - 1.0){
report_factor(n, 1, P[i]);
}
}
}
}
return EXIT_SUCCESS;
}
Here I have used magic constant
static const double rounding_constant = 6755399441055744.0;
that is 2^51 + 2^52 magic number for doubles.
I converted this to AVX512 (32 potential divisors per loop) and analyzed result using IACA. It told that Throughput Bottleneck: Backend and Backend allocation was stalled due to unavailable allocation resources.
I am not very experienced with assembler so my question is that is there anything I can do to speed this up and solve this backend bottleneck?
AVX512 code is here and can be found also from github
uint64_t factorial_AVX512_unrolled_four(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
{
// we are trying to find a factor for a factorial numbers : n! +-1
//nmin is minimum n we want to report and nmax is maximum. P is table of primes
// we process 32 primes in one loop.
// naive version of the algorithm is int he function factorial_naive
// and simple version of the FMA based approach in the function factorial_simpleFMA
const double one_table[8] __attribute__ ((aligned(64))) ={1.0, 1.0, 1.0,1.0,1.0,1.0,1.0,1.0};
uint64_t n;
__m512d zero, rounding_const, one, n_double;
__m512i prime1, prime2, prime3, prime4;
__m512d residue1, residue2, residue3, residue4;
__m512d prime_double_reciprocal1, prime_double_reciprocal2, prime_double_reciprocal3, prime_double_reciprocal4;
__m512d quotient1, quotient2, quotient3, quotient4;
__m512d prime_times_quotient_high1, prime_times_quotient_high2, prime_times_quotient_high3, prime_times_quotient_high4;
__m512d prime_times_quotient_low1, prime_times_quotient_low2, prime_times_quotient_low3, prime_times_quotient_low4;
__m512d nr1, nr2, nr3, nr4;
__m512d prime_double1, prime_double2, prime_double3, prime_double4;
__m512d prime_minus_one1, prime_minus_one2, prime_minus_one3, prime_minus_one4;
__mmask8 negative_reminder_mask1, negative_reminder_mask2, negative_reminder_mask3, negative_reminder_mask4;
__mmask8 found_factor_mask11, found_factor_mask12, found_factor_mask13, found_factor_mask14;
__mmask8 found_factor_mask21, found_factor_mask22, found_factor_mask23, found_factor_mask24;
// load data and initialize cariables for loop
rounding_const = _mm512_set1_pd(rounding_constant);
one = _mm512_load_pd(one_table);
zero = _mm512_setzero_pd ();
// load primes used to sieve
prime1 = _mm512_load_epi64((__m512i *) &P[0]);
prime2 = _mm512_load_epi64((__m512i *) &P[8]);
prime3 = _mm512_load_epi64((__m512i *) &P[16]);
prime4 = _mm512_load_epi64((__m512i *) &P[24]);
// convert primes to double
prime_double1 = _mm512_cvtepi64_pd (prime1); // vcvtqq2pd
prime_double2 = _mm512_cvtepi64_pd (prime2); // vcvtqq2pd
prime_double3 = _mm512_cvtepi64_pd (prime3); // vcvtqq2pd
prime_double4 = _mm512_cvtepi64_pd (prime4); // vcvtqq2pd
// calculates 1.0/ prime
prime_double_reciprocal1 = _mm512_div_pd(one, prime_double1);
prime_double_reciprocal2 = _mm512_div_pd(one, prime_double2);
prime_double_reciprocal3 = _mm512_div_pd(one, prime_double3);
prime_double_reciprocal4 = _mm512_div_pd(one, prime_double4);
// for comparison if we have found factors for n!+1
prime_minus_one1 = _mm512_sub_pd(prime_double1, one);
prime_minus_one2 = _mm512_sub_pd(prime_double2, one);
prime_minus_one3 = _mm512_sub_pd(prime_double3, one);
prime_minus_one4 = _mm512_sub_pd(prime_double4, one);
// residue init
residue1 = _mm512_set1_pd(2.0);
residue2 = _mm512_set1_pd(2.0);
residue3 = _mm512_set1_pd(2.0);
residue4 = _mm512_set1_pd(2.0);
// double counter init
n_double = _mm512_set1_pd(3.0);
// main loop starts here. typical value for nmax can be 5,000,000 -> 10,000,000
for (n=3; n<=nmax; n++) // main loop
{
// timings for instructions:
// _mm512_load_epi64 = vmovdqa64 : L 1, T 0.5
// _mm512_load_pd = vmovapd : L 1, T 0.5
// _mm512_set1_pd
// _mm512_div_pd = vdivpd : L 23, T 16
// _mm512_cvtepi64_pd = vcvtqq2pd : L 4, T 0,5
// _mm512_mul_pd = vmulpd : L 4, T 0.5
// _mm512_fmadd_pd = vfmadd132pd, vfmadd213pd, vfmadd231pd : L 4, T 0.5
// _mm512_fmsub_pd = vfmsub132pd, vfmsub213pd, vfmsub231pd : L 4, T 0.5
// _mm512_sub_pd = vsubpd : L 4, T 0.5
// _mm512_cmplt_pd_mask = vcmppd : L ?, Y 1
// _mm512_mask_add_pd = vaddpd : L 4, T 0.5
// _mm512_cmpeq_pd_mask = vcmppd L ?, Y 1
// _mm512_kor = korw L 1, T 1
// nr = residue * n
nr1 = _mm512_mul_pd (residue1, n_double);
nr2 = _mm512_mul_pd (residue2, n_double);
nr3 = _mm512_mul_pd (residue3, n_double);
nr4 = _mm512_mul_pd (residue4, n_double);
// quotient = nr * 1.0/ prime_double + rounding_constant
quotient1 = _mm512_fmadd_pd(nr1, prime_double_reciprocal1, rounding_const);
quotient2 = _mm512_fmadd_pd(nr2, prime_double_reciprocal2, rounding_const);
quotient3 = _mm512_fmadd_pd(nr3, prime_double_reciprocal3, rounding_const);
quotient4 = _mm512_fmadd_pd(nr4, prime_double_reciprocal4, rounding_const);
// quotient -= rounding_constant, now quotient is rounded to integer
// countient should be at maximum nmax (10,000,000)
quotient1 = _mm512_sub_pd(quotient1, rounding_const);
quotient2 = _mm512_sub_pd(quotient2, rounding_const);
quotient3 = _mm512_sub_pd(quotient3, rounding_const);
quotient4 = _mm512_sub_pd(quotient4, rounding_const);
// now we calculate high and low for prime * quotient using decker product (FMA).
// quotient is calculated using approximation but this is accurate for given quotient
prime_times_quotient_high1 = _mm512_mul_pd(quotient1, prime_double1);
prime_times_quotient_high2 = _mm512_mul_pd(quotient2, prime_double2);
prime_times_quotient_high3 = _mm512_mul_pd(quotient3, prime_double3);
prime_times_quotient_high4 = _mm512_mul_pd(quotient4, prime_double4);
prime_times_quotient_low1 = _mm512_fmsub_pd(quotient1, prime_double1, prime_times_quotient_high1);
prime_times_quotient_low2 = _mm512_fmsub_pd(quotient2, prime_double2, prime_times_quotient_high2);
prime_times_quotient_low3 = _mm512_fmsub_pd(quotient3, prime_double3, prime_times_quotient_high3);
prime_times_quotient_low4 = _mm512_fmsub_pd(quotient4, prime_double4, prime_times_quotient_high4);
// now we calculate new reminder using decker product and using original values
// we subtract above calculated prime * quotient (quotient is aproximation)
residue1 = _mm512_fmsub_pd(residue1, n_double, prime_times_quotient_high1);
residue2 = _mm512_fmsub_pd(residue2, n_double, prime_times_quotient_high2);
residue3 = _mm512_fmsub_pd(residue3, n_double, prime_times_quotient_high3);
residue4 = _mm512_fmsub_pd(residue4, n_double, prime_times_quotient_high4);
residue1 = _mm512_sub_pd(residue1, prime_times_quotient_low1);
residue2 = _mm512_sub_pd(residue2, prime_times_quotient_low2);
residue3 = _mm512_sub_pd(residue3, prime_times_quotient_low3);
residue4 = _mm512_sub_pd(residue4, prime_times_quotient_low4);
// lets check if reminder < 0
negative_reminder_mask1 = _mm512_cmplt_pd_mask(residue1,zero);
negative_reminder_mask2 = _mm512_cmplt_pd_mask(residue2,zero);
negative_reminder_mask3 = _mm512_cmplt_pd_mask(residue3,zero);
negative_reminder_mask4 = _mm512_cmplt_pd_mask(residue4,zero);
// we and prime back to reminder using mask if it was < 0
residue1 = _mm512_mask_add_pd(residue1, negative_reminder_mask1, residue1, prime_double1);
residue2 = _mm512_mask_add_pd(residue2, negative_reminder_mask2, residue2, prime_double2);
residue3 = _mm512_mask_add_pd(residue3, negative_reminder_mask3, residue3, prime_double3);
residue4 = _mm512_mask_add_pd(residue4, negative_reminder_mask4, residue4, prime_double4);
n_double = _mm512_add_pd(n_double,one);
// if we are below nmin then we continue next iteration
if (n < nmin) continue;
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
{ // we find factor very rarely
double *residual_list1 = (double *) &residue1;
double *residual_list2 = (double *) &residue2;
double *residual_list3 = (double *) &residue3;
double *residual_list4 = (double *) &residue4;
double *prime_list1 = (double *) &prime_double1;
double *prime_list2 = (double *) &prime_double2;
double *prime_list3 = (double *) &prime_double3;
double *prime_list4 = (double *) &prime_double4;
for (int i=0; i <8; i++){
if( residual_list1[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list1[i]);
}
if( residual_list2[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list2[i]);
}
if( residual_list3[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list3[i]);
}
if( residual_list4[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list4[i]);
}
if(residual_list1[i] == (prime_list1[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list1[i]);
}
if(residual_list2[i] == (prime_list2[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list2[i]);
}
if(residual_list3[i] == (prime_list3[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list3[i]);
}
if(residual_list4[i] == (prime_list4[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list4[i]);
}
}
}
}
return EXIT_SUCCESS;
}
As a few commenters have suggested: a "backend" bottleneck is what you'd expect for this code. That suggests you're keeping things pretty well fed, which is what you want.
Looking at the report, there should be an opportunity in this section:
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
From the IACA analysis:
| 1 | 1.0 | | | | | | | | kmovw r11d, k0
| 1 | 1.0 | | | | | | | | kmovw eax, k1
| 1 | 1.0 | | | | | | | | kmovw ecx, k2
| 1 | 1.0 | | | | | | | | kmovw esi, k3
| 1 | 1.0 | | | | | | | | kmovw edi, k4
| 1 | 1.0 | | | | | | | | kmovw r8d, k5
| 1 | 1.0 | | | | | | | | kmovw r9d, k6
| 1 | 1.0 | | | | | | | | kmovw r10d, k7
| 1 | | 1.0 | | | | | | | or r11d, eax
| 1 | | | | | | | 1.0 | | or r11d, ecx
| 1 | | 1.0 | | | | | | | or r11d, esi
| 1 | | | | | | | 1.0 | | or r11d, edi
| 1 | | 1.0 | | | | | | | or r11d, r8d
| 1 | | | | | | | 1.0 | | or r11d, r9d
| 1* | | | | | | | | | or r11d, r10d
The processor is moving the resulting comparison masks (k0-k7) over to regular registers for the "or" operation. You should be able to eliminate those moves, AND, do the "or" rollup in 6ops vs 8.
NOTE: the found_factor_mask types are defined as __mmask8, where they should be __mask16 (16x double floats in a 512bit fector). That might let the compiler get at some optimizations. If not, drop to assembly as a commenter noted.
And related: what fraction of iteractions fire this or-mask clause? As another commenter observed, you should be able to unroll this with an accumlating "or" operation. Check the accumulated "or" value at the end of each unrolled iteration (or after N iterations), and if it's "true", go back and re-do the values to figure out which n value triggered it.
(And, you can binary search within the "roll" to find the matching n value -- that might get some gain).
Next, you should be able to get rid of this mid-loop check:
// if we are below nmin then we continue next iteration, we
if (n < nmin) continue;
Which shows up here:
| 1* | | | | | | | | | cmp r14, 0x3e8
| 0*F | | | | | | | | | jb 0x229
It may not be a huge gain since the predictor will (probably) get this one (mostly) right, but you should get some gains by having two distinct loops for two "phases":
n=3 to n=nmin-1
n=nmin and beyond
Even if you gain a cycle, that's 3%. And since that's generally related to the big 'or' operation, above, there may be more cleverness in there to be found.
I've been working on a program for my Algorithm Analysis class where I have to solve the Knapsack problem with Brute Force, greedy, dynamic, and branch and bound strategies. Everything works perfectly when I run it in Visual Studio 2012, but if I compile with gcc and run it on the command line, I get a different result:
Visual Studio:
+-------------------------------------------------------------------------------+
| Number of | Processing time in seconds / Maximum benefit value |
| +---------------+---------------+---------------+---------------+
| items | Brute force | Greedy | D.P. | B. & B. |
+---------------+---------------+---------------+---------------+---------------+
| 10 + 0 / 1290 + 0 / 1328 + 0 / 1290 + 0 / 1290 |
+---------------+---------------+---------------+---------------+---------------+
| 20 + 0 / 3286 + 0 / 3295 + 0 / 3200 + 0 / 3286 |
+---------------+---------------+---------------+---------------+---------------+
cmd:
+-------------------------------------------------------------------------------+
| Number of | Processing time in seconds / Maximum benefit value |
| +---------------+---------------+---------------+---------------+
| items | Brute force | Greedy | D.P. | B. & B. |
+---------------+---------------+---------------+---------------+---------------+
| 10 + 0 / 1290 + 0 / 1328 + 0 / 1599229779+ 0 / 1290 |
+---------------+---------------+---------------+---------------+---------------+
| 20 + 0 / 3286 + 0 / 3295 + 0 / 3200 + 0 / 3286 |
+---------------+---------------+---------------+---------------+---------------+
The same number always shows up, "1599229779." Notice that the output is only messed up the first time the Dynamic algorithm is run.
Here is my code:
typedef struct{
short value; //This is the value of the item
short weight; //This is the weight of the item
float ratio; //This is the ratio of value/weight
} itemType;
typedef struct{
time_t startingTime;
time_t endingTime;
int maxValue;
} result;
result solveWithDynamic(itemType items[], int itemsLength, int maxCapacity){
result answer;
int rowSize = 2;
int colSize = maxCapacity + 1;
int i, j; //used in loops
int otherColumn, thisColumn;
answer.startingTime = time(NULL);
int **table = (int**)malloc((sizeof *table) * rowSize);//[2][(MAX_ITEMS*WEIGHT_MULTIPLIER)];
for(i = 0; i < rowSize; i ++)
table[i] = (int*)malloc((sizeof *table[i]) * colSize);
table[0][0] = 0;
table[1][0] = 0;
for(i = 1; i < maxCapacity; i++) table[1][i] = 0;
for(i = 0; i < itemsLength; i++){
thisColumn = i%2;
otherColumn = (i+1)%2; //this is always the other column
for(j = 1; j < maxCapacity + 1; j++){
if(items[i].weight <= j){
if(items[i].value + table[otherColumn][j-items[i].weight] > table[otherColumn][j])
table[thisColumn][j] = items[i].value + table[otherColumn][j-items[i].weight];
else
table[thisColumn][j] = table[otherColumn][j];
} else {
table[thisColumn][j] = table[thisColumn][j-1];
}//end if/else
}//end for
}//end for
answer.maxValue = table[thisColumn][maxCapacity];
answer.endingTime = time(NULL);
for(i = 0; i < rowSize; i ++)
free(table[i]);
free(table);
return answer;
}//end solveWithDynamic
Just a bit of explanation. I was having trouble with the memory consumption of this algorithm because I have to run it for a set of 10,000 items. I realized that I didn't need to store the whole table, because I only ever looked at the previous column. I actually figured out that you only need to store the current row and x+1 additional values, where x is the weight of the current itemType. It brought the memory required from (itemsLength+1) * (maxCapacity+1) elements to 2*(maxCapacity+1) and possibly (maxCapacity+1) + (x+1) (although I don't need to optimize it that much).
Also, I used printf("%d", answer.maxValue); in this function, and it still came out as "1599229779." Can anyone help me figure out what is going on? Thanks.
Can't be sure that that is what causes it, but
for(i = 1; i < maxCapacity; i++) table[1][i] = 0;
you leave table[1][maxCapacity] uninitialised, but then potentially use it:
for(j = 1; j < maxCapacity + 1; j++){
if(items[i].weight <= j){
if(items[i].value + table[otherColumn][j-items[i].weight] > table[otherColumn][j])
table[thisColumn][j] = items[i].value + table[otherColumn][j-items[i].weight];
else
table[thisColumn][j] = table[otherColumn][j];
} else {
table[thisColumn][j] = table[thisColumn][j-1];
}//end if/else
}//end for
If that is always zero with Visual Studio, but nonzero with gcc, that could explain the difference.