Marking Stack Variables for Garbage Collection

Marking Stack Variables for Garbage Collection - c

I'm trying to learn how to implement a simple mark-and-sweep garbage collection algorithm. I'm learning by looking at the tgc library.
I'm figuring out how to iterate through the stack to mark reachable heap allocated variables. This is done in the following lines in tgc.c:
static void tgc_mark_stack(tgc_t *gc) {
void *stk, *bot, *top, *p;
bot = gc->bottom; top = &stk;
if (bot == top) { return; }
if (bot < top) {
for (p = top; p >= bot; p = ((char*)p) - sizeof(void*)) {
tgc_mark_ptr(gc, *((void**)p));
}
}
if (bot > top) {
for (p = top; p <= bot; p = ((char*)p) + sizeof(void*)) {
tgc_mark_ptr(gc, *((void**)p));
}
}
}
How does p = ((char*)p) - sizeof(void*) and p = ((char*)p) + sizeof(void*) not cause pointer p to overshoot or undershoot pointing at the correct address for the stack variable? As an example, my intuition is that we can possibly have something like this which wouldn't yield the correct heap address of char *:
void *bottom -> ----------- High address/bottom of stack
| |
| char a | 1 byte
| |
-----------
| |
(char*)p + sizeof(void*) -> | char *str | 8 bytes
| |
-----------
| |
| int x | 4 bytes
| |
void *p = void *top -> -----------
| |
| void *stk | 8 bytes
| |
----------- Low address/top of stock
But I've tested this out, and the loop seems to work regardless of the stack layout. I figured that we'd do something like p = ((char*)p) - sizeof(char) which also seems to work but is a bit slower. Why does the method used in the above code work?

Related

why I keep getting Aborted (core dumped) in Kali VM?

I'm working on a project The project in c and The point of it is to create a multiple threads and work with them ...
the problem is The program worked fine on my Macos but when I'm trying to work on the project form Kali VM or WSL (Windows Subsystem for Linux). the same code gaves me the following error
the error on kali VM
└─$ ./a.out 3 800 200 200 1200 134 ⨯
malloc(): corrupted top size
zsh: abort ./a.out 3 800 200 200 1200
The error on WSL
└─$ ./a.out 2 60 60 20
malloc(): corrupted top size
Aborted (core dumped)
you can check the full code here in this repo.
this is the main file of the code:
#include "philosophers.h"
int ft_error_put(char *messsage, int ret)
{
printf("%s\n", messsage);
return (ret);
}
int ft_parsing(char **av, t_simulation *simulation)
{
int num;
int i;
int j;
i = 1;
j = 0;
while (av[i])
{
j = 0;
num = 0;
while (av[i][j])
{
if (av[i][j] >= '0' && av[i][j] <= '9')
num = num * 10 + (av[i][j] - '0');
else
return (ft_error_put("Error: Number Only", 1));
j++;
}
if (i == 1)
{
simulation->philo_numbers = num;
simulation->forks = num;
simulation->threads = (pthread_t *)malloc(sizeof(pthread_t) * num);
}
else if (i == 2)
simulation->time_to_die = num;
else if (i == 3)
simulation->time_to_eat = num;
else if (i == 4)
simulation->time_to_sleep = num;
else if (i == 5)
simulation->eat_counter = num;
i++;
}
if (i == 5)
simulation->eat_counter = -1;
return (0);
}
void ft_for_each_philo(t_simulation *simulation, t_philo *philo, int i)
{
philo[i].index = i + 1;
philo[i].left_hand = i;
philo[i].right_hand = (i + 1) % simulation->philo_numbers;
philo[i].is_dead = NO;
if (simulation->eat_counter == -1)
philo[i].eat_counter = -1;
else
philo[i].eat_counter = simulation->eat_counter;
}
t_philo *ft_philo_init(t_simulation *simulation)
{
t_philo *philo;
int i;
i = -1;
philo = (t_philo *)malloc(sizeof(t_philo));
while (++i < simulation->philo_numbers)
ft_for_each_philo(simulation, philo, i);
return (philo);
}
void *ft_routine(void *arg)
{
t_philo *philo;
philo = (t_philo *)arg;
printf("thread number %d has started\n", philo->index);
sleep(1);
printf("thread number %d has ended\n", philo->index);
return (NULL);
}
int main(int ac, char **av)
{
int i;
t_simulation simulation;
t_philo *philo;
i = 0;
if (ac == 5 || ac == 6)
{
if (ft_parsing(av, &simulation))
return (1);
philo = ft_philo_init(&simulation);
while (i < simulation.philo_numbers)
{
simulation.philo_index = i;
pthread_create(simulation.threads + i, NULL,
ft_routine, philo + i);
i++;
}
i = 0;
while (i < simulation.philo_numbers)
{
pthread_join(simulation.threads[i], NULL);
i++;
}
}
return (0);
}

Your program was aborting on a pthread_create call.
But, the issue was a too short malloc call before that in ft_philo_init.
You were only allocating enough space for one t_philo struct instead of philo_numbers
Side note: Don't cast the return value of malloc. See: Do I cast the result of malloc?
Here is the corrected function:
t_philo *
ft_philo_init(t_simulation *simulation)
{
t_philo *philo;
int i;
i = -1;
// NOTE/BUG: not enough elements allocated
#if 0
philo = (t_philo *) malloc(sizeof(t_philo));
#else
philo = malloc(sizeof(*philo) * simulation->philo_numbers);
#endif
while (++i < simulation->philo_numbers)
ft_for_each_philo(simulation, philo, i);
return philo;
}
UPDATE:
thank you it's worked now, but can you explain why it work on macOS but it doesn't in kali? – DarkSide77
Well, it did not "work" on macOS either ...
When we index through an array and go beyond the bounds of the array, it is UB ("undefined behavior"). UB means just that: undefined behavior.
See:
Undefined, unspecified and implementation-defined behavior
Is accessing a global array outside its bound undefined behavior?
Anything could happen. That's because the philo array occupies a certain amount of memory. What is placed after that allocation? Let's assume philo is 8 bytes [or elements if you wish--it doesn't matter]:
| philo[8] | whatever |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | | | | | | | | | | | | | | | |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
As long as we stay within bounds, things are fine (e.g.):
for (i = 0; i < 8; ++i)
philo[i] = 23;
If we go beyond the end, we have UB (e.g.):
for (i = 0; i < 9; ++i)
philo[i] = 23;
Here we went one beyond and modified the first cell of whatever.
Depending upon what variable was placed there by the linker, several behaviors are possible:
The program seems to run normally.
A value at whatever is corrupted and the program runs but produces incorrect results.
whatever is aligned to a page that is write protected. The program will segfault on a protection exception immediately.
The value corrupted at whatever has no immediate effect, but later the program detects the corruption.
The corruption eventually causes a segfault because a pointer value was corrupted.
On both systems, your program was doing the same thing. For an area we get from malloc, the whatever is an internal struct used by the heap manager to keep track of the allocations. The program was corrupting this.
On macOS, the heap manager did not detect this. On linux (glibc), the heap manager did better cross checking and detected the corruption.

Segmentation fault when assigning array to array in structure

I get segmentation fault from this code. It's a function for vertical flipping of a BMP image. I am trying to assign an array to another array in a structure. I would appreciate your help.
struct bmp_image *flip_horizontally(const struct bmp_image *image) {
if (image == NULL)
{
return NULL;
}
struct bmp_image *transformed = NULL;
transformed = (struct bmp_image *)calloc(1, sizeof(image->data) + sizeof(image->header));
transformed->header = image->header;
struct pixel data_array[image->header->height][image->header->width];
int index = 0;
for (int i = 0; i < image->header->height; i++)
{
for (int j = (int)image->header->width - 1; j >= 0; j--)
{
data_array[i][j].blue = image->data[index].blue;
data_array[i][j].green = image->data[index].green;
data_array[i][j].red = image->data[index].red;
index++;
}
}
struct pixel *transformed_data = calloc((image->header->width * image->header->height), sizeof(struct pixel));
index = 0;
// I think that the code after this line causes the segmentation fault
for (int i = 0; i < image->header->height; i++)
{
for (int j = 0; j < image->header->width; j++)
{
transformed_data[index].blue = data_array[i][j].blue;
transformed_data[index].green = data_array[i][j].green;
transformed_data[index].red = data_array[i][j].red;
index++;
}
}
transformed->data = transformed_data;
return transformed;
}

In my opinion... I would create a single array per row of pixels in the bmp image, and an array of pointers to arrays of pixels, pointing each to each of the successive rows of your image. Then flipping the rows is just going from the extremes of the array with two indices, and swapping the pointers, until both indices meed at the middle of the array. Then encode again the array and you will have done this without having to copy all the pixels up and down.
+-------+ +-----+-----+-----+-----+-----+-----+-----+-----+-----+
| P[0] =======>| Pxl | Pxl | ... | | | | | | |
+-------+ +-----+-----+-----+-----+-----+-----+-----+-----+-----+
| P[1] =======>| | | | | | | | | |
+-------+ +-----+-----+-----+-----+-----+-----+-----+-----+-----+
...
+-------+ +-----+-----+-----+-----+-----+-----+-----+-----+-----+
| P[N] =======>| | | | | | | | | |
+-------+ +-----+-----+-----+-----+-----+-----+-----+-----+-----+
As you have not posted complete code, it's impossible for me to tweak your code to make it simpler and show a solution by modifying it. But something like (beware, this code is not tested):
void bitimage_swap(struct pixel **to_swap, size_t nrows)
{
struct pixel **start = to_swap, **end = to_swap + nrows;
while (start < --end) {
struct pixel *temp = *start;
*start++ = *end;
*end = temp;
}
}

Varying string variable in an if condition

I used this program to take input mm as the month of the year and print out the name of the month:
#include <stdio.h>
#include <string.h>
int main(){
int mm;
printf("input month ");
scanf("%d", &mm);
char mname[9];
if (mm == 1) {mname = "January";}
if (mm == 2) {mname = "February";}
if (mm == 3) {mname = "March";}
if (mm == 4) {mname = "April";}
if (mm == 5) {mname = "May";}
if (mm == 6) {mname = "June";}
if (mm == 7) {mname = "July";}
if (mm == 8) {mname = "August";}
if (mm == 9) {mname = "September";}
if (mm == 10) {mname = "October";}
if (mm == 11) {mname = "November";}
if (mm == 12) {mname = "December";}
printf("%d is month %s", mm, mname);
return 0;
}
it gave an error assignment to expression with array type. please help

Taking Michael Walz two great comments and adding them as an answer:
#include <stdio.h>
#include <string.h>
void main(int argc, char** argv)
{
int mm = 0;
printf("Please enter a month number [1-12]:\n");
scanf("%d", &mm);
static const char* months[] = { "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December" };
if (mm >= 1 && mm <= 12)
{
printf("%d is month %s", mm, months[mm - 1]);
}
else
{
printf("You have entered an invalid month number %d\n", mm);
}
}
Validity check was done (mentioned in above comments).
Hope it helps.
Cheers,
Guy.

Basically there are two different ways to think about / talk about strings:
An array of characters, terminated by a '\0' character. (This is the formal definition of a string in C.)
As a pointer to character, or char *, pointing at the first of a sequence (an array) of characters, terminated by a '\0' character.
So you can declare an array, and copy a string into it:
char arraystring[10];
strcpy(arraystring, "Hello");
Or you can declare an array, and give it an initial value when you declare it:
char arraystring2[] = "world!";
Or you can declare a pointer, and make it point to a string:
char *pointerstring;
pointerstring = "Goodbye";
Or you can declare a pointer, and give it an initial value:
char *pointerstring2 = "for now";
It's worth knowing how these "look" in memory:
+---+---+---+---+---+---+---+---+---+---+
arraystring: | H | e | l | l | o |\0 |\0 |\0 |\0 |\0 |
+---+---+---+---+---+---+---+---+---+---+
+---+---+---+---+---+---+---+
arraystring2: | w | o | r | l | d | ! |\0 |
+---+---+---+---+---+---+---+
+---------------+
pointerstring: | * |
+-------|-------+
| +---+---+---+---+---+---+---+---+
+---------> | G | o | o | d | b | y | e |\0 |
+---+---+---+---+---+---+---+---+
+---------------+
pointerstring2: | * |
+-------|-------+
| +---+---+---+---+---+---+---+---+
+---------> | f | o | r | | n | o | w |\0 |
+---+---+---+---+---+---+---+---+
Now, the thing is, you can't assign arrays in C. You can assign pointers. You can also make use of the special rule (the "equivalence between arrays and pointers") by which when you use an array in an expression, what you get is a pointer to the array's first element.
So if you want to assign one string-as-pointer to another string-as-pointer, that works:
pointerstring = pointerstring2;
If you try to assign one string-as-array to another string-as-array, that doesn't work
arraystring = arraystring2; /* WRONG -- compiler complains, attempt to assign array */
If you want to copy one string-as-array to another, you have to call strcpy (and of course you have to worry about overflow):
strcpy(arraystring, arraystring2);
You can also assign a string-as-array to a string-as-pointer:
pointerstring = arraystring;
This works because the compiler treats it exactly as if you'd written
pointerstring = &arraystring[0];
Finally, if you attempt to assign a string-as-pointer to a string-as-array, this doesn't work, again because you can't assign to an array:
arraystring = pointerstring; /* WRONG */
Again, you could call strcpy instead, as long as you're sure the string will fit:
strcpy(arraystring, pointerstring); /* but worry about overflow */
In your original code, mname was a string-as-array, so you can't assign to it. You have two choices:
Use strcpy to copy strings into it:
if (mm == 1) { strcpy(mname, "January"); }
Declare mname as p a pointer instead:
char *mname;
...
if (mm == 1) { mname = "January"; }
Addendum: For completeness, I should mention one more set of points.
When you initialize a pointer to point to a string, in either of these ways:
char *pointerstring = "Goodbye";
char * pointerstring2;
pointerstring2 = "for now";
those strings "Goodbye" and "for now" are read-only. You can't modify them. So if you try to do something like
strcpy(pointerstring, pointerstring2); /* WRONG: overwriting constant string */
it won't work, because you're trying to copy the second string into the memory where the first string is stored, but that memory isn't writable.
So when you're using arrays, you can't use assignment, you must use strcpy; but when you're using pointers, you can use assignment, and you probably can't call strcpy.

Basically array types are constant pointers, so when you try to assign a new value to pointer mname the compiler detects an error.
You could use function strcpy as in the following example to solve the problem:
if (mm == 1) {
strcpy(mname, "January");
}

Fast AVX512 modulo when same divisor

I have tried to find divisors to potential factorial primes (number of the form n!+-1) and because I recently bought Skylake-X workstation I thought that I could get some speed up using AVX512 instructions.
Algorithm is simple and main step is to take modulo repeatedly respect to same divisor. Main thing is to loop over large range of n values. Here is naïve approach written in c (P is table of primes):
uint64_t factorial_naive(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
{
uint64_t n, i, residue;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2;
for (n=3; n <= nmax; n++){
residue *= n;
residue %= P[i];
// Lets check if we found factor
if (nmin <= n){
if( residue == 1){
report_factor(n, -1, P[i]);
}
if(residue == P[i]- 1){
report_factor(n, 1, P[i]);
}
}
}
}
return EXIT_SUCCESS;
}
Here the idea is to check a large range of n, e.g. 1,000,000 -> 10,000,000 against the same set of divisors. So we will take modulo respect to same divisor several million times. using DIV is very slow so there are several possible approaches depending on the range of the calculations. Here in my case n is most likely less than 10^7 and potential divisor p is less than 10,000 G (< 10^13), So numbers are less than 64-bits and also less than 53-bits!, but the product of the maximum residue (p-1) times n is larger than 64-bits. So I thought that simplest version of Montgomery method doesn’t work because we are taking modulo from number that is larger than 64-bit.
I found some old code for power pc where FMA was used to get an accurate product up to 106 bits (I guess) when using doubles. So I converted this approach to AVX 512 assembler (Intel Intrinsics). Here is a simple version of the FMA method, this is based on work of Dekker (1971), Dekker product and FMA version of TwoProduct of that are useful words when trying to find/googling rationale behind this. Also this approach has been discussed in this forum (e.g. here).
int64_t factorial_FMA(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
{
uint64_t n, i;
double prime_double, prime_double_reciprocal, quotient, residue;
double nr, n_double, prime_times_quotient_high, prime_times_quotient_low;
for (i = 0; i < APP_BUFLEN; i++){
residue = 2.0;
prime_double = (double)P[i];
prime_double_reciprocal = 1.0 / prime_double;
n_double = 3.0;
for (n=3; n <= nmax; n++){
nr = n_double * residue;
quotient = fma(nr, prime_double_reciprocal, rounding_constant);
quotient -= rounding_constant;
prime_times_quotient_high= prime_double * quotient;
prime_times_quotient_low = fma(prime_double, quotient, -prime_times_quotient_high);
residue = fma(residue, n, -prime_times_quotient_high) - prime_times_quotient_low;
if (residue < 0.0) residue += prime_double;
n_double += 1.0;
// Lets check if we found factor
if (nmin <= n){
if( residue == 1.0){
report_factor(n, -1, P[i]);
}
if(residue == prime_double - 1.0){
report_factor(n, 1, P[i]);
}
}
}
}
return EXIT_SUCCESS;
}
Here I have used magic constant
static const double rounding_constant = 6755399441055744.0;
that is 2^51 + 2^52 magic number for doubles.
I converted this to AVX512 (32 potential divisors per loop) and analyzed result using IACA. It told that Throughput Bottleneck: Backend and Backend allocation was stalled due to unavailable allocation resources.
I am not very experienced with assembler so my question is that is there anything I can do to speed this up and solve this backend bottleneck?
AVX512 code is here and can be found also from github
uint64_t factorial_AVX512_unrolled_four(uint64_t const nmin, uint64_t const nmax, const uint64_t *restrict P)
{
// we are trying to find a factor for a factorial numbers : n! +-1
//nmin is minimum n we want to report and nmax is maximum. P is table of primes
// we process 32 primes in one loop.
// naive version of the algorithm is int he function factorial_naive
// and simple version of the FMA based approach in the function factorial_simpleFMA
const double one_table[8] __attribute__ ((aligned(64))) ={1.0, 1.0, 1.0,1.0,1.0,1.0,1.0,1.0};
uint64_t n;
__m512d zero, rounding_const, one, n_double;
__m512i prime1, prime2, prime3, prime4;
__m512d residue1, residue2, residue3, residue4;
__m512d prime_double_reciprocal1, prime_double_reciprocal2, prime_double_reciprocal3, prime_double_reciprocal4;
__m512d quotient1, quotient2, quotient3, quotient4;
__m512d prime_times_quotient_high1, prime_times_quotient_high2, prime_times_quotient_high3, prime_times_quotient_high4;
__m512d prime_times_quotient_low1, prime_times_quotient_low2, prime_times_quotient_low3, prime_times_quotient_low4;
__m512d nr1, nr2, nr3, nr4;
__m512d prime_double1, prime_double2, prime_double3, prime_double4;
__m512d prime_minus_one1, prime_minus_one2, prime_minus_one3, prime_minus_one4;
__mmask8 negative_reminder_mask1, negative_reminder_mask2, negative_reminder_mask3, negative_reminder_mask4;
__mmask8 found_factor_mask11, found_factor_mask12, found_factor_mask13, found_factor_mask14;
__mmask8 found_factor_mask21, found_factor_mask22, found_factor_mask23, found_factor_mask24;
// load data and initialize cariables for loop
rounding_const = _mm512_set1_pd(rounding_constant);
one = _mm512_load_pd(one_table);
zero = _mm512_setzero_pd ();
// load primes used to sieve
prime1 = _mm512_load_epi64((__m512i *) &P[0]);
prime2 = _mm512_load_epi64((__m512i *) &P[8]);
prime3 = _mm512_load_epi64((__m512i *) &P[16]);
prime4 = _mm512_load_epi64((__m512i *) &P[24]);
// convert primes to double
prime_double1 = _mm512_cvtepi64_pd (prime1); // vcvtqq2pd
prime_double2 = _mm512_cvtepi64_pd (prime2); // vcvtqq2pd
prime_double3 = _mm512_cvtepi64_pd (prime3); // vcvtqq2pd
prime_double4 = _mm512_cvtepi64_pd (prime4); // vcvtqq2pd
// calculates 1.0/ prime
prime_double_reciprocal1 = _mm512_div_pd(one, prime_double1);
prime_double_reciprocal2 = _mm512_div_pd(one, prime_double2);
prime_double_reciprocal3 = _mm512_div_pd(one, prime_double3);
prime_double_reciprocal4 = _mm512_div_pd(one, prime_double4);
// for comparison if we have found factors for n!+1
prime_minus_one1 = _mm512_sub_pd(prime_double1, one);
prime_minus_one2 = _mm512_sub_pd(prime_double2, one);
prime_minus_one3 = _mm512_sub_pd(prime_double3, one);
prime_minus_one4 = _mm512_sub_pd(prime_double4, one);
// residue init
residue1 = _mm512_set1_pd(2.0);
residue2 = _mm512_set1_pd(2.0);
residue3 = _mm512_set1_pd(2.0);
residue4 = _mm512_set1_pd(2.0);
// double counter init
n_double = _mm512_set1_pd(3.0);
// main loop starts here. typical value for nmax can be 5,000,000 -> 10,000,000
for (n=3; n<=nmax; n++) // main loop
{
// timings for instructions:
// _mm512_load_epi64 = vmovdqa64 : L 1, T 0.5
// _mm512_load_pd = vmovapd : L 1, T 0.5
// _mm512_set1_pd
// _mm512_div_pd = vdivpd : L 23, T 16
// _mm512_cvtepi64_pd = vcvtqq2pd : L 4, T 0,5
// _mm512_mul_pd = vmulpd : L 4, T 0.5
// _mm512_fmadd_pd = vfmadd132pd, vfmadd213pd, vfmadd231pd : L 4, T 0.5
// _mm512_fmsub_pd = vfmsub132pd, vfmsub213pd, vfmsub231pd : L 4, T 0.5
// _mm512_sub_pd = vsubpd : L 4, T 0.5
// _mm512_cmplt_pd_mask = vcmppd : L ?, Y 1
// _mm512_mask_add_pd = vaddpd : L 4, T 0.5
// _mm512_cmpeq_pd_mask = vcmppd L ?, Y 1
// _mm512_kor = korw L 1, T 1
// nr = residue * n
nr1 = _mm512_mul_pd (residue1, n_double);
nr2 = _mm512_mul_pd (residue2, n_double);
nr3 = _mm512_mul_pd (residue3, n_double);
nr4 = _mm512_mul_pd (residue4, n_double);
// quotient = nr * 1.0/ prime_double + rounding_constant
quotient1 = _mm512_fmadd_pd(nr1, prime_double_reciprocal1, rounding_const);
quotient2 = _mm512_fmadd_pd(nr2, prime_double_reciprocal2, rounding_const);
quotient3 = _mm512_fmadd_pd(nr3, prime_double_reciprocal3, rounding_const);
quotient4 = _mm512_fmadd_pd(nr4, prime_double_reciprocal4, rounding_const);
// quotient -= rounding_constant, now quotient is rounded to integer
// countient should be at maximum nmax (10,000,000)
quotient1 = _mm512_sub_pd(quotient1, rounding_const);
quotient2 = _mm512_sub_pd(quotient2, rounding_const);
quotient3 = _mm512_sub_pd(quotient3, rounding_const);
quotient4 = _mm512_sub_pd(quotient4, rounding_const);
// now we calculate high and low for prime * quotient using decker product (FMA).
// quotient is calculated using approximation but this is accurate for given quotient
prime_times_quotient_high1 = _mm512_mul_pd(quotient1, prime_double1);
prime_times_quotient_high2 = _mm512_mul_pd(quotient2, prime_double2);
prime_times_quotient_high3 = _mm512_mul_pd(quotient3, prime_double3);
prime_times_quotient_high4 = _mm512_mul_pd(quotient4, prime_double4);
prime_times_quotient_low1 = _mm512_fmsub_pd(quotient1, prime_double1, prime_times_quotient_high1);
prime_times_quotient_low2 = _mm512_fmsub_pd(quotient2, prime_double2, prime_times_quotient_high2);
prime_times_quotient_low3 = _mm512_fmsub_pd(quotient3, prime_double3, prime_times_quotient_high3);
prime_times_quotient_low4 = _mm512_fmsub_pd(quotient4, prime_double4, prime_times_quotient_high4);
// now we calculate new reminder using decker product and using original values
// we subtract above calculated prime * quotient (quotient is aproximation)
residue1 = _mm512_fmsub_pd(residue1, n_double, prime_times_quotient_high1);
residue2 = _mm512_fmsub_pd(residue2, n_double, prime_times_quotient_high2);
residue3 = _mm512_fmsub_pd(residue3, n_double, prime_times_quotient_high3);
residue4 = _mm512_fmsub_pd(residue4, n_double, prime_times_quotient_high4);
residue1 = _mm512_sub_pd(residue1, prime_times_quotient_low1);
residue2 = _mm512_sub_pd(residue2, prime_times_quotient_low2);
residue3 = _mm512_sub_pd(residue3, prime_times_quotient_low3);
residue4 = _mm512_sub_pd(residue4, prime_times_quotient_low4);
// lets check if reminder < 0
negative_reminder_mask1 = _mm512_cmplt_pd_mask(residue1,zero);
negative_reminder_mask2 = _mm512_cmplt_pd_mask(residue2,zero);
negative_reminder_mask3 = _mm512_cmplt_pd_mask(residue3,zero);
negative_reminder_mask4 = _mm512_cmplt_pd_mask(residue4,zero);
// we and prime back to reminder using mask if it was < 0
residue1 = _mm512_mask_add_pd(residue1, negative_reminder_mask1, residue1, prime_double1);
residue2 = _mm512_mask_add_pd(residue2, negative_reminder_mask2, residue2, prime_double2);
residue3 = _mm512_mask_add_pd(residue3, negative_reminder_mask3, residue3, prime_double3);
residue4 = _mm512_mask_add_pd(residue4, negative_reminder_mask4, residue4, prime_double4);
n_double = _mm512_add_pd(n_double,one);
// if we are below nmin then we continue next iteration
if (n < nmin) continue;
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
{ // we find factor very rarely
double *residual_list1 = (double *) &residue1;
double *residual_list2 = (double *) &residue2;
double *residual_list3 = (double *) &residue3;
double *residual_list4 = (double *) &residue4;
double *prime_list1 = (double *) &prime_double1;
double *prime_list2 = (double *) &prime_double2;
double *prime_list3 = (double *) &prime_double3;
double *prime_list4 = (double *) &prime_double4;
for (int i=0; i <8; i++){
if( residual_list1[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list1[i]);
}
if( residual_list2[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list2[i]);
}
if( residual_list3[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list3[i]);
}
if( residual_list4[i] == 1.0)
{
report_factor((uint64_t) n, -1, (uint64_t) prime_list4[i]);
}
if(residual_list1[i] == (prime_list1[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list1[i]);
}
if(residual_list2[i] == (prime_list2[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list2[i]);
}
if(residual_list3[i] == (prime_list3[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list3[i]);
}
if(residual_list4[i] == (prime_list4[i] - 1.0))
{
report_factor((uint64_t) n, 1, (uint64_t) prime_list4[i]);
}
}
}
}
return EXIT_SUCCESS;
}

As a few commenters have suggested: a "backend" bottleneck is what you'd expect for this code. That suggests you're keeping things pretty well fed, which is what you want.
Looking at the report, there should be an opportunity in this section:
// Lets check if we found any factors, residue 1 == n!-1
found_factor_mask11 = _mm512_cmpeq_pd_mask(one, residue1);
found_factor_mask12 = _mm512_cmpeq_pd_mask(one, residue2);
found_factor_mask13 = _mm512_cmpeq_pd_mask(one, residue3);
found_factor_mask14 = _mm512_cmpeq_pd_mask(one, residue4);
// residue prime -1 == n!+1
found_factor_mask21 = _mm512_cmpeq_pd_mask(prime_minus_one1, residue1);
found_factor_mask22 = _mm512_cmpeq_pd_mask(prime_minus_one2, residue2);
found_factor_mask23 = _mm512_cmpeq_pd_mask(prime_minus_one3, residue3);
found_factor_mask24 = _mm512_cmpeq_pd_mask(prime_minus_one4, residue4);
if (found_factor_mask12 | found_factor_mask11 | found_factor_mask13 | found_factor_mask14 |
found_factor_mask21 | found_factor_mask22 | found_factor_mask23|found_factor_mask24)
From the IACA analysis:
| 1 | 1.0 | | | | | | | | kmovw r11d, k0
| 1 | 1.0 | | | | | | | | kmovw eax, k1
| 1 | 1.0 | | | | | | | | kmovw ecx, k2
| 1 | 1.0 | | | | | | | | kmovw esi, k3
| 1 | 1.0 | | | | | | | | kmovw edi, k4
| 1 | 1.0 | | | | | | | | kmovw r8d, k5
| 1 | 1.0 | | | | | | | | kmovw r9d, k6
| 1 | 1.0 | | | | | | | | kmovw r10d, k7
| 1 | | 1.0 | | | | | | | or r11d, eax
| 1 | | | | | | | 1.0 | | or r11d, ecx
| 1 | | 1.0 | | | | | | | or r11d, esi
| 1 | | | | | | | 1.0 | | or r11d, edi
| 1 | | 1.0 | | | | | | | or r11d, r8d
| 1 | | | | | | | 1.0 | | or r11d, r9d
| 1* | | | | | | | | | or r11d, r10d
The processor is moving the resulting comparison masks (k0-k7) over to regular registers for the "or" operation. You should be able to eliminate those moves, AND, do the "or" rollup in 6ops vs 8.
NOTE: the found_factor_mask types are defined as __mmask8, where they should be __mask16 (16x double floats in a 512bit fector). That might let the compiler get at some optimizations. If not, drop to assembly as a commenter noted.
And related: what fraction of iteractions fire this or-mask clause? As another commenter observed, you should be able to unroll this with an accumlating "or" operation. Check the accumulated "or" value at the end of each unrolled iteration (or after N iterations), and if it's "true", go back and re-do the values to figure out which n value triggered it.
(And, you can binary search within the "roll" to find the matching n value -- that might get some gain).
Next, you should be able to get rid of this mid-loop check:
// if we are below nmin then we continue next iteration, we
if (n < nmin) continue;
Which shows up here:
| 1* | | | | | | | | | cmp r14, 0x3e8
| 0*F | | | | | | | | | jb 0x229
It may not be a huge gain since the predictor will (probably) get this one (mostly) right, but you should get some gains by having two distinct loops for two "phases":
n=3 to n=nmin-1
n=nmin and beyond
Even if you gain a cycle, that's 3%. And since that's generally related to the big 'or' operation, above, there may be more cleverness in there to be found.

Alternatives for creating graphs in C

I recently created a program that calculates flow rate through a pipe and generates, line by line, a scatter graph of the output. My knowledge of C is rudimentary (started with python) and I get the feeling that I may have made the code overly complicated. As such, I am asking if anyone has any alternatives to the code below. Critiques of code structure etc. are also welcome!
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#define PI 3.1415926
double
flow_rate(double diameter, double k, double slope){
double area, w_perimeter, hyd_rad, fr;
area = (PI*pow(diameter,2.0))/8.0;
w_perimeter = (PI*diameter)/2.0;
hyd_rad = area/w_perimeter;
fr = (1.0/k)*area*pow(hyd_rad,(2.0/3.0))*pow(slope,(1.0/2.0));
return fr;
}
int
main(int argc, char **argv) {
double avg_k=0.0312, min_slope=0.0008;
float s3_diameter;
int i=0, num=0, flow_array[6] ,rows, align=29;
char graph[] = " ";
char graph_temp[]= " ";
printf("\nFlow Rate (x 10^-3) m^3/s\n");
for (s3_diameter=0.50;s3_diameter>0.24;s3_diameter-=0.05){
flow_array[i] = (1000*(flow_rate(s3_diameter, avg_k, min_slope))+0.5);
i += 1;
}
for (rows=30;rows>0;rows--){
strcpy(graph_temp,graph);
for (num=0;num<6;num++){
if (rows==flow_array[num] && rows%5==0){
graph_temp[align] = '*';
printf("%d%s\n",rows,graph_temp);
align -= 5;
break;
}
else if (rows==flow_array[num]){
graph_temp[align] = '*';
printf("|%s\n",graph_temp);
align -= 5;
break;
}
else {
if (rows%5==0 && num==5){
printf("%d%s\n",rows,graph_temp);
}
else if (rows%5!=0 && num==5){
printf("|%s\n",graph_temp);
}
}
}
}
printf("|----2----3----3----4----4----5----\n");
printf(" 5 0 5 0 5 0\n");
printf(" Diameter (x 10^-2) m\n");
return 0;
}
Output as below.
Flow Rate (x 10^-3) m^3/s
30
|
|
|
|
25
|
|
| *
|
20
|
|
| *
|
15
|
|
| *
|
10
| *
|
|
| *
5
| *
|
|
|
|----2----3----3----4----4----5----
5 0 5 0 5 0
Diameter (x 10^-2) m

GNUPlot is by far the simplest way to draw graph in C.
It can draw from simple plotting to complex 3d graph, and even provides an ASCII Art output (if ASCII output is really required)
You can find more information on how to use GNUPlot in a C program here: http://ndevilla.free.fr/gnuplot/

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Marking Stack Variables for Garbage Collection - c

Related

why I keep getting Aborted (core dumped) in Kali VM?

Segmentation fault when assigning array to array in structure

Varying string variable in an if condition

Fast AVX512 modulo when same divisor

Alternatives for creating graphs in C

Categories

Resources