MPI Radix-Sort implementation low speed-up - c

I tried to implement a solution for the Radix-Sort algorithm using MPI to parallelize execution. My idea was to have each process compute the vector of the counts locally and then aggregate the results on the root process that does the sorting. With this solution I noticed a maximum speed up of just 1.5 compared to the sequential version.
I was wondering if a similar speed up was normal or if a different MPI implementation could be adopted to improve this result. I've seen around that you might think about sorting subarrays on processes but I don't quite understand how to put these sorted blocks together to get the final sorted array efficiently.
Sequential:
/**
* #brief This function allows to find the maximum in an array.
* #param arr array.
* #param n array size.
*/
int getMax(int* array, int n) {
int max = array[0];
for (int i = 1; i < n; i++)
if (array[i] > max)
max = array[i];
return max;
}
/**
* #brief The main function that sorts the array of size 'size'.
* #param array array.
* #param size array size.
* #param digit number which represent ciphers.
*/
void countingSort(int* array, int size, int digit) {
int* output= (int*) malloc(sizeof(int)*(size + 1));
int count[10]={0};
for (int i = 0; i < size; i++)
count[(array[i] / digit) % 10]++;
for (int i = 1; i < 10; i++)
count[i] += count[i - 1];
for (int i = size - 1; i >= 0; i--) {
output[count[(array[i] / digit) % 10] - 1] = array[i];
count[(array[i] / digit) % 10]--;
}
for (int i = 0; i < size; i++)
array[i] = output[i];
free(output);
}
/**
* #brief The function that takes the max and starts the sorting process.
* #param array array.
* #param size array size.
*/
void radixsort(int* array, int size) {
int max = getMax(array, size);
for (int digit = 1; max / digit > 0; digit *= 10)
countingSort(array, size, digit);
}
MPI version:
/**
* #brief This function allows to find the maximum in an array.
* #param arr array.
* #param n array size.
*/
int getMax(int* arr, int n) {
int max = arr[0];
for (int i = 1; i < n; i++)
if (arr[i] > max)
max = arr[i];
return max;
}
/**
* #brief The main function that sorts the array of size n.
* #param array array.
* #param rec_buf sub-array of each process.
* #param n array size.
* #param digit number which represent ciphers.
* #param num_process number of processes.
* #param rank rank of the current process.
* #param dim dimension of the rec_buf.
*/
void countingSort(int* array, int* rec_buf, int n, int digit, int num_process, int rank, int dim) {
// Compute local count for each processes
int i, local_count[10] = {0};
for (i = 0; i < dim; i++) {
local_count[(rec_buf[i] / digit) % 10]++;
}
// Reduce all the sub counts to root process
if (rank == 0) {
int count[10] = {0};
MPI_Reduce(local_count, count, 10, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
for (i = 1; i < 10; i++) {
count[i] += count[i - 1];
}
int* temp_array = (int*) malloc(sizeof(int) * n);
for (i = n - 1; i >= 0; i--) {
temp_array[count[(array[i] / digit) % 10] - 1] = array[i];
count[(array[i] / digit) % 10]--;
}
memcpy(array, temp_array, sizeof(int) * n);
free(temp_array);
} else {
MPI_Reduce(local_count, 0, 10, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
}
}
/**
* #brief The function that separates the array in subarray and starts the sorting process.
* #param array array.
* #param n array size.
* #param num_process number of processes.
* #param rank rank of the current process.
*/
void radix_sort(int* array, int n, int num_process, int rank) {
int rem = n%num_process; // elements remaining after division among processes
int dim, displacement;
if ( rank < rem) {
dim = n/num_process+1;
displacement = rank * dim;
}
else {
dim = n/num_process;
displacement = rank * dim + rem;
}
int* rec_buf= (int*) malloc(sizeof(int)*dim) ;
int* sendcounts = NULL;
int* displs = NULL;
if (rank == 0) {
sendcounts = malloc(sizeof(int)*num_process);
displs = malloc(sizeof(int)*num_process);
}
MPI_Gather(&dim,1,MPI_INT, sendcounts, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Gather(&displacement, 1, MPI_INT, displs, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Scatterv(array, sendcounts, displs, MPI_INT, rec_buf, dim, MPI_INT, 0, MPI_COMM_WORLD);
if (rank==0) {
free(sendcounts);
free(displs);
}
int local_max = getMax(rec_buf,dim);
int global_max;
MPI_Allreduce(&local_max,&global_max, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
for (int digit = 1; global_max / digit > 0; digit *= 10) {
countingSort(array, rec_buf, n, digit, num_process, rank, dim);
}
free(rec_buf);
}
Also I noticed that using the same kind of approach of calculating the vector of counts locally, but using OpenMP, the speed up reaches maximum value equal to 4.5. Could this result be due to the different architecture used in the two cases? (shared memory for OpenMP and distributed memory for MPI).
For completeness I also report the solution with OpenMP:
/**
* #brief This function allows to find the maximum in an array.
* #param n array size.
* #param arr[n] array.
*/
unsigned getMax(int n, unsigned arr[n]) {
unsigned mx = arr[0];
#pragma omp parallel for reduction(max:mx)
for (int i = 1; i < n; i++)
if (arr[i] > mx)
mx = arr[i];
return mx;
}
/* Source: https://gist.github.com/wanghc78/2c2b403299cab172e74c62f4397a6997
Copyright (c) 2014, Haichuan Wang All rights reserved. */
/**
* #brief The main function that sorts arr[] of size n using Radix Sort
* #param n array size.
* #param arr[n] array.
* #param threads number of threads.
*/
unsigned * radixsort(int n, unsigned arr[n], int threads) {
if (threads == 0)
threads+=1;
unsigned m = getMax(n, arr);
unsigned exp;
unsigned *output = malloc(n*sizeof(unsigned));
for (exp = 1; m / exp > 0; exp *= 10) {
int count[10] = {0}, local_count[10] = { 0 };
#pragma omp parallel firstprivate(local_count) num_threads(threads)
{
#pragma omp for schedule(static) nowait
for (int i = 0; i < n; i++)
local_count[(arr[i] / exp) % 10]++;
#pragma omp critical
for(int i = 0; i < 10; i++)
count[i] += local_count[i];
#pragma omp barrier
#pragma omp single
for (int i = 1; i < 10; i++)
count[i] += count[i - 1];
int tid = omp_get_thread_num();
for(int cur_t = threads - 1; cur_t >= 0; cur_t--) {
if(cur_t == tid) {
for(int i = 0; i < 10; i++) {
count[i] -= local_count[i];
local_count[i] = count[i];
}
}
else {
#pragma omp barrier
}
}
#pragma omp for schedule(static)
for(int i = 0; i < n; i++)
output[local_count[(arr[i] / exp) % 10]++] = arr[i];
}
unsigned* tmp = arr;
arr = output;
output = tmp;
}
free(output);
return arr;
}
/**
* #brief This function initializes all the data structures needed in the program.
* #param N array size.
* #param threads number of threads.
*/
unsigned* init_structure(int N, int threads) {
unsigned * data_vector = malloc(N*sizeof(unsigned));
#pragma omp parallel for shared(data_vector) num_threads(threads)
for (int i=0; i<N; i++)
data_vector[i]=N-i;
return data_vector;
}
Thanks everyone for the answers.

Related

What causes MPI_Gatherv() to segfault?

I perform a computation on a subset of parallel processes, but when I join the results in the master process with the command MPI_Gatherv(a_per_process, mylen_per_process, MPI_LONG_DOUBLE, a, recvcounts, displs, MPI_LONG_DOUBLE, 0, MPI_COMM_WORLD);, I get a segmentation fault.
/*****************************************************************************
* DESCRIPTION:
* This program increments every element of the array by two.
* It extracts the averge execution time for different numbers of threads,
* We do this in order to compare the performance of each routine.
* Compile:
* $mpicc mpic.c -o mpic -fopenmp -lm -Ofast
* Run:
* $mpirun -np <maxthreads> ./mpic
******************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int j = 0;
long long int len_per_process = 0;
long long int remainder = 0;
long long int mylen_per_process = 0;
int size = 0;
int rank = 0;
int *recvcounts, *displs;
long double *a, *a_per_process;
double start_comp = 0;
double start_comm = 0;
double end_comp = 0;
double end_comm = 0;
double maxtime_comp = 0;
double maxtime_comm = 0;
int i = 0;
long nSamples = 10;
long long int length = 1.0;
int maxthreads = 0;
int testnumber = 0;
long long int minlength = 1;
long long int maxlength = 1;
int cycles = 0;
long longlength = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/*Whole array allocation in master process*/
if (rank == 0)
{
a = (long double *)malloc(length * sizeof(long double));
}
for (length = minlength; length <= maxlength; length = length * 10)
{
for (i = 1; i <= size; i = i * 2)
{
/*Data distribution to processes*/
len_per_process = length / i;
remainder = length % i;
mylen_per_process = (rank < remainder) ? (len_per_process + 1) : (len_per_process);
recvcounts = (int *)malloc(size * sizeof(int));
displs = (int *)malloc(size * sizeof(int));
MPI_Allgather(&mylen_per_process, 1, MPI_INT, recvcounts, 1, MPI_INT, MPI_COMM_WORLD);
displs[0] = 0;
for (j = 1; j < size; j++)
{
displs[j] = displs[j - 1] + recvcounts[j - 1];
}
/*Sub-Arrays Allocation and Initialisation at each process*/
a_per_process = (long double *)malloc(mylen_per_process * sizeof(long double));
for (j = 0; j < mylen_per_process; j++)
{
a_per_process[j] = 0.0;
}
if (rank <= i)
{
/*Increment elements by 2*/
start_comp = omp_get_wtime();
for (j = 0; j < nSamples; j++)
{
for (int k = 0; k < mylen_per_process; k++)
{
a_per_process[k] = a_per_process[k] + 2.0;
}
}
end_comp = omp_get_wtime() - start_comp;
start_comm = omp_get_wtime();
end_comm = omp_get_wtime() - start_comm;
}
// The following line causes a segfault:
MPI_Gatherv(a_per_process, mylen_per_process, MPI_LONG_DOUBLE, a, recvcounts, displs, MPI_LONG_DOUBLE, 0, MPI_COMM_WORLD);
// Get the maximum computation and communication time
MPI_Reduce(&end_comp, &maxtime_comp, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD);
MPI_Reduce(&end_comm, &maxtime_comm, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
free(a_per_process);
free(recvcounts);
free(displs);
}
}
if (rank == 0)
{
free(a);
}
MPI_Finalize();
return 0;
}
I tried double and long double types for my variables a and a_per_process, i.e. MPI_LONG_DLOUBLE and MPI_DOUBLE in the MPI_Gatherv command. The code runs when I comment this line out, meaning it doesn't abort or segfault.

I am trying to improve the performance speed of my cross-correlation algorithm. What things can I do to make my C code run faster?

I created a cross-correlation algorithm, and I am trying to maximize its performance by reducing the time it takes for it to run. First of all, I reduced the number of function calls within the "crossCorrelationV2" function. Second, I created several macros at the top of the program for constants. Third, I reduced the number of loops that are inside the "crossCorrelationV2" function. The code that you see is the most recent code that I have.
Are there any other methods I can use to try and reduce the processing time of my code?
Let's assume that I am only focused on the functions "crossCorrelationV2" and "createAnalyzingWave".
I would be glad for any advice, whether in general about programming or pertaining to those two specific functions; I am a beginner programmer. Thanks.
#include <stdio.h>
#include <stdlib.h>
#define ARRAYSIZE 4096
#define PULSESNUMBER 16
#define DATAFREQ 1300
// Print the contents of the array onto the console.
void printArray(double array[], int size){
int k;
for (k = 0; k < size; k++){
printf("%lf ", array[k]);
}
printf("\n");
}
// Creates analyzing square wave. This square wave has unity (1) magnitude.
// The number of high values in each period is determined by high values = (analyzingT/2) / time increment
void createAnalyzingWave(double analyzingFreq, double wave[]){
int highValues = (1 / analyzingFreq) * 0.5 / ((PULSESNUMBER * (1 / DATAFREQ) / ARRAYSIZE));
int counter = 0;
int p;
for(p = 1; p <= ARRAYSIZE; p++){
if ((counter % 2) == 0){
wave[p - 1] = 1;
} else{
wave[p - 1] = 0;
}
if (p % highValues == 0){
counter++;
}
}
}
// Creates data square wave (for testing purposes, for the real implementation actual ADC data will be used). This
// square wave has unity magnitude.
// The number of high values in each period is determined by high values = array size / (2 * number of pulses)
void createDataWave(double wave[]){
int highValues = ARRAYSIZE / (2 * PULSESNUMBER);
int counter = 0;
int p;
for(p = 0; p < ARRAYSIZE; p++){
if ((counter % 2) == 0){
wave[p] = 1;
} else{
wave[p] = 0;
}
if ((p + 1) % highValues == 0){
counter++;
}
}
}
// Finds the average of all the values inside an array
double arrayAverage(double array[], int size){
int i;
double sum = 0;
// Same thing as for(i = 0; i < arraySize; i++)
for(i = size; i--; ){
sum = array[i] + sum;
}
return sum / size;
}
// Cross-Correlation algorithm
double crossCorrelationV2(double dataWave[], double analyzingWave[]){
int bigArraySize = (2 * ARRAYSIZE) - 1;
// Expand analyzing array into array of size 2arraySize-1
int lastArrayIndex = ARRAYSIZE - 1;
int lastBigArrayIndex = 2 * ARRAYSIZE - 2; //bigArraySize - 1; //2 * arraySize - 2;
double bigAnalyzingArray[bigArraySize];
int i;
int b;
// Set first few elements of the array equal to analyzingWave
// Set remainder of big analyzing array to 0
for(i = 0; i < ARRAYSIZE; i++){
bigAnalyzingArray[i] = analyzingWave[i];
bigAnalyzingArray[i + ARRAYSIZE] = 0;
}
double maxCorrelationValue = 0;
double currentCorrelationValue;
// "Beginning" of correlation algorithm proper
for(i = 0; i < bigArraySize; i++){
currentCorrelationValue = 0;
for(b = lastBigArrayIndex; b > 0; b--){
if (b >= lastArrayIndex){
currentCorrelationValue = dataWave[b - lastBigArrayIndex / 2] * bigAnalyzingArray[b] + currentCorrelationValue;
}
bigAnalyzingArray[b] = bigAnalyzingArray[b - 1];
}
bigAnalyzingArray[0] = 0;
if (currentCorrelationValue > maxCorrelationValue){
maxCorrelationValue = currentCorrelationValue;
}
}
return maxCorrelationValue;
}
int main(){
int samplesNumber = 25;
double analyzingFreq = 1300;
double analyzingWave[ARRAYSIZE];
double dataWave[ARRAYSIZE];
createAnalyzingWave(analyzingFreq, analyzingWave);
//createDataWave(arraySize, pulsesNumber, dataWave);
double maximumCorrelationArray[samplesNumber];
int i;
for(i = 0; i < samplesNumber; i++){
createDataWave(dataWave);
maximumCorrelationArray[i] = crossCorrelationV2(dataWave, analyzingWave);
}
printf("Average of the array values: %lf\n", arrayAverage(maximumCorrelationArray, samplesNumber));
return 0;
}
The first point is that you are explicitly shifting the analizingData array, this way you are required twice as much memory and moving the items is about 50% of your time. In a test here using crossCorrelationV2 takes 4.1 seconds, with the implementation crossCorrelationV3 it runs in ~2.0 seconds.
The next thing is that you are spending time multiplying by zero on the padded array, removing that, and also removing the padding, and simplifying the indices we end with crossCorrelationV4 that makes the program to run in ~1.0 second.
// Cross-Correlation algorithm
double crossCorrelationV3(double dataWave[], double analyzingWave[]){
int bigArraySize = (2 * ARRAYSIZE) - 1;
// Expand analyzing array into array of size 2arraySize-1
int lastArrayIndex = ARRAYSIZE - 1;
int lastBigArrayIndex = 2 * ARRAYSIZE - 2; //bigArraySize - 1; //2 * arraySize - 2;
double bigAnalyzingArray[bigArraySize];
int i;
int b;
// Set first few elements of the array equal to analyzingWave
// Set remainder of big analyzing array to 0
for(i = 0; i < ARRAYSIZE; i++){
bigAnalyzingArray[i] = analyzingWave[i];
bigAnalyzingArray[i + ARRAYSIZE] = 0;
}
double maxCorrelationValue = 0;
double currentCorrelationValue;
// "Beginning" of correlation algorithm proper
for(i = 0; i < bigArraySize; i++){
currentCorrelationValue = 0;
// Instead of checking if b >= lastArrayIndex inside the loop I use it as
// a stopping condition.
for(b = lastBigArrayIndex; b >= lastArrayIndex; b--){
// instead of shifting bitAnalizing[b] = bigAnalyzingArray[b-1] every iteration
// I simply use bigAnalizingArray[b-i]
currentCorrelationValue = dataWave[b - lastBigArrayIndex / 2] * bigAnalyzingArray[b - i] + currentCorrelationValue;
}
bigAnalyzingArray[0] = 0;
if (currentCorrelationValue > maxCorrelationValue){
maxCorrelationValue = currentCorrelationValue;
}
}
return maxCorrelationValue;
}
// Cross-Correlation algorithm
double crossCorrelationV4(double dataWave[], double analyzingWave[]){
int bigArraySize = (2 * ARRAYSIZE) - 1;
// Expand analyzing array into array of size 2arraySize-1
int lastArrayIndex = ARRAYSIZE - 1;
int lastBigArrayIndex = 2 * ARRAYSIZE - 2; //bigArraySize - 1; //2 * arraySize - 2;
// I will not allocate the bigAnalizingArray here
// double bigAnalyzingArray[bigArraySize];
int i;
int b;
// I will not copy the analizingWave to bigAnalyzingArray
// for(i = 0; i < ARRAYSIZE; i++){
// bigAnalyzingArray[i] = analyzingWave[i];
// bigAnalyzingArray[i + ARRAYSIZE] = 0;
// }
double maxCorrelationValue = 0;
double currentCorrelationValue;
// Compute the correlation by symmetric paris
// the idea here is to simplify the indices of the inner loops since
// they are computed more times.
for(i = 0; i < lastArrayIndex; i++){
currentCorrelationValue = 0;
for(b = lastArrayIndex - i; b >= 0; b--){
// instead of shifting bitAnalizing[b] = bigAnalyzingArray[b-1] every iteration
// I simply use bigAnalizingArray[b-i]
currentCorrelationValue += dataWave[b] * analyzingWave[b + i];
}
if (currentCorrelationValue > maxCorrelationValue){
maxCorrelationValue = currentCorrelationValue;
}
if(i != 0){
currentCorrelationValue = 0;
// Correlate shifting to the other side
for(b = lastArrayIndex - i; b >= 0; b--){
// instead of shifting bitAnalizing[b] = bigAnalyzingArray[b-1] every iteration
// I simply use bigAnalizingArray[b-i]
currentCorrelationValue += dataWave[b + i] * analyzingWave[b];
}
if (currentCorrelationValue > maxCorrelationValue){
maxCorrelationValue = currentCorrelationValue;
}
}
}
return maxCorrelationValue;
}
If you want more optimization you can unroll some iterations of the loop and enable some compiler optimizations like vector extension.

Optimizing Matrix multiplication in C with Bit Packing

I'm currently attempting to write an algorithm for optimizing matrix multiplication over GF(2) using bit-packing. Both matrices A and B are provided in column major order so I start by copying A into row-major order and then packing the values into 8-bit integers and using parity checking to speed up operations. I need to be able to test square matrices of up to 2048x2048, however, my current implementation provides the correct answer up to 24x24 and then fails to compute the correct result. Any help would be appreciated.
//Method which packs an array of integers into 8 bits
uint8_t pack(int *toPack) {
int i;
uint8_t A;
A = 0;
for (i = 0; i < 8; i++) {
A = (A << 1) | (uint8_t)toPack[i];
}
return A;
}
//Method for doing matrix multiplication over GF(2)
void matmul_optimized(int n, int *A, int *B, int *C) {
int i, j, k;
//Copying values of A into a row major order matrix.
int *A_COPY = malloc(n * n * sizeof(int));
int copy_index = 0;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
A_COPY[copy_index] = A[i + j * n];
copy_index++;
}
}
//Size of the data data type integers will be packed into
const int portion_size = 8;
int portions = n / portion_size;
//Pointer space reserved to store packed integers in row major order
uint8_t *compressedA = malloc(n * portions * sizeof(uint8_t));
uint8_t *compressedB = malloc(n * portions * sizeof(uint8_t));
int a[portion_size];
int b[portion_size];
for (i = 0; i < n; i++) {
for (j = 0; j < portions; j++) {
for (k = 0; k < portion_size; k++) {
a[k] = A_COPY[i * n + j * portion_size + k];
b[k] = B[i * n + j * portion_size + k];
}
compressedA[i * n + j] = pack(a);
compressedB[i * n + j] = pack(b);
}
}
//Calculating final matrix using parity checking and XOR on A and B
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
int cIndex = i + j * n;
cij = C[cIndex];
for (k = 0; k < portions; ++k) {
uint8_t temp = compressedA[k + i * n] & compressedB[k + j * n];
temp ^= temp >> 4;
temp ^= temp >> 2;
temp ^= temp >> 1;
uint8_t parity = temp & (uint8_t)1;
cij = cij ^ parity;
}
C[cIndex] = cij;
}
}
free(compressedA);
free(compressedB);
free(A_COPY);
}
I have two remarks:
you should probably initialize cij to 0 instead of cij = C[cIndex];. It seems incorrect to update the destination matrix instead of storing the result of A * B. Your code might work for small matrices by coincidence because the destination matrix C happens to be all zeroes for this size.
it is risky to compute the allocation size as malloc(n * n * sizeof(int)); because n * n might overflow with int n if int is smaller than size_t. Given the sizes you work with, it is probably not a problem here, but it is a good idea to always use the sizeof as the first operand to force conversion to size_t of the following ones:
int *A_COPY = malloc(sizeof(*A_COPY) * n * n);

Hybrid approach with OpenMP and MPI does not use same number of threads in cluster with different number of hosts

I'm testing a hybrid approach by paralleling the friendly-numbers (CAPBenchmark) program with MPI and OpenMP.
My cluster has 8 machines and each machine has a 4 core processor.
The code:
/*
* Copyright(C) 2014 Pedro H. Penna <pedrohenriquepenna#gmail.com>
*
* friendly-numbers.c - Friendly numbers kernel.
*/
#include <global.h>
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <util.h>
#include "fn.h"
/*
* Computes the Greatest Common Divisor of two numbers.
*/
static int gcd(int a, int b)
{
int c;
/* Compute greatest common divisor. */
while (a != 0)
{
c = a;
a = b%a;
b = c;
}
return (b);
}
/*
* Some of divisors.
*/
static int sumdiv(int n)
{
int sum; /* Sum of divisors. */
int factor; /* Working factor. */
sum = 1 + n;
/* Compute sum of divisors. */
for (factor = 2; factor < n; factor++)
{
/* Divisor found. */
if ((n%factor) == 0)
sum += factor;
}
return (sum);
}
/*
* Computes friendly numbers.
*/
int friendly_numbers(int start, int end)
{
int n; /* Divisor. */
int *num; /* Numerator. */
int *den; /* Denominator. */
int *totalnum;
int *totalden;
int rcv_friends;
int range; /* Range of numbers. */
int i, j; /* Loop indexes. */
int nfriends; /* Number of friendly numbers. */
int slice;
range = end - start + 1;
slice = range / nthreads;
if (rank == 0) {
num = smalloc(sizeof(int)*range);
den = smalloc(sizeof(int)*range);
totalnum = smalloc(sizeof(int)*range);
totalden = smalloc(sizeof(int)*range);
} else {
num = smalloc(sizeof(int) * slice);
den = smalloc(sizeof(int) * slice);
totalnum = smalloc(sizeof(int)*range);
totalden = smalloc(sizeof(int)*range);
}
j = 0;
omp_set_dynamic(0);
omp_set_num_threads(4);
#pragma omp parallel for private(i, j, n) default(shared)
for (i = start + rank * slice; i < start + (rank + 1) * slice; i++) {
j = i - (start + rank * slice);
num[j] = sumdiv(i);
den[j] = i;
n = gcd(num[j], den[j]);
num[j] /= n;
den[j] /= n;
}
if (rank != 0) {
MPI_Send(num, slice, MPI_INT, 0, 0, MPI_COMM_WORLD);
MPI_Send(den, slice, MPI_INT, 0, 1, MPI_COMM_WORLD);
} else {
for (i = 1; i < nthreads; i++) {
MPI_Recv(num + (i * (slice)), slice, MPI_INT, i, 0, MPI_COMM_WORLD, 0);
MPI_Recv(den + (i * (slice)), slice, MPI_INT, i, 1, MPI_COMM_WORLD, 0);
}
}
if (rank == 0) {
for (i = 1; i < nthreads; i++) {
MPI_Send(num, range, MPI_INT, i, 2, MPI_COMM_WORLD);
MPI_Send(den, range, MPI_INT, i, 3, MPI_COMM_WORLD);
}
} else {
MPI_Recv(totalnum, range, MPI_INT, 0, 2, MPI_COMM_WORLD,0);
MPI_Recv(totalden, range, MPI_INT, 0, 3, MPI_COMM_WORLD,0);
}
/* Check friendly numbers. */
nfriends = 0;
if (rank == 0) {
omp_set_dynamic(0);
omp_set_num_threads(4);
#pragma omp parallel for private(i, j) default(shared) reduction(+:nfriends)
for (i = rank; i < range; i += nthreads) {
for (j = 0; j < i; j++) {
/* Friends. */
if ((num[i] == num[j]) && (den[i] == den[j]))
nfriends++;
}
}
} else {
omp_set_dynamic(0);
omp_set_num_threads(4);
#pragma omp parallel for private(i, j) default(shared) reduction(+:nfriends)
for (i = rank; i < range; i += nthreads) {
for (j = 0; j < i; j++) {
/* Friends. */
if ((totalnum[i] == totalnum[j]) && (totalden[i] == totalden[j]))
nfriends++;
}
}
}
if (rank == 0) {
for (i = 1; i < nthreads; i++) {
MPI_Recv(&rcv_friends, 1, MPI_INT, i, 4, MPI_COMM_WORLD, 0);
nfriends += rcv_friends;
}
} else {
MPI_Send(&nfriends, 1, MPI_INT, 0, 4, MPI_COMM_WORLD);
}
free(num);
free(den);
return (nfriends);
}
During the executions I observed the following behavior:
When I run mpirun with 4 and 8 hosts, each of the hosts uses 4 threads for processing, as expected.
However when running using only 2 hosts only 1 thread is used on each machine.
What could cause this behavior? Is there any alternative to "force" the use of the 4 threads in the case of the 2 hosts?
I assume you are using Open MPI.
The default binding policy is to bind to socket or numa domain (depending on your version). I assume your nodes are single socket, which means one MPI tasks is bound to 4 cores, and then the OpenMP runtime will likely start 4 OpenMP threads.
A special case is when you start only 2 MPI tasks. In this case, the binding policy is to bind to core, which means one MPI task in only bound to one core, and hence the OpenMP runtime only start one OpenMP thread.
In order to achieve the desired behavior, you can
mpirun --bind-to numa -np 2 ...
If it fails, you can fallback to
mpirun --bind-to socket -np 2 ...

Improve performance of a construction of p-values matrix for a permutation test

I used an R code which implements a permutation test for the distributional comparison between two populations of functions. We have p univariate p-values.
The bottleneck is the construction of a matrix which contains all the possible CONTIGUOS p-values.
The last row of the matrix of p-values contain all the univariate p-values.
The penultimate row contains all the bivariate p-values in this order:
p_val_c(1,2), p_val_c(2,3), ..., p_val_c(p, 1)
...
The elements of the first row are coincident and the value associated is the p-value of the global test p_val_c(1,...,p)=p_val_c(2,...,p,1)=...=pval(p,1,...,p-1).
For computational reasons, I have decided to implement this component in c and use it in R with .C.
Here the code. The unique important part is the definition of the function Build_pval_asymm_matrix.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix);
// Function used for the sorting of vector T_temp with qsort
int cmp(const void *x, const void *y);
int main() {
int B = 1000; // number Conditional Monte Carlo (CMC) runs
int p = 100; // number univariate tests
// Generate fictitiously data univariate p-values pval and matrix L.
// The j-th column of L is the empirical survival
// function of the statistics test associated to the j-th coefficient
// of the basis expansion. The dimension of L is B * p.
// Generate pval
double pval[p];
memset(pval, 0, sizeof(pval)); // initialize all elements to 0
for (int i = 0; i < p; i++) {
pval[i] = (double)rand() / (double)RAND_MAX;
}
// Construct L
double L[B * p];
// Inizialize to 0 the elements of L
memset(L, 0, sizeof(L));
// Array used to construct the columns of L
double temp_array[B];
memset(temp_array, 0, sizeof(temp_array));
for(int i = 0; i < B; i++) {
temp_array[i] = (double) (i + 1) / (double) B;
}
for (int iter_coeff=0; iter_coeff < p; iter_coeff++) {
// Shuffle temp_array
if (B > 1) {
for (int k = 0; k < B - 1; k++)
{
int j = rand() % B;
double t = temp_array[j];
temp_array[j] = temp_array[k];
temp_array[k] = t;
}
}
for (int i=0; i<B; i++) {
L[iter_coeff + p * i] = temp_array[i];
}
}
double pval_asymm_matrix[p * p];
memset(pval_asymm_matrix, 0, sizeof(pval_asymm_matrix));
// Construct the asymmetric matrix of p-values
clock_t start, end;
double cpu_time_used;
start = clock();
Build_pval_asymm_matrix(&p, &B, pval, L, pval_asymm_matrix);
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("TOTAL CPU time used: %f\n", cpu_time_used);
return 0;
}
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix) {
int nbasis = *p, iter_CMC = *B;
// Scalar output fisher combining function applied on univariate
// p-values
double T0_temp = 0;
// Vector output fisher combining function applied on a set of
//columns of L
double T_temp[iter_CMC];
memset(T_temp, 0, sizeof(T_temp));
// Counter for elements of T_temp greater than or equal to T0_temp
int count = 0;
// Indexes for columns of L
int inf = 0, sup = 0;
// The last row of matrice_pval_asymm contains the univariate p-values
for(int i = 0; i < nbasis; i++) {
pval_asymm_matrix[i + nbasis * (nbasis - 1)] = pval[i];
}
// Construct the rows from bottom to up
for (int row = nbasis - 2; row >= 0; row--) {
for (int col = 0; col <= row; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = (nbasis - row) + col - 1;
// Combining function Fisher applied on
// p-values pval[inf:sup]
for (int k = inf; k <= sup; k++) {
T0_temp += log(pval[k]);
}
T0_temp *= -2;
// Combining function Fisher applied
// on columns inf:sup of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = inf; l <= sup; l++) {
T_temp[k] += log(L[l + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
}
// auxiliary variable for columns of L inf:nbasis-1 and 1:sup
int aux_first = 0, aux_second = 0;
int num_col_needed = 0;
for (int col = row + 1; col < nbasis; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = ((nbasis - row) + col) % nbasis - 1;
// Useful indexes
num_col_needed = nbasis - inf + sup + 1;
int index_needed[num_col_needed];
memset(index_needed, -1, num_col_needed * sizeof(int));
aux_first = inf;
for (int i = 0; i < nbasis - inf; i++) {
index_needed[i] = aux_first;
aux_first++;
}
aux_second = 0;
for (int j = 0; j < sup + 1; j++) {
index_needed[j + nbasis - inf] = aux_second;
aux_second++;
}
// Combining function Fisher applied on p-values
// pval[inf:p-1] and pval[0:sup-1]1]
for (int k = 0; k < num_col_needed; k++) {
T0_temp += log(pval[index_needed[k]]);
}
T0_temp *= -2;
// Combining function Fisher applied on columns inf:p-1 and 0:sup-1
// of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = 0; l < num_col_needed; l++) {
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
} // end for over col from row + 1 to nbasis - 1
} // end for over rows of asymm p-values matrix except the last row
}
int cmp(const void *x, const void *y)
{
double xx = *(double*)x, yy = *(double*)y;
if (xx < yy) return -1;
if (xx > yy) return 1;
return 0;
}
Here the times of execution in seconds measured in R:
time_original_function
user system elapsed
79.726 1.980 112.817
time_function_double_for
user system elapsed
79.013 1.666 89.411
time_c_function
user system elapsed
47.920 0.024 56.096
The first measure was obtained using an equivalent R function with duplication of the vector pval and matrix L.
What I wanted to ask is some suggestions in order to decrease the execution time with the C function for simulation purposes. The last time I used c was five years ago and consequently there is room for improvement. For instance I sort the vector T_temp with qsort in order to compute in linear time with a while the number of elements of T_temp greater than or equal to T0_temp. Maybe this task could be done in a more efficient way. Thanks in advance!!
I reduced the input size to p to 50 to avoid waiting on it (don't have such a fast machine) -- keeping p as is and reducing B to 100 has a similar effect, but profiling it showed that ~7.5 out of the ~8 seconds used to compute this was spent in the log function.
qsort doesn't even show up as a real hotspot. This test seems to headbutt the machine more in terms of micro-efficiency than anything else.
So unless your compiler has a vastly faster implementation of log than I do, my first suggestion is to find a fast log implementation if you can afford some accuracy loss (there are ones out there that can compute log over an order of magnitude faster with precision loss in the range of ~3% or so).
If you cannot have precision loss and accuracy is critical, then I'd suggest trying to memoize the values you use for log if you can and store them into a lookup table.
Update
I tried the latter approach.
// Create a memoized table of log values.
double log_cache[B * p];
for (int j=0, num=B*p; j < num; ++j)
log_cache[j] = log(L[j]);
Using malloc might be better here, as we're pushing rather large data to the stack and could risk overflows.
Then pass her into Build_pval_asymm_matrix.
Replace these:
T_temp[k] += log(L[l + nbasis * k]);
...
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
With these:
T_temp[k] += log_cache[l + nbasis * k];
...
T_temp[k] += log_cache[index_needed[l] + nbasis * k];
This improved the times for me from ~8 seconds to ~5.3 seconds, but we've exchanged the computational overhead of log for memory overhead which isn't that much better (in fact, it rarely is but calling log for double-precision floats is apparently quite expensive, enough to make this exchange worthwhile). The next iteration, if you want more speed, and it is very possible, involves looking into cache efficiency.
For this kind of huge matrix stuff, focusing on memory layouts and access patterns can work wonders.

Resources