rsa algorithm in openmp - c

I am trying to parallize RSA algorithm with the help of repeated square and multiply method in openmp.
code is as follow:
long long unsigned int mod_exp(int base,int exp,int n)
{
long long unsigned int i,pow1=1,pow2=1,pow3=1,pow4=1,pow=1,pow5=1;
int exp1=exp/4;
int id;
for(i=0;i<exp1;i++)
pow1=(pow1*base)%n;
for(i=0;i<exp1;i++)
pow2=(pow2*base)%n;
for(i=0;i<exp1;i++)
pow3=(pow3*base)%n;
for(i=0;i<exp1;i++)
pow4=(pow4*base)%n;
for(i=0;i<1;i++)
pow5=(pow5*base)%n;
pow=pow1*pow2*pow3*pow4*pow5;
pow=pow%n;
return pow;
}
just with #pragma omp for i am unable to find get the correct output.
kindly help

I guess you could go for something like this:
long long unsigned int i, pow = 1, exp1 = exp / 4;
int k;
#pragma omp parallel for reduction( * : pow )
for ( k = 0; k < 5; k++ ) {
for ( i = 0; i < exp1 ; i++ ) {
pow = ( pow * base ) % n;
}
}
That should work but I doubt it would do you much good since the amount of work is so limited that the parallelisation overhead is very likely to slow-down the code instead of speeding it up.
EDIT: hum, actually, I can't make sense of the initial code... Why are we doing 5 times the same computation? Did I miss something?

Related

How to optimize the computation of a for loop using SIMD?

I am trying to accelerate a stereo matching algorithm on ODROID XU4 ARM platform using Neon SIMD. For this puropose I am using openMp's
pragmas.
void StereoMatch:: sadCol(uint8_t* leftRank,uint8_t* rightRank,const int SAD_WIDTH,const int SAD_WIDTH_STEP, const int imgWidth,int j, int d , uint16_t* cost)
{
uint16_t sum = 0;
int n = 0;
int m =0;
for ( n = 0; n < SAD_WIDTH+1; n++)
{
#pragma omp simd
for( m = 0; m< SAD_WIDTH_STEP; m = m + imgWidth )
{
sum += abs(leftRank[j+m+n]-rightRank[j+m+n-d]);
};
cost[n] = sum;
sum = 0;
};
I am fairly new to SIMD and openMp, I understood that using the SIMD pragma in the code will direct the compiler to vectorize the subtraction, but when I executed the code I noticed no difference. What should I add to my code in order to vectorize it ?
As said in the comments, ARM-Neon has an instruction which directly does what you want, i.e., compute the absolute difference of unsigned bytes and accumulates it to unsigned short-integers.
Assuming SAD_WIDTH+1==8, here is a very simple implementation using intrinsics (based on the simplified version by #nemequ):
void sadCol(uint8_t* leftRank,
uint8_t* rightRank,
int j,
int d ,
uint16_t* cost) {
const int SAD_WIDTH = 7;
const int imgWidth = 320;
const int SAD_WIDTH_STEP = SAD_WIDTH * imgWidth;
uint16x8_t cost_8 = {0};
for(int m = 0; m < SAD_WIDTH_STEP; m = m + imgWidth ) {
cost_8 = vabal_u8(cost_8, vld1_u8(&leftRank[j+m]), vld1_u8(&rightRank[j+m-d]));
};
vst1q_u16(cost, cost_8);
};
vld1_u8 loads 8 consecutive bytes, vabal_u8 computes the absolute difference and accumulates it to the first register. Finally, vst1q_u16 stores the register to memory.
You can easily make imgWidth and SAD_WIDTH_STEP function parameters. If SAD_WIDTH+1 is a different multiple of 8, you can write another loop for that.
I have no ARM platform at hand to test it, but "it compiles": https://godbolt.org/z/vPqiYI (and the assembly looks fine, in my eyes). If you optimize with -O3 gcc will unroll the loop.

SPOJ COINS DP and Recursive Approach

I have recently started solving DP problem and I came across COINS. I tried to solve it using DP with memoization and it works fine if I use int array(I guess).
Here is my approach(few modifications left):
#include <stdio.h>
#include <stdlib.h>
int dp[100000];
long long max(long x, long y)
{
if (x > y)
return x;
else
return y;
}
int main()
{
int n,i;
scanf("%d",&n);
dp[0]=0;
for(i=1;i<=n;i++)
{
dp[i]=max(i,dp[i/2] + dp[i/3] + dp[i/4]);
}
printf("%d\n",dp[n]);
return 0;
}
But I don't understand as soon as I use long long array I get SIGSEGV.
I searched and there seems to be a recursive solution that I am not understanding.
Can someone help me out here?
The limits say n<=10e9, array size of which will always result in memory overflow and hence, SIGSEGV. It does not matter what is the type of your dp-array.
There are yet other errors in your code. Firstly, there are test-cases, which you have to read till EOF. Secondly, since the limits are 10e9, you are looping n times !! Surely TLE.
Now, for the recursive solution, using memoization:
Firstly, save the answer values till 10e6 in the array. Will help save time. It can be done as:
long long dp[1000000] = {0};
for(int i = 1; i < 1000000; i++){
dp[i] = max(i, dp[i/2] + dp[i/3] + dp[i/4]);
}
Now, for any input n, find the solution as,
ans = coins(n);
Implement coins function as:
long long coins(long long n){
if (n < 1000000)
return dp[n];
return coins(n/2) + coins(n/3) + coins(n/4);
}
Why this recursive solution works:
It is very obvious that answer for all n >= 12 will be ans[n/2] + ans[n/3] + ans[n/4], so for n > 10e6, that is returned.
The base condition for the recursion is just to save time. You can also return it for 0, but then then you will have to take care of corner cases. (You get my point there)
Exact code:
#include<stdio.h>
long long dp[1000000] = {0};
long long max(long long a, long long b){
return a>b?a:b;
}
long long coins(long long n){
if (n < 1000000)
return dp[n];
return coins(n/2) + coins(n/3) + coins(n/4);
}
int main(){
for(long long i = 1; i < 1000000; i++){
dp[i] = max(i, dp[i/2] + dp[i/3] + dp[i/4]);
}
long long n;
while(scanf("%lld",&n) != EOF){
printf("%lld\n", coins(n));
}
return 0;
}

C code for generating next bit to flip in a gray code

I need a function that returns a number essentially telling me which bit would be the one to flip when moving to the nth element of a Gray code. It doesn't matter if it's the standard (reflecting) Gray code or some other minimal bit-toggling approach. I can do it, but it seems unnecessarily unwieldy. Currently I have this:
#include <stdio.h>
int main()
{
int i;
for (i=1; i<32; i++)
printf("%d\n",grayBitToFlip(i));
}
int grayBitToFlip(int n)
{
int j, d, n1, n2;
n1 = (n-1)^((n-1)>>1);
n2 = n^(n>>1);
d = n1^n2;
j = 0;
while (d >>= 1)
j++;
return j;
}
The loop in main() is only there to demonstrate the output of the function.
Is there a better way?
EDIT: just looking at the output, it's obvious one can do this more simply. I've added a 2nd function, gray2, that does the same thing much more simply. Would this be the way to do it? This is not production code by the way but hobbyist.
#include <stdio.h>
int main()
{
int i;
for (i=1; i<32; i++)
printf("%d %d\n",grayBitToFlip(i), gray2(i));
}
int grayBitToFlip(int n)
{
int j, d, n1, n2;
n1 = (n-1)^((n-1)>>1);
n2 = n^(n>>1);
d = n1^n2;
j = 0;
while (d >>= 1)
j++;
return j;
}
int gray2(int n)
{
int j;
j=0;
while (n)
{
if (n & 1)
return j;
n >>= 1;
j++;
}
return j;
}
The easiest Gray code to use is a Johnson Gray Code (JGC).
BitNumberToFlip = ++BitNumberToFlip % NumberOfBitsInCode;
JGC = JGC ^ (1 << BitNumberToFlip); // start JGC = 0;
A Johnson code is linear in the number of bits required for representation.
A Binary Reflected Gray Code (BRGC) has a much better bit density since only
a logarithmic number of bits are required to represent the range of BRGC codes.
int powerOf2(int n){ return // does 16 bit codes
( n & 0xFF00 ? 8:0 ) + // 88888888........
( n & 0xF0F0 ? 4:0 ) + // 4444....4444....
( n & 0xCCCC ? 2:0 ) + // 22..22..22..22..
( n & 0xAAAA ? 1:0 ) ; } // 1.1.1.1.1.1.1.1.
// much faster algorithms exist see ref.
int BRGC(int gc){ return (gc ^ gc>>1);}
int bitToFlip(int n){ return powerOf2( BRGC( n ) ^ BRGC( n+1 ) ); }
for details see ref:
How do I find next bit to change in a Gray code in constant time?

High limit in Sieve of Eratosthenes Algorithm for finding prime numbers makes the program stop working

I have used Sieve of Eratosthenes Algorithm to find the sum of prime numbers under certain limit and it has worked correctly until limit of 2 million, but when I tried 3 million the program has stopped while executing. Here is the code:
int main(){
bool x[3000000];
unsigned long long sum = 0;
for(unsigned long long i=0; i< 3000000; i++)
x[i] = true;
x[0] = x[1] = false;
for(unsigned long long i = 2; i < 3000000; i++){
if(x[i]){
for (unsigned long long j = 2; i * j < 3000000; j++) {
x[j*i] = false;
}
sum += i;
}
}
printf("%ld", sum);
return 0;
}
Most likely bool x[3000000]; will result in a stack overflow, as it will require more memory than is typically available on the stack . As a quick fix change this to:
static bool x[3000000];
or consider using dynamic memory allocation:
bool *x = malloc(3000000 * sizeof(*x));
// do your stuff
free(x);
Also note that your printf format specifier is wrong, since sum is declared as unsigned long long - change:
printf("%ld", sum);
to:
printf("%llu", sum);
If you're using a decent compiler (e.g. gcc) with warnings enabled (e.g. gcc -Wall ...) then the compiler should already have warned you about this error.
One further tip: don't use hard-coded constants like 3000000 - use a symbolic constant, and then you only have to define the value in one place - this is known as the "Single Point Of Truth" (SPOT) principle, or "Don't Repeat Yourself" (DRY):
const size_t n = 3000000;
and then wherever you are currently using 3000000 use n instead.

Measuring execution time using rdtsc within kernel

I am writing a code to measure the time consumption of a sequence of codes in kernel by loading the codes as module into the kernel. I uses common rdtsc routine to calculate the time. Interesting thing is similar routine running in user mode results in normal values, whereas the results is always 0 when running in kernel mode, no matter how many lines of codes I have added into the time_count function. The calculation I use here is a common matrix product function, and the running cycles should increase rapidly through the increasing of matrix dimension. Can anyone point out the mistakes in my code why I could not measure the cycle number in kernel?
#include <linux/init.h>
#include <linux/module.h>
int matrix_product(){
int array1[500][500], array2[500][500], array3[500][500];
int i, j, k, sum;
for(i = 0; i < 50000; i++){
for(j = 0; j < 50000; j++){
array1[i][j] = 5*i + j;
array2[i][j] = 5*i + j;
}
}
for(i = 0; i < 50000; i++){
for(j = 0; j < 50000; j++){
for(k = 0; k < 50000; k++)
sum += array1[i][k]*array2[k][j];
array3[i][j] = sum;
sum = 0;
}
}
return 0;
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long hi, lo;
__asm__ __volatile__ ("xorl %%eax,%%eax\ncpuid" ::: "%rax", "%rbx", "%rcx", "%rdx");
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ((unsigned long long)lo) | (((unsigned long long)hi)<<32) ;
}
static int my_init(void)
{
unsigned long str, end, curr, best, tsc, best_curr;
long i, t;
#define time_count(codes) for(i=0; i<120000; i++){str=rdtsc(); codes; end=rdtsc(); curr=end-str; if(curr<best)best=curr;}
best = ~0;
time_count();
tsc = best;
best = ~0;
time_count(matrix_product());
best_curr = best;
printk("<0>matrix product: %lu ticks\n", best_curr-tsc);
return 0;
}
static void my_exit(void){
return;
}
module_init(my_init);
module_exit(my_exit);`
Any help is appreciated! Thanks.
rdtsc is not guaranteed to be available on every CPU, or to run at a constant rate, or be consistent between different cores.
You should use a reliable and portable function like getrawmonotonic unless you have special requirements for the timestamps.
If you really want to use cycles directly, the kernel already defines get_cycles and cpuid functions for this.

Resources