Concatenate private arrays after OpenMP - arrays

I am using OpenMP to parallelize the code below.
int *neighbors = malloc(num_of_neigh * sizeof(int));
int pos = 0;
for(int i=0;i<n;i++){
if(M[n*element+i]==1 && element!=i){
neighbors[pos] = i;
pos++;
}
}
Below is the code which works properly. However for a great number of n, I have a slower implementation due to the shared variables neighbors and pos.
int *neighbors = malloc(num_of_neigh * sizeof(int));
int pos = 0;
#pragma omp parallel for num_threads(8)
for(int i=0;i<n;i++){
if(M[n*element+i]==1 && element!=i){
#pragma omp critical
{
neighbors[pos] = i;
pos++;
}
}
}
As a result, I decided to use local variables for neighbors and pos and concatenate the private variables to a global one after the calculations. However, I have some troubles with these concatenations.
int *neighbors = malloc(num_of_neigh * sizeof(int));
int pos = 0;
#pragma omp parallel num_threads(8)
{
int *temp_neighbors = malloc(num_of_neigh * sizeof(int));
int temp_pos = 0;
#pragma omp for
for(int i=0;i<n;i++){
if(M[n*element+i]==1 && element!=i){
temp_neighbors[temp_pos] = i;
temp_pos++;
}
}
// I want here to concatenate the local variables temp_neighbors to the global one neighbors.
}
I tried the code below in order to achieve the concatenation, but neighbors only takes the last value of the temp_neighbors, while the rest elements are 0.
#pragma omp critical
{
memcpy(neighbors+pos, temp_neighbors, num_of_neigh * sizeof(neighbors));
pos++;
}
So, the question is:
How can I concatenate private variables (and especially arrays) to a global one?
I searched a lot, but didn't find any proper answer. Thanks in advance and sorry for the big question.

The global section should look like the following.
#pragma omp critical
{
memcpy(neighbors + pos, temp_neighbors, temp_pos * sizeof(int));
pos += temp_pos;
}
memcpy copies temp_pos elements (temp_pos * sizeof(int) bytes) of data to first unused position in neighbors (neighbors + pos). Then pos is increased to be index of next unused position.

Related

How can I dynamically update the bound of a for loop with OpenMP ic C?

I am building a parallel version of a program that I initially wrote in a serial fashion. This function reads a file to create the input vector needed for following operations. Since OpenMP particularly likes for loops, I converted what was initially a while loop into this, that basically is a "dynamic" for loop, where the upper bound (limit?) of the loop is continuously updated until the file is over.
As a serial code, this works with no issues, even on very big files (1M+ lines). However, when I implement the OMP pragmas, it won't work. From my troubleshooting it appears that using the omp pragmas the upper bound (vec_size) cannot be dynamically updated. Is there a work-around for this issue? Thank you in advance.
int readfile(double complex** vec, char* filepath)
{
FILE* filePointer;
char line[50] = {0};
filePointer = fopen(filepath,"r");
double complex* v;
int n=0, i=0, vec_size=VEC_SIZE; //number of elements in the vector
v = malloc(VEC_SIZE*sizeof(complex));
//read file and store elements in vec
float tmp;
#pragma omp parallel
{
#pragma omp for ordered
for(i=0; i<n; i++){
#pragma omp ordered
fgets(line, 50, filePointer);
printf("for %d\n", omp_get_thread_num());
if(n==(vec_size-1) && !feof(filePointer)){ //check if size is exceeded
printf("if %d\n", omp_get_thread_num());
vec_size=2*vec_size;
v = realloc(v, vec_size*sizeof(complex)); //double initial allocated space
printf("vec_size if %d\n", vec_size);
}
printf("vec_size for %d\n", vec_size);
sscanf(line,"%f",&tmp);
v[n] = tmp + 0*I;
n++;
}
}
fclose(filePointer);
*vec = v;
return n;
}

How to declare and malloc pointer inside OpenMP parallel region? (ERROR: Segment violation ('core' generated))

I am doing it like:
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(int));
#pragma omp parallel
{
int ** localClusterMemberCount;
int * activeCluster;
#pragma omp single
{
localClusterMemberCount = (int **) malloc (omp_get_num_threads() * sizeof(int *));
//localClusterMemberCount[0] = (int *) calloc (omp_get_num_threads()*numClusters,sizeof(int));
for (int i = 0; i < omp_get_num_threads(); ++i) {
localClusterMemberCount[i] = calloc (numClusters,sizeof(int));
//localClusterMemberCount[i] = localClusterMemberCount[i-1] + numClusters;
}
activeCluster = (int *) calloc (omp_get_num_threads(),sizeof(int));
}
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
activeCluster[omp_get_thread_num()] = clusterAssignmentCurrent[i];
// update count of members in that cluster
++localClusterMemberCount[omp_get_thread_num()][activeCluster[omp_get_thread_num()]];
// sum point coordinates for finding centroid
for (int j = 0; j < numCoords; ++j)
#pragma omp atomic
clustersCentroID[activeCluster[omp_get_thread_num()]*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
for (int i = 0; i < numClusters; ++i) {
if (localClusterMemberCount[omp_get_thread_num()][i] != 0)
// for each numCoordsension
for (int j = 0; j < numCoords; ++j)
#pragma omp atomic
clustersCentroID[i*numCoords + j] /= localClusterMemberCount[omp_get_thread_num()][i]; /// XXXX will divide by zero here for any empty clusters!
}
// free memory
#pragma omp single
{
free (localClusterMemberCount[0]);
free (localClusterMemberCount);
free (activeCluster);
}
}
free(clusterMemberCount);
But I get the error: Segment violation ('core' generated) so I am doing something bad, and I think the error is on mallocing pointers due to I have tried sequential code and it is working fine. Also I have tried parallel code but without mallocs (using globals variables with atomic) and that works fine too. The error only apears when I try to create private pointers and malloc them.
Any idea how can I solve it?
Two reasons for the segfault:
localClusterMemberCount should be a shared variable declared outside of the parallel region, initialized within the parallel region by a single thread. Otherwise, each thread has its own copy of the variable, and for all but the thread that has gone through the singlesection, that points to a random location of the memory.
An implicit or explicit barrier is needed before the section of code where pointers are freed. All threads needs to be done for sure before memory can be disallocated, otherwise one thread may free pointers still being used by other threads.
There are few other issues with the code. See below with my own comments flagged with ***:
void calculateClusterCentroIDs(int numCoords, int numObjs, int numClusters, float * dataSetMatrix, int * clusterAssignmentCurrent, float *clustersCentroID) {
int * clusterMemberCount = (int *) calloc (numClusters,sizeof(int));
/* ***
* This has to be a shared variable that each thread can access
* If declared inside the parallel region, it will be a thread-local variable
* which is left un-initialized for all but one thread. Further attempts to access
* that variable will lead to segfaults
*/
int ** localClusterMemberCount;
#pragma omp parallel shared(localClusterMemberCount,clusterMemberCount)
{
// *** Make activeCluster a thread-local variable rather than a shared array (shared array will result in false sharing)
int activeCluster;
#pragma omp single
{
localClusterMemberCount = (int **) malloc (omp_get_num_threads() * sizeof(int *));
//localClusterMemberCount[0] = (int *) calloc (omp_get_num_threads()*numClusters,sizeof(int));
for (int i = 0; i < omp_get_num_threads(); ++i) {
localClusterMemberCount[i] = calloc (numClusters,sizeof(int));
//localClusterMemberCount[i] = localClusterMemberCount[i-1] + numClusters;
}
}
// sum all points
// for every point
for (int i = 0; i < numObjs; ++i) {
// which cluster is it in?
activeCluster = clusterAssignmentCurrent[i];
// update count of members in that cluster
++localClusterMemberCount[omp_get_thread_num()][activeCluster];
// sum point coordinates for finding centroid
// *** This may be slower in parallel because of the atomic operation
for (int j = 0; j < numCoords; ++j)
#pragma omp atomic
clustersCentroID[activeCluster*numCoords + j] += dataSetMatrix[i*numCoords + j];
}
/* ***
* Missing: one reduction step
* The global cluster member count needs to be updated
* one option is below :
*/
#pragma omp critical
for (int i=0; i < numClusters; ++i) clusterMemberCount+=localClusterMemberCount[omp_get_thread_num()];
#pragma omp barrier // wait here before moving on
// *** The code below was wrong; to compute the average, coordinates should be divided by the global count
// *** Sucessive divisions by local count will fail. Like, 1/(4+6) is not the same as (1/4)/6
// now divide each coordinate sum by number of members to find mean/centroid
// for each cluster
#pragma omp for
for (int i = 0; i < numClusters; ++i) {
if (clusterMemberCount != 0)
// for each numCoordsension
#pragma omp simd //not sure this will help, the compiler may already vectorize that
for (int j = 0; j < numCoords; ++j)
clustersCentroID[i*numCoords + j] /= clusterMemberCount[i]; /// XXXX will divide by zero here for any empty clusters!
// *** ^^ atomic is not needed
// *** only one thread will access each value of clusterCentroID
}
#pragma omp barrier
/* ***
* A barrier is needed otherwise the first thread arriving there will start to free the memory
* Other threads may still be in the previous loop attempting to access localClusterMemberCount
* If the pointer has been freed already, this will result in a segfault
*
* With the corrected code, the implicit barrier at the end of the distributed
* for loop would be sufficient. With your initial code, an explicit barrier
* would have been needed.
*/
// free memory
#pragma omp single
{
// *** Need to free all pointers and not only the first one
for (int i = 0; i < omp_get_num_threads(); ++i) free (localClusterMemberCount[i]);
free (localClusterMemberCount);
}
}
free(clusterMemberCount);

How to synchronize 3 nested loop in OpenMP?

I am writing a program that will match up one block(a group of 4 double numbers which are within certain absolute value) with another.
Essentially, I will call the function in main.
The matrix has 4399 rows and 500 columns.I am trying to use OpenMp to speed up the task yet my code seems to have race condition within the innermost loop (where the actual creation of block happens create_Block(rrr[k], i); ).
It is ok to ignore all the function detail as they are working well in serial version. The only focus here is the OpenMP derivatives.
int main(void) {
readKey("keys.txt");
double** jz = readMatrix("data.txt");
int j = 0;
int i = 0;
int k = 0;
#pragma omp parallel for firstprivate(i) shared(Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (i = 0; i < 50; i++) {
printf("THIS IS COLUMN %d\n", i);
double*c = readCol(jz, i, 4400);
#pragma omp parallel for firstprivate(j) shared(i,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (j=0; j < 4400; j++) {
// printf("This is fixed row %d from column %d !!!!!!!!!!\n",j,i);
int* one_collection = collection(c, j, 4400);
// MODIFY THE DYMANIC ALLOCATION OF SPACES (SIZE_OF_COMBINATION) IN combNonRec() function.
if (get_combination_size(SIZE_OF_COLLECTION, M) >= 4) {
//GET THE 2D-ARRAY OF COMBINATION
int** rrr = combNonRec(one_collection, SIZE_OF_COLLECTION, M);
#pragma omp parallel for firstprivate(k) shared(i,j,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (k = 0; k < get_combination_size(SIZE_OF_COLLECTION, M); k++) {
create_Block(rrr[k], i); //ACTUAL CREATION OF BLOCK !!!!!!!
printf("This is block %d \n", NUM_OF_BLOCK);
add_To_Block_Collection();
}
free(rrr);
}
free(one_collection);
}
//OpenMP for j
free(c);
}
// OpenMP for i
collision();
}
Here is the parallel version result: non-deterministic
Whereas the serial result has constant 400 blocks.
Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION are global variable.
Did I do anything wrong in the derivative declaration? What might have caused such problem?

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.

What to heed, when reading an array from multiple threads?

I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).
The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.
My questions are:
Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?
Code
#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>
double serial(int x)
{
double sum=0;
int i;
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
return sum;
}
double parallel(int x)
{
double sum=0;
int i;
#pragma omp parallel num_threads(6) shared(sum) //default(none)
{
//printf(" I'm thread no. %d\n", omp_get_thread_num());
#pragma omp for private(i, x) reduction(+: sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
return sum;
}
void serial_a(double* a, int n, double* y2)
{
int i;
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
void parallel_a(double* a, int n, double* y2)
{
int i;
#pragma omp parallel num_threads(6)
{
#pragma omp for private(i)
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum, *y1, *y2, *a, s, p;
int x, n, *d;
/* Check for proper number of arguments. */
if(nrhs!=2) {
mexErrMsgTxt("Two inputs required.");
} else if(nlhs>2) {
mexErrMsgTxt("Too many output arguments.");
}
/* Get pointer to first input */
x = (int)mxGetScalar(prhs[0]);
/* Get pointer to second input */
a = mxGetPr(prhs[1]);
d = (int*)mxGetDimensions(prhs[1]);
n = (int)d[1]; // row vector
/* Create space for output */
plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);
/* Get pointer to output array */
y1 = mxGetPr(plhs[0]);
y2 = mxGetPr(plhs[1]);
{ /* Do the calculation */
clock_t tic = clock();
y1[0] = serial(x);
s = (double) clock()-tic;
printf("serial....: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
y1[1] = parallel(x);
p = (double) clock()-tic;
printf("parallel..: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
mexEvalString("drawnow");
tic = clock();
serial_a(a, n, y2);
s = (double) clock()-tic;
printf("serial_a..: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
parallel_a(a, n, &y2[n]);
p = (double) clock()-tic;
printf("parallel_a: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
}
}
Output
>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21
>> a(1) == a(2)
ans =
0
>> all(b(:,1) == b(:,2))
ans =
1
System
MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867
In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.
To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do
#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
#pragma omp for
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do
#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.
When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as
#pragma omp parallel for
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
and once again it is the same as your serial_a function except for the pragma statement.
But be careful with assuming iterators are private. Consider the following double loop
for(i=0; i<n; i++) {
for(j=0; j<m; j++) {
//
}
}
If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).

Resources