Can I assume that OpenMP shared variables are read and written atomically? - c

For instance, assume I have a variable that cannot be accessed by the underlying processor in one instruction (e.g. 64 bit integer on a 32 bit architecture).
// let x, y, z of the same integral type of size > architecture
#pragma omp parallel shared(x), private(y,z)
y = ...;
z = ...;
if (x == y)
x = z;
While there could be races between the if statement and the actual assignment, could half of x be read before a context switch, and the other half afterwards? Or is it guaranteed that read and write access to shared variables always happens atomically? I cannot find any statements regarding this in the standard.

No and no. This code will result in resource race.

Related

How to implement input independent logical shift in software?

I'm trying to implement AES/DES/.. encryption/decryption in software without using any input dependent operations (specifically only using constant time not, and, or, xor operations and input independent array indexing/loops).
Is there any way to implement input independent logical shift (someconst << key[3] & 5 etc.)?
Array indexing with input dependent variable, using hardware shifts with input dependent n, input dependent conditional jumps must be avoided and I don't care about code size/speed.
Depending on your requirements and which operations you can assume to be constant time, this code needs some additional modifications.
However, it might point you in the right direction (as the SELECT primitive is quite powerful for side-channel free code):
#define MAX_SHIFT 32 // maximum amount to be shifted
// this may not be constant time.
// However, you can find different (more ugly) ways to achieve the same thing.
// 1 -> 0
// 0 -> 0xff...
#define MASK(cond) (cond - 1)
// again, make sure everything here is constant time according to your threat model
// (0, x, y) -> y
// (i, x, y) -> x (i != 0)
#define SELECT(cond, A, B) ((MASK(!(cond)) & A) | (MASK(!!(cond)) & B))
int shift(int value, int shift){
int result = value;
for(int i = 0; i <= MAX_SHIFT; i++){
result = SELECT(i ^ shift, result, value);
// this may not be constant time. If it is not, implement it yourself ;)
value <<= 1;
}
return result;
}
Note, however, that you have to make sure the compiler does not optimize this.
Also, CPUs may also employ operand-dependent performance optimizations, that may lead to timing differences.
In addition to this, transient execution attacks like Spectre may also be a possible threat.
In conclusion: It is almost impossible to write side-channel free code.

At a sequence point all previous accesses to volatile objects have stabilized

From GNU document about volatile:
The minimum requirement is that at a sequence point all previous
accesses to volatile objects have stabilized and no subsequent
accesses have occurred
Ok, so we know what sequence points are, and we now know how volatile behaves with respect to them in gcc.
So, naively I would look at the following program:
volatile int x = 0;
int y = 0;
x = 1; /* sequence point at the end of the assignment */
y = 1; /* sequence point at the end of the assignment */
x = 2; /* sequence point at the end of the assignment */
And will apply the GNU requirement the following way:
At a sequence point (end of y=1) access to volatile x = 1 stabilize and no subsequence access x = 2 have occurred.
But that just wrong because non-volatiles y = 1 can be reordered across sequence points, for example y = 1 can actually be performed before x = 1 and x = 2, and furthermore it can be optimised away (without violating the as-if rule).
So I am very eager to know how can I apply the GNU requirement properly, is there a problem with my understanding? is the requirement written in a wrong way?
Maybe should the requirement be written as something like:
The minimum
requirement is that at a sequence point WHICH HAS A SIDE EFFECT all previous accesses to volatile objects have stabilized
and no subsequent accesses have occurre
Or as pmg elegantly suggested in the comment:
The minimum requirement is that at a sequence point all UNSEQUENCED previous accesses to volatile objects have
stabilized and no subsequent accesses have occurred
so we could only apply it on the sequence points of end of x = 1; and end of x = 2; on which is definitely true that previous accesses to volatile objects have stabilized and no subsequent accesses have occurred?

ARM Neon in C: How to combine different 128bit data types while using intrinsics?

TLTR
For arm intrinsics, how do you feed a 128bit variable of type uint8x16_t into a function expecting uint16x8_t?
EXTENDED VERSION
Context: I have a greyscale image, 1 byte per pixel. I want to downscale it by a factor 2x. For each 2x2 input box, I want to take the minimum pixel. In plain C, the code will look like this:
for (int y = 0; y < rows; y += 2) {
uint8_t* p_out = outBuffer + (y / 2) * outStride;
uint8_t* p_in = inBuffer + y * inStride;
for (int x = 0; x < cols; x += 2) {
*p_out = min(min(p_in[0],p_in[1]),min(p_in[inStride],p_in[inStride + 1]) );
p_out++;
p_in+=2;
}
}
Where both rows and cols are multiple of 2. I call "stride" the step in bytes that takes to go from one pixel to the pixel immediately below in the image.
Now I want to vectorize this. The idea is:
take 2 consecutive rows of pixels
load 16 bytes in a from the top row, and load the 16 bytes immediately below in b
compute the minimum byte by byte between a and b. Store in a.
create a copy of a shifting it right by 1 byte (8 bits). Store it in b.
compute the minimum byte by byte between a and b. Store in a.
store every second byte of a in the output image (discards half of the bytes)
I want to write this using Neon intrinsics. The good news is, for each step there exists an intrinsic that match it.
For example, at point 3 one can use (from here):
uint8x16_t vminq_u8(uint8x16_t a, uint8x16_t b);
And at point 4 one can use one of the following using a shift of 8 bits (from here):
uint16x8_t vrshrq_n_u16(uint16x8_t a, __constrange(1,16) int b);
uint32x4_t vrshrq_n_u32(uint32x4_t a, __constrange(1,32) int b);
uint64x2_t vrshrq_n_u64(uint64x2_t a, __constrange(1,64) int b);
That's because I do not care what happens to byte 1,3,5,7,9,11,13,15 because anyway they will be discarded from the final result. (The correctness of this has been verified and it's not the point of the question.)
HOWEVER, the output of vminq_u8 is of type uint8x16_t, and it is NOT compatible with the shift intrinsics that I would like to use. In C++ I addressed the problem with this templated data structure, while I have been told that the problem cannot be reliably addressed using union (Edit: although that answer refer to C++, and in fact in C type punning IS allowed), nor by using pointers to cast, because this will break the strict aliasing rule.
What is the way to combine different data types while using ARM Neon intrinsics?
For this kind of problem, arm_neon.h provides the vreinterpret{q}_dsttype_srctype casting operator.
In some situations, you might want to treat a vector as having a
different type, without changing its value. A set of intrinsics is
provided to perform this type of conversion.
So, assuming a and b are declared as:
uint8x16_t a, b;
Your point 4 can be written as(*):
b = vreinterpretq_u8_u16(vrshrq_n_u16(vreinterpretq_u16_u8(a), 8) );
However, note that unfortunately this does not address data types using an array of vector types, see ARM Neon: How to convert from uint8x16_t to uint8x8x2_t?
(*) It should be said, this is much more cumbersome of the equivalent (in this specific context) SSE code, as SSE has only one 128 bit integer data type (namely __m128i):
__m128i b = _mm_srli_si128(a,1);

What would the minimum and maximum value of int x be in this case?

I'm revising for my mid-term exam in concurrent programming in C and I'm stuck on this question.
Say you have the following loop:
int x = 20;
for (int i = -3; i <= 7; i++)
x -= 2;
}
On a monoprocessor machine, what are the minimum and maximum values possible for the variable int x after being executed simultaneously by 5 threads?
EDIT : x is of course a shared (global) variable with each thread.
The behaviour of the program is undefined due to the potential for simultaneous read and write to x.
Access to x needs to be controlled by mutual exclusion, or steps need to be taken to ensure that x -= 2 is atomic. Only then can we talk about the possible values that x can take.

Stack concern : Local variables vs Arithmetics

I have a question regarding wether it is good to save arithmetics computations to limit the stack usage.
Let's say I have a recursive function like this one :
void foo (unsigned char x, unsigned char z) {
if (!x || !z)
return;
// Do something
for (unsigned char i = 0; i < 100; ++i) {
foo(x - 1, z);
foo(x, z - 1);
}
}
The main thing to see here are the x - 1 and z - 1 evaluated each time in the loop.
To increase performance, I would do something like this :
const unsigned char minus_x = x - 1;
const unsigned char minus_z = z - 1;
for (unsigned char i = 0; i < 100; ++i) {
foo(minus_x, z);
foo(x, minus_z);
}
But doing this means that on each call, minus_x and minus_z are saved on the stack. The recursive function might be called thousand of times, which means thousand bytes used in the stack. Also, the maths involved aren't as simple as a -1.
Is this a good idea ?
Edit : It is actually useless, since it is a pretty standard optimization for compilers : Loop-invariant code motion (see HansPassant's comment)
Would it be a better idea to use a static array containing the computations like :
static const char minuses[256] = {/* 0 for x = 0; x - 1 for x = 1 to 255 */}
and then do :
foo(minuses[x], z);
foo(x, minuses[z]);
This approach limits a lot the actual maths needed but on each call, it has to get the cell in the array instead of reading it from a register.
I am trying to benchmark as much as I can to find the best solution, but if there is a best practice or something I am missing here, please let me know.
FWIW, I tried this with gcc, for two functions foo_1() (no extra variables) and foo_2() (extra variables).
With -03 gcc unrolled the for loop (!), the two functions were exactly the same size, but not quite the same code. I regret I don't have time to work out how and why they differed.
With -02 gcc generated exactly the same code for foo_1 and foo_2. As one might expect it allocated a register to x, z, x-1, z-1 and i, and pushed/popped those to preserve the parent's values -- using 6 x 8 (64-bit machine) bytes of stack for each call (including the return address).
You report 24 bytes of stack used... is that a 32-bit machine ?
With -O0, the picture was different, foo_1 did the x-1 and z-1 each time round the loop, and in both cases the variables were held in memory. foo_1 was slightly shorter and I suspect that the subtraction makes no difference on a modern processor ! In this case, foo_1 and foo_2 used the same amount of stack. This is because all the variables in foo are unsigned char, and the extra minus_x and minus_z pack together with the i, using space which is otherwise padding. If you change minus_x and minus_z to unsigned long long, you get a difference. Curiously, foo_1 used 6 x 8 bytes of stack as well. There were 16 unused bytes in the stack frame, so even taking into account aligning the RSP and the RBP to 16 byte boundaries, it appears to be using more than it needs to... I have no idea why.
I had a quick look at a static array of x - 1. For -O0 it made no difference to the stack use (for the same reason as before). For -O2, it took one look at foo(x, minuses[z]); and hoisted the minuses[z] out of the loop ! Which one ought to have expected... and the stack use stayed the same (at 6 x 8).
More generally, as noted elsewhere, any effective amount of optimisation is going to hoist calculations out of loops where it can. The other thing which is going on is heavy use of registers to hold variables -- both real variables (those you have named) and "pseudo" variables (to hold the pre-calculated result of hoisted stuff). Those registers need to be saved across calls of subroutines -- either by the caller or the callee. The x86 push/pop operate on the entire register, so an unsigned char held in a register is going to need a full 8 or 4 (64-bit or 32-bit mode) bytes of stack. But, hey, that's what you pay for the optimisation !
It's not entirely clear to me whether its the run-time or the stack-use which you are most concerned about. Either way, the message is to leave it up to the compiler and worry if and only if the thing is too slow, and then only worry about the bits which profiling shows are a problem !

Resources