How does lockless incrementing work in LDD3's "shortprint"?

How does lockless incrementing work in LDD3's "shortprint"? - c

I am having a hard time understanding how shortp_incr_bp() works. How is it able to atomically increment without the need for a spinlock or semaphore? (I don't really understand the provided comments.) What could happen if barrier() wasn't there? How does the optimization cause incorrect values? What is one way this optimization can go wrong?
/*
* Input is managed through a simple circular buffer which, among other things,
* is allowed to overrun if the reader isn't fast enough. That makes life simple
* on the "read" interrupt side, where we don't want to block.
*/
static unsigned long shortp_in_buffer = 0;
static unsigned long volatile shortp_in_head;
static volatile unsigned long shortp_in_tail;
/*
* Atomically increment an index into "shortp_in_buffer"
*
* This function has been carefully written to wrap a pointer into the circular
* buffer without ever exposing an incorrect value. The "barrier" call is there
* to block compiler optimizations across the other two lines of the function.
* Without the barrier, the compiler might decide to optimize out the "new"
* variable and assign directly to "*index". That optimization could expose an
* incorrect value of the index for a brief period in the case where it wraps.
* By taking care to prevent in inconsistent value from ever being visible to
* other threads, we can manipulate the circular buffer pointers safely without
* locks.
*/
static inline void shortp_incr_bp(volatile unsigned long *index, int delta)
{
unsigned long new = *index + delta;
barrier(); /* Don't optimize these two together */
*index = (new >= (shortp_in_buffer + PAGE_SIZE)) ? shortp_in_buffer : new;
}

Related

Read optimizations on shared memory

Suppose you have a function that make several read access to a shared variable whose access is atomic. All in running in the same process. Imagine them as threads of a process or as a sw running on bare metal platform with no MMU.
As a requirement you must ensure that the value of that read is consistent for all the length of the function so the code must not re-read the memory location and have to put in a local variable or on a register. How can we ensure that this behaviour is respected?
As an example...
shared is the only shared variable
extern uint32_t a, b, shared;
void useless_function()
{
__ASM volatile ("":::"memory");
uint32_t value = shared;
a = value *2;
b = value << 3;
}
Can value be optimized out by direct readings of shared variable in some contexts? If yes, how can I be sure this cannot happen?

As a requirement you must ensure that the value of that read is consistent for all the length of the function so the code must not re-read the memory location and have to put in a local variable or on a register. How can we ensure that this behaviour is respected?
You can do that with READ_ONCE macro from Linux kernel:
/*
* Prevent the compiler from merging or refetching reads or writes. The
* compiler is also forbidden from reordering successive instances of
* READ_ONCE and WRITE_ONCE, but only when the compiler is aware of some
* particular ordering. One way to make the compiler aware of ordering is to
* put the two invocations of READ_ONCE or WRITE_ONCE in different C
* statements.
*
* These two macros will also work on aggregate data types like structs or
* unions. If the size of the accessed data type exceeds the word size of
* the machine (e.g., 32 bits or 64 bits) READ_ONCE() and WRITE_ONCE() will
* fall back to memcpy(). There's at least two memcpy()s: one for the
* __builtin_memcpy() and then one for the macro doing the copy of variable
* - '__u' allocated on the stack.
*
* Their two major use cases are: (1) Mediating communication between
* process-level code and irq/NMI handlers, all running on the same CPU,
* and (2) Ensuring that the compiler does not fold, spindle, or otherwise
* mutilate accesses that either do not require ordering or that interact
* with an explicit memory barrier or atomic instruction that provides the
* required ordering.
*/
E.g.:
uint32_t value = READ_ONCE(shared);
READ_ONCE macro essentially casts the object you read to be volatile because the compiler cannot emit extra reads or writes for volatile objects.
The above is equivalent to:
uint32_t value = *(uint32_t volatile*)&shared;
Alternatively:
uint32_t value;
memcpy(&value, &shared, sizeof value);
memcpy breaks the dependency between shared and value, so that the compiler cannot re-load shared instead of loading value.

In the example given you are not using the variable value in the function at all. So it will definitely be optimised.
Also, as mentioned in comments, in a multitasking system, the value of shared can be changed within the function.
What I need is that shared is read only once and it local value keeped for all function length and not re-evaluated
I would suggest something like this below.
extern uint32_t a, b, shared;
void useless_function()
{
__ASM volatile ("":::"memory");
uint32_t value = shared;
a = value*2;
b = value << 3;
}
Here shared is read only once in the function. It will be read again on next call of the function.

Clear variable on the stack

Code Snippet:
int secret_foo(void)
{
int key = get_secret();
/* use the key to do highly privileged stuff */
....
/* Need to clear the value of key on the stack before exit */
key = 0;
/* Any half decent compiler would probably optimize out the statement above */
/* How can I convince it not to do that? */
return result;
}
I need to clear the value of a variable key from the stack before returning (as shown in the code).
In case you are curious, this was an actual customer requirement (embedded domain).

You can use volatile (emphasis mine):
Every access (both read and write) made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine (that is, all writes are completed at some time before the next sequence point). This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access.
volatile int key = get_secret();

volatile might be overkill sometimes as it would also affect all the other uses of a variable.
Use memset_s (since C11): http://en.cppreference.com/w/c/string/byte/memset
memset may be optimized away (under the as-if rules) if the object modified by this function is not accessed again for the rest of its lifetime. For that reason, this function cannot be used to scrub memory (e.g. to fill an array that stored a password with zeroes). This optimization is prohibited for memset_s: it is guaranteed to perform the memory write.
int secret_foo(void)
{
int key = get_secret();
/* use the key to do highly privileged stuff */
....
memset_s(&key, sizeof(int), 0, sizeof(int));
return result;
}
You can find other solutions for various platforms/C standards here: https://www.securecoding.cert.org/confluence/display/c/MSC06-C.+Beware+of+compiler+optimizations
Addendum: have a look at this article Zeroing buffer is insufficient which points out other problems (besides zeroing the actual buffer):
With a bit of care and a cooperative compiler, we can zero a buffer — but that's not what we need. What we need to do is zero every location where sensitive data might be stored. Remember, the whole reason we had sensitive information in memory in the first place was so that we could use it; and that usage almost certainly resulted in sensitive data being copied onto the stack and into registers.
Your key value might have been copied into another location (like a register or temporary stack/memory location) by the compiler and you don't have any control to clear that location.

If you go with dynamic allocation you can control wiping that memory and not be bound by what the system does with the stack.
int secret_foo(void)
{
int *key = malloc(sizeof(int));
*key = get_secret();
memset(key, 0, sizeof(int));
// other magical things...
return result;
}

One solution is to disable compiler optimizations for the section of the code that you dont want optimizations:
int secret_foo(void) {
int key = get_secret();
#pragma GCC push_options
#pragma GCC optimize ("O0")
key = 0;
#pragma GCC pop_options
return result;
}

How to get a pointer on the return value of a function?

The function
I'm using the function uint32_t htonl(uint32_t hostlong) to convert a uint32_t to network byte order.
What I want to do
I need to do calculations with the variable after converting it to network byte order:
//Usually I do calculate with much more variables and copy them into a much
// larger buff - to keep it understandable and simple I broke it down
// to one calculation
uint32_t var = 1;
void* buff;
buff = malloc(sizeof(uint32_t));
while(var < 5) {
var = htonl(var);
memcpy(buff, &var, sizeof(uint32_t));
doSomethingWithBuff(buff);
var++; // FAIL
}
What I could do but ...
Actually I found a solutions for this problem already:
uint32_t var = 1, nbo;
void* buff;
buff = malloc(sizeof(uint32_t));
while(var < 5) {
nbo = htonl(var);
memcpy(buff, &nbo, sizeof(uint32_t));
doSomethingWithBuff(buff);
var++;
}
The problem is that I waste memory with this solution because nbo is just used as a buffer.
What I would prefer to do
It would be perfect if I could use the htonl() function within the memcpy() function.
memcpy() needs the 2nd value to be a void*. My Question is: How can I get the address of the return value of htonl()?
uint32_t var = 1;
void* buff;
buff = malloc(sizeof(uint32_t));
while(var < 5) {
memcpy(buff, (GET ADDRESS)htonl(var), sizeof(uint32_t));
doSomethingWithBuff(buff);
var++;
}
And if it is not possible because there "is no address of this variable": How does a function work that is returning a variable rather than a pointer to a variable?

Discussion about only one variable buffer
I think you are doing the wrong micro-optimizations.
As Uchia Itachi points out, getting the address of the return value would be a bug.
Actually, if you are concerned about efficiency, the bottleneck is static storage. malloc() has memory overhead - in addition to the data that you store in static memory, there is metadata written. For example here (scroll down to Implementation Details) is explained how a clever minimal algorithm has overhead of only size_t for each allocation. And this is not even considering fragmentation.
memcpy() is a fast function, but also an overkill for a single number.
Therefore, I recommend using only the stack. Make buff a global integer variable. Then pass buff,s address to those, requireing a buffer. They won't notice the difference.
Discussion about the modified question - large buffer with lots of writes and reads in the loop
When a function returns (something), it pushes the value (or pointer to an object) in a register or on the stack. On the other hand, when a variable is declared, initialized and used, it resides either in a register or on the stack.
Do you notice the similarity? Optimizing compilers remove unneeded variables, they also create unnamed variables for internal use. For example, the storage of a variable is reused after detecting that the variable is no longer referenced in this scope.
Therefore, one should strive to write simple and readable code, and leave the details to the compiler. Meaning that your second example is perfectly fine.

You cannot be sure where the return value of a function will be put. Sometimes, the value is returned on the stack, other times, it may be in a register. There is no real difference between returning something versus returning a pointer in C. But it is good not to mess with local variables that are defined and accessible only in a different function. Doing so increases the chances of seg faults. Also, using &nbo is not a waste of memory.

You can wrap the issue "away" using a macro:
#define APPEND_TO_BUFFER( \
buffer, \
value, \
type)
{ \
type tmpvar = (value); \
memcpy((buffer), &tmpvar, sizeof(type)); \
}
...
char * buff = <some valid memory address>;
uint32_t var = <some value>;
int i = <some other value>;
...
APPEND_TO_BUFFER(buff, htonl(var), uint32_t); /* Append converted var to buffer as per OP. */
APPEND_TO_BUFFER(buff+sizeof(uint32_t), i, int); /* Append another i right after var. */
doSomethingWithBuff(buff);

CUDA shared memory speed

This is a performance-related question. I've written the following simple CUDA kernel based on the "CUDA By Example" sample code:
#define N 37426 /* the (arbitrary) number of hashes we want to calculate */
#define THREAD_COUNT 128
__device__ const unsigned char *m = "Goodbye, cruel world!";
__global__ void kernel_sha1(unsigned char *hval) {
sha1_ctx ctx[1];
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < N) {
sha1_begin(ctx);
sha1_hash(m, 21UL, ctx);
sha1_end(hval+tid*SHA1_DIGEST_SIZE, ctx);
tid += blockDim.x * gridDim.x;
}
}
The code seems to me to be correct and indeed spits out 37,426 copies of the same hash (as expected. Based on my reading of Chapter 5, section 5.3, I assumed that each thread writing to the global memory passed in as "hval" would be extremely inefficient.
I then implemented what I assumed would be a performance-boosting cache using shared memory. The code was modified as follows:
#define N 37426 /* the (arbitrary) number of hashes we want to calculate */
#define THREAD_COUNT 128
__device__ const unsigned char *m = "Goodbye, cruel world!";
__global__ void kernel_sha1(unsigned char *hval) {
sha1_ctx ctx[1];
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ unsigned char cache[THREAD_COUNT*SHA1_DIGEST_SIZE];
while(tid < N) {
sha1_begin(ctx);
sha1_hash(m, 21UL, ctx);
sha1_end(cache+threadIdx.x*SHA1_DIGEST_SIZE, ctx);
__syncthreads();
if( threadIdx.x == 0) {
memcpy(hval+tid*SHA1_DIGEST_SIZE, cache, sizeof(cache));
}
__syncthreads();
tid += blockDim.x * gridDim.x;
}
}
The second version also appears to run correctly, but is several times slower than the initial version. The latter code completes in about 8.95 milliseconds while the former runs in about 1.64 milliseconds. My question to the Stack Overflow community is simple: why?

I looked through CUDA by Example and couldn't find anything resembling this. Yes there is some discussion of GPU hash tables in the appendix, but it looks nothing like this. So I really have no idea what your functions do, especially sha1_end. If this code is similar to something in that book, please point it out, I missed it.
However, if sha1_end writes to global memory once (per thread) and does so in a coalesced way, there's no reason that it can't be quite efficient. Presumably each thread is writing to a different location, so if they are adjacent more-or-less, there are definitely opportunities for coalescing. Without going into the details of coalescing, suffice it to say that it allows multiple threads to write data to memory in a single transaction. And if you are going to write your data to global memory, you're going to have to pay this penalty at least once, somewhere.
For your modification, you've completely killed this concept. You have now performed all the data copying from a single thread, and the memcpy means that subsequent data writes (ints, or chars, whatever) are occurring in separate transactions. Yes, there is a cache which may help with this, but it's completely the wrong way to do it on GPUs. Let each thread update global memory, and take advantage of opportunities to do it in parallel. But when you force all the updates on a single thread, then that thread must copy the data sequentially. This is probably the biggest single cost factor in the timing difference.
The use of __syncthreads() also imposes additional cost.
Section 12.2.7 of the CUDA by Examples book refers to visual profiler (and makes mention that it can gather information about coalesced accesses). The visual profiler is a good tool to help try to answer questions like this.
If you want to learn more about efficient memory techniques and coalescing, I would recommend the NVIDIA GPU computing webinar entitled "GPU Computing using CUDA C – Advanced 1 (2010)". The direct link to it is here with slides.

How to safely convert/copy volatile variable?

volatile char* sevensegment_char_value;
void ss_load_char(volatile char *digits) {
...
int l=strlen(digits);
...
}
ss_load_char(sevensegment_char_value);
In the above example I've got warning from avr-gcc compiler
Warning 6 passing argument 1 of 'strlen' discards 'volatile' qualifier from pointer target type [enabled by default]
So I have to somehow copy the value from volatile to non-volatile var? What is the safe workaround?

There is no such thing like a "built in" Workaround in C. Volatile tells the compiler, that the contents of a variable (or in your case the memory the variable is pointing at) can change without the compiler noticing it and forces the compiler to read the data direct from the data bus rather than using a possibly existing copy in the registers.
Therefore the volatile keyword is used to avoid odd behaviour induced through compiler optimizations. (I can explain this further if you like)
In your case, you have a character buffer declared as volatile. If your program changes the contents of this buffer in a different context like an ISR for example, you have to implement sort of a synchronisation mechanism (like disabling the particular interrupt or so) to avoid inconsistency of data. After aquiring the "lock" (disabling the interrupt) you can copy the data byte by byte to a local (non-volatile) buffer and work on this buffer for the rest of the routine.
If the buffer will not change "outside" of the context of your read accesses I suggest to omit the volatile keyword as there is no use for it.
To judge the correct solution, a little bit more information about your exact use case would be needed.

Standard library routines aren't designed to work on volatile objects. The simplest solution is to read the volatile memory into normal memory before operating on it:
void ss_load_char(volatile char *digits) {
char buf[BUFSIZE];
int i = 0;
for (i = 0; i < BUFSIZE; ++i) {
buf[i] = digits[i];
}
int l=strlen(buf);
...
}
Here BUFSIZE is the size of the area of volatile memory.
Depending on how the volatile memory is configured, there may be routines you are supposed to call to copy out the contents, rather than just using a loop. Note that memcpy won't work as it is not designed to work with volatile memory.

The compiler warning only means that strlen() will not treat your pointer as volatile, i.e. it will maybe cache the pointer in a register when computing the length of your string. I guess, that's ok with you.
In general, volatile means that the compiler will not cache the variable. Look at this example:
extern int flag;
while (flag) { /* loop*/ }
This would loop forever if flag != 0, since the compiler assumes that flag is not changed "from the outside", like a different thread. If you want to wait on the input of some other thread, you must write this:
extern volatile int flag;
while (flag) { /* loop*/ }
Now, the compiler will really look at flag each time the loop loops. This may be must more what we intended in this example.
In answer to your question: if you know what you're doing, just cast the volatile away with int l=strlen((char*)digits).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight