how can I convert non atomic operation to atomic - c

I am trying to understand atomic and non atomic operations.With respect to Operating System and also with respect to C.
As per the wikipedia page here
Consider a simple counter which different processes can increment.
Non-atomic
The naive, non-atomic implementation:
reads the value in the memory location;
adds one to the value;
writes the new value back into the memory location.
Now, imagine two processes are running incrementing a single, shared memory location:
the first process reads the value in memory location;
the first process adds one to the value;
but before it can write the new value back to the memory location it is suspended, and the second process is allowed to run:
the second process reads the value in memory location, the same value that the first process read;
the second process adds one to the value;
the second process writes the new value into the memory location.
How can the above operation be made an atmoic operation.
My understanding of atomic operation is that any thing which executes without interruption is atomic.
So for example
int b=1000;
b+=1000;
Should be an atomic operation as per my understanding because both the instructions executed without an interruption,how ever I learned from some one that in C there is nothing known as atomic operation so above both statements are non atomic.
So what I want to understand is what is atomicity is different when it comes to programming languages than the Operating Systems?

C99 doesn't have any way to make variables atomic with respect to other threads. C99 has no concept of multiple threads of execution. Thus, you need to use compiler-specific extensions, and/or CPU-level instructions to achieve atomicity.
The next C standard, currently known as C1x, will include atomic operations.
Even then, mere atomicity just guarantees that an operation is atomic, it doesn't guarantee when that operation becomes visible to other CPUs. To achieve visibility guarantees, in C99 you would need to study your CPU's memory model, and possibly use a special kind of CPU instructions known as fences or memory barriers. You also need to tell the compiler about it, using some compiler-specific compiler barrier. C1x defines several memory orderings, and when you use an atomic operation you can decide which memory ordering to use.
Some examples:
/* NOT atomic */
b += 1000;
/* GCC-extension, only in newish GCCs
* requirements on b's loads are CPU-specific
*/
__sync_add_and_fetch(&b, 1000);
/* GCC-extension + x86-assembly,
* b should be aligned to its size (natural alignment),
* or loads will not be atomic
*/
__asm__ __volatile__("lock add $1000, %0" : "+r"(b));
/* C1x */
#include <stdatomic.h>
atomic_int b = ATOMIC_INIT(1000);
int r = atomic_fetch_add(&b, 1000) + 1000;
All of this is as complex as it seems, so you should normally stick to mutexes, which makes things easier.

int b = 1000;
b+=1000;
gets turned into multiple statements at the instruction level. At the very least, preparing a register or memory, assigning 1000, then getting the contents of that register/memory, adding 1000 to the contents, and re-assigning the new value (2000) to that register. Without locking, the OS can suspend the process/thread at any point in that operation. In addition, on multiproc systems, a different processor could access that memory (wouldn't be a register in this case) while your operation is in progress.
When you take a lock out (which is how you would make this atomic), you are, in part, informing the OS that it is not ok to suspend this process/thread, and that this memory should not be accessed by other processes.
Now the above code would probably be optimized by the compiler to a simple assignment of 2000 to the memory location for b, but I'm ignoring that for the purposes of this answer.

b+=1000 is compiled, on all systems that I know, to multiple instructions. Thus it is not atomic.
Even b=1000 can be non atomic although you have to work hard to construct a situation where it is not atomic.
In fact C has no concept of threads and so there is nothing that is atomic in C. You need to rely on implementation specific details of your compiler and tools.

The above statements are non atomic because it becomes a move instruction to load b into a register (if it isnt) then add 1000 to it and the store back into memory. Many instruction sets allow for atomicity through atomic increment easiest being x86 with lock addl dest, src; some other instruction sets use cmpxchg to achieve the same result.

So what I want to understand is what
is atomicity is different when it
comes to programming languages than
the Operating Systems?
I'm a bit confused by this question. What do you mean exactly? The atomicity concept is the same both in prog. languages and OS.
Regarding atomicity and language, here is for example a link about atomicity in JAVA, that might give you a different perspective: What operations in Java are considered atomic?

Related

Is assigning a pointer in C program considered atomic on x86-64

https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html#Atomic-Types says - In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
My question is whether pointer assignment can be considered atomic on x86_64 architecture for a C program compiled with gcc m64 flag. OS is 64bit Linux and CPU is Intel(R) Xeon(R) CPU D-1548. One thread will be setting a pointer and another thread accessing the pointer. There is only one writer thread and one reader thread. Reader should either be getting the previous value of the pointer or the latest value and no garbage value in between.
If it is not considered atomic, please let me know how can I use the gcc atomic builtins or maybe memory barrier like __sync_synchronize to achieve the same without using locks. Interested only in C solution and not C++. Thanks!
Bear in mind that atomicity alone is not enough for communicating between threads. Nothing prevents the compiler and CPU from reordering previous/subsequent load and store instructions with that "atomic" store. In old days people used volatile to prevent that reordering but that was never intended for use with threads and doesn't provide means to specify less or more restrictive memory order (see "Relationship with volatile" in there).
You should use C11 atomics because they guarantee both atomicity and memory order.
For almost all architectures, pointer load and store are atomic. A once notable exception was 8086/80286 where pointers could be seg:offset; there was an l[des]s instruction which could make an atomic load; but no corresponding atomic store.
The integrity of the pointer is only a small concern; your bigger issue revolves around synchronization: the pointer was at value Y, you set it to X; how will you know when nobody is using the (old) Y value?
A somewhat related problem is that you may have stored things at X, which the other thread expects to find. Without synchronization, other might see the new pointer value, however what it points to might not be up to date yet.
A plain global char *ptr should not be considered atomic. It might work sometimes, especially with optimization disabled, but you can get the compiler to make safe and efficient optimized asm by using modern language features to tell it you want atomicity.
Use C11 stdatomic.h or GNU C __atomic builtins. And see Why is integer assignment on a naturally aligned variable atomic on x86? - yes the underlying asm operations are atomic "for free", but you need to control the compiler's code-gen to get sane behaviour for multithreading.
See also LWN: Who's afraid of a big bad optimizing compiler? - weird effects of using plain vars include several really bad well-known things, but also more obscure stuff like invented loads, reading a variable more than once if the compiler decides to optimize away a local tmp and load the shared var twice, instead of loading it into a register. Using asm("" ::: "memory") compiler barriers may not be sufficient to defeat that depending on where you put them.
So use proper atomic stores and loads that tell the compiler what you want: You should generally use atomic loads to read them, too.
#include <stdatomic.h> // C11 way
_Atomic char *c11_shared_var; // all access to this is atomic, functions needed only if you want weaker ordering
void foo(){
atomic_store_explicit(&c11_shared_var, newval, memory_order_relaxed);
}
char *plain_shared_var; // GNU C
// This is a plain C var. Only specific accesses to it are atomic; be careful!
void foo() {
__atomic_store_n(&plain_shared_var, newval, __ATOMIC_RELAXED);
}
Using __atomic_store_n on a plain var is the functionality that C++20 atomic_ref exposes. If multiple threads access a variable for the entire time that it needs to exist, you might as well just use C11 stdatomic because every access needs to be atomic (not optimized into a register or whatever). When you want to let the compiler load once and reuse that value, do char *tmp = c11_shared_var; (or atomic_load_explicit if you only want acquire instead of seq_cst; cheaper on a few non-x86 ISAs).
Besides lack of tearing (atomicity of asm load or store), the other key parts of _Atomic foo * are:
The compiler will assume that other threads may have changed memory contents (like volatile effectively implies), otherwise the assumption of no data-race UB will let the compiler hoist loads out of loops. Without this, dead-store elimination might only do one store at the end of a loop, not updating the value multiple times.
The read side of the problem is usually what bites people in practice, see Multithreading program stuck in optimized mode but runs normally in -O0 - e.g. while(!flag){} becomes if(!flag) infinite_loop; with optimization enabled.
Ordering wrt. other code. e.g. you can use memory_order_release to make sure that other threads that see the pointer update also see all changes to the pointed-to data. (On x86 that's as simple as compile-time ordering, no extra barriers needed for acquire/release, only for seq_cst. Avoid seq_cst if you can; mfence or locked operations are slow.)
Guarantee that the store will compile to a single asm instruction. You'd be depending on this. It does happen in practice with sane compilers, although it's conceivable that a compiler might decide to use rep movsb to copy a few contiguous pointers, and that some machine somewhere might have a microcoded implementation that does some stores narrower than 8 bytes.
(This failure mode is highly unlikely; the Linux kernel relies on volatile load/store compiling to a single instruction with GCC / clang for its hand-rolled intrinsics. But if you just used asm("" ::: "memory") to make sure a store happened on a non-volatile variable, there's a chance.)
Also, something like ptr++ will compile to an atomic RMW operation like lock add qword [mem], 4, rather than separate load and store like volatile would. (See Can num++ be atomic for 'int num'? for more about atomic RMWs). Avoid that if you don't need it, it's slower. e.g. atomic_store_explicit(&ptr, ptr + 1, mo_release); - seq_cst loads are cheap on x86-64 but seq_cst stores aren't.
Also note that memory barriers can't create atomicity (lack of tearing), they can only create ordering wrt other ops.
In practice x86-64 ABIs do have alignof(void*) = 8 so all pointer objects should be naturally aligned (except in a __attribute__((packed)) struct which violates the ABI, so you can use __atomic_store_n on them. It should compile to what you want (plain store, no overhead), and meet the asm requirements to be atomic.
See also When to use volatile with multi threading? - you can roll your own atomics with volatile and asm memory barriers, but don't. The Linux kernel does that, but it's a lot of effort for basically no gain, especially for a user-space program.
Side note: an often repeated misconception is that volatile or _Atomic are needed to avoid reading stale values from cache. This is not the case.
All machines that run C11 threads across multiple cores have coherent caches, not needing explicit flush instructions in the reader or writer. Just ordinary load or store instructions, like x86 mov. The key is to not let the compiler keep values of shared variable in CPU registers (which are thread-private). It normally can do this optimization because of the assumption of no data-race Undefined Behaviour. Registers are very much not the same thing as L1d CPU cache; managing what's in registers vs. memory is done by the compiler, while hardware keeps cache in sync. See When to use volatile with multi threading? for more details about why coherent caches is sufficient to make volatile work like memory_order_relaxed.
See Multithreading program stuck in optimized mode but runs normally in -O0 for an example.
"Atomic" is treated as this quantum state where something can be both atomic and not atomic at the same time because "it's possible" that "some machines" "somewhere" "might not" write "a certain value" atomically. Maybe.
That is not the case. Atomicity has a very specific meaning, and it solves a very specific problem: threads being pre-empted by the OS to schedule another thread in its place on that core. And you cannot stop a thread from executing mid-assembly instruction.
What that means is that any single assembly instruction is "atomic" by definition. And since you have registry moving instructions, any register-sized copy is atomic by definition. That means a 32-bit integer on a 32-bit CPU, and a 64-bit integer on a 64-bit CPU are all atomic -- and of course that includes pointers (ignore all the people who will tell you "some architectures" have pointers of "different size" than registers, that hasn't been the case since 386).
You should however be careful not to hit variable caching problems (ie one thread writing a pointer, and another trying to read it but getting an old value from the cache), use volatile as needed to prevent this.

embedded C - using "volatile" to assert consistency

Consider the following code:
// In the interrupt handler file:
volatile uint32_t gSampleIndex = 0; // declared 'extern'
void HandleSomeIrq()
{
gSampleIndex++;
}
// In some other file
void Process()
{
uint32_t localSampleIndex = gSampleIndex; // will this be optimized away?
PrevSample = RawSamples[(localSampleIndex + 0) % NUM_RAW_SAMPLE_BUFFERS];
CurrentSample = RawSamples[(localSampleIndex + 1) % NUM_RAW_SAMPLE_BUFFERS];
NextSample = RawSamples[(localSampleIndex + 2) % NUM_RAW_SAMPLE_BUFFERS];
}
My intention is that PrevSample, CurrentSample and NextSample are consistent, even if gSampleIndex is updated during the call to Process().
Will the assignment to the localSampleIndex do the trick, or is there any chance it will be optimized away even though gSampleIndex is volatile?
In principle, volatile is not enough to guarantee that Process only sees consistent values of gSampleIndex. In practice, however, you should not run into any issues if uinit32_t is directly supported by the hardware. The proper solution would be to use atomic accesses.
The problem
Suppose that you are running on a 16-bit architecture, so that the instruction
localSampleIndex = gSampleIndex;
gets compiled into two instructions (loading the upper half, loading the lower half). Then the interrupt might be called between the two instructions, and you'll get half of the old value combined with half of the new value.
The solution
The solution is to access gSampleCounter using atomic operations only. I know of three ways of doing that.
C11 atomics
In C11 (supported since GCC 4.9), you declare your variable as atomic:
#include <stdatomic.h>
atomic_uint gSampleIndex;
You then take care to only ever access the variable using the documented atomic interfaces. In the IRQ handler:
atomic_fetch_add(&gSampleIndex, 1);
and in the Process function:
localSampleIndex = atomic_load(gSampleIndex);
Do not bother with the _explicit variants of the atomic functions unless you're trying to get your program to scale across large numbers of cores.
GCC atomics
Even if your compiler does not support C11 yet, it probably has some support for atomic operations. For example, in GCC you can say:
volatile int gSampleIndex;
...
__atomic_add_fetch(&gSampleIndex, 1, __ATOMIC_SEQ_CST);
...
__atomic_load(&gSampleIndex, &localSampleIndex, __ATOMIC_SEQ_CST);
As above, do not bother with weak consistency unless you're trying to achieve good scaling behaviour.
Implementing atomic operations yourself
Since you're not trying to protect against concurrent access from multiple cores, just race conditions with an interrupt handler, it is possible to implement a consistency protocol using standard C primitives only. Dekker's algorithm is the oldest known such protocol.
In your function you access volatile variable just once (and it's the only volatile one in that function) so you don't need to worry about code reorganization that compiler may do (and volatile prevents). What standard says for these optimizations at §5.1.2.3 is:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
Note last sentence: "...no needed side effects are produced (...accessing a volatile object)".
Simply volatile will prevent any optimization compiler may do around that code. Just to mention few: no instruction reordering respect other volatile variables. no expression removing, no caching, no value propagation across functions.
BTW I doubt any compiler may break your code (with or without volatile). Maybe local stack variable will be elided but value will be stored in a registry (for sure it won't repeatedly access a memory location). What you need volatile for is value visibility.
EDIT
I think some clarification is needed.
Let me safely assume you know what you're doing (you're working with interrupt handlers so this shouldn't be your first C program): CPU word matches your variable type and memory is properly aligned.
Let me also assume your interrupt is not reentrant (some magic cli/sti stuff or whatever your CPU uses for this) unless you're planning some hard-time debugging and tuning.
If these assumptions are satisfied then you don't need atomic operations. Why? Because localSampleIndex = gSampleIndex is atomic (because it's properly aligned, word size matches and it's volatile), with ++gSampleIndex there isn't any race condition (HandleSomeIrq won't be called again while it's still in execution). More than useless they're wrong.
One may think: "OK, I may not need atomic but why I can't use them? Even if such assumption are satisfied this is an *extra* and it'll achieve same goal" . No, it doesn't. Atomic has not same semantic of volatile variables (and seldom volatile is/should be used outside memory mapped I/O and signal handling). Volatile (usually) is useless with atomic (unless a specific architecture says it is) but it has a great difference: visibility. When you update gSampleIndex in HandleSomeIrq standard guarantees that value will be immediately visible to all threads (and devices). with atomic_uint standard guarantees it'll be visible in a reasonable amount of time.
To make it short and clear: volatile and atomic are not the same thing. Atomic operations are useful for concurrency, volatile are useful for lower level stuff (interrupts, devices). If you're still thinking "hey they do *exactly* what I need" please read few useful links picked from comments: cache coherency and a nice reading about atomics.
To summarize:
In your case you may use an atomic variable with a lock (to have both atomic access and value visibility) but no one on this earth would put a lock inside an interrupt handler (unless absolutely definitely doubtless unquestionably needed, and from code you posted it's not your case).

Are assignment = and subtraction assignment -= atomic operations in C?

int b = 1000;
b -= 20;
Is any of the above an atomic operation? What is an atomic operation in C?
It depends on the implementation. By the standard, nothing is atomic in C. If you need atomic ops you can look at your compiler's builtins.
It is architecture/implementation dependent.
If you want atomic operations, I think sig_atomic_t type is standardized by C99, but not sure.
From the GNU LibC docs:
In practice, you can assume that int and other integer types no longer than int are atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these are true on all of the machines that the GNU C library supports, and on all POSIX systems we know of.
This link seems to me to be on the right track in telling us what an atomic operation is in C:
http://odetocode.com/blogs/scott/archive/2006/05/17/atomic-operations.aspx
And it says, "...computer science adopted the term 'atomic operation' to describe an instruction that is indivisible and uninterruptible by other threads of execution."
And by that definition, the first line of code in the original question
int b=1000;
b-=20;
ought to be an atomic operation. The second could be an atomic operation if the CPU's instruction set includes an instruction to subtract directly from memory. The reason I think so is that the first code line would most likely require only one assembly (machine) instruction. And instructions either execute or not. I don't think any machine instruction can be interrupted in the middle.
At that same link, says, "If thread A is writing a 32-bit value to memory as an atomic operation, thread B will never be able to read the memory location and see only the first 16 of 32 bits written out." Seems that any single machine instruction cannot be interrupted in the middle, therefore would automatically be atomic between threads.
Incrementing and decrementing a number is not an atomic operation in C. Certain architectures support atomic incrementing and decrementing instructions, but there is no guarantee that the compiler would use them. You can look as an example at Qt reference counting. It uses atomic reference counting, on certain platforms it is implemented with platform-specific assembly code, and on the rest it is using a mutex to lock the counter.
If you're not incrementing or decrementing in a performance-critical part of your code, you'd simply use a mutex while doing it. If you're using it in performance-critical part of your code, you might want to try to rewrite your code in a way that doesn't use shared memory for this operation accessed from multiple places for this operation or use mutexes with higher granularity so that they don't affect the performance, or use assembly to ensure that the operation is atomic.
Quoting from ISO C89, 7.7 Signal handling <signal.h>
The type defined is sig_atomic_t which is the integral type of an
object that can be accessed as an atomic entity, even in the presence
of asynchronous interrupts.

Are memory barriers necessary for atomic reference counting shared immutable data?

I have some immutable data structures that I would like to manage using reference counts, sharing them across threads on an SMP system.
Here's what the release code looks like:
void avocado_release(struct avocado *p)
{
if (atomic_dec(p->refcount) == 0) {
free(p->pit);
free(p->juicy_innards);
free(p);
}
}
Does atomic_dec need a memory barrier in it? If so, what kind of memory barrier?
Additional notes: The application must run on PowerPC and x86, so any processor-specific information is welcomed. I already know about the GCC atomic builtins. As for immutability, the refcount is the only field that changes over the duration of the object.
On x86, it will turn into a lock prefixed assembly instruction, like LOCK XADD.
Being a single instruction, it is non-interruptible. As an added "feature", the lock prefix results in a full memory barrier:
"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.
A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64, because mfence is slower on many CPUs even when it's guaranteed to be available, like in 64-bit mode. (Does lock xchg have the same behavior as mfence?)
So you have a full fence on x86 as an added bonus, whether you like it or not. :-)
On PPC, it is different. An LL/SC pair - lwarx & stwcx - with a subtraction inside can be used to load the memory operand into a register, subtract one, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted (meaning it will fail and retry).
It also does not mean an automatic full fence.
This does not however compromise the atomicity of the counter in any way.
It just means that in the x86 case, you happen to get a fence as well, "for free".
On PPC, one can insert a (partial or) full fence by emitting a (lw)sync instruction.
All in all, explicit memory barriers are not necessary for the atomic counter to work properly.
It is important to distinguish between atomic accesses (which guarantee that the read/modify/write of the value executes as one atomic unit) vs. memory reordering.
Memory barriers prevent reordering of reads and writes. Reordering is completely orthogonal to atomicity. For instance, on PowerPC if you implement the most efficient atomic increment possible then it will not prevent reordering. If you want to prevent reordering then you need an lwsync or sync instruction, or some equivalent high-level (C++ 11?) memory barrier.
Claims that there is "no possibility of the compiler reordering things in a problematic way" seem naive as general statements because compiler optimizations can be quite surprising and because CPUs (PowerPC/ARM/Alpha/MIPS in particular) aggressively reorder memory operations.
A coherent cache doesn't save you either. See https://preshing.com/archives/ to see how memory reordering really works.
In this case, however, I believe the answer is that no barriers are required. That is because for this specific case (reference counting) there is no need for a relationship between the reference count and the other values in the object. The one exception is when the reference count hits zero. At that point it is important to ensure that all updates from other threads are visible to the current thread so a read-acquire barrier may be necessary.
Are you intending to implement your own atomic_dec or are you just wondering whether a system-supplied function will behave as you want?
As a general rule, system-supplied atomic increment/decrement facilities will apply whatever memory barriers are required to just do the right thing. You generally don't have to worry about memory barriers unless you are doing something wacky like implementing your own lock-free data structures or an STM library.

Do I need a lock when only a single thread writes to a shared variable?

I have 2 threads and a shared float global. One thread only writes to the variable while the other only reads from it, do I need to lock access to this variable? In other words:
volatile float x;
void reader_thread() {
while (1) {
// Grab mutex here?
float local_x = x;
// Release mutex?
do_stuff_with_value(local_x);
}
}
void writer_thread() {
while (1) {
float local_x = get_new_value_from_somewhere();
// Grab mutex here?
x = local_x;
// Release mutex?
}
}
My main concern is that a load or store of a float not being atomic, such that local_x in reader_thread ends up having a bogus, partially updated value.
Is this a valid concern?
Is there another way to guarantee atomicity without a mutex?
Would using sig_atomic_t as the shared variable work, assuming it has enough bits for my purposes?
The language in question is C using pthreads.
Different architectures have different rules, but in general, memory loads and stores of aligned, int-sized objects are atomic. Smaller and larger may be problematic. So if sizeof(float) == sizeof(int) you might be safe, but I still wouldn't depend on it in a portable program.
Also, the behavior of volatile isn't particularly well-defined... The specification uses it as a way to prevent optimizing away accesses to memory-mapped device I/O, but says nothing about its behavior on any other memory accesses.
In short, even if loads and stores are atomic on float x, I would use explicit memory barriers (though how varies by platform and compiler) in instead of depending on volatile. Without the guarantee of loads and stores being atomic, you would have to use locks, which do imply memory barriers.
According to section 24.4.7.2 of the GNU C library documentation:
In practice, you can assume that int and other integer types no longer than int are atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C library supports and on all POSIX systems we know of.
float technically doesn't count under these rules, although if a float is the same size as an int on your architecture, what you could do is make your global variable an int, and then convert it to a float with a union every time you read or write it.
The safest course of action is to use some form of mutex to protect accesses to the shared variable. Since the critical sections are extremely small (reading/writing a single variable), you're almost certainly going to get better performance out of a light-weight mutex such as a spin lock, as opposed to a heavy-weight mutex that makes system calls to do its job.
I would lock it down. I'm not sure how large float is in your environment, but it might not be read/written in a single instruction so your reader could potentially read a half-written value. Remember that volatile doesn't say anything about atomicity of operations, it simply states that the read will come from memory instead of being cached in a register or something like that.
The assignment is not atomic, at least for some compilers, and in the sense that it takes a single instruction to perform. The following code was generated by Visual C++ 6.0 - f1 and f2 are of type float.
4: f2 = f1;
00401036 mov eax,dword ptr [ebp-4]
00401039 mov dword ptr [ebp-8],eax
c11 c17
In the memory model introduced by C11 and later, the clear answer is yes: you do need a lock or other means of synchronization, or else to declare the variable x as atomic_float using <stdatomic.h>.
If a non-atomic variable is written by one thread, and either read or written by another, without appropriate synchronization to ensure that one access happens before the other in the precise sense defined in the standard, then a data race exists and the behavior of the program becomes undefined. (In particular, the bad effects need not be limited to just getting a bogus value when you read the variable; the program is allowed to crash, corrupt unrelated data, etc.)
Note that the presence of volatile is irrelevant. Declaring a variable volatile does not save you from UB when a data race otherwise exists, and if a data race is avoided by use of atomic_float or otherwise, then volatile is not needed.
Since it's a single word in memory you're changing you should be fine with just the volatile declaration.
I don't think you guarantee you'll have the latest value when you read it though, unless you use a lock.
In all probability, no. Since you have no chance for write collision the only concern is whether you could read it while it's half-written. It's hugely unlikely that your code is going to be run on a platform where writing a float doesn't happen in a single operation if you're writing something with threads.
However it's possible because the definition of a float in C does not mandate that the underlying hardware storage be limited to the processor's word size. You could be compiling to machine code where, say, sign and mantissa are written in two different operations.
The real question, I think, is two questions: "what's the downside to having a mutex here?" and "What's the repercussions if I get a garbage read?"
Perhaps rather than a mutex you should write an assert that determines whether the storage size of a float is smaller or equal to the word size of the underlying CPU.

Resources