How are read/write locks implemented in pthread? - c

How are they implemented especially in case of pthreads. What pthread synchronization APIs do they use under the hood? A little bit of pseudocode would be appreciated.

I haven't done any pthreads programming for a while, but when I did, I never used POSIX read/write locks. The problem is that most of the time a mutex will suffice: ie. your critical section is small, and the region isn't so performance critical that the double barrier is worth worrying about.
In those cases where performance is an issue, normally using atomic operations (generally available as a compiler extension) are a better option (ie. the extra barrier is the problem, not the size of the critical section).
By the time you eliminate all these cases, you are left with cases where you have specific performance/fairness/rw-bias requirements that require a true rw-lock; and that is when you discover that all the relevant performance/fairness parameters of POSIX rw-lock are undefined and implementation specific. At this point you are generally better off implementing your own so you can ensure the appropriate fairness/rw-bias requirements are met.
The basic algorithm is to keep a count of how many of each are in the critical section, and if a thread isn't allowed access yet, to shunt it off to an appropriate queue to wait. Most of your effort will be in implementing the appropriate fairness/bias between servicing the two queues.
The following C-like pthreads-like pseudo-code illustrates what I'm trying to say.
struct rwlock {
mutex admin; // used to serialize access to other admin fields, NOT the critical section.
int count; // threads in critical section +ve for readers, -ve for writers.
fifoDequeue dequeue; // acts like a cond_var with fifo behaviour and both append and prepend operations.
void *data; // represents the data covered by the critical section.
}
void read(struct rwlock *rw, void (*readAction)(void *)) {
lock(rw->admin);
if (rw->count < 0) {
append(rw->dequeue, rw->admin);
}
while (rw->count < 0) {
prepend(rw->dequeue, rw->admin); // Used to avoid starvation.
}
rw->count++;
// Wake the new head of the dequeue, which may be a reader.
// If it is a writer it will put itself back on the head of the queue and wait for us to exit.
signal(rw->dequeue);
unlock(rw->admin);
readAction(rw->data);
lock(rw->admin);
rw->count--;
signal(rw->dequeue); // Wake the new head of the dequeue, which is probably a writer.
unlock(rw->admin);
}
void write(struct rwlock *rw, void *(*writeAction)(void *)) {
lock(rw->admin);
if (rw->count != 0) {
append(rw->dequeue, rw->admin);
}
while (rw->count != 0) {
prepend(rw->dequeue, rw->admin);
}
rw->count--;
// As we only allow one writer in at a time, we don't bother signaling here.
unlock(rw->admin);
// NOTE: This is the critical section, but it is not covered by the mutex!
// The critical section is rather, covered by the rw-lock itself.
rw->data = writeAction(rw->data);
lock(rw->admin);
rw->count++;
signal(rw->dequeue);
unlock(rw->admin);
}
Something like the above code is a starting point for any rwlock implementation. Give some thought to the nature of your problem and replace the dequeue with the appropriate logic that determines which class of thread should be woken up next. It is common to allow a limited number/period of readers to leapfrog writers or visa versa depending on the application.
Of course my general preference is to avoid rw-locks altogether; generally by using some combination of atomic operations, mutexes, STM, message-passing, and persistent data-structures. However there are times when what you really need is a rw-lock, and when you do it is useful to know how they work, so I hope this helped.
EDIT - In response to the (very reasonable) question, where do I wait in the pseudo-code above:
I have assumed that the dequeue implementation contains the wait, so that somewhere within append(dequeue, mutex) or prepend(dequeue, mutex) there is a block of code along the lines of:
while(!readyToLeaveQueue()) {
wait(dequeue->cond_var, mutex);
}
which was why I passed in the relevant mutex to the queue operations.

Each implementation can be different, but normally they have to favor readers by default due to the requirement by POSIX that a thread be able to obtain the read-lock on an rwlock multiple times. If they favored writers, then whenever a writer was waiting, the reader would deadlock on the second read-lock attempt unless the implementation could determine the reader already has a read lock, but the only way to determine that is storing a list of all threads that hold read locks, which is very inefficient in time and space requirements.

Related

Memory order for a ticket-taking spin-lock mutex

Suppose I have the following ticket-taking spinlock mutex implementation (in C using GCC atomic builtins). As I understand it, the use of the "release" memory order in the unlock function is correct. I'm unsure, though, about the lock function. Because this is a ticket-taking mutex, there's a field indicating the next ticket number to be handed out, and a field to indicate which ticket number currently holds the lock. I've used acquire-release on the ticket increment and acquire on the spin load. Is that unnecessarily strong, and if so, why?
Separately, should those two fields (ticket and serving) be spaced so that they're on different cache lines, or does that not matter? I'm mainly interested in arm64 and amd64.
typedef struct {
u64 ticket;
u64 serving;
} ticket_mutex;
void
ticket_mutex_lock(ticket_mutex *m)
{
u64 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_ACQ_REL);
while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE));
}
void
ticket_mutex_unlock(ticket_mutex *m)
{
(void) __atomic_fetch_add(&m->serving, 1, __ATOMIC_RELEASE);
}
UPDATE: based on the advice in the accepted answer, I've adjusted the implementation to the following. This mutex is intended for the low-contention case.
typedef struct {
u32 ticket;
u32 serving;
} ticket_mutex;
void
ticket_mutex_lock(ticket_mutex *m)
{
u32 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_RELAXED);
while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE)) {
#ifdef __x86_64__
__asm __volatile ("pause");
#endif
}
}
void
ticket_mutex_unlock(ticket_mutex *m)
{
u32 my_ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED);
(void) __atomic_store_n(&m->serving, my_ticket+1, __ATOMIC_RELEASE);
}
m->ticket increment only needs to be RELAXED. You only need each thread to get a different ticket number; it can happen as early or late as you want wrt. other operations in the same thread.
load(&m->serving, acquire) is the operation that orders the critical section, preventing those from starting until we've synchronized-with a RELEASE operation in the unlock function of the previous holder of the lock. So the m->serving loads needs to be at least acquire.
Even if the m->ticket++ doesn't complete until after an acquire load of m->serving, that's fine. The while condition still determines whether execution proceeds (non-speculatively) into the critical section. Speculative execution into the critical section is fine, and good since it probably means it's ready commit sooner, reducing the time with the lock held.
Extra ordering on the RMW operation won't make it any faster locally or in terms of inter-thread visibility, and would slow down the thread taking the lock.
One cache line or two
For performance, I think with high contention, there are advantages to keeping the members in separate cache lines.
Threads needing exclusive ownership of the cache line to get a ticket number won't contend with the thread unlocking .serving, so those inter-thread latency delays can happen in parallel.
With multiple cores in the spin-wait while(load(serving)) loop, they can hit in their local L1d cache until something invalidates shared copies of the line, not creating any extra traffic. But wasting a lot of power unless you use something like x86 _mm_pause(), as well as wasting execution resources that could be shared with another logical core on the same physical. x86 pause also avoids a branch mispredict when leaving the spin loop. Related:
What is the purpose of the "PAUSE" instruction in x86?
How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
Locks around memory manipulation via inline assembly
Exponential backoff up to some number of pauses between checks is a common recommendation, but here we can do better: A number of pause instructions between checks that scales with my_ticket - m->serving, so you check more often when your ticket is coming up.
In really high contention cases, fallback to OS-assisted sleep/wake is appropriate if you'll be waiting for long, like Linux futex. Or since we can see how close to the head of the queue we are, yield, nanosleep, or futex if your wait interval will be more than 3 or 8 ticket numbers or whatever. (Tunable depending on how long it takes to serve a ticket.)
(Using futex, you might introduce a read of m->ticket into the unlock to figure out if there might be any threads sleeping, waiting for a notify. Like C++20 atomic<>.wait() and atomic.notify_all(). Unfortunately I don't know a good way to figure out which thread to notify, instead of waking them all up to check if they're the lucky winner.
With low average contention, you should keep both in the same cache line. An access to .ticket is always immediately followed by a load of .serving. In the unlocked no-contention case, this means only one cache line is bouncing around, or having to stay hot for the same core to take/release the lock.
If the lock is already held, the thread wanting to unlock needs exclusive ownership of the cache line to RMW or store. It loses this whether another core does an RMW or just a pure load on the line containing .serving.
There won't be too many cases where multiple waiters are all spinning on the same lock, and where new threads getting a ticket number delay the unlock, and its visibility to the thread waiting for it.
This is my intuition, anyway; it's probably hard to microbenchmark, unless a cache-miss atomic RMW stops later load from even starting to request the later line, in which case you could have two cache-miss latencies in taking the lock.
Avoiding an atomic RMW in the unlock?
The thread holding the lock knows it has exclusive ownership, no other thread will be modifying m->serving concurrently. If you had the lock owner remember its own ticket number, you could optimize the unlock to just a store.
void ticket_mutex_unlock(ticket_mutex *m, uint32_t ticket_num)
{
(void) __atomic_store_n(&m->serving, ticket_num+1, __ATOMIC_RELEASE);
}
Or without that API change (to return an integer from u32 ticket_mutex_lock())
void ticket_mutex_unlock(ticket_mutex *m)
{
uint32_t ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED); // we already own the lock
// and no other thread can be writing concurrently, so a non-atomic increment is safe
(void) __atomic_store_n(&m->serving, ticket+1, __ATOMIC_RELEASE);
}
This has a nice efficiency advantage on ISAs that need LL/SC retry loops for atomic RMWs, where spurious failure from another core reading the value can happen. And on x86 where the only possible atomic RMW is a full barrier, stronger even than needed for C seq_cst semantics.
BTW, the lock fields would be fine as uint32_t. You're not going to have 2^32 threads waiting for a lock. So I used uint32_t instead of u64. Wrap-around is well-defined. Even subtraction like ticket - serving Just Works, even across that wrapping boundary, like 1 - 0xffffffffUL gives 2, so you can still calculate how close you are to being served, for sleep decisions.
Not a big deal on x86-64, only saving a bit of code size, and probably not a factor at all on AArch64. But will help significantly on some 32-bit ISAs.

Self-written Mutex for 2+ Threads

I have written the following code, and so far in all my tests it seems as if I have written a working Mutex for my 4 Threads, but I would like to get someone else's opinion on the validity of my solution.
typedef struct Mutex{
int turn;
int * waiting;
int num_processes;
} Mutex;
void enterLock(Mutex * lock, int id){
int i;
for(i = 0; i < lock->num_processes; i++){
lock->waiting[id] = 1;
if (i != id && lock->waiting[i])
i = -1;
lock->waiting[id] = 0;
}
printf("ID %d Entered\n",id);
}
void leaveLock(Mutex * lock, int id){
printf("ID %d Left\n",id);
lock->waiting[id] = 0;
}
void foo(Muted * lock, int id){
enterLock(lock,id);
// do stuff now that i have access
leaveLock(lock,id);
}
I feel compelled writing an answer here because the question is a good one, taking into concern it could help others to understand the general problem with mutual exclusion. In your case, you came a long way to hide this problem, but you can't avoid it. It boils down to this:
01 /* pseudo-code */
02 if (! mutex.isLocked())
03 mutex.lock();
You always have to expect a thread switch between lines 02 and 03. So there is a possible situation where two threads find mutex unlocked and be interrupted after that ... only to resume later and lock this mutex individually. You will have two threads entering the critical section at the same time.
What you definitely need to implement reliable mutual exclusion is therefore an atomic operation that tests a condition and at the same time sets a value without any chance to be interrupted meanwhile.
01 /* pseudo-code */
02 while (! test_and_lock(mutex));
As soon as this test_and_lock function cannot be interrupted, your implementation is safe. Until c11, C didn't provide anything like this, so implementations of pthreads needed to use e.g. assembly or special compiler intrinsics. With c11, there is finally a "standard" way to write atomic operations like this, but I can't give an example here, because I don't have experience doing that. For general use, the pthreads library will give you what you need.
edit: of course, this is still simplified -- in a multi-processor scenario, you need to ensure that even memory accesses are mutually exclusive.
The Problem I see in you code:
The idea behind a mutex is to provide mutual exclusion, means that when thread_a is in the critical section, thread_b must wait(in case he wants also to enter) for thread_a.
This waiting part should be implemented in enterLock function. But what you have is a for loop which might end way before thread_a is done from the critical section and thus thread_b could also enter, hence you can't have mutual exclusion.
Way to fix it:
Take a look for example at Peterson's algorithm or Dekker's(more complicated), what they did there is what's called busy waiting which is basically a while loop which says:
while(i can't enter) { do nothing and wait...}
You are totally ignoring the topic of memory models. Unless you are on a machine with a sequential consistent memory model (which none of today's PC CPUs are), your code is incorrect, as any store executed by one thread is not necessarily immediately visible to other CPUs. However, exactly this seems to be an assumption in your code.
Bottom line: Use the existing synchronization primitives provided by the OS or a runtime library such a POSIX or Win32 API and don't try to be smart and implement this yourself. Unless you have years of experince in parallel programming as well as in-depth knowledge of CPU architecture, chances are quite good that you end up with an incorrect implementation. And debugging parallel programms can be hell...
After enterLock() returns, the state of the Mutex object is the same as before the function was called. Hence it will not prevent a second thread to enter the same Mutex object even before the first one released it calling leaveLock(). There is no mutual exclusiveness.

Locking Strategy for Millions of Linked Lists + Multithreading (C) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have a C program with 16-million-odd linked lists, and 4 worker threads.
No two threads should work on the same linked list at the same time, otherwise they might be modifying it simultaneously, which would be bad.
My initial naive solution was something like this:
int linked_lists_locks[NUM_LINKED_LISTS];
for (i=0; i< NUM_LINKED_LISTs; i++)
linked_lists_locks[i] = 0;
then later, in a section executed by each thread as it works:
while ( linked_lists_locks[some_list] == 1 ) {
/* busy wait */
}
linked_lists_locks[some_list] = 1; // mark it locked lock it
/* work with the list */
linked_lists_locks[some_list] = 0;
However, with 4 threads and ~250,000,000 operations I quickly got into cases where both threads did the same "is it locked" simultaneously and problems ensued. Smart people here would have seen that coming :-)
I've looked at some locking algorithms like Dekker's and Peterson's, but they seem to be more "lock this section of code" whereas what I'm looking for is "lock this variable". I suspect that if I lock the "work with the list" section of code, everything slows to a crawl because then only one thread can work (though I haven't tried it). Essentially, each worker's job is limited to doing some math and populating these lists. Cases where each thread wants to work on the same list simultaneously are rare, btw - only a few thousand times out of 250M operations, but they do happen.
Is there an algorithm or approach for implementing locks on many variables as opposed to sections of code? This is C (on Linux if that matters) so synchronized array lists, etc. from Java/C#/et al are not available.
It would be useful to know more about how your application is organized, but here are a few ideas about how to approach the problem.
A common solution for "synchronized" objects is to assign a mutex to each object. Before working on an object, the thread needs to acquire the object's mutex; when it is done, it releases the mutex. That's simple and effective, but if you really have 16 million lockable objects, it's a lot of overhead. More seriously, if two tasks really try to work on the same object at the same time, one of them will end up sleeping until the other one releases the lock. If there was something else the tasks might have been doing, the opportunity has been lost.
A simple solution to the first problem -- the overhead of 16 million mutexes -- is to use a small vector of mutexes and a hash function which maps each object to one mutex. If you only have four tasks, and you used a vector of, say, 1024 mutexes, you will occasionally end up with a thread needlessly waiting for another thread, but it won't be very common.
If lock contention really turns out to be a problem and it is possible to vary the order in which work is one, a reasonable model is a workqueue. Here, when a thread wants to do something, it takes a task off the workqueue, attempts to lock the task's object (using trylock instead of lock), and if that works, does the task. If the lock fails, it just puts the task back on the workqueue and grabs another one. To avoid workqueue lock contention, it's common for threads to grab a handful of tasks instead of one; each thread then manages its own subqueue. Tuning the various parameters in this solution requires knowing at least a bit about the characteristics of the tasks. (There is a kind of race condition in this solution, but it doesn't matter; it just means that occasionally tasks will be deferred unnecessarily. But they should always get executed eventually.)
You should use an atomic test and set operation. Unfortunately, you may need to use an assembly routine if your compiler doesn't have a built-in for that. See this article:
http://en.wikipedia.org/wiki/Test-and-set
If you are absolutely forced to use this many lists, and you have very few threads, you might not want to lock the lists, but allow the worker-threads to claim a single list at a time. In this case you need a structure to store the number of the list currently held and the list must not be aliased with another number.
Since you didn't seem to use any library I'll add some pseudo-code to clarify my idea:
/*
* list_number, the number of the list you want to lock
* my_id, the id of the thread trying to lock this list
* mutex, the mutex used to control locking the lists
* active_lists, array containing the lists currently held by the threads
* num_threads, size of the array and also number of threads
*/
void lock_list(int list_number, int my_id, some_mutex *mutex,
atomic_int *active_lists, size_t num_threads) {
int ok = 0;
int i;
while (true){ //busy wait to claim the lock
//first check if anyone seems to hold the list we want.
//Do this in a non-locking way to avoid lock contention
while (!ok){
ok = 1;
for (i = 0; i < num_threads; ++i){
if (active_lists[i].load() == list_number && i != my_id){
ok = 0;
/*
* we have to restart - potential to optimize
* at this point, you could delay the work on this list
* to do some other work
*/
break;
}
}
}
while(try_to_lock(mutex));
//rerun the check to see if anyone has taken the list in the meantime
// ok == 1 at this point
for (i = 0; i < num_threads; ++i){
if (active_lists[i].load() == list_number && i != my_id){
ok = 0;
break;
}
}
//this must not be set from anywhere else!
if (ok) active_lists[my_id].store(list_number);
unlock(mutex);
//if we noticed someone claimed the list, go back to the beginning.
if (ok) break;
}
}
There are a few constraints to the pseudo-types. some_mutex obviously has to be a mutex. What I call atomic_int here must somehow support fetching its latest value form main memory to prevent you from seeing old values, which are cached. Same goes for the store: it must not be cached core-locally before being written. Using a regular int and using lfence, sfence and/or mfence may work as well.
There are obviously some trade-offs here, where the main one is probably memory vs speed. This example will create contention at the single mutex used to store which list you have locked, so it will scale poorly with a large number of threads, but well with a large number of lists. If lists are claimed infrequently this would work well even at a larger number of threads. The advantage is that the storage requirement depends mainly on the number of threads. You have to pick a storage type which can hold a number equivalent to the maximum number of lists though.
I am not sure what exactly your scenario is, but recently lock-free lists have also gained some momentum. With the introduction of advanced support for lock-free code in C11 and C++11, there have been a few working (as in not shown to be broken) examples around. Herb Sutter gave a talk on how to do this in C++11. It is C++, but he discusses the relevant points of writing a lock free singly linked list, which are also true for plain old C. You can also try to find an existing implementation, but you should inspect it carefully because this is kind of bleeding edge stuff. However using lock-free lists would erase the need to lock at all.

How Compare and Swap works

I have read quite some posts that say compare and swap guarantees atomicity, However I am still not able to get how does it. Here is general pseudo code for compare and swap:
int CAS(int *ptr,int oldvalue,int newvalue)
{
int temp = *ptr;
if(*ptr == oldvalue)
*ptr = newvalue
return temp;
}
How does this guarantee atomicity? For example, if I am using this to implement a mutex,
void lock(int *mutex)
{
while(!CAS(mutex, 0 , 1));
}
how does this prevent 2 threads from acquiring the mutex at the same time? Any pointers would be really appreciated.
"general pseudo code" is not an actual code of CAS (compare and swap) implementation. Special hardware instructions are used to activate special atomic hardware in the CPU. For example, in x86 the LOCK CMPXCHG can be used (http://en.wikipedia.org/wiki/Compare-and-swap).
In gcc, for example, there is __sync_val_compare_and_swap() builtin - which implements hardware-specific atomic CAS. There is description of this operation from fresh wonderful book from Paul E. McKenney (Is Parallel Programming Hard, And, If So, What Can You Do About It?, 2014), section 4.3 "Atomic operations", pages 31-32.
If you want to know more about building higher level synchronization on top of atomic operations and save your system from spinlocks and burning cpu cycles on active spinning, you can read something about futex mechanism in Linux. First paper on futexes is Futexes are tricky by Ulrich Drepper 2011; the other is LWN article http://lwn.net/Articles/360699/ (and the historic one is Fuss, Futexes and Furwocks: Fast Userland Locking in Linux, 2002)
Mutex locks described by Ulrich use only atomic operations for "fast path" (when the mutex is not locked and our thread is the only who wants to lock it), but if the mutex was locked, the thread will go to sleeping using futex(FUTEX_WAIT...) (and it will mark the mutex variable using atomic operation, to inform the unlocking thread about "there are somebody sleeping waiting on this mutex", so unlocker will know that he must wake them using futex(FUTEX_WAKE, ...)
How does it prevent two threads from acquiring the lock? Well, once any one thread succeeds, *mutex will be 1, so any other thread's CAS will fail (because it's called with expected value 0). The lock is released by storing 0 in *mutex.
Note that this is an odd use of CAS, since it's essentially requiring an ABA-violation. Typically you'd just use a plain atomic exchange:
while (exchange(mutex, 1) == 1) { /* spin */ }
// critical section
*mutex = 0; // atomically
Or if you want to be slightly more sophisticated and store information about which thread has the lock, you can do tricks with atomic-fetch-and-add (see for example the Linux kernel spinlock code).
You cannot implement CAS in C. It's done on a hardware level in assembly.

Multithreading and mutexes

I'm currently beginning development on an indie game in C using the Allegro cross-platform library. I figured that I would separate things like input, sound, game engine, and graphics into their own separate threads to increase the program's robustness. Having no experience in multithreading whatsoever, my question is:
If I have a section of data in memory (say, a pointer to a data structure), is it okay for one thread to write to it at will and another to read from it at will, or would each thread have to use a mutex to lock the memory, then read or write, then unlock?
In particular, I was thinking about the interaction between the game engine and the video renderer. (This is in 2D.) My plan was for the engine to process user input, then spit out the appropriate audio and video to be fed to the speakers and monitor. I was thinking that I'd have a global pointer to the next bitmap to be drawn on the screen, and the code for the game engine and the renderer would be something like this:
ALLEGRO_BITMAP *nextBitmap;
boolean using;
void GameEngine ()
{
ALLEGRO_BITMAP *oldBitmap;
while (ContinueGameEngine())
{
ALLEGRO_BITMAP *bitmap = al_create_bitmap (width, height);
MakeTheBitmap (bitmap);
while (using) ; //The other thread is using the bitmap. Don't mess with it!
al_destroy_bitmap (nextBitmap);
nextBitmap = bitmap;
}
}
void Renderer ()
{
while (ContinueRenderer())
{
ALLEGRO_BITMAP *bitmap = al_clone_bitmap (nextBitmap);
DrawBitmapOnScreen (bitmap);
}
}
This seems unstable... maybe something would happen in the call to al_clone_bitmap but I am not quite certain how to handle something like this. I would use a mutex on the bitmap, but mutexes seem like they take time to lock and unlock and I'd like both of these threads (especially the game engine thread) to run as fast as possible. I also read up on something called a condition, but I have absolutely no idea how a condition would be applicable or useful, although I'm sure they are. Could someone point me to a tutorial on mutexes and conditions (preferably POSIX, not Windows), so I can try to figure all this out?
If I have a section of data in memory (say, a pointer to a data
structure), is it okay for one thread to write to it at will and
another to read from it at will
The answer is "it depends" which usually means "no".
Depending on what you're writing/reading, and depending on the logic of your program, you could wind up with wild results or corruption if you try writing and reading with no synchronization and you're not absolutely sure that writes and reads are atomic.
So you should just use a mutex unless:
You're absolutely sure that writes and reads are atomic, and you're absolutely sure that one thread is only reading (ideally you'd use some kind of specific support for atomic operations such as the Interlocked family of functions from WinAPI).
You absolutely need the tiny performance gain from not locking.
Also worth noting that your while (using); construct would be a lot more reliable, correct, and would probably even perform better if you used a spin lock (again if you're absolutely sure you need a spin lock, rather than a mutex).
The tool that you need is called atomic operations which would ensure that the reader thread only reads whole data as written by the other thread. If you don't use such operations, the data may only be read partially, thus what it read may may make no sense at all in terms of your application.
The new standard C11 has these operations, but it is not yet widely implemented. But many compilers should have extension that implement these. E.g gcc has a series of builtin functions that start with a __sync prefix.
There are a lot of man pages in 'google'. Search for them. I found http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html in a few search minutes:
Besides, begin with a so little example, increasing difficulty. Firstable with threads creation and termination, threads returns, threads sincronization. Continue with posix mutex and conditions and understand all these terms.
One important documentation feed is linux man and info pages.
Good luck
If I have a section of data in memory (say, a pointer to a data structure), is it okay for one thread to write to it at will and another to read from it at will, or would each thread have to use a mutex to lock the memory, then read or write, then unlock?
If you have section of data in memory where two different threads are reading and writing this is called the critical section and is a common issue of the consumer and producer.
There are many resources that speak to this issue:
https://docs.oracle.com/cd/E19455-01/806-5257/sync-31/index.html
https://stackoverflow.com/questions/tagged/producer-consumer
But yes if you are going to be using two different threads to read and write you will have to implement the use of mutexes or another form of locking and unlocking.

Resources