Which OS support an efficient file system write barrier? - filesystems

Do file system or disk drivers support a concept of file system modification fence, like a memory fence in the CPU or shared memory system?
A fence is an instruction that separate memory operations such that globally visible memory accesses after the fence event are not detectable until all those that come before it.
Is such feature available for file content modification (and repertory modification), in an efficient way? Of course a simplistic solution would be wait until all writes are written to stable storage; that however is blocking the application and could be inefficient if many synchronization points are needed. Also, it could cause many small individual writes when one big write (including many writes separated by fences) satisfies the same constrains on fully journalled system, or when the biggest write is guaranteed to be atomic by the disk driver.
Can file system drivers be forced to order writes with file system access fences? Has the concept been explored?
PRECISION
The context of the question is not multiple processes accessing the same files in a racy way but one process saving data in a database such that an interruption of the process (even a computer crash) should leave only one sequence of modifications (between two fences) partially written.

Related

Why do we even need cache coherence?

In languages like C, unsynchronized reads and writes to the same memory location from different threads is undefined behavior. But in the CPU, cache coherence says that if one core writes to a memory location and later another core reads it, the other core has to read the written value.
Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away? Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?
The acquire and release semantics required for C++11 std::mutex (and equivalents in other languages, and earlier stuff like pthread_mutex) would be very expensive to implement if you didn't have coherent cache. You'd have to write-back every dirty line every time you released a lock, and evict every clean line every time you acquired a lock, if couldn't count on the hardware to make your stores visible, and to make your loads not take stale data from a private cache.
But with cache coherency, acquire and release are just a matter of ordering this core's accesses to its own private cache which is part of the same coherency domain as the L1d caches of other cores. So they're local operations and pretty cheap, not even needing to drain the store buffer. The cost of a mutex is just in the atomic RMW operation it needs to do, and of course in cache misses if the last core to own the mutex wasn't this one.
C11 and C++11 added stdatomic and std::atomic respectively, which make it well-defined to access shared _Atomic int variables, so it's not true that higher level languages don't expose this. It would hypothetically be possible to implement on a machine that required explicit flushes/invalidates to make stores visible to other cores, but that would be very slow. The language model assumes coherent caches, not providing explicit flushes of ranges but instead having release operations that make every older store visible to other threads that do an acquire load that syncs-with the release store in this thread. (See When to use volatile with multi threading? for some discussion, although that answer is mainly debunking the misconception that caches could have stale data, from people mixed up by the fact that the compiler can "cache" non-atomic non-volatile values in registers.)
In fact, some of the guarantees on C++ atomic are actually described by the standard as exposing HW coherence guarantees to software, like "write-read coherence" and so on, ending with the note:
http://eel.is/c++draft/intro.races#19
[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note
(Long before C11 and C++11, SMP kernels and some user-space multithreaded programs were hand-rolling atomic operations, using the same hardware support that C11 and C++11 finally exposed in a portable way.)
Also, as pointed out in comments, coherent cache is essential for writes to different parts of the same line by other cores to not step on each other.
ISO C11 guarantees that a char arr[16] can have arr[0] written by one thread while another writes arr[1]. If those are both in the same cache line, and two conflicting dirty copies of the line exist, only one can "win" and be written back. C++ memory model and race conditions on char arrays
ISO C effectively requires char to be as large as smallest unit you can write without disturbing surrounding bytes. On almost all machines (not early Alpha and not some DSPs), that's a single byte, even if a byte store might take an extra cycle to commit to L1d cache vs. an aligned word on some non-x86 ISAs.
The language didn't officially require this until C11, but that just standardized what "everyone knew" the only sane choice had to be, i.e. how compilers and hardware already worked.
Ah, a very deep topic indeed!
Cache coherency between cores is used to synthesise (as closely as possible) and Symetric Multi Processing (SMP) environment. This harks back to the days when multiple single core CPUs were simply tagged on to the same single memory bus, circa mid 1990s, caches weren't really a thing, etc. With multiple CPUs with multiple cores each with multiple caches and multiple memory interfaces per CPU, the synthesis of an SMP-like environment is a lot more complicated, and cache-coherency is a big part of that.
So, when one asks, "Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away?", one is really asking "Do we still need an SMP environment?".
The answer is software. An awful lot of software, including all major OSes, has been written around the assumption that they're running on an SMP environment. Take away the SMP, and we'd have to re-write literally everything.
There are now various sage commentators beginning to wonder in articles whether SMP is in fact a dead end, and that we should start worrying about how to get out of that dead end. I think that it won't happen for a good long while yet; the CPU manufacturers have likely got a few more tricks to play to get ever increasing performance, and whilst that keeps being delivered no one will want to suffer the pain of software incompatibility. Security is another reason to avoid SMP - Meltdown and Spectre exploit weaknesses in the way SMP has been synthesised - but I'd guess that whilst other mitigations (however distasteful) are available security alone will not be sufficient reason to ditch SMP.
"Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?" Why not, indeed? We have been there before. Transputers (1980s, early 1990s) implemented Communicating Sequential Processes (CSP), where if the application needed a different CPU to process some data, the application would have to purposefully transfer data to that CPU. The transfers are (in CSP speak) through "Channels", which are more like network sockets or IPC pipes and not at all like shared memory spaces.
CSP is having something of a resurgence - as a multiprocessing paradigm it has some very beneficial features - and languages such as Go, Rust, Erlang implement it. The thing about those languages' implementations of CSP is that they're having to synthesise CSP on top of an SMP environment, which in turn is synthesised on top of an electronic architecture much more reminiscent of Transputers!
Having had a lot of experience with CSP, my view is that every multi-process piece of software should use CSP; it's a lot more reliable. The "performance hit" of "copying" data (which is what you have to do to do CSP properly on top of SMP) isn't so bad; it's about the same amount of traffic over the cache-coherency connections to copy data from one CPU to another as it is to access the data in an SMP-like way.
Rust is very interesting, because with it's syntax strongly expressing data ownership I suspect that it doesn't have to copy data to implement CSP, it can transfer ownership between threads (processes). Thus it may be getting the benefits of CSP, but without having to copy the data. Therefore it could be very efficient CSP, even if every thread is running on a CPU single core. I've not yet explored Rust deeply enough to know that that is what it's doing, but I have hopes.
On of the nice things about CSP is that with Channels being like network sockets or IPC pipes, one can readily implement CSP across actual network sockets. Raw sockets are not in themselves ideal - they're asynchronous and so more akin to Actor Model (as is ZeroMQ). Actor Model is fairly OK - and I've used it - but it's not as guarateed devoid of runtime problems as CSP is. So one has to implement the CSP bit oneself or find a library. However, with that in place CSP becomes a software architecture that can more easily span arbitrary networks of computers without having to change the software architecture; a local channel and a network channel are "the same", except the network one is a bit slower.
It's a lot harder to take a multithreaded piece of software that assumes SMP, uses semaphores, etc to scale up across multiple machines on a network. In fact, it can't, and has to be re-written.
More recently than Transputers, the Cell processor (Playstation 3 fame) was a multi-core device that did exactly as you suggest. It had a single CPU core, and 8 SPE maths cores each with 255k on-chip core-speed static RAM. To use the SPEs you had to write software to ships code and data in and out of that 256k (there was a monster-fast internal ring bus for doing this, and a very fast external memory interface). The result was that, with the right developer, very good results could be attained.
It took Intel about a further 10 years to usefully get x64 up to about the same performance; adding in a Fused Multply-Add instruction into SSE was what finally got them there, an instruction they'd been keeping in Itanium's repetoire in the vain hope of boosting its appeal. Cell (the SPEs were based in the PowerPC equivalent of SSE - Altivec) had had an FMA instruction from the get-go.
Cache coherency is not needed if a developer takes care of
issuing lock(+ memory barriers) / (mem. barrier)unlock irrespective of it.
Cache coherency is of little value, or even has a negative value
in terms of cost, power, performance, validation etc.
Today, software is more and more distributed. Any way, coherency
can't help two processes running on two different machines.
Even for multi-threaded SW, we end up using IPC.
only small part of data is shared in multi threaded sw.
A large part of data is not shared, if shared, memory barriers should
solve cache syncing.
practically, SW developers depend on explicit locks to access shared data. These locks can efficiently flush the caches with h/w assistance (efficient means, only the caches lines that are modified AND also cached else where). And already, this is exactly done when we lock/unlock. Since every does above said lock/unlock, then Cache coherency is redundant and wastage of silicon space/power, hw engineers sleep.
all compilers(at least C/C++, python VM ) generate code for single threaded, non shared data. If I need to share the data, I just tell it is shared, but not how and why (volatile?). Developers need to take care of managing (again lock/unlock across hw cores/sw threads/hw threads). Most of the time, we write in HLL with non-atomic data. Cache-coherency does not add any value to developers, so, he/she fall back to managing it through locks, which instruct the cache system to efficiently flush. All caches systems have logs of cache lines to flush efficiently w/ or w/o coherency support. (think of cached but non coherent memory. this memory still has logs which can be used for efficient flushing)
Cache coherency very complex on silicon, consuming space and power.
In any case, SW developers takes care of issuing memory barriers (via locks).
So, I think, it is good to get rid of coherency and tell developers to own it.
But I see, trend is opposite.
Look at CXL memory etc... It is coherent.
I am looking for a system call where I can just turn off the cache coherency
for my threads and see experiment

Why do system calls exist

I have been reading about system calls and how they work in Linux. I still have more reading to do but one thing that nothing I have read has answered is, WHY do we need system calls?
I understand that system calls are requests from user space program for the kernel to do something, but my question is basically: Why can't the user space program do the thing itself? Why doesn't Glibc do the actual operation instead of just being a wrapper for a system call?
For example, if I call fopen() in my program, why does glibc call the open system call? Why doesn't glibc just do the operation itself?
I understand that it would mean that glibc developers would have a lot more work and that they would have to have an intimate knowledge of Linux, but isn't glibc already very closely related to Linux kernel?
Also, I understand the system call functions are run in ring 0 in the CPU...but what's really the point of that? If I execute a program, I am giving it express permission to run, so what security is added by separating what code can be run in different contexts since you are giving it all permission anyway?
Why doesn't glibc just do the operation itself?
Well that is more less the ways things went in good old MS/DOS systems: no separation between kernel code and user code, and user code could happily directly access the hardware.
This just has 2 major problems:
It works (rather) fine on a single user and not multi tasking system, but as soon as multiple programs can simultaneously run in a system, you have to synchronize hardware accesses and memory usage => those are the parts dedicated to the kernel
There is no protection of the system from a poorly coded program. In a modern OS, an erroneous program can crash, but the system itself should survive. In MS/DOS a program crash usually ended in a system reboot.
For those reasons, all modern OS (except maybe some lightweight embedded ones) use isolation between different user processes and the kernel. And that just mean that you need a way to allow a user mode process to require a privileged action (reading or writing a physical disk is) from the kernel: that is exactly what system calls are made for.
Why doesn't glibc just do the operation itself?
Short answer: Because it can't.
Long answer:
A program running in Linux can run in two modes : UserLand or KernelLand.
The Kernel Land has every rights and can do everything, including talking with hardware, or providing userspace callbacks. For instance, when you call fopen(), the kernel does all the dirty talking with your filesystem (ext4 for instance), the caching, everything down to talking with the SATA Controller to access data on the hard-drive.
GLibc could do that using the device exposed by the kernel in /dev, but that would mean recoding from scratch all the filesystems layers, the sockets, the firewalling...
The kernel just provides easy usable API for programmers to have elevated rights and communicate with the devices. That's how Linux (and most modern OS) is made.
What security is added by separating what code can be run in different contexts since you are giving it all permission anyway?
The permissions are managed by the kernel. If you don't have syscall, you don't have permissions. Or should the program you run check their own permission? Once again, it would be reinventing the wheel every time.
If the code generated by a C implementation were the only thing that were going to be running on the target system (as it would be for many freestanding implementations, and for a very small number of hosted implementations) and if implementation knew precisely what hardware it would be running upon (true of some freestanding implementations, but seldom true for hosted ones), its runtime library might be able to perform operations like "fopen" by directly communicating with the storage hardware. It is rare, however, for either condition to apply, much less both of them.
If multiple programs will be using storage device, it will generally be necessary that they either coordinate their actions somehow or else that sequences of operations performed by different programs do not overlap, and that every program "forget" anything it thinks it knows about the state of storage any time another program might have written to it.
Otherwise, suppose a disk contains a single file and program #1 uses "fopen" to open it for reading. Each directory sector holds 8 entries, so the program would read the first directory sector and observe that slot #0 identifies the file of interest while #1-#7 are blank.
Now suppose program #2 uses "fopen" to create a file for writing. It would read the directory sector, observe that slots #1-#7 are blank, and rewrite the directory sector with information about the new file in slot #1.
Finally, suppose program #1 wants to write a file. If it doesn't know about program #2, it might reasonably believe it knows what the directory contains (it had read it earlier, and has no reason to believe it's changed), place information about the new file in slot #1, and replace the directory sector on disk with its new version, obliterating the entry written by program #2.
Having both programs route their operations through an operating system ensures that when program #2 wants to create its file, it can exploit the fact that it had just read the directory for program #1 (and thus doesn't need to reread it). More importantly, when program #1 goes to write a file, the operating system will know that the directory contains the file written by program #2, and will thus ensure that the new file gets placed in slot #2.
Contrary to what other answers say, even microcomputer C implementations running on platforms like MS-DOS essentially always relied upon the OS for file I/O. Some would include their own console I/O routines because the ones in MS-DOS were about four times as slow as they should have been, but the need for coordination when using file I/O meant that very few programs would try to do it themselves.
1- You don't wanna deal with low-level hardware communications. At least most people don't. Each of them has hundreds of commands.
2- Make a simple mistake and your CPU/RAM or I/O device might be useless forever.
3- When you are part of a network, you can share resources. The system calls and kernel keeps your co-worker from damaging your hard disk.
Another consideration is that the OS kernel needs to provide an abstraction for the myriad different types of hardware via a uniform API - without which you'd invariably be making device specific calls in your program.
While the previously-idle disk spins up for two seconds, or the networked disk gets connected for thirty seconds, what is the library going to do?
The full answer to your question is very broad but let me take a simple example based upon your question about fopen.
Let us say that we have a large system that has hundred or thousands of users. One of those users is, say the HR department with files containing confidential information about employees.
If that disk could be accessed at will in user mode, then any person on the system could open any file on the system, including those with confidential information.
In other words operating systems managed SHARED resources. These include disk, CPU, and memory. If these could be controlled in user mode, there would be no way to ensure that these were shared equitably.

When to use mmap vs when to use read and write with cache layer?

When looking into DB storage engines, it seems most use mmap to persist. However, is there a situation where writing to a cache layer and writing binary to disk using read and write makes sense?
What I'm trying to understand is what is the difference between mmap and unmmap vs read and write? And when to use the one or the other?
If you can feasibly use mmap(), it's usually the better way. When you use read()/write(), you have to perform a system call for every operation (although libraries like stdio minimize this with user-mode buffering), and these context switches are expensive. Even if the file block is in the buffer cache, you have to first switch into the kernel to check for it. Additionally, the kernel needs to copy the data from the kernel buffer to the caller's memory.
On the other hand, when you use mmap(), you only have to perform a system call when you first open and map the file. From then on, the virtual memory subsystem keeps the application memory synchronized with the file contents. Context switches are only necessary when you try to access a file block that hasn't yet been paged in from disk, not for each part of the file you try to read or write. When you modify the mapped memory, it gets written back to the file lazily.
For most practical applications, you should use whichever method fits the logic of the application best. The performance difference between the two methods will only be significant in highly time-critical applications. When implementing a library, you can't tell the needs of client applications, so of course you try to wring every bit of performance out of it. But for many other applications, premature optimization is the root of all evil.

Multiprocessors vs Multithreading in the context of PThreads

I have an application level (PThreads) question regarding choice of hardware and its impact on software development.
I have working multi-threaded code tested well on a multi-core single CPU box.
I am trying to decide what to purchase for my next machine:
A 6-core single CPU box
A 4-core dual CPU box
My question is, if I go for the dual CPU box, will that impact the porting of my code in a serious way? Or can I just allocate more threads and let the OS handle the rest?
In other words, is multiprocessor programming any different from (single CPU) multithreading in the context of a PThreads application?
I thought it would make no difference at this level, but when configuring a new box, I noticed that one has to buy separate memory for each CPU. That's when I hit some cognitive dissonance.
More Detail Regarding the Code (for those who are interested): I read a ton of data from disk into a huge chunk of memory (~24GB soon to be more), then I spawn my threads. That initial chunk of memory is "read-only" (enforced by my own code policies) so I don't do any locking for that chunk. I got confused as I was looking at 4-core dual CPU boxes - they seem to require separate memory. In the context of my code, I have no idea what will happen "under the hood" if I allocate a bunch of extra threads. Will the OS copy my chunk of memory from one CPU's memory bank to another? This would impact how much memory I would have to buy (raising the cost for this configuration). The ideal situation (cost-wise and ease-of-programming-wise) is to have the dual CPU share one large bank of memory, but if I understand correctly, this may not be possible on the new Intel dual core MOBOs (like the HP ProLiant ML350e)?
Modern CPUs1 handle RAM locally and use a separate channel2 to communicate between them. This is a consumer-level version of the NUMA architecture, created for supercomputers more than a decade ago.
The idea is to avoid a shared bus (the old FSB) that can cause heavy contention because it's used by every core to access memory. As you add more NUMA cells, you get higher bandwidth. The downside is that memory becomes non-uniform from the point of view of the CPU: some RAM is faster than others.
Of course, modern OS schedulers are NUMA-aware, so they try to reduce the migration of a task from one cell to another. Sometimes it's okay to move from one core to another in the same socket; sometimes there's a whole hierarchy specifying which resources (1-,2-,3-level cache, RAM channel, IO, etc) are shared and which aren't, and that determines if there would be a penalty or not by moving the task. Sometimes it can determine that waiting for the right core would be pointless and it's better to shovel the whole thing to another socket....
In the vast majority of cases, it's best to leave the scheduler do what it knows best. If not, you can play around with numactl.
As for the specific case of a given program; the best architecture depends heavily in the level of resource sharing between threads. If each thread has its own playground and mostly works alone within it, a smart enough allocator would prioritize local RAM, making it less important on which cell each thread happens to be.
If, on the other hand, objects are allocated by one thread, processed by another and consumed by a third; performance would suffer if they're not on the same cell. You could try to create small thread groups and limit heavy sharing within the group, then each group could go on a different cell without problem.
The worst case is when all threads participate in a great orgy of data sharing. Even if you have all your locks and processes well debugged, there won't be any way to optimize it to use more cores than what are available on a cell. It might even be best to limit the whole process to just use the cores in a single cell, effectively wasting the rest.
1 by modern, I mean any AMD-64bit chip, and Nehalem or better for Intel.
2 AMD calls this channel HyperTransport, and Intel name is QuickPath Interconnect
EDIT:
You mention that you initialize "a big chunk of read-only memory". And then spawn a lot of threads to work on it. If each thread works on its own part of that chunk, then it would be a lot better if you initialize it on the thread, after spawning it. That would allow the threads to spread to several cores, and the allocator would choose local RAM for each, a much more effective layout. Maybe there's some way to hint the scheduler to migrate away the threads as soon as they're spawned, but I don't know the details.
EDIT 2:
If your data is read verbatim from disk, without any processing, it might be advantageous to use mmap instead of allocating a big chunk and read()ing. There are some common advantages:
No need to preallocate RAM.
The mmap operation is almost instantaneous and you can start using it. The data will be read lazily as needed.
The OS can be way smarter than you when choosing between application, mmaped RAM, buffers and cache.
it's less code!
Non needed data won't be read, won't use up RAM.
You can specifically mark as read-only. Any bug that tries to write will cause a coredump.
Since the OS knows it's read-only, it can't be 'dirty', so if the RAM is needed, it will simply discard it, and reread when needed.
but in this case, you also get:
Since data is read lazily, each RAM page would be chosen after the threads have spread on all available cores; this would allow the OS to choose pages close to the process.
So, I think that if two conditions hold:
the data isn't processed in any way between disk and RAM
each part of the data is read (mostly) by one single thread, not touched by all of them.
then, just by using mmap, you should be able to take advantage of machines of any size.
If each part of the data is read by more than one single thread, maybe you could identify which threads will (mostly) share the same pages, and try to hint the scheduler to keep these in the same NUMA cell.
For the x86 boxes you're looking at, the fact that memory is physically wired to different CPU sockets is an implementation detail. Logically, the total memory of the machine appears as one large pool - your wouldn't need to change your application code for it to run correctly across both CPUs.
Performance, however, is another matter. There is a speed penalty for cross-socket memory access, so the unmodified program may not run to its full potential.
Unfortunately, it's hard to say ahead of time whether your code will run faster on the 6-core, one-node box or the 8-core, two-node box. Even if we could see your code, it would ultimately be an educated guess. A few things to consider:
The cross-socket memory access penalty only kicks in on a cache miss, so if your program has good cache behaviour then NUMA won't hurt you much;
If your threads are all writing to private memory regions and you're limited by write bandwidth to memory, then the dual-socket machine will end up helping;
If you're compute-bound rather than memory-bandwidth-bound then 8 cores is likely better than 6;
If your performance is bounded by cache read misses then the 6 core single-socket box starts to look better;
If you have a lot of lock contention or writes to shared data then again this tends to advise towards the single-socket box.
There's a lot of variables, so the best thing to do is to ask your HP reseller for loaner machines matching the configurations you're considering. You can then test your application out, see where it performs best and order your hardware accordingly.
Without more details, it's hard to give a detailed answer. However, hopefully the following will help you frame the problem.
If your thread code is proper (e.g. you properly lock shared resources), you should not experience any bugs introduced by the change of hardware architecture. Improper threading code can sometimes be masked by the specifics of how a specific platform handles things like CPU cache access/sharing.
You may experience a change in application performance per equivalent core due to differing approaches to memory and cache management in the single chip, multi core vs. multi chip alternatives.
Specifically if you are looking at hardware that has separate memory per CPU, I would assume that each thread is going to be locked to the CPU it starts on (otherwise, the system would have to incur significant overhead to move a thread's memory to memory dedicated to a different core). That may reduce overall system efficiency depending on your specific situation. However, separate memory per core also means that the different CPUs do not compete with each other for a given cache line (the 4 cores on each of the dual CPUs will still potentially compete for cache lines, but that is less contention than if 6 cores are competing for the same cache lines).
This type of cache line contention is called False Sharing. I suggest the following read to understand if that may be an issue you are facing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=3
Bottom line is, application behavior should be stable (other than things that naturally depend on the details of thread scheduling) if you followed proper thread development practices, but performance could go either way depending on exactly what you are doing.

How shared memory would be accessed in manycore systems

In multicore systems, such as 2, 4, 8 cores, we typically use mutexes and semaphores to access shared memory. However, I can foresee that these methods would induce a high overhead for future systems with many cores. Are there any alternative methods that would be better for future many core systems for accessing shared memories.
Transactional memory is one such method.
I'm not sure how far in the future you want to go. But in the long-long run, shared memory as we know it right now (single address space accessible by any core) is not scalable. So the programming model will have to change at some point and make the lives of programmers harder as it did when we went to multi-core.
But for now (perhaps for another 10 years) you can get away with transactional memory and other hardware/software tricks.
The reason I say shared-memory is not scalable in the long run is simply due to physics. (similar to how single-core/high-frequency hit a barrier)
In short, transistors can't shrink to less than the size of an atom (barring new technology), and signals can't propagate faster than the speed of light. Therefore, memory will get slower and slower (with respect to the processor) and at some point, it becomes infeasible to share memory.
We can already see this effect right now with NUMA on the multi-socket systems. Large-scale supercomputers are neither shared-memory nor cache-coherent.
1) Lock only the memory part your are accessing, and not the entire table ! This is done with the help of a big hash table. The bigger the table, the finer the lock mechanism is.
2) If you can, only lock on writing, not on reading (this requires that there is no problem in reading the "previous value" while it is being updated, which is very often a valid case).
Access to shared memory at the lowest level in any multi-processor/core/threaded application synchronization depends on the bus lock. Such a lock may incur hundreds of (CPU) wait states as it also encompasses locking those I/O buses that have bus-mastering devices including DMA. Theoretically it is possible to envision a medium-level lock that can be invoked in situations when the programmer is certain that the memory area being locked won't be affected by any I/O bus. Such a lock would be much faster because it only needs to synchronize the CPU caches with main memory which is fast, at least in comparison to latency of the slowest I/O buses. Whether programmers in general would be competent to determine when to use which bus lock adds worrying implications to its mainstream feasibility. Such a lock could also require its own dedicated external pins for synchronization with other processors.
In multi-processor Opteron systems each processor has its own memory which becomes part of the entire memory that all installed processors can "see". A processor trying to access memory which turns out to be attached to another processor will transparently complete the access - albeit more slowly - through a high-speed interconnect bus (called HyperTransport) to the processor in charge of that memory (the NUMA concept). As long as a processor and its cores are working with the memory physically connected to it processing will be fast. In addition, many processors are equipped with several external memory buses to multiply their overall memory bandwidth.
A theoretical medium-level lock could, on Opteron systems, be implemented using the HyperTransport interconnections.
As for any forseeable future the classic approach of locking as seldom as possible and for as short a time as possible by implementing efficient algorithms (and associated data structures) that are used when the locks are in place still holds true.

Resources