For example, in a dual socket system with 2 quad core processors, does the thread scheduler tries to keep the threads from the same processes in the same processor? Because interleaving threads of different processes in different processors would slow down performance in the case where threads in a process have a lot of shared memory accesses.
It depends.
On current Intel platforms the BIOS default seems to be that memory is interleaved between the sockets in the system, page by page. Allocate 1Mbyte and half will be on one socket, half on the other. That means that wherever your threads are they have equal access to the data.
This makes it very simple for OSes - anywhere will do.
This can work against you. The SMP hardware environment presented to the OS is synthesised by the CPUs cooperating over QPI. If there's a lot of threads all accessing the same data then those links can get real busy. If they're too busy then that limits the performance, and you're I/O bound. That's where I am; Z80 cores with Intel's memory subsystem design would be just as quick as the nahelem cores I've actually got (ok I might be exagerating...).
At the end of the day the real problem is that memory just isn't quick enough. Intel and AMD have both done some impressive things with memory recently, but we're still hampered by its slowness. Ideally memory would be quick enough so that all cores had clock rate access times to it. The Cell processor sort of did this - each SPE has a bit of SRAM instead of a cache, and once you get your head round them you can make them really sing.
===EDIT===
There is more to it. As Basile Starynkevitch hints the alternate approach is to embrace NUMA.
NUMA is what modern CPUs actually embody, the memory access being non-uniform because the memory on the other CPU sockets is not accessible directly by addressing a bus. The CPUs instead make a request for data over the QPI link (or Hypertransport in AMD's case) to ask the other CPU to fetch data out of its memory and send it back. Because the CPU is doing all this for you in hardware it ends up looking like a conventional SMP environment. And QPI / Hypertransport are very fast, so most of the time it's plenty quick enough.
If you write your code so as to mirror the architecture of the hardware you can in theory make improvements. So this might involve (for example) having two copies of your data in the system, one on each socket. There's memory affinity routines in Linux to specifically allocate memory that way instead of interleaved across all sockets. There's also CPU affinity routines that allow you to control which CPU core a thread is running on, the idea being you run it on a core that is close to the data buffer it will be processing.
Ok, so that might mean a lot of investment in the source code to make that work for you (especially if that data duplication doesn't fit well with the program's flow), but if the QPI has become a problematic bottle neck it's the only thing you can do.
I've fiddled with this to some extent. In a way it's a right faff. The whole mindset of Intel and AMD (and thus the OSes and libraries too) is to give you an SMP environment which, most of the time, is pretty good. However they let you play with NUMA by having a load of library functions you have to call to get the deployment of threads and memory that you want.
However for the edge cases where you want that little bit extra speed it'd be easier if the architecture and OS was rigidly NUMA, no SMP at all. Just like the Cell processor in fact. Easier, not because it'd be simple to write (in fact it would be harder), but if you got it running at all you'd then know for sure that it was as quick as the hardware could ever possibly achieve. With the faked SMP that we have right now you experiment with NUMA but you're mostly left wondering if it's as fast as it possibly could be. It's not like the libraries tell you that you're accessing memory that actually resident on another socket, they just let you do it with no hint that there's room for improvement.
Related
In languages like C, unsynchronized reads and writes to the same memory location from different threads is undefined behavior. But in the CPU, cache coherence says that if one core writes to a memory location and later another core reads it, the other core has to read the written value.
Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away? Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?
The acquire and release semantics required for C++11 std::mutex (and equivalents in other languages, and earlier stuff like pthread_mutex) would be very expensive to implement if you didn't have coherent cache. You'd have to write-back every dirty line every time you released a lock, and evict every clean line every time you acquired a lock, if couldn't count on the hardware to make your stores visible, and to make your loads not take stale data from a private cache.
But with cache coherency, acquire and release are just a matter of ordering this core's accesses to its own private cache which is part of the same coherency domain as the L1d caches of other cores. So they're local operations and pretty cheap, not even needing to drain the store buffer. The cost of a mutex is just in the atomic RMW operation it needs to do, and of course in cache misses if the last core to own the mutex wasn't this one.
C11 and C++11 added stdatomic and std::atomic respectively, which make it well-defined to access shared _Atomic int variables, so it's not true that higher level languages don't expose this. It would hypothetically be possible to implement on a machine that required explicit flushes/invalidates to make stores visible to other cores, but that would be very slow. The language model assumes coherent caches, not providing explicit flushes of ranges but instead having release operations that make every older store visible to other threads that do an acquire load that syncs-with the release store in this thread. (See When to use volatile with multi threading? for some discussion, although that answer is mainly debunking the misconception that caches could have stale data, from people mixed up by the fact that the compiler can "cache" non-atomic non-volatile values in registers.)
In fact, some of the guarantees on C++ atomic are actually described by the standard as exposing HW coherence guarantees to software, like "write-read coherence" and so on, ending with the note:
http://eel.is/c++draft/intro.races#19
[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note
(Long before C11 and C++11, SMP kernels and some user-space multithreaded programs were hand-rolling atomic operations, using the same hardware support that C11 and C++11 finally exposed in a portable way.)
Also, as pointed out in comments, coherent cache is essential for writes to different parts of the same line by other cores to not step on each other.
ISO C11 guarantees that a char arr[16] can have arr[0] written by one thread while another writes arr[1]. If those are both in the same cache line, and two conflicting dirty copies of the line exist, only one can "win" and be written back. C++ memory model and race conditions on char arrays
ISO C effectively requires char to be as large as smallest unit you can write without disturbing surrounding bytes. On almost all machines (not early Alpha and not some DSPs), that's a single byte, even if a byte store might take an extra cycle to commit to L1d cache vs. an aligned word on some non-x86 ISAs.
The language didn't officially require this until C11, but that just standardized what "everyone knew" the only sane choice had to be, i.e. how compilers and hardware already worked.
Ah, a very deep topic indeed!
Cache coherency between cores is used to synthesise (as closely as possible) and Symetric Multi Processing (SMP) environment. This harks back to the days when multiple single core CPUs were simply tagged on to the same single memory bus, circa mid 1990s, caches weren't really a thing, etc. With multiple CPUs with multiple cores each with multiple caches and multiple memory interfaces per CPU, the synthesis of an SMP-like environment is a lot more complicated, and cache-coherency is a big part of that.
So, when one asks, "Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away?", one is really asking "Do we still need an SMP environment?".
The answer is software. An awful lot of software, including all major OSes, has been written around the assumption that they're running on an SMP environment. Take away the SMP, and we'd have to re-write literally everything.
There are now various sage commentators beginning to wonder in articles whether SMP is in fact a dead end, and that we should start worrying about how to get out of that dead end. I think that it won't happen for a good long while yet; the CPU manufacturers have likely got a few more tricks to play to get ever increasing performance, and whilst that keeps being delivered no one will want to suffer the pain of software incompatibility. Security is another reason to avoid SMP - Meltdown and Spectre exploit weaknesses in the way SMP has been synthesised - but I'd guess that whilst other mitigations (however distasteful) are available security alone will not be sufficient reason to ditch SMP.
"Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?" Why not, indeed? We have been there before. Transputers (1980s, early 1990s) implemented Communicating Sequential Processes (CSP), where if the application needed a different CPU to process some data, the application would have to purposefully transfer data to that CPU. The transfers are (in CSP speak) through "Channels", which are more like network sockets or IPC pipes and not at all like shared memory spaces.
CSP is having something of a resurgence - as a multiprocessing paradigm it has some very beneficial features - and languages such as Go, Rust, Erlang implement it. The thing about those languages' implementations of CSP is that they're having to synthesise CSP on top of an SMP environment, which in turn is synthesised on top of an electronic architecture much more reminiscent of Transputers!
Having had a lot of experience with CSP, my view is that every multi-process piece of software should use CSP; it's a lot more reliable. The "performance hit" of "copying" data (which is what you have to do to do CSP properly on top of SMP) isn't so bad; it's about the same amount of traffic over the cache-coherency connections to copy data from one CPU to another as it is to access the data in an SMP-like way.
Rust is very interesting, because with it's syntax strongly expressing data ownership I suspect that it doesn't have to copy data to implement CSP, it can transfer ownership between threads (processes). Thus it may be getting the benefits of CSP, but without having to copy the data. Therefore it could be very efficient CSP, even if every thread is running on a CPU single core. I've not yet explored Rust deeply enough to know that that is what it's doing, but I have hopes.
On of the nice things about CSP is that with Channels being like network sockets or IPC pipes, one can readily implement CSP across actual network sockets. Raw sockets are not in themselves ideal - they're asynchronous and so more akin to Actor Model (as is ZeroMQ). Actor Model is fairly OK - and I've used it - but it's not as guarateed devoid of runtime problems as CSP is. So one has to implement the CSP bit oneself or find a library. However, with that in place CSP becomes a software architecture that can more easily span arbitrary networks of computers without having to change the software architecture; a local channel and a network channel are "the same", except the network one is a bit slower.
It's a lot harder to take a multithreaded piece of software that assumes SMP, uses semaphores, etc to scale up across multiple machines on a network. In fact, it can't, and has to be re-written.
More recently than Transputers, the Cell processor (Playstation 3 fame) was a multi-core device that did exactly as you suggest. It had a single CPU core, and 8 SPE maths cores each with 255k on-chip core-speed static RAM. To use the SPEs you had to write software to ships code and data in and out of that 256k (there was a monster-fast internal ring bus for doing this, and a very fast external memory interface). The result was that, with the right developer, very good results could be attained.
It took Intel about a further 10 years to usefully get x64 up to about the same performance; adding in a Fused Multply-Add instruction into SSE was what finally got them there, an instruction they'd been keeping in Itanium's repetoire in the vain hope of boosting its appeal. Cell (the SPEs were based in the PowerPC equivalent of SSE - Altivec) had had an FMA instruction from the get-go.
Cache coherency is not needed if a developer takes care of
issuing lock(+ memory barriers) / (mem. barrier)unlock irrespective of it.
Cache coherency is of little value, or even has a negative value
in terms of cost, power, performance, validation etc.
Today, software is more and more distributed. Any way, coherency
can't help two processes running on two different machines.
Even for multi-threaded SW, we end up using IPC.
only small part of data is shared in multi threaded sw.
A large part of data is not shared, if shared, memory barriers should
solve cache syncing.
practically, SW developers depend on explicit locks to access shared data. These locks can efficiently flush the caches with h/w assistance (efficient means, only the caches lines that are modified AND also cached else where). And already, this is exactly done when we lock/unlock. Since every does above said lock/unlock, then Cache coherency is redundant and wastage of silicon space/power, hw engineers sleep.
all compilers(at least C/C++, python VM ) generate code for single threaded, non shared data. If I need to share the data, I just tell it is shared, but not how and why (volatile?). Developers need to take care of managing (again lock/unlock across hw cores/sw threads/hw threads). Most of the time, we write in HLL with non-atomic data. Cache-coherency does not add any value to developers, so, he/she fall back to managing it through locks, which instruct the cache system to efficiently flush. All caches systems have logs of cache lines to flush efficiently w/ or w/o coherency support. (think of cached but non coherent memory. this memory still has logs which can be used for efficient flushing)
Cache coherency very complex on silicon, consuming space and power.
In any case, SW developers takes care of issuing memory barriers (via locks).
So, I think, it is good to get rid of coherency and tell developers to own it.
But I see, trend is opposite.
Look at CXL memory etc... It is coherent.
I am looking for a system call where I can just turn off the cache coherency
for my threads and see experiment
Context :
Debian 64 bits.
Making a linux-only userspace networking stack that I may release open source.
Everything is ready but one last thing.
The problem :
I know about poll/select/epoll and use them heavily already but they are too complicated for my need and tend to add latency (few nanoseconds -> too much).
The need :
A simple mean to notify from the kernel to an application that packets are to be processed and the reverse with a shared mmap file operating as a multi-ring buffer. It would obviously not incur a context-switch.
I wrote a custom driver for my NIC (and plan to create others for the big league -> 1-10Gb).
I would like two shared arrays of int and two shared arrays of char. I have the multiprocess and non blocking design already working.
A peer (int and char) for the kernel -> app direction; another for app -> kernel.
But how to notify at the very moment mmap has changed. I read a msync would do it but it is slow too. That is my problem.
Mutexes lead to dead slow code. Spinlocks tend to waste cpu cycles on overload.
Not talking about a busy while(1) loop always reading -> cpu cycles waste.
What do you recommend ?
It is my last step.
Thanks
Update:
I think i will have to pay the latency of setting the interrupt mask anyway. So it should ideally be amortized by the number of incoming packets during that required latency. The first few packets after a burst will always be slower i guess since i obviously don't infinite loop.
The worst case would be that packets are sparse to come (hence why seeking saturating link performance at the first place). That worts case will be met at times. But who cares, it is still faster than the stock kernel anyway. Trade-offs trade-offs :)
It seems like you are taking approaches which are common to networking in embedded systems based on RTOS.
In Linux you are not supposed to write your own network stack - Linux kernel already have a good network stack. You are just expected to implement a NIC device driver (in kernel) which hands over all the packets for processing by the Linux network stack.
Any Linux network related components are always in the kernel - and the problems you describe provide some explanation why that is essential for reasonable performance.
The only exception is userspace network filters (e.g., for firewalls) that may be hooked to the iptables mechanism - and those incur higher latencies on packets that routed through them.
In Modern multicore processors, we normally have a local L1 cache but a shared L2 cache. Is it possible to bypass the L1 cache for some portion of the memory while still using L2 cache for it? I want to do this to improve timing predictability, at the cost of performance it may be.
As far as I know, there is no way to bypass the L1 cache on mainstream CPUs.
However, to achieve your goal (i.e. avoid cache misses that may cause variation in timings mesurements), you may try to ask your compiler to prefetch the data into the cache.
If you use GCC or LLVM, see __builtin_prefetch.
However, your question is quite vague, and I am unsure that your proposal will suit your needs.
Caches
I strongly suspect that you have misunderstood what a cache does and what it is for.
Caches are transparent from the point of view of memory contents. If one core writes to a memory location then every other core whose caches (L1, L2, L3 etc), shared or not, happen to be caching that location will get updated also.
Note that that does not mean that the cores aren't able to race for the value. You can still have a race condition whereby one core reading a location fractionally before another writes it 'gets the wrong value'. Furthermore that will happen whether or not your CPU has caches of any sort. To solve that 'ordering' problem you have to use semaphores or other IPC primitives in your source code.
Some cache systems do allow you to 'drop hints' to them. Matthieu Rouget gave an example of that with __builtin_prefetch. These sorts of things allow the programmer to tell the cache system that it might well be worth getting some data in advance. Some systems (e.g. PowerPC 7450) sort of allowed the programmer to use part of the cache as memory instead of cache, kind of the ultimate in programmer cache control.
However, none of these things make any difference to the view of memory that all the caches have. If one cache's contents get updated, the rest are also updated.
Caches and Performance Programming
The very best programmers are able to extract peak performance from a CPU by coding around the behaviour of the cache. In that realm one generally finds oneself wishing that the cache wasn't there at all. The ultimate embodiment of this is the Cell processor in the PS3. The maths cores on that have no cache at all. Instead you have to in effect do all your own data fetching and write back yourself in your source code, rather than leave it up to some cache to second guess what data your program is going to ask for. Get it right and the performance is still blisteringly good.
Bus Snooping
Some CPUs don't have cache bus snooping, which can be a particular problem when writing device drivers. Bus snooping is a mechanism whereby the CPU caches spot the content of memory being updated by something other than the CPU cores (e.g. by a DMA controller reading data from a device). And the same the other way round - DMAs from memory get values currently stuck in cache. AFAIK almost all CPUs these days do bus snooping, so that is not likely to be a problem.
On systems with IO as well as memory address spaces (e.g. Intel) I don't think that I/O address space is cached anyway. For systems with memory mapped devices their memory is generally not cached either, and the OS sets up the CPU that way (see this).
Timing Predictability
To return to the reason for your question - timing predictability. You may be using the wrong technology. If your system has timing constraints whereby the problem is variations in main memory write times, then frankly using a multicore CPU sounds like the wrong thing in the first place. #Griwes is quite right on that point (and indeed the entire comment). You'll more likely need to resort to a pure hardware design, something along the lines of an FPGA (no comments about whether firmware is really software please!).
If, as I suspect, you're actually trying to avoid using semaphores and other IPC primitives to synchronise two threads in your system then you're not going to succeed, shared caches or not. You need to use semaphores and such to make your code work properly.
I have an application level (PThreads) question regarding choice of hardware and its impact on software development.
I have working multi-threaded code tested well on a multi-core single CPU box.
I am trying to decide what to purchase for my next machine:
A 6-core single CPU box
A 4-core dual CPU box
My question is, if I go for the dual CPU box, will that impact the porting of my code in a serious way? Or can I just allocate more threads and let the OS handle the rest?
In other words, is multiprocessor programming any different from (single CPU) multithreading in the context of a PThreads application?
I thought it would make no difference at this level, but when configuring a new box, I noticed that one has to buy separate memory for each CPU. That's when I hit some cognitive dissonance.
More Detail Regarding the Code (for those who are interested): I read a ton of data from disk into a huge chunk of memory (~24GB soon to be more), then I spawn my threads. That initial chunk of memory is "read-only" (enforced by my own code policies) so I don't do any locking for that chunk. I got confused as I was looking at 4-core dual CPU boxes - they seem to require separate memory. In the context of my code, I have no idea what will happen "under the hood" if I allocate a bunch of extra threads. Will the OS copy my chunk of memory from one CPU's memory bank to another? This would impact how much memory I would have to buy (raising the cost for this configuration). The ideal situation (cost-wise and ease-of-programming-wise) is to have the dual CPU share one large bank of memory, but if I understand correctly, this may not be possible on the new Intel dual core MOBOs (like the HP ProLiant ML350e)?
Modern CPUs1 handle RAM locally and use a separate channel2 to communicate between them. This is a consumer-level version of the NUMA architecture, created for supercomputers more than a decade ago.
The idea is to avoid a shared bus (the old FSB) that can cause heavy contention because it's used by every core to access memory. As you add more NUMA cells, you get higher bandwidth. The downside is that memory becomes non-uniform from the point of view of the CPU: some RAM is faster than others.
Of course, modern OS schedulers are NUMA-aware, so they try to reduce the migration of a task from one cell to another. Sometimes it's okay to move from one core to another in the same socket; sometimes there's a whole hierarchy specifying which resources (1-,2-,3-level cache, RAM channel, IO, etc) are shared and which aren't, and that determines if there would be a penalty or not by moving the task. Sometimes it can determine that waiting for the right core would be pointless and it's better to shovel the whole thing to another socket....
In the vast majority of cases, it's best to leave the scheduler do what it knows best. If not, you can play around with numactl.
As for the specific case of a given program; the best architecture depends heavily in the level of resource sharing between threads. If each thread has its own playground and mostly works alone within it, a smart enough allocator would prioritize local RAM, making it less important on which cell each thread happens to be.
If, on the other hand, objects are allocated by one thread, processed by another and consumed by a third; performance would suffer if they're not on the same cell. You could try to create small thread groups and limit heavy sharing within the group, then each group could go on a different cell without problem.
The worst case is when all threads participate in a great orgy of data sharing. Even if you have all your locks and processes well debugged, there won't be any way to optimize it to use more cores than what are available on a cell. It might even be best to limit the whole process to just use the cores in a single cell, effectively wasting the rest.
1 by modern, I mean any AMD-64bit chip, and Nehalem or better for Intel.
2 AMD calls this channel HyperTransport, and Intel name is QuickPath Interconnect
EDIT:
You mention that you initialize "a big chunk of read-only memory". And then spawn a lot of threads to work on it. If each thread works on its own part of that chunk, then it would be a lot better if you initialize it on the thread, after spawning it. That would allow the threads to spread to several cores, and the allocator would choose local RAM for each, a much more effective layout. Maybe there's some way to hint the scheduler to migrate away the threads as soon as they're spawned, but I don't know the details.
EDIT 2:
If your data is read verbatim from disk, without any processing, it might be advantageous to use mmap instead of allocating a big chunk and read()ing. There are some common advantages:
No need to preallocate RAM.
The mmap operation is almost instantaneous and you can start using it. The data will be read lazily as needed.
The OS can be way smarter than you when choosing between application, mmaped RAM, buffers and cache.
it's less code!
Non needed data won't be read, won't use up RAM.
You can specifically mark as read-only. Any bug that tries to write will cause a coredump.
Since the OS knows it's read-only, it can't be 'dirty', so if the RAM is needed, it will simply discard it, and reread when needed.
but in this case, you also get:
Since data is read lazily, each RAM page would be chosen after the threads have spread on all available cores; this would allow the OS to choose pages close to the process.
So, I think that if two conditions hold:
the data isn't processed in any way between disk and RAM
each part of the data is read (mostly) by one single thread, not touched by all of them.
then, just by using mmap, you should be able to take advantage of machines of any size.
If each part of the data is read by more than one single thread, maybe you could identify which threads will (mostly) share the same pages, and try to hint the scheduler to keep these in the same NUMA cell.
For the x86 boxes you're looking at, the fact that memory is physically wired to different CPU sockets is an implementation detail. Logically, the total memory of the machine appears as one large pool - your wouldn't need to change your application code for it to run correctly across both CPUs.
Performance, however, is another matter. There is a speed penalty for cross-socket memory access, so the unmodified program may not run to its full potential.
Unfortunately, it's hard to say ahead of time whether your code will run faster on the 6-core, one-node box or the 8-core, two-node box. Even if we could see your code, it would ultimately be an educated guess. A few things to consider:
The cross-socket memory access penalty only kicks in on a cache miss, so if your program has good cache behaviour then NUMA won't hurt you much;
If your threads are all writing to private memory regions and you're limited by write bandwidth to memory, then the dual-socket machine will end up helping;
If you're compute-bound rather than memory-bandwidth-bound then 8 cores is likely better than 6;
If your performance is bounded by cache read misses then the 6 core single-socket box starts to look better;
If you have a lot of lock contention or writes to shared data then again this tends to advise towards the single-socket box.
There's a lot of variables, so the best thing to do is to ask your HP reseller for loaner machines matching the configurations you're considering. You can then test your application out, see where it performs best and order your hardware accordingly.
Without more details, it's hard to give a detailed answer. However, hopefully the following will help you frame the problem.
If your thread code is proper (e.g. you properly lock shared resources), you should not experience any bugs introduced by the change of hardware architecture. Improper threading code can sometimes be masked by the specifics of how a specific platform handles things like CPU cache access/sharing.
You may experience a change in application performance per equivalent core due to differing approaches to memory and cache management in the single chip, multi core vs. multi chip alternatives.
Specifically if you are looking at hardware that has separate memory per CPU, I would assume that each thread is going to be locked to the CPU it starts on (otherwise, the system would have to incur significant overhead to move a thread's memory to memory dedicated to a different core). That may reduce overall system efficiency depending on your specific situation. However, separate memory per core also means that the different CPUs do not compete with each other for a given cache line (the 4 cores on each of the dual CPUs will still potentially compete for cache lines, but that is less contention than if 6 cores are competing for the same cache lines).
This type of cache line contention is called False Sharing. I suggest the following read to understand if that may be an issue you are facing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=3
Bottom line is, application behavior should be stable (other than things that naturally depend on the details of thread scheduling) if you followed proper thread development practices, but performance could go either way depending on exactly what you are doing.
In multicore systems, such as 2, 4, 8 cores, we typically use mutexes and semaphores to access shared memory. However, I can foresee that these methods would induce a high overhead for future systems with many cores. Are there any alternative methods that would be better for future many core systems for accessing shared memories.
Transactional memory is one such method.
I'm not sure how far in the future you want to go. But in the long-long run, shared memory as we know it right now (single address space accessible by any core) is not scalable. So the programming model will have to change at some point and make the lives of programmers harder as it did when we went to multi-core.
But for now (perhaps for another 10 years) you can get away with transactional memory and other hardware/software tricks.
The reason I say shared-memory is not scalable in the long run is simply due to physics. (similar to how single-core/high-frequency hit a barrier)
In short, transistors can't shrink to less than the size of an atom (barring new technology), and signals can't propagate faster than the speed of light. Therefore, memory will get slower and slower (with respect to the processor) and at some point, it becomes infeasible to share memory.
We can already see this effect right now with NUMA on the multi-socket systems. Large-scale supercomputers are neither shared-memory nor cache-coherent.
1) Lock only the memory part your are accessing, and not the entire table ! This is done with the help of a big hash table. The bigger the table, the finer the lock mechanism is.
2) If you can, only lock on writing, not on reading (this requires that there is no problem in reading the "previous value" while it is being updated, which is very often a valid case).
Access to shared memory at the lowest level in any multi-processor/core/threaded application synchronization depends on the bus lock. Such a lock may incur hundreds of (CPU) wait states as it also encompasses locking those I/O buses that have bus-mastering devices including DMA. Theoretically it is possible to envision a medium-level lock that can be invoked in situations when the programmer is certain that the memory area being locked won't be affected by any I/O bus. Such a lock would be much faster because it only needs to synchronize the CPU caches with main memory which is fast, at least in comparison to latency of the slowest I/O buses. Whether programmers in general would be competent to determine when to use which bus lock adds worrying implications to its mainstream feasibility. Such a lock could also require its own dedicated external pins for synchronization with other processors.
In multi-processor Opteron systems each processor has its own memory which becomes part of the entire memory that all installed processors can "see". A processor trying to access memory which turns out to be attached to another processor will transparently complete the access - albeit more slowly - through a high-speed interconnect bus (called HyperTransport) to the processor in charge of that memory (the NUMA concept). As long as a processor and its cores are working with the memory physically connected to it processing will be fast. In addition, many processors are equipped with several external memory buses to multiply their overall memory bandwidth.
A theoretical medium-level lock could, on Opteron systems, be implemented using the HyperTransport interconnections.
As for any forseeable future the classic approach of locking as seldom as possible and for as short a time as possible by implementing efficient algorithms (and associated data structures) that are used when the locks are in place still holds true.