What is the most efficient to share data between multiple cores. Sure you can use shared memory but that also comes at a cost. Say one core is continously writing to a variable and the other core has to continuously read from it. With the MESI cache coherence protocol, the writing core will cause the reading core to invalidate its cache every now and then. So in this scenario, what is the most efficient way of sharing data.
On a typical shared memory machine, the scenario that you describe is probably already the most efficient method that is possible:
Core A writes to memory location. Invalidates Core B's copy.
Core B grabs the data from memory or from Core A's cache.
Either way, the data must be sent from Core A to Core B. Cache coherency in modern processors will sometimes support direct cache-to-cache transfer without going all the way to memory.
What you want to avoid (whenever possible) is excessive locking of the shared resource. That will increase cache coherency traffic (and latency) in both directions.
One common and general approach is to have per-core data structures when possible.
For example, in a producer-consumer scenario, each of the consumer processors can have a part of the queue and operate on it. They can contact the producer processor only when they run out of work-items.
Of course, this is not always possible, but if the working-items can be architected this way, it reduces the inter-dependence between the cores and lets the application scale up to the number of cores.
This technique has been widely used in Solaris OS. For more info see Multicore Application Programming: for Windows, Linux, and Oracle Solaris.
This depends on how much staleness you can tolerate (please tell us).
If you require updates to propagate out as fast as possible, this is already the most efficient solution. If you can tolerate milliseconds or seconds of out-of-date data, you can use a distinct memory location for each core and synchronize using a timer.
Related
In languages like C, unsynchronized reads and writes to the same memory location from different threads is undefined behavior. But in the CPU, cache coherence says that if one core writes to a memory location and later another core reads it, the other core has to read the written value.
Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away? Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?
The acquire and release semantics required for C++11 std::mutex (and equivalents in other languages, and earlier stuff like pthread_mutex) would be very expensive to implement if you didn't have coherent cache. You'd have to write-back every dirty line every time you released a lock, and evict every clean line every time you acquired a lock, if couldn't count on the hardware to make your stores visible, and to make your loads not take stale data from a private cache.
But with cache coherency, acquire and release are just a matter of ordering this core's accesses to its own private cache which is part of the same coherency domain as the L1d caches of other cores. So they're local operations and pretty cheap, not even needing to drain the store buffer. The cost of a mutex is just in the atomic RMW operation it needs to do, and of course in cache misses if the last core to own the mutex wasn't this one.
C11 and C++11 added stdatomic and std::atomic respectively, which make it well-defined to access shared _Atomic int variables, so it's not true that higher level languages don't expose this. It would hypothetically be possible to implement on a machine that required explicit flushes/invalidates to make stores visible to other cores, but that would be very slow. The language model assumes coherent caches, not providing explicit flushes of ranges but instead having release operations that make every older store visible to other threads that do an acquire load that syncs-with the release store in this thread. (See When to use volatile with multi threading? for some discussion, although that answer is mainly debunking the misconception that caches could have stale data, from people mixed up by the fact that the compiler can "cache" non-atomic non-volatile values in registers.)
In fact, some of the guarantees on C++ atomic are actually described by the standard as exposing HW coherence guarantees to software, like "write-read coherence" and so on, ending with the note:
http://eel.is/c++draft/intro.races#19
[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note
(Long before C11 and C++11, SMP kernels and some user-space multithreaded programs were hand-rolling atomic operations, using the same hardware support that C11 and C++11 finally exposed in a portable way.)
Also, as pointed out in comments, coherent cache is essential for writes to different parts of the same line by other cores to not step on each other.
ISO C11 guarantees that a char arr[16] can have arr[0] written by one thread while another writes arr[1]. If those are both in the same cache line, and two conflicting dirty copies of the line exist, only one can "win" and be written back. C++ memory model and race conditions on char arrays
ISO C effectively requires char to be as large as smallest unit you can write without disturbing surrounding bytes. On almost all machines (not early Alpha and not some DSPs), that's a single byte, even if a byte store might take an extra cycle to commit to L1d cache vs. an aligned word on some non-x86 ISAs.
The language didn't officially require this until C11, but that just standardized what "everyone knew" the only sane choice had to be, i.e. how compilers and hardware already worked.
Ah, a very deep topic indeed!
Cache coherency between cores is used to synthesise (as closely as possible) and Symetric Multi Processing (SMP) environment. This harks back to the days when multiple single core CPUs were simply tagged on to the same single memory bus, circa mid 1990s, caches weren't really a thing, etc. With multiple CPUs with multiple cores each with multiple caches and multiple memory interfaces per CPU, the synthesis of an SMP-like environment is a lot more complicated, and cache-coherency is a big part of that.
So, when one asks, "Why does the processor need to bother exposing a coherent abstraction of the memory hierarchy if the next layer up is just going to throw it away?", one is really asking "Do we still need an SMP environment?".
The answer is software. An awful lot of software, including all major OSes, has been written around the assumption that they're running on an SMP environment. Take away the SMP, and we'd have to re-write literally everything.
There are now various sage commentators beginning to wonder in articles whether SMP is in fact a dead end, and that we should start worrying about how to get out of that dead end. I think that it won't happen for a good long while yet; the CPU manufacturers have likely got a few more tricks to play to get ever increasing performance, and whilst that keeps being delivered no one will want to suffer the pain of software incompatibility. Security is another reason to avoid SMP - Meltdown and Spectre exploit weaknesses in the way SMP has been synthesised - but I'd guess that whilst other mitigations (however distasteful) are available security alone will not be sufficient reason to ditch SMP.
"Why not just let the caches get incoherent, and require the software to issue a special instruction when it wants to share something?" Why not, indeed? We have been there before. Transputers (1980s, early 1990s) implemented Communicating Sequential Processes (CSP), where if the application needed a different CPU to process some data, the application would have to purposefully transfer data to that CPU. The transfers are (in CSP speak) through "Channels", which are more like network sockets or IPC pipes and not at all like shared memory spaces.
CSP is having something of a resurgence - as a multiprocessing paradigm it has some very beneficial features - and languages such as Go, Rust, Erlang implement it. The thing about those languages' implementations of CSP is that they're having to synthesise CSP on top of an SMP environment, which in turn is synthesised on top of an electronic architecture much more reminiscent of Transputers!
Having had a lot of experience with CSP, my view is that every multi-process piece of software should use CSP; it's a lot more reliable. The "performance hit" of "copying" data (which is what you have to do to do CSP properly on top of SMP) isn't so bad; it's about the same amount of traffic over the cache-coherency connections to copy data from one CPU to another as it is to access the data in an SMP-like way.
Rust is very interesting, because with it's syntax strongly expressing data ownership I suspect that it doesn't have to copy data to implement CSP, it can transfer ownership between threads (processes). Thus it may be getting the benefits of CSP, but without having to copy the data. Therefore it could be very efficient CSP, even if every thread is running on a CPU single core. I've not yet explored Rust deeply enough to know that that is what it's doing, but I have hopes.
On of the nice things about CSP is that with Channels being like network sockets or IPC pipes, one can readily implement CSP across actual network sockets. Raw sockets are not in themselves ideal - they're asynchronous and so more akin to Actor Model (as is ZeroMQ). Actor Model is fairly OK - and I've used it - but it's not as guarateed devoid of runtime problems as CSP is. So one has to implement the CSP bit oneself or find a library. However, with that in place CSP becomes a software architecture that can more easily span arbitrary networks of computers without having to change the software architecture; a local channel and a network channel are "the same", except the network one is a bit slower.
It's a lot harder to take a multithreaded piece of software that assumes SMP, uses semaphores, etc to scale up across multiple machines on a network. In fact, it can't, and has to be re-written.
More recently than Transputers, the Cell processor (Playstation 3 fame) was a multi-core device that did exactly as you suggest. It had a single CPU core, and 8 SPE maths cores each with 255k on-chip core-speed static RAM. To use the SPEs you had to write software to ships code and data in and out of that 256k (there was a monster-fast internal ring bus for doing this, and a very fast external memory interface). The result was that, with the right developer, very good results could be attained.
It took Intel about a further 10 years to usefully get x64 up to about the same performance; adding in a Fused Multply-Add instruction into SSE was what finally got them there, an instruction they'd been keeping in Itanium's repetoire in the vain hope of boosting its appeal. Cell (the SPEs were based in the PowerPC equivalent of SSE - Altivec) had had an FMA instruction from the get-go.
Cache coherency is not needed if a developer takes care of
issuing lock(+ memory barriers) / (mem. barrier)unlock irrespective of it.
Cache coherency is of little value, or even has a negative value
in terms of cost, power, performance, validation etc.
Today, software is more and more distributed. Any way, coherency
can't help two processes running on two different machines.
Even for multi-threaded SW, we end up using IPC.
only small part of data is shared in multi threaded sw.
A large part of data is not shared, if shared, memory barriers should
solve cache syncing.
practically, SW developers depend on explicit locks to access shared data. These locks can efficiently flush the caches with h/w assistance (efficient means, only the caches lines that are modified AND also cached else where). And already, this is exactly done when we lock/unlock. Since every does above said lock/unlock, then Cache coherency is redundant and wastage of silicon space/power, hw engineers sleep.
all compilers(at least C/C++, python VM ) generate code for single threaded, non shared data. If I need to share the data, I just tell it is shared, but not how and why (volatile?). Developers need to take care of managing (again lock/unlock across hw cores/sw threads/hw threads). Most of the time, we write in HLL with non-atomic data. Cache-coherency does not add any value to developers, so, he/she fall back to managing it through locks, which instruct the cache system to efficiently flush. All caches systems have logs of cache lines to flush efficiently w/ or w/o coherency support. (think of cached but non coherent memory. this memory still has logs which can be used for efficient flushing)
Cache coherency very complex on silicon, consuming space and power.
In any case, SW developers takes care of issuing memory barriers (via locks).
So, I think, it is good to get rid of coherency and tell developers to own it.
But I see, trend is opposite.
Look at CXL memory etc... It is coherent.
I am looking for a system call where I can just turn off the cache coherency
for my threads and see experiment
We know that there are levels in memory hierarchy,
cache,primary storage and secondary storage..
Can we use a c program to selectively store a variable in a specific block in memory hierarchy?
Reading your comments in the other answers, I would like to add a few things.
Inside an Operating System you can not restrict in which level of the memory hierarchy your variables will be stored, since the one who controls the metal is the Operating System and it enforces you to play by its rules.
Despite this, you can do something that MAY get you close to testing access time in cache (mostly L1, depending on your test algorithm) and in RAM memory.
To test cache access: warm up accessing a few times a variable. Them access a lot of times the same variable (cache is super fast, you need a lot of accesses to measure access time).
To test the main memory (aka RAM): disable the cache memory in the BIOS and run your code.
To test secondary memory (aka disk): disable disk cache for a given file (you can ask your Operating System for this, just Google about it), and start reading some data from the disk, always from the same position. This might or might not work depending on how much your OS will allow you to disable disk cache (Google about it).
To test other levels of memory, you must implement your own "Test Operating System", and even with that it may not be possible to disable some caching mechanisms due to hardware limitations (well, not actually limitations...).
Hope I helped.
Not really. Cache is designed to work transparently. Almost anything you do will end up in cache, because it's being operated on at the moment.
As for secondary storage, I assume you mean HDD, file, cloud, and so on. Nothing really ever gets stored there unless you do so explicitly, or set up a memory mapped region, or something gets paged to disk.
No. A normal computer program only has access to main memory. Even secondary storage (disk) is usually only available via operating system services, typically used via the stdio part of the library. There's basically no direct way to inspect or control the hierarchical caches closer to the CPU.
That said, there are cache profilers (like Valgrind's massif tool) which give you an idea about how well your program uses a given type of cache architecture (as well as speculative execution), and which can be very useful to help you spot code paths that have poor cache performance. They do this essentially by emulating the hardware.
There may also be architecture-specific instructions that give you some control over caching (such as "nontemporal hints" on x86, or "prefetch" instructions), but those are rare and peculiar, and not commonly exposed to C program code.
It depends on the specific architecture and compiler that you're using.
For example, on x86/x64, most compilers have a variety of levels of prefetch instructions which hint to the CPU
that a cache-line should be moved to a specific level in the cache from DRAM (or from higher-order caches - e.g. from L3 to L2).
On some CPUs, non-temporal prefetch instructions are available that when combined with non-temporal read instructions, allow you to bypass the caches and read directly into a register. (Some CPUs have non-temporals implemented by forcing the data to a specific way of the cache though; so you have to read the docs).
In general though, between L1 and L2 (or L3, or L4, or DRAM) it's a bit of a black-hole. You can't specifically store a value in one of these - some caches are inclusive of each other (so if a value is in L1, it's also in L2 and L3), some are not. And the caches are designed to periodically drain - so if a write goes to L1, eventually it works its way out to L2, then L3, then DRAM - especially in multi-core architectures with strong-memory models.
You can bypass them entirely on write (use streaming-store or mark the memory as write-combining).
You can measure the different access times by:
Using memory-mapped files as a backing store for your data (this will measure the time for first access to reach the CPU - just wrap it in a timer call, such as QueryPerformanceCounter or __rdtscp). Be sure to unmap and close the file from memory between each test, and turn off any caching. It'll take a while.
Flushing the cache between accesses to get the time-to-access from DRAM.
It's harder to measure the difference between different cache levels, but if your architecture supports it, you can prefetch into a cache level, spin in a loop for some period of time, and then attempt it.
It's going to be hard to do this on a system using a commercial OS, because they tend to do a LOT of work all the time, which will disturb your measurements.
In Modern multicore processors, we normally have a local L1 cache but a shared L2 cache. Is it possible to bypass the L1 cache for some portion of the memory while still using L2 cache for it? I want to do this to improve timing predictability, at the cost of performance it may be.
As far as I know, there is no way to bypass the L1 cache on mainstream CPUs.
However, to achieve your goal (i.e. avoid cache misses that may cause variation in timings mesurements), you may try to ask your compiler to prefetch the data into the cache.
If you use GCC or LLVM, see __builtin_prefetch.
However, your question is quite vague, and I am unsure that your proposal will suit your needs.
Caches
I strongly suspect that you have misunderstood what a cache does and what it is for.
Caches are transparent from the point of view of memory contents. If one core writes to a memory location then every other core whose caches (L1, L2, L3 etc), shared or not, happen to be caching that location will get updated also.
Note that that does not mean that the cores aren't able to race for the value. You can still have a race condition whereby one core reading a location fractionally before another writes it 'gets the wrong value'. Furthermore that will happen whether or not your CPU has caches of any sort. To solve that 'ordering' problem you have to use semaphores or other IPC primitives in your source code.
Some cache systems do allow you to 'drop hints' to them. Matthieu Rouget gave an example of that with __builtin_prefetch. These sorts of things allow the programmer to tell the cache system that it might well be worth getting some data in advance. Some systems (e.g. PowerPC 7450) sort of allowed the programmer to use part of the cache as memory instead of cache, kind of the ultimate in programmer cache control.
However, none of these things make any difference to the view of memory that all the caches have. If one cache's contents get updated, the rest are also updated.
Caches and Performance Programming
The very best programmers are able to extract peak performance from a CPU by coding around the behaviour of the cache. In that realm one generally finds oneself wishing that the cache wasn't there at all. The ultimate embodiment of this is the Cell processor in the PS3. The maths cores on that have no cache at all. Instead you have to in effect do all your own data fetching and write back yourself in your source code, rather than leave it up to some cache to second guess what data your program is going to ask for. Get it right and the performance is still blisteringly good.
Bus Snooping
Some CPUs don't have cache bus snooping, which can be a particular problem when writing device drivers. Bus snooping is a mechanism whereby the CPU caches spot the content of memory being updated by something other than the CPU cores (e.g. by a DMA controller reading data from a device). And the same the other way round - DMAs from memory get values currently stuck in cache. AFAIK almost all CPUs these days do bus snooping, so that is not likely to be a problem.
On systems with IO as well as memory address spaces (e.g. Intel) I don't think that I/O address space is cached anyway. For systems with memory mapped devices their memory is generally not cached either, and the OS sets up the CPU that way (see this).
Timing Predictability
To return to the reason for your question - timing predictability. You may be using the wrong technology. If your system has timing constraints whereby the problem is variations in main memory write times, then frankly using a multicore CPU sounds like the wrong thing in the first place. #Griwes is quite right on that point (and indeed the entire comment). You'll more likely need to resort to a pure hardware design, something along the lines of an FPGA (no comments about whether firmware is really software please!).
If, as I suspect, you're actually trying to avoid using semaphores and other IPC primitives to synchronise two threads in your system then you're not going to succeed, shared caches or not. You need to use semaphores and such to make your code work properly.
I have an application level (PThreads) question regarding choice of hardware and its impact on software development.
I have working multi-threaded code tested well on a multi-core single CPU box.
I am trying to decide what to purchase for my next machine:
A 6-core single CPU box
A 4-core dual CPU box
My question is, if I go for the dual CPU box, will that impact the porting of my code in a serious way? Or can I just allocate more threads and let the OS handle the rest?
In other words, is multiprocessor programming any different from (single CPU) multithreading in the context of a PThreads application?
I thought it would make no difference at this level, but when configuring a new box, I noticed that one has to buy separate memory for each CPU. That's when I hit some cognitive dissonance.
More Detail Regarding the Code (for those who are interested): I read a ton of data from disk into a huge chunk of memory (~24GB soon to be more), then I spawn my threads. That initial chunk of memory is "read-only" (enforced by my own code policies) so I don't do any locking for that chunk. I got confused as I was looking at 4-core dual CPU boxes - they seem to require separate memory. In the context of my code, I have no idea what will happen "under the hood" if I allocate a bunch of extra threads. Will the OS copy my chunk of memory from one CPU's memory bank to another? This would impact how much memory I would have to buy (raising the cost for this configuration). The ideal situation (cost-wise and ease-of-programming-wise) is to have the dual CPU share one large bank of memory, but if I understand correctly, this may not be possible on the new Intel dual core MOBOs (like the HP ProLiant ML350e)?
Modern CPUs1 handle RAM locally and use a separate channel2 to communicate between them. This is a consumer-level version of the NUMA architecture, created for supercomputers more than a decade ago.
The idea is to avoid a shared bus (the old FSB) that can cause heavy contention because it's used by every core to access memory. As you add more NUMA cells, you get higher bandwidth. The downside is that memory becomes non-uniform from the point of view of the CPU: some RAM is faster than others.
Of course, modern OS schedulers are NUMA-aware, so they try to reduce the migration of a task from one cell to another. Sometimes it's okay to move from one core to another in the same socket; sometimes there's a whole hierarchy specifying which resources (1-,2-,3-level cache, RAM channel, IO, etc) are shared and which aren't, and that determines if there would be a penalty or not by moving the task. Sometimes it can determine that waiting for the right core would be pointless and it's better to shovel the whole thing to another socket....
In the vast majority of cases, it's best to leave the scheduler do what it knows best. If not, you can play around with numactl.
As for the specific case of a given program; the best architecture depends heavily in the level of resource sharing between threads. If each thread has its own playground and mostly works alone within it, a smart enough allocator would prioritize local RAM, making it less important on which cell each thread happens to be.
If, on the other hand, objects are allocated by one thread, processed by another and consumed by a third; performance would suffer if they're not on the same cell. You could try to create small thread groups and limit heavy sharing within the group, then each group could go on a different cell without problem.
The worst case is when all threads participate in a great orgy of data sharing. Even if you have all your locks and processes well debugged, there won't be any way to optimize it to use more cores than what are available on a cell. It might even be best to limit the whole process to just use the cores in a single cell, effectively wasting the rest.
1 by modern, I mean any AMD-64bit chip, and Nehalem or better for Intel.
2 AMD calls this channel HyperTransport, and Intel name is QuickPath Interconnect
EDIT:
You mention that you initialize "a big chunk of read-only memory". And then spawn a lot of threads to work on it. If each thread works on its own part of that chunk, then it would be a lot better if you initialize it on the thread, after spawning it. That would allow the threads to spread to several cores, and the allocator would choose local RAM for each, a much more effective layout. Maybe there's some way to hint the scheduler to migrate away the threads as soon as they're spawned, but I don't know the details.
EDIT 2:
If your data is read verbatim from disk, without any processing, it might be advantageous to use mmap instead of allocating a big chunk and read()ing. There are some common advantages:
No need to preallocate RAM.
The mmap operation is almost instantaneous and you can start using it. The data will be read lazily as needed.
The OS can be way smarter than you when choosing between application, mmaped RAM, buffers and cache.
it's less code!
Non needed data won't be read, won't use up RAM.
You can specifically mark as read-only. Any bug that tries to write will cause a coredump.
Since the OS knows it's read-only, it can't be 'dirty', so if the RAM is needed, it will simply discard it, and reread when needed.
but in this case, you also get:
Since data is read lazily, each RAM page would be chosen after the threads have spread on all available cores; this would allow the OS to choose pages close to the process.
So, I think that if two conditions hold:
the data isn't processed in any way between disk and RAM
each part of the data is read (mostly) by one single thread, not touched by all of them.
then, just by using mmap, you should be able to take advantage of machines of any size.
If each part of the data is read by more than one single thread, maybe you could identify which threads will (mostly) share the same pages, and try to hint the scheduler to keep these in the same NUMA cell.
For the x86 boxes you're looking at, the fact that memory is physically wired to different CPU sockets is an implementation detail. Logically, the total memory of the machine appears as one large pool - your wouldn't need to change your application code for it to run correctly across both CPUs.
Performance, however, is another matter. There is a speed penalty for cross-socket memory access, so the unmodified program may not run to its full potential.
Unfortunately, it's hard to say ahead of time whether your code will run faster on the 6-core, one-node box or the 8-core, two-node box. Even if we could see your code, it would ultimately be an educated guess. A few things to consider:
The cross-socket memory access penalty only kicks in on a cache miss, so if your program has good cache behaviour then NUMA won't hurt you much;
If your threads are all writing to private memory regions and you're limited by write bandwidth to memory, then the dual-socket machine will end up helping;
If you're compute-bound rather than memory-bandwidth-bound then 8 cores is likely better than 6;
If your performance is bounded by cache read misses then the 6 core single-socket box starts to look better;
If you have a lot of lock contention or writes to shared data then again this tends to advise towards the single-socket box.
There's a lot of variables, so the best thing to do is to ask your HP reseller for loaner machines matching the configurations you're considering. You can then test your application out, see where it performs best and order your hardware accordingly.
Without more details, it's hard to give a detailed answer. However, hopefully the following will help you frame the problem.
If your thread code is proper (e.g. you properly lock shared resources), you should not experience any bugs introduced by the change of hardware architecture. Improper threading code can sometimes be masked by the specifics of how a specific platform handles things like CPU cache access/sharing.
You may experience a change in application performance per equivalent core due to differing approaches to memory and cache management in the single chip, multi core vs. multi chip alternatives.
Specifically if you are looking at hardware that has separate memory per CPU, I would assume that each thread is going to be locked to the CPU it starts on (otherwise, the system would have to incur significant overhead to move a thread's memory to memory dedicated to a different core). That may reduce overall system efficiency depending on your specific situation. However, separate memory per core also means that the different CPUs do not compete with each other for a given cache line (the 4 cores on each of the dual CPUs will still potentially compete for cache lines, but that is less contention than if 6 cores are competing for the same cache lines).
This type of cache line contention is called False Sharing. I suggest the following read to understand if that may be an issue you are facing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=3
Bottom line is, application behavior should be stable (other than things that naturally depend on the details of thread scheduling) if you followed proper thread development practices, but performance could go either way depending on exactly what you are doing.
In multicore systems, such as 2, 4, 8 cores, we typically use mutexes and semaphores to access shared memory. However, I can foresee that these methods would induce a high overhead for future systems with many cores. Are there any alternative methods that would be better for future many core systems for accessing shared memories.
Transactional memory is one such method.
I'm not sure how far in the future you want to go. But in the long-long run, shared memory as we know it right now (single address space accessible by any core) is not scalable. So the programming model will have to change at some point and make the lives of programmers harder as it did when we went to multi-core.
But for now (perhaps for another 10 years) you can get away with transactional memory and other hardware/software tricks.
The reason I say shared-memory is not scalable in the long run is simply due to physics. (similar to how single-core/high-frequency hit a barrier)
In short, transistors can't shrink to less than the size of an atom (barring new technology), and signals can't propagate faster than the speed of light. Therefore, memory will get slower and slower (with respect to the processor) and at some point, it becomes infeasible to share memory.
We can already see this effect right now with NUMA on the multi-socket systems. Large-scale supercomputers are neither shared-memory nor cache-coherent.
1) Lock only the memory part your are accessing, and not the entire table ! This is done with the help of a big hash table. The bigger the table, the finer the lock mechanism is.
2) If you can, only lock on writing, not on reading (this requires that there is no problem in reading the "previous value" while it is being updated, which is very often a valid case).
Access to shared memory at the lowest level in any multi-processor/core/threaded application synchronization depends on the bus lock. Such a lock may incur hundreds of (CPU) wait states as it also encompasses locking those I/O buses that have bus-mastering devices including DMA. Theoretically it is possible to envision a medium-level lock that can be invoked in situations when the programmer is certain that the memory area being locked won't be affected by any I/O bus. Such a lock would be much faster because it only needs to synchronize the CPU caches with main memory which is fast, at least in comparison to latency of the slowest I/O buses. Whether programmers in general would be competent to determine when to use which bus lock adds worrying implications to its mainstream feasibility. Such a lock could also require its own dedicated external pins for synchronization with other processors.
In multi-processor Opteron systems each processor has its own memory which becomes part of the entire memory that all installed processors can "see". A processor trying to access memory which turns out to be attached to another processor will transparently complete the access - albeit more slowly - through a high-speed interconnect bus (called HyperTransport) to the processor in charge of that memory (the NUMA concept). As long as a processor and its cores are working with the memory physically connected to it processing will be fast. In addition, many processors are equipped with several external memory buses to multiply their overall memory bandwidth.
A theoretical medium-level lock could, on Opteron systems, be implemented using the HyperTransport interconnections.
As for any forseeable future the classic approach of locking as seldom as possible and for as short a time as possible by implementing efficient algorithms (and associated data structures) that are used when the locks are in place still holds true.