OpenMp and Shared Memory definition - c

According to the OpenMP web site OpenMp is "the de-facto standard for parallel programming on shared memory systems" According to Wikipedia "Using memory for communication inside a single program, for example among its multiple threads, is generally not referred to as shared memory."
What is wrong here ? Is it the "generally" term ?
OpenMp is really just creating threads "sharing memory" through a single same virtual adress space, isn't it ?
Moreover, I guess OpenMP is able to run on NUMA architectures where all the memory can be addressed by all the processors, but with some increased memory access time when threads sharing data, are assigned to cores accessing to different memories at different access time. Is this true ?

I'm redacting a full-fledged answer here to try to answer further questions asked as comments to lucas1024's answer.
On the meaning of "shared memory"
On the one hand, you have the software-oriented (i.e. OS-oriented) meaning of shared-memory: a way to enable different processes to access the same chunk of memory (i.e. to relax the usual OS constraint that a given process should not be able to tamper with other processes' memory). As stated in the wikipedia page, the POSIX shared memory API is one implementation of such a facility. In this acception, it does not make much sense to speak of threads (an OS might well provide shared memory without even providing threads).
On the other hand, you have the hardware-oriented meaning of "shared-memory": an hardware configuration where all CPUs have access to the same piece of RAM.
On the meaning of "thread"
Now we have to disambiguate another term: "thread". An OS might provide a way to have multiple concurrent execution flows within a process. POSIX threads are an implementation of such a feature.
However, the OpenMP specification has its own definitions:
thread: An execution entity with a stack and associated static memory, called
threadprivate memory.
OpenMP thread: A thread that is managed by the OpenMP runtime system.
Such definitions fit nicely with the definition of e.g. POSIX threads, and most OpenMP implementations indeed use POSIX threads to create OpenMP threads. But you might imagine OpenMP implementations on top of OSes which do not provide POSIX threads or equivalent features. Such OpenMP implementations would have to internally manage execution flows, which is difficult enough but entirely doable. Alternatively, they might map OpenMP threads to OS processes and use some kind of "shared memory" feature (in the OS sense) to enable them sharing memory (though I don't know of any OpenMP implementation doing this).
In the end, the only constraint you have for an OpenMP implementation is that all CPUs should have a way to share access to the same central memory. That is to say OpenMP programs should run on "shared memory" systems in the hardware sense. However, OpenMP threads do not necessarily have to be POSIX threads of the same OS process.

A "shared memory system" is simply a system where multiple cores or CPUs are accessing a single pool of memory through a local bus. So the OpenMP site is correct.
Communicating between threads in a program is not done using "shared memory" - instead the term typically refers to communication between processes on the same machine through memory. So the Wikipedia entry is not in contradiction and it, in fact, points out the difference in terminology between hardware and software.

Related

How does N<->1 threading model work?

In continuation to question, This is an additional query on N-1 threading model.
It is taught that, before designing an application, selection of threading model need to be taken care.
In N-1 threading model, a single kernel thread is available to work on behalf of each user process. OS scheduler gives a single CPU time slice to this kernel thread.
In user space, programmer would use either POSIX pthread or Windows CreateThread() to spawn multiple threads within a user process. As the programmer used POSIX pthread or Windows CreateThread() the kernel is aware of the user-land threads and each thread is considered for processor time assignment by the scheduler. SO, that means every user thread will get a kernel thread.
My question:
So, How does N-1 threading model looks possible to exist? It would be 1-1 threading model. Please clarify.
In user space, programmer would use either POSIX pthread or Windows CreateThread() to spawn multiple threads within a user process. As the programmer used POSIX pthread or Windows CreateThread() the kernel is aware of the user-land threads and each thread is considered for processor time assignment by the scheduler. SO, that means every user thread will get a kernel thread.
That's how 1-to-1 threading works.
This doesn't have to be the case. A platform can implement pthread_create, CreateThread, or whatever other "create a thread" function it offers that does whatever it wants.
My question:
So, How does N-1 threading model looks possible to exist? It would be 1-1 threading model.
Please clarify.
Precisely as you explained in the beginning of your question -- when the programmer creates a thread, instead of creating a thread the kernel is aware of, it creates a thread that the userland scheduler is aware of, still using a single kernel thread for the entire process.
Short answer: there is more than Windows and Linux.
Slightly longer answer (EDITED):
Many programming languages and frameworks introduce multithreading to the programmer. At the same time, they aim to be portable, i.e., it is not known, whether any target plattform does support threads at all. Here, the best way is to implement a N:1 threading, either in general, are at least for the backends without threading support.
The classic example is Java: the language supports multithreading, while JVMs exist even for very simple embedded plattforms, that do not support threads. However, there are JVMs (actually, most of them) that use kernel threads (e.g. AFIK, the JVM by Sun/Oracle).
Another reason that a language/plattform does not want to transfer the threading control completely to the operating system are sometimes special implementation features as reactor modells or global language locks. Here, the objective is to use information on execution special patterns in the user runtime system (which does the local scheduling) that the OS scheduling has no access to.
Does [1:1 threading] add more space occupancy on User process virtual
address space because of these kernel threads?
Well, in theory, execution flow (processes, threads, etc.) and address space are independent concepts. One can find all kinds of mapping between processes (here used as a general term) and memory spaces: 1:1, n:1, 1:n, n:n. However, the classic approach of threading is that several threads of a process share the memory space of the task (that is the owner of the memory space). And thus, there is usually no difference between user threads and kernel threads regarding the memory space. (One exception is, e.g., the Erlang-VM: here, there exist user threads with isolated memory spaces).

OpenMP, multithreading or multiprocessing (C)?

I'm having some trouble understanding how OpenMP works. I know that it executes tasks in parallel and that it's a multi-processing tool, but what does it mean?
It uses 'threads' but at the same time it's a multi-processing tool? Aren't the two mutually exclusive, you use one method but not the other? Can you help explain which one it is?
To clarify, I've only worked with multi-threading with POSIX pthreads. And that's totally different from multiprocessing with fork and exec and shared memory.
Thank you.
OpenMP was developed to allow for an abstraction layer for parallel architectures utlizing multi-threading and shared memory so you don't have to write often used parallel code from scratch. Note, in general threads still have access to shared memory (the master thread's memory allocated). It takes advantage of multiple processors, but uses threads.
MPI is its counterpart for distributed systems. This might be more of the traditional "multi-processing" version you are thinking of, since all the "ranks" operate independently of eachother without shared memory, and must communicate through concepts such as scatter/map/reduce etc.
OpenMP is a used for multithreading. I go pretty in depth on how to use OpenMP and the pitfalls:
http://austingwalters.com/the-cache-and-multithreading/
It works very similar to the POSIX pthreads, except no fuss. It was developed to be incorporated into code that was already developed and then recompiled with an appropriate compiler (g++, clang/llvm will not work currently). If you clicked on my link above you'll note that a thread enables multiprocessing since it can be executed on any of the processors available.
Meaning if you have a single core, threads would could still execute faster since your processor shares time amongst all the programs. If you have multiple processors you and multiple threads the threads can be accessed from different processors simultaneously and therefore execute even faster.
Further OpenMP allows shared (and unshared memory), depending on the implementation and I believe you can use OpenMP with POSIX threading as well, though you will not gain any advantages if the pthreads were used correctly.
Below is a link to an excellent guide to OpenMP:
http://bisqwit.iki.fi/story/howto/openmp/

Shared memory access control mechanism for processes created by MPI

I have a shared memory used by multiple processes, these processes are created using MPI.
Now I need a mechanism to control the access of this shared memory.
I know that named semaphore and flock mechanisms can be used to do this but just wanted to know if MPI provides any special locking mechanism for shared memory usage ?
I am working on C under Linux.
MPI actually does provide support for shared memory now (as of version 3.0). You might try looking at the One-sided communication chapter (http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf) starting with MPI_WIN_ALLOCATE_SHARED (11.2.3). To use this, you'll have to make sure you have an implementation that supports it. I know that the most recent versions of both MPICH and Open MPI work.
No, MPI doesn't provide any support for shared memory. In fact, MPI would not want to support shared memory. The reason is that a program written with MPI is supposed to scale to a large number of processors, and a large number of processors never have shared memory.
However, it may happen, and often does, that groups of small number of processors (in that set of large number of processors) do have shared memory. To utilize that shared memory however, OpenMP is used.
OpenMP is very simple. I strongly suggest you learn it.

what's the difference between the threads(and process) in kernel-mode and ones in user-mode?

my question:
1)In book modern operating system, it says the threads and processes can be in kernel mode or user mode, but it does not say clearly what's the difference between them .
2)Why the switch for the kernel-mode threads and process costs more than the switch for user-mode threads and process?
3) now, I am learning Linux,I want to know how would I create threads and processes in Kernel mode and user mode respectively IN LINUX SYSTEM?
4)In book modern operating system, it says that it is possible that process would be in user- mode, but the threads which are created in the user-mode process can be in kernel mode. How would this be possible?
There are some terminology problems due more to historical accident than anything else here.
"Thread" usually refers to thread-of-control within a process, and may (does in this case) mean "a task with its own stack, but which shares access to everything not on that stack with other threads in the same protection domain".
"Process" tends to refer to a self-contained "protection domain" which may (and does in this case) have the ability to have multiple threads within it. Given two processes P1 and P2, the only way for P1 to affect P2 (or vice versa) is through some particular defined "communications channel" such as a file, pipe, or socket; via "inter-process" signals like Unix/Linux signals; and so on.
Since threads don't have this kind of barrier between each other, one thread can easily interfere with (corrupt the data used by) another thread.
All of this is independent of user vs kernel, with one exception: in "the kernel"—note that there is an implicit assumption here that there is just one kernel—you have access to the entire machine state at all times, and full privileges to do anything. Hence you can deliberately (or in some cases accidentally) disregard or turn off hardware protection and mess with data "belonging to" someone else.
That mostly covers several possibly-confused items in Q1. As for Q2, the answer to the question as asked is "it doesn't". In general, because threads do not involve (as much) protection, it's cheaper to switch from one thread to another: you do not have to tell the hardware (in whatever fashion) that it should no longer allow various kinds of access, since threads T1 and T2 have "the same" access. Switching between processes, however, as with P1 and P2, you "cross a protection barrier", which has some penalty (the actual penalty varies widely with hardware, and to some extent the skills of the OS writers).
It's also worth noting that crossing from user to kernel mode, and vice versa, is also crossing a protection domain, which again has some kind of cost.
In Linux, there are a number of ways for user processes to create what amount to threads, including both "POSIX threads" (pthreads) and the clone call (details for clone, which is extremely flexible, are beyond the scope of this answer). If you want to write portable code, you should probably stick with pthreads.
Within the Linux kernel, threads are done completely differently, and you will need Linux kernel documentation.
I can't properly answer Q4 since I don't have the book and am not sure what they are referring to here. My guess is that they mean that whenever any user process-or-thread makes a "system call" (requests some service from the OS), this crosses that user/kernel protection barrier, and it is then up to the kernel to verify that the user code has appropriate privileges for that operation, and then to do that operation. The part of the kernel that does this is running with kernel-level protections and thus needs to be more careful.
Some hardware (mostly obsolete these days) has (or had) more than just two levels of hardware-provided protection. On these systems, "user processes" had the least direct privilege, but above those you would find "executive mode", "system mode", and (most privileged) "kernel" or "nucleus" mode. These were intended to lower the cost of crossing the various protection barriers. Code running in "executive" did not have full access to everything in the machine, so it could, for instance, just assume that a user-provided address was valid, and try to use it. If that address was in fact invalid, the exception would rise to the next higher level. With only two levels—"user", unprivileged; and "kernel", completely-privileged—kernel code must be written very carefully. However, it's possible to provide "virtual machines" at low cost these days, which pretty much obsoletes the need for multiple hardware levels of protection. One simply writes a true kernel, then lets it run other things in what they "think" is "kernel mode". This is what VMware and other "hypervisor" systems do.
User-mode threads are scheduled in user mode by something in the process, and the process itself is the only thing handled by the kernel scheduler.
That means your process gets a certain amount of grunt from the CPU and you have to share it amongst all your user mode threads.
Simple case, you have two processes, one with a single thread and one with a hundred threads.
With a simplistic kernel scheduling policy, the thread in the single-thread process gets 50% of the CPU and each thread in the hundred-thread process gets 0.5% each.
With kernel mode threads, the kernel itself manages your threads and schedules them independently. Using the same simplistic scheduler, each thread would get just a touch under 1% of the CPU grunt (101 threads to share the 100% of CPU).
In terms of why kernel mode switching is more expensive, it probably has to do with the fact that you need to switch to kernel mode to do it. User mode threads do all their stuff in user mode (obviously) so there's no involving the kernel in a thread switch operation.
Under Linux, you create threads (and processes) with the clone call, similar to fork but with much finer control over things.
Your final point is a little obtuse. I can't be certain but it's probably talking about user and kernel mode in the sense that one could be executing user code and another could be doing some system call in the kernel (which requires switching to kernel or supervisor mode).
That's not the same as the distinction when talking about the threading support (user or kernel mode support for threading). Without having a copy of the book to hand, I couldn't say definitively, but that'd be my best guess.

How shared memory would be accessed in manycore systems

In multicore systems, such as 2, 4, 8 cores, we typically use mutexes and semaphores to access shared memory. However, I can foresee that these methods would induce a high overhead for future systems with many cores. Are there any alternative methods that would be better for future many core systems for accessing shared memories.
Transactional memory is one such method.
I'm not sure how far in the future you want to go. But in the long-long run, shared memory as we know it right now (single address space accessible by any core) is not scalable. So the programming model will have to change at some point and make the lives of programmers harder as it did when we went to multi-core.
But for now (perhaps for another 10 years) you can get away with transactional memory and other hardware/software tricks.
The reason I say shared-memory is not scalable in the long run is simply due to physics. (similar to how single-core/high-frequency hit a barrier)
In short, transistors can't shrink to less than the size of an atom (barring new technology), and signals can't propagate faster than the speed of light. Therefore, memory will get slower and slower (with respect to the processor) and at some point, it becomes infeasible to share memory.
We can already see this effect right now with NUMA on the multi-socket systems. Large-scale supercomputers are neither shared-memory nor cache-coherent.
1) Lock only the memory part your are accessing, and not the entire table ! This is done with the help of a big hash table. The bigger the table, the finer the lock mechanism is.
2) If you can, only lock on writing, not on reading (this requires that there is no problem in reading the "previous value" while it is being updated, which is very often a valid case).
Access to shared memory at the lowest level in any multi-processor/core/threaded application synchronization depends on the bus lock. Such a lock may incur hundreds of (CPU) wait states as it also encompasses locking those I/O buses that have bus-mastering devices including DMA. Theoretically it is possible to envision a medium-level lock that can be invoked in situations when the programmer is certain that the memory area being locked won't be affected by any I/O bus. Such a lock would be much faster because it only needs to synchronize the CPU caches with main memory which is fast, at least in comparison to latency of the slowest I/O buses. Whether programmers in general would be competent to determine when to use which bus lock adds worrying implications to its mainstream feasibility. Such a lock could also require its own dedicated external pins for synchronization with other processors.
In multi-processor Opteron systems each processor has its own memory which becomes part of the entire memory that all installed processors can "see". A processor trying to access memory which turns out to be attached to another processor will transparently complete the access - albeit more slowly - through a high-speed interconnect bus (called HyperTransport) to the processor in charge of that memory (the NUMA concept). As long as a processor and its cores are working with the memory physically connected to it processing will be fast. In addition, many processors are equipped with several external memory buses to multiply their overall memory bandwidth.
A theoretical medium-level lock could, on Opteron systems, be implemented using the HyperTransport interconnections.
As for any forseeable future the classic approach of locking as seldom as possible and for as short a time as possible by implementing efficient algorithms (and associated data structures) that are used when the locks are in place still holds true.

Resources