How can I make my pthreads execute a function each time they are rescheduled by the kernel?
I need to identify on which physical CPU/socket (not logical core) my thread is being scheduled at and cannot afford to do this all the time.
Can the wakeup routine be hooked somehow to make the necessary updates to TLS only when the thread is actually being rescheduled?
As to why I need this: I have code which executes AMOs appx every 70ns per thread which is fine if the address is not cached on another socket, deploying the same code on two sockets gives a 15 times performance impact because of frequent cache invalidations. I intend to allocate memory especially for this which is only shared among threads running the same L3 cache. So I need to identify on which socket I am running and address the correct memory block. I could obviously call sched_getcpu and compare this to the physical CPU ID in /proc/cpuinfo, but this is a rather big overhead. I cannot afford to allocate thread-private memory for each thread though, too expensive.
From what I have read in Linux Kernel Development, Third Edition, there is no service nor interface, provided by the kernel, for what you want. Using pthread_setaffinity (as suggested above by #osgx, or, in more recent linux kernel implementations, pthread_setaffinity_np) or caching a TLS key per cpu socket in the beginning (as suggested above by #caf) are perhaps the best methods to use in that direction.
Related
I am trying to understand why IOCP is used. I can think of two reasons:
Since WSARecv() will not block, then I can handle 1000s of clients without having to create a new thread for each client (also, there is a limit on how many threads you can create, and so the number of clients you can handle will be limited).
Since WSASend() will not block, then when I want to send a large file, I don't have to create a new thread to send it (if I did not create a new thread then the UI thread will block of course).
What other reasons are there to use IOCP?
IOCP has the benefits that you mention but that is not exclusive to IOCP. I'm not that familiar with the native socket APIs but some Win32 APIs have "overlapped IO" which is asynchronous but does not require IOCP.
Another benefit is that with IOCP the number of request serving threads is (kind of) optimized by the kernel. The kernel is aware of all blocking that request serving threads do and it will see to it that there are enough, and not more, threads unblocked at all times so that the CPU is well-utilized. Ideally, you would never block and there would be as many threads as there are cores (assuming 100% load). That would be very efficient.
IOCP also helps to reduce context switching because instead of switching to another thread to process the results of an IO an existing thread that is busy already simply calls GetQueuedCompletionStatus again.
GetQueuedCompletionStatusEx can be used to reduce the number of transitions to the kernel because you can dequeue multiple IOs in one call.
Also, it cuts down on avoidable bulk data copying and protection ring cycles. Instead of the kernel having to copy data from the network stack buffers into a user-space buffer when requested by a recv() call, user-space buffers are supplied by WSARecv() and the stack can then load them directly in kernel space.
Continuing my endeavors in OS development research, I have constructed an almost complete picture in my head. One thing still eludes me.
Here is the basic boot process, from my understanding:
1) BIOS/Bootloader perform necessary checks, initialize everything.
2) The kernel is loaded into RAM.
3) Kernel performs its initializations and starts scheduling tasks.
4) When a task is loaded, it is given a virtual address space in which it resides. Including the .text, .data, .bss, the heap and stack. This task "maintains" its own stack pointer, pointing to its own "virtual" stack.
5) Context switches merely push the register file (all CPU registers), the stack pointer and program counter into some kernel data structure and load another set belonging to another process.
In this abstraction, the kernel is a "mother" process inside of which all other processes are hosted. I tried to convey my best understanding in the following diagram:
Question is, first is this simple model correct?
Second, how is the executable program made aware of its virtual stack? Is it the OS job to calculate the virtual stack pointer and place it in the relevant CPU register? Is the rest of the stack bookkeeping done by CPU pop and push commands?
Does the kernel itself have its own main stack and heap?
Thanks.
Question is, first is this simple model correct?
Your model is extremely simplified but essentially correct - note that the last two parts of your model aren't really considered to be part of the boot process, and the kernel isn't a process. It can be useful to visualize it as one, but it doesn't fit the definition of a process and it doesn't behave like one.
Second, how is the executable program made aware of its virtual stack?
Is it the OS job to calculate the virtual stack pointer and place it
in the relevant CPU register? Is the rest of the stack bookkeeping
done by CPU pop and push commands?
An executable C program doesn't have to be "aware of its virtual stack." When a C program is compiled into an executable, local variables are usually referenced in relative to the stack pointer - for example, [ebp - 4].
When Linux loads a new program for execution, it uses the start_thread macro (which is called from load_elf_binary) to initialize the CPU's registers. The macro contains the following line:
regs->esp = new_esp;
which will initialize the CPU's stack pointer register to the virtual address that the OS has assigned to the thread's stack.
As you said, once the stack pointer is loaded, assembly commands such as pop and push will change its value. The operating system is responsible for making sure that there are physical pages that correspond to the virtual stack addresses - in programs that use a lot of stack memory, the number of physical pages will grow as the program continues its execution. There is a limit for each process that you can find by using the ulimit -a command (on my machine the maximum stack size is 8MB, or 2KB pages).
Does the kernel itself have its own main stack and heap?
This is where visualizing the kernel as a process can become confusing. First of all, threads in Linux have a user stack and a kernel stack. They're essentially the same, differing only in protections and location (kernel stack is used when executing in Kernel Mode, and user stack when executing in User Mode).
The kernel itself does not have its own stack. Kernel code is always executed in the context of some thread, and each thread has its own fixed-size (usually 8KB) kernel stack. When a thread moves from User Mode to Kernel Mode, the CPU's stack pointer is updated accordingly. So when kernel code uses local variables, they are stored on the kernel stack of the thread in which they are executing.
During system startup, the start_kernel function initializes the kernel init thread, which will then create other kernel threads and begin initializing user programs. So after system startup the CPU's stack pointer will be initialized to point to init's kernel stack.
As far as the heap goes, you can dynamically allocate memory in the kernel using kmalloc, which will try to find a free page in memory - its internal implementation uses get_zeroed_page.
You forgot one important point: Virtual memory is enforced by hardware, typically known as the MMU (Memory Management Unit). It is the MMU that converts virtual addresses to physical addresses.
The kernel typically loads the address of the base of the page table for a specific process into a register in the MMU. This is what task-switches the virtual memory space from one process to another. On x86, this register is CR3.
Virtual memory protects processes' memory from each other. RAM for process A is simply not mapped into process B. (Except for e.g. shared libraries, where the same code memory is mapped into multiple processes, to save memory).
Virtual memory also protect kernel memory space from a user-mode process. Attributes on the pages covering kernel address space are set so that, when the processor is running in user-mode, it is not allowed to execute there.
Note that, while the kernel may have threads of its own, which run entirely in kernel space, the kernel shouldn't really be thought of a "a mother process" that runs independently of your user-mode programs. The kernel basically is "the other half" of your user-mode program! Whenever you issue a system call, the CPU automatically transitions into kernel mode, and starts executing at a pre-defined location, dictated by the kernel. The kernel system call handler then executes on your behalf, in the kernel-mode context of your process. Time spent in the kernel handling your request is accounted for, and "charged to" your process.
The helpful ways of thinking about kernel in context of relationships with processes and threads
Model provided by you is very simplified but correct in general.
In the same time the way of thinking about kernel as about "mother process" isn't best, but it still has some sense.
I would like to propose another two better models.
Try to think about kernel as about special kind of shared library.
Like a shared library kernel is shared between different processes.
System call is performed in a way which is conceptually similar to the routine call from shared library.
In both cases, after call, you execute of "foreign" code but in the context your native process.
And in both cases your code continues to perform computations based on stack.
Note also, that in both cases calls to "foreign" code lead to blocking of execution of your "native" code.
After return from the call, execution continues starting in the same point of code and with the same state of the stack from which call was performed.
But why we consider kernel as a "special" kind of shared library? Because:
a. Kernel is a "library" that is shared by every process in the system.
b. Kernel is a "library" that shares not only section of code, but also section of data.
c. Kernel is a specially protected "library". Your process can't access kernel code and data directly. Instead, it is forced to call kernel controlled manner via special "call gates".
d. In the case of system calls your application will execute on virtually continuous stack. But in reality this stack will be consist from two separated parts. One part is used in user mode and the second part will be logically attached to the top of your user mode stack during entering the kernel and deattached during exit.
Another useful way of thinking about organization of computations in your computer is consideration of it as a network of "virtual" computers which doesn't has support of virtual memory.
You can consider process as a virtual multiprocessor computer that executes only one program which has access to all memory.
In this model each "virtual" processor will be represented by thread of execution.
Like you can have a computer with multiple processors (or with multicore processor) you can have multiple oncurrently running threads in your process.
Like in your computer all processors have shared access to the pool of physical memory, all threads of your process share access to the same virtual address space.
And like separate computers are physically isolated from each other, your processes also isolated from each other but logically.
In this model kernel is represented by server having direct connections to each computer in the network with star topology.
Similarly to a networking servers, kernel has two main purposes:
a. Server assembles all computers in single network.
Similarly kernel provides a means of inter-process communication and synchronization. Kernel works as a man in the middle which mediates entire communication process (transfers data, routes messages and requests etc.).
b. Like server provides some set of services to each connected computer, kernel provides a set of services to the processes. For example, like a network file server allows computers read and write files located on shared storage, your kernel allows processes to do the same things but using local storage.
Note, that following the client-server communication paradigm, clients (processes) are the only active actors in the network. They issue request to the server and between each other. Server in its turn is a reactive part of the system and it never initiate communication. Instead it only replies to incoming requests.
This models reflect the resource sharing/isolation relationships between each part of the system and the client-server nature of communication between kernel and processes.
How stack management is performed, and what role plays kernel in that process
When the new process starts, kernel, using hints from executable image, decides where and how much of virtual address space will have reserved for the user mode stack of initial thread of the process.
Having this decision, kernel sets the initial values for the set of processor registers, which will be used by main thread of process just after start of the execution.
This setup includes setting of the initial value of stack pointer.
After actual start of process execution, process itself becomes responsible for stack pointer.
More interesting fact is that process is responsible for initialization of stack pointers of each new thread created by it.
But note that kernel kernel is responsible for allocation and management of kernel mode stack for each and every thread in the system.
Note also that kernel is resposible for physical memory allocation for the stack and usually perform this job lazily on demand using page faults as hints.
Stack pointer of running thread is managed by thread itself. In most cases stack pointer management is performed by compiler, when it builds executable image. Compiler usually tracks stack pointer value and maintain it's consistency by adding and tracking all instructions that relates to the stack.
Such instructions not limited only by "push" and "pop". There are many CPU instructions which affects the stack, for example "call" and "ret", "sub ESP" and "add ESP", etc.
So as you can see, actual policy of stack pointer management is mostly static and known before process execution.
Sometimes programs have a special part of the logic that performs special stack management.
For example implementations of coroutines or long jumps in C.
In fact, you are allowed to do whatever you want with the stack pointer in your program if you want.
Kernel stack architectures
I'm aware about three approaches to this issue:
Separate kernel stack per thread in the system. This is an approach adopted by most well-known OSes based on monolithic kernel including Windows, Linux, Unix, MacOS.
While this approach leads to the significant overhead in terms of memory and worsens cache utilization, but it improves preemption of the kernel, which is critical for the monolithic kernels with long-running system calls especially in the multi-processor environment.
Actually, long time ago Linux had only one shared kernel stack and entire kernel was covered by Big Kernel Lock that limits the number of threads, which can concurrently perform system call, by only one thread.
But linux kernel developers has quickly recognized that blocking execution of one process which wants to know for instance its PID, because another process already have started send of a big packet through very slow network is completely inefficient.
One shared kernel stack.
Tradeoff is very different for microkernels.
Small kernel with short system calls allows microkernel designers to stick to the design with single kernel stack.
In the presence of proof that all system calls are extremely short, they can benefit from improved cache utilization and smaller memory overhead, but still keep system responsiveness on the good level.
Kernel stack for each processor in the system.
One shared kernel stack even in microkernel OSes seriously affects scalability of the entire operating system in multiprocessor environment.
Due to this, designers frequently follow approach which is looks like compromise between two approaches described above, and keep one kernel stack per each processor (processor core) in the system.
In that case they benefit from good cache utilization and small memory overhead, which are much better than in the stack per thread approach and slightly worser than in single shared stack approach.
And in the same time they benefit from the good scalability and responsiveness of the system.
Thanks.
what happens when we set different processor affinity to process and its thread in linux.
I am trying to start a process affined to a core (say 1) which have two threads one of which need to run on other core (say 0)
When i tried to set affinity to thread different to process the program got executed. but I want to know the hidden impacts of this approach.
Threads and processes are vastly the same thing. Whether you call pthread_setaffinity... or use the sched_setaffinity syscall, they both affect the current thread's affinity mask. This may be your "process" thread, or a thread you created.
However, note that a new thread created by pthread_create inherits a copy of its creator's CPU affinity mask [1].
That means that setting the affinity and creating a thread is not the same as creating a thread and setting the affinity. In the first case, both threads will compete over the same processor (which is most probably not what you want) and in the second case they will be bound to different processors.
Also note that while binding threads to a dedicated processor (core) may have advantages in some situations, it may just as well be a very stupid thing to do. Playing with the affinity mask means you limit the scheduler in what it can do to make your program run. If the core you bound your thread to isn't available, your thread will not run, end of story.
This is a very similar reasoning/strategy as disabling swap to make the system "faster" (some users still do that!). By doing so they usually gain nothing, all they do is limit what the memory manager can do by removing one option of providing a free page once it runs out of unused physical RAM. Usually this means something more or less valuable from the buffer cache is purged when instead some private page that wasn't used in hours could have been swapped out.
Usually people use affinity because they have this idea that the scheduler will make threads bounce between processor cores all the time and this is bad. Processor migration indeed is not cheap, but the scheduler has a mechanism which makes sure it does not happen before a certain minimum amount of time (there is a /proc thingie for that too). After a longer amount of time, all advantages of staying at the old core (TLB, cache) are usually gone anyway, so running on a different core which is readily available is actually better than waiting for a particular core to maybe, eventually become available.
NUMA architectures may be a different topic, but I'd assume (though I don't know for sure) that the scheduler is smart enough not to silently migrate a thread to a different node. In general, however, I'd recommend not to play with affinity at all.
Affinity is a common first line approach to limiting jitter in HPC. Typically LINUX processes and threads and such are constrained to a small but sufficient set of CPUs and the application is constrained to the remainder of the CPUs.
Affinity is very useful with device drivers. Consider for example an Infiniband adapter being used by an application. The adapter will perform best if the driver thread(s) are constrained to CPUs on the same (or closest if none) NUMA node as the adapter. LINUX doesn't know the application thread so can't even consider any affinity for performance.
(This is for a low latency system)
Assuming I have some code which transfers received UDP packets to a region of shared memory, how can I then notify the application (in user mode) that it is now time to read the shared memory? I do not want the application continuously polling eating up cpu cycles.
Is it possible to insert some code in the network stack which can call my application code immediately after it has written to the shared memory?
EDIT I added a C tag, but the application would be in C++
One way to signal an event from one Unix process to another is with POSIX semaphores. You would use sem_open to initialize and open a named semaphore that you can use cross-process.
See How can I get multiple calls to sem_open working in C?.
The lowest latency method to signal an event between processes on the same host is to spin-wait looking for a (shared) memory location to change... this avoids a system call. You expressly said you do not want the application polling, however in a multi-threaded application running on a multi-core system it may not be a bad tradeoff if you really care about latency.
Unless you are planning to use a real-time OS, there is no "immediate" protocol. The CPU resources are available in quantums of few milliseconds, and usually it takes some time for your user thread to understand it can continue.
Considering all above, any form of IPC would do: local sockets, signals, pipes, event descriptors etc. Practical difference on performance would be miserable.
Furthermore, usage of shared memory can lead to unnessessary complications in maintaining/debugging, but that's the designer's choice.
I have an application level (PThreads) question regarding choice of hardware and its impact on software development.
I have working multi-threaded code tested well on a multi-core single CPU box.
I am trying to decide what to purchase for my next machine:
A 6-core single CPU box
A 4-core dual CPU box
My question is, if I go for the dual CPU box, will that impact the porting of my code in a serious way? Or can I just allocate more threads and let the OS handle the rest?
In other words, is multiprocessor programming any different from (single CPU) multithreading in the context of a PThreads application?
I thought it would make no difference at this level, but when configuring a new box, I noticed that one has to buy separate memory for each CPU. That's when I hit some cognitive dissonance.
More Detail Regarding the Code (for those who are interested): I read a ton of data from disk into a huge chunk of memory (~24GB soon to be more), then I spawn my threads. That initial chunk of memory is "read-only" (enforced by my own code policies) so I don't do any locking for that chunk. I got confused as I was looking at 4-core dual CPU boxes - they seem to require separate memory. In the context of my code, I have no idea what will happen "under the hood" if I allocate a bunch of extra threads. Will the OS copy my chunk of memory from one CPU's memory bank to another? This would impact how much memory I would have to buy (raising the cost for this configuration). The ideal situation (cost-wise and ease-of-programming-wise) is to have the dual CPU share one large bank of memory, but if I understand correctly, this may not be possible on the new Intel dual core MOBOs (like the HP ProLiant ML350e)?
Modern CPUs1 handle RAM locally and use a separate channel2 to communicate between them. This is a consumer-level version of the NUMA architecture, created for supercomputers more than a decade ago.
The idea is to avoid a shared bus (the old FSB) that can cause heavy contention because it's used by every core to access memory. As you add more NUMA cells, you get higher bandwidth. The downside is that memory becomes non-uniform from the point of view of the CPU: some RAM is faster than others.
Of course, modern OS schedulers are NUMA-aware, so they try to reduce the migration of a task from one cell to another. Sometimes it's okay to move from one core to another in the same socket; sometimes there's a whole hierarchy specifying which resources (1-,2-,3-level cache, RAM channel, IO, etc) are shared and which aren't, and that determines if there would be a penalty or not by moving the task. Sometimes it can determine that waiting for the right core would be pointless and it's better to shovel the whole thing to another socket....
In the vast majority of cases, it's best to leave the scheduler do what it knows best. If not, you can play around with numactl.
As for the specific case of a given program; the best architecture depends heavily in the level of resource sharing between threads. If each thread has its own playground and mostly works alone within it, a smart enough allocator would prioritize local RAM, making it less important on which cell each thread happens to be.
If, on the other hand, objects are allocated by one thread, processed by another and consumed by a third; performance would suffer if they're not on the same cell. You could try to create small thread groups and limit heavy sharing within the group, then each group could go on a different cell without problem.
The worst case is when all threads participate in a great orgy of data sharing. Even if you have all your locks and processes well debugged, there won't be any way to optimize it to use more cores than what are available on a cell. It might even be best to limit the whole process to just use the cores in a single cell, effectively wasting the rest.
1 by modern, I mean any AMD-64bit chip, and Nehalem or better for Intel.
2 AMD calls this channel HyperTransport, and Intel name is QuickPath Interconnect
EDIT:
You mention that you initialize "a big chunk of read-only memory". And then spawn a lot of threads to work on it. If each thread works on its own part of that chunk, then it would be a lot better if you initialize it on the thread, after spawning it. That would allow the threads to spread to several cores, and the allocator would choose local RAM for each, a much more effective layout. Maybe there's some way to hint the scheduler to migrate away the threads as soon as they're spawned, but I don't know the details.
EDIT 2:
If your data is read verbatim from disk, without any processing, it might be advantageous to use mmap instead of allocating a big chunk and read()ing. There are some common advantages:
No need to preallocate RAM.
The mmap operation is almost instantaneous and you can start using it. The data will be read lazily as needed.
The OS can be way smarter than you when choosing between application, mmaped RAM, buffers and cache.
it's less code!
Non needed data won't be read, won't use up RAM.
You can specifically mark as read-only. Any bug that tries to write will cause a coredump.
Since the OS knows it's read-only, it can't be 'dirty', so if the RAM is needed, it will simply discard it, and reread when needed.
but in this case, you also get:
Since data is read lazily, each RAM page would be chosen after the threads have spread on all available cores; this would allow the OS to choose pages close to the process.
So, I think that if two conditions hold:
the data isn't processed in any way between disk and RAM
each part of the data is read (mostly) by one single thread, not touched by all of them.
then, just by using mmap, you should be able to take advantage of machines of any size.
If each part of the data is read by more than one single thread, maybe you could identify which threads will (mostly) share the same pages, and try to hint the scheduler to keep these in the same NUMA cell.
For the x86 boxes you're looking at, the fact that memory is physically wired to different CPU sockets is an implementation detail. Logically, the total memory of the machine appears as one large pool - your wouldn't need to change your application code for it to run correctly across both CPUs.
Performance, however, is another matter. There is a speed penalty for cross-socket memory access, so the unmodified program may not run to its full potential.
Unfortunately, it's hard to say ahead of time whether your code will run faster on the 6-core, one-node box or the 8-core, two-node box. Even if we could see your code, it would ultimately be an educated guess. A few things to consider:
The cross-socket memory access penalty only kicks in on a cache miss, so if your program has good cache behaviour then NUMA won't hurt you much;
If your threads are all writing to private memory regions and you're limited by write bandwidth to memory, then the dual-socket machine will end up helping;
If you're compute-bound rather than memory-bandwidth-bound then 8 cores is likely better than 6;
If your performance is bounded by cache read misses then the 6 core single-socket box starts to look better;
If you have a lot of lock contention or writes to shared data then again this tends to advise towards the single-socket box.
There's a lot of variables, so the best thing to do is to ask your HP reseller for loaner machines matching the configurations you're considering. You can then test your application out, see where it performs best and order your hardware accordingly.
Without more details, it's hard to give a detailed answer. However, hopefully the following will help you frame the problem.
If your thread code is proper (e.g. you properly lock shared resources), you should not experience any bugs introduced by the change of hardware architecture. Improper threading code can sometimes be masked by the specifics of how a specific platform handles things like CPU cache access/sharing.
You may experience a change in application performance per equivalent core due to differing approaches to memory and cache management in the single chip, multi core vs. multi chip alternatives.
Specifically if you are looking at hardware that has separate memory per CPU, I would assume that each thread is going to be locked to the CPU it starts on (otherwise, the system would have to incur significant overhead to move a thread's memory to memory dedicated to a different core). That may reduce overall system efficiency depending on your specific situation. However, separate memory per core also means that the different CPUs do not compete with each other for a given cache line (the 4 cores on each of the dual CPUs will still potentially compete for cache lines, but that is less contention than if 6 cores are competing for the same cache lines).
This type of cache line contention is called False Sharing. I suggest the following read to understand if that may be an issue you are facing
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206?pgno=3
Bottom line is, application behavior should be stable (other than things that naturally depend on the details of thread scheduling) if you followed proper thread development practices, but performance could go either way depending on exactly what you are doing.