how memcpy is handled by DMA in linux - c

I am using memcpy() in my program. as I increase the number of variables, unfortunately the CPU usage increases. it is as if memcpy is run by using for loop iteration. is there a fast memcpy function in linux too? shall I use a patch and compile the kernel?

There are architectures where the bus between the CPU and memory is rather weak; some of those architectures add a DMA engine to allow big blocks of memory to be copied without having a loop running on the CPU.
In Linux, you would be able to access the DMA engine with the dmaengine subsystem, but it is very hardware-dependent whether such an engine is actually available.
X86 CPUs have a good memory subsystem, and also have special hardware support for copying large blocks, so using a DMA engine would be very unlikely to actually help.
(Intel added a DMA engine called I/OAT to some server boards, but the overall results were not much better than plain CPU copies.)
DMA forces the data out of the CPU caches, so doing DMA copies for your program's variables would be utterly pointless because the first CPU access afterwards would have to read them back into the cache.

Related

Store read/write data in assembly similar to filesystem

I am creating a custom operating system and is there any way to store data (almost like a filesystem) in assembly so that if the computer shuts off and turns back on the data will still be there?
You can write device drivers for SATA hard drives, USB mass storage, floppy disks, NVMe flash, or whatever else in asm. You might also be able to use BIOS functions to access them (especially if you're on x86). But then you have to manage writes in chunks of 512B or 4096B, because those kinds of storage are block-based.
A more literal interpretation of the question has an interesting answer: can a store instruction like mov [mem], eax put data into persistent storage where a load instruction can get it later (after a power cycle)?
Yes, if your hardware has some memory-mapped non-volatile RAM. (Physically memory-mapped NVRAM like an NVDIMM, not like mmap() to logically map a file into the virtual memory address space of a process). See this answer on Superuser about Intel Optane DC Persistent Memory
x86 for example has recently gotten more instructions to support NVRAM, like clwb to write-back a cache line (all the way to memory) without necessarily evicting it. Early implementations of clwb may just run it like clflushopt, though: #Ana reports that Skylake-X does evict.
Also, clflushopt is a more efficient way to force more cache lines to memory. Use a memory barrier like sfence after a weakly-ordered flush like clflushopt to make sure data is in non-volatile RAM before further writes appear.
For a while Intel was going to require pcommit as part of making sure data had hit non-volatile storage, but decided against it. With that in mind, see Why Intel added the CLWB and PCOMMIT instructions for more details about using persistent RAM.
IDK what the situation is on architectures other than x86, but presumably NV RAM is / will be usable with ARM and other CPUs, too.

How does kernel restrict processes to their own memory pool?

This is purely academical question not related to any OS
We have x86 CPU and operating memory, this memory resembles some memory pool, that consist of addressable memory units that can be read or written to, using their address by MOV instruction of CPU (we can move memory from / to this memory pool).
Given that our program is the kernel, we have a full access to whole this memory pool. However if our program is not running directly on hardware, the kernel creates some "virtual" memory pool which lies somewhere inside the physical memory pool, our process consider it just as the physical memory pool and can write to it, read from it, or change its size usually by calling something like sbrk or brk (on Linux).
My question is, how is this virtual pool implemented? I know I can read whole linux source code and maybe one year I find it, but I can also ask here :)
I suppose that one of these 3 potential solutions is being used:
Interpret the instructions of program (very ineffective and unlikely): the kernel would just read the byte code of program and interpret each instruction individually, eg. if it saw a request to access memory the process isn't allowed to access it wouldn't let it.
Create some OS level API that would need to be used in order to read / write to memory and disallow access to raw memory, which is probably just as ineffective.
Hardware feature (probably best, but have no idea how that works): the kernel would say "dear CPU, now I will send you instructions from some unprivileged process, please restrict your instructions to memory area 0x00ABC023 - 0xDEADBEEF" the CPU wouldn't let the user process do anything wrong with the memory, except for that range approved by kernel.
The reason why am I asking, is to understand if there is any overhead in running program unprivileged behind the kernel (let's not consider overhead caused by multithreading implemented by kernel itself) or while running program natively on CPU (with no OS), as well as overhead in memory access caused by computer virtualization which probably uses similar technique.
You're on the right track when you mention a hardware feature. This is a feature known as protected mode and was introduced to x86 by Intel on the 80286 model. That evolved and changed over time, and currently x86 has 4 modes.
Processors start running in real mode and later a privileged software (ring0, your kernel for example) can switch between these modes.
The virtual addressing is implemented and enforced using the paging mechanism (How does x86 paging work?) supported by the processor.
On a normal system, memory protection is enforced at the MMU, or memory management unit, which is a hardware block that configurably maps virtual to physical addresses. Only the kernel is allowed to directly configure it, and operations which are illegal or go to unmapped pages raise exceptions to the kernel, which can then discipline the offending process or fetch the missing page from disk as appropriate.
A virtual machine typically uses CPU hardware features to trap and emulate privileged operations or those which would too literally interact with hardware state, while allowing ordinary operations to run directly and thus with moderate overall speed penalty. If those are unavailable, the whole thing must be emulated, which is indeed slow.

How can I use DMA in linux kernel? [duplicate]

I am using memcpy() in my program. as I increase the number of variables, unfortunately the CPU usage increases. it is as if memcpy is run by using for loop iteration. is there a fast memcpy function in linux too? shall I use a patch and compile the kernel?
There are architectures where the bus between the CPU and memory is rather weak; some of those architectures add a DMA engine to allow big blocks of memory to be copied without having a loop running on the CPU.
In Linux, you would be able to access the DMA engine with the dmaengine subsystem, but it is very hardware-dependent whether such an engine is actually available.
X86 CPUs have a good memory subsystem, and also have special hardware support for copying large blocks, so using a DMA engine would be very unlikely to actually help.
(Intel added a DMA engine called I/OAT to some server boards, but the overall results were not much better than plain CPU copies.)
DMA forces the data out of the CPU caches, so doing DMA copies for your program's variables would be utterly pointless because the first CPU access afterwards would have to read them back into the cache.

How do Linux OS schedule threads when there are multiple sockets

For example, in a dual socket system with 2 quad core processors, does the thread scheduler tries to keep the threads from the same processes in the same processor? Because interleaving threads of different processes in different processors would slow down performance in the case where threads in a process have a lot of shared memory accesses.
It depends.
On current Intel platforms the BIOS default seems to be that memory is interleaved between the sockets in the system, page by page. Allocate 1Mbyte and half will be on one socket, half on the other. That means that wherever your threads are they have equal access to the data.
This makes it very simple for OSes - anywhere will do.
This can work against you. The SMP hardware environment presented to the OS is synthesised by the CPUs cooperating over QPI. If there's a lot of threads all accessing the same data then those links can get real busy. If they're too busy then that limits the performance, and you're I/O bound. That's where I am; Z80 cores with Intel's memory subsystem design would be just as quick as the nahelem cores I've actually got (ok I might be exagerating...).
At the end of the day the real problem is that memory just isn't quick enough. Intel and AMD have both done some impressive things with memory recently, but we're still hampered by its slowness. Ideally memory would be quick enough so that all cores had clock rate access times to it. The Cell processor sort of did this - each SPE has a bit of SRAM instead of a cache, and once you get your head round them you can make them really sing.
===EDIT===
There is more to it. As Basile Starynkevitch hints the alternate approach is to embrace NUMA.
NUMA is what modern CPUs actually embody, the memory access being non-uniform because the memory on the other CPU sockets is not accessible directly by addressing a bus. The CPUs instead make a request for data over the QPI link (or Hypertransport in AMD's case) to ask the other CPU to fetch data out of its memory and send it back. Because the CPU is doing all this for you in hardware it ends up looking like a conventional SMP environment. And QPI / Hypertransport are very fast, so most of the time it's plenty quick enough.
If you write your code so as to mirror the architecture of the hardware you can in theory make improvements. So this might involve (for example) having two copies of your data in the system, one on each socket. There's memory affinity routines in Linux to specifically allocate memory that way instead of interleaved across all sockets. There's also CPU affinity routines that allow you to control which CPU core a thread is running on, the idea being you run it on a core that is close to the data buffer it will be processing.
Ok, so that might mean a lot of investment in the source code to make that work for you (especially if that data duplication doesn't fit well with the program's flow), but if the QPI has become a problematic bottle neck it's the only thing you can do.
I've fiddled with this to some extent. In a way it's a right faff. The whole mindset of Intel and AMD (and thus the OSes and libraries too) is to give you an SMP environment which, most of the time, is pretty good. However they let you play with NUMA by having a load of library functions you have to call to get the deployment of threads and memory that you want.
However for the edge cases where you want that little bit extra speed it'd be easier if the architecture and OS was rigidly NUMA, no SMP at all. Just like the Cell processor in fact. Easier, not because it'd be simple to write (in fact it would be harder), but if you got it running at all you'd then know for sure that it was as quick as the hardware could ever possibly achieve. With the faked SMP that we have right now you experiment with NUMA but you're mostly left wondering if it's as fast as it possibly could be. It's not like the libraries tell you that you're accessing memory that actually resident on another socket, they just let you do it with no hint that there's room for improvement.

Cache coherence issues in a DMA context

Suppose the CPU modifies the value in location x+50 and does not flush it back to main memory(write-back).
Meanwhile, a device launches a DMA read request from x to x+100.
In that case, how the CPU is informed to flush back the dirty cache line?
The DMA circuitry often works directly with the main memory without involving the CPU (and that's the main idea, to free the CPU from doing I/O that can be done elsewhere in the hardware and thus save CPU cycles). So, you may indeed run into cache coherency problems. Microsoft recommends flushing I/O buffers when using DMA.
But some systems do support cache coherency protocols between CPUs and DMA circuits much like between CPUs in multiprocessor systems. The ultimate answer depends on the actual hardware.
There are three approaches I can think of:
The memory is marked as un-cacheable,
the DMA controller co-ordinates with the cache controller,
the OS guarantees this will never happen, e.g. by ensuring the CPU-part of the process isn't running.
It depends on the hardware, and the capabilities of the OS.
Ensuring the process is not running isn't too weird on a multi-tasking OS, as DMA on memory owned by a process is likely triggered by the process doing a system call, e.g. a write. The process can be de-scheduled, and other processes run, until the DMA completes.
It may be too much of a constraint to wait for an I/O device to complete, so the DMA controller might be copying from the processes address space to a secondary buffer.
So if you have a case where this has happened, please outline the example, and the tests you've run.

Resources