C application for parallel communication with direct memory access - c

I'm having a problem with a parallel connection I've got to establish using DMA (Direct Acces Memory).
I've got to write some characters to a parallel port with a given address, through a C application. I know that for a PIO access, there are the _inp/_outp functions, but I don't know how to manage a direct memory access parallel communication.
Does anyone know how I should do or has any good links (I couldn't find any even after long research on the Web

This is not something that can be answered generically.
DMA access is determined by either a DMA controller (in OLD PC's), or using "bus mastering" (PCI onwards). Either of these solutions requires access to the relevant hardware manuals for the device that you are working with (and the DMA controller, if applicable).
In general, the principle works as this:
Reserve a piece of memory (DMA buffer) for the device to store data in.
Configure the device to store the data in said region (remember that in nearly all cases, DMA happens on physical addresses, as opposed to the virtual addresses that Windows or Linux uses).
When the device has stored the requested data, an interrupt is fired, the software responsible for the device takes the interrupt and signals some higher level software that the data is ready, and (perhaps) reprograms the device to start storing data again (either after copying the DMA buffer to someplace else, or assigning a new DMA buffer).


Kernel-Mode Driver Development in Windows

I'm developing a new Kernel-Mode driver, that should run on Windows 10 (64-bit).
This driver should allocate 48GB of continuous physical memory, and map it (its base address) to a virtual address in the user space of the Windows application that will use it. The system actually has 64GB of RAM installed on it, so it might be needed to make a segment of the memory dedicated for this use, perhaps by changing a boot entry information.
In addition, the driver should reveal its base address somehow to an FPGA-based device, located on the PCIe slot. The purpose of this, is to use this 48GB of continuous physical allocated memory, as a DMA (Direct Memory Access). Namely, the FPGA-based device will generate data, and write it at the appropriate location in the DMA. The host software will try to read from that location, in a cyclic fashion. That is, the FPGA will override the data in the buffer, and the host will try to keep the pace, and read that data before it is overridden.
Please note that this question only deals with the host side (the driver), and not with the FPGA side.
So, basically my questions are:
How do I make such an allocation (as described above)?
How do I map the base address from the virtual address in the Kernel-Space, to the appropriate virtual address in the User-Space?
How do I reveal that base address to the FPGA (located on the PCIe slot), so it will know where to perform its write operations?
What other Callback-Functions should this driver implement, in terms of events handling?
Thanks a lot!

Coherently understand the software-hardware interaction with regard to DMA and buses

I've gathered some level of knowledge on several components (including software and hardware) which are involved in general DMA transactions in ARM based boards, but I don't understand how is it all perfectly integrated, I didn't find a full coherent description about this.
I'll write down the high level of the knowledge I already have and I hope that someone could fix me where I'm wrong and complete the missing parts so the whole picture would be clear. My description starts with the userspace software and drills down to the hardware components. The misunderstood parts are in italic-bold format.
The user-mode application requests to read/write from some device, i.e. makes I/O operation.
The operating system receives the request and hand it to the appropriate driver (every OS has its own mechanism to do this, I don't need a further drill down here but if you want to share insights here you are welcome)
The driver which is on charge to handle the I/O request, has to know the address to which the device is mapped to (since I'm interested in ARM based boards, afaik there is only memory-mapped I/O and no port I/O). In most of the cases (if we consider smartphone-like boards) there is a linux kernel that parses the devices addresses from the device-tree which is given from the bootloader at the boot time (the modern approach), or the linux is precompiled for the specific model family and board with the device addresses within it (hardcoded in its source code) (in older and obsolete? approach). In some cases (happens a lot in smartphones) part of the drivers are precompiled and are just packaged into the kernel, i.e. their source is closed, thus, the addresses correspond to the devices are unknown. Is it correct?
Given that the driver knows the address of the relevant registers of the device it want to communicate with, it allocate a buffer (usually in the kernel space) to which the device would write its data (with the help of the DMA). The driver needs to inform the device about the location of that buffer, but the addresses that the devices work with (to manipulate memory) are different from the addresses that the drivers (cpu) work with, hence, the driver needs to inform the device about the 'bus address' of the buffer it has just allocated. How does the driver inform the device about that address? How popular is to use an IOMMU? when using IOMMU is there one hardware component that manages addressing or one per device?
Then the driver commands the device to do its job (by manipulating its registers) and the device transfers output data directly to the allocated buffer in the memory. Here I'm confused a bit with the relation of device-driver:bus:bus-controller:actual-device. Take for example some imaginary device which knows to communicate in the I2C protocol; the SoC specify an I2C bus interface - what is this actually? does the I2C bus has some kind of bus controller? Does the cpu communicate with the I2C bus interface or directly with the device? (i.e. the I2C bus interface is seamless). I guess that someone with some experience with device drivers could answer this easily..
The device populates a DMA channel. Since the device is not connected directly to the memory but rather is connected through some bus to the DMA controller (which masters the bus), it interacts with the DMA to transfer the required data to the allocated buffer in the memory. When the board vendor uses ARM IP cores and bus specifications then this step involves transactions over a bus from the AMBA spec (i.e. AHB/multi-AHB/AXI), and some protocol between the device and a DMAC on top of it. I would like to know more about this step, what actually happens? There are many specifications for DMA controller by ARM, which one is the popular? which is obsolete?
When the device is done, it sends an interrupt, which travel to the OS through the interrupt controller, and the OS's interrupt handler direct it to the appropriate driver which now knows that the DMA transfer is completed.
You've slightly conflated two things here - there are some devices (e.g. UARTs, MMC controllers, audio controllers, typically lower-bandwidth devices) which rely on an external DMA controller ("DMA engine" in Linux terminology), but many devices are simply bus masters in their own right and perform their own DMA directly (e.g. GPUs, USB host controllers, and of course the DMA controllers themselves). The former involves a bunch of extra complexity with the CPU programming the DMA controller, so I'm going to ignore it and just consider straightforward bus-master DMA.
In a typical ARM SoC, the CPU clusters and other master peripherals, and the memory controller and other slave peripherals, are all connected together with various AMBA interconnects, forming a single "bus" (generally all mapped to the "platform bus" in Linux), over which masters address slaves according to the address maps of the interconnect. You can safely assume that the device drivers know (whether by device tree or hardcoded) where devices appear in the CPU's physical address map, because otherwise they'd be useless.
On simpler systems, there is a single address map, so the physical addresses used by the CPU to address RAM and peripherals can be freely shared with other masters as DMA addresses. Other systems are more complex - one of the more well-known is the Raspberry Pi's BCM2835, in which the CPU and GPU have different address maps; e.g. the interconnect is hard-wired such that where the GPU sees peripherals at "bus address" 0x7e000000, the CPU sees them at "physical address" 0x20000000. Furthermore, in LPAE systems with 40-bit physical addresses, the interconnect might need to provide different views to different masters - e.g. in the TI Keystone 2 SoCs, all the DRAM is above the 32-bit boundary from the CPUs' point of view, so the 32-bit DMA masters would be useless if the interconnect didn't show them a different addresses map. For Linux, check out the dma-ranges device tree property for how such CPU→bus translations are described. The CPU must take these translations into account when telling a master to access a particular RAM or peripheral address; Linux drivers should be using the DMA mapping API which provides appropriately-translated DMA addresses.
IOMMUs provide more flexibility than fixed interconnect offsets - typically, addresses can be remapped dynamically, and for system integrity masters can be prevented from accessing any addresses other than those mapped for DMA at any given time. Furthermore, in an LPAE or AArch64 system with more than 4GB of RAM, an IOMMU becomes necessary if a 32-bit peripheral needs to be able to access buffers anywhere in RAM. You'll see IOMMUs on a lot of the current 64-bit systems for the purpose of integrating legacy 32-bit devices, but they are also increasingly popular for the purpose of device virtualisation.
IOMMU topology depends on the system and the IOMMUs in use - the system I'm currently working with has 7 separate ARM MMU-401/400 devices in front of individual bus-master peripherals; the ARM MMU-500 on the other hand can be implemented as a single system-wide device with a separate TLB for each master; other vendors have their own designs. Either way, from a Linux perspective, most device drivers should be using the aforementioned DMA mapping API to allocate and prepare physical buffers for DMA, which will also automatically set up the appropriate IOMMU mappings if the device is attached to one. That way, individual device drivers need not care about the presence of an IOMMU or not. Other drivers (typically GPU drivers) however, depend on an IOMMU and want complete control, so manage the mappings directly via the IOMMU API. Essentially, the IOMMU's page tables are set up to map certain ranges of physical addresses* to ranges of I/O virtual addresses, those IOVAs are given to the device as DMA (i.e. bus) addresses, and the IOMMU translates the IOVAs back to physical addresses as the device accesses them. Once the DMA operation is finished, the driver typically removes the IOMMU mapping, both to free up IOVA space and so that the device no longer has access to RAM.
Note that in some cases the DMA transfer is cyclic and never "finishes". With something like a display controller, the CPU might just map a buffer for DMA, pass that address to the controller and trigger it to start, and it will then continuously perform DMA reads to scan out whatever the CPU writes to that buffer until it is told to stop.
Other peripheral buses beyond the SoC interconnect, like I2C/SPI/USB/etc. work as you suspect - there is a bus controller (which is itself a device on the AMBA bus, so any of the above might apply to it) with its own device driver. In a crude generalisation, the CPU doesn't communicate directly with devices on the external bus - where a driver for an AMBA device says "write X to register Y", that just happens by the CPU performing a store to a memory-mapped address; where an I2C device driver says "write X to register Y", the OS usually has some bus abstraction layer which the bus controller driver implements, whereby the CPU programs the controller with a command saying "write X to register Y on device Z", the bus controller hardware will go off and do that, then notify the OS of the peripheral device's response via an interrupt or some other means.
* technically, the IOMMU itself, being more or less "just another device", could have a different address map in the interconnect as previously described, but I would doubt the sanity of anyone actually building a system like that.

how is tcp(kernel) bypass implemented?

Assuming I would like to avoid the overhead of the linux kernel in handling incoming packets and instead would like to grab the packet directly from user space. I have googled around a bit and it seems that all that needs to happen is one would use raw sockets with some socket options. Is this the case? Or is it more involved than this? And if so, what can I google for or reference in order to implement something like this?
There are many techniques for networking with kernel bypass.
First, if you are sending messages to another process on the same machine, you can do so through a shared memory region with no jumps into the kernel.
Passing packets over a network without involving the kernel gets more interesting, and involves specialized hardware that gets direct access to user memory. This idea is called RDMA.
Here's one way it can work (this is what InfiniBand hardware does). The application registers a memory buffer with the RDMA hardware. This buffer is pinned in physical memory, since swapping it out would obviously be bad (since the hardware will keep writing to the physical memory region). A control region is also mapped into userspace memory. When an application is ready to use the buffer to send or receive a message, it writes a command to the control region. The hardware takes the data from a registered buffer on one end, and places it into another registered buffer at the other end.
Clearly, this is too low level, so there are abstractions that make programming RDMA hardware easier. OFED verbs are one such abstraction.
The InfiniBand software stack has one extra interesting bit: the Sockets Direct Protocol (SDP) that is used for compatibility with existing applications. It works by inserting an LD_PRELOAD shim that translates standard socket API calls into IB verbs.
InfiniBand is just what I'm most familiar with. RoCE/iWARP hardware is very similar from the programmer's perspective, but uses a different transport than InfiniBand (TCP using an offload engine in iWarp, Ethernet in RoCE). There are/were also other approaches to RDMA (Quadrics, for example).

What does the machine code for networking look like?

At the end of the day every piece of code we write eventually gets turned into assembler and then machine language.
If you were writing assembler and wanting to perform a simple connection between two computers, how would you know which memory addresses to use (let alone offsets) within the assembler? Would you need to know specific addresses relating to the operating system?
I'm just wondering how somebody would write a really "clean" and "efficient" message passing library/compiler- the thing which is getting me is what on earth would network communications/IPC look like in assembler?
I think part of this answer could lie with querying known addresses relating to the OS? For example 0x4545456 to 0x 60000000 contains the Linux kernel data for communications X etc.
The addresses are not specific to your OS. They are specific to your hardware/system. Accessing those has nothing to do with assembler vs. another programming language (e.g. C), in fact most device driver code (the code that actually interacts with the networking hardware) is typically written in C.
Here's just one random sample of a network (ethernet) controller:
Intel® 82580EB/82580DB GbE Controller: Datasheet
There are a bunch of registers that your software, either in assembler, or in another language, has to program to get this thing to actually communicate over ethernet. It's probably easier to start with a simpler example, something like a serial port. Let's build a hypothetical, fixed baud rate, serial port controller, mapped to memory:
Address Meaning
0 RX status (reads 0 when no data to read, 1 a byte is available)
1 RX buffer
2 TX status (reads 0 when ready to send, 1 when busy)
3 TX buffer
Now your software, either in assembler or any other language, can transmit data to another computer, by monitoring (polling) address 2 until it's ready, writing the next byte to address 3. We can also received data from another computer by monitoring (polling) address 0 to see when data is ready and reading the byte from address 1 when the data is there.
In a modern operating system/OS those are all physical addresses which need to be somehow mapped into virtual addresses.
Real world hardware, such as the one I linked to, will typically use interrupts, so you don't need to poll. It will usually have DMA, so the hardware can access your data directly rather than you feeding it byte by byte. It will handle various protocols and will have registers for checking and setting various aspects of this protocol.
In a modern OS the actual interaction with the hardware is implemented in a device driver and user software can exchange data with the device driver through some API. Again, this user code may be written in assembler or any other language. The API will vary depending on the OS. Communication/networking is generally built as a "stack" with higher level protocols implemented over the lower level ones. Which part of this stack is in a user library or part of the OS will vary between different operating systems.
For the hypothetical device I described above the API may consist of two single byte blocking calls, read() and write(). You then use some sort of system call mechanism from either assembler or a higher level language to call these and pass parameters/retrieve the output. In some operating systems device I/O may look like file I/O so you would use the generic file read/write to perform operations on the device and the OS will dispatch those to the right device driver. Furthermore, in a typical OS the actual system call will be available through some sort of library, which again you may call from various programming languages.
There are two pieces of code for doing networking in assembly - the kernel code used by the operating system to actually do the networking, and client code that wants to tell the OS what data to send over the network.
Typically, the hardware in a machine has certain memory addresses dedicated to communicating with the network hardware. The machine code for the OS can then write the appropriate values into this memory to control the hardware that ends up sending and receiving bytes. These memory addresses would be hardcoded into the machine code.
In the case of user code that does networking (say, Mozilla Firefox), the process is different. There is typically a machine instruction or set of instructions that are used for user code to tell the operating system to perform some task (in MIPS, for example, this is syscall, while I think x86 uses the int instruction). Client code would work by setting up some buffers with the appropriate data to send to the network, then would use one of the assembly instructions above to tell the OS that it should send the data. The hardware then invokes the OS, which reads the user data and then uses its own machine code (described above) to actually control the network device appropriately. In this way, the OS can guard direct access to the network device by blocking access to the physical addresses controlling the device and moderating access through system calls. It also means that you don't need to know any memory addresses when writing user code to do networking. The OS handles these details, and all you need to know about is what instruction to execute to trigger the system call.
Hope this helps!

Virtual Memory allocation without Physical Memory allocation

I'm working on a Linux kernel project and i need to find a way to allocate Virtual Memory without allocating Physical Memory. For example if I use this :
char* buffer = my_virtual_mem_malloc(sizeof(char) * 512);
my_virtual_mem_malloc is a new SYSCALL implemented by my kernel module. All data written on this buffer is stocked on file or on other server by using socket (not on Physical Memory). So to complete this job, i need to request Virtual Memory and get access to the vm_area_struct structure to redefine vm_ops struct.
Do you have any ideas about this ?
This is not architecturally possible. You can create vm areas that have a writeback routine that copies data somewhere, but at some level, you must allocate physical pages to be written to.
If you're okay with that, you can simply write a FUSE driver, mount it somewhere, and mmap a file from it. If you're not, then you'll have to just write(), because redirecting writes without allocating a physical page at all is not supported by the x86, at the very least.
There are a few approaches to this problem, but most of them require you to first write to an intermediate memory.
Network File System (NFS)
The easiest approach is simply to have the server open some sort of a shared file system such as NFS and using mmap() to map a remote file to a memory address. Then, writing to that address will actually write the OS's page cache, wich will eventually be written to the remote file when the page cache is full or after predefined system timeout.
Distributed Shared Memory (DSM)
An alternative approach is using DSM with a very small cache size.
In computer science, distributed shared memory (DSM) is a form of memory architecture where physically separated memories can be addressed as one logically shared address space.
[...] Software DSM systems can be implemented in an operating system, or as a programming library and can be thought of as extensions of the underlying virtual memory architecture. When implemented in the operating system, such systems are transparent to the developer; which means that the underlying distributed memory is completely hidden from the users.
It means that each virtual address is logically mapped to a virtual address on a remote machine and writing to it will do the following: (a) receive the page from the remote machine and gain exclusive access. (b) update the page data. (c) release the page and send it back to the remote machine when it reads it again.
On typical DSM implementation, (c) will only happen when the remote machine will read the data again, but you might start from existing DSM implementation and change the behavior so that the data is sent once the local machine page cache is full.
[...] the IOMMU maps device-visible virtual addresses (also called device addresses or I/O addresses in this context) to physical addresses.
This basically means to write directly to the network device buffer, which is actually implementing an alternative driver for that device.
Such approach seems the most complicated and I don't see any benefit from that approach.
This approach is actually not using any intermediate memory but is definitely not recommended unless the system has a heavy realtime requirement.
