What are the Hardware Rx/Tx Queue in Ethernet controller - c

I have very basic question regarding Rx/Tx Hardware Queues in Ethernet Controller, what its used for ?
While looking at the following driver in Linux kernel, its seems like it is used to carry DMA descriptors ?
https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/broadcom/genet/bcmgenet.c#L2276

You are correct, the rx/tx queues contain DMA descriptors for incoming and outgoing packets.
If you are curious how network drivers work, I recommend looking at the ixy userspace network driver: https://github.com/emmericp/ixy
The code is relatively simple and very well commented, and there is a paper which explains how it works: https://www.net.in.tum.de/fileadmin/bibtex/publications/papers/ixy-writing-user-space-network-drivers.pdf
See section 4.1 NIC Ring API in the paper for an explanation of the receive (rx) and transmit (tx) queues:
NICs expose multiple circular buffers called queues or rings to transfer packets. The simplest setup uses only one receive and one transmit queue. Multiple transmit queues are merged on the NIC, incoming traffic is split according to filters or a hashing algorithm if multiple receive queues are configured. Both receive and transmit rings work in a similar way: the driver programs a physical base address and the size of the ring. It then fills the memory area with DMA descriptors, i.e., pointers to physical addresses where the packet data is stored with some metadata. Sending and receiving packets is done by passing ownership of the DMA descriptors between driver and hardware via a head and a tail pointer. The driver controls the tail, the hardware the head. Both pointers are stored in device registers accessible via MMIO.

Related

Linux Networking Driver With DMA Bus Mastering

I am currently working on writing my first Linux Networking driver and it seems to be going fairly smoothly right now. My network device is going to create an Ethernet interface but forward the Ethernet frames over PCIe to a PCIe endpoint. My question has to do with reception of forwarded Ethernet frames from the PCIe endpoint destined for my interface.
What I would normally do would be to allocate a large DMA buffer, tell the endpoint with Bus Mastering capabilities where the buffer was, and allow it to DMA to that buffer. It would then send an interrupt to signal reception and I could copy the data into an sk_buf.
My question is this:
In LDD3, it says that I should be able to DMA directly to the sk_buf because all sk_buf are in DMA memory. When and where do I allocate this buffer and tell the Bus Master where it is located? Do I do it on initialization first, then once the Bus Master has written it's first sk_buf and interrupts me signalling reception, do I allocate a new buffer and write the new location? Can this only be done with poll enabled (I think its called NAPI) reception?
Thanks in advance for your help.

How to implement an ethernet modem

Okay, what I want to do, as a training exercise, is to implement something like this
client --ethernet--> Modem1 --GPIO--> Modem2 --ethernet--> My Home Router
Where the client connects to Modem1 using an ethernet cable.
Modem1 is a Raberry PI, converting the signal and relaying it via the GPIO
Modem2 is a Raberry PI, receives the data from the GPIO, and send it via the ethernet cable to my home router
I want to implement the Modems, but have little idea where to start.
I have read up a little on ethernet programming, but still can't find answers to the "simple stuff" like.
How do I implement Modem1 so that when its connected to the client, the client discovers it as an internet connection.
On the Modem2 end, how do I make "My Home Router" send packets meant for the "client" to Modem2, so that Modem2 may forward them.
and possibly things I haven't though of....
So, how, concretely, can I implement this? preferably in c.
I'd venture to say you might be able to write some sort of custom GPIO intermediate layer.
Read Ethernet->Encapsulate->Write GPIO->|->Read GPIO->Decapsulate->Write Ethernet
(and vice versa)
The problem then becomes: How can both modems act as "Ethernet proxies"?
Modem1 acts as a proxy for the router. Modem2 acts as a proxy for the client. If your Raspberry Pi can spoof MAC addresses, you might be able to fool Ethernet peers into communicating with your modems' Ethernet port. The reason why you need to spoof MAC addresses is that in TCP/IP networking, there is the ARP table, which maps remote IP addresses to the MAC address that can route IP packets to/from them. This is what allows your client to communicate to your router over TCP/IP.
Another potential pitfall is where your modem communication introduces delays that interfere with the Ethernet layer's handling of the protocol. For example, the Ethernet protocol may have real-time constraints that could be shattered if you introduce delays...
But let's assume anything is possible in a perfect world...
You'll need to write code for reading/writing Ethernet messages (I've seen open source code for reading/writing Ethernet packets over raw sockets in Linux)
You'll need to write a custom driver for your GPIO comms.
This means implementing a carefully thought-out protocol to manage pins state, start-of-message, end-of-message, data-payload, checksum, whatever...
Finally, you'll need to write a top-level communications layer that implements:
Ethernet-to-GPIO process:
a) read from Ethernet port, encapsulates Ethernet packet into a custom message (or message fragments)
b) communicate this custom message, using your custom GPIO protocol driver, to the external GPIO peer
GPIO-to-Ethernet process:
a) Read from GPIO, using your custom driver code
b) Decapsulate Ethernet packet
c) Write Ethernet packet to Ethernet port.
these two processes run forever...
Again, all hinges on whether or not your modems can insert themselves in an peer-to-peer connection without disturbing the natural flow of the Ethernet protocol...
As for the 'C' part...
If you use open source libraries (or code snippets) for reading/writing raw Ethernet via raw sockets, that is most likely written in C.
Your GPIO code will read write from the GPIO pins in one of two ways: from a memory mapped H/W address, or using ioport calls on that H/W address.
Receive raw Ethernet frames in Linux
Send a raw Ethernet frame in Linux
Good luck

force UDP broadcast via the network (disable loopback)

I want to send a UDP broadcast datagram to multiple devices on the network, including the sender device itself. The goal is to have all devices receive the data at the EXACT same time (well, +/- 5ms is OK).
The problem is that the network interface on the sending device is looping the data back, so it is received immediately (in contrast to the other devices where network latency comes into play - quite a bit for Wifi for instance)
Any idea how I can disable my network interface to loop the data back directly?
Another idea I had: Is it possible to create a virtual network interface to send the broadcast packet and listen on another interface which only receives it via the network?
I am trying to do that in C on a Linux machine. Any help would be greatly appreciated!
UDP are sent as IP-payload. The routing of IP packets is a domain of the IP stack. It decides how a packet is transferred to the destination. When you IP stack detects that the destination is the local host it will enqueue the packet in the receive queue and the packet will be available immediatly. If your adapters' send queues are filled that you will have a delay. So you can't make a synchronization with this concept.
If you need a hard synchronization you should utilize NTP or SNTP tro synchronize the clocks and define a comment start time for your desired common operation.
Edit:
The (S)NTP protocol is designed to synchronize at millisecond Level. You will get a precision that you can't achieve with any Transmission of UDP packets due to the reason I described above.

Low latency packet processing with shared memory on Linux?

If I was to receive UDP packets on Linux (and I didn't mind changing some of the source code) what would be the fastest way for my application to read the packets?
Would I want to modify the network stack so that once a UDP packet is received it is written to shared memory and have the application access that memory?
Would there be any way for the stack to notify the application to react, rather than have the application continuously poll the shared memory?
Any advice/further resources are welcome- I have only seen:
http://www.kegel.com/c10k.html
If latency is a problem and the default UDP network stack does not perform as you wish, then try to use different existing (installable) network stacks.
Example, try UDP Lite, compare to the standard UDP stack, this particular stack does not perform any checksum on the UDP datagram, thus reducing latencies at the cost of providing corrupted datagram to the application layer.
Side note: you do not need to have a "polling" mechanism. Read the manual of select (and it's possible derivative like pselect or ppoll), with such API, the kernel will "wake up" your application as soon as it has something to read or write in the pipeline.

Any benefit in sockets from multiple cores? (Linux)

Cant find myself the answer for such a question:
Is there any benefit/boost to sockets in general at multi-core machine. I mean is there maybe some kind of sharing access to packets queue incoming to kernel from the ethernet-card driver or smth.
I understand that when it comes up to API call there can be multiple threads working with one socket instance, but it is up to programmer to synchronize and play correctly with calls to read/write/close/select etc. So at that level i see benefit only in working with dispatched packets and post processing etc... Or there is no speed boost until the packet copied during system call and transferred to user space?
The benefit of multi-core depends on how concurrency can be achieved in your algorithm. Take ethernet receiving as example, there are 4 tasks involved in it.
On receiving packet, the NIC hardware trigger an interrupt and CPU handle interrupt in interrupt context.
The network stack handle RX packet in software irq mechanism. The software irq requests can be run concurrently on multiple CPU. In network stack RX function, it pass network buffer to socket and wakes up user thread pendning at the socket.
The user thread waked up and continue user application code to receive or processes received network packets.
1) can only run at one CPU, 2) can run in multiple CPU and 3) can also runs on multiple CPU for multi-process or multi-thread applications.
Following from #Greg Inozemtsev's comment. NICs with multiple receive queues can filter incoming traffic into different queues. Each queue can generate a unique interrupt to the CPU to signal incoming packets. Each queue can be assigned a different interrupt that can be dispatched to a unique CPU core.
Linux supports various techniques in the Kernel such as:
RSS: Receive Side Scaling
RPS: Receive Packet Steering
RFS: Receive Flow Steering
Accelerated Receive Flow Steering
XPS: Transmit Packet Steering
Lets say all the packets are destined to your machine IP on port 80 and you have used socket() and listen() to create a socket on port 80. Traffic is coming into your NIC from a variety of source IPs so its being hashed into multiple receive queues (meaning the hardware interrupts are being spread across multiple CPU cores thanks to RSS). You can use the native kernel socket option PACKET_FANOUT to spread the load across multiple worker threads within your application.
If you look to 3rd party libraries NetMap, DPDK and VPP, just as examples, these can all be used to scale up even further by take you down to zero copy RX/TX with the caveat you need to write some of the network protocol code your self depending on which 3rd party library you use.
There are many many things to consider here, far too much to cover in one SO question. In answer to your original question; yes. Although I have tried to provide a bit of extra information.
For reading on RSS/RPS/RFS etc:
https://blog.cloudflare.com/how-to-receive-a-million-packets/
For further reading related to native PACKET_FANOUT and PACKET_MMAP:
http://kukuruku.co/hub/nix/capturing-packets-in-linux-at-a-speed-of-millions-of-packets-per-second-without-using-third-party-libraries
http://yusufonlinux.blogspot.co.uk/2010/11/data-link-access-and-zero-copy.html?m=1
https://www.kernel.org/doc/Documentation/networking/packet_mmap.txt
Further reading related to NetMap as an example of 3rd party library
https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap/
Further reading on NUMA and affinity:
https://null.53bits.co.uk/index.php?page=numa-and-queue-affinity
https://blog.cloudflare.com/how-to-achieve-low-latency/

Resources