Scheduling NAPI poll to execute at regular time intervals - c

I have gone through multiple posts (in and outside Stackoverflow) regarding this topic. Currently, I am working on to modify the i40e-2.0.30 driver for Intel X710 NIC.
Thanks to this illustrated blog post (https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/), understanding the driver code became lot easier.
My post is particularly concerned with the NAPI Poll mechanism. I understand that NAPI Poll function is triggered when a packet arrives, and if the amount of work done while receiving the packets exceeds the allocated budget, NAPI Polling continues; else polling stops.
Based on this information, I modified my driver to keep polling if a particular signature of data arrives on a particular queue ( using flow director), e.g. UDP Packets on Port XXX for 10,000 poll cycles. But, I am trying to eliminate the possibility of interrupts as much as possible.
Thus, here is my main questions. Will I be able to schedule the NAPI poll to be executed at a certain point in time ? Like, I want NAPI poll to be executed every 500 ms and may be last for 20ms.
For instance, I will be expecting my packet at time T ms, while I might start the polling at time (T-10) ms and stop polling at (T + 10) ms. This may, I might be able to reduce the usage of interrupts. Right now, I have been resetting the interrupts every 10,000 poll cycles.
Any explanation or reference on this would be really helpful.
Thanks,
Kushal.

Related

lwip to send data bigger than 64kb

i'm trying to send over lwip a RT data (4 bytes) sampled at 100kHz for 10 channels.
I've understood that lwip has a timer which loops every 250ms and it cannot be changed.
In this case i'm saving the RT over RAM at 100kHz and every 250ms sending the sampled data over TCP.
The problem is that i cannot go over 65535 bytes every 250ms because i get the MEM_ERR.
i already increased the buffer up to 65535 but when i try to increase it more i get several error during compiling.
So my doubt is: can lwip manage buffer bigger than 16bits?
Thanks,
Marco
Better to focus on throughput.
You neglected to show any code, describe which Xilinx board/system you're using, or which OS you're using (e.g. FreeRTOS, linux, etc.).
Your RT data is: 4 bytes * 10 channels * 100kHz --> 400,000 bytes / second.
From your lwip description, you have 65536 byte packets * 4 packets/sec --> 256,000 bytes / second.
This is too slow. And, it much slower than what a typical TCP / Ethernet link can process, so I think your understanding of the maximum rate is off a bit.
You probably can't increase the packet size.
But, you probably can increase the number of packets sent.
You say that the 250ms interval for lwip can't be changed. I believe that it can.
From: https://www.xilinx.com/support/documentation/application_notes/xapp1026.pdf we have the section: Creating an lwIP Application Using the RAW API:
Set up a timer to interrupt at a constant interval. Usually, the interval is around 250 ms. In the timer interrupt, update necessary flags to invoke the lwIP TCP APIs tcp_fasttmr and tcp_slowtmr from the main application loop explained previously
The "usually" seems to me to imply that it's a default and not a maximum.
But, you may not need to increase the timer rate as I don't think it dictates the packet rate, just the servicing/completion rate [in software].
A few guesses ...
Once a packet is queued to the NIC, normally, other packets may be queued asynchronously. Modern NIC hardware often has a hardware queue. That is, the NIC H/W supports multiple pending requests. It can service those at line speed without CPU intervention.
The 250ms may just be a timer interrupt that retires packet descriptors of packets completed by the NIC hardware.
That is, more than one packet can be processed/completed per interrupt. If that were not the case, then only 4 packets / second could be sent and that would be ridiculously low.
Generating an interrupt from the NIC for each packet incurs an overhead. My guess is that interrupts from the NIC are disabled. And, the NIC is run in a "polled" mode. The polling occurs in the timer ISR.
The timer interrupt will occur 4 times per second. But, will process any packets it sees that are completed. So, the ISR overhead is only 4 interrupts / second.
This increases throughput because the ISR overhead is reduced.
UPDATE:
Thanks for the reply, indeed is 4 bytes * 10 channels * 100kHz --> 4,000,000 bytes / second but I agree that we are quite far from the 100Mbit/s.
Caveat: I don't know lwip, so most of what I'm saying is based on my experience with other network stacks (e.g. linux), but it appears that lwip should be similar.
Of course, lwip will provide a way to achieve full line speed.
Regarding the changing of the 250ms timer period, to achieve what i want it should be lowered more than 10times which seems it is too much and it can compromise the stability of the protocol.
When you say that, did you actually try that?
And, again, you didn't post your code or describe your target system, so it's difficult to provide help.
Issues could be because of the capabilities [or lack thereof] of your target system and its NIC.
Or, because of the way your code is structured, you're not making use of the features that can make it fast.
So basically your suggestion is to enable the interrupt on each message? In this case i can send the remaining data in the ACK callback if I understood correctly. – Marco Martines
No.
The mode for interrupt on every packet is useful for low data rates where the use of the link is sparse/sporadic.
If you have an interrupt for every packet, the overhead of entering/exiting the ISR (the ISR prolog/epilog) code will become significant [and possibly untenable] at higher data rates.
That's why the timer based callback is there. To accumulate finished request blocks and [quickly] loop on the queue/chain and complete/reclaim them periodically. If you wish to understand the concept, look at NAPI: https://en.wikipedia.org/wiki/New_API
Loosely, on most systems, when you do a send, a request block is created with all info related to the given buffer. This block is then queued to the transmit queue. If the transmitter is idle, it is started. For subsequent blocks, the block is appended to the queue. The driver [or NIC card] will, after completing a previous request, immediately start a new/fresh one from the front of the queue.
This allows you to queue multiple/many blocks quickly [and return immediately]. They will be sent in order at line speed.
What actually happens depends on system H/W, NIC controller, and OS and what lwip modes you use.

Need suggestion for handling large number of timers/timeouts

I am working on redesign an existing L2TP(Layer 2 tunneling protocol) code.
For L2TP , the number of tunnels we support is 96K. L2TP protocol has a keep-alive mechanism where it needs to send HELLO msges.
Say if we have 96,000 tunnels for which L2TPd needs to send HELLO msg after configured timeout value , what is the best way to implement it ?
Right now , we have a timer thread , where for every 1sec , we iterate and send HELLO msges. This design is a old design which is not scaling now.
Please suggest me a design to handle large number of timers.
There are a couple of ways to implement timers:
1) select: this system call allows you to wait on a file descriptor, and then wake up. You can wait on a file descriptor that does nothing as a timeout
2) Posix Condition Variables: similar to select, they have a time out mechanism built in.
3) If you are using UNIX, you can set a UNIX signal to wake up.
Those are basic ideas. You can see how well they scale to multiple timers; I would guess you'd have to have multiple condvars/selects for some handful of the threads.
Dependingo on the behaviour you want, you would probably want a thread for every 100 timers or so, and use one of the mechanisms above to wake up
one of the timers. You'd have a thread sitting in a loop, and keeping
track on each of the 100 timeouts, then waking up.
Once you exceed 100 timers, you would simply create a new thread and have it manage the next 100 timers and so on.
I don't know if 100 is the right granularity, but it's something you'd play with.
Hopefully that's what you are looking for.
Typically, such requirements are met with a delta-queue. When a timeout is required, get the system tick count and add the timeout interval to it. This gives the Timeout Expiry Tick Count, (TETC). Insert the socket object into a queue that is sorted by decreasing TETC and have the thread wait for the TETC of the item at the head of the queue.
Typically, with asocket timeouts, queue insertion is cheap because there are many timeouts with the same interval and so new timeout insertion will normally take place at the queue tail.
Management of the queue, (actually, since insertion into the sorted queue could take place anywhere, it's more like a list than a queue, but whatever:), is best kept to one timeout thread that is normally performing a timed wait on a condvar or semaphore for the lowest TETC. New timeout-objects can then be queued to the thread on a thread-safe concurrent queue and signaled to the timeout-handler thread by the sema/condvar.
When the timeout thread becomes ready on TETC timeout, it could call some 'OnTimeout' method of the object itself, or it might put the timed-out object onto a threadpool input queue.
Such a delta-queue is much more efficient for handling large numbers of timeouts than any polling scheme, especially for requirements with longish intervals. No polling is required, no CPU/memory bandwidth wasted on continual iterations and the typical latency is going to be a system clock-tick or two.
It is dependent on the processor/OS, kernel version, architecture.
In linux, one of the option is to use its timer functionality for multiple timers. Addition of timer can be done using add_timer in linux. You can define it using timer_list and initilialize internal values of timer using init_timer.
Followed by it register it using add_timer after filling timer_list(timeout(expire), function to execute after timeout(function), parameter to the function(data)) appropriately for respective timer. If jiffies is more than or equal to timeout(expire), then the respective timer handler(function) shall be triggered.
Some processors have provisioning for timer wheels(that consists of a number of queues that are placed equally in time in slots) which can be configured for a wide range of timers,timeouts as per the requirement.

Lost in libpcap - how to use setnonblock() / should I use pcap_dispatch() or pcap_next_ex() for realtime?

I'm building a network sniffer that will run on a PFSense for monitoring an IPsec VPN. I'm compiling under FreeBSD 8.4.
I've chosen to use libpcap in C for the packet capture engine and Redis as the storage system.
There will be hundreds of packets per second to handle, the system will run all day long.
The goal is to have a webpage showing graphs about network activity, with updates every minutes or couple of seconds if that's possible.
When my sniffer will capture a packet, it'll determine its size, who (a geographical site, in our VPN context) sent it, to whom and when. Then those informations needs to be stored in the database.
I've done a lot of research, but I'm a little lost with libpcap, specifically with the way I should capture packets.
1) What function should I use to retrieve packets ? pcap_loop ? pcap_dispatch ? Or pcap_next_ex ? According to what I read online, loop and dispatch are blocking execution, so pcap_next_ex seems to be the solution.
2) When are we supposed to use pcap_setnonblock ? I mean with which capture function ? pcap_loop ? So if I use pcap_loop the execution won't be blocked ?
3) Is multi-threading the way to achieve this ? One thread running all the time capturing packets, analyzing them and storing some data in an array, and a second thread firing every minutes emptying this array ?
The more I think about it, the more I get lost, so please excuse me if I'm unclear and don't hesitate to ask me for precisions.
Any help is welcome.
Edit :
I'm currently trying to implement a worker pool, with the callback function only putting a new job in the job queue. Any help still welcome. I will post more details later.
1) What function should I use to retrieve packets ? pcap_loop ? pcap_dispatch ? Or pcap_next_ex ? According to what I read online, loop and dispatch are blocking execution, so pcap_next_ex seems to be the solution.
All of those functions block waiting for packets if there's no packet ready and you haven't put the pcap_t into non-blocking mode.
pcap_loop() contains a loop that runs indefinitely (or until pcap_breakloop() is called, and error occurs, or, if a count was specified, the count runs out). Thus, it may wait more than one time.
pcap_dispatch() blocks waiting for a batch of packets to arrive, if no packets are available, loops processing the batch, and then returns, so it only waits at most one time.
pcap_next() and pcap_next_ex() wait for a packet to be available and then returns it.
2) When are we supposed to use pcap_setnonblock ? I mean with which capture function ? pcap_loop ? So if I use pcap_loop the execution won't be blocked ?
No, it won't be, but that means that a call to pcap_loop() might return without processing any packets; you'll have to call it again to process any packets that arrive later. Non-blocking and pcap_loop() isn't really useful; you might as well use pcap_dispatch() or pcap_next_ex().
3) Is multi-threading the way to achieve this ? One thread running all the time capturing packets, analyzing them and storing some data in an array, and a second thread firing every minutes emptying this array ?
(The array would presumably be a queue, with the first thread appending packet data to the end of the queue and the second thread removing packet data from the head of the queue and putting it into the database.)
That would be one possibility, although I'm not sure whether it should be timer-based. Given that many packet capture mechanisms - including BPF, as used by *BSD and OS X - deliver packets in batches, perhaps you should have one loop that does something like
for (;;) {
pcap_dispatch(p, -1, callback);
wake up dumper thread;
}
(that's simplified - you might want to check for errors in the return value from pcap_dispatch()).
The callback would add the packet handed to it to the end of the queue.
The dumper thread would pull packets from the head of the queue and add them to the database.
This would not require that the pcap_t be non-blocking. On a single-CPU-core machine, it would rely on the threads being time-sliced by the scheduler.
I'm currently trying to implement a worker pool, with the callback function only putting a new job in the job queue.
Be aware that, once the callback function returns, neither the const struct pcap_pkthdr structure pointed to by its second argument nor the raw packet data pointed to by its third argument will be valid, so if you want to process them in the job, you will have to make copies of them - including a copy of all the packet data you will be processing in that job. You might want to consider doing some processing in the callback routine, even if it's only figuring out what packet data needs to be saved (e.g., IP source and destination address) and copying it, as that might be cheaper than copying the entire packet.

Scheduling routines in C and timing requirements

I'm working on a C program that transmits samples over USB3 for a set period of time (1-10 us), and then receives samples for 100-1000 us. I have a rudimentary pthread implementation where the TX and RX routines are each handled as a thread. The reason for this is that in order to test the actual TX routine, the RX needs to run and sample before the transmitter is activated.
Note that I have very little C experience outside of embedded applications and this is my first time dabbling with pthread.
My question is, since I know exactly how many samples I need to transmit and receive, how can I e.g. start the RX thread once the TX thread is done executing and vice versa? How can I ensure that the timing stays consistent? Sampling at 10 MHz causes some harsh timing requirements.
Thanks!
EDIT:
To provide a little more detail, my device is a bladeRF x40 SDR, and communication to the device is handled by a FX3 microcontroller, which occurs over a USB3 connection. I'm running Xubuntu 14.04. Processing, scheduling and configuration however is handled by a C program which runs on the PC.
You don't say anything about your platform, except that it supports pthreads.
So, assuming Linux, you're going to have to realize that in general Linux is not a real-time operating system, and what you're doing sure sounds as if has real-time timing requirements.
There are real-time variants of Linux, I'm not sure how they'd suit your needs. You might also be able to achieve better performance by doing the work in a kernel driver, but then you won't have access to pthreads so you're going to have to be a bit more low-level.
Thought I'd post my solution.
While the next build of the bladeRF firmware and FPGA image will include the option to add metadata (timestamps) to the synchronous interface, until then there's no real way in which I can know at which time instants certain events occurred.
What I do know is my sampling rate, and exactly how many samples I need to transmit and receive at which times relative to each other. Therefore, by using conditional variables (with pthread), I can signal my receiver to start receiving samples at the desired instant. Since TX and RX operations happen in a very specific sequence, I can calculate delays by counting the number of samples and multiplying by the sampling rate, which has proven to be within 95-98% accurate.
This obviously means that since my TX and RX threads are running simultaneously, there are chunks of data within the received set of samples that will be useless, and I have another routine in place to discard those samples.

Why are nanosleep() and usleep() too slow?

I have a program that generates packets to send to a receiver. I need an efficient method of introducing a small delay between the sending of each packet so as not to overrun the receiver. I've tried usleep() and nanosleep() but they seem to be too slow. I've implemented a busy wait loop and had more success, but it's not the most efficient method, I know. I'm interested in anyone's experiences in trying to do what I'm doing. Do others find usleep() and nanosleep() to function well for this type of application?
Thanks,
Danny Llewallyn
The behaviour of the sleep functions for very small intervals is heavily dependent on the kernel version and configuration.
If you have a "tickless" kernel (CONFIG_NO_HZ) and high resolution timers, then you can expect the sleeps to be quite close to what you ask for.
Otherwise, you'll generally end up sleeping at the granularity of the timer interrupt. The timer interrupt interval is configurable (CONFIG_HZ) - 10ms, 4ms, 3.3ms and 1ms are the common choices.
Assuming that the higher level approaches other commenters have mentioned are not available to you, then a common approach in embedded/microcontroller land is to create a NOP-loop of the required length.
A NOP operation takes one CPU cycle and in an embedded environment you typically know exactly what clock speed your processor is running at so you can just use a simple for-loop conatining _NOP() or if only a very short delay is required then don't bother with a loop, just add in the required number of nops.
regTX = 0xFF; // Transmit FF on special register
// Wait three clock cycles
_NOP();
_NOP();
_NOP();
regTX = 0x00; // Transmit 00
This seems like a bad design. Ideally the receiver would queue any extra data it receives , and then do its message processing separate thread. In that way, it can handle bursts of data without relying on the sender to throttle its requests.
But perhaps such an approach is not practical if (for example) you do not have control of the receiver's code, or if this is an embedded application.
I can speak for Solaris here, in that it uses an OS timer to wake up sleep calls. By default the minimum wait time will be 10ms, regardless of what you specify in your usleep. However, you can use the parameters hires_tick = 1 (1ms wakeups) and hires_hz = in the /etc/system configuration file to increase the frequency of timer wake up calls.
Instead of doing things at the packet level, where you need to worry about such things as overrunning the reciever. Why not use a TCP stream to transmit the data? Let TCP handle things like flow rate control and packet retransmission.
If you've already got a lot invested in the packetized approach, you can always use a layer on top of TCP to extract the original packets of data from the TCP stream and feed these into your existing functions.

Resources