Calculating End-to-End delay for multihop simulation in UnetStack - unetstack

I created a 3-node network, where Node-A sends data packets to Node-C via Node-B after every 5 sec. The simulation time is 1 minute. So, in 60 seconds Node-A will send 11 packets to Node-C. According to the This reply, I analyzed the trace.json file for each transmission from source(Node-A) to destination(Node-C).
I have some doubts:-
For some packets, at Node-B(intermediate node) there is more than one TxFrameReq for one DatagrameReq (example 2nd and 4th packet transmission). why it is happening?
In the above scenario what will be the end-to-end delay for that transmission?
For some transmissions end-to-end delay is more than 15 seconds plus. Is it correct or I am making some mistake in the end-to-end delay calculation. I am calculating the end-to-end delay by substracting the TX time of Node-A to the RxFrameNtf time of Node-C.
here I am sharing the simulation code, trace.json file, and a pdf containing the timing of the required events at each node.

Related

lwip to send data bigger than 64kb

i'm trying to send over lwip a RT data (4 bytes) sampled at 100kHz for 10 channels.
I've understood that lwip has a timer which loops every 250ms and it cannot be changed.
In this case i'm saving the RT over RAM at 100kHz and every 250ms sending the sampled data over TCP.
The problem is that i cannot go over 65535 bytes every 250ms because i get the MEM_ERR.
i already increased the buffer up to 65535 but when i try to increase it more i get several error during compiling.
So my doubt is: can lwip manage buffer bigger than 16bits?
Thanks,
Marco
Better to focus on throughput.
You neglected to show any code, describe which Xilinx board/system you're using, or which OS you're using (e.g. FreeRTOS, linux, etc.).
Your RT data is: 4 bytes * 10 channels * 100kHz --> 400,000 bytes / second.
From your lwip description, you have 65536 byte packets * 4 packets/sec --> 256,000 bytes / second.
This is too slow. And, it much slower than what a typical TCP / Ethernet link can process, so I think your understanding of the maximum rate is off a bit.
You probably can't increase the packet size.
But, you probably can increase the number of packets sent.
You say that the 250ms interval for lwip can't be changed. I believe that it can.
From: https://www.xilinx.com/support/documentation/application_notes/xapp1026.pdf we have the section: Creating an lwIP Application Using the RAW API:
Set up a timer to interrupt at a constant interval. Usually, the interval is around 250 ms. In the timer interrupt, update necessary flags to invoke the lwIP TCP APIs tcp_fasttmr and tcp_slowtmr from the main application loop explained previously
The "usually" seems to me to imply that it's a default and not a maximum.
But, you may not need to increase the timer rate as I don't think it dictates the packet rate, just the servicing/completion rate [in software].
A few guesses ...
Once a packet is queued to the NIC, normally, other packets may be queued asynchronously. Modern NIC hardware often has a hardware queue. That is, the NIC H/W supports multiple pending requests. It can service those at line speed without CPU intervention.
The 250ms may just be a timer interrupt that retires packet descriptors of packets completed by the NIC hardware.
That is, more than one packet can be processed/completed per interrupt. If that were not the case, then only 4 packets / second could be sent and that would be ridiculously low.
Generating an interrupt from the NIC for each packet incurs an overhead. My guess is that interrupts from the NIC are disabled. And, the NIC is run in a "polled" mode. The polling occurs in the timer ISR.
The timer interrupt will occur 4 times per second. But, will process any packets it sees that are completed. So, the ISR overhead is only 4 interrupts / second.
This increases throughput because the ISR overhead is reduced.
UPDATE:
Thanks for the reply, indeed is 4 bytes * 10 channels * 100kHz --> 4,000,000 bytes / second but I agree that we are quite far from the 100Mbit/s.
Caveat: I don't know lwip, so most of what I'm saying is based on my experience with other network stacks (e.g. linux), but it appears that lwip should be similar.
Of course, lwip will provide a way to achieve full line speed.
Regarding the changing of the 250ms timer period, to achieve what i want it should be lowered more than 10times which seems it is too much and it can compromise the stability of the protocol.
When you say that, did you actually try that?
And, again, you didn't post your code or describe your target system, so it's difficult to provide help.
Issues could be because of the capabilities [or lack thereof] of your target system and its NIC.
Or, because of the way your code is structured, you're not making use of the features that can make it fast.
So basically your suggestion is to enable the interrupt on each message? In this case i can send the remaining data in the ACK callback if I understood correctly. – Marco Martines
No.
The mode for interrupt on every packet is useful for low data rates where the use of the link is sparse/sporadic.
If you have an interrupt for every packet, the overhead of entering/exiting the ISR (the ISR prolog/epilog) code will become significant [and possibly untenable] at higher data rates.
That's why the timer based callback is there. To accumulate finished request blocks and [quickly] loop on the queue/chain and complete/reclaim them periodically. If you wish to understand the concept, look at NAPI: https://en.wikipedia.org/wiki/New_API
Loosely, on most systems, when you do a send, a request block is created with all info related to the given buffer. This block is then queued to the transmit queue. If the transmitter is idle, it is started. For subsequent blocks, the block is appended to the queue. The driver [or NIC card] will, after completing a previous request, immediately start a new/fresh one from the front of the queue.
This allows you to queue multiple/many blocks quickly [and return immediately]. They will be sent in order at line speed.
What actually happens depends on system H/W, NIC controller, and OS and what lwip modes you use.

Scheduling NAPI poll to execute at regular time intervals

I have gone through multiple posts (in and outside Stackoverflow) regarding this topic. Currently, I am working on to modify the i40e-2.0.30 driver for Intel X710 NIC.
Thanks to this illustrated blog post (https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/), understanding the driver code became lot easier.
My post is particularly concerned with the NAPI Poll mechanism. I understand that NAPI Poll function is triggered when a packet arrives, and if the amount of work done while receiving the packets exceeds the allocated budget, NAPI Polling continues; else polling stops.
Based on this information, I modified my driver to keep polling if a particular signature of data arrives on a particular queue ( using flow director), e.g. UDP Packets on Port XXX for 10,000 poll cycles. But, I am trying to eliminate the possibility of interrupts as much as possible.
Thus, here is my main questions. Will I be able to schedule the NAPI poll to be executed at a certain point in time ? Like, I want NAPI poll to be executed every 500 ms and may be last for 20ms.
For instance, I will be expecting my packet at time T ms, while I might start the polling at time (T-10) ms and stop polling at (T + 10) ms. This may, I might be able to reduce the usage of interrupts. Right now, I have been resetting the interrupts every 10,000 poll cycles.
Any explanation or reference on this would be really helpful.
Thanks,
Kushal.

How to handle multiple retransmission timers for UDP protocol?

I have to manage multiple timers for a UDP file transfer application,
after a timeout the server had to resend packets to the client, but there are more than one packet a time that could cause the timeout.
So I have to manage a timer for each packet. How can I do this?
I can't use alarm because it cancelled the previous timers and also works only with seconds.
You need to keep an array of structs containing timeouts for each packet you want to keep track of.
Each array element should contain the starting time and expected ending time for each timeout. When it's time to set the timer, check all entries in the array to see which one is expected to time out first. Then subtract that time from the current time to get your timeout value for select.
When the socket read times out, go through the list again and for each packet whose timeout time is prior to the current time, handle the timeout for that packet.
Take a look at the source of a multicast file transfer application I wrote called UFTP for an example of how this can be implemented. Specifically, look at the getrecenttimeout function in client_loop.c.

UDP sendto performance over loopback

Background
I have a very high throughput / low latency network app (goal is << 5 usec per packet) and I wanted to add some monitoring/metrics to it. I have heard about the statsd craze and seems a simple way to collect metrics and feed them into our time series database. Sending metrics is done via a small udp packet write to a daemon (typically running on same server).
I wanted to characterize the effects of sending ~5-10 udp packets in my data path to understand how much latency it would add and was surprised at how bad it is. I know this is a very obscure micro-benchmark but just wanted to get a rough idea on where it lands.
The question I have
I am trying to understand why it takes so long (relatively speaking) to send a UDP packet to localhost versus a remote host. Are there any tweaks I can make to reduce the latency to send a UDP packet? I am thinking the solution for me to push metric collection to an auxiliary core or actually run the statsd daemon on a seperate host.
My setup/benchmarks
CentOS 6.5 with some beefy server hardware.
The client test program I have been using is available here: https://gist.github.com/rishid/9178261
Compiled with gcc 4.7.3 gcc -O3 -std=gnu99 -mtune=native udp_send_bm.c -lrt -o udp_send_bm
The receiver side is running nc -ulk 127.0.0.1 12000 > /dev/null (ip change per IF)
I have ran this micro-benchmark with the following devices.
Some benchmark results:
loopback
Packet Size 500 // Time per sendto() 2159 nanosec // Total time 2.159518
integrated 1 Gb mobo controller
Packet Size 500 // Time per sendto() 397 nanosec // Total time 0.397234
intel ixgbe 10 Gb
Packet Size 500 // Time per sendto() 449 nanosec // Total time 0.449355
solarflare 10 Gb with userspace stack (onload)
Packet Size 500 // Time per sendto() 317 nanosec // Total time 0.317229
Writing to loopback will not be an efficient way to communicate inter-process for profiling. Generally the buffer will be copied multiple times before it's processed, and you run the risk of dropping packets since you're using udp. You're also making additional calls into the operating system, so you add to the risk of context switching (~2us).
goal is << 5 usec per packet
Is this a hard real-time requirement, or a soft requirement? Generally when you're handling things in microseconds, profiling should be zero overhead. You're using solarflare?, so I think you're serious. The best way I know to do this is tapping into the physical line, and sniffing traffic for metrics. A number of products do this.
i/o to disk or the network is very slow if you are incorporating it in a very tight (real time) processing loop. A solution might be to offload the i/o to a separate lower priority task. Let the real time loop pass the messages to the i/o task through a (best lock-free) queue.

Unix domain socket : Make Latency constant

Issue summary: AF_UNIX stable sending, bursty receiving.
I have an application B that receives data over unix domain datagram socket. There is peer application A that sends data to it. Both A and B are running continuously (and are SCHED_FIFO). My application A also prints the time of reception.
The peer application B can send data at varying timings (varying in terms of milliseconds only). Ideally (what I expect) the packet send delay should exactly match with reception delay. For example:
A sends in time : 5ms 10ms 15ms 21ms 30ms 36ms
B should receive in time : 5+x ms 10+x ms 15+x ms 21+x ms ...
Where x is a constant delay.
But when I experimented what I observe in B is :
A sends in time : 5ms 10ms 15ms 21ms 30ms 36ms
B received in time : 5+w ms 10+x ms 15+y ms 21+z ms ...
(w,x,y,z are different constant delays). So I cannot predict reception time when sending time is given).
Is it because some buffering is involved in unix domain socket ? Please suggest some workaround for the issue so that the reception time is predicable from send time. I need 1 millisecond accuracy.
(I am using vanilla Linux 3.0 kernel)
As you are using blocking recv(), when no datagram is available your program will be unscheduled. This is bad for your use case--you want your program to stay hot. So make your recv() non-blocking, and handle EAGAIN by simply busy waiting. This will consume 100% of one core, but I think you'll find it helps you achieve your goal.

Resources