Environment
As per my understanding Network layer is responsible for reassembly of fragmented datagrams and then it supplies the reassembled data to upper Transport layer.
I have collected packet traces using libpcap and i want to reassemble fragmented packets at layer 3 on my own.
This link says that i need fragment flag, fragment offset, identification number and buffer value for reassembly of segments.
Question
At the arrival of first segment how to know what should be size of buffer to be initialized for complete reassembly of datagram.
Thanks.
The IP header only gives you the size of the fragment. So you need to reserve a buffer the size of the largest possible IP packet, i.e. 65535 bytes. Only once you get the last fragment can you determine the length of the complete packet.
Related
When using recvfrom(2) to get packet from network I get each time 1 packet.
What is the max length of TCP/UDP packet that get with this function?
There's no fixed limit in TCP, because it's a stream protocol, not a datagram protocol.
In UDP over IPv4, the limit is 65,507 bytes. From Wikipedia:
Length
This field specifies the length in bytes of the UDP header and UDP data. The minimum length is 8 bytes, the length of the header. The field size sets a theoretical limit of 65,535 bytes (8 byte header + 65,527 bytes of data) for a UDP datagram. However the actual limit for the data length, which is imposed by the underlying IPv4 protocol, is 65,507 bytes (65,535 − 8 byte UDP header − 20 byte IP header).
Using IPv6 jumbograms it is possible to have UDP datagrams of size greater than 65,535 bytes. RFC 2675 specifies that the length field is set to zero if the length of the UDP header plus UDP data is greater than 65,535.
Note that using extremely large UDP datagrams can be problematic. Few network links have such large MTUs, so the datagram will likely be fragmented. If any fragment is lost, the entire datagram will have to be resent by the application layer (if the application requires and implements reliability). TCP normally used Path MTU Discovery to send the stream in segments that fit in the minimum MTU of all the links in the path; if a segment is lost, TCP can just retransmit the segments after that (or just the lost segment if Selective Acknowledgement is implemented, which most TCP implementations now offer).
recvfrom will always return exactly one packet for UDP. UDP packets can be up to 64KB in size give or take for a few header bytes. In practice, most UDP protocols don't ever send that much data in a single packet. So your buffer size passed to recvfrom can be much less depending on what your protocol dictates.
For TCP, you typically use recv, not recvfrom to read incoming data from a connected socket. Many will point out, TCP is a stream protocol, not a message/packet protocol like UDP. As such recv will give you back a non-deterministic amount of bytes between 1 and the size of the buffer being passed in to the recv call itself. Always check the return value from a recv call - it's not guaranteed to give you any particular byte count.
I'm getting trouble with packet segmentation. I've already read from many sources about GSO, which is a generalized way for segmenting a packet with size greater than the Ethernet MTU (1500 B). However, I have not found an answer for doubts that I have in mind.
If we add a new set of bytes (ex. a new header by the name 'NH') between L2 and L3 layer, the kernel must be able to pass through NH and adjust sk_buff pointer to the beginning of the L3 to offload the packet according to the 'policy' of the L3 protocol type (ex. IPv4 fragmentation). My thoughts were to modify skb_network_protocol() function. This function, if I'm not wrong, enables skb_mac_gso_segment() to properly call GSO function for different types of L3 protocol. However, I'm not being able to segment my packets properly.
I have a kernel module that forwards packets through the network (OVS, Open vSwitch). On the tests which I've been running (h1 --ping-- h2), the host generates large ICMP packets and then sends packets which are less or equal than MTU size. Those packets are received by the first switch which attaches the new header NH, so if a packet had 1500B, it becomes 1500B + NH length. Here is the problem, the switch has already received a fragmented packet from the host, and the switch adds more bytes in the packet (kind of VLAN does).
Therefore, at first, I tried to ping large packets, but it didn't work. In OVS, before calling dev_queue_xmit(), a packet can be segmented by calling skb_gso_segment(). However, the packet needs to go through a condition checked by netif_needs_gso(). But I'm not sure if I have to use skb_gso_segment() to properly segment the packet.
I also noticed that, for the needs_gso_segment() function be true, skb_shinfo(skb)->gso_size have to be true. However, gso_size has always zero value for all the received packets. So, I made a test by attributing a random value to gso_size (ex. 1448B). Now, on my tests, I was able to ping from h1 to h2, but the first 2 packets were lost. On another test, TCP had a extremely poor performance. And since then, I've been getting a kernel warning: "[ 5212.694418] [c1642e50] ? skb_warn_bad_offload+0xd0/0xd8
"
For small packets (< MTU) I got no trouble and ping works fine. TCP works fine, but for small window size.
Someone has any idea for what's happening? Should I always use GSO when I get large packets? Is it possible to fragment a fragmented IPv4 packets?
As the new header lies between L2 and L3, I guess the enlargement of a IPv4 packet due to the additional header, is similar to what happens with VLAN. How VLAN can handle the segmentation problem?
Thanks in advance,
The libpcap packet header structure has 2 length fields:
typedef struct pcaprec_hdr_s {
guint32 ts_sec; /* timestamp seconds */
guint32 ts_usec; /* timestamp microseconds */
guint32 incl_len; /* number of octets of packet saved in file */
guint32 orig_len; /* actual length of packet */
} pcaprec_hdr_t;
incl_len: the number of bytes of packet data actually captured and saved in the file. This value should never become larger than orig_len or the snaplen value of the global header.
orig_len: the length of the packet as it appeared on the network when it was captured. If incl_len and orig_len differ, the actually saved packet size was limited by snaplen.
Can any one tell me what is the difference between the 2 length fields? We are saving the packet in entirely then how can the 2 differ?
Reading through the documentation at the Wireshark wiki ( http://wiki.wireshark.org/Development/LibpcapFileFormat ) and studying an example pcap file, it looks like incl_len and orig_len are usually the same quantity. The only time they will differ is if the length of the packet exceeded the size of snaplen, which is specified in the global header for the file.
I'm just guessing here, but I imagine that snaplen specifies the size of the static buffer used for capturing. In the event that a packet was too large for the capture buffer, this is the format's method for signaling that fact. snaplen is documented to "usually" be 65535, which is large enough for most packets. But the documentation stipulates that the size might be limited by the user.
Can any one tell me what is the difference between the 2 length fields? We are saving the packet in entirely then how can the 2 differ?
If you're saving the entire packet, the 2 shouldn't differ.
However, if, for example, you run tcpdump or TShark or dumpcap or a capture-from-the-command-line Wireshark and specify a small value with the "-s n" flag, or specify a small value in the "Limit each packet to [n] bytes" option in the Wireshark GUI, then libpcap/WinPcap will be passed that value and will only supply the first n bytes of each packet to the program, and the entire packet won't be saved.
A limited "snapshot length" means you don't see all the packet data, so some analysis might not be possible, but means that less memory is needed in the OS to buffer packets (so fewer packets might be dropped), and less CPU bandwidth is needed to copy packet data to the application and less disk bandwidth is needed to save packets to disk if the application is saving them (which might also reduce the number of packets dropped), and less disk space is needed for the saved packets.
When using pcap_open_live to sniff from an interface, I have seen a lot of examples using various numbers as SNAPLEN value, ranging from BUFSIZ (<stdio.h>) to "magic numbers".
Wouldn't it make more sense to set as SNAPLEN the MTU of the interface we are capturing from ?
In this manner, we could fit more packets at once in PCAP buffer. Is it safe to assume that the MRU is equal to the MTU ?
Otherwise, is there a non-exotic way to set the SNAPLEN value ?
Thanks
The MTU is the largest payload size that could be handed to the link layer; it does not include any link-layer headers, so, for example, on Ethernet it would be 1500, not 1514 or 1518, and wouldn't be large enough to capture a full-sized Ethernet packet.
In addition, it doesn't include any metadata headers such as the radiotap header for 802.11 radio information.
And if the adapter is doing any form of fragmentation/segmentation/reassembly offloading, the packets handed to the adapter or received from the adapter might not yet be fragmented or segmented, or might have been reassembled, and, as such, might be much larger than the MTU.
As for fitting more packets in the PCAP buffer, that only applies to the memory-mapped TPACKET_V1 and TPACKET_V2 capture mechanisms in Linux, which have fixed-size packet slots; other capture mechanisms do not reserve a maximum-sized slot for every packet, so a shorter snapshot length won't matter. For TPACKET_V1 and TPACKET_V2, a smaller snapshot length could make a difference, although, at least for Ethernet, libpcap 1.2.1 attempts, as best it can, to choose an appropriate buffer slot size for Ethernet. (TPACKET_V3 doesn't appear to have the fixed-size per-packet slots, in which case it wouldn't have this problem, but it only appeared in officially-released kernels recently, and no support for it exists yet in libpcap.)
I need to send some data over the subnet with fixed non-standard MTU (for example, 1560) using TCP.
All the Ethernet frames transfered through this subnet should be manually padded with 0's, if the frame's length is less than MTU.
So, the data size should be
(1560 - sizeof( IP header ) - sizeof( TCP header ) ).
This is the way I am going to do it:
I set the TCP_CORK option to decrease the fragmenting of data. It is not reliable, because there is 200 millisecond ceiling, but it works.
I know size of IP header (20 bytes), so data length should be equal to (1540 - sizeof( TCP header )).
That's the problem. I don't know the TCP header size. The size of it's "Options" field is floating.
So, the question is: how to get the size of TCP header? Or maybe there is some way to send TCP frames with headers of fixed length?
Trying to control the size of frames when using TCP from the user application is wrong. You are working at the wrong abstraction level. It's also impossible.
What you should be doing is either consider replacing TCP with something else (UDP?) or, less likely, but possible, rewrite your Ethernet driver to set the non standard MTU and do the padding you need.
This isn't possible using the TCP stack of the host simply because a TCP stack that follows RFC 793 isn't supposed to offer this kind of access to an application.
That is, there isn't (and there shouldn't be) a way to influence what the lower layers do with your data. Of course, there are ways to influence what TCP does (Nagle for example) but that is against the spirit of the protocol. TCP should be used for what it's best at: transferring a continuous, ordered stream of bytes. Nothing more, nothing less. No messages, packets, frames.
If after all you do need to control such details, you need to look at lower-level APIs. You could use SOCK_RAW and PF_PACKET.
Packet sockets are used to receive or
send raw packets at the device driver
(OSI Layer 2) level.
#gby mentioned UDP and that is (partially) a good idea: UDP has a fixed size. But keep in mind that you will have to deal with IP fragmentation (or use IP_DONTFRAG).
In addition to my comments below the OP's question, this quote from the original RFC outlining how to send TCP/IP over ethernet is relevant:
RFC 894 (emphasis mine):
If necessary, the data field should be padded (with octets of zero) to meet the Ethernet minimum frame size.
If they wanted all ethernet frames to be at maximum size, they would have said so. They did not.
Maybe what was meant by padding is that the TCP header padding to align it on 32 bits should be all zeros : http://freesoft.org/CIE/Course/Section4/8.htm