I want to do my own very simple implementation of VPN in C on Linux. For that purpose I'm going to capture IP packets, modify them and send forward. The modification consists of encryption, authentication and other stuff like in IPSec. My question is should I process somehow the size of packets or this will be handled automatically? I know it's maximum size is 65535 - 20 (for header) but accoring to MTU it is lesser. I think its because encrypted payload "incapsulated into UDP" for NAT-T is much bigger then just "normal payload" of the IP packet.
Well, I found that there actually 2 ways to handle that problem:
1) We can send big packets by settings DF flag to tell we want fragment out packets. But in this case packet can be lost, because not all the devices/etc support packet fragmentation
2) We can automatically calculate our maximum MTU between hosts, split them and send. On another side we put all this packets together and restore them. This can be done by implementing our own "system" for this purpose.
More about IP packets fragmentation and reassembly you can read here
Related
When will a TCP packet be fragmented at the application layer? When a TCP packet is sent from an application, will the recipient at the application layer ever receive the packet in two or more packets? If so, what conditions cause the packet to be divided. It seems like a packet won't be fragmented until it reaches the Ethernet (at the network layer) limit of 1500 bytes. But, that fragmentation will be transparent to the recipient at the application layer since the network layer will reassemble the fragments before sending the packet up to the next layer, right?
It will be split when it hits a network device with a lower MTU than the packet's size. Most ethernet devices are 1500, but it can often be smaller, like 1492 if that ethernet is going over PPPoE (DSL) because of the extra routing information, even lower if a second layer is added like Windows Internet Connection Sharing. And dialup is normally 576!
In general though you should remember that TCP is not a packet protocol. It uses packets at the lowest level to transmit over IP, but as far as the interface for any TCP stack is concerned, it is a stream protocol and has no requirement to provide you with a 1:1 relationship to the physical packets sent or received (for example most stacks will hold messages until a certain period of time has expired, or there are enough messages to maximize the size of the IP packet for the given MTU)
As an example if you sent two "packets" (call your send function twice), the receiving program might only receive 1 "packet" (the receiving TCP stack might combine them together). If you are implimenting a message type protocol over TCP, you should include a header at the beginning of each message (or some other header/footer mechansim) so that the receiving side can split the TCP stream back into individual messages, either when a message is received in two parts, or when several messages are received as a chunk.
Fragmentation should be transparent to a TCP application. Keep in mind that TCP is a stream protocol: you get a stream of data, not packets! If you are building your application based on the idea of complete data packets then you will have problems unless you add an abstraction layer to assemble whole packets from the stream and then pass the packets up to the application.
The question makes an assumption that is not true -- TCP does not deliver packets to its endpoints, rather, it sends a stream of bytes (octets). If an application writes two strings into TCP, it may be delivered as one string on the other end; likewise, one string may be delivered as two (or more) strings on the other end.
RFC 793, Section 1.5:
"The TCP is able to transfer a
continuous stream of octets in each
direction between its users by
packaging some number of octets into
segments for transmission through the
internet system."
The key words being continuous stream of octets (bytes).
RFC 793, Section 2.8:
"There is no necessary relationship
between push functions and segment
boundaries. The data in any particular
segment may be the result of a single
SEND call, in whole or part, or of
multiple SEND calls."
The entirety of section 2.8 is relevant.
At the application layer there are any number of reasons why the whole 1500 bytes may not show up one read. Various factors in the internal operating system and TCP stack may cause the application to get some bytes in one read call, and some in the next. Yes, the TCP stack has to re-assemble the packet before sending it up, but that doesn't mean your app is going to get it all in one shot (it is LIKELY will get it in one read, but it's not GUARANTEED to get it in one read).
TCP tries to guarantee in-order delivery of bytes, with error checking, automatic re-sends, etc happening behind your back. Think of it as a pipe at the app layer and don't get too bogged down in how the stack actually sends it over the network.
This page is a good source of information about some of the issues that others have brought up, namely the need for data encapsulation on an application protocol by application protocol basis Not quite authoritative in the sense you describe but it has examples and is sourced to some pretty big names in network programming.
If a packet exceeds the maximum MTU of a network device it will be broken up into multiple packets. (Note most equipment is set to 1500 bytes, but this is not a necessity.)
The reconstruction of the packet should be entirely transparent to the applications.
Different network segments can have different MTU values. In that case fragmentation can occur. For more information see TCP Maximum segment size
This (de)fragmentation happens in the TCP layer. In the application layer there are no more packets. TCP presents a contiguous data stream to the application.
A the "application layer" a TCP packet (well, segment really; TCP at its own layer doesn't know from packets) is never fragmented, since it doesn't exist. The application layer is where you see the data as a stream of bytes, delivered reliably and in order.
If you're thinking about it otherwise, you're probably approaching something in the wrong way. However, this is not to say that there might not be a layer above this, say, a sequence of messages delivered over this reliable, in-order bytestream.
Correct - the most informative way to see this is using Wireshark, an invaluable tool. Take the time to figure it out - has saved me several times, and gives a good reality check
If a 3000 byte packet enters an Ethernet network with a default MTU size of 1500 (for ethernet), it will be fragmented into two packets of each 1500 bytes in length. That is the only time I can think of.
Wireshark is your best bet for checking this. I have been using it for a while and am totally impressed
I'm getting trouble with packet segmentation. I've already read from many sources about GSO, which is a generalized way for segmenting a packet with size greater than the Ethernet MTU (1500 B). However, I have not found an answer for doubts that I have in mind.
If we add a new set of bytes (ex. a new header by the name 'NH') between L2 and L3 layer, the kernel must be able to pass through NH and adjust sk_buff pointer to the beginning of the L3 to offload the packet according to the 'policy' of the L3 protocol type (ex. IPv4 fragmentation). My thoughts were to modify skb_network_protocol() function. This function, if I'm not wrong, enables skb_mac_gso_segment() to properly call GSO function for different types of L3 protocol. However, I'm not being able to segment my packets properly.
I have a kernel module that forwards packets through the network (OVS, Open vSwitch). On the tests which I've been running (h1 --ping-- h2), the host generates large ICMP packets and then sends packets which are less or equal than MTU size. Those packets are received by the first switch which attaches the new header NH, so if a packet had 1500B, it becomes 1500B + NH length. Here is the problem, the switch has already received a fragmented packet from the host, and the switch adds more bytes in the packet (kind of VLAN does).
Therefore, at first, I tried to ping large packets, but it didn't work. In OVS, before calling dev_queue_xmit(), a packet can be segmented by calling skb_gso_segment(). However, the packet needs to go through a condition checked by netif_needs_gso(). But I'm not sure if I have to use skb_gso_segment() to properly segment the packet.
I also noticed that, for the needs_gso_segment() function be true, skb_shinfo(skb)->gso_size have to be true. However, gso_size has always zero value for all the received packets. So, I made a test by attributing a random value to gso_size (ex. 1448B). Now, on my tests, I was able to ping from h1 to h2, but the first 2 packets were lost. On another test, TCP had a extremely poor performance. And since then, I've been getting a kernel warning: "[ 5212.694418] [c1642e50] ? skb_warn_bad_offload+0xd0/0xd8
"
For small packets (< MTU) I got no trouble and ping works fine. TCP works fine, but for small window size.
Someone has any idea for what's happening? Should I always use GSO when I get large packets? Is it possible to fragment a fragmented IPv4 packets?
As the new header lies between L2 and L3, I guess the enlargement of a IPv4 packet due to the additional header, is similar to what happens with VLAN. How VLAN can handle the segmentation problem?
Thanks in advance,
I am implementing a simple network stack using UDP sockets and I wish to send about 1 MB of string data from the client to the server. However, I am not aware if there is a limit to the size of the length in the UDP sendto() API in C. If there is a limit and sendto() won't handle packetization beyond a limit then I would have to manually split the string into smaller blocks and then send them.
Is there a limit on the length of the buffer? Or does the sendto() API handle packetization by itself.
Any insight is appreciated.
There's no API limit on sendto -- it can handle any size the underlying protocol can.
There IS a packet limit for UDP -- 64K; if you exceed this, the sendto call will fail with an EMSGSIZE error code. There are also packet size limits for IP which differ between IPv4 and IPv6. Finally, the low level transport has an MTU size which may or may not be an issue. IP packets can be fragemented into multiple lower level packets and automatically reassembled, unless you've used an IP_OPTIONS setsockopt call to disable fragmentation.
The easiest way to deal with all this complexity is to make your code flexible -- detect EMSGSIZE errors from sendto and switch to using smaller messages if you get it. The also works well if you want to do path MTU discovery, which will generally accept larger messages at first, but will cut down the maximum message size when you send one that ends up exceeding the path MTU.
If you just want to avoid worrying about it, a send of 1452 bytes or less is likely to always be fine (that's the 1500 byte normal ethernet payload max minus 40 for a normal IPv6 header and 8 for a UDP header), unless you're using a VPN (in which case you need to worry about encapsulation overhead).
I'm building an embedded system for a camera controller in Linux (not real-time). I'm having a problem getting the networking to do what I want it to do. The system has 3 NICs, 1 100base-T and 2 gigabit ports. I hook the slower one up to the camera (that's all it supports) and the faster ones are point-to-point connections to other machines. What I am attempting to do is get an image from the camera, do a little processing, then broadcast it using UDP to each of the other NICs.
Here is my network configuration:
eth0: addr: 192.168.1.200 Bcast 192.168.1.255 Mask: 255.255.255.0 (this is the 100base-t)
eth1: addr: 192.168.2.100 Bcast 192.168.2.255 Mask: 255.255.255.0
eth2: addr: 192.168.3.100 Bcast 192.168.3.255 Mask: 255.255.255.0
The image is coming in off eth0 in a proprietary protocol, so it's a raw socket. I can broadcast it to eth1 or eth2 just fine. But when I try to broadcast it to both, one after the other, I get lots of network hiccups and errors on eth0.
I initialize the UDP sockets like this:
sock2=socket(AF_INET,SOCK_DGRAM,IPPROTO_UDP); // Or sock3
sa.sin_family=AF_INET;
sa.sin_port=htons(8000);
inet_aton("192.168.2.255",&sa.sin_addr); // Or 192.168.3.255
setsockopt(sock2, SOL_SOCKET, SO_BROADCAST, &broadcast, sizeof(broadcast));
bind(sock2,(sockaddr*)&sa,sizeof(sa));
sendto(sock2,&data,sizeof(data),0,(sockaddr*)&sa,sizeof(sa)); // sizeof(data)<1100 bytes
I do this for each socket separately, and call sendto separately. When I do one or the other, it's fine. When I try to send on both, eth0 starts getting bad packets.
Any ideas on why this is happening? Is it a configuration error, is there a better way to do this?
EDIT:
Thanks for all the help, I've been trying some things and looking into this more. The issue does not appear to be broadcasting, strictly speaking. I replaced the broadcast code with a unicast command and it has the same behavior. I think I understand the behavior better, but not how to fix it.
Here is what is happening. On eth0 I am supposed to get an image every 50ms. When I send out an image on eth1 (or 2) it takes about 1.5ms to send the image. When I try to send on both eth1 and eth2 at the same time it takes about 45ms, occasionally jumping to 90ms. When this goes beyond the 50ms window, eth0's buffer starts to build. I lose packets when the buffer gets full, of course.
So my revised question. Why would it go from 1.5ms to 45ms just by going from one ethernet port to two?
Here is my initialization code:
sock[i]=socket(AF_INET,SOCK_DGRAM,IPPROTO_UDP);
sa[i].sin_family=AF_INET;
sa[i].sin_port=htons(8000);
inet_aton(ip,&sa[i].sin_addr);
//If Broadcasting
char buffer[]="eth1" // or eth2
setsockopt(sock[i],SOL_SOCKET,SO_BINDTODEVICE,buffer,5);
int b=1;
setsockopt(sock[i],SOL_SOCKET,SO_BROADCAST,&b,sizeof(b));
Here is my sending code:
for(i=0;i<65;i++) {
sendto(sock[0],&data[i],sizeof(data),0,sa[0],sizeof(sa[0]));
sendto(sock[1],&data[i],sizeof(data),0,sa[1],sizeof(sa[1]));
}
It's pretty basic.
Any ideas? Thanks for all your great help!
Paul
Maybe your UDP stack runs out of memory?
(1) Check /proc/sys/net/ipv4/udp_mem (see man 7 udp for details). Make sure that the first number is at least 8x times the image size. This sets the memory for all UDP sockets in the system.
(2) Make sure you per-socket buffer for sending socket is big enough. Use setsockopt(sock2, SOL_SOCKET, SO_SNDBUF, image_size*2) to set send buffer on both sockets. You might need to increase maximumu allowed value in /proc/sys/net/core/wmem_max. See man 7 socket for details.
(3) You might as well increase RX buffer for receiving socket. Write a big number to .../rmem_max, then use SO_RCVBUF to increase the receiving buffer size.
A workaround until this issue is actually solved may be to createa bridge for eth1+eth2 and send the packet to that bridge.
Thus it's only mapped to kernel-memory once and not twice per image.
It's been a long time, but I found the answer to my question, so I thought I would put it here in case anyone else ever finds it.
The two Gigabit Ethernet ports were actually on a PCI bridge off the PCI-express bus. The PCI-express bus was internal to the motherboard, but it was a PCI bus going to the cards. The bridge and the bus did not have enough bandwidth to actually send out the images that fast. With only one NIC enabled the data was sent to the buffer and it looked very quick to me, but it took much longer to actually get through the bus, out the card, and on to the wire. The second NIC was slower because the buffer was full. Although changing the buffer size masked the problem, it did not actually send the data out any faster and I was still getting dropped packets on the third NIC.
In the end, the 100Base-T card was actually built onto the motherboard, therefore had a faster bus to it, resulting in overall faster bandwidth than the gigabit ports.. By switching the camera to a gigabit line and one of the gigabit lines to the 100Base-T line I was able to meet the requirements.
Strange.
I need to send some data over the subnet with fixed non-standard MTU (for example, 1560) using TCP.
All the Ethernet frames transfered through this subnet should be manually padded with 0's, if the frame's length is less than MTU.
So, the data size should be
(1560 - sizeof( IP header ) - sizeof( TCP header ) ).
This is the way I am going to do it:
I set the TCP_CORK option to decrease the fragmenting of data. It is not reliable, because there is 200 millisecond ceiling, but it works.
I know size of IP header (20 bytes), so data length should be equal to (1540 - sizeof( TCP header )).
That's the problem. I don't know the TCP header size. The size of it's "Options" field is floating.
So, the question is: how to get the size of TCP header? Or maybe there is some way to send TCP frames with headers of fixed length?
Trying to control the size of frames when using TCP from the user application is wrong. You are working at the wrong abstraction level. It's also impossible.
What you should be doing is either consider replacing TCP with something else (UDP?) or, less likely, but possible, rewrite your Ethernet driver to set the non standard MTU and do the padding you need.
This isn't possible using the TCP stack of the host simply because a TCP stack that follows RFC 793 isn't supposed to offer this kind of access to an application.
That is, there isn't (and there shouldn't be) a way to influence what the lower layers do with your data. Of course, there are ways to influence what TCP does (Nagle for example) but that is against the spirit of the protocol. TCP should be used for what it's best at: transferring a continuous, ordered stream of bytes. Nothing more, nothing less. No messages, packets, frames.
If after all you do need to control such details, you need to look at lower-level APIs. You could use SOCK_RAW and PF_PACKET.
Packet sockets are used to receive or
send raw packets at the device driver
(OSI Layer 2) level.
#gby mentioned UDP and that is (partially) a good idea: UDP has a fixed size. But keep in mind that you will have to deal with IP fragmentation (or use IP_DONTFRAG).
In addition to my comments below the OP's question, this quote from the original RFC outlining how to send TCP/IP over ethernet is relevant:
RFC 894 (emphasis mine):
If necessary, the data field should be padded (with octets of zero) to meet the Ethernet minimum frame size.
If they wanted all ethernet frames to be at maximum size, they would have said so. They did not.
Maybe what was meant by padding is that the TCP header padding to align it on 32 bits should be all zeros : http://freesoft.org/CIE/Course/Section4/8.htm