According to Wikipedia, a traceroute program
Traceroute, by default, sends a sequence of User Datagram Protocol
(UDP) packets addressed to a destination host[...] The time-to-live
(TTL) value, also known as hop limit, is used in determining the
intermediate routers being traversed towards the destination. Routers
decrement packets' TTL value by 1 when routing and discard packets
whose TTL value has reached zero, returning the ICMP error message
ICMP Time Exceeded.[..]
I started writing a program (using an example UDP program as a guide) to adhere to this specification,
#include <sys/socket.h>
#include <assert.h>
#include <netinet/udp.h> //Provides declarations for udp header
#include <netinet/ip.h> //Provides declarations for ip header
#include <stdio.h>
#include <string.h>
#include <arpa/inet.h>
#include <unistd.h>
#define DATAGRAM_LEN sizeof(struct iphdr) + sizeof(struct iphdr)
unsigned short csum(unsigned short *ptr,int nbytes) {
register long sum;
unsigned short oddbyte;
register short answer;
sum=0;
while(nbytes>1) {
sum+=*ptr++;
nbytes-=2;
}
if(nbytes==1) {
oddbyte=0;
*((u_char*)&oddbyte)=*(u_char*)ptr;
sum+=oddbyte;
}
sum = (sum>>16)+(sum & 0xffff);
sum = sum + (sum>>16);
answer=(short)~sum;
return(answer);
}
char *new_packet(int ttl, struct sockaddr_in sin) {
static int id = 0;
char *datagram = malloc(DATAGRAM_LEN);
struct iphdr *iph = (struct iphdr*) datagram;
struct udphdr *udph = (struct udphdr*)(datagram + sizeof (struct iphdr));
iph->ihl = 5;
iph->version = 4;
iph->tos = 0;
iph->tot_len = DATAGRAM_LEN;
iph->id = htonl(++id); //Id of this packet
iph->frag_off = 0;
iph->ttl = ttl;
iph->protocol = IPPROTO_UDP;
iph->saddr = inet_addr("127.0.0.1");//Spoof the source ip address
iph->daddr = sin.sin_addr.s_addr;
iph->check = csum((unsigned short*)datagram, iph->tot_len);
udph->source = htons(6666);
udph->dest = htons(8622);
udph->len = htons(8); //udp header size
udph->check = csum((unsigned short*)datagram, DATAGRAM_LEN);
return datagram;
}
int main(int argc, char **argv) {
int s, ttl, repeat;
struct sockaddr_in sin;
char *data;
printf("\n");
if (argc != 3) {
printf("usage: %s <host> <port>", argv[0]);
return __LINE__;
}
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = inet_addr(argv[1]);
sin.sin_port = htons(atoi(argv[2]));
if ((s = socket(AF_PACKET, SOCK_RAW, 0)) < 0) {
printf("Failed to create socket.\n");
return __LINE__;
}
ttl = 1, repeat = 0;
while (ttl < 2) {
data = new_packet(ttl);
if (write(s, data, DATAGRAM_LEN) != DATAGRAM_LEN) {
printf("Socket failed to send packet.\n");
return __LINE__;
}
read(s, data, DATAGRAM_LEN);
free(data);
if (++repeat > 2) {
repeat = 0;
ttl++;
}
}
return 0;
}
... however at this point I have a few questions.
Is read(s, data, ... reading whole packets at a time, or do I need to parse the data read from the socket; seeking markers particular to IP packets?
What is the best way to uniquely mark my packets as they return to my box as expired?
Should I set up a second socket with the IPPROTO_ICMP flag, or is it easier to write a filter; accepting everything?
Do any other common mistakes exist; or are any common obstacles foreseeable?
Here are some of my suggestions (based on assumption it's a Linux machine).
read packets
You might want to read whole 1500 byte packets (entire Ethernet frame). Don't worry - smaller frames would still be read completely with read returning the length of data read.
Best way to add marker is to have some UDP payload (a simple unsigned int) should be good enough. Increase it on every packet sent. (I just did a tcpdump on traceroute - the ICMP error - does return an entire IP frame back - so you can look at the returned IP frame, parse the UDP payload and so on. Note your DATAGRAM_LEN would change accordingly. ) Of course you can use ID - but be careful that ID is mainly used by fragmentation. You should be okay with that - 'cos you'd not be approaching fragmentation limit on any intermediate routers with these packet sizes. Generally, not a good idea to 'steal' protocol fields that are meant for something else for our custom purpose.
A cleaner way could be to actually use IPPROTO_ICMP on raw sockets (if manuals are installed on your machine man 7 raw and man 7 icmp). You would not want to receive copy of all packets on your device and ignore those that are not ICMP.
If you are using type SOCKET_RAW on AF_PACKET, you will have to manually attach a link layer header or you can do SOCKET_DGRAM and check. Also man 7 packet for lot of subtleties.
Hope that helps or are you looking at some actual code?
A common pitfall is that programming at this level needs very careful use of the proper include files. For instance, your program as-is won't compile on NetBSD, which is typically quite strict in following relevant standards.
Even when I add some includes, there is no struct iphdr but there is a struct udpiphdr instead.
So for now the rest of my answer is not based on trying your program in practice.
read(2) can be used to read single packets at a time. For packet-oriented protocols, such as UDP, you'll never get more data from it than a single packet.
However you can also use recvfrom(2), recv(2) or recvmsg(2) to receive the packets.
If fildes refers to a socket, read() shall be equivalent to recv()
with no flags set.
To identify the packets, I believe using the id field is typically done, as you have already. I am not sure what you mean with "mark my packets as they return to my box as expired", since your packets don't return to you. What you may get back are ICMP Time Exceeded messages. These usually arrive within a few seconds, if they arrive at all. Sometimes they are not sent, sometimes they may be blocked by misconfigured routers between you and their sender.
Note that this assumes that the IP ID you set up in your packet is respected by the network stack you're using. It is possible that it doesn't, and replaces your chosen ID with a different one. Van Jacobson, the original author of the traceroute command as found in NetBSD therefore use a different method:
* The udp port usage may appear bizarre (well, ok, it is bizarre).
* The problem is that an icmp message only contains 8 bytes of
* data from the original datagram. 8 bytes is the size of a udp
* header so, if we want to associate replies with the original
* datagram, the necessary information must be encoded into the
* udp header (the ip id could be used but there's no way to
* interlock with the kernel's assignment of ip id's and, anyway,
* it would have taken a lot more kernel hacking to allow this
* code to set the ip id). So, to allow two or more users to
* use traceroute simultaneously, we use this task's pid as the
* source port (the high bit is set to move the port number out
* of the "likely" range). To keep track of which probe is being
* replied to (so times and/or hop counts don't get confused by a
* reply that was delayed in transit), we increment the destination
* port number before each probe.
Using a IPPROTO_ICMP socket for receiving the replies is more likely to be efficient than trying to receive all packets. It would also require fewer privileges to do so. Of course sending raw packets normally already requires root, but it could make a difference if a more fine-grained permission system is in use.
Related
I've written a simple source file that can read pcap files using the libpcap library in C. I can parse the packets one by one and analyze them up to a point. I want to be able to deduce whether a TCP packet I parsed is a TCP retransmission or not. After searching extensively the web, I've concluded that in order to so, I need to track the traffic behaviour and this means also analyzing previously received packets.
What I actually want to achieve is, to do on a basic level, what the tcp.analysis.retransmission filter does in wireshark.
This is an MRE that reads a pcap file and analyzes the TCP packets sent over IPv4. The function find_retransmissions is where the packet is analyzed.
#include <pcap.h>
#include <stdio.h>
#include <netinet/ip.h>
#include <netinet/tcp.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <net/ethernet.h>
#include <string.h>
void process_packet(u_char *,const struct pcap_pkthdr * , const u_char *);
void find_retransmissions(const u_char * , int );
int main()
{
pcap_t *handle;
char errbuff[PCAP_ERRBUF_SIZE];
handle = pcap_open_offline("smallFlows.pcap", errbuff);
pcap_loop(handle, -1, process_packet, NULL);
}
void process_packet(u_char *args,
const struct pcap_pkthdr * header,
const u_char *buffer)
{
int size = header->len;
struct ethhdr *eth = (struct ethhdr *)buffer;
if(eth->h_proto == 8) //Check if IPv4
{
struct iphdr *iph = (struct iphdr*)(buffer +sizeof(struct ethhdr));
if(iph->protocol == 6) //Check if TCP
{
find_retransmissions(buffer,size);
}
}
}
void find_retransmissions(const u_char * Buffer, int Size)
{
static struct iphdr previous_packets[20000];
static struct tcphdr previous_tcp[20000];
static int index = 0;
static int retransmissions = 0;
int retransmission = 0;
struct sockaddr_in source,dest;
unsigned short iphdrlen;
// IP header
struct iphdr *iph = (struct iphdr *)(Buffer + sizeof(struct ethhdr));
previous_packets[index] = *iph;
iphdrlen =iph->ihl*4;
memset(&source, 0, sizeof(source));
source.sin_addr.s_addr = iph->saddr;
memset(&dest, 0, sizeof(dest));
dest.sin_addr.s_addr = iph->daddr;
// TCP header
struct tcphdr *tcph=(struct tcphdr*)(Buffer
+ iphdrlen
+ sizeof(struct ethhdr));
previous_tcp[index]=*tcph;
index++;
int header_size = sizeof(struct ethhdr) + iphdrlen + tcph->doff*4;
unsigned int segmentlength;
segmentlength = Size - header_size;
/* First check if a same TCP packet has been received */
for(int i=0;i<index-1;i++)
{
// Check if packet has been resent
unsigned short temphdrlen;
temphdrlen = previous_packets[i].ihl*4;
// First check IP header
if ((previous_packets[i].saddr == iph->saddr) // Same source IP address
&& (previous_packets[i].daddr == iph->daddr) // Same destination Ip address
&& (previous_packets[i].protocol == iph->protocol) //Same protocol
&& (temphdrlen == iphdrlen)) // Same header length
{
// Then check TCP header
if((previous_tcp[i].source == tcph->source) // Same source port
&& (previous_tcp[i].dest == tcph->dest) // Same destination port
&& (previous_tcp[i].th_seq == tcph->th_seq) // Same sequence number
&& (previous_tcp[i].th_ack==tcph->th_ack) // Same acknowledge number
&& (previous_tcp[i].th_win == tcph->th_win) // Same window
&& (previous_tcp[i].th_flags == tcph->th_flags) // Same flags
&& (tcph->syn==1 || tcph->fin==1 ||segmentlength>0)) // Check if SYN or FIN are
{ // set or if tcp.segment 0
// At this point the packets are almost identical
// Now Check previous communication to check for retransmission
for(int z=index-1;z>=0;z--)
{
// Find packets going to the reverse direction
if ((previous_packets[z].daddr == iph->saddr) // Swapped IP source addresses
&& (previous_packets[z].saddr ==iph->daddr) // Same for IP dest addreses
&& (previous_packets[z].protocol == iph->protocol)) // Same protocol
{
if((previous_tcp[z].dest==tcph->source) // Swapped ports
&& (previous_tcp[z].source==tcph->dest)
&& (previous_tcp[z].th_seq-1 != tcph->th_ack) // Not Keepalive
&& (tcph->syn==1 // Either SYN is set
|| tcph->fin==1 // Either FIN is set
|| (segmentlength>0)) // Either segmentlength >0
&& (previous_tcp[z].th_seq>tcph->th_seq) // Next sequence number is
// bigger than the expected
&& (previous_tcp[z].ack != 1)) // Last seen ACK is set
{
retransmission = 1;
retransmissions++;
break;
}
}
}
}
}
}
if (retransmission == 1)
{
printf("Retransmission: True\n");
printf("\n\n******************IPv4 TCP Packet*************************\n");
printf(" |-IP Version : %d\n",(unsigned int)iph->version);
printf(" |-Source IP : %s\n" , inet_ntoa(source.sin_addr) );
printf(" |-Destination IP : %s\n" , inet_ntoa(dest.sin_addr) );
printf(" |-Source Port : %u\n", ntohs(tcph->source));
printf(" |-Destination Port : %u\n", ntohs(tcph->dest));
printf(" |-Protocol : %d\n",(unsigned int)iph->protocol);
printf(" |-IP Header Length : %d DWORDS or %d Bytes\n",
(unsigned int)iph->ihl,((unsigned int)(iph->ihl))*4);
printf(" |-Payload Length : %d Bytes\n",Size - header_size);
}
printf("Total Retransmissions: %d\n",retransmissions);
}
This approach is based on the wireshark wiki paragraph about Retransmission. I literally have clicked every page google has to offer on how to approach this analysis but this was the only thing I was able to find.
The results I get are somewhat correct, some Retransmissions go unnoticed, I get a lot of DUP-ACK packets and some normal traffic gets through as well (checked with wireshark). I use the smallFlows.pcap file found here and I believe that the results that I should have, should be the same as the tcp.analysis.retransmission && not tcp.analysis.spurious_retransmission filter in wireshark. Which amounts to 88 retransmissions for this pcap.
Running this code yields 45 and I can't understand why.
Sorry for the messy if statements, I tried my best to clean them up.
For detecting a retransmission you have to keep track of the expected sequence number. If the sequence number is higher than expected the packet is a retransmitted one ( TCP Analysis chapter of the wireshark docs,
https://www.wireshark.org/docs/wsug_html_chunked/ChAdvTCPAnalysis.html )
TCP Retransmission
Set when all of the following are true:
This is not a keepalive packet.
In the forward direction, the segment length is greater than zero or the SYN or FIN flag is set.
The next expected sequence number is greater than the current sequence number
Beside TCP Retransmission this there is also TCP Spurious Retransmission and TCP Fast Retransmission
Basically a retransmission is only necessary if a package is lost.
Analyzing lost segment inconsistency :
source of graphic : http://www.opentextbooks.org.hk/ditatopic/3578
For detecting this type of fault in wireshark the filter tcp.analysis.ack_lost_segment is used. Maybe try to implement this.
(https://serverfault.com/questions/626273/how-can-i-write-a-filter-to-get-tcp-sequence-number-inconsisten)
In wireshark several filters can be applied to capture all types of inconsistencies in sequence numbers i.e. tcp.analysis.retransmission, tcp.analysis.spurious_retransmission and tcp.analysis.fast_retransmission, for the general case of packet loss check for tcp.analysis.ack_lost_segment
https://superuser.com/questions/828294/how-can-i-get-the-actual-tcp-sequence-number-in-wireshark
By default Wireshark and TShark will keep track of all TCP sessions
and implement its own crude version of Sliding_Windows. This requires
some extra state information and memory to be kept by the dissector
but allows much better detection of interesting TCP events such as
retransmissions. This allows much better and more accurate
measurements of packet-loss and retransmissions than is available in
any other protocol analyzer. (But it is still not perfect)
This feature should not impact too much on the run-time memory
requirements of Wireshark but can be disabled if required.
When this feature is enabled the sliding window monitoring inside
Wireshark will detect and trigger display of interesting events for
TCP such as :
TCP Retransmission - Occurs when the sender retransmits a packet after the expiration of the acknowledgement.
TCP Fast Retransmission - Occurs when the sender retransmits a packet before the expiration of the acknowledgement timer. Senders
receive some packets which sequence number are bigger than the
acknowledged packets. Senders should Fast Retransmit upon receipt of 3
duplicate ACKs.
...
source : https://gitlab.com/wireshark/wireshark/-/wikis/TCP_Analyze_Sequence_Numbers
The concept of re-transmission is simple: data that was sent, was sent again.
In TCP, every transmitted byte has an identifier. If a TCP segment has 5 bytes in it (just a hypothetical example, in reality things are bigger of course), then the identifier of the first segment is the sequence number in the TCP header, +1 for the 2nd segment, ..., +4 for the 5th.
The receiver, when it wants to acknowledge a byte, it just sends an ACK with byte's sequence number +1. If receiver wants to acknowledge the 5 bytes as in our example, it ACKs the 5th byte, which is seq_num + 4 + 1. In your case, you do this calculation to get the next expected sequence number seq_num + 4 + 1.
Then, in order to detect if a re-transmission has happened, you simply know it if the same source has sent a TCP segment with a sequence number that's lower than the expected seq_num + 4 + 1.
Say, instead of getting seq_num + 4 + 1 in the next transmitted TCP message, you got seq_num. This means that the this segment is a re-transmission of the previous one.
But does it mean that this TCP segment, with the re-transmission, only contains re-transmissions? No. It can contain re-transmissions from previous segment, plus extra bytes for the next segment. This is why you need to count the total bytes in the segments to tell how many of the bytes are part of the re-transmissions, and how many are part of new transmission. As you see, TCP re-transmission is not binary per segment, but can overlap across segments. Because we are really re-transmitting bytes. We just store bytes in segments for reducing TCP header's overhead.
Now, what if you got seq_num + 2 + 1? This is a bit odd because it indicates that the previous segment got partially re-transmitted only. It basically indicates that it's only re-transmitting from byte 3. If the segment has only 3 bytes, it re-transmitting 3rd, 4th and 5th bytes (i.e. only the previous segment's bytes). But if it has, say, 10 bytes, it means that 6th, 7th, 8th, 9th and 10th bytes are new bytes (not re-transmitted).
In my opinion you can only say that a TCP packet is a re-transmission only when it's carrying bytes with identifiers that were sent before. But as said earlier, this might not be true, as a segment could contain some bytes sent earlier, plus more never sent, hence being a mixture between re-transmissions and new-transmissions.
I need to write program using raw sockets in c language on proxy server between two hosts.
I've written some code for it (and set some rules for iptable to change destination address of packets to proxy's interfaces), where I am receiving packet, print data in this packet and then send the packet to receiver.
It's working on my simple client/server programs on raw sockets, but when I am trying to establish a connection through a proxy - it doesn't work.
Do you have any ideas on how I can write this program without using the kernel?
#include <unistd.h>
#include <stdio.h>
#include <sys/socket.h>
#include <netinet/ip.h>
#include <netinet/tcp.h>
#define PCKT_LEN 8192
int main(void){
int s;
char buffer[PCKT_LEN];
struct sockaddr saddr;
struct sockaddr_in daddr;
memset(buffer, 0, PCKT_LEN);
s = socket(AF_INET, SOCK_RAW, IPPROTO_TCP);
if(s < 0){
printf("socket() error");
return -1;
}
int saddr_size = sizeof(saddr);
int header_size = sizeof(struct iphdr) + sizeof(struct tcphdr);
unsigned int count;
daddr.sin_family = AF_INET;
daddr.sin_port = htons(1234);
daddr.sin_addr.s_addr = inet_addr ("2.2.2.1");
while(1){
if(recvfrom(s, buffer, PCKT_LEN , 0, &saddr, &saddr_size) < 0){
printf("recvfrom() error");
return -1;
}
else{
int i = header_size;
for(; i < PCKT_LEN; i++)
printf("%c", buffer[i]);
if (sendto (s, buffer, PCKT_LEN, 0, &daddr, &saddr_size) < 0)
printf("sendto() error");
return -1;
}
}
}
close(s);
return 0;
}
(Your code has serious bugs. For example, the last argument to sendto(2) should not be a pointer. I'll assume it's not the real code and that the real code compiles without warnings.)
With the nagging out of the way, I think one problem is that you're accidentally including an extra IP header in the packets you send. raw(7) has the following:
The IPv4 layer generates an IP header when sending a packet unless the IP_HDRINCL socket option is enabled on the socket. When it is enabled, the packet must contain an IP header. For receiving the IP header is always included in the packet.
IP_HDRINCL is not enabled by default unless protocol is IPPROTO_RAW (see a bit further down in raw(7)), meaning it's disabled in your case. (I also checked with getsockopt(2).)
You will have to either enable IP_HDRINCL using setsockopt(2) to tell the kernel that you're supplying the header yourself, or not include the header in sendto().
It's better to look at the IHL field in the IP header than assume it has fixed size by the way. The IP header could include options.
There could be other issues as well depending on what you're trying to do, and details might vary for IPv6.
Whatever you are doing I don't think using raw sockets is the way. Those are used for network debugging only.
Fist of all, observe that basically you are copying content from an existing, stabilished connection, rather than tunneling it. You are not doing what is proposed.
If you want to capture connections to a given server:port, for instance, 2.2.2.1:1234, into your application so that you can tunnel it through a proxy, you can use iptables.
iptables -t nat -A OUTPUT -p tcp -d 2.2.2.1 --dport 1234 -j REDIRECT
Create an application bound to ip 0.0.0.0 listening to TCP port 1234 and every connection attempt to 2.2.2.1:1234 will connect to your application instead, and you can do whatever you please with it.
I am working with an embedded box that must be able to communicate with traditional computers using UDP. When the box sends large UDP messages (that need to be fragmented), a UDP header is included for each fragment. Thus if I want to a send a large datagram, it will be fragmented like this:
[eth hdr][ip hdr][udp hdr][ data 1 ] /* first fragment */
[eth hdr][ip hdr][udp hdr][ data 2 ] /* second fragment */
[eth hdr][ip hdr][udp hdr][ data 3 ] /* last fragment */
I understand that this is not customary, as usually the udp header would only be included in only the first ip packet of the fragmented message. However, this works perfectly for communicating with the other machines I need to talk to (ex. using recvfrom), so I have no reason to dig in and try to change it.
My issue, however, is in reading messages. The box seems to expect fragmented udp datagrams to be sent to it in the same manner. By this I mean that it expects every ipv4 fragment to have a udp header. Before trying to change this (it's a rather specialized and complicated platform) I would like to know if there is any way to configure sendto() or any other such function for sending udp messages in this format. I see when monitoring the traffic that those udp headers aren't present.
Thank you very much for the help.
No. Socket's don't work this way. Just write your own sendto wrapper to manually fragment the frames across multiple UDP packets on whatever buffer size boundary you choose. This will achieve the desired effect that you want.
Sample code as follows:
ssize_t fragmented_sendto(int sockfd, const void *buf, size_t len, int flags,
const struct sockaddr *dest_addr, socklen_t addrlen, size_t MAX_PACKET_SIZE)
{
unsigned char* ptr = (unsigned char*) buf;
size_t total = 0;
while (total <= len)
{
size_t newsize = len - total;
if (newsize > MAX_PACKET_SIZE)
{
newsize = MAX_PACKET_SIZE;
}
ssize_t result = sendto(sockfd, ptr, newsize, flags, dest_addr, addrlen);
if (result < 0)
{
// handle error
return -1;
}
else
{
total += result;
ptr += result;
}
}
return (ssize_t)total;
}
I am using domain sockets (AF_UNIX) to communicate between two threads for inter process communication. This is chosen to work well with libev: I use it on the recv end of the domain socket. This works very well except that the data I am sending is constant 4864 bytes. I cannot afford to get this data fragmented. I always thought domain sockets won't fragment data, but as it turns out it does. When the communication is at its peak between the threads, I observe the following
Thread 1:
SEND = 4864 actual size = 4864
Thread 2:
READ = 3328 actual size = 4864
Thread 1:
SEND = 4864 actual size = 4864
Thread 2:
READ = 1536 actual size = 4864
As you can see, thread 2 received the data in fragments (3328 + 1536). This is really bad for my application. Is there anyway we can make it not fragment it? I understand that IP_DONTFRAG can be set to only AF_INET family? Can someone suggest an alternative?
Update: sendto code
ssize_t
socket_domain_writer_dgram_send(int *domain_sd, domain_packet_t *pkt) {
struct sockaddr_un remote;
unsigned long len = 0;
ssize_t ret = 0;
memset(&remote, '\0', sizeof(struct sockaddr_un));
remote.sun_family = AF_UNIX;
strncpy(remote.sun_path, DOMAIN_SOCK_PATH, strlen(DOMAIN_SOCK_PATH));
len = strlen(remote.sun_path) + sizeof(remote.sun_family) + 1;
ret = sendto(*domain_sd, pkt, sizeof(*pkt), 0, (struct sockaddr *)&remote, sizeof(struct sockaddr_un));
if (ret == -1) {
bps_log(BPS_LOGGER_RD, ASL_LEVEL_ERR, "Domain writer could not connect send packets", errno);
}
return ret;
}
SOCK_STREAM by definition doesn't preserve message boundaries. Try again with SOCK_DGRAM or SOCK_SEQPACKET:
http://man7.org/linux/man-pages/man7/unix.7.html
On the other hand, consider that you may be passing messages larger than your architecture page size. For example, for amd64, a memory page is 4K. If that's a problem for any reason it might make sense to split the packets in 2.
Note however, that's not a real issue for the packets to arrive fragmented. It's common to have a packet assembler in the receiving end of the socket. What's wrong with implementing it ?
4864 + 3328 = 8192. My guess is that you're transmitting two 4864-byte packets back to back in some cases, and it's filling an 8 KB kernel buffer somewhere. IP_DONTFRAG isn't applicable because IP is not involved here — the "fragmentation" you're seeing is happening via a completely different mechanism.
If all the data you're transmitting consists of packets, you would do well to use a datagram socket (SOCK_DGRAM) instead of a stream. This should make the send() block when the kernel buffer doesn't have sufficient space to store an entire packet, rather than allowing a partial write through, and will make each recv() return exactly one packet, so you don't need to deal with framing.
I have a small function that tries to print the fragment offset of an IP header.
ParseIpHeader(unsigned char *packet, int len)
{
struct ethhdr *ethernet_header;
struct iphdr *ip_header;
/* First Check if the packet contains an IP header using
the Ethernet header */
ethernet_header = (struct ethhdr *)packet;
if(ntohs(ethernet_header->h_proto) == ETH_P_IP)
{
/* The IP header is after the Ethernet header */
if(len >= (sizeof(struct ethhdr) + sizeof(struct iphdr)))
{
ip_header = (struct iphdr*)(packet + sizeof(struct ethhdr));
/* print the Source and Destination IP address */
//printf("Dest IP address: %s\n", inet_ntoa(ip_header->daddr));
//printf("Source IP address: %s\n", inet_ntoa(ip_header->saddr));
printf("protocol %d\n", ip_header->protocol);
printf("Fragment off is %d\n", ntohs(ip_header->frag_off));
}
}
My packets are TCP (the ip_header->protocol is always 6. the problem is that the frag_off
is always 16384. I am sending a lot of data, why the frag_off is always constant?
Thanks.
Fragment offset is shared with flags. You have the "DF" (don't fragment) bit set.
Which gives you 16384 for the entire 16-bit field, given the fragment offset of 0.
Take a look at the http://www.ietf.org/rfc/rfc791.txt, starting from page 10.
EDIT:
The DF bit in the TCP segments that you are receiving is set by the remote side, to perform the Path MTU discovery - in a nutshell, to try to avoid the fragmentation.
In this case the sending side learns the biggest MTU that the overall path can handle, and chops the TCP segments such that they did not exceed it after the encapsulation into IP.
EDIT2:
regarding the use of recvfrom() and TCP: TCP is a connection-oriented protocol, and all of the segmentation/fragmentation details are already handled by it (fragmentation is obviously handled by the lower layer, IP) - so you do not need to deal with it. Anything you write() on the sending side will be eventually read() on the other side - possibly not in the same chunks though - i.e. two 4K writes may result in a single 8K read sometimes, and sometimes in two 4K reads - depending on the behaviour of the media inbetween concerning reordering/losses.
IP Fragmentation and reassembly is handled transparently by the operating system, so you do not need to worry about it, same as about packets out of order, etc. (you will just see the decreased performance as the effect on the application).
One good read I could recommend is this one: UNIX network programming. Given Steven's involvement with the TCP, it's a good book no matter which OS you use.
EDIT3:
And if you are doing something to be a "man in the middle" (assuming you have good and legitimate reasons for doing so :-) - then you can assess the upcoming work by looking at the prior art: chaosreader (one-script approach that works on pcap files, but adaptable to something else), or LibNIDS - that does emulate the IP defragmentation and the TCP stream reassembly; and maybe just reuse them for your purposes.