Related
I'm writing a simple program that creates an ethernet I frame and sends it through an interface to the specified MAC.
As i have read, the process for connecting to a socket in UNIX goes a bit like:
int sockfd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
struct sockaddr_ll sll;
/* populate sll with the target and interface info */
connect(sockfd, (struct sockaddr*)&sll, sizeof(sll));
write(sockfd, stuff, sizeof(stuff));
close(sockfd)
The thing is, for me, stuff is a valid eth frame already containing everything needed to send a packet to its destination. Isn't the connect step redundant then? What am I missing?
Have a nice day.
Not only is the connect "redundant", it is an error -- according to the Linux man page:
The connect(2) operation is not supported on packet sockets.
So the connect is probably failing but not actually doing anything. Since you ignore the return value of connect, you don't notice the failure.
As stated above, the connection step was wrong.
I will give the details of how i solved it in this post in case anyone in need sees this: (this is as i understood it, feel free to correct me)
For a trully raw communication in userspace you have to understand three concepts:
Sockets are analogous to file descriptors.
Binding a socket is like opening a file.
You can not read or write to a socket, just kindly ask the kernel to do it for you.
The process i followed is as follows:
int sockfd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
struct sockaddr_ll sll;
sll.sll_family = AF_PACKET;
sll.sll_ifindex = index; //This is the index of your network card
//Can be obtained through ioctl with SIOCGIFINDEX
sll.sll_protocol = htons(ETH_P_ALL);
bind(sockfd, (struct sockaddr*)&sll, sizeof(sll));
size_t send_len = write(sockfd, data, size);
As you can see, we dont really use connect, as it was, indeed, a mistake.
p.s. for a full example: https://github.com/TretornESP/RAWRP
According to Wikipedia, a traceroute program
Traceroute, by default, sends a sequence of User Datagram Protocol
(UDP) packets addressed to a destination host[...] The time-to-live
(TTL) value, also known as hop limit, is used in determining the
intermediate routers being traversed towards the destination. Routers
decrement packets' TTL value by 1 when routing and discard packets
whose TTL value has reached zero, returning the ICMP error message
ICMP Time Exceeded.[..]
I started writing a program (using an example UDP program as a guide) to adhere to this specification,
#include <sys/socket.h>
#include <assert.h>
#include <netinet/udp.h> //Provides declarations for udp header
#include <netinet/ip.h> //Provides declarations for ip header
#include <stdio.h>
#include <string.h>
#include <arpa/inet.h>
#include <unistd.h>
#define DATAGRAM_LEN sizeof(struct iphdr) + sizeof(struct iphdr)
unsigned short csum(unsigned short *ptr,int nbytes) {
register long sum;
unsigned short oddbyte;
register short answer;
sum=0;
while(nbytes>1) {
sum+=*ptr++;
nbytes-=2;
}
if(nbytes==1) {
oddbyte=0;
*((u_char*)&oddbyte)=*(u_char*)ptr;
sum+=oddbyte;
}
sum = (sum>>16)+(sum & 0xffff);
sum = sum + (sum>>16);
answer=(short)~sum;
return(answer);
}
char *new_packet(int ttl, struct sockaddr_in sin) {
static int id = 0;
char *datagram = malloc(DATAGRAM_LEN);
struct iphdr *iph = (struct iphdr*) datagram;
struct udphdr *udph = (struct udphdr*)(datagram + sizeof (struct iphdr));
iph->ihl = 5;
iph->version = 4;
iph->tos = 0;
iph->tot_len = DATAGRAM_LEN;
iph->id = htonl(++id); //Id of this packet
iph->frag_off = 0;
iph->ttl = ttl;
iph->protocol = IPPROTO_UDP;
iph->saddr = inet_addr("127.0.0.1");//Spoof the source ip address
iph->daddr = sin.sin_addr.s_addr;
iph->check = csum((unsigned short*)datagram, iph->tot_len);
udph->source = htons(6666);
udph->dest = htons(8622);
udph->len = htons(8); //udp header size
udph->check = csum((unsigned short*)datagram, DATAGRAM_LEN);
return datagram;
}
int main(int argc, char **argv) {
int s, ttl, repeat;
struct sockaddr_in sin;
char *data;
printf("\n");
if (argc != 3) {
printf("usage: %s <host> <port>", argv[0]);
return __LINE__;
}
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = inet_addr(argv[1]);
sin.sin_port = htons(atoi(argv[2]));
if ((s = socket(AF_PACKET, SOCK_RAW, 0)) < 0) {
printf("Failed to create socket.\n");
return __LINE__;
}
ttl = 1, repeat = 0;
while (ttl < 2) {
data = new_packet(ttl);
if (write(s, data, DATAGRAM_LEN) != DATAGRAM_LEN) {
printf("Socket failed to send packet.\n");
return __LINE__;
}
read(s, data, DATAGRAM_LEN);
free(data);
if (++repeat > 2) {
repeat = 0;
ttl++;
}
}
return 0;
}
... however at this point I have a few questions.
Is read(s, data, ... reading whole packets at a time, or do I need to parse the data read from the socket; seeking markers particular to IP packets?
What is the best way to uniquely mark my packets as they return to my box as expired?
Should I set up a second socket with the IPPROTO_ICMP flag, or is it easier to write a filter; accepting everything?
Do any other common mistakes exist; or are any common obstacles foreseeable?
Here are some of my suggestions (based on assumption it's a Linux machine).
read packets
You might want to read whole 1500 byte packets (entire Ethernet frame). Don't worry - smaller frames would still be read completely with read returning the length of data read.
Best way to add marker is to have some UDP payload (a simple unsigned int) should be good enough. Increase it on every packet sent. (I just did a tcpdump on traceroute - the ICMP error - does return an entire IP frame back - so you can look at the returned IP frame, parse the UDP payload and so on. Note your DATAGRAM_LEN would change accordingly. ) Of course you can use ID - but be careful that ID is mainly used by fragmentation. You should be okay with that - 'cos you'd not be approaching fragmentation limit on any intermediate routers with these packet sizes. Generally, not a good idea to 'steal' protocol fields that are meant for something else for our custom purpose.
A cleaner way could be to actually use IPPROTO_ICMP on raw sockets (if manuals are installed on your machine man 7 raw and man 7 icmp). You would not want to receive copy of all packets on your device and ignore those that are not ICMP.
If you are using type SOCKET_RAW on AF_PACKET, you will have to manually attach a link layer header or you can do SOCKET_DGRAM and check. Also man 7 packet for lot of subtleties.
Hope that helps or are you looking at some actual code?
A common pitfall is that programming at this level needs very careful use of the proper include files. For instance, your program as-is won't compile on NetBSD, which is typically quite strict in following relevant standards.
Even when I add some includes, there is no struct iphdr but there is a struct udpiphdr instead.
So for now the rest of my answer is not based on trying your program in practice.
read(2) can be used to read single packets at a time. For packet-oriented protocols, such as UDP, you'll never get more data from it than a single packet.
However you can also use recvfrom(2), recv(2) or recvmsg(2) to receive the packets.
If fildes refers to a socket, read() shall be equivalent to recv()
with no flags set.
To identify the packets, I believe using the id field is typically done, as you have already. I am not sure what you mean with "mark my packets as they return to my box as expired", since your packets don't return to you. What you may get back are ICMP Time Exceeded messages. These usually arrive within a few seconds, if they arrive at all. Sometimes they are not sent, sometimes they may be blocked by misconfigured routers between you and their sender.
Note that this assumes that the IP ID you set up in your packet is respected by the network stack you're using. It is possible that it doesn't, and replaces your chosen ID with a different one. Van Jacobson, the original author of the traceroute command as found in NetBSD therefore use a different method:
* The udp port usage may appear bizarre (well, ok, it is bizarre).
* The problem is that an icmp message only contains 8 bytes of
* data from the original datagram. 8 bytes is the size of a udp
* header so, if we want to associate replies with the original
* datagram, the necessary information must be encoded into the
* udp header (the ip id could be used but there's no way to
* interlock with the kernel's assignment of ip id's and, anyway,
* it would have taken a lot more kernel hacking to allow this
* code to set the ip id). So, to allow two or more users to
* use traceroute simultaneously, we use this task's pid as the
* source port (the high bit is set to move the port number out
* of the "likely" range). To keep track of which probe is being
* replied to (so times and/or hop counts don't get confused by a
* reply that was delayed in transit), we increment the destination
* port number before each probe.
Using a IPPROTO_ICMP socket for receiving the replies is more likely to be efficient than trying to receive all packets. It would also require fewer privileges to do so. Of course sending raw packets normally already requires root, but it could make a difference if a more fine-grained permission system is in use.
I need to get the local port used by a (client) socket.
It was my understanding that Windows Sockets performs an implicit bind function call, therefore getsockname() after sendto() should provide the assigned port. However, it always sets 0 as the port number. Am I missing something?
ex:
if (sendto(sockfd, ...) != SOCKET_ERROR)
printf("Sent\n");
if (getsockname(sockfd, (struct sockaddr*)&sin, &sinlen) != SOCKET_ERROR)
printf("port = %u\n", ntohs(sin.sin_port);
else
printf("Error");
//result: Sent, port = 0
Problem solved with a restart of the computer. Still unknown as to the actual cause, but at this point I'm just happy it's working.
If anyone has an idea for fixing the issue without a restart (for future readers), feel free to post.
The only ambiguity I can see in your example code is what size you assigned to sinlen before calling. (you do not show it) If you are using winsock, it should be defined, and assigned int sinlen = sizeof(sin);
I used this code on my system, and it returns a non-zero value for the port I am connecting through:
struct sockaddr_in sin;
int len = sizeof(sin);
if (getsockname(sock, (struct sockaddr *)&sin, &len) == -1)
//handle error
else
printf("port number %d\n", ntohs(sin.sin_port));
By the way, The ntohs function function returns the value in host byte order. If [ sin.sin_port ] is already in host byte order, then this function will reverse it. It is up to [your] application to determine if the byte order must be reversed. [text in brackets are my emphasis]
In answer to comment question ( getsockname() ):
The function prototype for getsockname():
int getsockname(
_In_ SOCKET s,
_Out_ struct sockaddr *name,
_Inout_ int *namelen //int, not socklen_t
);
For more discussion on socklen_t
Edit (address possible approach to re-setting sockets without rebooting PC.)
If winsock API calls cease to work predictably, you can re-start sockets without rebooting the PC by using WSAStartup and WSACleanup (see code example at bottom of link for WSAStartup)
You say you want to know the LOCAL port, but your line
sendto(sockfd, ...)
implies sockfd is the REMOTE descriptor. Your later code may therefore give you info about the REMOTE port, not the LOCAL one. 'sockets' are not both ends, meaning one connection. A socket is one end, meaning the IP and port number of one end of the connection. The first parameter of your getsockname() is not a reference or a pointer, it is therefore not an output from the function, but an input. You're telling the function to use the same socket descriptor that you just sent to, ie. the remote one.
Formatting error. ntohs() returns unsigned short so the format should be %hu, not %u or %d. If you grab too many bytes they are not the port.
Answer. After using sendto() try using gethostname() then getaddrinfo() on the name that comes back. Note: the addrinfo structures you get back will give you struct sockaddr pointers which you will need to re-cast to struct sockaddr_in pointers to access the local port number.
To find the local port number the kernel dreamed up when you issued a sendto() function perhaps you could write a routine to parse the output from the (gnu linux) commands 'ss' or 'netstat'. (Not sure if these are POSIX compatible.) Or maybe you could access /proc/net if you have the privilege.
I am using blocking TCP sockets for my client and server. Whenever I read, I first check whether data is available on the stream using select. I always read and write 40 bytes at a time. While most reads take few milliseconds or less, some just take more than half a second. That after I know that there is data available on the socket.
I am also using TCP_NODELAY
What could be causing it ?
EDIT 2
I analyzed the timestamp for each packet sent and received and saw that this delay happens only when client tries to read the object before the next object is written by the server. For instance, the server wrote object number x and after that the client tried to read object x, before the server was able to begin writing object number x+1. This makes me suspect that some kind of coalescing is taking place on the server side.
EDIT
The server is listening on 3 different ports. The client connects one by one to each of these ports.
There are three connections : One that sends some data frequently from the server to the client. A second one that only sends data from the client to the server. And a third one that is used very rarely to send single byte of data. I am facing the problem with the first connection. I am checking using select() that data is available on that connection and then when I timestamp the 40 byte read, I find that about half a second was taken for that read.
Any pointers as to how to profile this would be very helpful
using gcc on linux.
rdrr_server_start(void)
{
int rr_sd;
int input_sd;
int ack_sd;
int fp_sd;
startTcpServer(&rr_sd, remote_rr_port);
startTcpServer(&input_sd, remote_input_port);
startTcpServer(&ack_sd, remote_ack_port);
startTcpServer(&fp_sd, remote_fp_port);
connFD_rr = getTcpConnection(rr_sd);
connFD_input = getTcpConnection(input_sd);
connFD_ack= getTcpConnection(ack_sd);
connFD_fp=getTcpConnection(fp_sd);
}
static int getTcpConnection(int sd)
{
socklen_t l en;
struct sockaddr_in clientAddress;
len = sizeof(clientAddress);
int connFD = accept(sd, (struct sockaddr*) &clientAddress, &len);
nodelay(connFD);
fflush(stdout);
return connFD;
}
static void
startTcpServer(int *sd, const int port)
{
*sd= socket(AF_INET, SOCK_STREAM, 0);
ASSERT(*sd>0);
// Set socket option so that port can be reused
int enable = 1;
setsockopt(*sd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(int));
struct sockaddr_in a;
memset(&a,0,sizeof(a));
a.sin_family = AF_INET;
a.sin_port = port;
a.sin_addr.s_addr = INADDR_ANY;
int bindResult = bind(*sd, (struct sockaddr *) &a, sizeof(a));
ASSERT(bindResult ==0);
listen(*sd,2);
}
static void nodelay(int fd) {
int flag=1;
ASSERT(setsockopt(fd, SOL_TCP, TCP_NODELAY, &flag, sizeof flag)==0);
}
startTcpClient() {
connFD_rr = socket(AF_INET, SOCK_STREAM, 0);
connFD_input = socket(AF_INET, SOCK_STREAM, 0);
connFD_ack = socket(AF_INET, SOCK_STREAM, 0);
connFD_fp= socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in a;
memset(&a,0,sizeof(a));
a.sin_family = AF_INET;
a.sin_port = remote_rr_port;
a.sin_addr.s_addr = inet_addr(remote_server_ip);
int CONNECT_TO_SERVER= connect(connFD_rr, &a, sizeof(a));
ASSERT(CONNECT_TO_SERVER==0) ;
a.sin_port = remote_input_port;
CONNECT_TO_SERVER= connect(connFD_input, &a, sizeof(a));
ASSERT(CONNECT_TO_SERVER==0) ;
a.sin_port = remote_ack_port;
CONNECT_TO_SERVER= connect(connFD_ack, &a, sizeof(a));
ASSERT(CONNECT_TO_SERVER==0) ;
a.sin_port = remote_fp_port;
CONNECT_TO_SERVER= connect(connFD_fp, &a, sizeof(a));
ASSERT(CONNECT_TO_SERVER==0) ;
nodelay(connFD_rr);
nodelay(connFD_input);
nodelay(connFD_ack);
nodelay(connFD_fp);
}
I would be suspicious of the this line of code:
ASSERT(setsockopt(fd, SOL_TCP, TCP_NODELAY, &flag, sizeof flag)==0);
If you are running a release build, then ASSERT is mostly likely defined to nothing, so the call would not actually be made. The setsockopt call should not be in the ASSERT statement. Instead, the return value (in a variable) should be verified in the assert statement. Asserts with side effects are generally a bad thing. So even if this is not the problem, it should probably be changed.
One client and multiple connections?
some of socket functions might be blocking your execution (i.e. waiting for result of functions). I would suggest opening a new thread (on server side) for each connection so they won't interfere with each other...
but I'm shooting in the dark; you'll need to send some additional info...
Your statement is still confusing i.e. "multiple tcp connections with only one client". Obviously you have a single server listening on one port. Now if you have multiple connections this means there is more than one client connecting to the server each connected on a different tcp client port. Now server runs select and responds to whichever client has data (meaning client sent some data on his socket). Now if two clients send data simultaneously, server can only process them sequentially. So second client won't get processed until server is done processing with first.
Select only allows server to monitor more than one descriptors (sockets) and process which ever has data available. It is not like that it does processing in parallel. You need multiple threads or processes for that.
Maybe it is something related to the timeout argument.
What do you set for timeout argument of select call?
Try to change the timeout argument to a bigger one and observe the latency. Sometimes too small timeout and very often system calls can actually kill throughput . Maybe you can achieve better results if you assume a little bigger latency, that is realizable.
I suspect timeout or some code bug.
You may try using TCP_CORK (CORK'ed mode) with kernel extensions GRO, GSO and TSO disabled by ethtool:
sending inside TCP_CORK flagged session will ensure that the data will not be sent in partial segment
disabling generic-segmentation-offload, generic-receive-offload and tcp-segmentation-offload will ensure that kernel will not introduce artificial delays to collect additional tcp segments before moving data to/from userspace
I have an application that is receiving data from multiple multicast sources on the same port. I am able to receive the data. However, I am trying to account for statistics of each group (i.e. msgs received, bytes received) and all the data is getting mixed up. Does anyone know how to solved this problem? If I try to look at the sender's address, it is not the multicast address, but rather the IP of the sending machine.
I am using the following socket options:
struct ip_mreq mreq;
mreq.imr_multiaddr.s_addr = inet_addr("224.1.2.3");
mreq.imr_interface.s_addr = INADDR_ANY;
setsockopt(s, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));
and also:
setsockopt(s, SOL_SOCKET, SO_REUSEPORT, &reuse, sizeof(reuse));
After some years facing this linux strange behaviour, and using the bind workaround describe in previous answers, I realize that the ip(7) manpage describe a possible solution :
IP_MULTICAST_ALL (since Linux 2.6.31)
This option can be used to modify the delivery policy of
multicast messages to sockets bound to the wildcard INADDR_ANY
address. The argument is a boolean integer (defaults to 1).
If set to 1, the socket will receive messages from all the
groups that have been joined globally on the whole system.
Otherwise, it will deliver messages only from the groups that
have been explicitly joined (for example via the
IP_ADD_MEMBERSHIP option) on this particular socket.
Then you can activate the filter to receive messages of joined groups using :
int mc_all = 0;
if ((setsockopt(sock, IPPROTO_IP, IP_MULTICAST_ALL, (void*) &mc_all, sizeof(mc_all))) < 0) {
perror("setsockopt() failed");
}
This problem and the way to solve it enabling IP_MULTICAST_ALL is discussed in Redhat Bug 231899, this discussion contains test programs to reproduce the problem and to solve it.
[Edited to clarify that bind() may in fact include a multicast address.]
So the application is joining several multicast groups, and receiving messages sent to any of them, to the same port. SO_REUSEPORT allows you to bind several sockets to the same port. Besides the port, bind() needs an IP address. INADDR_ANY is a catch-all address, but an IP address may also be used, including a multicast one. In that case, only packets sent to that IP will be delivered to the socket. I.e. you can create several sockets, one for each multicast group. bind() each socket to the (group_addr, port), AND join group_addr. Then data addressed to different groups will show up on different sockets, and you'll be able to distinguish it that way.
I tested that the following works on FreeBSD:
#include <sys/socket.h>
#include <stdio.h>
#include <string.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <sys/param.h>
#include <unistd.h>
#include <errno.h>
int main(int argc, const char *argv[])
{
const char *group = argv[1];
int s = socket(AF_INET, SOCK_DGRAM, 0);
int reuse = 1;
if (setsockopt(s, SOL_SOCKET, SO_REUSEPORT, &reuse, sizeof(reuse)) == -1) {
fprintf(stderr, "setsockopt: %d\n", errno);
return 1;
}
/* construct a multicast address structure */
struct sockaddr_in mc_addr;
memset(&mc_addr, 0, sizeof(mc_addr));
mc_addr.sin_family = AF_INET;
mc_addr.sin_addr.s_addr = inet_addr(group);
mc_addr.sin_port = htons(19283);
if (bind(s, (struct sockaddr*) &mc_addr, sizeof(mc_addr)) == -1) {
fprintf(stderr, "bind: %d\n", errno);
return 1;
}
struct ip_mreq mreq;
mreq.imr_multiaddr.s_addr = inet_addr(group);
mreq.imr_interface.s_addr = INADDR_ANY;
setsockopt(s, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(mreq));
char buf[1024];
int n = 0;
while ((n = read(s, buf, 1024)) > 0) {
printf("group %s fd %d len %d: %.*s\n", group, s, n, n, buf);
}
}
If you run several such processes, for different multicast addresses, and send a message to one of the addresses, only the relevant process will receive it. Of course, in your case, you probably will want to have all the sockets in one process, and you'll have to use select or poll or equivalent to read them all.
Use setsockopt() and IP_PKTINFO or IP_RECVDSTADDR depending on your platform, assuming IPv4. This combined with recvmsg() or WSARecvMsg() allows you to find the source and destination address of every packet.
Unix/Linux, note FreeBSD uses IP_RECVDSTADDR whilst both support IP6_PKTINFO for IPv6.
http://www.kernel.org/doc/man-pages/online/pages/man7/ip.7.html
Windows, also has IP_ORIGINAL_ARRIVAL_IF
http://msdn.microsoft.com/en-us/library/ms741645(v=VS.85).aspx
Replace
mc_addr.sin_addr.s_addr = htonl(INADDR_ANY);
with
mc_addr.sin_addr.s_addr = inet_addr (mc_addr_str);
it's help for me (linux), for each application i receive separate mcast stream from separate mcast group on one port.
Also you can look into VLC player source, it show many mcast iptv channel from different mcast group on one port, but i dont know, how it separetes channel.
I have had to use multiple sockets each looking at different multicast group addresses, and then count statistics on each socket individually.
If there is a way to see the "receiver's address" as mentioned in the answer above, I can't figure it out.
One important point that also took me awhile - when I bound each of my individual sockets to a blank address like most python examples do:
sock[i].bind(('', MC_PORT[i])
I got all the multicast packets (from all multicast groups) on each socket, which didn't help. To fix this, I bound each socket to it's own multicast group
sock[i].bind((MC_GROUP[i], MC_PORT[i]))
And it then worked.
IIRC recvfrom() gives you a different read address/port for each sender.
You can also put a header in each packet identifying the source sender.
The Multicast address will be the receiver's address not sender's address in the packet. Look at the receiver's IP address.
You can separate the multicast streams by looking at the destination IP addresses of the received packets (which will always be the multicast addresses). It is somewhat involved to do this:
Bind to INADDR_ANY and set the IP_PKTINFO socket option. You then have to use recvmsg() to receive your multicast UDP packets and to scan for the IP_PKTINFO control message. This gives you some side band information of the received UDP packet:
struct in_pktinfo {
unsigned int ipi_ifindex; /* Interface index */
struct in_addr ipi_spec_dst; /* Local address */
struct in_addr ipi_addr; /* Header Destination address */
};
Look at ipi_addr: This will be the multicast address of the UDP packet you just received. You can now handle the received packets specific for each multicast stream (multicast address) you are receiving.