get HTTP HOST web address from packet - c

I have a basic packet sniffer like http://www.binarytides.com/packet-sniffer-code-c-linux/
I have extended it to process packets only on port 80 (HTTP). I am not sure how to get host web address from data. Can you guys help me here
What I am trying to do is parse HTTP header subset in order to identify host web address
I found something similar to what I need : https://github.com/joyent/http-parser/blob/master/http_parser.h#L194
but the code is too complex...
Or where can I find HTTP header bytewise breakdown like for TCP http://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_segment_structure

You need to grab the tcp data, then look for "GET". A typical http request looks like:
GET www.foo.com HTTP/1.0
web host name just follows the GET request. So you can extract the web host address from there.

Related

Packet analysing in wireshark - How to distinguish HTTP protocol from TCP

I am working on a school project, in which I have to analyze .pcap files in C language using the libcap library. I am new to networking, however I do know that TCP is on the layer 4 and HTTP is on the 7th layer in the OSI model. I want to sort HTTP packets, and print out the source/destination ports but I'm a little confused how to distinguish HTTP protocols from TCP protocols.
Here is an example, which I don't understand:
EDIT: Here is another example, where the source port is 80, the length is 100. The 54th byte is 48, which is the same as for a HTTP 1.1 response packet. It is a TCP.
https://i.stack.imgur.com/RQs6v.png
The destination port here is 80, which is HTTP. However wireshark does not list this packet as a HTTP protocol, it is just TCP.
https://i.stack.imgur.com/TsVuO.png
Me question is how to determine based on bytes if the packet is a HTTP protocol or just a TCP protocol?
You cannot determine if a packet is HTTP or not just by looking at its headers. HTTP is application level, if you want to identify an HTTP stream you will have to check the innermost payload of the packet. In other words, HTTP packets are distinguishable just by looking at what comes after the TCP header. Wireshark already does this for you and marks packets that look like HTTP as such. You can filter packets identified as HTTP by Wireshark by simply typing http in the filter bar at the top.
In your case, the packet you show has Length = 0, so there really isn't anything to analyze other than the various headers of the different layers. The packet is not HTTP.
Determining HTTP traffic "based on bytes" can be done by looking at the payload: HTTP requests and responses have known formats. For example HTTP 1.1 requests start with <METHOD> <URI> HTTP/1.1\r\n, and responses with HTTP/1.1 <CODE> <MSG>\r\n.

Connecting to Remote Host in C (Socket Programming)

My issue is simple: I want to make an HTTP GET request from a server. My client program takes a URL and sends it to the server program. I want to know how I can take this URL and make an HTTP GET request for the specific page that was entered from the client.
If I resolve the URL to get an IP address, I could open a socket to that IP address, but then how would I make the request for the actual page? Would I just send an HTTP GET request directly to that IP address with the directory in the request?
For instance, if I ran a Google search for the word "Test" I'd be presented with the following URL:
https://www.google.com/?gws_rd=ssl#q=Test
My understanding is that a GET request could look like this:
GET /?gws_rd=ssl#q=Test HTTP/1.1
Host: www.google.com
So, if I'm understanding this correctly, would I resolve the IP, open a socket, and then just send this GET request directly to the socket?
Lastly, if I try throwing a URL such as the one above into my server code it's unable to resolve an IP address. This means that if I'm making a more complex request than something like www.google.com I'd have to parse the string and match only the host. Is there a simple way to handle this other than by the use of regular expressions? I'm familiar with regular expressions from Python and C#, but if I can cut down on the complexity of the program by approaching this differently, I'd like to know.
Update: I'm attempting to match the URL with a POSIX regular expression and extract the domain name from it. I'm not having much luck so far as this implementation is oppressively confusing.
Yes, once a socket has been opened you could send requests as in your example and described in RFC 2616.
If you don't want to use regular expressions or strchr to split your URL you cold also send the entire URL:
`GET http://www.google.com/?gws_rd=ssl#q=Test HTTP/1.1
`
However, you will still need to find the hostname in the URL to make a call to something like gethostbyname.

c- Loadbalancer.Modify http header when splicing?

UPDATE :
I edited my question to focus it more on the problem.
Context
Coding to understand how loadbalancing works.
Debian 64 bits.
loadbalancer on 127.0.0.1:36001
backend on 127.0.0.1:36000
curl calling on loadbalancer (36001).
Problem
I took this code to create my socket socket code
I created a very naive loadbalancer, a server and I have a curl request to the loadbalancer.
My problem is that I don't understand how to pass the client port/ip to the backend to anwser directly to curl when splicing.
The code
curl request
curl localhost:36001 -d "salute"
loadbalancer
static int pool_of_conn[2],
pool_of_pipes[2][2];
int sentinel , efd;
struct epoll_event event;
struct epoll_event *events;
...
off64_t offset = 0;
pipe(pool_of_pipes[0]);
/* This splice works but sends the loadbalancer ip and port. How to put the client's here ? How to alter events[i].data.fd ? */
bytes = splice(events[i].data.fd, NULL, pool_of_pipes[0][1], NULL, 4096, SPLICE_F_MOVE);
if (bytes == 0)
break;
splice(pool_of_pipes[0][0], NULL, pool_of_conn[0], NULL, bytes, SPLICE_F_MOVE);
Could you help me please ?
The fundamental issue with your problem is that HTTP, goes over TCP, which is a connection oriented protocol. So, even if you could make the connection to the backend appear to come from the client IP address, the TCP connection is established between the load balancer and backend. Thus the return TCP packets use that connection also. To somehow work around this, you would need for HTTP to use a different TCP connection for the request and reply and also manipulate local IP routing to behave so that the backends reply still goes to the actual client.
If your intent is to make a HTTP load balancer, the easiest solution probably lies somehwere on the application protocol (i.e. HTTP) level. You could use HTTP redirects or something more complex, like wrapping the client request in some headers before forwarding it. Im sure existing HTTP load balancers offer plenty of examples and ideas.
That said, if you want to solve this on the TCP connection level, transparently to the application protocols, it is doable, but it does require quite a bit of additional code. Also, the backends reply still goes through the load balancer. My suggestion would be to look into the IP option IP_TRANSPARENT. It would allow your load balancer to use the clients IP address as source, when connecting to the backend. (Note that the return traffic also goes through the load balancer, so your splice() comes in handy)
Here is an old SO question about IP_TRANSPARENT Transparent proxying - how to pass socket to local server without modification?
and here is the Linux kernel documentation for it https://www.kernel.org/doc/Documentation/networking/tproxy.txt

multiple DNS queries in one web page request

I am working on a web proxy.The logic is client sends request to proxy, proxy sends the same request to server, and sends the answer back to the client.
For example, i want to visit www.baidu.com. I get "Host:www.baidu.com" in the GET: package, which is used to send a dns request, then i get the ip of "www.baidu.com", establish the socket between proxy and server.
The question is when I use wireshark to capture normal packages not with proxy, i find that there is more dns request queries visting "www.baidu.com" except query for www.baidu.com. It will query for nsclick.baidu.com and suggestion.baidu.com in different sockets.But there is no signal to let me to initiate these DNS queries, not like query for "www.baidu.com",in which i can initiate it when i detect "Host:". Can someone help me ? thank u.
This is not how this should be working probably in first place.
Imagine i hit www.baidu.com in my browser, which sends traffic via your proxy. For your proxy currently, www.baidu.com is the only thing to lookup for.
When my browser end up receiving html chunk for this request, received html/js code then loads requests for some images which comes from nsclick.baidu.com. Similarly requests for other resources (css, js, images) can be made. In turn they all again go through your proxy and then their you will be doing your usual dns query.

How to write a http1.0 proxy server in c in linux?

I must develop proxy server that work with only HTTP 1.0 in Linux and by c .
I need some hint to start developing .
I assume you are confident in using linux and the language c (no hints for that, else don't start with developing a proxy)
Read and understand the RFC 1945 HTTP/1.0 (pay attention to the specific mentioning of proxy)
Determine what kind of proxy you want (web/caching/content-filter/anonymizer/transparent/non-transparent/reverse/gateway/tunnel/...)
Start developing the server
Basic steps
Open port
Listen on port
Get all request sent from the client to that port (maybe make the whole thing multithreaded to be able to handle more than 1 request at a time)
Determine if it is a valid HTTP 1.0 request
Extract the request components
Rebuild the request according to what type of proxy you are
Send the new request
Get the response
Send response to client
How to create a proxy server:
Open a port to listen on
Catch all incoming requests on that report
Determine the web address requested
Open a connection to the host and forward the request
Receive response
Send the response back to the requesting client
Additionally: Use threads to allow for multiple requests to the server.

Resources