Trying to write a proxy server. Content Length MANAGEMENT problem - c

I am trying to write a proxy server in C language under Linux. It was working fine (I had the perception that it was working fine) until I tried it for streaming media.
Lemme first tell the problem and then I'll jump onto streaming media.
To read the incoming data from the website and forward it to the actual client I do this
count = read(websitefd,buffer,BUFSIZ);
write(clientfd,buffer,count);`
in a continuous while loop until I read up all the data on that socket.
Now the problem is if the actual website sends an HTTP packet with content length field as 1025 bytes and other part of data in other packets then still I always wait for BUFSIZ(8192 bytes) and then I send 8192 bytes to the client machine all together. for normal octet-stream it works fine even though I know its not the right method, because I should forward the packets same as the actual server. So if actual server sends me 2 packet of sizes 1024 and 1024 bytes I send the client a packet of 2048 bytes with the first packet with the HTTP header saying that the content length is 900 bytes (rest all being the http header assuming) but actually I forward a packet of 2048 bytes to client. For Content Type: application/octet-stream it just downloads the whole thing and displays it either as image or html text or asks me to save it.
When the client request for a streaming media, because of the above reason the client is not able to play the video. So what should I do now ? Thanks for reading my question. Please help me out. :)

First, I strongly recommend using an existing proxy server as the base of any proxy system. The HTTP standard is quite complex, much more than you realize. If you are going to implement a proxy server, read RFC 2616 at least three times first.
Second, your proxy server must parse HTTP headers to figure out how much it must send. The three main ways to know how much data to send are as follows:
If a Content-Length header is present and no Transfer-Encoding header is present: The Content-Length header specifies how much data to relay in bytes. Just go into a loop copying.
If a Transfer-Encoding: chunked header is present: You must parse the chunked transfer encoding chunk headers. This encoding is frequently used for streaming data, where the total size is not known ahead of time. It's also often used for dynamic data generated by scripts.
If some other Transfer-Encoding header is present: Close the connection and report a 500 error, unless you know what that encoding is.
If a Content-Length header is not present, and no Transfer-Encoding header is present: Check for Connection: close (must be present, in HTTP/1.1) and Connection: keep-alive (must NOT be present in HTTP/1.0). If these conditions are violated, trigger a 500 error. Otherwise just keep passing data until the server closes the connection.
I'm deliberately making this a bit vauge - you MUST read the standard if you're implementing a proxy server from scratch, or you will certainly introduce browser incompatibilities and/or security holes! So please, don't do so. Use lighttpd or varnish or something as the core proxy server, and just write a plugin for whatever functionality you need.

I suppose media is transferred in chunks, i.e no Content-Length is present and data is sent until finished.
As bdonlan said please read how chunked data works,
And i agree HTTP is pretty nasty (due to many changes and interpretations in time)

Related

Sending requests using Sockets through a SOCKS5 Proxy Server in C

I have a simple networking program for sending and responding to HTTP requests/responses. However, if I wanted to send a HTTP or another request via a SOCKS5 proxy, how would I go about this? I'm using C Unix sockets.
I could solve this by creating a proxy server in Linux. However, my intention is so that I can send this to a proxy server I do not own so that it can be forwarded to the destination server. I couldn't seem to find a library. I found an RFC for it which I've read https://www.rfc-editor.org/rfc/rfc1928.txt , but i'm still not 100% unsure how to format my request.
Am I supposed to send the segments for the handshakes as hex? If so, would I send a string of hex 0F3504 or \x0F \x35 \x04 or 0x0F3504? Another question is do i need to denote that header = value in the message, or does the SOCKS5 server know what header i am referring to by the position of the byte it is looking at from the message I've sent?
Any clear up would be very much appreciated
Some time ago I wrote an open source C library that may help you: https://github.com/brechtsanders/proxysocket
Maybe you can use the library. Its quite easy, just replace the connect() with stuff from the library and for the rest you can keep the rest of your code that uses the socket that is returned.
Or you can take a peek in the code to see how it's done there.

How to distinguish between different type of packets in the same HTTPS traffic?

There's something that bothers me: I'd like to distinguish between a packet coming from Youtube and a packet coming from Wikipedia: they both travel on HTTPS and they both come from the port 443.
Since they travel on HTTPS, their payload is not understandable and I can't do a full Deep Packet Inspection: I can only look at Ethernet, IP and TCP struct headers. I may look at the IP address source of both packets and see where they actually come from, but to know if they are from Youtube or Wikipedia I should already know the IP addresses of these two sites.
What I'm trying to figure out is a way to tell from a streaming over HTTP (like Youtube does) and a simple HTML transport (Wikipedia) without investigating the payload.
Edit 1: in a Wireshark session started during a reproducing video I got tons of packets. Maybe I should start looking at the timeout between packets coming from the same address.
If you are just interested in following the data stream in Wireshark you can use the TCP stream index, filter would be something like tcp.stream == 12
The stream index starts at zero with the first stream that wireshark encounters and increments for each new stream (persistent connection).
So two different streams between the same IPs would have two different numbers. For example a video stream might be 12 and an audio stream, between the same IP addresses, might be 13.
If you started the capture before the stream was initiated you'll be able to see the original traffic setting up the SSL connection (much of this is in clear text)
You may consider looking at the server certificate. It will tell you whether it's youtube (google) or facebook.
That would give you an idea whether SSL connection is to youtube, which one is to facebook.
You can try looking at the TCP header options, but generally the traffic is encrypted for a reason... so that it wouldn't be seen by man-in-the-middle. If it were possible, it would be, by definition, a poor encryption standard. Since you have the capture and all the information known to the user agent, you are not "in-the-middle". But you will need to use the user agent info to do the decryption before you can really see inside the stream.
this link: Reverse ip, find domain names on ip address
indicates several methods.
Suggest running nslookup on the IP from within a C program.
And remembering that address/ip values can be nested within the data of the packet, it may (probably will) take some investigation of the packet data to get to the originator of the packet
Well, you have encountered a dilema. How to get the info users are interchanging with their servers when they have explicitly encrypted the information to get anonymity. The quick response is you can't. But only if you can penetrate on the SSL connection you'll get more information.
Even the SSL certificate interchanged between server and client will be of not help, as it only identifies the server (and not the virtual host you'll try behind this connecton), and more than one SSL server (with the feature known as HTTP virtual host) several servers can be listening for connections on the same port of the same address.
SSL parameters are negotiated just after connection, and virtual server is normally selected with the Host http header field of the request (see RFC-2616) but these ocurr after the SSL negotiation has been finished, so you don't have access to them.
The only thing you can do for sure is to try to identify connections for youtube by the amounts and connection patterns this kind of traffic exhibit.

Detect end of CRL file when downloading across established tcp connection

For various reasons, I am trying to download a CRL file using crude tools in C. I'm opening a tcp connection using good old socket(), sending a hardcoded plaintext http request via send(), reading the results into a buffer via recv(), and then writing that buffer into a file (which I will later use to verify various certs).
The recv() and write-to-file portions are inside a while loop so that I can get it all.
My problem is that I'm having a heck of a time coming up with a reliable means of determining when I'm done receiving the file (and therefore can break out of the while loop). Everything I've come up with so far has either had false positives or false negatives (getting back 0 bytes happens too frequently, and either the EOF marker wasn't there or I was looking in the wrong byte for it). Preferably, it would be a technique that wouldn't introduce a lot of additional complexity.
Really, I have a host, port, and a path (all as char*). On the far end, there's a friendly http server (though not one that I control). I'd be happy with anything that could get me the file without a large quantity of additional code complexity. If I had access to a command line, I'd go for something like wget, but I haven't found any direct equivalents over on the C API side, and system() is a poor choice for the situation.
'Getting back zero bytes', by which I assume you mean recv() returning zero, only happens when the peer has finished sending data and has closed the connection. Unless the peer is sending you multiple files per connection, this is an infallible sign of the end of this file. 'Too frequently' is nonsense: it can only happen once per connection.
But if the peer is an HTTP server it should be sending you a Content-length header. See RFc 2616.

HTTPS protocol file integrity

I understand that when you send a file from a client to a server using HTTP/HTTPS protocols, you have the guarantee that all data sent successfully arrived at the destination. However, if you are sending a huge file and then suddenly the internet connection goes down, not all packages are sent and, therefore, you lose the logical integrity of the file.
Is there any point I am missing in my statement?
I would like to know if there is a way for the destination node to check file logical integrity without using a "custom code/api".
HTTPS is just HTTP over a TLS layer, so all applies to HTTPS, too:
HTTP is typically transported over TCP/IP. Now, TCP has flow control (ie. lost packets will be resent), and checksums (ie. the probability, that without the receiver noticing and re-requesting a packet data got altered is minor). So if you're really just transferring data, you're basically set (as long as your HTTP server is configured to send the length of your file in bytes, which, at least for static files, it usually is).
If your transfer is stopped before the whole file size that was advertised in the HTTP GET reply that your server sends to the client is reached, your client will know! Many HTTP libraries/clients can re-start HTTP transmissions (if the server supports it).
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.15
even specifies a MD5 checksum header field. You can configure web servers to use that field, and clients might use it to verify the overall file integrity.
EDIT: Content-MD5 as specified by rfc2616 seems to be deprecated. You can now use a content digest, which is much more flexible.
Also, you mention that you want to check the file that a client sends to a server. That problem might be quite a bit harder -- whilst you're usually in total control of your web server, you can't force an arbitrary client (e.g. a browser) to hash its file before uploading.
If you're, on the other hand, in fact in control over the client's HTTP implementation, you could most probably also use something more file transfer oriented than plain HTTP -- think WebDav, AtomPUB etc, which are protocols atop of HTTP, or even more file exchange oriented protocols like rsync (which I'd heartily recommend if you're actually syncing stuff -- it reduces network usage to a minimum if both side's versions only differ partially). If for some reason you're in the position that your users share most of their data within a well-defined circle (for example, you're building something where photographers share their albums), you might even just use bittorrent, which has per-chunk hashing, extensive load balancing options, and allows for "plain old HTTP seeds".
There are several issues here:
As Marcus stated is his answer TCP protects your bytes from being accidentaly corrupted, but it doesn't help if download was interrupted
HTTPS additionally ensures that those bytes weren't tampered with between server and client (you)
If you want to verify integrity of file (whose transfer was or was not interrupted) you should use checksum designed to protect from accidental file corruption (e.g. CRC32, there could be better ones, you should check)
If in addition you use HTTPS then you're safe from intentional attacks too because you know your checksum is OK and that file parts you got weren't tampered with.
If you use checksum, but don't use HTTPS (but you really should) then you should be safe against accidental data corruption but not against malicious attacks. It could be mitigated, but it's outside the scope of this question
In HTTP/1.1, the recipient can always detect whether it received a complete message (either by comparing the Content-Length, or by properly handling transfer-encoding: chunked).
(Adding content hashes can help if you suspect bit errors on the transport layer.)

How do I modify a HTTP response packet with winpcap?

There are two problems here:
What if content is encoded:gzip...
Do I also need to change the header part to make the HTTP packet valid(checksums if any?)
UPDATE
Can someone with actual experience elaborate the steps involved?
I'm using winpcap and bpf tcp and src port 80 to filter the traffic,so my job lies in this callback function:
void packet_handler(u_char *param, const struct pcap_pkthdr *header, const u_char *pkt_data)
WinPcap doesn't allow you to change a packet that was already sent.
If the packet was sent, WinPcap won't prevent it from reaching its destination.
If you want to send another response - in addition to the response that was sent - I'm not sure what you're trying to achieve.
Decompress it with a GZIP decompresser.
Remove the Content-Encoding header and add a Content-Length header representing the new length in bytes.
That said, for a better answer you'll need to supply more context in the question. This is namely a smell. What is it you're trying to achieve and for which you think that modifying the HTTP response is the right solution?
libpcap is used for capturing. If you want to do modification and injection of network packets you need another library, such as libnet.
winpcap is an odd way to try modifying a TCP stream - you don't explain why you are trying to do this, but you should probably be able to achieve this by writing your own HTTP proxy instead. That way, you get presented with a straight datastream you can intercept, log and modify to your heart's content. Once you do that, strip out Accept-Encoding from the request headers, then you'll never need to deal with gzipped responses in the first place.
There are no HTTP checksums, but the lower layers do have checksums; by operating on the application level as a proxy server, you let the network stack deal with all this for you.

Resources