HTTPS protocol file integrity - file

I understand that when you send a file from a client to a server using HTTP/HTTPS protocols, you have the guarantee that all data sent successfully arrived at the destination. However, if you are sending a huge file and then suddenly the internet connection goes down, not all packages are sent and, therefore, you lose the logical integrity of the file.
Is there any point I am missing in my statement?
I would like to know if there is a way for the destination node to check file logical integrity without using a "custom code/api".

HTTPS is just HTTP over a TLS layer, so all applies to HTTPS, too:
HTTP is typically transported over TCP/IP. Now, TCP has flow control (ie. lost packets will be resent), and checksums (ie. the probability, that without the receiver noticing and re-requesting a packet data got altered is minor). So if you're really just transferring data, you're basically set (as long as your HTTP server is configured to send the length of your file in bytes, which, at least for static files, it usually is).
If your transfer is stopped before the whole file size that was advertised in the HTTP GET reply that your server sends to the client is reached, your client will know! Many HTTP libraries/clients can re-start HTTP transmissions (if the server supports it).
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.15
even specifies a MD5 checksum header field. You can configure web servers to use that field, and clients might use it to verify the overall file integrity.
EDIT: Content-MD5 as specified by rfc2616 seems to be deprecated. You can now use a content digest, which is much more flexible.
Also, you mention that you want to check the file that a client sends to a server. That problem might be quite a bit harder -- whilst you're usually in total control of your web server, you can't force an arbitrary client (e.g. a browser) to hash its file before uploading.
If you're, on the other hand, in fact in control over the client's HTTP implementation, you could most probably also use something more file transfer oriented than plain HTTP -- think WebDav, AtomPUB etc, which are protocols atop of HTTP, or even more file exchange oriented protocols like rsync (which I'd heartily recommend if you're actually syncing stuff -- it reduces network usage to a minimum if both side's versions only differ partially). If for some reason you're in the position that your users share most of their data within a well-defined circle (for example, you're building something where photographers share their albums), you might even just use bittorrent, which has per-chunk hashing, extensive load balancing options, and allows for "plain old HTTP seeds".

There are several issues here:
As Marcus stated is his answer TCP protects your bytes from being accidentaly corrupted, but it doesn't help if download was interrupted
HTTPS additionally ensures that those bytes weren't tampered with between server and client (you)
If you want to verify integrity of file (whose transfer was or was not interrupted) you should use checksum designed to protect from accidental file corruption (e.g. CRC32, there could be better ones, you should check)
If in addition you use HTTPS then you're safe from intentional attacks too because you know your checksum is OK and that file parts you got weren't tampered with.
If you use checksum, but don't use HTTPS (but you really should) then you should be safe against accidental data corruption but not against malicious attacks. It could be mitigated, but it's outside the scope of this question

In HTTP/1.1, the recipient can always detect whether it received a complete message (either by comparing the Content-Length, or by properly handling transfer-encoding: chunked).
(Adding content hashes can help if you suspect bit errors on the transport layer.)

Related

Do I need to check data integrity after sending file over ftp?

I need to transfer some files from remote computer (on local network) and I plan to do it via FTP.
Apparently, FTP is based on TCP protocol and if I remember well my lessons the difference between TCP and UDP is that TCP checks that network packets are correctly send and received.
After asking myself if I need to add checksum verification, my conclusion was that I don't need to. Am I correct ?
I'm aware of the differences between binary transfer and text transfer and plan to do only binary transfers (working only on Windows).
Do I really need to checksum big files transfered by binary FTP ?
Be it clear, I need data integrity to verify that some bits where not altered during the exchange. Man in the middle is not (much) an issue because the operation will be done in a private network.
Yes, you do.
A man in the middle can alter any TCP packets on the way from the ftp server to your site or he can even act as a malicious ftp site and suppress the original traffic completely.
Therefore you need to verify somehow that that file you received is really the file you wanted to receive. Checksums are suitable for this task.

How to distinguish between different type of packets in the same HTTPS traffic?

There's something that bothers me: I'd like to distinguish between a packet coming from Youtube and a packet coming from Wikipedia: they both travel on HTTPS and they both come from the port 443.
Since they travel on HTTPS, their payload is not understandable and I can't do a full Deep Packet Inspection: I can only look at Ethernet, IP and TCP struct headers. I may look at the IP address source of both packets and see where they actually come from, but to know if they are from Youtube or Wikipedia I should already know the IP addresses of these two sites.
What I'm trying to figure out is a way to tell from a streaming over HTTP (like Youtube does) and a simple HTML transport (Wikipedia) without investigating the payload.
Edit 1: in a Wireshark session started during a reproducing video I got tons of packets. Maybe I should start looking at the timeout between packets coming from the same address.
If you are just interested in following the data stream in Wireshark you can use the TCP stream index, filter would be something like tcp.stream == 12
The stream index starts at zero with the first stream that wireshark encounters and increments for each new stream (persistent connection).
So two different streams between the same IPs would have two different numbers. For example a video stream might be 12 and an audio stream, between the same IP addresses, might be 13.
If you started the capture before the stream was initiated you'll be able to see the original traffic setting up the SSL connection (much of this is in clear text)
You may consider looking at the server certificate. It will tell you whether it's youtube (google) or facebook.
That would give you an idea whether SSL connection is to youtube, which one is to facebook.
You can try looking at the TCP header options, but generally the traffic is encrypted for a reason... so that it wouldn't be seen by man-in-the-middle. If it were possible, it would be, by definition, a poor encryption standard. Since you have the capture and all the information known to the user agent, you are not "in-the-middle". But you will need to use the user agent info to do the decryption before you can really see inside the stream.
this link: Reverse ip, find domain names on ip address
indicates several methods.
Suggest running nslookup on the IP from within a C program.
And remembering that address/ip values can be nested within the data of the packet, it may (probably will) take some investigation of the packet data to get to the originator of the packet
Well, you have encountered a dilema. How to get the info users are interchanging with their servers when they have explicitly encrypted the information to get anonymity. The quick response is you can't. But only if you can penetrate on the SSL connection you'll get more information.
Even the SSL certificate interchanged between server and client will be of not help, as it only identifies the server (and not the virtual host you'll try behind this connecton), and more than one SSL server (with the feature known as HTTP virtual host) several servers can be listening for connections on the same port of the same address.
SSL parameters are negotiated just after connection, and virtual server is normally selected with the Host http header field of the request (see RFC-2616) but these ocurr after the SSL negotiation has been finished, so you don't have access to them.
The only thing you can do for sure is to try to identify connections for youtube by the amounts and connection patterns this kind of traffic exhibit.

Detect end of CRL file when downloading across established tcp connection

For various reasons, I am trying to download a CRL file using crude tools in C. I'm opening a tcp connection using good old socket(), sending a hardcoded plaintext http request via send(), reading the results into a buffer via recv(), and then writing that buffer into a file (which I will later use to verify various certs).
The recv() and write-to-file portions are inside a while loop so that I can get it all.
My problem is that I'm having a heck of a time coming up with a reliable means of determining when I'm done receiving the file (and therefore can break out of the while loop). Everything I've come up with so far has either had false positives or false negatives (getting back 0 bytes happens too frequently, and either the EOF marker wasn't there or I was looking in the wrong byte for it). Preferably, it would be a technique that wouldn't introduce a lot of additional complexity.
Really, I have a host, port, and a path (all as char*). On the far end, there's a friendly http server (though not one that I control). I'd be happy with anything that could get me the file without a large quantity of additional code complexity. If I had access to a command line, I'd go for something like wget, but I haven't found any direct equivalents over on the C API side, and system() is a poor choice for the situation.
'Getting back zero bytes', by which I assume you mean recv() returning zero, only happens when the peer has finished sending data and has closed the connection. Unless the peer is sending you multiple files per connection, this is an infallible sign of the end of this file. 'Too frequently' is nonsense: it can only happen once per connection.
But if the peer is an HTTP server it should be sending you a Content-length header. See RFc 2616.

Assign a new socket to client after receving request from 8080 in server code

C Language TCP server/client.. I want to assign a new socket for a particular client which requested my server from 8080 lets say the new socket is 8081 to get further request, and want to free the previous socket(8080) so that the other clients will request my server from 8080. is there any way of doing it in C language. (OS Ubuntu) Thanks
Your problem statement is incorrect. You can't do this even if you wanted to. The way that TCP sockets work is that accept() gives you a new socket for the incoming client connection, on the same port you are listening to. That's all you need and it's all you can get. You can't 'allocate a new socket' to the client on a new port without engaging in another TCP handshake with him, which would be nothing but a complete waste of time when you already have a connection to him. This does not preclude another connection bring accepted while this one is open. You need to read a TCP Sockets networking tutorial.
Mat and EJP have said the pertinent things above, but I thought it might help others to describe the situation more verbosely.
A TCP/IP connection is identified by a four-tuple: target IP address, target TCP port number, source IP address, and source TCP port number. The kernel will keep track of established connections based on these four things. A single server port (and IP address) can be connected to thousands of clients at the same time, limited in practice only by the resources available.
When you have a listening TCP socket, it is bound to some IP address (or wildcard address) and TCP port. Such a socket does not receive data, only new connections. When accept() is called, the server notes the new four-tuple of the connection, and hands off the file descriptor that represents that connection (as the accept() return value). The original socket is free to accept new connections. Heck, you can even have more than one thread accepting new connections if you want to, although establishing new connections in Linux is so fast you shouldn't bother; it's just too insignificant to worry about.
If establishing the connection at application level is resource-intensive -- this is true for for example encrypted connections, where agreeing to an encryption scheme and preparing the data structures needed takes typically several orders of magnitude more CPU resources than a simple TCP connection --, then it is natural to wish to avoid that overhead. Let's assume this is the point in OP's question: to avoid unnecessary application-level connection establishment when a recent client needs another connection.
The preferred solution is connection multiplexing. Simply put, the application-level protocol is designed to allow multiple data streams via a single TCP connection.
The OP noted that it would be necessary/preferable to keep the existing application protocol intact, i.e. that the optimization should be completely on the server side, transparent to the clients.
This turns the recommended solution to a completely new direction. We should not talk about application protocols, but how to efficiently implement the existing one.
Before we get to that, let's take a small detour.
Technically, it is possible to use the kernel packet filtering facilities to modify incoming packets to use a different port based on the source IP address, redirecting requests from specific IP addresses to separate ports, and making those separate ports otherwise inaccessible. Technically possible, but quite complex to implement, and with very questionable benefits.
So, let's ignore the direction OP assumed would bring the desired benefits, and look at the alternatives. Or, actually, the common approach used.
Structurally, your application has
- A piece of code accepting new connections
- A piece of code establishing the application-level resources needed for that connection
- A piece of code doing the communication with the client (serving the response to the client, per the client's request)
There is no reason for these three pieces to be consecutive, or even part of the same code flow. Use data structures to your advantage.
Instead of treating new incoming connections (accept()ed) as equal, they can be simply thrown into separate pools based on their source IP addresses. (Or, if you are up to it, have a data structure which clusters source IP addresses together, but otherwise keeps them in the order they were received.)
Whenever a worker completes a request by a client, it checks if that same client has new incoming connections. If yes, it can avoid most if not all of the application-level connection establishment by checking that the new connection matches the application-level parameters of the old one. (You see, it is possible that even if the source IP address is the same, it could be a completely different client, for example if the clients are under the same VPN or NATted subnet.)
There are quite a few warts to take care of, for example how to keep the priorities, and avoid starving new IP addresses if known clients try to hog the service.
For protocols like HTTP, where the client sends the request information as soon as the server accepts the connection, there is an even better pattern to apply: instead of connection pools, have request pools. A single thread or a thread pool can receive the requests (they may span multiple packets in most protocols), without acting on them; only detecting when the request itself is complete. (A careful server will limit the number of pending requests, and the number of incomplete request, to avoid vulnerability to DOS.)
When the requests are complete, they are grouped, so that the same "worker" who serves one request, can serve another similar request with minimal overhead. Again, some careful thought is needed to avoid the situation where a prolific client hogs the server resources by sending a lot of requests, but it's nothing some careful thought and testing won't resolve.
One question remains:
Do you need to do this?
I'd wager you do not. Apache, which is one of the best HTTP servers, does not do any of the above. The performance benefits are not considered worth the extra code complexity. Could you write a new HTTP server (or a server for whatever protocol you're working with), and use a scheme similar to above, to make sure you can use your hardware as efficiently as possible? Sure. You don't even need to be a wizard, just do some research and careful planning, and avoid getting caught in minute details, keeping the big picture in mind at all times.
I firmly believe that code maintainability and security is more important than efficiency, especially when writing an initial implementation. The information gained from the first implementation has thus far always changed how I perceive the actual "problem"; similar to opening new eyes. It has always been worth it to create a robust, easy to develop and maintain, but not necessarily terribly efficient implementation, for the first generation. If there is someone willing to support the development of the next generation, you not only have the first generation implementation to compare (and verify and debug) against, but also all the practical knowledge gained.
That is also the reason old hands warn so often against premature optimization. In short, you end up optimizing resource waste and personal pain, not the implementation you're developing.
If I may, I'd recommend the OP back up a few steps, and actually describe what they intend to implement, what the observed problem with the implementation is, and suggestions on how to fix and avoid the problem. The current question is like asking how to better freeze a banana, as it keeps shattering when you hammer nails with it.

Trying to write a proxy server. Content Length MANAGEMENT problem

I am trying to write a proxy server in C language under Linux. It was working fine (I had the perception that it was working fine) until I tried it for streaming media.
Lemme first tell the problem and then I'll jump onto streaming media.
To read the incoming data from the website and forward it to the actual client I do this
count = read(websitefd,buffer,BUFSIZ);
write(clientfd,buffer,count);`
in a continuous while loop until I read up all the data on that socket.
Now the problem is if the actual website sends an HTTP packet with content length field as 1025 bytes and other part of data in other packets then still I always wait for BUFSIZ(8192 bytes) and then I send 8192 bytes to the client machine all together. for normal octet-stream it works fine even though I know its not the right method, because I should forward the packets same as the actual server. So if actual server sends me 2 packet of sizes 1024 and 1024 bytes I send the client a packet of 2048 bytes with the first packet with the HTTP header saying that the content length is 900 bytes (rest all being the http header assuming) but actually I forward a packet of 2048 bytes to client. For Content Type: application/octet-stream it just downloads the whole thing and displays it either as image or html text or asks me to save it.
When the client request for a streaming media, because of the above reason the client is not able to play the video. So what should I do now ? Thanks for reading my question. Please help me out. :)
First, I strongly recommend using an existing proxy server as the base of any proxy system. The HTTP standard is quite complex, much more than you realize. If you are going to implement a proxy server, read RFC 2616 at least three times first.
Second, your proxy server must parse HTTP headers to figure out how much it must send. The three main ways to know how much data to send are as follows:
If a Content-Length header is present and no Transfer-Encoding header is present: The Content-Length header specifies how much data to relay in bytes. Just go into a loop copying.
If a Transfer-Encoding: chunked header is present: You must parse the chunked transfer encoding chunk headers. This encoding is frequently used for streaming data, where the total size is not known ahead of time. It's also often used for dynamic data generated by scripts.
If some other Transfer-Encoding header is present: Close the connection and report a 500 error, unless you know what that encoding is.
If a Content-Length header is not present, and no Transfer-Encoding header is present: Check for Connection: close (must be present, in HTTP/1.1) and Connection: keep-alive (must NOT be present in HTTP/1.0). If these conditions are violated, trigger a 500 error. Otherwise just keep passing data until the server closes the connection.
I'm deliberately making this a bit vauge - you MUST read the standard if you're implementing a proxy server from scratch, or you will certainly introduce browser incompatibilities and/or security holes! So please, don't do so. Use lighttpd or varnish or something as the core proxy server, and just write a plugin for whatever functionality you need.
I suppose media is transferred in chunks, i.e no Content-Length is present and data is sent until finished.
As bdonlan said please read how chunked data works,
And i agree HTTP is pretty nasty (due to many changes and interpretations in time)

Resources