Detect end of CRL file when downloading across established tcp connection - c

For various reasons, I am trying to download a CRL file using crude tools in C. I'm opening a tcp connection using good old socket(), sending a hardcoded plaintext http request via send(), reading the results into a buffer via recv(), and then writing that buffer into a file (which I will later use to verify various certs).
The recv() and write-to-file portions are inside a while loop so that I can get it all.
My problem is that I'm having a heck of a time coming up with a reliable means of determining when I'm done receiving the file (and therefore can break out of the while loop). Everything I've come up with so far has either had false positives or false negatives (getting back 0 bytes happens too frequently, and either the EOF marker wasn't there or I was looking in the wrong byte for it). Preferably, it would be a technique that wouldn't introduce a lot of additional complexity.
Really, I have a host, port, and a path (all as char*). On the far end, there's a friendly http server (though not one that I control). I'd be happy with anything that could get me the file without a large quantity of additional code complexity. If I had access to a command line, I'd go for something like wget, but I haven't found any direct equivalents over on the C API side, and system() is a poor choice for the situation.

'Getting back zero bytes', by which I assume you mean recv() returning zero, only happens when the peer has finished sending data and has closed the connection. Unless the peer is sending you multiple files per connection, this is an infallible sign of the end of this file. 'Too frequently' is nonsense: it can only happen once per connection.
But if the peer is an HTTP server it should be sending you a Content-length header. See RFc 2616.

Related

How to notify server that client is closing connection

I have a server that is running a select() loop that sometimes continues blocking when the client closes the connection from its side. The select() loop handles all other read/write operations correctly and sets the correct file descriptor in the fd_set, leading me to believe that it is not an issue with the file descriptor setup on the server-side.
The way I planned on handling the client closing the connection was to have the select() break due to activity on the socket (closing it from the client-side), see that the fd was set for that socket, and then try to read from it - and if the read returned 0, then close the connection. However, because the select() doesn't always return when the client side closes the connection, there is no attempt to check the fd_set and subsequently try to read from the socket.
As a workaround, I implemented a "stop code" that the client writes to the server just before closing the connection, and this write causes the select() to break and the server reads the "stop code" and knows to close the socket. The only problem with this solution is the "stop code" is an arbitrary string of bytes that could potentially appear in regular traffic, as the normal data being written can contain random strings that could potentially contain the "stop code". Is there a better way to handle the client closing the connection from its end? Or is the method I described the general "best practice"?
I think my issue has something to do with OpenSSL, as the connection in question is an OpenSSL tunnel, and it is the only file descriptor in the set giving me issues.
The way I planned on handling the client closing the connection was to have the select() break due to activity on the socket (closing it from the client-side), see that the fd was set for that socket, and then try to read from it - and if the read returned 0, then close the connection. However, because the select() doesn't always return when the client side closes the connection, there is no attempt to check the fd_set and subsequently try to read from the socket.
Regardless of whether you are using SSL or not, select() can tell you when the socket is readable (has data available to read), and a graceful closure is a readable condition (a subsequent read operation reports 0 bytes read). It is only abnormal disconnects that select() can't report (unless you use the exceptfds parameter, but even that is not always guaranteed). The best way to handle abnormal disconnects is to simply use timeouts in your own code. If you don't receive data from the client for awhile, just close the connection. The client will have to send data periodically, such as a small heartbeat command, if it wants to stay connected.
Also, when using OpenSSL, if you are using the older ssl_... API functions (ssl_new(), ssl_set_fd(), ssl_read(), ssl_write(), etc), make sure you are NOT just blindly calling select() whenever you want, that you call it ONLY when OpenSSL tells you to (when an SSL read/write operation reports an SSL_ERROR_WANT_(READ|WRITE) error). This is an area where alot of OpenSSL newbies tend to make the same mistake. They try to use OpenSSL on top of pre-existing socket logic that waits for a readable notification before then reading data. This is the wrong way to use the ssl_... API. You are expected to ask OpenSSL to perform a read/write operation unconditionally, and then if it needs to wait for new data to arrive, or pending data to send out, it will tell you and you can then call select() accordingly before retrying the SSL read/write operation again.
On the other hand, if you are using the newer bio_... API functions (bio_new(), bio_read(), bio_write(), etc), you can take control of the underlying socket I/O and not let OpenSSL manage it for you, thus you can do whatever you want with select() (or any other socket API you want).
As a workaround, I implemented a "stop code" that the client writes to the server just before closing the connection, and this write causes the select() to break and the server reads the "stop code" and knows to close the socket.
That is a very common approach in many Internet protocols, regardless of whether SSL is used or not. It is a very distinct and explicit way for the client to say "I'm done" and both parties can then close their respective sockets.
The only problem with this solution is the "stop code" is an arbitrary string of bytes that could potentially appear in regular traffic, as the normal data being written can contain random strings that could potentially contain the "stop code".
Then either your communication protocol is not designed properly, or your code is not processing the protocol correctly. In a properly-designed and correctly-processed protocol, there will not be any such ambiguity. There needs to be a clear distinction between the various commands that your protocol defines. Your "stop code" would be one such command amongst other commands. Random data in one command should not be mistakenly treated as a different command. If you are experiencing that problem, you need to fix it.

For TCP, does returning from "write()" mean that the peer app has "read()" the data?

I'm writing a C/S program and both the client and server may send data to peer (without explicit ack) at arbitrary time. I'm wondering if it could possibly deadlock if the client and server coincidentally write to the peer at the same time.
So does returning from write() mean that the peer application has already read() the data? Or it only means the peer's kernel has got the data and would deliver to the app on next read()?
(EJP's answer fixed my totally wrong understanding about write()/send()/.... To add some authoritative info I found this in the POSIX standard about send:
Successful completion of a call to send() does not guarantee delivery of the message. A return value of -1 indicates only locally-detected errors.
Linux's man page about send() is not very clear:
No indication of failure to deliver is implicit in a send(). Locally detected errors are indicated by a return value of -1.
Or it's because I cannot fully understand the first sentence as a non native English speaker. )
I'm wondering if it could possibly deadlock if the client and server coincidentally write to the peer at the same time.
It can't, unless one or both of the peers is very slow reading and has closed its receive window. TCP is full-duplex.
So does returning from write() mean that the peer application has already read() the data?
No.
Or it only means the peer's kernel has got the data and would deliver to the app on next read()?
No.
It means the data has reached your kernel, and is queued for transmission.
the returning from write() means the TCP ACK has already been received.
No it doesn't.
u mean returning from write() only means the data has reached the sender's kernel?
That is not only what I meant, it is what I said.
I'd think the sender's already received the TCP ACK so reached the peer's kernel.
No.
No. If you think of it as a data pipe, returning from write means that your data has entered the pipe, not that it has exited the pipe at the other end.
In fact, since the pipe is one where the data may take any of hundreds of different pathways, you're not even guaranteed that it will reach the other end :-) If that happens, you'll be notified about it at some later date, probably by a subsequent write failing.
It may be blocked:
trying to exit your machine due to a broken cable,
at a bottleneck in the path somewhere,
by the networking stack at the destination,
by a networking stack at some device in the networking path, more intelligent than a simple hub,
because the application at the other end is otherwise tied up,
and so on.
A successful return from write means that your local network stack has accepted your data and will process it in due course.

Interrupt download (recv) of a file through socket

In an application I'm currently working on, I need to stop downloading some file if I realize it's not what I'm looking for. The protocol doesn't provide any way to know it before I start receiving the file (like headers or so).
As an example, in some cases I might be looking for a file of exactly X bytes in size, but after I have downloaded X bytes and I keep getting more bytes, this is not the file I'm looking for as its size is greater than X. In this case, I want to stop downloading to free network bandwidth resources. The protocol doesn't provide any way to notify the server about this.
I read somewhere that close(fd) or shutdown(fd, SHUT_RD) won't actually stop downloading as the server will continue to send() the file and this will continue to consume network bandwidth. I am also not sure about if I just stop calling recv() and packets still arrive, will they fill the buffer and then start to be discarded? If it matters, the protocol used is based on TCP (but I would like some solution that can also be used for UDP based protocols).
I became even more doubtful that stopping calls to recv() would solve it, after I searched for programmatic bandwidth control (sleep(), token bucket..) as an alternative solution (reduce download speed to about zero after I realize it's not the file I'm looking for). How can I control network bandwidth usage by reducing recv() calls if the server will still be send()ing? I didn't catch it.
Main idea is to entirely stop the download.
What would you suggest?
I read somewhere that close(fd) or shutdown(fd, SHUT_RD) won't actually stop download as server will continue to send() file and this will continue to consume network bandwidth.
If you shutdown(fd, SHUT_RD) your own recv() will unblock with a return code of zero, which will cause the code to close the socket, which will cause the localhost to issue an RST if any more data comes from the peer, which will cause an ECONNRESET at the sender (after a few more send() calls, not necessarily immediately).
Where did you read this nonsense?
I became even more doubtful about if stop calling recv() would solve it
It won't solve it, but it will eventually stop the sender from sending, because of TCP flow control. It isn't a solution to this problem.

HTTPS protocol file integrity

I understand that when you send a file from a client to a server using HTTP/HTTPS protocols, you have the guarantee that all data sent successfully arrived at the destination. However, if you are sending a huge file and then suddenly the internet connection goes down, not all packages are sent and, therefore, you lose the logical integrity of the file.
Is there any point I am missing in my statement?
I would like to know if there is a way for the destination node to check file logical integrity without using a "custom code/api".
HTTPS is just HTTP over a TLS layer, so all applies to HTTPS, too:
HTTP is typically transported over TCP/IP. Now, TCP has flow control (ie. lost packets will be resent), and checksums (ie. the probability, that without the receiver noticing and re-requesting a packet data got altered is minor). So if you're really just transferring data, you're basically set (as long as your HTTP server is configured to send the length of your file in bytes, which, at least for static files, it usually is).
If your transfer is stopped before the whole file size that was advertised in the HTTP GET reply that your server sends to the client is reached, your client will know! Many HTTP libraries/clients can re-start HTTP transmissions (if the server supports it).
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.15
even specifies a MD5 checksum header field. You can configure web servers to use that field, and clients might use it to verify the overall file integrity.
EDIT: Content-MD5 as specified by rfc2616 seems to be deprecated. You can now use a content digest, which is much more flexible.
Also, you mention that you want to check the file that a client sends to a server. That problem might be quite a bit harder -- whilst you're usually in total control of your web server, you can't force an arbitrary client (e.g. a browser) to hash its file before uploading.
If you're, on the other hand, in fact in control over the client's HTTP implementation, you could most probably also use something more file transfer oriented than plain HTTP -- think WebDav, AtomPUB etc, which are protocols atop of HTTP, or even more file exchange oriented protocols like rsync (which I'd heartily recommend if you're actually syncing stuff -- it reduces network usage to a minimum if both side's versions only differ partially). If for some reason you're in the position that your users share most of their data within a well-defined circle (for example, you're building something where photographers share their albums), you might even just use bittorrent, which has per-chunk hashing, extensive load balancing options, and allows for "plain old HTTP seeds".
There are several issues here:
As Marcus stated is his answer TCP protects your bytes from being accidentaly corrupted, but it doesn't help if download was interrupted
HTTPS additionally ensures that those bytes weren't tampered with between server and client (you)
If you want to verify integrity of file (whose transfer was or was not interrupted) you should use checksum designed to protect from accidental file corruption (e.g. CRC32, there could be better ones, you should check)
If in addition you use HTTPS then you're safe from intentional attacks too because you know your checksum is OK and that file parts you got weren't tampered with.
If you use checksum, but don't use HTTPS (but you really should) then you should be safe against accidental data corruption but not against malicious attacks. It could be mitigated, but it's outside the scope of this question
In HTTP/1.1, the recipient can always detect whether it received a complete message (either by comparing the Content-Length, or by properly handling transfer-encoding: chunked).
(Adding content hashes can help if you suspect bit errors on the transport layer.)

Programmatically detect if local web server has hung

I realise that I'll get at least one answer along the lines of "(re)write the code so it doesn't hang" but let's assume we don't live in that shiny happy utopia just yet...
In our embedded system we have a big SDK including a web-server (Boa) which is the primary method of user interaction.
It's possible, during certain phases of the moon, that something can cause the web server to hang or become otherwise stuck in such a way that the process appears running normally (not crashed/dead/using 100% CPU) but does not serve any web pages.
So, the question is, how do we test/detect this situation?
To test whether the server is hung, create a TCP socket and connect to port 80 on IP address 127.0.0.1 (loopback address). Then send the following text over the socket
GET / HTTP/1.1\r\n\r\n
Most servers will interpret that as a request for index.html. Alternatively, you could implement an undocumented URL for testing (which allows for a shorter, predetermined response), e.g.
GET /test/fdoaoqfaf12491r2h1rfda HTTP/1.1\r\n\r\n
You then need to read the response from the server. This involves using select with a reasonable timeout to determine whether any data came back from the server, and if so, use recv to read the data. The response from the server will consist of a header followed by content. The header consists of lines of text, with a blank line at the end of the header. Lines end with \r\n, so the end of the header is \r\n\r\n.
Getting the content involves calling select and recv until recv returns 0. This assumes that the server will send the response and then close the socket. Some sophisticated servers will leave a socket open to allow multiple requests over the same socket. A simple embedded server should not be doing that. (If your server is trying to use the same socket for multiple requests, then you need to figure out how to turn that feature off.)
That's all very well and good, but you really need to rewrite your code so it doesn't hang.
The mostly likely cause of the problem is that the server has a bunch of dangling sockets, i.e. connections from clients that were never properly cleaned up. Dangling sockets will eventually prevent the server from accepting more connections, either because the server has a limit on the number of open connections, or because the process that's running the server uses up all of its file descriptors.
The first thing to check is the TCP timeout value. One project that I worked on had a default timeout of 5 hours, which meant that dangling sockets stayed open for 5 hours. A reasonable timeout is 1 minute.
Then you need to create a client that deliberately misbehaves. Clients can misbehave by
leaving a socket open without reading the server's response
abruptly closing the socket while reading the response
gracefully closing the socket while reading the response
The first situation should be handled by the TCP timeout. The other two need to be properly handled by the server code. Graceful and abrupt socket closure is controlled via the SO_LINGER option of ioctl and the shutdown function. After the client misbehaves, check the number of open file descriptors in the server process, to verify that the server has handled the situation correctly.

Resources