detecting end of http header with \r\n\r\n - c

Using recv I want to get the http header so I can parse for a content length. However I'm having trouble detecting the line break. Or actually do I even have to detect line break or will the first time I read into the buffer always be the complete header (assuming I have a long enough buffer).
This is written in C.
edit: looking at some of the related questions one of the things I am worried about is
"...the "\r\n" of the header break might be pulled into your buffer by two different calls to recv() which would prevent your code from recognizing the header break."

You should call recv() repeatedly and each time it gives you x bytes you increase the buffer-pointer you give to it by x bytes (and decrease the cb it is allowed to write also by x bytes). You do this until your buffer either contains a \r\n\r\n or is completely full, in which case you just close the socket and ignore the malicious client from then on. Buffer-size should be about 3000 bytes.
But: this ignores the general problem that your server seems to be a polling-server. If you have some experience you should try to make an epoll-server instead.

In addition to the problem of identifying "\r\n\r\n" across packet boundaries, you have the problem of identifying "Content-Length: xxxx\r\n" across packet boundaries. I suggest recieving and parsing one byte at a time. When you get a recv() of '\r' followed by a recv() of '\n', followed by a recv() of '\r' followed by a recv() of '\n', you can be sure the header has ended. Once you've grasped this, adapt your solution to recieve and parse n bytes at a time where n is a preprocessor definition defined to 1 initially, and change n.

In the end I did something like this:
while ( recv... > 0 ) {
if rnrn is inside the buffer using strstr
look for content length, output error if content length doesn't exist
else
keep on reading into the buffer
}
and then once the header is found I keep on reading for the message body.
anyway thanks guys, ended up doing a really inefficient way to get my answer but what must be done is done.

Related

How many bytes should I read/write to a socket?

I'm having some doubts about the number of bytes I should write/read through a socket in C on Unix. I'm used to sending 1024 bytes, but this is really too much sometimes when I send short strings.
I read a string from a file, and I don't know how many bytes this string is, it can vary every time, it can be 10, 20 or 1000. I only know for sure that it's < 1024. So, when I write the code, I don't know the size of bytes to read on the client side, (on the server I can use strlen()). So, is the only solution to always read a maximum number of bytes (1024 in this case), regardless of the length of the string I read from the file?
For instance, with this code:
read(socket,stringBuff,SIZE);
wouldn't it be better if SIZE is 10 instead of 1024 if I want to read a 10 byte string?
In the code in your question, if there are only 10 bytes to be read, then it makes no difference whether SIZE is 10 bytes, 1,024 bytes, or 1,000,024 bytes - it'll still just read 10 bytes. The only difference is how much memory you set aside for it, and if it's possible for you to receive a string up to 1,024 bytes, then you're going to have to set aside that much memory anyway.
However, regardless of how many bytes you are trying to read in, you always have to be prepared for the possibility that read() will actually read a different number of them. Particularly on a network, when you can get delays in transmission, even if your server is sending a 1,024 byte string, less than that number of bytes may have arrived by the time your client calls read(), in which case you'll read less than 1,024.
So, you always have to be prepared for the need to get your input in more than one read() call. This means you need to be able to tell when you're done reading input - you can't rely alone on the fact that read() has returned to tell you that you're done. If your server might send more than one message before you've read the first one, then you obviously can't hope to rely on this.
You have three main options:
Always send messages which are the same size, perhaps padding smaller strings with zeros if necessary. This is usually suboptimal for a TCP stream. Just read until you've received exactly this number of bytes.
Have some kind of sentinel mechanism for telling you when a message is over. This might be a newline character, a CRLF, a blank line, or a single dot on a line followed by a blank line, or whatever works for your protocol. Keep reading until you have received this sentinel. To avoid making inefficient system calls of one character at a time, you need to implement some kind of buffering mechanism to make this work well. If you can be sure that your server is sending you lines terminated with a single '\n' character, then using fdopen() and the standard C I/O library may be an option.
Have your server tell you how big the message is (either in an initial fixed length field, or using the same kind of sentinel mechanism from point 2), and then keep reading until you've got that number of bytes.
The read() system call blocks until it can read one or more bytes, or until an error occurs.
It DOESN'T guarantee that it will read the number of bytes you request! With TCP sockets, it's very common that read() returns less than you request, because it can't return bytes that are still propagating through the network.
So, you'll have to check the return value of read() and call it again to get more data if you didn't get everything you wanted, and again, and again, until you have everything.

How to find out how much I should read from a socket?

In .NET there is the DataAvailable property in the network stream and the Available property in the tcp client.
However silverlight lacks those.
Should I send a header with the lenght of the message? I'd rather not waste network resources.
Is there any other way?
You are micro-optimizing. Why do you think that another 4 bytes would affect the performance?
In other words: Use a length header.
Update
I saw your comment on the other answer. You are using BeginRead in the wrong way. It will never block or wait until the entire buffer have been filled.
You should declare a buffer which can receive your entire message. The return value from EndRead will report the number of bytes received.
You should also know that TCP is stream based. There is no guarantees that your entire JSON message will be received at once (or that only your first message is received). Therefore you must have some sort of way to know when a message is complete.
And I say it again: A length header will hardly affect the performance.
What do you mean by 'waste network resources'? Every network read API I am aware of returns the actual number of bytes read, somehow. What's the actual problem here?

C winsock "rolling parsing"

i'm trying to receive data from a server and parse it.
http://pastebin.com/1kjXnXwq
http://pastebin.com/XpGSgRBh
everything works as is, but i want to parse the data instead of just grabbing blocks of it and printing it out. so is there any way to grab data from the winsock until \n then stop and pass it off to another function to be parsed and once that function returns continue reading from the last point until another \n shows up and repeate the process until there is nothing left to receive?
the function that is supposed to be doing this is called msgLoop() and is located in the second pastebin line.
To read an \n-terminated string from a socket, you have to either:
read from the socket 1 byte at a time until you encounter a \n byte. Any unread bytes are left in the socket until you read them later. This is not very efficient, but it works.
create a data cache. When you need a new string, first check the cache to see if there is already a \n byte present in it. If not, then keep reading from the socket in larger blocks and store them into the cache until you encounter a \n byte. Then process the contents of the cache up to the first \n byte, remove the bytes you processed, and move any remaining bytes to the front of the cache for later reads.
There's no built-in "readLine" method for sockets. So, you'll need to implement it yourself, but it's not too tricky. I found this example by Googling, you may be able to improve on it:
http://johnnie.jerrata.com/winsocktutorial/

getline over a socket

Is there a libc function that would do the same thing as getline, but would work with a connected socket instead of a FILE * stream ?
A workaround would be to call fdopen on a socket. What are things that should be taken care of, when doing so. What are reasons to do it/ not do it.
One obvious reason to do it is to call getline and co, but maybe it is a better idea to rewrite some custom getline ?
when you call a read on a socket, then it can return a zero value prematurely.
eg.
read(fd, buf, bufsize)
can return a value less than bufsize if the kernel buffer for the tcp socket is full.
in such a case it may be required to call the read function again unless it returns a zero or a negative result.
thus it is best to avoid stdio functions. you need to create wrappers for the read function in order to implement the iterative call to read for getting bufsize bytes reliably. it should return a zero value only when no more bytes can be read from the socket, as if the file is being read from the local disk.
you can find wrappers in the book Computer Systems: A Programmer's Perspective by Randal Bryant.
The source code is available at this site. look for functions beginning with rio_.
If the socket is connected to untrusted input, be prepared for arbitrary input within arbitrary time frame
\0 character before \r\n
wait eternally for any of \r or \n
any other potentially ugly thing
One way to address the arbitrary timing and arbitrary data would be to provide timeouts on the reads e.g. via select(2) and feed the data you actually receive to some well-written state machine byte by byte.
The problem would be if you don't receive the new line (\n or \r\n, depends on your implementation) the program would hang. I'd write your own version that also makes calls to select() to check if the socket is still read/writable and doesn't have any errors. Really there would be no way to tell if another "\n" or "\r\n" is coming so make sure you know that the data from the client/server will be consistent.
Imagine you coded a webserver that reads the headers using getline(). If an attacker simple sent
GET / HTTP/1.1\r\n
This line isn't terminated: bla
The call the getline would never return and the program would hang. Probably costing you resources and eventually a DoS would be possible.

Inspecting C pipelines passing through a program -- border cases

I'm receiving from socket A and writing that to socket B on the fly (like a proxy server might). I would like to inspect and possibly modify data passing through. My question is how to handle border cases, ie where the regular expression I'm searching for would match between two successive socket A read and socket B write iterations.
char buffer[4096]
int socket_A, socket_B
/* Setting up the connection goes here */
for(;;) {
recv(socket_A, buffer, 4096, 0);
/* Inspect, and possibly modify buffer */
send(socket_B, buffer, 4096, 0);
/* Oops, the matches I was looking for were at the end of buffer,
* and will be at the beginning of buffer next iteration :( */
}
My suggestion: have two buffers, and rotate between them:
Recv buffer 1
Recv buffer 2
Process.
Send buffer 1
Recv buffer 1
Process, but with buffer 2 before buffer 1.
Send buffer 2
Goto 2.
Or something like that?
Assuming you know the maximum length M of the possible regular expression matches (or can live with an arbitrary value - or just use the whole buffer), you could handle it by not passing on the full buffer but keep M-1 bytes back. In the next iteration put the new received data at the end of the M-1 bytes and apply the regular expression.
If you know the format of the data transmitted (e.g. http), you should be able to parse the contents to know when you reached the end of the communication and should send out the trailing bytes you may have cached. If you do not know the format, then you'd need to implement a timeout in the recv so that you do not hold on to the end of the communication for too long. What is too long is something that you will have to decide on your own,
You need to know and/or say something about your regular expression.
Depending on the regular expression, you might need to buffer a lot more than you are buffering now.
A worst case scenario might be something like a regular expression which says, "find everything, starting from the begining up until the first occurence of the word 'dog', and replace that with something else": if you have a regular expression like that, then you need to buffer (without forwarding) everything from the begining until the first occurence of the word 'dog': which might never happen, i.e. might be an infinite amount to buffer.
In that sense you're talking about (and all senses for, say, TCP) sockets are streams. It follows from your question that you have some structure in the data. So you must do something similar to the following:
Buffer (hold) incoming data until a boundary is reached. The boundary might be end-of-line, end-of-record, or any other way that you know that your regex will match.
When a "record" is ready, process it and place the results in an output buffer.
Write anything accumulated in the output buffer.
That handles most cases. If you have one of the rare cases where there's really no "record" then you have to build some sort of state machine (DFA). By this I mean you must be able to accumulate data until either a) it can't possibly match your regex, or b) it's a completed match.
EDIT:
If you're matching fixed strings instead of a true regex then you should be able to use the Boyer-Moore algorithm, which can actually run in sub-linear time (by skipping characters). If you do it right, as you move over the input you can throw previously seen data to the output buffer as you go, decreasing latency and increasing throughput significantly.
Basically, the problem with your code is that the recv/send loop is operating on a lower network layer than your modifications. How you solve this problem depends on what modifications you're making, but it probably involves buffering data until all local modifications can be made.
EDIT: I don't know of any regex library that can filter a stream like that. How hard this is going to be will depend on your regex and the protocol it's filtering.
One alternative is to use poll(2)-like strategy with non-blocking sockets. On read event grab a buffer from the socket, push it onto incoming queue, call the lexer/parser/matcher that assembles the buffers into a stream, then pushes chunks onto the output queue. On write event, take a chunk from the output queue, if any, and write it into the socket. This sounds kind of complicated, but it's not really once you get used to the inverted control model.

Resources