C winsock "rolling parsing" - c

i'm trying to receive data from a server and parse it.
http://pastebin.com/1kjXnXwq
http://pastebin.com/XpGSgRBh
everything works as is, but i want to parse the data instead of just grabbing blocks of it and printing it out. so is there any way to grab data from the winsock until \n then stop and pass it off to another function to be parsed and once that function returns continue reading from the last point until another \n shows up and repeate the process until there is nothing left to receive?
the function that is supposed to be doing this is called msgLoop() and is located in the second pastebin line.

To read an \n-terminated string from a socket, you have to either:
read from the socket 1 byte at a time until you encounter a \n byte. Any unread bytes are left in the socket until you read them later. This is not very efficient, but it works.
create a data cache. When you need a new string, first check the cache to see if there is already a \n byte present in it. If not, then keep reading from the socket in larger blocks and store them into the cache until you encounter a \n byte. Then process the contents of the cache up to the first \n byte, remove the bytes you processed, and move any remaining bytes to the front of the cache for later reads.

There's no built-in "readLine" method for sockets. So, you'll need to implement it yourself, but it's not too tricky. I found this example by Googling, you may be able to improve on it:
http://johnnie.jerrata.com/winsocktutorial/

Related

Sync read and write

I'm using read and write functions to communicate between client and server.
If server use two times write, in Wireshark I can see that two packets was send, but my read function concat two packets in one buffer
Question:
It is possible to my read function read only one payload at one time?
I dont want reduce buffer
Ex:
Situation now:
Send(8bytes) Send(8bytes)
Read, read 16 bytes
I want
Send(8 bytes) Send(8Bytes)
Read, read 8 bytes(first packet)
Read, read 8 bytes(second packet)
TCP/IP gives you an ordered byte stream. Reads and writes are not guaranteed to have the same boundaries, as you have seen.
To see where messages begin and end, you need to add extra information to your protocol to provide this information. A workable simple approach is to have a byte count at the start of each message. Read the byte count, then you know how many more bytes to read to get the complete message and none of the next message.
If you want to synchronize server and client use something like semaphores or you can send read/write bytes and this avoid sending information before client read it. Or if you know exactly length of message you can separate readed bytes. If you make buffer exact length of message remain bytes will be lost so make a server sending information when reader read previous message or extend buffer and separate multiple messages.

How to detect a delimiter while reading from a socket file descriptor in C?

In C, while reading into a buffer from a socket file descriptor, how do I make the read stop if a delimiter is detected? Let's assume the delimiter is a '>' character.
read(socket_filedes, buffer, MAXSZ);
/* stop if delimiter '>' is detected */
You have two options here:
Read a single byte at a time until you encounter a delimiter. This is likely to be very inefficient.
Read in a full buffer's worth of data at a time, then look for the delimiter there. When you find it, save off the remaining data in another buffer and process the data you want. When you're ready to read again, put the saved data back in the buffer and call read with the address of the next available byte in the buffer.
The read() function does not examine the data it transfers them from source to buffer. You cannot force it to stop transferring data at a specific character or characters if it would not otherwise have stopped there.
On the other hand, it is important to recognize that read() does not necessarily read the full number of bytes specified in any case. On one hand, that means that you need to be prepared to run read() calls in a loop to collect all the data you expect, but on the other hand it means that you can usually expect that if read() has already transferred at least one byte then it will return when no more data are immediately available to transfer. Thus, if the sender stops sending data after the delimiter, then read() will probably stop reading after that delimiter.
If you cannot rely on the sender to break up the transmission as you require, then you have the option of reading one byte at a time. That can be awfully inefficient, however. The more usual solution is to do the job in two stages: (1) perform fast, block-wise read()s from the kernel into a userspace buffer, and then (2) parse the buffer contents via your userspace code. This is basically what readable C streams do when using a buffer.

How many bytes should I read/write to a socket?

I'm having some doubts about the number of bytes I should write/read through a socket in C on Unix. I'm used to sending 1024 bytes, but this is really too much sometimes when I send short strings.
I read a string from a file, and I don't know how many bytes this string is, it can vary every time, it can be 10, 20 or 1000. I only know for sure that it's < 1024. So, when I write the code, I don't know the size of bytes to read on the client side, (on the server I can use strlen()). So, is the only solution to always read a maximum number of bytes (1024 in this case), regardless of the length of the string I read from the file?
For instance, with this code:
read(socket,stringBuff,SIZE);
wouldn't it be better if SIZE is 10 instead of 1024 if I want to read a 10 byte string?
In the code in your question, if there are only 10 bytes to be read, then it makes no difference whether SIZE is 10 bytes, 1,024 bytes, or 1,000,024 bytes - it'll still just read 10 bytes. The only difference is how much memory you set aside for it, and if it's possible for you to receive a string up to 1,024 bytes, then you're going to have to set aside that much memory anyway.
However, regardless of how many bytes you are trying to read in, you always have to be prepared for the possibility that read() will actually read a different number of them. Particularly on a network, when you can get delays in transmission, even if your server is sending a 1,024 byte string, less than that number of bytes may have arrived by the time your client calls read(), in which case you'll read less than 1,024.
So, you always have to be prepared for the need to get your input in more than one read() call. This means you need to be able to tell when you're done reading input - you can't rely alone on the fact that read() has returned to tell you that you're done. If your server might send more than one message before you've read the first one, then you obviously can't hope to rely on this.
You have three main options:
Always send messages which are the same size, perhaps padding smaller strings with zeros if necessary. This is usually suboptimal for a TCP stream. Just read until you've received exactly this number of bytes.
Have some kind of sentinel mechanism for telling you when a message is over. This might be a newline character, a CRLF, a blank line, or a single dot on a line followed by a blank line, or whatever works for your protocol. Keep reading until you have received this sentinel. To avoid making inefficient system calls of one character at a time, you need to implement some kind of buffering mechanism to make this work well. If you can be sure that your server is sending you lines terminated with a single '\n' character, then using fdopen() and the standard C I/O library may be an option.
Have your server tell you how big the message is (either in an initial fixed length field, or using the same kind of sentinel mechanism from point 2), and then keep reading until you've got that number of bytes.
The read() system call blocks until it can read one or more bytes, or until an error occurs.
It DOESN'T guarantee that it will read the number of bytes you request! With TCP sockets, it's very common that read() returns less than you request, because it can't return bytes that are still propagating through the network.
So, you'll have to check the return value of read() and call it again to get more data if you didn't get everything you wanted, and again, and again, until you have everything.

detecting end of http header with \r\n\r\n

Using recv I want to get the http header so I can parse for a content length. However I'm having trouble detecting the line break. Or actually do I even have to detect line break or will the first time I read into the buffer always be the complete header (assuming I have a long enough buffer).
This is written in C.
edit: looking at some of the related questions one of the things I am worried about is
"...the "\r\n" of the header break might be pulled into your buffer by two different calls to recv() which would prevent your code from recognizing the header break."
You should call recv() repeatedly and each time it gives you x bytes you increase the buffer-pointer you give to it by x bytes (and decrease the cb it is allowed to write also by x bytes). You do this until your buffer either contains a \r\n\r\n or is completely full, in which case you just close the socket and ignore the malicious client from then on. Buffer-size should be about 3000 bytes.
But: this ignores the general problem that your server seems to be a polling-server. If you have some experience you should try to make an epoll-server instead.
In addition to the problem of identifying "\r\n\r\n" across packet boundaries, you have the problem of identifying "Content-Length: xxxx\r\n" across packet boundaries. I suggest recieving and parsing one byte at a time. When you get a recv() of '\r' followed by a recv() of '\n', followed by a recv() of '\r' followed by a recv() of '\n', you can be sure the header has ended. Once you've grasped this, adapt your solution to recieve and parse n bytes at a time where n is a preprocessor definition defined to 1 initially, and change n.
In the end I did something like this:
while ( recv... > 0 ) {
if rnrn is inside the buffer using strstr
look for content length, output error if content length doesn't exist
else
keep on reading into the buffer
}
and then once the header is found I keep on reading for the message body.
anyway thanks guys, ended up doing a really inefficient way to get my answer but what must be done is done.

Inspecting C pipelines passing through a program -- border cases

I'm receiving from socket A and writing that to socket B on the fly (like a proxy server might). I would like to inspect and possibly modify data passing through. My question is how to handle border cases, ie where the regular expression I'm searching for would match between two successive socket A read and socket B write iterations.
char buffer[4096]
int socket_A, socket_B
/* Setting up the connection goes here */
for(;;) {
recv(socket_A, buffer, 4096, 0);
/* Inspect, and possibly modify buffer */
send(socket_B, buffer, 4096, 0);
/* Oops, the matches I was looking for were at the end of buffer,
* and will be at the beginning of buffer next iteration :( */
}
My suggestion: have two buffers, and rotate between them:
Recv buffer 1
Recv buffer 2
Process.
Send buffer 1
Recv buffer 1
Process, but with buffer 2 before buffer 1.
Send buffer 2
Goto 2.
Or something like that?
Assuming you know the maximum length M of the possible regular expression matches (or can live with an arbitrary value - or just use the whole buffer), you could handle it by not passing on the full buffer but keep M-1 bytes back. In the next iteration put the new received data at the end of the M-1 bytes and apply the regular expression.
If you know the format of the data transmitted (e.g. http), you should be able to parse the contents to know when you reached the end of the communication and should send out the trailing bytes you may have cached. If you do not know the format, then you'd need to implement a timeout in the recv so that you do not hold on to the end of the communication for too long. What is too long is something that you will have to decide on your own,
You need to know and/or say something about your regular expression.
Depending on the regular expression, you might need to buffer a lot more than you are buffering now.
A worst case scenario might be something like a regular expression which says, "find everything, starting from the begining up until the first occurence of the word 'dog', and replace that with something else": if you have a regular expression like that, then you need to buffer (without forwarding) everything from the begining until the first occurence of the word 'dog': which might never happen, i.e. might be an infinite amount to buffer.
In that sense you're talking about (and all senses for, say, TCP) sockets are streams. It follows from your question that you have some structure in the data. So you must do something similar to the following:
Buffer (hold) incoming data until a boundary is reached. The boundary might be end-of-line, end-of-record, or any other way that you know that your regex will match.
When a "record" is ready, process it and place the results in an output buffer.
Write anything accumulated in the output buffer.
That handles most cases. If you have one of the rare cases where there's really no "record" then you have to build some sort of state machine (DFA). By this I mean you must be able to accumulate data until either a) it can't possibly match your regex, or b) it's a completed match.
EDIT:
If you're matching fixed strings instead of a true regex then you should be able to use the Boyer-Moore algorithm, which can actually run in sub-linear time (by skipping characters). If you do it right, as you move over the input you can throw previously seen data to the output buffer as you go, decreasing latency and increasing throughput significantly.
Basically, the problem with your code is that the recv/send loop is operating on a lower network layer than your modifications. How you solve this problem depends on what modifications you're making, but it probably involves buffering data until all local modifications can be made.
EDIT: I don't know of any regex library that can filter a stream like that. How hard this is going to be will depend on your regex and the protocol it's filtering.
One alternative is to use poll(2)-like strategy with non-blocking sockets. On read event grab a buffer from the socket, push it onto incoming queue, call the lexer/parser/matcher that assembles the buffers into a stream, then pushes chunks onto the output queue. On write event, take a chunk from the output queue, if any, and write it into the socket. This sounds kind of complicated, but it's not really once you get used to the inverted control model.

Resources