Inspecting C pipelines passing through a program -- border cases

Inspecting C pipelines passing through a program -- border cases - c

I'm receiving from socket A and writing that to socket B on the fly (like a proxy server might). I would like to inspect and possibly modify data passing through. My question is how to handle border cases, ie where the regular expression I'm searching for would match between two successive socket A read and socket B write iterations.
char buffer[4096]
int socket_A, socket_B
/* Setting up the connection goes here */
for(;;) {
recv(socket_A, buffer, 4096, 0);
/* Inspect, and possibly modify buffer */
send(socket_B, buffer, 4096, 0);
/* Oops, the matches I was looking for were at the end of buffer,
* and will be at the beginning of buffer next iteration :( */
}

My suggestion: have two buffers, and rotate between them:
Recv buffer 1
Recv buffer 2
Process.
Send buffer 1
Recv buffer 1
Process, but with buffer 2 before buffer 1.
Send buffer 2
Goto 2.
Or something like that?

Assuming you know the maximum length M of the possible regular expression matches (or can live with an arbitrary value - or just use the whole buffer), you could handle it by not passing on the full buffer but keep M-1 bytes back. In the next iteration put the new received data at the end of the M-1 bytes and apply the regular expression.
If you know the format of the data transmitted (e.g. http), you should be able to parse the contents to know when you reached the end of the communication and should send out the trailing bytes you may have cached. If you do not know the format, then you'd need to implement a timeout in the recv so that you do not hold on to the end of the communication for too long. What is too long is something that you will have to decide on your own,

You need to know and/or say something about your regular expression.
Depending on the regular expression, you might need to buffer a lot more than you are buffering now.
A worst case scenario might be something like a regular expression which says, "find everything, starting from the begining up until the first occurence of the word 'dog', and replace that with something else": if you have a regular expression like that, then you need to buffer (without forwarding) everything from the begining until the first occurence of the word 'dog': which might never happen, i.e. might be an infinite amount to buffer.

In that sense you're talking about (and all senses for, say, TCP) sockets are streams. It follows from your question that you have some structure in the data. So you must do something similar to the following:
Buffer (hold) incoming data until a boundary is reached. The boundary might be end-of-line, end-of-record, or any other way that you know that your regex will match.
When a "record" is ready, process it and place the results in an output buffer.
Write anything accumulated in the output buffer.
That handles most cases. If you have one of the rare cases where there's really no "record" then you have to build some sort of state machine (DFA). By this I mean you must be able to accumulate data until either a) it can't possibly match your regex, or b) it's a completed match.
EDIT:
If you're matching fixed strings instead of a true regex then you should be able to use the Boyer-Moore algorithm, which can actually run in sub-linear time (by skipping characters). If you do it right, as you move over the input you can throw previously seen data to the output buffer as you go, decreasing latency and increasing throughput significantly.

Basically, the problem with your code is that the recv/send loop is operating on a lower network layer than your modifications. How you solve this problem depends on what modifications you're making, but it probably involves buffering data until all local modifications can be made.
EDIT: I don't know of any regex library that can filter a stream like that. How hard this is going to be will depend on your regex and the protocol it's filtering.

One alternative is to use poll(2)-like strategy with non-blocking sockets. On read event grab a buffer from the socket, push it onto incoming queue, call the lexer/parser/matcher that assembles the buffers into a stream, then pushes chunks onto the output queue. On write event, take a chunk from the output queue, if any, and write it into the socket. This sounds kind of complicated, but it's not really once you get used to the inverted control model.

Related

Why should I use circular buffers when reading and writing to sockets in C?

I'm doing an assignment where the goal is to create a basic FTP server in C capable of handling multiple clients at once.
The subject tells us to "wisely use circular buffers" but I don't really understand why or how ?
I'm already using select to know when I can read or write into my socket without blocking as I'm not allowed to use recv, send or O_NONBLOCKING.
Each connection has a structure where I store everything related to this client like the communication file descriptor, the network informations and the buffers.
Why can't I just use read on my socket into a fixed size buffer and then pass this buffer to the parsing function ?
Same goes for writing : why can't I just dprintf my response into the socket ?
From my point of view using a circular buffer adds a useless layer of complexity just to be translated back into a string to parse the command or to send back the response.
Did I misunderstood the subject ? Instead of storing individual characters should I store commands and responses as circular buffers of strings ?

Why should I use circular buffers when reading and writing to sockets in C?
The socket interface does not itself provide a reason for using circular buffers (a.k.a. ring buffers). You should be looking instead at the protocol requirements of the application using the socket -- the FTP protocol in this case. This will be colored by the characteristics of the underlying network protocol (TCP for FTP) and their effect on the behavior of the socket layer.
Why can't I just use read on my socket into a fixed size buffer and then pass this buffer to the parsing function ?
You surely could do without circular buffers, but that wouldn't be as simple as you seem to suppose. And that's not the question you should be asking anyway: it's not whether circular buffers are required, but what benefit they can provide that you might not otherwise get. More on that later.
Also, you surely can have fixed size circular buffers -- "circular" and "fixed size" are orthogonal characteristics. However, it is usually among the objectives of using a circular buffer to minimize or eliminate any need for dynamically adjusting the buffer size.
Same goes for writing : why can't I just dprintf my response into the socket ?
Again, you probably could do as you describe. The question is what do you stand to gain from interposing a circular buffer? Again, more later.
From my point of view using a circular buffer adds a useless layer of
complexity just to be translated back into a string to parse the
command or to send back the response.
Did I misunderstood the subject ?
That you are talking about translating to and from strings makes me think that you did indeed misunderstand the subject.
Instead of storing individual
characters should I store commands and responses as circular buffers
of strings ?
Again, where do you think "of strings" comes into it? Why are you supposing that the elements of the buffer(s) would represent (whole) messages?
A circular buffer is more a manner of use of an ordinary, flat, usually fixed-size buffer than it is a separate data structure of its own. There is a little bit of extra bookkeeping data involved, however, so I won't quibble with anyone who wants to call it a data structure in its own right.
Circular buffers for input
Among the main contexts for circular buffers' usefulness is data arriving with stream semantics (such as TCP provides) rather than with message semantics (such as UDP provides). With respect to your assignment, consider this: when the server reads command input, how does it know where the command ends? I suspect you're supposing that you will get one complete command per read(), but that is in no way a safe assumption, regardless of the implementation of the client. You may get partial commands, multiple commands, or both on each read(), and you need to be prepared to deal with that.
So suppose, for example, that you receive one and a half control messages in one read(). You can parse and respond to the first, but you need to read more data before you can act on the second. Where do you put that data? Ok, you read it into the end of the buffer. And what if on the next read() you get not only the rest of a message, but also part of another message?
You cannot keep on indefinitely adding data at the end of the buffer, not even if you dynamically allocate more space as needed. You could at some point move the unprocessed data from the tail of the buffer to the beginning, thus opening up space at the end, but that is costly, and at this point we are well past the simplicity you had in mind. (That simplicity was always imaginary.) Alternatively, you can perform your reads into a circular buffer, so that consuming data from the (logical) beginning of the buffer automatically makes space available at the (logical) end.
Circular buffers for output
Similar applies on the writing side with a stream-oriented network protocol. Consider that you cannot write() an arbitrary amount of data at a time, and it is very hard to know in advance exactly how much you can write. That's more likely to bite you on the data connection than on the control connection, but in principle, it applies to both. If you have only one client to feed at a time then you can keep write()ing in a loop until you've successfully transferred all the data, and this is what dprintf() would do. But that's potentially a blocking operation, so it undercuts your responsiveness when you are serving multiple clients at the same time, and maybe even with just one if (as with FTP) there are multiple connections per client.
You need to buffer data on the server, especially for the data connection, and now you have pretty much the same problem that you did on the reading side: when you've written only part of the data you want to send, and the socket is not ready for you to send more, what do you do? You could just track where you are in the buffer, and send more pieces as you can until the buffer is empty. But then you are wasting opportunities to read more data from the source file, or to buffer more control responses, until you work through the buffer. Once again, a circular buffer can mitigate that, by giving you a place to buffer more data without requiring it to start at the beginning of the buffer or being limited by the available space before the physical end of the buffer.

Is there a portable way to discard a number of readable bytes from a socket-like file descriptor?

Is there a portable way to discard a number of incoming bytes from a socket without copying them to userspace? On a regular file, I could use lseek(), but on a socket, it's not possible. I have two scenarios where I might need it:
A stream of records is arriving on a file descriptor (which can be a TCP, a SOCK_STREAM type UNIX domain socket or potentially a pipe). Each record is preceeded by a fixed size header specifying its type and length, followed by data of variable length. I want to read the header first and if it's not of the type I'm interested in, I want to just discard the following data segment without transferring them into user space into a dummy buffer.
A stream of records of varying and unpredictable length is arriving on a file descriptor. Due to asynchronous nature, the records may still be incomplete when the fd becomes readable, or they may be complete but a piece of the next record already may be there when I try to read a fixed number of bytes into a buffer. I want to stop reading the fd at the exact boundary between the records so I don't need to manage partially loaded records I accidentally read from the fd. So, I use recv() with MSG_PEEK flag to read into a buffer, parse the record to determine its completeness and length, and then read again properly (thus actually removing data from the socket) to the exact length. This would copy the data twice - I want to avoid that by simply discarding the data buffered in the socket by an exact amount.
On Linux, I gather it is possible to achieve that by using splice() and redirecting the data to /dev/null without copying them to userspace. However, splice() is Linux-only, and the similar sendfile() that is supported on more platforms can't use a socket as input. My questions are:
Is there a portable way to achieve this? Something that would work on other UNIXes (primarily Solaris) as well that do not have splice()?
Is splice()-ing into /dev/null an efficient way to do this on Linux, or would it be a waste of effort?
Ideally, I would love to have a ssize_t discard(int fd, size_t count) that simply removes count of readable bytes from a file descriptor fd in kernel (i.e. without copying anything to userspace), blocks on blockable fd until the requested number of bytes is discarded, or returns the number of successfully discarded bytes or EAGAIN on a non-blocking fd just like read() would do. And advances the seek position on a regular file of course :)

The short answer is No, there is no portable way to do that.
The sendfile() approach is Linux-specific, because on most other OSes implementing it, the source must be a file or a shared memory object. (I haven't even checked if/in which Linux kernel versions, sendfile() from a socket descriptor to /dev/null is supported. I would be very suspicious of code that does that, to be honest.)
Looking at e.g. Linux kernel sources, and considering how little a ssize_t discard(fd, len) differs from a standard ssize_t read(fd, buf, len), it is obviously possible to add such support. One could even add it via an ioctl (say, SIOCISKIP) for easy support detection.
However, the problem is that you have designed an inefficient approach, and rather than fix the approach at the algorithmic level, you are looking for crutches that would make your approach perform better.
You see, it is very hard to show a case where the "extra copy" (from kernel buffers to userspace buffers) is an actual performance bottleneck. The number of syscalls (context switches between userspace and kernel space) sometimes is. If you sent a patch upstream implementing e.g. ioctl(socketfd, SIOCISKIP, bytes) for TCP and/or Unix domain stream sockets, they would point out that the performance increase this hopes to achieve is better obtained by not trying to obtain the data you don't need in the first place. (In other words, the way you are trying to do things, is inherently inefficient, and rather than create crutches to make that approach work better, you should just choose a better-performing approach.)
In your first case, a process receiving structured data framed by a type and length identifier, wishing to skip unneeded frames, is better fixed by fixing the transfer protocol. For example, the receiving side could inform the sending side which frames it is interested in (i.e., basic filtering approach). If you are stuck with a stupid protocol that you cannot replace for external reasons, you're on your own. (The FLOSS developer community is not, and should not be responsible for maintaining stupid decisions just because someone wails about it. Anyone is free to do so, but they'd need to do it in a manner that does not require others to work extra too.)
In your second case, you already read your data. Don't do that. Instead, use an userspace buffer large enough to hold two full size frames. Whenever you need more data, but the start of the frame is already past the midway of the buffer, memmove() the frame to start at the beginning of the buffer first.
When you have a partially read frame, and you have N unread bytes from that left that you are not interested in, read them into the unused portion of the buffer. There is always enough room, because you can overwrite the portion already used by the current frame, and its beginning is always within the first half of the buffer.
If the frames are small, say 65536 bytes maximum, you should use a tunable for the maximum buffer size. On most desktop and server machines, with high-bandwidth stream sockets, something like 2 MiB (2097152 bytes or more) is much more reasonable. It's not too much memory wasted, but you rarely do any memory copies (and when you do, they tend to be short). (You can even optimize the memory moves so that only full cachelines are copied, aligned, since leaving almost one cacheline of garbage at the start of the buffer is insignificant.)
I do HPC with large datasets (including text-form molecular data, where records are separated by newlines, and custom parsers for converting decimal integers or floating-point values are used for better performance), and this approach does work well in practice. Simply put, skipping data already in your buffer is not something you need to optimize; it is insignificant overhead compared to simply avoiding doing the things you do not need.
There is also the question of what you wish to optimize by doing that: the CPU time/resources used, or the wall clock used in the overall task. They are completely different things.
For example, if you need to sort a large number of text lines from some file, you use the least CPU time if you simply read the entire dataset to memory, construct an array of pointers to each line, sort the pointers, and finally write each line (using either internal buffering and/or POSIX writev() so that you do not need to do a write() syscall for each separate line).
However, if you wish to minimize the wall clock time used, you can use a binary heap or a balanced binary tree instead of an array of pointers, and heapify or insert-in-order each line completely read, so that when the last line is finally read, you already have the lines in their correct order. This is because the storage I/O (for all but pathological input cases, something like single-character lines) takes longer than sorting them using any robust sorting algorithm! The sorting algorithms that work inline (as data comes in) are typically not as CPU-efficient as those that work offline (on complete datasets), so this ends up using somewhat more CPU time; but because the CPU work is done at a time that is otherwise wasted waiting for the entire dataset to load into memory, it is completed in less wall clock time!
If there is need and interest, I can provide a practical example to illustrate the techniques. However, there is absolutely no magic involved, and any C programmer should be able to implement these (both the buffering scheme, and the sort scheme) on their own. (I do consider using resources like Linux man pages online and Wikipedia articles and pseudocode on for example binary heaps doing it "on your own". As long as you do not just copy-paste existing code, I consider it doing it "on your own", even if somebody or some resource helps you find the good, robust ways to do it.)

Flushing pipe without closing in C

I have found a lot of threads in here asking about how it is possible to flush a pipe after writing to it without closing it.
In every thread I could see different suggestions but i could not find a definite solution.
Here is a quick summary:
The easiest way to avoid read blocking on the pipe would be to write the exact number of bytes that is reading.
It could be also done by using ptmx instead of a pipe but people said it could be to much.
Note: It's not possible to use fsync with pipes
Are there any other more efficient solutions?
Edit:
The flush would be convenient when the sender wants to write n characters but the client reads m characters (where m>n). The client will block waiting for another m-n characters. If the sender wants to communicate again with the client leaves him without the option of closing the pipe and just sending the exact number of bytes could be a good source of bugs.
The receiver operates like this and it cannot be modified:
while((n=read(0, buf, 100)>0){
process(buf)
so the sender wants to get processed: "file1" and "file2" for which will have to:
write(pipe[1], "file1\0*95", 100);
write(pipe[1], "file2\0*95", 100);
what I am looking is for a way to do something like that (without being necessary to use the \n as the delimeter):
write(pipe[1], "file1\nfile2", 11); //it would have worked if it was ptmx
(Using read and write)

Flushing in the sense of fflush() is irrelevant to pipes, because they are not represented as C streams. There is therefore no userland buffer to flush. Similarly, fsync() is also irrelevant to pipes, because there is no back-end storage for the data. Data successfully written to a pipe exist in the kernel and only in the kernel until they are successfully read, so there is no work for fsync() to do. Overall, flushing / syncing is applicable only where there is intermediate storage involved, which is not the case with pipes.
With the clarification, your question seems to be about establishing message boundaries for communication via a pipe. You are correct that closing the write end of the pipe will signal a boundary -- not just of one message, but of the whole stream of communication -- but of course that's final. You are also correct that there are no inherent message boundaries. Nevertheless, you seem to be working from at least somewhat of a misconception here:
The easiest way to avoid read blocking on the pipe would be to write
the exact number of bytes that is reading.
[...]
The flush would be convenient when the sender wants to write n
characters but the client reads m characters (where m>n). The client
will block waiting for another m-n characters.
Whether the reader will block is entirely dependent on how the reader is implemented. In particular, the read(2) system call in no way guarantees to transfer the requested number of bytes before returning. It can and will perform a short read under some circumstances. Although the details are unspecified, you can ordinarily rely on a short read when at least one character can be transferred without blocking, but not the whole number requested. Similar applies to write(2). Thus, the easiest way to avoid read() blocking is to ensure that you write at least one byte to the pipe for that read() call to transfer.
In fact, people usually come at this issue from the opposite direction: needing to be certain to receive a specific number of bytes, and therefore having to deal with the potential for short reads as a complication (to be dealt with by performing the read() in a loop). You'll need to consider that, too, but you have the benefit that your client is unlikely to block under the circumstances you describe; it just isn't the problem you think it is.
There is an inherent message-boundary problem in any kind of stream communication, however, and you'll need to deal with it. There are several approaches; among the most commonly used are
Fixed-length messages. The receiver can then read until it successfully transfers the required number of bytes; any blocking involved is appropriate and needful. With this approach, the scenario you postulate simply does not arise, but the writer might need to pad its messages.
Delimited messages. The receiver then reads until it finds that it has received a message delimiter (a newline or a null byte, for example). In this case, the receiver will need to be prepared for the possibility of message boundaries not being aligned with the byte sequences transferred by read() calls. Marking the end of a message by closing the channel can be considered a special case of this alternative.
Embedded message-length metadata. This can take many forms, but one of the simplest is to structure messages as a fixed-length integer message length field, followed by that number of bytes of message data. The reader then knows at every point how many bytes it needs to read, so it will not block needlessly.
These can be used individually or in combination to implement an application-layer protocol for communicating between your processes. Naturally, the parties to the communication must agree on the protocol details for communication to be successful.

Strategy for estimating / calculating buffer space needed by writer function on embedded system

This isn't a show-stopping programming problem as such, but perhaps more of a design pattern issue. I'd have thought it'd be a common design issue on embedded resource-limited systems, but none of the questions I found so far on SO seem relevant (but please point out anything relevant that I could have missed).
Essentially, I'm trying to work out the best strategy of estimating the largest buffer size required by some writer function, when that writer function's output isn't fixed, particularly because some of the data are text strings of variable length.
This is a C application that runs on a small ARM micro. The application needs to send various message types via TCP socket. When I want to send a TCP packet, the TCP stack (Keil RL) provides me with a buffer (which the library allocates from its own pool) into which I may write the packet data payload. That buffer size depends of course on the MSS; so let's assume it's 1460 at most, but it could be smaller.
Once I have this buffer, I pass this buffer and its length to a writer function, which in turn may call various nested writer functions in order to build the complete message. The reason for this structure is because I'm actually generating a small XML document, where each writer function typically generates a specific XML element. Each writer function wants to write a number of bytes to my allocated TCP packet buffer. I only know exactly how many bytes a given writer function writes at run-time, because some of the encapsulated content depends on user-defined text strings of variable length.
Some messages need to be around (say) 2K in size, meaning they're likely to be split across at least two TCP packet send operations. Those messages will be constructed by calling a series of writer functions that produce, say, a hundred bytes at a time.
Prior to making a call to each writer function, or perhaps within the writer function itself, I initially need to compare the buffer space available with how much that writer function requires; and if there isn't enough space available, then transmit that packet and continue writing into a fresh packet later.
Possible solutions I am considering are:
Use another much larger buffer to write everything into initially. This isn't preferred because of resource constraints. Furthermore, I would still wish for a means to algorithmically work out how much space I need by my message writer functions.
At compile time, produce a 'worst case size' constant for each writer function. Each writer function typically generates an XML element such as <START_TAG>[string]</START_TAG>, so I could have something like: #define SPACE_NEEDED ( START_TAG_LENGTH + START_TAG_LENGTH + MAX_STRING_LENGTH + SOME_MARGIN ). All of my content writer functions are picked out of a table of function pointers anyway, so I could have the worst-case size estimate constants for each writer function exist as a new column in that table. At run-time, I check the buffer room against that estimate constant. This probably my favourite solution at the moment. The only downside is that it does rely on correct maintenance to make it work.
My writer functions provide a special 'dummy run' mode where they run though and calculate how many bytes they want to write but don't write anything. This could be achieved by perhaps simply sending NULL in place of the buffer pointer to the function, in which case the functions's return value (which usually states amount written to buffer) just states how much it wants to write. The only thing I don't like about this is that, between the 'dummy' and 'real' call, the underlying data could - at least in theory - change. A possible solution for that could be to statically capture the underlying data.
Thanks in advance for any thoughts and comments.
Solution
Something I had actually already started doing since posting the question was to make each content writer function accept a state, or 'iteration' parameter, which allows the writer to be called many times over by the TCP send function. The writer is called until it flags that it has no more to write. If the TCP send function decides after a certain iteration that the buffer is now nearing full, it sends the packet and then the process continues later with a new packet buffer. This technique is very similar I think to Max's answer, which I've therefore accepted.
A key thing is that on each iteration, a content writer must be designed so that it won't write more than LENGTH bytes to the buffer; and after each call to the writer, the TCP send function will check that it has LENGTH room left in the packet buffer before calling the writer again. If not, it continues in a new packet.
Another step I did was to have a serious think about how I structure my message headers. It became apparent that, like I suppose with almost all protocols that use TCP, it is essential to implement into the application protocol some means of indicating the total message length. The reason for this is because TCP is a stream-based protocol, not a packet-based protocol. This is again where it got a bit of a headache because I needed some upfront means of knowing the total message length for insertion into the start header. The simple solution to this was to insert a message header into the start of every sent TCP packet, rather than only at the start of the application protocol message (which may of course span several TCP sockets), and basically implement fragmentation. So, in the header, I implemented two flags: a fragment flag, and a last-fragment flag. Therefore the length field in each header only needs to state the size of the payload in the particular packet. At the receiving end, individual header+payload chunks are read out of the stream and then reassembled into a complete protocol message.
This of course is no doubt very simplistically how HTTP and so many other protocols work over TCP. It's just quite interesting that, only once I've attempted to write a robust protocol that works over TCP, have I started to realise the importance of really thinking the your message structure in terms of headers, framing, and so forth so that it works over a stream protocol.

I had a related problem in a much smaller embedded system, running on a PIC 16 micro-controller (and written in assembly language, rather than C). My 'buffer size' was always going to be the two byte UART transmit queue, and I had only one 'writer' function, which was walking a DOM and emitting its XML serialisation.
The solution I came up with was to turn the problem 'inside out'. The writer function becomes a task: each time it is called it writes as many bytes as it can (which may be >2 depending on the serial data transmission rate) until the transmit buffer is full, then it returns. However, it remembers, in a state variable, how far it had got through the DOM. The next time it is called, it caries on from the point previously reached. The writer task is called repeatedly from a loop. If there is no free buffer space, it returns immediately without changing its state. It is called repeatedly from an infinite loop, which acts as a round-robin scheduler for this task and the others in the system. Each time round the loop, there is a delay which waits for the TMR0 timer to overflow. So each task gets called exactly once in a fixed time slice.
In my implementation, the data is transmitted by a TxEmpty interrupt routine, but it could also be sent by another task.
I guess the 'pattern' here is that one role of the program counter is to hold the current state of the flow of control, and that this role can be abstracted away from the PC to another data structure.
Obviously, this isn't immediately applicable to your larger, higher level system. But it is a different way of looking at the problem, which may spark your own particulr insight.
Good luck!

Buffering data from sockets?

I am trying to make a simple HTTP server that would be able to parse client requests and send responses back.
Now I have a problem. I have to read and handle one line at a time in the request, and I don't know if I should:
read one byte at a time, or
read chunks of N bytes at a time, put them in a buffer, and then handle the bytes one by one, before reading a new chunk of bytes.
What would be the best option, and why?
Also, are there some alternative solutions to this? Like a function that would read a line at a time from the socket or something?

Single byte at a time is going to kill performance. Consider a circular buffer of decent size.
Read chunks of whatever size is free in the buffer. Most of the time you will get short reads. Check for the end of the http command in the read bytes. Process complete commands and next byte becomes head of buffer. If buffer becomes full, copy it off to a backup buffer, use a second circular buffer, report an error or whatever is appropriate.

The short answer to your question is that I would go with reading a single byte at a time. Unfortunately its one of those cases where there are pros and cons for both cases.
For the use of a buffer is the fact that the implementation can be more efficient from the perspective of the network IO. Against the use of a buffer, I think that the code will be inherently more complex than the single byte version. So its an efficiency vs complexity trade off. The good news is that you can implement the simple solution first, profile the result and "upgrage" to a buffered approach if testing shows it to be worthwhile.
Also, just to note, as a thought experiment I wrote some pseudo code for a loop that does buffer based reads of http packets, included below. The complexity to implement a buffered read doesn't seem to bad. Note however that I haven't given much consideration to error handling, or tested if this will work at all. However, it should avoid excessive "double handling" of data, which is important since that would reduce the efficiency gains which were the purpose of this approach.
#define CHUNK_SIZE 1024
nextHttpBytesRead = 0;
nextHttp = NULL;
while (1)
{
size_t httpBytesRead = nextHttpBytesRead;
size_t thisHttpSize;
char *http = nextHttp;
char *temp;
char *httpTerminator;
do
{
temp = realloc(httpBytesRead + CHUNK_SIZE);
if (NULL == temp)
...
http = temp;
httpBytesRead += read(httpSocket, http + httpBytesRead, CHUNK_SIZE);
httpTerminator = strstr(http, "\r\n\r\n");
}while (NULL == httpTerminator)
thisHttpSize = ((int)httpTerminator - (int)http + 4; // Include terminator
nextHttpBytesRead = httpBytesRead - thisHttpSize;
// Adding CHUNK_SIZE here means that the first realloc won't have to do any work
nextHttp = malloc(nextHttpBytesRead + CHUNK_SIZE);
memcpy(nextHttp, http + thisHttpSize, nextHttpSize);
http[thisHttpSize] = '\0';
processHttp(http);
}

TCP data stream is coming in at one IP packet at a time, which can be up to 1,500 or so bytes depending on the IP layer configuration. In Linux this is going to be wait in one SKB waiting for the application layer to read off the queue. If you read one byte at a time you suffer the overhead of context switches between the application and kernel for simply copying one byte from one structure to another. The optimum solution is to use non-blocking IO to read the content of one SKB at a time and so minimize the switches.
If you are after optimum bandwidth you could read off longer size of bytes in order to further reduce the context switches at the expensive of latency as more time will be spent out of the application copying memory. But this only applies to extremes and such code changes should be implemented when required.
If you examine the multitude of existing HTTP technologies you can find alternative approaches such as using multiple threads and blocking sockets, pushing more work back into the kernel to reduce the overhead of switching into the application and back.
I have implemented a HTTP server library very similar to torak's pseudo code here, http://code.google.com/p/openpgm/source/browse/trunk/openpgm/pgm/http.c The biggest speed improvements for this implementation came from making everything asynchronous so nothing ever blocks.

Indy, for example, takes the buffered approach. When the code asks Indy to read a line, it first checks its current buffer to see if a line break is present. If not, the network is read in chunks and appended to the buffer until the line break appears. Once it does, just the data up to the line break is removed from the buffer and returned to the app, leaving any remaining data in the buffer for the next reading operation. This makes for a good balance between a simple application-level API (ReadLine, ReadStream, etc), while providing for efficient network I/O (read everything that is currently available in the socket, buffer it, and ack it so the sender is not waiting too long - fewer network-level reads are needed this way).

Start by reading one byte at a time (though noting that lines end with cr/lf in HTTP) because it's simple. If that's not sufficient, do more complex things.

Read a byte array buffer at a time. Reading single characters will be dog slow because of the multiple context switches between user and kernel mode (depending on the libc actually).
If you read buffers, you need to be prepared that the buffer could eighter be not filled completely (watch the length return), that the buffer does not contain enough bytes to line end or that the buffer contains more than one line.
It is a common pattern in network applications how to map your line or fixed-size block requests to that variable steam of buffers (and often implemented wrong, for example a 0 byte length answer is possible). Higher languages will hide you from this complexity.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight