Pipelining a set of C buffers - c

I am creating Ethernet packets in an embedded system. I have my Data / IP and UDP packet headers defined in pre-allocated buffers and I have a large buffer that is used to grab data from the FPGA's fabric using DMA.
I also have some user data headers and footers where the data comes from the fabric in other ways, mostly SPI transfer of temperature, PCB address etc. Or even grabs of some of the configuration registers (single transaction, on-boot).
Now, at the moment I concatenate these using memcpy into a new larger buffer (also pre-allocated), and then send to the Transmit buffer of the on-FPGA MAC.
My issues:
1) All these buffers are on the FPGA hence requiring memory, I could copy them one at a time into the MAC Tx buffer but this would prevent my second idea.
2) All being buffers, gives the possibility of forming a pipeline, where new data (DN+1) can be put into the first buffers, while subsequent buffers are storing and concatenating the data of (DN+0).
If I have a nice modularised code, how do I create a pipeline from buffer to buffer. In hardware I'd use flags, only passing data from Buffer A to B when Buffer B has finished passing its data to C. In terms of C, memcpy and memmove return only void, I'd therefore need to make my own boolean flag that is modified after memcpy finishes and I'd need to make these flags globals so that I can easily pass their status into other functions.
Finally, as this is embedded, I don't have access to the full C libraries and both time and memory are at a premium.
Thanks
Ed

Related

get_user_pages_fast() for DMA?

I have a Linux driver that does DMA transfers to/from a device. For sending data to the device (to prevent copy operations) the driver maps the userspace buffer and uses it for DMA directly via get_user_pages_fast(). The user pages are then added to a scatter-gather list and used for DMA.
This works rather well, but the one issue is that this forces the userspace buffer to have various alignment requirements to the cache line of the CPU. My system returns 128 when you call dma_get_cache_alignment(), which means that in userspace I have to ensure that the start address is aligned to this value. Also, I have to check that the buffer is sized to a multiple of 128.
I see two options for handling this:
Deal with it. That is, in userspace ensure that the buffer is properly aligned. This sounds reasonable, but I have run into some issues since my device has to be integrated into a larger project, and I don't have control over the buffers that get passed to me. As a result, I have to allocate a properly aligned buffer in userspace to sit between the driver and the application and use that buffer in the event the caller's buffer is not aligned. This adds a copy operation and isn't the end of the world, but the resulting code is rather messy.
Rework the driver to use a kernel space buffer. That is, change the code such that the driver uses copy_from_user() to move the data into a properly aligned kernel space buffer. I'm not too concerned about the performance here, so this is an option, but would require a good amount of rework.
Is there anything that I'm missing? I'm hoping that there might be some magic flag or something that I overlooked to remove the alignment requirement altogether.

how to design a server for variable size messages

I want some feedback or suggestion on how to design a server that handles variable size messages.
to simplify the answer lets assume:
single thread epoll() based
the protocol is: data-size + data
data is stored on a ringbuffer
the read code, with some simplification for clarity, looks like this:
if (client->readable) {
if (client->remaining > 0) {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 1+ bytes */
rd = read(client->sock, client->buffer, client->remaining);
client->buffer += rd;
client->remaining -= rd;
} else {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 4 bytes */
read(client->sock, &(client->remaining), 4);
client->buffer = acquire_ringbuf_slot(client->remaining);
}
}
please, do not focus on the 4 byte. just assume we have the data size in the beginning compressed or not does not make difference for this discussion.
now, the question is: what is the best way to do the above?
assume both small "data", few bytes and large data MBs
how can we reduce the number of read() calls? e.g. in case we have 4 message of 16 bytes on the stream, it seems a waste doing 8 calls to read().
are there better alternatives to this design?
PART of the solution depends on the transport layer protocol you use.
I assume you are using TCP which provides connection oriented and reliable communication.
From your code I assume you understand TCP is a stream-oriented protocol
(So when a client sends a piece of data, that data is stored in the socket send buffer and TCP may use one or more TCP segments to convey it to the other end (server)).
So the code, looks very good so far (considering you have error checks and other things in the real code).
Now for your questions, I give you my responses, what I think is best based on my experience (but there could be better solutions):
1-This is a solution with challenges similar to how an OS manages memory, dealing with fragmentation.
For handling different message sizes, you have to understand there are always trade-offs depending on your performance goals.
One solution to improve memory utilization and parallelization is to have a list of free buffer chunks of certain size, say 4KB.
You will retrieve as many as you need for storing your received message. In the last one you will have unused data. You play with internal fragmentation.
The drawback could be when you need to apply certain type of processing (maybe a visitor pattern) on the message, like parsing/routing/transformation/etc. It will be more complex and less efficient than a case of a huge buffer of contiguous memory. On the other side, the drawback of a huge buffer is much less efficient memory utilization, memory bottlenecks, and less parallelization.
You can implement something smarter in the middle (think about chunks that could also be contiguous whenever available). Always depending on your goals. Something useful is to implement an abstraction over the fragmented memory so that every function (or visitor) that is applied works as it were dealing with contiguous memory.
If you use these chunks, when the message was processed and dropped/forwarded/eaten/whatever, you return the unused chunks to the list of free chunks.
2-The number of read calls will depend on how fast TCP conveys the data from client to server. Remember this is stream oriented and you don't have much control over it. Of course, I'm assuming you try to read the max possible (remaining) data in each read.
If you use the chunks I mentioned above the max data to read will also depend on the chunk size.
Something you can do at TCP layer is to increase the server receive buffer. Thus, it can receive more data even when server cannot read it fast enough.
3-The ring buffer is OK, if you used chunked, the ring buffer should provide the abstraction. But I don't know why you need a ring buffer.
I like ring buffers because there is a way of implementing producer-consumer synchronization without locking (Linux Kernel uses this for moving packets from L2 to IP layer) but I don't know if that's your goal.
To pass messages to other components and/or upper-layers you could also use ring buffers of pointers to messages.
A better design may be as follows:
Set up your user-space socket read buffer to be the same size as the kernel socket buffer. If your user-space socket read buffer is smaller, then you would need more than one read syscall to read the kernel buffer. If your user-space buffer is bigger, then the extra space is wasted.
Your read function should only read as much data as possible in one read syscall. This function must not know anything about the protocol. This way you do not need to re-implement this function for different wire formats.
When your read function has read into the user-space buffer it should call a callback passing the iterators to the data available in the buffer. That callback is a parser function that should extract all available complete messages and pass these messages to another higher-level callback. Upon return the parser function should return the number of bytes consumed, so that these bytes can be discarded from the user-space socket buffer.

Zero-copy with and without Scatter/Gather operations

I just read an article that explains the zero-copy mechanism.
It talks about the difference between zero-copy with and without Scatter/Gather supports.
NIC without SG support, the data copies are as follows
NIC with SG support, the data copies are as follows
In a word, zero-copy with SG support can eliminate one CPU copy.
My question is that why data in kernel buffer could be scattered?
Because the Linux kernel's mapping / memory allocation facilities by default will create virtually-contiguous but possibly physically-disjoint memory regions.
That means the read from the filesystem which sendfile() does internally goes to a buffer in kernel virtual memory, which the DMA code has to "transmogrify" (for lack of a better word) into something that the network card's DMA engine can grok.
Since DMA (often but not always) uses physical addresses, that means you either duplicate the data buffer (into a specially-allocated physically-contigous region of memory, your socket buffer above), or else transfer it one-physical-page-at-a-time.
If your DMA engine, on the other hand, is capable of aggregating multiple physically-disjoint memory regions into a single data transfer (that's called "scatter-gather") then instead of copying the buffer, you can simply pass a list of physical addresses (pointing to physically-contigous sub-segments of the kernel buffer, that's your aggregate descriptors above) and you no longer need to start a separate DMA transfer for each physical page. This is usually faster, but whether it can be done or not depends on the capabilities of the DMA engine.
Re: My question is that why data in kernel buffer could be scattered?
Because it already is scattered. The data queue in front of a TCP socket is not divided into the datagrams that will go out onto the network interface. Scatter allows you to keep the data where it is and not have to copy it to make a flat buffer that is acceptable to the hardware.
With the gather feature, you can give the network card a datagram which is broken into pieces at different addresses in memory, which can be references to the original socket buffers. The card will read it from those locations and send it as a single unit.
Without gather (hardware requires simple, linear buffers) a datagram has to be prepared as a contiguously allocated byte string, and all the data which belongs to it has to be memcpy-d into place from the buffers that are queued for transmission on the socket.
Because when you write to a socket, the headers of the packet are assembled in a different place from your user-data, so to be coalesced into a network packet, the device needs "gather" capability, at least to get the headers and data.
Also to avoid the CPU having to read the data (and thus, fill its cache up with useless stuff it's never going to need again), the network card also needs to generate its own IP and TCP checksums (I'm assuming TCP here, because 99% of your bulk data transfers are going to be TCP). This is OK, because nowadays they all can.
What I'm not sure is, how this all interacts with TCP_CORK.
Most protocols tend to have their own headers, so a hypothetical protocol looks like:
Client: Send request
Server: Send some metadata; send the file data
So we tend to have a server application assembling some headers in memory, issuing a write(), followed by a sendfile()-like operation. I suppose the headers still get copied into a kernel buffer in this case.

Strategy for estimating / calculating buffer space needed by writer function on embedded system

This isn't a show-stopping programming problem as such, but perhaps more of a design pattern issue. I'd have thought it'd be a common design issue on embedded resource-limited systems, but none of the questions I found so far on SO seem relevant (but please point out anything relevant that I could have missed).
Essentially, I'm trying to work out the best strategy of estimating the largest buffer size required by some writer function, when that writer function's output isn't fixed, particularly because some of the data are text strings of variable length.
This is a C application that runs on a small ARM micro. The application needs to send various message types via TCP socket. When I want to send a TCP packet, the TCP stack (Keil RL) provides me with a buffer (which the library allocates from its own pool) into which I may write the packet data payload. That buffer size depends of course on the MSS; so let's assume it's 1460 at most, but it could be smaller.
Once I have this buffer, I pass this buffer and its length to a writer function, which in turn may call various nested writer functions in order to build the complete message. The reason for this structure is because I'm actually generating a small XML document, where each writer function typically generates a specific XML element. Each writer function wants to write a number of bytes to my allocated TCP packet buffer. I only know exactly how many bytes a given writer function writes at run-time, because some of the encapsulated content depends on user-defined text strings of variable length.
Some messages need to be around (say) 2K in size, meaning they're likely to be split across at least two TCP packet send operations. Those messages will be constructed by calling a series of writer functions that produce, say, a hundred bytes at a time.
Prior to making a call to each writer function, or perhaps within the writer function itself, I initially need to compare the buffer space available with how much that writer function requires; and if there isn't enough space available, then transmit that packet and continue writing into a fresh packet later.
Possible solutions I am considering are:
Use another much larger buffer to write everything into initially. This isn't preferred because of resource constraints. Furthermore, I would still wish for a means to algorithmically work out how much space I need by my message writer functions.
At compile time, produce a 'worst case size' constant for each writer function. Each writer function typically generates an XML element such as <START_TAG>[string]</START_TAG>, so I could have something like: #define SPACE_NEEDED ( START_TAG_LENGTH + START_TAG_LENGTH + MAX_STRING_LENGTH + SOME_MARGIN ). All of my content writer functions are picked out of a table of function pointers anyway, so I could have the worst-case size estimate constants for each writer function exist as a new column in that table. At run-time, I check the buffer room against that estimate constant. This probably my favourite solution at the moment. The only downside is that it does rely on correct maintenance to make it work.
My writer functions provide a special 'dummy run' mode where they run though and calculate how many bytes they want to write but don't write anything. This could be achieved by perhaps simply sending NULL in place of the buffer pointer to the function, in which case the functions's return value (which usually states amount written to buffer) just states how much it wants to write. The only thing I don't like about this is that, between the 'dummy' and 'real' call, the underlying data could - at least in theory - change. A possible solution for that could be to statically capture the underlying data.
Thanks in advance for any thoughts and comments.
Solution
Something I had actually already started doing since posting the question was to make each content writer function accept a state, or 'iteration' parameter, which allows the writer to be called many times over by the TCP send function. The writer is called until it flags that it has no more to write. If the TCP send function decides after a certain iteration that the buffer is now nearing full, it sends the packet and then the process continues later with a new packet buffer. This technique is very similar I think to Max's answer, which I've therefore accepted.
A key thing is that on each iteration, a content writer must be designed so that it won't write more than LENGTH bytes to the buffer; and after each call to the writer, the TCP send function will check that it has LENGTH room left in the packet buffer before calling the writer again. If not, it continues in a new packet.
Another step I did was to have a serious think about how I structure my message headers. It became apparent that, like I suppose with almost all protocols that use TCP, it is essential to implement into the application protocol some means of indicating the total message length. The reason for this is because TCP is a stream-based protocol, not a packet-based protocol. This is again where it got a bit of a headache because I needed some upfront means of knowing the total message length for insertion into the start header. The simple solution to this was to insert a message header into the start of every sent TCP packet, rather than only at the start of the application protocol message (which may of course span several TCP sockets), and basically implement fragmentation. So, in the header, I implemented two flags: a fragment flag, and a last-fragment flag. Therefore the length field in each header only needs to state the size of the payload in the particular packet. At the receiving end, individual header+payload chunks are read out of the stream and then reassembled into a complete protocol message.
This of course is no doubt very simplistically how HTTP and so many other protocols work over TCP. It's just quite interesting that, only once I've attempted to write a robust protocol that works over TCP, have I started to realise the importance of really thinking the your message structure in terms of headers, framing, and so forth so that it works over a stream protocol.
I had a related problem in a much smaller embedded system, running on a PIC 16 micro-controller (and written in assembly language, rather than C). My 'buffer size' was always going to be the two byte UART transmit queue, and I had only one 'writer' function, which was walking a DOM and emitting its XML serialisation.
The solution I came up with was to turn the problem 'inside out'. The writer function becomes a task: each time it is called it writes as many bytes as it can (which may be >2 depending on the serial data transmission rate) until the transmit buffer is full, then it returns. However, it remembers, in a state variable, how far it had got through the DOM. The next time it is called, it caries on from the point previously reached. The writer task is called repeatedly from a loop. If there is no free buffer space, it returns immediately without changing its state. It is called repeatedly from an infinite loop, which acts as a round-robin scheduler for this task and the others in the system. Each time round the loop, there is a delay which waits for the TMR0 timer to overflow. So each task gets called exactly once in a fixed time slice.
In my implementation, the data is transmitted by a TxEmpty interrupt routine, but it could also be sent by another task.
I guess the 'pattern' here is that one role of the program counter is to hold the current state of the flow of control, and that this role can be abstracted away from the PC to another data structure.
Obviously, this isn't immediately applicable to your larger, higher level system. But it is a different way of looking at the problem, which may spark your own particulr insight.
Good luck!

types of buffers

Recently an interviewer asked me about the types of buffers. What types of buffers are there ? Actually this question came up when I said I will be writing all the system calls to a log file to monitor the system. He said it will be slow to write each and every call to a file. How to prevent it. I said I will use a buffer and he asked me what type of buffer ? Can some one explain me types of buffers please.
In C under UNIX (and probably other operating systems as well), there are usually two buffers, at least in your given scenario.
The first exists in the C runtime libraries where information to be written is buffered before being delivered to the OS.
The second is in the OS itself, where information is buffered until it can be physically written to the underlying media.
As an example, we wrote a logging library many moons ago that forced information to be written to the disk so that it would be there if either the program crashed or the OS crashed.
This was achieved with the sequence:
fflush (fh); fsync (fileno (fh));
The first of these actually ensured that the information was handed from the C runtime buffers to the operating system, the second that it was written to disk. Keep in mind that this is an expensive operation and should only be done if you absolutely need the information written immediately (we only did it at the SUPER_ENORMOUS_IMPORTANT logging level).
To be honest, I'm not entirely certain why your interviewer thought it would be slow unless you're writing a lot of information. The two levels of buffering already there should perform quite adequately. If it was a problem, then you could just introduce another layer yourself which wrote the messages to an in-memory buffer and then delivered that to a single fprint-type call when it was about to overflow.
But, unless you do it without any function calls, I can't see it being much faster than what the fprint-type buffering already gives you.
Following clarification in comments that this question is actually about buffering inside a kernel:
Basically, you want this to be as fast, efficient and workable (not prone to failure or resource shortages) as possible.
Probably the best bet would be a buffer, either statically allocated or dynamically allocated once at boot time (you want to avoid the possibility that dynamic re-allocation will fail).
Others have suggested a ring (or circular) buffer but I wouldn't go that way (technically) for the following reason: the use of a classical circular buffer means that to write out the data when it has wrapped around will take two independent writes. For example, if your buffer has:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|s|t|r|i|n|g| |t|o| |w|r|i|t|e|.| | | | | | |T|h|i|s| |i|s| |t|h|e| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
^ ^
| |
Buffer next --+ +-- Buffer start
then you'll have to write "This is the " followed by "string to write.".
Instead, maintain the next pointer and, if the bytes in the buffer plus the bytes to be added are less than the buffer size, just add them to the buffer with no physical write to the underlying media.
Only if you are going to overflow the buffer do you start doing tricky stuff.
You can take one of two approaches:
Either flush the buffer as it stands, set the next pointer back to the start for processing the new message; or
Add part of the message to fill up the buffer, then flush it and set the next pointer back to the start for processing the rest of the message.
I would probably opt for the second given that you're going to have to take into account messages that are too big for the buffer anyway.
What I'm talking about is something like this:
initBuffer:
create buffer of size 10240 bytes.
set bufferEnd to end of buffer + 1
set bufferPointer to start of buffer
return
addToBuffer (size, message):
while size != 0:
xfersz = minimum (size, bufferEnd - bufferPointer)
copy xfersz bytes from message to bufferPointer
message = message + xfersz
bufferPointer = bufferPointer + xfersz
size = size - xfersz
if bufferPointer == bufferEnd:
write buffer to underlying media
set bufferPointer to start of buffer
endif
endwhile
That basically handles messages of any size efficiently by reducing the number of physical writes. There will be optimisations of course - it's possible that the message may have been copied into kernel space so it makes little sense to copy it to the buffer if you're going to write it anyway. You may as well write the information from the kernel copy directly to the underlying media and only transfer the last bit to the buffer (since you have to save it).
In addition, you'd probably want to flush an incomplete buffer to the underlying media if nothing had been written for a time. That would reduce the likelihood of missing information on the off chance that the kernel itself crashes.
Aside: Technically, I guess this is sort of a circular buffer but it has special case handling to minimise the number of writes, and no need for a tail pointer because of that optimisation.
There are also ring buffers which have bounded space requirements and are probably best known in the Unix dmesg facility.
What comes to mind for me is time-based buffers and size-based. So you could either just write whatever is in the buffer to file once every x seconds/minutes/hours or whatever. Alternatively, you could wait until there are x log entries or x bytes worth of log data and write them all at once. This is one of the ways that log4net and log4J do it.
Overall, there are "First-In-First-Out" (FIFO) buffers, also known as queues; and there are "Latest*-In-First-Out" (LIFO) buffers, also known as stacks.
To implement FIFO, there are circular buffers, which are usually employed where a fixed-size byte array has been allocated. For example, a keyboard or serial I/O device driver might use this method. This is the usual type of buffer used when it is not possible to dynamically allocate memory (e.g., in a driver which is required for the operation of the Virtual Memory (VM) subsystem of the OS).
Where dynamic memory is available, FIFO can be implemented in many ways, particularly with linked-list derived data structures.
Also, binomial heaps implementing priority queues may be used for the FIFO buffer implementation.
A particular case of neither FIFO nor LIFO buffer is the TCP segment reassembly buffers. These may hold segments received out-of-order ("from the future") which are held pending the receipt of intermediate segments not-yet-arrived.
* My acronym is better, but most would call LIFO "Last In, First Out", not Latest.
Correct me if I'm wrong, but wouldn't using a mmap'd file for the log avoid both the overhead of small write syscalls and the possibility of data loss if the application (but not the OS) crashed? It seems like an ideal balance between performance and reliability to me.

Resources