Is there an efficient way of storing the most recent part of a continuous stream in an array? - arrays

I have a never ending stream of data coming in to a program I’m writing. I would like to have a fixed size buffer array which only stores the T most recent observations of that stream. However, to me its not obvious how to implement that in an efficient way.
What I have done so far is to first allocate the buffer of length T and place incoming observations in consecutive order from the top as they arrive: data_0->index 0, data_1->index 1…data_T->index T.
Which works fine until the buffer is full. But when observation data_T+1 arrives, index 0 needs to be removed from the buffer and all T-1 rows needs to be moved up one step in the array/matrix in order to place the newest data point at index T.
That seems to be a very inefficient approach when the buffer is large and hundreds of thousands of elements need to be pushed one row up all the time.
How is this normally solved?

This algorithm called FIFO queue java fifo queue
Look at this API it has several code examples.

Related

Lockfree buffer updates with variable-length messages in C

I have 2 buffer of size N. I want to write to the buffer from different threads without using locks.
I maintain a buffer index (0 and 1) and an offset where the new write operation to buffer starts. If I can get the current offset and set the offset at offset + len_of_the_msg in an atomic manner, it will guarantee that the different threads will not overwrite each other. I also have to take care of buffer overflow. Once a buffer is full, switch buffer and set offset to 0.
Task to do in order:
set a = offset
increment offset by msg_len
if offset > N: switch buffer, set a to 0, set offset to msg_len
I am implementing this in C. Compiler is gcc.
How to do this operations in an atomic manner without using locks? Is it possible to do so?
EDIT:
I don't have to use 2 buffers. What I want to do is "Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached"
re: your edit:
I don't have to use 2 buffers. What I want to do is: Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached
A lock-free circular buffer could maybe work, with the reader collecting all data up to the last written entry. Extending an existing MPSC or MPMC queue based on using an array as a circular buffer is probably possible; see below for hints.
Verifying that all entries have been fully written is still a problem, though, as are variable-width entries. Doing that in-band with a length + sequence number would mean you couldn't just send the byte-range to the server, and the reader would have to walk through the "linked list" (of length "pointers") to check the sequence numbers, which is slow when they inevitably cache miss. (And can possibly false-positive if stale binary data from a previous time through the buffer happens to look like the right sequence number, because variable-length messages might not line up the same way.)
Perhaps a secondary array of fixed-width start/end-position pairs could be used to track "done" status by sequence number. (Writers store a sequence number with a release-store after writing the message data. Readers seeing the right sequence number know that data was written this time through the circular buffer, not last time. Sequence numbers provide ABA protection vs. a "done" flag that the reader would have to unset as it reads. The reader can indicate its read position with an atomic integer.)
I'm just brainstorming ATM, I might get back to this and write up more details or code, but I probably won't. If anyone else wants to build on this idea and write up an answer, feel free.
It might still be more efficient to do some kind of non-lock-free synchronization that makes sure all writers have passed a certain point. Or if each writer stores the position it has claimed, the reader can scan that array (if there are only a few writer threads) and find the lowest not-fully-written position.
I'm picturing that a writer should wake the reader (or even perform the task itself) after detecting that its increment has pushed the used space of the queue up past some threshold. Make the threshold a little higher than you normally want to actually send with, to account for partially-written entries from previous writers not actually letting you read this far.
If you are set on switching buffers:
I think you probably need some kind of locking when switching buffers. (Or at least stronger synchronization to make sure all claimed space in a buffer has actually been written.)
But within one buffer, I think lockless is possible. Whether that helps a lot or a little depends on how you're using it. Bouncing cache lines around is always expensive, whether that's just the index, or whether that's also a lock plus some write-index. And also false sharing at the boundaries between two messages, if they aren't all 64-byte aligned (to cache line boundaries.)
The biggest problem is that the buffer-number can change while you're atomically updating the offset.
It might be possible with a separate offset for each buffer, and some extra synchronization when you change buffers.
Or you can pack the buffer-number and offset into a single 64-bit struct that you can attempt to CAS with atomic_compare_exchange_weak. That can let a writer thread claim that amount of space in a known buffer. You do want CAS, not fetch_add because you can't build an upper limit into fetch_add; it would race with any separate check.
So you read the current offset, check there's enough room, then try to CAS with offset+msg_len. On success, you've claimed that region of that buffer. On fail, some other thread got it first. This is basically the same as what a multi-producer queue does with a circular buffer, but we're generalizing to reserving a byte-range instead of just a single entry with CAS(&write_idx, old, old+1).
(Maybe possible to use fetch_add and abort if the final offset+len you got goes past the end of the buffer. If you can avoid doing any fetch_sub to undo it, that could be good, but it would be worse if you had multiple threads trying to undo their mistakes with more modifications. That would still leave the possible problem of a large message stopping other small messages from packing into the end of a buffer, given some orderings. CAS avoids that because only actually-usable offsets get swapped in.)
But then you also need a mechanism to know when that writer has finished storing to that claimed region of the buffer. So again, maybe extra synchronization around a buffer-change is needed for that reason, to make sure all pending writes have actually happened before we let readers touch it.
A MPMC queue using a circular buffer (e.g. Lock-free Progress Guarantees) avoids this by only having one buffer, and giving writers a place to mark each write as done with a release-store, after they claimed a slot and stored into it. Having fixed-size slots makes this much easier; variable-length messages would make that non-trivial or maybe not viable at all.
The "claim a byte-range" mechanism I'm proposing is very much what lock-free array-based queues, to, though. A writer tries to CAS a write-index, then uses that claimed space.
Obviously all of this would be done with C11 #include <stdatomic.h> for _Atomic size_t offsets[2], or with GNU C builtin __atomic_...
I believe this is not solvable in a lock-free manner, unless you're only ruling out OS-level locking primitives and can live with brief spin locks in application code (which would be a bad idea).
For discussion, let's assume your buffers are organized this way:
#define MAXBUF 100
struct mybuffer {
char data[MAXBUF];
int index;
};
struct mybuffer Buffers[2];
int currentBuffer = 0; // switches between 0 and 1
Though parts can be done with atomic-level primitives, in this case the entire operation has to be done atomically so is really one big critical section. I cannot imagine any compiler with a unicorn primitive for this.
Looking at the GCC __atomic_add_fetch() primitive, this adds a given value (the message size) to a variable (the current buffer index), returning the new value; this way you could test for overflow.
Looking at some rough code that is not correct;
// THIS IS ALL WRONG!
int oldIndex = Buffers[current]->index;
if (__atomic_add_fetch(&Buffers[current]->index, mysize, _ATOMIC_XX) > MAXBUF)
{
// overflow, must switch buffers
// do same thing with new buffer
// recompute oldIndex
}
// copy your message into Buffers[current] at oldIndex
This is wrong in every way, because at almost every point some other thread could sneak in and change things out from under you, causing havoc.
What if your code grabs the oldIndex that happens to be from buffer 0, but then some other thread sneaks in and changes the current buffer before your if test even gets to run?
The __atomic_add_fetch() would then be allocating data in the new buffer but you'd copy your data to the old one.
This is the NASCAR of race conditions, I do not see how you can accomplish this without treating the whole thing as a critical section, making other processes wait their turn.
void addDataTobuffer(const char *msg, size_t n)
{
assert(n <= MAXBUF); // avoid danger
// ENTER CRITICAL SECTION
struct mybuffer *buf = Buffers[currentBuffer];
// is there room in this buffer for the entire message?
// if not, switch to the other buffer.
//
// QUESTION: do messages have to fit entirely into a buffer
// (as this code assumes), or can they be split across buffers?
if ((buf->index + n) > MAXBUF)
{
// QUESTION: there is unused data at the end of this buffer,
// do we have to fill it with NUL bytes or something?
currentBuffer = (currentBuffer + 1) % 2; // switch buffers
buf = Buffers[currentBuffer];
}
int myindex = buf->index;
buf->index += n;
// copy your data into the buffer at myindex;
// LEAVE CRITICAL SECTION
}
We don't know anything about the consumer of this data, so we can't tell how it gets notified of new messages, or if you can move the data-copy outside the critical section.
But everything inside the critical section MUST be done atomically, and since you're using threads anyway, you may as well use the primitives that come with thread support. Mutexes probably.
One benefit of doing it this way, in addition to avoiding race conditions, is that the code inside the critical section doesn't have to use any of the atomic primitives and can just be ordinary (but careful) C code.
An additional note: it's possible to roll your own critical section code with some interlocked exchange shenanigans, but this is a terrible idea because it's easy to get wrong, makes the code harder to understand, and avoids tried-and-true thread primitives designed for exactly this purpose.

Is there a portable way to discard a number of readable bytes from a socket-like file descriptor?

Is there a portable way to discard a number of incoming bytes from a socket without copying them to userspace? On a regular file, I could use lseek(), but on a socket, it's not possible. I have two scenarios where I might need it:
A stream of records is arriving on a file descriptor (which can be a TCP, a SOCK_STREAM type UNIX domain socket or potentially a pipe). Each record is preceeded by a fixed size header specifying its type and length, followed by data of variable length. I want to read the header first and if it's not of the type I'm interested in, I want to just discard the following data segment without transferring them into user space into a dummy buffer.
A stream of records of varying and unpredictable length is arriving on a file descriptor. Due to asynchronous nature, the records may still be incomplete when the fd becomes readable, or they may be complete but a piece of the next record already may be there when I try to read a fixed number of bytes into a buffer. I want to stop reading the fd at the exact boundary between the records so I don't need to manage partially loaded records I accidentally read from the fd. So, I use recv() with MSG_PEEK flag to read into a buffer, parse the record to determine its completeness and length, and then read again properly (thus actually removing data from the socket) to the exact length. This would copy the data twice - I want to avoid that by simply discarding the data buffered in the socket by an exact amount.
On Linux, I gather it is possible to achieve that by using splice() and redirecting the data to /dev/null without copying them to userspace. However, splice() is Linux-only, and the similar sendfile() that is supported on more platforms can't use a socket as input. My questions are:
Is there a portable way to achieve this? Something that would work on other UNIXes (primarily Solaris) as well that do not have splice()?
Is splice()-ing into /dev/null an efficient way to do this on Linux, or would it be a waste of effort?
Ideally, I would love to have a ssize_t discard(int fd, size_t count) that simply removes count of readable bytes from a file descriptor fd in kernel (i.e. without copying anything to userspace), blocks on blockable fd until the requested number of bytes is discarded, or returns the number of successfully discarded bytes or EAGAIN on a non-blocking fd just like read() would do. And advances the seek position on a regular file of course :)
The short answer is No, there is no portable way to do that.
The sendfile() approach is Linux-specific, because on most other OSes implementing it, the source must be a file or a shared memory object. (I haven't even checked if/in which Linux kernel versions, sendfile() from a socket descriptor to /dev/null is supported. I would be very suspicious of code that does that, to be honest.)
Looking at e.g. Linux kernel sources, and considering how little a ssize_t discard(fd, len) differs from a standard ssize_t read(fd, buf, len), it is obviously possible to add such support. One could even add it via an ioctl (say, SIOCISKIP) for easy support detection.
However, the problem is that you have designed an inefficient approach, and rather than fix the approach at the algorithmic level, you are looking for crutches that would make your approach perform better.
You see, it is very hard to show a case where the "extra copy" (from kernel buffers to userspace buffers) is an actual performance bottleneck. The number of syscalls (context switches between userspace and kernel space) sometimes is. If you sent a patch upstream implementing e.g. ioctl(socketfd, SIOCISKIP, bytes) for TCP and/or Unix domain stream sockets, they would point out that the performance increase this hopes to achieve is better obtained by not trying to obtain the data you don't need in the first place. (In other words, the way you are trying to do things, is inherently inefficient, and rather than create crutches to make that approach work better, you should just choose a better-performing approach.)
In your first case, a process receiving structured data framed by a type and length identifier, wishing to skip unneeded frames, is better fixed by fixing the transfer protocol. For example, the receiving side could inform the sending side which frames it is interested in (i.e., basic filtering approach). If you are stuck with a stupid protocol that you cannot replace for external reasons, you're on your own. (The FLOSS developer community is not, and should not be responsible for maintaining stupid decisions just because someone wails about it. Anyone is free to do so, but they'd need to do it in a manner that does not require others to work extra too.)
In your second case, you already read your data. Don't do that. Instead, use an userspace buffer large enough to hold two full size frames. Whenever you need more data, but the start of the frame is already past the midway of the buffer, memmove() the frame to start at the beginning of the buffer first.
When you have a partially read frame, and you have N unread bytes from that left that you are not interested in, read them into the unused portion of the buffer. There is always enough room, because you can overwrite the portion already used by the current frame, and its beginning is always within the first half of the buffer.
If the frames are small, say 65536 bytes maximum, you should use a tunable for the maximum buffer size. On most desktop and server machines, with high-bandwidth stream sockets, something like 2 MiB (2097152 bytes or more) is much more reasonable. It's not too much memory wasted, but you rarely do any memory copies (and when you do, they tend to be short). (You can even optimize the memory moves so that only full cachelines are copied, aligned, since leaving almost one cacheline of garbage at the start of the buffer is insignificant.)
I do HPC with large datasets (including text-form molecular data, where records are separated by newlines, and custom parsers for converting decimal integers or floating-point values are used for better performance), and this approach does work well in practice. Simply put, skipping data already in your buffer is not something you need to optimize; it is insignificant overhead compared to simply avoiding doing the things you do not need.
There is also the question of what you wish to optimize by doing that: the CPU time/resources used, or the wall clock used in the overall task. They are completely different things.
For example, if you need to sort a large number of text lines from some file, you use the least CPU time if you simply read the entire dataset to memory, construct an array of pointers to each line, sort the pointers, and finally write each line (using either internal buffering and/or POSIX writev() so that you do not need to do a write() syscall for each separate line).
However, if you wish to minimize the wall clock time used, you can use a binary heap or a balanced binary tree instead of an array of pointers, and heapify or insert-in-order each line completely read, so that when the last line is finally read, you already have the lines in their correct order. This is because the storage I/O (for all but pathological input cases, something like single-character lines) takes longer than sorting them using any robust sorting algorithm! The sorting algorithms that work inline (as data comes in) are typically not as CPU-efficient as those that work offline (on complete datasets), so this ends up using somewhat more CPU time; but because the CPU work is done at a time that is otherwise wasted waiting for the entire dataset to load into memory, it is completed in less wall clock time!
If there is need and interest, I can provide a practical example to illustrate the techniques. However, there is absolutely no magic involved, and any C programmer should be able to implement these (both the buffering scheme, and the sort scheme) on their own. (I do consider using resources like Linux man pages online and Wikipedia articles and pseudocode on for example binary heaps doing it "on your own". As long as you do not just copy-paste existing code, I consider it doing it "on your own", even if somebody or some resource helps you find the good, robust ways to do it.)

CUDA threads appending variable amounts of data to common array

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.
All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?
Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.
I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.
However I'm not looking forward to implementing that. Help? Thanks.
Using an atomic increment to repeatedly reserve a pair from a global
array sounds horribly slow.
You could increment bins of a global array instead of one entry at a time. In other words, you could have a large array, each thread could start with 10 possible output entries. If the thread over flows it requests for the next available bin from the global array. If you're worried about slow speed with the 1 atomic number, you could use 10 atomic numbers to 10 portions of the array and distribute the accesses. If one gets full, find another one.
I'm also considering processing the data twice: the 1st time just to
determine the number of output records for each input record. Then
allocate just enough space and finally process all the data again.
This is another valid method. The bottleneck is calculating the offset of each thread into the global array once you have the total number of results for each thread. I haven't figured a reasonable parallel way to do that.
The last option I can think of, would be to allocate a large array, distribute it based on blocks, used a shared atomic int (would help with slow global atomics). If you run out of space, mark that the block didn't finish, and mark where it left off. On your next iteration complete the work that hasn't been finished.
Downside of course of the distributed portions of global memory is like talonmies said... you need a gather or compaction to make the results dense.
Good luck!

Differentiating between queue full and queue empty

I am using a an array with a write index and a read index to implement a straightforward FIFO Queue. I do the usual MOD ArraySize when incrementing the write and read index.
Is there a way to differentiate between queue full and queue empty condition (wrIndex == rdIndex) without using any additional queuecount and also without wasting any array entry i.e . Queue is full if (WrIndex + 1 ) MOD ArraySize == ReadIndex
I'd go with 'wasting' an array entry to detect the queue full condition, especially if you're dealing with different threads/tasks being producers and consumers. Having another flag keep track of that situation increases the locking necessary to keep things consistent and increases the likelihood of some sort of bug that introduces a race condition. This is even more true in the case where you can't use a critical section (as you mention in a comment) to ensure that things are in-sync.
You'll need at least a bit somewhere to keep track of that condition, and that probably means at least a byte. Assuming that your queue contains ints you're only saving 3 bytes of RAM and you're going to chew up several more bytes of program image (which might not be as precious, so that might not matter). If you keep a flag bit inside a byte used to store other flag bits, then you have to additionally deal with setting/testing/clearing that flag bit in a thread safe manner to ensure that the other bits don't get corrupted.
If you're queuing bytes, then you probably save nothing - you can consider the sentinel element to be the flag that you'd have to put somewhere else. But now you have to have no extra code to deal with the flag.
Consider carefully if you really need that extra queue item, and keep in mind that if you're queuing bytes, then the extra queue item probably isn't really extra space
Instead of a read and write index, you could use a read index and a queue count. From the queue count, you can easily tell if the queue is empty of full. And the write index can be computed as (read index + queue count) mod array_size.
What's wrong with a queue count? It sounds like you're going for maximum efficiency and minimal logic, and while I would do the same, I think I'd still use a queue count variable. Otherwise, one other potential solution would be to use a linked list. Low memory usage, and removing first element would be easy, just make sure that you have pointers to the head and tail of the list.
Basically you only need a single additional bit somewhere to signal that the queue is currently empty. You can probably stash that away somewhere, e.g., in the most significant bit of one of your indices (and than AND-ing the lower bits creatively in places where you need to work only on the actual index into your array).
But honestly, I'd go with a queue count first and only cut that if I really need that space, instead of putting up with bit fiddling.

Inspecting C pipelines passing through a program -- border cases

I'm receiving from socket A and writing that to socket B on the fly (like a proxy server might). I would like to inspect and possibly modify data passing through. My question is how to handle border cases, ie where the regular expression I'm searching for would match between two successive socket A read and socket B write iterations.
char buffer[4096]
int socket_A, socket_B
/* Setting up the connection goes here */
for(;;) {
recv(socket_A, buffer, 4096, 0);
/* Inspect, and possibly modify buffer */
send(socket_B, buffer, 4096, 0);
/* Oops, the matches I was looking for were at the end of buffer,
* and will be at the beginning of buffer next iteration :( */
}
My suggestion: have two buffers, and rotate between them:
Recv buffer 1
Recv buffer 2
Process.
Send buffer 1
Recv buffer 1
Process, but with buffer 2 before buffer 1.
Send buffer 2
Goto 2.
Or something like that?
Assuming you know the maximum length M of the possible regular expression matches (or can live with an arbitrary value - or just use the whole buffer), you could handle it by not passing on the full buffer but keep M-1 bytes back. In the next iteration put the new received data at the end of the M-1 bytes and apply the regular expression.
If you know the format of the data transmitted (e.g. http), you should be able to parse the contents to know when you reached the end of the communication and should send out the trailing bytes you may have cached. If you do not know the format, then you'd need to implement a timeout in the recv so that you do not hold on to the end of the communication for too long. What is too long is something that you will have to decide on your own,
You need to know and/or say something about your regular expression.
Depending on the regular expression, you might need to buffer a lot more than you are buffering now.
A worst case scenario might be something like a regular expression which says, "find everything, starting from the begining up until the first occurence of the word 'dog', and replace that with something else": if you have a regular expression like that, then you need to buffer (without forwarding) everything from the begining until the first occurence of the word 'dog': which might never happen, i.e. might be an infinite amount to buffer.
In that sense you're talking about (and all senses for, say, TCP) sockets are streams. It follows from your question that you have some structure in the data. So you must do something similar to the following:
Buffer (hold) incoming data until a boundary is reached. The boundary might be end-of-line, end-of-record, or any other way that you know that your regex will match.
When a "record" is ready, process it and place the results in an output buffer.
Write anything accumulated in the output buffer.
That handles most cases. If you have one of the rare cases where there's really no "record" then you have to build some sort of state machine (DFA). By this I mean you must be able to accumulate data until either a) it can't possibly match your regex, or b) it's a completed match.
EDIT:
If you're matching fixed strings instead of a true regex then you should be able to use the Boyer-Moore algorithm, which can actually run in sub-linear time (by skipping characters). If you do it right, as you move over the input you can throw previously seen data to the output buffer as you go, decreasing latency and increasing throughput significantly.
Basically, the problem with your code is that the recv/send loop is operating on a lower network layer than your modifications. How you solve this problem depends on what modifications you're making, but it probably involves buffering data until all local modifications can be made.
EDIT: I don't know of any regex library that can filter a stream like that. How hard this is going to be will depend on your regex and the protocol it's filtering.
One alternative is to use poll(2)-like strategy with non-blocking sockets. On read event grab a buffer from the socket, push it onto incoming queue, call the lexer/parser/matcher that assembles the buffers into a stream, then pushes chunks onto the output queue. On write event, take a chunk from the output queue, if any, and write it into the socket. This sounds kind of complicated, but it's not really once you get used to the inverted control model.

Resources