I have found pseudo code on how to implement a circular buffer.
// Producer.
while (true) {
/* produce item v */
while ((in+1)%n == out)
/* Wait. */;
b[in] = v;
in = (in + 1) % n
// Consumer.
while (true) {
while (in == out)
/* Wait. */;
w = b[out];
out = (out + 1) % n;
/* Consume item w. */
What I don't understand is the "Consume item w." comment, because I think that with w = b[out]; we are consuming w, aren't we?

w = b[out];
You only grab a copy of the item to be consumed. With
out = (out + 1) % n;
You advance the index of the item to be consumed, thereby preventing it from being referenced again.
In a manner, multiple calls to w = b[out]; don't actually consume the buffer's slot, it just accesses it; while out = (out + 1) % n; prevents further access of that item. Preventing further access of the buffer item is the strongest definition of the term "consume the item" that I can think of.

these two lines are both part of consuming process:
w = b[out];
out = (out + 1) % n;
The first extract the value and the second increment the out index.
The comment refers to the previously two lines.

Yes, because then it's out of the buffer, which the following row says is empty.
Then we can process w.


Producer-consumer algorithm to use full buffer

I was reading Galvin OS book about producer consumer problem and came through this piece of code.
Global definitions
#define BUFFER_SIZE 10
typedef struct {
. . .
} item;
int in = 0;
int out = 0;
while (((in + 1) % BUFFER_SIZE) == out)
; /* do nothing */
buffer[in] = next_produced;
in = (in + 1) % BUFFER_SIZE ;
while (in == out)
; /* do nothing */
next_consumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
Now this is what Galvin book says:
This scheme allows at most BUFFER_SIZE − 1 items in the buffer at the
same time. We leave it as an exercise for you to provide a solution in which
BUFFER_SIZE items can be in the buffer at the same time.
This is what I came up with. Is this correct?
buffer[in] = next_produced; //JUST MOVED THIS LINE!
while (((in + 1) % BUFFER_SIZE ) == out)
; /* do nothing */
in = (in + 1) % BUFFER_SIZE;
while (in == out)
; /* do nothing */
next_consumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
I think this solves, but is this correct? Any other better solution possible?
In the original piece of code, when in == out it could mean the buffer is empty OR full. So to avoid such an ambiguity, the original code do not allow buffer to full, always leaving at least one empty item.
I am not sure you are solving this problem with your change: you will be able to put BUFFER_SIZE items, but you will not be able to consume them. So, literally you solved it, but it will not function properly.
Basically, to solve this problem, you should have an extra piece of information, so you can distinct between an empty buffer and full. There are a variety solutions for that, the most obvious is to add an extra flag.
The most elegant IMO is to use in and out counters as is, wrapping them only to access the buffer, so:
when in == out -- the buffer is empty
when abs(in - out) == BUFFER_SIZE -- the buffer is full
to access the buffer we should use buffer[in % BUFFER_SIZE] or buffer[out % BUFFER_SIZE]
We leave it as an exercise for you to provide a complete solution ;)

Efficiently find a sequence within a buffer

So I have a buffer that I am filling with a frame that has a maximum of 1200 bytes and is variably sized. I know the frame is complete when I get a tail sequence that is always the same and doesn't occur otherwise. So I am trying to find how to most efficiently detect that tail sequence. This is embedded so ideally the fewer function calls and data structures I use the better.
Here is what I have thus far:
//I am reading off of a circular buffer so this is checking that I still have unread bytes
while (cbuf_last_written_index != cbuf_last_read_index) {
buffer[frame_size] = circular_buffer[cbuf_last_read_index];
//this function does exactly what it says and just maintains circular buffer correctness
//TODO need to make this more efficient
int i;
uint8_t sync_test_array[TAIL_SYNC_LENGTH] = {0};
//this just makes sure I have enough in the frame to even bother checking the tail seq
if (frame_size > TAIL_SYNC_LENGTH) {
for (i = 0; i < TAIL_SYNC_LENGTH; i++) {
//sets the test array equal to the last TAIL_SYNC_LENGTH elements the buffer
sync_test_array[i] = buffer[(frame_size - TAIL_SYNC_LENGTH) + i];
if (sync_test_array == tail_sequence_array) {
//I will toggle a pin here to notify that the frame is complete
//get out of the while loop because the following bytes are part of the next frame
//end efficiency needed area
So basically for each new byte that is added to the frame I am checking the last x bytes (will probably actually be ~8) to see if they are the tail sequence. Can you think of a better way to do this?
Implement it as a state machine.
If your tail sequence is 1,2,5, the psuedo code would be:
switch(current_state) {
IDLE: next_state = ONE_SEEN if new_byte == 1 else next-state = IDLE
ONE_SEEN: next_state = TWO_SEEN if new_byte == 2 else next_state = IDLE
TWO_SEEN: next_state = TERMINATE if new_byte == 5 else next_state = IDLE

How is this tcp socket code handling the rx buffer?

I came across this tcp server example, provided with the Altera Nios II processor, and I'm not getting the section on handling the rx_buffer.
typedef struct SSS_SOCKET {
enum {
} state;
int fd;
int close;
INT8U rx_buffer[SSS_RX_BUF_SIZE];
INT8U *rx_rd_pos; /* position we've read up to */
INT8U *rx_wr_pos; /* position we've written up to */
} SSSConn;
int data_used = 0, rx_code = 0;
INT8U *lf_addr;
conn->rx_rd_pos = conn->rx_buffer;
conn->rx_wr_pos = conn->rx_buffer;
printf("[sss_handle_receive] processing RX data\n");
while (conn->state != CLOSE) {
/* Find the Carriage return which marks the end of the header */
lf_addr = strchr(conn->rx_buffer, '\n');
if (lf_addr) {
/* go off and do whatever the user wanted us to do */
/* No newline received? Then ask the socket for data */
else {
rx_code = recv(conn->fd, conn->rx_wr_pos,
SSS_RX_BUF_SIZE - (conn->rx_wr_pos - conn->rx_buffer) -1, 0);
if (rx_code > 0) {
conn->rx_wr_pos += rx_code;
/* Zero terminate so we can use string functions */
*(conn->rx_wr_pos + 1) = 0;
* When the quit command is received, update our connection state so that
* we can exit the while() loop and close the connection
conn->state = conn->close ? CLOSE : READY;
/* Manage buffer */
data_used = conn->rx_rd_pos - conn->rx_buffer;
memmove(conn->rx_buffer, conn->rx_rd_pos,
conn->rx_wr_pos - conn->rx_rd_pos);
conn->rx_rd_pos = conn->rx_buffer;
conn->rx_wr_pos -= data_used;
memset(conn->rx_wr_pos, 0, data_used);
Specifically, I don't see the purpose of the data_used variable. rx_rd_pos is pointing to rx_buffer and there doesn't appear to be an operation on either, so how will they be different? In fact, the only thing that seems to happen under Manage buffer is the copying of data into rx_buffer. I'm sure I'm missing something simple, but I can't seem to see it.
Thanks for any help in advance.
Edit: Here's the sss_exec_command() function.
void sss_exec_command(SSSConn* conn) {
int bytes_to_process = conn->rx_wr_pos - conn->rx_rd_pos;
INT8U *tx_wr_pos = tx_buf;
INT8U error_code;
* "SSSCommand" is declared static so that the data will reside
* in the BSS segment. This is done because a pointer to the data in
* SSSCommand
* will be passed via SSSLedCommandQ to the LEDManagementTask.
* Therefore SSSCommand cannot be placed on the stack of the
* SSSSimpleSocketServerTask, since the LEDManagementTask does not
* have access to the stack of the SSSSimpleSocketServerTask.
static INT32U SSSCommand;
while (bytes_to_process--) {
SSSCommand = toupper(*(conn->rx_rd_pos++));
if (SSSCommand >= ' ' && SSSCommand <= '~') {
tx_wr_pos += sprintf(tx_wr_pos,
"--> Simple Socket Server Command %c.\n",
(char) SSSCommand);
if (SSSCommand == CMD_QUIT) {
tx_wr_pos += sprintf(tx_wr_pos,
"Terminating connection.\n\n\r");
conn->close = 1;
} else {
error_code = OSQPost(SSSLEDCommandQ, (void *) SSSCommand);
alt_SSSErrorHandler(error_code, 0);
send(conn->fd, tx_buf, tx_wr_pos - tx_buf, 0);
Answers below are correct. I missed the pointer arithmetic on rx_rd in the command function :P
That section removes data from the buffer once it has been processed. The code you posted never uses the data stores in the buffer, but the sss_exec_command function will, after a newline is received. That function is passed the connection, so it can increment the read position by however much it uses.
After data is used, the buffer management section reclaims the space. The amount of data left in the buffer is the difference between the write and read positions. This much data is moved from the write position to the start of the buffer, then the read and write pointer are updated to their new positions. The read position is set to the start of the buffer, and the write position is decremented by data_used, which is the original difference between the start of the buffer and the read pointer, i.e. the amount of data used.
Assuming the code actually works, then data_used = conn->rx_rd_pos - conn->rx_buffer implies rx_rd_pos is being changed; this would be being changed when the code has consumed the data written into the buffer (it's written in at rx_wr_pos and consumed from rx_rd_pos). This would imply that sss_exec_command(conn) is adjusting conn. Is that the case?

sending c struct via MPI fails partially

I am sending a (particle) struct using the MPI_Type_create_struct() as done e.g. here, or explained in detail here.
I'm collecting all particles which are going to a specific proc, memcpy() them into the send buffer and MPI_Isend() them.
So far, so good. MPI_Iprob()'ing for the message gives me the right count of particles sent.
So I MPI_Recv() the buffer and extract the data (now even by copying the struct one by one). No matter how many particles I send, only the first particles' data are correct.
There are three possible mistakes:
The MPI_Type_create_struct() doesn't create a proper map of my struct, due to my usage of offset of() like in the first link. Maybe my struct contains a non visible padding as explained in the second link.
I'm doing some simple mistakes while copying particles into the send buffer and from the receive buffer back (I do print the send buffer - and it works - but maybe I'm overlooking something)
Something totally different.
(sorry for the really ugly presentation of the code, I could not manage to present it in a descent way. You'll find the code here - the line is already marked - on Github, too!)
Here are the construction of the mpi datatype,
typedef struct {
int ID;
double x[DIM];
} pchase_particle_t;
const int items = 2;
int block_lengths[2] = {1, DIM};
MPI_Datatype mpi_types[2] = {MPI_INT, MPI_DOUBLE};
MPI_Aint offsets[2];
offsets[0] = offsetof(pchase_particle_t, ID);
offsets[1] = offsetof(pchase_particle_t, x);
MPI_Type_create_struct(items, block_lengths, offsets, mpi_types, &W->MPI_Particle);
the sending
/* handle all mpi send/recv status data */
MPI_Request *send_request = P4EST_ALLOC(MPI_Request, W->p4est->mpisize);
MPI_Status *recv_status = P4EST_ALLOC(MPI_Status, W->p4est->mpisize);
/* setup send/recv buffers */
pchase_particle_t **recv_buf = P4EST_ALLOC(pchase_particle_t *, num_senders);
pchase_particle_t **send_buf = P4EST_ALLOC(pchase_particle_t *, num_receivers);
int recv_count = 0, recv_length, flag, j;
/* send all particles to their belonging procs */
for (i = 0; i < num_receivers; i++) {
/* resolve particle list for proc i */
sc_list_t *tmpList = *((sc_list_t **) sc_array_index(W->particles_to, receivers[i]));
pchase_particle_t * tmpParticle;
int send_count = 0;
/* get space for the particles to be sent */
send_buf[i] = P4EST_ALLOC(pchase_particle_t, tmpList->elem_count);
/* copy all particles into the send buffer and remove them from this proc */
while(tmpList->first != NULL){
tmpParticle = sc_list_pop(tmpList);
memcpy(send_buf[i] + send_count * sizeof(pchase_particle_t), tmpParticle, sizeof(pchase_particle_t));
/* free particle */
/* update particle counter */
/* print send buffer */
for (j = 0; j < send_count; j++) {
pchase_particle_t *tmpParticle = send_buf[i] + j * sizeof(pchase_particle_t);
printf("[pchase %i sending] particle[%i](%lf,%lf)\n", W->p4est->mpirank, tmpParticle->ID, tmpParticle->x[0], tmpParticle->x[1]);
printf("[pchase %i sending] particle count: %i\n", W->p4est->mpirank, send_count);
/* send particles to right owner */
mpiret = MPI_Isend(send_buf[i], send_count, W->MPI_Particle, receivers[i], 13, W->p4est->mpicomm, &send_request[i]);
and the receiving.
recv_count = 0;
/* check for messages until all arrived */
while (recv_count < num_senders) {
/* probe if any of the sender has already sent his message */
for (i = 0; i < num_senders; i++) {
MPI_Iprobe(senders[i], MPI_ANY_TAG, W->p4est->mpicomm,
&flag, &recv_status[i]);
if (flag) {
/* resolve number of particles receiving */
MPI_Get_count(&recv_status[i], W->MPI_Particle, &recv_length);
printf("[pchase %i receiving message] %i particles arrived from sender %i with tag %i\n",
W->p4est->mpirank, recv_length, recv_status[i].MPI_SOURCE, recv_status[i].MPI_TAG);
/* get space for the particles to be sent */
recv_buf[recv_count] = P4EST_ALLOC(pchase_particle_t, recv_length);
/* receive a list with recv_length particles */
mpiret = MPI_Recv(recv_buf[recv_count], recv_length, W->MPI_Particle, recv_status[i].MPI_SOURCE,
recv_status[i].MPI_TAG, W->p4est->mpicomm, &recv_status[i]);
* insert all received particles into the
* push list
pchase_particle_t *tmpParticle;
for (j = 0; j < recv_length; j++) {
* retrieve all particle details from
* recv_buf
tmpParticle = recv_buf[recv_count] + j * sizeof(pchase_particle_t);
pchase_particle_t *addParticle = P4EST_ALLOC(pchase_particle_t,1);
addParticle->x[0] = tmpParticle->x[0];
addParticle->x[1] = tmpParticle->x[1];
printf("[pchase %i receiving] particle[%i](%lf,%lf)\n",
W->p4est->mpirank, addParticle->ID, addParticle->x[0], addParticle->x[1]);
/* push received particle to push list and update world counter */
sc_list_append(W->particle_push_list, addParticle);
/* we received another particle list */
edit: reindented..
edit: Only the first particles' data is correct, means that all it's properties (ID and coordinates) are identical to that of the sent particle. The others however are initialized with zeros i.e. ID=0, x[0]=0.0, x[1]=0.0. Maybe that's a hint for the solution.
There is an error in your pointer arithmetic. send_buf[i] is already of type pchase_particle_t * and therefore send_buf[i] + j * sizeof(pchase_particle_t) does not point to the j-th element of the i-th buffer but rather to the j * sizeof(pchase_particle_t)-th element. Thus your particles are not stored contiguously in memory but rather separated by sizeof(pchase_particle_t) - 1 empty array elements. These get sent instead of the correct particles because the MPI_Send call accesses buffer memory contiguously. The same applies to the code of the receiver.
You do not see the error in the sender code because your debug print uses the same incorrect pointer arithmetic and hence accesses memory using the same stride. I guess your send counts are small and you get memory allocated on the data segment heap, otherwise you should have received SIGSEGV for out-of-bound array access very early in the data packing process (e.g. in the memcpy part).
Resolution: do not multiply the array index by sizeof(pchase_particle_t).

OpenCL transpose kernel how is get_local_id being used

Code taken from a sample. I created a project with it and it works, but I don't understand some parts.
For the sake of the example, say I have a 32x32 matrix, there are 36 work items and so get_global_id(0) goes from 0 -> 35 I presume, and size = MATRIX_DIM/4 = 8.
__kernel void transpose(__global float4 *g_mat,
__local float4 *l_mat, uint size) {
__global float4 *src, *dst;
/* Determine row and column location */
int col = get_global_id(0);
int row = 0;
while(col >= size) {
col -= size--;
col += row;
size += row;
/* Read source block into local memory */
src = g_mat + row * size * 4 + col;
l_mat += get_local_id(0)*8;
In the clEnqueueNDRangeKernel call, the arg local_work_size was set to NULL which according to the manual means let the compiler or something figure it out:
local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.
But I don't understand the multiply by 8, which gives an address offset into local memory for the work group I suppose. Can someone please explain this?
l_mat[0] = src[0];
l_mat[1] = src[size];
l_mat[2] = src[2*size];
l_mat[3] = src[3*size];
/* Process block on diagonal */
if(row == col) {
src[0] =
(float4)(l_mat[0].x, l_mat[1].x, l_mat[2].x, l_mat[3].x);
src[size] =
(float4)(l_mat[0].y, l_mat[1].y, l_mat[2].y, l_mat[3].y);
src[2*size] =
(float4)(l_mat[0].z, l_mat[1].z, l_mat[2].z, l_mat[3].z);
src[3*size] =
(float4)(l_mat[0].w, l_mat[1].w, l_mat[2].w, l_mat[3].w);
/* Process block off diagonal */
else {
/* Read destination block into local memory */
dst = g_mat + col * size * 4 + row;
l_mat[4] = dst[0];
l_mat[5] = dst[size];
l_mat[6] = dst[2*size];
l_mat[7] = dst[3*size];
/* Set elements of destination block */
dst[0] =
(float4)(l_mat[0].x, l_mat[1].x, l_mat[2].x, l_mat[3].x);
dst[size] =
(float4)(l_mat[0].y, l_mat[1].y, l_mat[2].y, l_mat[3].y);
dst[2*size] =
(float4)(l_mat[0].z, l_mat[1].z, l_mat[2].z, l_mat[3].z);
dst[3*size] =
(float4)(l_mat[0].w, l_mat[1].w, l_mat[2].w, l_mat[3].w);
/* Set elements of source block */
src[0] =
(float4)(l_mat[4].x, l_mat[5].x, l_mat[6].x, l_mat[7].x);
src[size] =
(float4)(l_mat[4].y, l_mat[5].y, l_mat[6].y, l_mat[7].y);
src[2*size] =
(float4)(l_mat[4].z, l_mat[5].z, l_mat[6].z, l_mat[7].z);
src[3*size] =
(float4)(l_mat[4].w, l_mat[5].w, l_mat[6].w, l_mat[7].w);
l_mat is being used a a local store for threads in a work-group. Specifically, it is being used because accesses to local memory are orders of magnitude faster than to global memory.
Each thread needs 8 float4s. Doing the following pointer arithmetic
l_mat += get_local_id(0)*8;
moves the l_mat pointer for each thread so that it doesn't overlap with other threads' data.
This could cause an error since the local_size wasn't specified and we are unable to ensure that the size of l_mat is sufficient to store the values for each thread.
l_mat is used as a temporary buffer for storing the two matrix components to invert for all the work-items.
So for each work-item it needs to store 2 * 4 float4s, hence : offset = get_local_id(0)*2*4 = get_local_id(0)*8.
