I try to do a simple uni-directional communication between a CPU and a K80 GPU using CUDA. I want to have a bool cancel flag that resides in global memory and is polled by all running GPU/kernel threads. The flag should default to false and can be set by a CPU/host thread to true during ongoing computation. The GPU/kernel threads then should exit.
This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested).
On the host side, global definition (.cpp):
// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr;
On the host side in the compute thread (.cpp):
initialize(&cancelRequested);
compute(cancelRequested);
finalize(&cancelRequested);
On the host side in a main thread (.cpp):
cancel(cancelRequested); // Called after init is finished
Host routines (.cu file):
void initialize(volatile bool** pCancelRequested)
{
cudaMalloc(const_cast<bool**>(pCancelRequested), sizeof(bool));
const bool aFalse = false;
cudaMemcpy(*const_cast<bool**>(pCancelRequested), &aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}
void compute(volatile bool* pCancelRequested)
{
....
computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
cudaDeviceSynchronize(); // Non-busy wait
....
}
void finalize(volatile bool** pCancelRequested)
{
cudaFree(*const_cast<bool**>(pCancelRequested));
*pCancelRequested = nullptr;
}
void cancel(volatile bool* pCancelRequested)
{
const bool aTrue = true;
cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}
Device routines (.cu file):
__global__ void computeKernel(volatile bool* pCancelRequested)
{
while (someCondition)
{
// Computation step here
if (*pCancelRequested)
{
printf("-> Cancel requested!\n");
return;
}
}
}
The code runs fine. But it does never enter the cancel case. I read back the false and true values in initialize() and cancel() successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from host side view point. However the kernels never see the cancel flag set to true and exit normally from the outer while loop.
Any idea why this doesn't work?
The fundamental problem I see with your approach is that cuda streams will prevent it from working.
CUDA streams have two basic principles:
Items issued into the same stream will not overlap; they will serialize.
Items issued into separate created streams have the possibility to overlap; there is no defined ordering of those operations provided by CUDA.
Even if you don't explicitly use streams, you are operating in the "default stream" and the same stream semantics apply.
I'm not covering everything there is to know about streams in this brief summary. You can learn more about CUDA streams in unit 7 of this online training series
Because of CUDA streams, this:
computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
and this:
cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
could not possibly overlap (they are being issued into the same "default" CUDA stream, and so rule 1 above says that they cannot possibly overlap). But overlap is essential if you want to "signal" the running kernel. We must allow the cudaMemcpy operation to take place at the same time that the kernel is running.
We can fix this via a direct application of CUDA streams (taking note of rule 2 above), to put the copy operation and the compute (kernel) operation into separate created streams, so as to allow them to overlap. When we do that, things work as desired:
$ cat t2184.cu
#include <iostream>
#include <unistd.h>
__global__ void k(volatile int *flag){
while (*flag != 0);
}
int main(){
int *flag, *h_flag = new int;
cudaStream_t s[2];
cudaStreamCreate(s+0);
cudaStreamCreate(s+1);
cudaMalloc(&flag, sizeof(h_flag[0]));
*h_flag = 1;
cudaMemcpy(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice);
k<<<32, 256, 0, s[0]>>>(flag);
sleep(5);
*h_flag = 0;
cudaMemcpyAsync(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice, s[1]);
cudaDeviceSynchronize();
}
$ nvcc -o t2184 t2184.cu
$ compute-sanitizer ./t2184
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$
NOTES:
Although not evident from the static text printout, the program spends approximately 5 seconds before exiting. If you comment out a line such as *h_flag = 0; then the program will hang, indicating that the flag signal method is working correctly.
Note the use of volatile. This is necessary to instruct the compiler that any access to that data must be an actual access, the compiler is not allowed to make modifications that would prevent a memory read or write from occurring at the expected location.
This kind of host->device signal behavior can also be realized without explicit use of streams, but with host pinned memory as the signalling location, since it is "visible" to both host and device code, "simultaneously". Here is an example:
#include <iostream>
#include <unistd.h>
__global__ void k(volatile int *flag){
while (*flag != 0);
}
int main(){
int *flag;
cudaHostAlloc(&flag, sizeof(flag[0]), cudaHostAllocDefault);
*flag = 1;
k<<<32, 256>>>(flag);
sleep(5);
*flag = 0;
cudaDeviceSynchronize();
}
For other examples of signalling, such as from device to host, other readers may be interested in this.
I have found discussion on using callbacks in AIO asynchronous I/O on the internet. However, what I have found has left me confused. An example code is listed below from a site on Linux AIO. In this code, AIO is being used to read in the contents of a file. My problem is that it seems to me that a code that actually processes the contents of that file must have some point where some kind of block is made to the execution until the read is completed. This code here has no block like that at all. I was expecting to see some kind of call analogous to pthread_mutex_lock in pthread programming. I suppose I could put in a dummy loop after the aio_read() call that would block execution until the read is completed. But that puts me right back to the simplest way of blocking the execution, and then I don't see what is gained by all the coding overhead that goes into establishing a callback. I am obviously missing something. Could someone tell me what it is?
Here is the code. (BTW, the original is in C++; I have adapted it to C.)
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <aio.h>
//#include <bits/stdc++.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
const int BUFSIZE = 1024;
void aio_completion_handler(sigval_t sigval)
{
struct aiocb *req;
req = (struct aiocb *)sigval.sival_ptr; //Pay attention here.
/*Check again if the asynchrony is complete?*/
if (aio_error(req) == 0)
{
int ret = aio_return(req);
printf("ret == %d\n", ret);
printf("%s\n", (char *)req->aio_buf);
}
close(req->aio_fildes);
free((void *)req->aio_buf);
while (1)
{
printf("The callback function is being executed...\n");
sleep(1);
}
}
int main(void)
{
struct aiocb my_aiocb;
int fd = open("file.txt", O_RDONLY);
if (fd < 0)
perror("open");
bzero((char *)&my_aiocb, sizeof(my_aiocb));
my_aiocb.aio_buf = malloc(BUFSIZE);
if (!my_aiocb.aio_buf)
perror("my_aiocb.aio_buf");
my_aiocb.aio_fildes = fd;
my_aiocb.aio_nbytes = BUFSIZE;
my_aiocb.aio_offset = 0;
//Fill in callback information
/*
Using SIGEV_THREAD to request a thread callback function as a notification method
*/
my_aiocb.aio_sigevent.sigev_notify = SIGEV_THREAD;
my_aiocb.aio_sigevent.sigev_notify_function = aio_completion_handler;
my_aiocb.aio_sigevent.sigev_notify_attributes = NULL;
/*
The context to be transmitted is loaded into the handler (in this case, a reference to the aiocb request itself).
In this handler, we simply refer to the arrived sigval pointer and use the AIO function to verify that the request has been completed.
*/
my_aiocb.aio_sigevent.sigev_value.sival_ptr = &my_aiocb;
int ret = aio_read(&my_aiocb);
if (ret < 0)
perror("aio_read");
/* <---- A real code would process the data read from the file.
* So execution needs to be blocked until it is clear that the
* read is complete. Right here I could put in:
* while (aio_error(%my_aiocb) == EINPROGRESS) {}
* But is there some other way involving a callback?
* If not, what has creating a callback done for me?
*/
//The calling process continues to execute
while (1)
{
printf("The main thread continues to execute...\n");
sleep(1);
}
return 0;
}
In contiki, i need to have two files, sender and receiver, the sender sends packets to the receiver. My problem is, the receiver is not outputting that the packets have been received.
I tried a while loop inside the receiving packet, i even tried to create a function, but still nothing has worked.
My sender.c file
#include "contiki.h"
#include "net/rime.h"
#include "random.h"
#include "dev/button-sensor.h"
#include "dev/leds.h"
#include <stdio.h>
PROCESS(sendReceive, "Hello There");
AUTOSTART_PROCESSES(&sendReceive);
PROCESS_THREAD(sendReceive, ev, data)
{
PROCESS_BEGIN();
static struct abc_conn abc;
static struct etimer et;
static const struct abc_callbacks abc_call;
PROCESS_EXITHANDLER(abc_close(&abc);)
abc_open(&abc, 128, &abc_call);
while(1)
{
/* Delay 2-4 seconds */
etimer_set(&et, CLOCK_SECOND * 2 + random_rand() % (CLOCK_SECOND * 2));
PROCESS_WAIT_EVENT_UNTIL(etimer_expired(&et));
packetbuf_copyfrom("Hello", 6);
abc_send(&abc);
printf("Message sent\n");
}
PROCESS_END();
}
my receiver.c file
#include "contiki.h"
#include "net/rime.h"
#include "random.h"
#include "dev/button-sensor.h"
#include "dev/leds.h"
#include <stdio.h>
PROCESS(sendReceive, "Receiving Message");
AUTOSTART_PROCESSES(&sendReceive);
PROCESS_THREAD(sendReceive, ev, data)
{
PROCESS_BEGIN();
{
printf("Message received '%s'\n", (char *)packetbuf_dataptr());
}
PROCESS_END();
}
The sender.c file is working, it is sending the packets correctly, the problem is the receiver seems not to output that it has been received.
While sending is simple - you just need to call a function -, receiving data in embedded system is in general more complicated. There needs to be a way for the operating system to let your code know that new data has arrived from outside. In Contiki that is internally done with events, and from user's perspective with callbacks.
So, implement a callback function:
static void
recv_from_abc(struct abc_conn *bc)
{
printf("Message received '%s'\n", (char *)packetbuf_dataptr());
}
In your receiver process, create and open an connection, passing the callback function's pointer as a parameter:
static struct abc_conn c;
static const struct abc_callbacks callbacks =
{recv_from_abc, NULL};
uint16_t channel = 128; /* matching the sender code */
abc_open(&c, channel, &callbacks);
I have a project (micro controller STM32 using c code) where I need to receive messages from serial port (for example strings) and I need to put the messages in a queue where I will read the string later.
Can someone tell me where can I find some example on how to create a message queue (like FIFO) of strings (or byte array) using standard C and how to manage the queue? Thanks for any kind of support.
"example on how to create e message queue (like FIFO) of strings (or byte array) using standard C and how to manage the queue"
"in a micro controller with a standard C you should manage the buffers, create the queue, enqueue and dequeue the elements"
The example given below should meet the requirements.
If necessary, the library functions used can easily be replaced with platform-specific versions or standard C array operations.
The memory allocation for the queue can also be done as static variable instead of as stack variable. If desired, even malloc could be used.
The message type can easily be extended. The queue and data sizes are defined as constants.
#leonardo gave a good hint on how to structure the processing, i.e. enqueuing messages in an interrupt routine and dequeuing them on main. I guess that some kind of semaphore needs to be used so that the execution of functions which manipulate the queue don't get mixed up. Some thoughts on this are discussed in semaphore like synchronization in ISR (Interrupt service routine)
/*
Portable array-based cyclic FIFO queue.
*/
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#define MESSAGE_SIZE 64
#define QUEUE_SIZE 3
typedef struct {
char data[MESSAGE_SIZE];
} MESSAGE;
typedef struct {
MESSAGE messages[QUEUE_SIZE];
int begin;
int end;
int current_load;
} QUEUE;
void init_queue(QUEUE *queue) {
queue->begin = 0;
queue->end = 0;
queue->current_load = 0;
memset(&queue->messages[0], 0, QUEUE_SIZE * sizeof(MESSAGE_SIZE));
}
bool enque(QUEUE *queue, MESSAGE *message) {
if (queue->current_load < QUEUE_SIZE) {
if (queue->end == QUEUE_SIZE) {
queue->end = 0;
}
queue->messages[queue->end] = *message;
queue->end++;
queue->current_load++;
return true;
} else {
return false;
}
}
bool deque(QUEUE *queue, MESSAGE *message) {
if (queue->current_load > 0) {
*message = queue->messages[queue->begin];
memset(&queue->messages[queue->begin], 0, sizeof(MESSAGE));
queue->begin = (queue->begin + 1) % QUEUE_SIZE;
queue->current_load--;
return true;
} else {
return false;
}
}
int main(int argc, char** argv) {
QUEUE queue;
init_queue(&queue);
MESSAGE message1 = {"This is"};
MESSAGE message2 = {"a simple"};
MESSAGE message3 = {"queue!"};
enque(&queue, &message1);
enque(&queue, &message2);
enque(&queue, &message3);
MESSAGE rec;
while (deque(&queue, &rec)) {
printf("%s\n", &rec.data[0]);
}
}
Compiling and running:
$ gcc -Wall queue.c
$ ./a.out
This is
a simple
queue!
$
The C language does not have a queue build in (it's a battery excluded language), you need to build your own. If you just need a FIFO to push things on your interrupt routine and then pop them out on your main loop (which is good design BTW), Check A Simple Message Queue for C if this works for you.
Why is the following code slow? And by slow I mean 100x-1000x slow. It just repeatedly performs read/write directly on a TCP socket. The curious part is that it remains slow only if I use two function calls for both read AND write as shown below. If I change either the server or the client code to use a single function call (as in the comments), it becomes super fast.
Code snippet:
int main(...) {
int sock = ...; // open TCP socket
int i;
char buf[100000];
for(i=0;i<2000;++i)
{ if(amServer)
{ write(sock,buf,10);
// read(sock,buf,20);
read(sock,buf,10);
read(sock,buf,10);
}else
{ read(sock,buf,10);
// write(sock,buf,20);
write(sock,buf,10);
write(sock,buf,10);
}
}
close(sock);
}
We stumbled on this in a larger program, that was actually using stdio buffering. It mysteriously became sluggish the moment payload size exceeded the buffer size by a small margin. Then I did some digging around with strace, and finally boiled the problem down to this. I can solve this by fooling around with buffering strategy, but I'd really like to know what on earth is going on here. On my machine, it goes from 0.030 s to over a minute on my machine (tested both locally and over remote machines) when I change the two read calls to a single call.
These tests were done on various Linux distros, and various kernel versions. Same result.
Fully runnable code with networking boilerplate:
#include <netdb.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <netinet/ip.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
static int getsockaddr(const char* name,const char* port, struct sockaddr* res)
{
struct addrinfo* list;
if(getaddrinfo(name,port,NULL,&list) < 0) return -1;
for(;list!=NULL && list->ai_family!=AF_INET;list=list->ai_next);
if(!list) return -1;
memcpy(res,list->ai_addr,list->ai_addrlen);
freeaddrinfo(list);
return 0;
}
// used as sock=tcpConnect(...); ...; close(sock);
static int tcpConnect(struct sockaddr_in* sa)
{
int outsock;
if((outsock=socket(AF_INET,SOCK_STREAM,0))<0) return -1;
if(connect(outsock,(struct sockaddr*)sa,sizeof(*sa))<0) return -1;
return outsock;
}
int tcpConnectTo(const char* server, const char* port)
{
struct sockaddr_in sa;
if(getsockaddr(server,port,(struct sockaddr*)&sa)<0) return -1;
int sock=tcpConnect(&sa); if(sock<0) return -1;
return sock;
}
int tcpListenAny(const char* portn)
{
in_port_t port;
int outsock;
if(sscanf(portn,"%hu",&port)<1) return -1;
if((outsock=socket(AF_INET,SOCK_STREAM,0))<0) return -1;
int reuse = 1;
if(setsockopt(outsock,SOL_SOCKET,SO_REUSEADDR,
(const char*)&reuse,sizeof(reuse))<0) return fprintf(stderr,"setsockopt() failed\n"),-1;
struct sockaddr_in sa = { .sin_family=AF_INET, .sin_port=htons(port)
, .sin_addr={INADDR_ANY} };
if(bind(outsock,(struct sockaddr*)&sa,sizeof(sa))<0) return fprintf(stderr,"Bind failed\n"),-1;
if(listen(outsock,SOMAXCONN)<0) return fprintf(stderr,"Listen failed\n"),-1;
return outsock;
}
int tcpAccept(const char* port)
{
int listenSock, sock;
listenSock = tcpListenAny(port);
if((sock=accept(listenSock,0,0))<0) return fprintf(stderr,"Accept failed\n"),-1;
close(listenSock);
return sock;
}
void writeLoop(int fd,const char* buf,size_t n)
{
// Don't even bother incrementing buffer pointer
while(n) n-=write(fd,buf,n);
}
void readLoop(int fd,char* buf,size_t n)
{
while(n) n-=read(fd,buf,n);
}
int main(int argc,char* argv[])
{
if(argc<3)
{ fprintf(stderr,"Usage: round {server_addr|--} port\n");
return -1;
}
bool amServer = (strcmp("--",argv[1])==0);
int sock;
if(amServer) sock=tcpAccept(argv[2]);
else sock=tcpConnectTo(argv[1],argv[2]);
if(sock<0) { fprintf(stderr,"Connection failed\n"); return -1; }
int i;
char buf[100000] = { 0 };
for(i=0;i<4000;++i)
{
if(amServer)
{ writeLoop(sock,buf,10);
readLoop(sock,buf,20);
//readLoop(sock,buf,10);
//readLoop(sock,buf,10);
}else
{ readLoop(sock,buf,10);
writeLoop(sock,buf,20);
//writeLoop(sock,buf,10);
//writeLoop(sock,buf,10);
}
}
close(sock);
return 0;
}
EDIT: This version is slightly different from the other snippet in that it reads/writes in a loop. So in this version, two separate writes automatically causes two separate read() calls, even if readLoop is called only once. But otherwise the problem still remains.
Interesting. You are being a victim of the Nagle's algorithm together with TCP delayed acknowledgements.
The Nagle's algorithm is a mechanism used in TCP to defer transmission of small segments until enough data has been accumulated that makes it worth building and sending a segment over the network. From the wikipedia article:
Nagle's algorithm works by combining a number of small outgoing
messages, and sending them all at once. Specifically, as long as there
is a sent packet for which the sender has received no acknowledgment,
the sender should keep buffering its output until it has a full
packet's worth of output, so that output can be sent all at once.
However, TCP typically employs something known as TCP delayed acknowledgements, which is a technique that consists of accumulating together a batch of ACK replies (because TCP uses cumulative ACKS), to reduce network traffic.
That wikipedia article further mentions this:
With both algorithms enabled, applications that do two successive
writes to a TCP connection, followed by a read that will not be
fulfilled until after the data from the second write has reached the
destination, experience a constant delay of up to 500 milliseconds,
the "ACK delay".
(Emphasis mine)
In your specific case, since the server doesn't send more data before reading the reply, the client is causing the delay: if the client writes twice, the second write will be delayed.
If Nagle's algorithm is being used by the sending party, data will be
queued by the sender until an ACK is received. If the sender does not
send enough data to fill the maximum segment size (for example, if it
performs two small writes followed by a blocking read) then the
transfer will pause up to the ACK delay timeout.
So, when the client makes 2 write calls, this is what happens:
Client issues the first write.
The server receives some data. It doesn't acknowledge it in the hope that more data will arrive (so it can batch up a bunch of ACKs in one single ACK).
Client issues the second write. The previous write has not been acknowledged, so Nagle's algorithm defers transmission until more data arrives (until enough data has been collected to make a segment) or the previous write is ACKed.
Server is tired of waiting and after 500 ms acknowledges the segment.
Client finally completes the 2nd write.
With 1 write, this is what happens:
Client issues the first write.
The server receives some data. It doesn't acknowledge it in the hope that more data will arrive (so it can batch up a bunch of ACKs in one single ACK).
The server writes to the socket. An ACK is part of the TCP header, so if you're writing, you might as well acknowledge the previous segment at no extra cost. Do it.
Meanwhile, the client wrote once, so it was already waiting on the next read - there was no 2nd write waiting for the server's ACK.
If you want to keep writing twice on the client side, you need to disable the Nagle's algorithm. This is the solution proposed by the algorithm author himself:
The user-level solution is to avoid write-write-read sequences on
sockets. write-read-write-read is fine. write-write-write is fine. But
write-write-read is a killer. So, if you can, buffer up your little
writes to TCP and send them all at once. Using the standard UNIX I/O
package and flushing write before each read usually works.
(See the citation on Wikipedia)
As mentioned by David Schwartz in the comments, this may not be the greatest idea for various reasons, but it illustrates the point and shows that this is indeed causing the delay.
To disable it, you need to set the TCP_NODELAY option on the sockets with setsockopt(2).
This can be done in tcpConnectTo() for the client:
int tcpConnectTo(const char* server, const char* port)
{
struct sockaddr_in sa;
if(getsockaddr(server,port,(struct sockaddr*)&sa)<0) return -1;
int sock=tcpConnect(&sa); if(sock<0) return -1;
int val = 1;
if (setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) < 0)
perror("setsockopt(2) error");
return sock;
}
And in tcpAccept() for the server:
int tcpAccept(const char* port)
{
int listenSock, sock;
listenSock = tcpListenAny(port);
if((sock=accept(listenSock,0,0))<0) return fprintf(stderr,"Accept failed\n"),-1;
close(listenSock);
int val = 1;
if (setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) < 0)
perror("setsockopt(2) error");
return sock;
}
It's interesting to see the huge difference this makes.
If you'd rather not mess with the socket options, it's enough to ensure that the client writes once - and only once - before the next read. You can still have the server read twice:
for(i=0;i<4000;++i)
{
if(amServer)
{ writeLoop(sock,buf,10);
//readLoop(sock,buf,20);
readLoop(sock,buf,10);
readLoop(sock,buf,10);
}else
{ readLoop(sock,buf,10);
writeLoop(sock,buf,20);
//writeLoop(sock,buf,10);
//writeLoop(sock,buf,10);
}
}