Ridiculously simple MPI_Send/Recv problem I don't understand - c

I have two functions with different algorithms. In the first function I implemented non-blocking communications (MPI_Irecv, MPI_Isend) and the program runs without any errors. Even when I change the non-blocking to blocking communication, everything is fine. No deadlock.
But if I implement the second function with basic blocking communication like this (reduced the algorithm to the problem):
if( my_rank == 0)
{
a = 3 ;
MPI_Send(&a,1,MPI_DOUBLE,1,0,MPI_COMM_WORLD) ;
}
else if( my_rank == 1 )
{
MPI_Recv(&a,1,MPI_DOUBLE,0,0,MPI_COMM_WORLD, &status ) ;
}
So, process 1 should receive the value a from process 0. But I'm getting this error:
Fatal error in MPI_Recv: Message
truncated, error stack:
MPI_Recv(187).......................:
MPI_Recv(buf=0xbfbef2a8, count=1,
MPI_DOUBLE, src=0, tag=0,
MPI_COMM_WORLD, status=0xbfbef294)
failed
MPIDI_CH3U_Request_unpack_uebuf(600):
Message truncated; 32 bytes received
but buffer size is 8 rank 2 in job 39
Blabla caused collective
abort of all ranks exit status of
rank 2: killed by signal 9
If I run the program with only one of the two functions, then they work as they are supposed to. But both together results in the error message above. I do understand the error message, but I don't know what I can do to prevent it. Can someone explain to me where I have to look for the error? Since I'm not getting a deadlock in the first function, I'm assuming that there can't be a unreceived send from the first function which leads to the error in the second.

So, here is the the first function:
MPI_Type_vector(m,1,m,MPI_DOUBLE, &column_mpi_t ) ;
MPI_Type_commit(&column_mpi_t) ;
T = (double**)malloc(m*sizeof(double*)) ;
T_data = (double*)malloc(m*m*sizeof(double)) ;
for(i=0;i<m;i++)
{
T[i] = &(T_data[i*m]) ;
}
if(my_rank==0)
{
s = &(T[0][0]) ;
for(i=1;i<p;i++)
{
MPI_Send(s,1,column_mpi_t,i,0,MPI_COMM_WORLD) ;
}
}
for(k=0;k<m-1;k++)
{
if(k%p != my_rank)
{
rbuffer = &(T[0][k]) ;
MPI_Recv(rbuffer,1,column_mpi_t,k%p,0,MPI_COMM_WORLD,&status) ;
}
for(j=k+1;j<n;j++)
{
if(j%p==my_rank)
{
if(j==k+1 && j!=n-1)
{
sbuffer = &(T[0][k+1]) ;
for(i=0;i<p;i++)
{
if(i!= (k+1)%p )
MPI_Send(sbuffer,1,column_mpi_t,i,0,MPI_COMM_WORLD) ;
}
}
}
}
}
I came to the conclusion that the derived datatype is the origin of my problems. Somebody sees why?
Ok, im wrong. If i change the MPI datatype in MPI_Irecv/send to MPI_DOUBLE,that would fit to the datatypes of recv/send of the second function ..so no truncation error. So, no solution....

Related

UDP error on beginPacket

I got an error on my arduino when I send a UDP-package any ip.
There 2 problems when I try to send a packet to _targetIp:
When called from loop() then udpSender.endPacket() freeze forever (on second call, first call was fine)
When called from setup() then udpSender.beginPacket(...) returns 0
Code:
IPAddress _targetIp(192, 168, 59, 250);
int _sendPort = 4321;
EthernetUDP _udpSender;
(...)
void sendUpd(int pinIndex, int value)
{
// if I wrote something like this:
// udpSender.beginPacket(_udpSender.remoteIP(), _sendPort)
// then is all fine
if (_udpSender.beginPacket(_targetIp, _sendPort) != 1)
{
Serial.println("socket error!");
return;
}
_udpSender.write(pinIndex);
_udpSender.write("=");
_udpSender.write(value);
int sendState = _udpSender.endPacket(); // hang forever when called from "loop()"
if ( sendState != 1 )
{
// enters with 0 as "sendState" when called from "setup()"
Serial.print("send error: ");
Serial.println( sendState );
}
}
Can anybody explain that for me?
Founded solutions:
Not the call _updSender.endPacket() freezes. The code to read a specific digital in (digitalRead(52)) causes the strange reaction.
I dont know how, but the arduino check if the host alive. If the host not reachable, the _udpSender.endPacket() returns 0.
Explantation:
Resolving the IP to an MAC adress fails. So the error does not came from UDP (layer 4 OSI). The error happend in data-link-layer (=2) and explains point 2.

Pthread sharing variables with pointers in C

I'm working on a C project using Pthread that needs to share some variables. There are several lines of code written yet and I just realized that using shared global variables doesn't work quite well because of the cache system.
I find on stackoverflow that a solution is to pass the adress of the variable (in my case it's more than one) to the thread function, what does it change?
Since my thread functions call other functions who will modify the globals, it's a bit painful to add a parameter to the chain of called functions where one function modify the globals.
So I was wondering, would it work to declare global pointers for each globals and use them to acess the global instead of the real globals?
I think it's a superficial inderiction but why wouldn't it work after all?
My program is an UDP network protocol whre networks look like rings or circled simple linked list. Instances of the program on the network are called entities.
An entity entity can insert on a ring or ask an entity to create another ring (dupplication), so the other entity would be on two ring.
The interface is sort of a shell where commands can leads to sending messages on the ring. Messages circle all over the rings after being stopped when they have ever been seen.
The shell is in the main thread, there is a thread for message treatment, another to manage insertion, and there is also a thread to detect broken rings. The problems is located in the ring tester thread. The thread initialize a global array (volatile short ring_check[NRING]) of size of the maximum ring numbers for an entity, initialize the first element with 0 according to the actual number of rings and the rest with -1, after that it send a test message in each ring and sleeps during a timeout of 30sec. When the timeout has finished, it checks for the values in the array.
The values are changed by the thread for message treatment, when a test message went went back, it detects it by its content and write -1 to the appropriate ring_check element.
The problem is that after a dupplication, the two rings are tested but the checking for the second failed (ring_check[1] == 0) and I really don't know why... The test message is received, immediately after the sending, after the message treatment modifies ring_check[1] to 0 I print it to see if the change is really made and it prints 1. But about 20 to 30sec later, the ring_tester wake up from his sleeping time and it reads 0 in ring_check[1].
short volatile ring_check[NRING+1];
// The function in the ring tester thread
static void test_ring() {
// initialize ring_check array
debug("ring_tester", GREEN "setting ring_check to -1...");
char port_diff[5];
// send test messages in each rings
int fixed_nring = getnring();
for (int i = fixed_nring+1; i < NRING; ++i) {
ring_check[i] = -1;
}
for (int i = 0; i < fixed_nring + 1; i++) {
debug("ring_tester", GREEN "setting ring_check %d to 0...", i);
ring_check[i] = 0;
itoa4(port_diff, ent.mdiff_port[i]);
debug("ring_tester", GREEN "sending test to ring %d...", i);
sendmessage(i, "TEST", "%s %s", ent.mdiff_ip[i], port_diff);
}
debug("test_ring", GREEN "timeout beginning...");
sleep(timeout);
debug("test_ring", GREEN "end of timeout.");
for (int i = 0; i < fixed_nring + 1 && ring_check[i] != -1; i++) {
debug("test_ring", GREEN "ring_check[%d]:%d", i, ring_check[i]);
if (ring_check[i]) {
debug("test_ring", GREEN "ring %d: checked.", i);
continue;
}
else {
debug("test_ring", GREEN "ring %d: checking failed. Ring broken...", i);
continue;
}
}
// The function called by the message treatment thread
static int action_test(char *message, char *content, int lookup_flag) {
debug("action_test", RED "entering function...");
if (content[15] != ' ' || content[20] != 0) {
debug("action_test", RED "content not following the protocol."\
"content: \"%s\"", content);
return 1;
}
if (lookup_flag) {
char mdiff_port[5];
int fixed_nring = getnring();
for (int i = 0; i < fixed_nring + 1 && ring_check[i] != -1; ++i) {
itoa4(mdiff_port, ent.mdiff_port[i]);
// find ring associated with message and actualize the checking
if (strncmp(content, ent.mdiff_ip[i], 15) == 0 &&
strncmp(&content[16], mdiff_port, 4) == 0 &&
ring_check[i] != -1) {
ring_check[i] = 1;
debug("action_test",
RED "correspondance found, ring_check[%d]:%d", i, ring_check[i]);
return 0;
}
}
}
else {
sendpacket_all(message);
}
return 0;
}
You could define a global structure such as thread_inputparam. Put all the global variables' addresses in it and send to all threads, the adress of this structure variable.
int global1;
struct thread_input {
int *g1;
// add other globals'addresses
}thread_inputparam;
thread_inputparam.g1=&global1;

Synchronizing the result of threads with incremented shared variable and condition

The title might not appear particularly clear, but the code explains itself:
int shared_variable;
int get_shared_variable() {
int result;
pthread_mutex_lock(&shared_variable_mutex);
result = shared_variable;
pthread_mutex_unlock(&shared_variable_mutex);
return result;
}
void* thread_routine(void *arg) {
while (get_shared_variable() < 5000) {
printf();
printf();
sleep(2);
int i = 0;
while (pthread_mutex_trylock(&foo_mutexes[i]) != 0) {
i++;
pthread_mutex_lock(&foo_count_mutex);
if (i == foo_count) {
pthread_mutex_unlock(&foo_count_mutex);
sleep(1); // wait one second and retry
i = 0;
}
pthread_mutex_unlock(&foo_count_mutex);
}
pthread_mutex_lock(&shared_variable_mutex);
shared_variable += 10;
pthread_mutex_unlock(&shared_variable_mutex);
}
return NULL;
}
I'm passing thread_routine to a pthread_create (pretty standard), but I'm having a problem with the synchronization of the result. Basically, the problem is that the first thread checks the while condition, it passes, and then another thread checks it, it passes too. However, when the first thread finishes and shared_variable reaches 5000, the second thread has not yet finished and it adds up another 10 and the end result becomes 5010 (or NUM_OF_THREADS - 1 * 10 if I run more than two) at the end, while the whole process should end at 5000.
Another issue is that in // do some work I output something on the screen, so the whole thing inside the loop should pretty much work as a transaction in database terms. I can't seem to figure out how to solve this problem, but I suppose there's something simple that I'm missing. Thanks in advance.
This answer may or may not be what you are after. Because as explained in the comments your description of the expected behaviour of the program is incomplete. Without the exact expected behaviour it is difficult to give a full answer. But since you ask, here is a possible structure of the program based on the code shown. The main principle it is illustrating is that the critical section for shared_variable needs to be both minimal and complete.
int shared_variable;
void* thread_routine(void *arg)
{
while (1) {
pthread_mutex_lock(&shared_variable_mutex);
if (shared_variable >= 5000) {
pthread_mutex_unlock(&shared_variable_mutex);
break;
}
shared_variable += 10;
pthread_mutex_unlock(&shared_variable_mutex);
/* Other code that doesn't use shared_variable goes here */
}
return NULL;
}

linux kernel + conditional statements

I basically am running into a very odd situation in a system call that I am writing. I want to check some values if they are the same return -2 which indicates a certain type of error has occurred. I am using printk() to print the values of the variables right before my "else if" and it says that they are equal to one another but yet the conditional is not being executed (i.e. we don't enter the else if) I am fairly new to working in the kernel but this seems very off to me and am wondering if there is some nuance of working in the kernel I am not aware of so if anyone could venture a guess as to why if I know the values of my variables the conditional would not execute I would really appreciate your help
//---------------------------------------//
/* sys_receiveMsg421()
Description:
- Copies the first message in the mailbox into <msg>
*/
asmlinkage long sys_receiveMsg421(unsigned long mbxID, char *msg, unsigned long N)
{
int result = 0;
int mboxIndex = checkBoxId(mbxID);
int msgIndex = 0;
//acquire the lock
down_interruptible(&sem);
//check to make sure the mailbox with <mbxID> exists
if(!mboxIndex)
{
//free our lock
up(&sem);
return -1;
}
else
mboxIndex--;
printk("<1>mboxIndex = %d\nNumber of messages = %dCurrent Msg = %d\n",mboxIndex, groupBox.boxes[mboxIndex].numMessages, groupBox.boxes[mboxIndex].currentMsg );
//check to make sure we have a message to recieve
-----------CODE NOT EXECUTING HERE------------------------------------------------
if(groupBox.boxes[mboxIndex].numMessages == groupBox.boxes[mboxIndex].currentMsg)
{
//free our lock
up(&sem);
return -2;
}
//retrieve the message
else
{
//check to make sure the msg is a valid pointer before continuing
if(!access_ok(VERIFY_READ, msg, N * sizeof(char)))
{
printk("<1>Access has been denied for %lu\n", mbxID);
//free our lock
up(&sem);
return -1;
}
else
{
//calculate the index of the message to be retrieved
msgIndex = groupBox.boxes[mboxIndex].currentMsg;
//copy from kernel to user variable
result = copy_to_user(msg, groupBox.boxes[mboxIndex].messages[msgIndex], N);
//increment message position
groupBox.boxes[mboxIndex].currentMsg++;
//free our lock
up(&sem);
//return number of bytes copied
return (N - result);
}
}
}
UPDATE: Solved my problem by just changing the return value to something else and it works fine very weird though
Please remember to use punctuation; I don't like running out of breath while reading questions.
Are you sure the if block isn't being entered? A printk there (and another in the corresponding else block) would take you one step further, no?
As for the question: No, there isn't anything specific to kernel code that would make this not work.
And you seem to have synchronization covered, too. Though: I see that you're acquiring mboxIndex outside the critical section. Could that cause a problem? It's hard to tell from this snippet, which doesn't even have groupBox declared.
Perhaps numMessages and/or currentMsg are defined as long?
If so, your printk, which uses %d, would print just some of the bits, so you may think they're equal while they are not.

select() times out immediately after long runtime (C++)

Most of the time this code works just fine. But sometimes when the executable has been running for a while, select() appears to time out immediately, then get into a weird state where it keeps getting called, timing out immediately, over and over. Then it has to be killed from the outside.
My guess would be that the way that standard input changes overtime is at fault - that is what select is blocking on.
Looking around on StackOverflow, most of people's select() troubles seem to be solved by making sure to reset with the macros (FD_ZERO & FD_SET) every time and using the right initial parameter to select. I don't think those are the issues here.
int rc = 0;
fd_set fdset;
struct timeval timeout;
// -- clear out the response -- //
readValue = "";
// -- set the timeout -- //
timeout.tv_sec = passedInTimeout; // 5 seconds
timeout.tv_usec = 0;
// -- indicate which file descriptors to select from -- //
FD_ZERO(&fdset);
FD_SET(passedInFileDescriptor, &fdset); //passedInFileDescriptor = 0
// -- perform the selection operation, with timeout -- //
rc = select(1, &fdset, NULL, NULL, &timeout);
if (rc == -1) // -- select failed -- //
{
result = TR_ERROR;
}
else if (rc == 0) // -- select timed out -- //
{
result = TR_TIMEDOUT;
}
else
{
if (FD_ISSET(mFileDescriptor, &fdset))
{
if(rc = readData(readValue) <= 0)
{
result = TR_ERROR;
}
} else {
result = TR_SUCCESS;
}
}
Beware that some implementaions of "select" apply strictly the specification:
"nfds is the highest-numbered file descriptor in any of the three sets, plus 1".
So, you'd better to change "1" with "passedInFileDescriptor+1" as first parameter.
I don't know if this can solve your problem, but at least your code becomes more... uhm... "traditional" ;)
Bye
On some OSes, timeout is modified when calling select to reflect the amount of time not slept. It doesn't look like you're reusing timeout in your example, but make sure that you are indeed reinitializing it to 5 seconds every time before calling select.
I'm having the same problem, it works fine on windows but not on linux and I have the maxfd set to last socket + 1. It occurs periodically after long runs. I pick up the connection on accept and then the first call to select periodically times out.
Look at this code:
if (FD_ISSET(mFileDescriptor, &fdset))
{
if(rc = readData(readValue) <= 0)
{
result = TR_ERROR;
}
} else {
result = TR_SUCCESS;
}
There are two things bothering me here:
if your FD has no data in it (like, say, an error occured),
FD_ISSET() will return false and your function returns
TR_SUCCESS !?
you FD_SET(passedInFileDescriptor, &fdset), but check on another
value: FD_ISSET(mFileDescriptor, &fdset). If mFileDescriptor !=
passedInFileDescriptor at some point, you'll fall into my first
assumption.
It should be looking like this:
if (FD_ISSET(passedInFileDescriptor, &fdset))
{
if(rc = readData(readValue) <= 0)
{
result = TR_ERROR;
}
else
{
result = TR_SUCCESS;
}
}
else
{
result = TR_ERROR;
}
No?
(Edit: also, this answer also points the problem of your use of select() with a bad high_fd value)
Another edit: well, looks like the guys never came back... frustrating.

Resources