I'm working on a college assignment where we are to implement parallelized A* search for a 15 puzzle. For this part, we are to use only one priority queue (I suppose to see that the contention by multiple threads would limit speedup). A problem I am facing is properly synchronizing popping the next "candidate" from the priority queue.
I tried the following:
while(1) {
// The board I'm trying to pop.
Board current_board;
pthread_mutex_lock(&priority_queue_lock);
// If the heap is empty, wait till another thread adds new candidates.
if (pq->heap_size == 0)
{
printf("Waiting...\n");
pthread_mutex_unlock(&priority_queue_lock);
continue;
}
current_board = top(pq);
pthread_mutex_unlock(&priority_queue_lock);
// Generate the new boards from the current one and add to the heap...
}
I've tried different variants of the same idea, but for some reason there are occasions where the threads get stuck on "Waiting". The code works fine serially (or with two threads), so that leads me to believe this is the offending part of the code. I can post the entire thing if necessary. I feel like it's an issue with my understanding of the mutex lock though. Thanks in advance for help.
Edit:
I've added the full code for the parallel thread below:
// h and p are global pointers initialized in main()
void* parallelThread(void* arg)
{
int thread_id = (int)(long long)(arg);
while(1)
{
Board current_board;
pthread_mutex_lock(&priority_queue_lock);
current_board = top(p);
pthread_mutex_unlock(&priority_queue_lock);
// Move blank up.
if (current_board.blank_x > 0)
{
int newpos = current_board.blank_x - 1;
Board new_board = current_board;
new_board.board[current_board.blank_x][current_board.blank_y] = new_board.board[newpos][current_board.blank_y];
new_board.board[newpos][current_board.blank_y] = BLANK;
new_board.blank_x = newpos;
new_board.goodness = get_goodness(new_board.board);
new_board.turncount++;
if (check_solved(new_board))
{
printf("Solved in %d turns",new_board.turncount);
exit(0);
}
if (!exists(h,new_board))
{
insert(h,new_board);
push(p,new_board);
}
}
// Move blank down.
if (current_board.blank_x < 3)
{
int newpos = current_board.blank_x + 1;
Board new_board = current_board;
new_board.board[current_board.blank_x][current_board.blank_y] = new_board.board[newpos][current_board.blank_y];
new_board.board[newpos][current_board.blank_y] = BLANK;
new_board.blank_x = newpos;
new_board.goodness = get_goodness(new_board.board);
new_board.turncount++;
if (check_solved(new_board))
{
printf("Solved in %d turns",new_board.turncount);
exit(0);
}
if (!exists(h,new_board))
{
insert(h,new_board);
push(p,new_board);
}
}
// Move blank right.
if (current_board.blank_y < 3)
{
int newpos = current_board.blank_y + 1;
Board new_board = current_board;
new_board.board[current_board.blank_x][current_board.blank_y] = new_board.board[current_board.blank_x][newpos];
new_board.board[current_board.blank_x][newpos] = BLANK;
new_board.blank_y = newpos;
new_board.goodness = get_goodness(new_board.board);
new_board.turncount++;
if (check_solved(new_board))
{
printf("Solved in %d turns",new_board.turncount);
exit(0);
}
if (!exists(h,new_board))
{
insert(h,new_board);
push(p,new_board);
}
}
// Move blank left.
if (current_board.blank_y > 0)
{
int newpos = current_board.blank_y - 1;
Board new_board = current_board;
new_board.board[current_board.blank_x][current_board.blank_y] = new_board.board[current_board.blank_x][newpos];
new_board.board[current_board.blank_x][newpos] = BLANK;
new_board.blank_y = newpos;
new_board.goodness = get_goodness(new_board.board);
new_board.turncount++;
if (check_solved(new_board))
{
printf("Solved in %d turns",new_board.turncount);
exit(0);
}
if (!exists(h,new_board))
{
insert(h,new_board);
push(p,new_board);
}
}
}
return NULL;
}
I tried the following:
I don't see anything wrong with the code that follows, assuming that top also removes the board from the queue. It's wasteful (if the queue is empty, it will spin locking and unlocking the mutex), but not wrong.
I've added the full code
This is useless without the code for exists, insert and push.
One general observation:
pthread_mutex_lock(&priority_queue_lock);
current_board = top(p);
pthread_mutex_unlock(&priority_queue_lock);
In the code above, your locking is "ouside" of the top function. But here:
if (!exists(h,new_board))
{
insert(h,new_board);
push(p,new_board);
}
you either do no locking at all (in which case that's a bug), or you do locking "inside" exists, insert and push.
You should not mix "inside" and "outside" locking. Pick one or the other and stick with it.
If you in fact do not lock the queue inside exists, insert, etc. then you have a data race and are thinking of mutexes incorrectly: they protect invariants, and you can't check whether the queue is empty in parallel with another thread executing "remove top element" -- these operations require serialization, and thus must both be done under a lock.
Related
To simplify the problem as much as possible, I have two functions, a parent that calls the child. Everything executes okay till it gets to the return of the child function. After that I get a Bus Error.
int main () {
game();
// this doesn't get executed and program fails with bus error
printf("Execute 2");
return 1;
}
int game () {
game_t GameInfo = {.level = 1, .score = 0, .playerCh = 0, .playerX = 1, .playerY = 1};
gameLevel(&GameInfo);
mvprintw(1,1, "Executed");
// code works up to here and get's executed properly
return 1;
};
void gameLevel (game_t *GameInfo) {
// determine the size of the game field
int cellCols = COLS / 3;
int cellRows = (LINES / 3) - 2;
GameInfo -> playerX = 1;
GameInfo -> playerY = 1;
generateMaze(0);
int solved = 0;
int level = GameInfo -> level;
// default player position
getPlayerDefault(GameInfo);
pthread_t enemies_th;
pthread_create(&enemies_th, NULL, enemies, (void *)GameInfo);
// enemies(&level);
while (solved == 0 && GameInfo -> collision != 1) {
printGameInfo(GameInfo);
noecho();
char move = getch();
echo();
if (GameInfo -> collision != 1) {
if (checkMoveValidity(move, GameInfo) == 1) {
solved = movePlayer(move, GameInfo);
if (solved == 1) {
break;
}
}
} else {
break;
}
}
if (solved == 1) {
pthread_cancel(enemies_th);
GameInfo->level++;
gameLevel(GameInfo);
} else {
// game over
pthread_cancel(enemies_th);
return;
}
}
Now, the code is much more complicated than here, but I think that shouldn't have any influence on this (?) as it executes properly, until the return statement. There is also ncurses and multithreading, quite complex custom structures, but it all works, up until that point. Any ideas ?
Tried putting print statements after each segment of code, everything worked up until this.
pthread_cancel() doesn't terminate the requested thread immediately. The only way to know that a cancelled thread has terminated is to call pthread_join(). If the thread is left running, it will interfere with use of the GameInfo variable in the next level of the game if the current level is solved, or may use the GameInfo variable beyond its lifetime if the current level was not solved and the main thread returns back to the main() function.
To make sure the old enemies thread has terminated, add calls to pthread_join() to the gameLevel() function as shown below:
if (solved == 1) {
pthread_cancel(enemies_th);
pthread_join(enemies_th);
GameInfo->level++;
gameLevel(GameInfo);
} else {
// game over
pthread_cancel(enemies_th);
pthread_join(enemies_th);
return;
}
The use of tail recursion in gameLevel() seems unnecessary. I recommend returning the solved value and letting the game() function start the next level:
In game():
while (gameLevel(&GameInfo)) {
GameInfo.level++;
}
In gameLevel():
int gameLevel(game_t *GameInfo) {
/* ... */
pthread_cancel(enemies_th);
pthread_join(enemies_th);
return solved;
}
I currently am trying to implement FIFO for the producer consumer problem. However when I run the code it seems that the first item is not being removed from the buffer as the output shows that every consumer is consuming the first item and never the others (Screenshot of output attached).
I implemented a LIFO queue and got the expected result which is what leads me to believe that the issue is with my FIFO implementation.
Simple error in dequeue. Imagine you want to get the first entry in the queue ( buffer_rd == 0). But you increment buffer_rd and the read theat entry,
buffer_t dequeuebuffer() {
if (buffer_rd == buffer_wr) {
printf("Buffer underflow\n");
} else {
buffer_rd = (buffer_rd + 1) % SIZE; <<<<<====
return buffer[buffer_rd]; <<<<<<======
}
return 0;
}
you need to reverse those 2 (like in insert)
buffer_t dequeuebuffer() {
if (buffer_rd == buffer_wr) {
printf("Buffer underflow\n");
} else {
int ret = buffer[buffer_rd];
buffer_rd = (buffer_rd + 1) % SIZE;
return ret;
}
return 0;
}
I'm trying to use local memory inside a device-side enqueued kernel.
My assumption that any locally-declared array is visible across all work items in the work group.
This is proven to be true when I use local memory on kernels that are called from the host-side, but I'm running into problems when I use a similar setup on device-side enqueued kernels.
Is there something wrong with my assumption?
Edit:
My kernel is below:
My goal is to sort the FIFO pipe into 3 buffers. The problem is that my work items have a limited view scope, and I'm trying to write the buffers into another pipe.
int pivot;
int in_pipe[BIN_SIZE];
int lt_bin[BIN_SIZE];
int gt_bin[BIN_SIZE];
int e_bin[BIN_SIZE];
reserve_id_t down_id = work_group_reserve_read_pipe(down_pipe, local_size);
//while ( is_valid_reserve_id(down_id) == false){
// down_id = work_group_reserve_read_pipe(down_pipe, local_size);
//}
//in_bin[tid] = -5;
if( is_valid_reserve_id(down_id) == true){
int status = read_pipe(down_pipe, down_id, lid, &pipe_out);
work_group_commit_read_pipe(down_pipe, down_id);
pivot = pipe_out;
pivot = work_group_broadcast(pivot, 0);
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
work_group_barrier(CLK_LOCAL_MEM_FENCE);
in_pipe[tid] = pipe_out;
//in_bin[lid] = in_pipe[tid];
int e_count = 0;
int gt_count = 0;
int lt_count = 0;
if(in_pipe[tid] == pivot){
e_count = 1;
}
else if(in_pipe[tid] < pivot){
lt_count = 1;
}
else if(in_pipe[tid] > pivot){
gt_count = 1;
}
int e_tot = work_group_reduce_add(e_count);
e_tot = work_group_broadcast(e_tot, 0);
int e_val = work_group_scan_exclusive_add(e_count);
int gt_tot = work_group_reduce_add(gt_count);
gt_tot = work_group_broadcast(gt_tot, 0);
int gt_val = work_group_scan_exclusive_add(gt_count);
int lt_tot = work_group_reduce_add(lt_count);
lt_tot = work_group_broadcast(lt_tot, 0);
int lt_val = work_group_scan_exclusive_add(lt_count);
//in_bin[tid] = lt_val;
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
work_group_barrier(CLK_LOCAL_MEM_FENCE);
if(in_pipe[tid] == pivot){
e_temp[e_val] = in_pipe[tid];
//in_bin[e_val] = e_bin[e_val];
//e_bin[e_Val] = work_group_broadcast(e_bin[e_val], lid);
}
if(in_pipe[tid] < pivot){
lte_temp[lt_val] = in_pipe[tid];
//in_bin[lt_val] = lt_bin[lt_val];
}
if(in_pipe[tid] > pivot){
gt_bin[gt_val] = in_pipe[tid];
//in_bin[gt_val] = gt_bin[gt_val];
}
No, not wrong. Local variables are declared and used across whole work-groups device-side too. They won't be shared with the parent kernels, though.
What exactly are you doing?
The working solution to my question is:
Pipes cannot be created on the device side. What I tried to accomplish was to make a dynamic tree structure, involving branches. OpenCL pipes simply cannot do that, as pipes are still memory objects, created on the host-side. There is no current way in the specifications to create memory objects.
Pipes, however, can be used in a dynamically-recursive method, albeit the recursion cannot deviate, and must occur in a linear fashion. Please consult the sample code found in the AMD APP SDK sample code packs for more details. Specifically, please look at the Device Enqueue BFS example.
I am developing a userspace premptive thread library(fibre) that uses context switching as the base approach. For this I wrote a scheduler. However, its not performing as expected. Can I have any suggestions for this.
The structure of the thread_t used is :
typedef struct thread_t {
int thr_id;
int thr_usrpri;
int thr_cpupri;
int thr_totalcpu;
ucontext_t thr_context;
void * thr_stack;
int thr_stacksize;
struct thread_t *thr_next;
struct thread_t *thr_prev;
} thread_t;
The scheduling function is as follows:
void schedule(void)
{
thread_t *t1, *t2;
thread_t * newthr = NULL;
int newpri = 127;
struct itimerval tm;
ucontext_t dummy;
sigset_t sigt;
t1 = ready_q;
// Select the thread with higest priority
while (t1 != NULL)
{
if (newpri > t1->thr_usrpri + t1->thr_cpupri)
{
newpri = t1->thr_usrpri + t1->thr_cpupri;
newthr = t1;
}
t1 = t1->thr_next;
}
if (newthr == NULL)
{
if (current_thread == NULL)
{
// No more threads? (stop itimer)
tm.it_interval.tv_usec = 0;
tm.it_interval.tv_sec = 0;
tm.it_value.tv_usec = 0; // ZERO Disable
tm.it_value.tv_sec = 0;
setitimer(ITIMER_PROF, &tm, NULL);
}
return;
}
else
{
// TO DO :: Reenabling of signals must be done.
// Switch to new thread
if (current_thread != NULL)
{
t2 = current_thread;
current_thread = newthr;
timeq = 0;
sigemptyset(&sigt);
sigaddset(&sigt, SIGPROF);
sigprocmask(SIG_UNBLOCK, &sigt, NULL);
swapcontext(&(t2->thr_context), &(current_thread->thr_context));
}
else
{
// No current thread? might be terminated
current_thread = newthr;
timeq = 0;
sigemptyset(&sigt);
sigaddset(&sigt, SIGPROF);
sigprocmask(SIG_UNBLOCK, &sigt, NULL);
swapcontext(&(dummy), &(current_thread->thr_context));
}
}
}
It seems that the "ready_q" (head of the list of ready threads?) never changes, so the search of the higest priority thread always finds the first suitable element. If two threads have the same priority, only the first one has a chance to gain the CPU. There are many algorithms you can use, some are based on a dynamic change of the priority, other ones use a sort of rotation inside the ready queue. In your example you could remove the selected thread from its place in the ready queue and put in at the last place (it's a double linked list, so the operation is trivial and quite inexpensive).
Also, I'd suggest you to consider the performace issues due to the linear search in ready_q, since it may be a problem when the number of threads is big. In that case it may be helpful a more sophisticated structure, with different lists of threads for different levels of priority.
Bye!
I'm having trouble with getting my worker threads and facilitator threads to synchronize properly. The problem I'm trying to solve is to find the largest prime number 10 files using up to 10 threads. 1 thread is single-threaded and anything greater than that is multi-threaded.
The problem lies where the worker signals the facilitator that it has found a new prime. The facilitator can ignore it if the number is insignificant, or signal to update all threads my_latest_lgprime if it is important. I keep getting stuck in my brain and in code.
The task must be completed using a facilitator and synchronization.
Here is what I have so far:
Worker:
void* worker(void* args){
w_pack* package = (w_pack*) args;
int i, num;
char text_num[30];
*(package->fac_prime) = 0;
for(i = 0; i<package->file_count; i++){
int count = 1000000; //integers per file
FILE* f = package->assigned_files[i];
while(count != 0){
fscanf(f, "%s", text_num);
num = atoi(text_num);
pthread_mutex_lock(&lock2);
while(update_ready != 0){
pthread_cond_wait(&waiter, &lock2);
package->my_latest_lgprime = largest_prime;//largest_prime is global
update_ready = 0;
}
pthread_mutex_unlock(&lock2);
if(num > (package->my_latest_lgprime+100)){
if(isPrime(num)==1){
*(package->fac_prime) = num;
package->my_latest_lgprime = num;
pthread_mutex_lock(&lock);
update_check = 1;
pthread_mutex_unlock(&lock);
pthread_cond_signal(&updater);
}
}
count--;
}
}
done++;
return (void*)package;
}`
Facilitator:
void* facilitator(void* args){
int i, temp_large;
f_pack* package = (f_pack*) args;
while(done != package->threads){
pthread_mutex_lock(&lock);
while(update_check == 0)
pthread_cond_wait(&updater, &lock);
temp_large = isLargest(package->threads_largest, package->threads);
if(temp_large > largest_prime){
pthread_mutex_lock(&lock2);
update_ready = 1;
largest_prime = temp_large;
pthread_mutex_unlock(&lock2);
pthread_cond_broadcast(&waiter);
printf("New large prime: %d\n", largest_prime);
}
update_check = 0;
pthread_mutex_unlock(&lock);
}
}
Here is the worker package
typedef struct worker_package{
int my_latest_lgprime;
int file_count;
int* fac_prime;
FILE* assigned_files[5];
} w_pack;
Is there an easier way to do this using semaphores?
I can't really spot a problem with certainty, but just by briefly reading the code it seems the done variable is shared across threads yet it is accessed and modified without synchronization.
In any case, I can suggest a couple of ideas to improve on your solution.
You assign the list of files to each thread at start up. This isn't the most efficient way, since processing each file may take more or less time. It seems to me a better approach would be to have a single list of files, and then each thread picks up the next file in the list.
Do you really need a facilitator task for this? It seems to me each thread can keep track of its own largest prime, and every time it finds a new maximum it can go check a global maximum and update it if necessary. You could also keep a single maximum (w/o a per-thread maximum) but that will require you to lock every time you need to compare.
Here is pseudo-code of how I would write the worker threads:
while (true) {
lock(file_list_mutex)
if file list is empty {
break // we are done!
}
file = get_next_file_in_list
unlock(file_list_mutex)
max = 0
foreach number in file {
if number is prime and number > max {
lock(max_number_mutex)
if (number > max_global_number) {
max_global_number = number
}
max = max_global_number
unlock(max_number_mutex)
}
}
}
Before you start the worker threads you need to initialize max_global_number = 0.
The above solution has the benefit that it doesn't abuse locks like in your case, so thread contention is minimized.