Multithreading - Raytracer

Multithreading - Raytracer - c

I've finished my raytracer now but im trying to learn threads to optimize the render time. To represent each pixels of my window im using a int8_t * (4 int8_t / pixel for R/G/B/A). Here is a sample of what im trying to do :
Number of threads set : 4
[0]-------1[2]-------[3][4]-------[5][6]-------[7]
[Thread 1][Thread 2][Thread 3][Thread 4]
For an array of 8, each thread take 2 cells but i want them to work simultaneously on the array. Is this possible if each thread work on a specific part ?
On this screen (ptr * is the int8_t *) you can see each thread have an area of effect on the array (start position to end position for the actual array part).
Is this possible if each thread work on a specific part ?
Thanks for your replies.

Yes, it's possible to have a multithreaded ray tracer with all the threads writing to the same output buffer without extra synchronization, if:
The output array isn't moving around.
The parts of the array being written by the threads don't overlap.
On some platforms, you might also have to make sure that threads never attempt an unaligned write, which suggests that it's not a great idea to have one thread writing the red channel while another writes the green channel of the same pixel.
For best performance, you probably don't want two threads trying to write to the same cache line at the same time. Rather than having the threads play leapfrog through the array, consider carving up the image into larger contiguous chunks.
I usually set each thread going on its own row in the image. When one finishes a row, I have it work on the next unassigned row. Doling out the portions this way does require some synchronization, but that's very minor and generally won't be under heavy contention.

Related

Lock free readonly-shared memory don’t care about memory order, only ensure visibility?

So I’m emulating a small microprocessor in c that has an internal flash storage represented as an array of chars. The emulation is entirely single threaded and operates on this flash storage as well as some register variables.
What I want to do is have a second “read” thread that periodically (~ every 10ms or about the monitor refreshrate) polls the data from that array and displays it in some sort of window. (The flash storage is only 32KiB in size so it can be displayed in a 512x512 black and white image).
The thing is that the main emulation thread should have an absolutely minimal performance overhead to do this (or optimally not even care about the second thread at all). A rw mutex is absolutely out of the question since that would absolutely tank the performance of my emulator. Best case scenario as I said is to let the emulator be completely oblivious of the existence of the read thread.
I don’t care about memory order either since it doesn’t matter if some changes are visible earlier than others to the read thread as long as they are visible at some point at all.
Is this possible at all in c / cpp or at least through some sort of memory_barrier() function that I can call in my emulation thread about every 1000th clock cycle that would then assure visibility to my read thread?
Is it enough to just use a volatile on the flash memory? / would this affect performance in any significant way?
I don’t want to stall the complete emulation thread just to copy over the flash array to some different place.
Pseudo code:
int main() {
char flash[32 * 1024];
flash_binary(flash, "program.bin");
// periodically displays the content of flash
pthread_t *read_thread = create_read_thread(flash);
// does something with flash, highly performance critical
emulate_cpu(flash);
kill_read_thread(read_thread);
}

OpenCL large global size or for loops per work item?

I'm learning OpenCL in order to implement a relatively complex image processing algorithm which includes several subroutines that should be implemented as kernels.
The implementation is intended to be on Mali T-6xx GPU.
I read the "OpenCL Programming by Example" book and the "Optimizing OpenCL kernels on the Mali-T600 GPUs" document.
In the book examples they use some global size of work items and each work item processes several pixels in for loops.
In the document the kernels are written without loops as in there is a single execution per work item in the kernel.
Since the maximum global size of work items that are possible to spawn on the Mali T-600 GPUs are 256 (and thats for simple kernels) And there are clearly more pixels to process in most images, in my understanding the kernel without loops will spawn more work item threads as soon as possible until the global size of work items completed executing the kernel and the global size might just be the amount of pixels in the image. Is that right? Such that it is a kind of a thread spawning loop in itself?
On the other hand in the book. The global work size is smaller than the amount of pixels to process, but the kernel has loops that make each work item process several pixels while executing the kernel code.
So I want to know which way is the proper way to write image processing kernels or any OpenCL kernels for that
matter and in what situations one way might be better than the other, assuming I understood correctly both ways...

Is that right? Such that it is a kind of a thread spawning loop in itself?
Yes.
So I want to know which way is the proper way to write image processing kernels or any OpenCL kernels for that matter and in what situations one
I suspect there isn't a "right" answer in general - there are multiple hardware vendors and multiple drivers - so I suspect the "best" approach will vary from vendor to vendor.
For Mali in particular the thread spawning is all handled by hardware, so will in general be faster than explicit loops in the shader code which will take instructions to process.
There is normally some advantage to at least some vectorization - e.g. processing vec4 or vec8 vectors of pixels per work item rather than just 1 - as the Mali-T600/700/800 GPU cores uses a vector arithmetic architecture.

Questions about parallelism on GPU (CUDA)

I need to give some details about what I am doing before asking my question. I hope my English and my explanations are clear and concise enough.
I am currently working on a massive parallelization of an initially written C code. The reason I was interested in CUDA is the large sizes of the arrays I was dealing with : the code is a simulation of fluid mechanics and I needed to launch a "time loop" with five to six successive operations on arrays as big as 3.10^9 or 19.10^9 double variables. I went through various tutorials and documentation and I finally managed to write a not-so-bad CUDA code.
Without going through the details of the code, I used relatively small 2D-blocks. The number of threads is 18 or 57 (which is awkwardly done since my wraps are not fully occupied).
The kernels call a "big" 3D-grid, which describes my physical geometry (the maximal desired size is 1000 value per dimension, that means I want to deal with a 3D grid with a 1 billion blocks).
Okay so now, my five to six kernels which are doing correctly the job are making good use of the shared memory advantages, since global memory is read ounce and written ounce for each kernel (the size of my blocks was actually determined in accordance with the adequate needed amount of shared memory).
Some of my kernels are launched concurrently, asynchronously called, but most of them need to be successive. There are several memcpy from device to host, but the ratio of memcpys over kernels calls is significantly low. I am mostly executing operations on my arrays values.
Here is my question :
If I understood correctly, all of my blocks are doing the job on the arrays at the same time. So that means dealing with a 10-blocks grid, a 100-blocks grid or a billion will take the same amount of time? The answer is obviously no, since the compuation time is significantly more important when I am dealing with large grids. Why is that?
I am using a relatively modest NVIDIA device (NVS 5200M). I was trying to get used to CUDA before getting bigger/more efficient devices.
Since I went through all the optimization and CUDA programming advices/guides by myself, I may have completely misunderstood some points. I hope my question is not too naive...
Thanks!

If I understood correctly, all of my blocks are doing the job on the arrays at the same time.
No they don't run at the same time! How many thread blocks can run concurrently depends on several things, all effected on the compute capability of your device - NVS 5200M should be cc2.1.
A CUDA enabled gpu has an internal scheduler, that manages where and when which thread block and warps of the blocks will run. Where means on which streaming multiprocessor (SM) the block will be launched.
Every SM has a limited amount of resources - shared memory and registers for example. A good overview for these limitations gives the Programming Guide or the Occupancy Calculator.
The first limitation is, that for cc2.1 a SM can run up to 8 thread blocks at the same time. Depending on your usage of registers, shared memory... the number will possible decrease.
If I remind me right a SM of cc2.1 exists of 96 cuda cores and therefore your NVS 5200M should have one SM. Let's assume with your kernel setup N (N<=8) thread blocks fit into the SM at the same time. The internal scheduler will be launched the first N blocks and queue up all other thread blocks. If one thread block has finished his work, the next one from the queue will be launched. So if you will launch in total 1 until N blocks, the used time for the kernel will be very equal. If you run the kernel with N+1 blocks, than the used time will be increased.

Can't get any speedup from parallelizing Quicksort using Pthreads

I'm using Pthreads to create a new tread for each partition after the list is split into the right and left halves (less than and greater than the pivot). I do this recursively until I reach the maximum number of allowed threads.
When I use printfs to follow what goes on in the program, I clearly see that each thread is doing its delegated work in parallel. However using a single process is always the fastest. As soon as I try to use more threads, the time it takes to finish almost doubles, and keeps increasing with number of threads.
I am allowed to use up to 16 processors on the server I am running it on.
The algorithm goes like this:
Split array into right and left by comparing the elements to the pivot.
Start a new thread for the right and left, and wait until the threads join back.
If there are more available threads, they can create more recursively.
Each thread waits for its children to join.
Everything makes sense to me, and sorting works perfectly well, but more threads makes it slow down immensely.
I tried setting a minimum number of elements per partition for a thread to be started (e.g. 50000).
I tried an approach where when a thread is done, it allows another thread to be started, which leads to hundreds of threads starting and finishing throughout. I think the overhead was way too much. So I got rid of that, and if a thread was done executing, no new thread was created. I got a little more speedup but still a lot slower than a single process.
The code I used is below.
http://pastebin.com/UaGsjcq2
Does anybody have any clue as to what I could be doing wrong?

Starting a thread has a fair amount of overhead. You'd probably be better off creating a threadpool with some fixed number of threads, along with a thread-safe queue to queue up jobs for the threads to do. The threads wait for an item in the queue, process that item, then wait for another item. If you want to do things really correctly, this should be a priority queue, with the ordering based on the size of the partition (so you always sort the smallest partitions first, to help keep the queue size from getting excessive).
This at least reduces the overhead of starting the threads quite a bit -- but that still doesn't guarantee you'll get better performance than a single-threaded version. In particular, a quick-sort involves little enough work on the CPU itself that it's probably almost completely bound by the bandwidth to memory. Processing more than one partition at a time may hurt cache locality to the point that you lose speed in any case.

First guess would be that creating, destroying, and especially the syncing your threads is going to eat up and possible gain you might receive depending on just how many elements you are sorting. I'd actually guess that it would take quite a long while to make up the overhead and that it probably won't ever be made up.
Because of the way you have your sort, you have 1 thread waiting for another waiting for another... you aren't really getting all that much parallelism to begin with. You'd be better off using a more linear sort, perhaps something like a Radix, that splits the threads up with more further data. That's still having one thread wait for others a lot, but at least the threads get to do more work in the mean time. But still, I don't think threads are going to help too much even with this.

I just have a quick look at your code. And i got a remark.
Why are you using lock.
If I understand what you are doing is something like:
quickSort(array)
{
left, right = partition(array);
newThread(quickSort(left));
newThread(quickSort(right));
}
You shouldn't need lock.
Normally each call to quick sort do not access the other part of the array.
So no sharing is involve.

Unless each thread is running on a separate processor or core they will not truly run concurrently and the context switch time will be significant. The number of threads should be restricted to the number of available execution units, and even then you have to trust the OS will distribute them to separate processors/cores, which it may not do if they are also being used for other processes.
Also you should use a static thread pool rather than creating and destroying threads dynamically. Creating/destroying a thread includes allocating/releasing a stack from the heap, which is non-deterministic and potentially time-consuming.
Finally are the 16 processors on the server real or VMs? And are they exclusively allocated to your process?

Using threads, how should I deal with something which ideally should happen in sequential order?

I have an image generator which would benefit from running in threads. I am intending to use POSIX threads, and have written some mock up code based on https://computing.llnl.gov/tutorials/pthreads/#ConVarSignal to test things out.
In the intended program, when the GUI is in use, I want the generated lines to appear from the top to the bottom one by one (the image generation can be very slow).
It should also be noted, the data generated in the threads is not the actual image data. The thread data is read and transformed into RGB data and placed into the actual image buffer. And within the GUI, the way the thread generated data is translated to RGB data can be changed during image generation without stopping image generation.
However, there is no guarantee from the thread scheduler that the threads will run in the order I want, which unfortunately makes the transformations of the thread generated data trickier, implying the undesirable solution of keeping an array to hold a bool value to indicate which lines are done.
How should I deal with this?
Currently I have a watcher thread to report when the image is complete (which really should be for a progress bar but I've not got that far yet, it instead uses pthread_cond_wait). And several render threads doing while(next_line());
next_line() does a mutex lock, and gets the value of img_next_line before incrementing it and unlocking the mutex. it then renders the line and does a mutex lock (different to first) to get lines_done checks against height, signals if complete, unlocks and returns 0 if complete or 1 if not.

Given that threads may well be executing in parallel on different cores it's pretty much inevitable that the results will arrive out of order. I think your appraoch of tracking what's complete with a set of flags is quite reasonable.
It's possible that the overall effect might be nicer if used threads in a different granularity. Say give each thread (say) 20 lines to work on rather than one. Then on completion you'd have bigger blocks available to draw, and maybe drawing stripes would look ok?

Just accept that the rows will be done in a non-deterministic order; it sounds like that is happening because they take different lengths of time to render, in which case forcing a completion order will waste CPU time.

This may sound silly but as a user I don't want to see one line rendered slowly from top to bottom. It makes a slow process seem even slower because the user already has completely predicted what will happen next. Better to just render when ready even if it is scattered over the place (either as single lines or better yet as blocks as some have suggested). It makes it look more random and therefore more captivating and less boring to a user like me.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight