Hyper-threading and PER-CPU arrays - c

I have been trying to find an answer to this question, but I have not been able to:
Should PER-CPU arrays be sized after the number of physical cores, or the total cores (physical + siblings) in a system?
Siblings cannot execute at the same time as each other, but I guess they can preempt each others.
Say you are counting some statistic, and you have 100 cores + 100 siblings. Should the per-cpu array for this statistic be size 200 or can you save some space by making it 100 elements? Is there a way to do atomic-local operations in order to get away with 100 elements without locking cache lines? Perhaps something akin to restartable sequences for the sibling-local case?

Related

Performance implications of a large number of mutexes

Suppose I have an array of 1,000,000 elements, and a number of worker threads each manipulating data in this array. The worker threads might be updating already populated elements with new data, but each operation is limited to a single array element, and is independent of the values of any other element.
Using a single mutex to protect the entire array would clearly result in high contention. On the other extreme, I could create an array of mutexes that is the same length as the original array, and for each element array[i] I would lock mutex[i] while operating on it. Assuming an even distribution of data, this would mostly eliminate lock contention, at the cost of a lot of memory.
I think a more reasonable solution would be to have an array of n mutexes (where 1 < n < 1000000). Then for each element array[i] I would lock mutex[i % n] while operating on it. If n is sufficiently large, I can still minimize contention.
So my question is, is there a performance penalty to using a large (e.g. >= 1000000) number of mutexes in this manner, beyond increased memory usage? If so, how many mutexes can you reasonably use before you start to see degradation?
I'm sure the answer to this is somewhat platform specific; I'm using pthreads on Linux. I'm also working on setting up my own benchmarks, but the scale of data that I'm working on makes that time consuming, so some initial guidance would be appreciated.
That was the initial question. For those asking for more detailed information regarding the problem, I have 4 multiple GB binary data files describing somewhere in the neighborhood of half a billion events that are being analyzed. The array in question is actually the array of pointers backing a very large chained hash table. We read the four data files into the hash table, possibly aggregating them together if they share certain characteristics. The existing implementation has 4 threads, each reading one file and inserting records from that file into the hash table. The hash table has 997 locks and 997*9973 = ~10,000,000 pointers. When inserting an element with hash h, I first lock mutex[h % 997] before inserting or modifying the element in bucket[h % 9943081]. This works all right, and as far as I can tell, we haven't had too many issues with contention, but there is a performance bottleneck in that we're only using 4 cores of a 16 core machine. (And even fewer as we go along since the files generally aren't all the same size.) Once all of the data has been read into memory, then we analyze it, which uses new threads and a new locking strategy tuned to the different workload.
I'm attempting to improve the performance of the data load stage by switching to a thread pool. In the new model, I still have one thread for each file which simply reads the file in ~1MB chunks and passes each chunk to a worker thread in the pool to parse and insert. The performance gain so far has been minimal, and the profiling that I did seemed to indicate that the time spent locking and unlocking the array was the likely culprit. The locking is built into the hash table implementation we are using, but it does allow specifying the number of locks to use independently of the size of the table. I'm hoping to speed things up without changing the hash table implementation itself.
(A very partial & possibly indirect answer to your question.)
Have once scored a huge performance hit trying this (on a CentOS) raising the number of locks from a prime of ~1K to a prime of ~1M. While I never fully understood its reason, I eventually figured out (or just convinced myself) that it's the wrong question.
Suppose you have an array of length M, with n workers. Furthermore, you use a hash function to protect the M elements with m < M locks (e.g., by some random grouping). Then, using the Square Approximation to the Birthday Paradox, the chance of a collision between two workers - p - is given by:
p ~ n2 / (2m)
It follows that the number of mutexes you need, m, does not depend on M at all - it is a function of p and n only.
Under Linux there is no cost other than the memory associated with more mutexes.
However, remember that the memory used by your mutexes must be included in your working set - and if your working set size exceeds the relevant cache size, you'll see a significant performance drop. This means that you don't want an excessively sized mutex array.
As Ami Tavory points out, the contention depends on the number of mutexes and number of threads, not the number of data elements protected - so there's no reason to link the number of mutexes to the number of data elements (with the obvious proviso that it never makes sense to have more mutexes than elements).
In the general scenario, I would advise
Simply locking the whole array (simple, very often "good enough" if your application is mostly doing "other stuff" besides accessing the array)
... or ...
Implementing a read/write lock on the entire array (assuming reads equal or exceed writes)
Apparently your scenario doesn't match either case.
Q: Have you considered implementing some kind of a "write queue"?
Worst case, you'd only need one mutex. Best case, you might even be able to use a lock-less mechanism to manage your queue. Look here for some ideas that might be applicable: https://msdn.microsoft.com/en-us/library/windows/desktop/ee418650%28v=vs.85%29.aspx

CUDA concurrent execution

I hope answering my question would not require a lot of time, because it is about my understanding of this topic.
So, the question is about block and grid sizes for concurrent kernels execution.
First, let me tell about my card: it is GeForce GTX TITAN, and here is some of it's characteristics, which I think are important in this question.
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 6144 MBytes (6442123264 bytes)
(14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Now, the main problem: I have a kernel(it performs sparse matrix multiplication, but it is not so important) and I want to launch it simultaneously(!) in several streams on one GPU, computing different matrixes multiplication.
Please, notice again the simultaneous requirement - I want all the kernels start at one moment, and finish at the another(all of them!), so the solution when these kernels only partly overlap doesn't satisfy me.
It is also very important that I want to maximize the number of parallel kernels, even if we lose some performance because of it.
Ok, let`s consider we already have the kernel and we want to specify it's grid and block sizes in in the best way.
Looking to the card characteristics we see it has 14 sm and capability 3.5, which allows to run 32 concurrent kernels.
So, the conclusion I make here is that launching 28 concurrent kernels(two per each of 14 SM) would be the best decision. The first question - am I right here?
Now, again, we want to optimize each kernel's block and grid sizes. Ok, let's look to this characteristic:
Maximum number of threads per multiprocessor: 2048
I understand it this way: if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another.
So, the next conclusion I make is that launching 28 kernels each one with 1024 threads would be also the best solution - because this is the only way when they can be executed simultaneously on each SM. The second question - am I right here? Or there is better solution how to get the simultaneous execution?
It would be very nice if you only say am I right or not, and I would be very grateful if you explain where I mistake or propose a better solution.
Thank you for reading this!
There are a number of questions on concurrent kernels already. You might search and review some of them. You must consider register usage, blocks, threads, and shared memory usage, amongst other things. Your question is not precisely answerable when you don't provide information about register usage or shared memory usage. Maximizing concurrent kernels is partly an occupancy question, so you should study that as well.
Nevertheless, you want to observe maximum concurrent kernels. As you've already pointed out, that is 32.
You have 14 SMs, each of which can have a maximum of 2048 threads. 14x2048/32 = 896 threads per kernel (ie. blocks * threads per block)
With a threadblock size of 128, that would be 7 blocks per kernel. 7 blocks * 32 kernels = 224 blocks total. When we divide this by 14 SMs we get 16 blocks per SM, which just happens to exactly match the spec limit.
So the above analysis, 32 kernels, 7 blocks per kernel, 128 threads per block, would be the extent of the analysis that could be done taking into account only the data you have provided.
If that does not work for you, I'd be sure to make sure I have addressed the requirements for concurrent execution and then focus on registers per thread or shared memory to see if those are limiters for "occupancy" in this case.
Honestly I don't hold out much hope for you witnessing the perfect scenario you describe, but have at it. I'd enjoy being surprised. FYI, if I were trying to do something like this, I would certainly try it on linux rather than windows, especially considering your card is a GeForce card subject to WDDM limitations under windows.
Your understanding seems flawed. Statements like this:
if we launch a kernel with 1024 threads and 2 blocks, these two blocks will be computed simultaneously. if we launch a kernel with 1024 threads and 4 blocks, then two pairs of block will be computed one after another
don't make sense to me. Blocks will be computed in whatever order the scheduler deems appropriate, but there is no rule that says two blocks will be computed simultaneously, but four blocks will be computed two by two.

Questions about parallelism on GPU (CUDA)

I need to give some details about what I am doing before asking my question. I hope my English and my explanations are clear and concise enough.
I am currently working on a massive parallelization of an initially written C code. The reason I was interested in CUDA is the large sizes of the arrays I was dealing with : the code is a simulation of fluid mechanics and I needed to launch a "time loop" with five to six successive operations on arrays as big as 3.10^9 or 19.10^9 double variables. I went through various tutorials and documentation and I finally managed to write a not-so-bad CUDA code.
Without going through the details of the code, I used relatively small 2D-blocks. The number of threads is 18 or 57 (which is awkwardly done since my wraps are not fully occupied).
The kernels call a "big" 3D-grid, which describes my physical geometry (the maximal desired size is 1000 value per dimension, that means I want to deal with a 3D grid with a 1 billion blocks).
Okay so now, my five to six kernels which are doing correctly the job are making good use of the shared memory advantages, since global memory is read ounce and written ounce for each kernel (the size of my blocks was actually determined in accordance with the adequate needed amount of shared memory).
Some of my kernels are launched concurrently, asynchronously called, but most of them need to be successive. There are several memcpy from device to host, but the ratio of memcpys over kernels calls is significantly low. I am mostly executing operations on my arrays values.
Here is my question :
If I understood correctly, all of my blocks are doing the job on the arrays at the same time. So that means dealing with a 10-blocks grid, a 100-blocks grid or a billion will take the same amount of time? The answer is obviously no, since the compuation time is significantly more important when I am dealing with large grids. Why is that?
I am using a relatively modest NVIDIA device (NVS 5200M). I was trying to get used to CUDA before getting bigger/more efficient devices.
Since I went through all the optimization and CUDA programming advices/guides by myself, I may have completely misunderstood some points. I hope my question is not too naive...
Thanks!
If I understood correctly, all of my blocks are doing the job on the arrays at the same time.
No they don't run at the same time! How many thread blocks can run concurrently depends on several things, all effected on the compute capability of your device - NVS 5200M should be cc2.1.
A CUDA enabled gpu has an internal scheduler, that manages where and when which thread block and warps of the blocks will run. Where means on which streaming multiprocessor (SM) the block will be launched.
Every SM has a limited amount of resources - shared memory and registers for example. A good overview for these limitations gives the Programming Guide or the Occupancy Calculator.
The first limitation is, that for cc2.1 a SM can run up to 8 thread blocks at the same time. Depending on your usage of registers, shared memory... the number will possible decrease.
If I remind me right a SM of cc2.1 exists of 96 cuda cores and therefore your NVS 5200M should have one SM. Let's assume with your kernel setup N (N<=8) thread blocks fit into the SM at the same time. The internal scheduler will be launched the first N blocks and queue up all other thread blocks. If one thread block has finished his work, the next one from the queue will be launched. So if you will launch in total 1 until N blocks, the used time for the kernel will be very equal. If you run the kernel with N+1 blocks, than the used time will be increased.

Do we need to take cache thrashing into account with CUDA?

I'm not familiar with the workings of GPU memory caching, so would like to know if the assumptions of temporal and spatial proximity of memory access associated with CPUs also applies with GPUs. That is, programming in CUDA C, do I need to take into account C's row-major array storage format to prevent cache thrashing?
Many thanks.
Yes, very much.
Say you are fetching 4 byte integers for each thread.
Scenario one
Each thread is fetching one integer with the index of its thread id. That means thread zero is fetching a[0], thread 1 is fetching a[1] etc... As with the GPU it will fetch in cache lines of 128 bytes. As a coincidence a warp is 32 threads, ergo 32*4 = 128 bytes. This means for one warp, it will one do one fetch request from memory.
Scenario two
If the threads are fetching in total random order with a distance between the indexes is greater than 128 bytes. It will have to make 32 memory requests of 128 bytes. This means that you will for each warp fill the caches with 32 times more memory, and if your problem is big your cache will be invalidated up to 32 more times than scenario one.
This means that if you would request memory that would normally reside in the cache in scenario one, in scenario two it is highly likely that it have to be resolve with another memory request from global memory.
No and yes. No because the GPU does not provides the same kind of "cache" that CPU.
But you have many other constraints that makes the underlying C array layout and how it is accessed by concurrent threads very important for performances.
You may have a look at this page for basics about CUDA memory types or here for more in depth details about cache on Fermi GPU.

Thread-safety of read-only memory access

I've implemented the Barnes-Hut gravity algorithm in C as follows:
Build a tree of clustered stars.
For each star, traverse the tree and apply the gravitational forces from each applicable node.
Update the star velocities and positions.
Stage 2 is the most expensive stage, and so is implemented in parallel by dividing the set of stars. E.g. with 1000 stars and 2 threads, I have one thread processing the first 500 stars and the second thread processing the second 500.
In practice this works: it speeds the computation by about 30% with two threads on a two-core machine, compared to the non-threaded version. Additionally, it yields the same numerical results as the original non-threaded version.
My concern is that the two threads are accessing the same resource (namely, the tree) simultaneously. I have not added any synchronisation to the thread workers, so it's likely they will attempt to read from the same location at some point. Although access to the tree is strictly read-only I am not 100% sure it's safe. It has worked when I've tested it but I know this is no guarantee of correctness!
Questions
Do I need to make a private copy of the tree for each thread?
Even if it is safe, are there performance problems of accessing the same memory from multiple threads?
Update Benchmark results for the curious:
Machine: Intel Atom CPU N270 # 1.60GHz, cpu MHz 800, cache size 512 KB
Threads real user sys
0 69.056 67.324 1.720
1 76.821 66.268 5.296
2 50.272 63.608 10.585
3 55.510 55.907 13.169
4 49.789 43.291 29.838
5 54.245 41.423 31.094
0 means no threading at all; 1 and above means spawn that many worker threads and for the main thread to wait for them. I would not expect much of an improvement for anything beyond 2 threads, since it's entirely CPU bound and that's how many cores there are. It's interesting that an odd number of threads is slightly worse than an even number.
Looking at sys it's apparent that there's a cost with making threads. Currently it's making the threads for each frame (so N*1000 thread creations). This was easy to program (during my 15 minutes on the train this morning). I'll need to think a bit about how to reuse threads...
Update #2 I've made it use a pool of threads, synchronised with two barriers. This has no noticeable performance advantage over recreating the threads each frame.
You don't specify how your data is structured, but in general reading memory from multiple threads simultaneously is safe and does not introduce any performance issues. You only get problems if someone is writing.
It is interesting that you say you're only getting 30% speedup out of two threads. If you have an otherwise idle machine, two or more CPUs and only readonly shared data (i.e. no synchronization) I would expect to see much closer to 50% speed improvement. This suggests that your operation is actually completing so quickly that the overhead of creating the thread is becoming significant in your numbers. Are you running on a hyperthreaded CPU?
If your data is read-only, then no, you do not need to make a private copy of the tree for each thread. This is the biggest advantage that a shared memory threading model offers!
I'm not aware of any performance problems with such a model. If anything, it should be faster depending on if your CPUs can share some of their cache.

Resources