I've written a program that executes some calculations and then merges the results.
I've used multi-threading to calculate in parallel.
During the phase of merge result, each thread will lock the global array, and then append individual part to it, and some extra work will be done to eliminate the repetitions.
I test it and find that the cost on merging increases with the number of threads, and the rate is unexpected:
2 thread: 40,116,084(us)
6 thread:511,791,532(us)
Why: what occurs when the number of threads increases? How do I change this?
--------------------------slash line -----------------------------------------------------
Actually, the code was very simply, there is the pseudo-code:
typedef my_object{
long no;
int count;
double value;
//something others
} my_object_t;
static my_object_t** global_result_array; //about ten thounds
static pthread_mutex_t global_lock;
void* thread_function(void* arg){
my_object_t** local_result;
int local_result_number;
int i;
my_object_t* ptr;
for(;;){
if( exit_condition ){ return NULL;}
if( merge_condition){
//start time point to log
pthread_mutex_lock( &global_lock);
for( i = local_result_number-1; i>=0 ;i++){
ptr = local_result[ i] ;
if( NULL == global_result_array[ ptr->no] ){
global_result_array[ ptr->no] = ptr; //step 4
}else{
global_result_array[ ptr->no] -> count += ptr->count; // step 5
global_result_array[ ptr->no] -> value += ptr->value; // step 6
}
}
pthread_mutex_unlock( &global_lock); // end time point to log
}else{
//do some calculation and produce the partly and thread-local result ,namely the local_result and local_result_number
}
}
}
As above, the difference between two threads and six threads are step 5 and step6, i has counted that there were about hundreds millions order of execution of step 5 and 6. The others are same.
So, from my view, the merge operation was very light, in spite of using 2 thread or 6 thread, they both need to lock and do merge exclusively.
Another astonished thing was : when using six thread, the cost on step 4 was boomed! It was the boot reason that the total cost was boomed!
btw: The test server has two cpus ,each cpu has four cores.
There are various reasons for the behaviour shown:
More threads means more locks and more blocking time among threads. As is apparent from your description, your implementation uses mutex locks or something similar. The speed-up with threads is better if the data sets are largely exclusive.
Unless your system has as many processors/cores as the number of threads, all of them cannot run concurrently. You can set the maximum concurrency using pthread_setconcurrency.
Context switching is an overhead. Hence the difference. If your computer had 6 cores it would be faster. Overwise you need to have more context switches for the threads.
This is a huge performance difference between 2/6 threads. I'm sorry, but you have to try very hard indeed to make such a huge discrepancy. You seem to have succeeded:((
As others have pointed out, using multiple threads on one data set only becomes worth it if the time spent on inter-thread communication, (locks etc.), is less than the time gained by the concurrent operations.
If, for example, you find that you are merging successively smaller data sections, (eg. with a merge sort), you are effectively optimizing the time wasted on inter-thread comms and cache-thrashing. This is why multi-threaded merge-sorts are frequently started with an in-place sort once the data has been divided up into a chunk less than the size of the L1 cache.
'each thread will lock the global array' - try to not do this. Locking large data structures for extended periods, or continually locking them for successive short periods, is a very bad plan. Locking the global once serializes the threads and generates one thread with too much inter-thread comms. Continualy locking/releasing generates one thread with far, far too much inter-thread comms.
Once the operations get so short that the returns are diminished to the point of uselessness, you would be better off queueing those operations to one thread that finishes off the job on its own.
Locking is often grossly over-used and/or misused. If I find myself locking anything for longer than the time taken to push/pop a pointer onto a queue or similar, I start to get jittery..
Without seeing/analysing the code, and more importantly, data,, (I guess both are complex), it's difficult to give any direct advice:(
Related
This is concerning Microsoft/Visual Studio and Intel/AMD-specific implementation only.
Say, if declare a global variable:
volatile __declspec(align(16)) ULONG vFlags = 0;
And, say, I have multiple contending threads:
//Thread 1
ULONG prevFlags;
prevFlags = InterlockedExchange(&vFlags, 0);
if(prevFlags != 0)
{
//Do work
}
and then from other threads, I do:
//Thread N
vFlags = SomeNonZeroValue;
So say, on a multi-CPU system, at the moment in time while thread 1 is executing a locked InterlockedExchange instruction, some other threads come to executing vFlags = 2 and vFlags = 4 instructions.
What would happen in that case? Would vFlags = 2 and vFlags = 4 be stalled until InterlockedExchange completes, or will it disregard that lock?
Or do I need to use this instead?
//Thread N
InterlockedOr(&vFlags, SomeNonZeroValue);
Instructions that don't use locks to update a variable do not interact with instructions that do. Locking is a cooperative process that all participants must observe in order for it to work. So, yes, updating the flag with a simple assignment on one thread will not be blocked by another thread calling InterlockedExchange.
On the other hand, assigning different values to variables that are read by other threads raises the issue of visibility across cores since other threads may not immediately, or indeed ever, see the updates. InterlockedExchange solves this issue as well by providing implicit memory fences.
In conclusion, I would use InterlockedExchange in all threads updating the flag.
I have an application which plays a network server role, and will pthread_create multiple threads and each thread will listen on a particular TCP port and accept multiple TCP socket connections.
Now, after some time for example, 60 seconds, all network clients (TCP socket clients) have been closed (but my application is still running those threads and listening on those ports), how do I collect data (such as total_bytes received) from those threads created by my application?
One solution I currently used is: in each socket accept(), when new data arrives, the corresponding thread will update a static variable with pthread_mutex_t. But I suspect this is low efficiency and waste time by the mutex.
Is there any lock-free way to do this?
If there any solution of "per_cpu" counters just like it is used in network driver/without lock/mutex?
Or, I don't update the Receiver_Total_Bytes when receiving n bytes from socket (read()). Instead, I keep calculate the total bytes within the thread. But the question is, how do I get the total bytes number from a un-finished thread?
===sudo code===
register long Receiver_Total_Bytes = 0;
static pthread_mutex_t Summarizer_Mutex = PTHREAD_MUTEX_INITIALIZER;
void add_server_transfer_bytes(register long bytes )
{
pthread_mutex_lock( &Summarizer_Mutex );
Receiver_Total_Bytes += bytes;
pthread_mutex_unlock( &Summarizer_Mutex );
}
void reset_server_transfer_bytes( )
{
pthread_mutex_lock( &Summarizer_Mutex );
Receiver_Total_Bytes = 0;
pthread_mutex_unlock( &Summarizer_Mutex );
}
Then in socket read:
if((n = read(i, buffer, bytes_to_be_read)) > 0) {
............
add_server_transfer_bytes(n);
Another option is to allocate a structure for each thread, and have that structure include the desired counters, say connections and total_bytes, at least.
The thread itself should increment these using atomic built-ins:
__sync_fetch_and_add(&(threadstruct->connections), 1);
__sync_fetch_and_add(&(threadstruct->total_bytes), bytes);
or
__atomic_fetch_add(&(threadstruct->connections), 1, __ATOMIC_SEQ_CST);
__atomic_fetch_add(&(threadstruct->total_bytes), bytes, __ATOMIC_SEQ_CST);
These are slightly slower than non-atomic operations, but the overhead is very small, if there is no contention. (In my experience, cacheline ping-pong -- when different CPUs try to access the variable at the same time -- is a significant risk and a real-world cause for slowdown, but here the risk is minimal. At worst, only the current thread and the main thread may access the variables at the same time. Of course, the main thread should not calculate the summaries too often; say, once or twice a second should be enough.)
Because the structure is allocated in the main thread, the main thread can also access the counters. To collect the totals, it'll use a loop, and inside the loop,
overall_connections += __sync_fetch_and_add(&(threadstruct[thread]->connections), 0);
overall_total_bytes += __sync_fetch_and_add(&(threadstruct[thread]->total_bytes), 0);
or
overall_connections += __atomic_load_n(&(threadstruct[thread]->connections));
overall_total_bytes += __atomic_load_n(&(threadstruct[thread]->total_bytes));
See the GCC manual for further information on the __atomic and __sync built-in functions. Other C compilers like Intel CC also provide these -- or at least used to; the last time I verified this was a few years ago. The __sync ones are older (and more widely supported in older compiler versions), but the __atomic ones reflect the memory models specified in C11, so are more likely to be supported by future C compilers.
Yes, your concerns are warranted. The worst thing you can do here is to use mutex as suggested in another answer. Mutexes preempt threads, so they really are multithreaders worst enemy. The other thing which might come to mind is to use atomic operations for incrementing (also mentioned in the same answer). Terrible idea indeed! Atomic operations perform very poor under contention (atomic increment is a actually a loop, which will try to incrememnt until succeeds). Since in the case described I imagine the conention will be high, atomics will behave bad.
The other problem with atomics and mutexes a like is that enforce memory ordering and impose bariers. Not a good thing for performance!
The real solution to the question, is, of course, having each thread using it's own private counter. It is not per-cpu, it is per thread. Once the threads are done, those counters can be accumulated.
I'm currently learning OpenCL and came across this code snippet:
int gti = get_global_id(0);
int ti = get_local_id(0);
int n = get_global_size(0);
int nt = get_local_size(0);
int nb = n/nt;
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
float4 p2 = pblock[j]; /* Read a cached particle position */
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
}
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in work-group */
}
Background info about the code: This is part of an OpenCL kernel in a NBody simulation program. The entirety of the code and tutorial can be found here.
Here are my questions (mainly to do with the for loops):
How exactly are for-loops executed in OpenCL? I know that all work-items run the same code and that work-items within a work group tries to execute in parallel. So if I run a for loop in OpenCL, does that mean all work-items run the same loop or is the loop somehow divided up to run across multiple work items, with each work item executing a part of the loop (ie. work item 1 processes indices 0 ~ 9, item 2 processes indices 10 ~ 19, etc).
In this code snippet, how does the outer and inner loops execute? Does OpenCL know that the outer loop is dividing the work among all the work groups and that the inner loop is trying to divide the work among work-items within each work group?
If the inner loop is divided among the work-items (meaning that the code within the for loop is executed in parallel, or at least attempted to), how does the addition at the end work? It is essentially doing a = a + f*d, and from my understanding of pipelined processors, this has to be executed sequentially.
I hope my questions are clear enough and I appreciate any input.
1) How exactly are for-loops executed in OpenCL? I know that all
work-items run the same code and that work-items within a work group
tries to execute in parallel. So if I run a for loop in OpenCL, does
that mean all work-items run the same loop or is the loop somehow
divided up to run across multiple work items, with each work item
executing a part of the loop (ie. work item 1 processes indices 0 ~ 9,
item 2 processes indices 10 ~ 19, etc).
You are right. All work items run the same code, but please note that, they may not run the same code at the same pace. Only logically, they run the same code. In the hardware, the work items inside the same wave (AMD term) or warp (NV term), they follow exactly the footprint in the instruction level.
In terms of loop, it is nothing more than just a few branch operations in the assembly code level. Threads from the same wave execute the branch instruction in parallel. If all work items meet the same condition, then they still follow the same path, and run in parallel. However, if they don't agree on the same condition, then typically, there will be divergent execution. For example, in the code below:
if(condition is true)
do_a();
else
do_b();
logically, if some work items meet the condition, they will execute do_a() function; while the other work items will execute do_b() function. However, in reality, the work items in a wave execute in exact the same step in the hardware, therefore, it is impossible for them to run different code in parallel. So, some work items will be masked out for do_a() operations, while the wave executes the do_a() function; when it is finished, the wave goes to do_b() function, at this time, the remaining work items are masked out. For either functions, only partial work items are active.
Go back to the loop question, since the loop is a branch operation, if the loop condition is true for some work items, then the above situation will occur, in which some work items execute the code in the loop, while the other work items will be masked out. However, in your code:
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti]; /* Cache ONE particle position */
barrier(CLK_LOCAL_MEM_FENCE); /* Wait for others in the work-group */
for(int j=0; j<nt; j++) { /* For ALL cached particle positions ... */
The loop condition does not depend on the work item IDs, which means that all the work items will have exactly the same loop condition, so they will follow the same execution path and be running in parallel all the time.
2) In this code snippet, how does the outer and inner loops execute?
Does OpenCL know that the outer loop is dividing the work among all
the work groups and that the inner loop is trying to divide the work
among work-items within each work group?
As described in answer to (1), since the loop conditions of outer and inner loops are the same for all work items, they always run in parallel.
In terms of the workload distribution in OpenCL, it totally relies on the developer to specify how to distribute the workload. OpenCL does not know anything about how to divide the workload among work groups and work items. You can partition the workloads by assigning different data and operations by using the global work id or local work id. For example,
unsigned int gid = get_global_id(0);
buf[gid] = input1[gid] + input2[gid];
this code asks each work item to fetch two data from consecutive memory and store the computation results into consecutive memory.
3) If the inner loop is divided among the work-items (meaning that the
code within the for loop is executed in parallel, or at least
attempted to), how does the addition at the end work? It is
essentially doing a = a + f*d, and from my understanding of pipelined
processors, this has to be executed sequentially.
float4 d = p2 - p;
float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
float f = p2.w*invr*invr*invr;
a += f*d; /* Accumulate acceleration */
Here, a, f and d are defined in the kernel code without specifier, which means they are private only to the work item itself. In GPU, these variable will be first assigned to registers; however, registers are typically very limited resources on GPU, so when registers are used up, these variables will be put into the private memory, which is called register spilling (depending on hardware, it might be implemented in different ways; e.g., in some platform, the private memory is implemented using global memory, therefore any register spilling will cause great performance degradation).
Since these variables are private, all the work items still run in parallel and each of the work item maintain and update their own a, f and d, without interfere with each other.
Heterogeneous programming works on work distribution model, meaning threads gets its portion to work on and start on it.
1.1) As you know that, threads are organized in work-group (or thread block) and in your case each thread in work-group (or thread-block) bringing data from global memory to local memory.
for(int jb=0; jb < nb; jb++) { /* Foreach block ... */
pblock[ti] = pos_old[jb*nt+ti];
//I assume pblock is local memory
1.2) Now all threads in thread-block have the data they need at there local storage (so no need to go to global memory anymore)
1.3) Now comes processing, If you look carefully the for loop where processing takes place
for(int j=0; j<nt; j++) {
which runs for total number of thread blocks. So this loop snippet design make sure that all threads process separate data element.
1) for loop is just like another C statement for OpenCL and all thread will execute it as is, its up-to you how you divide it. OpenCL will not do anything internally for your loop (like point # 1.1).
2) OpenCL don't know anything about your code, its how you divide the loops.
3) Same as statement:1 the inner loop is not divided among the threads, all threads will execute as is, only thing is they will point to the data which they want to process.
I guess this confusion for you is because you jumped into the code before having much knowledge on thread-block and local memory. I suggest you to see the initial version of this code where there is no use of local memory at all.
How exactly are for-loops executed in OpenCL?
They can be unrolled automatically into pages of codes that make it slower or faster to complete. SALU is used for loop counter so when you nest them, more SALU pressure is done and becomes a bottleneck when there are more than 9-10 loops nested (maybe some intelligent algorithm using same counter for all loops should do the trick) So not doing only SALU in the loop body but adding some VALU instructions, is a plus.
They are run in parallel in SIMD so all threads' loops are locked to each other unless there is branching or memory operation. If one loop is adding something, all other threads' loops adding too and if they finish sooner they wait the last thread computing. When they all finish, they continue to next instruction (unless there is branching or memory operation). If there is no local/global memory operation, you dont need synchronization. This is SIMD, not MIMD so it is not efficient when loops are not doing same thing at all threads.
In this code snippet, how does the outer and inner loops execute?
nb and nt are constants and they are same for all threads so all threads doing same amount of work.
If the inner loop is divided among the work-items
That needs opencl 2.0 which has the ability of fine-grain optimization(and spawning kernels in kernel).
http://developer.amd.com/community/blog/2014/11/17/opencl-2-0-device-enqueue/
Look for "subgroup-level functions" and "region growing" titles.
All subgroup threads would have their own accumulators which are then added in the end using a "reduction" operation for speed.
#include <windows.h>
#include <stdio.h>
#include <stdint.h>
// assuming we return times with microsecond resolution
#define STOPWATCH_TICKS_PER_US 1
uint64_t GetStopWatch()
{
LARGE_INTEGER t, freq;
uint64_t val;
QueryPerformanceCounter(&t);
QueryPerformanceFrequency(&freq);
return (uint64_t) (t.QuadPart / (double) freq.QuadPart * 1000000);
}
void task()
{
printf("hi\n");
}
int main()
{
uint64_t start = GetStopWatch();
task();
uint64_t stop = GetStopWatch();
printf("Elapsed time (microseconds): %lld\n", stop - start);
}
The above contains a query performance counter function Retrieves the current value of the high-resolution performance counter and query performance frequency function Retrieves the frequency of the high-resolution performance counter. If I am calling the task(); function multiple times then the difference between the start and stop time varies but I should get the same time difference for calling the task function multiple times. could anyone help me to identify the mistake in the above code ??
The thing is, Windows is a pre-emptive multi-tasking operating system. What the hell does that mean, you ask?
'Simple' - windows allocates time-slices to each of the running processes in the system. This gives the illusion of dozens or hundreds of processes running in parallel. In reality, you are limited to 2, 4, 8 or perhaps 16 parallel processes in a typical desktop/laptop. An Intel i3 has 2 physical cores, each of which can give the impression of doing two things at once. (But in reality, there's hardware tricks going on that switch the execution between each of the two threads that each core can handle at once) This is in addition to the software context switching that Windows/Linux/MacOSX do.
These time-slices are not guaranteed to be of the same duration each time. You may find the pc does a sync with windows.time to update your clock, you may find that the virus-scanner decides to begin working, or any one of a number of other things. All of these events may occur after your task() function has begun, yet before it ends.
In the DOS days, you'd get very nearly the same result each and every time you timed a single iteration of task(). Though, thanks to TSR programs, you could still find an interrupt was fired and some machine-time stolen during execution.
It is for just these reasons that a more accurate determination of the time a task takes to execute may be calculated by running the task N times, dividing the elapsed time by N to get the time per iteration.
For some functions in the past, I have used values for N as large as 100 million.
EDIT: A short snippet.
LARGE_INTEGER tStart, tEnd;
LARGE_INTEGER tFreq;
double tSecsElapsed;
QueryPerformanceFrequency(&tFreq);
QueryPerformanceCounter(&tStart);
int i, n = 100;
for (i=0; i<n; i++)
{
// Do Something
}
QueryPerformanceCounter(&tEnd);
tSecsElapsed = (tEnd.QuadPart - tStart.QuadPart) / (double)tFreq.QuadPart;
double tMsElapsed = tSecElapsed * 1000;
double tMsPerIteration = tMsElapsed / (double)n;
Code execution time on modern operating systems and processors is very unpredictable. There is no scenario where you can be sure that the elapsed time actually measured the time taken by your code, your program may well have lost the processor to another process while it was executing. The caches used by the processor play a big role, code is always a lot slower when it is executed the first time when the caches do not yet contain the code and data used by the program. The memory bus is very slow compared to the processor.
It gets especially meaningless when you measure a printf() statement. The console window is owned by another process so there's a significant chunk of process interop overhead whose execution time critically depends on the state of that process. You'll suddenly see a huge difference when the console window needs to be scrolled for example. And most of all, there isn't actually anything you can do about making it faster so measuring it is only interesting for curiosity.
Profile only code that you can improve. Take many samples so you can get rid of the outliers. Never pick the lowest measurement, that just creates unrealistic expectations. Don't pick the average either, that is affected to much by the long delays that other processes can incur on your test. The median value is a good choice.
I'm having some trouble writing Pseduocode for a homework assignment in my operating systems class in which we are programming in C.
You will be implementing a Producer-Consumer program with a bounded buffer queue of N elements, P producer threads and C consumer threads
(N, P and C should be command line arguments to your program, along with three additional parameters, X, Ptime and Ctime, that are described below). Each
Producer thread should Enqueue X different numbers onto the queue (spin-waiting for Ptime*100,000 cycles in between each call to Enqueue). Each Consumer thread
should Dequeue P*X/C items from the queue (spin-waiting for Ctime*100,000 cycles
in between each call to Dequeue). The main program should create/initialize the
Bounded Buffer Queue, print a timestamp, spawn off C consumer threads & P
producer threads, wait for all of the threads to finish and then print off another
timestamp & the duration of execution.
My main difficulty is understanding what my professor means by spin-waiting for the variables times 100,000. I have bolded the section that is confusing me.
I understand a time stamp will be used to print the difference between each thread. We are using semaphores and implementing synchronization at the moment. Any suggestions on the above queries would be much appreciated.
I'm guessing it means busy-waiting; repeatedly checking the loop condition and consuming unnecessary CPU power in a tight loop:
while (current_time() <= wake_up_time);
One would ideally use something that suspends your thread until it's woken up externally, by the scheduler (so resources such as the CPU can be diverted elsewhere):
sleep(2 * 60 * 1000 ms);
or at least give up some CPU (i.e. not be so tight):
while (current_time() <= wake_up_time)
sleep(100 ms);
But I guess they don't want you to manually invoke the scheduler, hinting the OS (or your threading library) that it's a good time to make a context switch.
I'm not sure what cycles are; in assembly they might be CPU cycles but given that your question is tagged C, I'll bet that they're simply loop iterations:
for (int i=0; i<Ptime*100000; ++i); //spin-wait for Ptime*100,000 cycles
Though it's always safest to ask whoever issued the homework.
busy-waiting or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input is available, or if a lock is available.
so the assignment says to wait for Ptime*100000 time before producing next element and enqueue x different elements after the condition is true
similarly Each Consumer thread
should Dequeue P*X/C items from the queue and wait for ctime*100000 after every consumption of item
I suspect that your professor is being a complete putz - by actually ASKING for the worste "busy waiting" technique in existance:
int n = pTime * 100000;
for ( int i=0; i<n; ++i) ; // waste some cycles.
I also suspect that he still uses a pterosaur thigh-bone as a walking stick, has a very nice (dry) cave, and a partner with a large bald patch.... O/S guys tend to be that way. It goes with the cool beards.
No wonder his thoroughly modern students misunderstand him. He needs to (re)learn how to grunt IN TUNE.
Cheers. Keith.