Coming from CUDA I'm interested in how shared memory is read from a thread and compares to the reading alignment requirements of CUDA. I'll used the following code as an example:
#include <sys/unistd.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#define THREADS 2
void * threadFun(void * args);
typedef struct {
float * dataPtr;
int tIdx,
dSize;
} t_data;
int main(int argc, char * argv[])
{
int i,
sizeData=5;
void * status;
float *data;
t_data * d;
pthread_t * threads;
pthread_attr_t attr;
data=(float *) malloc(sizeof(float) * sizeData );
threads=(pthread_t *)malloc(sizeof(pthread_t)*THREADS);
d = (t_data *) malloc (sizeof(t_data)*THREADS);
data[0]=0.0;
data[1]=0.1;
data[2]=0.2;
data[3]=0.3;
data[4]=0.4;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i=0; i<THREADS;i++)
{
d[i].tIdx=i;
d[i].dataPtr=data;
d[i].dSize=sizeData;
pthread_create(&threads[i],NULL,threadFun,(void *)(d+i));
}
for (i=0; i<THREADS; i++)
{
pthread_join(threads[i],&status);
if(status);
//Error;
}
return 0;
}
void * threadFun(void * args)
{
int i;
t_data * d= (t_data *) args;
float sumVal=0.0;
for (i=0; i<d->dSize; i++)
sumVal+=d->dataPtr[i]*(d->tIdx+1);
printf("Thread %d calculated the value as %-11.11f\n",d->tIdx,sumVal);
return(NULL);
}
In the threadFun, the entire pointer d is pointing to shared memory space (I believe). From what I've encountered in documentation reading from multiple threads is ok. In CUDA reads need to be coalesced - is there similar alignment restrictions in pthreads? I.e. if I have two threads reading from the same shared address I'm assuming somewhere along the line a scheduler has to put one thread ahead of the other. In CUDA this could be a costly operation and should be avoided. Is there a penalty for 'simultaneous' reads from shared memory - and if so is it so small that it is negligible? i.e. both threads may need to read d->datPtr[0] simultaneously - I'm assuming that memory read cannot occur simultaneously - is this assumption wrong?
Also I read an article from intel that said to use a structure of arrays when multithreading - this is consistent with cuda. If I do this though, it is almost inevitable I will need the thread ID - which I believe will require me to use a mutex lock the thread ID until it is read into the thread's scope, is this true or would there be some other way to identify threads?
An article on memory management for mulithreaded programs would be appreciated as well.
While your thread data pointer d is pointing into a shared memory space, unless you increment that pointer to try and read from or write to an adjoining thread data element in the shared memory space array, you're basically dealing with localized thread data. Also the value of args is local to each thread, so in both cases if you are not incrementing the data pointer itself (i.e., you're never calling something like d++, etc. so that you're pointing to another thread's memory), no mutex is needed to guard the memory "belonging" to your thread.
Also again for your thread ID, since you're only writing that value from the spawning thread, and then reading that value in the actual spawned thread, there is no need for a mutex or synchronization mechanism ... you only have a single producer/consumer for the data. Mutexes and other synchronization mechanisms are only needed if there are multiple threads that will read and write the same data location.
CPUs have caches. Reads come from caches, so each CPU/core can read from its own cache, as long as the corresponding cacheline is SHARED. Writes force cachelines into EXCLUSIVE state, invalidating the corresponding cachelines on other CPUs.
If you have an array with a member per thread, and there are both reads and writes to that array, you may want to align every member to a cacheline, to avoid false sharing.
memory read to the same area in different thread to the same memory isn't a problem in shared memory systems (write is another matter, the pertinent area is the cache line: 64-256 bytes depending on the system)
I don't see any reason for which getting the thread_id should be a synchronized operation. (And you can feed your thread with any id meaningful for you, it can be simpler than getting a meaningful value from an abstract id)
Coming from CUDA probably let's you think to complicated. POSIX threads are much simpler. Basically what you are doing should work, as long as you are only reading in the shared array.
Also, don't forget that CUDA is a dismemberment of C++ and not on C, so some things might look different from that aspect, too. E.g in your code the habit of casting the return from malloc is generally frowned upon by real C programmers since it can be the source of subtle errors, there.
Related
I have a general question that occured to me while trying to implement a thread sychronization problem with sempaphores. I do not want to get into much (unrelated) detail so I am going to give the code that I think is important to clarify my question.
sem_t *mysema;
violatile int counter;
struct my_info{
pthread_t t;
int id;
};
void *barrier (void *arg){
struct my_info *a = arg;
arg->id = thrid;
while(counter >0){
do_work(&mysem[thrid])
sem_wait(&mysema[third])
display_my_work(arg);
counter--;
sem_post(&mysema[thrid+1])
}
return NULL;
}
int main(int argc, char *argv[]){
int N = atoi(argv[1]);
mysema = mallon(N*(*mysema));
counter = 50;
/*semaphore intialisations */
for(i=0; i<M; i++){
sem_init(&mysema[i],0,0);
}
for(i=0; i<M; i++){
mysema[i].id = i;
}
for(i=0; i<M; i++){
pthread_create(&mysema.t[i], NULL, barrier, &tinfo[i])
}
/*init wake up the first sempahore */
sem_post(&mysema[0]);
.
.
.
We have an array made of M semaphores intialised in 0 , where M is defined in command line by the user.
I know I am done when all M threads in total have done some necessary computations 50 times.
Each thread blocks itself, until the previous thread "sem_post's" it. The very first thread will be waken up by init.
My question is whether the threads will stop when '''counter = 0 '''. Do they all see the same variable - counter? (It is a global one, initialised in the main).
If thread zero , makes the very first time ```counter = 49''' do all the other threads( thread 1, 2, ...M-1) see that ?
These are different questions:
Do [the threads] all see the same variable - counter? (It is a global one, initialised in the main).
If thread zero , makes the very first time ```counter = 49''' do all the other threads( thread 1, 2, ...M-1) see that ?
The first is fairly simple: yes. An object declared at file scope and without storage class specifier _Thread_local is a single object whose storage duration is the entire run of the program. Wherever that object's identifier is in-scope and visible, it identifies the same object regardless of which thread is accessing it.
The answer to the second question is more complicated. In a multi-threaded program there is the potential for data races, and the behavior of a program containing a data race is undefined. The volatile qualifier does not protect against these; instead, you need proper synchronization for all accesses to each shared variable, both reads and writes. This can be provided by a semaphore or more often a mutex, among other possibilities.
Your code's decrement of counter may be adequately protected, but I suspect not, on account of the threads using different semaphores. If this allows for multiple different threads to execute the ...
display_my_work(arg);
counter--;
... lines at the same time then you have a data race. Even if your protection is adequate there, however, the read of counter in the while condition clearly is not properly synchronized, and you definitely have a data race there.
One of the common manifestations of the undefined behavior brought on by data races is that threads do not see each others' updates, so not only does your program's undefined behavior generally mean that threads 1 ... M-1 may not see thread 0's update of counter, it also specifically makes such a failure comparatively probable.
I need to pass void handler to another application, To replicate the scenario I have created one small program using shared memory and try to pass the void pointer to another application and print the value, I can get the void pointer address in another application, but when I try to dereference the pointer second application crash.
Here are the sample application wire.c .
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
int main() {
key_t key=1235;
int shm_id;
void *shm;
void *vPtr;
shm_id = shmget(key,10,IPC_CREAT | 0666);
shm = shmat(shm_id,NULL,NULL);
sprintf(shm,"%d",&vPtr);
printf("Address is %p, Value is %d \n", (void *)&vPtr, * (int *)vPtr);
return;
}
Here is read.c
#include <sys/ipc.h>
#include <sys/shm.h>
#include <stdio.h>
#include <stdlib.h>
int main() {
key_t key=1235;
int shm_id;
void *shm;
void *p = (void *)malloc(sizeof(void));
shm_id = shmget(key,10,NULL);
shm = shmat(shm_id,NULL,NULL);
if(shm == NULL)
{
printf("error");
}
sscanf(shm,"%d",&p);
printf("Address is %p %d\n",(void *)p);
return 0;
}
When I try to dereference p it crash. I need to pass the void pointer address and value in second application.
I don't want to share value between application, It works using shared memory,I know.
By default void *ptr will have some garbadge value (for ex. add= 0xbfff7f, value=23456), Can you please tell me, how can i pass void pointer address to another application and from the second application using that address i can print the value which was found in first application (i.e. 23456).
Apart from the shared memory is there any other alternate available?
Thanks.
This is probably because the pointer is still virtual; it's pointing at memory that is shared but there is no guarantee that two processes sharing the same memory maps it to the same virtual address. The physical addresses are of course the same (it's the same actual memory, after all) but processes never deal with physical addresses.
You can try to request a specific address by using mmap() (in Linux) to do the sharing. You can also verify the virtual address theory by simply printing the addresses of the shared memory block in both processes.
Also, don't cast the return value of malloc() in C and don't allocate memory when you're going to be using the shared memory. That part just makes no sense.
If you want to call a function within a process from another process, it is NOT IPC.
IPC is sharing of data between multiple threads/processes.
Consider adding the shared function into a DLL/shared-object to share the code across processes. If not, then you could add RPC support to your executables as shown here.
Why passing function pointers between 2 process does NOT work?
A function pointer is a virtual address referring to the physical memory location where the function code is currently loaded into physical memory. Whenever a function pointer(virtual address) is referred to in the process, the kernel is responsible for performing the mapping to physical address. This is successful as the mapping is present in the page-tables for the current process.
However when a context-switch occurs and another process is running, the page-tables containing the mappings of that particular process are loaded and currently active. these will NOT contain the mapping of the function pointer form the previous process. Hence attempting to use the function pointer from another process will fail.
Why the page-tables do NOT contain the mapping of function in another process?
If this was done then there would be no advantages with having multiple processes. All the code that could ever be run would have to be loaded into physical memory simultaneously. Also the entire system would then effectively be a single process.
Practically speaking whenever a context-switch happens and a different process is executing, the code/data segments of the earlier process can even be swapped out of physical memory. Hence even maintaining a function pointer and passing it to the new process is useless as it cannot guarantee that the function code will be held in memory even after newer process is loaded in memory and starts executing.
This is illegal:
void *p = (void *) malloc(sizeof(void));
void is an incomplete type, sizeof(void) is invalid. You can't do this for the same reason that you can't declare a void variable:
void i; /* WRONG */
What happens when you dereference p is undefined. If you just want the pointer value, you don't need to call malloc on read.c - that's the whole concept of shared memory - if it's shared, why would you allocate space on the reader program?
In my application, I have a nested pair of loops which follow similarly-nested linked lists in order to parse the data. I made a stupid blunder and cast one struct as the child struct, EG:
if (((ENTITY *) OuterEntityLoop->data)->visible == true) {
instead of:
if (((ENTITY_RECORD *) OuterEntityLoop->data)->entity->visible == true) {
This caused a problem where about 70% of runs would result in the application halting completely - not crashing, just sitting and spinning. Diagnostic printfs in program flow would fire in odd order or not at all, and though it spontaneously recovered a couple of times for the most part it broke the app.
So here's the thing. Even after paring down the logic inside to be absolutely it wasn't infinite looping based on a logic bug, to the point where the loop only contained my printf, it was still broken.
Thing two: when the struct was identified incorrectly, it still complained if I tried to access a nonexistent property even though it didn't have the extant property.
My questions are:
Why did this corrupt memory? Can simply reading garbage memory trash the program's control structures? If not, does this mean I still have a leak somewhere even though Electric Fence doesn't complain anymore?
I assume that the reason it complained about a nonexistent property is because it goes by the type definition given, not what's actually there. This is less questionable in my mind now that I've typed it out, but I'd like confirmation that I'm not off base here.
There's really no telling what will happen when a program accesses invalid memory, even for reading. On some systems, any memory read operation will either be valid or cause an immediate program crash, but on other systems it's possible that an erroneous read could be misinterpreted as a signal to do something. You didn't specify whether you're using a PC or an embedded system, but on embedded systems there are often many addresses by design which trigger various actions when they are read [e.g. dequeueing received data from a serial port, or acknowledging an interrupt]; an erroneous read of such an address might cause serial data to be lost, or might cause the interrupt controller to think an interrupt had been processed when it actually hadn't.
Further, in some embedded systems, an attempt to read an invalid address may have other even worse effects that aren't really by design, but rather by happenstance. On one system I designed, for example, I had to interface a memory device which was a little slow to get off the bus following a read cycle. Provided that the next memory read was performed from a memory area which had at least one wait sate or was on a different bus, there would be no problem. If code which was running in the fast external memory partition tried to read that area, however, the failure of the memory device to get off the bus quickly would corrupt some bits of the next fetched instruction. The net effect of all this was that accessing the slow device from code located in some places was no problem, but accessing it--intentionally or not--from code located in the fast partition would cause weird and non-reproduceable failures.
welcome to C, where the power of casting, allows you to make any piece of memory look like any object you want, but at your own risk. If the thing you cast is not really an object of that type, and that type contains a pointer to something else, you run the risk of crashing. Since even attempting to read random memory that has not been actually mapped into a processes virtual memory address space can cause a core or reading from the certain areas of memory that do not have read permission will also cause a core, like the NULL pointer.
example:
#include <stdio.h>
#include <stdlib.h>
struct foo
{
int x;
int y;
int z;
};
struct bar
{
int x;
int y;
struct foo *p;
};
void evil_cast(void *p)
{
/* hmm... maybe this is a bar pointer */
struct bar *q = (struct bar *)p;
if (q != NULL) /* q is some valid pointer */
{
/* as long as q points readable memory q->x will return some value, */
/* this has a fairly high probability of success */
printf("random pointer to a bar, x value x(%d)\n", q->x);
/* hmm... lets use the foo pointer from my random bar */
if (q->p != NULL)
{
/* very high probabilty of coring, since the likely hood that a */
/* random piece of memory contains a valid address is much lower */
printf("random value of x from a random foo pointer, from a random bar pointer x value x(%d)\n", q->p->x);
}
}
}
int main(int argc, char *argv[])
{
int *random_heap_data = (int *)malloc(1024); /* just random head memory */
/* setup the first 5 locations to be some integers */
random_heap_data[0] = 1;
random_heap_data[1] = 2;
random_heap_data[2] = 3;
random_heap_data[3] = 4;
random_heap_data[4] = 5;
evil_cast(random_heap_data);
return 0;
}
I have a question about pthread, when I create a variable inside a thread with malloc and then pass its pointer to a shared structure, i.e fifo, is the pointer passed by thread-1 will be accessed by thread2 ?
Please note that I have to code for the question above, I'm just trying to understand threading better, the below is just what I'm thinking about. The environment is pthread, c and linux
As far as I know threads are sharing the memory of their parent process, If that's the case the below should be correct.
void *thread-1(void *pointer)
{
int *intp = malloc(4);
send_to_fifo(intp);
}
void *thread-2(void *pointer)
{
int *iptr;
iptr = read_from_fifo();
do_something(iptr);
free(iptr);
}
is the pointer passed by thread-1 will be accessed by thread2 ?
Yes: since all threads operate in a common memory space, this is allowed.
malloc, free, and other memory management functions are thread-safe by default, unless compiled with NO_THREADS.
Of course you can do this. However you must be careful to not write to variable when it's used by another thread. You need synchronization.
In your case, you have race condition if the threads are run simultaneously (thread2 not waiting for thread1 to finish): thread2 either execute all it's code before thread1 puts anything to fifo or after that.
This is a performance-related question. I've written the following simple CUDA kernel based on the "CUDA By Example" sample code:
#define N 37426 /* the (arbitrary) number of hashes we want to calculate */
#define THREAD_COUNT 128
__device__ const unsigned char *m = "Goodbye, cruel world!";
__global__ void kernel_sha1(unsigned char *hval) {
sha1_ctx ctx[1];
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
while(tid < N) {
sha1_begin(ctx);
sha1_hash(m, 21UL, ctx);
sha1_end(hval+tid*SHA1_DIGEST_SIZE, ctx);
tid += blockDim.x * gridDim.x;
}
}
The code seems to me to be correct and indeed spits out 37,426 copies of the same hash (as expected. Based on my reading of Chapter 5, section 5.3, I assumed that each thread writing to the global memory passed in as "hval" would be extremely inefficient.
I then implemented what I assumed would be a performance-boosting cache using shared memory. The code was modified as follows:
#define N 37426 /* the (arbitrary) number of hashes we want to calculate */
#define THREAD_COUNT 128
__device__ const unsigned char *m = "Goodbye, cruel world!";
__global__ void kernel_sha1(unsigned char *hval) {
sha1_ctx ctx[1];
unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
__shared__ unsigned char cache[THREAD_COUNT*SHA1_DIGEST_SIZE];
while(tid < N) {
sha1_begin(ctx);
sha1_hash(m, 21UL, ctx);
sha1_end(cache+threadIdx.x*SHA1_DIGEST_SIZE, ctx);
__syncthreads();
if( threadIdx.x == 0) {
memcpy(hval+tid*SHA1_DIGEST_SIZE, cache, sizeof(cache));
}
__syncthreads();
tid += blockDim.x * gridDim.x;
}
}
The second version also appears to run correctly, but is several times slower than the initial version. The latter code completes in about 8.95 milliseconds while the former runs in about 1.64 milliseconds. My question to the Stack Overflow community is simple: why?
I looked through CUDA by Example and couldn't find anything resembling this. Yes there is some discussion of GPU hash tables in the appendix, but it looks nothing like this. So I really have no idea what your functions do, especially sha1_end. If this code is similar to something in that book, please point it out, I missed it.
However, if sha1_end writes to global memory once (per thread) and does so in a coalesced way, there's no reason that it can't be quite efficient. Presumably each thread is writing to a different location, so if they are adjacent more-or-less, there are definitely opportunities for coalescing. Without going into the details of coalescing, suffice it to say that it allows multiple threads to write data to memory in a single transaction. And if you are going to write your data to global memory, you're going to have to pay this penalty at least once, somewhere.
For your modification, you've completely killed this concept. You have now performed all the data copying from a single thread, and the memcpy means that subsequent data writes (ints, or chars, whatever) are occurring in separate transactions. Yes, there is a cache which may help with this, but it's completely the wrong way to do it on GPUs. Let each thread update global memory, and take advantage of opportunities to do it in parallel. But when you force all the updates on a single thread, then that thread must copy the data sequentially. This is probably the biggest single cost factor in the timing difference.
The use of __syncthreads() also imposes additional cost.
Section 12.2.7 of the CUDA by Examples book refers to visual profiler (and makes mention that it can gather information about coalesced accesses). The visual profiler is a good tool to help try to answer questions like this.
If you want to learn more about efficient memory techniques and coalescing, I would recommend the NVIDIA GPU computing webinar entitled "GPU Computing using CUDA C – Advanced 1 (2010)". The direct link to it is here with slides.