Coming from CUDA I'm interested in how shared memory is read from a thread and compares to the reading alignment requirements of CUDA. I'll used the following code as an example:
#include <sys/unistd.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#define THREADS 2
void * threadFun(void * args);
typedef struct {
float * dataPtr;
int tIdx,
dSize;
} t_data;
int main(int argc, char * argv[])
{
int i,
sizeData=5;
void * status;
float *data;
t_data * d;
pthread_t * threads;
pthread_attr_t attr;
data=(float *) malloc(sizeof(float) * sizeData );
threads=(pthread_t *)malloc(sizeof(pthread_t)*THREADS);
d = (t_data *) malloc (sizeof(t_data)*THREADS);
data[0]=0.0;
data[1]=0.1;
data[2]=0.2;
data[3]=0.3;
data[4]=0.4;
pthread_开发者_高级运维attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i=0; i<THREADS;i++)
{
d[i].tIdx=i;
d[i].dataPtr=data;
d[i].dSize=sizeData;
pthread_create(&threads[i],NULL,threadFun,(void *)(d+i));
}
for (i=0; i<THREADS; i++)
{
pthread_join(threads[i],&status);
if(status);
//Error;
}
return 0;
}
void * threadFun(void * args)
{
int i;
t_data * d= (t_data *) args;
float sumVal=0.0;
for (i=0; i<d->dSize; i++)
sumVal+=d->dataPtr[i]*(d->tIdx+1);
printf("Thread %d calculated the value as %-11.11f\n",d->tIdx,sumVal);
return(NULL);
}
In the threadFun, the entire pointer d is pointing to shared memory space (I believe). From what I've encountered in documentation reading from multiple threads is ok. In CUDA reads need to be coalesced - is there similar alignment restrictions in pthreads? I.e. if I have two threads reading from the same shared address I'm assuming somewhere along the line a scheduler has to put one thread ahead of the other. In CUDA this could be a costly operation and should be avoided. Is there a penalty for 'simultaneous' reads from shared memory - and if so is it so small that it is negligible? i.e. both threads may need to read d->datPtr[0] simultaneously - I'm assuming that memory read cannot occur simultaneously - is this assumption wrong?
Also I read an article from intel that said to use a structure of arrays when multithreading - this is consistent with cuda. If I do this though, it is almost inevitable I will need the thread ID - which I believe will require me to use a mutex lock the thread ID until it is read into the thread's scope, is this true or would there be some other way to identify threads?
An article on memory management for mulithreaded programs would be appreciated as well.
While your thread data pointer d
is pointing into a shared memory space, unless you increment that pointer to try and read from or write to an adjoining thread data element in the shared memory space array, you're basically dealing with localized thread data. Also the value of args
is local to each thread, so in both cases if you are not incrementing the data pointer itself (i.e., you're never calling something like d++
, etc. so that you're pointing to another thread's memory), no mutex is needed to guard the memory "belonging" to your thread.
Also again for your thread ID, since you're only writing that value from the spawning thread, and then reading that value in the actual spawned thread, there is no need for a mutex or synchronization mechanism ... you only have a single producer/consumer for the data. Mutexes and other synchronization mechanisms are only needed if there are multiple threads that will read and write the same data location.
CPUs have caches. Reads come from caches, so each CPU/core can read from its own cache, as long as the corresponding cacheline is SHARED. Writes force cachelines into EXCLUSIVE state, invalidating the corresponding cachelines on other CPUs.
If you have an array with a member per thread, and there are both reads and writes to that array, you may want to align every member to a cacheline, to avoid false sharing.
memory read to the same area in different thread to the same memory isn't a problem in shared memory systems (write is another matter, the pertinent area is the cache line: 64-256 bytes depending on the system)
I don't see any reason for which getting the thread_id should be a synchronized operation. (And you can feed your thread with any id meaningful for you, it can be simpler than getting a meaningful value from an abstract id)
Coming from CUDA probably let's you think to complicated. POSIX threads are much simpler. Basically what you are doing should work, as long as you are only reading in the shared array.
Also, don't forget that CUDA is a dismemberment of C++ and not on C, so some things might look different from that aspect, too. E.g in your code the habit of casting the return from malloc
is generally frowned upon by real C programmers since it can be the source of subtle errors, there.
精彩评论