I just learned about Nvidia's thrust library. Just to try it wrote a small example which is supposed to normalize a bunch of vectors.
#include <cstdio>
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
struct normalize_functor: public thrust::unary_function<double4, double4>
{
__device__ __host__ double4 operator()(double4 v)
{
double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
v.x /= len;
v.y /= len;
v.z /= len;
printf("%f %f %f\n", v.x, v.y, v.z);
}
};
int main()
{
thrust::host_vector<double4> v(2);
v[0].x = 1; v[0].y = 2; v[0].z = 3;
v[1].x = 4; v[1].y = 5; v[1].z = 6;
thrust::device_vector<double4> v_d = v;
thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());
// This doesn't seem to copy back
v = v_d;
// Neither this does..
thrust::host_vector<double4> result = v_d;
for(int i=0; i<v.size(); i++)
printf("[ %f %f %f ]\n", result[i].x, result[i].y, result[i].z);
return 0;
}
The example above seems to work, however I'm unable to copy the data back.. I thought a simple assignment would invoke a cudaMemcpy. It works to copy the data from the host to the device but not back???
Secondly I'm not sure if I do this the right way. The documentation of for_each says:
for_each applies the function object f to each element in the range [first, last); f's return value, if any, is ignored.
But the unary_function struct template expects two template arguments (one for the return value) and forces the operator() to also return a value, this results in a warning when compiling. I don't see how I'm supposed to write an unary functor with no return value.
Next is the data arrangement. I 开发者_如何学Pythonjust chose double4 since this will result in two fetch instructions ld.v2.f64 and ld.f64 IIRC. However I'm wondering how thrust fetches data under the hood (and how many cuda threads/blocks) are created. If I would chose instead a struct of 4 vectors would it be able to fetch data in a coalesced way.
Finally thrust provides tuples. What about an array of tuples? How would the data be arranged in this case.
I looked through the examples, but I haven't found an example which explains which data structure to choose for a bunch of vectors (the dot_products_with_zip.cu example says something about "structure of arrays" instead of "arrays of structures" but I see no structures used in the example.
Update
I fixed the code above and tried to run a larger example, this time normalizing 10k vectors.
#include <cstdio>
#include <thrust/transform.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
struct normalize_functor
{
__device__ __host__ void operator()(double4& v)
{
double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
v.x /= len;
v.y /= len;
v.z /= len;
}
};
int main()
{
int n = 10000;
thrust::host_vector<double4> v(n);
for(int i=0; i<n; i++) {
v[i].x = rand();
v[i].y = rand();
v[i].z = rand();
}
thrust::device_vector<double4> v_d = v;
thrust::for_each(v_d.begin(), v_d.end(), normalize_functor());
v = v_d;
return 0;
}
Profiling with computeprof shows me a low occupancy and non-coalesced memory access:
Kernel Occupancy Analysis
Kernel details : Grid size: 23 x 1 x 1, Block size: 448 x 1 x 1
Register Ratio = 0.984375 ( 32256 / 32768 ) [24 registers per thread]
Shared Memory Ratio = 0 ( 0 / 49152 ) [0 bytes per Block]
Active Blocks per SM = 3 / 8
Active threads per SM = 1344 / 1536
Potential Occupancy = 0.875 ( 42 / 48 )
Max achieved occupancy = 0.583333 (on 9 SMs)
Min achieved occupancy = 0.291667 (on 5 SMs)
Occupancy limiting factor = Block-Size
Memory Throughput Analysis for kernel launch_closure_by_value on device GeForce GTX 470
Kernel requested global memory read throughput(GB/s): 29.21
Kernel requested global memory write throughput(GB/s): 17.52
Kernel requested global memory throughput(GB/s): 46.73
L1 cache read throughput(GB/s): 100.40
L1 cache global hit ratio (%): 48.15
Texture cache memory throughput(GB/s): 0.00
Texture cache hit rate(%): 0.00
L2 cache texture memory read throughput(GB/s): 0.00
L2 cache global memory read throughput(GB/s): 42.44
L2 cache global memory write throughput(GB/s): 46.73
L2 cache global memory throughput(GB/s): 89.17
L2 cache read hit ratio(%): 88.86
L2 cache write hit ratio(%): 3.09
Local memory bus traffic(%): 0.00
Global memory excess load(%): 31.18
Global memory excess store(%): 62.50
Achieved global memory read throughput(GB/s): 4.73
Achieved global memory write throughput(GB/s): 45.29
Achieved global memory throughput(GB/s): 50.01
Peak global memory throughput(GB/s): 133.92
I wonder how I can optimized this?
If you want to modify a sequence in-place with for_each
then you'll need to take the argument by reference in the functor:
struct normalize_functor
{
__device__ __host__ void operator()(double4& ref)
{
double v = ref;
double len = sqrt(v.x*v.x + v.y*v.y + v.z*v.z);
v.x /= len;
v.y /= len;
v.z /= len;
printf("%f %f %f\n", v.x, v.y, v.z);
ref = v;
}
};
Alternatively, you could use your definition of normalize_functor
with the transform
algorithm, specifying v_d
as both the source and destination range:
thrust::transform(v_d.begin(), v_d.end(), v_d.begin(), normalize_functor());
My personal preference is to use transform
in this situation, but the performance ought to be the same in either case.
On the question of optimization, there isn't much that can be done with Thrust - that isn't really the libraries intention. Without wanting to speak for Nathan Bell, who is one of the authors of Thrust and whom posted in this thread already, the aim is to make a range of data parallel algorithms for the GPU available in a simple, intuitive way without needing to write much, if any, CUDA code. And it succeeds spectacularly at that, in my opinion. The kernel performance of many of the thrust kernel is close to the state of the art, but there are always optimizations which could be done in specific cases which aren't easy to do in generic template code. It is part of the price you pay for the ease of use and flexibility Thrust affords.
Having said that, I suspect there are a couple of tweaks to try in your operator function which might improve things. I would typically write something like this:
struct normalize_functor
{
__device__ __host__ void operator()(double4& v)
{
double4 nv = v;
double len = sqrt(nv.x*nv.x + nv.y*nv.y + nv.z*nv.z);
nv.x /= len;
nv.y /= len;
nv.z /= len;
(void)nv.h;
v = nv;
};
};
Now while it isn't as pretty as the original, it should ensure that the compiler emits vectorized load and store instructions. I have seen cases in the past where the compiler will optimize away loads and stores of unused members of vector types, which results in the PTX generator emitting scalar load and stores, and breaks coalescing as a result. By having a clear float4 load and store, and making sure each element of the structure is used, it can get around this unwanted "optimization" that was present in at least the 2.x and 3.x nvcc releases. I am not sure whether this is still the case with the 4.0 compiler.
精彩评论