`Out of resources` error while doing loop unrolling_问答_开发者

When I increase the unrolling from 8 to 9 loops in my kernel, it breaks with an out of resources error.

I read in How do I diagnose a CUDA launch failure due to being out of resources? that a mismatch of parameters and an overuse of registers could be a problem, but that seems not be the case here.

My kernel calculates the distance between n points and m centroids and selects for each point the closest centroid. It works for 8 dimensions but not for 9. When I set dimensions=9 and uncomment the two lines for the distance calculation, I get an pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources.

What do you think, could cause this behavior? What other iusses can cause an out of resources*?

I use an Quadro FX580. Here is the minimal(ish) example. For the unrolling in the real code I use templates.

import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import pycuda.autoinit


## preference
np.random.seed(20)
points = 512
dimensions = 8
nclusters = 1

## init data
data = np.random.randn(points,dimensions).astype(np.float32)
clusters = data[:nclusters]

## init cuda
kernel_code = """

      // the kernel definition 
    __device__ __constant__ float centroids[16384];

    __global__ void kmeans_kernel(float *idata,float *g_centroids,
    int * cluster, float *min_dist, int numClusters, int numDim) {
    int valindex = blockIdx.x * blockDim.x + threadIdx.x ;
    float increased_distance,distance, minDistance;
    minDistance = 10000000 ;
    int nearestCentroid = 0;
    for(int k=0;k<numClusters;k++){
      distance = 0.0;
      increased_distance = idata[valindex*numDim] -centroids[k*numDim];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+1] -centroids[k*numDim+1];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+2] -centroids[k*numDim+2];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+3] -centroids[k*numDim+3];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+4] -centroids[k*numDim+4];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+5] -centroids[k*numDim+5];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+6] -centroids[k*numDim+6];
      distance = distance +(increased_distance * increased_distance);
      increased_distance =  idata[valindex*numDim+7] -centroids[k*numDim+7];
      distance = distance +(increased_distance * increased_distance);
      //increased_distance =  idata[valindex*numDim+8] -centroids[k*numDim+8];
      //distance = distance +(increased_distance * increased_distance);

      if(distance <minDistance) {
        minDistance = distance ;
        nearestCentroid = k;
        } 
      }
      cluster[valindex]=nearestCentroid;
      min_dist[valindex]=sqrt(minDistance);
    } 
 """
mod = compiler.SourceModule(kernel_code)
centroids_ad开发者_运维技巧rs = mod.get_global('centroids')[0]    
kmeans_kernel = mod.get_function("kmeans_kernel")
clusters_gpu = gpuarray.to_gpu(clusters)
cluster = gpuarray.zeros(points, dtype=np.int32)
min_dist = gpuarray.zeros(points, dtype=np.float32)

driver.memcpy_htod(centroids_adrs,clusters)

distortion = gpuarray.zeros(points, dtype=np.float32)
block_size= 512

## start kernel
kmeans_kernel(
    driver.In(data),driver.In(clusters),cluster,min_dist,
    np.int32(nclusters),np.int32(dimensions),
    grid = (points/block_size,1),
    block = (block_size, 1, 1),
)
print cluster
print min_dist

You're running out of registers because your block_size (512) is too large.

ptxas reports that your kernel uses 16 registers with the commented lines:

$ nvcc test.cu -Xptxas --verbose
ptxas info    : Compiling entry function '_Z13kmeans_kernelPfS_PiS_ii' for 'sm_10'
ptxas info    : Used 16 registers, 24+16 bytes smem, 65536 bytes cmem[0]

Uncommenting the lines increases register use to 17 and an error at runtime:

$ nvcc test.cu -run -Xptxas --verbose
ptxas info    : Compiling entry function '_Z13kmeans_kernelPfS_PiS_ii' for 'sm_10'
ptxas info    : Used 17 registers, 24+16 bytes smem, 65536 bytes cmem[0]
error: too many resources requested for launch

The number of physical registers used by each thread of a kernel limits the size of blocks you can launch at runtime. An SM 1.0 device has 8K registers that can be used by a block of threads. We can compare that to your kernel's register demands: 17 * 512 = 8704 > 8K. At 16 registers, your original commented kernel just squeaks by: 16 * 512 = 8192 == 8K.

When no architecture is specified, nvcc compiles kernels for an SM 1.0 device by default. PyCUDA may work the same way.

To fix your problem, you could either decrease block_size (to say, 256) or find a way to configure PyCUDA to compile your kernel for an SM 2.0 device. SM 2.0 devices such as your QuadroFX 580 provide 32K registers, more than enough for your original block_size of 512.