无法获得与使用numba的numpy元素矩阵乘法相同的值

问题描述：

我一直在玩弄numba并尝试实现一个简单的基于元素的矩阵乘法。当使用'vectorize'时，我会得到与numpy乘法相同的结果，但是当我使用'cuda.jit'时，它们不相同。其中许多是零。我为此提供了一个最低工作示例。任何有关问题的帮助将不胜感激。我正在使用numba o.35.0和python 2.7无法获得与使用numba的numpy元素矩阵乘法相同的值

from __future__ import division 
from __future__ import print_function 

import numpy as np 

from numba import vectorize, cuda, jit 

M = 80 
N = 40 
P = 40 

# Set the number of threads in a block 
threadsperblock = 32 

# Calculate the number of thread blocks in the grid 
blockspergrid = (M*N*P + (threadsperblock - 1)) // threadsperblock 

@vectorize(['float32(float32,float32)'], target='cuda') 
def VectorMult3d(a, b): 
    return a*b 

@cuda.jit('void(float32[:, :, :], float32[:, :, :], float32[:, :, :])') 
def mult_gpu_3d(a, b, c): 
    [x, y, z] = cuda.grid(3) 
    if x < c.shape[0] and y < c.shape[1] and z < c.shape[2]: 
    c[x, y, z] = a[x, y, z] * b[x, y, z] 

if __name__ == '__main__': 
    A = np.random.normal(size=(M, N, P)).astype(np.float32) 
    B = np.random.normal(size=(M, N, P)).astype(np.float32) 

    numpy_C = A*B 

    A_gpu = cuda.to_device(A) 
    B_gpu = cuda.to_device(B) 
    C_gpu = cuda.device_array((M,N,P), dtype=np.float32) # cuda.device_array_like(A_gpu) 

    mult_gpu_3d[blockspergrid,threadsperblock](A_gpu,B_gpu,C_gpu) 

    cudajit_C = C_gpu.copy_to_host() 

    print('------- using cuda.jit -------') 
    print('Is close?: {}'.format(np.allclose(numpy_C,cudajit_C))) 
    print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,cudajit_C)), M*N*P)) 
    print('------- using cuda.jit -------\n') 


    vectorize_C_gpu = VectorMult3d(A_gpu, B_gpu) 
    vectorize_C = vectorize_C_gpu.copy_to_host() 

    print('------- using vectorize -------') 
    print('Is close?: {}'.format(np.allclose(numpy_C,vectorize_C))) 
    print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,vectorize_C)), M*N*P)) 
    print('------- using vectorize -------\n') 

    import numba; print("numba version: "+numba.__version__)

答

这里是你如何调试这个。

考虑以较小的和简化的例子：

减少数组大小，例如（2,3,1）（所以你可以实际打印值并能够读取它们）
简单和确定性的内容，例如， “所有的人”（跨运行比较）
额外的内核参数进行调试

from __future__ import (division, print_function) 

import numpy as np 
from numba import cuda 

M = 2 
N = 3 
P = 1 

threadsperblock = 1 
blockspergrid = (M * N * P + (threadsperblock - 1)) // threadsperblock 


@cuda.jit 
def mult_gpu_3d(a, b, c, grid_ran, grid_multed): 
    grid = cuda.grid(3) 
    x, y, z = grid 

    grid_ran[x] = 1 

    if (x < c.shape[0]) and (y < c.shape[1]) and (z < c.shape[2]): 
     grid_multed[x] = 1 
     c[grid] = a[grid] * b[grid] 


if __name__ == '__main__': 
    A = np.ones((M, N, P), np.int32) 
    B = np.ones((M, N, P), np.int32) 

    A_gpu = cuda.to_device(A) 
    B_gpu = cuda.to_device(B) 
    C_gpu = cuda.to_device(np.zeros_like(A)) 

    # Tells whether thread at index i have ran 
    grid_ran = cuda.to_device(np.zeros([blockspergrid], np.int32)) 

    # Tells whether thread at index i have performed multiplication 
    grid_multed = cuda.to_device(np.zeros(blockspergrid, np.int32)) 

    mult_gpu_3d[blockspergrid, threadsperblock](
     A_gpu, B_gpu, C_gpu, grid_ran, grid_multed) 

    print("grid_ran.shape : ", grid_ran.shape) 
    print("grid_multed.shape : ", grid_multed.shape) 
    print("C_gpu.shape  : ", C_gpu.shape) 

    print("grid_ran   : ", grid_ran.copy_to_host()) 
    print("grid_multed  : ", grid_multed.copy_to_host()) 

    C = C_gpu.copy_to_host() 
    print("C transpose flat : ", C.T.flatten()) 
    print("C     : \n", C)

输出：

grid_ran.shape : (6,) 
grid_multed.shape : (6,) 
C_gpu.shape  : (2, 3, 1) 
grid_ran   : [1 1 1 1 1 1] 
grid_multed  : [1 1 0 0 0 0] 
C transpose flat : [1 1 0 0 0 0] 
C     : 
[[[1] 
    [0] 
    [0]] 

[[1] 
    [0] 
    [0]]]

你可以看到，设备网格形状不符合形状阵列：网格是平坦的(M*N*P)，而阵列都是3维的(M, N, P)。也就是说，网格的第一维的索引范围为0..M*N*P-1（0..5，本例*计6个值），而数组的第一维仅在0..M-1（0..1，在我的示例中总计2个值）。这个错误通常会导致做出来的越界访问，但你保护你的内核有一个条件，这就减少了违规线程：

以上 M-1指数

if (x <= c.shape[0])

此行不允许线程（1在我的例子）来运行（以及[1]），这就是为什么没有值被写入，并且在结果数组中得到很多零的原因。

可能的解决方案：

一般而言，可以使用多维内核网格配置，即对于blockspergrid代替标量一个3D矢量[2]。
特别是，因为元素乘法是一个映射操作，并且不依赖于数组形状，所以您可以将所有3个数组压缩成1D数组，在1D网格上运行内核，然后重新设计结果[3]，[ 4]。

参考文献：

[1] How to understand “All threads in a warp execute the same instruction at the same time.” in GPU?
[2] Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation)
[3] numpy.ndarray.flatten
[4] numpy.ravel

感谢。你的解释很清楚。我接受了使用多维内核网格配置的建议。像下面的东西。 'threadsperblock =（4，4，4）; blockspergrid_x = np.int（np.ceil（M/threadsperblock [0]））' 同样设置blockspergrid_y和blockspergrid_z，然后'blockspergrid =（blockspergrid_x，blockspergrid_y，blockspergrid_z）'。最后用'blockspergrid'和'threadsperblock'调用'mult_gpu_3d'。您提供的参考资料也很棒！再次感谢。 –

无法获得与使用numba的numpy元素矩阵乘法相同的值

相关推荐