无法获得与使用numba的numpy元素矩阵乘法相同的值

问题描述:

我一直在玩弄numba并尝试实现一个简单的基于元素的矩阵乘法。当使用'vectorize'时,我会得到与numpy乘法相同的结果,但是当我使用'cuda.jit'时,它们不相同。其中许多是零。我为此提供了一个最低工作示例。任何有关问题的帮助将不胜感激。我正在使用numba o.35.0和python 2.7无法获得与使用numba的numpy元素矩阵乘法相同的值

from __future__ import division 
from __future__ import print_function 

import numpy as np 

from numba import vectorize, cuda, jit 

M = 80 
N = 40 
P = 40 

# Set the number of threads in a block 
threadsperblock = 32 

# Calculate the number of thread blocks in the grid 
blockspergrid = (M*N*P + (threadsperblock - 1)) // threadsperblock 

@vectorize(['float32(float32,float32)'], target='cuda') 
def VectorMult3d(a, b): 
    return a*b 

@cuda.jit('void(float32[:, :, :], float32[:, :, :], float32[:, :, :])') 
def mult_gpu_3d(a, b, c): 
    [x, y, z] = cuda.grid(3) 
    if x < c.shape[0] and y < c.shape[1] and z < c.shape[2]: 
    c[x, y, z] = a[x, y, z] * b[x, y, z] 

if __name__ == '__main__': 
    A = np.random.normal(size=(M, N, P)).astype(np.float32) 
    B = np.random.normal(size=(M, N, P)).astype(np.float32) 

    numpy_C = A*B 

    A_gpu = cuda.to_device(A) 
    B_gpu = cuda.to_device(B) 
    C_gpu = cuda.device_array((M,N,P), dtype=np.float32) # cuda.device_array_like(A_gpu) 

    mult_gpu_3d[blockspergrid,threadsperblock](A_gpu,B_gpu,C_gpu) 

    cudajit_C = C_gpu.copy_to_host() 

    print('------- using cuda.jit -------') 
    print('Is close?: {}'.format(np.allclose(numpy_C,cudajit_C))) 
    print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,cudajit_C)), M*N*P)) 
    print('------- using cuda.jit -------\n') 


    vectorize_C_gpu = VectorMult3d(A_gpu, B_gpu) 
    vectorize_C = vectorize_C_gpu.copy_to_host() 

    print('------- using vectorize -------') 
    print('Is close?: {}'.format(np.allclose(numpy_C,vectorize_C))) 
    print('{} of {} elements are close'.format(np.sum(np.isclose(numpy_C,vectorize_C)), M*N*P)) 
    print('------- using vectorize -------\n') 

    import numba; print("numba version: "+numba.__version__) 

这里是你如何调试这个。

考虑以较小的和简化的例子:

  • 减少数组大小,例如(2,3,1)(所以你可以实际打印值并能够读取它们)
  • 简单和确定性的内容,例如, “所有的人”(跨运行比较)
  • 额外的内核参数进行调试

from __future__ import (division, print_function) 

import numpy as np 
from numba import cuda 

M = 2 
N = 3 
P = 1 

threadsperblock = 1 
blockspergrid = (M * N * P + (threadsperblock - 1)) // threadsperblock 


@cuda.jit 
def mult_gpu_3d(a, b, c, grid_ran, grid_multed): 
    grid = cuda.grid(3) 
    x, y, z = grid 

    grid_ran[x] = 1 

    if (x < c.shape[0]) and (y < c.shape[1]) and (z < c.shape[2]): 
     grid_multed[x] = 1 
     c[grid] = a[grid] * b[grid] 


if __name__ == '__main__': 
    A = np.ones((M, N, P), np.int32) 
    B = np.ones((M, N, P), np.int32) 

    A_gpu = cuda.to_device(A) 
    B_gpu = cuda.to_device(B) 
    C_gpu = cuda.to_device(np.zeros_like(A)) 

    # Tells whether thread at index i have ran 
    grid_ran = cuda.to_device(np.zeros([blockspergrid], np.int32)) 

    # Tells whether thread at index i have performed multiplication 
    grid_multed = cuda.to_device(np.zeros(blockspergrid, np.int32)) 

    mult_gpu_3d[blockspergrid, threadsperblock](
     A_gpu, B_gpu, C_gpu, grid_ran, grid_multed) 

    print("grid_ran.shape : ", grid_ran.shape) 
    print("grid_multed.shape : ", grid_multed.shape) 
    print("C_gpu.shape  : ", C_gpu.shape) 

    print("grid_ran   : ", grid_ran.copy_to_host()) 
    print("grid_multed  : ", grid_multed.copy_to_host()) 

    C = C_gpu.copy_to_host() 
    print("C transpose flat : ", C.T.flatten()) 
    print("C     : \n", C) 

输出:

grid_ran.shape : (6,) 
grid_multed.shape : (6,) 
C_gpu.shape  : (2, 3, 1) 
grid_ran   : [1 1 1 1 1 1] 
grid_multed  : [1 1 0 0 0 0] 
C transpose flat : [1 1 0 0 0 0] 
C     : 
[[[1] 
    [0] 
    [0]] 

[[1] 
    [0] 
    [0]]] 

你可以看到,设备网格形状不符合形状阵列:网格是平坦的(M*N*P),而阵列都是3维的(M, N, P)。也就是说,网格的第一维的索引范围为0..M*N*P-10..5,本例*计6个值),而数组的第一维仅在0..M-10..1,在我的示例中总计2个值)。这个错误通常会导致做出来的越界访问,但你保护你的内核有一个条件,这就减少了违规线程:

以上 M-1指数
if (x <= c.shape[0]) 

此行不允许线程(1在我的例子)来运行(以及[1]),这就是为什么没有值被写入,并且在结果数组中得到很多零的原因。

可能的解决方案:

  • 一般而言,可以使用多维内核网格配置,即对于blockspergrid代替标量一个3D矢量[2]。
  • 特别是,因为元素乘法是一个映射操作,并且不依赖于数组形状,所以您可以将所有3个数组压缩成1D数组,在1D网格上运行内核,然后重新设计结果[3],[ 4]。

参考文献:

+0

感谢。你的解释很清楚。我接受了使用多维内核网格配置的建议。像下面的东西。 'threadsperblock =(4,4,4); blockspergrid_x = np.int(np.ceil(M/threadsperblock [0]))' 同样设置blockspergrid_y和blockspergrid_z,然后'blockspergrid =(blockspergrid_x,blockspergrid_y,blockspergrid_z)'。最后用'blockspergrid'和'threadsperblock'调用'mult_gpu_3d'。您提供的参考资料也很棒!再次感谢。 –