Coursera

[GPU programming] Multidimensional Data and Computation on the GPU Quiz

star.candy 2022. 7. 18. 23:29
질문 1

Which statement is not true about the syntax below:

kernel<​<<a,b>>>(args)

T​he value(s) in args, can only be the name(s) of a kernel(s).

T​his is false as the value(s) that args represents are arguments to the kernel.

질문 2

I​f you would like a 2-dimensional data/execution mapping of threads in your blocks for your kernel, which of the following would calculate the index of the thread and data? Note: that N is defined as 512

i​nt x = blockIdx.x * blockDim.x + threadIdx.x;

int y = blockIdx.y + blockDim.y + threadIdx.y;

int index = x + y *N;

질문 3

I​f you would like a 3-dimensional data/execution mapping of threads in your blocks for your kernel, which of the following would calculate the index of the thread and data? Note: that N is defined as 512

i​nt index = blockIdx.x * blockDim.x * blockDim.y + blockDim.z * threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x;

the index variable calculates the x, y, and z offsets based on the threadIdx variables for x,y,z values and the height, width, and depth of the 3D thread-block space.

 

질문 4

G​iven the following code how many threads will be executed in a single block (the variable threadsPerBlockPerDimension)? Note: math.pow(x,2) returns x squared.

#include <math.h> #define N 16384 int blocksPerGridPerDimension = 2; dim3 gridLayout(blocksPerGridPerDimension, blocksPerGridPerDimension); int threadsPerBlockPerDimension=N/math.pow(blocksPerGridPerDimension,2);

4096

16384/(2*2)=4096.

 

질문 5

G​iven the following code, how would you change the initialization of the dim3 block variable to create blocks of 256 threads? You are able to change the dimensions of the dim3 for the block and the associated code to retrieve the thread index. dim3 grid(32,32); dim3 block(1,1); kernel<<<grid,block>>>(args);

d​im3 block(8,4,8)

T​his is correct as 8*4*8 = 256

 

질문 6

G​iven the following kernel code (executable from host code), choose a bolded area to correct and the correct code snippet from the options below:

_​_device__ matrixAdd(int *matrix_a, int *matrix_b, int *matrix_c){ int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; int threadId = blockId *(blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.z) + threadIdx.x; c[threadId] = a[blockId] + b[threadId]; }​

_​_global__ matrixAdd(int *matrix_a, int *matrix_b, int *matrix_c){

T​his is a correct answer, as the __device__ keyword is only for functions that are called from GPU devices, not the CPU host.

+ (threadIdx.y *blockDim.x) + threadIdx.x;

T​his is a correct answer because the original statement had the y thread offset calculated by using the z dimension of the block dimension, which can give the wrong dimensional offset when the dimension of y and z are not equal in the size in threads.