Homework for DCS316 Multi-core Programming.

Problem Description

Compute the entropy for each point of a 2D matrix using a 5x5 window. Matrix elements are integer within the range [0, 16). The image below shows the computation with 3x3 windows.

entropy

Platform

The Tianhe-2K supercomputer, each GPU node with two Intel Xeon Gold 6132 and 4 NVLINKed Tesla V100 accelerators. This homework only uses one GPU.

Baseline

The baseline simply maps each CUDA thread to one data point in the matrix and computes its entropy. It first counts the frequency for all appeared elements, then gets the entropy using the formula.

__global__ 
void entropy_kernel1(float* d_entropy, char* d_val, 
                     int m, int n)
{
    const int x = threadIdx.x + blockIdx.x * blockDim.x,
              y = threadIdx.y + blockIdx.y * blockDim.y;

    int count[16], eleCount = 0;
    memset(count, 0, sizeof(int)*16);
    for(int dx = -2;dx <= 2;dx++) {
        for(int dy = -2;dy <= 2;dy++) {
            int xx = x + dx,
                yy = y + dy;
            
            if(xx >= 0 && yy >= 0 && yy < m && xx < n) {
                count[d_val[yy*n+xx]]++;
                eleCount++;
            }
        }
    }
    eleCount = max(1, eleCount);

    float entropy = 0;
    for(int k = 0;k <= 15;k++) {
        float p = (float)count[k]/eleCount;
        entropy -= p*log2(p);
    }
    if(y < m && x < n) {
        d_entropy[y*n+x] = entropy;
    }
}

Tests repeat 10 times to get the best result. It takes 171ms to complete for a random matrix with 8192x8192 size.

Store log2(x) In A Lookup Table

Each integer may not appear more than 25 times in a 5x5 matrix. Therefore we can replace these log2(x) function calls with a table lookup:

\[p\log(p) = p*(logTable[cnt[i]]-logTable[eleCount])\]

Where cnt[i] is the time it appears, and eleCount holds the number of valid elements (that fall in the matrix) in the window.

for(int k = 0;k <= 15;k++) {
        float p = (float)count[k]/eleCount;
        entropy -= p*(d_logTable[count[k]]-d_logTable[eleCount]);
}

Not work, now 184ms for computation. However, we may put the table elsewhere, or more clearly, access it from a different path. You see, since the matrix is random, it is unlikely that one element appears more than 10 even 20 times in a 5x5 window. That means $locality$: elements closed to 0(maybe 1-5?) have higher access frequency.

So I use Texture Cache to read it. Simply add __ldg(&x) to read the data from Texture Cache instead of L1 Cache:

\[\text{Global Memory} \to \text{L2 Cache} \to \text{Tex Cache}\]
for(int k = 0;k <= 15;k++) {
        float p = (float)count[k]/eleCount;
        entropy -= p*(__ldg(&d_logTable[count[k]])-__ldg(&d_logTable[eleCount]));
}

The result is great: we have doubled performance, only 97ms is needed for this kernel.

Lower Precision

Since integers appear no more than 25 times in a window, there is no need to use int for count[]. It is sufficient to use char(int8) as the data type:

    char count[16], eleCount = 0;
    memset(count, 0, sizeof(char)*16);

It works because the GPU can output 4x arithmetic operations for int8 data than int. Even better, smaller data takes less space in the cache, thus improving the hit rate. This modification leads to an about 10x speedup, as computation time is now only 19ms compared to 184ms from the above kernel.

Put the Input Matrix into Shared Memory

This attempt is trivial as most elements are read 25 times in a coalesced pattern. Note that each block now has a Halo Region which threads in it only read in matrix elements but don’t contribute to the result.

__shared__ int sd_val[32][32];
const int x = threadIdx.x-2 + blockIdx.x * bsize,
          y = threadIdx.y-2 + blockIdx.y * bsize;
if(x >= 0 && y >= 0 && y < m && x < n) {
    sd_val[threadIdx.y][threadIdx.x] = d_val[y*n+x];
} else {
    sd_val[threadIdx.y][threadIdx.x] = 16;
}
__syncthreads();

char count[17], eleCount = 0;
memset(count, 0, sizeof(char)*17);

if(threadIdx.x >= 2 && threadIdx.x <= 29 && threadIdx.y >= 2 &threadIdx.y <= 29) {
    for(int dx = -2;dx <= 2;dx++) {
        for(int dy = -2;dy <= 2;dy++) {
            char nowVal = sd_val[threadIdx.y+dy][threadIdx.+dx];
            count[nowVal]++;
            eleCount += min(1, 16-nowVal);
        }
    }
}

This attempt doesn’t lead to a speedup: now 25.4ms. Possible reasons: threads in the Halo Region take equal resources but don’t contribute to the result. These threads introduce more divergence in execution. Read from global memory is already coalesced, thus cache-friendly, therefore shared memory may not be that helpful.

Put count[] Array into Registers

To be more agressive, we can map the whole array to registers. Since registers are not addressable, this can be done with a long switch construct:

if(threadIdx.x >= 2 && threadIdx.x <= 29 && threadIdx.y >= 2 && threadIdx.y <= 29) {
    for(int dx = -2;dx <= 2;dx++) {
        for(int dy = -2;dy <= 2;dy++) {
            char nowVal = sd_val[threadIdx.y+dy][threadIdx.x+dx];
            switch(nowVal) {
                case 0:  count[0]++;  break;
                case 1:  count[1]++;  break;
                case 2:  count[2]++;  break;
                case 3:  count[3]++;  break;
                case 4:  count[4]++;  break;
                case 5:  count[5]++;  break;
                case 6:  count[6]++;  break;
                case 7:  count[7]++;  break;
                case 8:  count[8]++;  break;
                case 9:  count[9]++;  break;
                case 10: count[10]++; break;
                case 11: count[11]++; break;
                case 12: count[12]++; break;
                case 13: count[13]++; break;
                case 14: count[14]++; break;
                case 15: count[15]++; break;
                case 16: count[16]++; break;
            }
            eleCount += min(1, 16-nowVal);
        }
    }
}

Now the count[] array is placed totally inside the register file. By compiling the code with -Xptxas -v, we can see the stack frame is now 0, compared to 72 for kernel3. The allocated stack frame of kernel3 is probably where count[] falls in.

ptxas info    : Compiling entry function '_Z15entropy_kernel4ILi28EEvPfPcii' for 'sm_70' \\
ptxas info    : Function properties for _Z15entropy_kernel4ILi28EEvPfPcii \\
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads \\
ptxas info    : Used 33 registers, 4096 bytes smem, 376 bytes cmem[0] \\
ptxas info    : Compiling entry function '_Z15entropy_kernel3ILi28EEvPfPcii' for 'sm_70' \\
ptxas info    : Function properties for _Z15entropy_kernel3ILi28EEvPfPcii \\
72 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads \\
ptxas info    : Used 32 registers, 4096 bytes smem, 376 bytes cmem[0] \\

It is only a tradeoff. We are doing a balance between fast memory read and a highly diverse execution path. This kernel needs 31.6ms to finish, a 20% performance loss. However, this strategy could be helpful if the range is much restricted, or we have to use larger data types(int/float). With a simple test, we can observe that since count[] now resides in the register file, no performance is lost from making count[] an int array. But in kernel 3, 90% will be gone if we use int.

Put count[] Array into Shared Memory

After a few CUDA projects, I figure that it won’t be much help to put data from the Global Memory to Shared Memory if access to it already coalesces and is cache-friendly. For a speedup, you need to create a more regular access pattern with total control over the per-block cache.

The write to count[] is random, but we can put it into shared memory without bank conflict:

__shared__ int sd_count[16][bsize_y*bsize_x];
for(int i = 0;i <= 15;i++) sd_count[i][idx] = 0;
for(int dy = -2;dy <= 2;dy++) {
    for(int dx = -2;dx <= 2;dx++) {
        int xx = x + dx,
            yy = y + dy;
        
        if(xx >= 0 && yy >= 0 && yy < m && xx < n) { 
            sd_count[d_val[yy*n+xx]][idx]++;
            eleCount++;
        }
    }
}

Now, each thread only accesses one column of the shared memory, and the column falls in exactly one bank of the cache. There is no bank conflict between every 32 threads in the block, even they write to different rows. Now it completes in 8.6ms, another 2x performance gain.

Exchange x And y Loops

It is common to put the loop over y in front of x for better cache reuse. It seems to work on GPUs, and mind that the compiler will definitely unroll the following loop(constant 25 iterations).

for(int dy = -2;dy <= 2;dy++) {
    for(int dx = -2;dx <= 2;dx++) {
        int xx = x + dx,
            yy = y + dy;
        
        if(xx >= 0 && yy >= 0 && yy < m && xx < n) {
            sd_count[d_val[yy*n+xx]][idx]++;
            eleCount++;
        }
    }
}

A 10% speedup.

Convert Division to Multiplication

It is generally faster to do multiplications than divisions(about 10x?), both on CPUs and GPUs. So lets give the compiler a hint:

float p = (float)sd_count[k][idx] / eleCount;

change it to:

float p = (float)sd_count[k][idx] * (1.0 / eleCount);

Now the compiler will: first compute 1/eleCount then store it in a register. Then do multiplications using the stored value. This modification leads to a surprising 80% performance boost. Now we have a kernel that ends in 4.2ms.

Put More Things into The Table

Remember that we need many $plog(p)$ values? We can save the multiply between p and log(p) with this equal formula:

\[H(x) = -\Sigma_i p(X=x_i)log(p(X=x_i)) = -\Sigma_i\frac{n_i}{\Sigma_in_i}log(\frac{n_i}{\Sigma_in_i})\]

then

\[H(x) = \frac{1}{\Sigma_in_i}(\Sigma_in_ilog(\Sigma_in_i) - \Sigma_in_ilog(n_i)) = log(\Sigma_in_i)-\frac{1}{\Sigma_in_i}\Sigma_in_ilog(n_i)\]

and the code is like:

float entropy = 0;
for(int k = 0;k <= 15;k++) {
    entropy -= __ldg(&d_logTable[sd_count[k][idx]]);
}
entropy = entropy / eleCount + log2(eleCount);

Now we can store $nlog(n)$ into the table and access it through the Texture Cache. There is only one division and log2() function call to do the same thing. Now the optimized kernel completes in 2.9ms, a total of 70x speedup compared with the naive implementation!

Kernels Execution Time
OpenMP 706.40ms
Kernel1 168.92ms
Kernel2 19.03ms
Kernel3 25.47ms
Kernel4 31.69ms
Kernel5 2.90ms