pid_m = (pid % grid_n)
pid_n = pid // grid_m
return pid_m, pid_n
The column major seems to give wrong result. In fact, a lot of blocks doesn't do any computation because they just return because of incorrect value. So the kernel seems much faster, but it's wrong. right?