NVIDIA has a library called NVIDIA Collective Communication Library (NCCL) that implements optimised communication primitives for NVIDIA GPUs. This library could improve the performance of collective routines such as all-gather, all-reduce, and all-broadcast, as well as point-to-point communication.
For the past couple of days, I have been stuck working on my dissertation. I was trying to replace MPI functions with NCCL in FluTAS code written in Fortran. My first approach was creating a toy example of sending and receiving data allocated in GPUs. Here is the simplified version of the code in Fortran:
ncclGroupStart()
ncclSend(sendbuf, ..., stream)
ncclRecv(recvbuf, ..., stream)
ncclGroupEnd()
cudaMemcpy(hostbuf, recvbuf, ...)
print *, hostbuf
This code runs successfully and correctly. Then, I proceed to modify the actual code that I need to change. After I run the code, the program exits and says that the simulation divergence is incorrect. I know from experience that it means there is something wrong with my changes.
Luckily, when I scour the NCCL documentation, I read something that says, “The NCCL call returns when the operation has been effectively enqueued to the given stream or returns an error. The collective operation is then executed asynchronously on the CUDA device.” from https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/streams.html#cuda-stream-semantics .
My understanding of ncclGroup
was very wrong! I thought it was like a barrier for synchronising/waiting until the NCCL primitives are complete. It turns out that we should still use the stream to synchronise the functions. So, calling cudaStreamSynchronize(stream)
solves the bug.
Also, the unfortunate thing with the toy example was calling cudaMemcpy
after ncclGroupEnd implicitly synchronises the stream. That was why I did not get the same issue with the toy example.