RDMA supports zero-copy networking by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer.
As disadvantage, the target node is not notified of the completion of the request, i.e. 1-sided communication. The common way to notify it is to change a memory byte when the data has been delivered, but it requires the target to poll on this byte. Not only does this polling consume CPU cycles, but the memory footprint and the latency increases linearly with the number of possible other nodes, which limits use of RDMA in High-Performance Computing (HPC). Moreover, RDMA reduces network protocol overhead, leading to improvements in communication latency. Reductions in protocol overhead can increase a network’s ability to move data quickly, allowing applications to get the data they need faster, in turn leading to more scalable clusters. However, one must be aware of the tradeoff between this reduction in network protocol overhead and additional overhead that may be incurred on each node due to the need for pinning virtual memory pages. In particular, zero-copy RDMA protocols require that the memory pages involved in a transaction be pinned, at least for the duration of the transfer. If this is not done, RDMA pages might be paged out to disk and replaced with other data by the operating system, causing the DMA engine (which knows nothing of the virtual memory system maintained by the operating system) to send the wrong data.
As advantage, the Send/Recv model used by other zero-copy HPC interconnects, such as Myrinet or Quadrics, does not have the 1-sided communication problem or the memory paging problem described above, yet provides comparable reductions in latency when used in conjunction with HPC communication frameworks that expose the Send/Recv model to the programmer.
See the following talk by Joel Scherpelz from Nvidia who presents: “RDMA for Heterogeneous Parallel Computing”