One-sided vs Two-sided Test

One-sided vs Two-sided RDMA: Implementation & Benchmark API

We implement a benchmark set that allows us to measure throughput (Mops) and bandwidth (GiB/s) for: - One-sided RDMA READ - One-sided RDMA WRITE - Two-sided SEND/RECV

The framework can tune message size, iteration count, and in-flight window/recv depth to see how performance scales.

Build

$ cd docs/code_examples/code/one_side_vs_two_side
$ gcc bench_server_broadcom.c -o bench_server_bench_server_broadcom -lrdmacm -libverbs
$ gcc bench_client_broadcom.c -o bench_client_broadcom -lrdmacm -libverbs

Server API

./bench_server <port> [--mode read|write|send] [--msg N] [--iters N] [--recv-depth N]

- --mode: read exposes a buffer for client RDMA READ; write exposes a buffer for client RDMA WRITE; send preposts receives to accept SENDs. - --msg: message size (bytes). - --iters: total operations to expect. - --recv-depth: number of receives preposted in SEND mode (must cover client window).

Client API

./bench_client <server_ip> <port> [--mode read|write|send] [--msg N] [--iters N] [--window N]

- --mode: read issues one-sided RDMA READs; write issues one-sided RDMA WRITEs; send does two-sided SENDs. - --msg: message size (bytes); must not exceed server-advertised buffer. - --iters: total operations to issue. - --window: outstanding WRs allowed in flight (match server recv-depth in SEND mode).

Test results (CPU RAM)

We would like to explore the impact of message size on MOPS and bandwidth for one-side and two-side RDMA.

We test the difference between write and send with two machines connected with CoRE.

We iterate on the message size from 32 to 8192 bytes.

Also it is important to note that the queue depth (window) has a great impact on the results as they allow the number of operations in flight. We test both window=64 and window=512 to understand how the system would perform in a shallow or deep queue.

Here's our results:

experiment	mode	msg	window	iters	mops	gib
msg_sweep	write	32	64	200000	3.9	0.12
msg_sweep	send	32	64	200000	3.92	0.12
msg_sweep	write	64	64	200000	3.9	0.23
msg_sweep	send	64	64	200000	3.92	0.23
msg_sweep	write	128	64	200000	3.94	0.47
msg_sweep	send	128	64	200000	3.85	0.46
msg_sweep	write	256	64	200000	3.87	0.92
msg_sweep	send	256	64	200000	3.9	0.93
msg_sweep	write	512	64	200000	3.88	1.85
msg_sweep	send	512	64	200000	3.84	1.83
msg_sweep	write	1024	64	200000	3.93	3.75
msg_sweep	send	1024	64	200000	3.87	3.69
msg_sweep	write	2048	64	200000	3.83	7.31
msg_sweep	send	2048	64	200000	3.8	7.25
msg_sweep	write	4096	64	200000	3.97	15.16
msg_sweep	send	4096	64	200000	3.76	14.36
msg_sweep	write	8192	64	200000	3.64	27.77
msg_sweep	send	8192	64	200000	3.64	27.76
msg_sweep	write	16384	64	200000	1.8	27.54
msg_sweep	send	16384	64	200000	1.8	27.41
msg_sweep	write	32768	64	200000	0.93	28.53
msg_sweep	send	32768	64	200000	0.93	28.52
msg_sweep	write	65536	64	200000	0.48	29.11
msg_sweep	send	65536	64	200000	0.48	29.1
msg_sweep	write	131072	64	200000	0.24	29.39
msg_sweep	send	131072	64	200000	0.24	29.39
msg_sweep	write	262144	64	200000	0.12	29.52
msg_sweep	send	262144	64	200000	0.12	29.48
msg_sweep	write	524288	64	200000	nan	nan
msg_sweep	send	524288	64	200000	0.06	29.64
msg_sweep	write	1048576	64	200000	0.03	29.65
msg_sweep	send	1048576	64	200000	nan	nan
msg_sweep	write	256	512	200000	3.92	0.93
msg_sweep	send	256	512	200000	3.83	0.91
msg_sweep	write	512	512	200000	3.86	1.84
msg_sweep	send	512	512	200000	3.86	1.84
msg_sweep	write	1024	512	200000	3.88	3.7
msg_sweep	send	1024	512	200000	3.88	3.7
msg_sweep	write	2048	512	200000	3.92	7.47
msg_sweep	send	2048	512	200000	3.91	7.46
msg_sweep	write	4096	512	200000	3.87	14.75
msg_sweep	send	4096	512	200000	3.88	14.8
msg_sweep	write	8192	512	200000	3.68	28.1
msg_sweep	send	8192	512	200000	3.65	27.86
msg_sweep	write	16384	512	200000	1.89	28.87
msg_sweep	send	16384	512	200000	1.89	28.86
msg_sweep	write	32768	512	200000	0.96	29.25
msg_sweep	send	32768	512	200000	0.96	29.25
msg_sweep	write	65536	512	200000	nan	nan
msg_sweep	send	65536	512	200000	0.48	29.46
msg_sweep	write	131072	512	200000	0.24	29.56
msg_sweep	send	131072	512	200000	0.24	29.56

Result analysis (CPU)

Regardless of window size, both settings show that when you initially increase the message size, it has not much impact of MOPS while increasing the throughput. However, once we reach a message size of relatively $2^13$ bytes (8KB), any further increase of message size would not increase the throughput while significantly decreases MOPS.

The experiment also shows that while window size could be a limitation when its very small, it is large enough, further increasing it would not affect the performance.

Test results (GPU RAM)

This tutorial is mainly for GPU RDMA, so we also test the performance when using GPU memory.

experiment	mode	msg	window	iters	mops	gib
msg_sweep_gpu	write	256	64	200000	3.84	0.92
msg_sweep_gpu	send	256	64	200000	3.73	0.89
msg_sweep_gpu	write	512	64	200000	3.79	1.81
msg_sweep_gpu	send	512	64	200000	3.74	1.78
msg_sweep_gpu	write	1024	64	200000	3.81	3.63
msg_sweep_gpu	send	1024	64	200000	3.81	3.63
msg_sweep_gpu	write	2048	64	200000	3.84	7.32
msg_sweep_gpu	send	2048	64	200000	3.84	7.32
msg_sweep_gpu	write	4096	64	200000	3.83	14.6
msg_sweep_gpu	send	4096	64	200000	3.83	14.62
msg_sweep_gpu	write	8192	64	200000	3.55	27.09
msg_sweep_gpu	send	8192	64	200000	3.62	27.6
msg_sweep_gpu	write	16384	64	200000	1.89	28.8
msg_sweep_gpu	send	16384	64	200000	1.89	28.79
msg_sweep_gpu	write	32768	64	200000	0.96	29.23
msg_sweep_gpu	send	32768	64	200000	0.96	29.23
msg_sweep_gpu	write	65536	64	200000	0.48	29.45
msg_sweep_gpu	send	65536	64	200000	0.48	29.44
msg_sweep_gpu	write	256	512	200000	3.82	0.91
msg_sweep_gpu	send	256	512	200000	3.81	0.91
msg_sweep_gpu	write	512	512	200000	3.84	1.83
msg_sweep_gpu	send	512	512	200000	3.84	1.83
msg_sweep_gpu	write	1024	512	200000	3.82	3.64
msg_sweep_gpu	send	1024	512	200000	3.89	3.71
msg_sweep_gpu	write	2048	512	200000	3.82	7.28
msg_sweep_gpu	send	2048	512	200000	3.78	7.2
msg_sweep_gpu	write	4096	512	200000	3.83	14.61
msg_sweep_gpu	send	4096	512	200000	3.86	14.71
msg_sweep_gpu	write	8192	512	200000	3.63	27.67
msg_sweep_gpu	send	8192	512	200000	3.65	27.86
msg_sweep_gpu	write	16384	512	200000	1.89	28.82
msg_sweep_gpu	send	16384	512	200000	1.89	28.81
msg_sweep_gpu	write	32768	512	200000	0.96	29.22
msg_sweep_gpu	send	32768	512	200000	0.96	29.23
msg_sweep_gpu	write	65536	512	200000	0.48	29.44
msg_sweep_gpu	send	65536	512	200000	0.48	29.44

Result analysis (GPU)

Similar performances exhibit in our GPU test. The throughput increase with message size but the effect eventually converges at $2^13$. MOPS does not change until that point when it starts to drop as we increase message size even further.

Conclusion

“Our experiments reveal an interesting finding: with a window size of 64 or 256, the send and write operations show no statistically significant performance difference when using Our Broadcom RoCE NIC that runs at 400 Gb/s (theoretical peak ≈ 46.6 GiB/s). Our best GPU-direct RDMA benchmark reaches about 30 GiB/s, i.e., ~64% of the line rate. To further increase the performance, we need to optimized the QPs and polling.

Thus, we arrive at a somewhat counterintuitive conclusion that, under our experimental setup, one-sided and two-sided operations exhibit no observable performance difference.