
Disaggregated LLM inference separates prompt processing (prefill) from token generation (decode) to leverage the distinct hardware strengths of different architectures. NVIDIA Blackwell GPUs excel at compute-intensive prefill tasks, while Apple Silicon’s high memory bandwidth provides superior performance for token decoding. By connecting a DGX Spark and a Mac Studio via a 50Gb network link, this experimental setup achieves faster time-to-first-token and improved overall throughput compared to running models on either machine alone. Performance gains are constrained by network latency and the overhead of injecting the KV cache between systems. This heterogeneous approach proves most effective for larger models, where the prefill advantage of the GPU and the decode efficiency of Apple Silicon combine to outperform single-machine configurations, provided the network infrastructure can effectively handle the significant data transfer requirements.
Sign in to continue reading, translating and more.
Continue