01 May 2026

I Plugged a DGX Spark and Mac Together... and Didn’t Expect This

Alex Ziskind

Disaggregated LLM inference separates prompt processing (prefill) from token generation (decode) to leverage the distinct hardware strengths of different architectures. NVIDIA Blackwell GPUs excel at compute-intensive prefill tasks, while Apple Silicon’s high memory bandwidth provides superior performance for token decoding. By connecting a DGX Spark and a Mac Studio via a 50Gb network link, this experimental setup achieves faster time-to-first-token and improved overall throughput compared to running models on either machine alone. Performance gains are constrained by network latency and the overhead of injecting the KV cache between systems. This heterogeneous approach proves most effective for larger models, where the prefill advantage of the GPU and the decode efficiency of Apple Silicon combine to outperform single-machine configurations, provided the network infrastructure can effectively handle the significant data transfer requirements.

Outlines

Continue

Preview

How to Get Rich: Every EpisodeNaval

I Plugged a DGX Spark and Mac Together... and Didn’t Expect This

Alex Ziskind

Implementing Disaggregated Prefill and Decode on Heterogeneous Hardware

Optimizing Network Throughput and Model Performance in Disaggregated Inference

I Plugged a DGX Spark and Mac Together... and Didn’t Expect This

Alex Ziskind

00:00Implementing Disaggregated Prefill and Decode on Heterogeneous Hardware

Implementing Disaggregated Prefill and Decode on Heterogeneous Hardware

06:05Optimizing Network Throughput and Model Performance in Disaggregated Inference

Optimizing Network Throughput and Model Performance in Disaggregated Inference