I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input.
• 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s
• 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s
These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.
These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.