Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Pay attention to IO bandwidth if you’re building a machine with multiple GPUs like this!

In this setup the model is sharded between cards so data must be shuffled through a PCIe 3.0 x16 link which is limited to ~16 GB/s max. For reference that’s an order of magnitude lower than the ~350 GB/s memory bandwidth of the Tesla P40 cards being used.

Author didn’t mention NVLink so I’m presuming it wasn’t used, but I believe these cards would support it.

Building on a budget is really hard. In my experience 5-15 tok/s is a bit too slow for use cases like coding, but I admit once you’ve had a taste of 150 tok/s it’s hard to go back (I’ve been spoiled by RTX 4090 with vLLM).



Unless you run the GPUs in parallel, which you have to go out of your way to do, the IO bandwidth doesn't matter. The cards hold separate layers of the model, they're not working together. They're only passing a few kilobytes per second between them.


Which models do you enjoy most on your 4090? and why vLLM instead of ollama?


> Author didn’t mention NVLink so I’m presuming it wasn’t used, but I believe these cards would support it.

How would you setup NVLink, if the cards support it?


I feel that you are mistaking the two bandwidth numbers




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: