Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's particularly useful in memory bound workflows like batch size = 1 LLM inference where you're bottlenecked by how quickly you can send weights to your GPU. This is why at least in torchao we strongly recommend people try out int4 quantization.

At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: