*> What other API would you suggest?* Assuming increasing vram leads to an appre...

> What other API would you suggest?

Assuming increasing vram leads to an appreciable improvement in model speed, it should default to using all but 10% of the vram of the largest GPU, or all but 1GB, whichever is less.

If I've got 8GB of vram, the software should figure out the right number of layers to offload and a sensible context size, to not exceed 7GB of vram.

(Although I realise the authors are just doing what llama.cpp does, so they didn't design it the way it is)