Can it easily run as a server process in the background? To me, not having to load the LLM into memory for every single interaction is a big win of Ollama.
I wouldn't consider that a given at all, but apparently there's indeed `llama-server` which looks promising!
Then the only thing that's missing seems to be a canonical way for clients to instantiate that, ideally in some OS-native way (systemd, launchcd etc.), and a canonical port that they can connect to.
So does the original llama.cpp. And you won't have to deal with mislabeled models and insane defaults out of the box.