Nice idea, I do the same with Ollama and local models, except my client code is in Common Lisp, Clojure, and Racket. I have three books for these languages with Ollama examples, all can be read free online https://leanpub.com/u/markwatson
I have been paid to do so-called “AI work” since 1982, lots of early work with neural networks and symbolic AI, then more recently deep learning. I have never been as excited about any technology in my life as I am about LLMs.
Commercial APIs from Anthropic, Mistral, and OpenAI are great tools, but I get off more on running smaller models locally myself.
I like mistral:7b-instruct, yi:34b, and wizard-vicuna-uncensored:30b. I think the so-called "uncensored" models tend to work better for general purpose, but mistral and yi aren't available uncensored.
I have a M2 Pro 32G memory so I need to use 3-bit quantization to run mixtral: dolphin-mixtral:8x7b-v2.5-q3_K_S. In general I don't like to go below 4-bit quantization.
Wow. This actually "just worked" for me as in followed the instructions and got a result. Meanwhile the words "jupyter notebook" I've come to associate with python dependency hell.
To be fair I work as a PM and I rarely get more than about 60 minutes to play around with anything involving code, which has blocked me on getting hands dirty with anything AI related.
as someone who just went through this, the process to getting mixtral running in python did "just work" (pip install the interface, download the model, run the sample)
The process to get it running on the gpu wasn't there yet.
I tried using langchain's Ollama provider but for my use case it was strictly worse than just using Ollama directly. Ollama builds a conversation context automatically that langchain provides no handle for, and the context langchain encourages you to build isn't as useful because it forces Ollama to re-process the full context each time, whereas the native Ollama context represents the current state of inference.
The other kinds of non-conversational context that I needed were trivial to put together myself, so for my use case langchain just got in the way. Ollama's API was already trivial to wrap myself.
There is also Nx and Bumblebee in Elixir land - it really changes the how one approaches running models in production. The fact that one can put together a service (or local process) running any model published to hugging face in a couple of lines of code is amazing.
how is the deployment story though? assuming a standard phoenix-on-fly.io process, I was under the assumption that the bumblebee models are downloaded at runtime? or are they "built" as part of the CI pipeline and then shipped within their docker container as blobs?
That’s all configurable. You can choose to download and build at startup, bundle into the docker image or “prebuild” the cache in advance / separately from the main app. I think it’s quite alright for both cloud, Docker and VPS-y deployments.
Calling an http api from a ruby program doesn't really constitute running an "AI Model Locally with Ruby" for me. But if you want to get a little closer to that being true you could also use the llama.cpp bindings for ruby. https://github.com/yoshoku/llama_cpp.rb
This seems highly misleading to me. In no world are LLMs the kind of neural net you talk about. You grossly misrepresent how they work by pretending they are entirely built of fully connected layers.
I have been paid to do so-called “AI work” since 1982, lots of early work with neural networks and symbolic AI, then more recently deep learning. I have never been as excited about any technology in my life as I am about LLMs.
Commercial APIs from Anthropic, Mistral, and OpenAI are great tools, but I get off more on running smaller models locally myself.