How to use local models
Small (0.5-3B) | Medium (4B-9B) | Large (10B-32B) | |
---|---|---|---|
Speed | |||
Intelligence | |||
World Knowledge | |||
Recommended model | Gemma 2 2B | Qwen 2.5 7B | Mistral Small 3.1 |
Windows Terminal
, and click OK
), type ollama pull gemma2:2b
, and wait for the download to finish.ollama/gemma2:2b
from the model dropdown menu, and type out the formula =PROMPT("Which model are you and who made you?")
. The model will tell you that is called “Gemma” and made by Google DeepMind..exe
to the filename. For example, gemma-2-2b-it.Q6_K.llamafile
should be renamed to gemma-2-2b-it.Q6_K.llamafile.exe
.
Windows Terminal
, and click OK
):
openaicompatible
provider from the model drop-down on Cellm’s ribbon menu. It doesn’t matter what model name you choose, as Llamafiles ignore a model’s name because a particular Llamafile serves one model only. A name is required though, because the OpenAI API expects it.
docker/
folder. vLLM is designed to run many requests in parallel and particularly useful if you need to process a lot of data with Cellm.
To get started, we recommend using Ollama with the Gemma 2 2B model:
docker/
directory:
openaicompatible
provider from the model drop-down on Cellm’s ribbon menu. Replace the model name with the name of the model you want to use. For Gemma 2 2B, the textbox should read “openaicompatible/gemma2:2b”.
http://localhost:11434
.
ollama run mistral-small3.1:24b
in the container.
If you want to speed up inference, you can use your GPU as well:
http://localhost:3000
.