Local Models
How to use local models
Cellm supports local models that run on your computer via Llamafiles, Ollama, or vLLM. This ensures none of your data ever leaves your machine. And it’s free.
On this page you will learn what to consider when choosing a local model and how to run it.
Choose a model
We can split local models into three tiers based on their size and capabilities, balancing speed, intelligence, and world knowledge:
Small (0.5-3B) | Medium (4B-9B) | Large (10B-32B) | |
---|---|---|---|
Speed | |||
Intelligence | |||
World Knowledge | |||
Recommended model | Gemma 2 2B | Qwen 2.5 7B | Mistral Small 3.1 |
You need a GPU for any of the medium or large models to be useful in practice. If you don’t have a GPU, you can use Hosted Models if small ones are insufficient.
In general, smaller models are faster and less intelligent, while larger models are slower and more intelligent. When using local models, it’s important to find the right balance for your task, because speed impacts your productivity and intelligence impacts your results. You should try out different models and choose the smallest one that gives you good results
Small models are sufficient for many common tasks such as categorizing text or extracting person names from news articles. Medium models are appropriate for more complex tasks such as document review, survey analysis, or tasks involving function calling. Large models are useful for creative writing, tasks requiring nuanced language understanding such as spam detection, or tasks requiring world knowledge.
Models larger than 32B require significant hardware investment to run locally, and you are better off using Hosted Models if you need this kind of intelligence and don’t have the hardware already.
Run models locally
You need to run a program on your computer that servers models to Cellm. We call these programs “providers”. Cellm supports Ollama, Llamafiles, and vLLM, as well as any OpenAI-compatible provider. If you don’t know any of these names, just use Ollama.
Ollama
To get started with Ollama, we recommend you try out the Gemma 2 2B model, which is Cellm’s default local model.
- Download and install Ollama. Ollama will start after the install and automatically run whenever you start up your computer.
- Download Gemma 2 2B model: Open Windows Terminal (open start menu, type
Windows Terminal
, and clickOK
), typeollama pull gemma2:2b
, and wait for the download to finish. - In Excel, select the
ollama/gemma2:2b
from the model dropdown menu, and type out the formula=PROMPT("Which model are you and who made you?")
. The model will tell you that is called “Gemma” and made by Google DeepMind.
You can use any model that Ollama supports. See https://ollama.com/search for a complete list.
LLamafile
Llamafile is project by Mozilla that combines llama.cpp with Cosmopolitan Libc, enabling you to download and run a single-file executable (called a “llamafile”) that runs locally on most computers, with no installation. To get started:
-
Download a llamafile from https://github.com/Mozilla-Ocho/llamafile (e.g. Gemma 2 2B).
-
Append
.exe
to the filename. For example,gemma-2-2b-it.Q6_K.llamafile
should be renamed togemma-2-2b-it.Q6_K.llamafile.exe
. -
Run the following command in your Windows terminal (open start menu, type
Windows Terminal
, and clickOK
):To offload inference to your NVIDIA or AMD GPU, run:
-
Start Excel and select the
openaicompatible
provider from the model drop-down on Cellm’s ribbon menu. It doesn’t matter what model name you choose, as Llamafiles ignore a model’s name because a particular Llamafile serves one model only. A name is required though, because the OpenAI API expects it. -
Set the Base Address textbox to http://localhost:8080.
Llamafiles are especially useful if you don’t have the necessary permissions to install programs on your computer.
Dockerized Ollama and vLLM
If you prefer to run models via docker, both Ollama and vLLM are packaged up with docker compose files in the docker/
folder. vLLM is designed to run many requests in parallel and particularly useful if you need to process a lot of data with Cellm.
To get started, we recommend using Ollama with the Gemma 2 2B model:
-
Clone the source code:
-
Run the following command in the
docker/
directory: -
Start Excel and select the
openaicompatible
provider from the model drop-down on Cellm’s ribbon menu. Replace the model name with the name of the model you want to use. For Gemma 2 2B, the textbox should read “openaicompatible/gemma2:2b”. -
Set the Base Address textbox to
http://localhost:11434
.
To use other Ollama models, pull another of the supported models by running e.g. ollama run mistral-small3.1:24b
in the container.
If you want to speed up inference, you can use your GPU as well:
If you want to speed up running many requests in parallel, you can use vLLM instead of Ollama. You must supply the docker compose file with a Hugging Face API key either via an environment variable or editing the docker compose file directy. Look at the vLLM docker compose file for details. If you don’t know what a Hugging Face API key is, just use Ollama.
To start vLLM:
To use other vLLM models, change the “—model” argument in the docker compose file to another Hugging Face model.
Open WebUI is included in both Ollama and vLLM docker compose files so you can test the local model outside of Cellm. Open WebUI is available at http://localhost:3000
.