Cellm supports local models that run on your computer via Llamafiles, Ollama, or vLLM. This ensures none of your data ever leaves your machine. And it’s free.

On this page you will learn what to consider when choosing a local model and how to run it.

Choose a model

We can split local models into three tiers based on their size and capabilities, balancing speed, intelligence, and world knowledge:

Small (0.5-3B)Medium (4B-9B)Large (10B-32B)
Speed
Intelligence
World Knowledge
Recommended modelGemma 2 2BQwen 2.5 7BMistral Small 3.1

You need a GPU for any of the medium or large models to be useful in practice. If you don’t have a GPU, you can use Hosted Models if small ones are insufficient.

In general, smaller models are faster and less intelligent, while larger models are slower and more intelligent. When using local models, it’s important to find the right balance for your task, because speed impacts your productivity and intelligence impacts your results. You should try out different models and choose the smallest one that gives you good results

Small models are sufficient for many common tasks such as categorizing text or extracting person names from news articles. Medium models are appropriate for more complex tasks such as document review, survey analysis, or tasks involving function calling. Large models are useful for creative writing, tasks requiring nuanced language understanding such as spam detection, or tasks requiring world knowledge.

Models larger than 32B require significant hardware investment to run locally, and you are better off using Hosted Models if you need this kind of intelligence and don’t have the hardware already.

Run models locally

You need to run a program on your computer that servers models to Cellm. We call these programs “providers”. Cellm supports Ollama, Llamafiles, and vLLM, as well as any OpenAI-compatible provider. If you don’t know any of these names, just use Ollama.

Ollama

To get started with Ollama, we recommend you try out the Gemma 2 2B model, which is Cellm’s default local model.

  1. Download and install Ollama. Ollama will start after the install and automatically run whenever you start up your computer.
  2. Download Gemma 2 2B model: Open Windows Terminal (open start menu, type Windows Terminal, and click OK), type ollama pull gemma2:2b, and wait for the download to finish.
  3. In Excel, select the ollama/gemma2:2b from the model dropdown menu, and type out the formula =PROMPT("Which model are you and who made you?"). The model will tell you that is called “Gemma” and made by Google DeepMind.

You can use any model that Ollama supports. See https://ollama.com/search for a complete list.

LLamafile

Llamafile is project by Mozilla that combines llama.cpp with Cosmopolitan Libc, enabling you to download and run a single-file executable (called a “llamafile”) that runs locally on most computers, with no installation. To get started:

  1. Download a llamafile from https://github.com/Mozilla-Ocho/llamafile (e.g. Gemma 2 2B).

  2. Append .exe to the filename. For example, gemma-2-2b-it.Q6_K.llamafile should be renamed to gemma-2-2b-it.Q6_K.llamafile.exe.

  3. Run the following command in your Windows terminal (open start menu, type Windows Terminal, and click OK):

    .\gemma-2-2b-it.Q6_K.llamafile.exe --server --v2
    

    To offload inference to your NVIDIA or AMD GPU, run:

    .\gemma-2-2b-it.Q6_K.llamafile.exe --server --v2 -ngl 999
    
  4. Start Excel and select the openaicompatible provider from the model drop-down on Cellm’s ribbon menu. It doesn’t matter what model name you choose, as Llamafiles ignore a model’s name because a particular Llamafile serves one model only. A name is required though, because the OpenAI API expects it.

  5. Set the Base Address textbox to http://localhost:8080.

Llamafiles are especially useful if you don’t have the necessary permissions to install programs on your computer.

Dockerized Ollama and vLLM

If you prefer to run models via docker, both Ollama and vLLM are packaged up with docker compose files in the docker/ folder. vLLM is designed to run many requests in parallel and particularly useful if you need to process a lot of data with Cellm.

To get started, we recommend using Ollama with the Gemma 2 2B model:

  1. Clone the source code:

    git clone https://github.com/getcellm/cellm
    
  2. Run the following command in the docker/ directory:

    docker compose -f docker-compose.Ollama.yml up --detach
    docker compose -f docker-compose.Ollama.yml down  // When you want to shut it down
    
  3. Start Excel and select the openaicompatible provider from the model drop-down on Cellm’s ribbon menu. Replace the model name with the name of the model you want to use. For Gemma 2 2B, the textbox should read “openaicompatible/gemma2:2b”.

  4. Set the Base Address textbox to http://localhost:11434.

To use other Ollama models, pull another of the supported models by running e.g. ollama run mistral-small3.1:24b in the container.

If you want to speed up inference, you can use your GPU as well:

docker compose -f docker-compose.Ollama.yml -f docker-compose.Ollama.GPU.yml up --detach

If you want to speed up running many requests in parallel, you can use vLLM instead of Ollama. You must supply the docker compose file with a Hugging Face API key either via an environment variable or editing the docker compose file directy. Look at the vLLM docker compose file for details. If you don’t know what a Hugging Face API key is, just use Ollama.

To start vLLM:

docker compose -f docker-compose.vLLM.GPU.yml up --detach

To use other vLLM models, change the “—model” argument in the docker compose file to another Hugging Face model.

Open WebUI is included in both Ollama and vLLM docker compose files so you can test the local model outside of Cellm. Open WebUI is available at http://localhost:3000.