Skip to main content

Releasing "Ollama Colab Free Server" — Run Ollama on Google Colab as a Free LLM Server

· 3 min read
hiroaki
Individual Developer
If this site helped you, please support us with a star! 🌟
Star on GitHub

Today I'm releasing Ollama Colab Free Server, an open-source notebook that runs Ollama on Google Colab's free GPU and makes it instantly available as a backend for Claude Code and Continue. Execute the cells top to bottom, and a publicly accessible LLM server is up within minutes.

Background

Coding assistants like Claude Code and Continue are powerful, but API costs can add up quickly — and there's always the concern of sending your code to an external service. At the same time, local Ollama on a machine with a weak GPU rarely reaches practical inference speeds.

This notebook bridges that gap. By running Ollama on Google Colab's free T4 GPU and exposing it via ngrok, you get a fully free LLM server — no local setup, no data sent to external APIs — accessible entirely from your browser.

Features

The notebook is designed to work without writing any code. In the first cell (Model Registry), enter model names as a comma-separated list and edit freely. Running the cell displays a radio button UI to pick the model you want.

In the second cell (Server), paste your ngrok token and run. The following steps execute automatically: install Ollama and dependencies, start the Ollama server, establish the ngrok tunnel, and pull the selected model (5–15 minutes on first run). Once complete, the endpoint URL along with ready-to-use config snippets for Continue and Claude Code are printed directly to the terminal.

OpenAI-compatible clients such as Codex CLI are also supported — just append /v1 to the base URL.

Supported Tools

  • Continue (VS Code / JetBrains extension): set the endpoint URL as apiBase
  • Claude Code: set ANTHROPIC_BASE_URL to the endpoint (Ollama v0.14.0+ natively supports the Anthropic Messages API)
  • OpenAI-compatible clients: use the endpoint URL + /v1

Model Size Guide for T4

The practical range on Google Colab's T4 GPU is 8B to 14B models. Models of 20B or more see significant slowdowns, so choosing a size that matches your use case matters. If you want to benchmark candidates first, Ollama Multi-Model Benchmarker lets you compare multiple models at once.

Get Started

No local environment needed. Just have a free ngrok account and your auth token ready.

Feedback and Pull Requests are welcome.

Technical Details

For architecture, implementation notes, and internals (health checks, shell injection mitigation, ngrok tunnel management, etc.), see the full documentation:

If this site helped you, please support us with a star! 🌟
Star on GitHub