Run Ollama on Google Colab's GPU and spin up a free LLM server.
When using coding assistants like Claude Code or Continue, you often run into situations where API costs add up quickly, or you'd rather not send your code to an external service. There's also the frustration of wanting to try out lightweight local LLMs but finding that your local GPU is too slow to be practical.
This notebook solves all of that: run Ollama on Google Colab's GPU, expose the endpoint via ngrok, and you have a completely free, privacy-preserving local LLM server — all from the browser.
What I built
No setup required. Run it in your browser from the links below.
⚡️ Run on Google Colab
Ollama Colab Free Server (English)
Just select a model and run the cells from top to bottom.🐙 View the code on GitHub
hiroaki-com/colab-ollama-server
Browse the source code, star, or fork the repository here.
Why I built this
The starting point was a benchmarking tool I built earlier for comparing local LLMs on Google Colab. While using it, I thought — what if I could use a model I liked directly as the backend for Claude Code or Continue?
Three motivations drove the project: reducing API costs (especially for token-heavy tasks like refactoring), keeping code out of external services for privacy, and the fact that my local Mac's GPU isn't powerful enough for practical Ollama inference. Google Colab's T4 GPU, available for free, seemed like the right answer to all three.
The setup — spin up a server on Colab, tunnel it through ngrok — is simple in principle, but typing commands every time is tedious. So I wrapped the whole flow, from model selection to server startup and connection config output, into a single notebook.
How to use it
Just run the cells from top to bottom.
Prerequisites
You'll need a free ngrok account. Create one at the ngrok dashboard and grab your auth token.
1. Set the runtime
Open the notebook in Google Colab and go to Runtime > Change runtime type > select T4 GPU. CPU mode works too, but inference will be significantly slower, so GPU is strongly recommended.
2. Select a model in Model Registry
Running the first cell (Model Registry) displays a radio button UI with the available models. The list comes with a few defaults, but you can edit the text field directly to add any model you want.
model_list = "qwen3:8b, qwen3:14b, qwen2.5-coder:7b, deepseek-r1:8b"
Official model names can be found at https://ollama.com/search. Here's a rough guide to model size performance on a T4 GPU:
| Size | Speed | Notes |
|---|---|---|
| 8B | Fast | Recommended |
| 14B | Moderate | Practical range |
| 20B+ | Slow | Not recommended |
3. Launch the Server cell
Paste your ngrok token into the Server cell and run it. The following steps are automated:
- Install
zstd, Ollama, andpyngrok - Start the Ollama server and run a health check
- Establish the ngrok tunnel
- Pull the selected model (first run: roughly 5–15 minutes)
Once complete, the connection config is printed automatically in the terminal:
ENDPOINT : https://xxxx.ngrok-free.app
4. Configure your client tool
Just paste the endpoint URL into your tool of choice.
Continue Extension (~/.continue/config.yaml)
models:
- title: qwen3:8b
provider: ollama
model: qwen3:8b
apiBase: https://xxxx.ngrok-free.app
contextLength: 16384
Claude Code (shell env)
export ANTHROPIC_BASE_URL=https://xxxx.ngrok-free.app
export ANTHROPIC_API_KEY=dummy
claude --model qwen3:8b
Since Ollama v0.14.0, the Anthropic Messages API (/v1/messages) is officially supported. There's no need to route through an OpenAI-compatible layer — setting ANTHROPIC_BASE_URL to this server's endpoint connects Claude Code directly to Ollama.
OpenAI-Compatible Clients (e.g. Codex CLI)
Append /v1 to the base URL:
https://xxxx.ngrok-free.app/v1
Connectivity with each tool listed here (Continue, Claude Code, OpenAI-compatible clients) has been verified. However, future updates or specification changes on the tool side may break compatibility. If you run into issues, please also consult each tool's latest documentation.
Implementation highlights
A few things worth noting about the implementation.
-
Simplified model selection UI
The Model Registry cell lets you freely edit the model list as a comma-separated string. Running it renders a radio button selector — so you can add or switch models without touching the code. The two-step design (edit input → select from UI) keeps things clean.
-
Automatic connection config output
After the server starts, the YAML config for Continue and the environment variable commands for Claude Code are printed directly to the terminal. Just copy and paste — no need to manually assemble the endpoint URL.
-
Shell injection protection
Passing model names directly to
subprocess.Popenorsubprocess.runcan introduce shell injection risk depending on the input. To guard against this, model names are validated upfront against a regex, and anything outside the allowed character set raises an exception.if not re.fullmatch(r'[a-zA-Z0-9._:/-]+', selected_model):
raise ValueError(f"Invalid model name: {selected_model}")All official Ollama model names fall within this pattern, so there's no practical restriction.
-
Health check on startup
Ollama starts asynchronously, so hitting the API immediately after
subprocess.Popenwill result in a connection refusal. The notebook polls/api/tagsevery second for up to 30 attempts, proceeding once a 200 response is received. If the server never responds, aRuntimeErroris raised.for _ in range(MAX_HEALTH_RETRIES):
try:
if requests.get("http://0.0.0.0:11434/api/tags", timeout=HEALTH_CHECK_TIMEOUT).status_code == 200:
break
except requests.exceptions.RequestException:
pass
time.sleep(1)
else:
raise RuntimeError("⚠️ Ollama server failed to start.") -
Releasing the previous ngrok tunnel
ngrok.kill()is called beforengrok.connect(). If you re-run the Server cell without restarting the Colab session, the previous tunnel can linger and hit the free plan's single-tunnel limit. -
Keep Alive configuration
OLLAMA_KEEP_ALIVE=24his set as an environment variable. By default, Ollama unloads the model from memory after idle time, causing a reload delay on the next request. Keeping it loaded for the duration of the Colab session avoids that latency. -
Real-time uptime display
After startup, the elapsed time is overwritten in place using
\r. Polling every 30 seconds also has a useful side effect: it keeps the Colab session from timing out due to inactivity.while True:
elapsed_min = int((time.time() - start_time) / 60)
print(f"\r ● Running Uptime: {elapsed_min}min | {public_url}", end="")
time.sleep(STATUS_POLL_INTERVAL)
Things I learned along the way
Picking a lightweight model
On a T4 GPU, models around 8B parameters tend to strike the best balance of response speed and output quality. If you want to benchmark several candidates before committing to one, the multi-model comparison tool can help you decide.
Ollama has native Anthropic protocol support
Since v0.14.0, Ollama officially supports the Anthropic Messages API (/v1/messages). The old workaround of converting through an OpenAI-compatible layer is no longer necessary — just point ANTHROPIC_BASE_URL at the Ollama endpoint and Claude Code connects directly.
Wrapping up
This notebook came out of three goals: cut API costs, keep code off external services, and make it easy to try out local LLMs without a capable local GPU.
By packaging everything as a Google Colab notebook, anyone can spin up a free LLM server from the browser — no local GPU required. From selecting a model to getting connection-ready takes just a few minutes.
If you've been thinking "I'd love to try local LLMs but my GPU isn't up to it" or "I want to use a coding assistant without worrying about API costs," hopefully this is useful.
If you'd like to compare models before picking one, check out the multi-model benchmarking tool as well.