Comparing Multiple Local LLM Models on Free Google Colab
Recently, locally-running LLMs (Large Language Models) have become quite sophisticated. With models like Llama, Qwen, and Mistral publicly available, you might find yourself wondering: "Which model best fits my use case?"
Before setting up a local environment, I thought it would be convenient to test models easily in a free cloud environment. So I created a tool that "tests multiple Ollama models simultaneously on Google Colab and automatically compares their performance." You can run them with your own prompts and verify their capabilities firsthand.
What I Builtβ
No complicated setup required. You can run it directly in your browser via the link below.
β‘οΈ Run on Google Colab
Ollama Multi-Model Benchmarker (English)
Execute the script here. Just click and press the play button to get started.π View Code on GitHub
hiroaki-com/ollama-llm-benchmark
Check the source code, or Star / Fork the repository.
Why I Created Thisβ
When I wanted to use local LLMs, the first challenge was: "Which model should I choose?"
While benchmark results are available online, and there are local measurement tools like aidatatools/ollama-benchmark, those evaluations use general tasks. I felt that I wouldn't really know how a model performs until I tried it with my actual prompts.
However, setting up Ollama locally, downloading multiple large models, and testing them sequentially requires significant time and storage. I wanted to "test lightly in the cloud first, then install only the promising models locally," so I created this benchmark tool using Google Colab's free T4 GPU.
How to Useβ
No need to write Python code. Just fill in the Colab form and press the button.
1. Configure Model List in Model Registryβ
First, enter the test target models in comma-separated format in the Model Registry cell.
model_list = "qwen3:8b, qwen3:14b, qwen2.5-coder:7b, ministral-3:8b"
Check model names by searching on the official Ollama website to confirm the official names.
Model Selection Guidelines for T4 GPU
| Model Size | Execution Speed | Recommendation |
|---|---|---|
| 8B | Fast | βββ Recommended |
| 14B | Medium | ββ Usable |
| 20B+ | Slow | β Not Recommended |
2. Select Models with Checkboxesβ
When you run the cell, checkboxes for the entered models will appear.
- β Select All Models: Batch select all models
- Individual checkboxes: Select specific models only
Once you've set up the model list, you can easily adjust test targets with checkboxes on each run.
3. Run the Benchmarkerβ
Configure the following parameters in the Ollama Multi-Model Benchmarker cell.
save_to_drive = True # Save results to Google Drive
timeout_seconds = 1000 # Maximum processing time per model (seconds)
custom_test_prompt = "" # Custom prompt (uses default if empty)
Custom Prompt Examples
# Coding task
custom_test_prompt = "Write a Python function to recursively calculate the Fibonacci sequence"
# Summarization task
custom_test_prompt = "Summarize the following text in 3 sentences..."
Press the play button (βΆ) to test the selected models sequentially.
4. Review Resultsβ
After benchmark completion, results are displayed in the following format.
Top Performers by Category
| Category | Model | Score |
|---|---|---|
| β‘ Fastest Generation | qwen3:8b | 45.23 t/s |
| β±οΈ Most Responsive | ministral-3:8b | 0.12 s |
| π₯ Quickest Pull | qwen2.5-coder:7b | 23.4 s |
Detailed Metrics
| Model | Speed | TTFT | Total | Tok | Pull | Load | Size |
|---|---|---|---|---|---|---|---|
qwen3:8b | 45.23 t/s | 0.15s | 12.3s | 500 | 45.2s | 2.1s | 4.7GB |
Additionally, 6 types of graphs provide visual comparison:
- Generation Speed (Tokens/Sec)
- Time To First Token (Response Speed)
- Total Processing Time
- Model Load Time
- Download Time
- Model Size
Key Features and Technical Highlightsβ
I've incorporated several enhancements focused on practicality.
-
Flexible Model Selection UI
By combining comma-separated input with checkboxes, you can easily narrow down test targets from a large model list. You can also use it by unchecking "Select All" and retesting only specific models.
-
Single Source of Truth Design
The model list is managed in only one location: the Model Registry cell. No need to write the same list in multiple places in the code, making later edits easier.
-
Comprehensive Performance Metrics
Beyond generation speed (tokens/sec), it measures all the metrics you'd care about in actual use: Time To First Token (TTFT), total processing time, model load time, download time, and model size.
-
Automatic Result Saving
Set save_to_drive = True to automatically save measurement results to the MyDrive/OllamaBenchmarks folder in Google Drive. With three types of filesβintegrated JSON, session-specific archives, and model size cacheβcomparing past results is simple.
Google Drive/MyDrive/OllamaBenchmarks/
βββ benchmark_results.json # Integrated results
βββ session_logs/
β βββ YYYYMMDD_HHMMSS_session.json # Session-specific
βββ model_size_cache.json # Size cache -
Visualization Reports
Results are displayed with 6 types of matplotlib graphs and Markdown-formatted tables. You can also preview each model's actual response text to verify generation quality.
-
Disk Space Check
Automatically checks disk space before model download, skipping if insufficient. This saves you from wasting time waiting for downloads.
-
Model Size Cache
Once a model's size is measured, it's cached, speeding up disk space pre-checks on subsequent runs.
Understanding the Metricsβ
Here's an explanation of the main measured metrics and their meanings.
| Metric | Description | Unit |
|---|---|---|
| Speed | Token generation speed | tokens/sec |
| TTFT | Time To First Token (time to first response) | seconds |
| Total | Total processing time (from prompt submission to completion) | seconds |
| Tok | Number of generated tokens | - |
| Pull | Model download time | seconds |
| Load | VRAM loading time | seconds |
| Size | Model disk/VRAM size | GB |
The metrics you focus on will vary by use case:
- For chat applications, TTFT (response speed) matters. You don't want to keep users waiting.
- For code generation, Speed (generation speed) is useful. It can write long code quickly.
- If VRAM or disk capacity is limited, check Size (model size) to choose smaller models.
- In serverless environments where frequent startups occur, it's useful to consider Pull + Load time as well.
Default Promptβ
If no custom prompt is specified, the following prompt is used:
Write a recursive Python function with type hints and a docstring to compute
the factorial of a number, test it with n = 5, and show only the code and the
expected result.
This prompt is useful for measuring code generation capability and logical thinking.
Conclusionβ
When choosing a local LLM, it's hard to get a real sense of usability just by looking at spec sheets. Personally, I felt more confident about the workflow of running it with my own prompts, confirming "this seems usable," and then installing it in my local environment.
I hope this tool helps those who, like me, want to "try it out first."