Skip to main content

Comparing Multiple Local LLM Models on Free Google Colab

If this site helped you, please support us with a star! 🌟
Star on GitHub

Recently, locally-running LLMs (Large Language Models) have become quite sophisticated. With models like Llama, Qwen, and Mistral publicly available, you might find yourself wondering: "Which model best fits my use case?"

Before setting up a local environment, I thought it would be convenient to test models easily in a free cloud environment. So I created a tool that "tests multiple Ollama models simultaneously on Google Colab and automatically compares their performance." You can run them with your own prompts and verify their capabilities firsthand.

What I Built​

πŸš€ Try It Now

No complicated setup required. You can run it directly in your browser via the link below.

Why I Created This​

When I wanted to use local LLMs, the first challenge was: "Which model should I choose?"

While benchmark results are available online, and there are local measurement tools like aidatatools/ollama-benchmark, those evaluations use general tasks. I felt that I wouldn't really know how a model performs until I tried it with my actual prompts.

However, setting up Ollama locally, downloading multiple large models, and testing them sequentially requires significant time and storage. I wanted to "test lightly in the cloud first, then install only the promising models locally," so I created this benchmark tool using Google Colab's free T4 GPU.

How to Use​

No need to write Python code. Just fill in the Colab form and press the button.

1. Configure Model List in Model Registry​

First, enter the test target models in comma-separated format in the Model Registry cell.

model_list = "qwen3:8b, qwen3:14b, qwen2.5-coder:7b, ministral-3:8b"
How to Verify Model Names

Check model names by searching on the official Ollama website to confirm the official names.

Model Selection Guidelines for T4 GPU

Model SizeExecution SpeedRecommendation
8BFast⭐⭐⭐ Recommended
14BMedium⭐⭐ Usable
20B+Slow⭐ Not Recommended

2. Select Models with Checkboxes​

When you run the cell, checkboxes for the entered models will appear.

Model Selector UI

  • βœ… Select All Models: Batch select all models
  • Individual checkboxes: Select specific models only

Once you've set up the model list, you can easily adjust test targets with checkboxes on each run.

3. Run the Benchmarker​

Configure the following parameters in the Ollama Multi-Model Benchmarker cell.

save_to_drive = True           # Save results to Google Drive
timeout_seconds = 1000 # Maximum processing time per model (seconds)
custom_test_prompt = "" # Custom prompt (uses default if empty)

Custom Prompt Examples

# Coding task
custom_test_prompt = "Write a Python function to recursively calculate the Fibonacci sequence"

# Summarization task
custom_test_prompt = "Summarize the following text in 3 sentences..."

Press the play button (β–Ά) to test the selected models sequentially.

4. Review Results​

After benchmark completion, results are displayed in the following format.

Benchmark Results Table

Top Performers by Category

CategoryModelScore
⚑ Fastest Generationqwen3:8b45.23 t/s
⏱️ Most Responsiveministral-3:8b0.12 s
πŸ“₯ Quickest Pullqwen2.5-coder:7b23.4 s

Detailed Metrics

ModelSpeedTTFTTotalTokPullLoadSize
qwen3:8b45.23 t/s0.15s12.3s50045.2s2.1s4.7GB

Performance Graphs

Additionally, 6 types of graphs provide visual comparison:

  • Generation Speed (Tokens/Sec)
  • Time To First Token (Response Speed)
  • Total Processing Time
  • Model Load Time
  • Download Time
  • Model Size

Key Features and Technical Highlights​

I've incorporated several enhancements focused on practicality.

  • Flexible Model Selection UI

    By combining comma-separated input with checkboxes, you can easily narrow down test targets from a large model list. You can also use it by unchecking "Select All" and retesting only specific models.

  • Single Source of Truth Design

    The model list is managed in only one location: the Model Registry cell. No need to write the same list in multiple places in the code, making later edits easier.

  • Comprehensive Performance Metrics

    Beyond generation speed (tokens/sec), it measures all the metrics you'd care about in actual use: Time To First Token (TTFT), total processing time, model load time, download time, and model size.

  • Automatic Result Saving

    Set save_to_drive = True to automatically save measurement results to the MyDrive/OllamaBenchmarks folder in Google Drive. With three types of filesβ€”integrated JSON, session-specific archives, and model size cacheβ€”comparing past results is simple.

    Google Drive/MyDrive/OllamaBenchmarks/
    β”œβ”€β”€ benchmark_results.json # Integrated results
    β”œβ”€β”€ session_logs/
    β”‚ └── YYYYMMDD_HHMMSS_session.json # Session-specific
    └── model_size_cache.json # Size cache
  • Visualization Reports

    Results are displayed with 6 types of matplotlib graphs and Markdown-formatted tables. You can also preview each model's actual response text to verify generation quality.

  • Disk Space Check

    Automatically checks disk space before model download, skipping if insufficient. This saves you from wasting time waiting for downloads.

  • Model Size Cache

    Once a model's size is measured, it's cached, speeding up disk space pre-checks on subsequent runs.

Understanding the Metrics​

Here's an explanation of the main measured metrics and their meanings.

MetricDescriptionUnit
SpeedToken generation speedtokens/sec
TTFTTime To First Token (time to first response)seconds
TotalTotal processing time (from prompt submission to completion)seconds
TokNumber of generated tokens-
PullModel download timeseconds
LoadVRAM loading timeseconds
SizeModel disk/VRAM sizeGB

The metrics you focus on will vary by use case:

  • For chat applications, TTFT (response speed) matters. You don't want to keep users waiting.
  • For code generation, Speed (generation speed) is useful. It can write long code quickly.
  • If VRAM or disk capacity is limited, check Size (model size) to choose smaller models.
  • In serverless environments where frequent startups occur, it's useful to consider Pull + Load time as well.

Default Prompt​

If no custom prompt is specified, the following prompt is used:

Write a recursive Python function with type hints and a docstring to compute 
the factorial of a number, test it with n = 5, and show only the code and the
expected result.

This prompt is useful for measuring code generation capability and logical thinking.

Conclusion​

When choosing a local LLM, it's hard to get a real sense of usability just by looking at spec sheets. Personally, I felt more confident about the workflow of running it with my own prompts, confirming "this seems usable," and then installing it in my local environment.

I hope this tool helps those who, like me, want to "try it out first."

If this site helped you, please support us with a star! 🌟
Star on GitHub

References​