Choosing the Right LLM: Systematic Model Evaluation with MLflow

The more and more LLM models are available, engineers are requested to answer the question: which model performs best for your use case? This piece walks through a production-oriented setup for evaluating multiple open source models using MLflow, focusing on practices that translate directly to real-world deployments.

Why MLflow?

MLflow provides a standardized framework for 

  • tracking experiments
  • comparing models
  • maintaining evaluation history

MLflow offers you:

  • Centralized experiment tracking with a web UI
  • Structured comparison across multiple model runs
  • Persistent storage of evaluation metrics and artifacts

Setup

All the code below is available at gh.com/lotharschulz/mlflow_eval.

1. Environment Variables Over Hard-Coded Values

Let’s separate concerns with environment variables:

export MLFLOW_TRACKING_URI="http://localhost:5000"
echo $MLFLOW_TRACKING_URI
export OLLAMA_BASE_URI="http://localhost:11434"
echo $OLLAMA_BASE_URI

This approach means your evaluation code works unchanged across stages, be it development, staging, production or others — simply adjust the environment variables.

2. Python Virtual Environment for Dependency Isolation

In order to avoid potential conflicts with other python setups, let’s use a Virtual environment like so:

python -m venv env
source env/bin/activate

This ensures your evaluation setup doesn’t interfere with other Python projects and makes the virtual environment reproducible.

3. Docker Compose for Infrastructure

MLflow’s tracking server is based on docker compose, making it simple to start up and tear down:

#start
docker compose up -d
#tear down
docker-compose down -v  # The -v flag removes volumes storing data

The -v flag is crucial—it ensures volumes containing experiment data are deleted, giving you a clean slate for the next evaluation run.

The Evaluation Environment

At first lets use a small test script to check the env variables are set correctly tracking_uri_test.sh.
The set -euo pipefail inspired by the unofficial-bash-strict-mode ensures the bash script exits on any error.

Preparing Ollama Models

I choose to test ollama models, because ollama is available on a lot of systems: prepare_ollama_evaluation.sh

Running the Evaluation

Execute the evaluation with the bash script evaluate_ollama.sh that calls the evaluate_ollama.py script.

Understanding the Results

The evaluation produces a winner model based on average similarity. This is a straightforward metric comparing model outputs to the provided ground truth answers:

eval_data = pd.DataFrame({
    "question": [
        "What is MLflow?",
        "What is the capital of Spain?",
        "Explain machine learning in simple terms.",
        "What is 2+2+2?",
        "Who wrote Romeo and Juliet?",
        "How many vowels are in Alabama?",
        "Which city is meant in the song \"We built this city\" by the group Starship?",
    ],
    "ground_truth": [
        "MLflow is an open-source platform for managing the machine learning lifecycle.",
        "The capital city of Spain is Madrid",
        "Machine learning is a way for computers to learn patterns from data.",
        "6",
        "William Shakespeare",
        "4",
        "Two cities are referenced: San Francisco and Los Angeles.",
    ]
})

source

Although similarity scoring has limitations (it favors lexical overlap over semantic correctness), it provides a reproducible baseline as a start for comparing models.

The python script output contains:

  • Top 3 ranking: Quick comparison of leading models
  • Average similarity: Overall model performance across all questions
  • Min/Max similarity: Performance range, revealing consistency
  • Winner declaration: Clear identification of the best-performing model

All results are persisted and available in MLflow’s web UI at http://localhost:5000. You can drill into details, compare runs side-by-side, and export data for further analysis.

Conclusion

This evaluation setup emphasizes production-ready practices:

  1. Environment variables separate configuration from code
  2. Virtual environments isolate dependencies
  3. Docker Compose provides reproducible infrastructure
  4. Volume cleanup (docker compose down -v) ensures clean teardowns
  5. Similarity-based evaluation offers a simple, reproducible comparison metric, easy to replace with a different metric
  6. MLflow tracking provides structured, persistent evaluation history

The approach scales from local experimentation to production like monitoring. Swap in different models, adjust evaluation datasets, or integrate custom metrics—the foundation remains solid.

For engineers starting with LLM evaluation or new to MLFlow, this setup and code provides a practical entry point that respects production engineering principles from day one.

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.