Working with Virtual Machines, SSH and Remote MLflow Infrastructure#
This chapter introduces the practical foundations needed to work with remote computing environments commonly used in Machine Learning and MLOps workflows. You will learn what a Virtual Machine (VM) is, how to connect to it using SSH, how to securely expose remote services using port forwarding, and finally how to connect your local ML code to a remote MLflow server for experiment tracking.
Table of Contents#
What is a Virtual Machine?
Why Data Scientists Use VMs
How to Build a Virtual Machine
Setting Up SSH Access
Using SSH Port Forwarding
Connecting to Remote MLflow
Full Workflow Summary
Troubleshooting
1. What is a Virtual Machine?#
A Virtual Machine (VM) is a fully isolated computer environment running inside a physical machine. It behaves like a real computer—with its own CPU, RAM, storage, operating system, and network interface.
You can think of a VM as:#
1️⃣ A remote computer you log into#
A Virtual Machine behaves just like a physical computer, but it exists somewhere else—in a datacenter or in the cloud. It has its own:
Operating system (often Ubuntu/Linux)
CPU cores
RAM
Disk storage
Network interface
You connect to it using SSH (Secure Shell), which means:
You can work on the VM from anywhere
Your laptop becomes only a “window” into the VM
Training jobs continue running even if you close your laptop
A VM is essentially your remote workstation for ML.
2️⃣ A controlled and reproducible computing environment#
A laptop changes constantly (OS updates, Python conflicts, limited resources). A VM is different: it stays stable and consistent. A VM provides:
A fixed Python environment
Stable versions of CUDA, drivers, and libraries
Shared datasets accessible to multiple users
No unexpected system updates
An isolated environment that does not depend on your laptop
This makes it ideal for reproducible experiments, a key MLOps requirement. If your code runs today on the VM, it will run the same way tomorrow, because nothing changes unless you change it.
3️⃣ A place to run ML workloads without relying on your laptop#
Laptops have limitations:
Limited RAM
No or limited GPU
Limited CPU power
Overheating
Battery draining
Sleep mode interrupts long training runs
A VM solves these problems:
Can have large amounts of RAM
Can include powerful CPUs or multiple GPUs
Runs 24h without interruption
Can handle big datasets and heavy training workloads
Supports multiple notebooks/scripts running at once
Can host ML services such as MLflow, MinIO, FastAPI, or databases
A VM becomes your primary compute engine, while your laptop becomes simply a client.
⭐ Summary
A Virtual Machine is:
A powerful, stable, always-on remote computer used to run machine learning tasks, experiments, and services without depending on your laptop’s hardware.
VMs are fundamental to modern ML engineering and MLOps.
Why Data Scientists Use VMs#
Machine Learning workloads often require:
More compute power than your laptop can provide
Long-running jobs that must survive reboot or disconnection
Shared experiment tracking, such as MLflow
Persistent storage for datasets and model artifacts
Team-wide access to the same environment
A VM provides all of this in a controlled way.
Note
Even if you develop code locally, the infrastructure for experiment tracking and deployment often lives on remote VMs.
2. How to Build a Virtual Machine (VM)#
Creating a Virtual Machine is one of the first steps toward building a stable and reproducible ML infrastructure. While the exact steps differ slightly between cloud providers (Azure, AWS, GCP), the process is conceptually the same everywhere. This section explains how to build a VM in a general, cloud-agnostic way, and also highlights the key parameters you must choose for ML workloads.
1️⃣ Choose a Cloud Platform or Host Environment#
You can create a VM in:
Azure (Azure Portal → Virtual Machines)
AWS EC2
Google Cloud Compute Engine
Local virtualization tools (VirtualBox, VMware)
University/enterprise clusters (custom portals)
Bare-metal servers managed via SSH
Regardless of the platform, the VM creation workflow follows the same steps.
2️⃣ Select the Operating System (OS)#
For Machine Learning, the recommended OS is:
Ubuntu 20.04 LTS
Ubuntu 22.04 LTS (very common)
(Optional) Ubuntu + NVIDIA GPU drivers (if using GPUs)
Why Ubuntu?
Compatibility with ML frameworks
Easy package management (
apt,pip,conda)Good community support
Most MLOps tools assume Linux environments
3️⃣ Choose VM Size (CPU, RAM, GPU)#
Depending on your ML workload:
For lightweight models (e.g., scikit-learn):#
2–4 vCPUs
8–16 GB RAM
No GPU needed
For deep learning (PyTorch/TensorFlow):#
8+ vCPUs
32+ GB RAM
One or more GPUs (NVIDIA Tesla series)
Dedicated disk for datasets
For MLOps infrastructure (MLflow, MinIO, FastAPI services):#
2–4 vCPUs
8 GB RAM
Standard SSD storage
Note
You do not need GPU for MLflow, databases, or MinIO. GPUs are needed only for model training or inference workloads.
4️⃣ Configure Networking & Access#
Every VM needs:
A public IP address (for SSH access)
A security rule (firewall) allowing SSH on port 22
(Optional) private networking if multiple VMs talk to each other
For security:
Disable password login
Use SSH keys only
Do not expose MLflow or MinIO to the internet
Use port forwarding instead
5️⃣ Add Your SSH Key#
During VM creation, the portal will ask for an SSH key. Paste the content of your local public key:
cat ~/.ssh/id_rsa.pub
For a deeper introduction to SSH fundamentals, see the SSH Basics chapter.
6️⃣ Create the VM and Connect to It#
Once the VM is created, connect from your local machine:
ssh <username>@<VM_PUBLIC_IP>
For example:
ssh student@20.73.41.122
If you have an SSH config entry:
ssh mlops-vm
7️⃣ Install Essential ML Infrastructure Components#
After connecting to the VM, you typically install:
System updates:
sudo apt update && sudo apt upgrade -y
Python tools:
sudo apt install python3-pip python3-venv -y
Git:
sudo apt install git -y
MLflow + libraries (inside a venv):
python3 -m venv mlflow_env
source mlflow_env/bin/activate
pip install mlflow boto3 psycopg2-binary
Optional — GPUs:
Install NVIDIA drivers + CUDA toolkit depending on cloud platform.
8️⃣ Enable the VM for MLOps Work#
A fully configured ML VM typically includes:
MLflow tracking server
Object storage (MinIO or cloud buckets)
PostgreSQL / MySQL (MLflow metadata)
SSH port forwarding for UI access
Docker (for deployment workloads)
Kubernetes tooling (kubectl + Azure CLI if using AKS)
This transforms your VM from a plain Linux box into a full ML platform.
⭐ Summary
A Virtual Machine (VM) provides a reproducible, stable, and always-on environment for Machine Learning and MLOps.
To build a practical ML-ready VM:
Choose a cloud or host platform (Azure, AWS, GCP, on-premise)
Select Ubuntu as the operating system (20.04 or 22.04 LTS recommended)
Allocate compute resources
CPUs: 2–8
RAM: 8–32 GB
GPU: optional (NVIDIA)
Enable secure SSH access using key-based authentication
Create the VM and connect via
sshInstall essential tools
Python, pip/conda
MLflow
Git
Storage clients
Configure networking and SSH port forwarding for MLflow/MinIO
Add Docker & Kubernetes tooling for production deployments
These steps give you a powerful, reusable infrastructure to run ML experiments, track them, store artifacts, and deploy production ML services.
3. SSH Basics#
SSH (Secure Shell) is a protocol used to securely log in to a remote machine (like a VM) and run commands.
3.1 Generate SSH key pair (locally)#
On your laptop:
ssh-keygen -t rsa -b 4096
This creates:
A private key (e.g., ~/.ssh/id_rsa) → keep this secret
A public key (e.g., ~/.ssh/id_rsa.pub) → you share this with servers you trust
3.2 Why are there two keys?#
SSH uses asymmetric cryptography, which requires a matched pair of keys:
🔐 Private Key (your secret identity)#
Stored only on your laptop
Must never be shared
Used to prove your identity when connecting
If someone obtains this file, they can log in as you
Warning
Think of it like your passport it uniquely identifies you.
3.3 How SSH authentication works (simple explanation)#
You run:
ssh <user>@<vm-ip>
The VM checks whether your public key is in:
~/.ssh/authorized_keys
If found, the server sends a cryptographic challenge. Your laptop solves it using your private key. If the solution matches the public key, access is granted. No passwords are sent over the network. Your private key never leaves your machine.
3.4 Why SSH keys are more secure than passwords#
Impossible to guess or brute-force (4096-bit RSA is extremely strong)
No secrets are transmitted
No password typing → avoids keyloggers
Works automatically once set up
Required by most cloud providers (Azure, AWS, GCP)
⭐ Summary
File |
Location |
Purpose |
Share? |
|---|---|---|---|
Private Key ( |
Laptop only |
Proves your identity |
❌ Never |
Public Key ( |
Given to VM |
Verifies your identity |
✔ Safe |
SSH keys are the foundation of secure access to remote ML environments and production infrastructure.
3.5 Add your public key to the VM#
On the VM (or via your cloud provider’s UI), add the content of your id_rsa.pub into:
~/.ssh/authorized_keys
Make sure permissions are correct:
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
3.6 Using an SSH config#
Instead of typing long commands, create/edit:
~/.ssh/config
Example:
Host mlops-vm
HostName <VM_PUBLIC_IP_OR_HOSTNAME>
User <your-username>
IdentityFile ~/.ssh/id_rsa
Now you can simply connect with:
ssh mlops-vm
Important
If you get a permissions are too open warning, run:
chmod 600 ~/.ssh/id_rsa
4. Port Forwarding for MLflow and MinIO#
Many ML tools run as web applications on the VM, for example:
Service |
Typical Port |
Description |
|---|---|---|
MLflow |
5000 |
UI and tracking API |
MinIO |
9000 |
Web UI for artifact storage |
MinIO API |
9001 |
S3-compatible API endpoint (optional) |
These ports usually are not reachable directly from your laptop (for security reasons). Instead, we use SSH port forwarding (also called tunneling).
4.1 Configure port forwarding in SSH config#
Extend your ~/.ssh/config entry:
Host mlops-vm
HostName <VM_PUBLIC_IP_OR_HOSTNAME>
User <your-username>
IdentityFile ~/.ssh/id_rsa
LocalForward 5000 127.0.0.1:5000
LocalForward 9000 127.0.0.1:9000
LocalForward 9001 127.0.0.1:9001
4.2 Connect with forwarding#
Now connect:
ssh mlops-vm
As long as this SSH session is open, you can open on your laptop:
MLflow →
http://127.0.0.1:5000MinIO →
http://127.0.0.1:9000
Even though these services are actually running on the VM.
Note
This approach is much safer than exposing MLflow and MinIO directly to the internet. The ports stay private to your tunnel.
5. Connecting Code to a Remote MLflow Server#
Once port forwarding is set up, you can connect your ML code (running on your laptop or in the VM) to the MLflow tracking server.
5.1 Basic configuration in Python#
import mlflow
# Tracking server appears local because of SSH port forwarding
mlflow.set_tracking_uri("http://127.0.0.1:5000")
# Choose a meaningful experiment name
mlflow.set_experiment("wine_quality_experiment")
5.2 Logging runs, params, metrics, and models#
import mlflow
import mlflow.sklearn
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
import numpy as np
# X_train, X_test, y_train, y_test assumed to be prepared already
with mlflow.start_run(run_name="elasticnet_default"):
alpha = 0.5
l1_ratio = 0.5
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", float(rmse))
mlflow.sklearn.log_model(model, artifact_path="model")
print("Run finished, check MLflow UI at http://127.0.0.1:5000")
After running this:
Navigate to
http://127.0.0.1:5000in your browserOpen your experiment
Inspect runs, parameters, metrics, and artifacts
Note
When MLflow is configured with an object storage (e.g., MinIO or Azure Blob Storage), your logged models and artifacts are stored there, while metadata goes into the MLflow backend database.
6. Building Production ML Services (Docker, AKS/Kubernetes)#
So far, we used a VM and MLflow to track experiments. In real projects, you often want to expose a trained model as a service (e.g., REST API) that other applications can call.
A common production stack:
MLflow for model tracking and registry
Docker to package your model and API
Kubernetes (e.g., Azure Kubernetes Service – AKS) to run and scale containers
This section gives a high-level overview and a minimal end-to-end example.
6.1 High-level architecture#
Conceptually:
MLflow (on VM or Azure ML)
▲
│ (load model by URI or from registry)
▼
Docker container (FastAPI / Flask serving code)
▲
│ (container image)
▼
Kubernetes / AKS (runs N replicas, exposes a Service)
The steps:
Train and log model to MLflow
Write a small serving app (e.g., FastAPI) that loads the model
Create a Docker image for that app
Push the image to a container registry
Deploy the image on Kubernetes/AKS with a Deployment + Service
6.2 Example: Simple FastAPI inference service#
Assume you have a model registered in MLflow, e.g.:
Registered name: wine-quality-model
Stage: Production
We will write an API that receives JSON and returns predictions.
# src/app.py
import mlflow.pyfunc
from fastapi import FastAPI
import pandas as pd
# MLflow model URI (from registry or direct run artifact)
MODEL_URI = "models:/wine-quality-model/Production"
app = FastAPI(title="Wine Quality Prediction API")
# Load the model once at startup
model = mlflow.pyfunc.load_model(MODEL_URI)
@app.get("/health")
def health():
return {"status": "ok"}
@app.post("/predict")
def predict(features: dict):
"""
Example request body:
{
"fixed acidity": 7.4,
"volatile acidity": 0.70,
"citric acid": 0.00,
"residual sugar": 1.9,
"chlorides": 0.076,
"free sulfur dioxide": 11.0,
"total sulfur dioxide": 34.0,
"density": 0.9978,
"pH": 3.51,
"sulphates": 0.56,
"alcohol": 9.4
}
"""
df = pd.DataFrame([features])
preds = model.predict(df)
return {"prediction": float(preds[0])}
You can test this locally (before Dockerizing) using uvicorn:
uvicorn src.app:app --host 0.0.0.0 --port 8000
Then open: http://127.0.0.1:8000/docs to test the API.
6.3 Dockerizing the service#
Create a requirements.txt:
fastapi
uvicorn[standard]
mlflow
pandas
Create a Dockerfile:
# Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
# Environment variables for MLflow tracking / artifacts (adapt for your setup)
ENV MLFLOW_TRACKING_URI=http://mlflow:5000
EXPOSE 8000
CMD ["uvicorn", "src.app:app", "--host", "0.0.0.0", "--port", "8000"]
Build the Docker image:
docker build -t wine-quality-api:latest .
Run locally:
docker run -p 8000:8000 wine-quality-api:latest
Check http://127.0.0.1:8000/docs again.
Warning
In a real setup with MinIO, Azure Blob, or S3, you need to configure environment variables or credentials inside the container so that mlflow.pyfunc.load_model can access the model artifacts.
6.4 Deploying the Docker image to Kubernetes / AKS#
On a Kubernetes cluster (e.g., AKS), you:
Push the image to a registry (e.g., Azure Container Registry):
docker tag wine-quality-api <your-registry>/wine-quality-api:latest
docker push <your-registry>/wine-quality-api:latest
Create a Deployment and Service manifest.
Example deployment-wine-api.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: wine-quality-api
labels:
app: wine-quality-api
spec:
replicas: 2
selector:
matchLabels:
app: wine-quality-api
template:
metadata:
labels:
app: wine-quality-api
spec:
containers:
- name: wine-quality-api
image: <your-registry>/wine-quality-api:latest
ports:
- containerPort: 8000
env:
- name: MLFLOW_TRACKING_URI
value: "http://mlflow:5000"
# add any storage credentials here if needed
---
apiVersion: v1
kind: Service
metadata:
name: wine-quality-api-service
spec:
type: LoadBalancer
selector:
app: wine-quality-api
ports:
- port: 80
targetPort: 8000
Apply it:
kubectl apply -f deployment-wine-api.yaml
On AKS or another managed Kubernetes:
The
Serviceof typeLoadBalancerwill be given an external IPYou can then call:
http://<EXTERNAL-IP>/health
http://<EXTERNAL-IP>/predict
from any client that can access the cluster.
Note
In real production setups, you will usually:
use HTTPS (TLS),
add authentication/authorization,
configure autoscaling and monitoring,
integrate with secrets managers (for credentials).
7. Troubleshooting#
Q1: I cannot access MLflow at http://127.0.0.1:5000.
Check that your SSH session is active and uses
LocalForward 5000127.0.0.1:5000.Ensure MLflow is actually running on the VM (check with
ps aux | grep mlflowornetstat -tulpn).
Q2: My Docker container cannot load the MLflow model.
Confirm that MLFLOW_TRACKING_URI inside the container points to a reachable MLflow server.
If using remote artifact storage (MinIO, S3, Azure Blob), ensure the correct environment variables and credentials are configured inside the container.
Q3: Kubernetes Service has no external IP.
On some local clusters (e.g., kind, minikube),
LoadBalanceris emulated or not supported by default.On AKS, wait a bit for the LoadBalancer to be provisioned and check with:
kubectl get svc wine-quality-api-service
Q4: My model works locally but not in Kubernetes.
Check logs:
kubectl logs deployment/wine-quality-api
Common issues: wrong model URI, missing environment variables, missing Python dependencies.