ML Model Deployment: Kubernetes and MLOps Best Practices

背景

最近要把一個 scikit-learn 模型部署到生產環境。之前都是用簡單的 Flask + VM，這次決定試試 Kubernetes，為之後的擴展做準備。

記錄一下過程和踩的坑。

第一步：容器化

Dockerfile

FROM python:3.11-slim

WORKDIR /app

# 先複製 requirements，利用 Docker layer cache
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 複製模型和程式碼
COPY model/ ./model/
COPY app.py .

# 非 root 用戶運行（安全性）
RUN useradd -m appuser
USER appuser

EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

踩坑：Image 太大

第一版 image 超過 2GB，原因是用了完整的 Python image 加上 PyTorch。

解法：

用 python:3.11-slim 而不是 python:3.11
把不需要的依賴移除
用 multi-stage build（如果需要編譯）

最後壓到 ~500MB。

第二步：Kubernetes 部署

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
  labels:
    app: ml-model
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      containers:
      - name: ml-model
        image: your-registry/ml-model:v1.0.0
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5

踩坑：資源配置

一開始 memory request 設太低（256Mi），結果 Pod 一直被 OOM kill。

ML 模型載入時會吃很多記憶體，要算進去。我的經驗是：

Request：設定模型載入後的穩定用量
Limit：設定 request 的 1.5-2 倍，預留 burst 空間

Service

apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP

第三步：健康檢查

加入 health endpoint

from fastapi import FastAPI
import time

app = FastAPI()
model = None
model_loaded_at = None

@app.on_event("startup")
async def load_model():
    global model, model_loaded_at
    model = load_your_model()
    model_loaded_at = time.time()

@app.get("/health")
async def health():
    """Liveness probe - 程式還活著嗎？"""
    return {"status": "healthy"}

@app.get("/ready")
async def ready():
    """Readiness probe - 可以接收流量了嗎？"""
    if model is None:
        return {"status": "not ready"}, 503
    return {"status": "ready", "model_loaded_at": model_loaded_at}

踩坑：initialDelaySeconds 設太短

模型載入要 20 秒，但 liveness probe 的 initialDelaySeconds 只設 10 秒。結果 Pod 還沒 ready 就被判定為 unhealthy，然後被 kill，無限重啟。

第四步：模型版本管理

用 MLflow 追蹤模型版本：

import mlflow

mlflow.set_tracking_uri("http://mlflow-server:5000")

with mlflow.start_run():
    # 訓練
    model = train_model(X_train, y_train)

    # 記錄參數和指標
    mlflow.log_params({
        "n_estimators": 100,
        "max_depth": 10
    })
    mlflow.log_metrics({
        "accuracy": accuracy,
        "f1_score": f1
    })

    # 儲存模型
    mlflow.sklearn.log_model(model, "model")

部署時從 MLflow 拉取指定版本：

import mlflow

model = mlflow.sklearn.load_model("models:/my-model/Production")

第五步：監控

Prometheus metrics

from prometheus_client import Counter, Histogram, generate_latest
from fastapi import Response

REQUEST_COUNT = Counter(
    'prediction_requests_total',
    'Total prediction requests'
)

PREDICTION_LATENCY = Histogram(
    'prediction_latency_seconds',
    'Prediction latency in seconds'
)

@app.post("/predict")
async def predict(data: PredictRequest):
    REQUEST_COUNT.inc()

    with PREDICTION_LATENCY.time():
        result = model.predict(data.features)

    return {"prediction": result}

@app.get("/metrics")
async def metrics():
    return Response(
        content=generate_latest(),
        media_type="text/plain"
    )

監控什麼？

延遲：P50、P95、P99
吞吐量：每秒請求數
錯誤率：4xx、5xx 比例
資源使用：CPU、Memory
模型指標：預測分佈、異常偵測

CI/CD Pipeline

# .github/workflows/deploy.yml
name: Deploy ML Model

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build and push image
        run: |
          docker build -t $REGISTRY/ml-model:${{ github.sha }} .
          docker push $REGISTRY/ml-model:${{ github.sha }}

      - name: Deploy to staging
        run: |
          kubectl set image deployment/ml-model \
            ml-model=$REGISTRY/ml-model:${{ github.sha }} \
            --namespace=staging

      - name: Run smoke tests
        run: |
          ./scripts/smoke-test.sh staging

      - name: Deploy to production
        if: success()
        run: |
          kubectl set image deployment/ml-model \
            ml-model=$REGISTRY/ml-model:${{ github.sha }} \
            --namespace=production

總結

把 ML 模型部署到 K8s，主要學到：

容器要瘦身：用 slim image、移除不必要依賴
資源要算準：特別是記憶體，模型載入會吃很多
健康檢查很重要：但 delay 要設夠長
監控不能少：延遲、錯誤率、資源使用
版本管理：用 MLflow 或類似工具追蹤

K8s 學習曲線陡，但一旦熟悉了，scaling 和 rollback 都很方便。

參考資源

背景

最近要把一個 scikit-learn 模型部署到生產環境。之前都是用簡單的 Flask + VM，這次決定試試 Kubernetes，為之後的擴展做準備。

記錄一下過程和踩的坑。

第一步：容器化

Dockerfile

FROM python:3.11-slim

WORKDIR /app

# 先複製 requirements，利用 Docker layer cache
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# 複製模型和程式碼
COPY model/ ./model/
COPY app.py .

# 非 root 用戶運行（安全性）
RUN useradd -m appuser
USER appuser

EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

踩坑：Image 太大

第一版 image 超過 2GB，原因是用了完整的 Python image 加上 PyTorch。

解法：

用 python:3.11-slim 而不是 python:3.11
把不需要的依賴移除
用 multi-stage build（如果需要編譯）

最後壓到 ~500MB。

第二步：Kubernetes 部署

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
  labels:
    app: ml-model
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      containers:
      - name: ml-model
        image: your-registry/ml-model:v1.0.0
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5

踩坑：資源配置

一開始 memory request 設太低（256Mi），結果 Pod 一直被 OOM kill。

ML 模型載入時會吃很多記憶體，要算進去。我的經驗是：

Request：設定模型載入後的穩定用量
Limit：設定 request 的 1.5-2 倍，預留 burst 空間

Service

apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP

第三步：健康檢查

加入 health endpoint

from fastapi import FastAPI
import time

app = FastAPI()
model = None
model_loaded_at = None

@app.on_event("startup")
async def load_model():
    global model, model_loaded_at
    model = load_your_model()
    model_loaded_at = time.time()

@app.get("/health")
async def health():
    """Liveness probe - 程式還活著嗎？"""
    return {"status": "healthy"}

@app.get("/ready")
async def ready():
    """Readiness probe - 可以接收流量了嗎？"""
    if model is None:
        return {"status": "not ready"}, 503
    return {"status": "ready", "model_loaded_at": model_loaded_at}

踩坑：initialDelaySeconds 設太短

模型載入要 20 秒，但 liveness probe 的 initialDelaySeconds 只設 10 秒。結果 Pod 還沒 ready 就被判定為 unhealthy，然後被 kill，無限重啟。

第四步：模型版本管理

用 MLflow 追蹤模型版本：

import mlflow

mlflow.set_tracking_uri("http://mlflow-server:5000")

with mlflow.start_run():
    # 訓練
    model = train_model(X_train, y_train)

    # 記錄參數和指標
    mlflow.log_params({
        "n_estimators": 100,
        "max_depth": 10
    })
    mlflow.log_metrics({
        "accuracy": accuracy,
        "f1_score": f1
    })

    # 儲存模型
    mlflow.sklearn.log_model(model, "model")

部署時從 MLflow 拉取指定版本：

import mlflow

model = mlflow.sklearn.load_model("models:/my-model/Production")

第五步：監控

Prometheus metrics

from prometheus_client import Counter, Histogram, generate_latest
from fastapi import Response

REQUEST_COUNT = Counter(
    'prediction_requests_total',
    'Total prediction requests'
)

PREDICTION_LATENCY = Histogram(
    'prediction_latency_seconds',
    'Prediction latency in seconds'
)

@app.post("/predict")
async def predict(data: PredictRequest):
    REQUEST_COUNT.inc()

    with PREDICTION_LATENCY.time():
        result = model.predict(data.features)

    return {"prediction": result}

@app.get("/metrics")
async def metrics():
    return Response(
        content=generate_latest(),
        media_type="text/plain"
    )

監控什麼？

延遲：P50、P95、P99
吞吐量：每秒請求數
錯誤率：4xx、5xx 比例
資源使用：CPU、Memory
模型指標：預測分佈、異常偵測

CI/CD Pipeline

# .github/workflows/deploy.yml
name: Deploy ML Model

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build and push image
        run: |
          docker build -t $REGISTRY/ml-model:${{ github.sha }} .
          docker push $REGISTRY/ml-model:${{ github.sha }}

      - name: Deploy to staging
        run: |
          kubectl set image deployment/ml-model \
            ml-model=$REGISTRY/ml-model:${{ github.sha }} \
            --namespace=staging

      - name: Run smoke tests
        run: |
          ./scripts/smoke-test.sh staging

      - name: Deploy to production
        if: success()
        run: |
          kubectl set image deployment/ml-model \
            ml-model=$REGISTRY/ml-model:${{ github.sha }} \
            --namespace=production

總結

把 ML 模型部署到 K8s，主要學到：

容器要瘦身：用 slim image、移除不必要依賴
資源要算準：特別是記憶體，模型載入會吃很多
健康檢查很重要：但 delay 要設夠長
監控不能少：延遲、錯誤率、資源使用
版本管理：用 MLflow 或類似工具追蹤

K8s 學習曲線陡，但一旦熟悉了，scaling 和 rollback 都很方便。

背景

第一步：容器化

Dockerfile

踩坑：Image 太大

第二步：Kubernetes 部署

Deployment

踩坑：資源配置

Service

第三步：健康檢查

加入 health endpoint

踩坑：initialDelaySeconds 設太短

第四步：模型版本管理

第五步：監控

Prometheus metrics

監控什麼？

CI/CD Pipeline

總結

參考資源

Share this article

Related Articles

Building RAG Systems with LangChain: A Complete Guide

FastAPI Async Best Practices: High-Performance API Development

留言討論

背景

第一步：容器化

Dockerfile

踩坑：Image 太大

第二步：Kubernetes 部署

Deployment

踩坑：資源配置

Service

第三步：健康檢查

加入 health endpoint

踩坑：initialDelaySeconds 設太短

第四步：模型版本管理

第五步：監控

Prometheus metrics

監控什麼？

CI/CD Pipeline

總結

參考資源

Share this article

Related Articles

Building RAG Systems with LangChain: A Complete Guide

FastAPI Async Best Practices: High-Performance API Development

留言討論