Skip to main content
Catie MCP

Monitoring and Observability

Monitoring and Observability

This guide covers how to effectively monitor MCP Catie and set up observability tools to ensure optimal performance and reliability.

Overview

MCP Catie provides built-in monitoring capabilities through:

  1. Health Check Endpoint: Basic service health information
  2. Prometheus Metrics: Detailed performance and operational metrics
  3. Structured Logging: Comprehensive logging for debugging and auditing
  4. Monitoring UI: Simple web interface for quick status checks

Health Check Endpoint

The health check endpoint provides a simple way to verify if the service is running properly.

Endpoint Details

  • URL: /health
  • Method: GET
  • Response: 200 OK when healthy

Prometheus Metrics

MCP Catie exposes a rich set of Prometheus metrics that provide insights into its performance and behavior.

Metrics Endpoint

  • URL: /metrics
  • Method: GET
  • Authentication: Protected by the same credentials as the monitoring UI

Available Metrics

Metric Name Type Description
mcp_router_requests_total Counter Total number of requests processed
mcp_router_errors_total Counter Total number of request errors
mcp_router_requests_by_method Counter Number of requests broken down by method
mcp_router_requests_by_endpoint Counter Number of requests broken down by target endpoint
mcp_router_response_time_ms Histogram Response time in milliseconds by method
mcp_router_uptime_seconds Gauge Time since the router started in seconds
mcp_router_active_sessions Gauge Number of active sessions
mcp_router_memory_usage_bytes Gauge Memory usage in bytes
mcp_router_goroutines Gauge Number of active goroutines

Prometheus Configuration

Add the following to your Prometheus configuration to scrape metrics from MCP Catie:

scrape_configs:
  - job_name: 'mcp-catie'
    scrape_interval: 15s
    basic_auth:
      username: 'admin'
      password: 'your_secure_password'
    static_configs:
      - targets: ['mcp-catie:80']

Dashboard Panels

The dashboard includes the following panels:

  • Request Overview: Total requests, errors, and success rate
  • Response Times: P50, P90, and P99 response times by method
  • Endpoint Usage: Requests by target endpoint
  • Resource Utilization: Memory usage and goroutine count
  • Session Management: Active sessions over time
  • Error Rates: Errors by type and endpoint

Grafana Dashboard

Log Levels

  • debug: Detailed information for debugging purposes
  • info: General operational information
  • warn: Warning events that might require attention
  • error: Error events that might still allow the application to continue

Integration with Log Management Systems

ELK Stack

Configure Filebeat to collect and forward logs to Elasticsearch:

filebeat.inputs:
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  json.keys_under_root: true
  json.add_error_key: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]

Loki

Configure Promtail to collect and forward logs to Loki:

scrape_configs:
  - job_name: mcp-catie
    static_configs:
    - targets:
        - localhost
      labels:
        job: mcp-catie
        __path__: /var/log/mcp-catie/*.log
    pipeline_stages:
    - json:
        expressions:
          level: level
          message: message
          method: method

Monitoring UI

MCP Catie includes a simple web UI for monitoring basic statistics and router configuration.

UI Access

  • URL: /stats
  • Authentication: Basic authentication using credentials from configuration

UI Features

  • Current request statistics
  • Active sessions
  • Routing configuration
  • Recent errors
  • System resource usage

Alerting

Configure alerts to be notified of potential issues before they affect users.

Prometheus Alerting Rules

groups:
- name: mcp-catie-alerts
  rules:
  - alert: HighErrorRate
    expr: rate(mcp_router_errors_total[5m]) / rate(mcp_router_requests_total[5m]) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      description: "Error rate is above 5% for the last 5 minutes"

  - alert: SlowResponseTime
    expr: histogram_quantile(0.95, sum(rate(mcp_router_response_time_ms_bucket[5m])) by (le)) > 500
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Slow response times detected"
      description: "95th percentile response time is above 500ms for the last 5 minutes"

  - alert: HighMemoryUsage
    expr: mcp_router_memory_usage_bytes / 1024 / 1024 > 1024
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is above 1GB for the last 15 minutes"

Integration with Alert Managers

Configure AlertManager to send notifications to your preferred channels:

receivers:
- name: 'team-slack'
  slack_configs:
  - channel: '#mcp-alerts'
    send_resolved: true
    
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ .CommonAnnotations.description }}'
    

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-slack'