Monitoring and Observability

This guide covers how to effectively monitor MCP Catie and set up observability tools to ensure optimal performance and reliability.

Overview

MCP Catie provides built-in monitoring capabilities through:

Health Check Endpoint: Basic service health information
Prometheus Metrics: Detailed performance and operational metrics
Structured Logging: Comprehensive logging for debugging and auditing
Monitoring UI: Simple web interface for quick status checks

Health Check Endpoint

The health check endpoint provides a simple way to verify if the service is running properly.

Endpoint Details

URL: /health
Method: GET
Response: 200 OK when healthy

Prometheus Metrics

MCP Catie exposes a rich set of Prometheus metrics that provide insights into its performance and behavior.

Metrics Endpoint

URL: /metrics
Method: GET
Authentication: Protected by the same credentials as the monitoring UI

Available Metrics

Metric Name	Type	Description
`mcp_router_requests_total`	Counter	Total number of requests processed
`mcp_router_errors_total`	Counter	Total number of request errors
`mcp_router_requests_by_method`	Counter	Number of requests broken down by method
`mcp_router_requests_by_endpoint`	Counter	Number of requests broken down by target endpoint
`mcp_router_response_time_ms`	Histogram	Response time in milliseconds by method
`mcp_router_uptime_seconds`	Gauge	Time since the router started in seconds
`mcp_router_active_sessions`	Gauge	Number of active sessions
`mcp_router_memory_usage_bytes`	Gauge	Memory usage in bytes
`mcp_router_goroutines`	Gauge	Number of active goroutines

Prometheus Configuration

Add the following to your Prometheus configuration to scrape metrics from MCP Catie:

scrape_configs:
  - job_name: 'mcp-catie'
    scrape_interval: 15s
    basic_auth:
      username: 'admin'
      password: 'your_secure_password'
    static_configs:
      - targets: ['mcp-catie:80']

Dashboard Panels

The dashboard includes the following panels:

Request Overview: Total requests, errors, and success rate
Response Times: P50, P90, and P99 response times by method
Endpoint Usage: Requests by target endpoint
Resource Utilization: Memory usage and goroutine count
Session Management: Active sessions over time
Error Rates: Errors by type and endpoint

Grafana Dashboard

Log Levels

debug: Detailed information for debugging purposes
info: General operational information
warn: Warning events that might require attention
error: Error events that might still allow the application to continue

Integration with Log Management Systems

ELK Stack

Configure Filebeat to collect and forward logs to Elasticsearch:

filebeat.inputs:
- type: container
  paths:
    - '/var/lib/docker/containers/*/*.log'
  json.keys_under_root: true
  json.add_error_key: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]

Loki

Configure Promtail to collect and forward logs to Loki:

scrape_configs:
  - job_name: mcp-catie
    static_configs:
    - targets:
        - localhost
      labels:
        job: mcp-catie
        __path__: /var/log/mcp-catie/*.log
    pipeline_stages:
    - json:
        expressions:
          level: level
          message: message
          method: method

Monitoring UI

MCP Catie includes a simple web UI for monitoring basic statistics and router configuration.

UI Access

URL: /stats
Authentication: Basic authentication using credentials from configuration

UI Features

Current request statistics
Active sessions
Routing configuration
Recent errors
System resource usage

Alerting

Configure alerts to be notified of potential issues before they affect users.

Prometheus Alerting Rules

groups:
- name: mcp-catie-alerts
  rules:
  - alert: HighErrorRate
    expr: rate(mcp_router_errors_total[5m]) / rate(mcp_router_requests_total[5m]) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High error rate detected"
      description: "Error rate is above 5% for the last 5 minutes"

  - alert: SlowResponseTime
    expr: histogram_quantile(0.95, sum(rate(mcp_router_response_time_ms_bucket[5m])) by (le)) > 500
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Slow response times detected"
      description: "95th percentile response time is above 500ms for the last 5 minutes"

  - alert: HighMemoryUsage
    expr: mcp_router_memory_usage_bytes / 1024 / 1024 > 1024
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
      description: "Memory usage is above 1GB for the last 15 minutes"

Integration with Alert Managers

Configure AlertManager to send notifications to your preferred channels:

receivers:
- name: 'team-slack'
  slack_configs:
  - channel: '#mcp-alerts'
    send_resolved: true
    
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ .CommonAnnotations.description }}'
    

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-slack'

On this page

Monitoring and Observability

Monitoring and Observability

Overview

Health Check Endpoint

Endpoint Details

Prometheus Metrics

Metrics Endpoint

Available Metrics

Prometheus Configuration

Dashboard Panels

Log Levels

Integration with Log Management Systems

ELK Stack

Loki

Monitoring UI

UI Access

UI Features

Alerting

Prometheus Alerting Rules

Integration with Alert Managers