On this page
Monitoring and Observability
Monitoring and Observability
This guide covers how to effectively monitor MCP Catie and set up observability tools to ensure optimal performance and reliability.
Overview
MCP Catie provides built-in monitoring capabilities through:
- Health Check Endpoint: Basic service health information
- Prometheus Metrics: Detailed performance and operational metrics
- Structured Logging: Comprehensive logging for debugging and auditing
- Monitoring UI: Simple web interface for quick status checks
Health Check Endpoint
The health check endpoint provides a simple way to verify if the service is running properly.
Endpoint Details
- URL:
/health
- Method: GET
- Response: 200 OK when healthy
Prometheus Metrics
MCP Catie exposes a rich set of Prometheus metrics that provide insights into its performance and behavior.
Metrics Endpoint
- URL:
/metrics
- Method: GET
- Authentication: Protected by the same credentials as the monitoring UI
Available Metrics
Metric Name | Type | Description |
---|---|---|
mcp_router_requests_total |
Counter | Total number of requests processed |
mcp_router_errors_total |
Counter | Total number of request errors |
mcp_router_requests_by_method |
Counter | Number of requests broken down by method |
mcp_router_requests_by_endpoint |
Counter | Number of requests broken down by target endpoint |
mcp_router_response_time_ms |
Histogram | Response time in milliseconds by method |
mcp_router_uptime_seconds |
Gauge | Time since the router started in seconds |
mcp_router_active_sessions |
Gauge | Number of active sessions |
mcp_router_memory_usage_bytes |
Gauge | Memory usage in bytes |
mcp_router_goroutines |
Gauge | Number of active goroutines |
Prometheus Configuration
Add the following to your Prometheus configuration to scrape metrics from MCP Catie:
scrape_configs:
- job_name: 'mcp-catie'
scrape_interval: 15s
basic_auth:
username: 'admin'
password: 'your_secure_password'
static_configs:
- targets: ['mcp-catie:80']
Dashboard Panels
The dashboard includes the following panels:
- Request Overview: Total requests, errors, and success rate
- Response Times: P50, P90, and P99 response times by method
- Endpoint Usage: Requests by target endpoint
- Resource Utilization: Memory usage and goroutine count
- Session Management: Active sessions over time
- Error Rates: Errors by type and endpoint
Log Levels
- debug: Detailed information for debugging purposes
- info: General operational information
- warn: Warning events that might require attention
- error: Error events that might still allow the application to continue
Integration with Log Management Systems
ELK Stack
Configure Filebeat to collect and forward logs to Elasticsearch:
filebeat.inputs:
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
json.keys_under_root: true
json.add_error_key: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
Loki
Configure Promtail to collect and forward logs to Loki:
scrape_configs:
- job_name: mcp-catie
static_configs:
- targets:
- localhost
labels:
job: mcp-catie
__path__: /var/log/mcp-catie/*.log
pipeline_stages:
- json:
expressions:
level: level
message: message
method: method
Monitoring UI
MCP Catie includes a simple web UI for monitoring basic statistics and router configuration.
UI Access
- URL:
/stats
- Authentication: Basic authentication using credentials from configuration
UI Features
- Current request statistics
- Active sessions
- Routing configuration
- Recent errors
- System resource usage
Alerting
Configure alerts to be notified of potential issues before they affect users.
Prometheus Alerting Rules
groups:
- name: mcp-catie-alerts
rules:
- alert: HighErrorRate
expr: rate(mcp_router_errors_total[5m]) / rate(mcp_router_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% for the last 5 minutes"
- alert: SlowResponseTime
expr: histogram_quantile(0.95, sum(rate(mcp_router_response_time_ms_bucket[5m])) by (le)) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Slow response times detected"
description: "95th percentile response time is above 500ms for the last 5 minutes"
- alert: HighMemoryUsage
expr: mcp_router_memory_usage_bytes / 1024 / 1024 > 1024
for: 15m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 1GB for the last 15 minutes"
Integration with Alert Managers
Configure AlertManager to send notifications to your preferred channels:
receivers:
- name: 'team-slack'
slack_configs:
- channel: '#mcp-alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ .CommonAnnotations.description }}'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-slack'