MCP Sampling Explained: Best Practices and Examples

Matthew Lenahrd

Matthew Lenahrd

Matthew Lenhard is the creator of MCP Catie. Before Catie he founded ContainIQ an observability platform for kubernetes where he was using eBPF to intercept network traffic.

Introduction: What is MCP Sampling

MCP Sampling is undoubtedly one of the more complex topics in MCP. At its core, it allows for MCP servers to request competitions from the client. This enables you to offload LLM calls from the server to the client.

Say, for example, you have a HackerNews MCP Server and in one of your tool calls, you want to summarize the top comments. You could have the MCP client perform this operation, so that your server doesn't need to integrate with a third-party LLM, ultimately reducing complexity and cost overhead.

Sampling Support

Support for sampling is still lacking across clients and should be considered when implementing this feature in an MCP server.

Client Sampling Support
5ire
Apify MCP Tester
BeeAI Framework
Claude Code
Claude Desktop App
Cline
Continue
Copilot-MCP
Cursor
Daydreams Agents
Emacs Mcp
fast-agent
Genkit
GenAIScript
Goose
LibreChat
mcp-agent ⚠️
Microsoft Copilot Studio
OpenSumi
oterm
Roo Code
Sourcegraph Cody
SpinAI
Superinterface
TheiaAI/TheiaIDE
VS Code GitHub Copilot
Windsurf Editor
Witsy
Zed

Implementing Sampling in MCP

The sampling process follows a specific flow:

  1. Server initiates request: The server sends a sampling/createMessage request to the client
  2. Client processes request: The client forwards the request to the LLM with appropriate context
  3. LLM generates completion: The model creates a response based on the provided messages
  4. Client returns completion: The generated text is returned to the server

Request Format

{
messages: [
  {
    role: "user" | "assistant",
    content: {
      type: "text",
      text?: string
    }
  }
],
modelPreferences?: {
  hints?: [{
    name?: string
  }],
  costPriority?: number,
  speedPriority?: number,
  intelligencePriority?: number
},
systemPrompt?: string,
includeContext?: "none" | "thisServer" | "allServers",
temperature?: number,
maxTokens: number,
stopSequences?: string[],
metadata?: Record<string, unknown>
}

Key Parameters

  • messages: The conversation history providing context for the completion
  • modelPreferences: Optional hints for model selection based on priorities
  • systemPrompt: Instructions that guide the LLM's behavior
  • includeContext: Controls whether to include context from servers
  • temperature: Controls randomness in the generated text (0-1)
  • maxTokens: Maximum length of the generated completion

Response Format

{
model: string,
stopReason?: "endTurn" | "stopSequence" | "maxTokens",
role: "user" | "assistant",
content: {
  type: "text",
  text?: string,
}
}

Example Implementation

Here's a practical example of implementing sampling in a TypeScript server:

// Handle a request to analyze a file
server.setRequestHandler(AnalyzeFileRequestSchema, async (request) => {
// Read the file content
const fileContent = await readFile(request.params.filePath);
 // Request LLM to analyze the content
const analysisResult = await server.createSamplingMessage({
  messages: [
    {
      role: "user",
      content: {
        type: "text",
        text: `Please analyze this file content:\n\n${fileContent}`
      }
    }
  ],
  systemPrompt: "You are a helpful code analysis assistant.",
  includeContext: "thisServer",
  maxTokens: 500
});
 // Return the analysis to the client
return {
  analysis: analysisResult.content.text,
  filePath: request.params.filePath
};
});

Human-in-the-Loop Controls

Sampling is designed with human oversight in mind:

For Prompts

  • Clients should show users the proposed prompt
  • Users should be able to modify or reject prompts
  • System prompts can be filtered or modified
  • Context inclusion is controlled by the client

For Completions

  • Clients should show users the completion
  • Users should be able to modify or reject completions
  • Clients can filter or modify completions
  • Users control which model is used

Best Practices

When implementing sampling in MCP:

Prompt Engineering

Clear, Well-Structured Prompts

Always provide clear, well-structured prompts with consistent formatting and explicit instructions at the beginning. Breaking complex tasks into smaller steps and providing examples for expected output formats significantly improves model performance. Using markdown or other formatting enhances readability and helps the model understand the structure of your request.

Effective System Prompts Keep system prompts concise but comprehensive, clearly defining the model's role and constraints. Include necessary context while avoiding redundancy that could waste tokens. Testing different system prompts can help optimize performance for specific use cases. Consider versioning system prompts to maintain consistency across your application.

Content Management

Handle Content Types Appropriately Validate text content for proper encoding and optimize image resolution for model capabilities before sending. Consider the content type limitations of different models and implement fallbacks for unsupported content types. Using appropriate MIME types for all content ensures proper handling throughout the pipeline.

Context Management Request minimal necessary context to reduce token usage while ensuring the model has sufficient information. Structure context clearly with headers and sections, prioritizing recent and relevant information. Implementing context windowing for long conversations and considering summarization techniques helps maintain context within token limits.

Technical Implementation

Set Reasonable Token Limits Balance between completeness and cost when setting token limits, adjusting based on expected response length. Consider model-specific token limitations and implement pagination for long responses when necessary. Monitoring token usage helps optimize costs over time as you learn your application's patterns.

Validate Responses Check for expected output formats and implement schema validation for structured outputs. Develop mechanisms to detect and handle hallucinations or incorrect information, possibly implementing confidence scoring. Having fallback mechanisms for low-quality responses ensures your application remains robust even when sampling results are suboptimal.

Error Handling Implement comprehensive error catching with meaningful error messages to users. Handle timeouts gracefully and implement retry logic with exponential backoff for transient issues. Logging errors systematically facilitates debugging and continuous improvement of your sampling implementation.

Performance and Scalability

Rate Limiting Implement client-side rate limiting using token bucket algorithms for smooth request distribution. Monitor usage patterns to adjust limits appropriately and implement queuing for high-demand periods. Consider establishing priority levels for different request types to ensure critical operations aren't blocked during peak usage.

Caching Strategies Cache common responses to reduce API calls, implementing LRU (Least Recently Used) caching with appropriate TTL (Time To Live) settings. Consider semantic caching for similar queries to further reduce redundant model calls. Remember to invalidate cache when context changes significantly to maintain response accuracy.

Comprehensive Testing Test with various model parameters to understand performance characteristics across different settings. Create test suites for different use cases and implement integration tests with actual models. Testing error handling and edge cases thoroughly, along with performing load testing for production scenarios, ensures reliability at scale.

Cost and Resource Management

Monitor Sampling Costs Implement usage tracking and reporting with alerts for unusual usage patterns. Create dashboards for cost visualization to identify optimization opportunities. Consider implementing user quotas to prevent unexpected cost spikes. Continuously optimize prompts to reduce token usage without sacrificing quality.

Resource Optimization Balance between model capabilities and costs by selecting appropriate models for different tasks. Implement tiered access based on requirements and consider batching requests when appropriate. Use streaming for long-running completions to improve user experience, and implement graceful degradation during high load to maintain service availability.

By following these expanded best practices, developers can create more robust, efficient, and cost-effective implementations of MCP sampling in their applications.

Context Management

Best practices for context:

  • Request minimal necessary context
  • Structure context clearly
  • Handle context size limits
  • Update context as needed
  • Clean up stale context

Security Considerations

When implementing sampling:

  • Validate all message content
  • Sanitize sensitive information
  • Implement appropriate rate limits
  • Monitor sampling usage
  • Encrypt data in transit
  • Handle user data privacy
  • Audit sampling requests
  • Control cost exposure
  • Implement timeouts
  • Handle model errors gracefully

Limitations

Be aware of these limitations when implementing MCP sampling:

  • Client Capability Dependency: Sampling functionality depends entirely on the client's implementation. Different clients may support different features, model access, or sampling parameters.

  • User Control: End users ultimately control whether sampling requests are processed, which models are used, and how completions are handled. Servers cannot bypass user consent or preferences.

  • Context Size Constraints:

  • Models have fixed context windows (typically 8K-128K tokens)

  • Large contexts may slow down processing or increase costs

  • Context management becomes critical for complex applications

  • Some information may need to be summarized or excluded

  • Rate Limiting Considerations:

  • Clients may impose rate limits on sampling requests

  • Models themselves often have API-level rate limits

  • High-frequency sampling may be throttled

  • Implement backoff strategies for rate limit handling

  • Cost Implications:

  • Sampling incurs token-based costs for model usage

  • Larger contexts and completions increase costs

  • Different models have different pricing structures

  • Consider implementing budgeting mechanisms

  • Model Availability Variations:

  • Not all models are available on all clients

  • Model availability may change over time

  • Newer models may not be immediately accessible

  • Some regions may have restricted model access

  • Variable Response Times:

  • Response times vary based on model size, load, and complexity

  • Larger contexts typically result in slower responses

  • First-token latency can be significant

  • Implement appropriate timeouts and loading states

  • Content Type Limitations:

  • Some models may not support all content types (images, audio, etc.)

  • Multimodal capabilities vary across models

  • Format conversions may be necessary

  • Consider fallback strategies for unsupported content

  • Determinism Challenges:

  • Even at temperature=0, responses may not be perfectly deterministic

  • Results may vary between model versions or implementations

  • Critical systems should not rely on exact reproducibility

  • Implement validation for mission-critical applications

Conclusion

Hopefully this was helpful in better understanding how Sampling works in MCP. Like I mentioned in the introduction, it's a powerful but complex feature to take advantage of. I'm also hopeful we'll see greater adoption from clients in the near future.