MCP Sampling Explained: Best Practices and Examples

19 April 2025
mcp,
sampling

Matthew Lenahrd

Matthew Lenhard is the creator of MCP Catie. Before Catie he founded ContainIQ an observability platform for kubernetes where he was using eBPF to intercept network traffic.

Introduction: What is MCP Sampling

MCP Sampling is undoubtedly one of the more complex topics in MCP. At its core, it allows for MCP servers to request competitions from the client. This enables you to offload LLM calls from the server to the client.

Say, for example, you have a HackerNews MCP Server and in one of your tool calls, you want to summarize the top comments. You could have the MCP client perform this operation, so that your server doesn't need to integrate with a third-party LLM, ultimately reducing complexity and cost overhead.

Sampling Support

Support for sampling is still lacking across clients and should be considered when implementing this feature in an MCP server.

Client	Sampling Support
5ire	❌
Apify MCP Tester	❌
BeeAI Framework	❌
Claude Code	❌
Claude Desktop App	❌
Cline	❌
Continue	❌
Copilot-MCP	❌
Cursor	❌
Daydreams Agents	❌
Emacs Mcp	❌
fast-agent	✅
Genkit	❌
GenAIScript	❌
Goose	❌
LibreChat	❌
mcp-agent	⚠️
Microsoft Copilot Studio	❌
OpenSumi	❌
oterm	✅
Roo Code	❌
Sourcegraph Cody	❌
SpinAI	❌
Superinterface	❌
TheiaAI/TheiaIDE	❌
VS Code GitHub Copilot	❌
Windsurf Editor	❌
Witsy	❌
Zed	❌

Implementing Sampling in MCP

The sampling process follows a specific flow:

Server initiates request: The server sends a sampling/createMessage request to the client
Client processes request: The client forwards the request to the LLM with appropriate context
LLM generates completion: The model creates a response based on the provided messages
Client returns completion: The generated text is returned to the server

Request Format

{
messages: [
  {
    role: "user" | "assistant",
    content: {
      type: "text",
      text?: string
    }
  }
],
modelPreferences?: {
  hints?: [{
    name?: string
  }],
  costPriority?: number,
  speedPriority?: number,
  intelligencePriority?: number
},
systemPrompt?: string,
includeContext?: "none" | "thisServer" | "allServers",
temperature?: number,
maxTokens: number,
stopSequences?: string[],
metadata?: Record<string, unknown>
}

Key Parameters

messages: The conversation history providing context for the completion
modelPreferences: Optional hints for model selection based on priorities
systemPrompt: Instructions that guide the LLM's behavior
includeContext: Controls whether to include context from servers
temperature: Controls randomness in the generated text (0-1)
maxTokens: Maximum length of the generated completion

Response Format

{
model: string,
stopReason?: "endTurn" | "stopSequence" | "maxTokens",
role: "user" | "assistant",
content: {
  type: "text",
  text?: string,
}
}

Example Implementation

Here's a practical example of implementing sampling in a TypeScript server:

// Handle a request to analyze a file
server.setRequestHandler(AnalyzeFileRequestSchema, async (request) => {
// Read the file content
const fileContent = await readFile(request.params.filePath);
 // Request LLM to analyze the content
const analysisResult = await server.createSamplingMessage({
  messages: [
    {
      role: "user",
      content: {
        type: "text",
        text: `Please analyze this file content:\n\n${fileContent}`
      }
    }
  ],
  systemPrompt: "You are a helpful code analysis assistant.",
  includeContext: "thisServer",
  maxTokens: 500
});
 // Return the analysis to the client
return {
  analysis: analysisResult.content.text,
  filePath: request.params.filePath
};
});

Human-in-the-Loop Controls

Sampling is designed with human oversight in mind:

For Prompts

Clients should show users the proposed prompt
Users should be able to modify or reject prompts
System prompts can be filtered or modified
Context inclusion is controlled by the client

For Completions

Clients should show users the completion
Users should be able to modify or reject completions
Clients can filter or modify completions
Users control which model is used

Best Practices

When implementing sampling in MCP:

Prompt Engineering

Clear, Well-Structured Prompts

Always provide clear, well-structured prompts with consistent formatting and explicit instructions at the beginning. Breaking complex tasks into smaller steps and providing examples for expected output formats significantly improves model performance. Using markdown or other formatting enhances readability and helps the model understand the structure of your request.

Effective System Prompts Keep system prompts concise but comprehensive, clearly defining the model's role and constraints. Include necessary context while avoiding redundancy that could waste tokens. Testing different system prompts can help optimize performance for specific use cases. Consider versioning system prompts to maintain consistency across your application.

Content Management

Handle Content Types Appropriately Validate text content for proper encoding and optimize image resolution for model capabilities before sending. Consider the content type limitations of different models and implement fallbacks for unsupported content types. Using appropriate MIME types for all content ensures proper handling throughout the pipeline.

Context Management Request minimal necessary context to reduce token usage while ensuring the model has sufficient information. Structure context clearly with headers and sections, prioritizing recent and relevant information. Implementing context windowing for long conversations and considering summarization techniques helps maintain context within token limits.

Technical Implementation

Set Reasonable Token Limits Balance between completeness and cost when setting token limits, adjusting based on expected response length. Consider model-specific token limitations and implement pagination for long responses when necessary. Monitoring token usage helps optimize costs over time as you learn your application's patterns.

Validate Responses Check for expected output formats and implement schema validation for structured outputs. Develop mechanisms to detect and handle hallucinations or incorrect information, possibly implementing confidence scoring. Having fallback mechanisms for low-quality responses ensures your application remains robust even when sampling results are suboptimal.

Error Handling Implement comprehensive error catching with meaningful error messages to users. Handle timeouts gracefully and implement retry logic with exponential backoff for transient issues. Logging errors systematically facilitates debugging and continuous improvement of your sampling implementation.

Performance and Scalability

Rate Limiting Implement client-side rate limiting using token bucket algorithms for smooth request distribution. Monitor usage patterns to adjust limits appropriately and implement queuing for high-demand periods. Consider establishing priority levels for different request types to ensure critical operations aren't blocked during peak usage.

Caching Strategies Cache common responses to reduce API calls, implementing LRU (Least Recently Used) caching with appropriate TTL (Time To Live) settings. Consider semantic caching for similar queries to further reduce redundant model calls. Remember to invalidate cache when context changes significantly to maintain response accuracy.

Comprehensive Testing Test with various model parameters to understand performance characteristics across different settings. Create test suites for different use cases and implement integration tests with actual models. Testing error handling and edge cases thoroughly, along with performing load testing for production scenarios, ensures reliability at scale.

Cost and Resource Management

Monitor Sampling Costs Implement usage tracking and reporting with alerts for unusual usage patterns. Create dashboards for cost visualization to identify optimization opportunities. Consider implementing user quotas to prevent unexpected cost spikes. Continuously optimize prompts to reduce token usage without sacrificing quality.

Resource Optimization Balance between model capabilities and costs by selecting appropriate models for different tasks. Implement tiered access based on requirements and consider batching requests when appropriate. Use streaming for long-running completions to improve user experience, and implement graceful degradation during high load to maintain service availability.

By following these expanded best practices, developers can create more robust, efficient, and cost-effective implementations of MCP sampling in their applications.

Context Management

Best practices for context:

Request minimal necessary context
Structure context clearly
Handle context size limits
Update context as needed
Clean up stale context

Security Considerations

When implementing sampling:

Validate all message content
Sanitize sensitive information
Implement appropriate rate limits
Monitor sampling usage
Encrypt data in transit
Handle user data privacy
Audit sampling requests
Control cost exposure
Implement timeouts
Handle model errors gracefully

Limitations

Be aware of these limitations when implementing MCP sampling:

Client Capability Dependency: Sampling functionality depends entirely on the client's implementation. Different clients may support different features, model access, or sampling parameters.
User Control: End users ultimately control whether sampling requests are processed, which models are used, and how completions are handled. Servers cannot bypass user consent or preferences.
Context Size Constraints:
Models have fixed context windows (typically 8K-128K tokens)
Large contexts may slow down processing or increase costs
Context management becomes critical for complex applications
Some information may need to be summarized or excluded
Rate Limiting Considerations:
Clients may impose rate limits on sampling requests
Models themselves often have API-level rate limits
High-frequency sampling may be throttled
Implement backoff strategies for rate limit handling
Cost Implications:
Sampling incurs token-based costs for model usage
Larger contexts and completions increase costs
Different models have different pricing structures
Consider implementing budgeting mechanisms
Model Availability Variations:
Not all models are available on all clients
Model availability may change over time
Newer models may not be immediately accessible
Some regions may have restricted model access
Variable Response Times:
Response times vary based on model size, load, and complexity
Larger contexts typically result in slower responses
First-token latency can be significant
Implement appropriate timeouts and loading states
Content Type Limitations:
Some models may not support all content types (images, audio, etc.)
Multimodal capabilities vary across models
Format conversions may be necessary
Consider fallback strategies for unsupported content
Determinism Challenges:
Even at temperature=0, responses may not be perfectly deterministic
Results may vary between model versions or implementations
Critical systems should not rely on exact reproducibility
Implement validation for mission-critical applications

Conclusion

Hopefully this was helpful in better understanding how Sampling works in MCP. Like I mentioned in the introduction, it's a powerful but complex feature to take advantage of. I'm also hopeful we'll see greater adoption from clients in the near future.

← Previous
Securing Your Local MCP Servers