In the Weeds: Serverless AI Pipelines with Google Cloud Run MCP
A technical deep-dive into building serverless AI processing pipelines with Google Cloud Run MCP — from container deployment to auto-scaling inference endpoints.
Why Cloud Run for AI Workloads?
The serverless AI pitch is simple: deploy your model or pipeline, pay only when it runs, scale to zero when idle. In practice, most serverless platforms buckle under AI workloads — cold starts kill latency, memory limits constrain model size, and execution timeouts choke long inference tasks.
Google Cloud Run has systematically addressed each of these. And with GGoogle Cloud Run MCP, AI models can deploy, manage, and interact with Cloud Run services directly through the Model Context Protocol. This turns infrastructure management into a conversation.
The Architecture
We're building a document processing pipeline that accepts PDFs, extracts text, runs AI analysis, and returns structured insights. The pipeline needs to handle burst traffic (Monday morning when everyone uploads weekly reports) and cost nothing on weekends when nobody's working.
User Upload → Cloud Storage → Pub/Sub → Cloud Run (Processing) → Firestore (Results)
↓
Cloud Run (AI Inference)
Two Cloud Run services: one for orchestration and preprocessing, one for AI inference. Separating them lets us scale each independently — the preprocessing service handles many requests quickly, while the inference service handles fewer requests with more resources.
The Processing Service
python# processor/main.py
from flask import Flask, request, jsonify
import google.cloud.storage as storage
import google.cloud.firestore as firestore
from pypdf import PdfReader
import requests
import os
import io
app = Flask(__name__)
gcs = storage.Client()
db = firestore.Client()
@app.route("/process", methods=["POST"])
def process_document():
envelope = request.get_json()
if not envelope:
return "Bad Request", 400
message = envelope.get("message", {})
bucket_name = message.get("attributes", {}).get("bucketId")
file_name = message.get("attributes", {}).get("objectId")
# Download PDF from Cloud Storage
bucket = gcs.bucket(bucket_name)
blob = bucket.blob(file_name)
pdf_bytes = blob.download_as_bytes()
# Extract text
reader = PdfReader(io.BytesIO(pdf_bytes))
text = "\n".join(page.extract_text() or "" for page in reader.pages)
# Send to AI inference service
inference_url = os.environ["INFERENCE_SERVICE_URL"]
response = requests.post(
f"{inference_url}/analyze",
json={"text": text, "filename": file_name},
timeout=300
)
result = response.json()
# Store results in Firestore
doc_ref = db.collection("analyses").document(file_name)
doc_ref.set({
"filename": file_name,
"page_count": len(reader.pages),
"analysis": result["analysis"],
"key_findings": result["key_findings"],
"processed_at": firestore.SERVER_TIMESTAMP,
})
return jsonify({"status": "processed", "document": file_name})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
The Inference Service
python# inference/main.py
from flask import Flask, request, jsonify
from anthropic import Anthropic
import os
app = Flask(__name__)
client = Anthropic()
@app.route("/analyze", methods=["POST"])
def analyze():
data = request.get_json()
text = data["text"]
filename = data["filename"]
# Truncate if beyond context limit
max_chars = 150000
if len(text) > max_chars:
text = text[:max_chars] + "\n[Document truncated]"
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system="You are a document analyst. Extract key findings, summarize the content, and identify action items.",
messages=[{
"role": "user",
"content": f"Analyze this document ({filename}):\n\n{text}"
}]
)
analysis_text = response.content[0].text
# Second pass for structured extraction
structured = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"From this analysis, extract a JSON object with keys: summary (string), key_findings (array of strings), action_items (array of strings), sentiment (string). Analysis:\n\n{analysis_text}"
}]
)
import json
try:
result = json.loads(structured.content[0].text)
except json.JSONDecodeError:
result = {"summary": analysis_text, "key_findings": [], "action_items": [], "sentiment": "neutral"}
return jsonify({"analysis": analysis_text, "key_findings": result.get("key_findings", [])})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=int(os.environ.get("PORT", 8080)))
Dockerfiles
Both services need containers. The inference service needs more memory:
dockerfile# inference/Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--timeout", "300", "main:app"]
Deployment with Cloud Run MCP
Here's where the MCP integration changes the workflow. Instead of writing deployment scripts, you can interact with Cloud Run through the AI model:
"Deploy the inference service with 2GB memory, 300-second timeout, max 10 instances, min 0 instances, and set the ANTHROPIC_API_KEY environment variable."
The MCP server translates this into the correct gcloud run deploy configuration, handles the container build, and manages the service account permissions. You're having a conversation about infrastructure instead of debugging YAML.
Scaling Configuration
The critical Cloud Run settings for AI workloads:
Memory and CPU: AI inference is memory-hungry. Set the inference service to 2GB+ memory with 2 CPUs. Cloud Run now supports up to 32GB memory per instance, which is enough for smaller local models if you're using something like LLocalAI.
Concurrency: Set concurrency to 1 for the inference service. AI inference requests are resource-intensive — you don't want multiple requests competing for the same instance's memory. The orchestration service can handle higher concurrency (50-80) since it's I/O-bound.
Min instances: Keep at 0 for cost efficiency. The cold-start penalty (3-10 seconds for Python containers) is acceptable for document processing since users aren't watching in real time.
Max instances: Cap this based on your API rate limits and budget. If your AI provider limits you to 50 concurrent requests, setting max instances to 50 prevents you from deploying more capacity than you can use.
Timeout: Set to 300 seconds for the inference service. Long documents take time to process, and the default 60-second timeout will kill your requests mid-inference.
Cost Analysis
Here's the math that makes serverless compelling for AI workloads:
A Cloud Run instance with 2GB memory and 2 vCPUs costs approximately $0.00003 per second when active. A typical document analysis takes 30-60 seconds. That's $0.0009 to $0.0018 per document in compute costs, plus the AI API cost.
Compare this to a dedicated server running 24/7: even a modest VM costs $50-100/month whether you process 10 documents or 10,000. Cloud Run scales to zero — if you process nothing on the weekend, you pay nothing on the weekend.
For a team processing ~500 documents per month, Cloud Run costs roughly $1-2 in compute. The AI API calls will cost more (perhaps $5-15 depending on document length), but the infrastructure cost is negligible.
Error Handling and Retries
AI inference fails. Models timeout. API rate limits trigger. Your pipeline needs to handle this gracefully.
Cloud Run integrates with Pub/Sub's retry mechanism. If your processing function returns an error, Pub/Sub redelivers the message with exponential backoff. Configure a dead-letter topic for messages that fail repeatedly:
yaml# pubsub-subscription.yaml
deadLetterPolicy:
deadLetterTopic: projects/my-project/topics/dead-letter
maxDeliveryAttempts: 5
retryPolicy:
minimumBackoff: 10s
maximumBackoff: 600s
In the inference service, implement circuit-breaking for the AI API:
pythonfrom tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=4, max=60))
def call_ai(text, filename):
return client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{"role": "user", "content": f"Analyze: {text}"}]
)
Monitoring with MCP
The Cloud Run MCP server provides monitoring capabilities that let you ask natural-language questions about your deployment:
- "What's the average latency for the inference service over the last hour?"
- "How many instances are currently running?"
- "Show me error rates for the processing service this week."
- "What was the peak concurrent request count yesterday?"
This is significantly faster than navigating the Cloud Console, especially when troubleshooting. Combined with logging tools and NNeon MCP for database monitoring, you can observe your entire pipeline through AI-driven queries.
When to Use This Pattern
This architecture works best when:
- Workloads are bursty (not constant throughput)
- Cold starts are acceptable (not real-time user-facing)
- You want zero infrastructure management
- Cost efficiency matters more than maximum performance
- You need to scale from 0 to high volume quickly
For real-time user-facing AI (chatbots, interactive agents), consider keeping minimum instances at 1+ or using a dedicated platform. For batch processing, document analysis, and async AI workloads, Cloud Run hits the sweet spot of cost, simplicity, and scalability.
The Bottom Line
Serverless AI pipelines used to be a contradiction. The "serverless" constraints (memory limits, short timeouts, cold starts) fought against AI's needs (large models, long inference, warm caches). Cloud Run has closed most of those gaps, and the MCP integration makes the remaining complexity conversational rather than operational.
You deploy by describing what you want. You monitor by asking questions. You scale by doing nothing. That's the promise of serverless, finally delivered for AI workloads.
Ratings & Reviews
0.0
out of 5
0 ratings
No reviews yet. Be the first to share your experience.