In the Weeds: Automating Content Pipelines with Apify + Neon
A technical guide to building automated content collection, processing, and enrichment pipelines using Apify for web scraping and Neon serverless Postgres for storage — the infrastructure behind a-gnt's catalog.
The Content Problem at Scale
When you run a catalog of 3,500+ AI tools, keeping information current is not a manual task. It is an infrastructure challenge. Tools update their features. New tools launch daily. Pricing changes. Integrations appear. Documentation evolves.
The answer is not hiring a team of researchers. The answer is building a pipeline that continuously collects, processes, enriches, and validates content — automatically, reliably, and cheaply.
Here is how we do it with AApify MCP for web collection and NNeon MCP for serverless storage.
The Pipeline Architecture
Sources -> Apify Scraping -> Raw Storage (Neon) -> LLM Processing -> Enriched Storage -> Validation -> Publication
Each stage is independent. Each can fail without bringing down the pipeline. Each can be monitored separately. This modularity is critical at scale — when you are processing hundreds of sources daily, something will always be failing. The question is whether failure cascades or stays contained.
Stage 1: Source Discovery and Scheduling
First, we need to know what to scrape. Our source registry lives in Neon:
sqlCREATE TABLE content_sources (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
url TEXT NOT NULL,
source_type TEXT NOT NULL,
scrape_frequency INTERVAL DEFAULT '24 hours',
last_scraped TIMESTAMPTZ,
apify_actor_id TEXT,
config JSONB DEFAULT '{}',
active BOOLEAN DEFAULT true,
reliability_score FLOAT DEFAULT 1.0,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Scheduling query: what needs scraping now?
CREATE VIEW sources_due AS
SELECT FROM content_sources
WHERE active = true
AND (last_scraped IS NULL OR last_scraped + scrape_frequency < NOW())
ORDER BY last_scraped ASC NULLS FIRST;
Stage 2: Apify Scraping
AApify MCP provides web scraping as a service — pre-built actors for common patterns plus custom actors for specific needs.
For product pages, we use a generic web scraper actor configured to extract structured content:
javascript// nn8n Function node: Build Apify task configuration
const source = items[0].json;
const apifyConfig = {
actorId: source.apify_actor_id || 'apify/web-scraper',
input: {
startUrls: [{ url: source.url }],
pageFunction: async function pageFunction(context) {
const { page, request } = context;
await page.waitForSelector('body', { timeout: 30000 });
const title = await page.evaluate(() =>
document.querySelector('h1')?.textContent || ''
);
const description = await page.evaluate(() =>
document.querySelector('meta[name="description"]')?.content || ''
);
const bodyText = await page.evaluate(() =>
(document.querySelector('main') || document.querySelector('article') || document.body).textContent
);
const pricing = await page.evaluate(() =>
Array.from(document.querySelectorAll('[class="price"]')).map(el => el.textContent)
);
const features = await page.evaluate(() =>
Array.from(document.querySelectorAll('li')).slice(0, 50).map(el => el.textContent.trim())
);
return {
url: request.url,
title,
description,
bodyText: bodyText.substring(0, 50000),
pricing,
features,
scrapedAt: new Date().toISOString()
};
},
maxRequestsPerCrawl: 10,
proxyConfiguration: { useApifyProxy: true }
},
timeout: 120
};
return { apifyConfig };
For GitHub repos, we use the GitHub API actor:
javascript{
actorId: 'apify/github-scraper',
input: {
repoUrls: [source.url],
includeReadme: true,
includeReleases: true,
includeLanguages: true,
maxReleases: 5
}
}
Stage 3: Raw Storage
Scraped data goes directly into a raw storage table — unprocessed, exactly as received:
sqlCREATE TABLE raw_scrapes (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_id UUID REFERENCES content_sources(id),
raw_data JSONB NOT NULL,
scraped_at TIMESTAMPTZ DEFAULT NOW(),
processed BOOLEAN DEFAULT false,
processing_error TEXT
);
-- Partition by month for performance
CREATE TABLE raw_scrapes_2026_04 PARTITION OF raw_scrapes
FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');
Why store raw? Because processing logic changes. When you improve your LLM prompts or switch enrichment models, you can reprocess historical raw data without re-scraping. This saves API costs and prevents hitting rate limits.
Stage 4: LLM Processing and Enrichment
The enrichment stage transforms raw scraped content into structured, normalized data:
pythonasync def enrich_tool_data(raw_scrape: dict) -> dict:
raw = raw_scrape['raw_data']
prompt = f"""Analyze this scraped data about an AI tool and extract structured information.
RAW DATA:
Title: {raw.get('title', '')}
Description: {raw.get('description', '')}
Body (first 3000 chars): {raw.get('bodyText', '')[:3000]}
Pricing mentions: {raw.get('pricing', [])}
Features: {raw.get('features', [])[:20]}
Extract into JSON with fields: name, tagline, description (2-3 paragraphs),
category (automation/development/writing/design/data/other),
pricing_model (free/freemium/paid/enterprise), pricing_details,
key_features (list), integrations (list), use_cases (list),
technical_level (beginner/intermediate/advanced), last_updated.
If information is not available, use null. Do not guess."""
response = await llm.generate(prompt)
return json.loads(response)
Stage 5: Validation and Quality Control
Before enriched data reaches production, it passes through validation:
pythondef validate_enriched_data(data: dict, previous_version: dict = None) -> tuple:
errors = []
# Required fields
for field in ['name', 'tagline', 'description', 'category']:
if not data.get(field):
errors.append(f"Missing required field: {field}")
# Sanity checks
if data.get('description') and len(data['description']) < 50:
errors.append("Description suspiciously short")
if data.get('pricing_model') not in [None, 'free', 'freemium', 'paid', 'enterprise']:
errors.append(f"Invalid pricing_model: {data['pricing_model']}")
# Drift detection
if previous_version:
if data.get('name') != previous_version.get('name'):
errors.append(f"Name changed: {previous_version['name']} to {data['name']}")
if data.get('category') != previous_version.get('category'):
errors.append("Category changed unexpectedly")
return len(errors) == 0, errors
Failed validations go to a review queue rather than being silently dropped or silently published.
Stage 6: The Neon Advantage
Why NNeon MCP specifically for this pipeline?
Serverless scaling: The pipeline runs in bursts — heavy during scraping windows, idle between them. Neon scales to zero when idle, meaning we pay nothing during the 20+ hours per day when no processing happens.
Branching: Neon supports database branching. When we change enrichment logic, we branch the database, reprocess on the branch, validate results, and then merge. This gives us safe, reversible updates to 3,500+ records.
bash# Create a branch for reprocessing
neonctl branches create --name enrichment-v2 --parent main
Reprocess on the branch
python reprocess.py --connection-string $BRANCH_URL
Validate
python validate.py --connection-string $BRANCH_URL --report
If good, merge
Point-in-time recovery: If a bad batch corrupts data, restore to any point in the last 7 days. This has saved us twice.
Monitoring and Observability
At scale, you need visibility:
sql-- Daily pipeline health check
SELECT
date_trunc('day', scraped_at) AS day,
COUNT() AS total_scrapes,
COUNT() FILTER (WHERE processed) AS processed,
COUNT() FILTER (WHERE processing_error IS NOT NULL) AS errors,
ROUND(100.0 COUNT() FILTER (WHERE processing_error IS NOT NULL) / COUNT(), 2) AS error_rate
FROM raw_scrapes
WHERE scraped_at > NOW() - INTERVAL '7 days'
GROUP BY 1
ORDER BY 1 DESC;
Alert when error rate exceeds 5%. Investigate when any source fails three times consecutively. Auto-deactivate sources that have not successfully scraped in 7 days.
The Full Stack
The production pipeline uses:
- AApify MCP for distributed web scraping
- NNeon MCP for serverless Postgres (raw storage, enriched storage, production catalog)
- nn8n for orchestration and scheduling
- ttxtai for generating embeddings on enriched content (powering semantic search)
- CContext7 for understanding documentation pages in context
Total infrastructure cost at 3,500+ tools with daily updates: under fifty dollars per month. At enterprise scale with dedicated Apify actors and higher scraping frequency, budget two to five hundred per month.
The pipeline runs itself. Sources are discovered, scraped, processed, validated, and published without human intervention — unless validation fails, at which point a human reviews the flagged items.
This is how you keep a catalog fresh at scale. Not with researchers. With infrastructure.
Ratings & Reviews
0.0
out of 5
0 ratings
No reviews yet. Be the first to share your experience.