In the Weeds: Automating Content Pipelines with Apify + Neon

joey-ioApril 12, 20265 min read

A technical guide to building automated content collection, processing, and enrichment pipelines using Apify for web scraping and Neon serverless Postgres for storage — the infrastructure behind a-gnt's catalog.

technical in-the-weeds apify neon content-pipeline automation scraping

The Content Problem at Scale

When you run a catalog of 3,500+ AI tools, keeping information current is not a manual task. It is an infrastructure challenge. Tools update their features. New tools launch daily. Pricing changes. Integrations appear. Documentation evolves.

The answer is not hiring a team of researchers. The answer is building a pipeline that continuously collects, processes, enriches, and validates content — automatically, reliably, and cheaply.

Here is how we do it with AApify MCP for web collection and NNeon MCP for serverless storage.

The Pipeline Architecture

Sources -> Apify Scraping -> Raw Storage (Neon) -> LLM Processing -> Enriched Storage -> Validation -> Publication

Each stage is independent. Each can fail without bringing down the pipeline. Each can be monitored separately. This modularity is critical at scale — when you are processing hundreds of sources daily, something will always be failing. The question is whether failure cascades or stays contained.

Stage 1: Source Discovery and Scheduling

First, we need to know what to scrape. Our source registry lives in Neon:

sqlCREATE TABLE content_sources (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT NOT NULL,
  url TEXT NOT NULL,
  source_type TEXT NOT NULL,
  scrape_frequency INTERVAL DEFAULT '24 hours',
  last_scraped TIMESTAMPTZ,
  apify_actor_id TEXT,
  config JSONB DEFAULT '{}',
  active BOOLEAN DEFAULT true,
  reliability_score FLOAT DEFAULT 1.0,
  created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Scheduling query: what needs scraping now?
CREATE VIEW sources_due AS
SELECT  FROM content_sources
WHERE active = true
  AND (last_scraped IS NULL OR last_scraped + scrape_frequency < NOW())
ORDER BY last_scraped ASC NULLS FIRST;

Stage 2: Apify Scraping

AApify MCP provides web scraping as a service — pre-built actors for common patterns plus custom actors for specific needs.

For product pages, we use a generic web scraper actor configured to extract structured content:

javascript// nn8n Function node: Build Apify task configuration
const source = items[0].json;
const apifyConfig = {
  actorId: source.apify_actor_id || 'apify/web-scraper',
  input: {
    startUrls: [{ url: source.url }],
    pageFunction: async function pageFunction(context) {
      const { page, request } = context;
      await page.waitForSelector('body', { timeout: 30000 });
const title = await page.evaluate(() =>
        document.querySelector('h1')?.textContent || ''
      );
      const description = await page.evaluate(() =>
        document.querySelector('meta[name="description"]')?.content || ''
      );
      const bodyText = await page.evaluate(() =>
        (document.querySelector('main') || document.querySelector('article') || document.body).textContent
      );
const pricing = await page.evaluate(() =>
        Array.from(document.querySelectorAll('[class="price"]')).map(el => el.textContent)
      );
const features = await page.evaluate(() =>
        Array.from(document.querySelectorAll('li')).slice(0, 50).map(el => el.textContent.trim())
      );
return {
        url: request.url,
        title,
        description,
        bodyText: bodyText.substring(0, 50000),
        pricing,
        features,
        scrapedAt: new Date().toISOString()
      };
    },
    maxRequestsPerCrawl: 10,
    proxyConfiguration: { useApifyProxy: true }
  },
  timeout: 120
};
return { apifyConfig };

For GitHub repos, we use the GitHub API actor:

javascript{
  actorId: 'apify/github-scraper',
  input: {
    repoUrls: [source.url],
    includeReadme: true,
    includeReleases: true,
    includeLanguages: true,
    maxReleases: 5
  }
}

Stage 3: Raw Storage

Scraped data goes directly into a raw storage table — unprocessed, exactly as received:

sqlCREATE TABLE raw_scrapes (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source_id UUID REFERENCES content_sources(id),
  raw_data JSONB NOT NULL,
  scraped_at TIMESTAMPTZ DEFAULT NOW(),
  processed BOOLEAN DEFAULT false,
  processing_error TEXT
);
-- Partition by month for performance
CREATE TABLE raw_scrapes_2026_04 PARTITION OF raw_scrapes
  FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

Why store raw? Because processing logic changes. When you improve your LLM prompts or switch enrichment models, you can reprocess historical raw data without re-scraping. This saves API costs and prevents hitting rate limits.

Stage 4: LLM Processing and Enrichment

The enrichment stage transforms raw scraped content into structured, normalized data:

pythonasync def enrich_tool_data(raw_scrape: dict) -> dict:
    raw = raw_scrape['raw_data']
prompt = f"""Analyze this scraped data about an AI tool and extract structured information.
RAW DATA:
Title: {raw.get('title', '')}
Description: {raw.get('description', '')}
Body (first 3000 chars): {raw.get('bodyText', '')[:3000]}
Pricing mentions: {raw.get('pricing', [])}
Features: {raw.get('features', [])[:20]}
Extract into JSON with fields: name, tagline, description (2-3 paragraphs),
category (automation/development/writing/design/data/other),
pricing_model (free/freemium/paid/enterprise), pricing_details,
key_features (list), integrations (list), use_cases (list),
technical_level (beginner/intermediate/advanced), last_updated.
If information is not available, use null. Do not guess."""
response = await llm.generate(prompt)
    return json.loads(response)

Stage 5: Validation and Quality Control

Before enriched data reaches production, it passes through validation:

pythondef validate_enriched_data(data: dict, previous_version: dict = None) -> tuple:
    errors = []
# Required fields
    for field in ['name', 'tagline', 'description', 'category']:
        if not data.get(field):
            errors.append(f"Missing required field: {field}")
# Sanity checks
    if data.get('description') and len(data['description']) < 50:
        errors.append("Description suspiciously short")
if data.get('pricing_model') not in [None, 'free', 'freemium', 'paid', 'enterprise']:
        errors.append(f"Invalid pricing_model: {data['pricing_model']}")
# Drift detection
    if previous_version:
        if data.get('name') != previous_version.get('name'):
            errors.append(f"Name changed: {previous_version['name']} to {data['name']}")
        if data.get('category') != previous_version.get('category'):
            errors.append("Category changed unexpectedly")
return len(errors) == 0, errors

Failed validations go to a review queue rather than being silently dropped or silently published.

Stage 6: The Neon Advantage

Why NNeon MCP specifically for this pipeline?

Serverless scaling: The pipeline runs in bursts — heavy during scraping windows, idle between them. Neon scales to zero when idle, meaning we pay nothing during the 20+ hours per day when no processing happens.

Branching: Neon supports database branching. When we change enrichment logic, we branch the database, reprocess on the branch, validate results, and then merge. This gives us safe, reversible updates to 3,500+ records.

bash# Create a branch for reprocessing
neonctl branches create --name enrichment-v2 --parent main
Reprocess on the branch
python reprocess.py --connection-string $BRANCH_URL
Validate
python validate.py --connection-string $BRANCH_URL --report
If good, merge

Point-in-time recovery: If a bad batch corrupts data, restore to any point in the last 7 days. This has saved us twice.

Monitoring and Observability

At scale, you need visibility:

sql-- Daily pipeline health check
SELECT
  date_trunc('day', scraped_at) AS day,
  COUNT() AS total_scrapes,
  COUNT() FILTER (WHERE processed) AS processed,
  COUNT() FILTER (WHERE processing_error IS NOT NULL) AS errors,
  ROUND(100.0  COUNT() FILTER (WHERE processing_error IS NOT NULL) / COUNT(), 2) AS error_rate
FROM raw_scrapes
WHERE scraped_at > NOW() - INTERVAL '7 days'
GROUP BY 1
ORDER BY 1 DESC;

Alert when error rate exceeds 5%. Investigate when any source fails three times consecutively. Auto-deactivate sources that have not successfully scraped in 7 days.

The Full Stack

The production pipeline uses:
- AApify MCP for distributed web scraping
- NNeon MCP for serverless Postgres (raw storage, enriched storage, production catalog)
- nn8n for orchestration and scheduling
- ttxtai for generating embeddings on enriched content (powering semantic search)
- CContext7 for understanding documentation pages in context

Total infrastructure cost at 3,500+ tools with daily updates: under fifty dollars per month. At enterprise scale with dedicated Apify actors and higher scraping frequency, budget two to five hundred per month.

The pipeline runs itself. Sources are discovered, scraped, processed, validated, and published without human intervention — unless validation fails, at which point a human reviews the flagged items.

This is how you keep a catalog fresh at scale. Not with researchers. With infrastructure.

Share this post:

Ratings & Reviews

0.0

out of 5

0 ratings

No reviews yet. Be the first to share your experience.

Tools in this post

Apify MCP

Access 3,000+ pre-built cloud tools for web scraping

Context7

Up-to-date docs for any library, instantly

n8n

Open-source workflow automation with AI integration

Neon MCP

Interact with Neon serverless Postgres databases

txtai

All-in-one embeddings database and RAG framework

In the Weeds: Automating Content Pipelines with Apify + Neon

The Content Problem at Scale

The Pipeline Architecture

Stage 1: Source Discovery and Scheduling

Stage 2: Apify Scraping

Stage 3: Raw Storage

Stage 4: LLM Processing and Enrichment

Stage 5: Validation and Quality Control

Stage 6: The Neon Advantage

Reprocess on the branch

Validate

If good, merge

Monitoring and Observability

The Full Stack

Ratings & Reviews

Related Posts