In the Weeds: Building a Personal Research Agent with Apify and Neon
A deep technical tutorial on combining Apify MCP for web scraping, Neon MCP for serverless Postgres, and AI to build a personal research pipeline. Code examples, architecture decisions, and real results.
What We're Building (and Why)
I'm a research hoarder. I read dozens of articles, papers, and blog posts per week. I bookmark things compulsively. And then, when I actually need to find something I read three weeks ago, I spend 20 minutes digging through browser bookmarks, note apps, and the vague recollection that "it had a blue header and the author's name started with M... or was it N?"
So I built a personal research agent. A system that automatically scrapes sources I care about, stores everything in a searchable database, and lets me query my collected research using natural language.
This is a technical tutorial. We're going deep. If you're here for the code and architecture, you're in the right place. If terms like "serverless Postgres" make your eyes glaze over, the Non-Technical Person's Guide might be a better starting point — no judgment.
Let's build.
The Architecture
The system has three main components:
[Web Sources] → [AApify MCP - Scraping] → [AI - Processing] → [NNeon MCP - Storage] → [Query Interface]
AApify MCP handles the web scraping. It's an MCP (Model Context Protocol) server that gives AI access to Apify's web scraping infrastructure. Instead of writing and maintaining my own scrapers (a maintenance nightmare), I use Apify's pre-built actors for different source types and let the MCP layer handle the integration.
NNeon MCP handles the storage. Neon is serverless Postgres — all the power of PostgreSQL without managing a database server. The MCP layer lets AI interact with the database directly: creating tables, inserting data, running queries. For a personal project, this is ideal — I don't want to manage infrastructure.
The AI layer sits in the middle, doing the cognitive work: extracting key information from scraped content, generating summaries and tags, identifying connections between articles, and handling natural language queries against the database.
Step 1: Setting Up the Scraping Pipeline
First, the scraping layer. AApify MCP provides access to Apify's actor ecosystem, which has pre-built scrapers for essentially every type of web content.
For my research agent, I scrape three types of sources:
RSS feeds from blogs and publications I follow. These are the most reliable — structured data, consistent formatting, regular updates.
Specific web pages that I manually add when I find something worth archiving. Think of this as a "save for later" that actually processes and indexes the content.
Search results for topics I'm tracking. Once a week, the system searches for new content on topics I've defined and adds relevant results to the pipeline.
The Apify MCP handles all of this through a unified interface. Here's the conceptual flow for the RSS feed scraper:
javascript// Pseudocode for the scraping workflow
const feedSources = [
'https://example.com/ai-research/feed',
'https://example.com/tech-blog/rss',
// ... more feeds
];
// For each source, Apify scrapes the content
// The MCP layer handles authentication, rate limiting, and error handling
for (const source of feedSources) {
const articles = await apifyMcp.scrape({
source: source,
type: 'rss',
extractContent: true, // Get full article text, not just summaries
since: lastScrapeDate // Only new articles
});
// Pass to AI processing layer
for (const article of articles) {
await processArticle(article);
}
}
The key design decision here: scrape full content, not just metadata. Titles and summaries are useful for browsing, but when I'm doing serious research, I need to search the actual content. Apify extracts the full article text, cleaned of navigation, ads, and other clutter.
I run the scraper daily for RSS feeds and weekly for search-based discovery. The manual "save this article" scraping happens on demand.
Step 2: AI Processing
This is where raw scraped content becomes structured, searchable knowledge. For each article, the AI processing layer extracts:
Summary: A 2-3 sentence distillation of the article's main argument or findings. Not the article's own summary (which is often vague or clickbaity), but a genuine summary of the actual content.
Key claims: The specific assertions the article makes. "Remote workers report 15% higher productivity." "The study found no significant correlation between X and Y." These are extracted as discrete, quotable statements.
Tags: Thematic tags based on content analysis. Not the article's own tags (which are often SEO-optimized rather than descriptively useful), but tags that reflect what the article is actually about.
Connections: References to other articles in my database that cover related topics, present opposing views, or provide supporting evidence.
Quality assessment: A rough evaluation of the source's credibility, the strength of evidence presented, and any notable biases. This isn't perfect, but it's useful for quickly distinguishing between a peer-reviewed study and an opinion blog post.
Here's the conceptual processing function:
javascriptasync function processArticle(article) {
const processed = await ai.analyze({
content: article.fullText,
tasks: [
'Generate a 2-3 sentence summary of the main argument',
'Extract key factual claims as discrete statements',
'Assign 3-7 thematic tags',
'Identify the type: research, opinion, tutorial, news, analysis',
'Rate source credibility: high, medium, low, unknown',
'Note any potential biases or limitations'
]
});
// Find connections to existing articles
const connections = await findRelatedArticles(processed.tags, processed.summary);
return {
...article,
...processed,
connections: connections,
processedAt: new Date()
};
}
The quality assessment deserves special mention. I've tuned it to flag three things: (1) Is the source generally reputable? (2) Does the article cite its claims? (3) Is there language suggesting strong bias? This isn't a fact-checker — it's a first-pass filter that helps me quickly assess whether something is worth deep reading.
Step 3: Database Design with Neon
NNeon MCP lets me create and manage a serverless PostgreSQL database through the AI interface. Here's my schema:
sql-- Core articles table
CREATE TABLE articles (
id SERIAL PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
author TEXT,
source_domain TEXT,
published_at TIMESTAMPTZ,
scraped_at TIMESTAMPTZ DEFAULT NOW(),
full_text TEXT NOT NULL,
summary TEXT,
article_type TEXT, -- research, opinion, tutorial, news, analysis
credibility TEXT, -- high, medium, low, unknown
bias_notes TEXT
);
-- Tags for flexible categorization
CREATE TABLE article_tags (
id SERIAL PRIMARY KEY,
article_id INTEGER REFERENCES articles(id),
tag TEXT NOT NULL
);
-- Key claims extracted from articles
CREATE TABLE claims (
id SERIAL PRIMARY KEY,
article_id INTEGER REFERENCES articles(id),
claim_text TEXT NOT NULL,
claim_type TEXT -- factual, statistical, opinion, prediction
);
-- Connections between articles
CREATE TABLE connections (
id SERIAL PRIMARY KEY,
article_a_id INTEGER REFERENCES articles(id),
article_b_id INTEGER REFERENCES articles(id),
connection_type TEXT, -- related, supporting, opposing, extends
description TEXT
);
-- Search index for natural language queries
CREATE INDEX idx_articles_fulltext ON articles USING gin(to_tsvector('english', full_text));
CREATE INDEX idx_articles_summary ON articles USING gin(to_tsvector('english', summary));
CREATE INDEX idx_claims_text ON claims USING gin(to_tsvector('english', claim_text));
CREATE INDEX idx_tags ON article_tags(tag);
A few design decisions worth explaining:
Why separate tables for tags and claims? Because they need to be independently queryable. I often want "all articles tagged with 'machine-learning' from the last month" or "all statistical claims about productivity." Normalized tables make these queries simple and fast.
Why full-text indexing? Because my primary query pattern is natural language search. PostgreSQL's built-in full-text search is surprisingly powerful for this use case. The gin indexes make these queries fast even as the database grows.
Why store full text? Storage is cheap. Context is expensive. When I'm doing research, I need to be able to read the full article within my system, not just see a summary and then go hunting for the original URL (which may have moved or gone behind a paywall).
Why serverless (Neon)? Because this is a personal project and I don't want to manage a database server. Neon scales to zero when I'm not using it and handles all the operational complexity. For a project like this, where usage is bursty (heavy during research periods, quiet otherwise), serverless is ideal.
Step 4: The Query Interface
This is where the system pays for itself. I can query my research database in natural language:
"What have I collected about AI in education this month?"
Returns articles, summaries, and key claims, sorted by relevance and recency.
"Find opposing viewpoints on remote work productivity."
Uses the connections table to identify articles with "opposing" relationships and the claims table to find contradictory statements.
"What sources have I collected that are rated high credibility on the topic of climate technology?"
Combines the credibility rating with tag-based filtering.
"Summarize everything I know about transformer architectures, citing sources."
Generates a synthesis across multiple articles, with citations back to the originals.
The natural language queries are translated to SQL by the AI layer, which understands my schema and can construct appropriate joins, filters, and aggregations. For complex queries, it often generates multi-step searches: first finding relevant articles, then extracting key claims, then synthesizing.
javascript// Conceptual query flow
async function query(naturalLanguageQuestion) {
// AI translates to SQL (or multiple queries)
const sqlQueries = await ai.translateToSQL({
question: naturalLanguageQuestion,
schema: DATABASE_SCHEMA,
context: 'Personal research database with articles, tags, claims, and connections'
});
// Execute queries against Neon
const results = await neonMcp.execute(sqlQueries);
// AI synthesizes results into a human-readable answer
const answer = await ai.synthesize({
question: naturalLanguageQuestion,
rawResults: results,
format: 'detailed with citations'
});
return answer;
}
Real Results: Three Months In
After running this system for three months, here are the numbers:
- 2,847 articles scraped and processed
- 8,200+ tags assigned
- 4,100+ key claims extracted
- 1,900+ connections identified between articles
- Average query time: ~2 seconds for simple queries, ~8 seconds for synthesis queries
And here's what matters more than numbers: I actually use it. Every research project I've started in the last two months has begun with a query to my personal database, and every single time, I've found relevant material I'd already collected but would never have remembered or found through manual browsing.
Three real examples:
Example 1: I was writing about AI in healthcare. Queried my database and found 23 relevant articles I'd collected over the previous months, including a research paper I'd completely forgotten about that became the foundation of my piece. Time saved: probably 4-5 hours of searching and re-reading.
Example 2: A colleague asked me for references on the environmental impact of large language models. I queried my database, got 8 articles with credibility ratings and key claims, and sent them a synthesized summary within 10 minutes. Without the system, I would have spent an hour digging through bookmarks and probably given up.
Example 3: I noticed a pattern in my collected articles that I hadn't consciously identified: three separate sources, from different fields, were making similar arguments about attention fragmentation. The connections table surfaced this. I wrote an article connecting the three perspectives that I never would have conceived without the system's cross-referencing.
Maintenance and Costs
Time maintenance: About 30 minutes per week. Mostly reviewing the AI's processing for accuracy, adding manual "save this article" items, and occasionally adjusting scraping sources.
Neon costs: Minimal for a personal project. Serverless means I'm paying for actual compute and storage, not a running server. My monthly bill has been under $5.
Apify costs: Depends on scraping volume. My RSS feed scraping fits well within free tier limits. The search-based discovery uses more compute but stays under $10/month.
AI processing costs: This is the biggest variable. Processing 2,800+ articles through the AI layer for summarization, claim extraction, and connection mapping adds up. I've optimized by using smaller models for routine processing and only using larger models for synthesis queries. Total: roughly $15-20/month.
Total monthly cost: About $25-30 for a personal research infrastructure that would have been unimaginable five years ago.
Lessons for Builders
If you want to build something similar, here's what I'd do differently with the benefit of hindsight:
Start with fewer sources. I started with too many RSS feeds and was overwhelmed by volume. Begin with 5-10 sources you truly care about and expand gradually.
Invest time in the schema. My first schema was too simple (just articles and tags). Adding the claims and connections tables was a game-changer. Think about your actual query patterns before designing the database.
Don't over-process. My initial AI processing prompt was too ambitious — I wanted sentiment analysis, readability scores, and entity extraction in addition to the core features. Most of that went unused. Process what you'll actually query.
Build the query interface first. Start with the retrieval experience and work backward. Knowing how you want to search your data informs every upstream decision.
Accept imperfection. The AI's tag assignments are wrong maybe 10% of the time. Its claim extraction misses things. Its credibility ratings are rough. That's fine. The system doesn't need to be perfect — it needs to be better than the alternative, which is bookmarks and memory. That's a very low bar.
The Philosophy
I want to end with something less technical.
The reason I built this system — the real reason, underneath the tooling and the SQL and the MCP integrations — is that I believe in compounding knowledge. Every article I read should make me smarter not just today, but next month and next year. Every connection between ideas should be discoverable, not trapped in the unreliable storage of biological memory.
We're living through an information explosion. The amount of valuable knowledge being published daily is staggering, and it's growing. No human brain can keep up. But a human brain augmented with intelligent storage, processing, and retrieval? That brain can operate at a different level.
AApify MCP gives me the ability to gather information from across the web. NNeon MCP gives me a place to store it that's powerful, flexible, and maintenance-free. The AI layer gives me the processing and retrieval capabilities that turn raw information into structured, searchable, connected knowledge.
Together, they form something that feels genuinely new: a personal research infrastructure that grows more valuable with every article it processes.
I started this project as a weekend experiment. Three months later, it's become indispensable. Not because the technology is impressive (though it is), but because it solved a real problem that I'd been failing to solve for years: the gap between what I read and what I can remember.
The weeds are deep, but the view from down here is pretty good.
Ratings & Reviews
0.0
out of 5
0 ratings
No reviews yet. Be the first to share your experience.