In the Weeds: Building a Knowledge Graph with txtai

joey-ioApril 12, 20265 min read

A deep technical guide to building a semantic knowledge graph using txtai — from embedding your documents to traversing relationships that traditional search would never surface.

technical in-the-weeds txtai knowledge-graph python semantic-search

Why Knowledge Graphs Still Matter

Everyone is talking about RAG. Retrieval Augmented Generation. Stuff your documents into a vector database, do a similarity search, feed the results to an LLM. It works. But it has a ceiling.

The ceiling is this: vector similarity finds documents that are about similar things. It does not find documents that are connected in meaningful but semantically distant ways. Your quarterly financial report and your engineering sprint retrospective might be deeply connected — the budget cut caused the team reduction that caused the velocity drop — but their embeddings are in completely different neighborhoods.

Knowledge graphs solve this. They encode relationships explicitly. And ttxtai gives you a remarkably elegant way to build one that combines the best of both worlds: semantic understanding plus relational structure.

The Architecture

Here is what we are building:

Documents → Embeddings (ttxtai) → Entities (extracted) → Relationships (graph edges) → Queryable Graph

ttxtai is not just an embeddings library. It is a full semantic search and workflow engine that supports graphs natively. The key insight is using its pipeline system to chain extraction, embedding, and graph construction into a single flow.

pythonfrom txtai.embeddings import Embeddings
from txtai.pipeline import Entity, Labels
from txtai.graph import GraphFactory
Initialize embeddings with graph support
embeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True,
    "graph": {
        "backend": "networkx",
        "batchsize": 256
    }
})
Entity extraction pipeline
entity = Entity("dslim/bert-base-NER")

The graph configuration tells txtai to automatically build relationships between indexed documents based on shared entities, topics, and semantic proximity. This is the magic — you get a graph without manually specifying edges.

Feeding the Graph

The power of this approach shows when you index heterogeneous content. I built a knowledge graph for a startup's internal documentation that included:

Engineering RFCs
Customer support tickets
Sales call transcripts
Product roadmap documents
SSlack channel exports (using SSlack MCP to pull these programmatically)

pythonimport os
from pathlib import Path
Collect documents from multiple sources
documents = []
Load markdown files
for f in Path("./docs").rglob("*.md"):
    documents.append({
        "id": str(f),
        "text": f.read_text(),
        "source": "docs",
        "type": "engineering"
    })
Load support tickets (exported as JSON)
import json
tickets = json.loads(Path("./tickets.json").read_text())
for ticket in tickets:
    documents.append({
        "id": f"ticket-{ticket['id']}",
        "text": f"{ticket['subject']}\n{ticket['body']}",
        "source": "support",
        "type": "customer"
    })
Index everything
embeddings.index([(doc["id"], doc, None) for doc in documents])

Once indexed, txtai's graph reveals connections that no keyword search or even vector search would surface. A customer complaint about slow exports is now linked to an engineering RFC about database query optimization, linked to a roadmap item about performance improvements, linked to a sales call where a prospect asked about export speeds.

Traversing Relationships

The graph query interface lets you walk these connections:

python# Find all documents related to a specific one, via graph edges
results = embeddings.graph.search(
    "SELECT id, text FROM txtai WHERE similar('database performance') AND graph_distance <= 2",
    limit=20
)
Or traverse from a specific node
neighbors = embeddings.graph.neighbors("rfc-042-query-optimization")
for neighbor in neighbors:
    edge_data = embeddings.graph.edge(
        "rfc-042-query-optimization",
        neighbor
    )
    print(f"  → {neighbor} (relationship: {edge_data.get('type', 'semantic')})")

The graph_distance parameter is powerful. Distance 1 gives you directly connected documents. Distance 2 gives you documents connected through one intermediary. Distance 3 starts revealing surprising, serendipitous connections — the kind that make knowledge graphs worth the effort.

Adding Custom Relationship Types

The automatic graph is useful, but custom relationships make it powerful. I add explicit edges for relationships the semantic model cannot detect:

python# Add explicit edges for causal relationships
embeddings.graph.addedge(
    "rfc-042-query-optimization",
    "ticket-8834",
    type="resolves",
    confidence=0.9
)
Temporal relationships
embeddings.graph.addedge(
    "roadmap-q2-2026",
    "rfc-042-query-optimization",
    type="motivated",
    direction="forward"
)
Contradiction detection
embeddings.graph.addedge(
    "sales-call-march-15",
    "roadmap-q2-2026",
    type="contradicts",
    note="Sales promised feature by April, roadmap says Q3"
)

That last one — contradiction detection — is where knowledge graphs earn their keep. Finding contradictions between what different parts of an organization believe is extraordinarily valuable. And with a graph, you can query for them:

python# Find all contradictions
contradictions = embeddings.graph.search(
    "SELECT source, target, note FROM edges WHERE type = 'contradicts'"
)

Visualization

A graph you cannot see is a graph you cannot use. I export to standard formats for visualization:

pythonimport networkx as nx
Export the graph
G = embeddings.graph.backend
nx.write_gexf(G, "knowledge_graph.gexf")
For web visualization, export to JSON
from networkx.readwrite import json_graph
data = json_graph.node_link_data(G)
Path("graph.json").write_text(json.dumps(data))

I render the web visualization with D3.js force-directed layout. Color nodes by source type (engineering, support, sales, roadmap), size them by connection count, and make edges clickable to show relationship metadata. The result is a living map of organizational knowledge.

Integration with LLM Queries

The final piece is using the graph to enhance LLM responses. Instead of basic RAG — retrieve similar documents, stuff them in the context — I do graph-augmented retrieval:

pythondef graph_augmented_query(question, embeddings, llm):
    # Step 1: Find directly relevant documents
    direct = embeddings.search(question, limit=5)
# Step 2: Expand via graph — get neighbors of top results
    expanded = set()
    for result in direct[:3]:
        neighbors = embeddings.graph.neighbors(result["id"])
        for n in neighbors[:5]:
            expanded.add(n)
# Step 3: Fetch expanded context
    context_docs = direct + [
        embeddings.search(f"id:{doc_id}")[0]
        for doc_id in expanded
    ]
# Step 4: Include relationship metadata
    relationships = []
    for doc in context_docs:
        edges = embeddings.graph.edges(doc["id"])
        for edge in edges:
            relationships.append(
                f"{edge['source']} --[{edge['type']}]--> {edge['target']}"
            )
# Step 5: Prompt with graph context
    context = "\n".join([d["text"][:500] for d in context_docs])
    rels = "\n".join(relationships)
prompt = f"""Based on the following documents and their relationships:
DOCUMENTS:
{context}
RELATIONSHIPS:
{rels}
QUESTION: {question}"""
return llm(prompt)

This gives the LLM not just relevant content but the structure of how that content relates. The answers are dramatically better — they capture causality, contradiction, and temporal sequence in ways that flat RAG cannot.

Performance Considerations

For graphs under 100,000 nodes, NetworkX is fine. Beyond that, switch to a dedicated graph database:

pythonembeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True,
    "graph": {
        "backend": "rdbms",
        "url": "postgresql://user:pass@localhost/graphdb"
    }
})

For our use case at a-gnt, we use NNeon MCP for serverless Postgres with pgvector. The embeddings live in Postgres, the graph edges live in Postgres, and txtai orchestrates both. Deployment is a single database connection string.

If you need real-time updates — new documents being indexed, new relationships being discovered — nn8n webhooks can trigger re-indexing workflows whenever content changes. Push a document to your knowledge base, and within seconds the graph updates with new nodes and edges.

When to Use This vs. Plain RAG

Plain RAG: You have a documentation site and want a chatbot that answers questions from it. The documents are self-contained. Relationships between them do not matter much.

Knowledge Graph: You have heterogeneous information across multiple systems. Understanding how pieces connect is as important as finding individual pieces. You need to detect contradictions, trace causality, or navigate organizational knowledge.

The SSupabase MCP can handle the plain RAG case beautifully. But when your data has structure, when relationships are the insight — that is when ttxtai and its graph capabilities justify the added complexity.

Build the graph. Let the connections surprise you.

Share this post:

Ratings & Reviews

0.0

out of 5

0 ratings

No reviews yet. Be the first to share your experience.

Tools in this post

Slack

Send messages, search conversations, and manage Slack channels

n8n

Open-source workflow automation with AI integration

Neon MCP

Interact with Neon serverless Postgres databases

Supabase MCP

Connect AI agents to Supabase database, auth, and edge functions

txtai

All-in-one embeddings database and RAG framework

In the Weeds: Building a Knowledge Graph with txtai

Why Knowledge Graphs Still Matter

The Architecture

Initialize embeddings with graph support

Entity extraction pipeline

Feeding the Graph

Collect documents from multiple sources

Load markdown files

Load support tickets (exported as JSON)

Index everything

Traversing Relationships

Or traverse from a specific node

Adding Custom Relationship Types

Temporal relationships

Contradiction detection

Visualization

Export the graph

For web visualization, export to JSON

Integration with LLM Queries

Performance Considerations

When to Use This vs. Plain RAG

Ratings & Reviews

Related Posts