Skip to main content
0

In the Weeds: Building a Knowledge Graph with txtai

joey-io's avatarjoey-io5 min read

A deep technical guide to building a semantic knowledge graph using txtai — from embedding your documents to traversing relationships that traditional search would never surface.

Why Knowledge Graphs Still Matter

Everyone is talking about RAG. Retrieval Augmented Generation. Stuff your documents into a vector database, do a similarity search, feed the results to an LLM. It works. But it has a ceiling.

The ceiling is this: vector similarity finds documents that are about similar things. It does not find documents that are connected in meaningful but semantically distant ways. Your quarterly financial report and your engineering sprint retrospective might be deeply connected — the budget cut caused the team reduction that caused the velocity drop — but their embeddings are in completely different neighborhoods.

Knowledge graphs solve this. They encode relationships explicitly. And ttxtai gives you a remarkably elegant way to build one that combines the best of both worlds: semantic understanding plus relational structure.

The Architecture

Here is what we are building:

Documents → Embeddings (ttxtai) → Entities (extracted) → Relationships (graph edges) → Queryable Graph

ttxtai is not just an embeddings library. It is a full semantic search and workflow engine that supports graphs natively. The key insight is using its pipeline system to chain extraction, embedding, and graph construction into a single flow.

pythonfrom txtai.embeddings import Embeddings
from txtai.pipeline import Entity, Labels
from txtai.graph import GraphFactory

Initialize embeddings with graph support

embeddings = Embeddings({ "path": "sentence-transformers/all-MiniLM-L6-v2", "content": True, "graph": { "backend": "networkx", "batchsize": 256 } })

Entity extraction pipeline

entity = Entity("dslim/bert-base-NER")

The graph configuration tells txtai to automatically build relationships between indexed documents based on shared entities, topics, and semantic proximity. This is the magic — you get a graph without manually specifying edges.

Feeding the Graph

The power of this approach shows when you index heterogeneous content. I built a knowledge graph for a startup's internal documentation that included:

  • Engineering RFCs
  • Customer support tickets
  • Sales call transcripts
  • Product roadmap documents
  • SSlack channel exports (using SSlack MCP to pull these programmatically)
pythonimport os
from pathlib import Path

Collect documents from multiple sources

documents = []

Load markdown files

for f in Path("./docs").rglob("*.md"): documents.append({ "id": str(f), "text": f.read_text(), "source": "docs", "type": "engineering" })

Load support tickets (exported as JSON)

import json tickets = json.loads(Path("./tickets.json").read_text()) for ticket in tickets: documents.append({ "id": f"ticket-{ticket['id']}", "text": f"{ticket['subject']}\n{ticket['body']}", "source": "support", "type": "customer" })

Index everything

embeddings.index([(doc["id"], doc, None) for doc in documents])

Once indexed, txtai's graph reveals connections that no keyword search or even vector search would surface. A customer complaint about slow exports is now linked to an engineering RFC about database query optimization, linked to a roadmap item about performance improvements, linked to a sales call where a prospect asked about export speeds.

Traversing Relationships

The graph query interface lets you walk these connections:

python# Find all documents related to a specific one, via graph edges
results = embeddings.graph.search(
    "SELECT id, text FROM txtai WHERE similar('database performance') AND graph_distance <= 2",
    limit=20
)

Or traverse from a specific node

neighbors = embeddings.graph.neighbors("rfc-042-query-optimization") for neighbor in neighbors: edge_data = embeddings.graph.edge( "rfc-042-query-optimization", neighbor ) print(f" → {neighbor} (relationship: {edge_data.get('type', 'semantic')})")

The graph_distance parameter is powerful. Distance 1 gives you directly connected documents. Distance 2 gives you documents connected through one intermediary. Distance 3 starts revealing surprising, serendipitous connections — the kind that make knowledge graphs worth the effort.

Adding Custom Relationship Types

The automatic graph is useful, but custom relationships make it powerful. I add explicit edges for relationships the semantic model cannot detect:

python# Add explicit edges for causal relationships
embeddings.graph.addedge(
    "rfc-042-query-optimization",
    "ticket-8834",
    type="resolves",
    confidence=0.9
)

Temporal relationships

embeddings.graph.addedge( "roadmap-q2-2026", "rfc-042-query-optimization", type="motivated", direction="forward" )

Contradiction detection

embeddings.graph.addedge( "sales-call-march-15", "roadmap-q2-2026", type="contradicts", note="Sales promised feature by April, roadmap says Q3" )

That last one — contradiction detection — is where knowledge graphs earn their keep. Finding contradictions between what different parts of an organization believe is extraordinarily valuable. And with a graph, you can query for them:

python# Find all contradictions
contradictions = embeddings.graph.search(
    "SELECT source, target, note FROM edges WHERE type = 'contradicts'"
)

Visualization

A graph you cannot see is a graph you cannot use. I export to standard formats for visualization:

pythonimport networkx as nx

Export the graph

G = embeddings.graph.backend nx.write_gexf(G, "knowledge_graph.gexf")

For web visualization, export to JSON

from networkx.readwrite import json_graph data = json_graph.node_link_data(G) Path("graph.json").write_text(json.dumps(data))

I render the web visualization with D3.js force-directed layout. Color nodes by source type (engineering, support, sales, roadmap), size them by connection count, and make edges clickable to show relationship metadata. The result is a living map of organizational knowledge.

Integration with LLM Queries

The final piece is using the graph to enhance LLM responses. Instead of basic RAG — retrieve similar documents, stuff them in the context — I do graph-augmented retrieval:

pythondef graph_augmented_query(question, embeddings, llm):
    # Step 1: Find directly relevant documents
    direct = embeddings.search(question, limit=5)

# Step 2: Expand via graph — get neighbors of top results
expanded = set()
for result in direct[:3]:
neighbors = embeddings.graph.neighbors(result["id"])
for n in neighbors[:5]:
expanded.add(n)

# Step 3: Fetch expanded context
context_docs = direct + [
embeddings.search(f"id:{doc_id}")[0]
for doc_id in expanded
]

# Step 4: Include relationship metadata
relationships = []
for doc in context_docs:
edges = embeddings.graph.edges(doc["id"])
for edge in edges:
relationships.append(
f"{edge['source']} --[{edge['type']}]--> {edge['target']}"
)

# Step 5: Prompt with graph context
context = "\n".join([d["text"][:500] for d in context_docs])
rels = "\n".join(relationships)

prompt = f"""Based on the following documents and their relationships:

DOCUMENTS:
{context}

RELATIONSHIPS:
{rels}

QUESTION: {question}"""

return llm(prompt)

This gives the LLM not just relevant content but the structure of how that content relates. The answers are dramatically better — they capture causality, contradiction, and temporal sequence in ways that flat RAG cannot.

Performance Considerations

For graphs under 100,000 nodes, NetworkX is fine. Beyond that, switch to a dedicated graph database:

pythonembeddings = Embeddings({
    "path": "sentence-transformers/all-MiniLM-L6-v2",
    "content": True,
    "graph": {
        "backend": "rdbms",
        "url": "postgresql://user:pass@localhost/graphdb"
    }
})

For our use case at a-gnt, we use NNeon MCP for serverless Postgres with pgvector. The embeddings live in Postgres, the graph edges live in Postgres, and txtai orchestrates both. Deployment is a single database connection string.

If you need real-time updates — new documents being indexed, new relationships being discovered — nn8n webhooks can trigger re-indexing workflows whenever content changes. Push a document to your knowledge base, and within seconds the graph updates with new nodes and edges.

When to Use This vs. Plain RAG

Plain RAG: You have a documentation site and want a chatbot that answers questions from it. The documents are self-contained. Relationships between them do not matter much.

Knowledge Graph: You have heterogeneous information across multiple systems. Understanding how pieces connect is as important as finding individual pieces. You need to detect contradictions, trace causality, or navigate organizational knowledge.

The SSupabase MCP can handle the plain RAG case beautifully. But when your data has structure, when relationships are the insight — that is when ttxtai and its graph capabilities justify the added complexity.

Build the graph. Let the connections surprise you.

Share this post:

Ratings & Reviews

0.0

out of 5

0 ratings

No reviews yet. Be the first to share your experience.