SVQ.AI
This is my thesis project, an agentic Retrieval-Augmented Generation (RAG) system with multiple retrieval tools i.e., vector search over raw chunks, vector search over entities, vector search over relations, hybrid graph search, fuzzy literal search.
Datasource organization
Every corpus in SVQ is a datasource which is a named, isolated collection of documents owned by one user. You make one, upload PDFs into it, and chat with it.
POST /api/data-sources
Authorization: Bearer <jwt>
{ "name": "my-legal-contracts" }
{ "msg": "Data source created successfully",
"dataSource": { "_id": "664f...", "name": "my-legal-contracts" } }
Creating a datasource provisions three MongoDB collections:
await Promise.all([
dbService.createCollection(getDb(), `chunks-${insertedId}`),
dbService.createCollection(getDb(), `entities-${insertedId}`),
dbService.createCollection(getDb(), `relations-${insertedId}`),
]);
Each one gets its own Atlas vector search index, with 3072-dimensional embeddings and cosine similarity:
const vectorIndexDefinition = {
name: `${collection_name}_vector_index`,
type: "vectorSearch",
definition: {
fields: [{ path: "embedding", numDimensions: 3072,
similarity: "cosine", type: "vector" }],
},
};
Extraction
When a PDF is uploaded, we extract:
- Chunks: overlapping windows of raw text from the document.
- Entities: things mentioned in the text. People, organizations, places, concepts, dates. Each one with a description and a set of types.
- Relations: connections between those entities. “Person A owes Person B.” “Drug X interacts with Drug Y.” “Author Z cites Author W.”
Dense vector search over chunks is great at scoped queries:“find me a passage that’s semantically similar to this question.” However, when you want to understand relationships “what are the obligations of A to B?” you need entities and relations. These exist in the edges of the graph.
The graph extraction approach is inspired by LightRAG: instead of fine-tuning a model to extract entities, you ask GPT-4o-mini to do it per chunk, with a long structured prompt and examples. It’s slower but more flexible than a real NER pipeline.
Ingestion
Chunking strategy. The splitter is LangChain’s RecursiveCharacterTextSplitter, which respects natural boundaries, i.e. paragraphs first, then lines, then words rather than slicing at a fixed character count. Chunks are 1000 characters with a tiny overlap, and each one stores its start and end offsets in the original document. Those offsets help later, when we need to find which PDF page a chunk came from to show the user the original.
Entity extraction is parallelized but bounded. For every chunk, the system asks GPT-4o-mini to extract entities and relations using a Zod-typed schema:
export const entity_extraction_output_schema_zod = z.object({
entities: z.array(EntitySchema),
relationships: z.array(RelationshipSchema),
content_keywords: z.array(z.string()),
}).strict();
All chunks are processed concurrently, but p-limit(8) caps it at eight concurrent requests because of OpenAI API rate limits.
Deduplication of chunks. When chunk 17 mentions “Acme Corp” and chunk 42 also mentions “Acme Corp,” you don’t want two separate entities but instead a single entity with both chunks listed as sources. The merging logic concatenates descriptions, unions the type sets, unions the source chunk IDs, and re-embeds the merged result so vector search reflects the combined context:
if (entityMap.has(entity.entity_name)) {
const existing = entityMap.get(entity.entity_name)!;
existing.entity_description = (existing.entity_description || '') + ' ' + entity.entity_description;
existing.entity_types = [...new Set([...existing.entity_types, ...entity.entity_types])];
existing.chunk_ids = [...new Set([...existing.chunk_ids, ...entity.chunk_ids])];
}
Similarly relations are keyed on ${source_entity}==>${target_entity}. When the same relation shows up in multiple chunks, the relationship_strength field is updated as independent mentions produce a stronger edge than one.
Local then global. Dedupe happens twice: first against everything extracted from the current file (to merge within the upload), then against everything already in the database (to merge with the existing graph). This allows incremental uploads and a new document doesn’t get its own disconnected graph, it just extends the existing one.
By the end of an ingestion, the three collections look something like this for a single datasource:
chunks-664f... → 500 chunks, 3072-dim embedding each
entities-664f... → 90 entities, deduped, embedded on "name - description"
relations-664f... → 140 relations, deduped, embedded on "source => target - description"
All three can be searched independently.
Agentic retrieval
SVQ has five retrieval tools and lets an LLM decide which ones to call.
From src/services/llm/queryTools.ts:
| Tool | What it does | When the agent picks it |
|---|---|---|
queryChunksVectorDB | Dense vector search over raw text chunks | Open-ended, descriptive questions |
queryEntitiesVectorDB | Vector search over the entity collection | ”Who/what is X” entity-centric |
queryRelationsVectorDB | Vector search over the relation collection | ”How is X connected to Y” relationship-centric |
queryKnowledgeGraph | Hybrid (entities + relations) | Complex multi-hop questions |
queryRawDocuments | Fuzzy literal search (Ctrl+F with typo tolerance) | Exact codes, names, identifiers |
answer | Terminate the loop and return a final response | When the agent has gathered enough |
The loop runs for at most five iterations:
for (let iteration = 0; iteration < MAX_ITERATIONS; iteration++) {
const agentResponse = await agent.getNextAction(prompt, [], isSystemPrompt);
const toolCalls = agentResponse?.toolCalls;
const toolExecutionResults = await processAgentToolCalls(toolCalls, dataSourceId);
await agent.resolveToolCalls(toolExecutionResults);
const answerToolResult = toolExecutionResults.find(r => r.tool_name === 'answer' && r.success);
if (answerToolResult) {
finalAnswer = answerToolResult.result;
break;
}
}
Few salient details:
Forced tool use. The OpenAI call is configured with tool_choice: "required" so the model can’t just respond with text, it has to call something. Either a retrieval tool to gather more context, or the answer tool to terminate.
Last-iteration forced answer. If the agent hits iteration 4 (the last one) without having called answer, the prompt gets nudged:
prompt = "Given the information you attained so far provide an answer now, "
+ "if you found nothing, return an answer explaining that you couldn't "
+ "find anything and your reasoning.";
isSystemPrompt = true;
In the worst case a confused agent might burn five tool calls and still not have a great answer, but it will return something coherent — including “I couldn’t find this in the dataset, here’s why” — rather than looping forever.
The agent runs on DeepSeek. Embeddings use OpenAI’s text-embedding-3-large. Entity extraction uses the cheap GPT-4o-mini. But the agent itself runs on deepseek-chat via DeepSeek’s OpenAI-compatible endpoint as it’s substantially cheaper for this kind of multi-turn, multi-tool reasoning compared to OpenAI.
Streaming with SSE
As the agent reasons, it streams progress to the client via Server-Sent Events, a one-way HTTP streaming protocol that’s lighter than WebSocket:
const sendEvent = (data, event = 'message') => {
res.write(`event: ${event}\n`);
res.write(`data: ${JSON.stringify(data)}\n\n`);
};
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
res.flushHeaders();
await runAgentLoop(query, dataSource._id, existingHistory, sendEvent);
The logCallback passed into the agent loop is sendEvent. So when the agent calls queryEntitiesVectorDB, a Querying Entities in the knowledge graph event hits the client immediately. When the agent finally calls answer, the final answer streams in as the last event.
The user sees the system thinking which reduces perceived latency.
Why SSE and not WebSocket? The server-to-client direction is the only one that needs streaming here as the client just sends one query and waits. WebSocket would give us a bidirectional channel we don’t need, and unnecessary reconnection logic. SSE uses plain HTTP, auto-reconnects on most clients, and is simpler to operate.
Citations
Every claim in an answer has to be cited. The prompt enforces this with a strict format:
/citation[chunk_id, citation_text]
chunk_id is the actual MongoDB ObjectId of a chunk the agent retrieved. citation_text must be an exact substring of that chunk’s content. The prompt enforces:
Every factual or descriptive claim must be followed by a citation. The citation must come immediately after the claim it supports. Never fabricate citations or reference unverified information. If a claim cannot be verified via a chunk, explicitly state that.
When the application renders an answer, those citation tags become hover links. Clicking allows you to see the actual PDF page the chunk came from (or pages) that overlap with the chunk’s [start, end] character range:
let cumulativeOffset = 0;
for (let pageIndex = 0; pageIndex < pdf.numPages; pageIndex++) {
const page = await pdf.getPage(pageIndex + 1);
const textContent = await page.getTextContent();
const pageText = textContent.items.map(i => i.str).join(' ');
const pageStart = cumulativeOffset;
const pageEnd = pageStart + pageText.length - 1;
if (end >= pageStart && start <= pageEnd) {
pagesToExtract.push(pageIndex);
}
cumulativeOffset += pageText.length;
}
const newPdf = await PDFDocument.create();
const copiedPages = await newPdf.copyPages(originalPdf, pagesToExtract);
copiedPages.forEach((page) => newPdf.addPage(page));
return Buffer.from(await newPdf.save());
Evaluation
I built a multi-source eval over a generated question set, scoring four metrics across three question types:
- Context Relevance of the chunks retrieved, how much was actually in the reference (gold) set?
- Context Utilization of the chunks retrieved, how much did the answer actually use?
- Completeness relative to ground truth?
- Adherence. Does the answer stay faithful to the retrieved content, or does it hallucinate?
The three question types stress different things:
- Single-Source: the answer lives in one chunk of one document.
- Multi-Source: the answer requires synthesizing across chunks from multiple documents.
- No-Source: the answer isn’t in the dataset at all. The system should refuse instead of hallucinating.
Results
| Context Relevance | Context Utilization | Completeness | Adherence | |
|---|---|---|---|---|
| Single-Source | 0.029 | 0.033 | 0.703 | 0.921 |
| Multi-Source | 0.059 | 0.040 | 0.347 | 0.900 |
| No-Source | 0.000 | 0.000 | 1.000 | 0.800 |
| Overall | 0.041 | 0.033 | 0.551 | 0.898 |
- Adherence is high (~90%) across all question types. The agent stays faithful to what it retrieves and doesn’t make things up.
- Context relevance and utilization are low (~3–6%). Retrieval is over-fetching: the agent pulls many chunks, but only a small slice of each is actually used. This could be tuned, probably via lower
topKdefaults or a re-ranking pass before stuffing the prompt. - Single-source completeness (70%) is much better than multi-source (35%). Synthesizing across documents is the hard problem, and the numbers say it’s still hard.
- No-source completeness is 1.0, the agent correctly refuses every question whose answer isn’t in the dataset. That’s the forced-answer fallback doing its job: at iteration 5, it returns “I couldn’t find this, here’s why” rather than hallucinating.
Stack
| Layer | Choice |
|---|---|
| API | Express 5, TypeScript, Node |
| Storage | MongoDB 7 with Atlas Vector Search ($vectorSearch) |
| Blob storage | S3 / MinIO via swappable IBlobProvider interface |
| Auth | JWT (jsonwebtoken) with bcryptjs for password hashing |
| Embeddings | OpenAI text-embedding-3-large (3072-dim, cosine) |
| Entity extraction | OpenAI gpt-4o-mini with structured Zod schemas |
| Agent / answers | DeepSeek deepseek-chat via OpenAI-compatible endpoint |
| Chunking | LangChain RecursiveCharacterTextSplitter (1000 chars) |
pdf-parse for ingestion, pdfjs-dist + pdf-lib for page extraction | |
| Streaming | Server-Sent Events over plain HTTP |
| Concurrency | p-limit (8 in flight) + async-retry w/ exponential backoff |
| Schemas | Zod |
Frontend: Next.js 15, React 19, MobX Keystone, Tailwind, Radix UI / shadcn.
Source: svq-ai/svq.api (backend), svq-ai/svq.web (frontend).