EvalExplorer has two parts: a document processing pipeline that turns evaluation documents into structured, AI-ready data, and an analysis interface that queries and synthesises across the processed set. Here's what happens between a raw PDF and a sourced, portfolio-scale answer.
Documents flow through a series of steps. Each adds structure, classification, and a clearer separation of strong signal from weak noise.
PDFs in
Docling OCR
headers normalised
theme · region · method
database + vector store
Evaluation documents are gathered from institutional repositories, publisher platforms and programme websites, then queued for processing.
Docling OCR, using IBM's Granite Docling vision-language model, “sees” document layouts — identifying tables, diagrams and text regions — giving substantially better table extraction than rule-based approaches.
Header patterns and AI classification identify sections, recognising that “6. Key Findings” and “Results and Discussion” both represent findings, normalised to one taxonomy.
Content is tagged with evaluation-domain metadata — themes, geography, evaluation type, methodology codes — powering filtering and comparison across the corpus.
Structured outputs land in a database plus a vector store, ready for hybrid keyword and semantic retrieval at query time.
A document's Table of Contents reveals its semantic structure. The Methodology section contains methodology; the Findings section contains findings. By extracting along these natural boundaries rather than arbitrary page breaks or character counts, we preserve the semantic coherence that makes content useful.
When a user searches for methodology, they get complete methodology sections — not fragments that happen to mention “methods” but start mid-sentence and end mid-paragraph.

The same document is interrogated with different questions at each level of granularity — so a finding buried in an “Implementation” section or a recommendation inside a “Lessons Learned” narrative is still identified and tagged.
Each evaluation criterion is processed separately during extraction, so a sentence relevant to both “lessons” and “recommendations” receives both tags and appears for either query. Thematic concerns like Value for Money are captured through taxonomy tagging at the content level, regardless of whether the source report has a dedicated section for them.
The pipeline also produces three summary smart chunks per document — a document summary (~300 words), a findings summary and a methods summary. These compressed representations are the primary units for search and retrieval, and each smart chunk carries its themes, country and region, and evaluation methodology.
Keyword matching alone missed related content — a search for “community health workers” wouldn't surface “village health volunteers.” So retrieval combines two layers.
PostgreSQL native full-text search over document titles, summaries and excerpt content — precise on exact terms.
Dense vector embeddings (Jina v5, 256 dimensions) over summary chunks, capturing meaning rather than exact words — strong on recall.
Reciprocal Rank Fusion interleaves both result sets by rank position, balancing keyword precision with semantic recall.
Every claim links to its source chunks → the original PDF pages.
Because content is already extracted and tagged, a question doesn't need hundreds of full documents. The system identifies the relevant taxonomy filters, retrieves focused context and the specific chunks that match, then synthesises an answer.
This is what keeps portfolio-scale synthesis fast and affordable — smaller models can handle tightly scoped, pre-extracted inputs, while a more capable model is used for synthesis where faithfulness to the evidence matters most.
Traceability is the trust mechanism. A synthesised claim resolves, step by step, all the way back to a highlighted passage in the original PDF.
Findings → Sustainability · pp. 45–47 · lines 1203–1267
Open the original PDF with the passage highlighted in context.
User testing surfaced synthesis failures — cause-and-effect inversions, numerical misattributions, over- or understated findings. A single instance erodes trust in every other output, so we addressed it directly:
Explore 1,500 initial processed evaluations, run a synthesis, and click any claim through to its source.