Technical overview

How EvalExplorer works

EvalExplorer has two parts: a document processing pipeline that turns evaluation documents into structured, AI-ready data, and an analysis interface that queries and synthesises across the processed set. Here's what happens between a raw PDF and a sourced, portfolio-scale answer.

The processing pipeline

Documents flow through a series of steps. Each adds structure, classification, and a clearer separation of strong signal from weak noise.

1
Document ingestion

PDFs in

2
PDF processing

Docling OCR

3
Section extraction

headers normalised

4
Classification & tagging

theme · region · method

5
Storage

database + vector store

1

Document ingestion

Evaluation documents are gathered from institutional repositories, publisher platforms and programme websites, then queued for processing.

2

PDF processing

Docling OCR, using IBM's Granite Docling vision-language model, “sees” document layouts — identifying tables, diagrams and text regions — giving substantially better table extraction than rule-based approaches.

3

Section extraction

Header patterns and AI classification identify sections, recognising that “6. Key Findings” and “Results and Discussion” both represent findings, normalised to one taxonomy.

4

Classification & tagging

Content is tagged with evaluation-domain metadata — themes, geography, evaluation type, methodology codes — powering filtering and comparison across the corpus.

5

Storage

Structured outputs land in a database plus a vector store, ready for hybrid keyword and semantic retrieval at query time.

Extract along the document's own structure

A document's Table of Contents reveals its semantic structure. The Methodology section contains methodology; the Findings section contains findings. By extracting along these natural boundaries rather than arbitrary page breaks or character counts, we preserve the semantic coherence that makes content useful.

When a user searches for methodology, they get complete methodology sections — not fragments that happen to mention “methods” but start mid-sentence and end mid-paragraph.

A real evaluation report's table of contents mapped to sections, and sections mapped to individual findings and smart chunks

Classification at three levels

The same document is interrogated with different questions at each level of granularity — so a finding buried in an “Implementation” section or a recommendation inside a “Lessons Learned” narrative is still identified and tagged.

Document levelapplied once per document
What kind of evaluation is this, and what does it cover?
Impact evaluationEducationSub-Saharan Africa2023
Section levelapplied to each section
What type of content does this section hold?
MethodologyFindingsRecommendationsConclusions
Feature levelapplied to each extracted chunk
What specific insight does this chunk contain?
Finding · EffectivenessRecommendation · Scale-upEvidence gap · Sustainability

Each evaluation criterion is processed separately during extraction, so a sentence relevant to both “lessons” and “recommendations” receives both tags and appears for either query. Thematic concerns like Value for Money are captured through taxonomy tagging at the content level, regardless of whether the source report has a dedicated section for them.

The pipeline also produces three summary smart chunks per document — a document summary (~300 words), a findings summary and a methods summary. These compressed representations are the primary units for search and retrieval, and each smart chunk carries its themes, country and region, and evaluation methodology.

Search: hybrid retrieval

Keyword matching alone missed related content — a search for “community health workers” wouldn't surface “village health volunteers.” So retrieval combines two layers.

Keyword layer

PostgreSQL native full-text search over document titles, summaries and excerpt content — precise on exact terms.

Semantic layer

Dense vector embeddings (Jina v5, 256 dimensions) over summary chunks, capturing meaning rather than exact words — strong on recall.

Rank fusion

Reciprocal Rank Fusion interleaves both result sets by rank position, balancing keyword precision with semantic recall.

User query
“What are the main barriers to scaling education interventions?”
Step 1 · Identify taxonomy filters
Theme: EducationFindings · Constraints · RecommendationsGeography: all regions
Step 2 · Retrieve context
12 education evaluationsMethodology overviews (RCT, quasi-experimental)
Step 3 · Retrieve chunks
8 findings · effectiveness5 constraints · sustainability4 recommendations · scale-up
Synthesised answer
“Across 12 evaluations, scaling barriers cluster into three themes…”

Every claim links to its source chunks → the original PDF pages.

How a query runs

Because content is already extracted and tagged, a question doesn't need hundreds of full documents. The system identifies the relevant taxonomy filters, retrieves focused context and the specific chunks that match, then synthesises an answer.

This is what keeps portfolio-scale synthesis fast and affordable — smaller models can handle tightly scoped, pre-extracted inputs, while a more capable model is used for synthesis where faithfulness to the evidence matters most.

  • Focused inputs — findings and recommendations, not raw pages
  • Transparent tool calls — you see each search query and its parameters
  • Every claim linked to its source chunks and original PDF pages

The verification chain

Traceability is the trust mechanism. A synthesised claim resolves, step by step, all the way back to a highlighted passage in the original PDF.

Synthesised claim
“Community-based approaches showed stronger sustainability outcomes than top-down interventions.”
traces to
Supporting excerpt
“Findings from the community mobilisation component showed sustained behaviour change 18 months post-intervention…”
traces to
Source
UNICEF_Education_Kenya_2023.pdf

Findings → Sustainability · pp. 45–47 · lines 1203–1267

enables
Verification

Open the original PDF with the passage highlighted in context.

Data quality gates

  • A two-tier word-count check flags documents under 500 words (likely stubs) and under 100 words per page (likely presentation-style documents), excluding them from analysis by default.
  • Duplicate detection flags true duplicates and near-duplicates (editions, translations) for manual review rather than auto-deleting.
  • Managers can correct classification and taxonomy errors in-app, keeping the corpus accurate without reprocessing.

Evidence fidelity

User testing surfaced synthesis failures — cause-and-effect inversions, numerical misattributions, over- or understated findings. A single instance erodes trust in every other output, so we addressed it directly:

  • A more capable model for synthesis, at a lower generation temperature that favours faithfulness over creativity.
  • Prompts restructured to emphasise verbatim quotation where possible.
  • A publication-bias disclaimer on all analysis and chat outputs — with a systematic validation framework now measuring the remaining error rate.

See it on the FCDO evaluation corpus

Explore 1,500 initial processed evaluations, run a synthesis, and click any claim through to its source.