AI Transparency

Where artificial intelligence is used in EvalExplorer, and the safeguards around it. For the underlying mechanics, see How it works.

Our evaluation framework puts humans in the loop to score extraction and synthesis against a labelled gold set.

Our evaluation framework

Overview

EvalExplorer uses AI in three places: to ingest and extract structure from evaluation documents, to analyse and synthesise findings across a collection, and to power a chat interface for conversational exploration. It is a targeted, retrieval-grounded system — not a general-purpose chatbot answering from a model's own memory.

Key principle: AI augments human judgement, it does not replace it. Every synthesised claim links back to the source excerpts and original PDF pages it came from, so outputs can be verified rather than trusted blindly.

Where AI is used

1. Ingestion & extraction

When an evaluation PDF enters the system, AI turns it into structured, AI-ready data. This happens once per document, upfront:

PDF processing — Docling OCR, using IBM's Granite Docling vision-language model, reads the page layout and recovers text, tables and structure.
Section & content classification — models identify and normalise sections (Methodology, Findings, Recommendations…) and tag content with themes, country and region, and evaluation methodology.
Smart-chunk & summary generation — individual findings, recommendations and evidence gaps are extracted, and each document gets a document, findings and methods summary.
Embeddings — summary chunks are embedded (Jina v5, 256 dimensions) to power semantic search.

Where a deterministic algorithm does the job as well as a model, we use it — extraction quality depends on the weakest link, so simpler and more predictable is better.

2. Analysis & synthesis

The structured analysis tools — summarisation, comparative, temporal and custom queries — retrieve the relevant pre-extracted excerpts and ask a language model to synthesise across them. Because the inputs are already focused and tagged, synthesis runs on the signal, not on hundreds of raw pages.

Every claim carries numbered references to the specific source excerpts behind it.
A more capable model is used for synthesis, at a low generation temperature that favours faithfulness over creativity, with prompts that emphasise verbatim quotation.
The report shows which excerpts were included, which were excluded, and how sources map to claims — and a publication-bias disclaimer is attached to all outputs.

3. Chat

The chat interface answers questions conversationally across the corpus. It retrieves relevant excerpts and summaries, synthesises a response, and provides inline citations in the same format as the structured analyses.

Each tool call is transparent: the chat shows the search queries it ran and the parameters it used, and when it runs several searches, each result set is labelled with its query — so you can see why results appear as they do.

Retrieval behind analysis and chat

Both analysis and chat sit on the same hybrid retrieval layer: PostgreSQL full-text search for keyword precision and dense vector embeddings (Jina v5) for semantic recall, combined with Reciprocal Rank Fusion. This is what lets the system pull the right 50 findings instead of loading 200 full documents — described in more detail in How it works.

Evidence fidelity & safeguards

Language-model synthesis can distort source evidence — inverting cause and effect, misattributing numbers, or over- and understating findings. Because a single such error erodes trust in every other output, we address it directly:

A more capable synthesis model at a lower, faithfulness-favouring temperature.
Prompts restructured to emphasise verbatim quotation where possible.
A publication-bias disclaimer on all analysis and chat outputs.

Full traceability from claim → chunk → source section → page and line references.
A systematic validation framework (LLM-as-judge, calibrated against human review) measuring extraction and synthesis error rates.
In-app correction so managers can fix classification and tagging errors as they surface.

Known failure modes

Two parts of the pipeline are where quality problems concentrate. We're candid that they aren't fully solved — and both are the focus of an active, human-in-the-loop evaluation framework in which evaluation experts verify the system's own work.

Mitigation in progress

Imperfect extraction

Smart chunks can be extracted or tagged poorly — a finding missed, a passage split at the wrong boundary, or a chunk filed under the wrong type. Everything downstream builds on this layer, so extraction errors propagate.

We run a human-verified evaluation framework: reviewers check extractions for faithfulness to the source, correct boundaries and the right chunk type, with disagreements resolved through adjudication. Those verified results feed our work to refine the extraction pipeline, so that whatever is extractable is properly extracted and tagged.

Evaluation underway

Synthesis errors

When the system brings findings together across documents, it can misattribute a claim, over- or understate a pattern, or cite a source that doesn't quite support the point being made.

Human evaluation experts are validating the synthesis process — checking whether each claim is grounded in the evidence it cites, whether coverage across the corpus is proportional, and whether the system correctly declines to answer when the evidence isn't there.

Model providers & data flow

EvalExplorer uses several models, matching each to the task rather than defaulting to the largest model for everything. Providers include:

IBM Granite Docling — document OCR and layout understanding at ingestion.
Anthropic (Claude) — synthesis and analysis where faithfulness matters most.
Google (Gemini) — summary generation and classification during ingestion.
Open-weight models — tightly scoped extraction and classification tasks.
Jina AI — text embeddings for semantic search.

Sent to AI models

• Evaluation report text and excerpts (during ingestion and retrieval)
• Your search queries and analysis prompts
• Retrieved excerpts and their metadata (for synthesis)

Not sent to AI models

• Credentials or authentication tokens
• Personal account information
• Raw database contents beyond the relevant excerpts

No customer data is used to train third-party AI models; all provider agreements include “no training on our data” terms. See the Privacy Policy for details.

Limitations

Coverage is bounded

Answers reflect the evaluations currently in the corpus — not evidence that hasn't been ingested, and not the wider literature.

Language & recency

Models perform best in English, and their general knowledge has a training cutoff; EvalExplorer grounds answers in the corpus to limit this.

For more on how we measure and improve extraction and synthesis quality, see our evaluation framework.

Working well with AI outputs

Treat summaries and syntheses as a starting point, not a final answer.
Follow the citation links and read the source passage before relying on a claim.
Use per-analysis document selection and excerpt exclusion to keep inputs relevant.
Prefer specific questions — they retrieve more relevant evidence and synthesise more faithfully.

Want the technical detail?

How it works walks through the full pipeline, three-level extraction, hybrid retrieval and the verification chain.

Learn how it works Questions about AI use