Where artificial intelligence is used in EvalExplorer, and the safeguards around it. For the underlying mechanics, see How it works.
Our evaluation framework puts humans in the loop to score extraction and synthesis against a labelled gold set.
Our evaluation frameworkEvalExplorer uses AI in three places: to ingest and extract structure from evaluation documents, to analyse and synthesise findings across a collection, and to power a chat interface for conversational exploration. It is a targeted, retrieval-grounded system — not a general-purpose chatbot answering from a model's own memory.
Key principle: AI augments human judgement, it does not replace it. Every synthesised claim links back to the source excerpts and original PDF pages it came from, so outputs can be verified rather than trusted blindly.
When an evaluation PDF enters the system, AI turns it into structured, AI-ready data. This happens once per document, upfront:
Where a deterministic algorithm does the job as well as a model, we use it — extraction quality depends on the weakest link, so simpler and more predictable is better.
The structured analysis tools — summarisation, comparative, temporal and custom queries — retrieve the relevant pre-extracted excerpts and ask a language model to synthesise across them. Because the inputs are already focused and tagged, synthesis runs on the signal, not on hundreds of raw pages.
The chat interface answers questions conversationally across the corpus. It retrieves relevant excerpts and summaries, synthesises a response, and provides inline citations in the same format as the structured analyses.
Each tool call is transparent: the chat shows the search queries it ran and the parameters it used, and when it runs several searches, each result set is labelled with its query — so you can see why results appear as they do.
Both analysis and chat sit on the same hybrid retrieval layer: PostgreSQL full-text search for keyword precision and dense vector embeddings (Jina v5) for semantic recall, combined with Reciprocal Rank Fusion. This is what lets the system pull the right 50 findings instead of loading 200 full documents — described in more detail in How it works.
Language-model synthesis can distort source evidence — inverting cause and effect, misattributing numbers, or over- and understating findings. Because a single such error erodes trust in every other output, we address it directly:
Two parts of the pipeline are where quality problems concentrate. We're candid that they aren't fully solved — and both are the focus of an active, human-in-the-loop evaluation framework in which evaluation experts verify the system's own work.
Smart chunks can be extracted or tagged poorly — a finding missed, a passage split at the wrong boundary, or a chunk filed under the wrong type. Everything downstream builds on this layer, so extraction errors propagate.
We run a human-verified evaluation framework: reviewers check extractions for faithfulness to the source, correct boundaries and the right chunk type, with disagreements resolved through adjudication. Those verified results feed our work to refine the extraction pipeline, so that whatever is extractable is properly extracted and tagged.
When the system brings findings together across documents, it can misattribute a claim, over- or understate a pattern, or cite a source that doesn't quite support the point being made.
Human evaluation experts are validating the synthesis process — checking whether each claim is grounded in the evidence it cites, whether coverage across the corpus is proportional, and whether the system correctly declines to answer when the evidence isn't there.
EvalExplorer uses several models, matching each to the task rather than defaulting to the largest model for everything. Providers include:
No customer data is used to train third-party AI models; all provider agreements include “no training on our data” terms. See the Privacy Policy for details.
Coverage is bounded
Answers reflect the evaluations currently in the corpus — not evidence that hasn't been ingested, and not the wider literature.
Language & recency
Models perform best in English, and their general knowledge has a training cutoff; EvalExplorer grounds answers in the corpus to limit this.
For more on how we measure and improve extraction and synthesis quality, see our evaluation framework.
How it works walks through the full pipeline, three-level extraction, hybrid retrieval and the verification chain.