Evaluation framework

How we measure whether EvalExplorer is trustworthy

A tool that answers questions across hundreds of evaluations is only as trustworthy as the two things it does under the hood: extraction — turning documents into clean, correctly typed evidence — and synthesis — bringing that evidence together into an answer. We evaluate both, with humans in the loop, so quality is measured rather than assumed.

In progress

Extraction evaluation

A human-validated annotation programme that checks whether findings, recommendations and methodology are extracted faithfully, bounded correctly and filed under the right type — building the gold reference everything else is scored against.

Planned

Synthesis evaluation

A five-dimension scoring rubric, retrieval scoring on the gold we already hold, and a controlled scoring pilot — producing a defensible first baseline for synthesis quality and a release gate to hold it to. Next up, and detailed below.

Phase 1In progress

Extraction evaluation

A working annotation tool turns evaluation reports into a human-validated set of labelled text spans for findings, recommendations and methodology. Those labelled spans are the gold the synthesiser retrieves from — which is why synthesis evaluation can build on this work rather than start over.

The annotation tool

Annotators mark findings, recommendations and methodology as labelled text spans, directly on the source PDF.

Suggestions & consensus

A strong model pre-seeds each span; a consensus mechanism reconciles multiple reviewers into a single agreed set, with adjudication for disputes.

Scored against the system

System spans are compared to the gold with IoU span overlap, plus classification metrics for presence, themes and geography.

These labelled spans are the gold the synthesiser retrieves from. That is why synthesis evaluation begins right here — and why retrieval can be scored for free, with no second annotation round.

What reviewers verify

Extraction quality is not one judgement but several. Each smart chunk is checked along independent axes, so a faithful chunk filed under the wrong type — or a correct type with the wrong boundaries — is caught rather than folded into a single vague score.

Text faithfulness

Does the chunk match its source passage? Errors are severity-graded — critical, major, minor or inconsequential — so a fabrication is not weighed the same as a dropped comma.

Boundaries & span

Does the chunk begin and end in the right place on the page? Scored by intersection-over-union (IoU) against the gold span, with a 0.5 overlap threshold.

Chunk type

Is it labelled correctly — finding, recommendation, evidence gap or lesson?

Classification tags

Are the multi-label tags — themes, methodology, region and country — both correct and complete?

Page & line anchors

Does the chunk point back to the correct page and line references in the source PDF?

Chunker recall

Did the chunker miss passages it should have captured? Measured against passages the reviewer marks as missed.

Atomicity & redundancy

Is each chunk a single coherent unit, without near-duplicate overlap with its neighbours?

Retrieval suitability

Is the chunk a usable standalone unit for retrieval, or too fragmentary to stand on its own?

Consensus, adjudication and reliability

No single annotator defines the gold. Each subject is reviewed by several people, and a consensus pipeline reconciles them before anything is promoted to ground truth.

Step 1

Similarity

Compare the regions proposed by different reviewers.

Step 2

Cluster

Group overlapping regions into candidate items.

Step 3

Reduce

Aggregate reviewer votes per cluster — exact match, span overlap, Jaccard.

Step 4

Decide

Threshold each cluster to accept, review, adjudicate or reject.

Disagreement routes to adjudication

When agreement is low, the item goes to an adjudicator who sees every reviewer's proposal alongside the consensus candidate, then sets the final answer. Gold is then constructed by an explicit strategy — simple majority, a confidence threshold, or union-then-adjudicate for contested items.

Reliability is measured, not assumed

Inter-annotator agreement is reported with the statistic appropriate to the label type — Cohen's κ, Fleiss' κ, Krippendorff's α, or ICC. Hidden calibration items with known answers test each reviewer's competence, producing a weight that feeds back into the consensus.

Phase 2Planned

Synthesis evaluation

The research surfaced a full evaluation programme. Phase 2 keeps the essential and sequences the rest: a synthesis scoring rubric, retrieval scoring on the gold we already collected, one controlled scoring pilot that is human-scored, and a first baseline with a release-gate proposal.

The rubric — five dimensions

Five dimensions, scored the same way every run, so synthesis quality stays comparable while we tune retrieval and prompts. Adapted from the document-synthesis literature, narrowed to what matters for published evaluation work.

1

Groundedness / faithfulness

Is every claim traceable to a retrieved span?

Unit · Atomic claimCritical
2

Coverage

Did it include the findings that matter for the prompt?

Unit · Content unitCritical
3

Relevance

Does it actually answer the prompt that was asked?

Unit · Whole responseHigh
4

Coherence

Is it readable, ordered and well structured?

Unit · Whole responseMedium
5

Neutrality

Does it stay within the evidence, without editorialising?

Unit · Span + responseHigh

How each one is scored

Two kinds of label, decomposed once, scored by people first.

Label type A · Ratios

Claim & unit labels

Groundedness and coverage. Each claim is marked supported, or split into one of two failure types, then aggregated to a ratio.

  • Intrinsiccontradicts a retrieved span
  • Extrinsicnot found in any retrieved span
Label type B · Ordinal

Anchored 1-to-5 ratings

Relevance, coherence, neutrality. A rubric-anchored judgement on the whole response. We report the distribution, not just a mean — so a bimodal “great or terrible” result can't hide behind an average.

The measures are drawn from the RAG and summarisation evaluation literature: citation precision and recall in the style of ALCE for groundedness, per-claim faithfulness in the style of RAGAS and FActScore, and nugget- or Pyramid-style recall for coverage. We adopt the measures rather than any single benchmark, and re-anchor them to evaluation-domain content.

Decompose once

Break the synthesis into claims a single time, then give every grader the same set — so we measure verdicts, not disagreement about how to split the text.

People first

Humans score the pilot. A model-judge is introduced only once it agrees with human labels on a held-out set, with agreement reported beside every score.

Groundedness gates

A fabricated claim fails the synthesis outright — it is not averaged away. The other four dimensions report as quality bands.

Retrieval scoring comes with the gold

No fresh annotation effort is needed to score retrieval. The labelled spans we already collected are the reference evidence set.

L1

Which reports were retrieved

Out of the corpus, did the system pull the evaluation reports relevant to the prompt? Scored with classification metrics.

L2

Which spans reached synthesis

Within those reports, which gold spans made it into the synthesis input? Ranking matters, so we use top-k metrics: Recall@k, MRR, NDCG.

Why this matters

The gold already carries theme and geography tags, so the evidence relevant to a prompt is derivable from work already done. Retrieval recall comes essentially for free, with no second annotation round.

One controlled scoring run

The smallest run that produces a defensible first number, designed so synthesis quality stays comparable as retrieval and prompts change.

Fixed gold prompts

A small set of synthesis prompts spanning the main themes and countries — frozen and versioned, so every run answers the same questions.

Controlled task

Same prompt, output length and rubric across runs, so a change in score reflects the system, not the test conditions.

Two scorers, calibrated

A calibration subset is double-scored for agreement before any number is trusted, reusing the reliability gate from Phase 1.

Output

A first synthesis-quality baseline across all five dimensions, with inter-rater agreement reported beside every score.

In the works

Advanced evaluation

These methods measure things that depend on first having a reliable output score — how an answer was produced, the quality of the underlying evidence, robustness under stress. We plan to work through them once the rubric, retrieval scoring and first baseline are in place.

Process & trajectory evaluation

Scoring how an answer was produced — planning, query strategy, tool calls, error recovery — not only the final paragraph. Used when output scoring alone can't localise a failure.

Chatbot evaluation

A multi-turn conversational surface that asks, retrieves and answers across turns. Its evaluation schema — which dimensions, what reference standard, how to handle multi-turn context — is still to be designed.

Risk-of-bias & certainty (GRADE)

A GRADE-based appraisal of the quality of the underlying evidence, not the extraction. It needs a PICO / study-design layer we don't collect today, so it sits outside the first phase.

A validated model-judge at scale

Extending the human-calibrated judge across the whole corpus, once it has proven agreement with human labels on held-out sets.

The full dimension library

The complete multi-domain set of quality dimensions the research surfaced, beyond the five essential ones scored first.

Adversarial & temporal stress sets

Targeted test sets that probe robustness — adversarial prompts and time-shifted evidence — layered on once a stable baseline exists.

How this connects to the product

Evaluation is one half of our commitment to trustworthy outputs. The other is traceability — every claim linked back to its source. See where AI is used and the safeguards around it.