A tool that answers questions across hundreds of evaluations is only as trustworthy as the two things it does under the hood: extraction — turning documents into clean, correctly typed evidence — and synthesis — bringing that evidence together into an answer. We evaluate both, with humans in the loop, so quality is measured rather than assumed.
A human-validated annotation programme that checks whether findings, recommendations and methodology are extracted faithfully, bounded correctly and filed under the right type — building the gold reference everything else is scored against.
A five-dimension scoring rubric, retrieval scoring on the gold we already hold, and a controlled scoring pilot — producing a defensible first baseline for synthesis quality and a release gate to hold it to. Next up, and detailed below.
A working annotation tool turns evaluation reports into a human-validated set of labelled text spans for findings, recommendations and methodology. Those labelled spans are the gold the synthesiser retrieves from — which is why synthesis evaluation can build on this work rather than start over.
Annotators mark findings, recommendations and methodology as labelled text spans, directly on the source PDF.
A strong model pre-seeds each span; a consensus mechanism reconciles multiple reviewers into a single agreed set, with adjudication for disputes.
System spans are compared to the gold with IoU span overlap, plus classification metrics for presence, themes and geography.
These labelled spans are the gold the synthesiser retrieves from. That is why synthesis evaluation begins right here — and why retrieval can be scored for free, with no second annotation round.
Extraction quality is not one judgement but several. Each smart chunk is checked along independent axes, so a faithful chunk filed under the wrong type — or a correct type with the wrong boundaries — is caught rather than folded into a single vague score.
Text faithfulness
Does the chunk match its source passage? Errors are severity-graded — critical, major, minor or inconsequential — so a fabrication is not weighed the same as a dropped comma.
Boundaries & span
Does the chunk begin and end in the right place on the page? Scored by intersection-over-union (IoU) against the gold span, with a 0.5 overlap threshold.
Chunk type
Is it labelled correctly — finding, recommendation, evidence gap or lesson?
Classification tags
Are the multi-label tags — themes, methodology, region and country — both correct and complete?
Page & line anchors
Does the chunk point back to the correct page and line references in the source PDF?
Chunker recall
Did the chunker miss passages it should have captured? Measured against passages the reviewer marks as missed.
Atomicity & redundancy
Is each chunk a single coherent unit, without near-duplicate overlap with its neighbours?
Retrieval suitability
Is the chunk a usable standalone unit for retrieval, or too fragmentary to stand on its own?
No single annotator defines the gold. Each subject is reviewed by several people, and a consensus pipeline reconciles them before anything is promoted to ground truth.
Similarity
Compare the regions proposed by different reviewers.
Cluster
Group overlapping regions into candidate items.
Reduce
Aggregate reviewer votes per cluster — exact match, span overlap, Jaccard.
Decide
Threshold each cluster to accept, review, adjudicate or reject.
When agreement is low, the item goes to an adjudicator who sees every reviewer's proposal alongside the consensus candidate, then sets the final answer. Gold is then constructed by an explicit strategy — simple majority, a confidence threshold, or union-then-adjudicate for contested items.
Inter-annotator agreement is reported with the statistic appropriate to the label type — Cohen's κ, Fleiss' κ, Krippendorff's α, or ICC. Hidden calibration items with known answers test each reviewer's competence, producing a weight that feeds back into the consensus.
The research surfaced a full evaluation programme. Phase 2 keeps the essential and sequences the rest: a synthesis scoring rubric, retrieval scoring on the gold we already collected, one controlled scoring pilot that is human-scored, and a first baseline with a release-gate proposal.
Five dimensions, scored the same way every run, so synthesis quality stays comparable while we tune retrieval and prompts. Adapted from the document-synthesis literature, narrowed to what matters for published evaluation work.
Groundedness / faithfulness
Is every claim traceable to a retrieved span?
Is every claim traceable to a retrieved span?
Coverage
Did it include the findings that matter for the prompt?
Did it include the findings that matter for the prompt?
Relevance
Does it actually answer the prompt that was asked?
Does it actually answer the prompt that was asked?
Coherence
Is it readable, ordered and well structured?
Is it readable, ordered and well structured?
Neutrality
Does it stay within the evidence, without editorialising?
Does it stay within the evidence, without editorialising?
Two kinds of label, decomposed once, scored by people first.
Groundedness and coverage. Each claim is marked supported, or split into one of two failure types, then aggregated to a ratio.
Relevance, coherence, neutrality. A rubric-anchored judgement on the whole response. We report the distribution, not just a mean — so a bimodal “great or terrible” result can't hide behind an average.
The measures are drawn from the RAG and summarisation evaluation literature: citation precision and recall in the style of ALCE for groundedness, per-claim faithfulness in the style of RAGAS and FActScore, and nugget- or Pyramid-style recall for coverage. We adopt the measures rather than any single benchmark, and re-anchor them to evaluation-domain content.
Break the synthesis into claims a single time, then give every grader the same set — so we measure verdicts, not disagreement about how to split the text.
Humans score the pilot. A model-judge is introduced only once it agrees with human labels on a held-out set, with agreement reported beside every score.
A fabricated claim fails the synthesis outright — it is not averaged away. The other four dimensions report as quality bands.
No fresh annotation effort is needed to score retrieval. The labelled spans we already collected are the reference evidence set.
Out of the corpus, did the system pull the evaluation reports relevant to the prompt? Scored with classification metrics.
Within those reports, which gold spans made it into the synthesis input? Ranking matters, so we use top-k metrics: Recall@k, MRR, NDCG.
The gold already carries theme and geography tags, so the evidence relevant to a prompt is derivable from work already done. Retrieval recall comes essentially for free, with no second annotation round.
The smallest run that produces a defensible first number, designed so synthesis quality stays comparable as retrieval and prompts change.
A small set of synthesis prompts spanning the main themes and countries — frozen and versioned, so every run answers the same questions.
Same prompt, output length and rubric across runs, so a change in score reflects the system, not the test conditions.
A calibration subset is double-scored for agreement before any number is trusted, reusing the reliability gate from Phase 1.
A first synthesis-quality baseline across all five dimensions, with inter-rater agreement reported beside every score.
These methods measure things that depend on first having a reliable output score — how an answer was produced, the quality of the underlying evidence, robustness under stress. We plan to work through them once the rubric, retrieval scoring and first baseline are in place.
Process & trajectory evaluation
Scoring how an answer was produced — planning, query strategy, tool calls, error recovery — not only the final paragraph. Used when output scoring alone can't localise a failure.
Chatbot evaluation
A multi-turn conversational surface that asks, retrieves and answers across turns. Its evaluation schema — which dimensions, what reference standard, how to handle multi-turn context — is still to be designed.
Risk-of-bias & certainty (GRADE)
A GRADE-based appraisal of the quality of the underlying evidence, not the extraction. It needs a PICO / study-design layer we don't collect today, so it sits outside the first phase.
A validated model-judge at scale
Extending the human-calibrated judge across the whole corpus, once it has proven agreement with human labels on held-out sets.
The full dimension library
The complete multi-domain set of quality dimensions the research surfaced, beyond the five essential ones scored first.
Adversarial & temporal stress sets
Targeted test sets that probe robustness — adversarial prompts and time-shifted evidence — layered on once a stable baseline exists.
Evaluation is one half of our commitment to trustworthy outputs. The other is traceability — every claim linked back to its source. See where AI is used and the safeguards around it.