• Scalene
  • Posts
  • Scalene 39: AbsenceBench / Jury Theorems / Microwaves

Scalene 39: AbsenceBench / Jury Theorems / Microwaves

Humans | AI | Peer review. The triangle is changing.

Another week where I’m struggling to get everything in without overwhelming you. It feels like some of the fundamental problems surrounding AI and peer review are being acknowledged and steps made towards solving them, including the first story below. Looks like we’re back to weekly updates for the foreseeable future!

22nd June 2025

1//
AbsenceBench: Language Models Can’t Tell What’s Missing 

arXiv - 13 June 2025 - 46 min read

If you use LLMs on a regular basis for manuscript evaluation, you will have noticed a distinct trend over time, namely that the quality of evaluation of the presented work has increased enormously with each new model - but that all LLMs still struggle to place the work in historical context and, importantly, fail to notice what isn’t presented. This is partly explained by this work, which shows there is much to do as we are starting a long way back from where you may expect:

We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to.

As the authors observe in a Bluesky post:

If LLMs can’t tell what’s missing, this might pose challenges for using LLMs as judges, graders, or assistants in any domain where absence matters. We hope AbsenceBench can serve as a starting point for building more robust and trustworthy architectures that are absence-aware: it’s a necessary but far from sufficient bar for making sure LLMs can reason about absence.

2//
A tale of two studies

A confusing week for those of us who like to keep abreast of how AI is helping or hindering our search for absolute truths. First, I came across this working paper on Econstur entitled ‘Comparing Human-Only, AI-Assisted, and AI-Led Teams on Assessing Research Reproducibility in Quantitative Social Science’ [pdf].

This study evaluates the effectiveness of varying levels of human and artificial intelligence (AI) integration in reproducibility assessments of quantitative social science research. We computationally reproduced quantitative results from pub- lished articles in the social sciences with 288 researchers, randomly assigned to 103 teams across three groups — human-only teams, AI-assisted teams and teams whose task was to minimally guide an AI to conduct reproducibility checks (the “AI-led” approach). Findings reveal that when working independently, human teams matched the reproducibility success rates of teams using AI assistance, while both groups substantially outperformed AI-led approaches (with human teams achieving 57 percentage points higher success rates than AI-led teams, p < 0.001).

Uh-huh, so humans and humans with AI assistance are better at these reproducibility tests than AI-led teams.

But then, literally a few hours later, I started to read this LinkedIn post from Arnaud Engelfriet which seems to state the opposite (and seems to be a common finding in radiology too).

🤖⚖️ “Our AI is 90% accurate!”
“Great, let’s add a human in the loop for the final 10%.”
"Perfect, now it’s 78%!"
…. beyond ~80% model accuracy, adding human-in-the-loop mechanisms actually degrades overall system performance. Not because people are malicious or lazy. We bring inconsistency, and often override good recommendations for the wrong reasons.

I’m more on board with the second viewpoint than the first, but conflicting views allow to pick your favourite, I guess. I’m going to spend some more time this week reading the Econstur paper and testing my own assumptions here. I’d be intrigued to know your feelings on this.

3//
Jury Theorems for Peer Review

BJPS - June 2025 - 23 min read

In this article we argue that crowd-sourced peer review is likely to do better than journal-solicited peer review at sorting articles by quality. Our argument rests on two key claims. First, crowd- sourced peer review will lead on average to more reviewers per article than journal-solicited peer review. Second, due to the wisdom of the crowds, more reviewers will tend to make better judgements than fewer reviewers will.

It seems obvious, and there are some examples of this working in parts of mathematics, - but I see this as an example of logical improvements to the scholarly comms industry that simply won’t work for perverse reasons. Despite the wisdom of crowds, in many fields your input to the discussion is unrewarded work. It’s hard work to get 2 peer reviewers to agree to review anything these days - getting 10 to comment on an article they need to read first would work for a small % of papers, but the vast majority would have none, I fear. It’s something I hope I’m wrong on.

4//
Scientific Publishing: Enough is Enough

Astera Institute - 11 June 2025 - 20 min read

I admire the leadership of a few (well-funded) independent research institutes, such as Astera, who are using their privilege to eschew the world of journal publishing. Here Seemay Chou describes why journals are no longer fit for purpose.

Scientists should probably be putting out shorter narratives, datasets, code, and models at a faster rate, with more visibility into their thinking, mistakes, and methods. In this age of the internet, almost anything could technically be a “publishable unit.” It doesn’t even have to sound nice or match the human attention span anymore, given our increasing reliance on AI agents.

In more general terms, we need publishing to be a reliable record that approximates the true scientific process as closely as possible, both in content and in time. The way we publish now is so far from this goal that we’re even preventing our own ability to develop useful large language models (LLMs) that accelerate science. Automated AI agents can’t learn scientific reasoning based on a journal record that presents only the glossy parts of the intellectual process, and not actual human reasoning. We are shooting ourselves in the foot.

The comments on this post include quite a bit of pushback, but it’s refreshing to see this being discussed.

5//
YouTube recommendations x2

Two great things to watch this week.

1. A Conversation with Richard Sever, Assistant Director, Cold Spring Harbor Laboratory Press - thoughts on how preprints + review + curation should be making journal editors question their added value.

2. Panel Discussion: Using AI in Peer Review: Caution, Capability, and Community Input - great panel discussion from ASCE.

And finally…

One year ago: Scalene 6, 23 June 2024

Let’s chat
Many of you may know I work for Cactus Communications in my day job, and one of my responsibilities there is to help publishers speed up their peer review processes. Usually this is in the form of 100% human peer review, delivered in 7 days. However, we are now offering a secure hybrid human/AI service in just 5 days. If you want to chat about how to bring review times down with either a 100% human service, or you’re interested in experimenting with how AI can assist, let’s talk: https://calendly.com/chrisle1972/chris-leonard-cactus

Curated by me, Chris Leonard.
If you want to get in touch, please simply reply to this email.