• Scalene
  • Posts
  • Scalene 42: Risks / Breaking Point / Lessons from Radiology

Scalene 42: Risks / Breaking Point / Lessons from Radiology

Humans | AI | Peer review. The triangle is changing.

Things are back to normal this week(!) as we look at the continuing strain on traditional peer review processes and some non-AI ways to fix that, lessons from radiology, and the prompt injection story carries on - with an unexpected positive potential consequence.

3rd August 2025

1//
Evaluating the potential risks of employing large language models in peer review

Clin. & Trans. Discovery - 27 June 2025 - 14 min read

Claude 2.0 was used to generate peer review reports, rejection recommendations, citation requests and refutations for 20 original, unmodified cancer biology manuscripts obtained from eLife's new publishing model. Artificial intelligence (AI) detection tools (zeroGPT and GPTzero) assessed whether the reviews were identifiable as LLM-generated.All LLM-generated outputs were evaluated for reasonableness by two expert on a five-point Likert scale.

Before I go into the most interesting parts of this paper, let’s acknowledge the sample size and use of Claude 2 as notable limitations. There may not be 100s of similar eLife papers available yet, but we are using Claude 4 as a well-established tool in the summer of 2025. That said, I’m going to highlight the main findings here, as they correlate well with personal experience:

  1. LLM-generated review comments cannot be identified by AI detectors

  2. LLM can replace human reviewers in some cases

  3. LLM can provide convincing rejection comments

  4. LLM can generate seemingly reasonable citation requests for unrelated references

  5. LLM can refute these unreasonable citation requests

2//
Have we already hit the peer review breaking point?

Substack - 24 July 2025 - 5 min read

Mark Hahnel opines on the current state of peer review, and surmises, thanks to a quadrupling in submissions in the last 25 years, we have reached - or are very close - to the point at which it no longer functions

If publications grow 10x within 5 years of AI adoption, we'd need 200-300 million peer reviews annually. Even if every PhD-level researcher worldwide dedicated their entire career to reviewing, we couldn't keep up. As acceptance rates plummet (potentially from 20% to 5-10%), authors may submit each paper to more journals. This means the same research generates multiple submissions, artificially inflating the crisis.

So what’s the answer? Hahnel suggests the Publish-Then-Curate model. Certainly it would help, but do we not kick the can down the road and then have a Curation crisis? Who is reading these papers and recommending them - or warning us away from them? I feel this is a step in the right direction, but the journey is long.
You can also play a fun game where you can predict in which year the current peer review model will collapse :-)

3//
Making “Pay Peer Reviewers” More Than a Slogan

Origin Editorial -22 July 2025 - 7 min read

A very even-handed review of the merits and otherwise of paying reviewers. I include it here because it also sets out some practical questions about how this could work which proponents (which I am one of) need to consider. Along with AI assistance, the professionaliztion of peer review is one thing I am convinced will become essential in the near future, so some of these points are valuable in that context too:

First, proponents of paid peer review should start developing suggestions in more detail, addressing the sorts of questions laid out here. Frequently, what seems to be an obvious solution to one person is not obvious to their colleagues, and it is likely that more consciousness raising among academics would be necessary for paid peer review to be more than a hypothetical or pilot study.

Second, many researchers may find data more convincing than arguments from first principles. More pilot studies from journals would be extremely welcome. These could help to force people to grapple with the details of implementation and determine what the cost-benefit ratio might be.

Third, organizations that specialize in research culture, metascience, and scholarly publishing are well-placed to examine the topic. Reports or white papers on the topic could help articulate the range of suggestions, and their potential pitfalls, from the perspective of all of the players.

Related, is this LinkedIn blog post which describes the ‘Feedback Loop of Decline’: https://www.linkedin.com/pulse/emerging-crisis-peer-review-ai-ethics-pressure-publish-fay-manning-kjcsf/ - also a killer closing paragraph.

4//
Comparing AI-generated and human peer reviews: A study on 11 articles

Hand Surg. & Rehab. - 19 July 2025 - 17 min read

Again, I’m going to lament the sample size here, but very interesting results nonetheless from this study. I was particularly impressed by the use of newer models (ChatGPT 4o and o1) and the ARCADIA tool to assess the quality of peer review reports in the biomedical domain.

LLMs, including ChatGPT, can enhance the peer review process for scientific articles by structuring and refining human reviewers’ peer reviews. Their effectiveness depends on precise instructions that minimize “hallucinations” while improving quality, explicitly prohibiting the generation of false information and requiring detailed, structured, high-quality peer reviews. Their ability to process large volumes of data far exceeds human capacity but requires strict oversight of their limitations. Although currently prohibited for privacy reasons, if used appropriately, LLMs could accelerate human reviewers' work and improve the quality of scientific publications. Our findings suggest that future work should explore whether combining LLMs-generated reviews with human assessments could help improve the peer review process.

5//
Bottom-up Domain-specific Superintelligence: A Reliable Knowledge Graph is What We Need

One of things holding back fuller AI assessment of academic manuscripts is that they are weak at placing the work in the historical context of ‘what has come before’ this paper. Weird, given that the references precisely describe the relevant nodes and edges on the graph of all published research. But this is important for detecting novelty - is this a new theory, or new application, that we haven’t seen before? Knowledge graphs should be able to help enormously here. The talk of ‘superintelligence’ does stick in my craw a little though:

Recent advances in language modeling [1–8] have made a significant stride towards a cognitive system [9, 10] capable of performing a wide spectrum of tasks with human-like proficiency [11–14]. Yet, human-level generality may only be a waypoint on the path to advanced intelligent systems that may exceed the cognitive performance of humans: Superintelligence [15, 16]. While achieving the breadth of human cognition is one goal of advanced artificial intelligence, superintelligence might be orthogonally characterized by depth, outperforming the best human experts in specialized domains [17–23], like proving unsolved conjectures in number theory, developing novel kinase inhibitors for rare cancer subtypes, or discovering new ferromagnetic semiconductors that operate at room temperature. Consequently, advancing towards superintelligence might require fine-tuning general cross-domain intelligence into specialized domain-specific expertise.

And finally…

Beyond Assistance: The Case for Role Separation in AI-Human Radiology Workflows - I’m fascinated by the adoption of AI in radiology and the benefits and pitfalls of hybrid human-AI evaluations. Lessons for peer review in here too I feel.

Ruining Peer Review, but funnier - James Heathers has a go a prompt injection for peer review manipulation, with hilarious results.

Can Author Manipulation of AI Referees Be Welfare Improving? - like inflation, sometimes a little of a bad thing (prompt injections again) can be a greater good. I’m going on the abstract only here as I wasn’t able to buy the full text. Caveat lector.

AI will soon be able to audit all published research – what will that mean for public trust in science? - a logical endpoint to automated AI assessment is that we can go back and re-review all existing work. For the sake of the scientific endeavour, we should.

One year ago: Scalene 11, 04 Aug 2024

Let’s chat

I’ll be at the Peer Review Congress in Chicago in early September, and then ALPSP in Manchester shortly thereafter. Wanna meet?

Curated by me, Chris Leonard.
If you want to get in touch, please simply reply to this email.