• Scalene
  • Posts
  • Scalene 31: One step back? / ARTEMIS / Paint

Scalene 31: One step back? / ARTEMIS / Paint

Humans | AI | Peer review. The triangle is changing.

I’m not always great at predictions, but one of them is starting to come true. 2024 was the year of research integrity startups and services, and 2025 is already living up to my prediction of being the year of automated peer review solutions. As I sit here writing this, and undoubtedly at London Book Fair this week, new services are being launched and promoted as the future of peer review. It’s a difficult balance to strike with automation and human oversight, but one thing we can be sure of is that human oversight is an absolute requirement now, and for the foreseeable future. I also highlight some dissenting voices as they provide a reality check for the most ardent optimists, even if they aren’t always reflective of the industry as a whole. Let’s go!

10th March 2025

// 1
Is Your Paper Being Reviewed by an LLM? A New Benchmark Dataset and Approach for Detecting AI Text in Peer Review
arXiv - 26 Feb 2025 - 25 min read

With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review.
To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews written by humans and different state-of-the-art LLMs. Motivated by the shortcomings of existing methods, we propose a new detection approach which surpasses existing methods in the identification of AI written peer reviews. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI.

CL: It’s surprising to me that there are nearly 800,000 publicly identifiable peer review reports available which are AI generated. But given that they exist, a perfect use case is to compare them against what we can certify are human reviews. The bad news: 18 existing approaches for determining if peer review reports are AI generated are ‘poorly suited’ to the task.

// 2
Automatically Evaluating the Paper Reviewing Capability of Large Language Models
arXiv - 24 Feb 2025 - 29 min read

Peer review is essential for scientific progress, but it faces challenges such as reviewer shortages and growing workloads. Although Large Language Models (LLMs) show potential for providing assistance, research has reported significant limitations in the reviews they generate. While the insights are valuable, conducting the analysis is challenging due to the considerable time and effort required, especially given the rapid pace of LLM developments. To address the challenge, we developed an automatic evaluation pipeline to assess the LLMs’ paper review capability by comparing them with expert-generated reviews. By constructing a dataset1 consisting of 676 OpenReview papers, we examined the agreement between LLMs and experts in their strength and weakness identifications. The results showed that LLMs lack balanced perspectives, significantly overlook novelty assessment when criticizing, and produce poor acceptance decisions. Our automated pipeline enables a scalable evaluation of LLMs’ paper review capability over time.

CL: More bad news? It certainly seems the automation of peer review has taken a step back if this weeks’ arXiv releases are to be believed. Having used a variety of LLMs to test out their strengths and weaknesses, I can anecdotally confirm that novelty is still something they all struggle with, but others have specific strengths which can be useful in the right hands with the right prompting. A strong argument for the hybrid human-AI reviewing future.

// 3
Crit me baby, one more time
Everything Hertz - 2 March 2025 - 54 min listen

It’s not often I recommend podcasts on here, mainly because I don’t have the time to listen more than a handful a week, but this one is worth your time on a commute, long run, or dog walk - make time! James Heathers and Dan Quintana discuss the ERROR project (highlighted here a few weeks ago) and how citations or altmetrics score could trigger a post-publication review of high-visibility papers:

// 4
ARTEMIS - Automated Review and Trustworthy Evaluation for Manuscripts in Science
ResearchHub - 27 Feb 2025 -10 min read

Web-application developed by ResearchHub Foundation using AI agents (AI Editor and AI Peer Reviewers).

CL: Apparently inspired and accelerated by the Black Spatula Project, ResearchHub have pre-registered how they intend to go about developing a protocol to assess reviewer performance, and use agentic AI solutions to solve the peer review crisis of too many papers and not enough reviewers. Also includes incentivising reviewers by payment in ResearchCoin ($RSC), a native crypto token from ResearchHub. Admirable openness here, especially in the comments at the end where authors engage with reader comments.

https://www.researchhub.com/post/3961/artemis-automated-review-and-trustworthy-evaluation-for-manuscripts-in-sciencebr

// 5
Is everyone huffing paint?
BlueSky - 08 March 2025 -17 min read

A counterblast to any unchecked optimism around AI and peer review this week came from Carl Bergstrom on BlueSky. Reacting to the ResearchHub news above, he provoked a lively thread on the ethics of AI in peer review by asking “Is everyone huffing paint?”

Is everyone huffing paint? Crypto guy claims to have built an LLM-based tool to detect errors in research papers; funded using its own cryptocurrency; will let coin holders choose what papers to go after; it's unvetted and a total black box—and Nature reports it as if it's a new protein structure.

Carl T. Bergstrom (@carlbergstrom.com)2025-03-07T21:54:00.417Z

CL: Carl has a point here, but a few counterpoints to his post; 1) the Nature editor questioned the charge that this article was without criticism of the approach, and 2) the idea of human-only reviews, that are high-quality and on time comes from a position of privilege. White men, experienced in their field, working at Western institutes will have a very different take on this to the rest of the world. For many, AI is going to bring their experience closer to the ideal - as long as humans remain involved (for now).

And finally…

Loads of other things to clear out of my Scalene folder. Click away:

I’m in Olympia for the London Book Fair until Thursday afternoon this week. Reply to this email if you want to meet up at some point.

Let's do coffee!
- London Book Fair, London(!) - Mar 11-13
- ALPSP UP Redux, Oxford - April 3-4 [I’m giving the keynote speech on the 3rd]
Let me know if you’re at any of these and we can chat all things Scalene-related.

Free consultation calls
Many of you may know I work for Cactus Communications in my day job, and one of my responsibilities there is to help publishers speed up their peer review processes. Usually this is in the form of 100% human peer review, delivered in 7 days. However, we are keen to experiment further with subtle AI assistance. If you want to chat about how to bring review times down with either a 100% human service, or you’re interested in experimenting with how AI can assist, let’s talk: https://calendly.com/chrisle1972/chris-leonard-cactus

Curated by me, Chris Leonard.
If you want to get in touch, please simply reply to this email.