• Scalene
  • Posts
  • Scalene 23: Retraction / Strain / Agents

Scalene 23: Retraction / Strain / Agents

Humans | AI | Peer review. The triangle is changing.

The one in which I eat some humble pie, where others see AI accelerating the utility of AI peer review, and where ‘agents’ are coming to improve the evaluation work of, er, other agents. Plus the delightful UnderwheLLM. Let’s go…

01 December 2024

// 1
The Obsolescence of Traditional Peer Review: Why AI Should Replace Human Validation in Scientific Research [RECOMMENDATION RETRACTED]
Preprints.org - 05 Nov 2024 - 10 min read

Well this is a little embarrassing, and yet encouraging at the same time. Last week I highlighted this preprint and commended its readability while not noticing some red flags about it. The author has a prodigious output of papers on a variety of subjects and this preprint contains many made-up references to back up his point. In hindsight I should have noticed this and furthermore the preprint seems likely to have been completely written by some LLM or other.

But it’s encouraging in other ways. It shows the value of peer review (this hadn’t been reviewed) and it shows how well science can be self-correcting, given the number of messages I got on email and bluesky. I will happily fulfil my part of the deal by retracting the recommendation to read this, and the original post I made has been amended on the web version too.

Thanks to many people for flagging this up, but particularly for this investigative work by Shahan Ali Memon. If only all potential retractions could be handled this easily.

// 2
AI to take the brunt of peer reviewing?
LinkedIn- 26 Nov 2024 - 9 min read

In light of the above (!) I’m going to leave the bigger statements to others this week, and a wonderful thread on LinkedIn provoked by Gunter Eysenbach’s assertion below:

Well, it turns out a lot of people were both for against him on that. Read the full thread (linked below) as it is enlightening for a snapshot of opinions on this topic. And readers of this newsletter will be particularly interested to know the following:

// 3
Standard Terminology for Peer Review: Where Next? 
EON - 15 Nov 2024 -6 min read

Michael Willis describes The Standard Terminology for Peer Review and how it aims to clarify different peer review models and promote transparency for authors, reviewers, and readers. It encourages journals to clearly describe their peer review processes, helping to build trust in scholarly communication.

The rationale is sound and there are some great visual indicators of how could look (kudos to IOP Publishing for having already implemented it), but a future revision may wish to include opportunities for editors to disclose AI use too.

// 4
The strain on scientific publishing
QSS - 08 Nov 2024 -18 min read

The total number of articles indexed in Scopus and Web of Science has grown exponentially in recent years; in 2022 the article total was ∼47% higher than in 2016, which has outpaced the limited growth—if any—in the number of practicing scientists. Thus, publication workload per scientist has increased dramatically. We define this problem as “the strain on scientific publishing.” To analyze this strain, we present five data-driven metrics showing publisher growth, processing times, and citation behaviors. We draw these data from web scrapes, and from publishers through their websites or upon request. Specific groups have disproportionately grown in their articles published per year, contributing to this strain. Some publishers enabled this growth by hosting “special issues” with reduced turnaround times. Given pressures on researchers to “publish or perish” to compete for funding, this strain was likely amplified by these offers to publish more articles.

CL: This is a great paper with illuminating figures showing how the majority of growth in the output of papers in recent years has come through special issues, where the oversight of peer review processes is somewhat lower than for regular issues. Thus a commercial driver (accept more OA papers) has resulted in a polluted corpus of poorly- or non-reviewed manuscripts which we are going to have to retrospectively address when the will is there to do it.

// 5
Agent-as-a-Judge: Evaluate Agents with Agents
arXiv - 18 Oct 2024 - 23 min read

Now we’re comfortably heading to an area where my knowledge is somewhat limited right now, but since I see other people getting excited by it, I’m going to tentatively - given last weeks’ brou-hoo-har - amplify those voices (caveat lector: this is a non-reviewed preprint).

I truly believe we will see the next revolution in peer review come from outside the industry so I try to keep an eye on the adjacent fields and what is happening there, and agentic AI use is something that intrigues me deeply. Another LI post, this time by Nicholas Nouri, describes this paper’s premise well (bolding mine):

We're all excited about the potential of intelligent agents. But amidst all the buzz, a critical question often gets overlooked: How do we effectively evaluate these AI agents?

Traditional evaluation methods tend to focus on the final outcome of an agent's task. While this provides some insight, it often misses the nuances of how the agent arrived at that result - especially important in complex domains like code generation.

Meta's latest research is a "Agent-as-a-Judge" framework.

Instead of solely assessing the end product, this approach involves AI agents evaluating the performance of other AI agents throughout each step of their process. Think of it as AI systems peer-reviewing each other to provide detailed feedback, much like how colleagues might review each other's work in a collaborative environment.

Why is this approach much better ?

- Step by Step Feedback: By evaluating each intermediate step, we gain a deeper understanding of the agent's decision-making process. This helps identify specific areas for improvement rather than just knowing whether the final outcome was successful.

Introduction of the DevAI Dataset: To facilitate more rigorous testing, Meta introduced DevAI - a dataset comprising 55 realistic AI development tasks. This provides a more challenging benchmark to evaluate modern AI agents effectively.

In experiments using the DevAI benchmark, the Agent-as-a-Judge framework outperformed previous evaluation methods, especially in tasks requiring complex reasoning.

The study found that this method's evaluations are as reliable as human assessments, offering a viable alternative that saves both time and resources.

Cost and Time Efficiency: Automating the evaluation process reduces the need for extensive manual reviews, accelerating the development cycle.

Could this peer-review approach be the key to unlocking more advanced and dependable AI applications?

I’ll leave it to you to draw the parallels to academic peer review. Nicholas’ post and the arXiv paper it describes are below. Even if you think this sounds far-fetched or too hard, just look at the figures for a high-level overview of what it is describing:
https://www.linkedin.com/posts/nicholasnouri_innovation-technology-future-activity-7264499042636701696-UedW/?utm_source=share&utm_medium=member_ios

https://arxiv.org/pdf/2410.10934

And finally…

It’s nice to someone with a sense of humour in the peer review/AI space, because that’s what you need to develop UnderwheLMM - the ‘Reviewer 2’ AI that hates your work. I’ll let developer Lars explain more:

Links to stuff I didn’t have time for this week:

Let's do coffee!
I’m in London for the STM meeting on December 4th (± 1 day).

Free consultation calls
Many of you may know I work for Cactus Communications in my day job, and one of my responsibilities there is to help publishers speed up their peer review processes. Usually this is in the form of 100% human peer review, delivered in 7 days. However, we are keen to experiment further with subtle AI assistance. If you want to chat about how to bring review times down with either a 100% human service, or you’re interested in experimenting with how AI can assist, let’s talk: https://calendly.com/chrisle1972/chris-leonard-cactus

Curated by Chris Leonard.
If you want to get in touch with me, please simply reply to this email.