- Scalene
- Posts
- Scalene 36: We are 1! / REMOR / SIFT
Scalene 36: We are 1! / REMOR / SIFT

Humans | AI | Peer review. The triangle is changing.
Happy Birthday to us, happy birthday to us, happy birthday dear Scalene, happy birthday to us. 12 months, 36 newsletters, and 450 subscribers later, we’re starting the second year of monitoring how AI is changing peer review. It means I can now have a ‘One year ago’ section at the bottom of the email to remind us how far we have come (or in some cases, haven’t) in our journey around the sun. But we’re all about looking forward, not back, so let’s see what’s been happening since issue 35…
23rd May 2025
1//
Integrating Artificial Intelligence into Scholarly Peer Review: A Framework for Enhancing Efficiency and Quality
OSF Preprints - 15 May 2025 - 16 min read
A great read to start us off from Richard Wynne and Vijaya B Kolachalama wherein they outline pragmatic and thoughtful ways AI can be introduced to peer review workflows. Definitely one to ponder over:
The integration of AI into scholarly peer review appears inevitable, but will publishers, journals, and editors proactively guide its implementation or be bystanders? Using pragmatic approaches, journals have an opportunity to embrace AI and ensure that it is adopted in an efficacious and transparent manner. Deploying specialized peer review “copilot” chatbots could shield reviewers from the complexities of multiple, iterative “thick” AI prompts and foster creative ideation. Appropriate user interface design will ensure that human reviewers and editors supervise and concur with every step, while automatically generating transparent audit logs. Taking this approach could help relieve chronic pain points in scholarly workflow such as saving time for experienced reviewers by allowing them to focus on conceptual analysis and opening the door to non-native English speakers.
2//
REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning
arXiv.org - 16 May 2025 - 37 min read
Daniel Acuna has been doing stellar work in this area for a long time and a new paper from him always worth reading. This time with Pawin Taechoyotin he has engineered the shallowness and overpraising suggestions of LLMs out of the review process through a process known as multi-objective reinforcement learning.
In this work, we demonstrated that explicit reasoning combined with multi-objective reinforcement learning significantly enhances the depth and quality of automated peer review generation. By incorporating detailed, aspect-based reward signals, REMOR produced feedback that achieves approximately double the Human-aligned Peer Review Reward (HPRR) of typical human reviews, approaching parity with the best-quality human-generated reviews. Our proposed human-aligned reward function can also serve as a self-assessment tool for reviewers, encouraging more detailed and relevant feedback

3//
Sensemaking with SIFT Toolbox
Checkplease - May 2025
Mike Caulfield has shared a prompt which helps improve the performance of language models by guiding them to provide better conclusions and consider different viewpoints. It is designed for student researchers to assist in their research while encouraging independent thinking - however it could be used in other ways too!
SIFT Toolbox is a lengthy instruction prompt that outperforms unmodified LLMs in multiple dimensions. You paste it in at the beginning of a chat session (or add it to project instructions) to make the LLM act differently. With the prompt in place, your LLM will come to better conclusions, hallucinate less, and source conflicting perspectives more systematically. It also models an approach that is less chatbot, and more research assistant in a way that is appropriate for student researchers, who can use it to aid research while coming to their own conclusions.
4//
In-Context Watermarks for Large Language Models
arxiv.org - 22 May 2025 - 29 min read
It took me two reads to ‘get’ this one, but it seems like preliminary work that could morph into something very interesting for editorial offices. Briefly, they use a watermarking process to identify which ‘dishonest (or lazy) reviewers’ have used LLMs for their reviews by embedding imperceptible signals into the manuscript through carefully crafted watermarking instructions, the LLM’s output can carry a hidden watermark that enables later detection and attribution.

5//
Are the confidence scores of reviewers consistent with the review content? Evidence from top conference proceedings in AI
arxiv.org - 21 May 2025 - 45 min read
This study examines the relationship between confidence scores of reviewers and the content of their review reports. It finds a strong consistency between the two, with higher confidence scores linked to more hedge sentences and a greater word count.
Our research findings indicate a high consistency between confidence scores and the expression of review texts at the word, sentence, and aspect levels. This suggests that the current confidence scores in review reports are reliable. Furthermore, our research results suggest a negative correlation between confidence scores and paper decisions, indicating that higher confidence scores tend to lean towards rejecting the paper.

And finally…
Tangential stories from around the web:
Re2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions
Cactus Communications and Silverchair collaborate to elevate research integrity and streamline editorial workflows [disclosure: I am a Cactus employee, but this is just big news so it’s going in!]
One year ago: Scalene 1, 22 May 2024
Length matters
Despite what others may have told you, the length of a peer review does matter and to satisfy the author they should be at least 947 words long. That's a startling finding from a paper by Abdelghani Motti & Luis Miotti, handily summarised by the author in a post on the LSE blog:
Our analyses revealed a statistically significant impact of reviewers’ report length on citations received, with reports surpassing approximately one and a half pages (947 words) marking a critical threshold. Notably, papers garnering the highest citation counts tended to be associated with longer reviewer reports, exceeding the average length. Beyond this threshold, citation counts exhibited an increasing trend with longer report lengths, corroborating the initial hypothesis positing the synonymous relationship between the length of referees’ reports and the extent of revisions solicited, thereby enhancing manuscript “quality”.
I think we all know short reviews can be pithy and superficial, but the opposite case never really struck me for some reason. As an author, and editor, you would want something substantial - particularly if you'd been waiting a long time for it.
Let’s chat
Many of you may know I work for Cactus Communications in my day job, and one of my responsibilities there is to help publishers speed up their peer review processes. Usually this is in the form of 100% human peer review, delivered in 7 days. However, we are now offering a secure hybrid human/AI service in just 5 days. If you want to chat about how to bring review times down with either a 100% human service, or you’re interested in experimenting with how AI can assist, let’s talk: https://calendly.com/chrisle1972/chris-leonard-cactus
Curated by me, Chris Leonard.
If you want to get in touch, please simply reply to this email.