Scalene
Posts
Scalene 37: Unwritten codes / decentralization / equilibrium

Scalene 37: Unwritten codes / decentralization / equilibrium

Chris Leonard
June 08, 2025

Humans | AI | Peer review. The triangle is changing.

Some great projects going on right now which need us to reconsider manuscript evaluation as a continuum to encompass all aspects of research integrity and expert opinion. ‘Peer review’ itself may be a term we lose over time as this quality control step becomes so much more than it is now.

8th June 2025

1//
Language Models Surface the Unwritten Code of Science and Society

arXiv - 27 May 2025 - 19 min read

This paper calls on the research community not only to investigate how human biases are inherited by large language models (LLMs) but also to explore how these biases in LLMs can be leveraged to make society’s “unwritten code” — such as implicit stereotypes and heuristics — visible and accessible for critique. We introduce a conceptual framework through a case study in science: uncovering hidden rules in peer review - the factors that reviewers care about but rarely state explicitly due to normative scientific expectations.

https://arxiv.org/abs/2505.18942

2//
Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation

arXiv.org - 28 May 2025 - 16 min read

Recent advancements in large language models have sparked interest in utilizing them to assist the peer review process of scientific publication. Instead of having AI models generate reviews in the same way as human reviewers, we propose adopting them as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs from different providers and assessed their performance and API costs for identifying critical errors and unsoundness problems. The OpenAI o3 model performed the best, while o4-mini was the most cost-effective one in our evaluation.

CL: Inspired by the Black Spatula project, this work, using LLMs to evaluate for errors and critical soundness that would invalidate any conclusions, foresees how AI can assist human review, rather than replace it.

https://arxiv.org/abs/2505.23824

3//
Decentralized knowledge assessment

The Innovation - 02 June 2025 - 21 min read

The peer-review process, which serves as the quality-control mechanism of scientific knowledge production, has been criticized for its bias, unreliability, and inefficiency. Academic conferences and journals typically rely on a centralized mechanism for reviewer assignment and paper assessment. We argue that this centralization is a major factor contributing to the unreliability of the review process, leading to deficiencies in the current knowledge-assessment systems. To address this, we propose a novel decentralized model that democratizes peer review by shifting decision-making rights from centralized authorities to all scholars participating in a scholarly community. Our model includes a dual-rewarding incentive mechanism that motivates scholars to actively participate in peer review by recognizing both their effort and scientific contributions. This model transforms peer review from passive judgment to active collaboration.

CL: Not strictly AI-related, but an innovation in peer review nonetheless for computer science conferences. It borrows from some web3 concepts such as DAOs and also touches on identity management (via zero-knowledge proofs), demand-based pricing of review services, and the general utility of blockchains in peer review. An eye-opening read if you’re unfamiliar with these things.

https://doi.org/10.1016/j.xinn.2025.100945

4//
Equilibrium effects of LLM reviewing

Bryan Wilder - 26 May 2025 - 7 min read

What could we expect if LLMs are adopted as a way to review academic manuscripts. In the longer term, bizarre author behaviour could (almost certainly, will) evolve to exploit them:

A very extreme form of equilibrium behavior is an explicitly adversarial attack on the LLM reviewer. The possibilities are near-endless. On the low-effort end of the spectrum, some authors will no doubt attempt to smuggle new instructions into hidden text in the paper, instructing the LLM to leave a more positive review. This may be explicitly forbidden by conferences, with automated checks implemented to detect such instructions. However, more sophisticated attacks abound. Authors could attempt steganographic attacks, using an LLM of their own to attempt millions of paraphrases of the original paper until discovering a version that is semantically similar but garners a much more positive review. Or, since sophisticated LLM reviewers will employ web search, authors could attempt to poison the search results, introducing papers into arxiv or less-selective venues which exist only to influence the LLM reviewer once retrieved into context.

The blog post itself is great, but also check out the comments on the Bluesky post.

https://bryanwilder.github.io/files/llmreviews.html
https://bsky.app/profile/brwilder.bsky.social/post/3lq3s2ooook2n

5//
Introducing the Medical Evidence Project

James Heathers - 04 June 2025 - 8 min read

Everyone’s favourite ‘data thug’, James Heathers, looks set to disrupt the world of research integrity with a new initiative, The Medical Evidence Project. Read all about it on his own blog, where he outlines how things will work - and why current research integrity investigations are inadequate:

I have written before, as have many other people, about the sheer and unending silliness of dealing with regular research integrity processes. The people involved on ‘the other side’, wherever that may be, are slow. Or, slower than that. They lack accountability. They have guidelines for pursuing research integrity questions, like COPE, which are neither followed nor enforced. Complaints are often bounced between disinterested parties.

And if we find active threats to human health in this literature, I think it is immoral to get tied up in this silly process that you can’t trust or rely on.

Let's say we found an unambiguously contaminated treatment guideline. Should we enter into some nonsense exchange of emails that takes two years to issue an expression of concern while people are hurt that whole time?

No - we’re going straight to press.

No filter, no university intermediary, no silliness. It’ll look more like this:
1. Problem identified.
2. Report written about problem.
3. If serious enough, problem peer reviewed internally (AND I HAVE A LINE ITEM IN THE BUDGET TO PAY THEM)
4. Problem immediately reported in press.

https://jamesclaims.substack.com/p/introducing-the-medical-evidence

And finally…

Tangential stories from around the web:

The Ethics of AI - a great new OA book from Rainer Mülhoff and Bristol University Press
Recommendations for a Classification of AI Use in Academic Manuscript Preparation - The STM Association has created a classification system to help publishers establish guidelines for the use of AI in academic manuscript preparation.
Advancing the Scientific Method with Large Language Models: From Hypothesis to Discovery - AI-driven science raises ethical questions about creativity, oversight, and responsibility

One year ago: Scalene 4, 09 June 2024

The Other Reviewer: RoboReviewer
I came across a special issue of Journal of the Association for Information Systemsrecently, dedicated to Generative AI and Knowledge Work. After an ‘interesting’ user journey to find the issue in question, I think I found the full article list as a PDF.
All of the articles will catch your eye in one way or another, but the one that caught my eye the most was the one introducing the concept of RoboReviewer. The opinion piece by Ron Weber does a great job of looking at what peer review does, how we might start to use AI in the process, and introduces the concept of a ‘RoboReviewer marketplace’.
However it was Section 6 that shone for me. How will AI(-assisted) peer review change journal submission dynamics:

Presumably, prescreening activities at journals and conferences such as scans for plagiarism, doctored images, AI-generated content, and paper-mill output would also be rendered less effective (Hu, 2023; Tang, 2023). A RoboReviewer used by a researcher should have already detected these irregularities in their paper and possibly modified the paper to mask them.

CL - Should tools like ‘RoboReviewer’ be available to authors? My gut says yes, but my head is conflicted. We don’t make image manipulation or research integrity tools easily available to authors for good reason. But this might improve the overall quality of submissions to journals too. Food for thought.
https://aisel.aisnet.org/cgi/viewcontent.cgi?article=2171&context=jais

Let’s chat
Many of you may know I work for Cactus Communications in my day job, and one of my responsibilities there is to help publishers speed up their peer review processes. Usually this is in the form of 100% human peer review, delivered in 7 days. However, we are now offering a secure hybrid human/AI service in just 5 days. If you want to chat about how to bring review times down with either a 100% human service, or you’re interested in experimenting with how AI can assist, let’s talk: https://calendly.com/chrisle1972/chris-leonard-cactus

Curated by me, Chris Leonard.
If you want to get in touch, please simply reply to this email.