• Scalene
  • Posts
  • Scalene 47: arXiv papers reviewed #1

Scalene 47: arXiv papers reviewed #1

Humans | AI | Peer review. The triangle is changing.

As promised - although a little later than I intended - here is a special issue of Scalene devoted to arXiv preprints. I have noticed that some readers take a listing in the newsletter as a recommendation - and I don’t always agree with some of the articles I highlight (I highlight them because they are in ‘our’ space). But with great power comes great responsibility, so I am endeavouring to do the right thing and only write about preprints I find interesting AND are scientifically valid. The second part of this I am largely ascertaining with the help of LLMs and various prompts, which I share below . It’s a bit of effort, but hopefully means I am bringing higher quality preprints to your attention. I am only doing this for preprints, not published articles or blog posts etc. Each review takes about 10 minutes.

20th October 2025

1//
Paper Copilot: Tracking the Evolution of Peer Review in AI Conferences

arXiv:2510.13201v1 - 15 Oct 2025

We present Paper Copilot, a system that creates durable digital archives of peer reviews across a wide range of computer-science venues, an open dataset that enables researchers to study peer review at scale, and a large-scale empirical analysis of ICLR reviews spanning multiple years. By releasing both the infrastructure and the dataset, Paper Copilot supports reproducible research on the evolution of peer review.

Evaluation of this preprint is below. As with other preprints in this issue, we are using Claude 4.5 Sonnet with Research and Extended Thinking activated.
The prompt is one I think I copied from the Black Spatula group - but is tailored to CS conferences, so felt appropriate. As with all LLM evaluations, it can be a bit nit-picky, but is at least thorough. Claude 4.5 is also good at suggesting missing references, with reasons, that are not hallucinated. A big step forward from just a few months ago. I have not further interrogated Claude on the response in the interests of time, but that would be best practice here.

[I’ve just noticed the links in my version are not in the published version linked above, making it hard to follow the section on Missing Critical Literature Citations. Here they are:

Interestingly the first two suggested links don’t go to the primary source (F1000 and arXiv, respectively) 🤔]

2//
ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review

arXiv:2510.08867v1 - 09 Oct 2025

So, this one is coming through as ‘Major Revision’, but I like it so it’s going in. This particular Editor-in-Chief is human and has flaws and biases.

The paper proposes six guidelines for integrating AI into peer review workflows, positioning AI as a complement to human judgment rather than a replacement.

The paper makes three claimed contributions: (1) conceptualizing peer review as a socio-technical process requiring diverse perspectives, (2) presenting large-scale empirical analysis on the ICLR-2k dataset with persona-specific performance characterization, and (3) deriving practical guidelines for AI integration based on quantitative metrics and qualitative analysis.

[From the Claude review, not the preprint.]

The circular workflow of LLM-as-a-judge for LLM-generated reports needs benchmarking against human reviews, but otherwise I was very impressed with this paper. And I’m definitely going to use ‘peer review as a socio-technical process’ somewhere in the future!

3//
NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation

arXiv:2509.25179v2 - 01 Oct 2025

The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems.

[Abstract from preprint]

And if that sounds a bit much to process, we can simplify it:

what are 3 takeaways for a layman?

Here are 3 key takeaways from this research paper, explained in plain language:
1. The Challenge: Picking Good Research Papers is Complicated and Biased
Imagine trying to judge a cooking competition where chefs from different cuisines (Italian, Japanese, Mexican) are evaluated by different judges with different standards. A "7 out of 10" from one judge might mean something completely different than a "7" from another, and Italian dishes might consistently score differently than Japanese ones just because of judge preferences, not quality.
This is exactly what happens in scientific peer review. Papers from different research areas get scored inconsistently, and standards change over time. The researchers found that existing computer systems struggle with this—they're either too slow and expensive (like asking an AI to carefully read and review every paper) or too inaccurate (like just looking at numbers without understanding context).
2. The Solution: Train on Comparisons, Not Absolute Scores
Instead of teaching a computer "this paper deserves exactly a 6.5," the researchers trained their system by showing it pairs of papers and asking "which one is better?" — but crucially, only comparing papers from the same field and time period. This sidesteps the bias problem.
Think of it like this: Rather than asking "Is this pasta dish good?", you ask "Between these two Italian dishes submitted to the same competition, which is better?" This relative comparison is much more reliable. Then, once trained, the system can quickly assign scores to new papers without needing to do pairwise comparisons anymore—making it fast at deployment.
3. It Works Surprisingly Well, Even for Conferences It's Never Seen
The system achieved 78% accuracy at predicting whether papers would be accepted at major AI conferences. Even more impressive: when tested on papers from a completely different conference (NeurIPS) that it had never seen during training, it correctly predicted that rejected papers would have lower scores and "oral presentation" papers (the best ones) would have the highest scores.
Why this matters: This could help researchers get quick feedback on their work before submitting, help conference organizers identify promising papers more efficiently, and potentially make the peer review process more fair and consistent. However, the authors acknowledge this should assist humans, not replace expert judgment.

This one came without recommended citations, so I asked as a follow-up for missing cited works. I’m kind of glad, because this gives an example of how powerful Claude 4.5 is in recommending related papers. It also came as a chat, rather than an artifact, which is something that just seems to happen at random as far as I can make out.

4//
ReviewScore: Misinformed Peer Review Detection with Large Language Models

arXiv:2509.21679v1 - 25 Sep 2025

Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either “weaknesses” in a review that contain incorrect premises, or “questions” in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScoreindicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs and verify moderate agreements. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality. A thorough disagreement analysis further supports a potential of fully automated ReviewScore evaluation.

[Abstract]

I was excited to share this one with you, under my new criteria of ‘interesting and correct’ - but alas there are too many flaws with this one to recommend it whole-heartedly. I am including it anyway as an example of something I would have shared blindly before, but will be more critical of in future.

There are a host of concerns here, but some are - in my eyes - either incorrect or invalid. The ones which seem to be most problematic are small sample size, inadequate inter-annotator agreement, circular reasoning (as in story 2) of LLM-as-judge for LLM-generated content, and overstated claims. But some criticisms about LLM models (GPT5, Claude-sonnet-4) not being available as of October 2025 are just plain wrong.

This is also a good example of an overly-long and over-picky (IMHO) review which is not dissimilar to some human reviews I have seen. When they decide they don’t like it, they really go for it (humans and machines alike).

And finally…

Stopping at 4 stories as my token limit has been reached for the next few hours. The next one will be a ‘normal’ newsletter and there is lots to cover in that too. I have just returned from the Frankfurt Book Fair and it was heartening to have conversations that are more accepting of some role for AI in the manuscript evaluation process. I think the next 6 months will involve some fairly bold moves from some fairly large players - given the questions of scale of output (see story 2 above), it’s going to be a necessary change.

What did you think?

Do you want to see more of these real-world review cases? Possibly with other LLMs and prompting strategies? Let me know. You can see there are some great positives and some fairly critical negatives in using LLMs for one-shot reviews like this. I’d love to do some more of these, possibly with Gemini next time, and maybe iterating and going into more depth on just 1 or 2?

Curated by me, Chris Leonard.
If you want to get in touch, please simply reply to this email.