• Scalene
  • Posts
  • Scalene 28: Humans / GARs / LLM review

Scalene 28: Humans / GARs / LLM review

Humans | AI | Peer review. The triangle is changing.

Things are really ramping up in the LLM space. Just as we were getting to grips with the visual reasoning of DeepSeek V3-R1, another impressive model is released by Ai2 - namely Tulu. I never really wanted Scalene to be a documenter of new AI tools, preferring to concentrate on how these affect the process of peer review (and ideally how they make life easier for the author), so I won’t be going into any depth on these tools on their own - only their applications. Thankfully, there are many people out there who are doing both very elegantly…

2 February 2025

// 1
Black Spatula (again)
Various/Discord - 27 Jan 2025 - 13 min read

The Black Spatula project continues to inspire and do a lot of the heavy lifting of applying the latest in AI developments to spot errors in academic manuscripts. A group of AI experts and researchers, sleuths, reviewers have coalesced in order to see what insights can be gained from the latest models when applied to academic research manuscripts.

This is currently a retrospective exercise on the whole (analysing already published papers), but the leap to using it for peer review is so small as to be negligible.

It’s hard to summarise a multitude of threads looking at various different problems in various different ways, but this single document (work in progress) gives you an idea of what’s possible now with DeepSeek and o1-preview.

If you like this newsletter, you really should join the party on Discord or WhatsApp via the project website:
https://the-black-spatula-project.github.io

// 2
Human in the loop
Substack - 31 Jan 2025 - 9 min read

Fascinating insight into how best to combine human and AI assessment (in this case for grading essays). Some of the problems they recognise about the need for human oversight we have also encountered with peer review, but their way of mitigating them is novel indeed - audio notes.

What we’ve tried instead, which has been more successful, is to rework the process entirely.

- Teachers read and judge their students’ writing, as they would normally.

- They leave an audio comment on each piece of writing.

- The AI does two things: it transcribes the audio, and it combines together audio from different teachers into one final polished comment for the student, and a feedback report for the teacher.

So here we have a human in the loop, but in a slightly more sophisticated way that leads to better outcomes. The AI is doing a lot of the legwork, and human time and effort is reduced, but we are mitigating the problem of hallucinations.

// 3
Large language models for automated scholarly paper review: A survey
arXiv.org - 17 Jan 2025 - 25 min read

Peer review using AI isn’t really ‘peer’ review - so I’m grateful to these authors for introducing the term Automated Scholarly Paper Review (ASPR) as an alternative. It’s not often I recommend reading a whole paper, but this review is worth it if you have the time.

We begin with a survey to find out which LLMs are used to conduct ASPR. Then, we review what ASPR-related technological bottlenecks have been solved with the incorporation of LLM technology. After that, we move on to explore new methods, new datasets, new source code, and new online systems that come with LLMs for ASPR. Furthermore, we summarize the performance and issues of LLMs in ASPR, and investigate the attitudes and reactions of publishers and academia to ASPR. Lastly, we discuss the challenges associated with the development of LLMs for ASPR. We hope this survey can serve as an inspirational reference for the researchers and promote the progress of ASPR for its actual implementation.

// 4
The reviewer paradox: more publications, fewer peers?
Substack - 14 Jan 2025 -8 min read

I recently read that “as the number of publications increases, the number and availability of peers decrease”. This didn’t seem intuitive to me. If each publication has multiple authors, the pool of potential reviewers should grow. However, if authors publish multiple papers in the same year, there could indeed be a deficit of peers. After all, we have all heard about the peer-reviewing crisis. So what is the mechanism behind this? And what fields of research are potentially the most affected?

CL: I’m increasingly convinced that reviewing will become more professionalized (maybe carried out by editorial board members, along with multiple AI tools), and this is one more data point as to why change is required in current peer review methods.
https://researchmusings.substack.com/p/the-reviewer-paradox-more-publications

// 5
Generative Adversarial Reviews: When LLMs Become the Critic
arXiv.org - 09 Dec 2024 -22 min read

The peer review process is fundamental to scientific progress, determining which papers meet the quality standards for publication. Yet, the rapid growth of scholarly production and increasing specialization in knowledge areas strain traditional scientific feedback mechanisms. In light of this, we introduce Generative Agent Reviewers (GAR), leveraging LLM-empowered agents to simulate faithful peer reviewers. To enable generative reviewers, we design an architecture that extends a large language model with memory capabilities and equips agents with reviewer personas derived from historical data. Central to this approach is a graph-based representation of manuscripts, condensing content and logically organizing information — linking ideas with evidence and technical details. GAR’s review process leverages external knowledge to evaluate paper novelty, followed by detailed assessment using the graph representation and multi-round assessment. Finally, a meta-reviewer aggregates individual reviews to predict the acceptance decision. Our experiments demonstrate that GAR performs comparably to human reviewers in providing detailed feedback and predicting paper outcomes. Beyond mere performance comparison, we conduct insightful experiments, such as evaluating the impact of reviewer expertise and examining fairness in reviews. By offering early expert-level feedback, typically restricted to a limited group of researchers, GAR democratizes access to transparent and in-depth evaluation.

And finally…

Back to a link list again this week. I’m restricting myself to 5 only, but - wow - there’s a lot going on:

Let's do coffee!
- Oxford & Cambridge: w/c 17 Feb
- Researcher 2 Reader conference, London - Feb 25-26
- London Book Fair, London(!) - Mar 11-13
- ALPSP UP Redux, Oxford - April 3-4 [I’m giving the keynote speech on the 3rd]
Let me know if you’re at any of these and we can chat all things Scalene-related.

Free consultation calls
Many of you may know I work for Cactus Communications in my day job, and one of my responsibilities there is to help publishers speed up their peer review processes. Usually this is in the form of 100% human peer review, delivered in 7 days. However, we are keen to experiment further with subtle AI assistance. If you want to chat about how to bring review times down with either a 100% human service, or you’re interested in experimenting with how AI can assist, let’s talk: https://calendly.com/chrisle1972/chris-leonard-cactus

Curated by me, Chris Leonard.
If you want to get in touch, please simply reply to this email.