• Scalene
  • Posts
  • Scalene 24: AnnotateGPT / Trust / Consensus

Scalene 24: AnnotateGPT / Trust / Consensus

Humans | AI | Peer review. The triangle is changing.

If I’ve learnt anything in 25 years of product development in academic publishing (and its a lesson I need to re-learn every few years) it’s never to ignore the human element of any technology solution. AI can replace many elements of the manuscript evaluation process, but most people are just not ready for that yet - they don’t trust it. Trust needs to be earned, but also depends on transparency - something most AI systems are not great at. So I’m happy to showcase an AI system which helps humans evaluate the best peer review response to a paper, and a few sociological perspectives on how to increase AI adoption through building more trust. Plus more of the usual stuff, of course.

08 December 2024

// 1
Streamlining the review process: AI-generated annotations in research manuscripts
arXiv - 29 Nov 2024 - 21 min read

We have researched the use of AI as a supportive tool and propose a nuanced ap- proach where AI aids reviewers in identifying important sections of the manuscript. To accomplish this, we recommend using "annotation" as a means to enhance collaboration between the AI agent and the reviewer. This vision is realized in AnnotateGPT, a web- based PDF visualizer that combines criterion-driven prompts with language models like GPT-4.

If you only read one thing this week, make it this. The description sounds odd until you see it in action in the paper’s figures (and in footnote 9, where the authors have gone all ‘Inception’ on us and evaluated this very paper with their own tool). AnnotateGPT evaluates the manuscript for multiple criteria; contribution, orginality, relevance, rigour, solution description, and ‘state of the art’. For each facet of the paper, there is a generalised evaluation and then several alternatives that a reviewer may wish to alter if they disagree with the generalised version. They can select from ‘positive’, ‘critical’, ‘constructive’, and ‘alternative’ viewpoints to amend the evaluation for that aspect of the report.

This is a wonderful way to introduce reviewers to AI, whilst keeping the reviewers as the ultimate arbiter of what is correct. The alternative viewpoints make the reviewer consider which appraisal is correct, rather than supplying them with one ‘truth’ to evaluate. There is also an excellent query option for the reviewers to ask specific questions I’m incredibly excited to explore this more over the coming weeks.

Clarification request through the prompt: ‘9 subjects enough for TAM evaluation?’

Do yourself a favour and click on this:
https://arxiv.org/abs/2412.00281

// 2
Bridging Tradition and Technology: Expert Insights on the Future of Innovation in Peer Review
CSE Science Editor- 06 Dec 2024 - 9 min read

The Asian Council of Science Editors (ACSE) hosted an exclusive interview series featuring industry experts who shared insights, ideas, and perspectives on the technology transforming the peer review process (Figure 1). The discussions highlighted critical areas, such as AI-driven automation and open peer review, along with the challenges and opportunities these innovations bring to academic publishing.

Figure 1. Expert perspectives on balancing innovation and integrity in peer review.

CL: It’s a good read, covering both the technology and human aspects of peer review which need to align if we are realise the benefits of both.
https://www.csescienceeditor.org/article/bridging-tradition-and-technology/

// 3
Use of Artificial Intelligence in Peer Review Among Top 100 Medical Journals
JAMA Open - 03 Dec 2024 - 5 min read

What I expected to be a fairly straightforward article looking at publisher policies towards AI in peer review (spoiler alert: largely prohibited or restricted), ended up with this very positive section in the Discussion:

Although AI is not expected to replace human peer review, its role is expected to grow as our familiarity with AI and its technical capabilities advances. Used safely and ethically, AI can increase productivity and innovation. Thus, continuous monitoring and regular assessment of AI’s impact are essential for updating guidance, thereby maintaining high-quality peer review.

CL: There are many other gems of information in this very short letter, especially figure 2. Go take a look.
https://doi.org/10.1001/jamanetworkopen.2024.48609

// 4
Scholars are Failing the GPT Review Process
HSNS - 01 Nov 2024 -7 min read

As a proactive attempt to look more at the human elements of peer review recently, I’ve been broadening my reading and came across this great little article in a journal I’ve never encountered before, Historical Studies in Natural Sciences. In this piece, the author describes how AI is disrupting academia, but scholars are unwilling to learn how to use it to their advantage (or even understand what it can and can’t do well).

Then this little snippet - running on from AI-generated text detectors - re-awakened some horror stories I’ve heard from African researchers about inherent biases in traditional peer review processes:

Research findings also show that those systems tend to generate far more false positives on non-native English speakers. From that point, begin to consider the differences in how expectations and perceptions of gender and race can affect people’s communication styles—and how these lived experiences are least likely to be properly accounted for in either the training data or the weighting architectures of GPTs or other “AI” tools. To highlight the importance of these intersections, consider an incident scholar Rua M. Williams discussed on social media. Williams recounted a tale in which the writing and scholarship quality of a non-native English-speaking student with whom they were working suddenly and catastrophically diminished. When Williams reached out to the student out of deep concern, it turned out the student had recently begun using ChatGPT to rewrite their work to “sound more white.”

This is a short but important read. Let’s not automate human review as it is, we need to do better.
https://doi.org/10.1525/hsns.2024.54.5.625

// 5
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability
arXiv - 10 Nov 2024 - 17 min read

Back to the nerdery now with this great paper showing that LLMs can greatly increase their reliability, and greatly decrease their need for human oversight, using ensemble methods for content validation through model consensus. Or in other words, using multiple LLMs to come up with answers and picking the answer which the majority of models agree on. Limited right now to multiple choice questions, but the concept seems promising, if a little likely to ‘regress to the mean’ for novel academic research?

https://arxiv.org/abs/2411.06535

And finally…

The section headings of this email are all set to ‘large’ but are coming through at random sizes on my preview. My life’s too short to sort that out so you’ll have to live with it, but there are lots of lovely links here that are worth a click if you’ve got nothing better to do:

Let's do coffee!
I’m at home for the rest of the year. If you’re in Devon at some point, I’d love to meet up (and ask why you’re here) - otherwise it looks like R2R in February is my next official outing.

Free consultation calls
Many of you may know I work for Cactus Communications in my day job, and one of my responsibilities there is to help publishers speed up their peer review processes. Usually this is in the form of 100% human peer review, delivered in 7 days. However, we are keen to experiment further with subtle AI assistance. If you want to chat about how to bring review times down with either a 100% human service, or you’re interested in experimenting with how AI can assist, let’s talk: https://calendly.com/chrisle1972/chris-leonard-cactus

Curated by Chris Leonard.
If you want to get in touch with me, please simply reply to this email.