- Scalene
- Posts
- Scalene 13: Stroke contest / PRW / LLMs and Peer Review
Scalene 13: Stroke contest / PRW / LLMs and Peer Review
Humans | AI | Peer review. The triangle is changing.
After the excitement of last week, this has been a relatively quiet week for things I can share on here. Conversations with stealth-mode start-ups and commercial publishers must remain secret for now - and the discourse around peer review and AI has been somewhat dominated by Sakana’s AI scientist news and the ethics around peer reviewing your ‘own’ work. However, I’ve still been able to uncover these juicy nuggets for you…
8th September 2024
// 1
Reviewer Experience Detecting and Judging
Human Versus Artificial Intelligence Content: The
Stroke Journal Essay Contest
Stroke - 03 September 2024 - 28 min read
Another paper I willingly put my hand in my pocket to buy (for 24 hours at least) - honestly, this newsletter will bankrupt me. However, it was worth every penny as it examines one of the thornier problems we face right now, can humans determine what is written by humans, and what is written by AI?
To explore the role of LLMs in scientific writing and the perceptions of manuscript reviewers, we launched a competitive essay contest for the journal Stroke. We sought to evaluate how editorial board members would score persuasive essays on 3 topics of controversy in the stroke field when reviewing essays blinded to author status (human versus 1 of 4 different LLMs), what factors would influence the assignment of human versus AI author, and whether or not the perceived author status would influence the reviewer’s assessment.
I liked this paper as it used more than one LLM (ChatGPT 3.5 and 4.0, Bard, and LLaMA-2) and it’s findings can be summarised thusly:
Reviewers had ‘great difficulty accurately assigning correct authorship to the essays’
Bard was responsible for the greatest number of ‘best in topic’ essays.
Reviewers had a clear bias against content they judged to be AI generated.
What does this all mean - well it means if we can’t distinguish AI-generated content now, we are less likely to be able to do so in the future, or in the study’s words: our study suggests that journal reviewers may not currently possess the skills to accurately distinguish AI-generated content from human authorship.
PS: A note to anyone from Stroke or AHA who sees this. The link to the supplemental file didn’t work until I bought access to the full text - surely some mistake?
// 2
Lex Friedman and Yann Le Cun on peer review
Youtube - 23 Jan 2022 - 10 min watch
I can’t remember how I came across this video, but in the course of some research I’ve been doing recently on the purpose and function of peer review, it was very useful. Here, Yann Le Cun reminds us that peer review can always find faults. Even great papers have faults. The question isn’t ‘is this paper faultless’, but rather, ‘is it exciting and does it move the field forwards?’. The link below goes to the 7 minutes point where the peer review chat starts.
https://youtu.be/8tB9qx_6duM?si=_E3uBdurzbwf_O92&t=435
// 3
Peer Review Week is here (soon)
23-27 Sept 2024 - 5 days
It’s Peer Review Week in a few weeks and I’m talking at each of these events below. Please sign up and ask me some easy questions:
24 Sept: 12:30 GMT - Editage and EASE present: Envisioning a Hybrid Model of Peer Review: Integrating AI with reviewers, publishers, & authors - with Serge Horbach, Haseeb Irfanullah, and Marie McVeigh. https://www.editage.com/events/peer-review-week-2024?utm_source=linkedin&utm_medium=social&utm_campaign=prw2024
24 Sept 14:00 GMT - MDPI present: Peer Review Webinar: Roundtable Discussion on Innovation and Technology in Peer Review. https://sciforum.net/event/PeerReview2024?utm_source=mdpi_news&utm_medium=banner&utm_campaign=prw2024&utm_content=announcement?section=#promotional_video
25 Sept 14:00 GMT - ISMTE present Reviewer Burnout and AI - with Bahar Mehmani and Ashutosh Ghildiyal https://www.ismte.org/events/EventDetails.aspx?id=1889778&group=
// 4
Analysis of the ICML 2023 Ranking Data: Can Authors' Opinions of Their Own Papers Assist Peer Review in Machine Learning?
arXiv - 23 Aug 2024 - 37 min read
It’s generally not surprising that so much experimentation of peer review happens in the context of computer science and machine learning conferences. These events have many submissions, strict time limits for reviewing, and actors who are all well aware of the strengths and weaknesses of AI. Reviewing of these submissions even results in the call for ‘emergency reviewers’.
This paper looks at the cases of authors who submitted more than one paper to a conference, asking them to rank them, and comparing their ranking to machine evaluation. The result Isotonic Mechanism adjusts review scores based on these rankings, leading to more accurate assessments of paper quality. A surprising refutation of Betteridge’s Law.
https://arxiv.org/abs/2408.13430
// 5
Safekeep Science’s Future: Can LLMs Transform Peer Review?
Medium - 14 Aug 2024 - 9 min read
This great essay by Salvatore Raieli takes a very practical approach to constructing a LLM-assisted peer review workflow. How can we overcome the major hurdles of knowledge of previous literature and hallucinations? How can we interrogate the methods and results. How, in short, can we construct a meaningful peer review report? Read this to find out.
So far, therefore, we have a system that manages to be aware of the previous literature and minimizes hallucinations. The next part would be to have a system that was able to conduct a critical reading of a manuscript, propose corrections, potential experiments, and provide feedback. This last step could be done with specific questions to ask the model (‘Was the correct statistical analysis used in the article? Did the authors consider recent literature? Are there inconsistencies in the results? and so on’). However, this step requires additional reasoning skills and is currently beyond the capabilities of an LLM today. This means that still a complete peer review by one side of an LLM is not entirely feasible. Also, very few hallucinations are still not zero hallucinations. That doesn’t mean [it] can’t assist, though, by perhaps conducting a flag of potential errors or problems in an article. As it stands, the system could already identify inconsistencies with the literature, some potential errors, and provide some potential feedback. Conducting a peer review is a process that requires concentration and is labor intensive; conducting a quick first review with an LLM could be a reduction in the burden of work for researchers.
And finally…
Not going to lie, I’ve been on a Netflix binge recently, and the best things I’ve seen have been comedy, in particular Phil Wang’s “Wang on in there baby”. Also, I’ve spent nearly an hour working out how to link to that and it seems I can’t. Thanks Netflix.
Let's do coffee!
It’s conference season again. I’m at ALPSP and FBF over the coming weeks, so please get in touch (reply to this email) if you want to say hi at either of those events.
Curated by Chris Leonard.
If you want to get in touch with me, please simply reply to this email.