• Scalene
  • Posts
  • Scalene 29: Time / Economics / Hallucinations

Scalene 29: Time / Economics / Hallucinations

Humans | AI | Peer review. The triangle is changing.

I’ve had some very fruitful and exciting conversations this week with publishers who are willing to experiment with AI-assisted manuscript review. This is notable since conversations a few months ago were ‘I see it’s interesting, but we don’t want to change human-only peer review’ - whereas now they are ‘I see it’s interesting and we need to change human-only peer review’. I think these early adopters are going to be pleasantly surprised by how subject matter experts and the right combination of prompts and tools can work optimally together. Anyway, on with the show!

16th February 2025

// 1
Exploring the Impact of Generative AI on Peer Review: Insights from Journal Reviewers
Journal of Academic Ethics - 11 Feb 2025 - 12 min read

The perspectives of 12 journal reviewers on the subject of LLMs in peer review were explored in this article. The small sample size is further skewed by the fact they are mainly humanities and social scientist researchers, and none from physical sciences. However the thematic grouping of concerns was interesting, even if the overall message was less so (LLMs can screen, but not ready to review on their own, and need human oversight).
I think part of the problem here is the questionnaire results came back in August 2023-March 2024, and article preparation and review takes so long that by the time its published (earlier this week), it’s already somewhat out of date. A month is a long time in this space, never mind almost a year. It’s good that we consider best practice around ethics and biases, but time is frequently a downplayed element of peer review:

// 2
Can AI Solve the Peer Review Crisis? A Large-Scale Experiment on LLM's Performance and Biases in Evaluating Economics Papers
IZA - 31 Jan 2025 - 34 min read

This paper came out on arXiv and via the Institute of Labor Economics at the same time and delivers some very valuable insights:

We systematically vary author characteristics (e.g., top male and female economists from RePEc’s top 10 list, bottom- ranked economists, and randomly generated names) and institutional affiliations across ranking tiers. Our base dataset includes 30 recently published papers: nine from the “top five” journals (Econometrica, Journal of Political Economy, Quarterly Journal of Economics), nine from mid-tier journals (European Economic Review, Economica, Oxford Bulletin of Economics and Statistics), nine from lower-ranked journals (Asian Economic and Financial Review, Journal of Applied Economics and Business, Business and Economics Journal), and three AI-generated papers designed to mimic the quality standards of “top five” submissions. Using GPT4o-mini, a leading LLM known for its cost-efficiency and broad applicability, we assess each variation along multiple dimensions: desk rejection and acceptance probability at “top five” journals, projected citation impact, likelihood of research grant success, tenure prospects, top conference acceptance, and potential Nobel Prize contributions.

The result? LLMs are good at distinguishing paper quality, but exhibit modest biases for prominent institutions, male authors, and renowned economists. Additionally, LLMs find it hard to distinguish AI-generated papers from human-authored ones.

I appreciate the demonstration of how LLMs can exhibit similar biases for human fallibility in peer review, but this is something where we already have a solution: double-blind review, where author and affiliation data is not fed into the LLM. Much easier than trying to reconfigure existing tools where, apparently, these biases are already baked in. The authors make several other suggestions and note that publicly-available versions of articles on RePEc and other preprint servers may hinder simple double-blinding.

// 3
Can AI write your PhD dissertation for you?
FOBH - 09 Feb 2025 - 5 min read

Following on from the previous point of LLMs not being able to distinguish AI-generated papers from human ones, this experiment was very timely. Andrew Maynard used OpenAI’s Deep Research tool to write an entire PhD thesis. And while it has its flaws, it’s also surprisingly good in many ways.

But remember that this is what an AI produced with the only input from me being the initial questions, some light prompts, and a bit of editing. It’s not hard to see how, with more human input, Deep Research and subsequent reasoning AI platforms could transform the process of doing a PhD.

So does this mean that the concept of a PhD dissertation as an original piece of uniquely human scholarship is on the way out?

It may be that the consensus is that pursuing a PhD is more about the journey than the new knowledge it produces — in which case not using AI would make sense, at least to a degree.

// 4
Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation?
arXiv - 07 Feb 2025 - 55 min read

A surprisingly broad and in-depth study of AI in the whole process of science, but thankfully specifically including peer review as a part of that too. This is a mega-review with over 300 references and no-one has time to read it all, so I’ll tell you that the peer review part starts on p32 of the pdf. The heavy referencing meant there were many opportunities to get distracted - which is exactly what I did.

// 5
Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks
arXiv.org - 27 Jan 2024 -24 min read

Hallucinations remain a significant challenge in current Generative AI models, undermining trust in AI systems and their reliability. This study investigates how orchestrating multiple specialized Artificial Intelligent Agents can help mitigate such hallucinations, with a focus on systems leveraging Natural Language Processing (NLP) to facilitate seamless agent interactions. To achieve this, we design a pipeline that introduces over three hundred prompts, purposefully crafted to induce hallucinations, into a front-end agent. The outputs are then systematically reviewed and refined by second- and third-level agents, each employing distinct large language models and tailored strategies to detect unverified claims, incorporate explicit disclaimers, and clarify speculative content

A great example of how multiple AI tools can solve for the failings of a single tool. The example of the Library of Avencord on p13 shows this clearly, and this is likely to become best practice in the short term. No one single tool is perfect, but using many in a smart way is eye-opening.

And finally…

More links, less commentary:

[Sorry about the weird size fonts in 3 and 5 subheadings - I can’t get to the bottom of it in 5 minutes, and frankly, have other things in my life to spend time on]

 

Let's do coffee!
- Researcher 2 Reader conference, London - Feb 25-26
- London Book Fair, London(!) - Mar 11-13
- ALPSP UP Redux, Oxford - April 3-4 [I’m giving the keynote speech on the 3rd]
Let me know if you’re at any of these and we can chat all things Scalene-related.

Free consultation calls
Many of you may know I work for Cactus Communications in my day job, and one of my responsibilities there is to help publishers speed up their peer review processes. Usually this is in the form of 100% human peer review, delivered in 7 days. However, we are keen to experiment further with subtle AI assistance. If you want to chat about how to bring review times down with either a 100% human service, or you’re interested in experimenting with how AI can assist, let’s talk: https://calendly.com/chrisle1972/chris-leonard-cactus

Curated by me, Chris Leonard.
If you want to get in touch, please simply reply to this email.