ACL 2025

Generative Reviewer Agents

Scalable Simulacra of Peer Review

1Woven by Toyota

Abstract

The peer review process is fundamental to scientific progress, determining which papers meet the quality standards for publication. Yet, the rapid growth of scholarly production and increasing specialization in knowledge areas strain traditional scientific feedback mechanisms. In light of this, we introduce Generative Agent Reviewers (GAR), leveraging LLM-empowered agents to simulate faithful peer reviewers. To enable generative reviewers, we design an architecture that extends a large language model with memory capabilities and equips agents with reviewer personas derived from historical data. Our experiments demonstrate that GAR performs comparably to human reviewers in providing detailed feedback and predicting paper outcomes. Beyond mere performance comparison, we conduct insightful experiments, such as evaluating the impact of reviewer expertise and examining fairness in reviews. By offering early expert-level feedback, typically restricted to a limited group of researchers, GAR democratizes access to transparent and in-depth evaluation.

Key Results

0.684 GPT-4 Preference Score
(vs. 0.523 Human)
70% Paper Acceptance
Prediction Accuracy
3.99 Human-Likeness
(5-point scale)
0.66 F1 Score
(vs. 0.49 Human)

Overview

GAR employs a 4-phase pipeline to simulate the peer review process: (1) Graph Construction to structure the manuscript, (2) Reviewer Selection with persona initialization, (3) Multi-round Reviewer Evaluation, and (4) Meta-Review synthesis for final decisions.

GAR Framework Overview

Figure 1: The Generative Agent Reviewers (GAR) framework for simulating peer review, featuring graph-based paper representation, memory-augmented reviewers, and multi-round evaluation.

Reviewer Agent Architecture

GAR structures agents with four specialized modules tailored for review scenarios:

1

Profile Module

Stores eight core attributes: strictness, expertise level, focus areas, evidence focus, open-mindedness, ethic focus, tone, and attention to technical details. Characteristics are inferred via contrastive comparison across reviews.

2

Novelty Module

Draws upon external knowledge sources (Semantic Scholar) to gauge the originality of the manuscript in comparison to prior research, generating novelty scores with explanations.

3

Memory Module

Enables retrieval-augmented reviews through community-level retrieval (similar communities and reviews) and paper-level retrieval (similar papers based on graph overlap).

4

Review Module

Generates initial reviews and performs multi-round refinement, evaluating each community descriptor with retrieved exemplars from memory to identify strengths and weaknesses.

Graph-Paper Representation

To handle lengthy and complex manuscripts, GAR introduces a graph-based representation that organizes content into structured communities:

Experiments

LLM vs Human Review Preferences

We evaluated review quality using GPT-4 preferences in pairwise comparisons:

Rank Reviewer Bradley-Terry Score
1 GAR 0.684
2 Human 0.523
3 AI-Scientist 0.242
4 ReviewerGPT 0.000
5 AI-Review -0.365
6 OpenReviewer -0.632

Predicting Paper Acceptance

We compared GAR's decisions against ground truth from NeurIPS and ICLR submissions:

Method NeurIPS ICLR 22 ICLR 23
Bal. Acc F1 Bal. Acc F1 Bal. Acc F1
Human* 0.66 0.49 0.66 0.49 0.66 0.49
AI-Scientist 0.58 0.51 0.65 0.57 0.63 0.55
AI-Review 0.59 0.49 0.64 0.55 0.61 0.53
GAR 0.64 0.61 0.68 0.66 0.66 0.60
GAR> 0.68 0.62 0.71 0.67 0.70 0.69

*Human scores from NeurIPS 2021 consistency study. GAR> uses threshold-based acceptance.

Human-Likeness Evaluation

GPT-4o assessed whether reviews appeared AI-generated or human-like (5-point scale):

Method NeurIPS ICLR 22 ICLR 23
AI-Scientist 3.34 3.39 3.38
AI-Review 3.30 3.42 3.38
ReviewerGPT 3.26 3.25 3.29
GAR 3.89 4.02 3.99

Feedback Aspect Comparison

Comparison of Feedback Aspects by Reviewer Type

Figure 2: Comparison of feedback aspects across different reviewer types. GAR (purple) shows closer alignment to ground truth (0.0 baseline) on most dimensions including novelty assessment, theoretical soundness, and ethical aspects, while baseline methods tend to deviate more significantly.

Datasets

GAR is evaluated on peer review datasets from major ML conferences:

BibTeX

@inproceedings{bougie2025gar, title={Generative Reviewer Agents: Scalable Simulacra of Peer Review}, author={Bougie, Nicolas and Watanabe, Narimasa}, booktitle={Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)}, year={2025} }