arXiv 2026

ContextSim

Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation

1Woven by Toyota

Abstract

Recommender systems are central to online services, yet evaluation remains challenging due to the disconnect between offline metrics and online performance. We introduce ContextSim, an LLM agent framework that simulates believable user proxies by anchoring interactions in daily life activities. Rather than modeling users in isolation, ContextSim incorporates a life simulation module that generates scenarios specifying when, where, and why users engage with recommendations, grounding agents in realistic temporal, spatial, situational, goal, and constraint contexts. To enforce consistency, agents maintain explicit internal thoughts and their behaviors are aligned at both the action and trajectory levels through item disentanglement and trajectory alignment tasks. Experiments demonstrate that ContextSim generates user interactions more closely resembling authentic human behavior compared to previous approaches. Importantly, recommender system parameters optimized using ContextSim show improved real-world engagement when validated through offline A/B testing correlation analysis.

Key Results

82.4% Preference Alignment
(vs. 79.1% SimUSER)
0.45 Rating RMSE
(vs. 0.50 SimUSER)
84.1% Thought Consistency
(vs. 29.2% SimUSER)
4.60 Human-Likeness
(5-point scale)

Overview

ContextSim addresses a key limitation of existing LLM-based user simulators: they model users in isolation, ignoring the contextual factors that shape real-world decisions. ContextSim grounds agent behavior in daily life activities, generating realistic scenarios that determine when, where, and why users interact with recommendations. The framework combines life simulation, thought synthesis, and context-aware policy to produce agents whose trajectories closely match real human behavior.

ContextSim Framework Overview

Figure 1: The ContextSim framework for evaluating recommender systems. A life simulation module generates daily schedules with temporal, spatial, and situational contexts. Agents interact with the recommender system while maintaining explicit internal thoughts aligned with their persona and context.

Framework

ContextSim operates in three phases to create context-aware user agents:

1

Persona Initialization

Infer self-consistent personas from historical data, including age, Big Five personality traits, occupation, habits, and preferences. Agents are equipped with episodic memory (interaction history) and emotional memory (fatigue, satisfaction).

2

Thought Synthesis

Train agents via two reasoning tasks: item disentanglement (why an action reflects user preferences) and trajectory alignment (why historical actions are preferred over alternatives). Joint SFT on both tasks enables explicit, consistent reasoning.

3

Life Simulation Module

Generate realistic daily schedules conditioned on persona, weather, and local events. Each interaction is grounded in five context dimensions: temporal (time of day), spatial (location), situational (activity, mood), goal (purpose), and constraint (budget, time).

4

Context-Aware Interaction

Agents sense the page, evaluate items, infer internal state (fatigue, curiosity), select actions with explicit thought, and self-reflect to update episodic memory. Context shapes every decision throughout the interaction session.

Experiments

Preference Alignment

We evaluate whether agents can identify items aligned with their human counterparts' tastes. ContextSim outperforms all baselines across all datasets:

Method (1:1) MovieLens AmazonBook Steam
Acc F1 Acc F1 Acc F1
RecAgent 0.581 0.621 0.604 0.659 0.627 0.650
Agent4Rec 0.691 0.698 0.719 0.700 0.689 0.679
SimUSER 0.791 0.777 0.822 0.790 0.791 0.794
ContextSim 0.824 0.819 0.847 0.831 0.818 0.839

Rating Prediction

ContextSim achieves substantially lower rating prediction error than all baselines, demonstrating better understanding of user preferences:

Method MovieLens AmazonBook Steam
RMSE MAE RMSE MAE RMSE MAE
RecAgent 1.102 0.763 1.259 1.119 1.077 0.960
Agent4Rec 0.761 0.714 0.879 0.671 0.758 0.688
SimUSER 0.502 0.446 0.568 0.421 0.587 0.532
ContextSim 0.451 0.392 0.511 0.369 0.528 0.471

Thought Consistency

We evaluate persona-action consistency using GPT-4o. ContextSim achieves dramatically higher consistency than baselines:

Method Coherent (%) Partially (%) Contradictory (%)
RecAgent 17.3 40.9 41.8
Agent4Rec 21.8 43.6 34.6
SimUSER 29.2 41.0 29.8
ContextSim 84.1 10.6 5.3

Human-Likeness Evaluation

GPT-4o assessed whether agent interactions appear human or AI-generated using a 5-point Likert scale:

Method MovieLens AmazonBook Steam OPeRA
RecAgent 3.01 3.14 2.96 3.08
Agent4Rec 3.04 3.21 3.09 3.15
SimUSER 4.41 3.99 4.02 4.13
ContextSim 4.60 4.18 4.22 4.37

Optimizing RS Parameters

We validate ContextSim by optimizing recommender system parameters and measuring real-world engagement on a food recommendation platform:

Strategy Viewing Ratio Liked Items Prop. Likes Satisfaction
Baseline 0.521 3.14 0.398 3.82
Traditional (nDCG) 0.535 3.22 0.407 3.86
SimUSER-optimized 0.561 3.58 0.434 4.09
ContextSim-optimized 0.589 3.91 0.462 4.24

Datasets

ContextSim is evaluated on four diverse recommendation domains:

BibTeX

@article{bougie2026contextsim, title={Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation}, author={Bougie, Nicolas and Marconi, Gian Maria and Ye, Xiaotong and Watanabe, Narimasa}, journal={arXiv preprint arXiv:2604.09549}, year={2026} }