Can We Really Learn One Representation to Optimize All Rewards?

Can We Really Learn One Representation to Optimize All Rewards?

1Princeton University
(*Equal contribution)
Teaser image

Overview

How to leverage the right form of prior to pre-train large models for solving reinforcement learning (RL) problems? One promising direction is to learn a prior over the policies of some yet-to-be-determined tasks: prefetch as much computation as possible before a specific reward function is known.

Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. Our work demystifies FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice.

Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement: one-step forward-backward representation learning (one-step FB).

Theoretical Findings

We aim to answer the following questions in our theoretical analysis of FB:

When do the ground-truth FB representations exist? In discrete controlled Markov processes (CMPs), we find four necessary conditions for the existence of ground-truth FB representations: Click to see the four necessary conditions.

What does the FB representation objective minimize? We interpret the representation objective in FB as a temporal-difference (TD) variance of the least-squares importance fitting (LSIF) loss. This interpretation draws a connection with fitted Q-evaluation.

Does the practical FB algorithm converge to ground-truth representations? Our theory suggests that, in theory, whether the FB algorithm converges to any ground-truth representations remains an open problem. The key challenge comes from the circular dependency between the FB representations and the policies (see the figure above). In practice, we will use didactic experiments to demonstrate the failure convergence of FB.

Our Simplified Algorithm

Our understanding of FB suggests a simplified pre-training method for RL called one-step forward-backward (one-step FB) representation learning.

  • One-step FB breaks the circular dependency by learning representations for a fixed behavioral policy (see the figure above).
  • One-step FB performs policy adaptation via one step of policy improvement.
  • Starting from the existing FB algorithm, implementing our method requires making two simple changes.
Click to see the complete algorithm.

Didactic Experiments

3 State CMP
The three-state CMP. Agents start from state \(s_0\) and take action \(a_i\) (\(i = 0, 1, 2\)) to determinstically transit into state \(s_i\). States \(s_1\) and \(s_2\) are both absorbing states.
3 State CMP Results

We will track several metrics in aims of answering the following questions:

  1. Do the learned representations accurately reflect the successor measure ratio? \( M^{\pi} / \rho \) and \( M^{\pi_{\beta}} / \rho \) prediction errors.
  2. Do the learned representations accurately reflect the ground-truth Q values? \( Q^{\star} \) and \( Q^{\pi_{\beta}} \) prediction errors.
  3. How similar are the learned policies to the ground-truth policies? forward KL divergence (\( \pi^{\star} \)) and forward KL divergence (\( \text{argmax}_a Q^{\pi_{\beta}} \)).
  4. Do the predicted Q-value satisfy the equivariance property of universal value functions? \( \hat{Q} \) equivariance errors.

Results:

  • (Left) After training for \( 10^{5} \) gradient steps, FB fails to converge to ground-truth FB representations.
  • (Right) Given a fixed policy, one-step FB exactly fits the ground-truth one-step FB representations.
Click to see didactic experiments on another CMP

Experiments on Standard Benchmarks

Domains

walker
cheetah
quadruped
jaco
antmaze large
antmaze teleport
cube single
scene

Offline zero-shot RL

Click to see the full table (46 tasks)
Offline zero-shot RL evaluations aggregated across domains
  • One-step FB achieves the best or near-best performance on \( 6 \) out of \( 10 \) domains.
  • Compared with FB, one-step FB achieves \(+1.4 \times\) improvements on average.
  • One-step FB is able to outperform prior methods by \(20\%\) using RGB images directly.

Offline-to-online fine-tuning

Offline-to-online fine-tuning learning curves
  • After offline unsupervised pre-training, we conduct online fine-tuning on various methods using the same off-the-shelf RL algorithm (TD3).
  • One-step FB continues to provide higher sample efficiency (\(+40\%\) on average) during fine-tuning, as compared with the original FB method.
  • The fine-tuned policies reach the asymptotic performance of TD3 at the end of training.

The Key Components of One-Step FB

Orthonormalization regularization ablation
Using an appropriate value of the orthonormalization regularization strength \( \lambda_{\text{ortho}} \) is key to the performance of one-step FB.
Reward temperature ablation
During zero-shot adaptation, we reweight the reward function using a softmax weight with temperature \( \tau_{\text{reward}} \). One-step FB is less sensitive to the choice of \( \tau_{\text{reward}} \) on different domains.

BibTeX

@article{zheng2026can,
  title={Can We Really Learn One Representation to Optimize All Rewards?}, 
  author={Zheng, Chongyi and Jayanth, Royina Karegoudra and Eysenbach, Benjamin},
  journal={arXiv preprint arXiv:xx},
  year={2026},
}