Overview
How to leverage the right form of prior to pre-train large models for solving reinforcement learning (RL) problems? One promising direction is to learn a prior over the policies of some yet-to-be-determined tasks: prefetch as much computation as possible before a specific reward function is known.
Recent work (forward-backward (FB) representation learning) has tried this, arguing that an unsupervised representation learning procedure can enable optimal control over arbitrary rewards without further fine-tuning. Our work demystifies FB by clarifying when such representations can exist, what its objective optimizes, and how it converges in practice.
Our analysis suggests a simplified unsupervised pre-training method for RL that, instead of enabling optimal control, performs one step of policy improvement: one-step forward-backward representation learning (one-step FB).
Theoretical Findings
We aim to answer the following questions in our theoretical analysis of FB:
When do the ground-truth FB representations exist? In discrete controlled Markov processes (CMPs), we find four necessary conditions for the existence of ground-truth FB representations: Click to see the four necessary conditions.
- representation dimension \( d \): \( d \geq \left| \mathcal{S} \times \mathcal{A} \right| \).
- rank of the ground-truth forward representation matrix \( F_{\mathcal{Z}}^{\star} \): \( \left| \mathcal{S} \times \mathcal{A} \right| \leq \text{rank}(F_{\mathcal{Z}}^{\star}) \leq d \).
- rank of the ground-truth backward representation matrix \( B^{\star} \): \( \text{rank}(B^{\star}) = d \).
- relationship between the ground-truth forward-backward representation matrices and successor measures \( M^{\pi}({\color{gray} a} \mid {\color{gray} s}, z) \): $$ B^{\star} = F^{\star +}_{z_1} M^{\pi({\color{gray} a} \mid {\color{gray} s}, z_1)} / \rho = \dots = F^{\star +}_{z_{|\mathcal{Z}|}} M^{\pi({\color{gray} a} \mid {\color{gray} s}, z_{|\mathcal{Z}|})} / \rho, $$ where \( {\color{gray} X}^{+} \) denotes the pseudoinverse of the matrix \({\color{gray} X} \).
What does the FB representation objective minimize? We interpret the representation objective in FB as a temporal-difference (TD) variance of the least-squares importance fitting (LSIF) loss. This interpretation draws a connection with fitted Q-evaluation.
Does the practical FB algorithm converge to ground-truth representations? Our theory suggests that, in theory, whether the FB algorithm converges to any ground-truth representations remains an open problem. The key challenge comes from the circular dependency between the FB representations and the policies (see the figure above). In practice, we will use didactic experiments to demonstrate the failure convergence of FB.
Our Simplified Algorithm
Our understanding of FB suggests a simplified pre-training method for RL called one-step forward-backward (one-step FB) representation learning.
- One-step FB breaks the circular dependency by learning representations for a fixed behavioral policy (see the figure above).
- One-step FB performs policy adaptation via one step of policy improvement.
- Starting from the existing FB algorithm, implementing our method requires making two simple changes.
Didactic Experiments
We will track several metrics in aims of answering the following questions:
- Do the learned representations accurately reflect the successor measure ratio? \( M^{\pi} / \rho \) and \( M^{\pi_{\beta}} / \rho \) prediction errors.
- Do the learned representations accurately reflect the ground-truth Q values? \( Q^{\star} \) and \( Q^{\pi_{\beta}} \) prediction errors.
- How similar are the learned policies to the ground-truth policies? forward KL divergence (\( \pi^{\star} \)) and forward KL divergence (\( \text{argmax}_a Q^{\pi_{\beta}} \)).
- Do the predicted Q-value satisfy the equivariance property of universal value functions? \( \hat{Q} \) equivariance errors.
Results:
- (Left) After training for \( 10^{5} \) gradient steps, FB fails to converge to ground-truth FB representations.
- (Right) Given a fixed policy, one-step FB exactly fits the ground-truth one-step FB representations.
Experiments on Standard Benchmarks
Domains
Offline zero-shot RL
- One-step FB achieves the best or near-best performance on \( 6 \) out of \( 10 \) domains.
- Compared with FB, one-step FB achieves \(+1.4 \times\) improvements on average.
- One-step FB is able to outperform prior methods by \(20\%\) using RGB images directly.
Offline-to-online fine-tuning
- After offline unsupervised pre-training, we conduct online fine-tuning on various methods using the same off-the-shelf RL algorithm (TD3).
- One-step FB continues to provide higher sample efficiency (\(+40\%\) on average) during fine-tuning, as compared with the original FB method.
- The fine-tuned policies reach the asymptotic performance of TD3 at the end of training.
The Key Components of One-Step FB
BibTeX
@article{zheng2026can,
title={Can We Really Learn One Representation to Optimize All Rewards?},
author={Zheng, Chongyi and Jayanth, Royina Karegoudra and Eysenbach, Benjamin},
journal={arXiv preprint arXiv:xx},
year={2026},
}