Beyond SFT-to-RL: Pre-alignment via Black-box On-policy Distillation for Multimodal RL

Sudong Wang1,*, Weiquan Huang1,*, Xiaomin Yu1, Zuhao Yang3, Hehai Lin1, Keming Wu2, Chaojun Xiao2, Chen Chen2, Wenxuan Wang4, Beier Zhu5, Yunjian Zhang6,†, Chengwei Qin1,†
1HKUST(GZ)   2Tsinghua University   3NTU   4RUC   5USTC   6UCAS
* Equal Contribution     Corresponding Author
Email Contact: swang886@connect.hkust-gz.edu.cn, whuang491@connect.hkust-gz.edu.cn

Abstract

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL.

We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits.

Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT→RLVR baseline on 4B and 8B, respectively.

Contributions

(1) PRISM: A Three-Stage Post-Training Pipeline
We propose PRISM, the first framework to reposition on-policy distillation as a standalone intermediate alignment stage between SFT and RLVR. In the multimodal setting, PRISM introduces black-box adversarial alignment with an MoE discriminator, providing decoupled corrective signals for perception and reasoning drift.

(2) High-Quality Multimodal Reasoning Corpus
We curate a 113K multimodal reasoning corpus distilled from Gemini 3 Flash, targeting the hardest problems unsolved by current LMMs with dense visual grounding and step-by-step reasoning traces. Combined with 1.26M public demonstrations, this corpus serves as both the SFT foundation and the supervision reference for distribution alignment.

(3) Consistent Improvements Across RL Algorithms
Experiments on Qwen3-VL-4B/8B validate that PRISM consistently and substantially improves downstream RLVR, with PRISM+GRPO outperforming SFT→GRPO by +4.4 and +6.0 average points on the two scales, respectively, and similar gains observed across DAPO and GSPO.

Overview

PRISM Overview

Overview of the PRISM pipeline. (a) SFT introduces distributional drift between the policy and the supervision distribution. (b) The alignment stage uses an MoE discriminator with dedicated perception and reasoning experts to repair this drift via adversarial on-policy distillation. (c) The resulting distribution-aligned policy provides a stronger initialization for downstream RLVR.

Distribution Alignment Architecture

Alignment Pipeline

Architecture of the distribution-alignment stage. An MoE discriminator with perception and reasoning experts is trained via Bradley-Terry loss to distinguish supervision from policy outputs; the policy is updated via policy gradient to maximize the combined MoE reward.

Main Results

Method Math Benchmarks General Benchmarks Avg.
MathVista MathVerse MathVision WeMath MMMU MMMU-Pro HalluBench
Qwen3-VL-4B
Instruct 74.959.036.570.7 63.645.168.259.7
  + SFT 71.558.431.970.6 53.642.869.156.8
  + GRPO 75.764.535.577.8 60.147.372.061.8
  + DAPO 74.365.142.777.2 62.548.072.363.2
  + GSPO 75.264.037.578.4 58.745.671.961.6
PRISM 71.059.530.667.5 56.342.872.657.2
  + GRPO 77.968.645.482.9 64.149.774.866.2
  + DAPO 77.868.246.783.9 64.150.472.966.3
  + GSPO 77.566.646.782.3 63.251.172.965.8
Qwen3-VL-8B
Instruct 76.062.443.771.7 65.652.371.663.3
  + SFT 70.260.432.673.4 56.342.971.258.1
  + GRPO 75.966.937.179.7 62.648.871.963.3
  + DAPO 77.069.841.584.3 63.049.071.565.2
  + GSPO 75.965.541.180.8 58.247.873.663.3
PRISM 71.462.237.173.1 58.443.469.559.3
  + GRPO 78.371.352.086.4 66.653.377.269.3
  + DAPO 78.270.952.086.2 66.252.476.168.9
  + GSPO 77.971.551.685.9 65.252.775.868.7

Main results on mathematical reasoning and general multimodal benchmarks. We report accuracy (%) for all benchmarks. Bold indicates the best result within each base model. Shaded rows denote PRISM (ours). PRISM consistently improves downstream RLVR performance across all RL algorithms and both model scales.

Ablation Study

Setting Math Benchmarks General Benchmarks Avg.
MathVista MathVerse MathVision WeMath MMMU MMMU-Pro HalluBench
PRISM (full) 77.968.645.482.9 64.149.774.866.2
Discriminator Design
Dense 4B disc. 74.663.741.876.9 61.347.174.062.8
Text-only disc. 74.059.542.876.8 62.748.571.662.3
Pipeline Stages
w/o SFT 62.447.625.955.7 51.436.566.149.4
w/o Alignment 75.764.535.577.8 60.147.372.061.8
SFT Data Scale
SFT-107K 72.367.043.176.9 60.649.068.362.5
SFT-1.37M 77.968.645.482.9 64.149.774.866.2

Ablation study results. The first row is the full PRISM pipeline for reference. The MoE discriminator, the three-stage pipeline, and sufficient SFT data are all critical for achieving the best performance.

Analysis

Training Dynamics
Distribution Visualization

Left: Training dynamics — Reward gap (supervision − policy) for the perception expert and reasoning expert. The perception expert peaks early and converges quickly, whereas the reasoning expert rises more gradually and exhibits greater oscillation, reflecting the distinct nature of visual and reasoning alignment. Right: Structural proxies of distribution alignment — Reasoning steps and descriptive items per caption across PRISM stages. The alignment stage substantially reduces the mismatch between policy and supervision distributions, and this improvement persists through RLVR.

Token Efficiency

Token Efficiency

Token efficiency comparison on MathVision, MathVerse, and MMMU-Pro (Qwen3-VL-4B). PRISM+GRPO achieves higher accuracy with fewer tokens across all three benchmarks, suggesting that the alignment stage encourages more concise and effective reasoning rather than simply producing longer outputs.

Citation

@article{wang2026beyond,
  title={Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL},
  author={Wang, Sudong and Huang, Weiquan and Yu, Xiaomin and Yang, Zuhao and Lin, Hehai and Wu, Keming and Xiao, Chaojun and Chen, Chen and Wang, Wenxuan and Zhu, Beier and Zhang, Yunjian and Qin, Chengwei},
  journal={arXiv preprint arXiv:2604.28123},
  year={2026}
}