Beyond SFT-to-RL: Pre-alignment via Black-box On-policy Distillation for Multimodal RL

Sudong Wang^1,*, Weiquan Huang^1,*, Xiaomin Yu¹, Zuhao Yang³, Hehai Lin¹, Keming Wu², Chaojun Xiao², Chen Chen², Wenxuan Wang⁴, Beier Zhu⁵, Yunjian Zhang^6,†, Chengwei Qin^1,†

¹HKUST(GZ) ²Tsinghua University ³NTU ⁴RUC ⁵USTC ⁶UCAS
^* Equal Contribution ^† Corresponding Author
Email Contact: swang886@connect.hkust-gz.edu.cn, whuang491@connect.hkust-gz.edu.cn

arXiv Code Data 🤗 Model 🚀 Daily Paper

Abstract

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's original capabilities nor faithfully matches the supervision distribution. This problem is further amplified in multimodal reasoning, where perception errors and reasoning failures follow distinct drift patterns that compound during subsequent RL.

We introduce PRISM, a three-stage pipeline that mitigates this drift by inserting an explicit distribution-alignment stage between SFT and RLVR. Building on the principle of on-policy distillation (OPD), PRISM casts alignment as a black-box, response-level adversarial game between the policy and a Mixture-of-Experts (MoE) discriminator with dedicated perception and reasoning experts, providing disentangled corrective signals that steer the policy toward the supervision distribution without requiring access to teacher logits.

Experiments on Qwen3-VL show that PRISM consistently improves downstream RLVR performance across multiple RL algorithms (GRPO, DAPO, GSPO) and diverse multimodal benchmarks, improving average accuracy by +4.4 and +6.0 points over the SFT→RLVR baseline on 4B and 8B, respectively.

Contributions

(1) PRISM: A Three-Stage Post-Training Pipeline
We propose PRISM, the first framework to reposition on-policy distillation as a standalone intermediate alignment stage between SFT and RLVR. In the multimodal setting, PRISM introduces black-box adversarial alignment with an MoE discriminator, providing decoupled corrective signals for perception and reasoning drift.

(2) High-Quality Multimodal Reasoning Corpus
We curate a 113K multimodal reasoning corpus distilled from Gemini 3 Flash, targeting the hardest problems unsolved by current LMMs with dense visual grounding and step-by-step reasoning traces. Combined with 1.26M public demonstrations, this corpus serves as both the SFT foundation and the supervision reference for distribution alignment.

(3) Consistent Improvements Across RL Algorithms
Experiments on Qwen3-VL-4B/8B validate that PRISM consistently and substantially improves downstream RLVR, with PRISM+GRPO outperforming SFT→GRPO by +4.4 and +6.0 average points on the two scales, respectively, and similar gains observed across DAPO and GSPO.

Overview

Overview of the PRISM pipeline. (a) SFT introduces distributional drift between the policy and the supervision distribution. (b) The alignment stage uses an MoE discriminator with dedicated perception and reasoning experts to repair this drift via adversarial on-policy distillation. (c) The resulting distribution-aligned policy provides a stronger initialization for downstream RLVR.

Distribution Alignment Architecture

Architecture of the distribution-alignment stage. An MoE discriminator with perception and reasoning experts is trained via Bradley-Terry loss to distinguish supervision from policy outputs; the policy is updated via policy gradient to maximize the combined MoE reward.

Main Results

Method	Math Benchmarks				General Benchmarks			Avg.
Method	MathVista	MathVerse	MathVision	WeMath	MMMU	MMMU-Pro	HalluBench	Avg.
Qwen3-VL-4B
Instruct	74.9	59.0	36.5	70.7	63.6	45.1	68.2	59.7
+ SFT	71.5	58.4	31.9	70.6	53.6	42.8	69.1	56.8
+ GRPO	75.7	64.5	35.5	77.8	60.1	47.3	72.0	61.8
+ DAPO	74.3	65.1	42.7	77.2	62.5	48.0	72.3	63.2
+ GSPO	75.2	64.0	37.5	78.4	58.7	45.6	71.9	61.6
PRISM	71.0	59.5	30.6	67.5	56.3	42.8	72.6	57.2
+ GRPO	77.9	68.6	45.4	82.9	64.1	49.7	74.8	66.2
+ DAPO	77.8	68.2	46.7	83.9	64.1	50.4	72.9	66.3
+ GSPO	77.5	66.6	46.7	82.3	63.2	51.1	72.9	65.8
Qwen3-VL-8B
Instruct	76.0	62.4	43.7	71.7	65.6	52.3	71.6	63.3
+ SFT	70.2	60.4	32.6	73.4	56.3	42.9	71.2	58.1
+ GRPO	75.9	66.9	37.1	79.7	62.6	48.8	71.9	63.3
+ DAPO	77.0	69.8	41.5	84.3	63.0	49.0	71.5	65.2
+ GSPO	75.9	65.5	41.1	80.8	58.2	47.8	73.6	63.3
PRISM	71.4	62.2	37.1	73.1	58.4	43.4	69.5	59.3
+ GRPO	78.3	71.3	52.0	86.4	66.6	53.3	77.2	69.3
+ DAPO	78.2	70.9	52.0	86.2	66.2	52.4	76.1	68.9
+ GSPO	77.9	71.5	51.6	85.9	65.2	52.7	75.8	68.7

Main results on mathematical reasoning and general multimodal benchmarks. We report accuracy (%) for all benchmarks. Bold indicates the best result within each base model. Shaded rows denote PRISM (ours). PRISM consistently improves downstream RLVR performance across all RL algorithms and both model scales.

Ablation Study

Setting	Math Benchmarks				General Benchmarks			Avg.
Setting	MathVista	MathVerse	MathVision	WeMath	MMMU	MMMU-Pro	HalluBench	Avg.
PRISM (full)	77.9	68.6	45.4	82.9	64.1	49.7	74.8	66.2
Discriminator Design
Dense 4B disc.	74.6	63.7	41.8	76.9	61.3	47.1	74.0	62.8
Text-only disc.	74.0	59.5	42.8	76.8	62.7	48.5	71.6	62.3
Pipeline Stages
w/o SFT	62.4	47.6	25.9	55.7	51.4	36.5	66.1	49.4
w/o Alignment	75.7	64.5	35.5	77.8	60.1	47.3	72.0	61.8
SFT Data Scale
SFT-107K	72.3	67.0	43.1	76.9	60.6	49.0	68.3	62.5
SFT-1.37M	77.9	68.6	45.4	82.9	64.1	49.7	74.8	66.2

Ablation study results. The first row is the full PRISM pipeline for reference. The MoE discriminator, the three-stage pipeline, and sufficient SFT data are all critical for achieving the best performance.

Analysis

Left: Training dynamics — Reward gap (supervision − policy) for the perception expert and reasoning expert. The perception expert peaks early and converges quickly, whereas the reasoning expert rises more gradually and exhibits greater oscillation, reflecting the distinct nature of visual and reasoning alignment. Right: Structural proxies of distribution alignment — Reasoning steps and descriptive items per caption across PRISM stages. The alignment stage substantially reduces the mismatch between policy and supervision distributions, and this improvement persists through RLVR.

Token Efficiency

Token efficiency comparison on MathVision, MathVerse, and MMMU-Pro (Qwen3-VL-4B). PRISM+GRPO achieves higher accuracy with fewer tokens across all three benchmarks, suggesting that the alignment stage encourages more concise and effective reasoning rather than simply producing longer outputs.

Citation

@article{wang2026beyond,
  title={Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL},
  author={Wang, Sudong and Huang, Weiquan and Yu, Xiaomin and Yang, Zuhao and Lin, Hehai and Wu, Keming and Xiao, Chaojun and Chen, Chen and Wang, Wenxuan and Zhu, Beier and Zhang, Yunjian and Qin, Chengwei},
  journal={arXiv preprint arXiv:2604.28123},
  year={2026}
}