# 设计完整的多模态 RLVR 流程，涵盖 Open-MM-RL、视觉-语言提示、奖励评分与 GRPO 导出

- 来源：MarkTechPost（RSS）
- 作者：Sana Hassan
- 发布时间：2026-05-26 15:25
- AIHOT 分数：62
- AIHOT 链接：https://aihot.virxact.com/items/cmpmbnw9w0naesl015vrzcpta
- 原文链接：https://www.marktechpost.com/2026/05/26/design-a-complete-multimodal-rlvr-pipeline-with-open-mm-rl-vision-language-prompting-reward-scoring-and-grpo-export

## AI 摘要

该教程以 TuringEnterprises/Open-MM-RL 数据集为实践基础，构建多模态推理与可验证奖励强化学习（RLVR）流程。内容涵盖数据集加载、结构分析（包括领域、格式、问题长度、答案类型和图像分布），并可视化各领域示例。同时实现了一个轻量级奖励函数，用于检查精确匹配等条件，并演示了如何将流程导出为 GRPO 格式。

## 正文

In this tutorial, we explore the TuringEnterprises/Open-MM-RL dataset as a practical foundation for multimodal reasoning and reinforcement learning with verifiable rewards. We load the dataset, inspect its schema, analyze domains, formats, question lengths, answer types, and image distributions, and visualize representative examples from each domain. We also build a lightweight reward function that checks exact, numeric, fractional, LaTeX, and symbolic answers, giving us a useful way to evaluate model outputs. Finally, we format prompts for vision-language models, optionally test SmolVLM on sample examples, and export the dataset into a GRPO-style structure for future multimodal RL training.

Copy CodeCopiedUse a different Browser

import subprocess, sys subprocess.run([sys.executable, "-m", "pip", "-q", "install", "datasets>=3.0", "huggingface_hub>=0.24", "transformers>=4.45", "Pillow", "matplotlib", "pandas", "numpy", "sympy", "accelerate", "tqdm"], check=True) import os, re, io, json, math, random, textwrap, hashlib, warnings from collections import Counter from pathlib import Path import numpy as np import pandas as pd import matplotlib.pyplot as plt from PIL import Image import sympy as sp from datasets import load_dataset warnings.filterwarnings("ignore") random.seed(0); np.random.seed(0) pd.set_option("display.max_colwidth", 120) DS_ID = "TuringEnterprises/Open-MM-RL" ds = load_dataset(DS_ID, split="train") print(f"Loaded {DS_ID} — {len(ds)} rows") print("Features:", ds.features) print("Row 0 keys:", list(ds[0].keys()))

We install all required libraries and import the core tools needed for dataset loading, analysis, visualization, symbolic math, and file handling. We set random seeds for reproducibility and configure pandas so that longer text fields display clearly. We then load the TuringEnterprises/Open-MM-RL dataset from Hugging Face and inspect its size, features, and first-row structure.

Copy CodeCopiedUse a different Browser

df = ds.remove_columns(["images"]).to_pandas() df["n_images"] = [len(ex["images"]) for ex in ds] df["q_len_chars"] = df["question"].str.len() df["a_len_chars"] = df["answer"].str.len() print("\n=== Domain ==="); print(df["domain"].value_counts()) print("\n=== Format ==="); print(df["format"].value_counts()) print("\n=== Sub-domain (top by domain) ===") print(df.groupby("domain")["subDomain"].value_counts().head(15)) print(f"\nMean images/example: {df['n_images'].mean():.2f} max: {df['n_images'].max()}") print(f"Median Q length: {df['q_len_chars'].median():.0f} " f"Median A length: {df['a_len_chars'].median():.0f}") fig, axes = plt.subplots(1, 3, figsize=(15, 4)) df["domain"].value_counts().plot.bar(ax=axes[0], color="#4C72B0") axes[0].set_title("Examples per domain"); axes[0].set_ylabel("count") df["format"].value_counts().plot.bar(ax=axes[1], color="#55A868") axes[1].set_title("Image-format type"); axes[1].tick_params(axis='x', rotation=25) df["n_images"].plot.hist(ax=axes[2], bins=range(1, df["n_images"].max() + 2), color="#C44E52", edgecolor="white") axes[2].set_title("Images per example"); axes[2].set_xlabel("n_images") plt.tight_layout(); plt.show() def img_stats(ex): sizes = [im.size for im in ex["images"]] modes = [im.mode for im in ex["images"]] return { "n_images": len(sizes), "min_w": min(w for w, h in sizes), "max_w": max(w for w, h in sizes), "min_h": min(h for w, h in sizes), "max_h": max(h for w, h in sizes), "modes": "|".join(sorted(set(modes))), "total_pixels": sum(w * h for w, h in sizes), } img_df = pd.DataFrame([img_stats(ex) for ex in ds]) print("\n=== Image resolution stats ===") print(img_df[["min_w", "max_w", "min_h", "max_h", "total_pixels"]].describe().round(0)) print("\nMode mix:", Counter("|".join(img_df["modes"]).split("|")))

We convert the dataset into a DataFrame after removing the image column, then calculate useful fields such as the number of images, question length, and answer length. We analyze domain counts, format distribution, sub-domain breakdowns, and basic text/image statistics. We also create charts to visualize the number of examples per domain, the image formats, and the distribution of images per example.

Copy CodeCopiedUse a different Browser

def show_example(ex, max_chars=600): print("=" * 80) print(f"id={ex['conversation_id']} {ex['domain']} / {ex['subDomain']}") print(f"format={ex['format']} n_images={len(ex['images'])}") print("-" * 80) q = ex["question"][:max_chars] + ("..." if len(ex["question"]) > max_chars else "") print("Q:", textwrap.fill(q, 100)) print("-" * 80) print("A (gold):", ex["answer"]) n = len(ex["images"]) fig, axes = plt.subplots(1, n, figsize=(5 * n, 5)) if n > 1 \ else plt.subplots(1, 1, figsize=(6, 6)) axes = np.atleast_1d(axes) for ax, im in zip(axes, ex["images"]): ax.imshow(im); ax.set_xticks([]); ax.set_yticks([]) ax.set_title(f"{im.size[0]}×{im.size[1]} ({im.mode})") plt.tight_layout(); plt.show() for dom in df["domain"].unique(): idx = int(df[df["domain"] == dom].index[0]) show_example(ds[idx]) LATEX_PAT = re.compile(r"\\\[[\s\S]+?\\\]|\\\([\s\S]+?\\\)|\$[^$]+\$") df["latex_blocks_q"] = df["question"].apply(lambda s: len(LATEX_PAT.findall(s or ""))) df["latex_blocks_a"] = df["answer"].apply(lambda s: len(LATEX_PAT.findall(s or ""))) print("\n=== LaTeX blocks per field ===") print(df[["latex_blocks_q", "latex_blocks_a"]].describe().round(2)) def classify_answer(a): s = (a or "").strip().strip("$ []").strip() s_no_dollar = s.replace("$", "") if re.fullmatch(r"-?\s*\d+(\.\d+)?\s*", s_no_dollar): return "integer/float" if any(t in s for t in ["\\sqrt", "\\frac", "\\pi", "^", "_", "\\kappa", "\\lceil"]): return "symbolic" if re.fullmatch(r"[-+0-9./()\s\\a-zA-Z{}]+", s) and any(c.isdigit() for c in s): return "numeric_expr" return "text" df["answer_type"] = df["answer"].apply(classify_answer) print("\n=== Answer-type breakdown ==="); print(df["answer_type"].value_counts()) print("\n=== Answer-type × domain ===") print(pd.crosstab(df["domain"], df["answer_type"]))

We define a helper function to display one representative example from each domain, including its question, gold answer, and associated images. We use this visual inspection step to better understand how multimodal reasoning problems are structured across different domains. We then analyze LaTeX usage in questions and answers, classify answer types, and compare answer-type distributions across domains.

Copy CodeCopiedUse a different Browser

EXTRACT_PATS = [ r"\\boxed\{([^{}]+)\}", r"final\s+answer\s*[:=]\s*([^\n]+)", r"answer\s*[:=]\s*([^\n]+)", ] def extract_final(text): if not text: return "" for p in EXTRACT_PATS: m = re.search(p, text, flags=re.IGNORECASE) if m: return m.group(1).strip().strip(".,;") lines = [l.strip() for l in str(text).strip().splitlines() if l.strip()] return lines[-1] if lines else "" def latex_to_sympy(s): s = (s or "").strip().strip("$").strip() s = re.sub(r"^\\[\[\(]", "", s); s = re.sub(r"\\[\]\)]$", "", s) s = (s.replace("\\pi", "pi").replace("\\cdot", "*").replace("\\times", "*") .replace("\\,", "").replace("\\;", "").replace("\\!", "")) s = re.sub(r"\\frac\s*\{([^{}]+)\}\s*\{([^{}]+)\}", r"((\1)/(\2))", s) s = re.sub(r"\\sqrt\s*\{([^{}]+)\}", r"sqrt(\1)", s) s = s.replace("^", "**") s = re.sub(r"\\[a-zA-Z]+", "", s) s = s.replace("{", "(").replace("}", ")") return s def grade(pred, gold, tol=1e-4): """Verifiable reward in [0,1]: exact > numeric > sympy-symbolic > partial.""" if pred is None or gold is None: return 0.0 p = extract_final(str(pred)).strip() g = str(gold).strip() norm = lambda x: re.sub(r"\s+", "", x.lower()).strip("$.,;[]()") if norm(p) == norm(g): return 1.0 def to_float(x): try: return float(latex_to_sympy(x)) except Exception: try: return float(sp.sympify(latex_to_sympy(x)).evalf()) except Exception: return None fp, fg = to_float(p), to_float(g) if fp is not None and fg is not None: if abs(fp - fg) / max(1.0, abs(fg)) r={grade(pred, gold)} (want {want})") SYSTEM = ("You are a STEM expert solving multimodal reasoning problems. " "You will see a question and one or more figures. " "Reason step by step, then end with exactly one line:\n" "Final answer: ") def build_prompt(ex): img_tags = "\n".join(f"[Image {i+1}]" for i in range(len(ex["images"]))) return f"{SYSTEM}\n\n{img_tags}\n\nQuestion:\n{ex['question']}\n\nLet's think step by step." print("\n=== Example prompt (truncated) ===") print(build_prompt(ds[0])[:600], "...\n")

We build a verifiable reward function that extracts final answers and compares predictions against gold answers using exact, numeric, and symbolic matching. We also add a LaTeX-to-SymPy conversion helper, allowing mathematical expressions to be evaluated more reliably. We test the grader with sanity checks and then create a structured prompt format for vision-language model reasoning.

Copy CodeCopiedUse a different Browser

import torch USE_VLM = torch.cuda.is_available() print(f"CUDA available: {USE_VLM}") if USE_VLM: try: from transformers import AutoProcessor, AutoModelForVision2Seq MODEL_ID = "HuggingFaceTB/SmolVLM-Instruct" print(f"Loading {MODEL_ID} (this takes ~1 min) ...") processor = AutoProcessor.from_pretrained(MODEL_ID) model = AutoModelForVision2Seq.from_pretrained( MODEL_ID, torch_dtype=torch.float16, device_map="auto" ) def vlm_solve(ex, max_new_tokens=512): imgs = [im.convert("RGB") for im in ex["images"]] content = [{"type": "image"} for _ in imgs] content.append({"type": "text", "text": build_prompt(ex)}) text = processor.apply_chat_template( [{"role": "user", "content": content}], add_generation_prompt=True) inputs = processor(text=text, images=imgs, return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False) return processor.batch_decode( out[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0] rows, sample_idx = [], random.sample(range(len(ds)), 6) for i in sample_idx: ex = ds[i] try: pred = vlm_solve(ex) r = grade(pred, ex["answer"]) except Exception as e: pred, r = f"", 0.0 rows.append({"id": ex["conversation_id"], "domain": ex["domain"], "reward": r, "pred_tail": pred[-200:]}) print(f" id={ex['conversation_id']} {ex['domain']:9s} r={r:.2f}") res = pd.DataFrame(rows) print(f"\nMean reward over {len(res)} samples: {res['reward'].mean():.3f}") print(res.groupby("domain")["reward"].mean().rename("avg_reward")) except Exception as e: print(f"VLM run failed ({e}); reward & data pipeline remain usable.") else: print("No GPU detected — skipping live VLM inference (Runtime → Change runtime type → GPU).") out_dir = Path("/content/open_mm_rl_processed"); out_dir.mkdir(exist_ok=True, parents=True) img_dir = out_dir / "images"; img_dir.mkdir(exist_ok=True) records = [] for ex in ds: paths = [] for j, im in enumerate(ex["images"]): p = img_dir / f"{ex['conversation_id']}_{j}.png" im.convert("RGB").save(p) paths.append(str(p)) records.append({ "id": ex["conversation_id"], "domain": ex["domain"], "subDomain": ex["subDomain"], "format": ex["format"], "prompt": build_prompt(ex), "gold": ex["answer"], "image_paths": paths, }) jsonl_path = out_dir / "data.jsonl" with open(jsonl_path, "w") as f: for r in records: f.write(json.dumps(r) + "\n") print(f"\nWrote {len(records)} records → {jsonl_path}") print(f"Saved {sum(len(r['image_paths']) for r in records)} images under {img_dir}") def mock_policy_samples(gold, K=4): """Stand-in for K policy rollouts. Replace with model.generate(do_sample=True).""" return [gold, "Final answer: 0", f"Final answer: {gold} (≈)", "I think the answer is unclear."][:K] def grpo_advantages(rewards): r = np.asarray(rewards, dtype=float) return (r - r.mean()) / (r.std() + 1e-6) print("\n=== Mock GRPO rollouts for example 0 ===") gold0 = ds[0]["answer"] cands = mock_policy_samples(gold0, K=4) rewards = [grade(c, gold0) for c in cands] adv = grpo_advantages(rewards) for c, r, a in zip(cands, rewards, adv): print(f" r={r:.2f} adv={a:+.2f} cand={c!r}") print("\nDone. To turn this into real training:") print(" 1. Replace mock_policy_samples with vlm_solve(..., do_sample=True, num_return_sequences=K).") print(" 2. Feed (prompt, K rollouts, K rewards) into TRL's GRPOTrainer or verl.") print(" 3. Curriculum: start with examples where rewards have non-zero variance.")

We check whether CUDA is available and, optionally, run SmolVLM on a few examples to generate predictions, then score them using our reward function. We then export the dataset to a GRPO-style JSONL format, saving all images to disk for future multimodal RL experiments. Finally, we demonstrate mock GRPO rollouts, calculate group-relative advantages, and outline how this can be replaced with real model-generated samples.

In conclusion, we built a complete workflow for understanding, evaluating, and preparing the Open-MM-RL dataset for multimodal reasoning experiments. We moved from dataset loading and exploratory analysis to image inspection, LaTeX-aware answer classification, reward scoring, prompt construction, optional VLM inference, and GRPO-style rollout preparation. It provides a strong starting point for training and evaluating vision-language models with verifiable rewards, while also helping us understand how to transform multimodal datasets into practical reinforcement learning pipelines.

Check out the Full Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post Design a Complete Multimodal RLVR Pipeline with Open-MM-RL, Vision-Language Prompting, Reward Scoring, and GRPO Export appeared first on MarkTechPost.