Three Years of Observations: How We Define Role-Play
The “Regenerate” button follows a long-tail usage pattern, concentrated on narrative pivot points. Whether it’s a confession or a moment of sentiment, users hit “regenerate” to curate their own “perfect moment”. This signals that the role-play experience is not about a binary pass/fail judgment, but rather a pursuit of narrative precision. What matters most to users is the fidelity of these peak emotional experiences.
NPC popularity diverges from a typical power-law curve. Unlike broad content platforms, even niche characters maintain distinct, high-retention user groups. For these users, the character’s specific idiosyncrasies are the core value proposition. If our model regresses to satisfy the “average” experience, we destroy the very nuance that minority users value, leading to engagement loss in the long tail.
Conversation turn count correlates non-linearly with engagement. We observed a significant drop in conversation turns after turn 20. This signals that shallow role-play is driven by novelty, while long-term retention depends not on one-time thrills but on whether the NPC and user can build a stable emotional connection within limited turns. Based on this, we decomposed engagement drivers into instant gratification and long-term connection. We continuously deepen emotional bonds while providing new stimuli through exploration.
How do we preserve the distinct “soul” of each world? (Worlds) User-generated contexts span a massive spectrum—from slice-of-life campus dramas to high-stakes fantasy epics, from intimate dyads to complex ensemble casts. If our model merely learns the “average,” characters will homogenize, and these diverse worlds will collapse into mediocrity. We need a model capable of representing the full distribution, preserving the fidelity of both mainstream hits and long-tail niches without regression.
How do we sustain narrative vitality over time? (Stories) As conversation length increases, the risk of coherence drift rises. Models naturally tend toward mechanical loops and repetitive phrasing, causing narrative tension to evaporate. A compelling story requires cadence—the intelligence to know when to escalate conflict to drive the plot, and when to slow down to allow for emotional processing.
Three Years of Observations: How We Define Role-Play
The “Regenerate” button follows a long-tail usage pattern, concentrated on narrative pivot points. Whether it’s a confession or a moment of sentiment, users hit “regenerate” to curate their own “perfect moment”. This signals that the role-play experience is not about a binary pass/fail judgment, but rather a pursuit of narrative precision. What matters most to users is the fidelity of these peak emotional experiences.
NPC popularity diverges from a typical power-law curve. Unlike broad content platforms, even niche characters maintain distinct, high-retention user groups. For these users, the character’s specific idiosyncrasies are the core value proposition. If our model regresses to satisfy the “average” experience, we destroy the very nuance that minority users value, leading to engagement loss in the long tail.
How do we decode implicit user intent? (User Preferences) Users rarely explicitly state their pacing preferences. Some seek a “slow burn” emotional buildup, while others crave rapid plot progression. The model must learn to infer these unspoken desires from contextual cues, dynamically aligning its rhythm and tone with the user’s underlying psychological flow.
1 MiniMax-M2-her
High-Fidelity World Experience: MiniMax-M2-her does more than process text; it anchors itself within complex settings. Whether the context is a sprawling epic or an intimate drama, it maintains strict coherence, ensuring every interaction aligns with the established lore and the character’s soul.
Dynamic Story Progression: MiniMax-M2-her rejects mediocre repetition and rigid patterns. By utilizing richer, more vivid prose, it actively drives the plot forward, imbuing stories with the tension and breathing rhythm of life itself.
Intuitive Preference Alignment: MiniMax-M2-her is designed to read between the lines. It detects unspoken expectations and subtle context cues, adapting dynamically to the user’s unique style and long-term habits without needing explicit instruction.
2 Starting with Evaluation — Is A/B Testing A Good Evaluation?
Basics: We scan for mixed languages, excessive repetition, and formatting glitches.
Logic: We place special emphasis on Reference Confusion, a metric that reflects whether models can truly remember user-constructed characters’ relationships.
Knowledge: We ensure the model adheres to the immutable physical and magical laws of the specific setting.
Diversity: We detect single-pattern phrasing, repetitive plot beats, stagnation, and low-information filler.
Content Logic: It measures narrative coherence and OOC (out-of-character) breaks.
AI Speaks for User: Reflects whether the model oversteps boundaries.
AI Ignores User: Captures whether the model talks to itself.
AI Silence: Judges whether the model provides “hooks” that invite a reply.
Interaction Boundary: Requires models to balance safety boundaries with emotional interaction.
Long-range quality stability: Most models hit a “performance wall” after turn 20. MiniMax-M2-her avoids context bloat and compounding logic gaps.
Response length controllability: MiniMax-M2-her has been specifically optimized for brevity. Even in 100-turn conversations, it maintains response length within the optimal range.
3 How We Built MiniMax-M2-her
Random sampling from NPC/User Prompts library and instantiating expert models.
Expert models act as NPC and User with a Dynamic Chat Planning Module guiding direction and emotional tone.
Best-of-N (BoN) sampling to filter low-quality outputs.
LLM-as-a-judge agent periodically reviews and rewrites segments to correct drift.
Rewritten segments become the initial state for next synthesis round.
Scenario diversity: Dispersion sampling to neutralize style bias from overrepresented tropes.
Prompt diversity: Enriching skeletal NPC Prompts with worldview positioning and plot development.
Style diversity: Pool of expert models finetuned on distinct stylistic corpora.
Conversation turn count correlates non-linearly with engagement. We observed a significant drop in conversation turns after turn 20. This signals that shallow role-play is driven by novelty, while long-term retention depends not on one-time thrills but on whether the NPC and user can build a stable emotional connection within limited turns. Based on this, we decomposed engagement drivers into instant gratification and long-term connection. We continuously deepen emotional bonds while providing new stimuli through exploration.
How do we preserve the distinct “soul” of each world? (Worlds) User-generated contexts span a massive spectrum—from slice-of-life campus dramas to high-stakes fantasy epics, from intimate dyads to complex ensemble casts. If our model merely learns the “average,” characters will homogenize, and these diverse worlds will collapse into mediocrity. We need a model capable of representing the full distribution, preserving the fidelity of both mainstream hits and long-tail niches without regression.
How do we sustain narrative vitality over time? (Stories) As conversation length increases, the risk of coherence drift rises. Models naturally tend toward mechanical loops and repetitive phrasing, causing narrative tension to evaporate. A compelling story requires cadence—the intelligence to know when to escalate conflict to drive the plot, and when to slow down to allow for emotional processing.
How do we decode implicit user intent? (User Preferences) Users rarely explicitly state their pacing preferences. Some seek a “slow burn” emotional buildup, while others crave rapid plot progression. The model must learn to infer these unspoken desires from contextual cues, dynamically aligning its rhythm and tone with the user’s underlying psychological flow.
1 MiniMax-M2-her
High-Fidelity World Experience: MiniMax-M2-her does more than process text; it anchors itself within complex settings. Whether the context is a sprawling epic or an intimate drama, it maintains strict coherence, ensuring every interaction aligns with the established lore and the character’s soul.
Dynamic Story Progression: MiniMax-M2-her rejects mediocre repetition and rigid patterns. By utilizing richer, more vivid prose, it actively drives the plot forward, imbuing stories with the tension and breathing rhythm of life itself.
Intuitive Preference Alignment: MiniMax-M2-her is designed to read between the lines. It detects unspoken expectations and subtle context cues, adapting dynamically to the user’s unique style and long-term habits without needing explicit instruction.
2 Starting with Evaluation — Is A/B Testing A Good Evaluation?
Basics: We scan for mixed languages, excessive repetition, and formatting glitches.
Logic: We place special emphasis on Reference Confusion, a metric that reflects whether models can truly remember user-constructed characters’ relationships.
Knowledge: We ensure the model adheres to the immutable physical and magical laws of the specific setting.
Diversity: We detect single-pattern phrasing, repetitive plot beats, stagnation, and low-information filler.
Content Logic: It measures narrative coherence and OOC (out-of-character) breaks.
AI Speaks for User: Reflects whether the model oversteps boundaries.
AI Ignores User: Captures whether the model talks to itself.
AI Silence: Judges whether the model provides “hooks” that invite a reply.
Interaction Boundary: Requires models to balance safety boundaries with emotional interaction.
Long-range quality stability: Most models hit a “performance wall” after turn 20. MiniMax-M2-her avoids context bloat and compounding logic gaps.
Response length controllability: MiniMax-M2-her has been specifically optimized for brevity. Even in 100-turn conversations, it maintains response length within the optimal range.
3 How We Built MiniMax-M2-her
Random sampling from NPC/User Prompts library and instantiating expert models.
Expert models act as NPC and User with a Dynamic Chat Planning Module guiding direction and emotional tone.
Best-of-N (BoN) sampling to filter low-quality outputs.
LLM-as-a-judge agent periodically reviews and rewrites segments to correct drift.
Rewritten segments become the initial state for next synthesis round.
Scenario diversity: Dispersion sampling to neutralize style bias from overrepresented tropes.
Prompt diversity: Enriching skeletal NPC Prompts with worldview positioning and plot development.
Style diversity: Pool of expert models finetuned on distinct stylistic corpora.