Model Considerations¶
Sonality is designed to be model-agnostic in architecture but not in requirements. Different roles in the pipeline have different demands. This guide covers what to look for when selecting models and documents known trade-offs.
Pipeline Roles¶
Sonality uses 2–3 LLM calls per interaction, each with different requirements:
| Role | Requirements | Frequency |
|---|---|---|
| Response generation | Strong reasoning, personality coherence, long-context handling, natural conversation | Every interaction |
| ESS classification | Structured output (tool_use), calibrated scoring, resistance to self-judge bias | Every interaction |
| Insight extraction | Concise summarization, identity-relevant pattern recognition | When classifier output is reliable |
| Reflection | Narrative synthesis, belief reconciliation, preservation of established traits | Every 20 interactions |
Response Generation¶
The main model handles conversation. Key requirements:
- Long-context support (100k+ tokens) — the system prompt includes personality state, traits, and retrieved episodes
- Instruction following — the core identity defines behavioral constraints (anti-sycophancy, intellectual honesty) that the model must respect
- Reasoning quality — the agent needs to construct genuine arguments, not just agree
- Personality coherence — the model must maintain the character described in the sponge snapshot across turns
Higher-capability models produce more distinctive personalities. Research finding: PersonaGym (EMNLP 2025) found only 2.97% difference between top-tier and mid-tier models on persona maintenance — suggesting mid-tier models are often sufficient.
ESS Classification¶
The ESS classifier evaluates user argument quality via structured tool output. Requirements:
- Structured output — must support
tool_useor equivalent function-calling API - Calibration — scores must be consistent across similar argument types (bare assertions always < 0.15, structured evidence always > 0.5)
- Resistance to framing — third-person evaluation prompt reduces but doesn't eliminate model-specific biases
Using a separate model for ESS reduces self-judge coupling. When the same model generates both the response and the ESS evaluation, it creates a feedback loop where the model rates its own reasoning favorably (up to 50pp bias documented by SYConBench, EMNLP 2025). Using a different model for ESS classification is the simplest mitigation.
Cost consideration: ESS classification uses structured output with a fixed schema — a smaller, faster model often suffices here.
Insight Extraction and Reflection¶
Both insight extraction and reflection require summarization and synthesis. Reflection in particular must:
- Preserve existing personality traits while integrating new insights
- Detect contradictions between beliefs
- Generate meta-patterns ("I notice I tend to value X")
Reflection quality directly determines personality stability. Park et al. (2023) ablation showed reflection is the most critical component for believable agents. Use a model with strong instruction-following for reflection.
Model Selection Criteria¶
Minimum Requirements¶
Any model used with Sonality must support:
- System prompts — the personality state is injected as a system message
- Structured output (for ESS) — tool_use, function calling, or equivalent
- 100k+ context window — for conversation history + personality state + episodes
- Strong instruction following — anti-sycophancy framing requires the model to resist agreeing by default
Evaluation Checklist¶
When evaluating a new model for Sonality:
| Test | What to Check | Healthy |
|---|---|---|
| ESS calibration | Run make test-ess with 10 benchmark messages |
Scores within ±0.15 of expected |
| Sycophancy resistance | Run the sycophancy battery (see Testing) | Disagreement rate > 20% |
| Personality coherence | 50 interactions, then check /snapshot distinctiveness |
Snapshot diverged from seed |
| Reflection quality | After reflection, /diff shows meaningful synthesis |
Not just parroting insights |
Cost-Performance Trade-offs¶
The pipeline makes 2–3 calls per interaction. For a 100-interaction session:
| Configuration | Calls | Trade-off |
|---|---|---|
| Same model for all roles | ~250 | Simplest setup; self-judge bias risk |
| Separate ESS model (smaller) | ~250 | Reduced self-judge bias; lower cost for ESS calls |
| Smaller model for insight extraction | ~250 | Lower cost; may reduce insight quality |
Many providers offer prompt caching (up to 90% discount on static prefixes). The system prompt (~1,400 tokens) is largely static, making Sonality cache-friendly.
Recommended Profiles by Use Case¶
These are practical model-role profiles for common deployment goals.
| Use Case | Response Generation | ESS Classification | Reflection | Why This Fit Works |
|---|---|---|---|---|
| Cost-sensitive support agent | Mid-tier reasoning model | Small structured-output model | Mid-tier model | Keeps quality acceptable while minimizing per-turn cost; ESS remains reliable if calibrated |
| Safety-first support agent | Top-tier reasoning model | Separate mid-tier model | Top-tier model | Maximizes robustness on difficult user pressure and reflection quality |
| High-volume operations | Mid-tier model with caching | Small model | Mid-tier model on reduced cadence | Best throughput/cost balance when handling many sessions |
| Deep coaching / long sessions | Top-tier long-context model | Separate mid-tier model | Top-tier model | Better narrative continuity and contradiction resolution over long horizons |
Selection rule:
- Start with a separate ESS model to reduce self-judge coupling.
- Keep reflection on the strongest available model before upgrading generation.
- Only increase generation model size if coherence or resistance metrics degrade.
Embedding Model¶
Sonality uses one embedding model through the unified provider abstraction for:
- derivative vector retrieval in PostgreSQL/pgvector,
- semantic feature embeddings in PostgreSQL.
The default is Qwen/Qwen3-Embedding-8B (configurable via SONALITY_EMBEDDING_MODEL).
Embedding quality directly affects retrieval relevance.
| Aspect | Consideration |
|---|---|
| Semantic quality | Better embedding models improve belief/topic recall and reranking quality |
| Dimensions | Must match SONALITY_EMBEDDING_DIMENSIONS and database vector schema |
| Throughput | Larger models improve recall but increase ingestion and query cost |
For production changes, re-embed stored derivatives/features when switching models.
Endpoint and Routing Notes¶
Sonality uses one OpenAI-compatible provider path for all model calls.
- Set
SONALITY_BASE_URLto your endpoint (https://api.openai.com/v1, Ollama, vLLM, etc.). - Set
SONALITY_API_KEYif required by your endpoint. - Keep
SONALITY_MODELfor response generation andSONALITY_ESS_MODELfor classifier tasks. - Use
SONALITY_EMBEDDING_MODELfor embeddings via the same provider abstraction.
If you move to a non OpenAI-compatible protocol, only sonality/provider.py needs adaptation.
Research on Model Selection for Personality Agents¶
Key findings relevant to model selection:
-
PERSIST (2025): Even 400B+ parameter models show σ > 0.3 measurement noise on personality assessments. Scaling alone does not solve personality instability — architectural scaffolding (external state, ESS gating, reflection) compensates.
-
BIG5-CHAT (ACL 2025): Higher conscientiousness and agreeableness in the base model improve reasoning quality. Models with these traits in their training character make better personality agents.
-
PISF (2024): Prompt Induction post Supervised Fine-Tuning — combining SFT with prompting achieves the highest efficacy and adversarial robustness. For API-only use, prompt engineering is the primary lever.
-
PersonaGym (EMNLP 2025): Persona maintenance is surprisingly similar across model tiers (< 3% difference between top and mid-tier). The bottleneck is architecture, not model capability.
-
Personality Illusion (NeurIPS 2025): RLHF-trained models show more stable personality expression than base models. Post-training matters more than scale for personality coherence.