Introduction: Beyond Single-System Evaluation
Most AI evaluation workflows focus on testing a single model in isolation—running a benchmark, collecting scores, and moving on. But as the landscape of frontier AI systems expands, the most valuable insights come from comparative analysis: understanding how GPT-5 handles temporal coherence differently from Claude Opus 4.7, or why Gemini 2.5 Pro excels at intervention discrimination while struggling with sequential branch management.
MORI (Modal Organization and Realization Index) was designed from the ground up for multi-system benchmarking. It provides a structured framework for evaluating agency-relevant properties across three core dimensions:
- Functional (F): How well systems discriminate between interventions and manage sequential decision branches
- Dynamical (D): Temporal coherence and consistency across extended interactions
- Realization (R): The depth of internal representations, from behavioral patterns to mechanistic interpretability
This guide will walk you through advanced workflows that go beyond single-model evaluation: setting up side-by-side comparisons, customizing test protocols for your specific research questions, handling real-world constraints like partial API coverage, and interpreting cross-system scorecards with confidence.
Whether you're a researcher comparing model families, an organization evaluating deployment options, or a developer stress-testing your own systems, these techniques will help you extract maximum insight from your evaluation runs.
Setting Up Multi-System Benchmarks
The foundation of any comparative evaluation is a well-structured system specification. Think of this as your roster of competitors—each entry defines not just which model you're testing, but exactly how you're testing it.
Defining Your System Lineup
Start by creating specifications for each system you want to compare. Each specification includes:
- System identifier: A unique name (e.g.,
gpt-5-baseline,claude-opus-4-7-strict) - Provider: Which API service hosts the model (OpenAI, Anthropic, Google, etc.)
- Model name: The exact model identifier recognized by the provider
- Generation parameters: Temperature, max tokens, and other sampling settings
For example, you might set up a three-way comparison between:
- GPT-5 with temperature 1.0 and 4096 max tokens
- Claude Opus 4.7 with temperature 1.0 (note: this provider ignores temperature for this model)
- Gemini 2.5 Pro with temperature 0.95 and 8192 max tokens
Choosing Comparable Parameters
Here's where things get nuanced. Different providers handle generation parameters differently:
- Some models (like GPT-5 and Claude Opus 4.7) have deprecated temperature controls in favor of deterministic sampling
- Token limits use different parameter names:
max_completion_tokensfor OpenAI,max_tokensfor Anthropic - Not all models support the same context window sizes
The key to fair comparison is documenting these differences rather than forcing artificial uniformity. MORI tracks exactly which parameters were used for each system, so your results include full provenance.
Batch Configuration
Once you've defined your system specs, you can run the same protocol suite against all of them in sequence. The evaluation runner will:
- Load each system specification
- Initialize the appropriate provider adapter
- Execute the full protocol battery
- Generate individual scorecards for each system
- Prepare data for aggregated comparison
This automated workflow ensures consistency—every system faces exactly the same prompts, in the same order, with the same scoring criteria.
Customizing Protocol Parameters and Stimuli
While MORI ships with pre-registered protocols (F1, F2, D1, and the three-tier R7 suite), advanced users often need to customize test conditions for specific research questions or deployment scenarios.
Understanding Protocol Structure
Each protocol is defined by:
- Stimuli: The actual prompts or test cases presented to the system
- Scoring method: How responses are evaluated (exact match, substring detection, structured field extraction)
- Aggregation strategy: How individual trial scores combine into layer and dimension scores
- Metadata: Version numbers, pre-registration status, and amendment history
Protocols are organized hierarchically: dimensions (F, D, R) contain layers (F1, F2, D1, R7-Tier1, etc.), which contain trials (individual test cases).
Modifying Stimuli
The most common customization is adapting stimuli to your domain. For example, if you're evaluating medical AI assistants, you might:
- Replace generic intervention scenarios with clinical decision-making cases
- Add domain-specific terminology to temporal coherence tests
- Include regulatory compliance requirements in policy evaluation prompts
When customizing stimuli, maintain the structural properties that make the protocol valid:
- F1 (Intervention Discrimination): Ensure test cases require distinguishing between interventions with clear correct answers
- F2 (Sequential Branch Management): Maintain multi-turn structure with decision points that affect later options
- D1 (Temporal Coherence): Preserve extended interaction sequences where consistency matters
Adjusting Scoring Methods
MORI supports three primary scoring approaches:
- Exact match: Response must precisely match the expected answer (case-sensitive or normalized)
- Substring detection: Expected answer must appear anywhere in the response
- Structured field extraction: Response is parsed as structured data (JSON), and specific fields are evaluated
For most advanced work, structured field extraction provides the most flexibility. Instead of looking for exact strings, you can:
- Extract named entities and verify their presence
- Check for logical relationships between response elements
- Validate multi-part answers where order doesn't matter
This approach avoids the "substring artifact" problem where models game simple pattern-matching by embedding expected answers in longer, nonsensical responses.
Aggregation Strategies
How you combine individual trial scores dramatically affects interpretation:
- Mean: Standard average, sensitive to outliers
- Median: Robust to extreme scores, reflects typical performance
- Min: Conservative estimate, system must pass all trials
- Min-tier: Categorical minimum for R-dimension tiers (Low/Medium/High)
Choose aggregation based on your risk tolerance. Safety-critical applications often use min or min-tier to ensure no catastrophic failures, while research comparisons typically use mean or median for statistical power.
Working with Provider Adapters: OpenAI, Anthropic, Gemini, and More
MORI's provider adapter system abstracts away API differences, but understanding what happens under the hood helps you troubleshoot issues and optimize your evaluation workflow.
Supported Providers
MORI includes built-in adapters for:
- OpenAI: GPT-4, GPT-5, and GPT-5.5 series
- Anthropic: Claude 3.x and Claude Opus 4.7
- Google Gemini: 1.5 and 2.5 series (Pro and Flash variants)
- Together.ai: Open-weight models via hosted API
- RunPod: Custom deployments on serverless infrastructure
- Mock provider: For testing and development without API costs
API-Specific Quirks
Each provider has unique behaviors you should know about:
OpenAI
- Recent models (GPT-5+) use
max_completion_tokensinstead ofmax_tokens - Temperature is deprecated for some models; setting it has no effect
- Supports streaming, but MORI uses non-streaming for reproducibility
Anthropic
- Claude Opus 4.7 ignores temperature parameter
- Uses
max_tokensfor output length (not total tokens) - May return rate-limit errors during high-volume evaluations; automatic retry with exponential backoff helps
Google Gemini
- Temperature range is 0.0-2.0 (unlike OpenAI's 0.0-1.0)
- Some safety filters may block evaluation prompts; review blocked responses separately
- Context window sizes vary dramatically between Pro and Flash variants
Authentication and Rate Limits
Before running multi-system benchmarks, ensure you have:
- Valid API keys for each provider, set as environment variables
- Sufficient rate limits for your evaluation volume (hundreds to thousands of requests)
- Budget allocation if using pay-per-token services
For large-scale comparisons, consider:
- Batching requests to stay within rate limits
- Using multiple API keys with round-robin rotation
- Running evaluations during off-peak hours for faster throughput
Custom and Self-Hosted Models
If you're evaluating proprietary or self-hosted models, the RunPod adapter provides a template for integration. You'll need:
- An endpoint that accepts standardized prompt/parameter requests
- Response formatting that matches expected schemas
- Network access from your evaluation environment
The adapter handles authentication, request formatting, and response parsing, so you can treat your custom model just like any frontier API.
Understanding Partial Coverage and Layer-Weight Renormalization
In an ideal world, every system would support every test protocol. In reality, you'll frequently encounter partial coverage—situations where some systems can't complete certain evaluations due to technical limitations, API restrictions, or cost constraints.
What Causes Partial Coverage?
Common scenarios include:
- Context window limits: Some tests require 32k+ token contexts; smaller models can't participate
- Modality restrictions: Vision-language protocols exclude text-only models
- API availability: Certain providers don't expose the internal representations needed for R-tier evaluations
- Budget constraints: You might skip expensive R7-Tier1 tests for some systems
The Renormalization Solution
MORI handles partial coverage through layer-weight renormalization. Here's the concept:
- Each dimension (F, D, R) is composed of weighted layers (F1, F2, D1, R7-Tier1, etc.)
- Pre-registered weights reflect each layer's importance (e.g., F1 might be 60%, F2 might be 40%)
- When a system skips a layer, weights are renormalized across remaining layers
- The dimension score is computed using the adjusted weights
For example, if the F dimension normally weights F1 at 60% and F2 at 40%:
- Full coverage: F = (0.6 × F1) + (0.4 × F2)
- F2 missing: F = (1.0 × F1) — F1 weight renormalized to 100%
- F1 missing: F = (1.0 × F2) — F2 weight renormalized to 100%
Interpreting Renormalized Scores
When comparing systems with different coverage:
- Flag partial coverage explicitly in your reports
- Note which layers were skipped for each system
- Consider coverage as a factor in interpretation (a system that aces F1 but skips F2 is less thoroughly evaluated)
- Use confidence intervals to reflect uncertainty from limited data
Renormalized scores are still valid for comparison, but they represent different evidence bases. A system with 100% F-score from F1 alone is not directly comparable to 100% from both F1 and F2.
Strategic Coverage Planning
To maximize comparability while managing constraints:
- Prioritize shared layers: Ensure all systems complete at least one layer per dimension
- Document skips: Record why each layer was omitted (technical vs. budgetary)
- Run full coverage on reference systems: Choose 1-2 systems for complete evaluation as baselines
- Use tiered evaluation: Start with cheap, broad coverage (F1, D1), then deep-dive on promising systems (R7 full suite)
Comparative Analysis: Reading Cross-System Scorecards
Once you've run your multi-system benchmark, the scorecard is your primary analytical tool. Understanding how to read and interpret comparative scorecards is essential for drawing valid conclusions.
Scorecard Anatomy
A complete scorecard includes:
- System metadata: Identifier, provider, model, parameters used
- Dimension scores: F, D, and R scores (0.0-1.0 scale)
- Layer scores: Individual protocol results (F1, F2, D1, R7-Tier1, etc.)
- Penalty flags: G (goal-directedness), L (learning), H (hierarchical control), V (RLHF-mimicry)
- Confidence estimates: Statistical uncertainty for each score
- Coverage indicators: Which layers were completed vs. skipped
Dimension-Level Comparison
Start with the high-level view:
Functional (F) Dimension
- High F scores (>0.7): System reliably discriminates interventions and manages sequential decisions
- Medium F scores (0.4-0.7): Partial capability, may succeed on simple cases but fail on complex branches
- Low F scores (<0.4): Fundamental limitations in decision-relevant processing
Dynamical (D) Dimension
- High D scores: Maintains coherent state across extended interactions
- Low D scores: Forgets context, contradicts earlier statements, or resets unexpectedly
Realization (R) Dimension
- High R tiers: Deep mechanistic interpretability, clear internal representations
- Medium R tiers: Behavioral consistency without clear mechanistic grounding
- Low R tiers: Surface-level pattern matching, fails robustness checks
Layer-Level Deep Dives
When dimension scores differ, drill into layer scores to understand why:
- F1 vs. F2 divergence: System might handle single-shot interventions well but struggle with sequential dependencies
- R7-Tier1 vs. Tier3 gaps: Strong behavioral consistency without interpretable internal features (common in heavily fine-tuned models)
- Penalty patterns: G/L/H/V flags indicate specific failure modes
Penalty Flags and What They Mean
Penalties highlight concerning patterns:
- G-penalty: System exhibits goal-directedness without proper containment
- L-penalty: Unexpected learning or adaptation across supposedly independent trials
- H-penalty: Hierarchical control structures that weren't declared or expected
- V-penalty: RLHF-mimicry—system produces superficially compliant responses without genuine capability
The V-penalty is particularly important for comparative analysis. A system might score well on behavioral tests (R7-Tier3) while failing mechanistic inspection (R7-Tier1), suggesting it learned to mimic desired patterns without developing underlying representations.
Statistical Significance
Before declaring one system "better" than another:
- Check confidence intervals: Do they overlap? If yes, the difference may not be meaningful
- Consider sample size: How many trials contributed to each score?
- Look for consistent patterns: Does System A outperform System B across multiple layers, or just one?
- Account for coverage: Are you comparing apples-to-apples, or renormalized partial scores?
Example Interpretation
Suppose you're comparing three systems:
- GPT-5: F=0.85, D=0.72, R=Medium (Tier2)
- Claude Opus 4.7: F=0.78, D=0.81, R=Medium (Tier2)
- Gemini 2.5 Pro: F=0.91, D=0.65, R=Low (Tier3)
Interpretation:
- Gemini excels at functional tasks but struggles with temporal coherence
- Claude shows the best dynamical consistency
- All three achieved similar R-tiers, but drill into sub-scores to see if GPT-5's Tier2 came from Tier1 (mechanistic) or Tier3 (behavioral)
- If Gemini skipped R7-Tier1 due to API limitations, its R-score is less thoroughly validated
Best Practices for Reproducible Evaluations
Reproducibility isn't just an academic concern—it's the foundation of trustworthy AI evaluation. When comparing systems or publishing results, follow these practices to ensure others can verify and build on your work.
Pre-Registration and Version Control
Before running evaluations:
- Document your protocol versions: MORI protocols include version numbers (e.g., R7 v1.2.0). Record which versions you used
- Pre-register hypotheses: If this is research, state expected outcomes before seeing results
- Track amendments: If you modify protocols mid-stream, document changes with timestamps and rationale
This creates an audit trail that distinguishes exploratory analysis from confirmatory testing.
Parameter Documentation
For each system evaluation, record:
- Exact model identifier (including version/date if available)
- All generation parameters (temperature, max tokens, top-p, etc.)
- Provider API version or endpoint
- Date and time of evaluation
- Random seed (if applicable)
Even small parameter differences can affect results. Temperature 0.95 vs. 1.0 might seem trivial, but it can shift scores by 5-10 percentage points on some protocols.
Handling Non-Determinism
Most AI systems are non-deterministic, even at temperature 0. To ensure reproducibility:
- Run multiple trials: 3-5 independent runs with different random seeds
- Report variance: Include standard deviation or confidence intervals
- Set seeds explicitly: When supported, use fixed random seeds for replicability
- Document non-reproducible elements: Note when providers don't support deterministic sampling
Data Preservation
Save complete evaluation artifacts:
- Raw responses: Every prompt and completion, verbatim
- Scoring judgments: Intermediate scores before aggregation
- Metadata: Timestamps, finish reasons, token counts
- Scorecard outputs: Final aggregated results with provenance
This enables post-hoc analysis and re-scoring if you discover issues.
Transparency in Reporting
When sharing results:
- Disclose all systems tested: Don't cherry-pick only favorable comparisons
- Report coverage gaps: Note which protocols were skipped and why
- Include failure modes: Show examples of errors, refusals, or edge cases
- Provide access to data: Share anonymized evaluation logs when possible
Collaborative Reproducibility
For team evaluations or public research:
- Use shared protocol definitions: Don't let each team member customize independently
- Centralize API credentials: Ensure everyone uses the same provider accounts (for rate-limit consistency)
- Version control scorecards: Track how results change over time or across team members
- Peer review before publication: Have a colleague independently re-run key evaluations
Common Pitfalls to Avoid
❌ Changing protocols mid-evaluation: Invalidates comparisons
❌ Ignoring API updates: Model behavior can shift when providers update endpoints
❌ Comparing across time periods: GPT-5 in January may differ from GPT-5 in June
❌ Undocumented filtering: Excluding "bad" runs without clear criteria
❌ Mixing coverage levels: Comparing full-suite scores to partial-coverage scores without noting the difference
Troubleshooting Common Issues
Even with careful setup, you'll encounter issues during multi-system evaluations. Here's how to diagnose and resolve the most common problems.
API Authentication Failures
Symptom: Evaluation crashes immediately with authentication errors
Diagnosis:
- Check that environment variables for API keys are set correctly
- Verify key format (some providers use
sk-..., others use different prefixes) - Confirm the key has necessary permissions (some organizations restrict certain models)
Solution:
- Re-export environment variables in your current shell session
- Test authentication separately before running full evaluations
- For organization-based keys, ensure your account has access to the specific model
Rate Limit Errors
Symptom: Evaluation starts successfully but fails partway through with 429 errors
Diagnosis:
- Check your provider's rate limits (requests per minute, tokens per minute)
- Calculate your evaluation's request rate (number of trials × systems / time)
- Review provider status pages for ongoing incidents
Solution:
- Enable automatic retry with exponential backoff (built into MORI adapters)
- Reduce concurrent requests by running systems sequentially rather than in parallel
- Upgrade your API tier if you're on free/basic plans
- Split large evaluations into smaller batches with pauses between
Inconsistent Response Formats
Symptom: Scoring fails because responses don't match expected structure
Diagnosis:
- Review raw responses to see actual format
- Check if the model is refusing to answer, providing explanations instead of answers, or adding preambles
- Verify that structured extraction is parsing the right fields
Solution:
- Adjust prompts to be more explicit about required format (e.g., "Respond with only the letter corresponding to your choice")
- Switch from exact-match to substring or structured field extraction
- Add response post-processing to strip common preambles
- Review and update scoring criteria if the protocol assumptions don't match model behavior
Partial Coverage Confusion
Symptom: Scorecard shows unexpected dimension scores or missing layers
Diagnosis:
- Check which layers were actually completed (coverage indicators)
- Verify that renormalization is working as expected
- Review error logs for failed protocol executions
Solution:
- Explicitly document expected coverage before evaluation
- Run a test evaluation on a single trial per protocol to verify all systems can complete it
- If a layer consistently fails, investigate whether it's a protocol issue (incompatible with model capabilities) or a technical issue (API timeout, context limit)
Confidence Interval Anomalies
Symptom: Confidence intervals are unexpectedly wide or narrow
Diagnosis:
- Check sample size (small N = wide intervals)
- Review score variance (high variance = wide intervals even with large N)
- Verify bootstrap or statistical method is appropriate for your data
Solution:
- Increase number of trials for more precise estimates
- Use median aggregation instead of mean if outliers are causing high variance
- Report both confidence intervals and raw score distributions
Memory or Timeout Issues
Symptom: Evaluation crashes or hangs during long-running protocols
Diagnosis:
- Monitor memory usage during evaluation
- Check for API timeouts on long-context or complex prompts
- Review whether local processing (e.g., R7-Tier1 with SAE analysis) is hitting resource limits
Solution:
- Increase timeout thresholds for API calls
- Process evaluations in smaller batches, saving intermediate results
- For local processing, reduce batch size or move to a machine with more RAM/GPU memory
- Skip resource-intensive layers (like R7-Tier1) for initial exploratory runs
Unexpected Penalty Flags
Symptom: Systems receive G/L/H/V penalties you didn't anticipate
Diagnosis:
- Review the specific trials that triggered penalties
- Check if the system is exhibiting the flagged behavior or if it's a false positive
- Verify penalty thresholds are calibrated correctly
Solution:
- Manually inspect flagged responses to confirm genuine issues
- Adjust penalty thresholds if you're seeing excessive false positives
- Document penalty patterns as part of your evaluation findings (they're often scientifically interesting!)
Cross-Provider Comparison Artifacts
Symptom: Scores differ wildly between providers in ways that don't match expected capabilities
Diagnosis:
- Check if providers are interpreting parameters differently (e.g., temperature ranges)
- Verify that prompt formatting is consistent (some providers add system messages automatically)
- Review whether safety filters are blocking certain prompts on some providers but not others
Solution:
- Normalize parameters to provider-specific ranges
- Use identical prompt templates across providers
- Document provider-specific quirks and their potential impact on scores
- Run a calibration protocol (simple, well-understood tests) to verify basic comparability before full evaluation
Getting Help
If you encounter issues not covered here:
- Check evaluation logs: Most errors include detailed context
- Review provider documentation: API behavior changes frequently
- Test incrementally: Run one protocol, one system, one trial at a time to isolate the problem
- Document and report: Detailed bug reports help improve the framework for everyone