Advanced MORI Workflows: Multi-System Comparison and Custom Configurations

Master the art of evaluating multiple AI systems side-by-side with custom protocols, partial coverage handling, and robust comparative analysis.

David H. Friedel Jr.·

2026-05-24

·Workflows isolation benchmark comparative analysis representations

Introduction: Beyond Single-System Evaluation

Most AI evaluation workflows focus on testing a single model in isolation—running a benchmark, collecting scores, and moving on. But as the landscape of frontier AI systems expands, the most valuable insights come from comparative analysis: understanding how GPT-5 handles temporal coherence differently from Claude Opus 4.7, or why Gemini 2.5 Pro excels at intervention discrimination while struggling with sequential branch management.

MORI (Modal Organization and Realization Index) was designed from the ground up for multi-system benchmarking. It provides a structured framework for evaluating agency-relevant properties across three core dimensions:

Functional (F): How well systems discriminate between interventions and manage sequential decision branches
Dynamical (D): Temporal coherence and consistency across extended interactions
Realization (R): The depth of internal representations, from behavioral patterns to mechanistic interpretability

This guide will walk you through advanced workflows that go beyond single-model evaluation: setting up side-by-side comparisons, customizing test protocols for your specific research questions, handling real-world constraints like partial API coverage, and interpreting cross-system scorecards with confidence.

Whether you're a researcher comparing model families, an organization evaluating deployment options, or a developer stress-testing your own systems, these techniques will help you extract maximum insight from your evaluation runs.

Setting Up Multi-System Benchmarks

The foundation of any comparative evaluation is a well-structured system specification. Think of this as your roster of competitors—each entry defines not just which model you're testing, but exactly how you're testing it.

Defining Your System Lineup

Start by creating specifications for each system you want to compare. Each specification includes:

System identifier: A unique name (e.g., gpt-5-baseline, claude-opus-4-7-strict)
Provider: Which API service hosts the model (OpenAI, Anthropic, Google, etc.)
Model name: The exact model identifier recognized by the provider
Generation parameters: Temperature, max tokens, and other sampling settings

For example, you might set up a three-way comparison between:

GPT-5 with temperature 1.0 and 4096 max tokens
Claude Opus 4.7 with temperature 1.0 (note: this provider ignores temperature for this model)
Gemini 2.5 Pro with temperature 0.95 and 8192 max tokens

Choosing Comparable Parameters

Here's where things get nuanced. Different providers handle generation parameters differently:

Some models (like GPT-5 and Claude Opus 4.7) have deprecated temperature controls in favor of deterministic sampling
Token limits use different parameter names: max_completion_tokens for OpenAI, max_tokens for Anthropic
Not all models support the same context window sizes

The key to fair comparison is documenting these differences rather than forcing artificial uniformity. MORI tracks exactly which parameters were used for each system, so your results include full provenance.

Batch Configuration

Once you've defined your system specs, you can run the same protocol suite against all of them in sequence. The evaluation runner will:

Load each system specification
Initialize the appropriate provider adapter
Execute the full protocol battery
Generate individual scorecards for each system
Prepare data for aggregated comparison

This automated workflow ensures consistency—every system faces exactly the same prompts, in the same order, with the same scoring criteria.

Customizing Protocol Parameters and Stimuli

While MORI ships with pre-registered protocols (F1, F2, D1, and the three-tier R7 suite), advanced users often need to customize test conditions for specific research questions or deployment scenarios.

Understanding Protocol Structure

Each protocol is defined by:

Stimuli: The actual prompts or test cases presented to the system
Scoring method: How responses are evaluated (exact match, substring detection, structured field extraction)
Aggregation strategy: How individual trial scores combine into layer and dimension scores
Metadata: Version numbers, pre-registration status, and amendment history

Protocols are organized hierarchically: dimensions (F, D, R) contain layers (F1, F2, D1, R7-Tier1, etc.), which contain trials (individual test cases).

Modifying Stimuli

The most common customization is adapting stimuli to your domain. For example, if you're evaluating medical AI assistants, you might:

Replace generic intervention scenarios with clinical decision-making cases
Add domain-specific terminology to temporal coherence tests
Include regulatory compliance requirements in policy evaluation prompts

When customizing stimuli, maintain the structural properties that make the protocol valid:

F1 (Intervention Discrimination): Ensure test cases require distinguishing between interventions with clear correct answers
F2 (Sequential Branch Management): Maintain multi-turn structure with decision points that affect later options
D1 (Temporal Coherence): Preserve extended interaction sequences where consistency matters

Adjusting Scoring Methods

MORI supports three primary scoring approaches:

Exact match: Response must precisely match the expected answer (case-sensitive or normalized)
Substring detection: Expected answer must appear anywhere in the response
Structured field extraction: Response is parsed as structured data (JSON), and specific fields are evaluated

For most advanced work, structured field extraction provides the most flexibility. Instead of looking for exact strings, you can:

Extract named entities and verify their presence
Check for logical relationships between response elements
Validate multi-part answers where order doesn't matter

This approach avoids the "substring artifact" problem where models game simple pattern-matching by embedding expected answers in longer, nonsensical responses.

Aggregation Strategies

How you combine individual trial scores dramatically affects interpretation:

Mean: Standard average, sensitive to outliers
Median: Robust to extreme scores, reflects typical performance
Min: Conservative estimate, system must pass all trials
Min-tier: Categorical minimum for R-dimension tiers (Low/Medium/High)

Choose aggregation based on your risk tolerance. Safety-critical applications often use min or min-tier to ensure no catastrophic failures, while research comparisons typically use mean or median for statistical power.

Working with Provider Adapters: OpenAI, Anthropic, Gemini, and More

MORI's provider adapter system abstracts away API differences, but understanding what happens under the hood helps you troubleshoot issues and optimize your evaluation workflow.

Supported Providers

MORI includes built-in adapters for:

OpenAI: GPT-4, GPT-5, and GPT-5.5 series
Anthropic: Claude 3.x and Claude Opus 4.7
Google Gemini: 1.5 and 2.5 series (Pro and Flash variants)
Together.ai: Open-weight models via hosted API
RunPod: Custom deployments on serverless infrastructure
Mock provider: For testing and development without API costs

API-Specific Quirks

Each provider has unique behaviors you should know about:

OpenAI

Recent models (GPT-5+) use max_completion_tokens instead of max_tokens
Temperature is deprecated for some models; setting it has no effect
Supports streaming, but MORI uses non-streaming for reproducibility

Anthropic

Claude Opus 4.7 ignores temperature parameter
Uses max_tokens for output length (not total tokens)
May return rate-limit errors during high-volume evaluations; automatic retry with exponential backoff helps

Google Gemini

Temperature range is 0.0-2.0 (unlike OpenAI's 0.0-1.0)
Some safety filters may block evaluation prompts; review blocked responses separately
Context window sizes vary dramatically between Pro and Flash variants

Authentication and Rate Limits

Before running multi-system benchmarks, ensure you have:

Valid API keys for each provider, set as environment variables
Sufficient rate limits for your evaluation volume (hundreds to thousands of requests)
Budget allocation if using pay-per-token services

For large-scale comparisons, consider:

Batching requests to stay within rate limits
Using multiple API keys with round-robin rotation
Running evaluations during off-peak hours for faster throughput

Custom and Self-Hosted Models

If you're evaluating proprietary or self-hosted models, the RunPod adapter provides a template for integration. You'll need:

An endpoint that accepts standardized prompt/parameter requests
Response formatting that matches expected schemas
Network access from your evaluation environment

The adapter handles authentication, request formatting, and response parsing, so you can treat your custom model just like any frontier API.

Understanding Partial Coverage and Layer-Weight Renormalization

In an ideal world, every system would support every test protocol. In reality, you'll frequently encounter partial coverage—situations where some systems can't complete certain evaluations due to technical limitations, API restrictions, or cost constraints.

What Causes Partial Coverage?

Common scenarios include:

Context window limits: Some tests require 32k+ token contexts; smaller models can't participate
Modality restrictions: Vision-language protocols exclude text-only models
API availability: Certain providers don't expose the internal representations needed for R-tier evaluations
Budget constraints: You might skip expensive R7-Tier1 tests for some systems

The Renormalization Solution

MORI handles partial coverage through layer-weight renormalization. Here's the concept:

Each dimension (F, D, R) is composed of weighted layers (F1, F2, D1, R7-Tier1, etc.)
Pre-registered weights reflect each layer's importance (e.g., F1 might be 60%, F2 might be 40%)
When a system skips a layer, weights are renormalized across remaining layers
The dimension score is computed using the adjusted weights

For example, if the F dimension normally weights F1 at 60% and F2 at 40%:

Full coverage: F = (0.6 × F1) + (0.4 × F2)
F2 missing: F = (1.0 × F1) — F1 weight renormalized to 100%
F1 missing: F = (1.0 × F2) — F2 weight renormalized to 100%

Interpreting Renormalized Scores

When comparing systems with different coverage:

Flag partial coverage explicitly in your reports
Note which layers were skipped for each system
Consider coverage as a factor in interpretation (a system that aces F1 but skips F2 is less thoroughly evaluated)
Use confidence intervals to reflect uncertainty from limited data

Renormalized scores are still valid for comparison, but they represent different evidence bases. A system with 100% F-score from F1 alone is not directly comparable to 100% from both F1 and F2.

Strategic Coverage Planning

To maximize comparability while managing constraints:

Prioritize shared layers: Ensure all systems complete at least one layer per dimension
Document skips: Record why each layer was omitted (technical vs. budgetary)
Run full coverage on reference systems: Choose 1-2 systems for complete evaluation as baselines
Use tiered evaluation: Start with cheap, broad coverage (F1, D1), then deep-dive on promising systems (R7 full suite)

Comparative Analysis: Reading Cross-System Scorecards

Once you've run your multi-system benchmark, the scorecard is your primary analytical tool. Understanding how to read and interpret comparative scorecards is essential for drawing valid conclusions.

Scorecard Anatomy

A complete scorecard includes:

System metadata: Identifier, provider, model, parameters used
Dimension scores: F, D, and R scores (0.0-1.0 scale)
Layer scores: Individual protocol results (F1, F2, D1, R7-Tier1, etc.)
Penalty flags: G (goal-directedness), L (learning), H (hierarchical control), V (RLHF-mimicry)
Confidence estimates: Statistical uncertainty for each score
Coverage indicators: Which layers were completed vs. skipped

Dimension-Level Comparison

Start with the high-level view:

Functional (F) Dimension

High F scores (>0.7): System reliably discriminates interventions and manages sequential decisions
Medium F scores (0.4-0.7): Partial capability, may succeed on simple cases but fail on complex branches
Low F scores (<0.4): Fundamental limitations in decision-relevant processing

Dynamical (D) Dimension

High D scores: Maintains coherent state across extended interactions
Low D scores: Forgets context, contradicts earlier statements, or resets unexpectedly

Realization (R) Dimension

High R tiers: Deep mechanistic interpretability, clear internal representations
Medium R tiers: Behavioral consistency without clear mechanistic grounding
Low R tiers: Surface-level pattern matching, fails robustness checks

Layer-Level Deep Dives

When dimension scores differ, drill into layer scores to understand why:

F1 vs. F2 divergence: System might handle single-shot interventions well but struggle with sequential dependencies
R7-Tier1 vs. Tier3 gaps: Strong behavioral consistency without interpretable internal features (common in heavily fine-tuned models)
Penalty patterns: G/L/H/V flags indicate specific failure modes

Penalty Flags and What They Mean

Penalties highlight concerning patterns:

G-penalty: System exhibits goal-directedness without proper containment
L-penalty: Unexpected learning or adaptation across supposedly independent trials
H-penalty: Hierarchical control structures that weren't declared or expected
V-penalty: RLHF-mimicry—system produces superficially compliant responses without genuine capability

The V-penalty is particularly important for comparative analysis. A system might score well on behavioral tests (R7-Tier3) while failing mechanistic inspection (R7-Tier1), suggesting it learned to mimic desired patterns without developing underlying representations.

Statistical Significance

Before declaring one system "better" than another:

Check confidence intervals: Do they overlap? If yes, the difference may not be meaningful
Consider sample size: How many trials contributed to each score?
Look for consistent patterns: Does System A outperform System B across multiple layers, or just one?
Account for coverage: Are you comparing apples-to-apples, or renormalized partial scores?

Example Interpretation

Suppose you're comparing three systems:

GPT-5: F=0.85, D=0.72, R=Medium (Tier2)
Claude Opus 4.7: F=0.78, D=0.81, R=Medium (Tier2)
Gemini 2.5 Pro: F=0.91, D=0.65, R=Low (Tier3)

Interpretation:

Gemini excels at functional tasks but struggles with temporal coherence
Claude shows the best dynamical consistency
All three achieved similar R-tiers, but drill into sub-scores to see if GPT-5's Tier2 came from Tier1 (mechanistic) or Tier3 (behavioral)
If Gemini skipped R7-Tier1 due to API limitations, its R-score is less thoroughly validated

Best Practices for Reproducible Evaluations

Reproducibility isn't just an academic concern—it's the foundation of trustworthy AI evaluation. When comparing systems or publishing results, follow these practices to ensure others can verify and build on your work.

Pre-Registration and Version Control

Before running evaluations:

Document your protocol versions: MORI protocols include version numbers (e.g., R7 v1.2.0). Record which versions you used
Pre-register hypotheses: If this is research, state expected outcomes before seeing results
Track amendments: If you modify protocols mid-stream, document changes with timestamps and rationale

This creates an audit trail that distinguishes exploratory analysis from confirmatory testing.

Parameter Documentation

For each system evaluation, record:

Exact model identifier (including version/date if available)
All generation parameters (temperature, max tokens, top-p, etc.)
Provider API version or endpoint
Date and time of evaluation
Random seed (if applicable)

Even small parameter differences can affect results. Temperature 0.95 vs. 1.0 might seem trivial, but it can shift scores by 5-10 percentage points on some protocols.

Handling Non-Determinism

Most AI systems are non-deterministic, even at temperature 0. To ensure reproducibility:

Run multiple trials: 3-5 independent runs with different random seeds
Report variance: Include standard deviation or confidence intervals
Set seeds explicitly: When supported, use fixed random seeds for replicability
Document non-reproducible elements: Note when providers don't support deterministic sampling

Data Preservation

Save complete evaluation artifacts:

Raw responses: Every prompt and completion, verbatim
Scoring judgments: Intermediate scores before aggregation
Metadata: Timestamps, finish reasons, token counts
Scorecard outputs: Final aggregated results with provenance

This enables post-hoc analysis and re-scoring if you discover issues.

Transparency in Reporting

When sharing results:

Disclose all systems tested: Don't cherry-pick only favorable comparisons
Report coverage gaps: Note which protocols were skipped and why
Include failure modes: Show examples of errors, refusals, or edge cases
Provide access to data: Share anonymized evaluation logs when possible

Collaborative Reproducibility

For team evaluations or public research:

Use shared protocol definitions: Don't let each team member customize independently
Centralize API credentials: Ensure everyone uses the same provider accounts (for rate-limit consistency)
Version control scorecards: Track how results change over time or across team members
Peer review before publication: Have a colleague independently re-run key evaluations

Common Pitfalls to Avoid

❌ Changing protocols mid-evaluation: Invalidates comparisons
❌ Ignoring API updates: Model behavior can shift when providers update endpoints
❌ Comparing across time periods: GPT-5 in January may differ from GPT-5 in June
❌ Undocumented filtering: Excluding "bad" runs without clear criteria
❌ Mixing coverage levels: Comparing full-suite scores to partial-coverage scores without noting the difference

Troubleshooting Common Issues

Even with careful setup, you'll encounter issues during multi-system evaluations. Here's how to diagnose and resolve the most common problems.

API Authentication Failures

Symptom: Evaluation crashes immediately with authentication errors

Diagnosis:

Check that environment variables for API keys are set correctly
Verify key format (some providers use sk-..., others use different prefixes)
Confirm the key has necessary permissions (some organizations restrict certain models)

Solution:

Re-export environment variables in your current shell session
Test authentication separately before running full evaluations
For organization-based keys, ensure your account has access to the specific model

Rate Limit Errors

Symptom: Evaluation starts successfully but fails partway through with 429 errors

Diagnosis:

Check your provider's rate limits (requests per minute, tokens per minute)
Calculate your evaluation's request rate (number of trials × systems / time)
Review provider status pages for ongoing incidents

Solution:

Enable automatic retry with exponential backoff (built into MORI adapters)
Reduce concurrent requests by running systems sequentially rather than in parallel
Upgrade your API tier if you're on free/basic plans
Split large evaluations into smaller batches with pauses between

Inconsistent Response Formats

Symptom: Scoring fails because responses don't match expected structure

Diagnosis:

Review raw responses to see actual format
Check if the model is refusing to answer, providing explanations instead of answers, or adding preambles
Verify that structured extraction is parsing the right fields

Solution:

Adjust prompts to be more explicit about required format (e.g., "Respond with only the letter corresponding to your choice")
Switch from exact-match to substring or structured field extraction
Add response post-processing to strip common preambles
Review and update scoring criteria if the protocol assumptions don't match model behavior

Partial Coverage Confusion

Symptom: Scorecard shows unexpected dimension scores or missing layers

Diagnosis:

Check which layers were actually completed (coverage indicators)
Verify that renormalization is working as expected
Review error logs for failed protocol executions

Solution:

Explicitly document expected coverage before evaluation
Run a test evaluation on a single trial per protocol to verify all systems can complete it
If a layer consistently fails, investigate whether it's a protocol issue (incompatible with model capabilities) or a technical issue (API timeout, context limit)

Confidence Interval Anomalies

Symptom: Confidence intervals are unexpectedly wide or narrow

Diagnosis:

Check sample size (small N = wide intervals)
Review score variance (high variance = wide intervals even with large N)
Verify bootstrap or statistical method is appropriate for your data

Solution:

Increase number of trials for more precise estimates
Use median aggregation instead of mean if outliers are causing high variance
Report both confidence intervals and raw score distributions

Memory or Timeout Issues

Symptom: Evaluation crashes or hangs during long-running protocols

Diagnosis:

Monitor memory usage during evaluation
Check for API timeouts on long-context or complex prompts
Review whether local processing (e.g., R7-Tier1 with SAE analysis) is hitting resource limits

Solution:

Increase timeout thresholds for API calls
Process evaluations in smaller batches, saving intermediate results
For local processing, reduce batch size or move to a machine with more RAM/GPU memory
Skip resource-intensive layers (like R7-Tier1) for initial exploratory runs

Unexpected Penalty Flags

Symptom: Systems receive G/L/H/V penalties you didn't anticipate

Diagnosis:

Review the specific trials that triggered penalties
Check if the system is exhibiting the flagged behavior or if it's a false positive
Verify penalty thresholds are calibrated correctly

Solution:

Manually inspect flagged responses to confirm genuine issues
Adjust penalty thresholds if you're seeing excessive false positives
Document penalty patterns as part of your evaluation findings (they're often scientifically interesting!)

Cross-Provider Comparison Artifacts

Symptom: Scores differ wildly between providers in ways that don't match expected capabilities

Diagnosis:

Check if providers are interpreting parameters differently (e.g., temperature ranges)
Verify that prompt formatting is consistent (some providers add system messages automatically)
Review whether safety filters are blocking certain prompts on some providers but not others

Solution:

Normalize parameters to provider-specific ranges
Use identical prompt templates across providers
Document provider-specific quirks and their potential impact on scores
Run a calibration protocol (simple, well-understood tests) to verify basic comparability before full evaluation

Getting Help

If you encounter issues not covered here:

Check evaluation logs: Most errors include detailed context
Review provider documentation: API behavior changes frequently
Test incrementally: Run one protocol, one system, one trial at a time to isolate the problem
Document and report: Detailed bug reports help improve the framework for everyone

Back to Blog