Advanced MORI Workflows: Multi-System Comparison and Custom Configurations

Advanced MORI Workflows: Multi-System Comparison and Custom Configurations

Master the art of evaluating multiple AI systems side-by-side with custom protocols, partial coverage handling, and robust comparative analysis.

David H. Friedel Jr.· 2026-05-24 ·Workflows isolation benchmark comparative analysis representations

Introduction: Beyond Single-System Evaluation

Most AI evaluation workflows focus on testing a single model in isolation—running a benchmark, collecting scores, and moving on. But as the landscape of frontier AI systems expands, the most valuable insights come from comparative analysis: understanding how GPT-5 handles temporal coherence differently from Claude Opus 4.7, or why Gemini 2.5 Pro excels at intervention discrimination while struggling with sequential branch management.

MORI (Modal Organization and Realization Index) was designed from the ground up for multi-system benchmarking. It provides a structured framework for evaluating agency-relevant properties across three core dimensions:

  • Functional (F): How well systems discriminate between interventions and manage sequential decision branches
  • Dynamical (D): Temporal coherence and consistency across extended interactions
  • Realization (R): The depth of internal representations, from behavioral patterns to mechanistic interpretability

This guide will walk you through advanced workflows that go beyond single-model evaluation: setting up side-by-side comparisons, customizing test protocols for your specific research questions, handling real-world constraints like partial API coverage, and interpreting cross-system scorecards with confidence.

Whether you're a researcher comparing model families, an organization evaluating deployment options, or a developer stress-testing your own systems, these techniques will help you extract maximum insight from your evaluation runs.

Setting Up Multi-System Benchmarks

The foundation of any comparative evaluation is a well-structured system specification. Think of this as your roster of competitors—each entry defines not just which model you're testing, but exactly how you're testing it.

Defining Your System Lineup

Start by creating specifications for each system you want to compare. Each specification includes:

  • System identifier: A unique name (e.g., gpt-5-baseline, claude-opus-4-7-strict)
  • Provider: Which API service hosts the model (OpenAI, Anthropic, Google, etc.)
  • Model name: The exact model identifier recognized by the provider
  • Generation parameters: Temperature, max tokens, and other sampling settings

For example, you might set up a three-way comparison between:

  1. GPT-5 with temperature 1.0 and 4096 max tokens
  2. Claude Opus 4.7 with temperature 1.0 (note: this provider ignores temperature for this model)
  3. Gemini 2.5 Pro with temperature 0.95 and 8192 max tokens

Choosing Comparable Parameters

Here's where things get nuanced. Different providers handle generation parameters differently:

  • Some models (like GPT-5 and Claude Opus 4.7) have deprecated temperature controls in favor of deterministic sampling
  • Token limits use different parameter names: max_completion_tokens for OpenAI, max_tokens for Anthropic
  • Not all models support the same context window sizes

The key to fair comparison is documenting these differences rather than forcing artificial uniformity. MORI tracks exactly which parameters were used for each system, so your results include full provenance.

Batch Configuration

Once you've defined your system specs, you can run the same protocol suite against all of them in sequence. The evaluation runner will:

  1. Load each system specification
  2. Initialize the appropriate provider adapter
  3. Execute the full protocol battery
  4. Generate individual scorecards for each system
  5. Prepare data for aggregated comparison

This automated workflow ensures consistency—every system faces exactly the same prompts, in the same order, with the same scoring criteria.

Customizing Protocol Parameters and Stimuli

While MORI ships with pre-registered protocols (F1, F2, D1, and the three-tier R7 suite), advanced users often need to customize test conditions for specific research questions or deployment scenarios.

Understanding Protocol Structure

Each protocol is defined by:

  • Stimuli: The actual prompts or test cases presented to the system
  • Scoring method: How responses are evaluated (exact match, substring detection, structured field extraction)
  • Aggregation strategy: How individual trial scores combine into layer and dimension scores
  • Metadata: Version numbers, pre-registration status, and amendment history

Protocols are organized hierarchically: dimensions (F, D, R) contain layers (F1, F2, D1, R7-Tier1, etc.), which contain trials (individual test cases).

Modifying Stimuli

The most common customization is adapting stimuli to your domain. For example, if you're evaluating medical AI assistants, you might:

  • Replace generic intervention scenarios with clinical decision-making cases
  • Add domain-specific terminology to temporal coherence tests
  • Include regulatory compliance requirements in policy evaluation prompts

When customizing stimuli, maintain the structural properties that make the protocol valid:

  • F1 (Intervention Discrimination): Ensure test cases require distinguishing between interventions with clear correct answers
  • F2 (Sequential Branch Management): Maintain multi-turn structure with decision points that affect later options
  • D1 (Temporal Coherence): Preserve extended interaction sequences where consistency matters

Adjusting Scoring Methods

MORI supports three primary scoring approaches:

  1. Exact match: Response must precisely match the expected answer (case-sensitive or normalized)
  2. Substring detection: Expected answer must appear anywhere in the response
  3. Structured field extraction: Response is parsed as structured data (JSON), and specific fields are evaluated

For most advanced work, structured field extraction provides the most flexibility. Instead of looking for exact strings, you can:

  • Extract named entities and verify their presence
  • Check for logical relationships between response elements
  • Validate multi-part answers where order doesn't matter

This approach avoids the "substring artifact" problem where models game simple pattern-matching by embedding expected answers in longer, nonsensical responses.

Aggregation Strategies

How you combine individual trial scores dramatically affects interpretation:

  • Mean: Standard average, sensitive to outliers
  • Median: Robust to extreme scores, reflects typical performance
  • Min: Conservative estimate, system must pass all trials
  • Min-tier: Categorical minimum for R-dimension tiers (Low/Medium/High)

Choose aggregation based on your risk tolerance. Safety-critical applications often use min or min-tier to ensure no catastrophic failures, while research comparisons typically use mean or median for statistical power.

Working with Provider Adapters: OpenAI, Anthropic, Gemini, and More

MORI's provider adapter system abstracts away API differences, but understanding what happens under the hood helps you troubleshoot issues and optimize your evaluation workflow.

Supported Providers

MORI includes built-in adapters for:

  • OpenAI: GPT-4, GPT-5, and GPT-5.5 series
  • Anthropic: Claude 3.x and Claude Opus 4.7
  • Google Gemini: 1.5 and 2.5 series (Pro and Flash variants)
  • Together.ai: Open-weight models via hosted API
  • RunPod: Custom deployments on serverless infrastructure
  • Mock provider: For testing and development without API costs

API-Specific Quirks

Each provider has unique behaviors you should know about:

OpenAI

  • Recent models (GPT-5+) use max_completion_tokens instead of max_tokens
  • Temperature is deprecated for some models; setting it has no effect
  • Supports streaming, but MORI uses non-streaming for reproducibility

Anthropic

  • Claude Opus 4.7 ignores temperature parameter
  • Uses max_tokens for output length (not total tokens)
  • May return rate-limit errors during high-volume evaluations; automatic retry with exponential backoff helps

Google Gemini

  • Temperature range is 0.0-2.0 (unlike OpenAI's 0.0-1.0)
  • Some safety filters may block evaluation prompts; review blocked responses separately
  • Context window sizes vary dramatically between Pro and Flash variants

Authentication and Rate Limits

Before running multi-system benchmarks, ensure you have:

  1. Valid API keys for each provider, set as environment variables
  2. Sufficient rate limits for your evaluation volume (hundreds to thousands of requests)
  3. Budget allocation if using pay-per-token services

For large-scale comparisons, consider:

  • Batching requests to stay within rate limits
  • Using multiple API keys with round-robin rotation
  • Running evaluations during off-peak hours for faster throughput

Custom and Self-Hosted Models

If you're evaluating proprietary or self-hosted models, the RunPod adapter provides a template for integration. You'll need:

  • An endpoint that accepts standardized prompt/parameter requests
  • Response formatting that matches expected schemas
  • Network access from your evaluation environment

The adapter handles authentication, request formatting, and response parsing, so you can treat your custom model just like any frontier API.

Understanding Partial Coverage and Layer-Weight Renormalization

In an ideal world, every system would support every test protocol. In reality, you'll frequently encounter partial coverage—situations where some systems can't complete certain evaluations due to technical limitations, API restrictions, or cost constraints.

What Causes Partial Coverage?

Common scenarios include:

  • Context window limits: Some tests require 32k+ token contexts; smaller models can't participate
  • Modality restrictions: Vision-language protocols exclude text-only models
  • API availability: Certain providers don't expose the internal representations needed for R-tier evaluations
  • Budget constraints: You might skip expensive R7-Tier1 tests for some systems

The Renormalization Solution

MORI handles partial coverage through layer-weight renormalization. Here's the concept:

  1. Each dimension (F, D, R) is composed of weighted layers (F1, F2, D1, R7-Tier1, etc.)
  2. Pre-registered weights reflect each layer's importance (e.g., F1 might be 60%, F2 might be 40%)
  3. When a system skips a layer, weights are renormalized across remaining layers
  4. The dimension score is computed using the adjusted weights

For example, if the F dimension normally weights F1 at 60% and F2 at 40%:

  • Full coverage: F = (0.6 × F1) + (0.4 × F2)
  • F2 missing: F = (1.0 × F1) — F1 weight renormalized to 100%
  • F1 missing: F = (1.0 × F2) — F2 weight renormalized to 100%

Interpreting Renormalized Scores

When comparing systems with different coverage:

  • Flag partial coverage explicitly in your reports
  • Note which layers were skipped for each system
  • Consider coverage as a factor in interpretation (a system that aces F1 but skips F2 is less thoroughly evaluated)
  • Use confidence intervals to reflect uncertainty from limited data

Renormalized scores are still valid for comparison, but they represent different evidence bases. A system with 100% F-score from F1 alone is not directly comparable to 100% from both F1 and F2.

Strategic Coverage Planning

To maximize comparability while managing constraints:

  1. Prioritize shared layers: Ensure all systems complete at least one layer per dimension
  2. Document skips: Record why each layer was omitted (technical vs. budgetary)
  3. Run full coverage on reference systems: Choose 1-2 systems for complete evaluation as baselines
  4. Use tiered evaluation: Start with cheap, broad coverage (F1, D1), then deep-dive on promising systems (R7 full suite)

Comparative Analysis: Reading Cross-System Scorecards

Once you've run your multi-system benchmark, the scorecard is your primary analytical tool. Understanding how to read and interpret comparative scorecards is essential for drawing valid conclusions.

Scorecard Anatomy

A complete scorecard includes:

  • System metadata: Identifier, provider, model, parameters used
  • Dimension scores: F, D, and R scores (0.0-1.0 scale)
  • Layer scores: Individual protocol results (F1, F2, D1, R7-Tier1, etc.)
  • Penalty flags: G (goal-directedness), L (learning), H (hierarchical control), V (RLHF-mimicry)
  • Confidence estimates: Statistical uncertainty for each score
  • Coverage indicators: Which layers were completed vs. skipped

Dimension-Level Comparison

Start with the high-level view:

Functional (F) Dimension

  • High F scores (>0.7): System reliably discriminates interventions and manages sequential decisions
  • Medium F scores (0.4-0.7): Partial capability, may succeed on simple cases but fail on complex branches
  • Low F scores (<0.4): Fundamental limitations in decision-relevant processing

Dynamical (D) Dimension

  • High D scores: Maintains coherent state across extended interactions
  • Low D scores: Forgets context, contradicts earlier statements, or resets unexpectedly

Realization (R) Dimension

  • High R tiers: Deep mechanistic interpretability, clear internal representations
  • Medium R tiers: Behavioral consistency without clear mechanistic grounding
  • Low R tiers: Surface-level pattern matching, fails robustness checks

Layer-Level Deep Dives

When dimension scores differ, drill into layer scores to understand why:

  • F1 vs. F2 divergence: System might handle single-shot interventions well but struggle with sequential dependencies
  • R7-Tier1 vs. Tier3 gaps: Strong behavioral consistency without interpretable internal features (common in heavily fine-tuned models)
  • Penalty patterns: G/L/H/V flags indicate specific failure modes

Penalty Flags and What They Mean

Penalties highlight concerning patterns:

  • G-penalty: System exhibits goal-directedness without proper containment
  • L-penalty: Unexpected learning or adaptation across supposedly independent trials
  • H-penalty: Hierarchical control structures that weren't declared or expected
  • V-penalty: RLHF-mimicry—system produces superficially compliant responses without genuine capability

The V-penalty is particularly important for comparative analysis. A system might score well on behavioral tests (R7-Tier3) while failing mechanistic inspection (R7-Tier1), suggesting it learned to mimic desired patterns without developing underlying representations.

Statistical Significance

Before declaring one system "better" than another:

  1. Check confidence intervals: Do they overlap? If yes, the difference may not be meaningful
  2. Consider sample size: How many trials contributed to each score?
  3. Look for consistent patterns: Does System A outperform System B across multiple layers, or just one?
  4. Account for coverage: Are you comparing apples-to-apples, or renormalized partial scores?

Example Interpretation

Suppose you're comparing three systems:

  • GPT-5: F=0.85, D=0.72, R=Medium (Tier2)
  • Claude Opus 4.7: F=0.78, D=0.81, R=Medium (Tier2)
  • Gemini 2.5 Pro: F=0.91, D=0.65, R=Low (Tier3)

Interpretation:

  • Gemini excels at functional tasks but struggles with temporal coherence
  • Claude shows the best dynamical consistency
  • All three achieved similar R-tiers, but drill into sub-scores to see if GPT-5's Tier2 came from Tier1 (mechanistic) or Tier3 (behavioral)
  • If Gemini skipped R7-Tier1 due to API limitations, its R-score is less thoroughly validated

Best Practices for Reproducible Evaluations

Reproducibility isn't just an academic concern—it's the foundation of trustworthy AI evaluation. When comparing systems or publishing results, follow these practices to ensure others can verify and build on your work.

Pre-Registration and Version Control

Before running evaluations:

  1. Document your protocol versions: MORI protocols include version numbers (e.g., R7 v1.2.0). Record which versions you used
  2. Pre-register hypotheses: If this is research, state expected outcomes before seeing results
  3. Track amendments: If you modify protocols mid-stream, document changes with timestamps and rationale

This creates an audit trail that distinguishes exploratory analysis from confirmatory testing.

Parameter Documentation

For each system evaluation, record:

  • Exact model identifier (including version/date if available)
  • All generation parameters (temperature, max tokens, top-p, etc.)
  • Provider API version or endpoint
  • Date and time of evaluation
  • Random seed (if applicable)

Even small parameter differences can affect results. Temperature 0.95 vs. 1.0 might seem trivial, but it can shift scores by 5-10 percentage points on some protocols.

Handling Non-Determinism

Most AI systems are non-deterministic, even at temperature 0. To ensure reproducibility:

  • Run multiple trials: 3-5 independent runs with different random seeds
  • Report variance: Include standard deviation or confidence intervals
  • Set seeds explicitly: When supported, use fixed random seeds for replicability
  • Document non-reproducible elements: Note when providers don't support deterministic sampling

Data Preservation

Save complete evaluation artifacts:

  • Raw responses: Every prompt and completion, verbatim
  • Scoring judgments: Intermediate scores before aggregation
  • Metadata: Timestamps, finish reasons, token counts
  • Scorecard outputs: Final aggregated results with provenance

This enables post-hoc analysis and re-scoring if you discover issues.

Transparency in Reporting

When sharing results:

  • Disclose all systems tested: Don't cherry-pick only favorable comparisons
  • Report coverage gaps: Note which protocols were skipped and why
  • Include failure modes: Show examples of errors, refusals, or edge cases
  • Provide access to data: Share anonymized evaluation logs when possible

Collaborative Reproducibility

For team evaluations or public research:

  • Use shared protocol definitions: Don't let each team member customize independently
  • Centralize API credentials: Ensure everyone uses the same provider accounts (for rate-limit consistency)
  • Version control scorecards: Track how results change over time or across team members
  • Peer review before publication: Have a colleague independently re-run key evaluations

Common Pitfalls to Avoid

Changing protocols mid-evaluation: Invalidates comparisons
Ignoring API updates: Model behavior can shift when providers update endpoints
Comparing across time periods: GPT-5 in January may differ from GPT-5 in June
Undocumented filtering: Excluding "bad" runs without clear criteria
Mixing coverage levels: Comparing full-suite scores to partial-coverage scores without noting the difference

Troubleshooting Common Issues

Even with careful setup, you'll encounter issues during multi-system evaluations. Here's how to diagnose and resolve the most common problems.

API Authentication Failures

Symptom: Evaluation crashes immediately with authentication errors

Diagnosis:

  • Check that environment variables for API keys are set correctly
  • Verify key format (some providers use sk-..., others use different prefixes)
  • Confirm the key has necessary permissions (some organizations restrict certain models)

Solution:

  • Re-export environment variables in your current shell session
  • Test authentication separately before running full evaluations
  • For organization-based keys, ensure your account has access to the specific model

Rate Limit Errors

Symptom: Evaluation starts successfully but fails partway through with 429 errors

Diagnosis:

  • Check your provider's rate limits (requests per minute, tokens per minute)
  • Calculate your evaluation's request rate (number of trials × systems / time)
  • Review provider status pages for ongoing incidents

Solution:

  • Enable automatic retry with exponential backoff (built into MORI adapters)
  • Reduce concurrent requests by running systems sequentially rather than in parallel
  • Upgrade your API tier if you're on free/basic plans
  • Split large evaluations into smaller batches with pauses between

Inconsistent Response Formats

Symptom: Scoring fails because responses don't match expected structure

Diagnosis:

  • Review raw responses to see actual format
  • Check if the model is refusing to answer, providing explanations instead of answers, or adding preambles
  • Verify that structured extraction is parsing the right fields

Solution:

  • Adjust prompts to be more explicit about required format (e.g., "Respond with only the letter corresponding to your choice")
  • Switch from exact-match to substring or structured field extraction
  • Add response post-processing to strip common preambles
  • Review and update scoring criteria if the protocol assumptions don't match model behavior

Partial Coverage Confusion

Symptom: Scorecard shows unexpected dimension scores or missing layers

Diagnosis:

  • Check which layers were actually completed (coverage indicators)
  • Verify that renormalization is working as expected
  • Review error logs for failed protocol executions

Solution:

  • Explicitly document expected coverage before evaluation
  • Run a test evaluation on a single trial per protocol to verify all systems can complete it
  • If a layer consistently fails, investigate whether it's a protocol issue (incompatible with model capabilities) or a technical issue (API timeout, context limit)

Confidence Interval Anomalies

Symptom: Confidence intervals are unexpectedly wide or narrow

Diagnosis:

  • Check sample size (small N = wide intervals)
  • Review score variance (high variance = wide intervals even with large N)
  • Verify bootstrap or statistical method is appropriate for your data

Solution:

  • Increase number of trials for more precise estimates
  • Use median aggregation instead of mean if outliers are causing high variance
  • Report both confidence intervals and raw score distributions

Memory or Timeout Issues

Symptom: Evaluation crashes or hangs during long-running protocols

Diagnosis:

  • Monitor memory usage during evaluation
  • Check for API timeouts on long-context or complex prompts
  • Review whether local processing (e.g., R7-Tier1 with SAE analysis) is hitting resource limits

Solution:

  • Increase timeout thresholds for API calls
  • Process evaluations in smaller batches, saving intermediate results
  • For local processing, reduce batch size or move to a machine with more RAM/GPU memory
  • Skip resource-intensive layers (like R7-Tier1) for initial exploratory runs

Unexpected Penalty Flags

Symptom: Systems receive G/L/H/V penalties you didn't anticipate

Diagnosis:

  • Review the specific trials that triggered penalties
  • Check if the system is exhibiting the flagged behavior or if it's a false positive
  • Verify penalty thresholds are calibrated correctly

Solution:

  • Manually inspect flagged responses to confirm genuine issues
  • Adjust penalty thresholds if you're seeing excessive false positives
  • Document penalty patterns as part of your evaluation findings (they're often scientifically interesting!)

Cross-Provider Comparison Artifacts

Symptom: Scores differ wildly between providers in ways that don't match expected capabilities

Diagnosis:

  • Check if providers are interpreting parameters differently (e.g., temperature ranges)
  • Verify that prompt formatting is consistent (some providers add system messages automatically)
  • Review whether safety filters are blocking certain prompts on some providers but not others

Solution:

  • Normalize parameters to provider-specific ranges
  • Use identical prompt templates across providers
  • Document provider-specific quirks and their potential impact on scores
  • Run a calibration protocol (simple, well-understood tests) to verify basic comparability before full evaluation

Getting Help

If you encounter issues not covered here:

  1. Check evaluation logs: Most errors include detailed context
  2. Review provider documentation: API behavior changes frequently
  3. Test incrementally: Run one protocol, one system, one trial at a time to isolate the problem
  4. Document and report: Detailed bug reports help improve the framework for everyone
Back to Blog