Introduction: What MORI Evaluates
MORI (Modal Organization and Realization Index) is a comprehensive benchmark designed to measure agency-relevant properties in AI systems. Unlike simple performance tests that check if a model can answer questions correctly, MORI evaluates whether an AI system exhibits the kinds of organizational patterns, temporal coherence, and representational structures that matter for understanding its capabilities as an agent.
The framework is built around a key insight: consciousness and agency aren't binary yes/no questions. Instead, MORI measures systems along a spectrum, providing quantitative scores that reflect the degree to which specific properties are present.
MORI doesn't ask "is this AI conscious?" — it asks "what agency-relevant properties does this system exhibit, and to what degree?" This shift from categorical judgment to dimensional measurement makes MORI useful for:
- Developers who want to track how architectural changes affect agency-related capabilities
- Researchers studying the emergence of complex behaviors in AI systems
- Organizations that need standardized metrics for comparing systems
- Policy makers seeking evidence-based frameworks for AI evaluation
When you run a MORI evaluation, you'll receive a detailed scorecard that breaks down system performance across three fundamental layers: Functional, Dynamical, and Realization. Each layer captures a different aspect of how the system operates, and together they provide a comprehensive picture of its agency-relevant properties.
Installation and Setup
Getting MORI up and running is straightforward. The evaluation framework is designed to work with multiple AI providers, so you can test systems from OpenAI, Anthropic, Google, and other platforms without changing your setup.
Prerequisites
Before you begin, make sure you have:
- Python 3.8 or higher installed on your system
- API credentials for the AI system you want to evaluate (OpenAI API key, Anthropic API key, Google AI key, etc.)
- A stable internet connection for cloud-based evaluations
Installation Steps
-
Install MORI using your package manager. The installation includes all core evaluation components and provider adapters.
-
Configure your API credentials by setting environment variables for the providers you plan to use. For example, if you're evaluating an OpenAI model, you'll need your OpenAI API key readily available.
-
Verify your installation by listing available evaluation protocols. This confirms that MORI is properly installed and can access its protocol library.
Choosing Your Provider
MORI supports multiple AI providers through a unified interface:
- OpenAI (GPT-4, GPT-5, and other models)
- Anthropic (Claude Opus, Claude Sonnet)
- Google (Gemini Pro, Gemini Ultra)
- Together.ai (various open-source models)
- RunPod (for custom deployments)
- Mock provider (for testing your evaluation setup without API costs)
Each provider has its own quirks — some models don't support temperature settings, others have different token limit parameters — but MORI handles these differences automatically. You just specify which model you want to test, and the framework takes care of the provider-specific details.
Testing Your Setup
Before running a full evaluation, it's a good idea to test your configuration with the mock provider. This lets you see how the evaluation process works without consuming API credits or waiting for network requests. Once you're comfortable with the workflow, you can switch to evaluating real systems.
Understanding the Three Layers: Functional, Dynamical, and Realization
MORI's evaluation framework is organized into three distinct layers, each measuring a different aspect of how AI systems organize information and behavior. Understanding these layers is essential for interpreting your results.
Functional Layer (F)
The Functional layer examines whether a system can discriminate between interventions and manage sequential decision branches. These are foundational capabilities for any agent that needs to respond appropriately to different situations.
F1: Intervention Discrimination tests whether the system can distinguish between different types of interventions in its environment. For example, if you present two scenarios that differ in a critical way, does the system recognize that difference and respond accordingly?
F2: Sequential Branch Management evaluates how well the system handles decision trees with multiple steps. Can it maintain coherent reasoning across a sequence of choices, or does it lose track of earlier decisions as the scenario unfolds?
These protocols use carefully designed test scenarios with specific correct answers, allowing MORI to score responses objectively. The system either demonstrates the functional capability or it doesn't — there's little room for ambiguity.
Dynamical Layer (D)
The Dynamical layer shifts focus to temporal coherence — how well the system maintains consistency over time and across related contexts.
D1: Temporal Coherence presents the system with scenarios that require tracking state changes, understanding cause-and-effect relationships, and maintaining logical consistency across multiple turns of interaction. Unlike simple question-answering, these protocols test whether the system exhibits the kind of temporal organization that characterizes agent-like behavior.
Systems that score well on Dynamical protocols don't just respond correctly to individual prompts — they demonstrate that their responses form a coherent pattern over time, suggesting underlying organizational principles rather than isolated reactions.
Realization Layer (R)
The Realization layer is the most sophisticated, examining how agency-relevant properties are actually implemented in the system's internal representations and behaviors.
This layer uses a three-tier methodology:
Tier 1 looks at feature-level representations, examining whether the system has dedicated internal structures for tracking agency-relevant concepts. This involves inspecting learned features for selectivity, robustness, and causal influence on outputs.
Tier 2 employs probing techniques to test whether the system's internal representations genuinely encode the properties they appear to encode, or whether they're just surface-level patterns. Control conditions ensure that positive results reflect actual understanding rather than statistical shortcuts.
Tier 3 measures behavioral consistency across paraphrased prompts and perturbations. Does the system maintain the same reasoning patterns when you rephrase a question? This tier includes sophisticated checks for RLHF-mimicry — situations where a system has learned to produce superficially correct responses without the underlying organizational structure.
The R-tier system uses a three-level ladder — from feature inspection to behavioral consistency — ensuring measurements are grounded in actual system properties, not just surface behaviors.
Together, these three layers provide a comprehensive assessment that goes far beyond simple performance metrics, revealing the organizational depth of the system being evaluated.
Running Your First Protocol
Now that you understand what MORI measures, let's walk through running your first evaluation. We'll start with a Functional protocol since these are the most straightforward and provide quick, interpretable results.
Step 1: Choose Your Protocol
MORI comes with a library of pre-designed evaluation protocols. Each protocol is a carefully constructed test scenario with specific scoring criteria. To see what's available, you can list all protocols in the library.
For your first evaluation, we recommend starting with F1 (Intervention Discrimination). This protocol is relatively quick to run and produces clear, objective scores that are easy to interpret.
Step 2: Specify Your System
You'll need to tell MORI which AI system you want to evaluate. This includes:
- Provider name (e.g., "openai", "anthropic", "google")
- Model identifier (e.g., "gpt-4", "claude-opus-4", "gemini-2.5-pro")
- Optional parameters like temperature, maximum tokens, or other model-specific settings
MORI handles the provider connection automatically once you've specified these details.
Step 3: Run the Evaluation
With your protocol selected and system specified, you're ready to run the evaluation. The process typically takes a few minutes, depending on the protocol complexity and the API response times.
During execution, MORI will:
- Load the protocol and validate its structure
- Send test prompts to your specified AI system
- Collect responses and apply scoring criteria
- Calculate metrics including raw scores, confidence estimates, and penalty adjustments
- Generate a scorecard with your results
You'll see progress updates as the evaluation runs, showing which test cases are being processed and any issues that arise (like API rate limits or timeout errors).
Step 4: Review the Output
When the evaluation completes, MORI generates a detailed Run record that includes:
- The protocol that was executed
- The system that was tested
- Timestamp and configuration details
- Complete prompt-response pairs for every test case
- Individual scores for each test item
- Aggregated metrics and confidence intervals
Tips for Your First Run
Start small: Don't try to run all protocols at once. Begin with a single Functional protocol to get familiar with the workflow.
Check your API limits: Some AI providers have rate limits that can slow down or interrupt evaluations. Make sure your account has sufficient quota.
Save your results: MORI can export results in JSON format, making it easy to track evaluations over time or compare different systems.
Use the mock provider for testing: If you want to understand the evaluation flow without using API credits, run a protocol against the mock provider first. It returns synthetic responses that let you see the full process.
Review individual responses: Don't just look at the aggregate score — examine specific prompt-response pairs to understand what the system is actually doing. This context is invaluable for interpreting scores.
Reading Your Scorecard: Scores, Bands, and R-Tiers
After running an evaluation, you'll receive a scorecard — a comprehensive summary of how the system performed. Think of the scorecard as a nutritional label for AI systems: it gives you a standardized snapshot of capabilities across multiple dimensions.
Let's break down what you're looking at.
Layer Scores (F, D, R)
The scorecard shows separate scores for each of the three layers:
- F-score: Functional layer performance (0.0 to 1.0)
- D-score: Dynamical layer performance (0.0 to 1.0)
- R-score: Realization layer performance (0.0 to 1.0)
These scores are normalized, meaning 1.0 represents perfect performance on that layer's protocols, while 0.0 indicates no successful demonstrations of the measured properties.
What the Numbers Mean
A score of 0.8 or above typically indicates strong performance — the system consistently demonstrates the capability being measured.
A score between 0.5 and 0.8 suggests partial capability. The system shows the property in some contexts but not others, or demonstrates it inconsistently.
A score below 0.5 indicates weak or absent capability. The system rarely exhibits the measured property, or only shows superficial patterns that don't hold up under scrutiny.
Confidence Estimates
Each score comes with a confidence estimate that reflects measurement uncertainty. This is important because AI systems can be unpredictable — the same prompt might elicit different responses on different runs.
Wide confidence intervals suggest the system's performance is inconsistent or that more test cases are needed for a reliable measurement. Narrow intervals indicate stable, reproducible behavior.
Penalty Adjustments (G, L, H, V)
MORI applies several types of penalties to raw scores, ensuring that results reflect genuine capabilities rather than artifacts:
G-penalties (generalization penalties) are applied when a system performs well on training-like examples but fails on novel variations. This prevents inflated scores from memorization or overfitting to specific prompt patterns.
L-penalties (linguistic penalties) account for cases where a system produces responses that look correct superficially but lack the underlying structure required by the protocol. For example, a system might use the right keywords without demonstrating actual reasoning.
H-penalties (hallucination penalties) reduce scores when a system generates fabricated information or exhibits inconsistencies that undermine its responses.
V-penalties (RLHF-mimicry penalties) are particularly important for the Realization layer. These detect cases where a system has learned to produce responses that sound like they demonstrate agency-relevant properties, but behavioral consistency tests reveal they're just surface-level imitations.
Penalty-adjusted scores give you a more conservative, reliable estimate of true capability.
Performance Bands
Scores are often grouped into performance bands that make interpretation easier:
- High (0.75–1.0): Robust, consistent demonstration of the property
- Medium (0.50–0.75): Moderate capability with some gaps or inconsistencies
- Low (0.25–0.50): Weak or sporadic evidence of the property
- Minimal (0.0–0.25): Little to no demonstration of the capability
Bands provide a quick at-a-glance understanding of performance without getting lost in decimal precision.
R-Tier Breakdown
For the Realization layer, you'll see a tier-by-tier breakdown showing performance at each level:
Tier 1 scores tell you whether the system has detectable feature-level representations for agency-relevant concepts. High Tier 1 scores mean the system has dedicated internal structures; low scores suggest it's using more generic representations.
Tier 2 scores reveal whether those representations are genuine or superficial. A system might have features that look relevant (high Tier 1) but fail probing tests (low Tier 2), indicating the features don't actually encode what they appear to encode.
Tier 3 scores measure behavioral consistency. Even if a system has good internal representations (high Tier 1 and 2), it might fail to use them consistently (low Tier 3), especially when prompts are paraphrased or perturbed.
The tier structure helps you diagnose where a system succeeds or fails in the realization hierarchy.
Interpreting Your Results
Look for patterns across layers: A system with high F-scores but low R-scores might be solving tasks through shallow pattern matching rather than organized internal models. Conversely, high R-scores with lower F-scores might indicate rich internal structure that isn't fully leveraged for functional tasks.
Pay attention to confidence intervals: Narrow intervals mean you can trust the score; wide intervals mean you should run more test cases or investigate what's causing the variability.
Don't ignore penalties: If you see large penalty adjustments, dig into the individual test cases to understand what's happening. Penalties often reveal important limitations that raw scores miss.
Compare across systems: Scorecards are most valuable when you're comparing multiple systems or tracking a single system over time. Absolute scores matter less than relative patterns and trends.
Next Steps
Congratulations! You've run your first MORI evaluation and learned how to interpret the scorecard. Here's how to deepen your understanding and get more value from the framework.
Expand Your Protocol Coverage
You started with a single Functional protocol, but MORI offers a comprehensive library covering all three layers. Try running:
- F2 (Sequential Branch Management) to see how the system handles multi-step decision scenarios
- D1 (Temporal Coherence) to evaluate consistency across time and context
- R7 (Realization protocols) to probe internal representations and behavioral patterns
Running multiple protocols gives you a more complete picture of system capabilities and reveals strengths and weaknesses that single-protocol evaluations might miss.
Compare Different Systems
MORI's standardized approach makes it easy to compare AI systems from different providers. Try evaluating:
- Different model versions from the same provider (e.g., GPT-4 vs. GPT-5)
- Competing models from different providers (e.g., Claude Opus vs. Gemini Pro)
- Open-source vs. proprietary systems
Comparative evaluations help you understand the current landscape of AI capabilities and make informed decisions about which systems to use for specific applications.
Track Changes Over Time
If you're developing or fine-tuning AI systems, use MORI to track how changes affect agency-relevant properties. Run evaluations:
- Before and after training interventions to measure impact
- Across different checkpoints to see how properties emerge during training
- After architectural modifications to understand which components contribute to which capabilities
MORI's quantitative scores make it possible to detect subtle changes that qualitative assessment might miss.
Aggregate Multiple Runs
For production systems or research applications, you'll want to combine results from multiple evaluation runs into system-level scorecards. MORI's aggregation capabilities let you:
- Pool results across different protocols
- Weight different layers based on your priorities
- Handle partial coverage when some protocols aren't applicable to your system
- Generate confidence intervals that account for measurement uncertainty
Aggregated scorecards provide the most reliable and comprehensive assessment of system capabilities.
Dive Deeper into R-Tier Methodology
The Realization layer is MORI's most sophisticated component, and understanding its three-tier structure opens up advanced evaluation possibilities:
- Feature inspection (Tier 1) requires systems that expose internal representations, but when available, it provides the most direct measurement of how agency-relevant properties are implemented
- Probing analysis (Tier 2) helps you verify that internal representations are genuine rather than superficial
- Behavioral consistency testing (Tier 3) can be applied to any system, making it a versatile tool for detecting RLHF-mimicry and other forms of shallow capability
If you're working with systems where you have access to internal representations, exploring the full R-tier methodology will give you unprecedented insight into how agency-relevant properties are realized.
Contribute to the Protocol Library
MORI's protocol system is extensible. As you become more familiar with the framework, you might identify new dimensions of agency-relevant properties that aren't covered by existing protocols. The community benefits when researchers and developers share novel evaluation scenarios.
Whether you're measuring a single system or conducting large-scale comparative studies, MORI provides the tools you need for rigorous, reproducible evaluation of agency-relevant properties in AI systems. Your first evaluation is just the beginning of a deeper exploration into what makes AI systems tick.