Evaluation

pip install rakam-systems-cli

The evaluation service must be running to use evaluation features. Contact us if you need help setting it up.

Configure access to the evaluation service in your .env:

# Evaluation service access
EVALFRAMEWORK_URL="http://eval-service-url.com"   # URL of the evaluation service
EVALFRAMEWORK_API_KEY="your-api-token"             # Generate from the /docs Swagger UI

Write an evaluation function

Create an eval/ directory in your project and add evaluation functions decorated with @eval_run. Each function returns an EvalConfig or SchemaEvalConfig.

Text evaluation

# eval/examples.py
from rakam_systems_cli.decorators import eval_run
from rakam_systems_tools.evaluation.schema import (
    EvalConfig,
    TextInputItem,
    ClientSideMetricConfig,
    ToxicityConfig,
    CorrectnessConfig,
)

@eval_run
def test_simple_text_eval():
    """A simple text evaluation showcasing a basic client-side metric."""
    return EvalConfig(
        component="text_component_1",
        label="demo_simple_text",
        data=[
            TextInputItem(
                id="txt_001",
                input="Hello world",
                output="Hello world",
                expected_output="Hello world",
                metrics=[ClientSideMetricConfig(name="relevance", score=1)],
            )
        ],
        metrics=[ToxicityConfig(name="toxicity_demo", include_reason=False)],
    )

Available text metrics: CorrectnessConfig, AnswerRelevancyConfig, FaithfulnessConfig, ToxicityConfig.

Schema evaluation

from rakam_systems_cli.decorators import eval_run
from rakam_systems_tools.evaluation.schema import (
    SchemaEvalConfig,
    SchemaInputItem,
    JsonCorrectnessConfig,
)

@eval_run
def test_json_output():
    """Validate JSON structure of model outputs."""
    return SchemaEvalConfig(
        component="json-generator",
        label="json_validation",
        data=[
            SchemaInputItem(
                input="Generate a JSON object with name and age.",
                output='{"name": "John", "age": 30}'
            )
        ],
        metrics=[
            JsonCorrectnessConfig(
                expected_schema={"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "number"}}}
            )
        ],
    )

Available schema metrics: JsonCorrectnessConfig, FieldsPresenceConfig.

Client-side metrics

Log metrics calculated in your own code. These are sent alongside input data without server-side evaluation:

TextInputItem(
    input="User review",
    output="I am happy with this product.",
    metrics=[
        ClientSideMetricConfig(
            name="sentiment",
            score=1.0,
            reason="The user expressed a positive sentiment."
        )
    ]
)

Pass an empty list to metrics in EvalConfig to skip server-side evaluation.

Probabilistic evaluation

Use maybe_* methods to run evaluations on a sample of requests, reducing load on the evaluation service:

from rakam_systems_tools.evaluation import DeepEvalClient

client = DeepEvalClient()

# Runs approximately 10% of the time
client.maybe_text_eval(data=data, metrics=metrics, chance=0.1)

Error handling

By default, the evaluation client returns a dictionary with an "error" key on failure. Set raise_exception=True to raise instead:

from rakam_systems_tools.evaluation import DeepEvalClient

client = DeepEvalClient()

try:
    result = client.text_eval(data=data, metrics=metrics, raise_exception=True)
except requests.RequestException as e:
    print(f"An error occurred: {e}")

Run evaluations

Execute evaluations

The run command discovers and executes all @eval_run-decorated functions in the target directory:

# Run all evaluations in ./eval (default)
rakam eval run

# Run from a different directory
rakam eval run path/to/evals

# Search subdirectories recursively
rakam eval run --recursive

# Preview which functions would run without executing them
rakam eval run --dry-run

# Save each run result to a local JSON file
rakam eval run --save-runs --output-dir ./eval_runs

Example dry-run output:

📄 eval/quality.py
  ▶ test_answer_relevance
    🧪 Dry-run OK → text_eval
  ▶ test_json_output
    🧪 Dry-run OK → schema_eval

📄 eval/safety.py
  ▶ test_toxicity
    🧪 Dry-run OK → text_eval

View results

Show the details of a specific run, or the most recent one by default:

# Show the most recent run
rakam eval show

# Show a specific run by ID
rakam eval show --id 42

# Show a run by tag
rakam eval show --tag baseline-v1

# Output raw JSON (useful for scripting)
rakam eval show --raw

Compare runs

Compare two evaluation runs to track quality changes between iterations. Provide exactly two targets using --id or --tag:

# Compare two runs by ID
rakam eval compare --id 42 --id 45

# Compare a run by ID with a tagged run
rakam eval compare --id 42 --tag baseline-v1

# Show a summary diff only (reduced output)
rakam eval compare --id 42 --id 45 --summary

# Show a side-by-side diff
rakam eval compare --id 42 --id 45 --side-by-side

Example summary output:

Summary:
  | Status       | # | Metrics                |
  |--------------|---|------------------------|
  | ↑ Improved   | 2 | relevance, correctness |
  | ↓ Regressed  | 1 | faithfulness           |
  | ± Unchanged  | 1 | toxicity               |
  | + Added.     | 0 | -                      |
  | - Removed.   | 0 | -                      |

The default compare mode produces a unified diff of the full run payloads. Use --summary for a quick overview of what improved or regressed.

Tag runs

Assign human-readable tags to runs for easier reference in show and compare:

# Assign a tag to a run
rakam eval tag --id 42 --tag baseline-v1

# Delete a tag
rakam eval tag --delete baseline-v1

✅ Tag assigned successfully
Run ID: 42
Tag: baseline-v1

Tags let you compare named checkpoints (e.g., --tag baseline-v1 --tag after-prompt-update) instead of remembering numeric IDs.

List runs and evaluations

# List recent runs (newest first, default 20)
rakam eval list runs

# List more runs
rakam eval list runs --limit 50

# List all @eval_run functions discovered in ./eval
rakam eval list evals

# List all metric types used across evaluation functions
rakam eval metrics list

Example list runs output:

[id] tag                 label               created_at
[45] after-prompt-update demo_simple_text     2025-01-15 14:32:10
[44] -                   json_validation      2025-01-15 14:30:05
[42] baseline-v1         demo_simple_text     2025-01-14 09:15:22
[41] -                   toxicity_check       2025-01-14 09:12:00

Write an evaluation function​

Text evaluation​

Schema evaluation​

Client-side metrics​

Probabilistic evaluation​

Error handling​

Run evaluations​

Execute evaluations​

View results​

Compare runs​

Tag runs​

List runs and evaluations​