Skip to main content

Evaluation

pip install rakam-systems-cli

The evaluation service must be running to use evaluation features. Contact us if you need help setting it up.

Configure access to the evaluation service in your .env:

# Evaluation service access
EVALFRAMEWORK_URL="http://eval-service-url.com" # URL of the evaluation service
EVALFRAMEWORK_API_KEY="your-api-token" # Generate from the /docs Swagger UI

Write an evaluation function

Create an eval/ directory in your project and add evaluation functions decorated with @eval_run. Each function returns an EvalConfig or SchemaEvalConfig.

Text evaluation

# eval/examples.py
from rakam_systems_cli.decorators import eval_run
from rakam_systems_tools.evaluation.schema import (
EvalConfig,
TextInputItem,
ClientSideMetricConfig,
ToxicityConfig,
CorrectnessConfig,
)

@eval_run
def test_simple_text_eval():
"""A simple text evaluation showcasing a basic client-side metric."""
return EvalConfig(
component="text_component_1",
label="demo_simple_text",
data=[
TextInputItem(
id="txt_001",
input="Hello world",
output="Hello world",
expected_output="Hello world",
metrics=[ClientSideMetricConfig(name="relevance", score=1)],
)
],
metrics=[ToxicityConfig(name="toxicity_demo", include_reason=False)],
)

Available text metrics: CorrectnessConfig, AnswerRelevancyConfig, FaithfulnessConfig, ToxicityConfig.

Schema evaluation

from rakam_systems_cli.decorators import eval_run
from rakam_systems_tools.evaluation.schema import (
SchemaEvalConfig,
SchemaInputItem,
JsonCorrectnessConfig,
)

@eval_run
def test_json_output():
"""Validate JSON structure of model outputs."""
return SchemaEvalConfig(
component="json-generator",
label="json_validation",
data=[
SchemaInputItem(
input="Generate a JSON object with name and age.",
output='{"name": "John", "age": 30}'
)
],
metrics=[
JsonCorrectnessConfig(
expected_schema={"type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "number"}}}
)
],
)

Available schema metrics: JsonCorrectnessConfig, FieldsPresenceConfig.

Client-side metrics

Log metrics calculated in your own code. These are sent alongside input data without server-side evaluation:

TextInputItem(
input="User review",
output="I am happy with this product.",
metrics=[
ClientSideMetricConfig(
name="sentiment",
score=1.0,
reason="The user expressed a positive sentiment."
)
]
)

Pass an empty list to metrics in EvalConfig to skip server-side evaluation.

Probabilistic evaluation

Use maybe_* methods to run evaluations on a sample of requests, reducing load on the evaluation service:

from rakam_systems_tools.evaluation import DeepEvalClient

client = DeepEvalClient()

# Runs approximately 10% of the time
client.maybe_text_eval(data=data, metrics=metrics, chance=0.1)

Error handling

By default, the evaluation client returns a dictionary with an "error" key on failure. Set raise_exception=True to raise instead:

from rakam_systems_tools.evaluation import DeepEvalClient

client = DeepEvalClient()

try:
result = client.text_eval(data=data, metrics=metrics, raise_exception=True)
except requests.RequestException as e:
print(f"An error occurred: {e}")

Run evaluations

Execute evaluations

The run command discovers and executes all @eval_run-decorated functions in the target directory:

# Run all evaluations in ./eval (default)
rakam eval run

# Run from a different directory
rakam eval run path/to/evals

# Search subdirectories recursively
rakam eval run --recursive

# Preview which functions would run without executing them
rakam eval run --dry-run

# Save each run result to a local JSON file
rakam eval run --save-runs --output-dir ./eval_runs

Example dry-run output:

📄 eval/quality.py
▶ test_answer_relevance
🧪 Dry-run OK → text_eval
▶ test_json_output
🧪 Dry-run OK → schema_eval

📄 eval/safety.py
▶ test_toxicity
🧪 Dry-run OK → text_eval

View results

Show the details of a specific run, or the most recent one by default:

# Show the most recent run
rakam eval show

# Show a specific run by ID
rakam eval show --id 42

# Show a run by tag
rakam eval show --tag baseline-v1

# Output raw JSON (useful for scripting)
rakam eval show --raw

Compare runs

Compare two evaluation runs to track quality changes between iterations. Provide exactly two targets using --id or --tag:

# Compare two runs by ID
rakam eval compare --id 42 --id 45

# Compare a run by ID with a tagged run
rakam eval compare --id 42 --tag baseline-v1

# Show a summary diff only (reduced output)
rakam eval compare --id 42 --id 45 --summary

# Show a side-by-side diff
rakam eval compare --id 42 --id 45 --side-by-side

Example summary output:

Summary:
| Status | # | Metrics |
|--------------|---|------------------------|
| ↑ Improved | 2 | relevance, correctness |
| ↓ Regressed | 1 | faithfulness |
| ± Unchanged | 1 | toxicity |
| + Added. | 0 | - |
| - Removed. | 0 | - |

The default compare mode produces a unified diff of the full run payloads. Use --summary for a quick overview of what improved or regressed.

Tag runs

Assign human-readable tags to runs for easier reference in show and compare:

# Assign a tag to a run
rakam eval tag --id 42 --tag baseline-v1

# Delete a tag
rakam eval tag --delete baseline-v1
✅ Tag assigned successfully
Run ID: 42
Tag: baseline-v1

Tags let you compare named checkpoints (e.g., --tag baseline-v1 --tag after-prompt-update) instead of remembering numeric IDs.

List runs and evaluations

# List recent runs (newest first, default 20)
rakam eval list runs

# List more runs
rakam eval list runs --limit 50

# List all @eval_run functions discovered in ./eval
rakam eval list evals

# List all metric types used across evaluation functions
rakam eval metrics list

Example list runs output:

[id] tag                 label               created_at
[45] after-prompt-update demo_simple_text 2025-01-15 14:32:10
[44] - json_validation 2025-01-15 14:30:05
[42] baseline-v1 demo_simple_text 2025-01-14 09:15:22
[41] - toxicity_check 2025-01-14 09:12:00