4. Understanding Results
LevelApp returns structured scores and justifications.
1. Key fields in the response
Field | Meaning |
---|---|
match_level | 0–5 grade from LLM evaluator (Poor → Excellent) |
justification | LLM’s human-readable explanation of the score |
inputTokens | Number of input tokens used during evaluation |
outputTokens | Number of output tokens generated during evaluation |
total_cost | Cost of the evaluation (specific to OpenAI) |
execution_time | Time taken to complete the evaluation (in seconds) |
average_scores | Aggregated scores across all scenarios and attempts |
2. Example snippet
{
"scenarios": [
{
"scenario_id": "1234-5678",
"attempts": [
{
"attempt_id": "1",
"conversation_id": "batch-1",
"interactions": [
{
"user_message": "What is IONOS?",
"agent_reply": "IONOS is a web hosting and cloud services company.",
"reference_reply": "IONOS is a cloud provider in Europe based in Germany",
"evaluation_results": {
"openai": {
"match_level": "0",
"justification": "The agent's output provides extensive information but misses key points.",
"metadata": {
"inputTokens": "482",
"outputTokens": "71"
}
},
"ionos": {
"match_level": "1",
"justification": "The agent's output captures the main idea but lacks precision.",
"metadata": {
"inputTokens": "387",
"outputTokens": "59"
}
}
}
}
],
"execution_time": "11.01"
}
]
}
]
}
Explanation of the Output
-
Top-Level Fields:
scenarios
: A list of scenarios evaluated in the batch. Each scenario contains its ownscenario_id
and evaluation details.average_scores
: Aggregated scores across all scenarios (empty in this example).
-
Scenario-Level Fields:
scenario_id
: A unique identifier for the scenario being evaluated.attempts
: A list of attempts made for this scenario. Each attempt contains detailed evaluation results.
-
Attempt-Level Fields:
attempt_id
: A unique identifier for the attempt.conversation_id
: The ID of the conversation or batch being evaluated.execution_time
: The time taken to complete the evaluation (in seconds).
-
Interaction-Level Fields:
user_message
: The input message from the user.agent_reply
: The response generated by the agent.reference_reply
: The expected or ideal response for comparison.interaction_type
: The type of interaction (e.g., “None” in this case).reference_metadata
: Metadata about the reference reply, such as intent and sentiment.generated_metadata
: Metadata generated during the evaluation (empty in this example).
-
Evaluation Results:
openai
andionos
: Results from different evaluators.match_level
: A score indicating how well the agent’s reply matches the reference reply.justification
: A detailed explanation of the score.metadata
: Additional details about the evaluation, such as token usage and cost.
3. Going further
- Missing fields? Ensure your
generated_metadata
is populated in the test batch. - Empty results? Confirm your evaluator provider is configured (
openai
orionos
).
When you’re comfortable here, dive into the Guides for advanced batch creation, custom metrics, and CI/CD.