4. Understanding Results

LevelApp returns structured scores and justifications.

1. Key fields in the response

Field	Meaning
`match_level`	0–5 grade from LLM evaluator (Poor → Excellent)
`justification`	LLM’s human-readable explanation of the score
`inputTokens`	Number of input tokens used during evaluation
`outputTokens`	Number of output tokens generated during evaluation
`total_cost`	Cost of the evaluation (specific to OpenAI)
`execution_time`	Time taken to complete the evaluation (in seconds)
`average_scores`	Aggregated scores across all scenarios and attempts

2. Example snippet

{
  "scenarios": [
    {
      "scenario_id": "1234-5678",
      "attempts": [
        {
          "attempt_id": "1",
          "conversation_id": "batch-1",
          "interactions": [
            {
              "user_message": "What is IONOS?",
              "agent_reply": "IONOS is a web hosting and cloud services company.",
              "reference_reply": "IONOS is a cloud provider in Europe based in Germany",
              "evaluation_results": {
                "openai": {
                  "match_level": "0",
                  "justification": "The agent's output provides extensive information but misses key points.",
                  "metadata": {
                    "inputTokens": "482",
                    "outputTokens": "71"
                  }
                },
                "ionos": {
                  "match_level": "1",
                  "justification": "The agent's output captures the main idea but lacks precision.",
                  "metadata": {
                    "inputTokens": "387",
                    "outputTokens": "59"
                  }
                }
              }
            }
          ],
          "execution_time": "11.01"
        }
      ]
    }
  ]
}

Explanation of the Output

Top-Level Fields:
- scenarios: A list of scenarios evaluated in the batch. Each scenario contains its own scenario_id and evaluation details.
- average_scores: Aggregated scores across all scenarios (empty in this example).
Scenario-Level Fields:
- scenario_id: A unique identifier for the scenario being evaluated.
- attempts: A list of attempts made for this scenario. Each attempt contains detailed evaluation results.
Attempt-Level Fields:
- attempt_id: A unique identifier for the attempt.
- conversation_id: The ID of the conversation or batch being evaluated.
- execution_time: The time taken to complete the evaluation (in seconds).
Interaction-Level Fields:
- user_message: The input message from the user.
- agent_reply: The response generated by the agent.
- reference_reply: The expected or ideal response for comparison.
- interaction_type: The type of interaction (e.g., “None” in this case).
- reference_metadata: Metadata about the reference reply, such as intent and sentiment.
- generated_metadata: Metadata generated during the evaluation (empty in this example).
Evaluation Results:
- openai and ionos: Results from different evaluators.
  - match_level: A score indicating how well the agent’s reply matches the reference reply.
  - justification: A detailed explanation of the score.
  - metadata: Additional details about the evaluation, such as token usage and cost.

3. Going further

Missing fields? Ensure your generated_metadata is populated in the test batch.
Empty results? Confirm your evaluator provider is configured (openai or ionos).

When you’re comfortable here, dive into the Guides for advanced batch creation, custom metrics, and CI/CD.

First Evaluation Overview