LevelApp Docs v1.0 just launched!⭐ Star us on GitHub
Getting StartedUnderstanding Results

4. Understanding Results

LevelApp returns structured scores and justifications.

1. Key fields in the response

FieldMeaning
match_level0–5 grade from LLM evaluator (Poor → Excellent)
justificationLLM’s human-readable explanation of the score
inputTokensNumber of input tokens used during evaluation
outputTokensNumber of output tokens generated during evaluation
total_costCost of the evaluation (specific to OpenAI)
execution_timeTime taken to complete the evaluation (in seconds)
average_scoresAggregated scores across all scenarios and attempts

2. Example snippet

{
  "scenarios": [
    {
      "scenario_id": "1234-5678",
      "attempts": [
        {
          "attempt_id": "1",
          "conversation_id": "batch-1",
          "interactions": [
            {
              "user_message": "What is IONOS?",
              "agent_reply": "IONOS is a web hosting and cloud services company.",
              "reference_reply": "IONOS is a cloud provider in Europe based in Germany",
              "evaluation_results": {
                "openai": {
                  "match_level": "0",
                  "justification": "The agent's output provides extensive information but misses key points.",
                  "metadata": {
                    "inputTokens": "482",
                    "outputTokens": "71"
                  }
                },
                "ionos": {
                  "match_level": "1",
                  "justification": "The agent's output captures the main idea but lacks precision.",
                  "metadata": {
                    "inputTokens": "387",
                    "outputTokens": "59"
                  }
                }
              }
            }
          ],
          "execution_time": "11.01"
        }
      ]
    }
  ]
}

Explanation of the Output

  1. Top-Level Fields:

    • scenarios: A list of scenarios evaluated in the batch. Each scenario contains its own scenario_id and evaluation details.
    • average_scores: Aggregated scores across all scenarios (empty in this example).
  2. Scenario-Level Fields:

    • scenario_id: A unique identifier for the scenario being evaluated.
    • attempts: A list of attempts made for this scenario. Each attempt contains detailed evaluation results.
  3. Attempt-Level Fields:

    • attempt_id: A unique identifier for the attempt.
    • conversation_id: The ID of the conversation or batch being evaluated.
    • execution_time: The time taken to complete the evaluation (in seconds).
  4. Interaction-Level Fields:

    • user_message: The input message from the user.
    • agent_reply: The response generated by the agent.
    • reference_reply: The expected or ideal response for comparison.
    • interaction_type: The type of interaction (e.g., “None” in this case).
    • reference_metadata: Metadata about the reference reply, such as intent and sentiment.
    • generated_metadata: Metadata generated during the evaluation (empty in this example).
  5. Evaluation Results:

    • openai and ionos: Results from different evaluators.
      • match_level: A score indicating how well the agent’s reply matches the reference reply.
      • justification: A detailed explanation of the score.
      • metadata: Additional details about the evaluation, such as token usage and cost.

3. Going further

  • Missing fields? Ensure your generated_metadata is populated in the test batch.
  • Empty results? Confirm your evaluator provider is configured (openai or ionos).

When you’re comfortable here, dive into the Guides for advanced batch creation, custom metrics, and CI/CD.