Batch Tests

In LevelApp, a batch test is a structured way to evaluate how an AI model performs across multiple test cases. Each batch includes one or more test interactions, and LevelApp handles everything: sending prompts, collecting replies, scoring results, and returning detailed feedback.

What’s Inside a Batch Test?

A typical LevelApp batch test includes:

A unique batch ID and optional metadata (name, description, version)
A list of interactions, each with:
- A user message (input to the AI)
- A reference reply (what the model is expected to say)
- An optional agent reply (if already generated)
- Metadata like intent or sentiment (optional)
The model ID and API endpoint being tested
The number of attempts to run per interaction
A test name to track results

Example

Here’s an example of a LevelApp batch test:

{
  "test_batch": {
    "id": "12345678-1234-5678-1234-567812345678",
    "interactions": [
      {
        "id": "5b74c0b4-0c4a-4d1b-a6a0-bf31e0be2914",
        "user_message": "What is IONOS?",
        "agent_reply": "United States of America",
        "reference_reply": "IONOS is a cloud provider in Europe based in Germany",
        "interaction_type": "opening",
        "reference_metadata": {
          "intent": "greeting",
          "sentiment": "positive"
        },
        "generated_metadata": {
          "intent": "greeting",
          "sentiment": "positive"
        }
      }
    ],
    "description": "Test conversation for main evaluation",
    "details": {
      "name": "Main API Test",
      "version": "1.0"
    }
  },
  "endpoint": "http://localhost:8000",
  "model_id": "meta-llama/Llama-3.3-70B-Instruct",
  "attempts": 1,
  "test_name": "main_evaluate_test"
}

How LevelApp Uses It

When you submit a batch to LevelApp:

Each user message is sent to the target model (via the provided endpoint).
The model’s response is collected and compared with the reference reply.
Evaluators assign scores and explanations.
Metadata fields (if included) are also checked and scored.
If multiple attempts are requested, results are averaged.
Final results are returned in a structured, machine-readable format.

Why Batch Testing Matters

Batch testing helps you:

Evaluate model behavior across many examples
Detect inconsistencies or failure patterns
Measure both content quality and metadata accuracy
Run repeatable tests for tracking model changes over time

Scoring Nextra Setup Guide