Configure Evaluators¶

In Evaluation Studio, evaluators are tools used to assess how well a model is performing based on specific tasks. They function like custom prompts or instructions designed to check certain aspects of a model’s output.

For example, an evaluator can be set up to assess content completeness. It takes inputs and outputs from a model and is programmed to compare the results against predefined criteria to check if the content is complete.

Types of Evaluators¶

In Evaluation Studio, there are two primary types of evaluators: AI evaluators and Human evaluators.

AI Evaluators¶

AI Evaluators are predefined sets of instructions provided to a large language model (LLM), whether open-source or commercial, to evaluate its outputs. These evaluators assess the model’s performance by comparing its outputs against predefined criteria or instructions, using a dataset of input and output data.

There are two types of AI evaluators:

System AI Evaluators: These are pre-built evaluators provided by the platform to assess common aspects of model performance, such as quality, correctness, and safety. These evaluators are ready-to-use and cannot be modified, providing a quick and efficient way for users to evaluate models.
Custom AI Evaluators: Users can create custom evaluators tailored to their specific needs. These evaluators allow users to define their own evaluation prompts and scoring mechanisms, offering more flexibility. Custom AI evaluators are particularly useful for evaluating unique datasets or specialized tasks, giving users full control over how models are assessed.

Note

Users can access all the available system evaluators through the global Evaluators page located at the project level. Simply click the Evaluators tab, located next to the Projects tab. This page provides an overview of available evaluators that can be applied to datasets for evaluation.

System Evaluators are grouped into three categories: Quality, Safety, and RAGAS evaluators.

Quality Metrics¶

Quality metrics assess the overall effectiveness and usefulness of the model's outputs. These metrics focus on whether the content generated by the model is clear, accurate, and complete.

Below are the key quality metrics and the components required in the dataset to use these evaluators:

Metric	Description	Required Dataset Components
Groundness	Evaluates whether the output accurately reflects the information provided in the input without introducing additional details from the model’s knowledge base.	Input, Output
Query Relevance	Assesses the relevance of the output to a user query and ensures the output is related to the given input.	Input, Output, User Query
Ground Truth Relevance	Compares the output to a provided ground truth to assess the relevance between input and output.	Input, Output, Ground Truth
Coherence	Evaluates how logically consistent and well-structured the generated output is, assessing its natural flow and readability.	Output
Fluency	Assesses the quality of individual sentences in the output, checking if they are grammatically correct and well-written.	Output
GPT Similarity Score	Compares the model’s response with a superior model's response (e.g., GPT) for a given input.	Input, your model’s response, Superior model’s response
Paraphrasing	Assesses whether the output conveys the same meaning as the input using different phrasing and sentence structures.	Input, Output
Completeness	Evaluates whether the output conveys the full context from the input, checking if any information was lost or omitted.	Input, Output

Safety Metrics¶

Safety metrics focus on evaluating whether the model’s outputs are free from harmful or unethical content. These metrics are crucial for ensuring that AI models do not produce dangerous or biased results.

Below are the key safety metrics and the components required in the dataset to use these evaluators:

Metric	Description	Required Dataset Components
Bias Detection	Analyzes the output for potential biases related to specified topics, ensuring that the model doesn’t exhibit unfair or discriminatory tendencies.	Output
Banned Topics	Scans the output for prohibited content related to specific topics, like sensitive political issues or illegal activities.	Output
Toxicity	Screens the output for toxic content, such as violent, sexual, or otherwise inappropriate material.	Output

RAGAS Evaluators¶

RAGAS evaluators serve as system evaluators within Evaluation Studio, playing a crucial role in assessing the performance of RAG (Retrieval-Augmented Generation) pipelines. These evaluators assess both the accuracy of the answer and the relevance of the contexts used.

For example, when a user query is processed, the pipeline returns an answer along with the contexts from which the answer was derived. RAGAS evaluators evaluate the quality of both the answer and the retrieved contexts, ensuring a thorough evaluation of the model’s performance.

Users can fine-tune the evaluation process by adjusting key parameters to meet specific needs. While the evaluation prompts themselves cannot be modified, as their results directly impact the final score calculation, users have the flexibility to adjust the following parameters:

Model: Users can choose which model to use for the evaluation.
Pass Threshold: Users can modify the threshold required for a pass based on the evaluation criteria.
Variables in the Prompt: Users can attach variables depending on the specific metric being used, such as ground_truth, retrieved_contexts, user_input.

Below are the current list of RAGAS evaluators and the components required in the dataset to use these evaluators:

Metric	Description	Required Dataset Components
Context Precision	This metric measures the proportion of the relevant chunks and the total number of chunks retrieved for the given user input.	Input Response Retrieved context
Context Recall	Considering a reference context, this metric evaluates whether the retrieved context is sufficient to address the user input. Higher recall indicates that fewer significant chunks are omitted.	Input Response Retrieved context Reference answer
Context Entity Recall	Considering a reference context, this metric evaluates the number of common entities present in the retrieved context in relation to the total number of entities in that reference context.	Retrieved context Reference answer
Noise Sensitivity	Considering a reference context, this metric provides the proportion of incorrect claims in the total number of retrieved claims.	Input Response Retrieved context Reference answer
Faithfulness	This metric measures how factually consistent a response is with the retrieved context.	Input Response Retrieved context

Human Evaluators¶

Human evaluators play a crucial role in refining AI models by providing valuable insights into the quality of model outputs. They help assess and improve the accuracy, relevance, and overall quality of the outputs generated by AI systems.

The following are the three main types of human evaluators in Evaluation Studio:

Thumbs Up/Down: Users provide a reaction - thumbs up (positive) or thumbs down (negative) to the model's output. These reactions are then turned into numerical values: 1 for thumbs up (good) and 0 for thumbs down (bad). This simple feedback mechanism makes it easy to gauge overall sentiment towards the output.
Better Output: Users suggest an improved version of the model's output. They can add a column with their own revised response, showing what they think is a better response. This direct feedback helps the model by offering direct suggestions for improvement.
Comments: Users can leave short comments about the output. These comments can be either positive or negative and give more detailed feedback about why they like or dislike the output. Comments provide a richer understanding of why an output is liked or disliked and can highlight specific areas for improvement.

To add a human evaluator, simply click Add human feedback on the Evaluations page and choose one of the three options. Like AI evaluators, human evaluators are added as separate columns in the dataset, allowing users to provide valuable feedback. This feedback helps create a more accurate picture of how the model is performing and offers actionable insights for further improvement.

Adding a System Evaluator¶

When adding a system evaluator, users can map the dataset columns to the evaluator's input and output variables. However, the evaluator cannot be modified, and it must be used as-is. Users can add the system evaluators to an evaluation by mapping the prompt variables with the column names from the dataset.

Steps to add an evaluator:

On the Evaluations page, click the + button, and select the Add evaluator option.
From the list of Quality and Safety evaluators, select the desired evaluator.
In the Evaluators dialog, fill in these details:
1. Model: Choose the model you want to use as an evaluator. This model will assess the input and/or output and generate a score. Only the models deployed in Agent Platform will appear in the search dropdown. Both open-source and the external models are considered here.
2. Model Configuration: Select the appropriate model hyperparameters such as Temperature, Output token limit, Top P etc.
3. Prompt: Click to view the system prompt. The prompt associated with the system evaluator is view-only. While you can view the prompt, it cannot be edited.
4. Map variables: Map the variables in the prompt to the corresponding columns in your imported dataset. This ensures the evaluator uses the right data for its analysis.
5. Pass threshold: Set the minimum score required for an output to pass the evaluation. Choose either the ‘Greater than’ or ‘Less than’ option and then enter a threshold value (from 1 to 5).
  - For Positive Evaluators (or evaluators where a higher score is better, such as Completeness), the output is considered "good" if the score exceeds the threshold. For example, if the Completeness evaluator returns a score greater than 2.5, the result will be marked green, indicating that it meets the expected quality level.
  - For Negative Evaluators (such as Toxicity, where a lower score is better), a score above the threshold indicates a problem. For example, if the Toxicity evaluator returns a score greater than 2.5, it will be marked red, signaling that the output contains undesirable levels of toxic content.
  The ‘Greater than’ or ‘Less than’ options help distinguish between positive and negative evaluators, allowing you to adjust the evaluation based on the desired outcome
Click Save to save the evaluator configuration.

Adding a Custom Evaluator¶

Creating custom evaluators in Evaluation Studio allows users to tailor the evaluation process by designing AI evaluators suited to their specific needs. Users can choose from in-built templates to create custom evaluators for both Quality and Safety categories. Once created, evaluators can be saved as global evaluators, making them available for use across different projects. Custom evaluators can also be tested on datasets, and the results are provided with scores and detailed explanations. The flexibility to update and re-evaluate prompts ensures that the evaluation process is dynamic and adaptable.

Steps to add an evaluator:

On the Evaluations page, click the + button, and select the Add evaluator option.

The list of Quality and Safety evaluators are displayed.
To add a custom evaluator, click Add evaluator.
In the Custom evaluators dialog, fill in these details:
1. Evaluator Name: Enter a name for the evaluator.
2. Evaluator Type: Select the category for the evaluator: Quality or Safety.
3. Description: Provide a brief description of the evaluator, explaining its purpose and function.
4. Model: Choose the model you want to use for evaluation. This model will assess the input and/or output and generate a score. Only the models deployed in Agent Platform appear in the search dropdown. Both open-source and the external models are considered here.
5. Model Configuration: Select the appropriate model hyperparameters such as Temperature, Output token limit, Top P etc.
6. Prompt: Enter the prompt that will guide the model in evaluating the input/output. You can also click ‘Template’ to use built-in evaluator templates, which you can then customize as needed.
  
  Note
  
  Do not specify the format of the score in the prompt. The format is automatically determined by the selected output type (Score or Boolean). If there is a mismatch between the output type and the score format, an error may occur.
7. Save as a Global Evaluator: Check this box to save the evaluator as a global evaluator. This will add the custom evaluator to the global evaluator page. The global evaluator will also appear on the project-level evaluator page for other users to access and use, without affecting the original evaluator.
8. Map variables: Map the variables in the prompt to the corresponding columns in your imported dataset. This ensures the evaluator uses the right data for its analysis. Learn more.
9. Output Type: Select either Score or Boolean for the evaluator’s output.
10. Maximum Score: If the output type is Score, specify the maximum score on the scale (For example, 1 to 10).
11. Pass threshold: If the output type is Score, set the minimum score required for an output to pass the evaluation. Choose either the ‘Greater than’ or ‘Less than’ option and then enter a threshold value (For example, from 1 to 5). These options help distinguish between positive and negative evaluators, allowing you to adjust the evaluation based on the desired outcome.
  - For Positive Evaluators (or evaluators where a higher score is better, such as Completeness), the output is considered "good" if the score exceeds the threshold. For example, if the Completeness evaluator returns a score greater than 2.5, the result will be marked green, indicating that it meets the expected quality level.
  - For Negative Evaluators (such as Toxicity, where a lower score is better), a score above the threshold indicates a problem. For example, if the Toxicity evaluator returns a score greater than 2.5, it will be marked red, signaling that the output contains undesirable levels of toxic content.
Click Save to save the evaluator configuration.

Mapping Variables: Link Evaluator Prompts to Your Dataset¶

When setting up an AI evaluator, variable mapping is a crucial step. This is where the user connects the variables in the evaluator's prompt to the corresponding columns in the dataset.

Variables in the Prompt: The evaluator’s prompt contains variables, indicated in double curly braces. For example, {{input}}, {{output}}, {{query}}. These variables are placeholders for your dataset columns and will appear on the left side of the Variable column. For example, in a Query Relevance evaluator, the prompt might include variables like {{query}} for the user query, {{input}} for the input text, and {{output}} for the model's response.
Left Side - Prompt Variables: The left side of the mapping section shows the variables from the evaluator's prompt. This section is auto-populated by the system.
Right Side - Dataset Columns: The right side displays the columns from your imported dataset. You must select the correct columns from the dataset to match each variable in the prompt. For example:
- Map {{input}} to the corresponding input column in your dataset.
- Map {{output}} to the output column.
Safety Evaluators: For safety evaluators like Bias Detection or Toxicity, you may need to configure additional key-value pairs. These evaluators often provide binary (pass/fail) results, so map the relevant columns accordingly.

By correctly mapping the variables, you ensure the evaluator receives the right data and produces accurate results.

Key Highlights¶

Evaluators are used to assess model performance by comparing its outputs against predefined criteria.
System evaluators are pre-built and cannot be modified, offering ready-to-use options for evaluating common aspects of model performance, such as quality and safety metrics. Custom evaluators offer flexibility for users to create custom evaluators tailored to their specific needs.
Variable mapping is crucial when adding evaluators, as users must link the variables in the evaluator's prompt to the appropriate dataset columns to ensure accurate evaluation results.