Import Datasets¶

Evaluation Studio provides a flexible approach for importing and managing datasets, making it easier to evaluate model outputs. There are several ways to import and handle datasets, enabling flexibility in evaluating model outputs.

Here's how you can bring data into the platform:

Import a dataset: Users can upload datasets in CSV format. These datasets may contain:
- Input-output pairs (where each input has a corresponding output)
- Input data only (where outputs need to be generated by a user-defined model)
Evaluation Studio supports three scenarios for handling the datasets:
- Scenario 1: One Input, One Output: The simplest scenario where a dataset has one input column and one output column. Users map the input and output variables to run the evaluation, making it easy to evaluate model predictions.
- Scenario 2: One Input, Multiple Outputs: In more complex scenarios, one input may predict multiple outputs through different models. The dataset will have one input column and multiple output columns. Users map the input to the corresponding model outputs for evaluation. Also, users can upload ground truth columns to compare responses.
- Scenario 3: Input Only: In this scenario, the dataset contains only input data with no output columns. When users have input data but no corresponding outputs, they can generate model outputs using a pre-trained model. The system will automatically create a new output column based on the input, enabling users to evaluate the pre-trained model's generation.
Import production data: Users can import data from real-time deployed models. Evaluation Studio allows specific filters like the date range, source where the model is deployed, and columns from the model traces. This helps evaluate your deployed model’s generations. For example, users can identify outputs that took longer to generate or consumed more tokens, gaining insights to achieve an optimal balance between output quality and resource consumption, such as time and token usage.

Adding Datasets to an Evaluation¶

Users can add datasets to evaluations within a project. Each dataset represents a collection of inputs and outputs for a specific use case.

Steps to import a dataset:

Navigate to Evaluation Studio.
Click the Projects tab, and choose the relevant project.
Select the specific evaluation to which you want to add datasets.
Choose one of the following import methods to import the dataset for evaluation:
1. Upload from device: Click the Upload file link and select your CSV file saved on your local machine.
2. Import production data: Click Proceed and fill in the required fields in the Import production data dialog:
  1. Models: Choose the model deployed in production (open-source or commercial). You can select any model used in Agent Platform within Tools, Prompts, and endpoints. Only data related to the selected model will be retrieved from Model Traces.
  2. Source: Select the specific source where the model is deployed, such as Tools, Prompts, or endpoints. You can also select the ‘All’ option to import data from all available sources or specify individual sources like specific prompts or tools. For example, you can select a specific tool to see how the model is performing within that tool.
  3. Date: Set the desired date range for the data you want to import. By default, the last 30 days are selected.
  4. Columns: The system automatically fetches the input and output columns by default. If you need more detailed analysis, you can select additional columns such as request ID, input tokens, response time, and other relevant metrics. The selected columns will appear in the evaluation table.
Check the preview of the dataset (first 10 rows). To confirm and finalize the import, click Proceed.

The dataset is imported into Evaluation Studio and linked to the selected evaluation. You can then view your data in a tabular format in the evaluation table.
Click the + button on the Evaluations page to access additional dataset actions:
- Run a prompt: Run a prompt by selecting model name and configurations.
- Run an API: Run an API call using specified endpoint and parameters to fetch content from external APIs or deployed tools.
- Add an evaluator: Add a quality or safety evaluator to the dataset.
- Add human feedback: Manually input feedback for model outputs.
You can also filter the data (text, numeric, boolean), sort it, adjust row heights, and customize columns (hide/show). Filtering is limited to score columns to help with focused analysis. Applied filters and sorting states are clearly indicated and dynamically affect evaluation insights.

Running a Prompt¶

The Run a Prompt option enables users to generate customized data based on a specific model and prompt. This feature streamlines data creation and enables easy edits and adjustments for continuous improvements.

For instance, if you want to replace the manual effort of summarizing customer conversations with a fine-tuned model, you can use Evaluation studio to evaluate its summaries. Start by bringing your conversations as input and deploying the fine-tuned model in Agent Platform. Then, in Evaluation studio, select 'Run a prompt' and choose your fine-tuned model. In the prompt, you can specify 'summarize the {{input}}' (column as a variable). This variable will capture the conversations, and based on the additional prompt instructions, the model will generate the summary. Finally, you can assign desired evaluators to evaluate the output produced by the fine-tuned model.

Key Benefits

Efficiency: Generate content for multiple categories quickly and easily.
Customization: Edit prompts and regenerate content to match evolving needs.
Streamlined Workflow: Manage all content generation tasks in one central place.

Steps to run a prompt:

On the Evaluations page, click the + button and select the Run a Prompt option.
In the Run a Prompt dialog:
1. Enter the Column Name for the output data.
2. Choose the appropriate Model and Configuration settings.
3. Type the prompt that describes the data you want to generate, making sure to include any mapped variables.
Click Run to generate a new output column in your data table with the results.

After running the prompt, the following additional options are available:

To modify your prompt or configurations, click the Properties.
To refresh the output based on a new prompt or updated data, click Regenerate.
To remove an output column, click Delete. Before deleting, ensure that no evaluators are dependent on this column to avoid any errors.

Running an API¶

Evaluation Studio offers the ability to run an API, enabling users to fetch content from external APIs or deployed tools directly into their evaluation process. This feature enables the integration of live data or model outputs from deployed agents, providing greater flexibility in the evaluation process.

As a user, you can add a column in Evaluation Studio that triggers an API call to fetch content. This allows you to integrate external data, retrieve agent outputs, and incorporate them into your evaluation. Once the content is fetched, you can evaluate it using human or AI evaluators for in-depth analysis. Using the Run an API feature, you can also fetch outputs from models hosted outside of Agent platform. The external model can use the input rows in Evaluation Studio as its input, process the data, and provide the output for each row.

This functionality enhances the evaluation process by providing greater flexibility, allowing users to use external data and models in Evaluation Studio.

Key Benefits

External Data Integration: Easily bring in data or model outputs from external sources or deployed tools into your evaluations.
Flexible Data Handling: Allows you to evaluate dynamic content generated by deployed tools in real-time, ensuring up-to-date assessments.
Seamless Evaluation: Attach evaluators to the API-generated outputs just like any other dataset column, ensuring consistency in the evaluation process.
Customization: Tailor the API call and parameters to suit specific needs, enabling flexible and targeted evaluations of external content.

Steps to run an API:

On the Evaluations page, click the + button, and select the Run an API option.
Configure the API call: In the Run an API dialog, specify the following:
- Column Name: Enter a name for the column where the API output will be displayed.
- Method: Select the HTTP method (GET, POST, PUT, DELETE, or PATCH) based on the operation you want to perform with the API.
- Request URL: Enter the URL of the API endpoint. If you have a cURL command for the API request, you can paste it here to test the API call.
- Headers: Define any required headers for the API request. Specify the key/value pairs for the headers, such as authentication tokens or content-type specifications.
- Body: If the request method requires a body (usually for POST or PUT requests), specify the data to be sent in the request. You can use one or more input columns as data in the body. If you want to replace certain parts of the body with dynamic values from input fields, you can use variables for the values. For example, "input":"{{column1}}".
- Response: The response is automatically generated and displayed to show the result of the API call. When testing the API call, the system uses the input from the first row, makes the API request using the provided cURL, and displays the response.
- JSON Output Path: Define the path to the specific data within the JSON response that you want to display. This is useful when the API returns complex JSON data, and you need to extract specific fields or values.
Test the API call: Click Test to verify the API setup. The response from the test will be displayed in the Response tab of the properties panel. If the JSON Output Path is incorrect, an error message will appear:
Fetch content from the API: After configuring the API, click Run to trigger the request. The system will send the API call to the deployed tool, retrieve the output, and automatically add the response as a new column in the evaluation dataset.
Attach evaluators and evaluate the output: Once the content is added as a column, you can attach evaluators (e.g., Coherence, Toxicity, Bias Detection) to assess the output. Then run the evaluation, and the evaluators will analyze the API-generated data, providing insights into the quality and performance of the content.

Example Workflow for Running an API¶

Follow this example to set up and run an API call inside Evaluation Studio:

Create and deploy a tool: Set up your tool and deploy it in Agent Platform.
Copy the tool endpoint: From the Tool Endpoint tab, copy the deployed API's URL.
Upload a dataset: In Evaluation Studio, upload a dataset containing only the input columns.
Initiate ‘Run an API’: Click the + button, select Run an API, add a column name, and paste the copied endpoint URL in the Request URL field.
Generate API key: Go back to the tool, navigate to the API Keys tab, create a new key, and copy it.
Set the authorization header: In Evaluation Studio, in the Headers tab, paste the copied API key in the Value field for the Key x-api-key.
Configure the API body: Click the Body tab. Under the "input" key, replace {{example_text}} with your input column name. For example, {{Input}}.
Test the API call and view the response: Click Test to trigger the API. The system uses the first row of your dataset to verify the API setup and displays the response in the Response tab. For example, if you are running a summarization tool, you should see the generated summary output based on the input text.

The Test option enables you to preview the JSON response structure. After testing, carefully review the response to identify the correct output path, which you’ll need to specify as the JSON Output Path in the following step.
Define JSON output path: Click the JSON output path tab, and specify the path to extract the required field from the API response.

For example, if the API response structure is the following:

"output": { "Summarization": "(generated output)" }

you should enter output.Summarization as the JSON output path.
Run the API call: After successful testing, click Run to fetch outputs for all the dataset rows. A new column will be added with the populated API responses/results.

✅ Tip: Make sure the column name exactly matches the input column in your dataset to dynamically send each row's input to the API.

Key Highlights¶

A project can contain multiple evaluations, and users can add a dataset to any evaluation.
Users can upload a dataset into Evaluation Studio and run evaluations to measure model performance.
If importing data from production, carefully select the model, source, and date range to ensure you're importing the relevant data.
Running a prompt enables flexible data generation, allowing users to create customized data based on specific instructions.
Running an API enables users to integrate live data and model outputs from external APIs or deployed tools, enhancing flexibility in the evaluation process.