Audio To Text Node - Automate Transcriptions¶

The Audio to Text node under AI node in the Tool Builder converts speech audio into written text using Automatic Speech Recognition (ASR). This multimodal node processes audio input and generates text output within a single workflow for multiple languages supported by the model. It enables developers to build adaptable systems that efficiently handle and integrate audio and text data types.

Selection of the Audio File¶

You can add audio input in one of the following ways:

Manually select and upload an audio file in the allowed format.
Configure the Input variable by selecting Text for Type in the following window when adding input variables for the node. Learn more.

You must provide the audio file URL when running the flow, as mentioned here.

Note

Uploading audio files as input variables is not supported.

Supported Audio Formats¶

The following audio file formats are supported by the node:

M4a
Mp3
Webm
Mp4
Mpga
Wav
Mpeg

Note

Using other formats will result in a system error.

Audio File Size Limits¶

Maximum supported file size: 25 MB.
For larger files, split them into 25 MB or smaller segments and upload them to prevent input processing (transcription) and output generation delays.
Maintain context and avoid mid-sentence breaks when splitting files.

Processing Model¶

The Agent Platform uses OpenAI Whisper-1 for transcription.

Use Cases

This node is commonly used for:

Transcribing meetings, interviews, or lectures.
Automating customer service chatbots.
Generating subtitles for videos.
Voice command processing for applications.
Audio translation.

Example: The Audio to Text node processes the uploaded audio file of a customer service call. The transcribed text file is generated as the output.

In customer service, the node transcribes calls, which helps analyze conversational quality, response, and resolution and is also used for future reference.

Translation¶

Transcribes and translates speech in non-English languages (see Open AI Whisper-supported language) into English when enabled.
Inverse translation (English to other languages) is not currently supported.

Important Considerations¶

Audio uploads and settings are handled by the File Upload API.

Note

Be mindful of the environment where you upload the files - Host URLs that work fine in the Agent Platform may not work in GALE.

OpenAI Whisper automatically removes offensive and banned words during transcription.
Performance tracking is available under Settings > Model Analytics Dashboard > External Models tab. Learn more.

Metrics include:

Minutes transcribed/Minutes of Audio (total audio processed by the node) since the Whisper models are charged based on the minutes of the audio consumed.
Input and output tokens since the Whisper models usually support a small number of tokens, and tracking the counts is necessary. Learn more.
Each model execution is logged on the Model Traces page, displaying summarized data for:
- Input, Output, and Response Time
- Translation, and Timestamp. Learn more.

Steps to Add and Configure the Node¶

To add and configure the node, follow the steps below:

Note

Before proceeding, you must add an external LLM to your account using either Easy Integration or Custom API integration.

On the Tools tab, click the name of the tool to which you want to add the node. The Tool Flow page is displayed.
Click Go to flow to edit the in-development version of the flow.
In the flow builder, click the + icon for Audio to Text under AI in the Assets panel. Alternatively, drag the node from the panel onto the canvas. You can also click AI in the pop-up menu and click Audio to text.
Click the added node to open its properties dialog box. The General Settings for the node are displayed.
Enter or select the following General Settings:
- Node Name: Enter an appropriate name for the node. For example, “CustomerSupportConversation.”
- Provide the input variable that is set for the node for the Audio File field. Learn more.
- Select a model from the list of configured models.
- (Optional) Turn on the toggle for the following to enable the respective feature:
  - Translation: Translate other languages supported by the model to English.
  - Timestamps: The date and time at which each dialog was spoken.
- Provide the instructions that you want the model to follow for Prompt. User prompts define specific questions or requests for the model. Provide clear instructions for the model to follow, using context variables for dynamic inputs in the recommended syntax: {{context.variable_name}}. For example, you can store the conversation transcript in a variable named “conversation” and pass it on in the prompt using {{context.conversation}}. You may include simple instructions regarding the style of the transcription, correct words or proper nouns, in case the model could not figure out what the spoken word was, fix punctuations, add context, and more.
Note

Whisper models process up to 224 tokens in the input prompt and ignore any input exceeding this limit.

Standard Error

When the Model is not selected, the prompt details are not provided, or both, the error message “Proper data needs to be provided in the LLM node” is displayed.
- Response JSON schema: Define a JSON schema for structured responses. This step is optional and depends on the selected model.
  You can define a JSON schema to structure the model's response if the chosen model supports the response format. By default, if no schema is provided, the model will respond with plain text. Supported JSON schema types include: String, Boolean, Number, Integer, Object, Array, Enum, and anyOf. Ensure the schema follows the standard outlined here: Defining JSON schema. If the schema is invalid or mismatched, errors will be logged, and you must resolve them before proceeding.
  For more information about how the model parses the response and separates keys from the content body, see: Structured Response Parsing and Context Sharing in Workflows.
Click the Connections icon and select the Go to Node for success and failure conditions.

On Success > Go to Node: After the current node is successfully executed, go to a selected node in the flow to execute next, such as an AI node, Function node, Condition node, API node, or End node.
On Failure > Go to Node: If the execution of the current node fails, go to the End node to display any custom error message from the Audio to Text node.

Finally, Test the flow and fix any issues found.

Configure and Test the Flow for the Node¶

Step 1: (Optional) Add Input Variable(s)¶

Click the Input tab of the Start node, and click Add Input Variable to configure the input for the flow’s test run. Learn more.

Select Text for the Type field in the Enter input variable window to define a text input variable.
Click Save.

Add all the required input variables to run the flow in the Input section of the Start node.

Step 2: Add Output Variable(s)¶

Click the Output tab for the Start node.
Click Add Output Variable.

Enter the value for Name (key) and select String for Type to generate the transcribed text output.
Click Save. Learn more about accessing the node’s output.

Step 3: Run the Flow¶

To run and test the flow, follow the steps below:

Click the Run Flow button at the top-right corner of the flow builder.
(Optional) Add the value for Input Variable if you have configured it to test the flow. Otherwise, go directly to the next step.

Click Generate Output.

The Debug window generates the flow log and results, as shown below. Learn more about running the tool flow.