This blog post is co-authored by our great contributor Thomas Vitale.OpenAI provides specialized models for
speech-to-text and text-to-speech conversion, recognized for their performance and cost-efficiency. Spring AI integrates these capabilities via Voice-to-Text and Text-to-Speech (TTS).
The new Audio Generation feature (
gpt-4o-audio-preview) goes further, enabling mixed input and output modalities. Audio inputs can contain richer data than text alone. Audio can convey nuanced information like tone and inflection, and together with the audio outputs it enables asynchronous speech-to-speech interactions.
Additionally, this new multimodality opens up possibilities for innovative applications, such as structured data extraction. Developers can extract structured information not just from simple text, but also from images and audio, building complex, structured objects seamlessly.
Spring AI Audio Integrations
The Spring AI Multimodality Message API simplifies the integration of multimodal capabilities with various AI models.
Now it fully supports OpenAI’s Audio Input and Audio Output modalities, thanks in large part to community member Thomas Vitale, who contributed to this feature’s development.
Setup
Follow the Spring AI-OpenAI integration documentation to prepare your environment.Audio Input
OpenAI’s User Message API accepts base64-encoded audio files within messages using the Media type. Supported formats includeaudio/mp3 and audio/wav.
Example: Adding audio to an input prompt:
Audio Output Generation
The OpenAI Assistant Message API can return base64-encoded audio files using theMedia type.
Example: Generating audio output:
OpenAiChatOptions. Use the AudioParameters class to customize the voice and the audio format.
Voice ChatBot Demo
This example demonstrates building an interactive chatbot using Spring AI that supports input and output audio. It shows how AI can enhance user interaction with natural-sounding audio responses.Setup
Add the Spring AI OpenAI starter:application.properties:
Implementation
The Java implementation of the voice chatbot, detailed below, creates a conversational AI assistant using audio input and output. It leverages Spring AI’s integration with OpenAI models to enable seamless interactions with users. VoiceAssistantApplication-
VoiceAssistantApplicationserves as the main application. -
The
CommandLineRunnerbean initializes the chatbot:- The
ChatClientis configured using thesystemPromptfor contextual understanding and an in-memory chat memory for conversation history. - The
Audioutility is used to record voice input from the user and play back audio responses generated by the AI.
- The
-
Chat Loop: Inside the loop:
- Voice Recording: The
audio.startRecording()andaudio.stopRecording()methods handle the recording process, pausing for user input. - Processing AI Response: The user message is sent to the AI model via
chatClient.prompt(). The audio data is encapsulated in theMediaobject. - Response Handling: The AI-generated response is retrieved as text and played back as audio using the
Audio.play()method.
- Voice Recording: The
Audio utility, for capturing and playing back audio, is a single class leveraging the plain Java Sound API.
Important Considerations
- One hour of audio input is roughly equivalent to 128k tokens.
- The model currently supports
modalities = ["text", "audio"]. - Future updates may offer more flexible modality controls.
Conclusion
Thegpt-4o-audio-preview model unlocks new possibilities for dynamic audio interactions, enabling developers to build rich, AI-powered audio applications.
Disclaimer: API capabilities and features may evolve. Refer to the latest OpenAI and Spring AI documentation for updates.