Text to Speech

Quickstart

DeepFellow provides v1/audio/speech endpoint that can be used to produce spoken audio in multiple languages.

The v1/audio/speech takes three key inputs:

model you want to use
text you want to turn to audio
voice that will speak (depending on the model you use for audio tasks)

Example request:

curl -X POST \
 "https://deepfellow-server-host/v1/audio/speech" \
  -H "Authorization: Bearer DEEPFELLOW-PROJECT-API-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "speaches-ai/piper-en_US-ryan-high",
    "voice": "ryan",
    "input": "Today is a wonderful day to build something people love!",
    "instructions": "Speak in a cheerful and positive tone."
  }' \
  --output speech.mp3

from pathlib import Path
from openai import OpenAI

client = OpenAI(
    base_url="https://deepfellow-server-host/v1",
    api_key="DEEPFELLOW-PROJECT-API-KEY" 
)
speech_file_path = Path(__file__).parent / "speech.mp3"

with client.audio.speech.with_streaming_response.create(
    model="speaches-ai/piper-en_US-ryan-high",
    voice="ryan",
    input="Today is a wonderful day to build something people love!",
    instructions="Speak in a cheerful and positive tone.",
) as response:
    response.stream_to_file(speech_file_path)

import * as fs from 'fs';
import OpenAI from 'openai';

const client = new OpenAI({
    baseURL: 'https://deepfellow-server-host/v1',
    apiKey: 'DEEPFELLOW-PROJECT-API-KEY' 
});

const response = await client.audio.speech.create({
    model: 'speaches-ai/piper-en_US-ryan-high',
    voice: 'ryan',
    input: 'Today is a wonderful day to build something people love!',
    instructions: 'Speak in a cheerful and positive tone.'
});

const buffer = Buffer.from(await response.arrayBuffer());
await fs.promises.writeFile('speech.mp3', buffer);

The output is mp3 by default, but you can request any other supported format.

Supported Formats

MP3: The default response format for general use cases.
Opus: Low latency format for streaming and communication.
AAC: Format for digital compression, preferred by popular platforms, e.g. YouTube.
FLAC: Lossless compression.
WAV: Uncompressed WAV audio, suitable for low-latency applications to avoid decoding overhead.
PCM: Similar to WAV but contains the raw samples in 24kHz (16-bit signed, low-endian), without the header.

To read more about text to speech, visit OpenAI API Documentation.

Quickstart

Supported Formats

On this page