M
MesmerTools
Back to tools

Text to Speech

Generate natural speech audio from any text. Pick high quality for a professional Azure OpenAI audiobook narrator, or low quality for a fast Kokoro-82M render. Cached at the edge. Rate limited to 20 generations per hour.

221 / 2,000

REST API

GETPOST/api/v1/tts
Pass text, optional voice, and optional quality as query params (GET) or a JSON body (POST). Returns a JSON response with the CDN URL of the generated MP3.

Parameters

textrequiredText to synthesize. Up to 2,000 characters per request.
qualityhigh (default) — Azure GPT-audio narrator. low — Kokoro-82M via Replicate. High falls back to low on error or 30s timeout.
voiceFor high: hq_female or hq_male. For low: any Kokoro voice id (e.g. af_bella, am_adam). Defaults to hq_female.

Example

curl "https://mesmer.tools/api/v1/tts?text=Hello+world&quality=high&voice=hq_female"

POST body

curl -X POST "https://mesmer.tools/api/v1/tts" \ -H "Content-Type: application/json" \ -d '{"text":"Hello world","quality":"high","voice":"hq_female"}'

Response

{
"url": "https://cdn.mesmer.tools/tts/hq/hq_female/a1b2c3…mp3",
"hash": "a1b2c3d4e5f6…",
"cached": false,
"quality": "high",
"voice": "hq_female",
"provider": "openai",
"fellBack": false
}

When fellBack: true, the response also includes fallbackReason with the error that triggered the fallback.

Batch endpoint

POST/api/v1/tts/batchauth required
Generate up to 200 clips in a single request with a shared admin token. Body: { items: [{ text, voice }] }. Returns an array of results in the same order. Contact the admin for access.

Available voices

High quality (Azure GPT-audio)
Low quality (Kokoro-82M)
American English
British English

How it works

Three steps between your text and a ready-to-embed MP3.

Step 1
terminal
$ curl "/api/v1/tts
?text=Hello+world&voice=af_bella"

Send text and voice

Make a GET or POST request with your text and a voice id. No authentication, no SDK, no API key.

Step 2
cache lookupR2
kokoro-82mrendering
wav → mp3lamejs
r2 uploadstored

We render with Kokoro

Kokoro-82M synthesizes the audio on Replicate. We transcode the WAV to MP3 with lamejs and cache the result in R2 so repeat requests are instant.

Step 3
{
"url": "/tts/af_bella/…mp3",
"hash": "a1b2c3…",
"cached": false
}

Get a CDN URL

Receive a JSON response with a public MP3 URL you can embed in an audio tag, stream, or save. Same text and voice always return the same URL.

Why use this API

A no-frills text-to-speech endpoint for developers who want a real voice, a cached MP3, and no account creation step.

25+ voices, American & British English

Natural Kokoro voices

Kokoro-82M is an 82-million-parameter open model that rivals commercial TTS quality at a fraction of the cost. Female and male voices in American and British accents.

Cache hit: instant
R2-backed, content-addressed

Content-addressed caching

MP3s are keyed by voice and SHA-256 of the text. The same input always resolves to the same stable URL, so repeat calls skip the model entirely and your CDN can cache aggressively.

GET /api/v1/tts
?text=Hello+world
&voice=af_bella

One endpoint, zero setup

Works with curl, fetch, Axios, any HTTP client. GET for quick testing, POST for long text. No SDKs to install, no API keys to manage, no signup required.

1
2
+
200 clips per request
5 concurrent Replicate jobs

Batch pipeline ready

A separate /batch endpoint handles up to 200 clips per request for pre-rendering catalogs, audiobook chapters, or podcast generation pipelines.

Common use cases

From reading apps to game NPCs, text-to-speech is the fastest way to give any product a voice.

Content narration

Add voice-over to articles, newsletters, or documentation. Pre-generate chapter narrations for audiobooks, podcasts, or reading apps.

Used by reading apps, learning platforms, and news publishers offering audio versions of written content.

Game & NPC dialog

Voice NPCs, tutorial prompts, or procedurally generated story lines. Cache lines by content hash so they never re-render on the hot path.

Used by indie game devs, interactive fiction creators, and AI companion apps where voice variety matters.

Accessibility

Add read-aloud buttons to your app for users with low vision or reading disabilities. Give your product a screen-reader-friendly audio layer without paying per-character TTS pricing.

Used by education platforms, SaaS products, and accessibility teams retrofitting existing UIs.