Text to Speech

Convert text to speech with natural voices. 50+ voices, 30+ languages. Download MP3. Free online tool.

✓ Free✓ No sign-up✓ Works in browser

Speed: 1x

0.5x2x

Pitch: 1

LowHigh

Volume: 100%

0%100%

Uses your browser's built-in Text-to-Speech. Available voices depend on your OS and browser.

How to Use This Tool

Enter Your Text

Type or paste the text you want to convert to audio. The tool supports plain text and SSML (Speech Synthesis Markup Language) for advanced control.

Choose Voice and Language

Select from 120+ voices across 30+ languages. Adjust the speaking speed and pitch.

Play or Download MP3

Click Play to preview the audio in your browser. Click Download to save as an MP3 file for use in videos, podcasts, or apps.

Related Tools

Frequently Asked Questions

What file format is the audio output?

The tool outputs MP3 audio, which is compatible with all devices, video editors, podcast platforms, and audio players.

What is SSML?

SSML (Speech Synthesis Markup Language) is an XML-based language that lets you control pronunciation, pauses, emphasis, and speaking style in text-to-speech output.

Can I use the audio in commercial projects?

Yes. Audio generated using the free tool can be used in personal and commercial projects including YouTube videos, podcasts, and apps.

How long can the text be?

The free tool supports up to 3,000 characters per conversion. For longer texts, split into sections and combine the audio files.

About Text to Speech

A blind user is reading an article your site published and needs the browser to read it aloud because their preferred screen reader cannot handle the custom JavaScript-rendered markup. A creator is prototyping an explainer video and wants to hear their 300-word script spoken by multiple voice options before investing in a professional voiceover session at 300 USD per hour. This TTS tool uses the browser's Web Speech API (speechSynthesis) which ships in every modern browser — Chrome, Safari, Firefox, Edge — with no external service required. Voice selection depends on what the operating system provides: macOS ships with around 70 voices across languages, Windows 10+ has Microsoft's neural voices, Android and iOS have Google and Apple voices. Quality varies dramatically: macOS Siri and Apple's premium voices sound close to natural human, while some older Windows voices sound like 2005-era GPS units. SSML support is patchy — Chrome ignores most tags, Safari supports a subset. For production-grade voiceover (commercial videos, audiobooks), use a paid service like ElevenLabs or Google Cloud Text-to-Speech which offer consistent quality across devices.

How it works

1
Browser's Web Speech API handles synthesis
Input is passed to window.speechSynthesis.speak() which hands the text and voice choice to the operating system. The OS renders audio using its installed TTS engine — macOS uses its native speech synthesizer, Windows uses Microsoft voices including newer neural options, Android uses Google, iOS uses Apple. No data is sent to our servers or to the browser vendor; synthesis happens entirely on-device.
2
Voice list pulled from speechSynthesis.getVoices()
The voice dropdown populates from the user's installed voices, which varies by OS and by what additional language packs the user has downloaded. Chrome on macOS typically shows 60 to 90 voices; Chrome on Windows shows whatever Microsoft TTS voices are installed plus some Chrome-bundled options; Android depends heavily on device and region. Some voices require a one-time download from the OS settings before being usable.
3
Rate, pitch, and volume controllable via utterance properties
SpeechSynthesisUtterance exposes rate (0.1 to 10 with 1 as normal), pitch (0 to 2 with 1 as normal), and volume (0 to 1). These work consistently across browsers. SSML markup (<emphasis>, <break>) is supported inconsistently — Safari honors some tags, Chrome mostly ignores them. For production pacing control, stick to plain text and adjust the rate parameter rather than relying on SSML.

Pro tips

Voice quality varies wildly by device — test before relying

Premium macOS and iOS voices (Siri, Apple Neural) sound nearly indistinguishable from human. Modern Windows 10/11 Microsoft Neural voices are competitive. Older Windows 7/8 voices sound mechanical. Some Linux setups ship with espeak which sounds like 1990s assistive tech. If you are building a feature that relies on TTS, test on the actual target devices and browsers, and provide a fallback message ('Your device may not have high-quality voices available; try Chrome on macOS for best results') so users with older setups understand the limitation rather than blaming your product.

Chunking long text avoids browser hangs

The Web Speech API has an undocumented limit of roughly 200 to 300 characters per utterance in Chrome before some browsers silently fail. For long-form content (over 500 words), split into sentences using a period-plus-space regex, create a separate SpeechSynthesisUtterance for each, and queue them. This avoids the mid-paragraph cutoff bug that plagues naive implementations where a user hits play on a 2,000-word article and audio stops after 30 seconds with no error. Our tool handles chunking automatically, but note this pattern if you are building your own TTS integration.

Language codes determine voice selection — get them right

Voice selection uses BCP 47 language tags like en-US, en-GB, es-ES, es-MX, fr-FR, fr-CA, zh-CN, zh-TW. Picking en-US for Spanish text will produce an American-accented Spanish reading that sounds off; picking es-ES for Mexican Spanish text sounds subtly wrong to native speakers. When multi-language content is involved, set the utterance.lang property correctly per segment so the right voice is selected automatically. Mismatched language tags are the number one cause of bad TTS quality after voice selection itself.

Honest limitations

· Voice quality depends entirely on the user's operating system and installed voices — you cannot guarantee consistent audio across devices.
· SSML support is inconsistent across browsers; avoid relying on advanced pacing or emphasis tags for cross-browser playback.
· For commercial voiceover (ads, audiobooks, videos), browser TTS is a prototype tool only — use paid services (ElevenLabs, Google Cloud TTS, Amazon Polly) for production-grade audio that sounds consistent across consumers.

Frequently asked questions

Why do voices sound different on my phone versus my laptop?

Because voice selection and quality are provided by the operating system, not the browser or this tool. Apple's iOS and macOS ship high-quality neural voices that sound close to human; Android depends on Google's Play Services TTS and the specific Android version; Windows 10/11 includes modern Microsoft Neural voices but older Windows versions shipped lower-quality voices. The same text spoken by the 'default' voice on different devices will sound different because the default varies. If audio quality matters for your use case, specify a voice explicitly rather than relying on the default, and accept that some users on older hardware will hear lower-quality audio.

Can I download the generated audio as an MP3 or WAV file?

Not directly. The Web Speech API generates audio in real time through the OS audio pipeline and does not expose a 'save to file' method. To capture browser TTS as an audio file you would need to route the audio through a MediaRecorder API instance (which has limitations and inconsistent cross-browser support) or use a paid TTS service with a REST API that returns audio bytes (ElevenLabs, Google Cloud Text-to-Speech, Amazon Polly). For occasional personal use, screen-recording software can capture the audio, but for systematic generation of audio files for podcasts or videos, paid TTS APIs are the right tool.

What languages does this support?

Support depends on what voices your operating system has installed. Most modern OS installations cover the major European languages (English, Spanish, French, German, Italian, Portuguese), CJK (Chinese, Japanese, Korean), and common Asian languages (Arabic, Hindi, Vietnamese). Less-common languages (Swahili, Welsh, Icelandic) often require a one-time download of an additional language pack from the OS settings. Check your voice dropdown — if a language does not appear, visit your OS TTS settings to see if voices can be installed. Browser vendors occasionally bundle extra voices but OS-native voices generally sound better.

Why does the voice cut off partway through a long paragraph?

Chrome and some other browsers have an undocumented 200 to 300 character limit per SpeechSynthesisUtterance that causes longer texts to stop unexpectedly. Our tool works around this by splitting text into sentence-sized chunks and queuing them sequentially, so long-form content plays end to end. If you are using speechSynthesis directly in your own code and seeing this issue, implement the same chunking: split on terminal punctuation, create one SpeechSynthesisUtterance per chunk, and queue them all before calling speak() on the first.

Is this suitable for creating commercial video voiceover or audiobooks?

Generally no. Browser TTS is a prototype and accessibility tool, not a production voiceover service. For commercial content, the audio quality varies too much across listener devices, you cannot download the audio as files, and the licensing of system voices for redistribution is restrictive (Apple and Microsoft voices are not licensed for commercial distribution of the generated audio). For production voiceover, paid services like ElevenLabs (consistent high-quality neural voices, commercial licensing clear), Google Cloud Text-to-Speech, Amazon Polly, or Azure Cognitive Services Speech provide consistent output you can use in ads, videos, and audiobooks without legal or quality concerns.

TTS fits naturally with writing and content tools. The word-counter estimates reading time which roughly correlates to spoken duration at 150 wpm. The ai-writing-assistant drafts scripts that then feed into TTS for pacing review before recording. The paraphrasing-tool rewrites sentences that sound awkward when spoken aloud — often you catch phrasing issues by ear that were invisible on the page. For accessibility work, the caption-generator is adjacent: captions for video paired with TTS for blog content forms a more complete accessibility toolkit.