initial commit

This commit is contained in:
2026-01-25 18:58:40 +09:00
commit 77af47274c
101 changed files with 16247 additions and 0 deletions

3
.gitattributes vendored Normal file
View File

@@ -0,0 +1,3 @@
assets/onnx/*.onnx filter=lfs diff=lfs merge=lfs -text
ios/** linguist-ignore
web/** linguist-ignore

62
.gitignore vendored Normal file
View File

@@ -0,0 +1,62 @@
assets/*
assets/.git
assets/.gitignore
assets/.gitattributes
*.onnx
onnx
# Output files
results
# Python
__pycache__
*.py[cod]
*$py.class
*.so
.Python
# Virtual environments
.venv
venv/
ENV/
env/
# Node.js
node_modules/
npm-debug.log*
yarn-debug.log*
yarn-error.log*
package-lock.json
# Swift
.build/
.swiftpm/
*.xcodeproj
*.xcworkspace
xcuserdata/
DerivedData/
# Distribution / packaging
build/
dist/
*.egg-info/
.eggs/
# Testing
.pytest_cache/
.coverage
htmlcov/
.tox/
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
assets

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 Supertone Inc.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

452
README.md Normal file
View File

@@ -0,0 +1,452 @@
# Supertonic — Lightning Fast, On-Device TTS
[![v2 Demo](https://img.shields.io/badge/🤗%20v2-Demo-yellow)](https://huggingface.co/spaces/Supertone/supertonic-2)
[![v2 Models](https://img.shields.io/badge/🤗%20v2-Models-blue)](https://huggingface.co/Supertone/supertonic-2)
[![v1 Demo](https://img.shields.io/badge/🤗%20v1%20(old)-Demo-lightgrey)](https://huggingface.co/spaces/Supertone/supertonic#interactive-demo)
[![v1 Models](https://img.shields.io/badge/🤗%20v1%20(old)-Models-lightgrey)](https://huggingface.co/Supertone/supertonic)
<p align="center">
<img src="img/supertonic_preview_0.1.jpg" alt="Supertonic Banner">
</p>
**Supertonic** is a lightning-fast, on-device text-to-speech system designed for **extreme performance** with minimal computational overhead. Powered by ONNX Runtime, it runs entirely on your device—no cloud, no API calls, no privacy concerns.
### 📰 Update News
- **2026.01.22** - **[Voice Builder](https://supertonic.supertone.ai/voice_builder)** is now live! Turn your voice into a deployable, edge-native TTS with permanent ownership.
<p align="center">
<img src="img/voicebuilder_img.png" alt="Voice Builder" width="600">
</p>
- **2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
- **2025.12.10** - Added `supertonic` PyPI package! Install via `pip install supertonic`. For details, visit [supertonic-py documentation](https://supertone-inc.github.io/supertonic-py)
- **2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
- **2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
- **2025.11.24** - Added Flutter SDK support with macOS compatibility
### Table of Contents
- [Demo](#demo)
- [Why Supertonic?](#why-supertonic)
- [Language Support](#language-support)
- [Getting Started](#getting-started)
- [Performance](#performance)
- [Built with Supertonic](#built-with-supertonic)
- [Citation](#citation)
- [License](#license)
## Demo
### Raspberry Pi
Watch Supertonic running on a **Raspberry Pi**, demonstrating on-device, real-time text-to-speech synthesis:
https://github.com/user-attachments/assets/ea66f6d6-7bc5-4308-8a88-1ce3e07400d2
### E-Reader
Experience Supertonic on an **Onyx Boox Go 6** e-reader in airplane mode, achieving an average RTF of 0.3× with zero network dependency:
https://github.com/user-attachments/assets/64980e58-ad91-423a-9623-78c2ffc13680
### Chrome Extension
Turns any webpage into audio in under one second, delivering lightning-fast, on-device text-to-speech with zero network dependency—free, private, and effortless:
https://github.com/user-attachments/assets/cc8a45fc-5c3e-4b2c-8439-a14c3d00d91c
---
> 🎧 **Try it now**: Experience Supertonic in your browser with our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic-2), or get started with pre-trained models from [**Hugging Face Hub**](https://huggingface.co/Supertone/supertonic-2)
## Why Supertonic?
- **⚡ Blazingly Fast**: Generates speech up to **167× faster than real-time** on consumer hardware (M4 Pro)—unmatched by any other TTS system
- **🪶 Ultra Lightweight**: Only **66M parameters**, optimized for efficient on-device performance with minimal footprint
- **📱 On-Device Capable**: **Complete privacy** and **zero latency**—all processing happens locally on your device
- **🎨 Natural Text Handling**: Seamlessly processes numbers, dates, currency, abbreviations, and complex expressions without pre-processing
- **⚙️ Highly Configurable**: Adjust inference steps, batch processing, and other parameters to match your specific needs
- **🧩 Flexible Deployment**: Deploy seamlessly across servers, browsers, and edge devices with multiple runtime backends.
## Language Support
We provide ready-to-use TTS inference examples across multiple ecosystems:
| Language/Platform | Path | Description |
|-------------------|------|-------------|
| [**Python**](py/) | `py/` | ONNX Runtime inference |
| [**Node.js**](nodejs/) | `nodejs/` | Server-side JavaScript |
| [**Browser**](web/) | `web/` | WebGPU/WASM inference |
| [**Java**](java/) | `java/` | Cross-platform JVM |
| [**C++**](cpp/) | `cpp/` | High-performance C++ |
| [**C#**](csharp/) | `csharp/` | .NET ecosystem |
| [**Go**](go/) | `go/` | Go implementation |
| [**Swift**](swift/) | `swift/` | macOS applications |
| [**iOS**](ios/) | `ios/` | Native iOS apps |
| [**Rust**](rust/) | `rust/` | Memory-safe systems |
| [**Flutter**](flutter/) | `flutter/` | Cross-platform apps |
> For detailed usage instructions, please refer to the README.md in each language directory.
## Getting Started
First, clone the repository:
```bash
git clone https://github.com/supertone-inc/supertonic.git
cd supertonic
```
### Prerequisites
Before running the examples, download the ONNX models and preset voices, and place them in the `assets` directory:
> **Note:** The Hugging Face repository uses Git LFS. Please ensure Git LFS is installed and initialized before cloning or pulling large model files.
> - macOS: `brew install git-lfs && git lfs install`
> - Generic: see `https://git-lfs.com` for installers
```bash
git clone https://huggingface.co/Supertone/supertonic-2 assets
```
### Quick Start
**Python Example** ([Details](py/))
```bash
cd py
uv sync
uv run example_onnx.py
```
**Node.js Example** ([Details](nodejs/))
```bash
cd nodejs
npm install
npm start
```
**Browser Example** ([Details](web/))
```bash
cd web
npm install
npm run dev
```
**Java Example** ([Details](java/))
```bash
cd java
mvn clean install
mvn exec:java
```
**C++ Example** ([Details](cpp/))
```bash
cd cpp
mkdir build && cd build
cmake .. && cmake --build . --config Release
./example_onnx
```
**C# Example** ([Details](csharp/))
```bash
cd csharp
dotnet restore
dotnet run
```
**Go Example** ([Details](go/))
```bash
cd go
go mod download
go run example_onnx.go helper.go
```
**Swift Example** ([Details](swift/))
```bash
cd swift
swift build -c release
.build/release/example_onnx
```
**Rust Example** ([Details](rust/))
```bash
cd rust
cargo build --release
./target/release/example_onnx
```
**iOS Example** ([Details](ios/))
```bash
cd ios/ExampleiOSApp
xcodegen generate
open ExampleiOSApp.xcodeproj
```
- In Xcode: Targets → ExampleiOSApp → Signing: select your Team
- Choose your iPhone as run destination → Build & Run
### Technical Details
- **Runtime**: ONNX Runtime for cross-platform inference (CPU-optimized; GPU mode is not tested)
- **Browser Support**: onnxruntime-web for client-side inference
- **Batch Processing**: Supports batch inference for improved throughput
- **Audio Output**: Outputs 16-bit WAV files
## Performance
We evaluated Supertonic's performance (with 2 inference steps) using two key metrics across input texts of varying lengths: Short (59 chars), Mid (152 chars), and Long (266 chars).
**Metrics:**
- **Characters per Second**: Measures throughput by dividing the number of input characters by the time required to generate audio. Higher is better.
- **Real-time Factor (RTF)**: Measures the time taken to synthesize audio relative to its duration. Lower is better (e.g., RTF of 0.1 means it takes 0.1 seconds to generate one second of audio).
### Characters per Second
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|--------|-----------------|----------------|-----------------|
| **Supertonic** (M4 pro - CPU) | 912 | 1048 | 1263 |
| **Supertonic** (M4 pro - WebGPU) | 996 | 1801 | 2509 |
| **Supertonic** (RTX4090) | 2615 | 6548 | 12164 |
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 144 | 209 | 287 |
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 37 | 55 | 82 |
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 12 | 18 | 24 |
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 38 | 64 | 92 |
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 104 | 107 | 117 |
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 37 | 42 | 47 |
> **Notes:**
> `API` = Cloud-based API services (measured from Seoul)
> `Open` = Open-source models
> Supertonic (M4 pro - CPU) and (M4 pro - WebGPU): Tested with ONNX
> Supertonic (RTX4090): Tested with PyTorch model
> Kokoro: Tested on M4 Pro CPU with ONNX
> NeuTTS Air: Tested on M4 Pro CPU with Q8-GGUF
### Real-time Factor
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|--------|-----------------|----------------|-----------------|
| **Supertonic** (M4 pro - CPU) | 0.015 | 0.013 | 0.012 |
| **Supertonic** (M4 pro - WebGPU) | 0.014 | 0.007 | 0.006 |
| **Supertonic** (RTX4090) | 0.005 | 0.002 | 0.001 |
| `API` [ElevenLabs Flash v2.5](https://elevenlabs.io/docs/api-reference/text-to-speech/convert) | 0.133 | 0.077 | 0.057 |
| `API` [OpenAI TTS-1](https://platform.openai.com/docs/guides/text-to-speech) | 0.471 | 0.302 | 0.201 |
| `API` [Gemini 2.5 Flash TTS](https://ai.google.dev/gemini-api/docs/speech-generation) | 1.060 | 0.673 | 0.541 |
| `API` [Supertone Sona speech 1](https://docs.supertoneapi.com/en/api-reference/endpoints/text-to-speech) | 0.372 | 0.206 | 0.163 |
| `Open` [Kokoro](https://github.com/hexgrad/kokoro/) | 0.144 | 0.124 | 0.126 |
| `Open` [NeuTTS Air](https://github.com/neuphonic/neutts-air) | 0.390 | 0.338 | 0.343 |
<details>
<summary><b>Additional Performance Data (5-step inference)</b></summary>
<br>
**Characters per Second (5-step)**
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|--------|-----------------|----------------|-----------------|
| **Supertonic** (M4 pro - CPU) | 596 | 691 | 850 |
| **Supertonic** (M4 pro - WebGPU) | 570 | 1118 | 1546 |
| **Supertonic** (RTX4090) | 1286 | 3757 | 6242 |
**Real-time Factor (5-step)**
| System | Short (59 chars) | Mid (152 chars) | Long (266 chars) |
|--------|-----------------|----------------|-----------------|
| **Supertonic** (M4 pro - CPU) | 0.023 | 0.019 | 0.018 |
| **Supertonic** (M4 pro - WebGPU) | 0.024 | 0.012 | 0.010 |
| **Supertonic** (RTX4090) | 0.011 | 0.004 | 0.002 |
</details>
### Natural Text Handling
Supertonic is designed to handle complex, real-world text inputs that contain numbers, currency symbols, abbreviations, dates, and proper nouns.
> 🎧 **View audio samples more easily**: Check out our [**Interactive Demo**](https://huggingface.co/spaces/Supertone/supertonic#text-handling) for a better viewing experience of all audio examples
**Overview of Test Cases:**
| Category | Key Challenges | Supertonic | ElevenLabs | OpenAI | Gemini | Microsoft |
|:--------:|:--------------:|:----------:|:----------:|:------:|:------:|:---------:|
| Financial Expression | Decimal currency, abbreviated magnitudes (M, K), currency symbols, currency codes | ✅ | ❌ | ❌ | ❌ | ❌ |
| Time and Date | Time notation, abbreviated weekdays/months, date formats | ✅ | ❌ | ❌ | ❌ | ❌ |
| Phone Number | Area codes, hyphens, extensions (ext.) | ✅ | ❌ | ❌ | ❌ | ❌ |
| Technical Unit | Decimal numbers with units, abbreviated technical notations | ✅ | ❌ | ❌ | ❌ | ❌ |
<details>
<summary><b>Example 1: Financial Expression</b></summary>
<br>
**Text:**
> "The startup secured **$5.2M** in venture capital, a huge leap from their initial **$450K** seed round."
**Challenges:**
- Decimal point in currency ($5.2M should be read as "five point two million")
- Abbreviated magnitude units (M for million, K for thousand)
- Currency symbol ($) that needs to be properly pronounced as "dollars"
**Audio Samples:**
| System | Result | Audio Sample |
|--------|--------|--------------|
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1eancUOhiSXCVoTu9ddh4S-OcVQaWrPV-/view?usp=sharing) |
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1-r2scv7XQ1crIDu6QOh3eqVl445W6ap_/view?usp=sharing) |
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1MFDXMjfmsAVOqwPx7iveS0KUJtZvcwxB/view?usp=sharing) |
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1dEHpNzfMUucFTJPQK0k4RcFZvPwQTt09/view?usp=sharing) |
| VibeVoice Realtime 0.5B | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1b69XWBQnSZZ0WZeR3avv7E8mSdoN6p6P/view?usp=sharing) |
</details>
<details>
<summary><b>Example 2: Time and Date</b></summary>
<br>
**Text:**
> "The train delay was announced at **4:45 PM** on **Wed, Apr 3, 2024** due to track maintenance."
**Challenges:**
- Time expression with PM notation (4:45 PM)
- Abbreviated weekday (Wed)
- Abbreviated month (Apr)
- Full date format (Apr 3, 2024)
**Audio Samples:**
| System | Result | Audio Sample |
|--------|--------|--------------|
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1ehkZU8eiizBenG2DgR5tzBGQBvHS0Uaj/view?usp=sharing) |
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1ta3r6jFyebmA-sT44l8EaEQcMLVmuOEr/view?usp=sharing) |
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1sskmem9AzHAQ3Hv8DRSZoqX_pye-CXuU/view?usp=sharing) |
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1zx9X8oMsLMXW0Zx_SURoqjju-By2yh_n/view?usp=sharing) |
| VibeVoice Realtime 0.5B | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1ZpGEstZr4hA0EdAWBMCUFFWuAkIpYsVh/view?usp=sharing) |
</details>
<details>
<summary><b>Example 3: Phone Number</b></summary>
<br>
**Text:**
> "You can reach the hotel front desk at **(212) 555-0142 ext. 402** anytime."
**Challenges:**
- Area code in parentheses that should be read as separate digits
- Phone number with hyphen separator (555-0142)
- Abbreviated extension notation (ext.)
- Extension number (402)
**Audio Samples:**
| System | Result | Audio Sample |
|--------|--------|--------------|
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1z-e5iTsihryMR8ll1-N1YXkB2CIJYJ6F/view?usp=sharing) |
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1HAzVXFTZfZm0VEK2laSpsMTxzufcuaxA/view?usp=sharing) |
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/15tjfAmb3GbjP_kmvD7zSdIWkhtAaCPOg/view?usp=sharing) |
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1BCL8n7yligUZyso970ud7Gf5NWb1OhKD/view?usp=sharing) |
| VibeVoice Realtime 0.5B | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1c0c0YM_Qm7XxSk2uSVYLbITgEDTqaVzL/view?usp=sharing) |
</details>
<details>
<summary><b>Example 4: Technical Unit</b></summary>
<br>
**Text:**
> "Our drone battery lasts **2.3h** when flying at **30kph** with full camera payload."
**Challenges:**
- Decimal time duration with abbreviation (2.3h = two point three hours)
- Speed unit with abbreviation (30kph = thirty kilometers per hour)
- Technical abbreviations (h for hours, kph for kilometers per hour)
- Technical/engineering context requiring proper pronunciation
**Audio Samples:**
| System | Result | Audio Sample |
|--------|--------|--------------|
| **Supertonic** | ✅ | [🎧 Play Audio](https://drive.google.com/file/d/1kvOBvswFkLfmr8hGplH0V2XiMxy1shYf/view?usp=sharing) |
| ElevenLabs Flash v2.5 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1_SzfjWJe5YEd0t3R7DztkYhHcI_av48p/view?usp=sharing) |
| OpenAI TTS-1 | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1P5BSilj5xFPTV2Xz6yW5jitKZohO9o-6/view?usp=sharing) |
| Gemini 2.5 Flash TTS | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1GU82SnWC50OvC8CZNjhxvNZFKQb7I9_Y/view?usp=sharing) |
| VibeVoice Realtime 0.5B | ❌ | [🎧 Play Audio](https://drive.google.com/file/d/1lUTrxrAQy_viEK2Hlu3KLLtTCe8jvbdV/view?usp=sharing) |
</details>
> **Note:** These samples demonstrate how each system handles text normalization and pronunciation of complex expressions **without requiring pre-processing or phonetic annotations**.
## Built with Supertonic
| Project | Description | Links |
|---------|-------------|-------|
| **TLDRL** | Free, on-device TTS extension for reading any webpage | [Chrome](https://chromewebstore.google.com/detail/tldrl-lightning-tts-power/mdbiaajonlkomihpcaffhkagodbcgbme) |
| **Read Aloud** | Open-source TTS browser extension | [Chrome](https://chromewebstore.google.com/detail/read-aloud-a-text-to-spee/hdhinadidafjejdhmfkjgnolgimiaplp) · [Edge](https://microsoftedge.microsoft.com/addons/detail/read-aloud-a-text-to-spe/pnfonnnmfjnpfgagnklfaccicnnjcdkm) · [GitHub](https://github.com/ken107/read-aloud) |
| **PageEcho** | E-Book reader app for iOS | [App Store](https://apps.apple.com/us/app/pageecho/id6755965837) |
| **VoiceChat** | On-device voice-to-voice LLM chatbot in the browser | [Demo](https://huggingface.co/spaces/RickRossTN/ai-voice-chat) · [GitHub](https://github.com/irelate-ai/voice-chat) |
| **OmniAvatar** | Talking avatar video generator from photo + speech | [Demo](https://huggingface.co/spaces/alexnasa/OmniAvatar) |
| **CopiloTTS** | Kotlin Multiplatform TTS SDK via ONNX Runtime | [GitHub](https://github.com/sigmadeltasoftware/CopiloTTS) |
| **Voice Mixer** | PyQt5 tool for mixing and modifying voice styles | [GitHub](https://github.com/Topping1/Supertonic-Voice-Mixer) |
| **Supertonic MNN** | Lightweight library based on MNN (fp32/fp16/int8) | [GitHub](https://github.com/vra/supertonic-mnn) · [PyPI](https://pypi.org/project/supertonic-mnn/) |
| **Transformers.js** | Hugging Face's JS library with Supertonic support | [GitHub PR](https://github.com/huggingface/transformers.js/pull/1459) · [Demo](https://huggingface.co/spaces/webml-community/Supertonic-TTS-WebGPU) |
| **Pinokio** | 1-click localhost cloud for Mac, Windows, and Linux | [Pinokio](https://pinokio.co/) · [GitHub](https://github.com/SUP3RMASS1VE/SuperTonic-TTS) |
## Citation
The following papers describe the core technologies used in Supertonic. If you use this system in your research or find these techniques useful, please consider citing the relevant papers:
### SupertonicTTS: Main Architecture
This paper introduces the overall architecture of SupertonicTTS, including the speech autoencoder, flow-matching based text-to-latent module, and efficient design choices.
```bibtex
@article{kim2025supertonic,
title={SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System},
author={Kim, Hyeongju and Yang, Jinhyeok and Yu, Yechan and Ji, Seunghun and Morton, Jacob and Bous, Frederik and Byun, Joon and Lee, Juheon},
journal={arXiv preprint arXiv:2503.23108},
year={2025},
url={https://arxiv.org/abs/2503.23108}
}
```
### Length-Aware RoPE: Text-Speech Alignment
This paper presents Length-Aware Rotary Position Embedding (LARoPE), which improves text-speech alignment in cross-attention mechanisms.
```bibtex
@article{kim2025larope,
title={Length-Aware Rotary Position Embedding for Text-Speech Alignment},
author={Kim, Hyeongju and Lee, Juheon and Yang, Jinhyeok and Morton, Jacob},
journal={arXiv preprint arXiv:2509.11084},
year={2025},
url={https://arxiv.org/abs/2509.11084}
}
```
### Self-Purifying Flow Matching: Training with Noisy Labels
This paper describes the self-purification technique for training flow matching models robustly with noisy or unreliable labels.
```bibtex
@article{kim2025spfm,
title={Training Flow Matching Models with Reliable Labels via Self-Purification},
author={Kim, Hyeongju and Yu, Yechan and Yi, June Young and Lee, Juheon},
journal={arXiv preprint arXiv:2509.19091},
year={2025},
url={https://arxiv.org/abs/2509.19091}
}
```
## License
This project's sample code is released under the MIT License. - see the [LICENSE](https://github.com/supertone-inc/supertonic?tab=MIT-1-ov-file) for details.
The accompanying model is released under the OpenRAIL-M License. - see the [LICENSE](https://huggingface.co/Supertone/supertonic-2/blob/main/LICENSE) file for details.
This model was trained using PyTorch, which is licensed under the BSD 3-Clause License but is not redistributed with this project. - see the [LICENSE](https://docs.pytorch.org/FBGEMM/general/License.html) for details.
Copyright (c) 2026 Supertone Inc.

122
cpp/CMakeLists.txt Normal file
View File

@@ -0,0 +1,122 @@
cmake_minimum_required(VERSION 3.15)
project(Supertonic_CPP)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# Enable aggressive optimization
if(NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release)
endif()
# Add optimization flags
set(CMAKE_CXX_FLAGS_RELEASE "${CMAKE_CXX_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} -O3 -DNDEBUG -ffast-math")
# Find required packages
find_package(PkgConfig REQUIRED)
find_package(OpenMP)
# ONNX Runtime - Try multiple methods
# Method 1: Try to find via CMake config
find_package(onnxruntime QUIET CONFIG)
if(NOT onnxruntime_FOUND)
# Method 2: Try pkg-config
pkg_check_modules(ONNXRUNTIME QUIET libonnxruntime)
if(ONNXRUNTIME_FOUND)
set(ONNXRUNTIME_INCLUDE_DIR ${ONNXRUNTIME_INCLUDE_DIRS})
set(ONNXRUNTIME_LIB ${ONNXRUNTIME_LIBRARIES})
else()
# Method 3: Manual search in common locations
find_path(ONNXRUNTIME_INCLUDE_DIR
NAMES onnxruntime_cxx_api.h
PATHS
/usr/local/include
/opt/homebrew/include
/usr/include
${CMAKE_PREFIX_PATH}/include
PATH_SUFFIXES onnxruntime
)
find_library(ONNXRUNTIME_LIB
NAMES onnxruntime libonnxruntime
PATHS
/usr/local/lib
/opt/homebrew/lib
/usr/lib
${CMAKE_PREFIX_PATH}/lib
)
endif()
if(NOT ONNXRUNTIME_INCLUDE_DIR OR NOT ONNXRUNTIME_LIB)
message(FATAL_ERROR "ONNX Runtime not found. Please install it:\n"
" macOS: brew install onnxruntime\n"
" Ubuntu: See README.md for installation instructions")
endif()
message(STATUS "Found ONNX Runtime:")
message(STATUS " Include: ${ONNXRUNTIME_INCLUDE_DIR}")
message(STATUS " Library: ${ONNXRUNTIME_LIB}")
endif()
# nlohmann/json
find_package(nlohmann_json REQUIRED)
# Include directories
if(NOT onnxruntime_FOUND)
include_directories(${ONNXRUNTIME_INCLUDE_DIR})
endif()
# Helper library
add_library(tts_helper STATIC
helper.cpp
helper.h
)
if(onnxruntime_FOUND)
target_link_libraries(tts_helper
onnxruntime::onnxruntime
nlohmann_json::nlohmann_json
)
else()
target_include_directories(tts_helper PUBLIC ${ONNXRUNTIME_INCLUDE_DIR})
target_link_libraries(tts_helper
${ONNXRUNTIME_LIB}
nlohmann_json::nlohmann_json
)
endif()
# Enable OpenMP if available
if(OpenMP_CXX_FOUND)
target_link_libraries(tts_helper OpenMP::OpenMP_CXX)
message(STATUS "OpenMP enabled for parallel processing")
else()
message(WARNING "OpenMP not found - parallel processing will be disabled")
endif()
# Example executable
add_executable(example_onnx
example_onnx.cpp
)
if(onnxruntime_FOUND)
target_link_libraries(example_onnx
tts_helper
onnxruntime::onnxruntime
nlohmann_json::nlohmann_json
)
else()
target_link_libraries(example_onnx
tts_helper
${ONNXRUNTIME_LIB}
nlohmann_json::nlohmann_json
)
endif()
# Installation
install(TARGETS example_onnx DESTINATION bin)
install(TARGETS tts_helper DESTINATION lib)
install(FILES helper.h DESTINATION include)

139
cpp/README.md Normal file
View File

@@ -0,0 +1,139 @@
# Supertonic C++ Implementation
High-performance text-to-speech inference using ONNX Runtime.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
## Requirements
- C++17 compiler, CMake 3.15+
- Libraries: ONNX Runtime, nlohmann/json
## Installation
**Ubuntu/Debian:**
> ⚠️ **Note:** Installation instructions not yet verified.
```bash
sudo apt-get install -y cmake g++ nlohmann-json3-dev
wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.3/onnxruntime-linux-x64-1.16.3.tgz
tar -xzf onnxruntime-linux-x64-1.16.3.tgz
sudo cp -r onnxruntime-linux-x64-1.16.3/include/* /usr/local/include/
sudo cp -r onnxruntime-linux-x64-1.16.3/lib/* /usr/local/lib/
sudo ldconfig
```
**macOS:**
```bash
brew install cmake nlohmann-json onnxruntime
```
**Windows (vcpkg):**
> ⚠️ **Note:** Installation instructions not yet verified.
```powershell
vcpkg install nlohmann-json:x64-windows onnxruntime:x64-windows
vcpkg integrate install
```
## Building
```bash
cd cpp && mkdir build && cd build
cmake .. && cmake --build . --config Release
./example_onnx
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
./example_onnx
```
This will use:
- Voice style: `../assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
./example_onnx \
--voice-style ../assets/voice_styles/M1.json,../assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
--lang en,ko \
--batch
```
This will:
- Use `--batch` flag to enable batch processing mode
- Generate speech for 2 different voice-text pairs
- Use male voice style (M1.json) for the first English text
- Use female voice style (F1.json) for the second Korean text
- Process both samples in a single batch (automatic text chunking disabled)
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
./example_onnx \
--total-step 10 \
--voice-style ../assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
### Example 4: Long-Form Inference
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
```bash
./example_onnx \
--voice-style ../assets/voice_styles/M1.json \
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
```
This will:
- Automatically split the long text into smaller chunks (max 300 characters by default)
- Process each chunk separately while maintaining natural speech flow
- Insert brief silences (0.3 seconds) between chunks for natural pacing
- Combine all chunks into a single output audio file
**Note**: When using batch mode (`--batch`), automatic text chunking is disabled. Use non-batch mode for long-form text synthesis.
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--onnx-dir` | str | `../assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str | `../assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated for batch) |
| `--text` | str | (long default text) | Text(s) to synthesize (pipe-separated for batch) |
| `--lang` | str | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr` (comma-separated for batch) |
| `--save-dir` | str | `results` | Output directory |
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
## Notes
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer

121
cpp/example_onnx.cpp Normal file
View File

@@ -0,0 +1,121 @@
#include "helper.h"
#include <iostream>
#include <filesystem>
#include <algorithm>
#include <string>
#include <vector>
namespace fs = std::filesystem;
struct Args {
std::string onnx_dir = "../assets/onnx";
int total_step = 5;
float speed = 1.05f;
int n_test = 4;
std::vector<std::string> voice_style = {"../assets/voice_styles/M1.json"};
std::vector<std::string> text = {
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
};
std::vector<std::string> lang = {"en"};
std::string save_dir = "results";
bool batch = false;
};
auto splitString = [](const std::string& str, char delim) {
std::vector<std::string> result;
size_t start = 0, pos;
while ((pos = str.find(delim, start)) != std::string::npos) {
result.push_back(str.substr(start, pos - start));
start = pos + 1;
}
result.push_back(str.substr(start));
return result;
};
Args parseArgs(int argc, char* argv[]) {
Args args;
for (int i = 1; i < argc; i++) {
std::string arg = argv[i];
if (arg == "--onnx-dir" && i + 1 < argc) args.onnx_dir = argv[++i];
else if (arg == "--total-step" && i + 1 < argc) args.total_step = std::stoi(argv[++i]);
else if (arg == "--speed" && i + 1 < argc) args.speed = std::stof(argv[++i]);
else if (arg == "--n-test" && i + 1 < argc) args.n_test = std::stoi(argv[++i]);
else if (arg == "--voice-style" && i + 1 < argc) args.voice_style = splitString(argv[++i], ',');
else if (arg == "--text" && i + 1 < argc) args.text = splitString(argv[++i], '|');
else if (arg == "--lang" && i + 1 < argc) args.lang = splitString(argv[++i], ',');
else if (arg == "--save-dir" && i + 1 < argc) args.save_dir = argv[++i];
else if (arg == "--batch") args.batch = true;
}
return args;
}
int main(int argc, char* argv[]) {
std::cout << "=== TTS Inference with ONNX Runtime (C++) ===\n\n";
// --- 1. Parse arguments --- //
Args args = parseArgs(argc, argv);
int total_step = args.total_step;
float speed = args.speed;
int n_test = args.n_test;
std::string save_dir = args.save_dir;
std::vector<std::string> voice_style_paths = args.voice_style;
std::vector<std::string> text_list = args.text;
std::vector<std::string> lang_list = args.lang;
bool batch = args.batch;
if (voice_style_paths.size() != text_list.size()) {
std::cerr << "Error: Number of voice styles (" << voice_style_paths.size()
<< ") must match number of texts (" << text_list.size() << ")\n";
return 1;
}
int bsz = voice_style_paths.size();
// --- 2. Load Text to Speech --- //
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "TTS");
Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(
OrtAllocatorType::OrtArenaAllocator, OrtMemType::OrtMemTypeDefault
);
auto text_to_speech = loadTextToSpeech(env, args.onnx_dir, false);
std::cout << std::endl;
// --- 3. Load Voice Style --- //
auto style = loadVoiceStyle(voice_style_paths, true);
// --- 4. Synthesize speech --- //
fs::create_directories(save_dir);
for (int n = 0; n < n_test; n++) {
std::cout << "\n[" << (n + 1) << "/" << n_test << "] Starting synthesis...\n";
auto result = timer("Generating speech from text", [&]() {
if (batch) {
return text_to_speech->batch(memory_info, text_list, lang_list, style, total_step, speed);
} else {
return text_to_speech->call(memory_info, text_list[0], lang_list[0], style, total_step, speed);
}
});
int sample_rate = text_to_speech->getSampleRate();
int wav_shape_1 = result.wav.size() / bsz;
for (int b = 0; b < bsz; b++) {
std::string fname = sanitizeFilename(text_list[b], 20) + "_" + std::to_string(n + 1) + ".wav";
int wav_len = static_cast<int>(sample_rate * result.duration[b]);
std::vector<float> wav_out(
result.wav.begin() + b * wav_shape_1,
result.wav.begin() + b * wav_shape_1 + wav_len
);
std::string output_path = save_dir + "/" + fname;
writeWavFile(output_path, wav_out, sample_rate);
std::cout << "Saved: " << output_path << "\n";
}
clearTensorBuffers();
}
std::cout << "\n=== Synthesis completed successfully! ===\n";
return 0;
}

1186
cpp/helper.cpp Normal file

File diff suppressed because it is too large Load Diff

229
cpp/helper.h Normal file
View File

@@ -0,0 +1,229 @@
#pragma once
#include <string>
#include <vector>
#include <memory>
#include <iostream>
#include <iomanip>
#include <chrono>
#include <onnxruntime_cxx_api.h>
// Available languages for multilingual TTS
extern const std::vector<std::string> AVAILABLE_LANGS;
/**
* Configuration structure
*/
struct Config {
struct AEConfig {
int sample_rate;
int base_chunk_size;
} ae;
struct TTLConfig {
int chunk_compress_factor;
int latent_dim;
} ttl;
};
/**
* Unicode text processor
*/
class UnicodeProcessor {
public:
explicit UnicodeProcessor(const std::string& unicode_indexer_json_path);
// Process text list to text IDs and mask
void call(
const std::vector<std::string>& text_list,
const std::vector<std::string>& lang_list,
std::vector<std::vector<int64_t>>& text_ids,
std::vector<std::vector<std::vector<float>>>& text_mask
);
private:
std::vector<int64_t> indexer_;
std::string preprocessText(const std::string& text, const std::string& lang);
std::vector<uint16_t> textToUnicodeValues(const std::string& text);
std::vector<std::vector<std::vector<float>>> getTextMask(
const std::vector<int64_t>& text_ids_lengths
);
};
/**
* Style class
*/
class Style {
public:
Style(const std::vector<float>& ttl_data, const std::vector<int64_t>& ttl_shape,
const std::vector<float>& dp_data, const std::vector<int64_t>& dp_shape);
const std::vector<float>& getTtlData() const { return ttl_data_; }
const std::vector<float>& getDpData() const { return dp_data_; }
const std::vector<int64_t>& getTtlShape() const { return ttl_shape_; }
const std::vector<int64_t>& getDpShape() const { return dp_shape_; }
private:
std::vector<float> ttl_data_;
std::vector<float> dp_data_;
std::vector<int64_t> ttl_shape_;
std::vector<int64_t> dp_shape_;
};
/**
* TextToSpeech class
*/
class TextToSpeech {
public:
TextToSpeech(
const Config& cfgs,
UnicodeProcessor* text_processor,
Ort::Session* dp_ort,
Ort::Session* text_enc_ort,
Ort::Session* vector_est_ort,
Ort::Session* vocoder_ort
);
struct SynthesisResult {
std::vector<float> wav;
std::vector<float> duration;
};
SynthesisResult call(
Ort::MemoryInfo& memory_info,
const std::string& text,
const std::string& lang,
const Style& style,
int total_step,
float speed = 1.05f,
float silence_duration = 0.3f
);
SynthesisResult batch(
Ort::MemoryInfo& memory_info,
const std::vector<std::string>& text_list,
const std::vector<std::string>& lang_list,
const Style& style,
int total_step,
float speed = 1.05f
);
int getSampleRate() const { return sample_rate_; }
private:
SynthesisResult _infer(
Ort::MemoryInfo& memory_info,
const std::vector<std::string>& text_list,
const std::vector<std::string>& lang_list,
const Style& style,
int total_step,
float speed = 1.05f
);
Config cfgs_;
UnicodeProcessor* text_processor_;
Ort::Session* dp_ort_;
Ort::Session* text_enc_ort_;
Ort::Session* vector_est_ort_;
Ort::Session* vocoder_ort_;
int sample_rate_;
int base_chunk_size_;
int chunk_compress_factor_;
int ldim_;
void sampleNoisyLatent(
const std::vector<float>& duration,
std::vector<std::vector<std::vector<float>>>& noisy_latent,
std::vector<std::vector<std::vector<float>>>& latent_mask
);
};
// Utility functions
std::vector<std::vector<std::vector<float>>> lengthToMask(
const std::vector<int64_t>& lengths, int max_len = -1
);
std::vector<std::vector<std::vector<float>>> getLatentMask(
const std::vector<int64_t>& wav_lengths,
int base_chunk_size,
int chunk_compress_factor
);
// ONNX model loading
struct OnnxModels {
std::unique_ptr<Ort::Session> dp;
std::unique_ptr<Ort::Session> text_enc;
std::unique_ptr<Ort::Session> vector_est;
std::unique_ptr<Ort::Session> vocoder;
};
std::unique_ptr<Ort::Session> loadOnnx(
Ort::Env& env,
const std::string& onnx_path,
const Ort::SessionOptions& opts
);
OnnxModels loadOnnxAll(
Ort::Env& env,
const std::string& onnx_dir,
const Ort::SessionOptions& opts
);
// Configuration and processor loading
Config loadCfgs(const std::string& onnx_dir);
std::unique_ptr<UnicodeProcessor> loadTextProcessor(const std::string& onnx_dir);
// Voice style loading
Style loadVoiceStyle(const std::vector<std::string>& voice_style_paths, bool verbose = false);
// TextToSpeech loading
std::unique_ptr<TextToSpeech> loadTextToSpeech(
Ort::Env& env,
const std::string& onnx_dir,
bool use_gpu = false
);
// WAV file writing
void writeWavFile(
const std::string& filename,
const std::vector<float>& audio_data,
int sample_rate
);
// Tensor conversion utilities
void clearTensorBuffers();
Ort::Value arrayToTensor(
Ort::MemoryInfo& memory_info,
const std::vector<std::vector<std::vector<float>>>& array,
const std::vector<int64_t>& dims
);
Ort::Value intArrayToTensor(
Ort::MemoryInfo& memory_info,
const std::vector<std::vector<int64_t>>& array,
const std::vector<int64_t>& dims
);
// JSON loading helpers
std::vector<int64_t> loadJsonInt64(const std::string& file_path);
// Timer utility
template<typename Func>
auto timer(const std::string& name, Func&& func) -> decltype(func()) {
auto start = std::chrono::high_resolution_clock::now();
std::cout << name << "..." << std::endl;
auto result = func();
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << " -> " << name << " completed in "
<< std::fixed << std::setprecision(2) << elapsed.count() << " sec" << std::endl;
return result;
}
// Sanitize filename
std::string sanitizeFilename(const std::string& text, int max_len);
// Chunk text into manageable segments
std::vector<std::string> chunkText(const std::string& text, int max_len = 300);

41
csharp/.gitignore vendored Normal file
View File

@@ -0,0 +1,41 @@
# Build results
bin/
obj/
[Dd]ebug/
[Rr]elease/
x64/
x86/
[Aa]rm/
[Aa]rm64/
bld/
[Bb]in/
[Oo]bj/
[Ll]og/
# Visual Studio files
.vs/
*.suo
*.user
*.userosscache
*.sln.docstates
*.userprefs
# Rider
.idea/
*.sln.iml
# User-specific files
*.rsuser
*.suo
*.user
*.userosscache
*.sln.docstates
# Output directory
results/*.wav
# OS files
.DS_Store
Thumbs.db

171
csharp/ExampleONNX.cs Normal file
View File

@@ -0,0 +1,171 @@
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Media;
namespace Supertonic
{
class Program
{
class Args
{
public bool UseGpu { get; set; } = false;
public string OnnxDir { get; set; } = "./assets/onnx";
public int TotalStep { get; set; } = 5;
public float Speed { get; set; } = 1.05f;
public int NTest { get; set; } = 4;
public List<string> VoiceStyle { get; set; } = new List<string> { "assets/voice_styles/F2.json" };
public List<string> Text { get; set; } = new List<string>
{
"동해물과 백두산이 마르고 닳도록 하느님이 보우하사. 우리 나라 만세~~"
};
public List<string> Lang { get; set; } = new List<string> { "ko" };
public string SaveDir { get; set; } = "results";
public bool Batch { get; set; } = false;
public int? Seed { get; set; } = null;
public float PreSilence { get; set; } = 0.2f;
}
static Args ParseArgs(string[] args)
{
var result = new Args();
for (int i = 0; i < args.Length; i++)
{
switch (args[i])
{
case "--use-gpu":
result.UseGpu = true;
break;
case "--batch":
result.Batch = true;
break;
case "--onnx-dir" when i + 1 < args.Length:
result.OnnxDir = args[++i];
break;
case "--total-step" when i + 1 < args.Length:
result.TotalStep = int.Parse(args[++i]);
break;
case "--speed" when i + 1 < args.Length:
result.Speed = float.Parse(args[++i]);
break;
case "--n-test" when i + 1 < args.Length:
result.NTest = int.Parse(args[++i]);
break;
case "--voice-style" when i + 1 < args.Length:
result.VoiceStyle = args[++i].Split(',').ToList();
break;
case "--text" when i + 1 < args.Length:
result.Text = args[++i].Split('|').ToList();
break;
case "--lang" when i + 1 < args.Length:
result.Lang = args[++i].Split(',').ToList();
break;
case "--save-dir" when i + 1 < args.Length:
result.SaveDir = args[++i];
break;
case "--seed" when i + 1 < args.Length:
result.Seed = int.Parse(args[++i]);
break;
case "--pre-silence" when i + 1 < args.Length:
result.PreSilence = float.Parse(args[++i]);
break;
}
}
return result;
}
static void Main(string[] args)
{
Console.WriteLine("=== TTS Inference with ONNX Runtime (C#) ===\n");
Console.WriteLine("sample seed : 371279630");
// --- 1. Parse arguments --- //
var parsedArgs = ParseArgs(args);
int totalStep = parsedArgs.TotalStep;
float speed = parsedArgs.Speed;
int nTest = parsedArgs.NTest;
string saveDir = parsedArgs.SaveDir;
var voiceStylePaths = parsedArgs.VoiceStyle;
var textList = parsedArgs.Text;
var langList = parsedArgs.Lang;
bool batch = parsedArgs.Batch;
if (voiceStylePaths.Count != textList.Count)
{
throw new ArgumentException(
$"Number of voice styles ({voiceStylePaths.Count}) must match number of texts ({textList.Count})");
}
int bsz = voiceStylePaths.Count;
// --- 2. Load Text to Speech --- //
var textToSpeech = Helper.LoadTextToSpeech(parsedArgs.OnnxDir, parsedArgs.UseGpu);
Console.WriteLine();
// --- 3. Load Voice Style --- //
var style = Helper.LoadVoiceStyle(voiceStylePaths, verbose: true);
// --- 4. Synthesize speech --- //
Random seedGenerator = new Random();
for (int n = 0; n < nTest; n++)
{
int currentSeed = parsedArgs.Seed ?? seedGenerator.Next();
Console.WriteLine($"\n[{n + 1}/{nTest}] Starting synthesis (Seed: {currentSeed})...");
var (wav, duration) = Helper.Timer("Generating speech from text", () =>
{
if (batch)
{
return textToSpeech.Batch(textList, langList, style, totalStep, speed, currentSeed);
}
else
{
return textToSpeech.Call(textList[0], langList[0], style, totalStep, speed, seed: currentSeed);
}
});
if (!Directory.Exists(saveDir))
{
Directory.CreateDirectory(saveDir);
}
for (int b = 0; b < bsz; b++)
{
string fname = $"{Helper.SanitizeFilename(textList[b], 20)}_{n + 1}_s{currentSeed}.wav";
int wavLen = (int)(textToSpeech.SampleRate * duration[b]);
// --- Add Pre-Silence (Delay) --- //
int silenceSamples = (int)(textToSpeech.SampleRate * parsedArgs.PreSilence);
var wavOut = new float[wavLen + silenceSamples];
// The array is initialized to 0 by default, so we just copy the audio after the silence
Array.Copy(wav, b * wav.Length / bsz, wavOut, silenceSamples, Math.Min(wavLen, wav.Length / bsz));
string outputPath = Path.Combine(saveDir, fname);
Helper.WriteWavFile(outputPath, wavOut, textToSpeech.SampleRate);
Console.WriteLine($"Saved: {outputPath}");
// --- Play the generated audio --- //
try
{
using (var player = new SoundPlayer(outputPath))
{
Console.WriteLine("Playing audio...");
player.PlaySync();
}
}
catch (Exception ex)
{
Console.WriteLine($"Warning: Could not play audio. {ex.Message}");
}
}
}
Console.WriteLine("\n=== Synthesis completed successfully! ===");
}
}
}

861
csharp/Helper.cs Normal file
View File

@@ -0,0 +1,861 @@
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Text.Json;
using System.Text.RegularExpressions;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
namespace Supertonic
{
// Available languages for multilingual TTS
public static class Languages
{
public static readonly string[] Available = { "en", "ko", "es", "pt", "fr" };
}
// ============================================================================
// Configuration classes
// ============================================================================
public class Config
{
public AEConfig AE { get; set; } = null!;
public TTLConfig TTL { get; set; } = null!;
public class AEConfig
{
public int SampleRate { get; set; }
public int BaseChunkSize { get; set; }
}
public class TTLConfig
{
public int ChunkCompressFactor { get; set; }
public int LatentDim { get; set; }
}
}
// ============================================================================
// Style class
// ============================================================================
public class Style
{
public float[] Ttl { get; set; }
public long[] TtlShape { get; set; }
public float[] Dp { get; set; }
public long[] DpShape { get; set; }
public Style(float[] ttl, long[] ttlShape, float[] dp, long[] dpShape)
{
Ttl = ttl;
TtlShape = ttlShape;
Dp = dp;
DpShape = dpShape;
}
}
// ============================================================================
// Unicode text processor
// ============================================================================
public class UnicodeProcessor
{
private readonly Dictionary<int, long> _indexer;
public UnicodeProcessor(string unicodeIndexerPath)
{
var json = File.ReadAllText(unicodeIndexerPath);
var indexerArray = JsonSerializer.Deserialize<long[]>(json) ?? throw new Exception("Failed to load indexer");
_indexer = new Dictionary<int, long>();
for (int i = 0; i < indexerArray.Length; i++)
{
_indexer[i] = indexerArray[i];
}
}
private static string RemoveEmojis(string text)
{
var result = new StringBuilder();
for (int i = 0; i < text.Length; i++)
{
int codePoint;
if (char.IsHighSurrogate(text[i]) && i + 1 < text.Length && char.IsLowSurrogate(text[i + 1]))
{
// Get the full code point from surrogate pair
codePoint = char.ConvertToUtf32(text[i], text[i + 1]);
i++; // Skip the low surrogate
}
else
{
codePoint = text[i];
}
// Check if code point is in emoji ranges
bool isEmoji = (codePoint >= 0x1F600 && codePoint <= 0x1F64F) ||
(codePoint >= 0x1F300 && codePoint <= 0x1F5FF) ||
(codePoint >= 0x1F680 && codePoint <= 0x1F6FF) ||
(codePoint >= 0x1F700 && codePoint <= 0x1F77F) ||
(codePoint >= 0x1F780 && codePoint <= 0x1F7FF) ||
(codePoint >= 0x1F800 && codePoint <= 0x1F8FF) ||
(codePoint >= 0x1F900 && codePoint <= 0x1F9FF) ||
(codePoint >= 0x1FA00 && codePoint <= 0x1FA6F) ||
(codePoint >= 0x1FA70 && codePoint <= 0x1FAFF) ||
(codePoint >= 0x2600 && codePoint <= 0x26FF) ||
(codePoint >= 0x2700 && codePoint <= 0x27BF) ||
(codePoint >= 0x1F1E6 && codePoint <= 0x1F1FF);
if (!isEmoji)
{
if (codePoint > 0xFFFF)
{
// Add back as surrogate pair
result.Append(char.ConvertFromUtf32(codePoint));
}
else
{
result.Append((char)codePoint);
}
}
}
return result.ToString();
}
private string PreprocessText(string text, string lang)
{
// TODO: Need advanced normalizer for better performance
text = text.Normalize(NormalizationForm.FormKD);
// Remove emojis (wide Unicode range)
// C# doesn't support \u{...} syntax in regex, so we use character filtering instead
text = RemoveEmojis(text);
// Replace various dashes and symbols
var replacements = new Dictionary<string, string>
{
{"", "-"}, // en dash
{"", "-"}, // non-breaking hyphen
{"—", "-"}, // em dash
{"_", " "}, // underscore
{"\u201C", "\""}, // left double quote
{"\u201D", "\""}, // right double quote
{"\u2018", "'"}, // left single quote
{"\u2019", "'"}, // right single quote
{"´", "'"}, // acute accent
{"`", "'"}, // grave accent
{"[", " "}, // left bracket
{"]", " "}, // right bracket
{"|", " "}, // vertical bar
{"/", " "}, // slash
{"#", " "}, // hash
{"→", " "}, // right arrow
{"←", " "}, // left arrow
};
foreach (var kvp in replacements)
{
text = text.Replace(kvp.Key, kvp.Value);
}
// Remove special symbols
text = Regex.Replace(text, @"[♥☆♡©\\]", "");
// Replace known expressions
var exprReplacements = new Dictionary<string, string>
{
{"@", " at "},
{"e.g.,", "for example, "},
{"i.e.,", "that is, "},
};
foreach (var kvp in exprReplacements)
{
text = text.Replace(kvp.Key, kvp.Value);
}
// Fix spacing around punctuation
text = Regex.Replace(text, @" ,", ",");
text = Regex.Replace(text, @" \.", ".");
text = Regex.Replace(text, @" !", "!");
text = Regex.Replace(text, @" \?", "?");
text = Regex.Replace(text, @" ;", ";");
text = Regex.Replace(text, @" :", ":");
text = Regex.Replace(text, @" '", "'");
// Remove duplicate quotes
while (text.Contains("\"\""))
{
text = text.Replace("\"\"", "\"");
}
while (text.Contains("''"))
{
text = text.Replace("''", "'");
}
while (text.Contains("``"))
{
text = text.Replace("``", "`");
}
// Remove extra spaces
text = Regex.Replace(text, @"\s+", " ").Trim();
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
if (!Regex.IsMatch(text, @"[.!?;:,'\u0022\u201C\u201D\u2018\u2019)\]}…。」』】〉》›»]$"))
{
text += ".";
}
// Validate language
if (!Languages.Available.Contains(lang))
{
throw new ArgumentException($"Invalid language: {lang}. Available: {string.Join(", ", Languages.Available)}");
}
// Wrap text with language tags
text = $"<{lang}>" + text + $"</{lang}>";
return text;
}
private int[] TextToUnicodeValues(string text)
{
return text.Select(c => (int)c).ToArray();
}
private float[][][] GetTextMask(long[] textIdsLengths)
{
return Helper.LengthToMask(textIdsLengths);
}
public (long[][] textIds, float[][][] textMask) Call(List<string> textList, List<string> langList)
{
var processedTexts = textList.Select((t, i) => PreprocessText(t, langList[i])).ToList();
var textIdsLengths = processedTexts.Select(t => (long)t.Length).ToArray();
long maxLen = textIdsLengths.Max();
var textIds = new long[textList.Count][];
for (int i = 0; i < processedTexts.Count; i++)
{
textIds[i] = new long[maxLen];
var unicodeVals = TextToUnicodeValues(processedTexts[i]);
for (int j = 0; j < unicodeVals.Length; j++)
{
if (_indexer.TryGetValue(unicodeVals[j], out long val))
{
textIds[i][j] = val;
}
}
}
var textMask = GetTextMask(textIdsLengths);
return (textIds, textMask);
}
}
// ============================================================================
// TextToSpeech class
// ============================================================================
public class TextToSpeech
{
private readonly Config _cfgs;
private readonly UnicodeProcessor _textProcessor;
private readonly InferenceSession _dpOrt;
private readonly InferenceSession _textEncOrt;
private readonly InferenceSession _vectorEstOrt;
private readonly InferenceSession _vocoderOrt;
public readonly int SampleRate;
private readonly int _baseChunkSize;
private readonly int _chunkCompressFactor;
private readonly int _ldim;
public TextToSpeech(
Config cfgs,
UnicodeProcessor textProcessor,
InferenceSession dpOrt,
InferenceSession textEncOrt,
InferenceSession vectorEstOrt,
InferenceSession vocoderOrt)
{
_cfgs = cfgs;
_textProcessor = textProcessor;
_dpOrt = dpOrt;
_textEncOrt = textEncOrt;
_vectorEstOrt = vectorEstOrt;
_vocoderOrt = vocoderOrt;
SampleRate = cfgs.AE.SampleRate;
_baseChunkSize = cfgs.AE.BaseChunkSize;
_chunkCompressFactor = cfgs.TTL.ChunkCompressFactor;
_ldim = cfgs.TTL.LatentDim;
}
private (float[][][] noisyLatent, float[][][] latentMask) SampleNoisyLatent(float[] duration, int seed)
{
int bsz = duration.Length;
float wavLenMax = duration.Max() * SampleRate;
var wavLengths = duration.Select(d => (long)(d * SampleRate)).ToArray();
int chunkSize = _baseChunkSize * _chunkCompressFactor;
int latentLen = (int)((wavLenMax + chunkSize - 1) / chunkSize);
int latentDim = _ldim * _chunkCompressFactor;
// Generate random noise with fixed seed
var random = new Random(seed);
var noisyLatent = new float[bsz][][];
for (int b = 0; b < bsz; b++)
{
noisyLatent[b] = new float[latentDim][];
for (int d = 0; d < latentDim; d++)
{
noisyLatent[b][d] = new float[latentLen];
for (int t = 0; t < latentLen; t++)
{
// Box-Muller transform for normal distribution
double u1 = 1.0 - random.NextDouble();
double u2 = 1.0 - random.NextDouble();
noisyLatent[b][d][t] = (float)(Math.Sqrt(-2.0 * Math.Log(u1)) * Math.Cos(2.0 * Math.PI * u2));
}
}
}
var latentMask = Helper.GetLatentMask(wavLengths, _baseChunkSize, _chunkCompressFactor);
// Apply mask
for (int b = 0; b < bsz; b++)
{
for (int d = 0; d < latentDim; d++)
{
for (int t = 0; t < latentLen; t++)
{
noisyLatent[b][d][t] *= latentMask[b][0][t];
}
}
}
return (noisyLatent, latentMask);
}
private (float[] wav, float[] duration) _Infer(List<string> textList, List<string> langList, Style style, int totalStep, float speed = 1.05f, int seed = 42)
{
int bsz = textList.Count;
if (bsz != style.TtlShape[0])
{
throw new ArgumentException("Number of texts must match number of style vectors");
}
// Process text
var (textIds, textMask) = _textProcessor.Call(textList, langList);
var textIdsShape = new long[] { bsz, textIds[0].Length };
var textMaskShape = new long[] { bsz, 1, textMask[0][0].Length };
var textIdsTensor = Helper.IntArrayToTensor(textIds, textIdsShape);
var textMaskTensor = Helper.ArrayToTensor(textMask, textMaskShape);
var styleTtlTensor = new DenseTensor<float>(style.Ttl, style.TtlShape.Select(x => (int)x).ToArray());
var styleDpTensor = new DenseTensor<float>(style.Dp, style.DpShape.Select(x => (int)x).ToArray());
// Run duration predictor
var dpInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
NamedOnnxValue.CreateFromTensor("style_dp", styleDpTensor),
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
};
using var dpOutputs = _dpOrt.Run(dpInputs);
var durOnnx = dpOutputs.First(o => o.Name == "duration").AsTensor<float>().ToArray();
// Apply speed factor to duration
for (int i = 0; i < durOnnx.Length; i++)
{
durOnnx[i] /= speed;
}
// Run text encoder
var textEncInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("text_ids", textIdsTensor),
NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor)
};
using var textEncOutputs = _textEncOrt.Run(textEncInputs);
var textEmbTensor = textEncOutputs.First(o => o.Name == "text_emb").AsTensor<float>();
// Sample noisy latent
var (xt, latentMask) = SampleNoisyLatent(durOnnx, seed);
var latentShape = new long[] { bsz, xt[0].Length, xt[0][0].Length };
var latentMaskShape = new long[] { bsz, 1, latentMask[0][0].Length };
var totalStepArray = Enumerable.Repeat((float)totalStep, bsz).ToArray();
// Iterative denoising
for (int step = 0; step < totalStep; step++)
{
var currentStepArray = Enumerable.Repeat((float)step, bsz).ToArray();
var vectorEstInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("noisy_latent", Helper.ArrayToTensor(xt, latentShape)),
NamedOnnxValue.CreateFromTensor("text_emb", textEmbTensor),
NamedOnnxValue.CreateFromTensor("style_ttl", styleTtlTensor),
NamedOnnxValue.CreateFromTensor("text_mask", textMaskTensor),
NamedOnnxValue.CreateFromTensor("latent_mask", Helper.ArrayToTensor(latentMask, latentMaskShape)),
NamedOnnxValue.CreateFromTensor("total_step", new DenseTensor<float>(totalStepArray, new int[] { bsz })),
NamedOnnxValue.CreateFromTensor("current_step", new DenseTensor<float>(currentStepArray, new int[] { bsz }))
};
using var vectorEstOutputs = _vectorEstOrt.Run(vectorEstInputs);
var denoisedLatent = vectorEstOutputs.First(o => o.Name == "denoised_latent").AsTensor<float>();
// Update xt
int idx = 0;
for (int b = 0; b < bsz; b++)
{
for (int d = 0; d < xt[b].Length; d++)
{
for (int t = 0; t < xt[b][d].Length; t++)
{
xt[b][d][t] = denoisedLatent.GetValue(idx++);
}
}
}
}
// Run vocoder
var vocoderInputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("latent", Helper.ArrayToTensor(xt, latentShape))
};
using var vocoderOutputs = _vocoderOrt.Run(vocoderInputs);
var wavTensor = vocoderOutputs.First(o => o.Name == "wav_tts").AsTensor<float>();
return (wavTensor.ToArray(), durOnnx);
}
public (float[] wav, float[] duration) Call(string text, string lang, Style style, int totalStep, float speed = 1.05f, float silenceDuration = 0.3f, int seed = 42)
{
if (style.TtlShape[0] != 1)
{
throw new ArgumentException("Single speaker text to speech only supports single style");
}
int maxLen = lang == "ko" ? 120 : 300;
var textList = Helper.ChunkText(text, maxLen);
var wavCat = new List<float>();
float durCat = 0.0f;
foreach (var chunk in textList)
{
var (wav, duration) = _Infer(new List<string> { chunk }, new List<string> { lang }, style, totalStep, speed, seed);
if (wavCat.Count == 0)
{
wavCat.AddRange(wav);
durCat = duration[0];
}
else
{
int silenceLen = (int)(silenceDuration * SampleRate);
var silence = new float[silenceLen];
wavCat.AddRange(silence);
wavCat.AddRange(wav);
durCat += duration[0] + silenceDuration;
}
}
return (wavCat.ToArray(), new float[] { durCat });
}
public (float[] wav, float[] duration) Batch(List<string> textList, List<string> langList, Style style, int totalStep, float speed = 1.05f, int seed = 42)
{
return _Infer(textList, langList, style, totalStep, speed, seed);
}
}
// ============================================================================
// Helper class with utility functions
// ============================================================================
public static class Helper
{
// ============================================================================
// Utility functions
// ============================================================================
public static float[][][] LengthToMask(long[] lengths, long maxLen = -1)
{
if (maxLen == -1)
{
maxLen = lengths.Max();
}
var mask = new float[lengths.Length][][];
for (int i = 0; i < lengths.Length; i++)
{
mask[i] = new float[1][];
mask[i][0] = new float[maxLen];
for (int j = 0; j < maxLen; j++)
{
mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
}
}
return mask;
}
public static float[][][] GetLatentMask(long[] wavLengths, int baseChunkSize, int chunkCompressFactor)
{
int latentSize = baseChunkSize * chunkCompressFactor;
var latentLengths = wavLengths.Select(len => (len + latentSize - 1) / latentSize).ToArray();
return LengthToMask(latentLengths);
}
// ============================================================================
// ONNX model loading
// ============================================================================
public static InferenceSession LoadOnnx(string onnxPath, SessionOptions opts)
{
return new InferenceSession(onnxPath, opts);
}
public static (InferenceSession dp, InferenceSession textEnc, InferenceSession vectorEst, InferenceSession vocoder)
LoadOnnxAll(string onnxDir, SessionOptions opts)
{
var dpPath = Path.Combine(onnxDir, "duration_predictor.onnx");
var textEncPath = Path.Combine(onnxDir, "text_encoder.onnx");
var vectorEstPath = Path.Combine(onnxDir, "vector_estimator.onnx");
var vocoderPath = Path.Combine(onnxDir, "vocoder.onnx");
return (
LoadOnnx(dpPath, opts),
LoadOnnx(textEncPath, opts),
LoadOnnx(vectorEstPath, opts),
LoadOnnx(vocoderPath, opts)
);
}
// ============================================================================
// Configuration loading
// ============================================================================
public static Config LoadCfgs(string onnxDir)
{
var cfgPath = Path.Combine(onnxDir, "tts.json");
var json = File.ReadAllText(cfgPath);
using var doc = JsonDocument.Parse(json);
var root = doc.RootElement;
return new Config
{
AE = new Config.AEConfig
{
SampleRate = root.GetProperty("ae").GetProperty("sample_rate").GetInt32(),
BaseChunkSize = root.GetProperty("ae").GetProperty("base_chunk_size").GetInt32()
},
TTL = new Config.TTLConfig
{
ChunkCompressFactor = root.GetProperty("ttl").GetProperty("chunk_compress_factor").GetInt32(),
LatentDim = root.GetProperty("ttl").GetProperty("latent_dim").GetInt32()
}
};
}
public static UnicodeProcessor LoadTextProcessor(string onnxDir)
{
var unicodeIndexerPath = Path.Combine(onnxDir, "unicode_indexer.json");
return new UnicodeProcessor(unicodeIndexerPath);
}
// ============================================================================
// Voice style loading
// ============================================================================
public static Style LoadVoiceStyle(List<string> voiceStylePaths, bool verbose = false)
{
int bsz = voiceStylePaths.Count;
// Read first file to get dimensions
var firstJson = File.ReadAllText(voiceStylePaths[0]);
using var firstDoc = JsonDocument.Parse(firstJson);
var firstRoot = firstDoc.RootElement;
var ttlDims = ParseInt64Array(firstRoot.GetProperty("style_ttl").GetProperty("dims"));
var dpDims = ParseInt64Array(firstRoot.GetProperty("style_dp").GetProperty("dims"));
long ttlDim1 = ttlDims[1];
long ttlDim2 = ttlDims[2];
long dpDim1 = dpDims[1];
long dpDim2 = dpDims[2];
// Pre-allocate arrays with full batch size
int ttlSize = (int)(bsz * ttlDim1 * ttlDim2);
int dpSize = (int)(bsz * dpDim1 * dpDim2);
var ttlFlat = new float[ttlSize];
var dpFlat = new float[dpSize];
// Fill in the data
for (int i = 0; i < bsz; i++)
{
var json = File.ReadAllText(voiceStylePaths[i]);
using var doc = JsonDocument.Parse(json);
var root = doc.RootElement;
// Flatten data
var ttlData3D = ParseFloat3DArray(root.GetProperty("style_ttl").GetProperty("data"));
var ttlDataFlat = new List<float>();
foreach (var batch in ttlData3D)
{
foreach (var row in batch)
{
ttlDataFlat.AddRange(row);
}
}
var dpData3D = ParseFloat3DArray(root.GetProperty("style_dp").GetProperty("data"));
var dpDataFlat = new List<float>();
foreach (var batch in dpData3D)
{
foreach (var row in batch)
{
dpDataFlat.AddRange(row);
}
}
// Copy to pre-allocated array
int ttlOffset = (int)(i * ttlDim1 * ttlDim2);
ttlDataFlat.CopyTo(ttlFlat, ttlOffset);
int dpOffset = (int)(i * dpDim1 * dpDim2);
dpDataFlat.CopyTo(dpFlat, dpOffset);
}
var ttlShape = new long[] { bsz, ttlDim1, ttlDim2 };
var dpShape = new long[] { bsz, dpDim1, dpDim2 };
if (verbose)
{
Console.WriteLine($"Loaded {bsz} voice styles");
}
return new Style(ttlFlat, ttlShape, dpFlat, dpShape);
}
private static float[][][] ParseFloat3DArray(JsonElement element)
{
var result = new List<float[][]>();
foreach (var batch in element.EnumerateArray())
{
var batch2D = new List<float[]>();
foreach (var row in batch.EnumerateArray())
{
var rowData = new List<float>();
foreach (var val in row.EnumerateArray())
{
rowData.Add(val.GetSingle());
}
batch2D.Add(rowData.ToArray());
}
result.Add(batch2D.ToArray());
}
return result.ToArray();
}
private static long[] ParseInt64Array(JsonElement element)
{
var result = new List<long>();
foreach (var val in element.EnumerateArray())
{
result.Add(val.GetInt64());
}
return result.ToArray();
}
// ============================================================================
// TextToSpeech loading
// ============================================================================
public static TextToSpeech LoadTextToSpeech(string onnxDir, bool useGpu = false)
{
var opts = new SessionOptions();
if (useGpu)
{
throw new NotImplementedException("GPU mode is not supported yet");
}
else
{
Console.WriteLine("Using CPU for inference");
}
var cfgs = LoadCfgs(onnxDir);
var (dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) = LoadOnnxAll(onnxDir, opts);
var textProcessor = LoadTextProcessor(onnxDir);
return new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
}
// ============================================================================
// WAV file writing
// ============================================================================
public static void WriteWavFile(string filename, float[] audioData, int sampleRate)
{
using var writer = new BinaryWriter(File.Open(filename, FileMode.Create));
int numChannels = 1;
int bitsPerSample = 16;
int byteRate = sampleRate * numChannels * bitsPerSample / 8;
short blockAlign = (short)(numChannels * bitsPerSample / 8);
int dataSize = audioData.Length * bitsPerSample / 8;
// RIFF header
writer.Write(Encoding.ASCII.GetBytes("RIFF"));
writer.Write(36 + dataSize);
writer.Write(Encoding.ASCII.GetBytes("WAVE"));
// fmt chunk
writer.Write(Encoding.ASCII.GetBytes("fmt "));
writer.Write(16); // fmt chunk size
writer.Write((short)1); // audio format (PCM)
writer.Write((short)numChannels);
writer.Write(sampleRate);
writer.Write(byteRate);
writer.Write(blockAlign);
writer.Write((short)bitsPerSample);
// data chunk
writer.Write(Encoding.ASCII.GetBytes("data"));
writer.Write(dataSize);
// Write audio data
foreach (var sample in audioData)
{
float clamped = Math.Max(-1.0f, Math.Min(1.0f, sample));
short intSample = (short)(clamped * 32767);
writer.Write(intSample);
}
}
// ============================================================================
// Tensor conversion utilities
// ============================================================================
public static DenseTensor<float> ArrayToTensor(float[][][] array, long[] dims)
{
var flat = new List<float>();
foreach (var batch in array)
{
foreach (var row in batch)
{
flat.AddRange(row);
}
}
return new DenseTensor<float>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
}
public static DenseTensor<long> IntArrayToTensor(long[][] array, long[] dims)
{
var flat = new List<long>();
foreach (var row in array)
{
flat.AddRange(row);
}
return new DenseTensor<long>(flat.ToArray(), dims.Select(x => (int)x).ToArray());
}
// ============================================================================
// Timer utility
// ============================================================================
public static T Timer<T>(string name, Func<T> func)
{
var start = DateTime.Now;
Console.WriteLine($"{name}...");
var result = func();
var elapsed = (DateTime.Now - start).TotalSeconds;
Console.WriteLine($" -> {name} completed in {elapsed:F2} sec");
return result;
}
public static string SanitizeFilename(string text, int maxLen)
{
var result = new StringBuilder();
int count = 0;
foreach (char c in text)
{
if (count >= maxLen) break;
if (char.IsLetterOrDigit(c))
{
result.Append(c);
}
else
{
result.Append('_');
}
count++;
}
return result.ToString();
}
// ============================================================================
// Chunk text
// ============================================================================
public static List<string> ChunkText(string text, int maxLen = 300)
{
var chunks = new List<string>();
// Split by paragraph (two or more newlines)
var paragraphRegex = new Regex(@"\n\s*\n+");
var paragraphs = paragraphRegex.Split(text.Trim())
.Select(p => p.Trim())
.Where(p => !string.IsNullOrEmpty(p))
.ToList();
// Split by sentence boundaries, excluding abbreviations
var sentenceRegex = new Regex(@"(?<!Mr\.|Mrs\.|Ms\.|Dr\.|Prof\.|Sr\.|Jr\.|Ph\.D\.|etc\.|e\.g\.|i\.e\.|vs\.|Inc\.|Ltd\.|Co\.|Corp\.|St\.|Ave\.|Blvd\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+");
foreach (var paragraph in paragraphs)
{
var sentences = sentenceRegex.Split(paragraph);
string currentChunk = "";
foreach (var sentence in sentences)
{
if (string.IsNullOrEmpty(sentence)) continue;
if (currentChunk.Length + sentence.Length + 1 <= maxLen)
{
if (!string.IsNullOrEmpty(currentChunk))
{
currentChunk += " ";
}
currentChunk += sentence;
}
else
{
if (!string.IsNullOrEmpty(currentChunk))
{
chunks.Add(currentChunk.Trim());
}
currentChunk = sentence;
}
}
if (!string.IsNullOrEmpty(currentChunk))
{
chunks.Add(currentChunk.Trim());
}
}
// If no chunks were created, return the original text
if (chunks.Count == 0)
{
chunks.Add(text.Trim());
}
return chunks;
}
}
}

View File

@@ -0,0 +1,8 @@
{
"profiles": {
"Supertonic": {
"commandName": "Project",
"commandLineArgs": "--seed 371279630"
}
}
}

137
csharp/README.md Normal file
View File

@@ -0,0 +1,137 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `ExampleONNX.cs`.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
## Installation
### Prerequisites
- .NET 9.0 SDK or later
- [Download .NET SDK](https://dotnet.microsoft.com/download)
### Install dependencies
```bash
dotnet restore
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
dotnet run
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
dotnet run -- \
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
--lang en,ko \
--batch
```
This will:
- Use `--batch` flag to enable batch processing mode
- Generate speech for 2 different voice-text pairs
- Use male voice style (M1.json) for the first English text
- Use female voice style (F1.json) for the second Korean text
- Process both samples in a single batch (automatic text chunking disabled)
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
dotnet run -- \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
### Example 4: Long-Form Inference
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
```bash
dotnet run -- \
--voice-style assets/voice_styles/M1.json \
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
```
This will:
- Automatically split the long text into smaller chunks (max 300 characters by default)
- Process each chunk separately while maintaining natural speech flow
- Insert brief silences (0.3 seconds) between chunks for natural pacing
- Combine all chunks into a single output audio file
**Note**: When using batch mode (`--batch`), automatic text chunking is disabled. Use non-batch mode for long-form text synthesis.
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) (comma-separated) |
| `--text` | str+ | (long default text) | Text(s) to synthesize (pipe-separated: `|`) |
| `--lang` | str+ | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr` (comma-separated) |
| `--save-dir` | str | `results` | Output directory |
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
## Notes
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
## Building the Project
### Build for Release
```bash
dotnet build -c Release
```
### Run the compiled executable
```bash
./bin/Release/net9.0/Supertonic
```
## Project Structure
```
csharp/
├── ExampleONNX.cs # Main inference script
├── Helper.cs # Helper functions and classes
├── Supertonic.csproj # Project configuration
├── README.md # This file
└── results/ # Output directory (created automatically)
```

18
csharp/Supertonic.csproj Normal file
View File

@@ -0,0 +1,18 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net9.0-windows</TargetFramework>
<UseWindowsForms>true</UseWindowsForms>
<LangVersion>13.0</LangVersion>
<Nullable>enable</Nullable>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.ML.OnnxRuntime" Version="1.20.1" />
<PackageReference Include="System.Text.Json" Version="9.0.1" />
</ItemGroup>
</Project>

24
csharp/csharp.sln Normal file
View File

@@ -0,0 +1,24 @@
Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio Version 17
VisualStudioVersion = 17.5.2.0
MinimumVisualStudioVersion = 10.0.40219.1
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Supertonic", "Supertonic.csproj", "{869BE631-3CAF-8F33-CD9A-3A5788517967}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Release|Any CPU = Release|Any CPU
EndGlobalSection
GlobalSection(ProjectConfigurationPlatforms) = postSolution
{869BE631-3CAF-8F33-CD9A-3A5788517967}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{869BE631-3CAF-8F33-CD9A-3A5788517967}.Debug|Any CPU.Build.0 = Debug|Any CPU
{869BE631-3CAF-8F33-CD9A-3A5788517967}.Release|Any CPU.ActiveCfg = Release|Any CPU
{869BE631-3CAF-8F33-CD9A-3A5788517967}.Release|Any CPU.Build.0 = Release|Any CPU
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
EndGlobalSection
GlobalSection(ExtensibilityGlobals) = postSolution
SolutionGuid = {2726B6AA-94CF-4D70-899D-0356CF025555}
EndGlobalSection
EndGlobal

45
flutter/.gitignore vendored Normal file
View File

@@ -0,0 +1,45 @@
# Miscellaneous
*.class
*.log
*.pyc
*.swp
.DS_Store
.atom/
.build/
.buildlog/
.history
.svn/
.swiftpm/
migrate_working_dir/
# IntelliJ related
*.iml
*.ipr
*.iws
.idea/
# The .vscode folder contains launch configuration and tasks you configure in
# VS Code which you may wish to be included in version control, so this line
# is commented out by default.
#.vscode/
# Flutter/Dart/Pub related
**/doc/api/
**/ios/Flutter/.last_build_id
.dart_tool/
.flutter-plugins-dependencies
.pub-cache/
.pub/
/build/
/coverage/
# Symbolication related
app.*.symbols
# Obfuscation related
app.*.map.json
# Android Studio will place build artifacts here
/android/app/debug
/android/app/profile
/android/app/release

30
flutter/.metadata Normal file
View File

@@ -0,0 +1,30 @@
# This file tracks properties of this Flutter project.
# Used by Flutter tool to assess capabilities and perform upgrades etc.
#
# This file should be version controlled and should not be manually edited.
version:
revision: "19074d12f7eaf6a8180cd4036a430c1d76de904e"
channel: "stable"
project_type: app
# Tracks metadata for the flutter migrate command
migration:
platforms:
- platform: root
create_revision: 19074d12f7eaf6a8180cd4036a430c1d76de904e
base_revision: 19074d12f7eaf6a8180cd4036a430c1d76de904e
- platform: macos
create_revision: 19074d12f7eaf6a8180cd4036a430c1d76de904e
base_revision: 19074d12f7eaf6a8180cd4036a430c1d76de904e
# User provided section
# List of Local paths (relative to this file) that should be
# ignored by the migrate tool.
#
# Files that are not part of the templates will be ignored by default.
unmanaged_files:
- 'lib/main.dart'
- 'ios/Runner.xcodeproj/project.pbxproj'

38
flutter/README.md Normal file
View File

@@ -0,0 +1,38 @@
# Supertonic Flutter Example
This example demonstrates how to use Supertonic 2 in a Flutter application using ONNX Runtime.
> **Note:** This project uses the `flutter_onnxruntime` package ([https://pub.dev/packages/flutter_onnxruntime](https://pub.dev/packages/flutter_onnxruntime)). At the moment, only the macOS platform has been tested. Although the flutter_onnxruntime package supports several other platforms, they have not been tested in this project yet and may require additional verification.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Added and tested macos support.
## Multilingual Support
Supertonic 2 supports multiple languages. Select the appropriate language from the dropdown:
- **English (en)**: Default language
- **한국어 (ko)**: Korean
- **Español (es)**: Spanish
- **Português (pt)**: Portuguese
- **Français (fr)**: French
## Requirements
- Flutter SDK version ^3.5.0
## Running the Demo
```bash
flutter clean
flutter pub get
flutter run -d macos
```

View File

@@ -0,0 +1,28 @@
# This file configures the analyzer, which statically analyzes Dart code to
# check for errors, warnings, and lints.
#
# The issues identified by the analyzer are surfaced in the UI of Dart-enabled
# IDEs (https://dart.dev/tools#ides-and-editors). The analyzer can also be
# invoked from the command line by running `flutter analyze`.
# The following line activates a set of recommended lints for Flutter apps,
# packages, and plugins designed to encourage good coding practices.
include: package:flutter_lints/flutter.yaml
linter:
# The lint rules applied to this project can be customized in the
# section below to disable rules from the `package:flutter_lints/flutter.yaml`
# included above or to enable additional rules. A list of all available lints
# and their documentation is published at https://dart.dev/lints.
#
# Instead of disabling a lint rule for the entire project in the
# section below, it can also be suppressed for a single line of code
# or a specific dart file by using the `// ignore: name_of_lint` and
# `// ignore_for_file: name_of_lint` syntax on the line or in the file
# producing the lint.
rules:
# avoid_print: false # Uncomment to disable the `avoid_print` rule
# prefer_single_quotes: true # Uncomment to enable the `prefer_single_quotes` rule
# Additional information about this file can be found at
# https://dart.dev/guides/language/analysis-options

695
flutter/lib/helper.dart Normal file
View File

@@ -0,0 +1,695 @@
import 'dart:io';
import 'dart:convert';
import 'dart:math' as math;
import 'dart:typed_data';
import 'package:flutter/services.dart' show rootBundle;
import 'package:flutter_onnxruntime/flutter_onnxruntime.dart';
import 'package:logger/logger.dart';
import 'package:path_provider/path_provider.dart';
final logger = Logger(
printer: PrettyPrinter(methodCount: 0, errorMethodCount: 5, lineLength: 80),
);
// Available languages for multilingual TTS
const List<String> availableLangs = ['en', 'ko', 'es', 'pt', 'fr'];
bool isValidLang(String lang) => availableLangs.contains(lang);
// Hangul Jamo constants for NFKD decomposition
const int _hangulSyllableBase = 0xAC00;
const int _hangulSyllableEnd = 0xD7A3;
const int _leadingJamoBase = 0x1100;
const int _vowelJamoBase = 0x1161;
const int _trailingJamoBase = 0x11A7;
const int _vowelCount = 21;
const int _trailingCount = 28;
/// Decompose a Hangul syllable into Jamo (NFKD-like decomposition)
List<int> _decomposeHangulSyllable(int codePoint) {
if (codePoint < _hangulSyllableBase || codePoint > _hangulSyllableEnd) {
return [codePoint];
}
final syllableIndex = codePoint - _hangulSyllableBase;
final leadingIndex = syllableIndex ~/ (_vowelCount * _trailingCount);
final vowelIndex =
(syllableIndex % (_vowelCount * _trailingCount)) ~/ _trailingCount;
final trailingIndex = syllableIndex % _trailingCount;
final result = <int>[
_leadingJamoBase + leadingIndex,
_vowelJamoBase + vowelIndex,
];
if (trailingIndex > 0) {
result.add(_trailingJamoBase + trailingIndex);
}
return result;
}
/// Common Latin character decompositions (NFKD) for es, pt, fr
const Map<int, List<int>> _latinDecompositions = {
// Uppercase with acute accent
0x00C1: [0x0041, 0x0301], // Á → A + ́
0x00C9: [0x0045, 0x0301], // É → E + ́
0x00CD: [0x0049, 0x0301], // Í → I + ́
0x00D3: [0x004F, 0x0301], // Ó → O + ́
0x00DA: [0x0055, 0x0301], // Ú → U + ́
// Lowercase with acute accent
0x00E1: [0x0061, 0x0301], // á → a + ́
0x00E9: [0x0065, 0x0301], // é → e + ́
0x00ED: [0x0069, 0x0301], // í → i + ́
0x00F3: [0x006F, 0x0301], // ó → o + ́
0x00FA: [0x0075, 0x0301], // ú → u + ́
// Grave accent
0x00C0: [0x0041, 0x0300], // À → A + ̀
0x00C8: [0x0045, 0x0300], // È → E + ̀
0x00CC: [0x0049, 0x0300], // Ì → I + ̀
0x00D2: [0x004F, 0x0300], // Ò → O + ̀
0x00D9: [0x0055, 0x0300], // Ù → U + ̀
0x00E0: [0x0061, 0x0300], // à → a + ̀
0x00E8: [0x0065, 0x0300], // è → e + ̀
0x00EC: [0x0069, 0x0300], // ì → i + ̀
0x00F2: [0x006F, 0x0300], // ò → o + ̀
0x00F9: [0x0075, 0x0300], // ù → u + ̀
// Circumflex
0x00C2: [0x0041, 0x0302], // Â → A + ̂
0x00CA: [0x0045, 0x0302], // Ê → E + ̂
0x00CE: [0x0049, 0x0302], // Î → I + ̂
0x00D4: [0x004F, 0x0302], // Ô → O + ̂
0x00DB: [0x0055, 0x0302], // Û → U + ̂
0x00E2: [0x0061, 0x0302], // â → a + ̂
0x00EA: [0x0065, 0x0302], // ê → e + ̂
0x00EE: [0x0069, 0x0302], // î → i + ̂
0x00F4: [0x006F, 0x0302], // ô → o + ̂
0x00FB: [0x0075, 0x0302], // û → u + ̂
// Tilde
0x00C3: [0x0041, 0x0303], // Ã → A + ̃
0x00D1: [0x004E, 0x0303], // Ñ → N + ̃
0x00D5: [0x004F, 0x0303], // Õ → O + ̃
0x00E3: [0x0061, 0x0303], // ã → a + ̃
0x00F1: [0x006E, 0x0303], // ñ → n + ̃
0x00F5: [0x006F, 0x0303], // õ → o + ̃
// Diaeresis/Umlaut
0x00C4: [0x0041, 0x0308], // Ä → A + ̈
0x00CB: [0x0045, 0x0308], // Ë → E + ̈
0x00CF: [0x0049, 0x0308], // Ï → I + ̈
0x00D6: [0x004F, 0x0308], // Ö → O + ̈
0x00DC: [0x0055, 0x0308], // Ü → U + ̈
0x00E4: [0x0061, 0x0308], // ä → a + ̈
0x00EB: [0x0065, 0x0308], // ë → e + ̈
0x00EF: [0x0069, 0x0308], // ï → i + ̈
0x00F6: [0x006F, 0x0308], // ö → o + ̈
0x00FC: [0x0075, 0x0308], // ü → u + ̈
// Cedilla
0x00C7: [0x0043, 0x0327], // Ç → C + ̧
0x00E7: [0x0063, 0x0327], // ç → c + ̧
};
/// Apply NFKD-like decomposition (Hangul + Latin accented characters)
String _applyNfkdDecomposition(String text) {
final result = <int>[];
for (final codePoint in text.runes) {
// Check Hangul first
if (codePoint >= _hangulSyllableBase && codePoint <= _hangulSyllableEnd) {
result.addAll(_decomposeHangulSyllable(codePoint));
}
// Check Latin decomposition
else if (_latinDecompositions.containsKey(codePoint)) {
result.addAll(_latinDecompositions[codePoint]!);
}
// Keep as-is
else {
result.add(codePoint);
}
}
return String.fromCharCodes(result);
}
String preprocessText(String text, String lang) {
// Apply NFKD-like decomposition (especially for Hangul syllables → Jamo)
text = _applyNfkdDecomposition(text);
// Remove emojis
text = text.replaceAll(
RegExp(
r'[\u{1F600}-\u{1F64F}]|[\u{1F300}-\u{1F5FF}]|[\u{1F680}-\u{1F6FF}]|'
r'[\u{1F700}-\u{1F77F}]|[\u{1F780}-\u{1F7FF}]|[\u{1F800}-\u{1F8FF}]|'
r'[\u{1F900}-\u{1F9FF}]|[\u{1FA00}-\u{1FA6F}]|[\u{1FA70}-\u{1FAFF}]|'
r'[\u{2600}-\u{26FF}]|[\u{2700}-\u{27BF}]|[\u{1F1E6}-\u{1F1FF}]',
unicode: true,
),
'');
// Replace various dashes and symbols
const replacements = {
'': '-',
'': '-',
'': '-',
'_': ' ',
'\u201C': '"',
'\u201D': '"',
'\u2018': "'",
'\u2019': "'",
'´': "'",
'`': "'",
'[': ' ',
']': ' ',
'|': ' ',
'/': ' ',
'#': ' ',
'': ' ',
'': ' ',
};
for (final entry in replacements.entries) {
text = text.replaceAll(entry.key, entry.value);
}
// Remove special symbols
text = text.replaceAll(RegExp(r'[♥☆♡©\\]'), '');
// Replace known expressions
text = text.replaceAll('@', ' at ');
text = text.replaceAll('e.g.,', 'for example, ');
text = text.replaceAll('i.e.,', 'that is, ');
// Fix spacing around punctuation
text = text.replaceAll(' ,', ',');
text = text.replaceAll(' .', '.');
text = text.replaceAll(' !', '!');
text = text.replaceAll(' ?', '?');
text = text.replaceAll(' ;', ';');
text = text.replaceAll(' :', ':');
text = text.replaceAll(" '", "'");
// Remove duplicate quotes
while (text.contains('""')) text = text.replaceAll('""', '"');
while (text.contains("''")) text = text.replaceAll("''", "'");
while (text.contains('``')) text = text.replaceAll('``', '`');
// Remove extra spaces
text = text.replaceAll(RegExp(r'\s+'), ' ').trim();
// Add period if needed
if (text.isNotEmpty &&
!RegExp(r'[.!?;:,\x27\x22\u2018\u2019)\]}…。」』】〉》›»]$').hasMatch(text)) {
text += '.';
}
// Validate language
if (!isValidLang(lang)) {
throw ArgumentError(
'Invalid language: $lang. Available: ${availableLangs.join(", ")}');
}
// Wrap text with language tags
text = '<$lang>$text</$lang>';
return text;
}
class UnicodeProcessor {
final Map<int, int> indexer;
UnicodeProcessor._(this.indexer);
static Future<UnicodeProcessor> load(String path) async {
final json = jsonDecode(
path.startsWith('assets/')
? await rootBundle.loadString(path)
: File(path).readAsStringSync(),
);
final indexer = json is List
? {
for (var i = 0; i < json.length; i++)
if (json[i] is int && json[i] >= 0) i: json[i] as int
}
: (json as Map<String, dynamic>)
.map((k, v) => MapEntry(int.parse(k), v as int));
return UnicodeProcessor._(indexer);
}
Map<String, dynamic> call(List<String> textList, List<String> langList) {
// Preprocess texts with language tags
final processedTexts = <String>[];
for (var i = 0; i < textList.length; i++) {
processedTexts.add(preprocessText(textList[i], langList[i]));
}
final lengths = processedTexts.map((t) => t.runes.length).toList();
final maxLen = lengths.reduce(math.max);
final textIds = processedTexts.map((text) {
final row = List<int>.filled(maxLen, 0);
final runes = text.runes.toList();
for (var i = 0; i < runes.length; i++) {
row[i] = indexer[runes[i]] ?? 0;
}
return row;
}).toList();
return {'textIds': textIds, 'textMask': _lengthToMask(lengths)};
}
List<List<List<double>>> _lengthToMask(List<int> lengths, [int? maxLen]) {
maxLen ??= lengths.reduce(math.max);
return lengths
.map((len) => [List.generate(maxLen!, (i) => i < len ? 1.0 : 0.0)])
.toList();
}
}
class Style {
final OrtValue ttl, dp;
final List<int> ttlShape, dpShape;
Style(this.ttl, this.dp, this.ttlShape, this.dpShape);
}
class TextToSpeech {
final Map<String, dynamic> cfgs;
final UnicodeProcessor textProcessor;
final OrtSession dpOrt, textEncOrt, vectorEstOrt, vocoderOrt;
final int sampleRate, baseChunkSize, chunkCompressFactor, ldim;
TextToSpeech(this.cfgs, this.textProcessor, this.dpOrt, this.textEncOrt,
this.vectorEstOrt, this.vocoderOrt)
: sampleRate = cfgs['ae']['sample_rate'],
baseChunkSize = cfgs['ae']['base_chunk_size'],
chunkCompressFactor = cfgs['ttl']['chunk_compress_factor'],
ldim = cfgs['ttl']['latent_dim'];
Future<Map<String, dynamic>> call(
String text, String lang, Style style, int totalStep,
{double speed = 1.05, double silenceDuration = 0.3}) async {
final maxLen = lang == 'ko' ? 120 : 300;
final chunks = _chunkText(text, maxLen: maxLen);
final langList = List.filled(chunks.length, lang);
List<double>? wavCat;
double durCat = 0;
for (var i = 0; i < chunks.length; i++) {
final result = await _infer([chunks[i]], [langList[i]], style, totalStep,
speed: speed);
final wav = _safeCast<double>(result['wav']);
final duration = _safeCast<double>(result['duration']);
if (wavCat == null) {
wavCat = wav;
durCat = duration[0];
} else {
wavCat = [
...wavCat,
...List<double>.filled((silenceDuration * sampleRate).floor(), 0.0),
...wav
];
durCat += duration[0] + silenceDuration;
}
}
return {
'wav': wavCat,
'duration': [durCat]
};
}
Future<Map<String, dynamic>> _infer(
List<String> textList, List<String> langList, Style style, int totalStep,
{double speed = 1.05}) async {
final bsz = textList.length;
final result = textProcessor.call(textList, langList);
final textIdsRaw = result['textIds'];
final textIds = textIdsRaw is List<List<int>>
? textIdsRaw
: (textIdsRaw as List).map((row) => (row as List).cast<int>()).toList();
final textMaskRaw = result['textMask'];
final textMask = textMaskRaw is List<List<List<double>>>
? textMaskRaw
: (textMaskRaw as List)
.map((batch) => (batch as List)
.map((row) => (row as List).cast<double>())
.toList())
.toList();
final textIdsShape = [bsz, textIds[0].length];
final textMaskShape = [bsz, 1, textMask[0][0].length];
final textMaskTensor = await _toTensor(textMask, textMaskShape);
final dpResult = await dpOrt.run({
'text_ids': await _intToTensor(textIds, textIdsShape),
'style_dp': style.dp,
'text_mask': textMaskTensor,
});
final durOnnx = _safeCast<double>(await dpResult.values.first.asList());
final scaledDur = durOnnx.map((d) => d / speed).toList();
final textEncResult = await textEncOrt.run({
'text_ids': await _intToTensor(textIds, textIdsShape),
'style_ttl': style.ttl,
'text_mask': textMaskTensor,
});
final latentData = _sampleNoisyLatent(scaledDur);
final noisyLatentRaw = latentData['noisyLatent'];
var noisyLatent = noisyLatentRaw is List<List<List<double>>>
? noisyLatentRaw
: (noisyLatentRaw as List)
.map((batch) => (batch as List)
.map((row) => (row as List).cast<double>())
.toList())
.toList();
final latentMaskRaw = latentData['latentMask'];
final latentMask = latentMaskRaw is List<List<List<double>>>
? latentMaskRaw
: (latentMaskRaw as List)
.map((batch) => (batch as List)
.map((row) => (row as List).cast<double>())
.toList())
.toList();
final latentShape = [bsz, noisyLatent[0].length, noisyLatent[0][0].length];
final latentMaskTensor =
await _toTensor(latentMask, [bsz, 1, latentMask[0][0].length]);
final totalStepTensor =
await _scalarToTensor(List.filled(bsz, totalStep.toDouble()), [bsz]);
// Denoising loop
for (var step = 0; step < totalStep; step++) {
final result = await vectorEstOrt.run({
'noisy_latent': await _toTensor(noisyLatent, latentShape),
'text_emb': textEncResult.values.first,
'style_ttl': style.ttl,
'text_mask': textMaskTensor,
'latent_mask': latentMaskTensor,
'total_step': totalStepTensor,
'current_step':
await _scalarToTensor(List.filled(bsz, step.toDouble()), [bsz]),
});
final denoisedRaw = await result.values.first.asList();
final denoised = denoisedRaw is List<double>
? denoisedRaw
: _safeCast<double>(denoisedRaw);
var idx = 0;
for (var b = 0; b < noisyLatent.length; b++) {
for (var d = 0; d < noisyLatent[b].length; d++) {
for (var t = 0; t < noisyLatent[b][d].length; t++) {
noisyLatent[b][d][t] = denoised[idx++];
}
}
}
}
final vocoderResult = await vocoderOrt
.run({'latent': await _toTensor(noisyLatent, latentShape)});
final wavRaw = await vocoderResult.values.first.asList();
final wav = wavRaw is List<double> ? wavRaw : _safeCast<double>(wavRaw);
return {'wav': wav, 'duration': scaledDur};
}
Map<String, dynamic> _sampleNoisyLatent(List<double> duration) {
final wavLenMax = duration.reduce(math.max) * sampleRate;
final wavLengths = duration.map((d) => (d * sampleRate).floor()).toList();
final chunkSize = baseChunkSize * chunkCompressFactor;
final latentLen = ((wavLenMax + chunkSize - 1) / chunkSize).floor();
final latentDim = ldim * chunkCompressFactor;
final random = math.Random();
final noisyLatent = List.generate(
duration.length,
(_) => List.generate(
latentDim,
(_) => List.generate(latentLen, (_) {
final u1 = math.max(1e-10, random.nextDouble());
final u2 = random.nextDouble();
return math.sqrt(-2.0 * math.log(u1)) * math.cos(2.0 * math.pi * u2);
}),
),
);
final latentMask = _getLatentMask(wavLengths);
for (var b = 0; b < noisyLatent.length; b++) {
for (var d = 0; d < noisyLatent[b].length; d++) {
for (var t = 0; t < noisyLatent[b][d].length; t++) {
noisyLatent[b][d][t] *= latentMask[b][0][t];
}
}
}
return {'noisyLatent': noisyLatent, 'latentMask': latentMask};
}
List<List<List<double>>> _getLatentMask(List<int> wavLengths) {
final latentSize = baseChunkSize * chunkCompressFactor;
final latentLengths = wavLengths
.map((len) => ((len + latentSize - 1) / latentSize).floor())
.toList();
final maxLen = latentLengths.reduce(math.max);
return latentLengths
.map((len) => [List.generate(maxLen, (i) => i < len ? 1.0 : 0.0)])
.toList();
}
List<String> _chunkText(String text, {int maxLen = 300}) {
final paragraphs = text
.trim()
.split(RegExp(r'\n\s*\n+'))
.where((p) => p.trim().isNotEmpty)
.toList();
final chunks = <String>[];
for (var paragraph in paragraphs) {
paragraph = paragraph.trim();
if (paragraph.isEmpty) continue;
final sentences = paragraph.split(RegExp(
r'(?<!Mr\.|Mrs\.|Ms\.|Dr\.|Prof\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+'));
var currentChunk = '';
for (final sentence in sentences) {
if (currentChunk.length + sentence.length + 1 <= maxLen) {
currentChunk += (currentChunk.isNotEmpty ? ' ' : '') + sentence;
} else {
if (currentChunk.isNotEmpty) chunks.add(currentChunk.trim());
currentChunk = sentence;
}
}
if (currentChunk.isNotEmpty) chunks.add(currentChunk.trim());
}
return chunks;
}
List<T> _safeCast<T>(dynamic raw) {
if (raw is List<T>) return raw;
if (raw is List) {
if (raw.isNotEmpty && raw.first is List) {
return _flattenList<T>(raw);
}
if (T == double) {
return raw
.map((e) => e is num ? e.toDouble() : double.parse(e.toString()))
.toList() as List<T>;
}
return raw.cast<T>();
}
throw Exception('Cannot convert $raw to List<$T>');
}
List<T> _flattenList<T>(dynamic list) {
if (list is List) {
return list.expand((e) => _flattenList<T>(e)).toList();
}
if (T == double && list is num) {
return [list.toDouble()] as List<T>;
}
return [list as T];
}
Future<OrtValue> _toTensor(dynamic array, List<int> dims) async {
final flat = _flattenList<double>(array);
return await OrtValue.fromList(Float32List.fromList(flat), dims);
}
Future<OrtValue> _scalarToTensor(List<double> array, List<int> dims) async {
return await OrtValue.fromList(Float32List.fromList(array), dims);
}
Future<OrtValue> _intToTensor(List<List<int>> array, List<int> dims) async {
final flat = array.expand((row) => row).toList();
return await OrtValue.fromList(Int64List.fromList(flat), dims);
}
}
Future<TextToSpeech> loadTextToSpeech(String onnxDir,
{bool useGpu = false}) async {
if (useGpu) throw Exception('GPU mode not supported yet');
logger.i('Loading TTS models from $onnxDir');
final cfgs = await _loadCfgs(onnxDir);
final sessions = await _loadOnnxAll(onnxDir);
final textProcessor =
await UnicodeProcessor.load('$onnxDir/unicode_indexer.json');
logger.i('TTS models loaded successfully');
return TextToSpeech(
cfgs,
textProcessor,
sessions['dpOrt']!,
sessions['textEncOrt']!,
sessions['vectorEstOrt']!,
sessions['vocoderOrt']!,
);
}
Future<Style> loadVoiceStyle(List<String> paths) async {
final bsz = paths.length;
final firstJson = jsonDecode(
paths[0].startsWith('assets/')
? await rootBundle.loadString(paths[0])
: File(paths[0]).readAsStringSync(),
);
final ttlDims = List<int>.from(firstJson['style_ttl']['dims']);
final dpDims = List<int>.from(firstJson['style_dp']['dims']);
final ttlFlat = Float32List(bsz * ttlDims[1] * ttlDims[2]);
final dpFlat = Float32List(bsz * dpDims[1] * dpDims[2]);
for (var i = 0; i < bsz; i++) {
final json = jsonDecode(
paths[i].startsWith('assets/')
? await rootBundle.loadString(paths[i])
: File(paths[i]).readAsStringSync(),
);
final ttlData = _flattenToDouble(json['style_ttl']['data']);
final dpData = _flattenToDouble(json['style_dp']['data']);
ttlFlat.setRange(i * ttlDims[1] * ttlDims[2],
(i + 1) * ttlDims[1] * ttlDims[2], ttlData);
dpFlat.setRange(
i * dpDims[1] * dpDims[2], (i + 1) * dpDims[1] * dpDims[2], dpData);
}
final ttlShape = [bsz, ttlDims[1], ttlDims[2]];
final dpShape = [bsz, dpDims[1], dpDims[2]];
return Style(
await OrtValue.fromList(ttlFlat, ttlShape),
await OrtValue.fromList(dpFlat, dpShape),
ttlShape,
dpShape,
);
}
Future<Map<String, dynamic>> _loadCfgs(String onnxDir) async {
final path = '$onnxDir/tts.json';
final json = jsonDecode(await rootBundle.loadString(path));
return json as Map<String, dynamic>;
}
Future<String> copyModelToFile(String path) async {
final byteData = await rootBundle.load(path);
final tempDir = await getApplicationCacheDirectory();
final modelPath = '${tempDir.path}/${path.split("/").last}';
final file = File(modelPath);
await file.writeAsBytes(byteData.buffer.asUint8List());
return modelPath;
}
Future<Map<String, OrtSession>> _loadOnnxAll(String dir) async {
final ort = OnnxRuntime();
final models = [
'duration_predictor',
'text_encoder',
'vector_estimator',
'vocoder'
];
final sessions = await Future.wait(models.map((name) async {
final path = await copyModelToFile('$dir/$name.onnx');
logger.d('Loading $name.onnx');
return ort.createSessionFromAsset(path);
}));
return {
'dpOrt': sessions[0],
'textEncOrt': sessions[1],
'vectorEstOrt': sessions[2],
'vocoderOrt': sessions[3],
};
}
List<double> _flattenToDouble(dynamic list) {
if (list is List) return list.expand((e) => _flattenToDouble(e)).toList();
return [list is num ? list.toDouble() : double.parse(list.toString())];
}
void writeWavFile(String filename, List<double> audioData, int sampleRate) {
const numChannels = 1;
const bitsPerSample = 16;
final dataSize = audioData.length * 2;
final buffer = ByteData(44 + dataSize);
var offset = 0;
// RIFF header
for (var byte in [0x52, 0x49, 0x46, 0x46]) {
buffer.setUint8(offset++, byte);
}
buffer.setUint32(offset, 36 + dataSize, Endian.little);
offset += 4;
// WAVE
for (var byte in [0x57, 0x41, 0x56, 0x45]) {
buffer.setUint8(offset++, byte);
}
// fmt chunk
for (var byte in [0x66, 0x6D, 0x74, 0x20]) {
buffer.setUint8(offset++, byte);
}
buffer.setUint32(offset, 16, Endian.little);
offset += 4;
buffer.setUint16(offset, 1, Endian.little);
offset += 2;
buffer.setUint16(offset, numChannels, Endian.little);
offset += 2;
buffer.setUint32(offset, sampleRate, Endian.little);
offset += 4;
buffer.setUint32(offset, sampleRate * numChannels * 2, Endian.little);
offset += 4;
buffer.setUint16(offset, numChannels * 2, Endian.little);
offset += 2;
buffer.setUint16(offset, bitsPerSample, Endian.little);
offset += 2;
// data chunk
for (var byte in [0x64, 0x61, 0x74, 0x61]) {
buffer.setUint8(offset++, byte);
}
buffer.setUint32(offset, dataSize, Endian.little);
offset += 4;
// Write audio samples
for (var i = 0; i < audioData.length; i++) {
final sample = (audioData[i].clamp(-1.0, 1.0) * 32767).round();
buffer.setInt16(offset + i * 2, sample, Endian.little);
}
File(filename).writeAsBytesSync(buffer.buffer.asUint8List());
}

391
flutter/lib/main.dart Normal file
View File

@@ -0,0 +1,391 @@
import 'dart:io';
import 'package:flutter/material.dart';
import 'package:just_audio/just_audio.dart';
import 'package:path_provider/path_provider.dart';
import 'package:flutter_sdk/helper.dart';
void main() {
runApp(const SupertonicApp());
}
class SupertonicApp extends StatelessWidget {
const SupertonicApp({super.key});
@override
Widget build(BuildContext context) {
return MaterialApp(
title: 'Supertonic 2',
theme: ThemeData(
colorScheme: ColorScheme.fromSeed(seedColor: Colors.deepPurple),
useMaterial3: true,
),
home: const TTSPage(),
);
}
}
class TTSPage extends StatefulWidget {
const TTSPage({super.key});
@override
State<TTSPage> createState() => _TTSPageState();
}
class _TTSPageState extends State<TTSPage> {
final TextEditingController _textController = TextEditingController(
text: 'Hello, this is a text to speech example.',
);
final AudioPlayer _audioPlayer = AudioPlayer();
TextToSpeech? _textToSpeech;
Style? _style;
bool _isLoading = false;
bool _isGenerating = false;
String _status = 'Not initialized';
int _totalSteps = 5;
double _speed = 1.05;
String _selectedLang = 'en';
bool _isPlaying = false;
String? _lastGeneratedFilePath;
@override
void initState() {
super.initState();
_loadModels();
_setupAudioPlayerListeners();
}
void _setupAudioPlayerListeners() {
_audioPlayer.playerStateStream.listen((state) {
if (!mounted) return;
setState(() {
_isPlaying = state.playing;
if (state.processingState == ProcessingState.completed) {
_isPlaying = false;
_status = 'Ready';
} else if (state.processingState == ProcessingState.loading) {
_status = 'Loading audio...';
} else if (state.processingState == ProcessingState.buffering) {
_status = 'Buffering...';
}
});
});
}
Future<void> _loadModels() async {
setState(() {
_isLoading = true;
_status = 'Loading models...';
});
try {
_textToSpeech = await loadTextToSpeech('assets/onnx', useGpu: false);
_style = await loadVoiceStyle(['assets/voice_styles/M1.json']);
setState(() {
_isLoading = false;
_status = 'Ready';
});
} catch (e, stackTrace) {
logger.e('Error loading models', error: e, stackTrace: stackTrace);
setState(() {
_isLoading = false;
_status = 'Error: $e';
});
}
}
Future<void> _generateSpeech() async {
if (_textToSpeech == null || _style == null) {
setState(() => _status = 'Models not loaded yet');
return;
}
if (_textController.text.trim().isEmpty) {
setState(() => _status = 'Please enter some text');
return;
}
setState(() {
_isGenerating = true;
_status = 'Generating speech...';
});
List<double>? wav;
List<double>? duration;
// Step 1: Generate speech
try {
final result = await _textToSpeech!.call(
_textController.text,
_selectedLang,
_style!,
_totalSteps,
speed: _speed,
);
wav = result['wav'] is List<double>
? result['wav']
: (result['wav'] as List).cast<double>();
duration = result['duration'] is List<double>
? result['duration']
: (result['duration'] as List).cast<double>();
} catch (e) {
logger.e('Error generating speech', error: e);
setState(() {
_isGenerating = false;
_status = 'Error generating speech: $e';
});
return;
}
// Step 2: Save to file and play
try {
final tempDir = await getTemporaryDirectory();
final timestamp = DateTime.now().millisecondsSinceEpoch;
final outputPath = '${tempDir.path}/speech_$timestamp.wav';
writeWavFile(outputPath, wav!, _textToSpeech!.sampleRate);
final file = File(outputPath);
if (!file.existsSync()) {
throw Exception('Failed to create WAV file');
}
final absolutePath = file.absolute.path;
setState(() {
_isGenerating = false;
_status = 'Playing ${duration![0].toStringAsFixed(2)}s of audio...';
_lastGeneratedFilePath = absolutePath;
});
logger.i('Audio saved to $absolutePath');
final uri = Uri.file(absolutePath);
await _audioPlayer.setAudioSource(AudioSource.uri(uri));
await _audioPlayer.play();
} catch (e) {
logger.e('Error playing audio', error: e);
setState(() {
_isGenerating = false;
_status = 'Error playing audio: $e';
});
}
}
Future<void> _downloadFile() async {
if (_lastGeneratedFilePath == null) return;
try {
final sourceFile = File(_lastGeneratedFilePath!);
if (!sourceFile.existsSync()) {
setState(() => _status = 'Error: File no longer exists');
return;
}
final downloadsDir = await getDownloadsDirectory();
if (downloadsDir == null) {
setState(() => _status = 'Error: Could not access downloads folder');
return;
}
final timestamp = DateTime.now().millisecondsSinceEpoch;
final downloadPath = '${downloadsDir.path}/speech_$timestamp.wav';
await sourceFile.copy(downloadPath);
logger.i('File saved to $downloadPath');
setState(() => _status = 'File saved to: $downloadPath');
} catch (e) {
logger.e('Error downloading file', error: e);
setState(() => _status = 'Error downloading file: $e');
}
}
@override
Widget build(BuildContext context) {
return Scaffold(
appBar: AppBar(
backgroundColor: Theme.of(context).colorScheme.inversePrimary,
title: const Text('Supertonic 2'),
),
body: Padding(
padding: const EdgeInsets.all(16.0),
child: Column(
crossAxisAlignment: CrossAxisAlignment.stretch,
children: [
// Status indicator
Card(
color: _isLoading || _isGenerating
? Colors.orange.shade100
: _status.startsWith('Error')
? Colors.red.shade100
: Colors.green.shade100,
child: Padding(
padding: const EdgeInsets.all(16.0),
child: Row(
children: [
if (_isLoading || _isGenerating)
const SizedBox(
width: 20,
height: 20,
child: CircularProgressIndicator(strokeWidth: 2),
),
if (_isLoading || _isGenerating) const SizedBox(width: 12),
Expanded(
child:
Text(_status, style: const TextStyle(fontSize: 16)),
),
],
),
),
),
const SizedBox(height: 24),
// Text input
TextField(
controller: _textController,
maxLines: 5,
decoration: const InputDecoration(
labelText: 'Text to synthesize',
border: OutlineInputBorder(),
hintText: 'Enter the text you want to convert to speech...',
),
enabled: !_isLoading && !_isGenerating,
),
const SizedBox(height: 24),
// Parameters
Text('Parameters', style: Theme.of(context).textTheme.titleMedium),
const SizedBox(height: 12),
// Denoising steps slider
Row(
children: [
const Expanded(flex: 2, child: Text('Denoising Steps:')),
Expanded(
flex: 3,
child: Slider(
value: _totalSteps.toDouble(),
min: 1,
max: 20,
divisions: 19,
label: _totalSteps.toString(),
onChanged: _isLoading || _isGenerating
? null
: (value) =>
setState(() => _totalSteps = value.toInt()),
),
),
SizedBox(
width: 40,
child:
Text(_totalSteps.toString(), textAlign: TextAlign.right),
),
],
),
// Speed slider
Row(
children: [
const Expanded(flex: 2, child: Text('Speed:')),
Expanded(
flex: 3,
child: Slider(
value: _speed,
min: 0.5,
max: 2.0,
divisions: 30,
label: _speed.toStringAsFixed(2),
onChanged: _isLoading || _isGenerating
? null
: (value) => setState(() => _speed = value),
),
),
SizedBox(
width: 40,
child: Text(_speed.toStringAsFixed(2),
textAlign: TextAlign.right),
),
],
),
const SizedBox(height: 12),
// Language selector
Row(
children: [
const Expanded(flex: 2, child: Text('Language:')),
Expanded(
flex: 3,
child: DropdownButton<String>(
value: _selectedLang,
isExpanded: true,
items: const [
DropdownMenuItem(value: 'en', child: Text('English')),
DropdownMenuItem(value: 'ko', child: Text('한국어')),
DropdownMenuItem(value: 'es', child: Text('Español')),
DropdownMenuItem(value: 'pt', child: Text('Português')),
DropdownMenuItem(value: 'fr', child: Text('Français')),
],
onChanged: _isLoading || _isGenerating
? null
: (value) => setState(() => _selectedLang = value!),
),
),
],
),
const SizedBox(height: 24),
// Generate button
ElevatedButton.icon(
onPressed: _isLoading || _isGenerating
? null
: _isPlaying
? () async {
await _audioPlayer.stop();
setState(() => _status = 'Ready');
}
: _generateSpeech,
icon: Icon(_isPlaying ? Icons.stop : Icons.play_arrow),
label: Text(
_isGenerating
? 'Generating...'
: _isPlaying
? 'Stop Playback'
: 'Generate & Play Speech',
style: const TextStyle(fontSize: 16),
),
style: ElevatedButton.styleFrom(
padding: const EdgeInsets.symmetric(vertical: 16),
),
),
// Download button
if (_lastGeneratedFilePath != null) ...[
const SizedBox(height: 12),
OutlinedButton.icon(
onPressed: _isLoading || _isGenerating ? null : _downloadFile,
icon: const Icon(Icons.download),
label: const Text('Download WAV File',
style: TextStyle(fontSize: 16)),
style: OutlinedButton.styleFrom(
padding: const EdgeInsets.symmetric(vertical: 16),
),
),
],
],
),
),
);
}
@override
void dispose() {
_textController.dispose();
_audioPlayer.dispose();
super.dispose();
}
}

7
flutter/macos/.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
# Flutter-related
**/Flutter/ephemeral/
**/Pods/
# Xcode-related
**/dgph
**/xcuserdata/

View File

@@ -0,0 +1,2 @@
#include? "Pods/Target Support Files/Pods-Runner/Pods-Runner.debug.xcconfig"
#include "ephemeral/Flutter-Generated.xcconfig"

View File

@@ -0,0 +1,2 @@
#include? "Pods/Target Support Files/Pods-Runner/Pods-Runner.release.xcconfig"
#include "ephemeral/Flutter-Generated.xcconfig"

View File

@@ -0,0 +1,16 @@
//
// Generated file. Do not edit.
//
import FlutterMacOS
import Foundation
import audio_session
import flutter_onnxruntime
import just_audio
func RegisterGeneratedPlugins(registry: FlutterPluginRegistry) {
AudioSessionPlugin.register(with: registry.registrar(forPlugin: "AudioSessionPlugin"))
FlutterOnnxruntimePlugin.register(with: registry.registrar(forPlugin: "FlutterOnnxruntimePlugin"))
JustAudioPlugin.register(with: registry.registrar(forPlugin: "JustAudioPlugin"))
}

45
flutter/macos/Podfile Normal file
View File

@@ -0,0 +1,45 @@
platform :osx, '14.0'
# CocoaPods analytics sends network stats synchronously affecting flutter build latency.
ENV['COCOAPODS_DISABLE_STATS'] = 'true'
project 'Runner', {
'Debug' => :debug,
'Profile' => :release,
'Release' => :release,
}
def flutter_root
generated_xcode_build_settings_path = File.expand_path(File.join('..', 'Flutter', 'ephemeral', 'Flutter-Generated.xcconfig'), __FILE__)
unless File.exist?(generated_xcode_build_settings_path)
raise "#{generated_xcode_build_settings_path} must exist. If you're running pod install manually, make sure \"flutter pub get\" is executed first"
end
File.foreach(generated_xcode_build_settings_path) do |line|
matches = line.match(/FLUTTER_ROOT\=(.*)/)
return matches[1].strip if matches
end
raise "FLUTTER_ROOT not found in #{generated_xcode_build_settings_path}. Try deleting Flutter-Generated.xcconfig, then run \"flutter pub get\""
end
require File.expand_path(File.join('packages', 'flutter_tools', 'bin', 'podhelper'), flutter_root)
flutter_macos_podfile_setup
target 'Runner' do
use_frameworks! :linkage => :static
flutter_install_all_macos_pods File.dirname(File.realpath(__FILE__))
target 'RunnerTests' do
inherit! :search_paths
end
end
post_install do |installer|
installer.pods_project.targets.each do |target|
flutter_additional_macos_build_settings(target)
target.build_configurations.each do |config|
config.build_settings['MACOSX_DEPLOYMENT_TARGET'] = '14.0'
end
end
end

View File

@@ -0,0 +1,54 @@
PODS:
- audio_session (0.0.1):
- FlutterMacOS
- flutter_onnxruntime (0.0.1):
- FlutterMacOS
- onnxruntime-objc (= 1.21.0)
- FlutterMacOS (1.0.0)
- just_audio (0.0.1):
- Flutter
- FlutterMacOS
- objective_c (0.0.1):
- FlutterMacOS
- onnxruntime-c (1.21.0)
- onnxruntime-objc (1.21.0):
- onnxruntime-objc/Core (= 1.21.0)
- onnxruntime-objc/Core (1.21.0):
- onnxruntime-c (= 1.21.0)
DEPENDENCIES:
- audio_session (from `Flutter/ephemeral/.symlinks/plugins/audio_session/macos`)
- flutter_onnxruntime (from `Flutter/ephemeral/.symlinks/plugins/flutter_onnxruntime/macos`)
- FlutterMacOS (from `Flutter/ephemeral`)
- just_audio (from `Flutter/ephemeral/.symlinks/plugins/just_audio/darwin`)
- objective_c (from `Flutter/ephemeral/.symlinks/plugins/objective_c/macos`)
SPEC REPOS:
trunk:
- onnxruntime-c
- onnxruntime-objc
EXTERNAL SOURCES:
audio_session:
:path: Flutter/ephemeral/.symlinks/plugins/audio_session/macos
flutter_onnxruntime:
:path: Flutter/ephemeral/.symlinks/plugins/flutter_onnxruntime/macos
FlutterMacOS:
:path: Flutter/ephemeral
just_audio:
:path: Flutter/ephemeral/.symlinks/plugins/just_audio/darwin
objective_c:
:path: Flutter/ephemeral/.symlinks/plugins/objective_c/macos
SPEC CHECKSUMS:
audio_session: 728ae3823d914f809c485d390274861a24b0904e
flutter_onnxruntime: e6887abc1032d3e5c92f84b912ad42c33e9ce1c9
FlutterMacOS: d0db08ddef1a9af05a5ec4b724367152bb0500b1
just_audio: a42c63806f16995daf5b219ae1d679deb76e6a79
objective_c: e5f8194456e8fc943e034d1af00510a1bc29c067
onnxruntime-c: ac65025f01072d25d7d394a2b43ac30d9397b260
onnxruntime-objc: 5fa03134356d47b642ec85b1023d9907a123d201
PODFILE CHECKSUM: 6b8e7008b8bf73cd361b3ffb8aa3768b71e74409
COCOAPODS: 1.16.2

View File

@@ -0,0 +1,13 @@
import Cocoa
import FlutterMacOS
@main
class AppDelegate: FlutterAppDelegate {
override func applicationShouldTerminateAfterLastWindowClosed(_ sender: NSApplication) -> Bool {
return true
}
override func applicationSupportsSecureRestorableState(_ app: NSApplication) -> Bool {
return true
}
}

View File

@@ -0,0 +1,68 @@
{
"images" : [
{
"size" : "16x16",
"idiom" : "mac",
"filename" : "app_icon_16.png",
"scale" : "1x"
},
{
"size" : "16x16",
"idiom" : "mac",
"filename" : "app_icon_32.png",
"scale" : "2x"
},
{
"size" : "32x32",
"idiom" : "mac",
"filename" : "app_icon_32.png",
"scale" : "1x"
},
{
"size" : "32x32",
"idiom" : "mac",
"filename" : "app_icon_64.png",
"scale" : "2x"
},
{
"size" : "128x128",
"idiom" : "mac",
"filename" : "app_icon_128.png",
"scale" : "1x"
},
{
"size" : "128x128",
"idiom" : "mac",
"filename" : "app_icon_256.png",
"scale" : "2x"
},
{
"size" : "256x256",
"idiom" : "mac",
"filename" : "app_icon_256.png",
"scale" : "1x"
},
{
"size" : "256x256",
"idiom" : "mac",
"filename" : "app_icon_512.png",
"scale" : "2x"
},
{
"size" : "512x512",
"idiom" : "mac",
"filename" : "app_icon_512.png",
"scale" : "1x"
},
{
"size" : "512x512",
"idiom" : "mac",
"filename" : "app_icon_1024.png",
"scale" : "2x"
}
],
"info" : {
"version" : 1,
"author" : "xcode"
}
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.5 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 520 B

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.2 KiB

View File

@@ -0,0 +1,343 @@
<?xml version="1.0" encoding="UTF-8"?>
<document type="com.apple.InterfaceBuilder3.Cocoa.XIB" version="3.0" toolsVersion="14490.70" targetRuntime="MacOSX.Cocoa" propertyAccessControl="none" useAutolayout="YES" customObjectInstantitationMethod="direct">
<dependencies>
<deployment identifier="macosx"/>
<plugIn identifier="com.apple.InterfaceBuilder.CocoaPlugin" version="14490.70"/>
<capability name="documents saved in the Xcode 8 format" minToolsVersion="8.0"/>
</dependencies>
<objects>
<customObject id="-2" userLabel="File's Owner" customClass="NSApplication">
<connections>
<outlet property="delegate" destination="Voe-Tx-rLC" id="GzC-gU-4Uq"/>
</connections>
</customObject>
<customObject id="-1" userLabel="First Responder" customClass="FirstResponder"/>
<customObject id="-3" userLabel="Application" customClass="NSObject"/>
<customObject id="Voe-Tx-rLC" customClass="AppDelegate" customModule="Runner" customModuleProvider="target">
<connections>
<outlet property="applicationMenu" destination="uQy-DD-JDr" id="XBo-yE-nKs"/>
<outlet property="mainFlutterWindow" destination="QvC-M9-y7g" id="gIp-Ho-8D9"/>
</connections>
</customObject>
<customObject id="YLy-65-1bz" customClass="NSFontManager"/>
<menu title="Main Menu" systemMenu="main" id="AYu-sK-qS6">
<items>
<menuItem title="APP_NAME" id="1Xt-HY-uBw">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="APP_NAME" systemMenu="apple" id="uQy-DD-JDr">
<items>
<menuItem title="About APP_NAME" id="5kV-Vb-QxS">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="orderFrontStandardAboutPanel:" target="-1" id="Exp-CZ-Vem"/>
</connections>
</menuItem>
<menuItem isSeparatorItem="YES" id="VOq-y0-SEH"/>
<menuItem title="Preferences…" keyEquivalent="," id="BOF-NM-1cW"/>
<menuItem isSeparatorItem="YES" id="wFC-TO-SCJ"/>
<menuItem title="Services" id="NMo-om-nkz">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="Services" systemMenu="services" id="hz9-B4-Xy5"/>
</menuItem>
<menuItem isSeparatorItem="YES" id="4je-JR-u6R"/>
<menuItem title="Hide APP_NAME" keyEquivalent="h" id="Olw-nP-bQN">
<connections>
<action selector="hide:" target="-1" id="PnN-Uc-m68"/>
</connections>
</menuItem>
<menuItem title="Hide Others" keyEquivalent="h" id="Vdr-fp-XzO">
<modifierMask key="keyEquivalentModifierMask" option="YES" command="YES"/>
<connections>
<action selector="hideOtherApplications:" target="-1" id="VT4-aY-XCT"/>
</connections>
</menuItem>
<menuItem title="Show All" id="Kd2-mp-pUS">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="unhideAllApplications:" target="-1" id="Dhg-Le-xox"/>
</connections>
</menuItem>
<menuItem isSeparatorItem="YES" id="kCx-OE-vgT"/>
<menuItem title="Quit APP_NAME" keyEquivalent="q" id="4sb-4s-VLi">
<connections>
<action selector="terminate:" target="-1" id="Te7-pn-YzF"/>
</connections>
</menuItem>
</items>
</menu>
</menuItem>
<menuItem title="Edit" id="5QF-Oa-p0T">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="Edit" id="W48-6f-4Dl">
<items>
<menuItem title="Undo" keyEquivalent="z" id="dRJ-4n-Yzg">
<connections>
<action selector="undo:" target="-1" id="M6e-cu-g7V"/>
</connections>
</menuItem>
<menuItem title="Redo" keyEquivalent="Z" id="6dh-zS-Vam">
<connections>
<action selector="redo:" target="-1" id="oIA-Rs-6OD"/>
</connections>
</menuItem>
<menuItem isSeparatorItem="YES" id="WRV-NI-Exz"/>
<menuItem title="Cut" keyEquivalent="x" id="uRl-iY-unG">
<connections>
<action selector="cut:" target="-1" id="YJe-68-I9s"/>
</connections>
</menuItem>
<menuItem title="Copy" keyEquivalent="c" id="x3v-GG-iWU">
<connections>
<action selector="copy:" target="-1" id="G1f-GL-Joy"/>
</connections>
</menuItem>
<menuItem title="Paste" keyEquivalent="v" id="gVA-U4-sdL">
<connections>
<action selector="paste:" target="-1" id="UvS-8e-Qdg"/>
</connections>
</menuItem>
<menuItem title="Paste and Match Style" keyEquivalent="V" id="WeT-3V-zwk">
<modifierMask key="keyEquivalentModifierMask" option="YES" command="YES"/>
<connections>
<action selector="pasteAsPlainText:" target="-1" id="cEh-KX-wJQ"/>
</connections>
</menuItem>
<menuItem title="Delete" id="pa3-QI-u2k">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="delete:" target="-1" id="0Mk-Ml-PaM"/>
</connections>
</menuItem>
<menuItem title="Select All" keyEquivalent="a" id="Ruw-6m-B2m">
<connections>
<action selector="selectAll:" target="-1" id="VNm-Mi-diN"/>
</connections>
</menuItem>
<menuItem isSeparatorItem="YES" id="uyl-h8-XO2"/>
<menuItem title="Find" id="4EN-yA-p0u">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="Find" id="1b7-l0-nxx">
<items>
<menuItem title="Find…" tag="1" keyEquivalent="f" id="Xz5-n4-O0W">
<connections>
<action selector="performFindPanelAction:" target="-1" id="cD7-Qs-BN4"/>
</connections>
</menuItem>
<menuItem title="Find and Replace…" tag="12" keyEquivalent="f" id="YEy-JH-Tfz">
<modifierMask key="keyEquivalentModifierMask" option="YES" command="YES"/>
<connections>
<action selector="performFindPanelAction:" target="-1" id="WD3-Gg-5AJ"/>
</connections>
</menuItem>
<menuItem title="Find Next" tag="2" keyEquivalent="g" id="q09-fT-Sye">
<connections>
<action selector="performFindPanelAction:" target="-1" id="NDo-RZ-v9R"/>
</connections>
</menuItem>
<menuItem title="Find Previous" tag="3" keyEquivalent="G" id="OwM-mh-QMV">
<connections>
<action selector="performFindPanelAction:" target="-1" id="HOh-sY-3ay"/>
</connections>
</menuItem>
<menuItem title="Use Selection for Find" tag="7" keyEquivalent="e" id="buJ-ug-pKt">
<connections>
<action selector="performFindPanelAction:" target="-1" id="U76-nv-p5D"/>
</connections>
</menuItem>
<menuItem title="Jump to Selection" keyEquivalent="j" id="S0p-oC-mLd">
<connections>
<action selector="centerSelectionInVisibleArea:" target="-1" id="IOG-6D-g5B"/>
</connections>
</menuItem>
</items>
</menu>
</menuItem>
<menuItem title="Spelling and Grammar" id="Dv1-io-Yv7">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="Spelling" id="3IN-sU-3Bg">
<items>
<menuItem title="Show Spelling and Grammar" keyEquivalent=":" id="HFo-cy-zxI">
<connections>
<action selector="showGuessPanel:" target="-1" id="vFj-Ks-hy3"/>
</connections>
</menuItem>
<menuItem title="Check Document Now" keyEquivalent=";" id="hz2-CU-CR7">
<connections>
<action selector="checkSpelling:" target="-1" id="fz7-VC-reM"/>
</connections>
</menuItem>
<menuItem isSeparatorItem="YES" id="bNw-od-mp5"/>
<menuItem title="Check Spelling While Typing" id="rbD-Rh-wIN">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="toggleContinuousSpellChecking:" target="-1" id="7w6-Qz-0kB"/>
</connections>
</menuItem>
<menuItem title="Check Grammar With Spelling" id="mK6-2p-4JG">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="toggleGrammarChecking:" target="-1" id="muD-Qn-j4w"/>
</connections>
</menuItem>
<menuItem title="Correct Spelling Automatically" id="78Y-hA-62v">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="toggleAutomaticSpellingCorrection:" target="-1" id="2lM-Qi-WAP"/>
</connections>
</menuItem>
</items>
</menu>
</menuItem>
<menuItem title="Substitutions" id="9ic-FL-obx">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="Substitutions" id="FeM-D8-WVr">
<items>
<menuItem title="Show Substitutions" id="z6F-FW-3nz">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="orderFrontSubstitutionsPanel:" target="-1" id="oku-mr-iSq"/>
</connections>
</menuItem>
<menuItem isSeparatorItem="YES" id="gPx-C9-uUO"/>
<menuItem title="Smart Copy/Paste" id="9yt-4B-nSM">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="toggleSmartInsertDelete:" target="-1" id="3IJ-Se-DZD"/>
</connections>
</menuItem>
<menuItem title="Smart Quotes" id="hQb-2v-fYv">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="toggleAutomaticQuoteSubstitution:" target="-1" id="ptq-xd-QOA"/>
</connections>
</menuItem>
<menuItem title="Smart Dashes" id="rgM-f4-ycn">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="toggleAutomaticDashSubstitution:" target="-1" id="oCt-pO-9gS"/>
</connections>
</menuItem>
<menuItem title="Smart Links" id="cwL-P1-jid">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="toggleAutomaticLinkDetection:" target="-1" id="Gip-E3-Fov"/>
</connections>
</menuItem>
<menuItem title="Data Detectors" id="tRr-pd-1PS">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="toggleAutomaticDataDetection:" target="-1" id="R1I-Nq-Kbl"/>
</connections>
</menuItem>
<menuItem title="Text Replacement" id="HFQ-gK-NFA">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="toggleAutomaticTextReplacement:" target="-1" id="DvP-Fe-Py6"/>
</connections>
</menuItem>
</items>
</menu>
</menuItem>
<menuItem title="Transformations" id="2oI-Rn-ZJC">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="Transformations" id="c8a-y6-VQd">
<items>
<menuItem title="Make Upper Case" id="vmV-6d-7jI">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="uppercaseWord:" target="-1" id="sPh-Tk-edu"/>
</connections>
</menuItem>
<menuItem title="Make Lower Case" id="d9M-CD-aMd">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="lowercaseWord:" target="-1" id="iUZ-b5-hil"/>
</connections>
</menuItem>
<menuItem title="Capitalize" id="UEZ-Bs-lqG">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="capitalizeWord:" target="-1" id="26H-TL-nsh"/>
</connections>
</menuItem>
</items>
</menu>
</menuItem>
<menuItem title="Speech" id="xrE-MZ-jX0">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="Speech" id="3rS-ZA-NoH">
<items>
<menuItem title="Start Speaking" id="Ynk-f8-cLZ">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="startSpeaking:" target="-1" id="654-Ng-kyl"/>
</connections>
</menuItem>
<menuItem title="Stop Speaking" id="Oyz-dy-DGm">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="stopSpeaking:" target="-1" id="dX8-6p-jy9"/>
</connections>
</menuItem>
</items>
</menu>
</menuItem>
</items>
</menu>
</menuItem>
<menuItem title="View" id="H8h-7b-M4v">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="View" id="HyV-fh-RgO">
<items>
<menuItem title="Enter Full Screen" keyEquivalent="f" id="4J7-dP-txa">
<modifierMask key="keyEquivalentModifierMask" control="YES" command="YES"/>
<connections>
<action selector="toggleFullScreen:" target="-1" id="dU3-MA-1Rq"/>
</connections>
</menuItem>
</items>
</menu>
</menuItem>
<menuItem title="Window" id="aUF-d1-5bR">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="Window" systemMenu="window" id="Td7-aD-5lo">
<items>
<menuItem title="Minimize" keyEquivalent="m" id="OY7-WF-poV">
<connections>
<action selector="performMiniaturize:" target="-1" id="VwT-WD-YPe"/>
</connections>
</menuItem>
<menuItem title="Zoom" id="R4o-n2-Eq4">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="performZoom:" target="-1" id="DIl-cC-cCs"/>
</connections>
</menuItem>
<menuItem isSeparatorItem="YES" id="eu3-7i-yIM"/>
<menuItem title="Bring All to Front" id="LE2-aR-0XJ">
<modifierMask key="keyEquivalentModifierMask"/>
<connections>
<action selector="arrangeInFront:" target="-1" id="DRN-fu-gQh"/>
</connections>
</menuItem>
</items>
</menu>
</menuItem>
<menuItem title="Help" id="EPT-qC-fAb">
<modifierMask key="keyEquivalentModifierMask"/>
<menu key="submenu" title="Help" systemMenu="help" id="rJ0-wn-3NY"/>
</menuItem>
</items>
<point key="canvasLocation" x="142" y="-258"/>
</menu>
<window title="APP_NAME" allowsToolTipsWhenApplicationIsInactive="NO" autorecalculatesKeyViewLoop="NO" releasedWhenClosed="NO" animationBehavior="default" id="QvC-M9-y7g" customClass="MainFlutterWindow" customModule="Runner" customModuleProvider="target">
<windowStyleMask key="styleMask" titled="YES" closable="YES" miniaturizable="YES" resizable="YES"/>
<rect key="contentRect" x="335" y="390" width="800" height="600"/>
<rect key="screenRect" x="0.0" y="0.0" width="2560" height="1577"/>
<view key="contentView" wantsLayer="YES" id="EiT-Mj-1SZ">
<rect key="frame" x="0.0" y="0.0" width="800" height="600"/>
<autoresizingMask key="autoresizingMask"/>
</view>
</window>
</objects>
</document>

View File

@@ -0,0 +1,14 @@
// Application-level settings for the Runner target.
//
// This may be replaced with something auto-generated from metadata (e.g., pubspec.yaml) in the
// future. If not, the values below would default to using the project name when this becomes a
// 'flutter create' template.
// The application's name. By default this is also the title of the Flutter window.
PRODUCT_NAME = flutter_sdk
// The application's bundle identifier
PRODUCT_BUNDLE_IDENTIFIER = com.example.flutterSdk
// The copyright displayed in application information
PRODUCT_COPYRIGHT = Copyright © 2025 com.example. All rights reserved.

View File

@@ -0,0 +1,2 @@
#include "../../Flutter/Flutter-Debug.xcconfig"
#include "Warnings.xcconfig"

View File

@@ -0,0 +1,2 @@
#include "../../Flutter/Flutter-Release.xcconfig"
#include "Warnings.xcconfig"

View File

@@ -0,0 +1,13 @@
WARNING_CFLAGS = -Wall -Wconditional-uninitialized -Wnullable-to-nonnull-conversion -Wmissing-method-return-type -Woverlength-strings
GCC_WARN_UNDECLARED_SELECTOR = YES
CLANG_UNDEFINED_BEHAVIOR_SANITIZER_NULLABILITY = YES
CLANG_WARN_UNGUARDED_AVAILABILITY = YES_AGGRESSIVE
CLANG_WARN__DUPLICATE_METHOD_MATCH = YES
CLANG_WARN_PRAGMA_PACK = YES
CLANG_WARN_STRICT_PROTOTYPES = YES
CLANG_WARN_COMMA = YES
GCC_WARN_STRICT_SELECTOR_MATCH = YES
CLANG_WARN_OBJC_REPEATED_USE_OF_WEAK = YES
CLANG_WARN_OBJC_IMPLICIT_RETAIN_SELF = YES
GCC_WARN_SHADOW = YES
CLANG_WARN_UNREACHABLE_CODE = YES

View File

@@ -0,0 +1,12 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>com.apple.security.app-sandbox</key>
<true/>
<key>com.apple.security.cs.allow-jit</key>
<true/>
<key>com.apple.security.network.server</key>
<true/>
</dict>
</plist>

View File

@@ -0,0 +1,32 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>CFBundleDevelopmentRegion</key>
<string>$(DEVELOPMENT_LANGUAGE)</string>
<key>CFBundleExecutable</key>
<string>$(EXECUTABLE_NAME)</string>
<key>CFBundleIconFile</key>
<string></string>
<key>CFBundleIdentifier</key>
<string>$(PRODUCT_BUNDLE_IDENTIFIER)</string>
<key>CFBundleInfoDictionaryVersion</key>
<string>6.0</string>
<key>CFBundleName</key>
<string>$(PRODUCT_NAME)</string>
<key>CFBundlePackageType</key>
<string>APPL</string>
<key>CFBundleShortVersionString</key>
<string>$(FLUTTER_BUILD_NAME)</string>
<key>CFBundleVersion</key>
<string>$(FLUTTER_BUILD_NUMBER)</string>
<key>LSMinimumSystemVersion</key>
<string>$(MACOSX_DEPLOYMENT_TARGET)</string>
<key>NSHumanReadableCopyright</key>
<string>$(PRODUCT_COPYRIGHT)</string>
<key>NSMainNibFile</key>
<string>MainMenu</string>
<key>NSPrincipalClass</key>
<string>NSApplication</string>
</dict>
</plist>

View File

@@ -0,0 +1,15 @@
import Cocoa
import FlutterMacOS
class MainFlutterWindow: NSWindow {
override func awakeFromNib() {
let flutterViewController = FlutterViewController()
let windowFrame = self.frame
self.contentViewController = flutterViewController
self.setFrame(windowFrame, display: true)
RegisterGeneratedPlugins(registry: flutterViewController)
super.awakeFromNib()
}
}

View File

@@ -0,0 +1,8 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>com.apple.security.app-sandbox</key>
<true/>
</dict>
</plist>

View File

@@ -0,0 +1,12 @@
import Cocoa
import FlutterMacOS
import XCTest
class RunnerTests: XCTestCase {
func testExample() {
// If you add code to the Runner application, consider adding tests here.
// See https://developer.apple.com/documentation/xctest for more information about using XCTest.
}
}

418
flutter/pubspec.lock Normal file
View File

@@ -0,0 +1,418 @@
# Generated by pub
# See https://dart.dev/tools/pub/glossary#lockfile
packages:
args:
dependency: "direct main"
description:
name: args
sha256: d0481093c50b1da8910eb0bb301626d4d8eb7284aa739614d2b394ee09e3ea04
url: "https://pub.dev"
source: hosted
version: "2.7.0"
async:
dependency: transitive
description:
name: async
sha256: "758e6d74e971c3e5aceb4110bfd6698efc7f501675bcfe0c775459a8140750eb"
url: "https://pub.dev"
source: hosted
version: "2.13.0"
audio_session:
dependency: transitive
description:
name: audio_session
sha256: "8f96a7fecbb718cb093070f868b4cdcb8a9b1053dce342ff8ab2fde10eb9afb7"
url: "https://pub.dev"
source: hosted
version: "0.2.2"
boolean_selector:
dependency: transitive
description:
name: boolean_selector
sha256: "8aab1771e1243a5063b8b0ff68042d67334e3feab9e95b9490f9a6ebf73b42ea"
url: "https://pub.dev"
source: hosted
version: "2.1.2"
characters:
dependency: transitive
description:
name: characters
sha256: f71061c654a3380576a52b451dd5532377954cf9dbd272a78fc8479606670803
url: "https://pub.dev"
source: hosted
version: "1.4.0"
clock:
dependency: transitive
description:
name: clock
sha256: fddb70d9b5277016c77a80201021d40a2247104d9f4aa7bab7157b7e3f05b84b
url: "https://pub.dev"
source: hosted
version: "1.1.2"
collection:
dependency: transitive
description:
name: collection
sha256: "2f5709ae4d3d59dd8f7cd309b4e023046b57d8a6c82130785d2b0e5868084e76"
url: "https://pub.dev"
source: hosted
version: "1.19.1"
crypto:
dependency: transitive
description:
name: crypto
sha256: c8ea0233063ba03258fbcf2ca4d6dadfefe14f02fab57702265467a19f27fadf
url: "https://pub.dev"
source: hosted
version: "3.0.7"
fake_async:
dependency: transitive
description:
name: fake_async
sha256: "5368f224a74523e8d2e7399ea1638b37aecfca824a3cc4dfdf77bf1fa905ac44"
url: "https://pub.dev"
source: hosted
version: "1.3.3"
ffi:
dependency: transitive
description:
name: ffi
sha256: "289279317b4b16eb2bb7e271abccd4bf84ec9bdcbe999e278a94b804f5630418"
url: "https://pub.dev"
source: hosted
version: "2.1.4"
fixnum:
dependency: transitive
description:
name: fixnum
sha256: b6dc7065e46c974bc7c5f143080a6764ec7a4be6da1285ececdc37be96de53be
url: "https://pub.dev"
source: hosted
version: "1.1.1"
flutter:
dependency: "direct main"
description: flutter
source: sdk
version: "0.0.0"
flutter_lints:
dependency: "direct dev"
description:
name: flutter_lints
sha256: "5398f14efa795ffb7a33e9b6a08798b26a180edac4ad7db3f231e40f82ce11e1"
url: "https://pub.dev"
source: hosted
version: "5.0.0"
flutter_onnxruntime:
dependency: "direct main"
description:
name: flutter_onnxruntime
sha256: "55842e69293ec52c07f3065049ff7641e94e8a6cca3f659a913d5401a3994424"
url: "https://pub.dev"
source: hosted
version: "1.6.0"
flutter_test:
dependency: "direct dev"
description: flutter
source: sdk
version: "0.0.0"
flutter_web_plugins:
dependency: transitive
description: flutter
source: sdk
version: "0.0.0"
just_audio:
dependency: "direct main"
description:
name: just_audio
sha256: "9694e4734f515f2a052493d1d7e0d6de219ee0427c7c29492e246ff32a219908"
url: "https://pub.dev"
source: hosted
version: "0.10.5"
just_audio_platform_interface:
dependency: transitive
description:
name: just_audio_platform_interface
sha256: "2532c8d6702528824445921c5ff10548b518b13f808c2e34c2fd54793b999a6a"
url: "https://pub.dev"
source: hosted
version: "4.6.0"
just_audio_web:
dependency: transitive
description:
name: just_audio_web
sha256: "6ba8a2a7e87d57d32f0f7b42856ade3d6a9fbe0f1a11fabae0a4f00bb73f0663"
url: "https://pub.dev"
source: hosted
version: "0.4.16"
leak_tracker:
dependency: transitive
description:
name: leak_tracker
sha256: "33e2e26bdd85a0112ec15400c8cbffea70d0f9c3407491f672a2fad47915e2de"
url: "https://pub.dev"
source: hosted
version: "11.0.2"
leak_tracker_flutter_testing:
dependency: transitive
description:
name: leak_tracker_flutter_testing
sha256: "1dbc140bb5a23c75ea9c4811222756104fbcd1a27173f0c34ca01e16bea473c1"
url: "https://pub.dev"
source: hosted
version: "3.0.10"
leak_tracker_testing:
dependency: transitive
description:
name: leak_tracker_testing
sha256: "8d5a2d49f4a66b49744b23b018848400d23e54caf9463f4eb20df3eb8acb2eb1"
url: "https://pub.dev"
source: hosted
version: "3.0.2"
lints:
dependency: transitive
description:
name: lints
sha256: c35bb79562d980e9a453fc715854e1ed39e24e7d0297a880ef54e17f9874a9d7
url: "https://pub.dev"
source: hosted
version: "5.1.1"
logger:
dependency: "direct main"
description:
name: logger
sha256: a7967e31b703831a893bbc3c3dd11db08126fe5f369b5c648a36f821979f5be3
url: "https://pub.dev"
source: hosted
version: "2.6.2"
matcher:
dependency: transitive
description:
name: matcher
sha256: dc58c723c3c24bf8d3e2d3ad3f2f9d7bd9cf43ec6feaa64181775e60190153f2
url: "https://pub.dev"
source: hosted
version: "0.12.17"
material_color_utilities:
dependency: transitive
description:
name: material_color_utilities
sha256: f7142bb1154231d7ea5f96bc7bde4bda2a0945d2806bb11670e30b850d56bdec
url: "https://pub.dev"
source: hosted
version: "0.11.1"
meta:
dependency: transitive
description:
name: meta
sha256: "23f08335362185a5ea2ad3a4e597f1375e78bce8a040df5c600c8d3552ef2394"
url: "https://pub.dev"
source: hosted
version: "1.17.0"
objective_c:
dependency: transitive
description:
name: objective_c
sha256: "1f81ed9e41909d44162d7ec8663b2c647c202317cc0b56d3d56f6a13146a0b64"
url: "https://pub.dev"
source: hosted
version: "9.1.0"
path:
dependency: transitive
description:
name: path
sha256: "75cca69d1490965be98c73ceaea117e8a04dd21217b37b292c9ddbec0d955bc5"
url: "https://pub.dev"
source: hosted
version: "1.9.1"
path_provider:
dependency: "direct main"
description:
name: path_provider
sha256: "50c5dd5b6e1aaf6fb3a78b33f6aa3afca52bf903a8a5298f53101fdaee55bbcd"
url: "https://pub.dev"
source: hosted
version: "2.1.5"
path_provider_android:
dependency: transitive
description:
name: path_provider_android
sha256: f2c65e21139ce2c3dad46922be8272bb5963516045659e71bb16e151c93b580e
url: "https://pub.dev"
source: hosted
version: "2.2.22"
path_provider_foundation:
dependency: transitive
description:
name: path_provider_foundation
sha256: "6192e477f34018ef1ea790c56fffc7302e3bc3efede9e798b934c252c8c105ba"
url: "https://pub.dev"
source: hosted
version: "2.5.0"
path_provider_linux:
dependency: transitive
description:
name: path_provider_linux
sha256: f7a1fe3a634fe7734c8d3f2766ad746ae2a2884abe22e241a8b301bf5cac3279
url: "https://pub.dev"
source: hosted
version: "2.2.1"
path_provider_platform_interface:
dependency: transitive
description:
name: path_provider_platform_interface
sha256: "88f5779f72ba699763fa3a3b06aa4bf6de76c8e5de842cf6f29e2e06476c2334"
url: "https://pub.dev"
source: hosted
version: "2.1.2"
path_provider_windows:
dependency: transitive
description:
name: path_provider_windows
sha256: bd6f00dbd873bfb70d0761682da2b3a2c2fccc2b9e84c495821639601d81afe7
url: "https://pub.dev"
source: hosted
version: "2.3.0"
platform:
dependency: transitive
description:
name: platform
sha256: "5d6b1b0036a5f331ebc77c850ebc8506cbc1e9416c27e59b439f917a902a4984"
url: "https://pub.dev"
source: hosted
version: "3.1.6"
plugin_platform_interface:
dependency: transitive
description:
name: plugin_platform_interface
sha256: "4820fbfdb9478b1ebae27888254d445073732dae3d6ea81f0b7e06d5dedc3f02"
url: "https://pub.dev"
source: hosted
version: "2.1.8"
pub_semver:
dependency: transitive
description:
name: pub_semver
sha256: "5bfcf68ca79ef689f8990d1160781b4bad40a3bd5e5218ad4076ddb7f4081585"
url: "https://pub.dev"
source: hosted
version: "2.2.0"
rxdart:
dependency: transitive
description:
name: rxdart
sha256: "5c3004a4a8dbb94bd4bf5412a4def4acdaa12e12f269737a5751369e12d1a962"
url: "https://pub.dev"
source: hosted
version: "0.28.0"
sky_engine:
dependency: transitive
description: flutter
source: sdk
version: "0.0.0"
source_span:
dependency: transitive
description:
name: source_span
sha256: "254ee5351d6cb365c859e20ee823c3bb479bf4a293c22d17a9f1bf144ce86f7c"
url: "https://pub.dev"
source: hosted
version: "1.10.1"
stack_trace:
dependency: transitive
description:
name: stack_trace
sha256: "8b27215b45d22309b5cddda1aa2b19bdfec9df0e765f2de506401c071d38d1b1"
url: "https://pub.dev"
source: hosted
version: "1.12.1"
stream_channel:
dependency: transitive
description:
name: stream_channel
sha256: "969e04c80b8bcdf826f8f16579c7b14d780458bd97f56d107d3950fdbeef059d"
url: "https://pub.dev"
source: hosted
version: "2.1.4"
string_scanner:
dependency: transitive
description:
name: string_scanner
sha256: "921cd31725b72fe181906c6a94d987c78e3b98c2e205b397ea399d4054872b43"
url: "https://pub.dev"
source: hosted
version: "1.4.1"
synchronized:
dependency: transitive
description:
name: synchronized
sha256: c254ade258ec8282947a0acbbc90b9575b4f19673533ee46f2f6e9b3aeefd7c0
url: "https://pub.dev"
source: hosted
version: "3.4.0"
term_glyph:
dependency: transitive
description:
name: term_glyph
sha256: "7f554798625ea768a7518313e58f83891c7f5024f88e46e7182a4558850a4b8e"
url: "https://pub.dev"
source: hosted
version: "1.2.2"
test_api:
dependency: transitive
description:
name: test_api
sha256: ab2726c1a94d3176a45960b6234466ec367179b87dd74f1611adb1f3b5fb9d55
url: "https://pub.dev"
source: hosted
version: "0.7.7"
typed_data:
dependency: transitive
description:
name: typed_data
sha256: f9049c039ebfeb4cf7a7104a675823cd72dba8297f264b6637062516699fa006
url: "https://pub.dev"
source: hosted
version: "1.4.0"
uuid:
dependency: transitive
description:
name: uuid
sha256: a11b666489b1954e01d992f3d601b1804a33937b5a8fe677bd26b8a9f96f96e8
url: "https://pub.dev"
source: hosted
version: "4.5.2"
vector_math:
dependency: transitive
description:
name: vector_math
sha256: d530bd74fea330e6e364cda7a85019c434070188383e1cd8d9777ee586914c5b
url: "https://pub.dev"
source: hosted
version: "2.2.0"
vm_service:
dependency: transitive
description:
name: vm_service
sha256: "45caa6c5917fa127b5dbcfbd1fa60b14e583afdc08bfc96dda38886ca252eb60"
url: "https://pub.dev"
source: hosted
version: "15.0.2"
web:
dependency: transitive
description:
name: web
sha256: "868d88a33d8a87b18ffc05f9f030ba328ffefba92d6c127917a2ba740f9cfe4a"
url: "https://pub.dev"
source: hosted
version: "1.1.1"
xdg_directories:
dependency: transitive
description:
name: xdg_directories
sha256: "7a3f37b05d989967cdddcbb571f1ea834867ae2faa29725fd085180e0883aa15"
url: "https://pub.dev"
source: hosted
version: "1.1.0"
sdks:
dart: ">=3.9.0 <4.0.0"
flutter: ">=3.35.0"

26
flutter/pubspec.yaml Normal file
View File

@@ -0,0 +1,26 @@
name: flutter_sdk
description: Supertonic Flutter SDK TTS Example
version: 1.0.0
environment:
sdk: ^3.5.0
dependencies:
flutter:
sdk: flutter
flutter_onnxruntime: ^1.6.0
args: ^2.4.0
path_provider: ^2.1.1
just_audio: ^0.10.5
logger: ^2.0.2
dev_dependencies:
flutter_test:
sdk: flutter
flutter_lints: ^5.0.0
flutter:
assets:
- assets/onnx/
- assets/voice_styles/
uses-material-design: true

17
go/.gitignore vendored Normal file
View File

@@ -0,0 +1,17 @@
# Binaries
tts_example
example_onnx
*.exe
# Go build artifacts
*.o
*.a
*.so
# Results
results/
# Go workspace
go.work
go.work.sum

165
go/README.md Normal file
View File

@@ -0,0 +1,165 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `example_onnx.go`.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
## Installation
This project uses Go modules for dependency management.
### Prerequisites
1. Install Go 1.21 or later from [https://golang.org/dl/](https://golang.org/dl/)
2. Install ONNX Runtime C library:
**macOS (via Homebrew):**
```bash
brew install onnxruntime
```
**Linux:**
```bash
# Download ONNX Runtime from GitHub releases
wget https://github.com/microsoft/onnxruntime/releases/download/v1.16.0/onnxruntime-linux-x64-1.16.0.tgz
tar -xzf onnxruntime-linux-x64-1.16.0.tgz
sudo cp onnxruntime-linux-x64-1.16.0/lib/* /usr/local/lib/
sudo cp -r onnxruntime-linux-x64-1.16.0/include/* /usr/local/include/
sudo ldconfig
```
### Install Go dependencies
```bash
go mod download
```
### Configure ONNX Runtime Library Path (Optional)
If the ONNX Runtime library is not in a standard location, set the environment variable:
**Automatic Detection (Recommended):**
```bash
# macOS
export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
# Linux
export ONNXRUNTIME_LIB_PATH=$(find /usr/local/lib /usr/lib -name "libonnxruntime.so*" 2>/dev/null | head -n 1)
```
**Manual Configuration:**
```bash
export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.so # Linux
# or
export ONNXRUNTIME_LIB_PATH=/path/to/libonnxruntime.dylib # macOS
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
go run example_onnx.go helper.go
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
go run example_onnx.go helper.go \
--batch \
-voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
-text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
-lang "en,ko"
```
This will:
- Generate speech for 2 different voice-text-language pairs
- Use male voice (M1.json) for the first text in English
- Use female voice (F1.json) for the second text in Korean
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
go run example_onnx.go helper.go \
-total-step 10 \
-voice-style "assets/voice_styles/M1.json" \
-text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
### Example 4: Long-Form Inference
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
```bash
go run example_onnx.go helper.go \
-voice-style "assets/voice_styles/M1.json" \
-text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
```
This will:
- Automatically split the text into chunks based on paragraph and sentence boundaries
- Synthesize each chunk separately
- Add 0.3 seconds of silence between chunks for natural pauses
- Concatenate all chunks into a single audio file
**Note**: Automatic text chunking is disabled when using `--batch` mode. In batch mode, each text is processed as-is without chunking.
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `-use-gpu` | flag | false | Use GPU for inference (default: CPU) |
| `-onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `-total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `-n-test` | int | 4 | Number of times to generate each sample |
| `-voice-style` | str | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
| `-text` | str | (long default text) | Text(s) to synthesize, pipe-separated |
| `-lang` | str | `en` | Language(s) for synthesis, comma-separated (en, ko, es, pt, fr) |
| `-save-dir` | str | `results` | Output directory |
| `--batch` | flag | false | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
## Notes
- **Multilingual Support**: Use `-lang` to specify the language for each text. Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
- **Batch Processing**: When using `--batch`, the number of `-voice-style`, `-text`, and `-lang` entries must match
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
- **Quality vs Speed**: Higher `-total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
## Building a Binary
To build a standalone executable:
```bash
go build -o tts_example example_onnx.go helper.go
```
Then run it:
```bash
./tts_example -voice-style "../assets/voice_styles/M1.json" -text "Hello world"
```

193
go/example_onnx.go Normal file
View File

@@ -0,0 +1,193 @@
package main
import (
"flag"
"fmt"
"os"
"path/filepath"
"strings"
ort "github.com/yalue/onnxruntime_go"
)
// Args holds command line arguments
type Args struct {
useGPU bool
onnxDir string
totalStep int
speed float64
nTest int
voiceStyle []string
text []string
lang []string
saveDir string
batch bool
}
func parseArgs() *Args {
args := &Args{}
flag.BoolVar(&args.useGPU, "use-gpu", false, "Use GPU for inference (default: CPU)")
flag.StringVar(&args.onnxDir, "onnx-dir", "assets/onnx", "Path to ONNX model directory")
flag.IntVar(&args.totalStep, "total-step", 5, "Number of denoising steps")
flag.Float64Var(&args.speed, "speed", 1.05, "Speech speed factor (higher = faster)")
flag.IntVar(&args.nTest, "n-test", 4, "Number of times to generate")
flag.StringVar(&args.saveDir, "save-dir", "results", "Output directory")
flag.BoolVar(&args.batch, "batch", false, "Enable batch mode (multiple text-style pairs)")
var voiceStyleStr, textStr, langStr string
flag.StringVar(&voiceStyleStr, "voice-style", "assets/voice_styles/M1.json", "Voice style file path(s), comma-separated")
flag.StringVar(&textStr, "text", "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.", "Text(s) to synthesize, pipe-separated")
flag.StringVar(&langStr, "lang", "en", "Language(s) for synthesis, comma-separated (en, ko, es, pt, fr)")
flag.Parse()
// Parse comma-separated voice-style
if voiceStyleStr != "" {
args.voiceStyle = strings.Split(voiceStyleStr, ",")
for i := range args.voiceStyle {
args.voiceStyle[i] = strings.TrimSpace(args.voiceStyle[i])
}
}
// Parse pipe-separated text
if textStr != "" {
args.text = strings.Split(textStr, "|")
for i := range args.text {
args.text[i] = strings.TrimSpace(args.text[i])
}
}
// Parse comma-separated lang
if langStr != "" {
args.lang = strings.Split(langStr, ",")
for i := range args.lang {
args.lang[i] = strings.TrimSpace(args.lang[i])
}
}
return args
}
func main() {
fmt.Println("=== TTS Inference with ONNX Runtime (Go) ===\n")
// --- 1. Parse arguments --- //
args := parseArgs()
totalStep := args.totalStep
speed := float32(args.speed)
nTest := args.nTest
saveDir := args.saveDir
voiceStylePaths := args.voiceStyle
textList := args.text
langList := args.lang
batch := args.batch
if batch {
if len(voiceStylePaths) != len(textList) {
fmt.Printf("Error: Number of voice styles (%d) must match number of texts (%d)\n",
len(voiceStylePaths), len(textList))
os.Exit(1)
}
if len(langList) != len(textList) {
fmt.Printf("Error: Number of languages (%d) must match number of texts (%d)\n",
len(langList), len(textList))
os.Exit(1)
}
}
bsz := len(voiceStylePaths)
// Initialize ONNX Runtime
if err := InitializeONNXRuntime(); err != nil {
fmt.Printf("Error initializing ONNX Runtime: %v\n", err)
os.Exit(1)
}
defer ort.DestroyEnvironment()
// --- 2. Load config --- //
cfg, err := LoadCfgs(args.onnxDir)
if err != nil {
fmt.Printf("Error loading config: %v\n", err)
os.Exit(1)
}
// --- 3. Load TTS components --- //
textToSpeech, err := LoadTextToSpeech(args.onnxDir, args.useGPU, cfg)
if err != nil {
fmt.Printf("Error loading TTS components: %v\n", err)
os.Exit(1)
}
defer textToSpeech.Destroy()
// --- 4. Load voice styles --- //
style, err := LoadVoiceStyle(voiceStylePaths, true)
if err != nil {
fmt.Printf("Error loading voice styles: %v\n", err)
os.Exit(1)
}
defer style.Destroy()
// --- 5. Synthesize speech --- //
if err := os.MkdirAll(saveDir, 0755); err != nil {
fmt.Printf("Error creating save directory: %v\n", err)
os.Exit(1)
}
for n := 0; n < nTest; n++ {
fmt.Printf("\n[%d/%d] Starting synthesis...\n", n+1, nTest)
var wav []float32
var duration []float32
if batch {
Timer("Generating speech from text", func() interface{} {
w, d, err := textToSpeech.Batch(textList, langList, style, totalStep, speed)
if err != nil {
fmt.Printf("Error generating speech: %v\n", err)
os.Exit(1)
}
wav = w
duration = d
return nil
})
} else {
Timer("Generating speech from text", func() interface{} {
w, d, err := textToSpeech.Call(textList[0], langList[0], style, totalStep, speed, 0.3)
if err != nil {
fmt.Printf("Error generating speech: %v\n", err)
os.Exit(1)
}
wav = w
duration = []float32{d}
return nil
})
}
// Save outputs
for i := 0; i < bsz; i++ {
fname := fmt.Sprintf("%s_%d.wav", sanitizeFilename(textList[i], 20), n+1)
var wavOut []float64
if batch {
wavOut = extractWavSegment(wav, duration[i], textToSpeech.SampleRate, i, bsz)
} else {
// For non-batch mode, wav is a single concatenated audio
wavLen := int(float32(textToSpeech.SampleRate) * duration[0])
wavOut = make([]float64, wavLen)
for j := 0; j < wavLen && j < len(wav); j++ {
wavOut[j] = float64(wav[j])
}
}
outputPath := filepath.Join(saveDir, fname)
if err := writeWavFile(outputPath, wavOut, textToSpeech.SampleRate); err != nil {
fmt.Printf("Error writing wav file: %v\n", err)
continue
}
fmt.Printf("Saved: %s\n", outputPath)
}
}
fmt.Println("\n=== Synthesis completed successfully! ===")
}

13
go/go.mod Normal file
View File

@@ -0,0 +1,13 @@
module supertonic-tts
go 1.21
require (
github.com/go-audio/audio v1.0.0
github.com/go-audio/wav v1.1.0
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12
github.com/yalue/onnxruntime_go v1.11.0
golang.org/x/text v0.14.0
)
require github.com/go-audio/riff v1.0.0 // indirect

12
go/go.sum Normal file
View File

@@ -0,0 +1,12 @@
github.com/go-audio/audio v1.0.0 h1:zS9vebldgbQqktK4H0lUqWrG8P0NxCJVqcj7ZpNnwd4=
github.com/go-audio/audio v1.0.0/go.mod h1:6uAu0+H2lHkwdGsAY+j2wHPNPpPoeg5AaEFh9FlA+Zs=
github.com/go-audio/riff v1.0.0 h1:d8iCGbDvox9BfLagY94fBynxSPHO80LmZCaOsmKxokA=
github.com/go-audio/riff v1.0.0/go.mod h1:l3cQwc85y79NQFCRB7TiPoNiaijp6q8Z0Uv38rVG498=
github.com/go-audio/wav v1.1.0 h1:jQgLtbqBzY7G+BM8fXF7AHUk1uHUviWS4X39d5rsL2g=
github.com/go-audio/wav v1.1.0/go.mod h1:mpe9qfwbScEbkd8uybLuIpTgHyrISw/OTuvjUW2iGtE=
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12 h1:dd7vnTDfjtwCETZDrRe+GPYNLA1jBtbZeyfyE8eZCyk=
github.com/mjibson/go-dsp v0.0.0-20180508042940-11479a337f12/go.mod h1:i/KKcxEWEO8Yyl11DYafRPKOPVYTrhxiTRigjtEEXZU=
github.com/yalue/onnxruntime_go v1.11.0 h1:aKH4yPIbqfcB3SfnQWq/WxzLelkyolntHnffL3eMBHY=
github.com/yalue/onnxruntime_go v1.11.0/go.mod h1:b4X26A8pekNb1ACJ58wAXgNKeUCGEAQ9dmACut9Sm/4=
golang.org/x/text v0.14.0 h1:ScX5w1eTa3QqT8oi6+ziP7dTV1S2+ALU0bI+0zXKWiQ=
golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=

1066
go/helper.go Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 766 KiB

BIN
img/voicebuilder_img.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 448 KiB

View File

@@ -0,0 +1,10 @@
import SwiftUI
@main
struct ExampleiOSApp: App {
var body: some Scene {
WindowGroup {
ContentView()
}
}
}

View File

@@ -0,0 +1,30 @@
import Foundation
import AVFoundation
final class AudioPlayer: NSObject, AVAudioPlayerDelegate {
private var player: AVAudioPlayer?
private var onFinish: (() -> Void)?
func play(url: URL, onFinish: (() -> Void)? = nil) {
self.onFinish = onFinish
do {
let data = try Data(contentsOf: url)
let player = try AVAudioPlayer(data: data)
player.delegate = self
player.prepareToPlay()
player.play()
self.player = player
} catch {
print("Audio play error: \(error)")
}
}
func stop() {
player?.stop()
player = nil
}
func audioPlayerDidFinishPlaying(_ player: AVAudioPlayer, successfully flag: Bool) {
onFinish?()
}
}

View File

@@ -0,0 +1,99 @@
import SwiftUI
struct ContentView: View {
@StateObject private var vm = TTSViewModel()
var body: some View {
ZStack {
LinearGradient(gradient: Gradient(colors: [Color(.systemBackground), Color(.secondarySystemBackground)]), startPoint: .topLeading, endPoint: .bottomTrailing)
.ignoresSafeArea()
VStack(spacing: 20) {
Spacer()
VStack(spacing: 12) {
Text("Supertonic 2 iOS Demo")
.font(.title2.weight(.semibold))
.foregroundColor(.primary)
TextEditor(text: $vm.text)
.frame(minHeight: 120, maxHeight: 180)
.padding(8)
.background(Color(.secondarySystemBackground))
.cornerRadius(12)
.overlay(
RoundedRectangle(cornerRadius: 12)
.stroke(Color.secondary.opacity(0.3), lineWidth: 1)
)
.padding(.horizontal)
HStack(spacing: 12) {
Text("NFE")
.font(.subheadline)
.foregroundColor(.secondary)
Slider(value: $vm.nfe, in: 2...15, step: 1)
Text("\(Int(vm.nfe))")
.font(.subheadline.monospacedDigit())
.frame(width: 36)
}
.padding(.horizontal)
Picker("Voice", selection: $vm.voice) {
Text("M").tag(TTSService.Voice.male)
Text("F").tag(TTSService.Voice.female)
}
.pickerStyle(SegmentedPickerStyle())
.padding(.horizontal)
HStack(spacing: 12) {
Text("Language")
.font(.subheadline)
.foregroundColor(.secondary)
Picker("Language", selection: $vm.language) {
ForEach(TTSService.Language.allCases, id: \.self) { lang in
Text(lang.displayName).tag(lang)
}
}
.pickerStyle(MenuPickerStyle())
}
.padding(.horizontal)
}
HStack(spacing: 16) {
Button(action: { vm.generate() }) {
Label(vm.isGenerating ? "Generating..." : "Generate", systemImage: vm.isGenerating ? "hourglass" : "wand.and.stars"
)
.labelStyle(.titleAndIcon)
}
.buttonStyle(.borderedProminent)
.tint(.accentColor)
.disabled(vm.isGenerating)
Button(action: { vm.togglePlay() }) {
Label(vm.isPlaying ? "Stop" : "Play", systemImage: vm.isPlaying ? "stop.fill" : "play.fill")
}
.buttonStyle(.bordered)
.disabled(vm.audioURL == nil)
}
if let rtf = vm.rtfText {
Text(rtf)
.font(.footnote.monospacedDigit())
.foregroundColor(.secondary)
.padding(.top, 2)
}
if let error = vm.errorMessage {
Text(error)
.foregroundColor(.red)
.font(.footnote)
.multilineTextAlignment(.center)
.padding(.horizontal)
}
Spacer()
}
}
.onAppear { vm.startup() }
}
}

View File

@@ -0,0 +1,29 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>CFBundleDevelopmentRegion</key>
<string>en</string>
<key>CFBundleExecutable</key>
<string>$(EXECUTABLE_NAME)</string>
<key>CFBundleIdentifier</key>
<string>$(PRODUCT_BUNDLE_IDENTIFIER)</string>
<key>CFBundleInfoDictionaryVersion</key>
<string>6.0</string>
<key>CFBundleName</key>
<string>ExampleiOSApp</string>
<key>CFBundlePackageType</key>
<string>APPL</string>
<key>CFBundleShortVersionString</key>
<string>1.0</string>
<key>CFBundleVersion</key>
<string>1</string>
<key>UILaunchScreen</key>
<dict/>
<key>UIApplicationSceneManifest</key>
<dict>
<key>UIApplicationSupportsMultipleScenes</key>
<false/>
</dict>
</dict>
</plist>

View File

@@ -0,0 +1,114 @@
import Foundation
import OnnxRuntimeBindings
final class TTSService {
enum Voice { case male, female }
enum Language: String, CaseIterable {
case en = "en"
case ko = "ko"
case es = "es"
case pt = "pt"
case fr = "fr"
var displayName: String {
switch self {
case .en: return "English"
case .ko: return "한국어"
case .es: return "Español"
case .pt: return "Português"
case .fr: return "Français"
}
}
}
private let env: ORTEnv
private let textToSpeech: TextToSpeech
private let bundleOnnxDir: String
private let sampleRate: Int
init() throws {
bundleOnnxDir = try Self.locateOnnxDirInBundle()
env = try ORTEnv(loggingLevel: .warning)
textToSpeech = try loadTextToSpeech(bundleOnnxDir, false, env)
sampleRate = textToSpeech.sampleRate
}
func synthesize(text: String, nfe: Int, voice: Voice, language: Language) async throws -> URL {
// Load style for the selected voice
let styleURL = try Self.locateVoiceStyleURL(voice: voice)
let style = try loadVoiceStyle([styleURL.path], verbose: false)
// 2) Synthesize via packed TextToSpeech component
let (wav, duration) = try textToSpeech.call(text, language.rawValue, style, nfe)
let audioSeconds = Double(duration)
let wavLenSample = min(Int(Double(sampleRate) * audioSeconds), wav.count)
let wavOut = Array(wav[0..<wavLenSample])
let tmpURL = FileManager.default.temporaryDirectory.appendingPathComponent("supertonic_tts_\(UUID().uuidString).wav")
try writeWavFile(tmpURL.path, wavOut, sampleRate)
return tmpURL
}
// MARK: - Resource location helpers
private static func locateOnnxDirInBundle() throws -> String {
let bundle = Bundle.main
let fm = FileManager.default
func dirHasRequiredFiles(_ dir: URL) -> Bool {
let required = [
"tts.json",
"duration_predictor.onnx",
"text_encoder.onnx",
"vector_estimator.onnx",
"vocoder.onnx"
]
return required.allSatisfy { fm.fileExists(atPath: dir.appendingPathComponent($0).path) }
}
var candidates: [URL] = []
if let dir = bundle.resourceURL?.appendingPathComponent("onnx", isDirectory: true) { candidates.append(dir) }
if let dir = bundle.resourceURL?.appendingPathComponent("assets/onnx", isDirectory: true) { candidates.append(dir) }
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "onnx") { candidates.append(url.deletingLastPathComponent()) }
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: "assets/onnx") { candidates.append(url.deletingLastPathComponent()) }
if let url = bundle.url(forResource: "tts", withExtension: "json", subdirectory: nil) { candidates.append(url.deletingLastPathComponent()) }
if let root = bundle.resourceURL { candidates.append(root) }
for dir in candidates {
if dirHasRequiredFiles(dir) { return dir.path }
}
throw NSError(
domain: "TTS",
code: -100,
userInfo: [NSLocalizedDescriptionKey: "Could not find the onnx directory in the bundle. Please make sure the onnx folder (as a folder reference) is included in Copy Bundle Resources in Xcode."]
)
}
private static func locateVoiceStyleURL(voice: Voice) throws -> URL {
// Prefer M1/F1 defaults; search common subdirectories
let fileName = (voice == .male) ? "M1" : "F1"
let bundle = Bundle.main
let candidates: [URL?] = [
bundle.url(forResource: fileName, withExtension: "json", subdirectory: "voice_styles"),
bundle.url(forResource: fileName, withExtension: "json", subdirectory: "assets/voice_styles"),
bundle.url(forResource: fileName, withExtension: "json", subdirectory: nil)
]
for url in candidates {
if let url = url { return url }
}
// Fallback: scan folders if needed
if let folder1 = bundle.resourceURL?.appendingPathComponent("voice_styles", isDirectory: true) {
let file = folder1.appendingPathComponent("\(fileName).json")
if FileManager.default.fileExists(atPath: file.path) { return file }
}
if let folder2 = bundle.resourceURL?.appendingPathComponent("assets/voice_styles", isDirectory: true) {
let file = folder2.appendingPathComponent("\(fileName).json")
if FileManager.default.fileExists(atPath: file.path) { return file }
}
throw NSError(
domain: "TTS",
code: -102,
userInfo: [NSLocalizedDescriptionKey: "Could not find the voice style JSON (\(fileName).json) in the bundle. Ensure voice_styles folder is included in Copy Bundle Resources."]
)
}
}

View File

@@ -0,0 +1,82 @@
import Foundation
import AVFoundation
@MainActor
final class TTSViewModel: ObservableObject {
@Published var text: String = "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
@Published var nfe: Double = 5
@Published var voice: TTSService.Voice = .male
@Published var language: TTSService.Language = .en
@Published var isGenerating: Bool = false
@Published var isPlaying: Bool = false
@Published var errorMessage: String?
@Published var audioURL: URL?
@Published var elapsedSeconds: Double?
@Published var audioSeconds: Double?
private var service: TTSService?
private var player = AudioPlayer()
var rtfText: String? {
guard let e = elapsedSeconds, let a = audioSeconds, a > 0 else { return nil }
return String(format: "RTF %.2fx · %.2fs / %.2fs", e / a, e, a)
}
func startup() {
do {
service = try TTSService()
} catch {
errorMessage = "Failed to init TTS: \(error.localizedDescription)"
}
}
func generate() {
guard let service = service else { return }
isGenerating = true
errorMessage = nil
audioURL = nil
elapsedSeconds = nil
audioSeconds = nil
Task {
let tic = Date()
do {
let url = try await service.synthesize(text: text, nfe: Int(nfe), voice: voice, language: language)
let elapsed = Date().timeIntervalSince(tic)
let audio = audioDuration(at: url)
await MainActor.run {
self.audioURL = url
self.elapsedSeconds = elapsed
self.audioSeconds = audio
self.isGenerating = false
self.play(url: url)
}
} catch {
await MainActor.run {
self.errorMessage = error.localizedDescription
self.isGenerating = false
}
}
}
}
func togglePlay() {
if isPlaying {
player.stop()
isPlaying = false
} else if let url = audioURL {
play(url: url)
}
}
private func play(url: URL) {
player.play(url: url) { [weak self] in
DispatchQueue.main.async { self?.isPlaying = false }
}
isPlaying = true
}
private func audioDuration(at url: URL) -> Double? {
guard let file = try? AVAudioFile(forReading: url) else { return nil }
return Double(file.length) / file.fileFormat.sampleRate
}
}

View File

@@ -0,0 +1,29 @@
name: ExampleiOSApp
options:
minimumXcodeGenVersion: 2.37.0
packages:
onnxruntime:
url: https://github.com/microsoft/onnxruntime-swift-package-manager.git
from: 1.16.0
targets:
ExampleiOSApp:
type: application
platform: iOS
deploymentTarget: "15.0"
sources:
- path: .
- path: ../../swift/Sources/Helper.swift
type: file
resources:
- path: onnx
type: folder
- path: audio
type: folder
settings:
base:
PRODUCT_BUNDLE_IDENTIFIER: com.supertonic.ExampleiOSApp
SWIFT_VERSION: 5.9
INFOPLIST_FILE: Info.plist
dependencies:
- package: onnxruntime
product: onnxruntime

78
ios/README.md Normal file
View File

@@ -0,0 +1,78 @@
# Supertonic iOS Example App
A minimal iOS demo that runs Supertonic 2 (ONNX Runtime) on-device. The app shows:
- Multiline text input
- NFE (denoising steps) slider
- Voice toggle (M/F)
- Language selector (en, ko, es, pt, fr)
- Generate & Play buttons
- RTF display (Elapsed / Audio seconds)
All ONNX models/configs are reused from `Supertonic/assets/onnx`, and voice style JSON files from `Supertonic/assets/voice_styles`.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
## Prerequisites
- macOS 13+, Xcode 15+
- Swift 5.9+
- iOS 15+ device (recommended)
- Homebrew, XcodeGen
Install tools (if needed):
```bash
brew install xcodegen
```
## Quick Start (zero-click in Xcode)
0) Prepare assets next to the iOS target (one-time)
```bash
cd ios/ExampleiOSApp
mkdir -p onnx voice_styles
rsync -a ../../assets/onnx/ onnx/
rsync -a ../../assets/voice_styles/ voice_styles/
```
1) Generate the Xcode project with XcodeGen
```bash
xcodegen generate
open ExampleiOSApp.xcodeproj
```
2) Open in Xcode and select your iPhone as the run destination
- Targets → ExampleiOSApp → Signing & Capabilities: Select your Team
- iOS Deployment Target: 15.0+
3) Build & Run on device
- Type text → adjust NFE/Voice → Tap Generate → Audio plays automatically
- An RTF line shows like: `RTF 0.30x · 3.04s / 10.11s`
## What's included (generated project)
- SwiftUI app files: `App.swift`, `ContentView.swift`, `TTSViewModel.swift`, `AudioPlayer.swift`
- Runtime wrapper: `TTSService.swift` (includes TTS inference logic)
- Resources (local, vendored in `ios/ExampleiOSApp/onnx` and `ios/ExampleiOSApp/voice_styles` after step 0)
These references are defined in `project.yml` and added to the app bundle by XcodeGen.
## App Controls
- **Text**: Multiline `TextEditor`
- **NFE**: Denoising steps (default 5)
- **Voice**: M/F voice style selector
- **Language**: Language selector (English, 한국어, Español, Português, Français)
- **Generate**: Runs end-to-end synthesis
- **Play/Stop**: Controls playback of the last output
- **RTF**: Shows Elapsed / Audio seconds for quick performance intuition
## Multilingual Support
Supertonic 2 supports multiple languages. Select the appropriate language for your input text:
- **English (en)**: Default language
- **한국어 (ko)**: Korean
- **Español (es)**: Spanish
- **Português (pt)**: Portuguese
- **Français (fr)**: French

35
java/.gitignore vendored Normal file
View File

@@ -0,0 +1,35 @@
# Maven
target/
pom.xml.tag
pom.xml.releaseBackup
pom.xml.versionsBackup
pom.xml.next
release.properties
dependency-reduced-pom.xml
buildNumber.properties
.mvn/timing.properties
.mvn/wrapper/maven-wrapper.jar
# Compiled class files
*.class
# IntelliJ IDEA
.idea/
*.iml
*.iws
*.ipr
# Eclipse
.classpath
.project
.settings/
# VS Code
.vscode/
# Results
results/*.wav
# Mac
.DS_Store

183
java/ExampleONNX.java Normal file
View File

@@ -0,0 +1,183 @@
import ai.onnxruntime.*;
import java.io.File;
import java.util.*;
/**
* TTS Inference Example with ONNX Runtime (Java)
*/
public class ExampleONNX {
/**
* Command line arguments
*/
static class Args {
boolean useGpu = false;
String onnxDir = "assets/onnx";
int totalStep = 5;
float speed = 1.05f;
int nTest = 4;
List<String> voiceStyle = Arrays.asList("assets/voice_styles/M1.json");
List<String> text = Arrays.asList(
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
);
List<String> lang = Arrays.asList("en");
String saveDir = "results";
boolean batch = false;
}
/**
* Parse command line arguments
*/
private static Args parseArgs(String[] args) {
Args result = new Args();
for (int i = 0; i < args.length; i++) {
switch (args[i]) {
case "--use-gpu":
result.useGpu = true;
break;
case "--onnx-dir":
if (i + 1 < args.length) result.onnxDir = args[++i];
break;
case "--total-step":
if (i + 1 < args.length) result.totalStep = Integer.parseInt(args[++i]);
break;
case "--speed":
if (i + 1 < args.length) result.speed = Float.parseFloat(args[++i]);
break;
case "--n-test":
if (i + 1 < args.length) result.nTest = Integer.parseInt(args[++i]);
break;
case "--voice-style":
if (i + 1 < args.length) {
result.voiceStyle = Arrays.asList(args[++i].split(","));
}
break;
case "--text":
if (i + 1 < args.length) {
result.text = Arrays.asList(args[++i].split("\\|"));
}
break;
case "--lang":
if (i + 1 < args.length) {
result.lang = Arrays.asList(args[++i].split(","));
}
break;
case "--save-dir":
if (i + 1 < args.length) result.saveDir = args[++i];
break;
case "--batch":
result.batch = true;
break;
}
}
return result;
}
/**
* Main inference function
*/
public static void main(String[] args) {
try {
System.out.println("=== TTS Inference with ONNX Runtime (Java) ===\n");
// --- 1. Parse arguments --- //
Args parsedArgs = parseArgs(args);
int totalStep = parsedArgs.totalStep;
float speed = parsedArgs.speed;
int nTest = parsedArgs.nTest;
String saveDir = parsedArgs.saveDir;
List<String> voiceStylePaths = parsedArgs.voiceStyle;
List<String> textList = parsedArgs.text;
List<String> langList = parsedArgs.lang;
boolean batch = parsedArgs.batch;
if (batch) {
if (voiceStylePaths.size() != textList.size()) {
throw new RuntimeException("Number of voice styles (" + voiceStylePaths.size() +
") must match number of texts (" + textList.size() + ")");
}
if (langList.size() != textList.size()) {
throw new RuntimeException("Number of languages (" + langList.size() +
") must match number of texts (" + textList.size() + ")");
}
}
int bsz = voiceStylePaths.size();
OrtEnvironment env = OrtEnvironment.getEnvironment();
// --- 2. Load TTS components --- //
TextToSpeech textToSpeech = Helper.loadTextToSpeech(parsedArgs.onnxDir, parsedArgs.useGpu, env);
// --- 3. Load voice styles --- //
Style style = Helper.loadVoiceStyle(voiceStylePaths, true, env);
// --- 4. Synthesize speech --- //
File saveDirFile = new File(saveDir);
if (!saveDirFile.exists()) {
saveDirFile.mkdirs();
}
for (int n = 0; n < nTest; n++) {
System.out.println("\n[" + (n + 1) + "/" + nTest + "] Starting synthesis...");
TTSResult ttsResult;
if (batch) {
ttsResult = Helper.timer("Generating speech from text", () -> {
try {
return textToSpeech.batch(textList, langList, style, totalStep, speed, env);
} catch (Exception e) {
throw new RuntimeException(e);
}
});
} else {
ttsResult = Helper.timer("Generating speech from text", () -> {
try {
return textToSpeech.call(textList.get(0), langList.get(0), style, totalStep, speed, 0.3f, env);
} catch (Exception e) {
throw new RuntimeException(e);
}
});
}
float[] wav = ttsResult.wav;
float[] duration = ttsResult.duration;
// Save outputs
for (int i = 0; i < bsz; i++) {
String fname = Helper.sanitizeFilename(textList.get(i), 20) + "_" + (n + 1) + ".wav";
float[] wavOut;
if (batch) {
int wavLen = wav.length / bsz;
int actualLen = (int) (textToSpeech.sampleRate * duration[i]);
wavOut = new float[actualLen];
System.arraycopy(wav, i * wavLen, wavOut, 0, Math.min(actualLen, wavLen));
} else {
// For non-batch mode, wav is a single concatenated audio
int actualLen = (int) (textToSpeech.sampleRate * duration[0]);
wavOut = new float[Math.min(actualLen, wav.length)];
System.arraycopy(wav, 0, wavOut, 0, wavOut.length);
}
String outputPath = saveDir + "/" + fname;
Helper.writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
System.out.println("Saved: " + outputPath);
}
}
// Clean up
style.close();
textToSpeech.close();
System.out.println("\n=== Synthesis completed successfully! ===");
} catch (Exception e) {
System.err.println("Error during inference: " + e.getMessage());
e.printStackTrace();
System.exit(1);
}
}
}

955
java/Helper.java Normal file
View File

@@ -0,0 +1,955 @@
import ai.onnxruntime.*;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import javax.sound.sampled.AudioFileFormat;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.FloatBuffer;
import java.nio.LongBuffer;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.text.Normalizer;
import java.util.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
/**
* Available languages for multilingual TTS
*/
class Languages {
public static final List<String> AVAILABLE = Arrays.asList("en", "ko", "es", "pt", "fr");
public static boolean isValid(String lang) {
return AVAILABLE.contains(lang);
}
}
/**
* Configuration classes
*/
class Config {
static class AEConfig {
int sampleRate;
int baseChunkSize;
}
static class TTLConfig {
int chunkCompressFactor;
int latentDim;
}
AEConfig ae;
TTLConfig ttl;
}
/**
* Voice Style Data from JSON
*/
class VoiceStyleData {
static class StyleData {
float[][][] data;
long[] dims;
String type;
}
StyleData styleTtl;
StyleData styleDp;
}
/**
* Unicode text processor
*/
class UnicodeProcessor {
private long[] indexer;
public UnicodeProcessor(String unicodeIndexerJsonPath) throws IOException {
this.indexer = Helper.loadJsonLongArray(unicodeIndexerJsonPath);
}
private static String removeEmojis(String text) {
StringBuilder result = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
int codePoint;
if (Character.isHighSurrogate(text.charAt(i)) && i + 1 < text.length() && Character.isLowSurrogate(text.charAt(i + 1))) {
codePoint = Character.codePointAt(text, i);
i++; // Skip the low surrogate
} else {
codePoint = text.charAt(i);
}
// Check if code point is in emoji ranges
boolean isEmoji = (codePoint >= 0x1F600 && codePoint <= 0x1F64F) ||
(codePoint >= 0x1F300 && codePoint <= 0x1F5FF) ||
(codePoint >= 0x1F680 && codePoint <= 0x1F6FF) ||
(codePoint >= 0x1F700 && codePoint <= 0x1F77F) ||
(codePoint >= 0x1F780 && codePoint <= 0x1F7FF) ||
(codePoint >= 0x1F800 && codePoint <= 0x1F8FF) ||
(codePoint >= 0x1F900 && codePoint <= 0x1F9FF) ||
(codePoint >= 0x1FA00 && codePoint <= 0x1FA6F) ||
(codePoint >= 0x1FA70 && codePoint <= 0x1FAFF) ||
(codePoint >= 0x2600 && codePoint <= 0x26FF) ||
(codePoint >= 0x2700 && codePoint <= 0x27BF) ||
(codePoint >= 0x1F1E6 && codePoint <= 0x1F1FF);
if (!isEmoji) {
if (codePoint > 0xFFFF) {
result.append(Character.toChars(codePoint));
} else {
result.append((char) codePoint);
}
}
}
return result.toString();
}
public TextProcessResult call(List<String> textList, List<String> langList) {
List<String> processedTexts = new ArrayList<>();
for (int i = 0; i < textList.size(); i++) {
processedTexts.add(preprocessText(textList.get(i), langList.get(i)));
}
// Convert texts to unicode values first to get correct character counts
List<int[]> allUnicodeVals = new ArrayList<>();
for (String text : processedTexts) {
allUnicodeVals.add(textToUnicodeValues(text));
}
int[] textIdsLengths = new int[processedTexts.size()];
int maxLen = 0;
for (int i = 0; i < allUnicodeVals.size(); i++) {
textIdsLengths[i] = allUnicodeVals.get(i).length; // Use code point count, not char count
maxLen = Math.max(maxLen, textIdsLengths[i]);
}
long[][] textIds = new long[processedTexts.size()][maxLen];
for (int i = 0; i < allUnicodeVals.size(); i++) {
int[] unicodeVals = allUnicodeVals.get(i);
for (int j = 0; j < unicodeVals.length; j++) {
textIds[i][j] = indexer[unicodeVals[j]];
}
}
float[][][] textMask = getTextMask(textIdsLengths);
return new TextProcessResult(textIds, textMask);
}
private String preprocessText(String text, String lang) {
// TODO: Need advanced normalizer for better performance
text = Normalizer.normalize(text, Normalizer.Form.NFKD);
// Remove emojis (wide Unicode range)
// Java Pattern doesn't support \x{...} syntax for Unicode above \uFFFF
// Use character filtering instead
text = removeEmojis(text);
// Replace various dashes and symbols
Map<String, String> replacements = new HashMap<>();
replacements.put("", "-"); // en dash
replacements.put("", "-"); // non-breaking hyphen
replacements.put("", "-"); // em dash
replacements.put("_", " "); // underscore
replacements.put("\u201C", "\""); // left double quote
replacements.put("\u201D", "\""); // right double quote
replacements.put("\u2018", "'"); // left single quote
replacements.put("\u2019", "'"); // right single quote
replacements.put("´", "'"); // acute accent
replacements.put("`", "'"); // grave accent
replacements.put("[", " "); // left bracket
replacements.put("]", " "); // right bracket
replacements.put("|", " "); // vertical bar
replacements.put("/", " "); // slash
replacements.put("#", " "); // hash
replacements.put("", " "); // right arrow
replacements.put("", " "); // left arrow
for (Map.Entry<String, String> entry : replacements.entrySet()) {
text = text.replace(entry.getKey(), entry.getValue());
}
// Remove special symbols
text = text.replaceAll("[♥☆♡©\\\\]", "");
// Replace known expressions
Map<String, String> exprReplacements = new HashMap<>();
exprReplacements.put("@", " at ");
exprReplacements.put("e.g.,", "for example, ");
exprReplacements.put("i.e.,", "that is, ");
for (Map.Entry<String, String> entry : exprReplacements.entrySet()) {
text = text.replace(entry.getKey(), entry.getValue());
}
// Fix spacing around punctuation
text = text.replaceAll(" ,", ",");
text = text.replaceAll(" \\.", ".");
text = text.replaceAll(" !", "!");
text = text.replaceAll(" \\?", "?");
text = text.replaceAll(" ;", ";");
text = text.replaceAll(" :", ":");
text = text.replaceAll(" '", "'");
// Remove duplicate quotes
while (text.contains("\"\"")) {
text = text.replace("\"\"", "\"");
}
while (text.contains("''")) {
text = text.replace("''", "'");
}
while (text.contains("``")) {
text = text.replace("``", "`");
}
// Remove extra spaces
text = text.replaceAll("\\s+", " ").trim();
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
if (!text.matches(".*[.!?;:,'\"\\u201C\\u201D\\u2018\\u2019)\\]}…。」』】〉》›»]$")) {
text += ".";
}
// Validate language
if (!Languages.isValid(lang)) {
throw new IllegalArgumentException("Invalid language: " + lang + ". Available: " + Languages.AVAILABLE);
}
// Wrap text with language tags
text = "<" + lang + ">" + text + "</" + lang + ">";
return text;
}
private int[] textToUnicodeValues(String text) {
// Use codePoints() stream to correctly handle surrogate pairs
return text.codePoints().toArray();
}
private float[][][] getTextMask(int[] lengths) {
int bsz = lengths.length;
int maxLen = 0;
for (int len : lengths) {
maxLen = Math.max(maxLen, len);
}
float[][][] mask = new float[bsz][1][maxLen];
for (int i = 0; i < bsz; i++) {
for (int j = 0; j < maxLen; j++) {
mask[i][0][j] = j < lengths[i] ? 1.0f : 0.0f;
}
}
return mask;
}
static class TextProcessResult {
long[][] textIds;
float[][][] textMask;
TextProcessResult(long[][] textIds, float[][][] textMask) {
this.textIds = textIds;
this.textMask = textMask;
}
}
}
/**
* Text-to-Speech inference class
*/
class TextToSpeech {
private Config config;
private UnicodeProcessor textProcessor;
private OrtSession dpSession;
private OrtSession textEncSession;
private OrtSession vectorEstSession;
private OrtSession vocoderSession;
public int sampleRate;
private int baseChunkSize;
private int chunkCompress;
private int ldim;
public TextToSpeech(Config config, UnicodeProcessor textProcessor,
OrtSession dpSession, OrtSession textEncSession,
OrtSession vectorEstSession, OrtSession vocoderSession) {
this.config = config;
this.textProcessor = textProcessor;
this.dpSession = dpSession;
this.textEncSession = textEncSession;
this.vectorEstSession = vectorEstSession;
this.vocoderSession = vocoderSession;
this.sampleRate = config.ae.sampleRate;
this.baseChunkSize = config.ae.baseChunkSize;
this.chunkCompress = config.ttl.chunkCompressFactor;
this.ldim = config.ttl.latentDim;
}
private TTSResult _infer(List<String> textList, List<String> langList, Style style, int totalStep, float speed, OrtEnvironment env)
throws OrtException {
int bsz = textList.size();
// Process text
UnicodeProcessor.TextProcessResult textResult = textProcessor.call(textList, langList);
long[][] textIds = textResult.textIds;
float[][][] textMask = textResult.textMask;
// Create tensors
OnnxTensor textIdsTensor = Helper.createLongTensor(textIds, env);
OnnxTensor textMaskTensor = Helper.createFloatTensor(textMask, env);
// Predict duration
Map<String, OnnxTensor> dpInputs = new HashMap<>();
dpInputs.put("text_ids", textIdsTensor);
dpInputs.put("style_dp", style.dpTensor);
dpInputs.put("text_mask", textMaskTensor);
OrtSession.Result dpResult = dpSession.run(dpInputs);
Object dpValue = dpResult.get(0).getValue();
float[] duration;
if (dpValue instanceof float[][]) {
duration = ((float[][]) dpValue)[0];
} else {
duration = (float[]) dpValue;
}
// Apply speed factor to duration
for (int i = 0; i < duration.length; i++) {
duration[i] /= speed;
}
// Encode text
Map<String, OnnxTensor> textEncInputs = new HashMap<>();
textEncInputs.put("text_ids", textIdsTensor);
textEncInputs.put("style_ttl", style.ttlTensor);
textEncInputs.put("text_mask", textMaskTensor);
OrtSession.Result textEncResult = textEncSession.run(textEncInputs);
OnnxTensor textEmbTensor = (OnnxTensor) textEncResult.get(0);
// Sample noisy latent
NoisyLatentResult noisyLatentResult = sampleNoisyLatent(duration);
float[][][] xt = noisyLatentResult.noisyLatent;
float[][][] latentMask = noisyLatentResult.latentMask;
// Prepare constant tensors
float[] totalStepArray = new float[bsz];
Arrays.fill(totalStepArray, (float) totalStep);
OnnxTensor totalStepTensor = OnnxTensor.createTensor(env, totalStepArray);
// Denoising loop
for (int step = 0; step < totalStep; step++) {
float[] currentStepArray = new float[bsz];
Arrays.fill(currentStepArray, (float) step);
OnnxTensor currentStepTensor = OnnxTensor.createTensor(env, currentStepArray);
OnnxTensor noisyLatentTensor = Helper.createFloatTensor(xt, env);
OnnxTensor latentMaskTensor = Helper.createFloatTensor(latentMask, env);
OnnxTensor textMaskTensor2 = Helper.createFloatTensor(textMask, env);
Map<String, OnnxTensor> vectorEstInputs = new HashMap<>();
vectorEstInputs.put("noisy_latent", noisyLatentTensor);
vectorEstInputs.put("text_emb", textEmbTensor);
vectorEstInputs.put("style_ttl", style.ttlTensor);
vectorEstInputs.put("latent_mask", latentMaskTensor);
vectorEstInputs.put("text_mask", textMaskTensor2);
vectorEstInputs.put("current_step", currentStepTensor);
vectorEstInputs.put("total_step", totalStepTensor);
OrtSession.Result vectorEstResult = vectorEstSession.run(vectorEstInputs);
float[][][] denoised = (float[][][]) vectorEstResult.get(0).getValue();
// Update latent
xt = denoised;
// Clean up
currentStepTensor.close();
noisyLatentTensor.close();
latentMaskTensor.close();
textMaskTensor2.close();
vectorEstResult.close();
}
// Generate waveform
OnnxTensor finalLatentTensor = Helper.createFloatTensor(xt, env);
Map<String, OnnxTensor> vocoderInputs = new HashMap<>();
vocoderInputs.put("latent", finalLatentTensor);
OrtSession.Result vocoderResult = vocoderSession.run(vocoderInputs);
float[][] wavBatch = (float[][]) vocoderResult.get(0).getValue();
// Flatten all batch audio into a single array for batch processing
int totalSamples = 0;
for (float[] w : wavBatch) {
totalSamples += w.length;
}
float[] wav = new float[totalSamples];
int offset = 0;
for (float[] w : wavBatch) {
System.arraycopy(w, 0, wav, offset, w.length);
offset += w.length;
}
// Clean up
textIdsTensor.close();
textMaskTensor.close();
dpResult.close();
textEncResult.close();
totalStepTensor.close();
finalLatentTensor.close();
vocoderResult.close();
return new TTSResult(wav, duration);
}
private NoisyLatentResult sampleNoisyLatent(float[] duration) {
int bsz = duration.length;
float maxDur = 0;
for (float d : duration) {
maxDur = Math.max(maxDur, d);
}
long wavLenMax = (long) (maxDur * sampleRate);
long[] wavLengths = new long[bsz];
for (int i = 0; i < bsz; i++) {
wavLengths[i] = (long) (duration[i] * sampleRate);
}
int chunkSize = baseChunkSize * chunkCompress;
int latentLen = (int) ((wavLenMax + chunkSize - 1) / chunkSize);
int latentDim = ldim * chunkCompress;
Random rng = new Random();
float[][][] noisyLatent = new float[bsz][latentDim][latentLen];
for (int b = 0; b < bsz; b++) {
for (int d = 0; d < latentDim; d++) {
for (int t = 0; t < latentLen; t++) {
// Box-Muller transform
double u1 = Math.max(1e-10, rng.nextDouble());
double u2 = rng.nextDouble();
noisyLatent[b][d][t] = (float) (Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2));
}
}
}
float[][][] latentMask = Helper.getLatentMask(wavLengths, config);
// Apply mask
for (int b = 0; b < bsz; b++) {
for (int d = 0; d < latentDim; d++) {
for (int t = 0; t < latentLen; t++) {
noisyLatent[b][d][t] *= latentMask[b][0][t];
}
}
}
return new NoisyLatentResult(noisyLatent, latentMask);
}
/**
* Synthesize speech from a single text with automatic chunking
*/
public TTSResult call(String text, String lang, Style style, int totalStep, float speed, float silenceDuration, OrtEnvironment env)
throws OrtException {
int maxLen = lang.equals("ko") ? 120 : 300;
List<String> chunks = Helper.chunkText(text, maxLen);
List<Float> wavCat = new ArrayList<>();
float durCat = 0.0f;
for (int i = 0; i < chunks.size(); i++) {
TTSResult result = _infer(Arrays.asList(chunks.get(i)), Arrays.asList(lang), style, totalStep, speed, env);
float dur = result.duration[0];
int wavLen = (int) (sampleRate * dur);
float[] wavChunk = new float[wavLen];
System.arraycopy(result.wav, 0, wavChunk, 0, Math.min(wavLen, result.wav.length));
if (i == 0) {
for (float val : wavChunk) {
wavCat.add(val);
}
durCat = dur;
} else {
int silenceLen = (int) (silenceDuration * sampleRate);
for (int j = 0; j < silenceLen; j++) {
wavCat.add(0.0f);
}
for (float val : wavChunk) {
wavCat.add(val);
}
durCat += silenceDuration + dur;
}
}
float[] wavArray = new float[wavCat.size()];
for (int i = 0; i < wavCat.size(); i++) {
wavArray[i] = wavCat.get(i);
}
return new TTSResult(wavArray, new float[]{durCat});
}
/**
* Batch synthesize speech from multiple texts
*/
public TTSResult batch(List<String> textList, List<String> langList, Style style, int totalStep, float speed, OrtEnvironment env)
throws OrtException {
return _infer(textList, langList, style, totalStep, speed, env);
}
public void close() throws OrtException {
if (dpSession != null) dpSession.close();
if (textEncSession != null) textEncSession.close();
if (vectorEstSession != null) vectorEstSession.close();
if (vocoderSession != null) vocoderSession.close();
}
}
/**
* Style holder class
*/
class Style {
OnnxTensor ttlTensor;
OnnxTensor dpTensor;
Style(OnnxTensor ttlTensor, OnnxTensor dpTensor) {
this.ttlTensor = ttlTensor;
this.dpTensor = dpTensor;
}
public void close() throws OrtException {
if (ttlTensor != null) ttlTensor.close();
if (dpTensor != null) dpTensor.close();
}
}
/**
* TTS result holder
*/
class TTSResult {
float[] wav;
float[] duration;
TTSResult(float[] wav, float[] duration) {
this.wav = wav;
this.duration = duration;
}
}
/**
* Noisy latent result holder
*/
class NoisyLatentResult {
float[][][] noisyLatent;
float[][][] latentMask;
NoisyLatentResult(float[][][] noisyLatent, float[][][] latentMask) {
this.noisyLatent = noisyLatent;
this.latentMask = latentMask;
}
}
/**
* Helper utility class
*/
public class Helper {
private static final int MAX_CHUNK_LENGTH = 300;
private static final String[] ABBREVIATIONS = {
"Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
"St.", "Ave.", "Rd.", "Blvd.", "Dept.", "Inc.", "Ltd.",
"Co.", "Corp.", "etc.", "vs.", "i.e.", "e.g.", "Ph.D."
};
/**
* Chunk text into smaller segments based on paragraphs and sentences
*/
public static List<String> chunkText(String text, int maxLen) {
if (maxLen == 0) {
maxLen = MAX_CHUNK_LENGTH;
}
text = text.trim();
if (text.isEmpty()) {
return Arrays.asList("");
}
// Split by paragraphs
String[] paragraphs = text.split("\\n\\s*\\n");
List<String> chunks = new ArrayList<>();
for (String para : paragraphs) {
para = para.trim();
if (para.isEmpty()) {
continue;
}
if (para.length() <= maxLen) {
chunks.add(para);
continue;
}
// Split by sentences
List<String> sentences = splitSentences(para);
StringBuilder current = new StringBuilder();
int currentLen = 0;
for (String sentence : sentences) {
sentence = sentence.trim();
if (sentence.isEmpty()) {
continue;
}
int sentenceLen = sentence.length();
if (sentenceLen > maxLen) {
// If sentence is longer than maxLen, split by comma or space
if (current.length() > 0) {
chunks.add(current.toString().trim());
current.setLength(0);
currentLen = 0;
}
// Try splitting by comma
String[] parts = sentence.split(",");
for (String part : parts) {
part = part.trim();
if (part.isEmpty()) {
continue;
}
int partLen = part.length();
if (partLen > maxLen) {
// Split by space as last resort
String[] words = part.split("\\s+");
StringBuilder wordChunk = new StringBuilder();
int wordChunkLen = 0;
for (String word : words) {
int wordLen = word.length();
if (wordChunkLen + wordLen + 1 > maxLen && wordChunk.length() > 0) {
chunks.add(wordChunk.toString().trim());
wordChunk.setLength(0);
wordChunkLen = 0;
}
if (wordChunk.length() > 0) {
wordChunk.append(" ");
wordChunkLen++;
}
wordChunk.append(word);
wordChunkLen += wordLen;
}
if (wordChunk.length() > 0) {
chunks.add(wordChunk.toString().trim());
}
} else {
if (currentLen + partLen + 1 > maxLen && current.length() > 0) {
chunks.add(current.toString().trim());
current.setLength(0);
currentLen = 0;
}
if (current.length() > 0) {
current.append(", ");
currentLen += 2;
}
current.append(part);
currentLen += partLen;
}
}
continue;
}
if (currentLen + sentenceLen + 1 > maxLen && current.length() > 0) {
chunks.add(current.toString().trim());
current.setLength(0);
currentLen = 0;
}
if (current.length() > 0) {
current.append(" ");
currentLen++;
}
current.append(sentence);
currentLen += sentenceLen;
}
if (current.length() > 0) {
chunks.add(current.toString().trim());
}
}
if (chunks.isEmpty()) {
return Arrays.asList("");
}
return chunks;
}
/**
* Split text into sentences, avoiding common abbreviations
*/
private static List<String> splitSentences(String text) {
// Build pattern that avoids abbreviations
StringBuilder abbrevPattern = new StringBuilder();
for (int i = 0; i < ABBREVIATIONS.length; i++) {
if (i > 0) abbrevPattern.append("|");
abbrevPattern.append(Pattern.quote(ABBREVIATIONS[i]));
}
// Match sentence endings, but not abbreviations
String patternStr = "(?<!(?:" + abbrevPattern.toString() + "))(?<=[.!?])\\s+";
Pattern pattern = Pattern.compile(patternStr);
return Arrays.asList(pattern.split(text));
}
/**
* Load voice style from JSON files
*/
public static Style loadVoiceStyle(List<String> voiceStylePaths, boolean verbose, OrtEnvironment env)
throws IOException, OrtException {
int bsz = voiceStylePaths.size();
// Read first file to get dimensions
ObjectMapper mapper = new ObjectMapper();
JsonNode firstRoot = mapper.readTree(new File(voiceStylePaths.get(0)));
long[] ttlDims = new long[3];
for (int i = 0; i < 3; i++) {
ttlDims[i] = firstRoot.get("style_ttl").get("dims").get(i).asLong();
}
long[] dpDims = new long[3];
for (int i = 0; i < 3; i++) {
dpDims[i] = firstRoot.get("style_dp").get("dims").get(i).asLong();
}
long ttlDim1 = ttlDims[1];
long ttlDim2 = ttlDims[2];
long dpDim1 = dpDims[1];
long dpDim2 = dpDims[2];
// Pre-allocate arrays with full batch size
int ttlSize = (int) (bsz * ttlDim1 * ttlDim2);
int dpSize = (int) (bsz * dpDim1 * dpDim2);
float[] ttlFlat = new float[ttlSize];
float[] dpFlat = new float[dpSize];
// Fill in the data
for (int i = 0; i < bsz; i++) {
JsonNode root = mapper.readTree(new File(voiceStylePaths.get(i)));
// Flatten TTL data
int ttlOffset = (int) (i * ttlDim1 * ttlDim2);
int idx = 0;
JsonNode ttlData = root.get("style_ttl").get("data");
for (JsonNode batch : ttlData) {
for (JsonNode row : batch) {
for (JsonNode val : row) {
ttlFlat[ttlOffset + idx++] = (float) val.asDouble();
}
}
}
// Flatten DP data
int dpOffset = (int) (i * dpDim1 * dpDim2);
idx = 0;
JsonNode dpData = root.get("style_dp").get("data");
for (JsonNode batch : dpData) {
for (JsonNode row : batch) {
for (JsonNode val : row) {
dpFlat[dpOffset + idx++] = (float) val.asDouble();
}
}
}
}
long[] ttlShape = {bsz, ttlDim1, ttlDim2};
long[] dpShape = {bsz, dpDim1, dpDim2};
OnnxTensor ttlTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(ttlFlat), ttlShape);
OnnxTensor dpTensor = OnnxTensor.createTensor(env, FloatBuffer.wrap(dpFlat), dpShape);
if (verbose) {
System.out.println("Loaded " + bsz + " voice styles\n");
}
return new Style(ttlTensor, dpTensor);
}
/**
* Load TTS components
*/
public static TextToSpeech loadTextToSpeech(String onnxDir, boolean useGpu, OrtEnvironment env)
throws IOException, OrtException {
if (useGpu) {
throw new RuntimeException("GPU mode is not supported yet");
}
System.out.println("Using CPU for inference\n");
// Load config
Config config = loadCfgs(onnxDir);
// Create session options
OrtSession.SessionOptions opts = new OrtSession.SessionOptions();
// Load models
OrtSession dpSession = env.createSession(onnxDir + "/duration_predictor.onnx", opts);
OrtSession textEncSession = env.createSession(onnxDir + "/text_encoder.onnx", opts);
OrtSession vectorEstSession = env.createSession(onnxDir + "/vector_estimator.onnx", opts);
OrtSession vocoderSession = env.createSession(onnxDir + "/vocoder.onnx", opts);
// Load text processor
UnicodeProcessor textProcessor = new UnicodeProcessor(onnxDir + "/unicode_indexer.json");
return new TextToSpeech(config, textProcessor, dpSession, textEncSession, vectorEstSession, vocoderSession);
}
/**
* Load configuration from JSON
*/
public static Config loadCfgs(String onnxDir) throws IOException {
ObjectMapper mapper = new ObjectMapper();
JsonNode root = mapper.readTree(new File(onnxDir + "/tts.json"));
Config config = new Config();
config.ae = new Config.AEConfig();
config.ae.sampleRate = root.get("ae").get("sample_rate").asInt();
config.ae.baseChunkSize = root.get("ae").get("base_chunk_size").asInt();
config.ttl = new Config.TTLConfig();
config.ttl.chunkCompressFactor = root.get("ttl").get("chunk_compress_factor").asInt();
config.ttl.latentDim = root.get("ttl").get("latent_dim").asInt();
return config;
}
/**
* Get latent mask from wav lengths
*/
public static float[][][] getLatentMask(long[] wavLengths, Config config) {
long baseChunkSize = config.ae.baseChunkSize;
long chunkCompressFactor = config.ttl.chunkCompressFactor;
long latentSize = baseChunkSize * chunkCompressFactor;
long[] latentLengths = new long[wavLengths.length];
long maxLen = 0;
for (int i = 0; i < wavLengths.length; i++) {
latentLengths[i] = (wavLengths[i] + latentSize - 1) / latentSize;
maxLen = Math.max(maxLen, latentLengths[i]);
}
float[][][] mask = new float[wavLengths.length][1][(int) maxLen];
for (int i = 0; i < wavLengths.length; i++) {
for (int j = 0; j < maxLen; j++) {
mask[i][0][j] = j < latentLengths[i] ? 1.0f : 0.0f;
}
}
return mask;
}
/**
* Write WAV file
*/
public static void writeWavFile(String filename, float[] audioData, int sampleRate) throws IOException {
// Convert float to byte array
byte[] bytes = new byte[audioData.length * 2];
ByteBuffer buffer = ByteBuffer.wrap(bytes);
buffer.order(ByteOrder.LITTLE_ENDIAN);
for (float sample : audioData) {
short val = (short) Math.max(-32768, Math.min(32767, sample * 32767));
buffer.putShort(val);
}
ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
AudioInputStream ais = new AudioInputStream(bais, format, audioData.length);
AudioSystem.write(ais, AudioFileFormat.Type.WAVE, new File(filename));
}
/**
* Sanitize filename (supports Unicode characters)
*/
public static String sanitizeFilename(String text, int maxLen) {
// Get first maxLen characters (code points, not chars for surrogate pairs)
int[] codePoints = text.codePoints().limit(maxLen).toArray();
StringBuilder result = new StringBuilder();
for (int codePoint : codePoints) {
if (Character.isLetterOrDigit(codePoint)) {
result.appendCodePoint(codePoint);
} else {
result.append('_');
}
}
return result.toString();
}
/**
* Timer utility
*/
public static <T> T timer(String name, java.util.function.Supplier<T> fn) {
long start = System.currentTimeMillis();
System.out.println(name + "...");
T result = fn.get();
long elapsed = System.currentTimeMillis() - start;
System.out.printf(" -> %s completed in %.2f sec\n", name, elapsed / 1000.0);
return result;
}
/**
* Create float tensor from 3D array
*/
public static OnnxTensor createFloatTensor(float[][][] array, OrtEnvironment env) throws OrtException {
int dim0 = array.length;
int dim1 = array[0].length;
int dim2 = array[0][0].length;
float[] flat = new float[dim0 * dim1 * dim2];
int idx = 0;
for (int i = 0; i < dim0; i++) {
for (int j = 0; j < dim1; j++) {
for (int k = 0; k < dim2; k++) {
flat[idx++] = array[i][j][k];
}
}
}
long[] shape = {dim0, dim1, dim2};
return OnnxTensor.createTensor(env, FloatBuffer.wrap(flat), shape);
}
/**
* Create long tensor from 2D array
*/
public static OnnxTensor createLongTensor(long[][] array, OrtEnvironment env) throws OrtException {
int dim0 = array.length;
int dim1 = array[0].length;
long[] flat = new long[dim0 * dim1];
int idx = 0;
for (int i = 0; i < dim0; i++) {
for (int j = 0; j < dim1; j++) {
flat[idx++] = array[i][j];
}
}
long[] shape = {dim0, dim1};
return OnnxTensor.createTensor(env, LongBuffer.wrap(flat), shape);
}
/**
* Load JSON long array
*/
public static long[] loadJsonLongArray(String filePath) throws IOException {
ObjectMapper mapper = new ObjectMapper();
JsonNode root = mapper.readTree(new File(filePath));
long[] result = new long[root.size()];
for (int i = 0; i < root.size(); i++) {
result[i] = root.get(i).asLong();
}
return result;
}
}

130
java/README.md Normal file
View File

@@ -0,0 +1,130 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `ExampleONNX.java`.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
## Installation
This project uses [Maven](https://maven.apache.org/) for dependency management.
### Prerequisites
- Java 11 or higher
- Maven 3.6 or higher
### Install dependencies
```bash
mvn clean install
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
mvn exec:java
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
mvn exec:java -Dexec.args="--batch --voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json --text 'The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요.' --lang en,ko"
```
This will:
- Generate speech for 2 different voice-text-language pairs
- Use male voice (M1.json) for the first text in English
- Use female voice (F1.json) for the second text in Korean
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
mvn exec:java -Dexec.args="--total-step 10 --voice-style assets/voice_styles/M1.json --text 'Increasing the number of denoising steps improves the output fidelity and overall quality.'"
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
### Example 4: Long-Form Inference
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
```bash
mvn exec:java -Dexec.args="--voice-style assets/voice_styles/M1.json --text 'This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues.'"
```
This will:
- Automatically split the text into chunks based on paragraph and sentence boundaries
- Synthesize each chunk separately
- Add 0.3 seconds of silence between chunks for natural pauses
- Concatenate all chunks into a single audio file
**Note**: Automatic text chunking is disabled when using `--batch` mode. In batch mode, each text is processed as-is without chunking.
**Tip**: If your text contains apostrophes, use escaping or run the JAR directly:
```bash
java -jar target/tts-example.jar --total-step 10 --text "Text with apostrophe's here"
```
## Building a Fat JAR
To create a standalone JAR with all dependencies:
```bash
mvn clean package
```
Then run it directly:
```bash
java -jar target/tts-example.jar
```
Or with arguments:
```bash
java -jar target/tts-example.jar --total-step 10 --text "Your custom text here"
```
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
| `--text` | str+ | (long default text) | Text(s) to synthesize, pipe-separated |
| `--lang` | str+ | `en` | Language(s) for synthesis, comma-separated (en, ko, es, pt, fr) |
| `--save-dir` | str | `results` | Output directory |
| `--batch` | flag | False | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
## Notes
- **Multilingual Support**: Use `--lang` to specify the language for each text. Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
- **Voice Styles**: Uses pre-extracted voice style JSON files for fast inference

110
java/pom.xml Normal file
View File

@@ -0,0 +1,110 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ai.supertonic</groupId>
<artifactId>tts-onnx-java</artifactId>
<version>1.0.0</version>
<packaging>jar</packaging>
<name>TTS ONNX Java Example</name>
<description>Text-to-Speech inference using ONNX Runtime in Java</description>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
<onnxruntime.version>1.23.1</onnxruntime.version>
<jackson.version>2.15.2</jackson.version>
</properties>
<dependencies>
<!-- ONNX Runtime -->
<dependency>
<groupId>com.microsoft.onnxruntime</groupId>
<artifactId>onnxruntime</artifactId>
<version>${onnxruntime.version}</version>
</dependency>
<!-- Jackson for JSON parsing -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>${jackson.version}</version>
</dependency>
<!-- JTransforms for Fast FFT -->
<dependency>
<groupId>com.github.wendykierp</groupId>
<artifactId>JTransforms</artifactId>
<version>3.1</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>.</sourceDirectory>
<plugins>
<!-- Maven Compiler Plugin -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.11.0</version>
<configuration>
<source>11</source>
<target>11</target>
</configuration>
</plugin>
<!-- Maven Exec Plugin for running the example -->
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<mainClass>ExampleONNX</mainClass>
</configuration>
</plugin>
<!-- Maven Jar Plugin -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<archive>
<manifest>
<mainClass>ExampleONNX</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<!-- Maven Shade Plugin for creating fat JAR -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.5.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>ExampleONNX</mainClass>
</transformer>
</transformers>
<finalName>tts-example</finalName>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

140
nodejs/README.md Normal file
View File

@@ -0,0 +1,140 @@
# TTS ONNX Node.js Implementation
Node.js implementation for TTS inference. Uses ONNX Runtime to generate speech from text.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
## Requirements
- Node.js v16 or higher
- npm or yarn
## Installation
```bash
cd nodejs
npm install
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
npm start
```
Or:
```bash
node example_onnx.js
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
node example_onnx.js \
--voice-style "assets/voice_styles/M1.json,assets/voice_styles/F1.json" \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
--lang "en,ko" \
--batch
```
This will:
- Use `--batch` flag to enable batch processing mode
- Generate speech for 2 different voice-text pairs
- Use male voice style (M1.json) for the first English text
- Use female voice style (F1.json) for the second Korean text
- Process both samples in a single batch (automatic text chunking disabled)
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
node example_onnx.js \
--total-step 10 \
--voice-style "assets/voice_styles/M1.json" \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
### Example 4: Long-Form Inference
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
```bash
node example_onnx.js \
--voice-style "assets/voice_styles/M1.json" \
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
```
This will:
- Automatically split the long text into smaller chunks (max 300 characters by default)
- Process each chunk separately while maintaining natural speech flow
- Insert brief silences (0.3 seconds) between chunks for natural pacing
- Combine all chunks into a single output audio file
**Note**: When using batch mode (`--batch`), automatic text chunking is disabled. Use non-batch mode for long-form text synthesis.
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (not supported yet) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s). Separate multiple files with commas |
| `--text` | str+ | (long default text) | Text(s) to synthesize. Separate multiple texts with pipes |
| `--lang` | str+ | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr`. Separate multiple with commas |
| `--save-dir` | str | `results` | Output directory |
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
## Notes
- **Batch Processing**: The number of voice style files must match the number of texts. Use commas to separate files and pipes to separate texts
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
## Architecture
- `helper.js`: Node.js port of Python's `helper.py`
- `Preprocessor`: Audio preprocessing (STFT, Mel Spectrogram)
- `UnicodeProcessor`: Text preprocessing
- Utility functions (mask generation, tensor conversion, etc.)
- `example_onnx.js`: Main inference script
- ONNX model loading
- TTS inference pipeline execution
- WAV file saving
- `package.json`: Node.js project configuration and dependencies
## Implementation Notes
1. **Pure Node.js WAV Processing**: Writes WAV files without external native libraries. Outputs 16-bit PCM format.
2. **Memory Efficiency**: Note that Node.js may consume significant memory when processing large arrays.
3. **Performance**: The mel spectrogram extraction (Step 1-1) is currently slower than Python's Librosa, which uses highly optimized C extensions. This bottleneck could be further improved with additional optimizations such as WASM-based FFT libraries or native addons.

119
nodejs/example_onnx.js Normal file
View File

@@ -0,0 +1,119 @@
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
import { loadTextToSpeech, loadVoiceStyle, timer, writeWavFile, sanitizeFilename } from './helper.js';
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
/**
* Parse command line arguments
*/
function parseArgs() {
const args = {
useGpu: false,
onnxDir: 'assets/onnx',
totalStep: 5,
speed: 1.05,
nTest: 4,
voiceStyle: ['assets/voice_styles/M1.json'],
text: ['This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.'],
lang: ['en'],
saveDir: 'results',
batch: false
};
for (let i = 2; i < process.argv.length; i++) {
const arg = process.argv[i];
if (arg === '--use-gpu') {
args.useGpu = true;
} else if (arg === '--batch') {
args.batch = true;
} else if (arg === '--onnx-dir' && i + 1 < process.argv.length) {
args.onnxDir = process.argv[++i];
} else if (arg === '--total-step' && i + 1 < process.argv.length) {
args.totalStep = parseInt(process.argv[++i]);
} else if (arg === '--speed' && i + 1 < process.argv.length) {
args.speed = parseFloat(process.argv[++i]);
} else if (arg === '--n-test' && i + 1 < process.argv.length) {
args.nTest = parseInt(process.argv[++i]);
} else if (arg === '--voice-style' && i + 1 < process.argv.length) {
args.voiceStyle = process.argv[++i].split(',');
} else if (arg === '--text' && i + 1 < process.argv.length) {
args.text = process.argv[++i].split('|');
} else if (arg === '--lang' && i + 1 < process.argv.length) {
args.lang = process.argv[++i].split(',');
} else if (arg === '--save-dir' && i + 1 < process.argv.length) {
args.saveDir = process.argv[++i];
}
}
return args;
}
/**
* Main inference function
*/
async function main() {
console.log('=== TTS Inference with ONNX Runtime (Node.js) ===\n');
// --- 1. Parse arguments --- //
const args = parseArgs();
const totalStep = args.totalStep;
const speed = args.speed;
const nTest = args.nTest;
const saveDir = args.saveDir;
const voiceStylePaths = args.voiceStyle.map(p => path.resolve(__dirname, p));
const textList = args.text;
const langList = args.lang;
const batch = args.batch;
if (voiceStylePaths.length !== textList.length) {
throw new Error(`Number of voice styles (${voiceStylePaths.length}) must match number of texts (${textList.length})`);
}
const bsz = voiceStylePaths.length;
// --- 2. Load Text to Speech --- //
const onnxDir = path.resolve(__dirname, args.onnxDir);
const textToSpeech = await loadTextToSpeech(onnxDir, args.useGpu);
// --- 3. Load Voice Style --- //
const style = loadVoiceStyle(voiceStylePaths, true);
// --- 4. Synthesize speech --- //
for (let n = 0; n < nTest; n++) {
console.log(`\n[${n + 1}/${nTest}] Starting synthesis...`);
const { wav, duration } = await timer('Generating speech from text', async () => {
if (batch) {
return await textToSpeech.batch(textList, langList, style, totalStep, speed);
} else {
return await textToSpeech.call(textList[0], langList[0], style, totalStep, speed);
}
});
if (!fs.existsSync(saveDir)) {
fs.mkdirSync(saveDir, { recursive: true });
}
const wavShape = [bsz, wav.length / bsz];
for (let b = 0; b < bsz; b++) {
const fname = `${sanitizeFilename(textList[b], 20)}_${n + 1}.wav`;
const wavLen = Math.floor(textToSpeech.sampleRate * duration[b]);
const wavOut = wav.slice(b * wavShape[1], b * wavShape[1] + wavLen);
const outputPath = path.join(saveDir, fname);
writeWavFile(outputPath, wavOut, textToSpeech.sampleRate);
console.log(`Saved: ${outputPath}`);
}
}
console.log('\n=== Synthesis completed successfully! ===');
}
// Run main function
main().catch(err => {
console.error('Error during inference:', err);
process.exit(1);
});

559
nodejs/helper.js Normal file
View File

@@ -0,0 +1,559 @@
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
import * as ort from 'onnxruntime-node';
const __filename = fileURLToPath(import.meta.url);
const AVAILABLE_LANGS = ["en", "ko", "es", "pt", "fr"];
/**
* Unicode text processor
*/
class UnicodeProcessor {
constructor(unicodeIndexerJsonPath) {
this.indexer = JSON.parse(fs.readFileSync(unicodeIndexerJsonPath, 'utf8'));
}
_preprocessText(text, lang) {
// TODO: Need advanced normalizer for better performance
text = text.normalize('NFKD');
// Remove emojis (wide Unicode range)
const emojiPattern = /[\u{1F600}-\u{1F64F}\u{1F300}-\u{1F5FF}\u{1F680}-\u{1F6FF}\u{1F700}-\u{1F77F}\u{1F780}-\u{1F7FF}\u{1F800}-\u{1F8FF}\u{1F900}-\u{1F9FF}\u{1FA00}-\u{1FA6F}\u{1FA70}-\u{1FAFF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}\u{1F1E6}-\u{1F1FF}]+/gu;
text = text.replace(emojiPattern, '');
// Replace various dashes and symbols
const replacements = {
'': '-',
'': '-',
'—': '-',
'_': ' ',
'\u201C': '"', // left double quote "
'\u201D': '"', // right double quote "
'\u2018': "'", // left single quote '
'\u2019': "'", // right single quote '
'´': "'",
'`': "'",
'[': ' ',
']': ' ',
'|': ' ',
'/': ' ',
'#': ' ',
'→': ' ',
'←': ' ',
};
for (const [k, v] of Object.entries(replacements)) {
text = text.replaceAll(k, v);
}
// Remove special symbols
text = text.replace(/[♥☆♡©\\]/g, '');
// Replace known expressions
const exprReplacements = {
'@': ' at ',
'e.g.,': 'for example, ',
'i.e.,': 'that is, ',
};
for (const [k, v] of Object.entries(exprReplacements)) {
text = text.replaceAll(k, v);
}
// Fix spacing around punctuation
text = text.replace(/ ,/g, ',');
text = text.replace(/ \./g, '.');
text = text.replace(/ !/g, '!');
text = text.replace(/ \?/g, '?');
text = text.replace(/ ;/g, ';');
text = text.replace(/ :/g, ':');
text = text.replace(/ '/g, "'");
// Remove duplicate quotes
while (text.includes('""')) {
text = text.replace('""', '"');
}
while (text.includes("''")) {
text = text.replace("''", "'");
}
while (text.includes('``')) {
text = text.replace('``', '`');
}
// Remove extra spaces
text = text.replace(/\s+/g, ' ').trim();
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
if (!/[.!?;:,'\"')\]}…。」』】〉》›»]$/.test(text)) {
text += '.';
}
// Validate language
if (!AVAILABLE_LANGS.includes(lang)) {
throw new Error(`Invalid language: ${lang}. Available: ${AVAILABLE_LANGS.join(', ')}`);
}
// Wrap text with language tags
text = `<${lang}>` + text + `</${lang}>`;
return text;
}
_textToUnicodeValues(text) {
return Array.from(text).map(char => char.charCodeAt(0));
}
_getTextMask(textIdsLengths) {
return lengthToMask(textIdsLengths);
}
call(textList, langList) {
const processedTexts = textList.map((t, i) => this._preprocessText(t, langList[i]));
const textIdsLengths = processedTexts.map(t => t.length);
const maxLen = Math.max(...textIdsLengths);
const textIds = [];
for (let i = 0; i < processedTexts.length; i++) {
const row = new Array(maxLen).fill(0);
const unicodeVals = this._textToUnicodeValues(processedTexts[i]);
for (let j = 0; j < unicodeVals.length; j++) {
row[j] = this.indexer[unicodeVals[j]];
}
textIds.push(row);
}
const textMask = this._getTextMask(textIdsLengths);
return { textIds, textMask };
}
}
/**
* Style class
*/
class Style {
constructor(styleTtlOnnx, styleDpOnnx) {
this.ttl = styleTtlOnnx;
this.dp = styleDpOnnx;
}
}
/**
* TextToSpeech class
*/
class TextToSpeech {
constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
this.cfgs = cfgs;
this.textProcessor = textProcessor;
this.dpOrt = dpOrt;
this.textEncOrt = textEncOrt;
this.vectorEstOrt = vectorEstOrt;
this.vocoderOrt = vocoderOrt;
this.sampleRate = cfgs.ae.sample_rate;
this.baseChunkSize = cfgs.ae.base_chunk_size;
this.chunkCompressFactor = cfgs.ttl.chunk_compress_factor;
this.ldim = cfgs.ttl.latent_dim;
}
sampleNoisyLatent(duration) {
const wavLenMax = Math.max(...duration) * this.sampleRate;
const wavLengths = duration.map(d => Math.floor(d * this.sampleRate));
const chunkSize = this.baseChunkSize * this.chunkCompressFactor;
const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
const latentDim = this.ldim * this.chunkCompressFactor;
// Generate random noise
const noisyLatent = [];
for (let b = 0; b < duration.length; b++) {
const batch = [];
for (let d = 0; d < latentDim; d++) {
const row = [];
for (let t = 0; t < latentLen; t++) {
// Box-Muller transform for normal distribution
// Add epsilon to avoid log(0)
const eps = 1e-10;
const u1 = Math.max(eps, Math.random());
const u2 = Math.random();
const randNormal = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
row.push(randNormal);
}
batch.push(row);
}
noisyLatent.push(batch);
}
const latentMask = getLatentMask(wavLengths, this.baseChunkSize, this.chunkCompressFactor);
// Apply mask
for (let b = 0; b < noisyLatent.length; b++) {
for (let d = 0; d < noisyLatent[b].length; d++) {
for (let t = 0; t < noisyLatent[b][d].length; t++) {
noisyLatent[b][d][t] *= latentMask[b][0][t];
}
}
}
return { noisyLatent, latentMask };
}
async _infer(textList, langList, style, totalStep, speed = 1.05) {
if (textList.length !== style.ttl.dims[0]) {
throw new Error('Number of texts must match number of style vectors');
}
const bsz = textList.length;
const { textIds, textMask } = this.textProcessor.call(textList, langList);
const textIdsShape = [bsz, textIds[0].length];
const textMaskShape = [bsz, 1, textMask[0][0].length];
const textMaskTensor = arrayToTensor(textMask, textMaskShape);
const dpResult = await this.dpOrt.run({
text_ids: intArrayToTensor(textIds, textIdsShape),
style_dp: style.dp,
text_mask: textMaskTensor
});
const durOnnx = Array.from(dpResult.duration.data);
// Apply speed factor to duration
for (let i = 0; i < durOnnx.length; i++) {
durOnnx[i] /= speed;
}
const textEncResult = await this.textEncOrt.run({
text_ids: intArrayToTensor(textIds, textIdsShape),
style_ttl: style.ttl,
text_mask: textMaskTensor
});
const textEmbTensor = textEncResult.text_emb;
let { noisyLatent, latentMask } = this.sampleNoisyLatent(durOnnx);
const latentShape = [bsz, noisyLatent[0].length, noisyLatent[0][0].length];
const latentMaskShape = [bsz, 1, latentMask[0][0].length];
const latentMaskTensor = arrayToTensor(latentMask, latentMaskShape);
const totalStepArray = new Array(bsz).fill(totalStep);
const scalarShape = [bsz];
const totalStepTensor = arrayToTensor(totalStepArray, scalarShape);
for (let step = 0; step < totalStep; step++) {
const currentStepArray = new Array(bsz).fill(step);
const vectorEstResult = await this.vectorEstOrt.run({
noisy_latent: arrayToTensor(noisyLatent, latentShape),
text_emb: textEmbTensor,
style_ttl: style.ttl,
text_mask: textMaskTensor,
latent_mask: latentMaskTensor,
total_step: totalStepTensor,
current_step: arrayToTensor(currentStepArray, scalarShape)
});
const denoisedLatent = Array.from(vectorEstResult.denoised_latent.data);
// Update latent with the denoised output
let idx = 0;
for (let b = 0; b < noisyLatent.length; b++) {
for (let d = 0; d < noisyLatent[b].length; d++) {
for (let t = 0; t < noisyLatent[b][d].length; t++) {
noisyLatent[b][d][t] = denoisedLatent[idx++];
}
}
}
}
const vocoderResult = await this.vocoderOrt.run({
latent: arrayToTensor(noisyLatent, latentShape)
});
const wav = Array.from(vocoderResult.wav_tts.data);
return { wav, duration: durOnnx };
}
async call(text, lang, style, totalStep, speed = 1.05, silenceDuration = 0.3) {
if (style.ttl.dims[0] !== 1) {
throw new Error('Single speaker text to speech only supports single style');
}
const maxLen = lang === 'ko' ? 120 : 300;
const textList = chunkText(text, maxLen);
let wavCat = null;
let durCat = 0;
for (const chunk of textList) {
const { wav, duration } = await this._infer([chunk], [lang], style, totalStep, speed);
if (wavCat === null) {
wavCat = wav;
durCat = duration[0];
} else {
const silenceLen = Math.floor(silenceDuration * this.sampleRate);
const silence = new Array(silenceLen).fill(0);
wavCat = [...wavCat, ...silence, ...wav];
durCat += duration[0] + silenceDuration;
}
}
return { wav: wavCat, duration: [durCat] };
}
async batch(textList, langList, style, totalStep, speed = 1.05) {
return await this._infer(textList, langList, style, totalStep, speed);
}
}
/**
* Convert lengths to binary mask
*/
function lengthToMask(lengths, maxLen = null) {
maxLen = maxLen || Math.max(...lengths);
const mask = [];
for (let i = 0; i < lengths.length; i++) {
const row = [];
for (let j = 0; j < maxLen; j++) {
row.push(j < lengths[i] ? 1.0 : 0.0);
}
mask.push([row]); // [B, 1, maxLen]
}
return mask;
}
/**
* Get latent mask from wav lengths
*/
function getLatentMask(wavLengths, baseChunkSize, chunkCompressFactor) {
const latentSize = baseChunkSize * chunkCompressFactor;
const latentLengths = wavLengths.map(len =>
Math.floor((len + latentSize - 1) / latentSize)
);
return lengthToMask(latentLengths);
}
/**
* Load ONNX model
*/
async function loadOnnx(onnxPath, opts) {
return await ort.InferenceSession.create(onnxPath, opts);
}
/**
* Load all ONNX models for TTS
*/
async function loadOnnxAll(onnxDir, opts) {
const dpPath = path.join(onnxDir, 'duration_predictor.onnx');
const textEncPath = path.join(onnxDir, 'text_encoder.onnx');
const vectorEstPath = path.join(onnxDir, 'vector_estimator.onnx');
const vocoderPath = path.join(onnxDir, 'vocoder.onnx');
const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = await Promise.all([
loadOnnx(dpPath, opts),
loadOnnx(textEncPath, opts),
loadOnnx(vectorEstPath, opts),
loadOnnx(vocoderPath, opts)
]);
return { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt };
}
/**
* Load configuration
*/
function loadCfgs(onnxDir) {
const cfgPath = path.join(onnxDir, 'tts.json');
const cfgs = JSON.parse(fs.readFileSync(cfgPath, 'utf8'));
return cfgs;
}
/**
* Load text processor
*/
function loadTextProcessor(onnxDir) {
const unicodeIndexerPath = path.join(onnxDir, 'unicode_indexer.json');
const textProcessor = new UnicodeProcessor(unicodeIndexerPath);
return textProcessor;
}
/**
* Load voice style from JSON file
*/
export function loadVoiceStyle(voiceStylePaths, verbose = false) {
const bsz = voiceStylePaths.length;
// Read first file to get dimensions
const firstStyle = JSON.parse(fs.readFileSync(voiceStylePaths[0], 'utf8'));
const ttlDims = firstStyle.style_ttl.dims;
const dpDims = firstStyle.style_dp.dims;
const ttlDim1 = ttlDims[1];
const ttlDim2 = ttlDims[2];
const dpDim1 = dpDims[1];
const dpDim2 = dpDims[2];
// Pre-allocate arrays with full batch size
const ttlSize = bsz * ttlDim1 * ttlDim2;
const dpSize = bsz * dpDim1 * dpDim2;
const ttlFlat = new Float32Array(ttlSize);
const dpFlat = new Float32Array(dpSize);
// Fill in the data
for (let i = 0; i < bsz; i++) {
const voiceStyle = JSON.parse(fs.readFileSync(voiceStylePaths[i], 'utf8'));
const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
const ttlOffset = i * ttlDim1 * ttlDim2;
ttlFlat.set(ttlData, ttlOffset);
const dpData = voiceStyle.style_dp.data.flat(Infinity);
const dpOffset = i * dpDim1 * dpDim2;
dpFlat.set(dpData, dpOffset);
}
const ttlStyle = new ort.Tensor('float32', ttlFlat, [bsz, ttlDim1, ttlDim2]);
const dpStyle = new ort.Tensor('float32', dpFlat, [bsz, dpDim1, dpDim2]);
if (verbose) {
console.log(`Loaded ${bsz} voice styles`);
}
return new Style(ttlStyle, dpStyle);
}
/**
* Load text to speech components
*/
export async function loadTextToSpeech(onnxDir, useGpu = false) {
const opts = {};
if (useGpu) {
throw new Error('GPU mode is not supported yet');
} else {
console.log('Using CPU for inference');
}
const cfgs = loadCfgs(onnxDir);
const { dpOrt, textEncOrt, vectorEstOrt, vocoderOrt } = await loadOnnxAll(onnxDir, opts);
const textProcessor = loadTextProcessor(onnxDir);
const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
return textToSpeech;
}
/**
* Convert 3D array to ONNX tensor
*/
function arrayToTensor(array, dims) {
// Flatten the array
const flat = array.flat(Infinity);
return new ort.Tensor('float32', Float32Array.from(flat), dims);
}
/**
* Convert 2D int array to ONNX tensor
*/
function intArrayToTensor(array, dims) {
const flat = array.flat(Infinity);
return new ort.Tensor('int64', BigInt64Array.from(flat.map(x => BigInt(x))), dims);
}
/**
* Write WAV file
*/
export function writeWavFile(filename, audioData, sampleRate) {
const numChannels = 1;
const bitsPerSample = 16;
const byteRate = sampleRate * numChannels * bitsPerSample / 8;
const blockAlign = numChannels * bitsPerSample / 8;
const dataSize = audioData.length * bitsPerSample / 8;
const buffer = Buffer.alloc(44 + dataSize);
// RIFF header
buffer.write('RIFF', 0);
buffer.writeUInt32LE(36 + dataSize, 4);
buffer.write('WAVE', 8);
// fmt chunk
buffer.write('fmt ', 12);
buffer.writeUInt32LE(16, 16); // fmt chunk size
buffer.writeUInt16LE(1, 20); // audio format (PCM)
buffer.writeUInt16LE(numChannels, 22);
buffer.writeUInt32LE(sampleRate, 24);
buffer.writeUInt32LE(byteRate, 28);
buffer.writeUInt16LE(blockAlign, 32);
buffer.writeUInt16LE(bitsPerSample, 34);
// data chunk
buffer.write('data', 36);
buffer.writeUInt32LE(dataSize, 40);
// Write audio data
for (let i = 0; i < audioData.length; i++) {
const sample = Math.max(-1, Math.min(1, audioData[i]));
const intSample = Math.floor(sample * 32767);
buffer.writeInt16LE(intSample, 44 + i * 2);
}
fs.writeFileSync(filename, buffer);
}
/**
* Timer utility for measuring execution time
*/
export async function timer(name, fn) {
const start = Date.now();
console.log(`${name}...`);
const result = await fn();
const elapsed = ((Date.now() - start) / 1000).toFixed(2);
console.log(` -> ${name} completed in ${elapsed} sec`);
return result;
}
/**
* Sanitize filename by replacing non-alphanumeric characters with underscores (supports Unicode)
*/
export function sanitizeFilename(text, maxLen) {
const prefix = text.substring(0, maxLen);
// \p{L} matches any Unicode letter, \p{N} matches any Unicode number
return prefix.replace(/[^\p{L}\p{N}_]/gu, '_');
}
/**
* Chunk text into manageable segments
*/
function chunkText(text, maxLen = 300) {
if (typeof text !== 'string') {
throw new Error(`chunkText expects a string, got ${typeof text}`);
}
// Split by paragraph (two or more newlines)
const paragraphs = text.trim().split(/\n\s*\n+/).filter(p => p.trim());
const chunks = [];
for (let paragraph of paragraphs) {
paragraph = paragraph.trim();
if (!paragraph) continue;
// Split by sentence boundaries (period, question mark, exclamation mark followed by space)
// But exclude common abbreviations like Mr., Mrs., Dr., etc. and single capital letters like F.
const sentences = paragraph.split(/(?<!Mr\.|Mrs\.|Ms\.|Dr\.|Prof\.|Sr\.|Jr\.|Ph\.D\.|etc\.|e\.g\.|i\.e\.|vs\.|Inc\.|Ltd\.|Co\.|Corp\.|St\.|Ave\.|Blvd\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+/);
let currentChunk = "";
for (let sentence of sentences) {
if (currentChunk.length + sentence.length + 1 <= maxLen) {
currentChunk += (currentChunk ? " " : "") + sentence;
} else {
if (currentChunk) {
chunks.push(currentChunk.trim());
}
currentChunk = sentence;
}
}
if (currentChunk) {
chunks.push(currentChunk.trim());
}
}
return chunks;
}

26
nodejs/package.json Normal file
View File

@@ -0,0 +1,26 @@
{
"name": "tts-onnx-nodejs",
"version": "1.0.0",
"description": "TTS inference using ONNX Runtime for Node.js",
"main": "example_onnx.js",
"type": "module",
"scripts": {
"start": "node example_onnx.js"
},
"keywords": [
"tts",
"onnx",
"speech-synthesis",
"nodejs"
],
"author": "",
"license": "MIT",
"dependencies": {
"fft.js": "^4.0.3",
"js-yaml": "^4.1.0",
"onnxruntime-node": "^1.19.2"
},
"engines": {
"node": ">=16.0.0"
}
}

145
py/README.md Normal file
View File

@@ -0,0 +1,145 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `example_onnx.py`.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added `supertonic` PyPI package! Install via `pip install supertonic` for a streamlined experience. This is a separate usage method from the ONNX examples in this directory. For more details, visit [supertonic-py documentation](https://supertone-inc.github.io/supertonic-py) and see `example_pypi.py` for usage.
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed. Adjust the speed factor to make speech faster or slower while maintaining natural quality.
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
## Installation
This project uses [uv](https://docs.astral.sh/uv/) for fast package management.
### Install uv (if not already installed)
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
### Install dependencies
```bash
uv sync
```
Or if you prefer using traditional pip with requirements.txt:
```bash
pip install -r requirements.txt
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
uv run example_onnx.py
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
uv run example_onnx.py \
--voice-style assets/voice_styles/M1.json assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange." "오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 좋아서 한참을 멈춰 서서 들었어요." \
--lang en ko \
--batch
```
This will:
- Use `--batch` flag to enable batch processing mode
- Generate speech for 2 different voice-text pairs
- Use male voice style (M1.json) for the first English text
- Use female voice style (F1.json) for the second Korean text
- Process both samples in a single batch (automatic text chunking disabled)
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
uv run example_onnx.py \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
### Example 4: Long-Form Inference
For long texts, the system automatically chunks the text into manageable segments and generates a single audio file:
```bash
uv run example_onnx.py \
--voice-style assets/voice_styles/M1.json \
--text "Once upon a time, in a small village nestled between rolling hills, there lived a young artist named Clara. Every morning, she would wake up before dawn to capture the first light of day. The golden rays streaming through her window inspired countless paintings. Her work was known throughout the region for its vibrant colors and emotional depth. People from far and wide came to see her gallery, and many said her paintings could tell stories that words never could."
```
This will:
- Automatically split the long text into smaller chunks (max 300 characters by default)
- Process each chunk separately while maintaining natural speech flow
- Insert brief silences (0.3 seconds) between chunks for natural pacing
- Combine all chunks into a single output audio file
**Note**: When using batch mode (`--batch`), automatic text chunking is disabled. Use non-batch mode for long-form text synthesis.
### Example 5: Adjusting Speech Speed
Control the speed of speech synthesis:
```bash
# Faster speech (speed > 1.0)
uv run example_onnx.py \
--voice-style assets/voice_styles/F2.json \
--text "This text will be synthesized at a faster pace." \
--speed 1.2
# Slower speech (speed < 1.0)
uv run example_onnx.py \
--voice-style assets/voice_styles/M2.json \
--text "This text will be synthesized at a slower, more deliberate pace." \
--speed 0.9
```
This will:
- Use `--speed 1.2` to generate faster speech
- Use `--speed 0.9` to generate slower speech
- Default speed is 1.05 if not specified
- Recommended speed range is between 0.9 and 1.5 for natural-sounding results
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (with CPU fallback) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--speed` | float | 1.05 | Speech speed factor (higher = faster, lower = slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
| `--text` | str+ | (long default text) | Text(s) to synthesize |
| `--lang` | str+ | `en` | Language(s) for text(s): `en`, `ko`, `es`, `pt`, `fr` |
| `--save-dir` | str | `results` | Output directory |
| `--batch` | flag | False | Enable batch mode (disables automatic text chunking) |
## Notes
- **Batch Processing**: The number of `--voice-style` files must match the number of `--text` entries
- **Multilingual Support**: Use `--lang` to specify language(s). Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
- **Long-Form Inference**: Without `--batch` flag, long texts are automatically chunked and combined into a single audio file with natural pauses
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet

116
py/example_onnx.py Normal file
View File

@@ -0,0 +1,116 @@
import argparse
import os
import soundfile as sf
from helper import load_text_to_speech, timer, sanitize_filename, load_voice_style
def parse_args():
parser = argparse.ArgumentParser(description="TTS Inference with ONNX")
# Device settings
parser.add_argument(
"--use-gpu", action="store_true", help="Use GPU for inference (default: CPU)"
)
# Model settings
parser.add_argument(
"--onnx-dir",
type=str,
default="assets/onnx",
help="Path to ONNX model directory",
)
# Synthesis parameters
parser.add_argument(
"--total-step", type=int, default=5, help="Number of denoising steps"
)
parser.add_argument(
"--speed",
type=float,
default=1.05,
help="Speech speed (default: 1.05, higher = faster)",
)
parser.add_argument(
"--n-test", type=int, default=4, help="Number of times to generate"
)
# Batch processing
parser.add_argument("--batch", action="store_true", help="Batch processing")
# Input/Output
parser.add_argument(
"--voice-style",
type=str,
nargs="+",
default=["assets/voice_styles/M1.json"],
help="Voice style file path(s). Can specify multiple files for batch processing",
)
parser.add_argument(
"--text",
type=str,
nargs="+",
default=[
"This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
],
help="Text(s) to synthesize. Can specify multiple texts for batch processing",
)
parser.add_argument(
"--lang",
type=str,
nargs="+",
default=["en"],
help="Language(s) of the text(s). Can specify multiple languages for batch processing",
)
parser.add_argument(
"--save-dir", type=str, default="results", help="Output directory"
)
return parser.parse_args()
print("=== TTS Inference with ONNX Runtime (Python) ===\n")
# --- 1. Parse arguments --- #
args = parse_args()
total_step = args.total_step
speed = args.speed
n_test = args.n_test
save_dir = args.save_dir
voice_style_paths = args.voice_style
text_list = args.text
lang_list = args.lang
batch = args.batch
assert len(voice_style_paths) == len(
text_list
), f"Number of voice styles ({len(voice_style_paths)}) must match number of texts ({len(text_list)})"
bsz = len(voice_style_paths)
# --- 2. Load Text to Speech --- #
text_to_speech = load_text_to_speech(args.onnx_dir, args.use_gpu)
# --- 3. Load Voice Style --- #
style = load_voice_style(voice_style_paths, verbose=True)
# --- 4. Synthesize Speech --- #
for n in range(n_test):
print(f"\n[{n+1}/{n_test}] Starting synthesis...")
with timer("Generating speech from text"):
if batch:
wav, duration = text_to_speech.batch(
text_list, lang_list, style, total_step, speed
)
else:
wav, duration = text_to_speech(
text_list[0], lang_list[0], style, total_step, speed
)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
for b in range(bsz):
fname = f"{sanitize_filename(text_list[b], 20)}_{n+1}.wav"
w = wav[b, : int(text_to_speech.sample_rate * duration[b].item())] # [T_trim]
sf.write(os.path.join(save_dir, fname), w, text_to_speech.sample_rate)
print(f"Saved: {save_dir}/{fname}")
print("\n=== Synthesis completed successfully! ===")

16
py/example_pypi.py Normal file
View File

@@ -0,0 +1,16 @@
from supertonic import TTS
# Note: First run downloads model automatically (~260MB)
tts = TTS(auto_download=True)
# Get a voice style
style = tts.get_voice_style(voice_name="M4")
# Generate speech
text = "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
wav, duration = tts.synthesize(text, voice_style=style)
# wav: np.ndarray, shape = (1, num_samples)
# duration: np.ndarray, shape = (1,)
# Save to file
tts.save_audio(wav, "results/example_pypi.wav")

429
py/helper.py Normal file
View File

@@ -0,0 +1,429 @@
import json
import os
import time
from contextlib import contextmanager
from typing import Optional
from unicodedata import normalize
import numpy as np
import onnxruntime as ort
import re
AVAILABLE_LANGS = ["en", "ko", "es", "pt", "fr"]
class UnicodeProcessor:
def __init__(self, unicode_indexer_path: str):
with open(unicode_indexer_path, "r") as f:
self.indexer = json.load(f)
def _preprocess_text(self, text: str, lang: str) -> str:
# TODO: Need advanced normalizer for better performance
text = normalize("NFKD", text)
# Remove emojis (wide Unicode range)
emoji_pattern = re.compile(
"[\U0001f600-\U0001f64f" # emoticons
"\U0001f300-\U0001f5ff" # symbols & pictographs
"\U0001f680-\U0001f6ff" # transport & map symbols
"\U0001f700-\U0001f77f"
"\U0001f780-\U0001f7ff"
"\U0001f800-\U0001f8ff"
"\U0001f900-\U0001f9ff"
"\U0001fa00-\U0001fa6f"
"\U0001fa70-\U0001faff"
"\u2600-\u26ff"
"\u2700-\u27bf"
"\U0001f1e6-\U0001f1ff]+",
flags=re.UNICODE,
)
text = emoji_pattern.sub("", text)
# Replace various dashes and symbols
replacements = {
"": "-",
"": "-",
"": "-",
"_": " ",
"\u201c": '"', # left double quote "
"\u201d": '"', # right double quote "
"\u2018": "'", # left single quote '
"\u2019": "'", # right single quote '
"´": "'",
"`": "'",
"[": " ",
"]": " ",
"|": " ",
"/": " ",
"#": " ",
"": " ",
"": " ",
}
for k, v in replacements.items():
text = text.replace(k, v)
# Remove special symbols
text = re.sub(r"[♥☆♡©\\]", "", text)
# Replace known expressions
expr_replacements = {
"@": " at ",
"e.g.,": "for example, ",
"i.e.,": "that is, ",
}
for k, v in expr_replacements.items():
text = text.replace(k, v)
# Fix spacing around punctuation
text = re.sub(r" ,", ",", text)
text = re.sub(r" \.", ".", text)
text = re.sub(r" !", "!", text)
text = re.sub(r" \?", "?", text)
text = re.sub(r" ;", ";", text)
text = re.sub(r" :", ":", text)
text = re.sub(r" '", "'", text)
# Remove duplicate quotes
while '""' in text:
text = text.replace('""', '"')
while "''" in text:
text = text.replace("''", "'")
while "``" in text:
text = text.replace("``", "`")
# Remove extra spaces
text = re.sub(r"\s+", " ", text).strip()
# If text doesn't end with punctuation, quotes, or closing brackets, add a period
if not re.search(r"[.!?;:,'\"')\]}…。」』】〉》›»]$", text):
text += "."
if lang not in AVAILABLE_LANGS:
raise ValueError(f"Invalid language: {lang}")
text = f"<{lang}>" + text + f"</{lang}>"
return text
def _get_text_mask(self, text_ids_lengths: np.ndarray) -> np.ndarray:
text_mask = length_to_mask(text_ids_lengths)
return text_mask
def _text_to_unicode_values(self, text: str) -> np.ndarray:
unicode_values = np.array(
[ord(char) for char in text], dtype=np.uint16
) # 2 bytes
return unicode_values
def __call__(
self, text_list: list[str], lang_list: list[str]
) -> tuple[np.ndarray, np.ndarray]:
text_list = [
self._preprocess_text(t, lang) for t, lang in zip(text_list, lang_list)
]
text_ids_lengths = np.array([len(text) for text in text_list], dtype=np.int64)
text_ids = np.zeros((len(text_list), text_ids_lengths.max()), dtype=np.int64)
for i, text in enumerate(text_list):
unicode_vals = self._text_to_unicode_values(text)
text_ids[i, : len(unicode_vals)] = np.array(
[self.indexer[val] for val in unicode_vals], dtype=np.int64
)
text_mask = self._get_text_mask(text_ids_lengths)
return text_ids, text_mask
class Style:
def __init__(self, style_ttl_onnx: np.ndarray, style_dp_onnx: np.ndarray):
self.ttl = style_ttl_onnx
self.dp = style_dp_onnx
class TextToSpeech:
def __init__(
self,
cfgs: dict,
text_processor: UnicodeProcessor,
dp_ort: ort.InferenceSession,
text_enc_ort: ort.InferenceSession,
vector_est_ort: ort.InferenceSession,
vocoder_ort: ort.InferenceSession,
):
self.cfgs = cfgs
self.text_processor = text_processor
self.dp_ort = dp_ort
self.text_enc_ort = text_enc_ort
self.vector_est_ort = vector_est_ort
self.vocoder_ort = vocoder_ort
self.sample_rate = cfgs["ae"]["sample_rate"]
self.base_chunk_size = cfgs["ae"]["base_chunk_size"]
self.chunk_compress_factor = cfgs["ttl"]["chunk_compress_factor"]
self.ldim = cfgs["ttl"]["latent_dim"]
def sample_noisy_latent(
self, duration: np.ndarray
) -> tuple[np.ndarray, np.ndarray]:
bsz = len(duration)
wav_len_max = duration.max() * self.sample_rate
wav_lengths = (duration * self.sample_rate).astype(np.int64)
chunk_size = self.base_chunk_size * self.chunk_compress_factor
latent_len = ((wav_len_max + chunk_size - 1) / chunk_size).astype(np.int32)
latent_dim = self.ldim * self.chunk_compress_factor
noisy_latent = np.random.randn(bsz, latent_dim, latent_len).astype(np.float32)
latent_mask = get_latent_mask(
wav_lengths, self.base_chunk_size, self.chunk_compress_factor
)
noisy_latent = noisy_latent * latent_mask
return noisy_latent, latent_mask
def _infer(
self,
text_list: list[str],
lang_list: list[str],
style: Style,
total_step: int,
speed: float = 1.05,
) -> tuple[np.ndarray, np.ndarray]:
assert (
len(text_list) == style.ttl.shape[0]
), "Number of texts must match number of style vectors"
bsz = len(text_list)
text_ids, text_mask = self.text_processor(text_list, lang_list)
dur_onnx, *_ = self.dp_ort.run(
None, {"text_ids": text_ids, "style_dp": style.dp, "text_mask": text_mask}
)
dur_onnx = dur_onnx / speed
text_emb_onnx, *_ = self.text_enc_ort.run(
None,
{"text_ids": text_ids, "style_ttl": style.ttl, "text_mask": text_mask},
) # dur_onnx: [bsz]
xt, latent_mask = self.sample_noisy_latent(dur_onnx)
total_step_np = np.array([total_step] * bsz, dtype=np.float32)
for step in range(total_step):
current_step = np.array([step] * bsz, dtype=np.float32)
xt, *_ = self.vector_est_ort.run(
None,
{
"noisy_latent": xt,
"text_emb": text_emb_onnx,
"style_ttl": style.ttl,
"text_mask": text_mask,
"latent_mask": latent_mask,
"current_step": current_step,
"total_step": total_step_np,
},
)
wav, *_ = self.vocoder_ort.run(None, {"latent": xt})
return wav, dur_onnx
def __call__(
self,
text: str,
lang: str,
style: Style,
total_step: int,
speed: float = 1.05,
silence_duration: float = 0.3,
) -> tuple[np.ndarray, np.ndarray]:
assert (
style.ttl.shape[0] == 1
), "Single speaker text to speech only supports single style"
max_len = 120 if lang == "ko" else 300
text_list = chunk_text(text, max_len=max_len)
wav_cat = None
dur_cat = None
for text in text_list:
wav, dur_onnx = self._infer([text], [lang], style, total_step, speed)
if wav_cat is None:
wav_cat = wav
dur_cat = dur_onnx
else:
silence = np.zeros(
(1, int(silence_duration * self.sample_rate)), dtype=np.float32
)
wav_cat = np.concatenate([wav_cat, silence, wav], axis=1)
dur_cat += dur_onnx + silence_duration
return wav_cat, dur_cat
def batch(
self,
text_list: list[str],
lang_list: list[str],
style: Style,
total_step: int,
speed: float = 1.05,
) -> tuple[np.ndarray, np.ndarray]:
return self._infer(text_list, lang_list, style, total_step, speed)
def length_to_mask(lengths: np.ndarray, max_len: Optional[int] = None) -> np.ndarray:
"""
Convert lengths to binary mask.
Args:
lengths: (B,)
max_len: int
Returns:
mask: (B, 1, max_len)
"""
max_len = max_len or lengths.max()
ids = np.arange(0, max_len)
mask = (ids < np.expand_dims(lengths, axis=1)).astype(np.float32)
return mask.reshape(-1, 1, max_len)
def get_latent_mask(
wav_lengths: np.ndarray, base_chunk_size: int, chunk_compress_factor: int
) -> np.ndarray:
latent_size = base_chunk_size * chunk_compress_factor
latent_lengths = (wav_lengths + latent_size - 1) // latent_size
latent_mask = length_to_mask(latent_lengths)
return latent_mask
def load_onnx(
onnx_path: str, opts: ort.SessionOptions, providers: list[str]
) -> ort.InferenceSession:
return ort.InferenceSession(onnx_path, sess_options=opts, providers=providers)
def load_onnx_all(
onnx_dir: str, opts: ort.SessionOptions, providers: list[str]
) -> tuple[
ort.InferenceSession,
ort.InferenceSession,
ort.InferenceSession,
ort.InferenceSession,
]:
dp_onnx_path = os.path.join(onnx_dir, "duration_predictor.onnx")
text_enc_onnx_path = os.path.join(onnx_dir, "text_encoder.onnx")
vector_est_onnx_path = os.path.join(onnx_dir, "vector_estimator.onnx")
vocoder_onnx_path = os.path.join(onnx_dir, "vocoder.onnx")
dp_ort = load_onnx(dp_onnx_path, opts, providers)
text_enc_ort = load_onnx(text_enc_onnx_path, opts, providers)
vector_est_ort = load_onnx(vector_est_onnx_path, opts, providers)
vocoder_ort = load_onnx(vocoder_onnx_path, opts, providers)
return dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
def load_cfgs(onnx_dir: str) -> dict:
cfg_path = os.path.join(onnx_dir, "tts.json")
with open(cfg_path, "r") as f:
cfgs = json.load(f)
return cfgs
def load_text_processor(onnx_dir: str) -> UnicodeProcessor:
unicode_indexer_path = os.path.join(onnx_dir, "unicode_indexer.json")
text_processor = UnicodeProcessor(unicode_indexer_path)
return text_processor
def load_text_to_speech(onnx_dir: str, use_gpu: bool = False) -> TextToSpeech:
opts = ort.SessionOptions()
if use_gpu:
raise NotImplementedError("GPU mode is not fully tested")
else:
providers = ["CPUExecutionProvider"]
print("Using CPU for inference")
cfgs = load_cfgs(onnx_dir)
dp_ort, text_enc_ort, vector_est_ort, vocoder_ort = load_onnx_all(
onnx_dir, opts, providers
)
text_processor = load_text_processor(onnx_dir)
return TextToSpeech(
cfgs, text_processor, dp_ort, text_enc_ort, vector_est_ort, vocoder_ort
)
def load_voice_style(voice_style_paths: list[str], verbose: bool = False) -> Style:
bsz = len(voice_style_paths)
# Read first file to get dimensions
with open(voice_style_paths[0], "r") as f:
first_style = json.load(f)
ttl_dims = first_style["style_ttl"]["dims"]
dp_dims = first_style["style_dp"]["dims"]
# Pre-allocate arrays with full batch size
ttl_style = np.zeros([bsz, ttl_dims[1], ttl_dims[2]], dtype=np.float32)
dp_style = np.zeros([bsz, dp_dims[1], dp_dims[2]], dtype=np.float32)
# Fill in the data
for i, voice_style_path in enumerate(voice_style_paths):
with open(voice_style_path, "r") as f:
voice_style = json.load(f)
ttl_data = np.array(
voice_style["style_ttl"]["data"], dtype=np.float32
).flatten()
ttl_style[i] = ttl_data.reshape(ttl_dims[1], ttl_dims[2])
dp_data = np.array(voice_style["style_dp"]["data"], dtype=np.float32).flatten()
dp_style[i] = dp_data.reshape(dp_dims[1], dp_dims[2])
if verbose:
print(f"Loaded {bsz} voice styles")
return Style(ttl_style, dp_style)
@contextmanager
def timer(name: str):
start = time.time()
print(f"{name}...")
yield
print(f" -> {name} completed in {time.time() - start:.2f} sec")
def sanitize_filename(text: str, max_len: int) -> str:
"""Sanitize filename by replacing non-alphanumeric characters with underscores (supports Unicode)"""
import re
prefix = text[:max_len]
# \w matches Unicode word characters (letters, digits, underscore) with re.UNICODE
# We replace non-word characters except keeping existing underscores
return re.sub(r"[^\w]", "_", prefix, flags=re.UNICODE)
def chunk_text(text: str, max_len: int = 300) -> list[str]:
"""
Split text into chunks by paragraphs and sentences.
Args:
text: Input text to chunk
max_len: Maximum length of each chunk (default: 300)
Returns:
List of text chunks
"""
import re
# Split by paragraph (two or more newlines)
paragraphs = [p.strip() for p in re.split(r"\n\s*\n+", text.strip()) if p.strip()]
chunks = []
for paragraph in paragraphs:
paragraph = paragraph.strip()
if not paragraph:
continue
# Split by sentence boundaries (period, question mark, exclamation mark followed by space)
# But exclude common abbreviations like Mr., Mrs., Dr., etc. and single capital letters like F.
pattern = r"(?<!Mr\.)(?<!Mrs\.)(?<!Ms\.)(?<!Dr\.)(?<!Prof\.)(?<!Sr\.)(?<!Jr\.)(?<!Ph\.D\.)(?<!etc\.)(?<!e\.g\.)(?<!i\.e\.)(?<!vs\.)(?<!Inc\.)(?<!Ltd\.)(?<!Co\.)(?<!Corp\.)(?<!St\.)(?<!Ave\.)(?<!Blvd\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+"
sentences = re.split(pattern, paragraph)
current_chunk = ""
for sentence in sentences:
if len(current_chunk) + len(sentence) + 1 <= max_len:
current_chunk += (" " if current_chunk else "") + sentence
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = sentence
if current_chunk:
chunks.append(current_chunk.strip())
return chunks

20
py/pyproject.toml Normal file
View File

@@ -0,0 +1,20 @@
[project]
name = "tts-onnx"
version = "1.0.0"
description = "TTS ONNX Inference"
requires-python = ">=3.10"
dependencies = [
"onnxruntime==1.23.1",
"numpy>=1.26.0",
"soundfile>=0.12.1",
"librosa>=0.10.0",
"PyYAML>=6.0",
]
[tool.setuptools]
py-modules = []
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

5
py/requirements.txt Normal file
View File

@@ -0,0 +1,5 @@
onnxruntime==1.23.1
numpy>=1.26.0
soundfile>=0.12.1
librosa>=0.10.0
PyYAML>=6.0

1142
py/uv.lock generated Normal file

File diff suppressed because it is too large Load Diff

21
rust/.gitignore vendored Normal file
View File

@@ -0,0 +1,21 @@
# Rust build artifacts
/target/
Cargo.lock
# Output directory
/results/
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
# Debug
*.pdb

44
rust/Cargo.toml Normal file
View File

@@ -0,0 +1,44 @@
[package]
name = "supertonic-tts"
version = "0.1.0"
edition = "2021"
[dependencies]
# ONNX Runtime
ort = "2.0.0-rc.7"
# Array processing (like NumPy)
ndarray = { version = "0.16", features = ["rayon"] }
rand = "0.8"
rand_distr = "0.4"
# Parallel processing
rayon = "1.10"
# Audio processing
hound = "3.5"
rustfft = "6.2"
# JSON serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
# CLI argument parsing
clap = { version = "4.5", features = ["derive"] }
# Error handling
anyhow = "1.0"
# Unicode normalization
unicode-normalization = "0.1"
# Regular expressions
regex = "1.10"
# System calls
libc = "0.2"
[[bin]]
name = "example_onnx"
path = "src/example_onnx.rs"

146
rust/README.md Normal file
View File

@@ -0,0 +1,146 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using Rust.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
## Installation
This project uses [Cargo](https://doc.rust-lang.org/cargo/) for package management.
### Install Rust (if not already installed)
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
### Build the project
```bash
cargo build --release
```
## Basic Usage
You can run the inference in two ways:
1. **Using cargo run** (builds if needed, then runs)
2. **Direct binary execution** (faster if already built)
### Example 1: Default Inference
Run inference with default settings:
```bash
# Using cargo run
cargo run --release --bin example_onnx
# Or directly execute the built binary (faster)
./target/release/example_onnx
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
# Using cargo run
cargo run --release --bin example_onnx -- \
--batch \
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
--lang en,ko
# Or using the binary directly
./target/release/example_onnx \
--batch \
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
--lang en,ko
```
This will:
- Generate speech for 2 different voice-text-language pairs
- Use male voice (M1.json) for the first text in English
- Use female voice (F1.json) for the second text in Korean
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
# Using cargo run
cargo run --release --bin example_onnx -- \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
# Or using the binary directly
./target/release/example_onnx \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
### Example 4: Long-Form Inference
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
```bash
# Using cargo run
cargo run --release --bin example_onnx -- \
--voice-style assets/voice_styles/M1.json \
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
# Or using the binary directly
./target/release/example_onnx \
--voice-style assets/voice_styles/M1.json \
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
```
This will:
- Automatically split the text into chunks based on paragraph and sentence boundaries
- Synthesize each chunk separately
- Add 0.3 seconds of silence between chunks for natural pauses
- Concatenate all chunks into a single audio file
**Note**: Automatic text chunking is disabled when using `--batch` mode. In batch mode, each text is processed as-is without chunking.
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s), comma-separated |
| `--text` | str+ | (long default text) | Text(s) to synthesize, pipe-separated |
| `--lang` | str+ | `en` | Language(s) for synthesis, comma-separated (en, ko, es, pt, fr) |
| `--save-dir` | str | `results` | Output directory |
| `--batch` | flag | False | Enable batch mode (multiple text-style pairs, disables automatic chunking) |
## Notes
- **Multilingual Support**: Use `--lang` to specify the language for each text. Available: `en` (English), `ko` (Korean), `es` (Spanish), `pt` (Portuguese), `fr` (French)
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet
- **Known Issues**: On some platforms (especially macOS), there might be a mutex cleanup warning during exit. This is a known ONNX Runtime issue and doesn't affect functionality. The implementation uses `libc::_exit()` and `mem::forget()` to bypass this issue.

144
rust/src/example_onnx.rs Normal file
View File

@@ -0,0 +1,144 @@
use anyhow::Result;
use clap::Parser;
use std::path::PathBuf;
use std::fs;
use std::mem;
mod helper;
use helper::{
load_text_to_speech, load_voice_style, timer, write_wav_file, sanitize_filename,
};
#[derive(Parser, Debug)]
#[command(name = "TTS ONNX Inference")]
#[command(about = "TTS Inference with ONNX Runtime (Rust)", long_about = None)]
struct Args {
/// Use GPU for inference (default: CPU)
#[arg(long, default_value = "false")]
use_gpu: bool,
/// Path to ONNX model directory
#[arg(long, default_value = "assets/onnx")]
onnx_dir: String,
/// Number of denoising steps
#[arg(long, default_value = "5")]
total_step: usize,
/// Speech speed factor (higher = faster)
#[arg(long, default_value = "1.05")]
speed: f32,
/// Number of times to generate
#[arg(long, default_value = "4")]
n_test: usize,
/// Voice style file path(s)
#[arg(long, value_delimiter = ',', default_values_t = vec!["assets/voice_styles/M1.json".to_string()])]
voice_style: Vec<String>,
/// Text(s) to synthesize
#[arg(long, value_delimiter = '|', default_values_t = vec!["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.".to_string()])]
text: Vec<String>,
/// Language(s) for synthesis (en, ko, es, pt, fr)
#[arg(long, value_delimiter = ',', default_values_t = vec!["en".to_string()])]
lang: Vec<String>,
/// Output directory
#[arg(long, default_value = "results")]
save_dir: String,
/// Enable batch mode (multiple text-style pairs)
#[arg(long, default_value = "false")]
batch: bool,
}
fn main() -> Result<()> {
println!("=== TTS Inference with ONNX Runtime (Rust) ===\n");
// --- 1. Parse arguments --- //
let args = Args::parse();
let total_step = args.total_step;
let speed = args.speed;
let n_test = args.n_test;
let voice_style_paths = &args.voice_style;
let text_list = &args.text;
let lang_list = &args.lang;
let save_dir = &args.save_dir;
let batch = args.batch;
if batch {
if voice_style_paths.len() != text_list.len() {
anyhow::bail!(
"Number of voice styles ({}) must match number of texts ({})",
voice_style_paths.len(),
text_list.len()
);
}
if lang_list.len() != text_list.len() {
anyhow::bail!(
"Number of languages ({}) must match number of texts ({})",
lang_list.len(),
text_list.len()
);
}
}
let bsz = voice_style_paths.len();
// --- 2. Load TTS components --- //
let mut text_to_speech = load_text_to_speech(&args.onnx_dir, args.use_gpu)?;
// --- 3. Load voice styles --- //
let style = load_voice_style(voice_style_paths, true)?;
// --- 4. Synthesize speech --- //
fs::create_dir_all(save_dir)?;
for n in 0..n_test {
println!("\n[{}/{}] Starting synthesis...", n + 1, n_test);
let (wav, duration) = if batch {
timer("Generating speech from text", || {
text_to_speech.batch(text_list, lang_list, &style, total_step, speed)
})?
} else {
let (w, d) = timer("Generating speech from text", || {
text_to_speech.call(&text_list[0], &lang_list[0], &style, total_step, speed, 0.3)
})?;
(w, vec![d])
};
// Save outputs
for i in 0..bsz {
let fname = format!("{}_{}.wav", sanitize_filename(&text_list[i], 20), n + 1);
let wav_slice = if batch {
let wav_len = wav.len() / bsz;
let actual_len = (text_to_speech.sample_rate as f32 * duration[i]) as usize;
let wav_start = i * wav_len;
let wav_end = wav_start + actual_len.min(wav_len);
&wav[wav_start..wav_end]
} else {
// For non-batch mode, wav is a single concatenated audio
let actual_len = (text_to_speech.sample_rate as f32 * duration[0]) as usize;
&wav[..actual_len.min(wav.len())]
};
let output_path = PathBuf::from(save_dir).join(&fname);
write_wav_file(&output_path, wav_slice, text_to_speech.sample_rate)?;
println!("Saved: {}", output_path.display());
}
}
println!("\n=== Synthesis completed successfully! ===");
// Prevent ONNX Runtime sessions from being dropped, which causes mutex cleanup issues
mem::forget(text_to_speech);
// Use _exit to bypass all cleanup handlers and avoid ONNX Runtime mutex issues on macOS
unsafe {
libc::_exit(0);
}
}

838
rust/src/helper.rs Normal file
View File

@@ -0,0 +1,838 @@
// ============================================================================
// TTS Helper Module - All utility functions and structures
// ============================================================================
use ndarray::{Array, Array3};
use serde::{Deserialize, Serialize};
use serde_json;
use std::fs::File;
use std::io::BufReader;
use std::path::Path;
use anyhow::{Result, Context, bail};
use unicode_normalization::UnicodeNormalization;
use hound::{WavWriter, WavSpec, SampleFormat};
use rand_distr::{Distribution, Normal};
use regex::Regex;
// Available languages for multilingual TTS
pub const AVAILABLE_LANGS: &[&str] = &["en", "ko", "es", "pt", "fr"];
pub fn is_valid_lang(lang: &str) -> bool {
AVAILABLE_LANGS.contains(&lang)
}
// ============================================================================
// Configuration Structures
// ============================================================================
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Config {
pub ae: AEConfig,
pub ttl: TTLConfig,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AEConfig {
pub sample_rate: i32,
pub base_chunk_size: i32,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TTLConfig {
pub chunk_compress_factor: i32,
pub latent_dim: i32,
}
/// Load configuration from JSON file
pub fn load_cfgs<P: AsRef<Path>>(onnx_dir: P) -> Result<Config> {
let cfg_path = onnx_dir.as_ref().join("tts.json");
let file = File::open(cfg_path)?;
let reader = BufReader::new(file);
let cfgs: Config = serde_json::from_reader(reader)?;
Ok(cfgs)
}
// ============================================================================
// Voice Style Data Structure
// ============================================================================
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VoiceStyleData {
pub style_ttl: StyleComponent,
pub style_dp: StyleComponent,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StyleComponent {
pub data: Vec<Vec<Vec<f32>>>,
pub dims: Vec<usize>,
#[serde(rename = "type")]
pub dtype: String,
}
// ============================================================================
// Unicode Text Processor
// ============================================================================
pub struct UnicodeProcessor {
indexer: Vec<i64>,
}
impl UnicodeProcessor {
pub fn new<P: AsRef<Path>>(unicode_indexer_json_path: P) -> Result<Self> {
let file = File::open(unicode_indexer_json_path)?;
let reader = BufReader::new(file);
let indexer: Vec<i64> = serde_json::from_reader(reader)?;
Ok(UnicodeProcessor { indexer })
}
pub fn call(&self, text_list: &[String], lang_list: &[String]) -> Result<(Vec<Vec<i64>>, Array3<f32>)> {
let mut processed_texts: Vec<String> = Vec::new();
for (text, lang) in text_list.iter().zip(lang_list.iter()) {
processed_texts.push(preprocess_text(text, lang)?);
}
let text_ids_lengths: Vec<usize> = processed_texts
.iter()
.map(|t| t.chars().count())
.collect();
let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
let mut text_ids = Vec::new();
for text in &processed_texts {
let mut row = vec![0i64; max_len];
let unicode_vals = text_to_unicode_values(text);
for (j, &val) in unicode_vals.iter().enumerate() {
if val < self.indexer.len() {
row[j] = self.indexer[val];
} else {
row[j] = -1;
}
}
text_ids.push(row);
}
let text_mask = get_text_mask(&text_ids_lengths);
Ok((text_ids, text_mask))
}
}
pub fn preprocess_text(text: &str, lang: &str) -> Result<String> {
// TODO: Need advanced normalizer for better performance
let mut text: String = text.nfkd().collect();
// Remove emojis (wide Unicode range)
let emoji_pattern = Regex::new(r"[\x{1F600}-\x{1F64F}\x{1F300}-\x{1F5FF}\x{1F680}-\x{1F6FF}\x{1F700}-\x{1F77F}\x{1F780}-\x{1F7FF}\x{1F800}-\x{1F8FF}\x{1F900}-\x{1F9FF}\x{1FA00}-\x{1FA6F}\x{1FA70}-\x{1FAFF}\x{2600}-\x{26FF}\x{2700}-\x{27BF}\x{1F1E6}-\x{1F1FF}]+").unwrap();
text = emoji_pattern.replace_all(&text, "").to_string();
// Replace various dashes and symbols
let replacements = [
("", "-"), // en dash
("", "-"), // non-breaking hyphen
("", "-"), // em dash
("_", " "), // underscore
("\u{201C}", "\""), // left double quote
("\u{201D}", "\""), // right double quote
("\u{2018}", "'"), // left single quote
("\u{2019}", "'"), // right single quote
("´", "'"), // acute accent
("`", "'"), // grave accent
("[", " "), // left bracket
("]", " "), // right bracket
("|", " "), // vertical bar
("/", " "), // slash
("#", " "), // hash
("", " "), // right arrow
("", " "), // left arrow
];
for (from, to) in &replacements {
text = text.replace(from, to);
}
// Remove special symbols
let special_symbols = ["", "", "", "©", "\\"];
for symbol in &special_symbols {
text = text.replace(symbol, "");
}
// Replace known expressions
let expr_replacements = [
("@", " at "),
("e.g.,", "for example, "),
("i.e.,", "that is, "),
];
for (from, to) in &expr_replacements {
text = text.replace(from, to);
}
// Fix spacing around punctuation
text = Regex::new(r" ,").unwrap().replace_all(&text, ",").to_string();
text = Regex::new(r" \.").unwrap().replace_all(&text, ".").to_string();
text = Regex::new(r" !").unwrap().replace_all(&text, "!").to_string();
text = Regex::new(r" \?").unwrap().replace_all(&text, "?").to_string();
text = Regex::new(r" ;").unwrap().replace_all(&text, ";").to_string();
text = Regex::new(r" :").unwrap().replace_all(&text, ":").to_string();
text = Regex::new(r" '").unwrap().replace_all(&text, "'").to_string();
// Remove duplicate quotes
while text.contains("\"\"") {
text = text.replace("\"\"", "\"");
}
while text.contains("''") {
text = text.replace("''", "'");
}
while text.contains("``") {
text = text.replace("``", "`");
}
// Remove extra spaces
text = Regex::new(r"\s+").unwrap().replace_all(&text, " ").to_string();
text = text.trim().to_string();
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
if !text.is_empty() {
let ends_with_punct = Regex::new(r#"[.!?;:,'"\u{201C}\u{201D}\u{2018}\u{2019})\]}…。」』】〉》›»]$"#).unwrap();
if !ends_with_punct.is_match(&text) {
text.push('.');
}
}
// Validate language
if !is_valid_lang(lang) {
bail!("Invalid language: {}. Available: {:?}", lang, AVAILABLE_LANGS);
}
// Wrap text with language tags
text = format!("<{}>{}</{}>", lang, text, lang);
Ok(text)
}
pub fn text_to_unicode_values(text: &str) -> Vec<usize> {
text.chars().map(|c| c as usize).collect()
}
pub fn length_to_mask(lengths: &[usize], max_len: Option<usize>) -> Array3<f32> {
let bsz = lengths.len();
let max_len = max_len.unwrap_or_else(|| *lengths.iter().max().unwrap_or(&0));
let mut mask = Array3::<f32>::zeros((bsz, 1, max_len));
for (i, &len) in lengths.iter().enumerate() {
for j in 0..len.min(max_len) {
mask[[i, 0, j]] = 1.0;
}
}
mask
}
pub fn get_text_mask(text_ids_lengths: &[usize]) -> Array3<f32> {
let max_len = *text_ids_lengths.iter().max().unwrap_or(&0);
length_to_mask(text_ids_lengths, Some(max_len))
}
/// Sample noisy latent from normal distribution and apply mask
pub fn sample_noisy_latent(
duration: &[f32],
sample_rate: i32,
base_chunk_size: i32,
chunk_compress: i32,
latent_dim: i32,
) -> (Array3<f32>, Array3<f32>) {
let bsz = duration.len();
let max_dur = duration.iter().fold(0.0f32, |a, &b| a.max(b));
let wav_len_max = (max_dur * sample_rate as f32) as usize;
let wav_lengths: Vec<usize> = duration
.iter()
.map(|&d| (d * sample_rate as f32) as usize)
.collect();
let chunk_size = (base_chunk_size * chunk_compress) as usize;
let latent_len = (wav_len_max + chunk_size - 1) / chunk_size;
let latent_dim_val = (latent_dim * chunk_compress) as usize;
let mut noisy_latent = Array3::<f32>::zeros((bsz, latent_dim_val, latent_len));
let normal = Normal::new(0.0, 1.0).unwrap();
let mut rng = rand::thread_rng();
for b in 0..bsz {
for d in 0..latent_dim_val {
for t in 0..latent_len {
noisy_latent[[b, d, t]] = normal.sample(&mut rng);
}
}
}
let latent_lengths: Vec<usize> = wav_lengths
.iter()
.map(|&len| (len + chunk_size - 1) / chunk_size)
.collect();
let latent_mask = length_to_mask(&latent_lengths, Some(latent_len));
// Apply mask
for b in 0..bsz {
for d in 0..latent_dim_val {
for t in 0..latent_len {
noisy_latent[[b, d, t]] *= latent_mask[[b, 0, t]];
}
}
}
(noisy_latent, latent_mask)
}
// ============================================================================
// WAV File I/O
// ============================================================================
pub fn write_wav_file<P: AsRef<Path>>(
filename: P,
audio_data: &[f32],
sample_rate: i32,
) -> Result<()> {
let spec = WavSpec {
channels: 1,
sample_rate: sample_rate as u32,
bits_per_sample: 16,
sample_format: SampleFormat::Int,
};
let mut writer = WavWriter::create(filename, spec)?;
for &sample in audio_data {
let clamped = sample.max(-1.0).min(1.0);
let val = (clamped * 32767.0) as i16;
writer.write_sample(val)?;
}
writer.finalize()?;
Ok(())
}
// ============================================================================
// Text Chunking
// ============================================================================
const MAX_CHUNK_LENGTH: usize = 300;
const ABBREVIATIONS: &[&str] = &[
"Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
"St.", "Ave.", "Rd.", "Blvd.", "Dept.", "Inc.", "Ltd.",
"Co.", "Corp.", "etc.", "vs.", "i.e.", "e.g.", "Ph.D.",
];
pub fn chunk_text(text: &str, max_len: Option<usize>) -> Vec<String> {
let max_len = max_len.unwrap_or(MAX_CHUNK_LENGTH);
let text = text.trim();
if text.is_empty() {
return vec![String::new()];
}
// Split by paragraphs
let para_re = Regex::new(r"\n\s*\n").unwrap();
let paragraphs: Vec<&str> = para_re.split(text).collect();
let mut chunks = Vec::new();
for para in paragraphs {
let para = para.trim();
if para.is_empty() {
continue;
}
if para.len() <= max_len {
chunks.push(para.to_string());
continue;
}
// Split by sentences
let sentences = split_sentences(para);
let mut current = String::new();
let mut current_len = 0;
for sentence in sentences {
let sentence = sentence.trim();
if sentence.is_empty() {
continue;
}
let sentence_len = sentence.len();
if sentence_len > max_len {
// If sentence is longer than max_len, split by comma or space
if !current.is_empty() {
chunks.push(current.trim().to_string());
current.clear();
current_len = 0;
}
// Try splitting by comma
let parts: Vec<&str> = sentence.split(',').collect();
for part in parts {
let part = part.trim();
if part.is_empty() {
continue;
}
let part_len = part.len();
if part_len > max_len {
// Split by space as last resort
let words: Vec<&str> = part.split_whitespace().collect();
let mut word_chunk = String::new();
let mut word_chunk_len = 0;
for word in words {
let word_len = word.len();
if word_chunk_len + word_len + 1 > max_len && !word_chunk.is_empty() {
chunks.push(word_chunk.trim().to_string());
word_chunk.clear();
word_chunk_len = 0;
}
if !word_chunk.is_empty() {
word_chunk.push(' ');
word_chunk_len += 1;
}
word_chunk.push_str(word);
word_chunk_len += word_len;
}
if !word_chunk.is_empty() {
chunks.push(word_chunk.trim().to_string());
}
} else {
if current_len + part_len + 1 > max_len && !current.is_empty() {
chunks.push(current.trim().to_string());
current.clear();
current_len = 0;
}
if !current.is_empty() {
current.push_str(", ");
current_len += 2;
}
current.push_str(part);
current_len += part_len;
}
}
continue;
}
if current_len + sentence_len + 1 > max_len && !current.is_empty() {
chunks.push(current.trim().to_string());
current.clear();
current_len = 0;
}
if !current.is_empty() {
current.push(' ');
current_len += 1;
}
current.push_str(sentence);
current_len += sentence_len;
}
if !current.is_empty() {
chunks.push(current.trim().to_string());
}
}
if chunks.is_empty() {
vec![String::new()]
} else {
chunks
}
}
fn split_sentences(text: &str) -> Vec<String> {
// Rust's regex doesn't support lookbehind, so we use a simpler approach
// Split on sentence boundaries and then check if they're abbreviations
let re = Regex::new(r"([.!?])\s+").unwrap();
// Find all matches
let matches: Vec<_> = re.find_iter(text).collect();
if matches.is_empty() {
return vec![text.to_string()];
}
let mut sentences = Vec::new();
let mut last_end = 0;
for m in matches {
// Get the text before the punctuation
let before_punc = &text[last_end..m.start()];
// Check if this ends with an abbreviation
let mut is_abbrev = false;
for abbrev in ABBREVIATIONS {
let combined = format!("{}{}", before_punc.trim(), &text[m.start()..m.start()+1]);
if combined.ends_with(abbrev) {
is_abbrev = true;
break;
}
}
if !is_abbrev {
// This is a real sentence boundary
sentences.push(text[last_end..m.end()].to_string());
last_end = m.end();
}
}
// Add the remaining text
if last_end < text.len() {
sentences.push(text[last_end..].to_string());
}
if sentences.is_empty() {
vec![text.to_string()]
} else {
sentences
}
}
// ============================================================================
// Utility Functions
// ============================================================================
pub fn timer<F, T>(name: &str, f: F) -> Result<T>
where
F: FnOnce() -> Result<T>,
{
let start = std::time::Instant::now();
println!("{}...", name);
let result = f()?;
let elapsed = start.elapsed().as_secs_f64();
println!(" -> {} completed in {:.2} sec", name, elapsed);
Ok(result)
}
pub fn sanitize_filename(text: &str, max_len: usize) -> String {
// Take first max_len characters (Unicode code points, not bytes)
text.chars()
.take(max_len)
.map(|c| {
// is_alphanumeric() works with all Unicode letters and digits
if c.is_alphanumeric() {
c
} else {
'_'
}
})
.collect()
}
// ============================================================================
// ONNX Runtime Integration
// ============================================================================
use ort::{
session::Session,
value::Value,
};
pub struct Style {
pub ttl: Array3<f32>,
pub dp: Array3<f32>,
}
pub struct TextToSpeech {
cfgs: Config,
text_processor: UnicodeProcessor,
dp_ort: Session,
text_enc_ort: Session,
vector_est_ort: Session,
vocoder_ort: Session,
pub sample_rate: i32,
}
impl TextToSpeech {
pub fn new(
cfgs: Config,
text_processor: UnicodeProcessor,
dp_ort: Session,
text_enc_ort: Session,
vector_est_ort: Session,
vocoder_ort: Session,
) -> Self {
let sample_rate = cfgs.ae.sample_rate;
TextToSpeech {
cfgs,
text_processor,
dp_ort,
text_enc_ort,
vector_est_ort,
vocoder_ort,
sample_rate,
}
}
fn _infer(
&mut self,
text_list: &[String],
lang_list: &[String],
style: &Style,
total_step: usize,
speed: f32,
) -> Result<(Vec<f32>, Vec<f32>)> {
let bsz = text_list.len();
// Process text
let (text_ids, text_mask) = self.text_processor.call(text_list, lang_list)?;
let text_ids_array = {
let text_ids_shape = (bsz, text_ids[0].len());
let mut flat = Vec::new();
for row in &text_ids {
flat.extend_from_slice(row);
}
Array::from_shape_vec(text_ids_shape, flat)?
};
let text_ids_value = Value::from_array(text_ids_array)?;
let text_mask_value = Value::from_array(text_mask.clone())?;
let style_dp_value = Value::from_array(style.dp.clone())?;
// Predict duration
let dp_outputs = self.dp_ort.run(ort::inputs!{
"text_ids" => &text_ids_value,
"style_dp" => &style_dp_value,
"text_mask" => &text_mask_value
})?;
let (_, duration_data) = dp_outputs["duration"].try_extract_tensor::<f32>()?;
let mut duration: Vec<f32> = duration_data.to_vec();
// Apply speed factor to duration
for dur in duration.iter_mut() {
*dur /= speed;
}
// Encode text
let style_ttl_value = Value::from_array(style.ttl.clone())?;
let text_enc_outputs = self.text_enc_ort.run(ort::inputs!{
"text_ids" => &text_ids_value,
"style_ttl" => &style_ttl_value,
"text_mask" => &text_mask_value
})?;
let (text_emb_shape, text_emb_data) = text_enc_outputs["text_emb"].try_extract_tensor::<f32>()?;
let text_emb = Array3::from_shape_vec(
(text_emb_shape[0] as usize, text_emb_shape[1] as usize, text_emb_shape[2] as usize),
text_emb_data.to_vec()
)?;
// Sample noisy latent
let (mut xt, latent_mask) = sample_noisy_latent(
&duration,
self.sample_rate,
self.cfgs.ae.base_chunk_size,
self.cfgs.ttl.chunk_compress_factor,
self.cfgs.ttl.latent_dim,
);
// Prepare constant arrays
let total_step_array = Array::from_elem(bsz, total_step as f32);
// Denoising loop
for step in 0..total_step {
let current_step_array = Array::from_elem(bsz, step as f32);
let xt_value = Value::from_array(xt.clone())?;
let text_emb_value = Value::from_array(text_emb.clone())?;
let latent_mask_value = Value::from_array(latent_mask.clone())?;
let text_mask_value2 = Value::from_array(text_mask.clone())?;
let current_step_value = Value::from_array(current_step_array)?;
let total_step_value = Value::from_array(total_step_array.clone())?;
let vector_est_outputs = self.vector_est_ort.run(ort::inputs!{
"noisy_latent" => &xt_value,
"text_emb" => &text_emb_value,
"style_ttl" => &style_ttl_value,
"latent_mask" => &latent_mask_value,
"text_mask" => &text_mask_value2,
"current_step" => &current_step_value,
"total_step" => &total_step_value
})?;
let (denoised_shape, denoised_data) = vector_est_outputs["denoised_latent"].try_extract_tensor::<f32>()?;
xt = Array3::from_shape_vec(
(denoised_shape[0] as usize, denoised_shape[1] as usize, denoised_shape[2] as usize),
denoised_data.to_vec()
)?;
}
// Generate waveform
let final_latent_value = Value::from_array(xt)?;
let vocoder_outputs = self.vocoder_ort.run(ort::inputs!{
"latent" => &final_latent_value
})?;
let (_, wav_data) = vocoder_outputs["wav_tts"].try_extract_tensor::<f32>()?;
let wav: Vec<f32> = wav_data.to_vec();
Ok((wav, duration))
}
pub fn call(
&mut self,
text: &str,
lang: &str,
style: &Style,
total_step: usize,
speed: f32,
silence_duration: f32,
) -> Result<(Vec<f32>, f32)> {
let max_len = if lang == "ko" { 120 } else { 300 };
let chunks = chunk_text(text, Some(max_len));
let mut wav_cat: Vec<f32> = Vec::new();
let mut dur_cat: f32 = 0.0;
for (i, chunk) in chunks.iter().enumerate() {
let (wav, duration) = self._infer(&[chunk.clone()], &[lang.to_string()], style, total_step, speed)?;
let dur = duration[0];
let wav_len = (self.sample_rate as f32 * dur) as usize;
let wav_chunk = &wav[..wav_len.min(wav.len())];
if i == 0 {
wav_cat.extend_from_slice(wav_chunk);
dur_cat = dur;
} else {
let silence_len = (silence_duration * self.sample_rate as f32) as usize;
let silence = vec![0.0f32; silence_len];
wav_cat.extend_from_slice(&silence);
wav_cat.extend_from_slice(wav_chunk);
dur_cat += silence_duration + dur;
}
}
Ok((wav_cat, dur_cat))
}
pub fn batch(
&mut self,
text_list: &[String],
lang_list: &[String],
style: &Style,
total_step: usize,
speed: f32,
) -> Result<(Vec<f32>, Vec<f32>)> {
self._infer(text_list, lang_list, style, total_step, speed)
}
}
// ============================================================================
// Component Loading Functions
// ============================================================================
/// Load voice style from JSON files
pub fn load_voice_style(voice_style_paths: &[String], verbose: bool) -> Result<Style> {
let bsz = voice_style_paths.len();
// Read first file to get dimensions
let first_file = File::open(&voice_style_paths[0])
.context("Failed to open voice style file")?;
let first_reader = BufReader::new(first_file);
let first_data: VoiceStyleData = serde_json::from_reader(first_reader)?;
let ttl_dims = &first_data.style_ttl.dims;
let dp_dims = &first_data.style_dp.dims;
let ttl_dim1 = ttl_dims[1];
let ttl_dim2 = ttl_dims[2];
let dp_dim1 = dp_dims[1];
let dp_dim2 = dp_dims[2];
// Pre-allocate arrays with full batch size
let ttl_size = bsz * ttl_dim1 * ttl_dim2;
let dp_size = bsz * dp_dim1 * dp_dim2;
let mut ttl_flat = vec![0.0f32; ttl_size];
let mut dp_flat = vec![0.0f32; dp_size];
// Fill in the data
for (i, path) in voice_style_paths.iter().enumerate() {
let file = File::open(path).context("Failed to open voice style file")?;
let reader = BufReader::new(file);
let data: VoiceStyleData = serde_json::from_reader(reader)?;
// Flatten TTL data
let ttl_offset = i * ttl_dim1 * ttl_dim2;
let mut idx = 0;
for batch in &data.style_ttl.data {
for row in batch {
for &val in row {
ttl_flat[ttl_offset + idx] = val;
idx += 1;
}
}
}
// Flatten DP data
let dp_offset = i * dp_dim1 * dp_dim2;
idx = 0;
for batch in &data.style_dp.data {
for row in batch {
for &val in row {
dp_flat[dp_offset + idx] = val;
idx += 1;
}
}
}
}
let ttl_style = Array3::from_shape_vec((bsz, ttl_dim1, ttl_dim2), ttl_flat)?;
let dp_style = Array3::from_shape_vec((bsz, dp_dim1, dp_dim2), dp_flat)?;
if verbose {
println!("Loaded {} voice styles\n", bsz);
}
Ok(Style {
ttl: ttl_style,
dp: dp_style,
})
}
/// Load TTS components
pub fn load_text_to_speech(onnx_dir: &str, use_gpu: bool) -> Result<TextToSpeech> {
if use_gpu {
anyhow::bail!("GPU mode is not supported yet");
}
println!("Using CPU for inference\n");
let cfgs = load_cfgs(onnx_dir)?;
let dp_path = format!("{}/duration_predictor.onnx", onnx_dir);
let text_enc_path = format!("{}/text_encoder.onnx", onnx_dir);
let vector_est_path = format!("{}/vector_estimator.onnx", onnx_dir);
let vocoder_path = format!("{}/vocoder.onnx", onnx_dir);
let dp_ort = Session::builder()?
.commit_from_file(&dp_path)?;
let text_enc_ort = Session::builder()?
.commit_from_file(&text_enc_path)?;
let vector_est_ort = Session::builder()?
.commit_from_file(&vector_est_path)?;
let vocoder_ort = Session::builder()?
.commit_from_file(&vocoder_path)?;
let unicode_indexer_path = format!("{}/unicode_indexer.json", onnx_dir);
let text_processor = UnicodeProcessor::new(&unicode_indexer_path)?;
Ok(TextToSpeech::new(
cfgs,
text_processor,
dp_ort,
text_enc_ort,
vector_est_ort,
vocoder_ort,
))
}

15
swift/.gitignore vendored Normal file
View File

@@ -0,0 +1,15 @@
# Swift Package Manager
.build/
.swiftpm/
*.xcodeproj
*.xcworkspace
# Build artifacts
example_onnx
# Results
results/*.wav
# macOS
.DS_Store

14
swift/Package.resolved Normal file
View File

@@ -0,0 +1,14 @@
{
"pins" : [
{
"identity" : "onnxruntime-swift-package-manager",
"kind" : "remoteSourceControl",
"location" : "https://github.com/microsoft/onnxruntime-swift-package-manager.git",
"state" : {
"revision" : "12ce7374c86944e1f68f3a866d10105d8357f074",
"version" : "1.20.0"
}
}
],
"version" : 2
}

22
swift/Package.swift Normal file
View File

@@ -0,0 +1,22 @@
// swift-tools-version: 5.9
import PackageDescription
let package = Package(
name: "Supertonic",
platforms: [
.macOS(.v13)
],
dependencies: [
.package(url: "https://github.com/microsoft/onnxruntime-swift-package-manager.git", from: "1.16.0"),
],
targets: [
.executableTarget(
name: "example_onnx",
dependencies: [
.product(name: "onnxruntime", package: "onnxruntime-swift-package-manager")
],
path: "Sources"
)
]
)

122
swift/README.md Normal file
View File

@@ -0,0 +1,122 @@
# TTS ONNX Inference Examples
This guide provides examples for running TTS inference using `example_onnx`.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
**2025.11.19** - Added `--speed` parameter to control speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
## Installation
This project uses Swift Package Manager (SPM) for dependency management.
### Prerequisites
- Swift 5.9 or later
- macOS 13.0 or later
### Build the project
```bash
swift build -c release
```
## Basic Usage
### Example 1: Default Inference
Run inference with default settings:
```bash
.build/release/example_onnx
```
This will use:
- Voice style: `assets/voice_styles/M1.json`
- Text: "This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."
- Output directory: `results/`
- Total steps: 5
- Number of generations: 4
### Example 2: Batch Inference
Process multiple voice styles and texts at once:
```bash
.build/release/example_onnx \
--batch \
--voice-style assets/voice_styles/M1.json,assets/voice_styles/F1.json \
--text "The sun sets behind the mountains, painting the sky in shades of pink and orange.|오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요." \
--lang en,ko
```
This will:
- Generate speech for 2 different voice-text-language triplets
- Use male voice (M1.json) for the first English text
- Use female voice (F1.json) for the second Korean text
- Process both samples in a single batch
### Example 3: High Quality Inference
Increase denoising steps for better quality:
```bash
.build/release/example_onnx \
--total-step 10 \
--voice-style assets/voice_styles/M1.json \
--text "Increasing the number of denoising steps improves the output's fidelity and overall quality."
```
This will:
- Use 10 denoising steps instead of the default 5
- Produce higher quality output at the cost of slower inference
### Example 4: Long-Form Inference
The system automatically chunks long texts into manageable segments, synthesizes each segment separately, and concatenates them with natural pauses (0.3 seconds by default) into a single audio file. This happens by default when you don't use the `--batch` flag:
```bash
.build/release/example_onnx \
--voice-style assets/voice_styles/M1.json \
--text "This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues."
```
This will:
- Automatically split the text into chunks based on paragraph and sentence boundaries
- Synthesize each chunk separately
- Add 0.3 seconds of silence between chunks for natural pauses
- Concatenate all chunks into a single audio file
**Note**: Automatic text chunking is disabled when using `--batch` mode. In batch mode, each text is processed as-is without chunking.
## Available Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--use-gpu` | flag | False | Use GPU for inference (default: CPU) |
| `--onnx-dir` | str | `assets/onnx` | Path to ONNX model directory |
| `--total-step` | int | 5 | Number of denoising steps (higher = better quality, slower) |
| `--n-test` | int | 4 | Number of times to generate each sample |
| `--voice-style` | str+ | `assets/voice_styles/M1.json` | Voice style file path(s) |
| `--text` | str+ | (long default text) | Text(s) to synthesize |
| `--lang` | str+ | `en` | Language(s) for synthesis (en, ko, es, pt, fr) |
| `--save-dir` | str | `results` | Output directory |
| `--batch` | flag | False | Enable batch mode (multiple text-style-lang triplets, disables automatic chunking) |
## Multilingual Support
Supertonic 2 supports multiple languages. Use the `--lang` argument to specify the language:
- `en` - English (default)
- `ko` - Korean (한국어)
- `es` - Spanish (Español)
- `pt` - Portuguese (Português)
- `fr` - French (Français)
## Notes
- **Batch Processing**: When using `--batch`, the number of `--voice-style`, `--text`, and `--lang` entries must match
- **Automatic Chunking**: Without `--batch`, long texts are automatically split and concatenated with 0.3s pauses
- **Quality vs Speed**: Higher `--total-step` values produce better quality but take longer
- **GPU Support**: GPU mode is not supported yet

View File

@@ -0,0 +1,163 @@
import Foundation
import OnnxRuntimeBindings
struct Args {
var useGpu: Bool = false
var onnxDir: String = "assets/onnx"
var totalStep: Int = 5
var speed: Float = 1.05
var nTest: Int = 4
var voiceStyle: [String] = ["assets/voice_styles/M1.json"]
var text: [String] = ["This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen."]
var lang: [String] = ["en"]
var saveDir: String = "results"
var batch: Bool = false
}
func parseArgs() -> Args {
var args = Args()
let arguments = CommandLine.arguments
var i = 1
while i < arguments.count {
let arg = arguments[i]
switch arg {
case "--use-gpu":
args.useGpu = true
case "--onnx-dir":
if i + 1 < arguments.count {
args.onnxDir = arguments[i + 1]
i += 1
}
case "--total-step":
if i + 1 < arguments.count {
args.totalStep = Int(arguments[i + 1]) ?? 5
i += 1
}
case "--speed":
if i + 1 < arguments.count {
args.speed = Float(arguments[i + 1]) ?? 1.05
i += 1
}
case "--n-test":
if i + 1 < arguments.count {
args.nTest = Int(arguments[i + 1]) ?? 4
i += 1
}
case "--voice-style":
if i + 1 < arguments.count {
args.voiceStyle = arguments[i + 1].components(separatedBy: ",")
i += 1
}
case "--text":
if i + 1 < arguments.count {
args.text = arguments[i + 1].components(separatedBy: "|")
i += 1
}
case "--lang":
if i + 1 < arguments.count {
args.lang = arguments[i + 1].components(separatedBy: ",")
i += 1
}
case "--save-dir":
if i + 1 < arguments.count {
args.saveDir = arguments[i + 1]
i += 1
}
case "--batch":
args.batch = true
default:
break
}
i += 1
}
return args
}
@main
struct ExampleONNX {
static func main() async {
print("=== TTS Inference with ONNX Runtime (Swift) ===\n")
// --- 1. Parse arguments --- //
let args = parseArgs()
if args.batch {
guard args.voiceStyle.count == args.text.count else {
print("Error: Number of voice styles (\(args.voiceStyle.count)) must match number of texts (\(args.text.count))")
return
}
guard args.lang.count == args.text.count else {
print("Error: Number of languages (\(args.lang.count)) must match number of texts (\(args.text.count))")
return
}
}
let bsz = args.voiceStyle.count
do {
let env = try ORTEnv(loggingLevel: .warning)
// --- 2. Load TTS components --- //
let textToSpeech = try loadTextToSpeech(args.onnxDir, args.useGpu, env)
// --- 3. Load voice styles --- //
let style = try loadVoiceStyle(args.voiceStyle, verbose: true)
// --- 4. Synthesize speech --- //
try? FileManager.default.createDirectory(atPath: args.saveDir, withIntermediateDirectories: true)
for n in 0..<args.nTest {
print("\n[\(n + 1)/\(args.nTest)] Starting synthesis...")
let wav: [Float]
let duration: [Float]
if args.batch {
let result = try timer("Generating speech from text") {
try textToSpeech.batch(args.text, args.lang, style, args.totalStep, speed: args.speed)
}
wav = result.wav
duration = result.duration
} else {
let result = try timer("Generating speech from text") {
try textToSpeech.call(args.text[0], args.lang[0], style, args.totalStep, speed: args.speed, silenceDuration: 0.3)
}
wav = result.wav
duration = [result.duration]
}
// Save outputs
for i in 0..<bsz {
let fname = "\(sanitizeFilename(args.text[i], maxLen: 20))_\(n + 1).wav"
let wavOut: [Float]
if args.batch {
let wavLen = wav.count / bsz
let actualLen = Int(Float(textToSpeech.sampleRate) * duration[i])
let wavStart = i * wavLen
let wavEnd = min(wavStart + actualLen, wavStart + wavLen)
wavOut = Array(wav[wavStart..<wavEnd])
} else {
// For non-batch mode, wav is a single concatenated audio
let actualLen = Int(Float(textToSpeech.sampleRate) * duration[0])
wavOut = Array(wav.prefix(actualLen))
}
let outputPath = "\(args.saveDir)/\(fname)"
try writeWavFile(outputPath, wavOut, textToSpeech.sampleRate)
print("Saved: \(outputPath)")
}
}
print("\n=== Synthesis completed successfully! ===")
} catch {
print("Error during inference: \(error)")
exit(1)
}
}
}

835
swift/Sources/Helper.swift Normal file
View File

@@ -0,0 +1,835 @@
import Foundation
import Accelerate
import OnnxRuntimeBindings
// MARK: - Available Languages
let AVAILABLE_LANGS = ["en", "ko", "es", "pt", "fr"]
func isValidLang(_ lang: String) -> Bool {
return AVAILABLE_LANGS.contains(lang)
}
// MARK: - Configuration Structures
struct Config: Codable {
struct AEConfig: Codable {
let sample_rate: Int
let base_chunk_size: Int
}
struct TTLConfig: Codable {
let chunk_compress_factor: Int
let latent_dim: Int
}
let ae: AEConfig
let ttl: TTLConfig
}
// MARK: - Voice Style Data Structure
struct VoiceStyleData: Codable {
struct StyleComponent: Codable {
let data: [[[Float]]]
let dims: [Int]
let type: String
}
let style_ttl: StyleComponent
let style_dp: StyleComponent
}
// MARK: - Unicode Text Processor
class UnicodeProcessor {
let indexer: [Int64]
init(unicodeIndexerPath: String) throws {
let data = try Data(contentsOf: URL(fileURLWithPath: unicodeIndexerPath))
self.indexer = try JSONDecoder().decode([Int64].self, from: data)
}
func call(_ textList: [String], _ langList: [String]) -> (textIds: [[Int64]], textMask: [[[Float]]]) {
var processedTexts = [String]()
for (i, text) in textList.enumerated() {
processedTexts.append(preprocessText(text, lang: langList[i]))
}
// Use unicodeScalars.count for correct length after NFKD decomposition
var textIdsLengths = [Int]()
for text in processedTexts {
textIdsLengths.append(text.unicodeScalars.count)
}
let maxLen = textIdsLengths.max() ?? 0
var textIds = [[Int64]]()
for text in processedTexts {
var row = Array(repeating: Int64(0), count: maxLen)
let unicodeValues = Array(text.unicodeScalars.map { Int($0.value) })
for (j, val) in unicodeValues.enumerated() {
if val < indexer.count {
row[j] = indexer[val]
} else {
row[j] = -1
}
}
textIds.append(row)
}
let textMask = getTextMask(textIdsLengths)
return (textIds, textMask)
}
}
func preprocessText(_ text: String, lang: String) -> String {
// Use NFKD (decomposed) for proper Hangul Jamo decomposition
var text = text.decomposedStringWithCompatibilityMapping
// Remove emojis (wide Unicode range)
// Swift NSRegularExpression doesn't support Unicode escapes above \uFFFF
// Use character filtering instead
text = text.unicodeScalars.filter { scalar in
let value = scalar.value
return !((value >= 0x1F600 && value <= 0x1F64F) ||
(value >= 0x1F300 && value <= 0x1F5FF) ||
(value >= 0x1F680 && value <= 0x1F6FF) ||
(value >= 0x1F700 && value <= 0x1F77F) ||
(value >= 0x1F780 && value <= 0x1F7FF) ||
(value >= 0x1F800 && value <= 0x1F8FF) ||
(value >= 0x1F900 && value <= 0x1F9FF) ||
(value >= 0x1FA00 && value <= 0x1FA6F) ||
(value >= 0x1FA70 && value <= 0x1FAFF) ||
(value >= 0x2600 && value <= 0x26FF) ||
(value >= 0x2700 && value <= 0x27BF) ||
(value >= 0x1F1E6 && value <= 0x1F1FF))
}.map { String($0) }.joined()
// Replace various dashes and symbols
let replacements: [String: String] = [
"": "-", // en dash
"": "-", // non-breaking hyphen
"": "-", // em dash
"_": " ", // underscore
"\u{201C}": "\"", // left double quote
"\u{201D}": "\"", // right double quote
"\u{2018}": "'", // left single quote
"\u{2019}": "'", // right single quote
"´": "'", // acute accent
"`": "'", // grave accent
"[": " ", // left bracket
"]": " ", // right bracket
"|": " ", // vertical bar
"/": " ", // slash
"#": " ", // hash
"": " ", // right arrow
"": " ", // left arrow
]
for (old, new) in replacements {
text = text.replacingOccurrences(of: old, with: new)
}
// Remove special symbols
let specialSymbols = ["", "", "", "©", "\\"]
for symbol in specialSymbols {
text = text.replacingOccurrences(of: symbol, with: "")
}
// Replace known expressions
let exprReplacements: [String: String] = [
"@": " at ",
"e.g.,": "for example, ",
"i.e.,": "that is, ",
]
for (old, new) in exprReplacements {
text = text.replacingOccurrences(of: old, with: new)
}
// Fix spacing around punctuation
text = text.replacingOccurrences(of: " ,", with: ",")
text = text.replacingOccurrences(of: " .", with: ".")
text = text.replacingOccurrences(of: " !", with: "!")
text = text.replacingOccurrences(of: " ?", with: "?")
text = text.replacingOccurrences(of: " ;", with: ";")
text = text.replacingOccurrences(of: " :", with: ":")
text = text.replacingOccurrences(of: " '", with: "'")
// Remove duplicate quotes
while text.contains("\"\"") {
text = text.replacingOccurrences(of: "\"\"", with: "\"")
}
while text.contains("''") {
text = text.replacingOccurrences(of: "''", with: "'")
}
while text.contains("``") {
text = text.replacingOccurrences(of: "``", with: "`")
}
// Remove extra spaces
let whitespacePattern = try! NSRegularExpression(pattern: "\\s+")
let whitespaceRange = NSRange(text.startIndex..., in: text)
text = whitespacePattern.stringByReplacingMatches(in: text, range: whitespaceRange, withTemplate: " ")
text = text.trimmingCharacters(in: .whitespacesAndNewlines)
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
if !text.isEmpty {
let punctPattern = try! NSRegularExpression(pattern: "[.!?;:,'\"\\u201C\\u201D\\u2018\\u2019)\\]}…。」』】〉》›»]$")
let punctRange = NSRange(text.startIndex..., in: text)
if punctPattern.firstMatch(in: text, range: punctRange) == nil {
text += "."
}
}
// Validate language
guard isValidLang(lang) else {
fatalError("Invalid language: \(lang). Available: \(AVAILABLE_LANGS.joined(separator: ", "))")
}
// Wrap text with language tags
text = "<\(lang)>\(text)</\(lang)>"
return text
}
func lengthToMask(_ lengths: [Int], maxLen: Int? = nil) -> [[[Float]]] {
let actualMaxLen = maxLen ?? (lengths.max() ?? 0)
var mask = [[[Float]]]()
for len in lengths {
var row = Array(repeating: Float(0.0), count: actualMaxLen)
for j in 0..<min(len, actualMaxLen) {
row[j] = 1.0
}
mask.append([row])
}
return mask
}
func getTextMask(_ textIdsLengths: [Int]) -> [[[Float]]] {
let maxLen = textIdsLengths.max() ?? 0
return lengthToMask(textIdsLengths, maxLen: maxLen)
}
func sampleNoisyLatent(duration: [Float], sampleRate: Int, baseChunkSize: Int, chunkCompress: Int, latentDim: Int) -> (noisyLatent: [[[Float]]], latentMask: [[[Float]]]) {
let bsz = duration.count
let maxDur = duration.max() ?? 0.0
let wavLenMax = Int(maxDur * Float(sampleRate))
var wavLengths = [Int]()
for d in duration {
wavLengths.append(Int(d * Float(sampleRate)))
}
let chunkSize = baseChunkSize * chunkCompress
let latentLen = (wavLenMax + chunkSize - 1) / chunkSize
let latentDimVal = latentDim * chunkCompress
var noisyLatent = [[[Float]]]()
for _ in 0..<bsz {
var batch = [[Float]]()
for _ in 0..<latentDimVal {
var row = [Float]()
for _ in 0..<latentLen {
// Box-Muller transform
let u1 = Float.random(in: 0.0001...1.0)
let u2 = Float.random(in: 0.0...1.0)
let val = sqrt(-2.0 * log(u1)) * cos(2.0 * Float.pi * u2)
row.append(val)
}
batch.append(row)
}
noisyLatent.append(batch)
}
var latentLengths = [Int]()
for len in wavLengths {
latentLengths.append((len + chunkSize - 1) / chunkSize)
}
let latentMask = lengthToMask(latentLengths, maxLen: latentLen)
// Apply mask
for b in 0..<bsz {
for d in 0..<latentDimVal {
for t in 0..<latentLen {
noisyLatent[b][d][t] *= latentMask[b][0][t]
}
}
}
return (noisyLatent, latentMask)
}
func getLatentMask(_ wavLengths: [Int64], _ cfgs: Config) -> [[[Float]]] {
let baseChunkSize = cfgs.ae.base_chunk_size
let chunkCompressFactor = cfgs.ttl.chunk_compress_factor
let latentSize = baseChunkSize * chunkCompressFactor
var latentLengths = [Int]()
for len in wavLengths {
latentLengths.append((Int(len) + latentSize - 1) / latentSize)
}
let maxLen = latentLengths.max() ?? 0
return lengthToMask(latentLengths, maxLen: maxLen)
}
// MARK: - WAV File I/O
func writeWavFile(_ filename: String, _ audioData: [Float], _ sampleRate: Int) throws {
let url = URL(fileURLWithPath: filename)
// Convert float to int16
let int16Data = audioData.map { sample -> Int16 in
let clamped = max(-1.0, min(1.0, sample))
return Int16(clamped * 32767.0)
}
// Create WAV header
let numChannels: UInt16 = 1
let bitsPerSample: UInt16 = 16
let byteRate = UInt32(sampleRate) * UInt32(numChannels) * UInt32(bitsPerSample) / 8
let blockAlign = numChannels * bitsPerSample / 8
let dataSize = UInt32(int16Data.count * 2)
var data = Data()
// RIFF chunk
data.append("RIFF".data(using: .ascii)!)
withUnsafeBytes(of: UInt32(36 + dataSize).littleEndian) { data.append(contentsOf: $0) }
data.append("WAVE".data(using: .ascii)!)
// fmt chunk
data.append("fmt ".data(using: .ascii)!)
withUnsafeBytes(of: UInt32(16).littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: UInt16(1).littleEndian) { data.append(contentsOf: $0) } // PCM
withUnsafeBytes(of: numChannels.littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: UInt32(sampleRate).littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: byteRate.littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: blockAlign.littleEndian) { data.append(contentsOf: $0) }
withUnsafeBytes(of: bitsPerSample.littleEndian) { data.append(contentsOf: $0) }
// data chunk
data.append("data".data(using: .ascii)!)
withUnsafeBytes(of: dataSize.littleEndian) { data.append(contentsOf: $0) }
// audio data
int16Data.withUnsafeBytes { data.append(contentsOf: $0) }
try data.write(to: url)
}
// MARK: - Text Chunking
let MAX_CHUNK_LENGTH = 300
let ABBREVIATIONS = [
"Dr.", "Mr.", "Mrs.", "Ms.", "Prof.", "Sr.", "Jr.",
"St.", "Ave.", "Rd.", "Blvd.", "Dept.", "Inc.", "Ltd.",
"Co.", "Corp.", "etc.", "vs.", "i.e.", "e.g.", "Ph.D."
]
func chunkText(_ text: String, maxLen: Int = 0) -> [String] {
let actualMaxLen = maxLen > 0 ? maxLen : MAX_CHUNK_LENGTH
let trimmedText = text.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines)
if trimmedText.isEmpty {
return [""]
}
// Split by paragraphs using regex
let paraPattern = try! NSRegularExpression(pattern: "\\n\\s*\\n")
let paraRange = NSRange(trimmedText.startIndex..., in: trimmedText)
var paragraphs = [String]()
var lastEnd = trimmedText.startIndex
paraPattern.enumerateMatches(in: trimmedText, range: paraRange) { match, _, _ in
if let match = match, let range = Range(match.range, in: trimmedText) {
paragraphs.append(String(trimmedText[lastEnd..<range.lowerBound]))
lastEnd = range.upperBound
}
}
if lastEnd < trimmedText.endIndex {
paragraphs.append(String(trimmedText[lastEnd...]))
}
if paragraphs.isEmpty {
paragraphs = [trimmedText]
}
var chunks = [String]()
for para in paragraphs {
let trimmedPara = para.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines)
if trimmedPara.isEmpty {
continue
}
if trimmedPara.count <= actualMaxLen {
chunks.append(trimmedPara)
continue
}
// Split by sentences
let sentences = splitSentences(trimmedPara)
var current = ""
var currentLen = 0
for sentence in sentences {
let trimmedSentence = sentence.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines)
if trimmedSentence.isEmpty {
continue
}
let sentenceLen = trimmedSentence.count
if sentenceLen > actualMaxLen {
// If sentence is longer than maxLen, split by comma or space
if !current.isEmpty {
chunks.append(current.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
current = ""
currentLen = 0
}
// Try splitting by comma
let parts = trimmedSentence.components(separatedBy: ",")
for part in parts {
let trimmedPart = part.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines)
if trimmedPart.isEmpty {
continue
}
let partLen = trimmedPart.count
if partLen > actualMaxLen {
// Split by space as last resort
let words = trimmedPart.components(separatedBy: CharacterSet.whitespaces).filter { !$0.isEmpty }
var wordChunk = ""
var wordChunkLen = 0
for word in words {
let wordLen = word.count
if wordChunkLen + wordLen + 1 > actualMaxLen && !wordChunk.isEmpty {
chunks.append(wordChunk.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
wordChunk = ""
wordChunkLen = 0
}
if !wordChunk.isEmpty {
wordChunk += " "
wordChunkLen += 1
}
wordChunk += word
wordChunkLen += wordLen
}
if !wordChunk.isEmpty {
chunks.append(wordChunk.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
}
} else {
if currentLen + partLen + 1 > actualMaxLen && !current.isEmpty {
chunks.append(current.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
current = ""
currentLen = 0
}
if !current.isEmpty {
current += ", "
currentLen += 2
}
current += trimmedPart
currentLen += partLen
}
}
continue
}
if currentLen + sentenceLen + 1 > actualMaxLen && !current.isEmpty {
chunks.append(current.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
current = ""
currentLen = 0
}
if !current.isEmpty {
current += " "
currentLen += 1
}
current += trimmedSentence
currentLen += sentenceLen
}
if !current.isEmpty {
chunks.append(current.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
}
}
return chunks.isEmpty ? [""] : chunks
}
func splitSentences(_ text: String) -> [String] {
// Swift's regex doesn't support lookbehind reliably, so we use a simpler approach
// Split on sentence boundaries and then check if they're abbreviations
let regex = try! NSRegularExpression(pattern: "([.!?])\\s+")
let range = NSRange(text.startIndex..., in: text)
// Find all matches
let matches = regex.matches(in: text, range: range)
if matches.isEmpty {
return [text]
}
var sentences = [String]()
var lastEnd = text.startIndex
for match in matches {
guard let matchRange = Range(match.range, in: text) else { continue }
// Get the text before the punctuation
let beforePunc = String(text[lastEnd..<matchRange.lowerBound])
// Get the punctuation character
let puncRange = Range(NSRange(location: match.range.location, length: 1), in: text)!
let punc = String(text[puncRange])
// Check if this ends with an abbreviation
var isAbbrev = false
let combined = beforePunc.trimmingCharacters(in: CharacterSet.whitespaces) + punc
for abbrev in ABBREVIATIONS {
if combined.hasSuffix(abbrev) {
isAbbrev = true
break
}
}
if !isAbbrev {
// This is a real sentence boundary
sentences.append(String(text[lastEnd..<matchRange.upperBound]))
lastEnd = matchRange.upperBound
}
}
// Add the remaining text
if lastEnd < text.endIndex {
sentences.append(String(text[lastEnd...]))
}
return sentences.isEmpty ? [text] : sentences
}
// MARK: - Utility Functions
func timer<T>(_ name: String, _ f: () throws -> T) rethrows -> T {
let start = Date()
print("\(name)...")
let result = try f()
let elapsed = Date().timeIntervalSince(start)
print(String(format: " -> %@ completed in %.2f sec", name, elapsed))
return result
}
func sanitizeFilename(_ text: String, maxLen: Int) -> String {
let truncated = text.count > maxLen ? String(text.prefix(maxLen)) : text
return truncated.map { char in
if char.isLetter || char.isNumber {
return char
} else {
return Character("_")
}
}.map(String.init).joined()
}
func loadCfgs(_ onnxDir: String) throws -> Config {
let cfgPath = "\(onnxDir)/tts.json"
let data = try Data(contentsOf: URL(fileURLWithPath: cfgPath))
let config = try JSONDecoder().decode(Config.self, from: data)
return config
}
// MARK: - ONNX Runtime Integration
struct Style {
let ttl: ORTValue
let dp: ORTValue
}
class TextToSpeech {
let cfgs: Config
let textProcessor: UnicodeProcessor
let dpOrt: ORTSession
let textEncOrt: ORTSession
let vectorEstOrt: ORTSession
let vocoderOrt: ORTSession
let sampleRate: Int
init(cfgs: Config, textProcessor: UnicodeProcessor,
dpOrt: ORTSession, textEncOrt: ORTSession,
vectorEstOrt: ORTSession, vocoderOrt: ORTSession) {
self.cfgs = cfgs
self.textProcessor = textProcessor
self.dpOrt = dpOrt
self.textEncOrt = textEncOrt
self.vectorEstOrt = vectorEstOrt
self.vocoderOrt = vocoderOrt
self.sampleRate = cfgs.ae.sample_rate
}
private func _infer(_ textList: [String], _ langList: [String], _ style: Style, _ totalStep: Int, speed: Float = 1.05) throws -> (wav: [Float], duration: [Float]) {
let bsz = textList.count
// Process text
let (textIds, textMask) = textProcessor.call(textList, langList)
// Flatten text IDs
let textIdsFlat = textIds.flatMap { $0 }
let textIdsShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: textIds[0].count)]
let textIdsValue = try ORTValue(tensorData: NSMutableData(bytes: textIdsFlat, length: textIdsFlat.count * MemoryLayout<Int64>.size),
elementType: .int64,
shape: textIdsShape)
// Flatten text mask
let textMaskFlat = textMask.flatMap { $0.flatMap { $0 } }
let textMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: textMask[0][0].count)]
let textMaskValue = try ORTValue(tensorData: NSMutableData(bytes: textMaskFlat, length: textMaskFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: textMaskShape)
// Predict duration
let dpOutputs = try dpOrt.run(withInputs: ["text_ids": textIdsValue, "style_dp": style.dp, "text_mask": textMaskValue],
outputNames: ["duration"],
runOptions: nil)
let durationData = try dpOutputs["duration"]!.tensorData() as Data
var duration = durationData.withUnsafeBytes { ptr in
Array(ptr.bindMemory(to: Float.self))
}
// Apply speed factor to duration
for i in 0..<duration.count {
duration[i] /= speed
}
// Encode text
let textEncOutputs = try textEncOrt.run(withInputs: ["text_ids": textIdsValue, "style_ttl": style.ttl, "text_mask": textMaskValue],
outputNames: ["text_emb"],
runOptions: nil)
let textEmbValue = textEncOutputs["text_emb"]!
// Sample noisy latent
var (xt, latentMask) = sampleNoisyLatent(duration: duration, sampleRate: sampleRate,
baseChunkSize: cfgs.ae.base_chunk_size,
chunkCompress: cfgs.ttl.chunk_compress_factor,
latentDim: cfgs.ttl.latent_dim)
// Prepare constant arrays
let totalStepArray = Array(repeating: Float(totalStep), count: bsz)
let totalStepValue = try ORTValue(tensorData: NSMutableData(bytes: totalStepArray, length: totalStepArray.count * MemoryLayout<Float>.size),
elementType: .float,
shape: [NSNumber(value: bsz)])
// Denoising loop
for step in 0..<totalStep {
let currentStepArray = Array(repeating: Float(step), count: bsz)
let currentStepValue = try ORTValue(tensorData: NSMutableData(bytes: currentStepArray, length: currentStepArray.count * MemoryLayout<Float>.size),
elementType: .float,
shape: [NSNumber(value: bsz)])
// Flatten xt
let xtFlat = xt.flatMap { $0.flatMap { $0 } }
let xtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
let xtValue = try ORTValue(tensorData: NSMutableData(bytes: xtFlat, length: xtFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: xtShape)
// Flatten latent mask
let latentMaskFlat = latentMask.flatMap { $0.flatMap { $0 } }
let latentMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: latentMask[0][0].count)]
let latentMaskValue = try ORTValue(tensorData: NSMutableData(bytes: latentMaskFlat, length: latentMaskFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: latentMaskShape)
let vectorEstOutputs = try vectorEstOrt.run(withInputs: [
"noisy_latent": xtValue,
"text_emb": textEmbValue,
"style_ttl": style.ttl,
"latent_mask": latentMaskValue,
"text_mask": textMaskValue,
"current_step": currentStepValue,
"total_step": totalStepValue
], outputNames: ["denoised_latent"], runOptions: nil)
let denoisedData = try vectorEstOutputs["denoised_latent"]!.tensorData() as Data
let denoisedFlat = denoisedData.withUnsafeBytes { ptr in
Array(ptr.bindMemory(to: Float.self))
}
// Reshape to 3D
let latentDimVal = xt[0].count
let latentLen = xt[0][0].count
xt = []
var idx = 0
for _ in 0..<bsz {
var batch = [[Float]]()
for _ in 0..<latentDimVal {
var row = [Float]()
for _ in 0..<latentLen {
row.append(denoisedFlat[idx])
idx += 1
}
batch.append(row)
}
xt.append(batch)
}
}
// Generate waveform
let finalXtFlat = xt.flatMap { $0.flatMap { $0 } }
let finalXtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)]
let finalXtValue = try ORTValue(tensorData: NSMutableData(bytes: finalXtFlat, length: finalXtFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: finalXtShape)
let vocoderOutputs = try vocoderOrt.run(withInputs: ["latent": finalXtValue],
outputNames: ["wav_tts"],
runOptions: nil)
let wavData = try vocoderOutputs["wav_tts"]!.tensorData() as Data
let wav = wavData.withUnsafeBytes { ptr in
Array(ptr.bindMemory(to: Float.self))
}
return (wav, duration)
}
func call(_ text: String, _ lang: String, _ style: Style, _ totalStep: Int, speed: Float = 1.05, silenceDuration: Float = 0.3) throws -> (wav: [Float], duration: Float) {
let maxLen = lang == "ko" ? 120 : 300
let chunks = chunkText(text, maxLen: maxLen)
let langList = Array(repeating: lang, count: chunks.count)
var wavCat = [Float]()
var durCat: Float = 0.0
for (i, chunk) in chunks.enumerated() {
let result = try _infer([chunk], [langList[i]], style, totalStep, speed: speed)
let dur = result.duration[0]
let wavLen = Int(Float(sampleRate) * dur)
let wavChunk = Array(result.wav.prefix(wavLen))
if i == 0 {
wavCat = wavChunk
durCat = dur
} else {
let silenceLen = Int(silenceDuration * Float(sampleRate))
let silence = [Float](repeating: 0.0, count: silenceLen)
wavCat.append(contentsOf: silence)
wavCat.append(contentsOf: wavChunk)
durCat += silenceDuration + dur
}
}
return (wavCat, durCat)
}
func batch(_ textList: [String], _ langList: [String], _ style: Style, _ totalStep: Int, speed: Float = 1.05) throws -> (wav: [Float], duration: [Float]) {
return try _infer(textList, langList, style, totalStep, speed: speed)
}
}
// MARK: - Component Loading Functions
func loadVoiceStyle(_ voiceStylePaths: [String], verbose: Bool) throws -> Style {
let bsz = voiceStylePaths.count
// Read first file to get dimensions
let firstData = try Data(contentsOf: URL(fileURLWithPath: voiceStylePaths[0]))
let firstStyle = try JSONDecoder().decode(VoiceStyleData.self, from: firstData)
let ttlDims = firstStyle.style_ttl.dims
let dpDims = firstStyle.style_dp.dims
let ttlDim1 = ttlDims[1]
let ttlDim2 = ttlDims[2]
let dpDim1 = dpDims[1]
let dpDim2 = dpDims[2]
// Pre-allocate arrays with full batch size
let ttlSize = bsz * ttlDim1 * ttlDim2
let dpSize = bsz * dpDim1 * dpDim2
var ttlFlat = [Float](repeating: 0.0, count: ttlSize)
var dpFlat = [Float](repeating: 0.0, count: dpSize)
// Fill in the data
for (i, path) in voiceStylePaths.enumerated() {
let data = try Data(contentsOf: URL(fileURLWithPath: path))
let voiceStyle = try JSONDecoder().decode(VoiceStyleData.self, from: data)
// Flatten TTL data
let ttlOffset = i * ttlDim1 * ttlDim2
var idx = 0
for batch in voiceStyle.style_ttl.data {
for row in batch {
for val in row {
ttlFlat[ttlOffset + idx] = val
idx += 1
}
}
}
// Flatten DP data
let dpOffset = i * dpDim1 * dpDim2
idx = 0
for batch in voiceStyle.style_dp.data {
for row in batch {
for val in row {
dpFlat[dpOffset + idx] = val
idx += 1
}
}
}
}
let ttlShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: ttlDim1), NSNumber(value: ttlDim2)]
let dpShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: dpDim1), NSNumber(value: dpDim2)]
let ttlValue = try ORTValue(tensorData: NSMutableData(bytes: &ttlFlat, length: ttlFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: ttlShape)
let dpValue = try ORTValue(tensorData: NSMutableData(bytes: &dpFlat, length: dpFlat.count * MemoryLayout<Float>.size),
elementType: .float,
shape: dpShape)
if verbose {
print("Loaded \(bsz) voice styles\n")
}
return Style(ttl: ttlValue, dp: dpValue)
}
func loadTextToSpeech(_ onnxDir: String, _ useGpu: Bool, _ env: ORTEnv) throws -> TextToSpeech {
if useGpu {
throw NSError(domain: "TTS", code: 1, userInfo: [NSLocalizedDescriptionKey: "GPU mode is not supported yet"])
}
print("Using CPU for inference\n")
let cfgs = try loadCfgs(onnxDir)
let sessionOptions = try ORTSessionOptions()
let dpPath = "\(onnxDir)/duration_predictor.onnx"
let textEncPath = "\(onnxDir)/text_encoder.onnx"
let vectorEstPath = "\(onnxDir)/vector_estimator.onnx"
let vocoderPath = "\(onnxDir)/vocoder.onnx"
let dpOrt = try ORTSession(env: env, modelPath: dpPath, sessionOptions: sessionOptions)
let textEncOrt = try ORTSession(env: env, modelPath: textEncPath, sessionOptions: sessionOptions)
let vectorEstOrt = try ORTSession(env: env, modelPath: vectorEstPath, sessionOptions: sessionOptions)
let vocoderOrt = try ORTSession(env: env, modelPath: vocoderPath, sessionOptions: sessionOptions)
let unicodeIndexerPath = "\(onnxDir)/unicode_indexer.json"
let textProcessor = try UnicodeProcessor(unicodeIndexerPath: unicodeIndexerPath)
return TextToSpeech(cfgs: cfgs, textProcessor: textProcessor,
dpOrt: dpOrt, textEncOrt: textEncOrt,
vectorEstOrt: vectorEstOrt, vocoderOrt: vocoderOrt)
}

330
test_all.sh Normal file
View File

@@ -0,0 +1,330 @@
#!/bin/bash
# Supertonic - Test All Language Implementations
# This script runs inference tests for all supported languages except web
set -e # Exit on error
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd "$SCRIPT_DIR"
echo "=================================="
echo "Supertonic - Testing All Examples"
echo "=================================="
echo ""
# Ask user to select test mode
echo "Select test mode:"
echo " 1) Default inference only"
echo " 2) Batch inference only"
echo " 3) Long-form inference only"
echo " 4) All tests (default + batch + long-form)"
echo -e "Enter your choice (1/2/3/4) [default: 1]: \c"
read -r test_mode
test_mode=${test_mode:-1}
case $test_mode in
1)
TEST_DEFAULT=true
TEST_BATCH=false
TEST_LONGFORM=false
echo "Running default inference tests only"
;;
2)
TEST_DEFAULT=false
TEST_BATCH=true
TEST_LONGFORM=false
echo "Running batch inference tests only"
;;
3)
TEST_DEFAULT=false
TEST_BATCH=false
TEST_LONGFORM=true
echo "Running long-form inference tests only"
;;
4)
TEST_DEFAULT=true
TEST_BATCH=true
TEST_LONGFORM=true
echo "Running all tests (default + batch + long-form)"
;;
*)
echo "Invalid choice. Using default inference only."
TEST_DEFAULT=true
TEST_BATCH=false
TEST_LONGFORM=false
;;
esac
echo ""
# Batch inference test data - multilingual examples
BATCH_VOICE_STYLE_1="assets/voice_styles/M1.json"
BATCH_VOICE_STYLE_2="assets/voice_styles/F1.json"
BATCH_TEXT_1="The sun sets behind the mountains, painting the sky in shades of pink and orange."
BATCH_TEXT_2="오늘 아침에 공원을 산책했는데, 새소리와 바람 소리가 너무 기분 좋았어요."
BATCH_LANG_1="en"
BATCH_LANG_2="ko"
# Long-form inference test data
LONGFORM_VOICE_STYLE="assets/voice_styles/M1.json"
LONGFORM_TEXT="This is a very long text that will be automatically split into multiple chunks. The system will process each chunk separately and then concatenate them together with natural pauses between segments. This ensures that even very long texts can be processed efficiently while maintaining natural speech flow and avoiding memory issues. The text chunking algorithm intelligently splits on paragraph and sentence boundaries, preserving the natural flow of the content. When a sentence is too long, it further splits on commas and spaces as needed. This multi-level approach ensures optimal chunk sizes for inference while maintaining linguistic coherence."
# Ask if user wants to clean results folders
echo -e "Do you want to clean all results folders before running tests? (y/N): \c"
read -r response
if [[ "$response" =~ ^[Yy]$ ]]; then
echo ""
echo "Cleaning results folders..."
# List of result directories
declare -a RESULT_DIRS=(
"py/results"
"nodejs/results"
"go/results"
"rust/results"
"csharp/results"
"java/results"
"swift/results"
"cpp/build/results"
)
for dir in "${RESULT_DIRS[@]}"; do
if [ -d "$SCRIPT_DIR/$dir" ]; then
echo " - Cleaning $dir"
rm -rf "$SCRIPT_DIR/$dir"/*
fi
done
echo "Results folders cleaned!"
echo ""
fi
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Track results
declare -a PASSED=()
declare -a FAILED=()
# Helper function to show statistics
show_stats() {
local name=$1
local results_dir=$2
if [ -d "$results_dir" ]; then
# Count .wav files
local file_count=$(find "$results_dir" -name "*.wav" -type f 2>/dev/null | wc -l | tr -d ' ')
if [ "$file_count" -gt 0 ]; then
# Calculate total size
local total_size=0
while IFS= read -r file; do
if [ -f "$file" ]; then
local size=$(stat -f%z "$file" 2>/dev/null || stat -c%s "$file" 2>/dev/null)
total_size=$((total_size + size))
fi
done < <(find "$results_dir" -name "*.wav" -type f 2>/dev/null)
# Calculate statistics
local total_size_mb=$(echo "scale=2; $total_size / 1024 / 1024" | bc)
local avg_size_kb=$(echo "scale=2; $total_size / $file_count / 1024" | bc)
echo -e "${BLUE}[$name]${NC} 📊 Statistics:"
echo -e "${BLUE}[$name]${NC} - Files generated: $file_count"
echo -e "${BLUE}[$name]${NC} - Total size: ${total_size_mb} MB"
echo -e "${BLUE}[$name]${NC} - Average file size: ${avg_size_kb} KB"
fi
fi
}
# Helper function to run tests
run_test() {
local name=$1
local dir=$2
shift 2
local cmd="$@"
echo -e "${BLUE}[$name]${NC} Running inference..."
cd "$SCRIPT_DIR/$dir"
# Determine results directory based on the directory
local results_dir="$SCRIPT_DIR/$dir/results"
if [[ "$dir" == "cpp/build" ]]; then
results_dir="$SCRIPT_DIR/cpp/build/results"
fi
# Run command and prefix each output line with the language name
if eval "$cmd" 2>&1 | sed "s/^/[$name] /"; then
echo -e "${GREEN}[$name]${NC} ✓ Success"
# Show statistics
show_stats "$name" "$results_dir"
PASSED+=("$name")
else
echo -e "${RED}[$name]${NC} ✗ Failed"
FAILED+=("$name")
fi
echo ""
cd "$SCRIPT_DIR"
}
# ====================================
# Python
# ====================================
echo -e "${YELLOW}Testing Python...${NC}"
if [ "$TEST_DEFAULT" = true ]; then
run_test "Python (default)" "py" "uv run example_onnx.py"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Python (batch)" "py" "uv run example_onnx.py --batch --voice-style $BATCH_VOICE_STYLE_1 $BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1' '$BATCH_TEXT_2' --lang $BATCH_LANG_1 $BATCH_LANG_2"
fi
if [ "$TEST_LONGFORM" = true ]; then
run_test "Python (long-form)" "py" "uv run example_onnx.py --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
fi
# ====================================
# JavaScript (Node.js)
# ====================================
echo -e "${YELLOW}Testing JavaScript (Node.js)...${NC}"
echo "Installing Node.js dependencies..."
cd nodejs && npm install --silent && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "JavaScript (default)" "nodejs" "node example_onnx.js"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "JavaScript (batch)" "nodejs" "node example_onnx.js --batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
fi
if [ "$TEST_LONGFORM" = true ]; then
run_test "JavaScript (long-form)" "nodejs" "node example_onnx.js --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
fi
# ====================================
# Go
# ====================================
echo -e "${YELLOW}Testing Go...${NC}"
echo "Cleaning Go cache..."
cd go && go clean && cd ..
export ONNXRUNTIME_LIB_PATH=$(brew --prefix onnxruntime 2>/dev/null)/lib/libonnxruntime.dylib
if [ "$TEST_DEFAULT" = true ]; then
run_test "Go (default)" "go" "go run example_onnx.go helper.go"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Go (batch)" "go" "go run example_onnx.go helper.go --batch -voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 -text '$BATCH_TEXT_1|$BATCH_TEXT_2' -lang $BATCH_LANG_1,$BATCH_LANG_2"
fi
if [ "$TEST_LONGFORM" = true ]; then
run_test "Go (long-form)" "go" "go run example_onnx.go helper.go -voice-style $LONGFORM_VOICE_STYLE -text '$LONGFORM_TEXT'"
fi
# ====================================
# Rust
# ====================================
echo -e "${YELLOW}Testing Rust...${NC}"
echo "Building Rust project..."
cd rust && cargo clean && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "Rust (default)" "rust" "cargo run --release"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Rust (batch)" "rust" "cargo run --release -- --batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
fi
if [ "$TEST_LONGFORM" = true ]; then
run_test "Rust (long-form)" "rust" "cargo run --release -- --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
fi
# ====================================
# C#
# ====================================
echo -e "${YELLOW}Testing C#...${NC}"
echo "Building C# project..."
cd csharp && dotnet clean && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "C# (default)" "csharp" "dotnet run --configuration Release"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "C# (batch)" "csharp" "dotnet run --configuration Release -- --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
fi
if [ "$TEST_LONGFORM" = true ]; then
run_test "C# (long-form)" "csharp" "dotnet run --configuration Release -- --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
fi
# ====================================
# Java
# ====================================
echo -e "${YELLOW}Testing Java...${NC}"
echo "Building Java project..."
cd java && mvn clean install -q && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "Java (default)" "java" "mvn exec:java -q"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Java (batch)" "java" "mvn exec:java -q -Dexec.args='--batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text \"$BATCH_TEXT_1|$BATCH_TEXT_2\" --lang $BATCH_LANG_1,$BATCH_LANG_2'"
fi
if [ "$TEST_LONGFORM" = true ]; then
run_test "Java (long-form)" "java" "mvn exec:java -q -Dexec.args='--voice-style $LONGFORM_VOICE_STYLE --text \"$LONGFORM_TEXT\"'"
fi
# ====================================
# Swift
# ====================================
echo -e "${YELLOW}Testing Swift...${NC}"
echo "Building Swift project..."
cd swift && swift build -c release && cd ..
if [ "$TEST_DEFAULT" = true ]; then
run_test "Swift (default)" "swift" ".build/release/example_onnx"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "Swift (batch)" "swift" ".build/release/example_onnx --batch --voice-style $BATCH_VOICE_STYLE_1,$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
fi
if [ "$TEST_LONGFORM" = true ]; then
run_test "Swift (long-form)" "swift" ".build/release/example_onnx --voice-style $LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
fi
# ====================================
# C++
# ====================================
echo -e "${YELLOW}Testing C++...${NC}"
echo "Building C++ project..."
cd cpp && mkdir -p build && cd build && cmake .. && make && cd ../..
if [ "$TEST_DEFAULT" = true ]; then
run_test "C++ (default)" "cpp/build" "./example_onnx"
fi
if [ "$TEST_BATCH" = true ]; then
run_test "C++ (batch)" "cpp/build" "./example_onnx --batch --voice-style ../$BATCH_VOICE_STYLE_1,../$BATCH_VOICE_STYLE_2 --text '$BATCH_TEXT_1|$BATCH_TEXT_2' --lang $BATCH_LANG_1,$BATCH_LANG_2"
fi
if [ "$TEST_LONGFORM" = true ]; then
run_test "C++ (long-form)" "cpp/build" "./example_onnx --voice-style ../$LONGFORM_VOICE_STYLE --text '$LONGFORM_TEXT'"
fi
# ====================================
# Summary
# ====================================
echo "=================================="
echo "Test Summary"
echo "=================================="
echo ""
if [ ${#PASSED[@]} -gt 0 ]; then
echo -e "${GREEN}Passed (${#PASSED[@]}):${NC}"
for lang in "${PASSED[@]}"; do
echo -e " ${GREEN}${NC} $lang"
done
echo ""
fi
if [ ${#FAILED[@]} -gt 0 ]; then
echo -e "${RED}Failed (${#FAILED[@]}):${NC}"
for lang in "${FAILED[@]}"; do
echo -e " ${RED}${NC} $lang"
done
echo ""
exit 1
else
echo -e "${GREEN}All tests passed! 🎉${NC}"
exit 0
fi

4
web/.gitignore vendored Normal file
View File

@@ -0,0 +1,4 @@
node_modules/
dist/
.DS_Store
*.log

121
web/README.md Normal file
View File

@@ -0,0 +1,121 @@
# Supertonic Web Example
This example demonstrates how to use Supertonic in a web browser using ONNX Runtime Web.
## 📰 Update News
**2026.01.06** - 🎉 **Supertonic 2** released with multilingual support! Now supports English (`en`), Korean (`ko`), Spanish (`es`), Portuguese (`pt`), and French (`fr`). [Demo](https://huggingface.co/spaces/Supertone/supertonic-2) | [Models](https://huggingface.co/Supertone/supertonic-2)
**2025.12.10** - Added [6 new voice styles](https://huggingface.co/Supertone/supertonic/tree/b10dbaf18b316159be75b34d24f740008fddd381) (M3, M4, M5, F3, F4, F5). See [Voices](https://supertone-inc.github.io/supertonic-py/voices/) for details
**2025.12.08** - Optimized ONNX models via [OnnxSlim](https://github.com/inisis/OnnxSlim) now available on [Hugging Face Models](https://huggingface.co/Supertone/supertonic)
**2025.11.23** - Enhanced text preprocessing with comprehensive normalization, emoji removal, symbol replacement, and punctuation handling for improved synthesis quality.
**2025.11.19** - Added speed control slider to adjust speech synthesis speed (default: 1.05, recommended range: 0.9-1.5).
**2025.11.19** - Added automatic text chunking for long-form inference. Long texts are split into chunks and synthesized with natural pauses.
## Features
- 🌐 Runs entirely in the browser (no server required for inference)
- 🚀 WebGPU support with automatic fallback to WebAssembly
- 🌍 Multilingual support: English (en), Korean (ko), Spanish (es), Portuguese (pt), French (fr)
- ⚡ Pre-extracted voice styles for instant generation
- 🎨 Modern, responsive UI
- 🎭 Multiple voice style presets (5 Male, 5 Female)
- 💾 Download generated audio as WAV files
- 📊 Detailed generation statistics (audio length, generation time)
- ⏱️ Real-time progress tracking
## Requirements
- Node.js (for development server)
- Modern web browser (Chrome, Edge, Firefox, Safari)
## Installation
1. Install dependencies:
```bash
npm install
```
## Running the Demo
Start the development server:
```bash
npm run dev
```
This will start a local development server (usually at http://localhost:3000) and open the demo in your browser.
## Usage
1. **Wait for Models to Load**: The app will automatically load models and the default voice style (M1)
2. **Select Voice Style**: Choose from available voice presets
- **Male 1-5 (M1-M5)**: Male voice styles
- **Female 1-5 (F1-F5)**: Female voice styles
3. **Select Language**: Choose the language that matches your input text
- **English (en)**: Default language
- **한국어 (ko)**: Korean
- **Español (es)**: Spanish
- **Português (pt)**: Portuguese
- **Français (fr)**: French
4. **Enter Text**: Type or paste the text you want to convert to speech
5. **Adjust Settings** (optional):
- **Total Steps**: More steps = better quality but slower (default: 5)
6. **Generate Speech**: Click the "Generate Speech" button
7. **View Results**:
- See the full input text
- View audio length and generation time statistics
- Play the generated audio in the browser
- Download as WAV file
## Multilingual Support
Supertonic 2 supports multiple languages. Make sure to select the correct language for your input text to get the best results. The model will automatically handle text preprocessing and pronunciation for the selected language.
## Technical Details
### Browser Compatibility
This demo uses:
- **ONNX Runtime Web**: For running models in the browser
- **Web Audio API**: For playing generated audio
- **Vite**: For development and bundling
## Notes
- The ONNX models must be accessible at `assets/onnx/` relative to the web root
- Voice style JSON files must be accessible at `assets/voice_styles/` relative to the web root
- Pre-extracted voice styles enable instant generation without audio processing
- Ten voice style presets are provided (M1-M5, F1-F5)
## Troubleshooting
### Models not loading
- Check browser console for errors
- Ensure `assets/onnx/` path is correct and models are accessible
- Check CORS settings if serving from a different domain
### WebGPU not available
- WebGPU is only available in recent Chrome/Edge browsers (version 113+)
- The app will automatically fall back to WebAssembly if WebGPU is not available
- Check the backend badge to see which execution provider is being used
### Out of memory errors
- Try shorter text inputs
- Reduce denoising steps
- Use a browser with more available memory
- Close other tabs to free up memory
### Audio quality issues
- Try different voice style presets
- Increase denoising steps for better quality
### Slow generation
- If using WebAssembly, try a browser that supports WebGPU
- Ensure no other heavy processes are running
- Consider using fewer denoising steps for faster (but lower quality) results

561
web/helper.js Normal file
View File

@@ -0,0 +1,561 @@
import * as ort from 'onnxruntime-web';
// Available languages for multilingual TTS
export const AVAILABLE_LANGS = ['en', 'ko', 'es', 'pt', 'fr'];
export function isValidLang(lang) {
return AVAILABLE_LANGS.includes(lang);
}
/**
* Unicode Text Processor
*/
export class UnicodeProcessor {
constructor(indexer) {
this.indexer = indexer;
}
call(textList, langList) {
const processedTexts = textList.map((text, i) => this.preprocessText(text, langList[i]));
const textIdsLengths = processedTexts.map(text => text.length);
const maxLen = Math.max(...textIdsLengths);
const textIds = processedTexts.map(text => {
const row = new Array(maxLen).fill(0);
for (let j = 0; j < text.length; j++) {
const codePoint = text.codePointAt(j);
row[j] = (codePoint < this.indexer.length) ? this.indexer[codePoint] : -1;
}
return row;
});
const textMask = this.getTextMask(textIdsLengths);
return { textIds, textMask };
}
preprocessText(text, lang) {
// TODO: Need advanced normalizer for better performance
text = text.normalize('NFKD');
// Remove emojis (wide Unicode range)
const emojiPattern = /[\u{1F600}-\u{1F64F}\u{1F300}-\u{1F5FF}\u{1F680}-\u{1F6FF}\u{1F700}-\u{1F77F}\u{1F780}-\u{1F7FF}\u{1F800}-\u{1F8FF}\u{1F900}-\u{1F9FF}\u{1FA00}-\u{1FA6F}\u{1FA70}-\u{1FAFF}\u{2600}-\u{26FF}\u{2700}-\u{27BF}\u{1F1E6}-\u{1F1FF}]+/gu;
text = text.replace(emojiPattern, '');
// Replace various dashes and symbols
const replacements = {
'': '-',
'': '-',
'—': '-',
'_': ' ',
'\u201C': '"', // left double quote "
'\u201D': '"', // right double quote "
'\u2018': "'", // left single quote '
'\u2019': "'", // right single quote '
'´': "'",
'`': "'",
'[': ' ',
']': ' ',
'|': ' ',
'/': ' ',
'#': ' ',
'→': ' ',
'←': ' ',
};
for (const [k, v] of Object.entries(replacements)) {
text = text.replaceAll(k, v);
}
// Remove special symbols
text = text.replace(/[♥☆♡©\\]/g, '');
// Replace known expressions
const exprReplacements = {
'@': ' at ',
'e.g.,': 'for example, ',
'i.e.,': 'that is, ',
};
for (const [k, v] of Object.entries(exprReplacements)) {
text = text.replaceAll(k, v);
}
// Fix spacing around punctuation
text = text.replace(/ ,/g, ',');
text = text.replace(/ \./g, '.');
text = text.replace(/ !/g, '!');
text = text.replace(/ \?/g, '?');
text = text.replace(/ ;/g, ';');
text = text.replace(/ :/g, ':');
text = text.replace(/ '/g, "'");
// Remove duplicate quotes
while (text.includes('""')) {
text = text.replace('""', '"');
}
while (text.includes("''")) {
text = text.replace("''", "'");
}
while (text.includes('``')) {
text = text.replace('``', '`');
}
// Remove extra spaces
text = text.replace(/\s+/g, ' ').trim();
// If text doesn't end with punctuation, quotes, or closing brackets, add a period
if (!/[.!?;:,'\"')\]}…。」』】〉》›»]$/.test(text)) {
text += '.';
}
// Validate language
if (!isValidLang(lang)) {
throw new Error(`Invalid language: ${lang}. Available: ${AVAILABLE_LANGS.join(', ')}`);
}
// Wrap text with language tags
text = `<${lang}>${text}</${lang}>`;
return text;
}
getTextMask(textIdsLengths) {
const maxLen = Math.max(...textIdsLengths);
return this.lengthToMask(textIdsLengths, maxLen);
}
lengthToMask(lengths, maxLen = null) {
const actualMaxLen = maxLen || Math.max(...lengths);
return lengths.map(len => {
const row = new Array(actualMaxLen).fill(0.0);
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
row[j] = 1.0;
}
return [row];
});
}
}
/**
* Style class to hold TTL and DP tensors
*/
export class Style {
constructor(ttlTensor, dpTensor) {
this.ttl = ttlTensor;
this.dp = dpTensor;
}
}
/**
* Text-to-Speech class
*/
export class TextToSpeech {
constructor(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt) {
this.cfgs = cfgs;
this.textProcessor = textProcessor;
this.dpOrt = dpOrt;
this.textEncOrt = textEncOrt;
this.vectorEstOrt = vectorEstOrt;
this.vocoderOrt = vocoderOrt;
this.sampleRate = cfgs.ae.sample_rate;
}
async _infer(textList, langList, style, totalStep, speed = 1.05, progressCallback = null) {
const bsz = textList.length;
// Process text
const { textIds, textMask } = this.textProcessor.call(textList, langList);
const textIdsFlat = new BigInt64Array(textIds.flat().map(x => BigInt(x)));
const textIdsShape = [bsz, textIds[0].length];
const textIdsTensor = new ort.Tensor('int64', textIdsFlat, textIdsShape);
const textMaskFlat = new Float32Array(textMask.flat(2));
const textMaskShape = [bsz, 1, textMask[0][0].length];
const textMaskTensor = new ort.Tensor('float32', textMaskFlat, textMaskShape);
// Predict duration
const dpOutputs = await this.dpOrt.run({
text_ids: textIdsTensor,
style_dp: style.dp,
text_mask: textMaskTensor
});
const duration = Array.from(dpOutputs.duration.data);
// Apply speed factor to duration
for (let i = 0; i < duration.length; i++) {
duration[i] /= speed;
}
// Encode text
const textEncOutputs = await this.textEncOrt.run({
text_ids: textIdsTensor,
style_ttl: style.ttl,
text_mask: textMaskTensor
});
const textEmb = textEncOutputs.text_emb;
// Sample noisy latent
let { xt, latentMask } = this.sampleNoisyLatent(
duration,
this.sampleRate,
this.cfgs.ae.base_chunk_size,
this.cfgs.ttl.chunk_compress_factor,
this.cfgs.ttl.latent_dim
);
const latentMaskFlat = new Float32Array(latentMask.flat(2));
const latentMaskShape = [bsz, 1, latentMask[0][0].length];
const latentMaskTensor = new ort.Tensor('float32', latentMaskFlat, latentMaskShape);
// Prepare constant arrays
const totalStepArray = new Float32Array(bsz).fill(totalStep);
const totalStepTensor = new ort.Tensor('float32', totalStepArray, [bsz]);
// Denoising loop
for (let step = 0; step < totalStep; step++) {
if (progressCallback) {
progressCallback(step + 1, totalStep);
}
const currentStepArray = new Float32Array(bsz).fill(step);
const currentStepTensor = new ort.Tensor('float32', currentStepArray, [bsz]);
const xtFlat = new Float32Array(xt.flat(2));
const xtShape = [bsz, xt[0].length, xt[0][0].length];
const xtTensor = new ort.Tensor('float32', xtFlat, xtShape);
const vectorEstOutputs = await this.vectorEstOrt.run({
noisy_latent: xtTensor,
text_emb: textEmb,
style_ttl: style.ttl,
latent_mask: latentMaskTensor,
text_mask: textMaskTensor,
current_step: currentStepTensor,
total_step: totalStepTensor
});
const denoised = Array.from(vectorEstOutputs.denoised_latent.data);
// Reshape to 3D
const latentDim = xt[0].length;
const latentLen = xt[0][0].length;
xt = [];
let idx = 0;
for (let b = 0; b < bsz; b++) {
const batch = [];
for (let d = 0; d < latentDim; d++) {
const row = [];
for (let t = 0; t < latentLen; t++) {
row.push(denoised[idx++]);
}
batch.push(row);
}
xt.push(batch);
}
}
// Generate waveform
const finalXtFlat = new Float32Array(xt.flat(2));
const finalXtShape = [bsz, xt[0].length, xt[0][0].length];
const finalXtTensor = new ort.Tensor('float32', finalXtFlat, finalXtShape);
const vocoderOutputs = await this.vocoderOrt.run({
latent: finalXtTensor
});
const wav = Array.from(vocoderOutputs.wav_tts.data);
return { wav, duration };
}
async call(text, lang, style, totalStep, speed = 1.05, silenceDuration = 0.3, progressCallback = null) {
if (style.ttl.dims[0] !== 1) {
throw new Error('Single speaker text to speech only supports single style');
}
const maxLen = lang === 'ko' ? 120 : 300;
const textList = chunkText(text, maxLen);
const langList = new Array(textList.length).fill(lang);
let wavCat = [];
let durCat = 0;
for (let i = 0; i < textList.length; i++) {
const { wav, duration } = await this._infer([textList[i]], [langList[i]], style, totalStep, speed, progressCallback);
if (wavCat.length === 0) {
wavCat = wav;
durCat = duration[0];
} else {
const silenceLen = Math.floor(silenceDuration * this.sampleRate);
const silence = new Array(silenceLen).fill(0);
wavCat = [...wavCat, ...silence, ...wav];
durCat += duration[0] + silenceDuration;
}
}
return { wav: wavCat, duration: [durCat] };
}
async batch(textList, langList, style, totalStep, speed = 1.05, progressCallback = null) {
return await this._infer(textList, langList, style, totalStep, speed, progressCallback);
}
sampleNoisyLatent(duration, sampleRate, baseChunkSize, chunkCompress, latentDim) {
const bsz = duration.length;
const maxDur = Math.max(...duration);
const wavLenMax = Math.floor(maxDur * sampleRate);
const wavLengths = duration.map(d => Math.floor(d * sampleRate));
const chunkSize = baseChunkSize * chunkCompress;
const latentLen = Math.floor((wavLenMax + chunkSize - 1) / chunkSize);
const latentDimVal = latentDim * chunkCompress;
const xt = [];
for (let b = 0; b < bsz; b++) {
const batch = [];
for (let d = 0; d < latentDimVal; d++) {
const row = [];
for (let t = 0; t < latentLen; t++) {
// Box-Muller transform
const u1 = Math.max(0.0001, Math.random());
const u2 = Math.random();
const val = Math.sqrt(-2.0 * Math.log(u1)) * Math.cos(2.0 * Math.PI * u2);
row.push(val);
}
batch.push(row);
}
xt.push(batch);
}
const latentLengths = wavLengths.map(len => Math.floor((len + chunkSize - 1) / chunkSize));
const latentMask = this.lengthToMask(latentLengths, latentLen);
// Apply mask
for (let b = 0; b < bsz; b++) {
for (let d = 0; d < latentDimVal; d++) {
for (let t = 0; t < latentLen; t++) {
xt[b][d][t] *= latentMask[b][0][t];
}
}
}
return { xt, latentMask };
}
lengthToMask(lengths, maxLen = null) {
const actualMaxLen = maxLen || Math.max(...lengths);
return lengths.map(len => {
const row = new Array(actualMaxLen).fill(0.0);
for (let j = 0; j < Math.min(len, actualMaxLen); j++) {
row[j] = 1.0;
}
return [row];
});
}
}
/**
* Load voice style from JSON files
*/
export async function loadVoiceStyle(voiceStylePaths, verbose = false) {
const bsz = voiceStylePaths.length;
// Read first file to get dimensions
const firstResponse = await fetch(voiceStylePaths[0]);
const firstStyle = await firstResponse.json();
const ttlDims = firstStyle.style_ttl.dims;
const dpDims = firstStyle.style_dp.dims;
const ttlDim1 = ttlDims[1];
const ttlDim2 = ttlDims[2];
const dpDim1 = dpDims[1];
const dpDim2 = dpDims[2];
// Pre-allocate arrays with full batch size
const ttlSize = bsz * ttlDim1 * ttlDim2;
const dpSize = bsz * dpDim1 * dpDim2;
const ttlFlat = new Float32Array(ttlSize);
const dpFlat = new Float32Array(dpSize);
// Fill in the data
for (let i = 0; i < bsz; i++) {
const response = await fetch(voiceStylePaths[i]);
const voiceStyle = await response.json();
// Flatten TTL data
const ttlData = voiceStyle.style_ttl.data.flat(Infinity);
const ttlOffset = i * ttlDim1 * ttlDim2;
ttlFlat.set(ttlData, ttlOffset);
// Flatten DP data
const dpData = voiceStyle.style_dp.data.flat(Infinity);
const dpOffset = i * dpDim1 * dpDim2;
dpFlat.set(dpData, dpOffset);
}
const ttlShape = [bsz, ttlDim1, ttlDim2];
const dpShape = [bsz, dpDim1, dpDim2];
const ttlTensor = new ort.Tensor('float32', ttlFlat, ttlShape);
const dpTensor = new ort.Tensor('float32', dpFlat, dpShape);
if (verbose) {
console.log(`Loaded ${bsz} voice styles`);
}
return new Style(ttlTensor, dpTensor);
}
/**
* Load configuration from JSON
*/
export async function loadCfgs(onnxDir) {
const response = await fetch(`${onnxDir}/tts.json`);
const cfgs = await response.json();
return cfgs;
}
/**
* Load text processor
*/
export async function loadTextProcessor(onnxDir) {
const response = await fetch(`${onnxDir}/unicode_indexer.json`);
const indexer = await response.json();
return new UnicodeProcessor(indexer);
}
/**
* Load ONNX model
*/
export async function loadOnnx(onnxPath, options) {
const session = await ort.InferenceSession.create(onnxPath, options);
return session;
}
/**
* Load all TTS components
*/
export async function loadTextToSpeech(onnxDir, sessionOptions = {}, progressCallback = null) {
console.log('Using WebAssembly/WebGPU for inference');
const cfgs = await loadCfgs(onnxDir);
const dpPath = `${onnxDir}/duration_predictor.onnx`;
const textEncPath = `${onnxDir}/text_encoder.onnx`;
const vectorEstPath = `${onnxDir}/vector_estimator.onnx`;
const vocoderPath = `${onnxDir}/vocoder.onnx`;
const modelPaths = [
{ name: 'Duration Predictor', path: dpPath },
{ name: 'Text Encoder', path: textEncPath },
{ name: 'Vector Estimator', path: vectorEstPath },
{ name: 'Vocoder', path: vocoderPath }
];
const sessions = [];
for (let i = 0; i < modelPaths.length; i++) {
if (progressCallback) {
progressCallback(modelPaths[i].name, i + 1, modelPaths.length);
}
const session = await loadOnnx(modelPaths[i].path, sessionOptions);
sessions.push(session);
}
const [dpOrt, textEncOrt, vectorEstOrt, vocoderOrt] = sessions;
const textProcessor = await loadTextProcessor(onnxDir);
const textToSpeech = new TextToSpeech(cfgs, textProcessor, dpOrt, textEncOrt, vectorEstOrt, vocoderOrt);
return { textToSpeech, cfgs };
}
/**
* Chunk text into manageable segments
*/
function chunkText(text, maxLen = 300) {
if (typeof text !== 'string') {
throw new Error(`chunkText expects a string, got ${typeof text}`);
}
// Split by paragraph (two or more newlines)
const paragraphs = text.trim().split(/\n\s*\n+/).filter(p => p.trim());
const chunks = [];
for (let paragraph of paragraphs) {
paragraph = paragraph.trim();
if (!paragraph) continue;
// Split by sentence boundaries (period, question mark, exclamation mark followed by space)
// But exclude common abbreviations like Mr., Mrs., Dr., etc. and single capital letters like F.
const sentences = paragraph.split(/(?<!Mr\.|Mrs\.|Ms\.|Dr\.|Prof\.|Sr\.|Jr\.|Ph\.D\.|etc\.|e\.g\.|i\.e\.|vs\.|Inc\.|Ltd\.|Co\.|Corp\.|St\.|Ave\.|Blvd\.)(?<!\b[A-Z]\.)(?<=[.!?])\s+/);
let currentChunk = "";
for (let sentence of sentences) {
if (currentChunk.length + sentence.length + 1 <= maxLen) {
currentChunk += (currentChunk ? " " : "") + sentence;
} else {
if (currentChunk) {
chunks.push(currentChunk.trim());
}
currentChunk = sentence;
}
}
if (currentChunk) {
chunks.push(currentChunk.trim());
}
}
return chunks;
}
/**
* Write WAV file to ArrayBuffer
*/
export function writeWavFile(audioData, sampleRate) {
const numChannels = 1;
const bitsPerSample = 16;
const byteRate = sampleRate * numChannels * bitsPerSample / 8;
const blockAlign = numChannels * bitsPerSample / 8;
const dataSize = audioData.length * 2;
// Create ArrayBuffer
const buffer = new ArrayBuffer(44 + dataSize);
const view = new DataView(buffer);
// Write WAV header
const writeString = (offset, string) => {
for (let i = 0; i < string.length; i++) {
view.setUint8(offset + i, string.charCodeAt(i));
}
};
writeString(0, 'RIFF');
view.setUint32(4, 36 + dataSize, true);
writeString(8, 'WAVE');
writeString(12, 'fmt ');
view.setUint32(16, 16, true);
view.setUint16(20, 1, true); // PCM
view.setUint16(22, numChannels, true);
view.setUint32(24, sampleRate, true);
view.setUint32(28, byteRate, true);
view.setUint16(32, blockAlign, true);
view.setUint16(34, bitsPerSample, true);
writeString(36, 'data');
view.setUint32(40, dataSize, true);
// Write audio data
const int16Data = new Int16Array(audioData.length);
for (let i = 0; i < audioData.length; i++) {
const clamped = Math.max(-1.0, Math.min(1.0, audioData[i]));
int16Data[i] = Math.floor(clamped * 32767);
}
const dataView = new Uint8Array(buffer, 44);
dataView.set(new Uint8Array(int16Data.buffer));
return buffer;
}

95
web/index.html Normal file
View File

@@ -0,0 +1,95 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Supertonic - Web Demo</title>
<link rel="stylesheet" href="/style.css">
</head>
<body>
<div class="container">
<h1>🎤 Supertonic 2</h1>
<p class="subtitle">Multilingual Text-to-Speech with ONNX Runtime Web</p>
<div id="statusBox" class="status-box">
<div class="status-text-wrapper">
<div id="statusText"> <strong>Loading models...</strong>
Please wait...</div>
</div>
<div id="backendBadge" class="backend-badge">WebAssembly</div>
</div>
<div class="main-content">
<div class="left-panel">
<div class="section">
<div class="ref-audio-label">
<label for="voiceStyleSelect">Voice Style: </label>
<span id="voiceStyleInfo"
class="ref-audio-info">Loading...</span>
</div>
<select id="voiceStyleSelect">
<option value="assets/voice_styles/M1.json">Male 1 (M1)</option>
<option value="assets/voice_styles/M2.json">Male 2 (M2)</option>
<option value="assets/voice_styles/M3.json">Male 3 (M3)</option>
<option value="assets/voice_styles/M4.json">Male 4 (M4)</option>
<option value="assets/voice_styles/M5.json">Male 5 (M5)</option>
<option value="assets/voice_styles/F1.json">Female 1 (F1)</option>
<option value="assets/voice_styles/F2.json">Female 2 (F2)</option>
<option value="assets/voice_styles/F3.json">Female 3 (F3)</option>
<option value="assets/voice_styles/F4.json">Female 4 (F4)</option>
<option value="assets/voice_styles/F5.json">Female 5 (F5)</option>
</select>
</div>
<div class="section">
<label for="langSelect">Language:</label>
<select id="langSelect">
<option value="en" selected>English (en)</option>
<option value="ko">한국어 (ko)</option>
<option value="es">Español (es)</option>
<option value="pt">Português (pt)</option>
<option value="fr">Français (fr)</option>
</select>
</div>
<div class="section">
<label for="text">Text to Synthesize:</label>
<textarea id="text"
placeholder="Enter the text you want to convert to speech...">This morning, I took a walk in the park, and the sound of the birds and the breeze was so pleasant that I stopped for a long time just to listen.</textarea>
</div>
<div class="params-grid">
<div class="section">
<label for="totalStep">Total Steps (higher = better
quality):</label>
<input type="number" id="totalStep" value="5"
min="1" max="50">
</div>
<div class="section">
<label for="speed">Speed (0.9-1.5 recommended):</label>
<input type="number" id="speed" value="1.05"
min="0.5" max="2.0" step="0.05">
</div>
</div>
<button id="generateBtn">Generate Speech</button>
<div id="error" class="error"></div>
</div>
<div class="right-panel">
<div id="results" class="results">
<div class="results-placeholder">
<div class="results-placeholder-icon">🎤</div>
<p>Generated speech will appear here</p>
</div>
</div>
</div>
</div>
</div>
<script type="module" src="/main.js"></script>
</body>
</html>

291
web/main.js Normal file
View File

@@ -0,0 +1,291 @@
import {
loadTextToSpeech,
loadVoiceStyle,
writeWavFile
} from './helper.js';
// Configuration
const DEFAULT_VOICE_STYLE_PATH = 'assets/voice_styles/M1.json';
// Helper function to extract filename from path
function getFilenameFromPath(path) {
return path.split('/').pop();
}
// Global state
let textToSpeech = null;
let cfgs = null;
// Pre-computed style
let currentStyle = null;
let currentStylePath = DEFAULT_VOICE_STYLE_PATH;
// UI Elements
const textInput = document.getElementById('text');
const voiceStyleSelect = document.getElementById('voiceStyleSelect');
const voiceStyleInfo = document.getElementById('voiceStyleInfo');
const langSelect = document.getElementById('langSelect');
const totalStepInput = document.getElementById('totalStep');
const speedInput = document.getElementById('speed');
const generateBtn = document.getElementById('generateBtn');
const statusBox = document.getElementById('statusBox');
const statusText = document.getElementById('statusText');
const backendBadge = document.getElementById('backendBadge');
const resultsContainer = document.getElementById('results');
const errorBox = document.getElementById('error');
function showStatus(message, type = 'info') {
statusText.innerHTML = message;
statusBox.className = 'status-box';
if (type === 'success') {
statusBox.classList.add('success');
} else if (type === 'error') {
statusBox.classList.add('error');
}
}
function showError(message) {
errorBox.textContent = message;
errorBox.classList.add('active');
}
function hideError() {
errorBox.classList.remove('active');
}
function showBackendBadge() {
backendBadge.classList.add('visible');
}
// Load voice style from JSON
async function loadStyleFromJSON(stylePath) {
try {
const style = await loadVoiceStyle([stylePath], true);
return style;
} catch (error) {
console.error('Error loading voice style:', error);
throw error;
}
}
// Load models on page load
async function initializeModels() {
try {
showStatus(' <strong>Loading configuration...</strong>');
const basePath = 'assets/onnx';
// Try WebGPU first, fallback to WASM
let executionProvider = 'wasm';
try {
const result = await loadTextToSpeech(basePath, {
executionProviders: ['webgpu'],
graphOptimizationLevel: 'all'
}, (modelName, current, total) => {
showStatus(` <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
});
textToSpeech = result.textToSpeech;
cfgs = result.cfgs;
executionProvider = 'webgpu';
backendBadge.textContent = 'WebGPU';
backendBadge.style.background = '#4caf50';
} catch (webgpuError) {
console.log('WebGPU not available, falling back to WebAssembly');
const result = await loadTextToSpeech(basePath, {
executionProviders: ['wasm'],
graphOptimizationLevel: 'all'
}, (modelName, current, total) => {
showStatus(` <strong>Loading ONNX models (${current}/${total}):</strong> ${modelName}...`);
});
textToSpeech = result.textToSpeech;
cfgs = result.cfgs;
}
showStatus(' <strong>Loading default voice style...</strong>');
// Load default voice style
currentStyle = await loadStyleFromJSON(currentStylePath);
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
showStatus(`✅ <strong>Models loaded!</strong> Using ${executionProvider.toUpperCase()}. You can now generate speech.`, 'success');
showBackendBadge();
generateBtn.disabled = false;
} catch (error) {
console.error('Error loading models:', error);
showStatus(`❌ <strong>Error loading models:</strong> ${error.message}`, 'error');
}
}
// Handle voice style selection
voiceStyleSelect.addEventListener('change', async (e) => {
const selectedValue = e.target.value;
if (!selectedValue) return;
try {
generateBtn.disabled = true;
showStatus(` <strong>Loading voice style...</strong>`, 'info');
currentStylePath = selectedValue;
currentStyle = await loadStyleFromJSON(currentStylePath);
voiceStyleInfo.textContent = getFilenameFromPath(currentStylePath);
showStatus(`✅ <strong>Voice style loaded:</strong> ${getFilenameFromPath(currentStylePath)}`, 'success');
generateBtn.disabled = false;
} catch (error) {
showError(`Error loading voice style: ${error.message}`);
// Restore default style
currentStylePath = DEFAULT_VOICE_STYLE_PATH;
voiceStyleSelect.value = currentStylePath;
try {
currentStyle = await loadStyleFromJSON(currentStylePath);
voiceStyleInfo.textContent = `${getFilenameFromPath(currentStylePath)} (default)`;
} catch (styleError) {
console.error('Error restoring default style:', styleError);
}
generateBtn.disabled = false;
}
});
// Main synthesis function
async function generateSpeech() {
const text = textInput.value.trim();
if (!text) {
showError('Please enter some text to synthesize.');
return;
}
if (!textToSpeech || !cfgs) {
showError('Models are still loading. Please wait.');
return;
}
if (!currentStyle) {
showError('Voice style is not ready. Please wait.');
return;
}
const startTime = Date.now();
try {
generateBtn.disabled = true;
hideError();
// Clear results and show placeholder
resultsContainer.innerHTML = `
<div class="results-placeholder generating">
<div class="results-placeholder-icon">⏳</div>
<p>Generating speech...</p>
</div>
`;
const totalStep = parseInt(totalStepInput.value);
const speed = parseFloat(speedInput.value);
const lang = langSelect.value;
showStatus(' <strong>Generating speech from text...</strong>');
const tic = Date.now();
const { wav, duration } = await textToSpeech.call(
text,
lang,
currentStyle,
totalStep,
speed,
0.3,
(step, total) => {
showStatus(` <strong>Denoising (${step}/${total})...</strong>`);
}
);
const toc = Date.now();
console.log(`Text-to-speech synthesis: ${((toc - tic) / 1000).toFixed(2)}s`);
showStatus(' <strong>Creating audio file...</strong>');
const wavLen = Math.floor(textToSpeech.sampleRate * duration[0]);
const wavOut = wav.slice(0, wavLen);
// Create WAV file
const wavBuffer = writeWavFile(wavOut, textToSpeech.sampleRate);
const blob = new Blob([wavBuffer], { type: 'audio/wav' });
const url = URL.createObjectURL(blob);
// Calculate total time and audio duration
const endTime = Date.now();
const totalTimeSec = ((endTime - startTime) / 1000).toFixed(2);
const audioDurationSec = duration[0].toFixed(2);
// Display result with full text
resultsContainer.innerHTML = `
<div class="result-item">
<div class="result-text-container">
<div class="result-text-label">Input Text</div>
<div class="result-text">${text}</div>
</div>
<div class="result-info">
<div class="info-item">
<span>📊 Audio Length</span>
<strong>${audioDurationSec}s</strong>
</div>
<div class="info-item">
<span>⏱️ Generation Time</span>
<strong>${totalTimeSec}s</strong>
</div>
</div>
<div class="result-player">
<audio controls>
<source src="${url}" type="audio/wav">
</audio>
</div>
<div class="result-actions">
<button onclick="downloadAudio('${url}', 'synthesized_speech.wav')">
<span>⬇️</span>
<span>Download WAV</span>
</button>
</div>
</div>
`;
showStatus('✅ <strong>Speech synthesis completed successfully!</strong>', 'success');
} catch (error) {
console.error('Error during synthesis:', error);
showStatus(`❌ <strong>Error during synthesis:</strong> ${error.message}`, 'error');
showError(`Error during synthesis: ${error.message}`);
// Restore placeholder
resultsContainer.innerHTML = `
<div class="results-placeholder">
<div class="results-placeholder-icon">🎤</div>
<p>Generated speech will appear here</p>
</div>
`;
} finally {
generateBtn.disabled = false;
}
}
// Download handler (make it global so it can be called from onclick)
window.downloadAudio = function(url, filename) {
const a = document.createElement('a');
a.href = url;
a.download = filename;
a.click();
};
// Attach generate function to button
generateBtn.addEventListener('click', generateSpeech);
// Initialize on load
window.addEventListener('load', async () => {
generateBtn.disabled = true;
await initializeModels();
});

21
web/package.json Normal file
View File

@@ -0,0 +1,21 @@
{
"name": "tts-onnx-web",
"version": "1.0.0",
"description": "TTS inference using ONNX Runtime for Web Browser",
"type": "module",
"scripts": {
"dev": "vite",
"build": "vite build",
"preview": "vite preview"
},
"keywords": ["tts", "onnx", "speech-synthesis", "web"],
"author": "",
"license": "MIT",
"dependencies": {
"fft.js": "^4.0.3",
"onnxruntime-web": "^1.17.0"
},
"devDependencies": {
"vite": "^5.0.0"
}
}

453
web/style.css Normal file
View File

@@ -0,0 +1,453 @@
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
min-height: 100vh;
display: flex;
justify-content: center;
align-items: center;
padding: 20px;
}
.container {
background: white;
border-radius: 20px;
padding: 40px;
max-width: 1400px;
width: 100%;
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.3);
}
.main-content {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 40px;
margin-top: 30px;
align-items: start;
}
.left-panel {
display: flex;
flex-direction: column;
}
.right-panel {
display: flex;
flex-direction: column;
height: 100%;
}
@media (max-width: 1024px) {
.main-content {
grid-template-columns: 1fr;
}
}
h1 {
color: #333;
margin-bottom: 10px;
font-size: 2em;
}
.subtitle {
color: #666;
margin-bottom: 30px;
font-size: 1.1em;
}
.section {
margin-bottom: 25px;
}
label {
display: block;
font-weight: 600;
color: #333;
margin-bottom: 8px;
font-size: 0.95em;
}
input[type="file"],
textarea,
input[type="number"] {
width: 100%;
padding: 12px;
border: 2px solid #e0e0e0;
border-radius: 8px;
font-size: 1em;
transition: border-color 0.3s;
}
input[type="file"]:focus,
textarea:focus,
input[type="number"]:focus {
outline: none;
border-color: #667eea;
}
textarea {
resize: vertical;
min-height: 100px;
font-family: inherit;
}
.params-grid {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 15px;
}
button {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
border: none;
padding: 15px 30px;
font-size: 1.1em;
font-weight: 600;
border-radius: 8px;
cursor: pointer;
width: 100%;
transition: transform 0.2s, box-shadow 0.2s;
}
button:hover:not(:disabled) {
transform: translateY(-2px);
box-shadow: 0 5px 20px rgba(102, 126, 234, 0.4);
}
button:disabled {
opacity: 0.6;
cursor: not-allowed;
}
.status-box {
background: #e3f2fd;
border-left: 4px solid #2196f3;
padding: 15px;
margin-bottom: 10px;
border-radius: 4px;
font-size: 0.9em;
color: #1565c0;
transition: all 0.3s ease;
display: flex;
justify-content: space-between;
align-items: center;
flex-wrap: wrap;
gap: 15px;
min-height: 50px;
}
.status-box.success {
background: #e8f5e9;
border-left-color: #4caf50;
color: #2e7d32;
}
.status-box.error {
background: #ffebee;
border-left-color: #f44336;
color: #c62828;
}
.status-text-wrapper {
flex: 1;
min-width: 200px;
}
.backend-badge {
display: inline-block;
visibility: hidden;
padding: 6px 12px;
background: #ff9800;
color: white;
border-radius: 12px;
font-size: 0.85em;
font-weight: 600;
margin-left: 10px;
white-space: nowrap;
}
.backend-badge.visible {
visibility: visible;
}
.ref-audio-info {
color: #4caf50;
font-weight: 700;
font-size: 0.95em;
}
.ref-audio-label {
margin-bottom: 8px;
}
.ref-audio-label label {
display: inline;
margin-bottom: 0;
}
.results {
flex: 1;
display: flex;
flex-direction: column;
}
.result-item {
background: white;
border-radius: 16px;
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
overflow: hidden;
transition: box-shadow 0.3s ease;
display: flex;
flex-direction: column;
flex: 1;
}
.result-item:hover {
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
}
.result-item h3 {
color: #667eea;
margin-bottom: 15px;
font-size: 1.2em;
}
.result-text-container {
padding: 20px;
background: linear-gradient(135deg, #f8f9ff 0%, #ffffff 100%);
border-bottom: 1px solid #e8ecf5;
flex: 1;
display: flex;
flex-direction: column;
overflow: hidden;
}
.result-text-label {
font-size: 0.75em;
text-transform: uppercase;
letter-spacing: 0.5px;
color: #667eea;
font-weight: 600;
margin-bottom: 8px;
}
.result-text {
color: #333;
line-height: 1.7;
font-size: 0.95em;
word-wrap: break-word;
white-space: pre-wrap;
overflow-y: auto;
padding-right: 8px;
flex: 1;
}
.result-text::-webkit-scrollbar {
width: 6px;
}
.result-text::-webkit-scrollbar-track {
background: #f0f0f0;
border-radius: 3px;
}
.result-text::-webkit-scrollbar-thumb {
background: #c0c0c0;
border-radius: 3px;
}
.result-text::-webkit-scrollbar-thumb:hover {
background: #a0a0a0;
}
.result-info {
display: grid;
grid-template-columns: 1fr 1fr;
gap: 0;
background: #fafbff;
}
.info-item {
padding: 16px 20px;
display: flex;
align-items: center;
gap: 8px;
font-size: 0.9em;
color: #666;
border-bottom: 1px solid #e8ecf5;
}
.info-item:nth-child(1) {
border-right: 1px solid #e8ecf5;
}
.info-item strong {
color: #333;
font-size: 1.1em;
font-weight: 600;
margin-left: auto;
}
.result-player {
padding: 20px;
background: white;
}
.result-item audio {
width: 100%;
height: 48px;
outline: none;
}
.result-item audio:focus {
outline: 2px solid #667eea;
outline-offset: 2px;
border-radius: 4px;
}
.result-actions {
padding: 16px 20px 20px;
background: white;
}
.result-item button {
width: 100%;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
border: none;
padding: 12px 24px;
font-size: 0.95em;
font-weight: 600;
border-radius: 8px;
cursor: pointer;
transition: all 0.3s ease;
display: flex;
align-items: center;
justify-content: center;
gap: 8px;
}
.result-item button:hover {
transform: translateY(-2px);
box-shadow: 0 4px 16px rgba(102, 126, 234, 0.3);
}
.result-item button:active {
transform: translateY(0);
}
@media (max-width: 640px) {
.result-info {
grid-template-columns: 1fr;
}
.info-item:nth-child(1) {
border-right: none;
}
}
audio {
width: 100%;
margin-top: 10px;
}
.error {
background: #fee;
color: #c00;
padding: 15px;
border-radius: 8px;
margin-top: 20px;
display: none;
}
.error.active {
display: block;
}
.warning-box {
background: #fff3cd;
color: #856404;
padding: 12px 15px;
border-radius: 8px;
margin-top: 10px;
border-left: 4px solid #ffc107;
font-size: 0.9em;
display: none;
line-height: 1.5;
}
.warning-box.active {
display: block;
}
.warning-box::before {
content: "⚠️ ";
margin-right: 5px;
}
.results-placeholder {
background: white;
border-radius: 16px;
box-shadow: 0 2px 12px rgba(0, 0, 0, 0.08);
padding: 60px 40px;
text-align: center;
color: #999;
transition: all 0.3s ease;
display: flex;
flex-direction: column;
justify-content: center;
align-items: center;
flex: 1;
min-height: 400px;
}
.results-placeholder:hover {
box-shadow: 0 4px 20px rgba(0, 0, 0, 0.12);
}
.results-placeholder-icon {
font-size: 4em;
margin-bottom: 20px;
opacity: 0.6;
animation: float 3s ease-in-out infinite;
}
.results-placeholder.generating .results-placeholder-icon {
animation: spin 2s linear infinite;
}
@keyframes float {
0%, 100% {
transform: translateY(0px);
}
50% {
transform: translateY(-10px);
}
}
@keyframes spin {
0% {
transform: rotate(0deg);
}
100% {
transform: rotate(360deg);
}
}
.results-placeholder p {
font-size: 1.05em;
color: #888;
font-weight: 500;
margin: 0;
}
.hidden {
display: none;
}

Some files were not shown because too many files have changed in this diff Show More